

I'm pretty sure if you've been working in Linux/UNIX environments that you've run into the problem of 'finding a file which contains a given pattern'. Most definitely the first thing you do is use the 'grep' utility to perform a regular-expression match on the contents of files.
Given this problem, let's look at the characteristics of a possible massively parallel solution. To do this, we look at the problem and break it down into tasks:
So now the problem becomes identifying the possible parallelism in each of the three tasks. Note that we're looking for possible parallelism available, and then distill the efficient and cost-effective (in terms of complexity and time-to-develop) to come up with a feasible solution.
Parallel Regex?
We start with the regular expression pattern matching task. Inherently, the implementation is state-ful and thus very hard to parallelize. What we can exploit the parallelism in is the fact that we can first do 'dirty matching' on data-blocks then defer the actual pattern matching on data that passes the dirty match. Usually this is done with fixed-sized blocks of data, checking for markers present in the regular expression to determine whether a given block may possibly match the regular expression, short of actually using the regular expression to check for a match.
Admittedly, this first pass will not be parallel when we're dealing with a stream -- but if the data were already chunked accordingly, then doing a dirty-match on blocks can happen in parallel.
File Chunking
This then brings us to the second task where we can exploit parallelism: file system operations. When reading file data, the traditional implementation is to buffer input and then let the buffer be consumed by the application. We then extend this idea by getting multiple buffers at the same time (similar to reading multiple blocks at the same time) and applying computations on the blocks in parallel. The dirty-matching can then happen in parallel, and we can form identify the blocks where the regex pattern may match.
There are a lot more details to this approach including merging contiguous blocks and applying the pattern match on these merged blocks, and things like overlap-chunking to limit the requirement for block merging, which I will not go into detail.
We can then imagine a multi-machine grep approach to searching for files that have contents that match a pattern and leverage for example the design of a DFS on which we have the files available. We can move the grepping of files hosted in machine A to be performed in the same machine to mitigate the effects of network latency if we were to move data between/among machines instead. This is an approach similar to what the Hadoop system achieves with both providing a DFS and parallel work distribution based on file locality.
Multi-Process Grep
Currently, the traditional approach is to let grep do the analysis in serial fashion -- and just rely on the grep implementation to be very efficient in a sequential setting; then deal with the files one at a time. Let's perform a simple calculation for example when grep'ing a file takes 0.1 second and we have 1000 files:
Given 0.1 sec/file * 1000 files we get 100.0 sec total time in the sequential fashion. If we did it with a shell script that divided the 1000 files into 4 groups, and ran four grep's that dealt with different groups of files on a four-processor machine (assuming 1:1 mapping between processes to processors), we should theoretically be able to get through the 1000 files in 1/4 of the time (25 secs). And if we increase the number of processors and processes running, we can further reduce this time.
Bigger Picture
This is just looking at the top-level (or the most coarse) parallelism on the computation given the data dependency we have. At a lower level, we should be able to leverage parallelism on a per-file processing given the approaches earlier. It would be arguable though whether the 0.1 sec/file time will be reduced given the algorithm, so it's usually at that time we can try tuning and trying.
But for the most part, the idea of data-parallel computations getting a boost from multi-core processors and increased parallelism is one of the cornerstones for the multi-core movement. It's becoming apparent that at the scale of systems we're building at the moment and the inherent parallelism in the solutions we're trying to build as well are the major drivers to pushing the parallel computing wave.
Large enterprises and web systems will benefit greatly in large data processing and analysis through parallelism at the solution level. Without having to invest in big-iron hardware but more on low-cost commodity hardware to achieve greater efficiency in terms of multi-core performance and multi-machine parallel computing (Beowulf clusters come to mind), it's more and more becoming a better value proposition to go parallel.
More to Come
In the next article I will explore another Coarse-Grained Data-Parallel problem/solution which has a different focus: statistics gathering based on huge data sets.
This article is part of a series on Parallel Computing with C++. If you're interested in going parallel for your applications, contact the author at dean.berris@cplusplus-soup.com for inquiries.




When I first talked about OpenCards, I was still using OpenOffice.org on my MS Windows XP partition. But on moving to Ubuntu, I was puzzled as to why I couldn't install the extension, or most any OpenOffice.org extension, at all. Turns out it has everything to do with making OpenOffice.org "see" the Java runtime. READ MORE ABOUT IT »




Having celebrated my 33rd year of geekiness on this world a couple of days ago, I decided to give my site a new look. I really didn't like the look I started with early this year. So on a whim I set out to upgrade the software on my blog and introduce a brand new look! READ MORE ABOUT IT »


Hello O&B. I'm experiencing a problem right now. Recently, I've been doing my distributed database machine problem and self-studying JSF at the same time. Now, the problem is this, the database part is not that hard but the JSF implementation quite made itself a stumbling block. And I couldn't just roll things back since I'm almost finished and I simply want to finish this project implementing the JSF framework (in all instances – I don't want to resort to the JSP style of iterating over the arrays and have the <tr> <td>s).
Currently, I simply want to have a table within a table, I couldn't find (or have not researched enough) info about nesting tables, so I think it would be a great help if you guys can give me a code, send me a tutorial link, or give me any articles or documents that can solve this problem.
The Dummy scenario (Just to simplify things):
I have A and B classes. A has an array of B. I also have a Request Bean C, which has an array of A.
I have an init method in my JSF Java file, which populates the Request Bean C from the Dao. (So this is already with the assumption that the whole thing is populated – just the view part is not working)
Here's the scenario (I'll simplify the tags –I'll just give their name):
<table>
<table row group; sourceData= C.As; sourceVar= currentRow>
<table column>
<static text; text = currentRow.value[<field name here other than the array B>]>
</table column>
<here>
</table row group>
</table>
==This is part is definitely not working – just a rough guess on how the JSF works, thought this worked.
What I did, in the <here>:
<table column>
<table>
<table row group; sourceData= currentRow.value[Bs]>
</table row group>
</table>
</table column>
You can prolly guess what I was trying to do. This just didn't work. What I wanted to achieve:
A[0]
A.B[0]
A.B[1]
…
A[1]
…
So there, I hope I've given enough info. Thanks in advance for the help(S).


For those who have had trouble with their ATI Graphics Card on Ubuntu, you might actually have hope now in Hardy Heron! READ MORE ABOUT IT »


I'm currently doing the mock-up of a new online system for my organization. Part of the work is to get screenshots of webpages to paste onto the use case documents. I used to have an application that could help me grab full shots of a page without having to hit the "print screen" button every so often and stitching up the images using Photoshop or GIMP. Unfortunately, it isn't free and is only available for MS Windows applications. Fortunately, I found a Firefox add-on that does exactly what I needed. READ MORE ABOUT IT »


So today we've finished the deployment of a very important piece of software where I work. The development of this piece of software has taught me a lot not only about the real power of C++ when you do it "the right way" in terms of expressiveness and sound software design: but also of the power of doing the right things that cause the most impact.
I cannot talk about what that project is and what is about, but I can talk about a surprising little thing that contributed most to the success of the project: the Proactor Pattern. From the original paper:
The Proactor pattern presented in this paper describes how to structure applications and systems that effectively utilize asynchronous mechanisms supported by operating systems. When an application invokes an asynchronous operation, the OS performs the operation on behalf of the application. This allows the application to have multiple operations running simultaneously without requiring the application to have a corresponding number of threads. Therefore, the Proactor pattern simplifies concurrent programming and improves performance by requiring fewer threads and leveraging OS support for asynchronous operations.
This pattern has allowed me to write a very stable highly scalable server. The good thing really is not just that it's a cool pattern, but more because of how you can leverage it to do a lot of other things. The pattern where you "handle the completion of an action" or "do something, then do this" allows you to distribute the processing of what the 'action' is (or what 'this' is) in the particular context to a group of processing threads. Allowing your processing threads to be as oblivious to what it's executing as possible compared to having to tie in the actual processing to specific thread life-times means you're decoupling basically the event queuing mechanism from the actual event handling.
This decoupled design allows users of a framework/library to leverage it in ways that are otherwise impossible with more coupled/rigid frameworks. For instance, I have been (ab)using the Boost.Asio 'io_service' object as a task queue for processing by a thread pool -- by packaging tasks as opaque Boost.Function objects or results of Boost.Bind operations. Boost.Asio uses a proactor framework internally to handle the IO related scheduling and dispatching of asynchronous events. And if C++ developers were lucky, it would be part of the next C++ standard library release (along with C++0x).
The surprise is, thanks to Boost.Asio we're now able to handle thousands of transactions per second with this new piece of software with very little strain on the CPU. Yet again, modern C++ development, a sprinkling of a little template metaprogramming voodoo here and there, and well-designed solutions saves the day -- and provisions for growth to scale in the future.
Time permitting, I may be able to post a simple example of the types of things Parallel Computing, High Performance Computing, Functional Programming, Modern C++, and Template Metaprogramming all culminating together allows. For now, this is an installment in the series of articles "On Parallel Computing" to let the world know of "that little pattern that could".
This is part of an on-going series of articles pertaining to parallel computing and programming techniques in C++. If you have specific questions particular to your application and your situation, please feel free to email me through dean.berris@cplusplus-soup.com and I'll get to your questions as soon as I can.


Just today I had the chance to play around with Hypertable using the C++ API, and all I can say is that it's really something worth looking into if you're looking at building systems which will handle large amounts of data, and need to scale. From the site:
Hypertable is an open source project based on published best practices and our own experience in solving large-scale data-intensive tasks. Our goal is to bring the benefits of new levels of both performance and scale to many data-driven businesses who are currently limited by previous-generation platforms. Our goal is nothing less than that Hypertable become one of the world’s most massively parallel high performance database platforms.
They have an Alpha release, and although hunting for documentation is a pain, I think I'll find it really amusing to help document this fine piece of software. If I find it really interesting and worth while, maybe I might even dive into the code.
If it's really very interesting to deal with, I might even write some articles about using Hypertable in high performance C++ applications. Maybe include an example in the Coarse Grained Data Parallel quadrant article.
Stay tuned!


When dealing with high performance parallel code on a daily basis, you will find yourself having to think in mostly common terms. These usually distill into some common concepts which I would try to discuss in the next series of articles covering a broad range of topics including concurrency, functional programming, graph theory, and software design. Of course, I will put these in the C++ context and how they map to the work I usually deal with on a daily basis as a C++ programmer.
Generally, you can find that there is a simple matrix of four different parallel programming patterns differentiated by two axes: grains of parallelism and the type of parallelism. There are two 'degrees' of grains when it comes to parallelism: coarse and fine. Then there are two types of parallelism: computational parallelism and data parallelism.
Given these two factors, we get four different quadrants of parallelism:
There are some patterns that arise unique to the quadrant in which your program is located in. I will talk more about these patterns as I write more about them and the patterns I am talking about.
At a more fundamental level however, in any parallel system you will have to delineate regarding the following factors: concurrent computation, data dependencies, and task granularity. These factors all taken together will help you determine how best to approach the problem you are solving, and eventually what kind of program you will be writing.
In this parallel view of things there are a few terms that need to be defined. The first term to be defined is a task which I will hastily define as:
A unit of work which is considered atomic, or executed in a single thread of execution. Usually tasks are modeled as data objects with metadata, manipulated and executed externally.
Using this definition of a task, you can then look at a parallel program as a set or list of tasks which need to be executed -- each task having its own data dependencies (metadata) that gets executed externally by threads of execution. More concretely, these threads of executions can be either real threads (hardware threads), "green" threads (software-driven threads), operating system threads, or remote compute nodes (in the case of Cluster of Workstations or Beowulf clusters).
It's interesting to note that there are arguably infinitely many ways of breaking a problem (or solution) up into tasks, and the challenge really is to find the right recipe to achieve both speedup and efficiency. I will also write more about the difference between just speeding things up, and making your programs more efficient. Effectively, we would also like to consider how the program would scale as the problem sizes increase.
In the next few posts I will discuss ways of breaking up a problem and eventually the solution(s) down into tasks that get processed in parallel. These will involve brute force "pen-and-paper" approaches as well as analytical graph theoretic approaches. I will also present them in the context of C++, maybe highlighting the use of the Intel Threading Building Blocks as well as the Boost C++ Library.
This post is meant to be an introductory in a long line of posts related to parallel computing with C++.
If you would like to learn more about turning your application into parallel applications that leverage multi-core technologies, or are interested in hiring a consultant that has experience with parallel computing and C++, send an email to dean@cplusplus-soup.com -- I should be able to help you connect with people who can help, or be available for consulting depending on the situation.


(Note: Geek post ahead. You have been warned.)
Admittedly, I'm an Emacs power user (and no, I am not as hardcore as some people). Emacs has been my primary text editor for the longest time— the first time I picked up Emacs was way back in 2004, a little over four years ago, when I decided to just sit down and learn the little bugger. Now, I do a lot in the editor: my mail is handled and served by Gnus, I handle my TODO lists and outlines through Org Mode, and I do the majority of my text editing and coding in it.
The last item bears a little explanation. My work entails a lot of heavy text editing in the form of code. I am a programmer by profession, and that means that I spend a majority of time looking at code if not writing it. We do a lot of Java where I work, and most of the company uses Eclipse as the IDE of choice. I use Emacs.
Now, Emacs has some decent support for editing Java files in the form of java-mode, which features basic syntax coloring. That's a good bit, but it isn't exactly what I need— if syntax coloring was all I needed, I could just edit files in gedit. And, admittedly, there are a lot of goodies that a full blown IDE such as Eclipse can bring to the table— code completion comes to mind, especially since Java can get somewhat verbose.
So, my Emacs config also includes JDEE, the Java Development Environment for Emacs. Unfortunately, JDEE development has been relatively stagnant of late, what with it only supporting ant and not Maven— and the latter is a big deal in our shop, as we use AppFuse 2 in a lot of our projects.
Fortunately, someone wrote a decent parser for Maven 2 POMs (which isn't really a parser as it simply asks Maven for the classpath, but what the hey, it works). Unfortunately, it doesn't work for multi-module projects, for one reason or another (and I'm bad at debugging elisp. So, I decided to write a custom macro to handle it. I simple place the following in a JDEE prj.el file (in the root directory of my project):
(jmi/load-multi-module-pom
"/path/to/project/root/here"
'("pom.xml" "core/pom.xml" "web/pom.xml") ;; Path to module POMs
('(jde-project-name "My JDEE Maven Project") ;; Other JDEE variables set here
'(jde-expand-classpath-p t)
'(jde-lib-directory-names '("^lib" "^jar" "^java" "^plugins"))
'(jde-expand-classpath-p t)
'(jde-ant-enable-find t)
'(jde-gen-k&r t)
'(tab-width 4)
'(jde-compile-option-command-line-args
(quote ("-Xlint:all" "-Xlint:-serial")))))
Et, voila. I get multi-module support. The macro is pretty simple, but it's the first I've ever written (and I'm thinking I could have written it as a function, if not for the expansion of the JDEE variable list at the end):
;; Project file helper for multi-module Maven projects
(defmacro jmi/load-multi-module-pom (base-path pom-path-list other-variable-setters)
"Macro to load multi-module projects into JDEE. BASE-PATH is
the path to the root of the multi-module project, POM-PATH-LIST
is a list of paths to the submodule pom.xml (relative to
BASE-PATH). Pass in a list of JDEE variables to set in OTHER-VARIABLE-SETTERS."
(let ((my-classpath (make-symbol "my-full-classpath"))
(my-sourcepath (make-symbol "my-sourcepath")))
`(progn
(require 'pom-parser)
(setq ,my-classpath '("/usr/share/java"))
(setq ,my-sourcepath '())
(mapcar
(lambda (pom-name)
(progn
(message "Reading %s" pom-name)
(with-pom (concat ,base-path pom-name)
(pom-set-jde-variables *pom-node*))
(setq ,my-classpath (append ,my-classpath jde-global-classpath))
(if (stringp jde-sourcepath)
(setq ,my-sourcepath (append ,my-sourcepath (list jde-sourcepath)))
(setq ,my-sourcepath (append ,my-sourcepath jde-sourcepath)))))
,pom-path-list)
(jde-set-variables
,@other-variable-setters
'(jde-global-classpath ,my-classpath)
'(jde-sourcepath ,my-sourcepath)))))
I'll probably elaborate on more of my Java development environment under Emacs in future posts...




Last Sunday, I scheduled myself to prepare for my next training assignment which is mostly related to Test Driven Development (TDD) and Unit Testing. I know how to do it but teaching it and making the participants appreciate TDD is a whole different thing for me. After browsing through the slides for the nth time, nothing came to me as to how I would attack this topic. Thankfully, Franz is at the office as well. He gave me something to read from The Craftsman. Its just a short story but gave a clear approach on how to teach TDD.
Come Tuesday that week, I was so excited to try out the approach I've learned in the story. I started by explaining the development approach that we will be learning for that day. I asked one student to be the typist and reminded that everybody is expected to share their thoughts on the application.
We used the same problem used in The Craftsman, they are to get the prime factors of a certain number. So we started writing their very first test. The test scenario is called testTwo which checks if the length of the array is correct and the value of the array is indeed 2. Of course their test failed after running it, so I asked them to write the simplest code that would pass the test. After writing down their solution, I was surprised it’s the same solution in the story that I was simulating. So I just continued by asking them to write the next test which would get the prime factors of 3. At this point, they're already making fun of me by saying "Oh we'll be testing the whole day, we'll be writing test for 2-1000!" What made me freak out is that their reactions and solutions are very similar to the story. The main deviation would be the additional/minor information I added during the process such as best practices in testing and seeing it LIVE! I never thought that the story predicted the events in such accurate points.
At the end of the exercise, the participants have completed the events in the story. They enjoyed the interactions that they had during the exercise and were really happy with what they just learned, AND that System.out.println() is not the only testing tool around.




Interesting discussion on interaction and perf issues on bundled browsers in the One Laptop project.


It's been a long, long while since I last posted on this blog. I'd like to say that I've been busy, but for the life of me I couldn't figure out what I've been busy with. I know there's a lot of stuff I need to do and yet I feel like I haven't done anything much. So I decided to call this day my "Get Things Done and Organized" Day. I then revived three items which worked for me long ago, though separately. Yet by putting them all together, I realized that I had gained an upper hand in finally getting things done. READ MORE ABOUT IT »




