Jabberwocky

MapReduce

Posted in algorithms, parallel by elisehuard on January 12, 2009

MapReduce is the generic term for a simple algorithm, but also for a library implementing this algorithm at Google. Details in the Google paper on MapReduce and the Google round table on the subject.

  • Map is the processing of a lot of different entities (like, say, counting words in a html page). This step produces a number of key -> value pairs, with duplicates in the keys, so that the values can be grouped for the next step
  • Reduce is the aggregation of all the different results into the required info (like, say, creating a search index). The values from previous steps are grouped per keys (key -> list of values), and these are aggregated (reduced) to produce the required results.

This is obviously only suited for certain operations (process-intensive map operation on lots of different entities, lighter operation to gather into one result). It’s also very well suited to distributed systems. You start the map on a bunch of different machines, and then you reduce the result of all machines in one or more steps.

The simplicity of this concept made for it success – distributed stuff usually is painfully hard to bend your head round. It’s being used in: CouchDB, Hadoop framework, Microsoft Dryad, and a few others mentioned in the wikipedia article. And at Google.

I’ll talk about CouchDB next.

Advertisements
Tagged with: , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: