Look Out Oracle: MapReduce Might Be Able To Do JOINs

Apologies for the esoteric post, folks... but this is kind of important... Two folks from Yahoo, plus two folks from UCLA, have just released a paper on the ACM about a new kind of parallel algorithm: Map - Reduce - Merge.

If you don't know about MapReduce, its the algorithm that makes most of Google possible. Its a simple algorithm that allows you to break a complex problem into hundreds of smaller problems, use hundreds of computers to solve it, then stitch the complete solution back together. Google says its excellent for:

"distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..."

bla bla bla... but MapReduce can't do joins between relational data sets. In other words, its great for making a search engine, but woefully impractical for virtually every business application known to man... although some MapReduce-based databases are trying anyway (CouchDB, Hadoop, etc.)

UPDATE: Some Hadoop fans mentioned in the comments that MapReduce can do joins in the Map step or the Reduce step... but its highly restrictive on the Map step, and sometimes slow in the Reduce step... joins are possible, but sometimes impractical.

Well... this latest twist from the Yahoo folks fixed that: they claim MapReduceMerge now supports table JOINs. No proof as of yet, but there are a lot of folks staking their reputation on this... so its a fair bet. The Hadoop folks seem to be experimenting with MapReduceMerge... so if they spit out some new insanely fast benchmarks, my guess is that this is for real...

What does this mean for relational database like Oracle? Uncertain... but I did hear a juicy rumor about 15 months back: some guy from Yahoo sat down in a room with Oracle's math PHDs, and spent a day discussing an algorithm for super-fast multidimensional table joins... like sub-second performance on 14-table relational queries, with no upper limit. My sources told me the Oracle dudes were floored, and started making immediate plans to integrate some new stuff into their database. The Yahoo connection made me think this might be the MapReduceMerge concept...

Coincidence? Perhaps... but a juicy rumor nonetheless.

MapReduce can do Joins

Actually Joins are quite possible with MapReduce. There are limitations, but both map and reduce side joins can be done. Map side are more scalable but can be only done under special cases (see the Hadoop contrib src tree). Reduce side are always possible, but if the joined set is huge, there will be some penalty.

Both Cascading and PIG support joins very naturally, plus there are examples in the Hadoop contrib tree.

That said, MapReduceMerge will be a welcome addition as joins are complex to implement (unless using one of the above tools).


Get ready for fifth generation computers...

Large scale parallel computers reallly can do things that serial computers can't!

JOINs are possible, but MapReduceMerge make it practical

I'll wait and see what Hadoop does with this... hopefully they'll do some before and after stats.

Hadoop already has some

Hadoop already has some built-in packages to join data from multiple sources.
One package offers a generic framework for reduce side join.
Another one is for map side join. Typically, the map-side join one is more efficient than the reduce side join.
However, in order for the map-side join to be applicable, the input data has to meet certain criteria, such as
the input data from different sources must be partitioned into the same number of parts on the join keys, and
each parts must be sorted in the same order on the join keys.

In addition, it is actually easier than what is commonly believed to write one's own map/reduce jobs to join data from multiple sources.
Depending on the ratio of the sizes of the input data from different sources, one can do hash based map side, map side merge join
or reduce side join.

re: Hadoop already has some

Runping, how do you see this new algorithm affecting how Hadoop does things? Do you think its better than a map-side join with restricted data?

Dirty sloppy parallel processing

Is this the basic idea?

Say I have a slick serial processing algorithm that takes 10 minutes to run on one machine to process a dataset.

Say I have another algorithm that can take the same dataset, send it to 10 machines, each run on 1/10 of the dataset for four minutes, then merge for another 4 minutes.

Total time of the second approach: 8 minutes, but it took 44 minutes of processing power.

sort of...

MapReduceMerge is sloppy, but I don't think your numbers are "real world" metrics...

Google's #1 expense right now is electricity. So if they could save 80% of total processing power for a mere 20% drop in performance, they would go for the cheaper option.

Although I am curious to see some official numbers.

Dirty sloppy parallel processing... Really?

That's an interesting point, but what if in this completely theoretical, non-specific scenario, we actually need to get the job done in 8 minutes, or 5 minutes? You will not accomplish this with your serial processing algorithm. Furthermore, what if your data-set doubles? In MapReduce, it's likely that your processing time will horizontally scale. whether it's efficient or not, some things are simply not possible in single-serial processing systems. That is until Moore's law resets the exponential possibilities. But at that point, MapReduce will have moved on to exponentially more complicated problems.

Best of luck

can hash join be implemented

can hash join be implemented on hadoop in some efficient way?

Different problems, different tools, different strengths...

A good interivew with Doug Cutting, inventor of Hadoop (this is page 2 of the article):


Computerworld: "Are they replacing relational databases for the most part, or just supplementing them?"

Cutting: "They are augmenting and not replacing. There are a lot of things I don't think Hadoop is ever going to replace, things like doing payroll, the real nuts-and-bolts things that people have been using relational database[s] for forever. It's not really a sweet spot for Hadoop."

I'm really interested in Hadoop: for some things, like big data sets, RDBMS can scale problematically. Indexing is great for performance but it eats space like crazy. Hadoop is wonderful for pattern investigation and exploratory analysis.

Recent comments