I'll mention the differences present at the shuffle side at a very high level, as I understand it, between Apache Spark and Apache Hadoop Map reduce.
Since few folks have already mentioned about difference in terms of I/O etc, I'll stick to only the shuffle difference.
First Let's discuss the Map side differences:
Map Side of Hadoop Map Reduce ( see left side of the image above ):
Map side of Spark ( see right side of the image above)
Now to the reduce side differences:
Since few folks have already mentioned about difference in terms of I/O etc, I'll stick to only the shuffle difference.
First Let's discuss the Map side differences:
Map Side of Hadoop Map Reduce ( see left side of the image above ):
- Each Map task outputs the data in Key and Value pair.
- The output is stored in a CIRCULAR BUFFER instead of writing to disk.
- The size of the circular buffer is around 100 MB. If the circular buffer is 80% full by default, then the data will be spilled to disk, which are called shuffle spill files.
- On a particular node, many map tasks are run as a result many spill files are created. Hadoop merges all the spill files, on a particular node, into one big file which is SORTED and PARTITIONED based on number of reducers.
Map side of Spark ( see right side of the image above)
- Initial Design:
- The output of map side is written to OS BUFFER CACHE.
- The operating system will decide if the data can stay in OS buffer cache or should it be spilled to DISK.
- Each map task creates as many shuffle spill files as number of reducers.
- SPARK doesn't merge and partition shuffle spill files into one big file, which is the case with Apache Hadoop.
- Example: If there are 6000 (R) reducers and 2000 (M) map tasks, there will be (M*R) 6000*2000=12 million shuffle files. This is because, in spark, each map task creates as many shuffle spill files as number of reducers. This caused performance degradation.
- This was the initial design of Apache Spark.
- Shuffle File Consolidation:
- Same as above.
- The only difference being, the map tasks which run on the same cores will be consolidated into a single file. So, each CORE will output as many shuffle files as number of reducers.
- Example: If there are 6000(R) and 4 (C) Cores, the number of shuffle files will be (R*C) 6000*4=8000 shuffle files. Note the huge change in the number of shuffle files.
Now to the reduce side differences:
- Reduce side of Hadoop MR:
- Reducer PULLS the intermediate files(shuffle files) created at the map side. And the data is loaded into memory.
- If the buffer reaches 70% of its limit, then the data will be spilled to disk.
- Then the spills are merged to form bigger files.
- Finally the reduce method gets invoked.
- Reduce side of Apache Spark:
- Map side PUSHES the intermediate files(shuffle files) to Reducer.
- The data is directly written to memory.
- If the data doesn't fit in-memory, it will be spilled to disk from spark 0.9 on-wards. Before that, an OOM(out of memory) exception would be thrown.
- Finally the reducer functionality gets invoked.