why spark is faster than mapreduce


Apache Spark is a unified engine which is used for analytics for processing huge amount of data. It support multiple languages for developments.

Spark is faster than mapreduce and it also works on distributed system means it divides data into number of partitions and distribute on multiple nodes and then process the data on nodes.

Spark is 100 times faster than Mapreduce due to following reasons.

Spark Supports Languages

1. In Memory Computation/Processing:

Spark is Faster than Mapreduce because Spark do all processing in Memory i.e. RAM while Mapreduce does disk IO (Input Output)  which is expensive operation. In case of Map-Reduce, First it perform Map Operation and Save the Output into Disk and fetch the data from disk into Memory to perform Reduce Operation.


2. Lazy Evaluation:

Spark does not start evaluation or processing for given transformation until an Action called. So once any Action called first it execute all the transformations in sequence first then it execute Action in one go. That’s why it reduces the Disk Input Output Operations and performs processing in Memory at Once. 

As you can see in below diagram.

1. First take the input data which can be any Structured, Semi Structured or Unstructured data.

2. Create RDD (Resilient Distributed Dataset) or Data Frame or Data Set. Here you can also create RDD via any sequence number or hard coded value or multiple other way as well.

3. Applying transformation like Filter, map etc.. operation on RDD or DataFrame or DataSet  but actually it will not execute immediately. Spark will just create order of transformations at this point.

4. Applying Action like show, write and count ..etc. Now Spark will execute first transformation in a sequence which have given and then execute action after that so in this way it will execute all the transformations in memory one go once any action called and reduce the disk input output operations. 


3. Fault Tolerace:

An RDD is a resilient distributed dataset which is immutable and re-computable. If any worker node fail and due that any partition of RDD is lost then that partition can be re-computed from the original dataset or any previous cache or persisted dataset using lineage of all transformations or operations performed.

Leave a Comment

Your email address will not be published.