Wednesday, February 27, 2013

Understanding Pregel

As I said in my previous post, I am interacting a lot with Apache Giraph. Before beginning with Giraph, I needed to understand the concept behind it. Giraph is an open-source implementation of the Pregel model, proposed by Google in 2010.

What is Pregel?

A model or framework designed for processing large-scale graphs.
How large can these large-scale graphs be? Well, let's say billions is a good number :)

What makes Pregel special?

Or shall I ask: Why not MapReduce (M/R)?
Well, M/R and graph processing are not the best buddies. In M/R it's quite hard to express graph algorithms, there is no option of calculating per node/vertex. Moreover, in M/R we have to pass the entire state of the graph from one phase to the other causing too much I/O traffic and watching the performance to go down.

Pregel comes to the rescue and offers what M/R cannot! The programmer can make calculations per vertex - thus it's very expressive - and there is no necessity for passing the graph state - less I/O, better performance. Pregel makes the graph processing while preserving fault tolerance, scalability and being distributed.

How does Pregel work?

Initialization: To start, we dedicate several machines (or cores) for the Pregel computation. Among them one is elected as the master for coordinating the rest, the workers. After inserting as an input a directed graph, the graph gets divided into partitions, each one including a set of vertices and their outgoing edges.

Execution: The master assigns these partitions to the workers and then a loop of supersteps begins! In each superstep, all vertices (in each worker) pass through 3 steps: (i) receive messages from other vertices, (ii) execute the same user-defined function, (iii) send messages to other vertices. At the end of each superstep, a global synchronization occurs among the workers and messages are exchanged.

Termination: This wonderful loop terminates when all vertices become inactive and there are no messages in transit (no messages to be sent to other vertices).

For more details - including strong and weak points of the paper, have a look at the review I made as part of ID2220 - Advance Topics in Distributed Systems, at KTH.





No comments:

Post a Comment