Feb 132013

Megastore is a storage system develope by Google, as an improvement of their BigTable one. They claimed to support strong consistency and high availability guatantees. Moreover, fully serializable ACID semantics within fine-grained partitions of data are provided.

To achieve hight availability and consistency, synchronous replication is applied. Replication is implemeted using Paxos for every write across the data centers, e.g. synchronous replication for writes is supported.


  1. Availability and scalability are apriory build-in.
  2. Paxos replication and consensus algorithm is optimized for low-latency across geo distirubted datacentres. This way high availability is supported. (Low-latency = 10-1000 miliseconds)

Availability + Scalability

Availability is supported by using synchronous, fault-tolerant log replicator optimized for long distances. While scalability is supported with partitioning the database. Those two concepts are well thought and oriented for long-distance links and active communication over them.

To scale throughput data is partitioned into a collection of entity groups. Operation across entity groups leverage asynchronous messaging. It is alos used for logically distant entities. Minimizing latency and improving throughput are laid on the applocation control level.


Megastore is deployed through clients libraries to which applications are linked. This library implements Paxos, therefor they deal with failure detecting and replication. Durability of Paxos operations is provided by direct send of an operation to the local Bigtable instance.

Another cool thing about the Megastore is its various index tuning. e.g. secondary indexes can be declared, therefor optimise further lookup.

Concurrency control

Each Entity group is abstracted to be a DB that provide serializable ACID semantics. Each row i nthe Megastore can be stored with multiple values with different timestamps. This way multiversion concurency control supported. Similar approach I will use in my thesis project to suport concurrent updates to the DS.

For atomic updates across entity groups a two phase commit can be used. However its usage is not recommended as it may increase the risk of contention.


Reads and writes can be instantiated from any replica preserving ACID semantic. Repication is performed per entity group. Simply, transaction log synchronously transmitted to a quorum of replicas. Agreemet between replicas is maintained by an optimized Paxos. Coordinator tracks a set of entities groups for each replica. Fast writes are leveraged by teh master-based approach (where each read and write are dedicated to one master). This way Paxos prepare stage can be skipped, if the previous requests succeded.

Strong side

  • Partitioning the data into entity groups. This way suport for ACIT transaction is guarateed.
  • Consistency level can be loosed but the user, therefor the latency improves.
  • Indexation is extremely useful for development.
  • Extension of Paxos to optimize wide-range aggrements.

Weak side

  • Partitioning of the data laid on the application side.
  • Write rate is quite low
  • Joins give an extra overhead, as the data should be denotmalized
  • Asymetric network partitioning vulnerability.
  • Chubby for failure detection, which is, most probably, not the most effecient tool.
Nov 072012

Pregel is a system for processing large-scale graphs. The main contributions of the paper are the following: 1) fault-tolerant distributed programming framework for graph algorithms execution; 2)API with direct message passing among vertices.

Brief description:

Pregel is a synchronized computation process on vertices. Upon inserting a graph as an input, the graph is divided into partitions (using consistent hashing), which include a set of vertices and their outgoing edges.  One of the machine is coordinator. The workers then undergo a series of iterations, called supersteps. In every superstep, all vertices in each worker execute the same user-defined function which can (a) receive messages sent during the previous superstep, (b) modify the state of the vertex and its outgoing edges (vertices and edges are kept on the machines) and (c) send messages to be delivered during the next superstep. At the end of each superstep a global synchronization point occurs. Vertices can become inactive and the sequence of iterations terminates when all vertices are inactive and there are no messages in transit. During computation, the master also sends ping messages to check for workers failures. The network is used only for sending messages and therefore it significantly reduces the communication overhead, becoming more efficient.

The presentation can be found here:


  • Usability
  • Scalability
  • Performance
  • Transparency in vertex-to-machine assigning

Not good/Open questions/Future improvements:

  • How to detect a master failure?
  • Why not to compare with MapReduce comparison in evaluation part.
  • Failure Detection needed to be improved to confined recovery.
  • Improve automatisation for defining user-defined functions
  • hash(VertexID) mod Partition – Why not to use smarter vertex assigning (to reduce then network usage)?