Nov 082012

In the field of Distributed Data-Storage is it almost impossible to come up with universal system that will satisfy all needs. That’s why, recently, various distributed storage systems appear to face different needs and use different approaches.

DynamoDB uses a key-value interface with only replication within a region. I haven’t checked myself the latencies range, but from its website latency is varies witing single digit milliseconds, what is at least 10 times more that I want to reach in the thesis system.

Megastore doesn’t reach a great performance because it is based on the Bigtable (with high communication cost), however it is scalalble and consistent. Synchronization for wide area reslication is done with Paxos.  Taking into account scalbility, consistency and faults priviledges, latency is sacrifiesed and is witing 100-400 milliseconds for reads and writes.

Scatter is a DHT-based key-value store that layers transactions on top of consistent replication (uses a low level interface). Even though it provides high availaility and scales well, still latencies for the operations are witin milliseconds.

VoltDB is an in-memory db that support master-slave replication over wide area range.

Cassandra is an column based storage developed and used by Facebook with reads within milliseconds.

Spanner provides semi-relational data model support and provides high performance, high level interface, general-purpose transactions and external consistency (using GPS and atomics clocks with new concept of time leases: TrueTime). Spanner also integrates concurrency control with replication. The main contribution of the paper is that the system solves the problem of wide-area replication system and that it implements globally synchronized timestamps (support strong consistency and linearizability for writes and snapshop isolation for reads). Good: TrueTime. Interleaving data. Atomic schema change. Snapshop reads for the past. Weak: Possible clocks uncertainty. Paxos groups are not reconfigurable. Read-Only transaction with trivial solution for executing reads (if there are a few Paxos groups, Spanner is not using communication within this groups and simply apply the latest timestamp on the read). Typical reads are near 10 ms and writes average is 100 ms.

Which characteristics can be sacrified in order to reach specific goals? The answer is: the system should be adopted as much as posible to the needs. Another thing when you are actually chasing for the latencies… Most probably rare DB will fit your requirements…

If it is not 90% well suited – Let the funny part start -> Do it yourself 🙂 Like me:))))

Jun 282012

Recently I had some free space in my schedule and Why not to do some extra stuff on Scalable Distributed Systems?

My choice fell into NoSQL DB, to be precise they are Redis, MongoDB and Cassandra. Of course I could do some more, but had not enough time to do it. So I’m planning to do it later.. one day..maybe 🙂

Hmm… What actually did I do?

First!  Install

Second! Take a look on the performance. But how to do it in the most efficient way and obviously to skip the part of reinventing the wheels. Why not to use  Yahoo! Cloud Serving Benchmark for evaluating the most common in use NOSQL Databases. The main metrics to compare were:

  • Run Time (ms)
  • Throughput (ops/sec)
Third! Install the next one. Go to the step 2.
All details about the DB installations and their performance are here.

Conclusions and observations:

The main challenge was to properly set up all the parameters and connect the DB to the client side – Yahoo! Cloud Serving Benchmark, with update heavy workload and two execution phases – load and transaction – and two metrics: Run time and Throughput.

During the evaluation it was found that all three DB were performing quite similarly, with a slight difference for MongoDB. Running Time of the transaction phase in MongoDB was slightly greater than the others and the throughput was lower.

Obviously, due to the different application areas of there NOSQL DBs – in memory DB, DB with small size structured data – it’s impossible to say which one of them is the best. That’s why depending on the needs, corresponding DB should be chosen 🙂