Jun 282012
 

Recently I had some free space in my schedule and Why not to do some extra stuff on Scalable Distributed Systems?

My choice fell into NoSQL DB, to be precise they are Redis, MongoDB and Cassandra. Of course I could do some more, but had not enough time to do it. So I’m planning to do it later.. one day..maybe 🙂

Hmm… What actually did I do?

First!  Install

Second! Take a look on the performance. But how to do it in the most efficient way and obviously to skip the part of reinventing the wheels. Why not to use  Yahoo! Cloud Serving Benchmark for evaluating the most common in use NOSQL Databases. The main metrics to compare were:

  • Run Time (ms)
  • Throughput (ops/sec)
Third! Install the next one. Go to the step 2.
All details about the DB installations and their performance are here.

Conclusions and observations:

The main challenge was to properly set up all the parameters and connect the DB to the client side – Yahoo! Cloud Serving Benchmark, with update heavy workload and two execution phases – load and transaction – and two metrics: Run time and Throughput.

During the evaluation it was found that all three DB were performing quite similarly, with a slight difference for MongoDB. Running Time of the transaction phase in MongoDB was slightly greater than the others and the throughput was lower.

Obviously, due to the different application areas of there NOSQL DBs – in memory DB, DB with small size structured data – it’s impossible to say which one of them is the best. That’s why depending on the needs, corresponding DB should be chosen 🙂

 

May 182012
 

Currently, I’m quire inspired doing one of the project I have in UPC.

It’s a practical evaluation of processes’ coordination system ZooKeeper. The main goal is to understand completely working principles and try to find some bottlenecks, applying some sophisticated system with ZooKeeper.

Obviously, it’s quite hard to deal with the system and even not to know the basic principles and ideas put in the system. For the reason of clear and conscious work, I decided to read an article on the topic written by authors of the system.

Here is what I come up with:

ZooKeeper – a service for coordinating process of distributed applications. It’s actually a key/value table with hierarchical keys. But it’s not designed to general data storage purposes, but still some clients information could be stored there. How it actually tries to achieve all it? By incorporating elements from group messaging, shared registers and distributed lock services into a replicated, centralized service. To combine all these features into one, ZooKeeper uses wait-free shared registered and FIFO execution guarantees with an event-driven mechanism, so it could be still simple and powerful at the same time. The idea fo ZooKeeper was mainly inspired by Chubby, locking service with strong synchronization guarantees. Meanwhile, ZooKeeper was developed avoiding blocking primitives, locks. This system implements an API that manipulates simple wait-free data objects , organized hierarchically as in file system.

Which guarantees could provide ZooKeeper? 

  • Linearizable writes.
  • FIFO client order.
  • Handling shared configuration updates.
  • Liveness and durability

How?

  • Different node types (regular, ephemeral)
  • Watches
  • Completely Replicated System

What for?

  • Configuration Management
  • Rendezvouz
  • Group Membership
  • Simple Locks
  • Simple Locks without Herd Effect
  • Read/Write Locks
  • Double Barrier

Want to read more, check on my wiki.

Implementations

  Configuration Management

  Reliable Multicast

Both program were implemented with ZooKeeper. Code could be found following the inks above, explanation and activity diagrams of the application fully presented in the presentation below.

Presentation

Summary and Critique from my side

ZooKeeper uses wait-free protocols to implement process coordination in distributed systems. Main goal of the system is its lack of locks, so that performance of the system could be improved significantly. Due to some features (like watches and different types of the nodes), ZooKeeper support following applications:

  • Configuration Management
  • Leader Election
  • Group Membership

All mentioned applications were implemented within following applications:

  • Dynamic LogBack Configuration Management
  • Reliable Multicast with handling of group membership and leader election to provide total order and reliability

The weakness of the ZooKeeper is that changes happened are dropped. So there could be a chance that during the time between getting the event and setting the watch multiple changes in the object could be happen. So there is a chance that not all changes will be tracked, but only the last ones. This actually means that ZooKeeper is a state based system, not an event system.

Also, complete replication within the ZooKeeper limits the total size of data that could be managed by ZooKeeper. As well, serializing all updates through a single leader could be a possible performance bottleneck.

May 172012
 

This is one of the variety of the project I’m currently into.

Currently deployment of translucent optical networks is the most promising short term solution in optical backbone network to reduce cost and energy consumption within the network. But at the same time some problems could appear, like inflexibility and coarse granularity in the Wavelength Switched Optical Network (WSON). So how to deal with it – apply Optical Burst Switching (OBS), or even better ( as not all modification of this switching could be good enough) Translucent OBS (T-OBS)

What is actually T-OBS?  

It’s an architecture in which core nodes switch incoming data burst to their output ports in an all-optical fashion or through regenerators when signal regeneration is required. It is. in some kind of, a bridge between transparent and opaque solutions of OBS.

The background information for the project is taken from the Oscar Pedrola Paper.

Project itself

In such kind of network with OBS switching technology, necessity of solving RRPD problem could apear.

What is RRPD?

RRPD is a Routing and Regenerators Placement and Dimensioning problem made out of two parts. The first part of this problem is the minimization of average paths that goes through one link. In the second part it is searched where to place the necessary regenerators in order to reduce the number of them in the network and provide an acceptable signal-to-noise ratio.

As the problem is very complicated and needs a lot of resources, it is a  good idea t split the work:

  • starting from the MILP for the routing problem, which was already implemented in CPLEX (code is available on the wiki). As the solution of this problem suppose to be precise, in the paper it was solved only through ILP. In this project we went a bit far and suggest an heuristic for this part as well. The main idea is to compare both ILP and heuristic, and perform an analysis based on it.
  • the next part of the project is to deal with regenerator placement and dimensioning. After a few hours of researching on the most efficient heuristics for this problem – Biased Random Key General Algorithm (BRKGA) was chosen. Even though this problem in the paper was solved only by mean of heuristic, we are planning to implement ILP for this problem in CPLEX and compare the result and performance of these algorithms.

All further details on the project could be found on my wiki.

Apr 242012
 

Here is the recent presentation that I had chance to complete successfully. The following topic was more than far away from me but still I manage to do it.

The main parts of the research was to find and analysis some possible bottlenecks within the application of the chosen benchmark.

I chose PARSEC one and fluidanimate application within it.

Steps to achieve the goal of the research: 

  • Install PARSEC and perform analysis of the application both on my Laptop (Macbook Air) and Boada Server(12 physical cores, Intel Xeon, 2.4 Gb, 24 GB RAM). -> Strange behavior was found
  • Extract traces from the application using Extrae tool.
  • Perform analysis of extracted traces on the Paraver.

More details you could find on the

Apr 192012
 

The last project on the Measurement Tools and Techniques subject was about the simulation. For this purpose two tools were used: Dimemas and Pin-tool.

Analysis of high-performance parallel application is very impor- tant to design and tune both application and hardware. Simulation tools help to perform analysis on how application works on a different environment or how hardware would performs with different config- urations. In this project two simulating tools are described and used for performance analysis. First one simulates a different environments for NAS benchmark’s application in order to find out the optimal environment configuration on which application would run comparable as on environment similar to MareNostrum’s supercomputer. Second one simulates three levels of cache in order to find the optimal configuration of caches for matrix multiplication application.

Dimemas

Dimemas tool for performance analysis of message-passing programs was used. The main reason to use this tool is to develop and tune parallel applications on a workstation, when providing an accurate prediction of their performance on the parallel target machine. Dimemas generate trace files that are suitable for further analysis and visualization with Paraver tool, so even more detailed examination of the application could be done. Paraver traces analysis are presented in this work as the result of Dimemas simulation.

Analysis was performed using Dimemas tool for different number of processors (2, 4, 8,16, …) with various parameters of latency, network bandwidth, number of buses and relative CPU performance. The main goal was to find:

  • max acceptable latency when performance of a system reduces not sig- nificantly in comparison to ”ideal”
  • min acceptable bandwidth when performance of a system reduces not significantly
  • min acceptable number of buses (connectivity) in order the application still performs comparable to the ”ideal” application
  • min relative CPU performance in order application still would be com- parable to the ”ideal” version of the application

Performing the analysis and the simulations require the right choice for the tool to make it. To perform the simulation initial configuration of the system was taken from the MareNostrum SuperComputer with following parameters:

  • Number of buses: 0 (that is, infinite)
  • Network bandwidth: 250 Mbyte/sec
  • Startup on remote communication (latency): 0.000008 ms • Relative CPU speed – 1.0
  • 128 cores, 1 core per processor

To discover the best tuning for the system environment parameters were changed. Ranges of the parameters were the following:

  • Number of buses: (0 .. 128], exponential step size 2
  • Network Bandwidth: (250 .. 100] MBytes/sec, step size 10
  • Startup on remote communication (latency): (0.000008..0.524288] s, exponential step size 2
  • Relative CPU performance: [3.0 .. 0.2], step size 0.2

Pin

Pin is a dynamic binary instrumentation framework that enable the creating of dynamic program analysis tools. The tools created from Pin are called Pintools and are used to perform program analysis on user space application.

This part fo the project contains results of the multilevel cache simulation with different per-processor L1 data cache, cluster-shared L2 data cache and globally-shared L3 data cache. To analyze the application, next parameters were varied:

  • number of application threads (or CPUs)
  • sizes of L1, L2, L3 cache size
  • number of processors per cluster that are share the same cluster-shared cache (L2)

For the simulation a Pintool was created that simulates multilevel cache with different per-processor L1 data cache, cluster-shared L2 data cache and globally-shared L3 data cache.

Conclusions

The first simulation simulated NAS benchmark’s Integer Sort application on a similar environment to MareNostrum supercomputer’s environment. The goal of simulation was to find a minimum environment configuration values where its execution would still be comparable to the original execution simulation on MareNostrum supercomputers environment. Simulation was done with Dimemas. Simulation results that such environment configuration is: 64 buses, latency up to 1 ms, bandwidth 250 MB/s and choosing faster CPU won’t make much difference.

The second simulation was multilevel cache simulation. Matrix multiplication application was executed on different configuration of L1, L2 and L3 caches in order to find out the optimal configuration of caches for this execution. For simulation a Pintool was created that could gather miss rates of caches. Gathered results show that optimal configuration is: 64 kB for L1, 2 MB for L2, L3 cache is not necessary (but if to choose to use it, it’s better to use 16 MB L3 cache) and 8 processors share one L2 cache.

Jan 192012
 

The Smith–Waterman algorithm is a well-known algorithm for performing local sequence alignment for determining similar regions between two nucleotide or protein sequences. Proteins are made by aminoacid sequences and similar protein structure has similar aminoacid sequence.In this project we did the parallel implementation of the Smith-Waterman Algorithm using Message Passing Interface code.

To compare two aminoacid sequence, initially we have to align the sequences to compare them. To find the best alignment between two sequences the algorithm initially populates a matrix H of size N Ă— N (N is size of sequence) using a scoring criteria. It requires a scoring matrix (cost of matching of two symbols) and a gap penalty for mismatch of two symbols. After populating the matrix H we can obtain the optimum local alignment by tracking back the matrix starting with the highest value in the matrix.

Pipelined computation to achieve specific degree of parallelism was used Also different parallelizing techniques to find optimum parallelization technique for the problem were compared. Parallelizing was started using different blocking sizes B at the column level. Furthermore, parallelization using different levels of interleave I at the row level was introduces.

For performance measurement the performance model of both the implementations for two interconnection networks was created which are linear and 2D-Mesh interconnection network.

All the details coud be found here.