Jul 312013

I was thinking about the process of writing the master thesis report. You know, it is not the funniest part of the project but it is still are mandatory to be done.

1. Start with the questions you would like to answer with your thesis. Either it is a well known problem or something new that you just come up with, try to criticize and find the questions that you will be able to answer with your thesis. Obviously, don’t ask questions that are impossible to answer, aka. is there a life in another galaxy?…

2. Write a preliminary conclusions and check whether the questions is answered in the conclusions.

3. Use the before defined questions as a start for the problem definition. After this it should not be  a problem to define and properly describe set of limitations

4. Use you “stuck” time properly. As writing the final report is the greatest pain in the rare part of your body, try to fill the background and related work as soon as you read new papers.

5. Constantly through the whole project rewrite and rethink your abstract. Yes, write it before having anything done. This will help to stay more focused and go directly to the point during the process.

6. Plan you experimental part after related work, background and “kinda ok” system description is ready. Proper selection of experiments is the key to success. Look at the related paper evaluation parts. Be coherent with the comparison data for you project.

7. Talk to your supervisor… I wish I might have done it more often.

8. Use Mendeley or Zotero. It helps so much to keep all your references and notes to the papers in one place and even categorised by the directories. Apart from Zotero, Mendeley allows you as well add all the pdfs you have on the disk so you will always have an access to the full text that could be annotated and stored in Mendeley. So my personal choice is Mendeley.

9. The easiest way to have something innovative done is to either reimplement already existing and improve it or find out missing functionality in something and implement it. No matter what you are doing, the identification of the problem and how/why you decided to work on it should be considered to be included in your introduction.

10. Do all the preliminary presentation properly. Does not matter what you are selling, important is how you do it!

Honestly, I am not sure if I followed all this advices but what I definitely did and what helped me a lot is to procrastinate by writing a report 🙂 Wish you similar procrastination and good luck 😉

Jan 252013

Most probably many of you asked yourselves: To build from scratch or Not To build from scratch?

  1. – To build!
  2. – Not to build!
  3. – To build!
  4. – Not to build!

Hmm…  Hard to decide? Take a look on the diagram and decide!

To build from scratch or not to build from scratch

Is there any Re Use in your answer? So you are one of those lucky bastards who can relax for a while and enjoy the cocktail 🙂

Re-use in all its beauty

Distributed platforms and frameworks aka Map-Reduce:

  • Apache Hadoop is an open source software for reliable, scalable, distributed computing. It is allows to use simple programming models to support distributed processing of a large data sets across clusters of computers. The main components are HDFS (distributed file system), Hadoop YARN (framework for job scheduling resource management), Hadoop MapReduce (system for parallel large data sets processing).
  • YARN is an aka second generation of MapReduce. The main idea behind its architecture is to split JobTracker’s resource management and job scheduling/monitoring into separate daemons. So that a Global Resource Manager and per-application AppManager were represented in the system.
  • Disco Project is a framework that is based on a MapReduce paradigm. It is on open Source Nokia Research Centre project which purpose is to serve handling of massive amount of data. According to the documentation it distribubed and replicates your data, schedules your jobs efficiently. Moreover, indexing of large amount of records is supported, so that real time querying is possible.

    • Master adds jobs to the job queue and run them when nodes are available.
    • Client submit jobs to the master
    • Slaves are started on the nodes by master and spawn and monitor processes.
    • Workers do the jobs. Output location is notified to the Master.
    • Data locality mechanisms are applied once results are saved to the cluster.

  • Spring Batch is an open source framework for batch processing. It is provide development of robust batch applications. It is built on Spring Framework. It supports logging, transaction management, statistics, restart, skip and resource management. Some optimization and partitioning techniques can be tuned in the system as well.

    • Infrastructure level is a low level tool. It is provide an opportunity to batch operations together and retry if an error occurred.
    • Execution environment provides robust features for tracing and management of the batch lifecycle.
    • Core module is the batch-focused domain and implementation. It includes statistics, job launch, restart.

  • Gearman is a framework to delegate tasks to another machines and processes that are more suited for it. Parallel execution of work is supported, as well as load balancing, multi language function calls.

    • A client, a worker and a job server are the parts of the system.
    • The client creates jobs and sends them to job server.
    • Job server forward tasks to suitable workers.
    • Worker works and responds to the client through job server.
    • Communication with the server is established through TCP sockets.

Distributed platforms and frameworks aka Directed Acyclic Graph aka Stream Processing Systems:

  • Spark a system that is optimized for data analytics to make it fast both to run and write. It is suited for in-memory data processing. Api is in Java and Scala. Purpose: machine learning and data mining, also general purpose is possible. It runs on Apache Mesos to share resources with Hadoop and other apps.
  • Dryad is a general purpose runtime for execution data parallel apps. It is modeled as a directed acyclic graph (DAG) which defines the dataflow of the application and vertices that represents the operations that should be performed on the data. However, creation of a graph is quite tricky part. That is why some high-level language compilers were created. One of them is DryadLINQ.
  • S4 and Storm
    are both for real time, reliable processing of unbounded stream of data. The main difference between Storm and S4 is the Storm guarantees messages to be processes even while failures occur, and S4 supports state recovery. More in the previous post.

Distributed Data Storage(DSS)

  • Key-Value
  • Column Based
  • SQL
  • NewSQL
  • bla bla bla…
  • Latency
  • Consistency
  • Availability
  • bal bla bla…

The problem of choosing the most suitable distributed storage system is quite tricky and require some reading in the field. Some information on Storage System with their deep review, from my previous project on Decentralized Storage Systems, can be found on my wiki. Also a brief review of come hot systems are represented in my previous post.

Actor Model Frameworks:

Is a  quite old computational model for concurrent computations that consist of concurrent digital computations called actors, that can react on the received messages: make local decisions, spawn other actors, send messages and design behavior for the next message that will be received. However this framework has some issues, that should be taken into account, if the decision to use this model is made – (a) scalability, (b) transparency, (c) inconsistency.

Actor Programming Languages are Erlang, Scala and other. This is one extra motivation to get to know closer those languages.

The most popular Actor Libraries:

  • Akka
    • Language: Java and Scala
    • Purpose: Build highly concurrent, distributed, fault-tolerant event-driven application on the JVM.
    • Actors: Very lightweighted concurrent entities. They process messages asynchronously using an event driven receive loop.
  • Pykka
    • Language: Python
    • Purpose: Build highly concurrent, distributed, fault-tolerant event-driven application.
    • Actors: It is an execution unit that executes concurrently with other actors. They don’t share state with each other. Communication is maintained by sending/receiving messages. On message, an actor can do some actions. Only one message is processed at a time.
  • Theron
    • Language: C++
    • Purpose: Build highly concurrent, distributed event-driven application.
    • Actors: They are specialized objects that execute in parallel natively. Each of them has a unique address. Communication is dome by messaging. Each actor’s behavior defined in message handlers, which are user-defined private member functions.
  • S4 – more in the Stream Processing Systems section.

Scheduling and Load Balancing:

  • Luigi Scheduler
    • It is a python module that helps to build complex pipelines of batch jobs. Also it builds in a support for Hadoop. It is a scheduler that is open-sourced by Spotify and used within the company.
    • It is still quite immature and anyone can try hers luck to contribute to this scheduler that is written in Python.
  • Apache Oozie
    • It is a workflow scheduler system to manage Apache Hadoop jobs.
    • It is a directed acyclical graph od actions.
    • It is scalable, reliable and extensible system.
  • Azkaban
    • It is a batch job scheduler.
    • It helps to control dependencies and scheduling of individual pieces to run.
  • Helix
    • It is a generic cluster management framework for automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes.
    • Features: (a) automatic assignments of resources to nodes, (b) node failure detection and recovery, (c) dynamic addition of resources and nodes to a cluster, (d) pluggable distributed state machine, (e) automatic load balancing and throttling of transactions
  • Norbert
    • Provides easy cluster management and workload distiribution
    • It is implemented in Scala and wraps ZooKeeper, Netty and Protocol Buffers to make it easier to build applications.
    • Purpose: (a) Provide Group Management, change configuration, add/remove nodes. (b) Partitions workload by using software load balancing. (c) Provide asynchronous client/server RPC and notifications.


Consesnsus and stuff:

Log Collection:

  • Apache Kafka
    Distributed Pub/Sub messaging systems that supports:

    • Persistent messaging with constant time performance
    • High-throughput
    • Explicit support for message partitioning over servers and machine-consumers
    • Support for parallel data load into Hadoop

    It is a viable solution to provide logged data to offline analysis systems like Hadoop, but it is might be quite limited for building real-time processing. The system is quite similar to Scribe and Apache Flume as they all do activity stream processing, even though that the architectures are different.

  • Logstash

Message Passing Frameworks:

(De)Serialization for sending objects in the network safely:

  • Avro
    • It is a data serialization system.
    • Provides: (a) rich data structures, (b) compact, fast, binary data format, (c) container file to store persistent data, (d) integration with dynamic languages. The schemas are defined with JSON.
    • Used by Spotify
  • Protocol Buffer
    • Developed and Used by Google.
    • It encodes structured data in an efficient extensible format.
  • Apache Thrift
    • Purpose: scalable cross-language services deployment.

Chasing for latency:



Jan 252013

I was trying to get into the very deep understanding of socket programming in C++ not for the first time and finally I got it. That is why I am really eager to share the main things to know. Also wanna show that I was actually doing something these days:)


Below I listed The Elephants of socket programming (abstractly :)). And all of them are very logical.

  • socket(family, socktype, protocol)
    • Necessary to create a socket descriptor that later will be fed to other system calls.
    • Creation: Feed the information about the protocol version (4, 6, any), socket type (Stream, Datagram), protocol (0 or any)
    • Returns: Socket Descriptor
  • bind(sock_descriptor, addr, addrlen)
    • Necessary to prepare to listen mode and binds to a random/free port
    • Creation: Feed the socket descriptor, address information, address length
    • Returns: -1 if error occurs
  • listen(sock_descriptor, backlog)
    • After binding to the port, listen mode can be turned on.
    • Creation: Feed the socket descriptor from socket() and the number of connection in a queue(backlog)
    • Returns: -1 if error occurs
  • accept(sock_descriptor, addr, addrlen)
    • After setting the listen mode, on incoming connection accept might be called.
    • Creation: Feed the socket descriptor, address information where the information about the incoming connection will be stores, address length.
    • Returns: -1 if error occurs
  • connect(sock_descriptor, addr, addrlen)
    • Necessary to connect to the remote host.
    • Creation: Feed the socket descriptor, address information of remote host, address length.
    • Returns: -1 if error occurs
  • sockaddr_in/sockaddr addr: Contains of fields that are required to be filled:s_family, s_port , s_addr, s_zero.
    • S_family: protocol version (4, 6, any)
    • S_port: in network byte order: htons(atoi(port))
    • S_addr: in binary format: inet_aton(“awesome_ip”, &addr.s_addr)

Note: All above is written in a kind of abstract way for better understanding the concept. More details can be found here.

Another cool thing to know:

  • getsockname (sock_descriptor, addr, addrlen)

    • Retrieves the locally-bounded neme of a specific socket and stores it to addr.
  • getpeername (sock_descriptor, addr, addrlen)

    • Retrieves the name of connected peer socket and stores it to addr.

Using these elephants will allow you to establish a connections between hosts, make them a server or client and do whatever. However some issues could appear, when you are sending an information through the network: How to ensure that on the other end of the wire a host will get exactly sent data when you want to send a binary, not text? For this serialization was created.


There are three options to do it:

  • sprintf() to make text. Sent. strtol() to get data from text.  SLOW!
  • Rend Raw data. Pass a pointer to the data to send(). DANGEROUS! NON_PORTABLE!
  • Encode into a portable binary form. Decode it on the other end. RULES! More here on Chapter 7.4.

Casting in C++

A very cool thing exists in C++ for type conversion between classes and controlling inappropriate conversion.

  • dynamic_cast <new_type> (expression)
    • is used only to convert from pointers and ref to objects (from derived to parent classes).
    • Purpose: Ensure conversion to valid complete object
    • Success: Conversion from child to parent class OR Conversion from parent that point to child to child.
  • static_cast <new_type> (expression)
    • is used only to convert from pointers to related classes (also from parent classes to derived ones), used for any non-pointer converter like it is implicit cast, used for conversions between classes with explicit constructor.
    • Purpose: Makes classes compatible, however no safety check is made.
  • reinterpret_cast <new_type> (expression)
    • is used only to convert from any pointer type to any other pointer type. It is actually just a binary copy of the value from one pointer to another.
    • Purpose: Makes classes compatible, however no safety check is made.
  • const_cast <new_type> (expression)
    • Purpose: manipulates with const objects, either mark or not as const. Example: ass a const argument to the function as non const parameter.
  • typeid
    • Purpose: get an info about the type. Return: reference to the constant object of type type_info.


May 182012

To be a good programmer is difficult and noble. The hardest part of making real a collective vision of a software project is dealing with one’s coworkers and customers. Writing computer programs is important and takes great intelligence and skill. But it is really child’s play compared to everything else that a good programmer must do to make a software system that succeeds for both the customer and myriad colleagues for whom she is partially responsible.