Mar 012013
 

GENIUM Data Store (GDS) is a system I am working on for my Master thesis in NASDAQ.

GENIUM INET messaging bus

To achieve performance goals, GDS leverages the high-performance of GENIUM INET messaging bus. GENIUM INET messaging bus is based on UDP multicasting and is made reliable by totally ordered sequence of backed up messages with a gap-fill mechanism using rewinders. As it is well known, a total order broadcast algorithms are fundamental building blocks for fault-tolerant systems construction. The purpose of such algorithms is to provide a communication primitives that allows processes to agree on the set of messages and order that deliver. NASDAQ OMX implementation of this abstraction assumes a perfect failure detector, e.g. it forces a process to fail if it was considered faulty. Moreover, uniform reliable total order is preserved, where a process is not allowed to deliver any message out of order, even if it faulty.

The receivers of the ordered messages should guarantee exactly-once delivery to the applications for each message, this way uniform integrity is guaranteed. Across the cluster of applications/clients and servers connected to the message stream, messages should be delivered in the same order.

The message stream can be configured without restriction, so that it can contain any number of various clients that can be placed on the same or different server, with defined replication level.  The server/sequencer is responsible for the sequenced/ordered stream creation as an output from received clients’ messages.

Failure resilience is provided by the following:

  •  All message callbacks are fully deterministic and replayable (if single-threaded), as the incoming stream is identical each time it is received.
  • Replication can be adopted by installing the same receivers at multiple servers.
  • As long as the new primary rewound to the same point as the failed one, the message stream is sufficient to synchronize state.

GDS uses the privileges of the described above infrastructure and a priory maintain a fault-tolerant real-time system. Moreover, the distributed state is made consistent by adhering to the sequenced numbering implied by the message stream.

MoldUDP

In a GENIUM Data Store, transactions are submitted through the MoldUDP, therefor it ensures the lowest possible transaction latency.

MoldUDP is a networking protocol that makes transmission of data messages efficient and scalable in a scenario where one transmitter and many listeners are present. It is a lightweight protocol that is built on top of UDP where missed packets can be easily traced and detected, but retransmission is not supported.

Some optimization can be applied to make this protocol more efficient: (a) multiple messages are aggregated into a single packet – to reduce network traffic, (b)  caching Re-request Server is placed near remote receiver – to reduce the latency and bandwidth.

MoldUDP presumes that system consists of listeners, which are subscribed to some multicast groups, and server, which transmits on those multicast groups. MoldUDP server transmits downstream packets through UDP multicast to send normal data stream addressed to listeners. MoldUDP server sends heartbeats periodically to clients, so they can retrieve information about packet loss if it takes place. Moreover, listeners should be configured with IP and port to which they can submit the requests.

Note: message in this context is an atomic piece of information that is carried by the MoldUDP protocol from 0 to 64 KB.

SoupTCP

In GENIUM Data Store, read query support will be maintained. That is why, TCP-like protocol is intended to be used to stream the data to the client in response to the submitted query.

SoupTCP is a lightweight point-to-point protocol build on top of TCP/IP sockets. This protocol allows to deliver a set of sequenced messages from a  server to a client. It guarantees that the client will receive all messages sent from a server strictly in sequence even when failures occur.

Server functionality with SoupTCP includes: (a) clients authentication on login and (b)  delivery of a logical stream of sequenced messages to a client in a real-time scenario. Clients sends messages to a server which are not guaranteed to be delivered in case of failures. That’s why the client will need to resubmit the request to the server.

Protocol flow:

  • Client opens a TCP/IP socket to the server with login request.
  • If the login information is valid – server responds with accept and starts to send sequenced data.
  • Both client and server compute message number locally by simple counting of messages and the first message in a session is always 1.
  • Link failure detected by the hearbeating. Both server and client send these messages. Former is required to notify a client in case of failure to reconnect to another socket. Later is necessary to close the existing socket with failed client and listen for a new connection.
Jan 282013
 

The system should provide a functionality to insert, modify and read the data.

SLA:

  • One instance of the data store should handle 250 000 messages per second, where each message is no more than 200 bytes.
  • A query for a record, with a unique index, should be returned in no more than 10 milliseconds with a 10 billion records database.
  • The capacity of a given response stream should be 100 Mbps.
  • The response streaming capacity of a datastore server should be 300 Mbps (divided across several clients), while receiving 100 000 messages per second.
  • A transaction should be acknowledges on the message bus within 100 microseconds.
  • Fail over within 30 seconds on the same site.

It is important to note that consistency and fault-tolerance is an important part when building distributed systems, however the leverage of the internal NASDAQ OMX messaging system INET provides a build-in support for this functionality. In the further posts (at least I really hope 🙂 lol) I will discover how the load will be handled and which techniques will be used to provide a high availability level and specific faults recovery.

Jan 232013
 

NASDAQ OMX (N) is the world’s largest stock exchange technology provider. Around ten percent of all world’s securities transaction is happening here. More importantly is that, N multi-assets trading and clearing system is the world’s fastest and scalable one. This enables the trading of any asset in any place in the world. The core system of the company for trading and clearing is called Genium INET and is very complex, optimized and customized for different customers.  N owns 70 stock exchanges of 50 countries and serves multiple assets including equities, derivatives, debts, commodities, structured products and ETFs.

  • N united 3900 companies in 39 countries with $4,5 trillion in total market value.
  • N is the world’s largest IT industry exchange.
  • N is the world’s largest biotech industry exchange.
  • N is the world’s largest provider of technology for the exchange industry.
  • Over 700 companies use more than 1500 different corporate products and services.
  • N data accessed by more than two millions users in 83 countries.
  • The average transaction latency on N in 2010 was less than 100 microseconds.
  • The N implementation of INET is able to handle more than one million transactions per second.

Main issues that are faced is to maintain high level of reliability, low latency and security in the system. This niche is unique and normally not covered among conventional IT vendors and consultancies. Among the existing N systems are the following: trading systems, clearing systems, market data systems, surveillance systems, advisory systems.

Obviously all the data in the system is required to be stored reliably, with an extremely fast response. Moreover, some requests and queries can be performed over the data, in addition to “on fly” processing the incoming data and storing the results.

Variety of challenges appears to design and implement such a system, among them a load balancing of incoming requests to store the information, fault tolerance to maintain 24/7 availability and ability to hot update without losing the availability.

Jan 152013
 

My third day at my thesis host company – NASDAQ OMX, Stockholm.

Big eyes.

Happy face.

Hands ready to code.

This is the spirit!

Feel like I will have a crazy 5 month of work with an amazing experience and hands/head ready to handle any problem.

Finally, going down to the thing I’m working on and the thesis topic:

GENIUM DataStore:Distriubted Data Store and Data Processing System.

The system should face very strict requirements of extremely low latency, high availability and scalability. When I’m talking about the low latency requirement, I mean a latency of 100 microseconds per an operation and 250000 transactions per seconds, which for now sounds quite insane:). Two main paradigms will be supported: storing and processing the data. Storing will be a kinda integration part, through which API an information can be stored to already internally used DBs, both relational and key-value. Processing part is similar to the stream processing system and be integrated with the storage part.

Building a system with a high requirements on consistency(C) and fault-tolerance(FT) is supposed to be a pain in the rear part of the body, but using the internally developed libraries for C and FT support will make my life way much easier. However, the performance goals can be a real challenge.

The main goal is not to make a system to have a variety of features and be “almost stable”version, but to have less features and be a stable release. Another side of the work here, is that whatever you are building should be well thought from the perspective of having critical data and a great responsibility for making mistakes. Hooorayyy…

Thesis structure. Preliminary thoughts.

I was also thinking on the structure of the final document already and it is actually quite hard to think of any at this point. I still sure that my vision will change, however here it is:

  • Abstract (Cap. Obvious :D)
  • Introduction

Motivation
Contribution
Results
Structure of the Document

  • Background and Related Work

NASDAQ INET (just some words to make at least unclear pic of the base system for GENIUM DS)
MoldUDP
SoupTCP
Distributed Data Storage Systems
Distributed Data Stream Processing Systems

  • GENIUM DataStore

NASDAQ OMX Domain (What is NASDAQ, what data is used, volume of the data to be processed)
Requirements
Architecture (and Reasoning for such architecture)
Fault Resilience
Availabiltiy
Have no idea what more… but for sure should be something

  • Implementation Details (I’m sure that this part will be neseccary, as the coding will be my main occupation these days:))

Tools
API
Failure Scenarios

  • Experimental Results

Prototype
Set Up
Scalability and Performance

  • Discussion

Main findings
Scalability
Low Latency
Availability
Positioning??? (can’t find any more suitable word… will think on it… one day:))
Comparison to other existing systems (if possible)
Future Work

  • Conclusions
  • References

Let’s see what is going to be in 1 month:)

Finally, I will hope not to become a turtle below and behave:)

Julia is not a turtle 🙂