Key-Value Storage using MemcacheDB

What is Entity-Attribute-Value model (aka key-value storage)

This is also know as Entity-Attribute-Value model, and it is used in circumstances where the number of attributes (properties) that can be used to describe an entity  is very vast but the number of attributes that will actually be used is modest.

Let’s think in terms of a database how an Entity-Attribute-Value model would look like for storing an user profile.

id user_id key value
1 101 screen_name john
2 101 first_name John
3 101 last_name Smith

The table has one row for each Attribute-Value pair. In practice, we prefer to separate values based on data type to let the database to perform type validation checks and to support proper indexing. So programmers tend to create separate EAV tables for strings, real and integer numbers, dates, long text and BLOBS.

The benefits of such structure are:

  1. Flexibility, there is no limit on attributes used to describe an entity. No schema redesign.
  2. The storage is efficient on sparse data.
  3. Easy to put the data into an XML format for interchange.

There are also some important drawbacks:

  1. No real use of data types
  2. Awkward use of database constraints
  3. There are several problems in querying such a structure.

What is MemcacheDB

Memcachedb is a distributed key-value storage system designed for persistence. It is a very fast an reliable distributed storage. It includes transaction and replication. It is using Berkeley DB as persistence storage.

Why is better than a database?

  1. Faster, no SQL engine on top of MemcacheDB
  2. Designed for concurrency, design for millions of requests
  3. Optimized for small data

Memcachedb is suitable for Messaging, metadata storage, Identity Management (Accounts, Profiles, Preferences, etc), index, counters, flags, etc.

The main features for Memcachedb are:

Storage, replication and recovery

Berkeley DB stores data quickly and easily without the overhead found in other databases. Read more about Berkeley DB here

MemcacheDB supports replication using Masters and Slaves nodes. The exact deployment design must chosen according with your application needs. A MemcacheDB environment consists intro three things:

One problem could be spot in Log files, that record you transaction, over time they will contain a lot of data making the recovery a pain moment. For this Memcache DB has a Checkpoint. The checkpoint empties the in-memory cache, writes a checkpoint record, flushes the logs and writes a list of open database files.

Berkeley DB also allows hot backups and uses gzip and tar to compress the backup.

Monitoring

Memcache DB has a lot of built in commands for monitoring, such as:

What i liked most at Memcached is that you can use telnet to log on the running process and issue commands from command prompt. The same thing is valid also for MemcacheDB.

Besides memcached built function the Berkeley DB engine comes with his own stats command:

db_stats, –c locking statistics, –l logging statistics, –m cache statistics, –r replication statistics, –t transaction statistics.

Overall i liked what i saw about this alternative and i think that this is the most suitable solution for storing user profiles and user data that don’t need to be queried. When you need to scale this is for sure a very reliable solution. Have fun!

Further reading

Homepage: http://memcachedb.org

Mailing list: http://groups.google.com/group/memcachedb

Facebook temporarily lost data.

Last Sunday Facebook reported a data loss. We are talking about approximately 15% of users’ photos. Loosing your client’s data is the worst thing that could happen to you and reminded me what a guy said once in a tech talk: “The main rules in running an online community service are: Never lose data and never go to jail.”

Facebook has not yet made public the details of what happened but only assured users that their photos will be restored using a backup. The official report states that we are talking about a hardware failure at storage level.

First of some key facts about Facebook

Based on the above numbers it means that they lost approximately 1.5 billion of pictures. Waw!

How is Facebook handling user’s images? Last year Jason Sobel, Manager of the Facebook Infrastructure Group, presented some insights about the current Facebook storage solution and the future one. We don’t know right now whether the new storage solution failed or the old one is to blame.

Writing files using the old way

They were using upload servers and stored images via NFS into a NetApp storage (last year they were planning to replace it). Each image is stored 4 times. This solution experienced heavy workload when processing metadata.

Reading files using the old way

Here all resumes to speed.

The main concerns with the above architecture are:  Netapp storage is overwhelmed, they rely too much on CDNs.

Obviously when your app grows like hell, you start to think that is better to make your own toys, fully customized and optimized for your particular problem. So did Amazon back in 2001 and  Google too.  This is how the Facebook storage was born: Haystack

Haystack

The answer was to develop in house a distributed file system like GFS (Google File System). Haystack should run on inexpensive commodity hardware, and it should deliver high aggregate performance to a large number of clients.

Haystack is file based and stores arbitrary data in files. For 1Gb disk data file they create 1M in memory index. In this way they have one disk seek which is much better than NetApp which had 3.

The Haystack format is rather simple and efficient, Version number, Magic number (supplied by the client to prevent brute force attack), length, data, checksum.  The index simply stores the Version, Photo key, Photo size, start, length.

Using a Haystack server

To write uses POST
/write/[pvid]_[key]_[magic]_[size].jpg
- writes data on disk haystack
- writes data on in memory index
To read uses GET
/[pvid]_[key]_[magic]_[size].jpg
- uses the in memory index to retrieve the offset
- reads data from the on-disk file

This simple approach allows Facebook to easily balance the reads and writes using Haystack clusters but to speed up the reads they still plan to use CDNs in areas where they don’t have datacenters and Cachr for profiles. This is their first step to create their own CDN network.

Additional readings

Needle in a haystack: efficient storage of billions of photos

Engineering[at]Facebook’s Notes

Simple way to scale your Web App – Part 1

Every time i made an J2EE application, my main concern was about how many requests it could handle in the end. In my early dawns of my career I have been in the situation when my app was his own victim of success, to much load for a single server. We always had a reactive attitude and tried to deal with the problem when it happened, but some times it was too damn late. To be able to scale you application it must be made to be scaled.

But what would be a simple scalable architecture?

Simple Web Application Architecture

Let’s consider the diagram from the left.  Our application will be splited in three logical clusters.  First one is the application cluster, second database cluster and finally logging and statistic cluster.

The Load Balancer

A load balancer distributes the traffic among you application servers.  It can be  software or a hardware device.  A handy solution is to use a software balancer, such an Apache, but the software solution is not so robust and performant as a hardware balancer.

Hardware

A dedicated device for load balancing is more suitable and gives you more performance. Thus, this comes with an additional costs but there many devices on the market and you should choose the best one on cost/ features. When evaluating a load balancer some things must be kept in mind

Software alternatives

To balance SSL connections your balancer should provide SSL termination capability. Otherwise being a connection level protocol the SSL connection should persists between server and client by allocating the same host to the same client.

More about Load Balancing in a future post.

The Application Cluster

To be able to scale a web application, it has to be designed to scale. The main issue is session replication. Depending on the load balancing algorithm you will need to replicate the session or to use sticky session.

Stateless

This mode doesn’t require much from a load balancer. This is very common to REST applications.  For a web 2.0 application this could be the common aproach. The application stores everything it needs on the client side.  The first user request goes on the first machine while de second will hit a different machine. No data has to be shared between web servers. To handle more requiest new servers can be added in the web pool and the system will scale out.

Having a stateless design allows us a seamless failover. This can be achieved no matter what language we use to develop our application

Sticky session

Some times our application needs to store user specific data at session level. This means that every time a user hits the server we need some data to be able to process user request, we need local data and state. This data must be available on all server where the user request come. To cope with this problem we can use “sticky session” which means we need to ensure that the user will hit the same server as the initial request.

The most common aproach with sticky sessions  is:

Session replication

This technique is very common in J2EE where the Servlet Containers provide a way to replicate the session between the servers. There are several condition for a session to be replicated but it can be a viable alternative to sticky session.

The Database cluster

Obviously, my immediate choice will be to use MySQL as database server. It’s free, has a lot of community, it serves a lot of well known web 2.0 sites.

Using MySql you can scale horizontal by using MySql’s replication mechanism. When the database grows is time to partition our data horizontally into shards.

Replication

Master-Slave

This type of configuration will help you to scale the reads from the write.  The reads will always go to the Slave while the transactions that alter the data will go to the Master.  You can have one Master and multiple slaves.

Master – Master

Such an approach will distribute the load evenly and it also provides High Availability. MySQL 5.0 provides replication at statement level which often can crash the replication because of conflicts. The most common conflicts i ever met are for unique indexes but you can coupe with this problem by using “replace” command instead of insert. Anyway the MySQL 5.1 will have some semnificative changes in the replication module.

Tree Replication

This gives you a lot of posibilities, you can conbine master-master with master-slave in a tree structure and conbined with sharding it can result into a fine tuned MySQL cluster. More about Tree Replication and data partitioning in further posts.

Logging & Batch processing

In the end we reached the final component of our application.: the logging server and the batch processing machine. Why do we need them? Simply because somebody has to do them. Every application has some batch processing to be done. This is done usually by using cron and scripts. I recommend to use a scripting language for batch processing such as: bash, ruby, python etc.

For logging I would suggest you to use syslog or syslog-ng which has more advanced features than syslog, better performance in terms of cpu and supports UTF-8.

What’s next?

Next I would like to walk step by step through designing a J2EE based on the architecture discussed above by using Apache, Tomcat, Struts, Hibernate and MySQL.

← Previous Page