Posted on Apr 12, 2011


Implementing durability for in-memory databases, on SSDs

As the examination for a recently completed course in Database Systems Implementation, students had to implement a durable, high-throughput, in-memory key/value database for strings, coincidently the same problem as this year’s SIGMOD programming contest. I thought I’d present aspects of my own implementation of durability, focusing on problems I encountered and how I solved them. I also relate parts of my solution to existing NoSQL databases as well as how SSD disks can reduce the sophistication needed for a good-enough solution.

Perhaps the main problem of the programming contest lies in implementing the in-memory data structure itself, with issues such as concurrency control and efficient string comparisons. I will, however, assume such an arbitrary data structure, and focus this post on implementing durability.

Disclaimer
This is by no means efficient, beautiful or advanced, and ideas should probably not be copied without first dissecting them with experienced eyes. I learned as I proceeded, and I have essentially no prior knowledge of databases up until this course and assignment. This post is written for unexperienced enthusiasts like myself; for others there is probably little new.

Definitions

First, a couple of loose definitions, as this post shouldn’t require too much prior knowledge within databases:

Durability: the D in ACID. For our purposes, an update to the database is physically written to persistent storage (hard disks or SSDs), and thus recoverable in case of a system crash.

Throughput: a measure of performance, to be maximised. Requests served per second.

In-memory database: the database resides fully in memory, meaning its size is small enough to be contained in RAM. (Conversely, most traditional databases are built for persistent storage, due to the larger available capacity of hard drives.)

Key/value database: only a key and its corresponding value is stored, as opposed to the relational model which stores data as tables with multiple attributes.

SSD: Solid State Drive, flash-based storage without moving parts. Faster in every aspect but much more expensive than hard drives.

A naive in-memory only solution

If durability is no requirement, an in-memory database is simply any data structure mapping string keys to string values, which also supports lexicographical iteration over elements. For example, Java’s TreeMap<K,V> would be perfectly fine. The database is then simply an adapter containing an instance of TreeMap, mapping inserts to put(K,V) etc. The only potentially difficult part involves concurrency.

Concurrency
Any Java person has probably been thrown a ConcurrentAccessException at some point, and if you code C you probably just get seg faults if your data structure is not thread safe. In a highly concurrent environment, in which a database lives, updates to the memory structure need to support concurrent access.

Three broad strategies exist:

  1. associate a mutex with the whole structure, only allowing access to one thread at a time;
  2. implement a more fine-grained locking strategy, for example with upgradeable read/write locks as used in traditional B trees;
  3. use a lock free data structure, such as ConcurrentSkipListMap<K,V>.

Transactions
With upport for concurrent access, a transaction for an in-memory only data structure roughly follows the following pseudo code:

    lock(data_structure)
    update(data_structure)
    unlock()
    return

That is, each update is synchronous within each thread, and each operation is considered a transaction of its own.

To make things simple, I used Kyoto Cabinet, a DBM with an excellent C/C++ API, for the in-memory data structure and concurrency control. More specifically, I used a B tree with node-level read/write locks, provided by Kyoto’s GrassDB class. This way, multiple threads may read simultaneously from the data structure.

High-level problem analysis

With full durability as a requirement, each update to the in-memory data structure needs to be written to disk as well, before success or failure is returned. Furthermore, each transaction is synchronous which poses a surprisingly interesting constraint. With the goal of maximised throughput, it is probably desirable to decouple writing to disk from accessing the in-memory data structure — something which I initially failed at, as illustrated in the following two invalid solutions. (Each disk write is assumed to be atomic.)

Solution 1
This solution is not valid since the database does not maintain consistency: while a slow disk write is under way in thread 1, other threads may read the update from memory even though thread 1 has not committed (returned ‘success’ or ‘failure’):

    lock(data_structure)
    update(data_structure)
    unlock()
    update(disk)
    return

Solution 2
This is essentially as fast as updating in-memory only, but not valid since the disk update occurs asynchronously in a separate thread, ‘success’ being returned before writing to disk has finished:

    lock(data_structure)
    update(data_structure)
    unlock()
    dispatch_new_thread(update(disk))
    return

A naive solution to both scenarios above associates an exclusive lock with both updating memory and updating disk:

    exclusive_lock()
    update(data_structure)
    update(disk)
    unlock()
    return

This, however, has several drawbacks: 1) sophisticated concurrency control for the chosen data structure is not leveraged; 2) no read may be served from memory while any disk write is in progress.

Instead, I chose to write to disk before updating the memory, solving the consistency problem and allowing for concurrent access to memory. This way, the updating of memory and disk can be optimised separately.

    update(disk)
    lock(data_structure)
    update(data_structure)
    unlock()
    return

Notice the similarity to Write Ahead Logging. Curiously, this is also the same strategy used by Facebook’s Cassandra (see Lakshman & Malik, Cassandra — A Decentralized Structured Storage System). An additional logging step should be added before returning, confirming the update to memory in order to guarantee atomicity; if the system crashes in-between writing to disk and updating memory, the key/value pair written to disk will be restored to memory.

The problem of implementing durability is now fully decoupled from whatever in-memory data structure is used, and can be optimised independently on its own.

File organisation for SSDs

Since the database resides fully in memory, no individual pages are read to and from disk. The need for block-sized pages, clustered files and other such traditional database considerations are thus not needed; a separate recovery feature need not be very efficient. A simple transaction log is sufficient for maintaining a persistent state of the database.

Due to the large rotational delay and seek time of hard drives, transaction logs are typically append-only, minimising the time to write an update, leveraging the fast sequential write speed of hard drives. This is coupled with periodical checkpointing of the database state, so that the logs don’t keep growing ad infinitum — if everything was a log, deletes would simply add data rather than remove it.

SSDs, however, have no rotational delay and have seek times orders of magnitude smaller than those of hard drives. Furthermore, the random write speed of a state-of-the-art SSD is much faster than even the sequential write speed of the fastest hard drives. AnandTech offers great insight and graphs on this.

For SSDs, a complex architecture involving both logging and checkpointing can be substituted by in-place writing of updates, by maintaining a slot directory of free (deleted) positions within a persistent file, as well as a lookup table mapping keys to offsets in the file. When writing to file, the persistence layer first checks these data structures, before appending to the end:

This way, the file ‘garbage collects’ itself, and all free space is hopefully reused (depending on the sizes of keys and values). One mechanism that could be added, though, is some sort of garbage collection of large chunks of free space, since the file-size is never reduced and is bounded by the largest amount of data that has resided in the database at any point in time. Nevertheless, the file organisation leverages the inherent advantages of SSDs and provides a simple, good-enough solution to persistent writing.

Issues with durability

In C, I/O usually involves write() and read() operations. For my purposes, apart from a recovery phase, only writing to disk is necessary as read operations are served directly from memory. However, simply invoking write() does not guarantee durability, since a system crash may occur before the disk cache is flushed to disk. Any update that is reported as complete may then, in fact, be lost given some bad luck.

To alleviate this, there exists the database programmers’ favourite function, fsync(). It essentially forces file updates to physical disk, sacrificing performance for guaranteed durability. The time for a write() is negligible compared to an fsync(), which typically takes tens- to hundreds of milliseconds to complete.

Dealing with fsync() is thus crucial to the overall performance of the database.

I considered three strategies for making the updates to memory persistent:

  1. Use Kyoto Cabinet’s PolyDB::OAUTOSYNC mode for letting Kyoto handle physical file writing as well as file organisation. This was, unfortunately, way too slow.
  2. Implement my own file organisation, and open the persistent file in O_DIRECT mode, bypassing any buffer. Linus Torvalds convinced me this wasn’t such a good idea:

    This is your brain: O
    This is your brain on O_DIRECT: .

    Any questions?

  3. Implement my own file organisation, but fsync() manually. This is what I opted for.

A comment regarding the cost of an fsync()
A simple, linear formula for the cost of an fsync() is given in a blog post at Tokutek:

    fsync() time = N/R + K

where N is the amount of dirty data that needs to by written to disk, R is the disk write rate, and K is a constant time defined by the storage system, such as disk seek time.

Consider what happens if each individual write operation is followed by an fsync(). With X clients each writing an update of 1 unit, the total time for all to finish is:

    X(1/R + K) = X/R + XK

That is, the constant is multiplied by the number of clients. Furthermore, since each write() is followed by an fsync(), the total performance is bounded by each individual fsync(). This means that the system as a whole can never perform faster than the number of writes multiplied by the cost of an fsync().

Instead, I consider batching updates together, and issuing one fsync() for the whole batch. Again, for X clients each writing 1 unit of data, this results in improved performance according to the formula:

    X/R + K

The more updates in the batch, the better, as the goal is < 1 fsync() for each update.

Batching writes together
There are a couple of strategies in use for minimising the amount of fsync()s:

  1. Keep a buffer of fixed byte-size in memory, and issue fsync()s whenever it becomes full. Worst-case data loss is the number of updates that fit inside the buffer. This is a strategy used by multiple DBMSs.
  2. Issue an fsync() periodically, every x milliseconds. Worst-case data loss is the amount of data written under a period of x milliseconds. For example, Redis can be set to fsync() once every second.

NoSQL stores are quick to sacrifice persistence for performance, for example MongoDB doesn’t fsync() at all (!) None of these strategies is fully durable, and thus they don’t fulfil the requirement given in the exam. On the other hand, fsync()‘ing each write doesn’t scale and is painfully slow.

A scalable and fully durable solution

The strategy I came up with considers batching updates together based on concurrent access. This way, writing to disk scales well with concurrent clients, and the worst-case data loss is zero — adhering to the constraints given.

I realise this by implementing a consumer/producer pattern, by letting each client — the producers — add their update to a queue. A dedicated thread (the consumer), responsible for writing to persistent file, empties the queue and writes the updates to disk, issuing one fsync() for the batch. The result is that while the writes and disk synchronisation is in progress, other clients will fill up the queue. Batch-wise writing is thus implemented, with more writes combined as concurrent clients increase. The size of the buffer is bounded by the number of writing jobs added during the time to write the previous batch, plus the time to issue the fsync().

Recall that each operation as a whole needs to be synchronous. However, by adding an update to the queue and letting a separate thread write, the function does not wait for the write to finish. I solved this problem by packaging each update into a struct, attaching a reference to a boolean which is set to true by the writing thread:

The code for the writing thread:

With a conditional variable, each client then waits for a notify_all() issued by the writing thread, checks the boolean, and then returns:

Results

I benchmarked my solution against the provided code for the SIGMOD contest. It performs around 4 million random inserts, reads and deletes for 8 byte keys and 8 byte values. Below is a plot of the performance of the index as a whole, with- and without durability. The y-axis shows throughput in requests/second, while the x-axis shows number of threads. Bear in mind that the throughput depends on the sizes of keys and values.

concurrency plot

The same data, plotted on a logarithmic y-scale:

concurrency-log plot

As a reference point, without the consumer/producer pattern the performance remained constant at ~300 requests/second — the same as for 1 thread. Instead, for highly concurrent access the durable B tree outperforms the in-memory only B tree. While this is likely due to reduced contention around nodes in the B tree, a stable state around 12000 requests/second seems to be reached which probably reflects the number of writes possible during an fsync(). The next step for optimising the solution would be to figure out ways to speed up writes in a single thread.

Conclusion

I built an in-memory database using Kyoto Cabinet for the data structure, and a hand-rolled persistence layer. A simple file organisation is provided for SSD disks, and I optimise persistent disk writes by batching together concurrent writes, thus scaling well with concurrent access. But most importantly, I learned a lot regarding databases and programming techniques!

EDIT: Discussion on Hacker News: http://news.ycombinator.com/item?id=2440406
EDIT 2: I’m pleased to know that this assignment received a distinction.