LevelDB Project

NAME
    LevelDB filesystem using FUSE
    
SYNOPSIS	
    leveldbd [-v] _keystore_dir_ _tablestore_dir_
    
DESCRIPTION
    Uses FUSE to attach a key-value filesystem implemented using the 
    ideas of the LevelDB datastore that runs Google's BigTable. 
    
OPTIONS
    -v verbose output
        
HISTORY
    New for csc421-201, november 2019

BUGS

US Patent US4905110 A

Goals

An introduction to modern datastores, providing a persistence service in distributed systems.
An introduction to a specific datastore, the Level DB, that underlies Google's Big Table.
A demonstration of the overloading of traditional file system semantics, to implement a different underlying data model (i.e. using traditional file operations to talk the talk and walk the walk of key-value pairs.)
An exercise in FUSE, to build user defined file systems.
An exercise in memory mapped files.
An exercise in synchronization, and ordering updates, and immutable files, to deal with distribution and concurrency.
To be the first kid on the block to have built a No-SQL datastore.

LevelDB is a very durable method for storing data. It is resistant to corruption caused by concurrency issues. The LevelDB system can be used in a distributed architecture, with multiple threads, because the LevelDB files are immutable. Immutable files cannot give conflicting data, because the data never changes.

Updates replace old immutable file with new ones in a manner that is atomic, consistent, efficient, and concurrent. The datastore will never be unavailable, and its files can be copied or backed up at any point, as they never change and are never wrong.

Immutable Key-Value Stores

The traditional file system is a data store that sometimes does not scale well. It provides the service of data persistence. However, scalable datastores need to be distributed so that many processes can access the data at higher access rates; they need to be tolerant to failure, and to this end, the data is replicated. Distributed access, replicated data, and fault tolerence are easier accommodated with key-value stores.

Databases that change in time are difficult to make consistent. A database that never changes is of limited application. A compromise is a database that many ways is immutable, and otherwise grows by appending new facts. No fact is over wrong, but it could be out-dated. The collection of all facts at any moment is a subset of all facts at any future time.

A data store based on key-value pairs has advantages in implementing such a database. A key-value store is a collection pairs, a key with a value. The functions on the store are,

Insert a key-value pair into the store, when there is no pair matching the key;
Update the value of an existing key-value pair;
Retrieve the value associated to a key;
Delete the key-value pair.

The key-value pairs are stored in files that are either completely immutable, once created they never change, or are append only. Any file is therefore complete with respect to its own consistency, and the append only files are consistent with respect to the common items they share.

These files are ordered, and a key is appended to the front, whether it be an insert, update or delete. A retrieval works backards through the key-value pairs found in the append only file, then through the time-ordered sequence sequence of the immutable files, looking for the first key match.

Deletes are handled by tombstone key-value pairs, that halt the search. A tombstone is a key-value value pair, with key matching the delete, and value which proclaims its own non-existence.

LevelDB

LevelDB is particular implementation of the sort of key-value store just described. It solves many practical issues with implementing the desired data stores by organizing into three levels of files,

Level 0: key-value pairs are appended to the end of the currently open level 0 file. Once a level 0 file is closed, it is placed at the head of a list of closed level 0 files.
A search for a key-value pair is in reverse order of its append. The most recent appends are in the unique open level 0 file, so search is first in that file. In each level 0 file the key-values pairs are in sequence order, so search proceeds from the end of the file to its front.
Level 1: A level 1 file is created from a closed level 0 file by sorting according to they key. The sorting is stable — if key-value pairs agree on their key, their order in the file is maintained. Then the file is gleaned of key-value pairs with duplicate keys by removing all but the most recent of the key-value pairs. (That is why we used a stable sort.)
The creation of the level 1 is concurrent with the continued use of the level 0 it will replace. The change over to the using the new level 1 can be done very safely and very quickly but first appending the new level 1 to the head of the list of level 1's, then removing the level 0 from the end of the list of level 0's.
If there is no match in the level 0 files, the level 1 files are searched in the reverse order of the list of level 1's. Each level 1 file is searched for the key using an efficient binary search.
Level 2: There is a single level 2 file that is created by merging the oldest level 1 files and the current level 2 file. Because all the input files are sorted, an O(n) sweep gives an ordered merge of the files. Any duplicate pairs are gleaned during the search by retaining only the most current pair; and a deleted pair (marked by a tombstone) can be dropped.
Level 0 and 1 files must retain a tombstone for a deleted key. The search must be stopped to avoid finding an older pair with that key. As there are no more files behind the level 2 file, tombstones need not and should not appear in the level 2 file.

LevelDB tables

         +-----------------------------------------------------------+ 
level 2  | k,v | k,v | k,v |           ...                     | k,v |
         +-----------------------------------------------------------+
         
           ⇑ search level 2 last
           
                +----------------------------+    
level 1         | k,v | k,v |   ...    | k,v |
                +----------------------------+
                    
           ⇑ search level 1 newest to oldest
           
                           +----------------------------+    
level 1                    | k,v | k,v |   ...    | k,v |
                           +----------------------------+ 

            ⇐ search level 0 newest first, open table first ⇐   

                  (closed)                     (open) 
         +------------------------+     +------------------
level 0  | k,v | k,v | .... | k,v |     | k,v | k,v | ....    ⇐   | k,v |
         +------------------------+     +------------------`   new pairs append to
                                                               open table



        +----------------+------------------------------------------------+
        | KEY (16 bytes) |           VALUE (48 bytes)                     |
        +----------------+------------------------------------------------+

The LevelDB filesystem

The key–value store is given the patina of a file system by running your FUSE enabled program, leveldbd, on the mount point store.

To add or update (key,value), write value to the file store/key.
To query the store for (key,value), read the file store/key.
To removed form the store (key,value), unlink the file store/key.
Listing the store directory lists the current keys in the store.

Under the directory store, these files do not exist. The store directory is a way of inserting a key-value pair into actually true files, in the kv-table directory.

The kv-table directory can contain a number of level 0 and level 1 tables, and a single level 2 table, each as separate files. The level 0 tables are append only until a certain size, when they become immutable. The other tables are always immutable.

The files all share a common format. They are a sequence of fixed size (64 bytes) records, with a key (first 16 bytes) then a value (48 bytes). The keys and values are generally null-terminated strings: the key and value are placed into their respective fields, with null ('\0') padding on the right. The null byte is not allowed in the key or value itself.

An empty key (the empty string) is not permitted.
A record where the key is the empty string signals the end of data in the level file.
An empty value is represented as the first two bytes of the value being the null character.
The tombstone is represented by the first byte the null byte and the second byte a one character -1, 0xff.
Other values for the byte following the final null byte are reserved.

Filesystem Layout

proj5
    |
    +----  all the .c, .h files, etc
    |
    +----  store  (fuse mounts here, these are simulated files)
    |      |
    |      +----- key-a (contains "hello leveldb!")
    |      |
    |      +----- ... other key names
    |
    +----  kv-table (the backing store, where the data really is)
           |
           +----- l0.dat  
           |
           +----- ... other .dat files

Project: Level 0

This project implements just the first level 0 file. You have template code that does the FUSE part and also opens up the file kv-table/l0.dat, which will be the unique level 0 table for this part of the project.

Your leveldbd program,

Responds to reads on the file named KEY-NAME by looking for a key-value pair with key name "KEY-NAME" in the l0.dat level file, and reads and returns to fuse the corresponding value. If the key is not found, or deleted, return the empty string.
Responds to writes on the file named KEY-NAME of write data VALUE DATA by appending the key-value pair ("KEY-NAME","VALUE DATA") to the l0.dat level file.
Responds to unlink (remove or delete) of KEY-NAME by appending the tombstone key-value pair ("KEY-NAME",__tombstone_2_byte_sequence__).
Responds to the read directory call back by using the filler function to add the key names of all the undeleted key-value pairs appearing in the store.
Responds to the get attributes call back with fixed values for all but the size. For existing key-value pairs, respond with the length of the value string. For all others queries report a length of zero.

Truncate keys longer than 15 characters to 15 characters, and values longer than 47 characters to 47 characters.

The Fuse demon has been set to debug, and the leveldb-sub.c file has printf's. Read the debug and printf output to see what other Fuse functions are called, to determine what implementations are needed to support the behavior.

Working with Memory Mapped I/O

To make file manipulation (cumbersome) look like array access (can anything be better?) you can make use of memory mapped I/O.

The leveldb file is in fact an array of fixed sized structs, with two fields: a key field and a value field. Memory map the file so it appears in memory as such.

Refer to the class/example/mmap directory (in github) for how this is done.

Frequently in class I have stressed that a C string is not just an array of bytes. It is an array of bytes that additionally,

Must have a null character at the end,
Cannot have a null character not at the end,
Has to be copied carefully so not to over-run the destination,
Needs an extra byte of storage for the terminating null byte.

Sure, C strings are convenient, but you must obey the above rules.

Note that the value strings defined for this project are C string only if the value is of length greater than zero. Otherwise, it is more like a two byte character array, char[2].

And remember, hexdump -C kv-table/l0.dat is your friend.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

author: burton rosenberg
created: 10 nov 2019
update: 13 nov 2021