by: burt rosenberg at: university of miami
date: nov 2021
NAME
LevelDB filesystem using FUSE
SYNOPSIS
leveldbd [-v] _keystore_dir_ _tablestore_dir_
DESCRIPTION
Uses FUSE to attach a key-value filesystem implemented using the
ideas of the LevelDB datastore that runs Google's BigTable.
OPTIONS
-v verbose output
HISTORY
New for csc421-201, november 2019
BUGS
US Patent US4905110 A
Goals
An introduction to modern datastores, providing a persistence service in distributed
systems.
An introduction to a specific datastore, the Level DB, that underlies Google's Big Table.
A demonstration of the overloading of traditional file system semantics, to
implement a different underlying data model (i.e. using traditional file operations
to talk the talk and walk the walk of key-value pairs.)
An exercise in FUSE, to build user defined file systems.
An exercise in memory mapped files.
An exercise in synchronization, and ordering updates, and immutable files, to deal with
distribution and concurrency.
To be the first kid on the block to have built a No-SQL datastore.
LevelDB is a very
durable method for storing data. It is resistant to corruption caused by concurrency
issues. The LevelDB system can be used in a distributed architecture, with multiple threads,
because the LevelDB files are immutable. Immutable files cannot give conflicting
data, because the data never changes.
Updates replace old immutable file with new ones in a manner that is atomic, consistent,
efficient, and concurrent.
The datastore will never be unavailable, and its files can be copied or backed up
at any point, as they never change and are never wrong.
Immutable Key-Value Stores
The traditional file system is a data store that sometimes does not scale well. It provides
the service of data persistence. However, scalable datastores need to be distributed so that
many processes can access the data at higher access rates; they need to be tolerant to
failure, and to this end, the data is replicated. Distributed access, replicated data, and
fault tolerence are easier accommodated with key-value stores.
Databases that change in time are difficult to make consistent. A database that never changes
is of limited application. A compromise is a database that many ways is immutable, and otherwise
grows by appending new facts. No fact is over wrong, but it could be out-dated. The collection of
all facts at any moment is a subset of all facts at any future time.
A data store based on key-value pairs has advantages in implementing such a database.
A key-value store is a collection pairs, a key with a value. The functions on the
store are,
Insert a key-value pair into the store, when there is no pair matching the key;
Update the value of an existing key-value pair;
Retrieve the value associated to a key;
Delete the key-value pair.
The key-value pairs are stored in files that are either completely immutable, once
created they never change, or are append only. Any file is therefore complete with respect
to its own consistency, and the append only files are consistent with respect to
the common items they share.
These files are ordered, and a key is appended to the front, whether it be an insert,
update or delete. A retrieval works backards through the key-value pairs found in
the append only file, then through the time-ordered sequence
sequence of the immutable files, looking for the first key match.
Deletes are handled by tombstone key-value pairs, that halt the search.
A tombstone is a key-value value pair, with key matching the delete, and value which
proclaims its own non-existence.
LevelDB
LevelDB is particular implementation of the sort of key-value store just described.
It solves many practical issues with implementing the desired data stores by
organizing into three levels of files,
Level 0: key-value pairs are appended to the end of the currently open level 0 file.
Once a level 0 file is closed, it is placed at the head of a list of closed level 0 files.
A search for a key-value pair is in reverse order of its append. The most recent
appends are in the unique open level 0 file, so search is first in that file.
In each level 0 file the key-values pairs are in sequence order, so search proceeds
from the end of the file to its front.
Level 1: A level 1 file is created from a closed level 0 file by sorting according
to they key. The sorting is stable — if key-value pairs agree on
their key, their order in the file is maintained.
Then the file is gleaned of key-value pairs with duplicate keys by removing
all but the most recent of the key-value pairs. (That is why we used a stable sort.)
The creation of the level 1 is concurrent with the continued use of the level 0
it will replace. The change over to the using the new level 1 can be done very
safely and very quickly but first appending the new level 1 to the head of the list
of level 1's, then removing the level 0 from the end of the list of level 0's.
If there is no match in the level 0 files,
the level 1 files are searched in the reverse order of the list of level 1's.
Each level 1 file is searched for the key using
an efficient binary search.
Level 2: There is a single level 2 file that is created by merging
the oldest level 1 files and the current level 2 file.
Because all the input files are sorted, an O(n) sweep gives an ordered merge of the
files. Any duplicate pairs are gleaned during the search by retaining only the most
current pair; and a deleted pair (marked by a tombstone) can be dropped.
Level 0 and 1 files must retain a tombstone for a deleted key. The search must
be stopped to avoid finding an older pair with that key.
As there are no more files behind the level 2 file, tombstones need not and should not
appear in the level 2 file.
The LevelDB filesystem
The key–value store is given the patina of a file system by running your FUSE enabled
program, leveldbd, on the mount point store.
To add or update (key,value), write value to the file store/key.
To query the store for (key,value), read the file store/key.
To removed form the store (key,value), unlink the file store/key.
Listing the store directory lists the current keys in the store.
Under the directory store, these files do not exist. The store directory is a way of
inserting a key-value pair into actually true files, in the kv-table directory.
The kv-table directory can contain a number of level 0 and level 1 tables, and a
single level 2 table, each as separate files.
The level 0 tables are append only until a certain size, when they become immutable.
The other tables are always immutable.
The files all share a common format. They are a sequence of fixed size (64 bytes)
records, with a key (first 16 bytes) then a value (48 bytes).
The keys and values are generally null-terminated strings:
the key and value are placed into their respective fields, with null ('\0')
padding on the right. The null byte is not allowed in the key or value itself.
An empty key (the empty string) is not permitted.
A record where the key is the empty string signals the end of data in the level file.
An empty value is represented as the first two bytes of the value being the null character.
The tombstone is represented by the first byte the null byte and the
second byte a one character -1, 0xff.
Other values for the byte following the final null byte are reserved.
Project: Level 0
This project implements just the first level 0 file. You have template code that
does the FUSE part and also opens up the file kv-table/l0.dat, which will be the
unique level 0 table for this part of the project.
Your leveldbd program,
Responds to reads on the file named KEY-NAME by looking for
a key-value pair with key name "KEY-NAME" in the l0.dat level file, and reads
and returns to fuse the corresponding value. If the key is not found, or deleted,
return the empty string.
Responds to writes on the file named KEY-NAME of write data VALUE DATA by appending
the key-value pair ("KEY-NAME","VALUE DATA") to the l0.dat level file.
Responds to unlink (remove or delete) of KEY-NAME by appending the tombstone
key-value pair ("KEY-NAME",__tombstone_2_byte_sequence__).
Responds to the read directory call back by using the filler function to add the key
names of all the undeleted key-value pairs appearing in the store.
Responds to the get attributes call back with fixed values for all but the size.
For existing key-value pairs, respond with the length of the value string. For all
others queries report a length of zero.
Truncate keys longer than 15 characters to 15 characters, and
values longer than 47 characters to 47 characters.
The Fuse demon has been set to debug, and the leveldb-sub.c file has printf's.
Read the debug and printf output to see what other Fuse functions are called, to
determine what implementations are needed to support the behavior.
Frequently in class I have stressed that a C string is not just an array of bytes.
It is an array of bytes that additionally,
Must have a null character at the end,
Cannot have a null character not at the end,
Has to be copied carefully so not to over-run the destination,
Needs an extra byte of storage for the terminating null byte.
Sure, C strings are convenient, but you must obey the above rules.
Note that the value strings defined for this project are C string only if the
value is of length greater than zero. Otherwise, it is more like a two byte
character array, char[2].
And remember, hexdump -C kv-table/l0.dat is your friend.