Dolt's New Storage Engine
In the beginning, there was Noms. The creation of Aaron Boodman and Attic Labs, Noms introduced Prolly Trees, a novel search index structure that supports diff, merge and sync operations. Noms development has since been halted, but its contributions live on in the open source community. Replicache and Dolt both use Prolly Tree based storage engines.
DoltHub was founded in 2018 with the vision of creating an open platform for data sharing and collaboration. We wanted to create Git-for-Data, and we built Dolt as our central collaboration tool. From the beginning, Dolt was built on Noms. It's why Dolt is the only SQL database that can branch, diff, merge, push, and pull. As DoltHub has grown, our focus has shifted from data-sharing to becoming a production database. This week we launched Hosted Dolt, our first cloud offering of Dolt. As our aims have changed, performance has become a major focus of development. Optimizing indexed access in our storage engine has been particularly important for improving Dolt's efficiency, and we've made significant progress on catching up to MySQL on benchmarks.
Incrementally optimizing Noms has provided substantial gains in efficiency, but we've reached a point of diminishing returns. One major barrier is the need to maintain compatibility with Noms binary serialization format. Altering this format is a major breaking change for Dolt customers and would require a database migration, but we believe it's the best path forward. Today we're announcing the alpha release of Dolt's new storage engine!
Purpose-Built Storage Engine
Storage engines are at the core of database architecture. They're responsible for indexing data, managing transaction concurrency and isolation, and providing durability guarantees. They're often a key differentiator between databases because design decisions at the storage layer impact the features and performance characteristics of the overall system.
Noms is described as a "decentralized database", a syncable peer-to-peer data store indexed with Prolly Trees. Noms data format is schema-less, somewhat like a NoSQL database. In general, Noms storage layer was designed for maximum flexibility and its data model was reflected in its binary format. However, this flexibility does come at a cost and is largely unnecessary in the context of a relational database like Dolt.
By removing the binary compatibility requirement, we were able to design a new storage engine purpose-built for Dolt. Our new engine borrows heavily from Noms, but its architecture is focused on OLTP access patterns. Specifically, we wanted fast access to indexed data and minimal memory allocation overhead. To achieve this, we wrote a new serialization format for tuples and Prolly Trees. Let's dive into the details.
Static Schemas
Noms self-describing serialization format encodes type information into serialized data. Its type system uses type accretion to build up a graph of type-descriptors where each object contained metadata about its children. Each serialized value encodes both its own type and the types of all of its child values. For large indexes, these type descriptors can have a noticeable overhead both in serialization size and cpu overhead.
Dolt stores rows data in a Prolly Tree with Map<Tuple,Tuple>
semantics: key fields are grouped in one tuple, and
non-key fields are grouped into another. During index maintenance we enforce map semantics by guaranteeing key
uniqueness within tree. Because Dolt is a relational database, each of rows has the same layout corresponding to the
static schema of the table. Storing schema information out-of-band creates a much more compact encoding, reducing the
size of our indexes. A compact encoding format is important for reducing IO overhead when reading from persistent
storage, and when caching index nodes in memory. However, our new encoding format is roughly the same size as the old
format because of decisions we made to optimize deserialization.
Fast Random Access
After the type system, the biggest changes in the new storage engine come from the value encoding format. The focus of these changes was to create a format that supports fast, zero-allocation decoding and fast random access. Random access in this context means the ability to access data in the middle of an object without first scanning the entire prefix of the object. Let's look at how index data is serialized to see what's changed and why it matters.
Once serialized, tuples and Prolly Tree nodes have a similar structure. They're essentially container objects that organize their elements as an array: nodes hold a collection of key and value tuples, tuples hold a collection of fields. This holds for both the old and new format. The key difference in the new format is that indexes and tuples serialize offsets to the end of their buffer after elements have been serialized. Offsets are like back pointers in the buffers that enable random access. In the new format values can be accessed in O(1) by looking up pointers to the beginning and end. The zeroth offset can be omitted as the start of the buffer is already known. The new structure of a Tuple looks like this:
+---------+---------+-----+---------+----------+----------+-----+----------+-------+
| Value 0 | Value 1 | ... | Value K | Offset 1 | Offset 2 | ... | Offset K | Count |
+---------+---------+-----+---------+----------+----------+-----+----------+-------+
Compared to Tuples in the old format:
+--------+----------+---------+--------+----------+---------+-----+--------+----------+---------+
| Type 0 | Length 0 | Value 0 | Type 1 | Length 1 | Value 1 | ... | Type k | Length k | Value K |
+--------+----------+---------+--------+----------+---------+-----+--------+----------+---------+
In the old format, random access also uses offsets, but these offsets must be computed on the fly during deserialization. When an object is read from disk it's scanned front to back to compute the locations of its elements. Computing offsets in this way means that serialized data is smaller, but it comes with a cost on the read path as a new buffer must be allocated to store these offsets. It also means that random access is O(n) when an object is read from persistent storage. Once cached in memory, these offsets can be reused.
Benchmark Results
Let's run some benchmarks to get a sense of how the new format performs. Each benchmark was performed against the same in-memory persistence layer. The differences between the old and new storage engines is down to only their serialization formats. If you're interested, the source code for the benchmarks is here. The first benchmark does point lookups of individual rows, the second benchmarks does point inserts. Each benchmark is run against indexes scales of 10K, 100K, and 1 million rows.
goos: darwin
goarch: amd64
pkg: github.com/dolthub/dolt/go/store/prolly/benchmark
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkMapGet
BenchmarkMapGet/benchmark_maps_10k
BenchmarkMapGet/benchmark_maps_10k/benchmark_new_format_reads
BenchmarkMapGet/benchmark_maps_10k/benchmark_new_format_reads-12 824058 1822 ns/op 0 B/op 0 allocs/op
BenchmarkMapGet/benchmark_maps_10k/benchmark_old_format_reads
BenchmarkMapGet/benchmark_maps_10k/benchmark_old_format_reads-12 199551 5893 ns/op 2783 B/op 47 allocs/op
BenchmarkMapGet/benchmark_maps_100k
BenchmarkMapGet/benchmark_maps_100k/benchmark_new_format_reads
BenchmarkMapGet/benchmark_maps_100k/benchmark_new_format_reads-12 663870 1869 ns/op 0 B/op 0 allocs/op
BenchmarkMapGet/benchmark_maps_100k/benchmark_old_format_reads
BenchmarkMapGet/benchmark_maps_100k/benchmark_old_format_reads-12 127732 9615 ns/op 3702 B/op 66 allocs/op
BenchmarkMapGet/benchmark_maps_1M
BenchmarkMapGet/benchmark_maps_1M/benchmark_new_format_reads
BenchmarkMapGet/benchmark_maps_1M/benchmark_new_format_reads-12 508759 3162 ns/op 0 B/op 0 allocs/op
BenchmarkMapGet/benchmark_maps_1M/benchmark_old_format_reads
BenchmarkMapGet/benchmark_maps_1M/benchmark_old_format_reads-12 58594 19413 ns/op 5114 B/op 88 allocs/op
PASS
goos: darwin
goarch: amd64
pkg: github.com/dolthub/dolt/go/store/prolly/benchmark
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkMapUpdate
BenchmarkMapUpdate/benchmark_maps_10k
BenchmarkMapUpdate/benchmark_maps_10k/benchmark_new_format_writes
BenchmarkMapUpdate/benchmark_maps_10k/benchmark_new_format_writes-12 14779 68341 ns/op 19563 B/op 54 allocs/op
BenchmarkMapUpdate/benchmark_maps_10k/benchmark_old_format_writes
BenchmarkMapUpdate/benchmark_maps_10k/benchmark_old_format_writes-12 1684 660560 ns/op 486386 B/op 3204 allocs/op
BenchmarkMapUpdate/benchmark_maps_100k
BenchmarkMapUpdate/benchmark_maps_100k/benchmark_new_format_writes
BenchmarkMapUpdate/benchmark_maps_100k/benchmark_new_format_writes-12 16516 72802 ns/op 21005 B/op 54 allocs/op
BenchmarkMapUpdate/benchmark_maps_100k/benchmark_old_format_writes
BenchmarkMapUpdate/benchmark_maps_100k/benchmark_old_format_writes-12 1294 1054807 ns/op 713440 B/op 4905 allocs/op
BenchmarkMapUpdate/benchmark_maps_1M
BenchmarkMapUpdate/benchmark_maps_1M/benchmark_new_format_writes
BenchmarkMapUpdate/benchmark_maps_1M/benchmark_new_format_writes-12 12213 96054 ns/op 28961 B/op 54 allocs/op
BenchmarkMapUpdate/benchmark_maps_1M/benchmark_old_format_writes
BenchmarkMapUpdate/benchmark_maps_1M/benchmark_old_format_writes-12 1219 1025194 ns/op 814927 B/op 5089 allocs/op
PASS
On the read-path we're between 3x and 6x faster. On the write-path we're consistently 10x faster!
Future Work
The new storage engine and serialization format are still experimental, but we're releasing an early version to users
in order to gather feedback and to preview future performance gains. It's currently behind a feature flag, but you can
turn it on by setting the environment variable DOLT_DEFAULT_BIN_FORMAT=__DOLT_1__
and creating a new database with
dolt init
.
While the new format is in development, the full feature set of Dolt is not yet supported, specifically
many CLI utilities have not yet been ported. The focus of this alpha release is dolt sql
and dolt sql-server
, which
are largely supported with the exception of some DDL and Foreign Key operations. A general release of the new storage
engine and storage format will be available later this year and will include a migration path for existing databases.
If you have any questions or want more information about the storage engine roadmap, join us on our
Discord!