Blog

11 min read

Dolt is a MySQL-compatible SQL database with Git-like version control features, including commit, branch, merge, diff, push, pull and clone. Dolt targets a diverse set of use cases and is often deployed in an OLTP context where it receives transactional writes that are expected to be ACID compliant. We’ve recently started working on crash-recovery testing of Dolt, in order to gain confidence in its durability properties following a VM crash.

Background#

DoltHub has been building Dolt for almost eight years. Its storage backend has seen repeated iteration in that period. Dolt originally started life as an offline data sharing tool, much like Git is used by individual software developers to edit their local copy of the source code and push the locally authored artifacts back upstream to a remote. We quickly realized that users wanted OLTP capabilities and a running SQL server for Dolt’s functionality and we have been building Dolt as an OLTP database for most of its lifetime. Dolt has a large battery of existing tests, including unit tests and various integration tests such as our bats tests.

We recently started writing crash-recovery tests for Dolt as well. These tests currently take the form of integration tests, which run against running Dolt sql-server servers. Our testing harness starts these sql-server processes in VMs, and the integration tests can request for the VMs to be hard reset, disrupting the server process. When the VMs boot back up, the sql-server process comes back online and assertions can be made about the durability of the thus-far acknowledged writes and the health of the server process itself.

This blog post will outline a bit about the technologies we are currently using for our testing and where we’re headed.

The Overall Architecture#

There are four high-level components for the tests:

Linux Virtual Machines. These are provisioned with system software and are able run the Dolt sql-server processes under test.
DCRTS. A first party daemon running on the VMs. It provides features to the integration tests, including the ability to provision and monitor a running Dolt sql-server on the VM.
Control Plane. A coordination component, run as part of the integration test suite, which interacts with the VM runtime and DCRTS instances to provide facilities to the integration tests. It launches the VMs and provides the interfaces necessary for the tests to access them and to request operations against them and the Dolt sql-server instances they run.
Individual Integration Tests. These interact with the Control Plane to run a VM and a Dolt sql-server on it. They act as a client to the sql-server and also make use of facilities for interfacing with the VM running the server. They can request the VM to be hard rebooted, causing the kinds of crash we want to be certain to be durable against, and they can be notified if the Dolt sql-server processes on the VMs do something like crash or log errors unexpectedly.

As we will see soon, all of these pieces currently run locally on a single machine which supports virtualization. The Control Plane is a library embedded into the integration tests. The VMs are QEMU VMs which are launched by the Control Plane locally. DCRTS is a GoLang program installed into the VMs and interfacing with the Control Plane through gRPC. And the integration tests are standard *testing.T GoLang tests which run their logic and make assertions using stretchr/testify.

Each component has quite a few details to work through, and so we’ll dig into each one in turn.

The VMs#

While the current architecture is amenable to distribution, to keep things straightforward and iteration times quick, for the time being the VMs are run locally. Currently we only test on Linux aarch64, but we hope to add Linux x86_64 soon.

The VMs themselves are QEMU VMs that can be run locally. They are created by a custom installer script that has a few components. I didn’t have much experience with creating and running QEMU VMs locally when I started this project, and I’m still far from an expert, but we landed in a place that works pretty well for us.

Initially I tried to use packer to build the QEMU VMs, since that is what we use to build GCP and AWS VMs across our infrastructure. Unfortunately, I wasn’t able to find minimal installers that I could figure out how to run and configure fully headless using the packer QEMU backend. After spinning my wheels unproductively for a while and not making much progress, I ended up building my own VM provisioning program written in a combination of Go and Bourne shell. It’s part of the same bazel repository where the crash recovery tests themselves live. It is a relatively straightforward process which does the following:

Creates the target qcow2 disk for the VM.
Copies the UEFI .vars from the qemu installation to have a mutable exclusively owned copy for the VM.
Runs the Alpine Linux installer in a VM using the installer ISO. This involves running QEMU with its serial console configured to connect as a client to a running TCP server. The installer program, in turn, runs a TCP server on a dedicated port and interfaces with the incoming connection as the serial console of the VM running the installer. The installer interacts with the TCP connection using Netflix/go-expect to script logging into the console, mounting disk resources configured for the VM and running our install.sh script which configures simple drive and networking options.
When the installation completes successfully, the guest VM is powered off, which allows our VM build process to continue. The process continues by rendering some local scripts into the VM’s directory which will make it ergonomic to access and work with. Things like run.sh, which starts the VM under QEMU, and ssh.sh and scp.sh, which will use SSH to access a shell on it and move files between it and the host.
These scripts being rendered, there are some post install provisioning steps which still need to run in the guest VM. DCRTS needs to be installed and the bootloader and system services need to be configured so that the crash-reboot cycle is as short as possible. The last step of the VM building process is to run these postinstall steps. This provisioning steps makes use of the rendered run.sh script and accesses the VM over SSH, scp’ing relevant files to it, running scripts and then shutting it down.

The end result is the fully installed VM in a standalone directory of the host. The script runs under bazel, which is also responsible for providing some of its dependencies, such as the compiled binary of the first party daemon, DCRTS, and the installer ISO. To create a VM for local integration testing looks like:

$ bazel run //go/cmd/install -- -out `pwd`/testvm_1 -qemushare /usr/local/share/qemu
...
2026/01/23 15:36:42 got connection, running installer
2026/01/23 15:36:48 saw login prompt. Logging in.
2026/01/23 15:36:48 Logged in. Mounting floppy
2026/01/23 15:36:48 Mounted storage. Running install.sh.
2026/01/23 15:37:03 installer run sucessfully

with the resulting artifact:

$ ls -l testvm_1
total 533160
-rw-r--r--  1 aaronson  staff  200409088 Jan 23 15:37 disk.qcow2
lrwxr-xr-x  1 aaronson  staff         51 Jan 23 15:37 eficode.fd -> /usr/local/share/qemu/edk2-aarch64-code.fd
-rw-r--r--  1 aaronson  staff      71026 Jan 23 15:37 install.log
-rw-r--r--  1 aaronson  staff          6 Jan 23 15:37 MYSQL_PORT
-rwxr-xr-x  1 aaronson  staff        963 Jan 23 15:37 run.sh
-rwxr-xr-x  1 aaronson  staff         93 Jan 23 15:37 scp.sh
-rw-r--r--  1 aaronson  staff         94 Jan 23 15:37 ssh_config
-rwxr-xr-x  1 aaronson  staff         93 Jan 23 15:37 ssh.sh
-rw-r--r--  1 aaronson  staff   67108864 Jan 23 15:37 vars.fd

Running the VM is as simple as:

% ./testvm_1/run.sh
QEMU 10.1.2 monitor - type 'help' for more information
(qemu)

and accessing it through SSH looks like:

$ ./testvm_1/ssh.sh localhost
Warning: Permanently added '[localhost]:26252' (ED25519) to the list of known hosts.
Welcome to Alpine!

The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See <https://wiki.alpinelinux.org/>.

You can setup the system with the command: setup-alpine

You may change this message by editing /etc/motd.

alpine:~# rc-status
Runlevel: default
 acpid                                                                    [  started  ]
 crond                                                                    [  started  ]
 sshd                                                                     [  started  ]
 dcrts                                                                    [  started  ]
Dynamic Runlevel: hotplugged
Dynamic Runlevel: needed/wanted
 sysfs                                                                    [  started  ]
 fsck                                                                     [  started  ]
 root                                                                     [  started  ]
 localmount                                                               [  started  ]
Dynamic Runlevel: manual
alpine:~#

DCRTS#

On the VM is a first party daemon with which the integration tests interact. This daemon is called DCRTS and it is an extension point for functionality which the integration tests might need on the VM. For now, it takes the following approach.

It is installed as a system service that always run when the VM comes up. As part of its functionality, it makes an outbound, bi-directional streaming RPC connection to a gRPC service which the integration test Control Plane is running. This allows the Control Plane to know that the VM is running, it provides a mechanism for the VM to communicate server failure and unexpected log messages in a timely manner to the Control Plane, and it allows the Control Plane to issue commands to DCRTS which will change the state of the VM.

These commands allow the integration test to change the configuration and runtime behavior of the VM to enact the test they are carrying out. So, for example, incoming commands can:

request a specific version of Dolt be installed on the VM,
request that the currently running Dolt sql-server instance be cleanly shutdown and the database deleted,
request that a new Dolt sql-server instance be provisioned on the VM and the server instance be brought up,
request that the VM runs a system-wide filesystem sync so that all file system buffers are durably written to disk.

With those services provided by DCRTS, the Control Plane is responsible for running the VMs and the gRPC server to which the DCRTS daemons on those VMs connect.

Control Plane#

For now the Control Plane has facilities to run VMs, which is essentially running QEMU for the VMs created and configured by the installer described above, and to interact with those VMs through two mechanisms. The first is through QEMU itself, which the Control Plane uses to hard reboot the VM and to request a clean shutdown. The second one is to use the gRPC server which the Control Plane runs to make requests to DCRTS running on the VMs.

Typically a given integration test will request a VM to be run, specifying that it wants Dolt to be installed and a sql-server instance to be running. The Control Plane will launch the VM with QEMU, will wait for that VM’s DCRTS daemon to connect to the Control Plane service, and then will request that Dolt be installed and that a Dolt sql-server database be provisioned with the desired config settings. At that point, the Control Plane can make the connection settings for that sql-server instance available to the running integration test. The Control Plane provides facilities so that the test can listen for unexpected sql-server process crashes or errors, as reported by DCRTS to the Control Plane. Then the test will Go about its functionality, performing the kinds of operations it is testing against the sql-server endpoint while simultaneously interacting with the Control Plane to hard reset the VM during the test.

Integration Tests#

In some ways, once the Control Plane is available and can control the VMs and configure them using DCRTS, the integration tests themselves can be pretty straightforward to write. Like most integration tests, they end up being composed of three parts:

Set Up — First they run the VM and wait for DCRTS to get the Dolt sql-server provisioned as expected. Then they do further setup against the sql-server instance, such as creating databases with a given schema, configuring SQL users and grants, etc. As part of setup, the test needs to be configured to terminate in a timely manner and report failure if the Dolt sql-server running on the VM crashes or experiences any unexpected errors.
Test — Typically there are two independent and parallel components. The first piece is something which repeatedly enacts a particular change to the Dolt sql-server using standard SQL commands over a TCP connection. These interactions are typically resilient to transient failures and do not consider things like unexpected TCP disconnects to be an issue. That is because the second component of the test is typically a process which intermittently requests the Control Plane to hard reboot the VM. For now this is typically on a fixed cadence with a bit of jitter.
Clean Up — First any post-test state can be observed and asserted on. For example, the observed state of the database throughout the run and after its completion should be consistent with the timeline of every attempted and with every ACK’d write being durable stored to storage regardless of the timeline of reboots. As part of cleanup, the Control Plane is responsible for asking DCRTS to shutdown the Dolt sql-server process and to clean up any filesystem artifacts associated with the database. Finally the Control Plane shuts down the VM so that it is available for use in another test.

Duration and Run Time#

Unlike most unit and integration tests, this form of crash recovery testing is looking for unexpected misbehavior across a wide range of possible interactions. No specific successful run of a crash-reboot cycle proves that Dolt has the kind of durability we want it to have when recovering from a crash. Instead, we can only gain incremental confidence over time that Dolt is behaving as desired as we run these tests over many cycles. Even after many cycles, a passing test by itself cannot prove the absence of any critical durability issues in Dolt regarding the interactions under test. Luckily, a failing test does provide a smoking gun with regards to the execution trace and the resulting on-disk state, and we can dig into any problems in order to improve Dolt’s behavior going forward.

For this reason, these tests are all written have a configurable runtime, kind of like (*testing.B).Loop(), and the developer or system which is running the test is in control of how much time should be dedicated to each run.

Conclusion and Future Directions#

We have spent considerable effort to get the framework and coordination in place to be able to run these tests repeatedly and reliably. As you can see, there are number of moving pieces, from reliably and reproducibly building the VM images to coordinating the VMs and the processes running on them during the test. We now have the framework in place, but we only have a few crash recovery tests in place as of the time being. Our current tests focus on high-touch use cases and help give us confidence in Dolt’s crash recovery properties for DML and DDL within existing databases, as well as for Dolt garbage collection. Going forward, we want to test and improve Dolt’s crash recovery properties for the following interactions:

Interactions with Dolt remotes, including dolt_pull(), dolt_fetch() and dolt_clone()`.
Database creation and deletion, including CREATE DATABASE, DROP DATABASE and DOLT_UNDROP().
Users and Grants manipulation, including CREATE USER, DROP USER, GRANT, as well as interactions with the system tables dolt_branch_control and dolt_branch_namespace_control.
Interactions with Dolt remote metadata, through dolt_remote('add', ...) and dolt_remote('remove', ...).

In addition to increasing the coverage of different operations Dolt supports, we would also like to improve the test harness capabilities as well. For example, we could run tests with higher parallelism if we could provision and run VMs on a cloud provider as well. We currently test against ext4 using data journaling modes ordered and writeback, but testing against other popular filesystems would also make sense. In addition, expectations of the durability and correctness properties of Dolt extend to other types of filesystem failures, outside of just crash recovery. Using a fault-injection filesystem like LazyFS might allow us to run our individual tests more quickly, and Dolt would also benefit from us finding compelling ways to test the injection other types of failures, such as I/O, ENOSPC and permission errors.

We hope to continue extending these tests going forward and to make these tests a regular part of our CI/CD pipeline. If you have experience or opinions in developing fault injection testing for SQL databases or distributed systems, or you just want to talk development of Dolt, don’t hesitate to drop by our Discord and reach out to aaron@.

Blog