Dolt Use Cases in the Wild
About a year ago, we wrote a blog post about how we thought people might use Dolt, the SQL database you can fork, clone, branch, merge, push and pull just like a git repository. At the time we wrote it, we didn't have any paying customers, and we were still looking for how we might get some. We've come a long way since then, long enough that it's time to publish a revised version of this blog post that reflects how people are actually using Dolt to solve their problems today, and where we think they'll apply it next.
Introduction
Dolt is Git for data. Instead of versioning files, Dolt versions tables. DoltHub is a place on the internet to share Dolt databases. Dolt is the only database with branches. How would you use such a thing?
This document describes how people are using the product today, in the wild. The use cases we're going to cover, in descending order of popularity:
- Backing an application
- Crowd-sourcing open datasets
- Reproducing models or analysis
- Sharing data on the Internet
Backing an application
This was surprising to us, but all of our paying customers (and a bunch of prospective ones) want to use Dolt to back an application, basically to replace MySQL or Postgres. This despite the fact that we're between 2 and 20 times slower than MySQL (but committed to getting down to 4x slower). But it's a drop-in replacement for MySQL, so you don't have to rewrite any application code. Everything just works. A year ago this wasn't close to true, and we didn't think Dolt was a good match for this use case. We've come a very long way since then.
Why use Dolt instead of a more mature database product? Generally, our customers chose Dolt because its revision and sharing model lets it do things no other database can.
Data releases
Many applications operate in a read-only mode for the majority of their lifecycle, and only want updates applied to the data in well defined batches on some regular schedule. For a lot of these applications, they want a human to have the opportunity to review the new data before it gets deployed. They use a workflow that looks like this:
- Customer traffic served by a
prod
branch of the database - New iterations / batches of data being continually worked on by
developers, using another fork or branch of the database,
dev
- On some regular schedule, submit a PR to merge
dev
intoprod
- If necessary, have a human review the changes before merging them
In this workflow dev
could be updated by any number of processes: a
private instance of the same app customers use, command line tools, or
some internal editing app.
Human review of the data merge is really only feasible if the size of the data is relatively small. Otherwise the reviewer is going to rely on sampling or spot-checking to try to catch large errors. We're experimenting with solutions to large data review, to make it tractable for a larger set of use cases.
The cool thing about data releases is that they work even when the
prod
database branch isn't read-only. Either you do data releases
for only a subset of your tables (e.g. the product catalog for an
online store front), or if the rate of change is manageable, you can
merge
from prod
to dev
periodically to safely manage data
releases on more active tables as well.
Giving branch & merge to customers
Some products use Dolt branches to manage data releases to provide an application to customers. Others deliver that branch and merge functionality to customers directly: customers store structured data in the application, and the application uses Dolt branches under the hood to let the customers manage their data, including workflows analogous to PRs on DoltHub.
This is a subtle but important distinction. As a Dolt customer you can
either do data releases for your own application's purposes, or you
can deliver data releases to customers directly as a first-class
feature of your application, either by exposing the raw data and diffs
directly or by putting a layer of UX on top of it. This works well for
any use case resembling a content management store (CMS), such as a
wiki, website builder, blog, online storefront, etc. You can think of
it as a way to make database transactions durable beyond the life of a
single session, or even a single user. Make as many edits as you need
to, then let the customer merge them back to master
in one big batch
when they're happy with the changes.
Personal database replicas
A surprisingly large chunk of custom enterprise software exists just to manage entries in a database. It's the job of entire departments to develop applications to inspect and update these entries through custom workflows and GUIs. Often, these run into scaling problems as the organization grows: not because the database itself is too large or busy, but because multiple teams need to experiment or test out their changes at the same time, and there's only one copy of the dev database.
With Dolt, getting a copy of the dev (or prod) database is as simple
as running a clone
command. Point your team's enterprise software at
the clone, and run your experiments without worrying about what
everyone else in the company is doing. When you're done, if the
experiment was a success, submit a PR to merge your changes back to
master. If it failed, just blow away the clone, no reason anyone else
at the company even needs to know you messed up. And if you need
additional help from another group, they can pull your changes to
investigate.
Data provenance & audit
Dolt tracks the history of every cell in the database: who put it there, what the commit message was, and what its previous values were. This is very useful in any domain, but absolutely vital for some forms of compliance. Typical enterprise teams build complicated business logic into their applications and schemas to comply with data provenance and auditing requirements. With Dolt, it's free.
Other features you can find in other databases, but are better in Dolt
In addition to these unique capabilities, customers using Dolt as a backing store for their application also get access to a bunch of features that do come standard in most other databases, but which work better in Dolt:
- Snapshots. Every database provides a way to snapshot the data. But most databases require you to jump through a lot of hoops to set up the snapshots, and then more hoops to actually use them. With Dolt snapshots are automatic: every commit is a snapshot you can refer to for backup, recovery, reproducible access etc. Our customers use Dolt commits as the basis for reproducible data pipeline jobs. No downtime or special setup is required to read from the database at an older commit: just set a single system var in your SQL session and you're reading from the commit of your choice.
- Time travel. Other databases provide ways to query older revisions. But it often requires you to install a special extension or configure a bunch of settings before you insert any data. Depending on the implementation, you might take a performance hit querying older history. Dolt gives you this ability for free, out of the box, and adds the ability to diff the values of rows in any two revisions.
- Rollbacks. Most databases provide some way to roll back the data
after some disaster. But like with snapshots, you have to set it up
ahead of time. If you forgot, or if your backup is a little too old,
you're out of luck. With Dolt, rollback is built in. Just
dolt reset --hard HEAD~3
and you've immediately undone the last 3 commits. Even better: it's easy to use the commit history after the disaster to cherry-pick back in necessary changes after the rollback.
Crowd-sourcing open datasets
A few months ago we decided to run an experiment to see if we could pay volunteers to build open data sets. The idea was simple: offer a large cash prize for the data, and use the features of Dolt to review every participant's contributions and track what portion of the dataset they were responsible for. Then at the end, divide up the pot based on each person's contributions. If you contributed half the cells in the final dataset, you get half the prize money.
We call the program data bounties, and have completed and paid out two bounties already. The first one paid $25k for precinct-level vote tallies from the 2016 and 2020 US presidential elections. The second paid $10k for hospital procedure prices. A third one, running now, pays $10k for course catalogs from US colleges.
When we launched the experiment we didn't really know if the idea would catch on, but it did. It worked better than we expected. These bounties have been a resounding success, attracting a few dozen participants who are making a very substantial income assembling data as a side gig. For these volunteers, this is a pretty great secondary income stream. And for people who want open data, this is the cheapest and fastest way to get it, by a couple orders of magnitude.
This use-case is possible because Dolt lets many people collaborate on the same dataset simultaneously without coordination, and gives bounty organizers control over what gets accepted via the PR mechanism, with no need to trust anyone with the keys to the database ahead of time or manage fine-grained permissions. And unlike flat files in source control, it scales pretty well: our open data sets weigh in at dozens of gigabytes.
We think data bounties are far and away the most effective way to spend grant money to assemble open datasets. If you have some to spend, let us know how we can help.
Reproducing models or analysis
Several customers are building data pipelines using Dolt as the backing store, so that machine learning and other analysis pipelines know which version of the data they're reading and can be reproduced exactly in the future.
Dolt has the concept of commits. Commits mark a dataset at a point in time and it is really simple to switch between commits. If you produce a model or analysis, make a commit of the data at that point of time and note the commit in the documentation of the model or analysis. When you or someone else returns to that model in the future, getting the data as it looked then is as simple as checking out the commit.
If your data is labeled, do you ever want to see if a particular labeler, machine or human, has a particular bias? Inspect the commit history of any cell in the Dolt database to see who changed it and why. Use branches to try different labeling or noise strategies with the ability to easily compare branches. Build your models off branches for reproducibility of every model.
Other databases provide snapshotting or time travel to get
reproducibility of data reads. But only Dolt lets you develop
simultaneous experimental data branches for your pipeline, then merge
the one that works best back into master
.
Sharing data on the Internet
Sharing data on the Internet was the guiding use case for which we built the current iteration of Dolt and DoltHub. We set out to build a fundamentally better format than CSV, JSON, or API for distributed data collaboration. In these formats, every write you do to data you receive is a hard fork in version control lingo. With Dolt, you can write to data you get off the internet and still be able to merge updates easily. You can even give these writes back to the producer, allowing deep collaboration between data producer and data consumer.
Collectively, we spend a lot of code taking data out of a database, putting it in a format for sharing, sharing it, and then putting it back in a database format for consumption. Dolt allows you to instead share the database, including schema and views, allowing you to delete all the code used to transfer data.
DoltHub is a beautiful interface for exploring data. DoltHub allows you to "try before you buy" data. You can run SQL queries on the web to see if the data matches your needs. The data provider can even build sample queries to guide the consumer's exploration. Via the commit log, you can see how often the data is updated. You can see who changed the data and why.
The data you share can be private. For a small fee, you can host private datasets and add read, write, or admin collaborators. Work with a distributed team to build a great dataset for you all to use. Private databases are free as long as they stay under a gig.
As bounties demonstrate, you can even effectively collaborate with total strangers to produce great datasets, with no coordination ahead of time.
Outside of bounties there's only a small number of public databases
being shared on DoltHub so far, but as the product continues to catch
on we're seeing this happen more. It seems like most people publishing
Dolt databases aren't doing it to "publish" them per se. Rather, they
just want to have a personal small database for their own use, and
DoltHub gives you that for free. Or they want to share some data with
a small set of other people, and telling their friends to run a
clone
command with Dolt is easier than mailing tarballs around and
trying to keep them up to date. In other words, "sharing and
collaboration" seem to be more important to our early DoltHub
customers than "publishing data."
Other use cases
Our original use cases blog post fantasizes about some other things we think Dolt would be great at, but they're not included here because we aren't personally aware of anyone using the product those ways, yet. But we continue to have faith Dolt will eventually get used in exactly the ways we envisioned.
Did we miss anything? Are you using Dolt in a way that's not well captured here? Let us know!
Try Dolt Today
As you can see, Dolt and DoltHub can be used for a number of different tasks. Do any of these use cases resonate? If so, try Dolt today, or come join us on Discord to talk to us and other customers. We are still very early on this adventure. Come be part of it with us!