Dolt: A Database with Branches
As we discussed in the Where Is the Data Catalog? blog post, Dolt is a database designed for internet-scale collaboration. There are databases with differences, history, rollback, and audit logging. We think the Git semantics of Dolt provide these capabilities in a different and potentially better way.
We have yet to find a database that does true branching. Branching is the unique capability Dolt provides. We think branching is the feature that will enable internet-scale collaboration on data and create a thriving open data community.
When we refer to branching, we really are referring to a set of capabilities. The first is the ability for the user to declare a branch or fork of a database at a given point in time. The user is signaling that from this point forward this copy of the database will diverge from the current copy. Both copies can evolve in parallel but also share updates in a convenient way if either copy so chooses.
Sharing updates between branches is called a merge. With a command, one copy of the database can signal the desire to create a copy of the database with all the updates from two branches. This triggers a process called merge conflict detection. Dolt has data specific merge conflict detection across both schema and data.
If a conflict is detected during merge, the merge cannot continue without manual intervention. If no conflict is detected, the merge is allowed to proceed. For data, if both branches changed the same cell (a row, column pair) to a different value, a conflict is thrown. For schema, the rules are more complex. Dolt makes its best effort to prevent your data from getting into a bad state.
Branching allows for distributed collaboration among a large number of contributors. There can be one master copy of a database with an owner. Many other people can choose to make copies of the data, edit it to their own use case, and still get updates from the master copy as it evolves. You can maintain your own view of the data without being forced to forfeit updates.
Moreover, if you think the changes you are making to the data are beneficial to the master, you can ask the owner of master to merge your changes into master. This is the type of distributed collaboration that created a thriving open source community. We think the same dynamic can create a thriving open data community.
A good example of a dataset that could use branching and merging is Open Street Maps. One use Open Street Maps is for any given GPS coordinate, produce a probability distribution of the place the GPS coordinate is in. Are you in the Starbucks or in your car at the stoplight on the road out front when you opened an app on your phone?
Using data from Open Street Maps for this purpose often requires a fork. For your application, you start to have different assumptions about what a place is. For instance, in some applications, the Statue of Liberty is only Liberty Island but for others, the Statue of Liberty is anywhere in New York and New Jersey where you can see the monument. For this reason, my impression is that many users fork Open Street Maps and start to maintain their own copy.
Without a tool to make collaboration easy, the copy diverges over time and updates are not sourced to or from that fork. If Open Street Maps were managed with Dolt, every user could maintain their own copy and manage the places where their copy diverges.
DoltHub allows you to manage this collaboration over the internet in a truly distributed manner. The organizations, teams, and individuals and read, write, and admin permissions system should be familiar to most.
We just released a Pull Request feature that allows for data review, similar to code review, before merging. We think the combination of Dolt and DoltHub will energize the open data community similar to the way Git and GitHub energized the open source community.
Dolt is open source. DoltHub is free to host public datasets. Grab a copy of Dolt, clone a dataset from DoltHub, and start being part of the data community.