Dolt as an Immutable Database
Dolt is Git for data. DoltHub is a place on the internet to share Dolt databases. In a recent discussion with a potential customer, the customer thought we did not make a big enough deal of Dolt being immutable at a commit. If you provide the commit hash of a Dolt commit, your queries are guaranteed to return the same results.
This blog post goes into detail about Dolt immutability and how one might use this Dolt feature.
How Dolt Immutability Works
Immutability is the result of Dolt's use of content addressing in the storage layer. I'm reusing pictures from this article that goes into great detail about how Dolt stores databases.
First, you have a commit graph of the database. Each commit points to its parent and to a hash of the Dolt database revision, with some extra metadata like commit author built into the hash.
Each Dolt database revision is the hash value of all the schema and data in that database.
Dolt builds these hash values up from the leaf nodes of the table data. In order to access the leaf node data, Dolt must know the root hash and that root hash needs to be the hash values of all its children. If anything in the database changes, Dolt will no longer be able to access the values from that root commit. So, with a Dolt commit, you are guaranteed to always get the same data and schema back on read, if you can read at all. The result is data and schema immutability at a commit.
How We Traditionally thought Immutability
Traditionally, we've focused on Dolt immutability as a feature for data sharing. If you are sharing data between parties that have low to no trust, the best way to know what you are getting is immutability of commits. So, whenever we've focused on immutability in this blog in the past, it's usually been in the context of data sharing and trust.
Moreover, quickly and easily being able to see what changed through diffs is very valuable. So, we've tended to gloss over immutability and go straight to diffs when discussing Dolt. Who doesn't want to see who, what, when, why, and how a database changed? Diffs are the feature we've focused on, potentially at the expense of discussing immutability.
How one could use Dolt's Immutability
In Dolt SQL, we expose queries at a commit via AS OF
syntax.
demo $ dolt log
commit t0tt6cpn2lmaubcunpsvt7fnfs8dqk1q
Merge: 5pmrkq7rpkvufna9qcstonk39kbu85ej 5pmrkq7rpkvufna9qcstonk39kbu85ej
Author: timsehn <tim@liquidata.co>
Date: Thu Jul 22 14:23:09 -0700 2021
Add query changes from workspace 527c46f5-6c43-4b1d-8ed7-de2364a90d69
commit 5pmrkq7rpkvufna9qcstonk39kbu85ej
Merge: hsjl3pbe13773ou98c30b3cjbm977j7s hsjl3pbe13773ou98c30b3cjbm977j7s
Author: timsehn <tim@liquidata.co>
Date: Thu Jul 22 14:21:23 -0700 2021
Run SQL query: insert into demo values (4, 'zap')
commit hsjl3pbe13773ou98c30b3cjbm977j7s
Merge: uq868f3oco9rte29b9ldjjfm5c4egrrn uq868f3oco9rte29b9ldjjfm5c4egrrn
Author: timsehn <tim@liquidata.co>
Date: Thu Jul 22 14:20:40 -0700 2021
Run SQL query: insert into demo values (3, 'bap')
commit uq868f3oco9rte29b9ldjjfm5c4egrrn
Merge: 3n7nnuggsqg5hkqfru4655keg357k3uv 3n7nnuggsqg5hkqfru4655keg357k3uv
Author: timsehn <tim@liquidata.co>
Date: Wed Jun 30 12:02:16 -0700 2021
Add query changes from workspace 15d30f70-6275-4a53-8656-b4c67c3bcfc1
commit 3n7nnuggsqg5hkqfru4655keg357k3uv
Merge: g67abr0esic305cic4rmsjoic9cl5g62 g67abr0esic305cic4rmsjoic9cl5g62
Author: timsehn <tim@liquidata.co>
Date: Wed Jun 30 12:02:03 -0700 2021
Run SQL query: insert into demo values (2, 'bar')
commit g67abr0esic305cic4rmsjoic9cl5g62
Merge: gshagnjp0a5todq4aqni9jh6kmjbp0q0 gshagnjp0a5todq4aqni9jh6kmjbp0q0
Author: timsehn <tim@liquidata.co>
Date: Wed Jun 30 07:33:39 -0700 2021
Add query changes from workspace 0d4089e7-8063-4447-91e2-54e41de323e0
commit gshagnjp0a5todq4aqni9jh6kmjbp0q0
Author: timsehn <tim@liquidata.co>
Date: Wed Jun 30 07:32:39 -0700 2021
Initialize data repository
demo $ dolt sql -q "select * from demo as of '3n7nnuggsqg5hkqfru4655keg357k3uv'"
+----+------+
| pk | col1 |
+----+------+
| 1 | foo |
| 2 | bar |
+----+------+
demo $ dolt sql -q "select * from demo as of 't0tt6cpn2lmaubcunpsvt7fnfs8dqk1q'"
+----+------+
| pk | col1 |
+----+------+
| 1 | foo |
| 2 | bar |
| 3 | bap |
| 4 | zap |
+----+------+
You can achieve the above in a different way by using DOLT_CHECKOUT()
or similar SQL to set your database to a specific commit. Dolt also supports tags if you like names instead of commits.
Thus, if you build an application on top of Dolt that is commit aware using AS OF
SQL syntax (or equivalent) for data access, you can pin the revision of the database you are querying to a particular commit. The application can then choose when to start accessing data at a different commit. Or that application can be upgraded to a new commit with an out of band process, like for instance a code or configuration deployment.
Building an application this way allows you to do data "releases". Don't just take the tip of master. Run an out of band quality assurance process on new data releases. If all your tests pass, upgrade to the latest release at a time when customer impact would be lowest if something goes wrong, much like upgrading other dependencies in your software. If the Dolt database supports more than one application, allow each application to choose which commit it trusts. For instance your development instance may trust the tip of master but production is tied to a specific data release.
Conclusion
Do you have any other ideas about how you would use the immutability of data and schema at a commit in Dolt. Come talk to us about your use cases on our Discord.