Dolt for Git Noobs
TL;DR
- Dolt is a SQL database with built-in Git versioning, branching, and distribution semantics that makes collaborating on and distributing data effortless.
- What Git does for files, Dolt does for data.
- Where Git versions files, allowing for fine-grained control, tracking, and management of files over time, Dolt versions data and database schema, allowing for fine-grained control throughout an entire Dolt repository and every branch it houses.
- Where Git supports the branching and merging of files for true user collaboration, Dolt supports branching and merging of data and databases, finally giving data producers and data consumers the tool they need to truly collaborate on data in a meaningful way.
Git + SQL = Dolt
A quick apology to our more technologically fluent readers -- please forgive me. I’ve taken some liberties with some of my technical descriptions and explanations. I've also included some simplified illustrations that depict features of Git, SQL, and Dolt that fall very short of illustrating the elegance and complexity of any of them. My aim in this overly simplified approach is to give readers, some of whom maybe less technically fluent, a wide, surface-level understanding of what Dolt does and why it matters. Thanks!
Dolt is a SQL database with built in Git versioning, branching, and distribution semantics. To understand why those features make Dolt such a powerful tool it's important to know a bit about Git and a bit about SQL.
Git for Noobs
At a high level, Git is software that allows users to manage and track all changes made to files over time. Git keeps a version of every saved change to those files. Git calls saving a version of files it stores a commit. Git commits make up a history for the files stored in a Git repository. You can think of this history as a tree-like structure, where each saved version of a file is descended from another version (the previously saved version). This tree-like structure is what makes Git such a powerful tool.
Git users can "checkout" any commit in the tree, which gives them a copy of the Git repo as it existed at that commit. They can make changes to that file and save (commit) those changes. Or, they could revert all changes they made to that file by checking out the previous version.
When two different people want to edit the same files, they usually make their own branches to work on. You can think of a branch like a branch of a tree, coming off the trunk. A branch keeps each user's work separate from anyone else's changes until it's time to merge them back into the main trunk of the tree.
We can represent this idea with the following simplified illustration:
In the diagram above, imagine we are looking at a Git repository containing the history of a single text file, called Hello.txt. The file originates on a branch named master
, colored green, and this branch makes up the starting point or "trunk" of the tree. When Hello.txt was created, "Hello" was written to the file and it was saved, or committed, making it the first commit on this file's tree (a sideways tree that grows from left to right, over time).
Next, someone edited the file to add the word "World" and committed it, resulting in a new version of Hello.txt. The original version is still there, saved in the Git repository. It can always be retrieved, recovered, or restored at any time.
Next, John, Sally, and Kayla each decide to edit the file. Each one creates their own branch of the repository. These branches are represented in the picture above by the solid-line arrows stemming from the second commit in the tree. By branching, each user makes a copy of the second version of Hello.txt from branch master
, meaning each user now has a copy of Hello.txt as it existed at the second commit.
Here's where things get really cool. Each user can now make whatever changes they want to the file without interfering with or affecting any other version of Hello.txt contained in the Repository. They all commit a new version. John changed his version of Hello.txt to say "Hello John!", Sally changed her version to say "Hello Sally!", and Kayla changed hers to say "It's Kayla!". As you can see, each of these differing versions is independent of the others, while still being housed in a single Git repository.
Finally, Kayla decides to merge her changes back to the master
branch. This means that the master
branch everyone uses will now have Kayla's changes to the file. The merge is represented by the dotted-line arrows stemming from both Kayla's version and from the master
version of the file. This merging combines the two versions of Hello.txt (for our purposes magically) to make a new version of the file that reads "Hello World It's Kayla!".
It's this "merge" feature of Git that makes repository branching so powerful. The ability to merge one branch into another, combining changes made on two different versions of the Hello.txt file, is what facilitates true user collaboration. These powerful features have made Git an irreplaceable tool in many software engineering tool belts, and that's why we've built these features into Dolt.
What Git does for files, Dolt does for data. Where Git supports the branching and merging of files for true user collaboration, Dolt supports branching and merging of data and databases, giving data producers and data consumers the tools they need to truly collaborate on data in a meaningful way.
SQL for Noobs
Users edit Dolt tables with SQL commands. If you're not familiar with SQL, pronounced "sequel", here's a quick, simplified summary to give you some context into what SQL is and why Dolt supports it. SQL is a time-tested standard with 50 years of knowledge and tooling behind it. Countless companies use SQL databases and consume various data using SQL. SQL support in Dolt lets users query their data in a language most already know.
SQL queries look like normal English and are easy to use. Here's an example:
As you can see in the image above, we have on the left a simple SQL query, "SELECT * FROM People WHERE Age = 24;". This query essentially means "Please return rows containing all of a person's information from the table called 'People', but only include people 24 years of age." The table being queried, titled "People", contains three people, each with a different age. The output of the query is the single row from the table with Age = 24
.
It's Dolt's SQL interface combined with features from Git, that make it such a powerful and exciting tool to produce, consume, share, and maintain data.
Editing Dolt tables with SQL commands
To demonstrate precisely what we mean when we say Dolt is "Git for data", let's recreate our Hello.txt example from earlier, using a Dolt repository in place of the Git repository! Our example Dolt repository will be called Hello_Repo
and will look very similar to the Git repository from earlier. Instead of tracking and versioning files, Hello_Repo
will be tracking and versioning data tables.
The illustration above shows the Hello_Repo
Dolt repository we'll be creating. I will also be showing you the exact Dolt commands you can use to create this repository yourself.
To start, make sure you have the latest version of Dolt installed, then open your terminal
or a command line shell.
From inside the terminal, we can verify that we have Dolt correctly installed by running Dolt's version command:
$ dolt version
dolt version 0.15.2
Also if this is your first time using Dolt, you should use Dolt's config command to set your user.name
and user.email
for Dolt, which will allow you to initialize new Dolt repositories.
$ dolt config --global --add user.name yourusername
$ dolt config --global --add user.email youremail@youremail.com
With Dolt successfully installed, let's create the directory for our Dolt repository and initialize it to track our data tables. To do this we will use Dolt's init command to initialize the repository, then we will display the current status of our repository by using Dolt's status command:
$ mkdir Hello_Repo
$ cd Hello_Repo
$ dolt init
Successfully initialized dolt data repository.
$ dolt status
On branch master
Untracked files:
(use "dolt add <table|doc>" to include in what will be committed)
new doc: LICENSE.md
new doc: README.md
When we initialize a new Dolt repository, Dolt automatically generates a LICENSE.md
file and a README.md
file to make writing a usage license and description of the repository easy. Dolt automatically tracks changes made to these files, just like it tracks changes to tables.
For our example, let's ignore these two files and just focus on representing our Hello.txt file as a data table, in Dolt.
Notice the first line of output from dolt status
says "On branch master". All new repositories start on branch master
and running dolt status
will always tell you which branch you are currently using.
We can also use Dolt's branch command, which will display a list of available branches we can move between. The current branch will have an asterisk next to its name.
$ dolt branch
* master
To add our schema and create our table, we will use Dolt's sql command, which allows us to interface with our Dolt repository using SQL. We will call our data table Hellos
and the SQL statement we will use to create this, looks like this:
CREATE TABLE Hellos (
line_number INT NOT NULL,
content VARCHAR(255) NOT NULL,
PRIMARY KEY (line_number)
);
Even if this syntax appears a bit cryptic, all it is doing is telling Dolt to create a table called Hellos
, and that the table should have two columns, one called "line_number" and one called "content". To execute this SQL statement, we run:
$ dolt sql -q 'CREATE TABLE Hellos (line_number INT NOT NULL, content VARCHAR(255) NOT NULL, PRIMARY KEY (line_number));'
The -q
flag (q
for query) paired with Dolt's sql
command tells Dolt that we only want to run a single SQL statement or query. Without -q
, we'd start an interactive SQL shell.
After adding our Hellos
table, running dolt status
again tells us that our Dolt repository is now aware of the table we've created.
$ dolt status
On branch master
Untracked files:
(use "dolt add <table|doc>" to include in what will be committed)
new table: Hellos
new doc: LICENSE.md
new doc: README.md
Though our repository is aware of table Hellos
, it is not yet tracking the table. Notice how new table: Hellos
is displayed in the list of Untracked files:
. In order to allow Dolt to start tracking changes to our table and storing different versions of our table as we change it, we need to tell Dolt to do so. This is what is known as staging a file or table for tracking, and is another amazing Git feature we've built into Dolt. If you want more information on what staging means and why it is valuable, check this out.
For the purpose of this example however, we will simply tell Dolt to start tracking and storing versions of Hellos
by running two Dolt commands back to back. Those commands are add and commit:
$ dolt add .
$ dolt commit -m "Add new table Hellos"
commit 3sto00up3jfp8mltu2uae48dscvd217p
Author: 😺😺😺😺 <dustin@liquidata.co>
Date: Thu Apr 02 12:14:06 -0700 2020
Add new table Hellos
The .
after dolt add
tells Dolt to start tracking all tables in our current directory. The -m
flag (m
for message) allows us to write a brief message describing this newly saved version of our table Hellos
.
After running dolt commit
we see a success commit message is displayed that includes the commit ID, the author (yes my user.name
is four cat emojis) along with the author's email, the date, and our commit message.
We've successfully made our first commit on branch master and it contains the same table we can see in the illustration of our Hello_Repo
repository above!
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+---------+
| line_number | content |
+-------------+---------+
+-------------+---------+
After making a commit, our branch is now considered "clean", meaning that Dolt has not detected any new changes made on this branch since the last commit. We can confirm that our current branch is "clean" by running dolt status
once again:
$ dolt status
On branch master
nothing to commit, working tree clean
Let's now add some data to Hellos
. The data we add will be line number
1 and the content
"Hello", just like the file Hello.txt had in our Git example. To add this data we return to Dolt's sql
command:
$ dolt sql -q 'INSERT INTO Hellos (line_number, content) VALUES (1, "Hello");'
+---------+
| updated |
+---------+
| 1 |
+---------+
Dolt lets us know that we've successfully updated a row in our table. Our table now looks like this:
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+---------+
| line_number | content |
+-------------+---------+
| 1 | Hello |
+-------------+---------+
dolt status
shows that we have changed the table since our last commit:
$ dolt status
On branch master
Changes not staged for commit:
(use "dolt add <table>" to update what will be committed)
(use "dolt checkout <table>" to discard changes in working directory)
modified: Hellos
Let's commit this change, permanently saving this new version of the table, the same way we did before:
$ dolt add .
$ dolt commit -m 'Insert Hello on line 1'
commit pl170pbhseui2u3p30867ki23d0rgms8
Author: 😺😺😺😺 <dustin@liquidata.co>
Date: Thu Apr 02 13:01:28 -0700 2020
Insert Hello on line 1
Great! Let's make our final change to table Hellos
on branch master
, changing the content
on line_number
1 from "Hello" to "Hello World", so our version tree matches the tree in the illustration above:
$ dolt sql -q 'UPDATE Hellos SET content="Hello World" WHERE line_number = 1;'
+---------+---------+
| matched | updated |
+---------+---------+
| 1 | 1 |
+---------+---------+
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+-------------+
| line_number | content |
+-------------+-------------+
| 1 | Hello World |
+-------------+-------------+
$ dolt status
On branch master
Changes not staged for commit:
(use "dolt add <table>" to update what will be committed)
(use "dolt checkout <table>" to discard changes in working directory)
modified: Hellos
$ dolt add .
$ dolt commit -m 'Change line 1 to say Hello World'
commit r6tn4i3omlte7lmc7vchlrerqc21531e
Author: 😺😺😺😺 <dustin@liquidata.co>
Date: Thu Apr 02 13:13:32 -0700 2020
Change line 1 to say Hello World
We have now successfully created three versions of our data table Hellos
on branch master
, just like in the illustration! The first version contains Hellos
as an empty table, the second version contains Hellos
with the content
"Hello" and line_number
1, and the most recent version contains the content
"Hello World" on line_number
1. Three distinct versions of the same table, all tracked and stored within our repository!
Now let's create a version of table Hellos
for each of our example users by using Dolt's powerful branching feature. To do this, we can once again use Dolt's branch
command. Providing a name argument after the branch
command tells Dolt to copy the current branch we are on, master
, and name it's copy after the name argument we supplied:
$ dolt branch Johns_Branch
$ dolt branch Sallys_Branch
$ dolt branch Kaylas_Branch
Creating new branches in our repository is really that simple; just three commands. We can the see the different branches we have created in our repository by listing them using dolt branch
once more:
$ dolt branch
Johns_Branch
Kaylas_Branch
Sallys_Branch
* master
These new branches are represented in the illustration of our Dolt repository by the solid-line arrows stemming from the third commit made on branch master
and pointing at three different commits, each made by our different example users. These commits will be different versions of table Hellos
and will reflect the changes made by our example users.
Let's make these three commits now, for each example user, starting with John.
First we "checkout" Johns_Branch
, making it our current branch, by using Dolt's checkout command:
$ dolt checkout Johns_Branch
Switched to branch 'Johns_Branch'
Now we can make John's changes to Hellos
using the same Dolt command pattern we used to modify the table when we were on branch master
:
$ dolt sql -q 'UPDATE Hellos SET content="Hello John!" WHERE line_number = 1;'
+---------+---------+
| matched | updated |
+---------+---------+
| 1 | 1 |
+---------+---------+
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+-------------+
| line_number | content |
+-------------+-------------+
| 1 | Hello John! |
+-------------+-------------+
$ dolt add .
$ dolt commit -m 'Change line 1 to say Hello John!'
commit 0uohugsepvo9ai3mojv5i28757joh2vv
Author: 😺😺😺😺 <dustin@liquidata.co>
Date: Thu Apr 02 13:43:13 -0700 2020
Change line 1 to say Hello John!
Let's do the same for both Sallys_Branch
and Kaylas_Branch
:
$ dolt checkout Sallys_Branch
Switched to branch 'Sallys_Branch'
$ dolt sql -q 'UPDATE Hellos SET content="Hello Sally!" WHERE line_number = 1;'
+---------+---------+
| matched | updated |
+---------+---------+
| 1 | 1 |
+---------+---------+
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+--------------+
| line_number | content |
+-------------+--------------+
| 1 | Hello Sally! |
+-------------+--------------+
$ dolt add .
$ dolt commit -m 'Change line 1 to say Hello Sally!'
commit tmgrjpn48tk82vtpp5u6gki16f58hk31
Author: 😺😺😺😺 <dustin@liquidata.co>
Date: Thu Apr 02 13:46:12 -0700 2020
Change line 1 to say Hello Sally!
$ dolt checkout Kaylas_Branch
Switched to branch 'Kaylas_Branch'
$ dolt sql -q 'INSERT INTO Hellos (line_number, content) VALUES (2, "Its Kayla!");'
+---------+
| updated |
+---------+
| 1 |
+---------+
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+-------------+
| line_number | content |
+-------------+-------------+
| 1 | Hello World |
| 2 | Its Kayla! |
+-------------+-------------+
$ dolt add .
$ dolt commit -m 'Add Its Kayla to line 2'
commit bktov62uk65tl9knjh1bt8p6dm0b9t7v
Author: 😺😺😺😺 <dustin@liquidata.co>
Date: Thu Apr 02 14:05:17 -0700 2020
Add Its Kayla to line 2
Notice that we made a couple adjustments to our SQL for Kaylas_Branch
. She doesn't change the content
for line_number
1 the way John and Sally did. Instead she adds a new line (and row) to the table. She added line_number
2 and content
"Its Kayla!" (please forgive my lack of apostrophe in It's).
We now have a single data table, Hellos
, with multiple versions on different branches! Let's look at each branch's data:
$ dolt checkout master
Switched to branch 'master'
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+-------------+
| line_number | content |
+-------------+-------------+
| 1 | Hello World |
+-------------+-------------+
$ dolt checkout Johns_Branch
Switched to branch 'Johns_Branch'
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+-------------+
| line_number | content |
+-------------+-------------+
| 1 | Hello John! |
+-------------+-------------+
$ dolt checkout Sallys_Branch
Switched to branch 'Sallys_Branch'
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+--------------+
| line_number | content |
+-------------+--------------+
| 1 | Hello Sally! |
+-------------+--------------+
$ dolt checkout Kaylas_Branch
Switched to branch 'Kaylas_Branch'
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+-------------+
| line_number | content |
+-------------+-------------+
| 1 | Hello World |
| 2 | Its Kayla! |
+-------------+-------------+
Fully versioned data, Git branching functionality, and a SQL data interface -- That is the power of Dolt. These example users are now true data collaborators.
Merging changes from a branch
To create the final commit in our version tree represented in the Hello_Repo
illustration, we will checkout
branch master
once more and use Dolt's merge command to merge Kaylas_Branch
into master
:
$ dolt checkout master
Switched to branch 'master'
$ dolt merge Kaylas_Branch
Updating r6tn4i3omlte7lmc7vchlrerqc21531e..bktov62uk65tl9knjh1bt8p6dm0b9t7v
Fast-forward
We now have the final commit we needed to represent our Hello.txt Git example in Dolt! The version of Kayla's Hellos
table was merged into the current version of the table on branch master
. On branch master
, this means we've successfully created another new version of Hellos
!
If we inspect all current versions of Hellos
on each branch again, we will see that Hellos
on master
reflects the additions from Kayla, while all other versions remain the same:
$ dolt checkout master
Already on branch 'master'
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+-------------+
| line_number | content |
+-------------+-------------+
| 1 | Hello World |
| 2 | Its Kayla! |
+-------------+-------------+
$ dolt checkout Johns_Branch
Switched to branch 'Johns_Branch'
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+-------------+
| line_number | content |
+-------------+-------------+
| 1 | Hello John! |
+-------------+-------------+
$ dolt checkout Sallys_Branch
Switched to branch 'Sallys_Branch'
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+--------------+
| line_number | content |
+-------------+--------------+
| 1 | Hello Sally! |
+-------------+--------------+
$ dolt checkout Kaylas_Branch
Switched to branch 'Kaylas_Branch'
$ dolt sql -q 'SELECT * FROM Hellos;'
+-------------+-------------+
| line_number | content |
+-------------+-------------+
| 1 | Hello World |
| 2 | Its Kayla! |
+-------------+-------------+
Conclusion
Fully versioned data, Git branching and merging functionality, with a SQL data interface -- That is the power of Dolt. Each branch in our repository can exist with its own versions of data and schema, and can be merged with others allowing for total control and total collaboration.
As mentioned earlier the examples above are really just a surface-level introduction to Git, SQL, and Dolt and the power they give to their users.
If you'd like to learn more about Dolt, DoltHub, or our team here, check out our other blog posts here. Thanks for reading!