So you Want Git for Data? 2024 Edition.
It's hard to believe it's been over four years since I first mused about Git for Data. Let's revisit the topic.
What is “Git for Data” in practice? Many products have come to market with some relation to the "Git for data" theme. Dolt and DoltHub are our answer. This blog tries to unpack what the various products in the space offer as an answer to what “Git for data” means.
What do you mean by Git?
Do you mean data versioning? If so, which parts of version control do you care about? Do you care about rollback? Diffs? Lineage, i.e. who changed what and when? Branch/merge? Sharing changes with others? Do you mean a content-addressed version of the above with all the good distributed qualities that solution provides? Do you care about some of the more esoteric version control features of Git like a staging area or multiple remotes?
Four years on, I have a much better idea conceptually what people mean when they say Git. People mean version control. Most people who say Git for Data want log, branch, merge, and diff on data in a "Git-ty" way. That means open source, content addresses, and a commit graph. They'll settle for other version control models but Git won version control of files for a reason.
Or are you thinking more of GitHub? Do you want an online data catalog? If so, is what you really want a thriving open data community akin to the open source community? Do you want to be able to collaborate remotely and asynchronously on private data projects? Do you want pull requests, i.e. integrated human review of data changes? Do you want to be able to create issues referring to certain changes or parts of the data?
I debated whether to exclude this category from this blog altogether. Git is not GitHub. I could write a "So you want GitHub for Data" article. But I think many people who want Git for Data really want the GitHub experience with data. So, I ultimately decided this category stays.
What do you mean by Data?
This topic hasn't changed much since I last wrote.
Do you mean data in files or data in tables? Do you mean unstructured data like images or text from web pages? Do you mean CSV tables or JSON blobs? Do you mean big data like time series log entries? Do you mean relational databases? If relational, do you care about schema or just data (or vice versa)? Do you mean data transformations, like exist in data pipelines? Do you have an application in mind? Data for machine learning (i.e. labeled data)? Data for visualizations and reports? Data for a software application?
I will add a further distinction. If you're looking for a relational database, are you looking for an Online Transaction Processing (OLTP) database like MySQL or Postgres? Or are you looking for an Online Analytical Processing (OLAP) database like a data warehouse or data lake?
We're a little biased but we think data means database. Data in files is just data yearning to be structured into a database so it can be queried in log(n)
.
Products
The categories of products claiming adjacency to Git for Data are the same:
- Data Catalogs
- Versioned Data Pipelines
- Version Controlled Databases.
The products in each category have changed. Some categories have clear winners and new entrants since 2020. Some of the tools listed in the original article are no longer maintained: Qri and Noms. Some other companies have de-emphasized their Git for Data angle: data.world and Quilt.
Data Catalogs
This category has consolidated around a few winners with Hugging Face coming on strong as a new entrant. We have two platforms competing to be GitHub for machine learning/generative artificial intelligence models and one platform focused on being GitHub for databases.
Kaggle
- Tagline
- "The Home of Data Science"
- Initial Release
- April 2010
Kaggle started by hosting machine learning competitions. The contest runner posts an open dataset, sets the terms of the contest, and receives model submissions. The winner of the contest receives a cash prize.
Kaggle was purchased by Google in 2017 and continues to operate as a standalone entity. It has evolved into a social network of sorts for data scientists, continuing to run contests, but also hosting public datasets, modeling code in the form of notebooks, and models. The interface is beautiful. There is a thriving, vibrant community.
The datasets are distributed as CSVs or JSON. Datasets are versioned in the sense that older versions are still available on Kaggle. So for tooling beyond data and model discovery, you are on your own, in a good way.
Hugging Face
- Tagline
- "The AI community building the future"
- Initial Release
- February 2021
Hugging Face was started by a team of artificial intelligence researchers who wanted a place to post their generative models. After a few years, they launched Hugging Face hub which they conceived as a GitHub for Artificial Intelligence.
Like Kaggle, Hugging Face hosts models and datasets. Just recently, Hugging Face launched SQL querying of datasets. The datasets are versioned in the sense that old dataset versions can be viewed but there is no diff and merge functionality.
Hugging Face seems to have the momentum in this space. I used it when working with NanoGPT to grab OpenAIs GPT3 model to fine tune.
To me, Hugging Face seems more open source-y and Kaggle feels more Google-y. I'm not an artificial intelligence practitioner so I don't know how I would decide between the two if I wanted to publish an open model. However, given my limited experience, the current momentum seems to be around Hugging Face.
DoltHub
- Tagline
- "GitHub for Dolt"
- Initial Release
- October 2019
Dolt, the world's first and only version controlled SQL database, we will get to later. But first, DoltHub.
DoltHub is a place on the internet to share Dolt databases. It adds Pull Requests, Issues, and a SQL Workbench style user interface to Dolt databases. If you are looking for the collaborative GitHub experience on data, DoltHub is that thing. Dolt is the only database that you can diff and merge, so DoltHub is the only place you can have a Pull Request workflow on data.
There is a small community of open data publishers, mostly publishing US Stock Market databases.
Unlike the previous two companies, DoltHub is single purpose: GitHub for data. There is no built in artificial intelligence angle, though you're welcome to use the data on DoltHub for whatever you wish, including building machine learning models.
Data Pipeline Versioning
In my view, this category is pretty much unchanged since 2020 with both products carving out their niche in the space.
Pachyderm
- Tagline
- "Reproducible Data Science at Scale!"
- Initial Release
- May 5, 2016
- GitHub
- https://github.com/pachyderm/pachyderm
Pachyderm is a data pipeline versioning tool. In the Pachyderm model, data is stored as a set of content-addressed files in a repository. Pipeline code is also stored as a content-addressed set of code files. When pipeline code is run on data, Pachyderm models this as a sort of merge commit, allowing for versioning concepts like branching and lineage across your data pipeline.
We find the Pachyderm model extremely intriguing from a versioning perspective. Modeling data plus code as a merge commit is very clever and produces some very useful insights.
This type of versioning is useful in many machine learning applications. Often, you are transforming images or large text files into different files using code, for instance, making every image in a set the same dimensions. You may want to reuse those modified files in many different pipelines and only do work if the set changes. If something goes awry, you want to be able to debug what changed to cause the issue. If you're running a large-scale data pipeline on files, like is common in machine learning, Pachyderm is for you.
DVC (Data Version Control)
- Tagline
- "Git for Data & Models"
- Initial Release
- May 4, 2017
- GitHub
- https://github.com/iterative/dvc
Similar to Pachyderm, DVC versions data pipelines. Unlike Pachyderm, DVC does not have its own execution engine. DVC is a wrapper around Git that allows for large files (like git-lfs) and versioning code along with data. It also comes with some friendly pipeline hooks like visualizations and reproduce commands. Most of the documentation has a machine learning focus.
DVC is lighter weight and more community-driven than Pachyderm. Pachyderm is more enterprise focused. If you are looking for data pipeline versioning without having to adopt an execution engine, check out DVC.
Version Controlled Databases
This category has changed a lot since 2020. Noms is read-only and not maintained. A couple new entrants are making their mark in the OLAP and Graph/Document database categories. Dolt has become the dominant player in the versioned OLTP database space.
A little known fact is that turquoise in your logo indicates version control to potential customers. Dolt was first. We claim turquoise!
LakeFS
- Tagline
- "Scalable Data Version Control"
- Initial Release
- August 2020
- GitHub
- https://github.com/treeverse/lakeFS
LakeFS defines a new category: data lake versioning. Data lakes are a relatively new term referring to unstructured or semi-structured data stored in large cloud storage systems like S3 and GCS. Data Lakes exist in contrast to Data Warehouses which are structured and SQL based.
LakeFS sits in front of your cloud storage and adds data versioning to the data in your lake. You get commits, branches, and rollback. Merge is supported but conflicts are detected at the file level. Finer-grained merge is unavailable.
A file in this case can be quite large, more like a dataset or table than a single row of data. Data is shared at the file level between commits, but a new version of the data means a new version of the file so storage can grow quite big with multiple versions.
LakeFS is relatively new, launched in 2020, and the sponsoring company Treeverse is well funded. Expect more development from the company. We like what we see from a versioning perspective.
Terminus DB
- Tagline
- "Making Data Collaboration Easy"
- Initial Release
- October 2019
- GitHub
- https://github.com/terminusdb/terminusdb
TerminusDB has full schema and data versioning capability but offers a graph database interface using a custom query language called Web Object Query Language (WOQL). WOQL is schema optional. TerminusDB also has the option to query JSON directly, similar to MongoDB, giving users a more document database style interface.
The versioning syntax is exposed via TerminusDB Console or a command line interface. The versioning metaphors are similar to Git. You branch, push, and pull. See their how to documentation for more information.
TerminusDB is established. The company is very responsive, has an active Discord, and is well funded. If you think your database version control makes more sense in graph or document form, check them out.
Dolt
- Tagline
- "Git for Data"
- Initial Release
- August 2019
- GitHub
- https://github.com/dolthub/dolt
Dolt takes the Git for Data mantra rather literally. Dolt adapted Git's model and interface to the SQL database, no small feat, requiring a new SQL database built from the storage engine up.
Dolt supports MySQL's full SQL dialect. Dolt combines this functionality with the full Git command line you know and love. The target of version control is database tables instead of files. In SQL, Git functionality is exposed as system tables, functions, and procedures.
Dolt has a commit graph, content addresses, merge, conflicts, rebase, remotes, a staging area, cell-wise queryable diffs, and all the other esoteric Git functionality you know and love exposed via SQL. If you never use any of these features, Dolt just works and performs like a MySQL database.
MySQL not your thing? Doltgres, a Postgres-compatible Dolt, is in Alpha.
We're biased but...Look no further, Dolt truly is Git for Data.