An Overview of BranchBench

June 3, 2026

8 min read

At DoltHub we are building the world’s first version-controlled SQL database, Dolt¹. Dolt is built from the ground up to support branching, diffing, and merging. It stores table data in a content-addressed Merkle tree, like Git does for source code repositories. It operates exactly like a SQL database, but the entire database value supports familiar Git primitives, including diff, branch, and merge, and interactions with remotes like push, pull, fetch, and clone.

We have been working on Dolt for almost eight years now. We originally started developing Dolt to support an envisioned data sharing use case. Based on feedback from customers, we added more and more OLTP support, because people wanted to add branch, diff, and merge to their own database-backed applications. Back in November of 2025, we started seeing more potential and success from agentic workflows, particularly around coding, and we quickly developed a hypothesis—agents need version control in order to safely make changes to a system of record. They need to be able to see their own changes, track those changes over time, and submit them for approval in a principled way.

Since then, we have been working to make Dolt the best database for agentic workflows. Recently, a research group out of Columbia University published a benchmark suite, as well as a few pre-print papers, which speaks to a similar vision for agentic workflows against databases. The paper immediately piqued our interest, both because it directly compared Dolt against a number of offerings which provide a similar set of features and because it presented our thesis around agentic workflows operating on databases in new language and in a new context.

After spending some time with the benchmark suite and the paper, I wanted to provide a quick introduction to it and provide a bit of context about how we are responding to it at DoltHub.

Overview#

BranchBench is a benchmark suite introduced by DAP Lab of Columbia University. It is described in a pre-print paper available on arXiv, as well as in a shortened paper written up for the CAIS’26 SAO Workshop. The primary researchers present some of the top-line take aways in their own words in a blog post on the DAP Labs blog, titled Branchable Databases Aren’t Ready for Agentic Workloads.

The blog post title makes clear a fundamental takeaway of the arXiv pre-print—no existing benchmarked offering meets all the requirements that the researchers envision for agentic workflows. But what are these agentic workflows and how are they described and ultimately quantified?

Agentic Workflows#

In the framework presented by BranchBench, agentic workflows use the database as working memory. From the blog post above:

Future agents will create thousands [of branches] — forking a candidate state, mutating it, evaluating the result (perhaps against other branches), pruning irrelevant states, and repeating.

The researchers identify five archetype agentic workflows, each of which represent concrete points on a parameterization of a data-oriented agentic workflow description space. These five workflows are:

Agentic Software Engineering: Each instance of an agent independently developing a feature can branch the database to apply schema changes, migrations, backfills and updates. Running the test suite against this branch is read/write heavy. Branching can be used to checkpoint state-mutating actions to provide reliable revert.
Failure Reproduction: Forking a production database before a data error was introduced and performing a binary search on the length of the prefix of the transaction log which is replayed against the fork. At each iteration checking for the presence of the data error. Like git bisect operating with a newly written test, this would allow an agent to identify the minimal prefix of the transaction log which produced the error. Further agentic analysis and pruning could help to identify more meaningful minimal reproductions so a fix could be developed.
Data Curation: A continuous monitoring process identifies potential data anomalies and spawns agentic workflows to explore ways of cleaning them. Each cleaning workflow creates a branch to clean an individual anomaly, and those workflows perform many mutations, data scans, validation queries, etc., against their branch as they explore possible avenues for improvement.
Monte Carlo Tree Search: Fixed branching at each layer of the search, each iteration applies writes to a new branch to create a new candidate state. The suitability of the candidate states is evaluated against each other and the most promising candidate states are continuously explored. The resulting tree has branches which grow deep and narrow.
Monte Carlo Simulation: Running many independent trials against a database with randomized inputs and aggregating the final results. This workflow is high fan-out, low depth, with each trial executing many write transactions against its own branch.

The Macrobenchmark Framework#

BranchBench maps these five agentic workflows to a benchmark framework which it uses to evaluate the runtime performance and characteristics of various branchable databases. The benchmark framework works as follows.

A given benchmark run runs a fixed number of concurrent workers. Each worker runs a fixed number of steps. Each step does the following: (1) create a new branch from an existing candidate branch, (2) apply parameterized DDL and DML operations to the newly created branch, (3) execute read queries on the newly mutated branch, representing evaluation or testing of the agentic process, (4) with workflow-specific probability, prune the newly created branch.

A given benchmark run can also run and benchmark a fixed number of cross-branch queries across the frontier of available branches as the benchmark run proceeds.

To control the shape and roll-out of resulting branch tree, and the work each step of the benchmark does, the workflows are parameterized by the following values:

Parameter	Description
Workers	The number of concurrent workers operating on the branch tree.
Steps per Worker	How many branch/dml+ddl/query/maybe-prune cycles each worker takes.
Cross-branch Queries	How many cross-branch queries are run throughout the whole workload (not per Worker).
Root Fanout	The number of branches which will be created from the initial database state.
Inner Fanout	The maximum number of branches which can be created from a child database state.
Depth	The maximum depth of the branch tree. No branch deeper than this is a candidate for new branches.
Num Schema Changes	How many DDL operations to perform per step.
Num Data Changes	How many DML operations to perform per step.
Num Queries	How many ready queries to perform per step.
Prune Probability	The probability that a given branch is pruned at the end of its step.

Two concrete examples from the paper. The parameterization for the Simulation workflow described above has workers = 1000, steps per worker = 1, root fanout = 1000, inner fanout = 0, a depth = 1 and a prune probability = 1. The end result is 1,000 concurrent workers which all create a single branch off the root database state, run their DML and read queries and then delete the branch they created.

Behaving quite differently, the software development workflow has workers = 5, steps per worker = 20, root fanout of 5, inner fanout of 3, depth of 3 and prune probability = 0.1. It creates five concurrent workers and 5 of them immediately create branches off the root database value. As the benchmark proceeds, each worker creates twenty branches in total, either off the root or an existing branch. The root has five branches coming off it, each interior node has up to three branches, and no path through the tree is more than three branches deep.

The end result is that, in addition to the different DDL, DML and read queries which the different agentic workflows run, each workflow creates differently shaped trees of branches. Some are deep and narrow, some are shallow and broad, and some are bushy.

The Results#

Definitely see the paper for the full results. No tested configuration was able to complete all of the given macrobenchmarks in the allotted time of two hours, either because the capabilities of the system did not meet the requirements of the test or because the benchmark run timed out.

The paper digs into specific performance differences between Neon and Dolt. It finds that Dolt has much faster branching capabilities, but its query performance is quite a bit worse on some of the workloads.

Observations#

We love the presentation of what agentic workflows against databases will look like in the future and the assertion that branchable databases are vitally important for agentic workflows. The description of five concrete agentic workflows, each having different fan out and distributions of data and query operations, is useful for building an understanding of what needs to be optimized for and which tradeoffs are appropriate in which contexts.

We think Dolt has a major advantage over any of the presented competitors for agentic workflows. The paper necessarily builds its benchmark suite and its discussion around a set of common functionality available in the targeted technologies. But Dolt supports branches as a first class citizen within the database itself, and it supports more operations on them than the presented alternatives. Diffs and merge are particularly important for these kinds of workflows. Similarly, permissions and agentic access are another consideration—Dolt has first class branch permissions and branch management is done directly within the SQL layer using stored procedures. All of the presented alternatives need separate access control and things like filesystem or API access to accomplish the branching capabilities.

In a similar way, the macrobenchmark framework as presented maybe comes with a couple caveats, compared to actual agentic work.

Cross-branch queries are run against separate database connections, and how to accomplish a cross-branch query lives as global knowledge in the driver. The driver is querying branches that it knows exist based on the state of the workers so far, not based on the state of the database. In reality, Dolt supports cross-branch queries directly against the database, so that things like joins, aggregates and windows do not have to be reimplemented on the client. And Dolt supports SQL querying of available branches and their metadata.
Perhaps somewhat similarly, an agentic benchmark developed directly against Dolt might choose to parameterize the workflows slightly differently. Instead of having a fixed number of workers, all running concurrently, and at each step each worker choosing an existing eligible branch to fan out from, an agentic workflow could also be described by the agent owning its own branch head and having the capability to spawn subagents to work on their own branches. If subagents were able to be detached or structurally nested, then in some workflows agents could even block on the completion of their subagents. This would potentially provide a natural way to describe opportunities for cross-branch queries, roll out, summarization, pruning, etc.

Future Work#

The paper presents some performance numbers for read queries against Dolt which leave a lot to be desired compared to Neon. Our next steps at DoltHub involve reproducing these numbers and better understanding where the bottlenecks are. Dolt’s storage technology has higher inherent overhead compared to Postgres, but there is no good excuse for that translating into a query that takes approximately ~4,000x as long to execute on Dolt vs. Neon². Where Dolt is slower, most queries are within a factor of 2-5x, which is where our internal Doltgres TPCC benchmarking comes in vs. Postgres.

We plan to add BranchBench to our standard benchmarking suite and start tracking Dolt and Doltgres’ performance on it over time.

Are you interested in agentic workloads against branchable databases. Stop by our Discord and chat with us about it.

Footnotes#

We also develop Doltgres, a Prolly tree backed SQL database that supports PostgreSQL syntax, instead of MySQL. We have been recently experimenting with seeing how far vibe coding/agentic engineering can get on porting Dolt’s fundamental ideas to other database engines and paradigms, such as DoltLite, where we replace SQLite’s storage engine with Prolly trees, and DumboDB, where we make a Prolly tree backed database speak MongoDB’s wire protocol. ↩
As the Read queries in the Simulation workflow did—page 11 of the arXiv paper. ↩

Blog

PRODUCTS

KEYWORDS