Introducing Global Tables

7 min read

At Dolthub, we're a small team that runs a lean ship. This means we rely on user requests and feedback to help us determine what to prioritize. Literally the most important factor driving new features is whether we believe that people will get value from it.

Oftentimes users only want to version-control some of their data. A common request we get from users is to support non-versioned tables that can be accessed from multiple branches. We call these Global Tables, because they exist completely outside of the branch framework.

We have a feature that enables this: the dolt_ignore system table indicates which user tables should not be staged for commits. But ignored tables are still tied to a specific branch, where global tables wouldn't be tied to any branch: they would simply exist on every branch as a single, non-versioned table.

We agree that there's value in being able to create tables like this. There are many uses cases where you have versioned data that depends on non-versioned metadata, and there are legitimate reasons why the metadata should not or cannot be subject to version control. This is likely to happen if you're using Dolt to implement version control features in your own application.

Now, Dolt does offer a couple of ways to access data outside of your current branch:

  • To access a table that's been committed to another branch or tag, you can use the AS OF syntax: SELECT * FROM table_name AS OF 'v1.0';
  • You can even access uncommitted data from another branch's working set by using a fully qualified database name: SELECT * FROM `db_name/main`.table_name;

However, this only helps you if you have full control over your database schema and over the SQL that you generate. If you're using Object-relational mapping software like Django or Laravel, then you have much less control over the generated SQL, and the ORM may require that certain tables exist in the same schema.

I get very excited about people using Dolt to build cool new applications that wouldn't otherwise be possible. And Global Tables seemed like a no-brainer, easy-to-implement feature. But as we began designing the feature, we realized that it actually has a ton of subtle gotchas:

Reasoning about transactions becomes a lot harder.

For concurrently-accessed databases, transactions are atomic actions that either happen fully or not-at all. They're a core part of the ACID concept of concurrency modeling, and database servers need to be able to detect when two concurrent transactions could conflict with each other.

For example: what happens two branches both attempt to modify the same global table? Currently, Dolt requires that each transaction only modifies tables on a single branch: this makes transaction conflicts easy to detect, since it only requires looking at the current branch. But if a transaction can include edits to global tables, then now any transaction could conflict with any other transaction. Even if this is implemented correctly, it could create a huge performance bottleneck if multiple branches have concurrent transactions.

Global tables aren't tracked in the version history of individual branches.

This means that some version control operations might no longer behave as expected. If a branch depends on global tables, rolling that branch back to an older commit won't roll back the global table and might result in different behavior than when that commit was first created.

If the branch has check constraints or foreign keys that reference the global table, those constraints may become invalid. A major feature of Dolt is the ability to capture a snapshot of the database at a specific point in time, but that's simply not possible for queries that reference global tables, even if the user wants to.

The data can change out from under you.

Not only is it possible for the data in global tables to change, the schema could change too. The table could even be dropped entirely. So not only is it possible for constraints to be violated, the constraint definition itself might become nonsensical if the table it references no longer exists.

Implications on our data model

It's not obvious how to store global table: it would necessarily require a change in our data model. But if we change our data model, older versions of Dolt might not incorrectly handle newer database.

Implications on our remote API

Dolt is decentralized, just like Git. You can push and pull from Dolt remotes. Should global tables be replicated to remotes?

In the end, we were forced to conclude that while global tables are simple and intuitive in the most common cases, supporting them in the general case introduces a bunch of unwanted complexity. This was frustrating for us, because the original motivation was sound. Users wanted global tables for situations where they made sense, and it was unsatisfying to drop the idea because there were other cases that didn't.

So if the common cases made sense, what if we just supported those common cases? Put another way, could we implement a feature that was like global tables in all the ways that mattered, allowed the functionality that people were asking for, but had extra guard rails in place to avoid problems?

Introducing Global Tables

With the next release of Dolt, we're introducing a new dolt system table: dolt_global_tables. In the simplest use case, this table affects how table names are resolved, allowing a table name to resolve to a table stored on a different branch. In practice, this allows for nearly all of the benefits of true global tables, and can be extended to add more features as needed.

Let's look at an example.

The schema of dolt_global_tables looks like this:

CREATE TABLE `dolt_global_tables` (                              
    `table_name` varchar(65535) NOT NULL,                          
    `target_ref` varchar(65535),                                   
    `ref_table` varchar(65535),                                    
    `options` varchar(65535),                                     
    PRIMARY KEY (`table_name`)                                     
)

Each row of this table can be read as: "When resolving table names on this branch, a table named table_name should instead resolve to the table named ref_table on target_ref.

  • If target_ref is a local branch, then the table will be writable. If target_ref is a tag, commit hash, or remote branch, then the table will be read-only.
  • The ref_table column will usually be omitted, but can be used if you want the table name on your child branch to be different from the table name on the "global" branch.
  • Like dolt_ignore, this configuration is per-branch and versioned.
  • The table_name column can be a pattern, the same patterns supported by dolt_ignore. This is useful when all the global tables share a common prefix:
INSERT INTO dolt_global_tables (table_name, target_ref, options) VALUES ('global_*', 'main', 'immediate')

Putting this all together, if your workflow is built around a single main branch and other branches that sprout from it, then you might consider something like the following in your main branch:

INSERT INTO dolt_global_tables (table_name, target_ref, options) VALUES ('global_*', 'main', 'immediate');
INSERT INTO dolt_ignore (pattern, ignored) VALUES ('global_*', true);
CALL dolt_add('dolt_global_tables', 'dolt_ignore');
CALL dolt_commit('-m', "commit dolt_global_tables and dolt_ignore system tables so they'll be inherited by new branches");

Now, any branches that split from main will have dolt_global_tables properly configured. Any queries on these branches that reference a table whose name begins with global_ will instead resolve to the branches on main instead. Since these tables are meant to be unversioned, the added row to dolt_ignore guarantees that they'll never be accidentally staged for a commit.

Of course, these tables are still part of Dolt and can still be version controlled later if you decide to do so: simply remove the dolt_ignore entry from the main branch or use dolt_add('--force'), and these tables will be staged on the main branch.

Right now, the only supported value for the options column is "immediate", which means that changes made to the table in one branch can be immediately read on another branch. In order to avoid the problems described above, this mode has some limitations:

  • You cannot define foreign key constraints on immediate global tables (in either direction). In a future update it will be possible to force foreign key constraints on immediate global tables, but if another branch modifies the table, it may cause these constraints to be silently invalidated. The DOLT_VERIFY_CONSTRAINTS() stored procedure can detect this if it happens.
  • You cannot have database transactions that modify both global tables and versioned tables. If you attempt to commit such a transaction, it will fail and be rolled back.

For cases where foreign keys or SQL transactions are required, we are currently implementing an alternate mode: "manual". In this mode, changes made to a global table are not immediately visible on other branches. The user must call a new system procedure DOLT_GLOBAL_TABLE_PUSH() in order for changes to be seen on other branches.

This mode doesn't have the same restrictions as "immediate" mode: the user can define foreign key constraints on global tables, and can modify global tables and versioned tables in the same SQL transaction.

But it has its own tradeoffs:

  • If multiple branches are making conflicting changes to a global branch, this can cause a conflict akin to a merge conflict that the user will have to resolve.
  • When a global table changes, each branch will observe those changes the next time it accesses the global table. This is indistinguishable from the "immediate" mode, except when a branch has a foreign key constraint on the global table with a referential action like ON DELETE CASCADE. Since the cascade only happens once the table is accessed, this makes it possible for reading a table to have side effects on other tables, something that isn't normally possible in MySQL.

We recommend using "immediate" mode when possible, and only using "manual" mode when foreign keys are needed, or when you need to modify both versioned and global tables in the same SQL transaction.

This is a work in progress: we're still in the process of implementing the manual mode. But users that would benefit from the immediate mode don't need to wait. This also means that if there's features you want global tables to have, you should join our Discord or message us on BlueSky and let us know. If you feel like this almost fits your use case but needs a little nudge, please tell us. We want to make Dolt as useful as possible so that people build amazing stuff with it.

That's all for now. Until next time, thanks for reading.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.