Exploring DoltgreSQL through Type Compatibility

DOLTGRESSQLTECHNICAL
5 min read

We're steadily making progress on DoltgreSQL, which is a version of Dolt built to be a drop-in replacement for PostgreSQL. For those that may not know, Dolt is the world's first database that is built, from the ground up, with version control in mind. That means you can branch, merge, diff, etc. your data and schemas. You can even run joins across commits, unlocking a way of interacting with data that has not been seen before.

As Dolt is battle-tested and used in production across a multitude of companies, we did not want to start completely from scratch with DoltgreSQL. After all, the technology underlying Dolt's versioning capabilities works for all relational data, regardless of whether it's MySQL or PostgreSQL flavored. That underlying technology (the storage engine), however, is only one part of Dolt. There are two additional parts to Dolt, which are the MySQL emulation layer and the translation layer between the other two layers.

This division works great for building Dolt. The MySQL emulation layer is the go-mysql-server (GMS) project, which we inherited from source{d} to become the maintainers. It is strictly focused on MySQL compatibility, including all of its quirks and features. There is also our fork of Vitess, which we use to implement MySQL's wire protocol, along with parsing MySQL statements into an AST that GMS understands. Both of these create a few problems for DoltgreSQL though.

PostgreSQL isn't MySQL

The section header is probably the most obvious thing that you'll read all year, but it's worth mentioning because we all know how many ways that they're different. For starting DoltgreSQL though, their similarities meant that we could get a prototype up and running in less than a month. MySQL definitely has its own variant of SQL, but at its heart it is still SQL. A SELECT statement returns results, and an INSERT statement adds rows. As long as we could talk the protocol, we could simply pass queries to Vitess and GMS, and convert those results back into PostgreSQL's protocol.

This worked for the purposes of a prototype, and showed the feasibility of the implementation (amongst other considerations, such as a foreign data wrapper, an extension, etc.). It also showed that this approach really only worked as a prototype, as encountering Postgres-specific types would immediately break down. For example, both MySQL and Postgres will happily accept an INT or INTEGER since it's in the SQL specification, but Postgres also allows the forms INT2, INT4, etc., which would immediately fail to parse in Vitess.

Rather than extending Vitess to accept Postgres types, we decided to replace Vitess with our own parser that outputs its own AST. We then added a translation layer that converts the Postgres-oriented AST to the Vitess AST that GMS can understand. This greatly moved the needle forward for Doltgres, since it meant that we could now work with a statement's intent rather than its exact syntax. Idiomatic Postgres uses ::TYPE for casting, and we could simply translate that into the equivalent CONVERT(VALUE, TYPE) MySQL call.

This allowed us to build tooling that would allow us to measure our correctness relative to Postgres, similar to our correctness tracking for Dolt. Rather than simply knowing that we do not support "stuff", we can quantify that X statement parses but does not have a translatable equivalent into a Vitess AST, or that Y statement is unsupported by the parser and needs to be added, or that Z statement simply returns an incorrect result in some scenarios. While this has been great for development, there's still the elephant in the room: what do we do for concepts that do not have an equivalent in MySQL at all, such as many of the types?

To Extend or Replace

I mentioned it earlier, but it's worth repeating: the goal of GMS is to act as though it were MySQL. But that's actually its secondary purpose. It's primary purpose is to facilitate Dolt's goal of being a drop-in replacement for MySQL. The two statements may seem similar, but the nuance comes from Dolt being the primary driver. Since other projects completely unrelated to our own use GMS for their own purposes, we do our best to stray from breaking changes. However, if they must be done for Dolt, then they must be done. So how, then, does DoltgreSQL fit into this picture?

Well, in a way, it kind of doesn't. Types are a great example of this, since types are one of the core pillars of a relational database. In MySQL, types are sort of an abstract concept to define a specific contract with data. Saying that a column has the type INTEGER doesn't just mean that it stores whole numbers. It also defines the range of valid numbers, how those numbers are ordered, how to interpret those numbers as other types, etc. From a high level, this doesn't look any different than Postgres, since Postgres also has many of the same functionality with its type. One key difference, however, is that Postgres is explicit in its definitions. You can add a SMALLINT/INT2 and an INTEGER/INT4 because there exists the function int24pl. Addition is still commutative because there is the function int42pl. This is all explicitly defined, and of course that is part of what makes Postgres so extensible. Add a new function to the group, and now you can add two types together where it was previously not possible. In fact, types themselves are just a collection of functions, meaning all type functionally is explicitly declared somewhere.

Since GMS is faithful to MySQL, it also models types as these abstract concepts. In a fundamental way, they are not extensible, modifiable, etc. Doltgres needs the extensibility because it's core to the definition of a type in Postgres, and is a bit at-odds with the abstract nature of MySQL types.

In order to keep up our rate of progress, we decided to shift as much responsibility for type handling into Doltgres for Postgres types, and introduced a new opaque type in GMS and the translation layer in Dolt. While GMS expects specific types in specific situations and will often fail if it sees the new opaque type in some scenarios, these scenarios are often scenarios where GMS is executing logic that is specific to MySQL. For example, GMS knows what to do when adding two integer types, but doesn't know what to do with the opaque type, so that prompted a full reimplementation of operators in Doltgres, which properly models operators as functions rather than intrinsic properties of types. This means that we're incrementally working our way toward complete compatibility with Postgres, while actually being usable in some contexts today.

Strategy Benefits

Besides compatibility, our incremental approach means that you can actually use Doltgres, today, for some use-cases. It's definitely not production ready (we still have a ton of work to do), but one great use-case for it is through replication. My coworker wrote an excellent blog demonstrating how to implement it in Golang, which shows some of the knowledge we gained while working on it. As long as the type is supported in Doltgres, then we expect replication to just work. We're also always working on adding new types, and we've added 16 new types in just the latest release. We publish a general overview of our types compatibility on our docs.

Conclusion

DoltgreSQL still has a lot of work left, but everyday we're getting closer to our goal of full compatibility! Versioned databases are truly a game-changer, and we're excited to bring the concept of Dolt to the PostgreSQL audience.

You can find us on Twitter/X and chat with us on Discord. We also welcome all issue reports! Thank you for reading!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.