Adding Types to DoltgreSQL
In November 2023, we officially started development on DoltgreSQL, which is a version of Dolt built to be a drop-in replacement for PostgreSQL. For those that may not know, Dolt is the world's first database that is built, from the ground up, with version control in mind. That means you can branch, merge, diff, etc. your data and schemas. You can even run joins across commits, unlocking a way of interacting with data that has not been seen before.
Due to MySQL's popularity, Dolt was created to be a drop-in replacement for MySQL. This satisfied a lot of users, however there are many that use PostgreSQL instead, and that's where DoltgreSQL comes in. Rather than starting from the very bottom and building up to DoltgreSQL, we've decided to "fork" Dolt and build all of the necessary PostgreSQL functionality on top of this fork. This allows us to build on top of the same engine that underpins Dolt as we progressively build toward full compatibility with PostgreSQL.
This blog details one of the changes that we've had to make while working on DoltgreSQL. In particular, how types are handled in Postgres and how they're handled in MySQL are a bit different.
Dolt's First Type System
Before we dive into the changes that have been made to the type system for DoltgreSQL, let's first go over the evolution of the existing type system in Dolt.
During the initial development of Dolt, we actually did not build it as a SQL database. Instead, we envisioned a sort of versioned data storage that users could use to distribute and collaborate on data. Dolt would be used for data importing and exporting, with DoltHub functioning as the hub for sharing and collaborating. Our tag was Git For Data. As a result, we only needed a few "core" data types that we felt were representative of the vast majority of data:
Type | Description |
---|---|
bool | Represents either true or false |
float64 | IEEE-754 double precision number |
int64 | Integers from -9223372036854775808 to 9223372036854775807 |
string | An array of characters |
uint64 | Integers from 0 to 18446744073709551615 |
uuid | 128-bit value as defined by RFC 4122 |
Regarding the storage, each data type was represented by an identifier, which we referred to as a kind
.
While scanning the data, we could determine the data that was present by first looking at the kind, which also told us how the following data was encoded.
Kind Kind
↓ Data→ ↓ Data→
-------------------------------------------------
| 2 | | | | | | 4 | | | | | |
-------------------------------------------------
Let's say that a kind
with the value 2
represents a float64
.
Upon reading the value 2
, we would know that the next 4 bytes represent a float64
value, which we could also read.
Then, we'd come across a kind
with the value 4
, which may represent a string
.
Since a string
has a variable number of characters, we can read the next 4 bytes as the string's length, and then read the number of bytes equivalent to the string's length.
This sufficed for a few months, but we eventually landed on the idea of adding SQL support, since it's essentially the de-facto method of interacting with data. At the time, MySQL was the most popular database for our target audience (and go-mysql-server already existed), and we therefore decided to base our SQL flavor on MySQL.
Dolt's First Type System Extended
With the change to a SQL implementation, we also reworked our schemas to include MySQL types rather than our previous kinds
.
We still relied on the kind
at the storage layer, as a BIGINT
in MySQL is still just an int64
, so we knew to map that MySQL type to its respective kind
.
MySQL has over 40 types, while our initial type system only had 6 types (including bool
and uuid
which aren't valid types in MySQL), so adding kinds
for all of them would be a fairly large effort.
Rather than adding new kinds
, we took a "shortcut" and reused some of the kinds
.
For example, a MySQL INT
is an int32
, which is just an int64
with a reduced range, so we can easily store an int32
inside of our int64
kind
.
This meant that we would waste some disk space, however this strategy greatly reduced the number of changes that we needed to make as we added more MySQL types.
Even MySQL types such as ENUM
were stored in an int64
, since internally it's just another integer.
We did have to add a few more kinds
, but overall we were able to add support for every MySQL type by mapping them to a kind
.
Dolt's Second Type System
Our first type system allowed us to grow Dolt fairly rapidly, as we were less worried about the storage engine. We could quickly iterate on features and correctness with respect to MySQL. That is, until we needed to change our storage engine. Dolt was initially built on Noms, which we leveraged for all of our versioning capabilities. This worked great for a while, but eventually we needed to focus on raw performance for the OLTP use case, and Noms was built in a more general direction than what we needed. We wrote a few blog posts that go into a bit more detail on the general side, so I won't go into details there.
On the type side, we added encodings, which have a very similar role to kinds
in the old engine, but are far more flexible, easier to implement, and far more specific to each type.
This meant that we could store an int8
in a single byte, rather than storing it in an int64
and wasting 7 bytes of space, since int8
had its own encoding.
Similar to kinds
, encodings also handled all of the sorting rules and other information that is relevant to data's layout on disk.
This new type system works perfectly for MySQL, but it has a fundamental flaw with regard to Postgres types.
PostgreSQL Type Flexibility
Postgres has a far more flexible type system than MySQL. I'm not just referring to indexes being able to declare the NULL ordering though, as we could add a new encoding for such variations. I'm referring to the fact that PostgreSQL allows users to create their own types. Encodings are statically defined within the code, but Postgres treats types as data.
On the Doltgres side, we do not yet support user-defined types, and we're not quite sure on how we're going to implement them just yet. We do know that we need to remove the static encoding dependency though, and so we came up with a solution.
First, though, we must make a brief detour to go-mysql-server (abbreviated GMS), which is the library that Dolt uses to implement all of MySQL's "non-storage engine" functionality. GMS defines types through a collection of functions (in Go, these are referred to as interfaces).
// Type represents a SQL type.
type Type interface {
CollationCoercible
// Compare returns an integer comparing two values.
// The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
Compare(interface{}, interface{}) (int, error)
// Convert a value of a compatible type to a most accurate type, returning
// the new value, whether the value in range, or an error. If |inRange| is
// false, the value was coerced according to MySQL's rules.
Convert(interface{}) (interface{}, ConvertInRange, error)
// Equals returns whether the given type is equivalent to the calling type. All parameters are included in the
// comparison, so ENUM("a", "b") is not equivalent to ENUM("a", "b", "c").
Equals(otherType Type) bool
// MaxTextResponseByteLength returns the maximum number of bytes needed to serialize an instance of this type
// as a string in a response over the wire for MySQL's text protocol – in other words, this is the maximum bytes
// needed to serialize any value of this type as human-readable text, NOT in a more compact, binary representation.
MaxTextResponseByteLength(ctx *Context) uint32
// Promote will promote the current type to the largest representing type of the same kind, such as Int8 to Int64.
Promote() Type
// SQL returns the sqltypes.Value for the given value.
// Implementations can optionally use |dest| to append
// serialized data, but should not mutate existing data.
SQL(ctx *Context, dest []byte, v interface{}) (sqltypes.Value, error)
// Type returns the query.Type for the given Type.
Type() query.Type
// ValueType returns the Go type of the value returned by Convert().
ValueType() reflect.Type
// Zero returns the golang zero value for this type
Zero() interface{}
fmt.Stringer
}
For example, the Compare
function takes two values and returns whether one value should be sorted before the other, or whether they're equivalent.
Dolt stores these types within the schema, and also has encodings that match the behavior of these types.
Since Doltgres builds on Dolt, Doltgres also builds on GMS, so we added a few more functions in GMS that handle the functionality of the encodings.
That allows Doltgres to define Postgres types on this new set of functions without having to worry about modifying encodings.
// ExtendedType is a serializable type that offers an extended interface for interacting with types in a wider context.
type ExtendedType interface {
sql.Type
// SerializedCompare compares two byte slices that each represent a serialized value, without first deserializing
// the value. This should return the same result as the Compare function.
SerializedCompare(v1 []byte, v2 []byte) (int, error)
// SerializeValue converts the given value into a binary representation.
SerializeValue(val any) ([]byte, error)
// DeserializeValue converts a binary representation of a value into its canonical type.
DeserializeValue(val []byte) (any, error)
// FormatValue returns a string version of the value. Primarily intended for display.
FormatValue(val any) (string, error)
// MaxSerializedWidth returns the maximum size that the serialized value may represent.
MaxSerializedWidth() ExtendedTypeSerializedWidth
}
type ExtendedTypeSerializedWidth uint8
const (
ExtendedTypeSerializedWidth_64K ExtendedTypeSerializedWidth = iota // Represents a variably-sized value. The maximum number of bytes is (2^16)-1.
ExtendedTypeSerializedWidth_Unbounded // Represents a variably-sized value. The maximum number of bytes is (2^64)-1, which is practically unbounded.
)
In the new Dolt engine, a row stores either the direct values, or references to values.
References are useful for values that may have an arbitrary size, as inlining such values would have a negative impact on performance.
All values will fall into one of these two categories, so we've added two special encodings (ExtendedTypeSerializedWidth_64K
and ExtendedTypeSerializedWidth_Unbounded
) for both of these cases.
The new set of functions in GMS includes a function that reports whether the type should be inlined or referenced (MaxSerializedWidth()
).
In addition, whenever one of the two new encodings are encountered throughout the new engine, rather than relying on some implicit behavior of the encoding, the engine calls the corresponding function from the new set of functions.
For example, take the sorting of values.
Postgres allows users to define how values should sort relative to one another, and now this is a function call in the engine rather than some static behavior.
You can see this by comparing the original comparator versus the new comparator.
Although we've not yet defined it, the function call (SerializedCompare
) could dispatch to some set of expressions that a user defines, which would achieve our goal of representing our types as data.
Even better, we can do all of this from within the Doltgres package, which will greatly increase our productivity, since we only have one location that we need to define functionality in.
So far, this allowed us to begin adding in the array types, which neither GMS nor Dolt were explicitly built to handle, but the freedom added with the additions allowed us to prototype these array types in mere hours.
Conclusion
DoltgreSQL is gaining speed, and we're having fun tackling all of these relatively hard problems! Even in our early stage, we may not yet be ready to replace PostgreSQL outright, but we're already finding users who want to use it as a replication target to still take advantage of the versioning features. We'll continue to add new types and expand our type coverage in the near future, so stay up to date with us!
You can find us on Twitter/X and chat with us on Discord. We also welcome all issue reports! Thank you for reading!