Doltgres Correctness Update

November 12, 2024

7 min read

We are hard at work building Doltgres, the Postgres-compatible version of Dolt, the world's first and only version-controlled SQL database. We are planning a public Beta launch of Doltgres in the first quarter of 2025, and one of the most important metrics for calling the product Beta is correctness.

In this blog, we'll discuss how we measure Doltgres's correctness, and what kind of progress we're making on each of these indicators.

sqllogictest

Doltgres runs the standard set of SQL correctness tests defined by the SQLite3 project, known as sqllogictest. These tests verify that the server returns the same results at Postgres itself, and runs some very complex queries, many of which are randomly generated. Here's an example of a query run by sqllogictest:

SELECT pk FROM tab1 WHERE ((((col3 > 0 AND ((col0 >= 7 AND col0 <= 2)
AND (col0 <= 4 OR col4 < 5.82 OR col3 > 7) AND col0 >= 4) AND col0 <
0) AND ((col1 > 7.76))))) OR ((col1 > 7.23 OR (col0 <= 3) OR (col4 >=
2.72 OR col1 >= 8.63) OR (col3 >= 3 AND col3 <= 4)) AND ((col0 < 2 AND
col3 < 0 AND (col1 < 6.30 AND col4 >= 7.2)) AND (((col3 < 5 AND col4
IN (SELECT col1 FROM tab1 WHERE ((col3 >= 7 AND col3 <= 6) OR col0 < 0
OR col1 >= 0.64 OR col3 <= 7 AND (col3 >= 8) AND ((col3 <= 6) AND
((col0 = 1 AND col3 IS NULL)) OR col0 > 7 OR col3 IN (8,1,7,4) OR col3
> 7 AND col3 >= 5 AND (col3 < 0) OR col0 > 3 AND col4 > 1.21 AND col0
< 4 OR ((col4 > 9.30)) AND ((col3 >= 5 AND col3 <= 7))) AND col0 <= 5
OR ((col0 >= 1 AND col4 IS NULL AND col0 > 5 AND (col0 < 3) OR col4 <=
8.86 AND (col3 > 0) AND col3 = 8)) OR col3 >= 1 OR (col3 < 4 OR (col3
= 7 OR (col1 >= 4.84 AND col1 <= 5.61)) OR col3 >= 5 AND ((col3 < 4)
AND ((col3 > 9)) OR (col0 < 3) AND (((col0 IS NULL))) AND (col0 < 4))
AND ((col4 IN (0.79)))) OR (col4 = 6.26 AND col1 >= 5.64) OR col1 IS
NULL AND col0 < 1)))) AND ((((col3 < 9) OR ((col0 IS NULL) OR (((col1
>= 8.40 AND col1 <= 0.30) AND col3 IS NULL OR (col0 <= 7 OR ((col3 >
4))) AND col0 = 6)) OR col3 < 6 AND (((((((col1 > 4.8)) OR col0 < 9 OR
(col3 = 1))) AND col4 >= 4.12))) OR (((col1 > 1.58 AND col0 < 7))) AND
(col1 < 8.60) AND ((col0 > 1 OR col0 > 1 AND ((col3 >= 2 AND col3 <=
0) AND col0 <= 0) OR ((col0 >= 8)) AND (((col3 >= 8 AND col3 <= 8) OR
col0 > 4 OR col3 = 8)) AND col1 > 5.10) AND ((col0 < 7 OR (col0 < 6 OR
(col3 < 0 OR col4 >= 9.51 AND (col3 IS NULL AND col1 < 9.41 AND col1 =
1.9 AND col0 > 1 AND col3 < 9 OR (col4 IS NULL) OR col1 = 0.5 AND
(col0 >= 3) OR col4 = 9.25 OR ((col1 > 0.26)) AND col4 < 8.25 AND
(col0 >= 2) AND col3 IS NULL AND (col1 > 3.52) OR (((col4 < 7.24)) AND
col1 IS NULL) OR col0 > 3) AND col3 >= 4 AND col4 >= 2.5 AND col0 >= 0
OR (col3 > 3 AND col3 >= 3) AND col0 = 1 OR col1 <= 8.9 AND col1 >
9.66 OR (col3 > 9) AND col0 > 0 AND col3 >= 0 AND ((col4 > 8.39))))
AND (col1 IS NULL)))))) AND col1 <= 2.0 OR col4 < 1.8 AND (col4 = 6.59
AND col3 IN (3,9,0))))) OR col4 <= 4.25 OR ((col3 = 5))) OR (((col0 >
0)) AND col0 > 6 AND (col4 >= 6.56)))

Dolt first hit 90% correctness on this suite in late 2019, 99% correctness in mid 2021, and finally achieved 100% in early 2024.

Doltgres benefits from using the same underlying query and storage engine as Dolt itself, so it's hitting the same milestones much more rapidly. When we first began measuring Doltgres correctness about a year ago (after our Alpha release), we came in at just over 70%. This low number was due to a large number of CREATE TABLE and ALTER TABLE statements specific to postgres that we couldn't parse or execute due to missing features like schema, sequences, etc. After a year of work, Doltgres now scores 91.1% on this measure, and we expect to continue our rapid progress, hitting 99% before our Beta release.

Postgres regression tests

sqllogictest was built to measure the accuracy of SQLite3, and so it doesn't capture very many Postgres-specific pieces of functionality. To get a better read on that number, we decided to run the same set of correctness tests that Postgres itself does, which specifically target Postgres features. Running these tests on Doltgres wasn't straightforward -- we needed to run the Postgres test process and capture the SQL statements being emitted, as well as expected results, and convert them into a repeatable Go process. Here's an example of one of the thousands of Postgres regression tests we're now executing:

SELECT i AS "Value",
  CASE WHEN (i < 0) THEN 'small'
       WHEN (i = 0) THEN 'zero'
       WHEN (i = 1) THEN 'one'
       WHEN (i = 2) THEN 'two'
       ELSE 'big'
  END AS "Category"
  FROM CASE_TBL;

Because these tests are tailored specifically to Postgres functionality, we are doing much worse on them. Currently, Doltgres passes 33.8% of these tests. This sounds much worse than it is. Because these are tests that specifically target Postgres features, they cover a very wide surface area that will take time to build out. They also include many pieces of functionality we will likely never implement, or will implement differently, so we don't yet know what portion of these tests it's possible for Doltgres to reasonably pass.

Nevertheless, we are making rapid solid progress in closing the real functionality gaps they expose, like in this recent PR. Just a few weeks ago Doltgres was passing only 27.2% of these tests. Now that we've deployed our correctness czar James to focus on them, we expect to see this number climb every week.

These tests also run as part of our normal GitHub CI workflow, so on every PR we can verify that we're adding new functionality and not breaking anything that used to work. The automation here is very cool: every PR shows us what queries used to fail and now pass.

Dolt engine tests

In addition to the third-party test above, Dolt has its own collection of tens of thousands of hand-written correctness tests, including ones that test Dolt's version-control functionality that we obviously can't expect anyone else to write for us. These were written for the MySQL dialect that Dolt uses, but we've been busy porting them to Doltgres so they can run there as well.

Here's a sample of what these tests look like:

	{
		Name: "Create branches from HEAD with dolt_branch procedure",
		Assertions: []queries.ScriptTestAssertion{
			{
				Query:    "CALL DOLT_BRANCH('myNewBranch1')",
				Expected: []sql.Row{{0}},
			},
			{
				Query:    "SELECT COUNT(*) FROM DOLT_BRANCHES WHERE NAME='myNewBranch1';",
				Expected: []sql.Row{{1}},
			},
			{
				// Trying to recreate that branch fails without the force flag
				Query:          "CALL DOLT_BRANCH('myNewBranch1')",
				ExpectedErrStr: "fatal: A branch named 'myNewBranch1' already exists.",
			},
			{
				Query:    "CALL DOLT_BRANCH('-f', 'myNewBranch1')",
				Expected: []sql.Row{{0}},
			},
		},
	}

Converting these tests has been a long process, and we've found and fixed a lot of bugs while doing it. But it's been getting easier as we go, which we take as a good sign that the product is actually reaching maturity pretty quickly. Most of the remaining failures in these tests are a result of failing to accurately translate the MySQL dialect tests correctly, but several months ago there were large missing pieces of Dolt functionality these tests were catching.

We expect that 100% of these tests will run successfully for the Beta release, minus a few with MySQL dialect dependencies we can't port in time.

Discussion

What do these numbers really mean? Are any of them the true measure of correctness?

The reality is each of them are measuring something different, and they each give us good insights into Doltgres's correctness and how far we have to go.

sqllogictest tells us that we have good coverage for simple operations: most CREATE TABLE statements, as well as CRUD operations like INSERT, DELETE, UPDATE. We return the correct results for most queries, even very complex ones. Performance on this measure is a decent proxy for the kinds of operations you're likely to run into writing a database application. Basically: this number is the basic correctness score for a typical user, and hitting 100% on it means that a typical database application can run without issue and have confidence of correct results.
Postgres regression testing tells us that we have a lot of Postgres-specific functionality left to implement. Postgres has been around for decades and has a lot of features. Most of them aren't going to be used by a typical database application, but rare customers will need specific features we don't yet support. Basically: this number is the long tail of compatibility, and we need to hit 100% to call ourselves a true drop-in replacement for Postgres, which is our ultimate long term goal.
Dolt engine testing tells us that most Dolt functionality works correctly. Our long-term goal is that these tests will assure us that all the version-control features that Dolt offers work the same between both products, and as we add more such features to Dolt, Doltgres will get them as well. Basically: these are Dolt-specific regression tests that we require during CI for any release to succeed. Many of them are skipped while being ported, but we expect a 100% pass rate by the Beta release.

Conclusion

Doltgres is making rapid progress on correctness to our Beta release. If you are thinking about using Doltgres and have missing features or bugs you want us to prioritize, make an issue on our Github Issues page. Or join our Discord server to talk our engineers in real time. We hope to hear from you!

Blog

sqllogictest

Postgres regression tests

Dolt engine tests

Discussion

Conclusion

Get started with Dolt