Three 9's of Correctness
Dolt is a SQL database with a custom storage layer designed for Git semantics. Dolt supports fast branching, diffing, and merging, plus all of the features MySQL provides.
One of the ways we track our compatibility with MySQL are
sqllogictests
, a collection of about 6 million correctness tests.
We reached
90% correctness in 2019,
99% correctness in
2021, and
have now reached 99.9% correctness.
This blog will give some examples of how the recent features contributed to correctness.
Background
Parsing a SQL query enforces one layer of SQL language semantics. But queries that are valid SQL strings might still not be executable.
"Binding", or name resolution, is the process of applying specific database catalogs, schemas, and SQL semantics to generic abstract syntax trees (AST). A query that passes through binding should have column and table names validated, types checked, and other mandatory semantics like aggregation determinism verified.
We won't discuss binding a whole lot in this blog, but most of the correctness fixes described are sourced from improvements in our ability to resolve well-formed queries. With that note, let's get to the queries!
Join
Many subtle but simple joins used to fail in Dolt until recently:
create table a (b int);
create table c (d int);
select * from a join c c1 on b = d join c c2 on c1.d = c2.d;
ambiguous column name "d", it's present in all these tables: c1, c2
The b = d
filter resolves in MySQL because only left-deep
tables are available for definition lookups. In other words, when we
built b = d
, only tables a
and c1
have been defined so far. The
definition ambiguity is only present after we have built c2
, at which
point we have to qualify c1.d = c2.d
to avoid an error.
Dolt now supports queries that depend on implicit left to right naming conventions.
In-Scope Lateral References
Lateral references are select clauses with subquery expressions that access other columns in the same SELECT clause. As with the previous example, subqueries can access fields defined to the left:
select x,y,
(select a from ab where x = a) as a,
(select u from uv where u = a) as u
from xy;
The "order" of resolving here is xy
, a
, and then u
. The FROM
columns are available for reference when we build a
. Likewise,
a
has been defined by the time we build u
.
Dolt first started supporting this query a year ago with some of Jason's work.
Lateral joins
Dolt now supports lateral joins. Lateral references can often be written equivalently as lateral joins:
select x, y from xy,
lateral (select a from ab where x = a) as dt1,
lateral (select u from uv where u = a) dt2;
The query above is a rewrite of the previous example, as long as dt1
and dt2
only return 1 row. dt
still has access to x,y
, dt2
still has
access to x, y, a
, and the output still includes all columns. If dt1
returns 2 values, the output will double in number because
a lateral join behaves like a cross join.
LATERAL
clauses make naming separation explicit, which is often more
predictable as an execution plan. We are making execution more reliable
over time by converting lateral subqueries to lateral joins.
Aggregation
Dolt used to panic when aggregation functions were used outside of SELECT clauses:
select sum(x+1)+1 from xy group by x having avg(x) > 1 order by 1
panic: unresolved column is a placeholder node, but Type was called
goroutine 1 [running]:
github.com/dolthub/go-mysql-server/sql/expression.(*UnresolvedColumn).Type(...)
/Users/max-hoffman/go/pkg/mod/github.com/dolthub/go-mysql-server@v0.16.1-0.20230815232241-704a2c0709d3/sql/expression/unresolved.go:66
...
Aggs require reassembling to convert SQL clauses into well-formed query plans. The schematic below gives an idea of how aggregate functions need to be separated and grouped from standard expressions before being stitched back together again:
Dolt now supports most aggregation queries. We are still working on the long-tail of the most complicated sorts (see q11 on page 4 here for an example!).
Using
The last feature James recently added is USING join support.
NATURAL JOIN joins two tables with equality filters between columns of the same name, eliminating the redundant columns of the right hand side:
select * from xy t1 natural join xy t2;
+---+---+
| x | y |
+---+---+
| 1 | 1 |
| 2 | 2 |
+---+---+
USING joins do the same thing, except you can select a subset of columns for equijoining/redundancy elimination:
select * from xy t1 join xy t2 using (x,y);
+---+---+
| x | y |
+---+---+
| 1 | 1 |
| 2 | 2 |
+---+---+
select * from xy t1 join xy t2 using (x);
+---+---+---+
| x | y | y |
+---+---+---+
| 1 | 1 | 1 |
| 2 | 2 | 2 |
+---+---+---+
Summary
This blog marks three 9's of SQL correctness and discusses some of the significant classes of queries we fixed to reach this goal. All of these fixes are related to a name binding refactor, which should continue to be a source of correctness and performance wins for Dolt moving forward. Stay tuned for the next update, when we reach 4 9's of correctness!
If you have any questions about Dolt, databases, or Golang performance reach out to us on Twitter, Discord, and GitHub!