Costing Index Scans

November 15, 2023

7 min read

Dolt is the first version controlled SQL database. We have made many correctness and performance improvements over the last couple of years. But one of the things we have never been good at are queries that need to adapt to underlying table data. We call these "costed" optimizations, because they depend on estimating the result counts of partial plans. To choose wisely in these scenarios, we need extra data structures that tell us about our index data structures. These are normally called table statistics, or histograms.

We are announcing the initial release of Dolt table statistics. A user can run ANALYZE TABLE <table list...> to collect statistics for tables, the statistics are not persisted on restart or refresh by default, and costing only applies to index decision making. Table statistics are limited but it's a start. Future releases will make statistics collection automatic and survive database restarts. At that point we will enable them by default, and costing will invisibly improve performance for all users just like in MySQL or Postgres.

Using Analyze to Cost Plans

We start with a simple example of how index costing can help us pick better query plans.

We make a quick table with two columns, two indexes, and 1,000 rows:

create table xy (
  x int,
  y int,
  z int,
  primary key (x),
  key(y, z)
);
insert into xy select x, 1,1 from (with recursive inputs(x) as (select 1 union select x+1 from inputs where x < 1000) select * from inputs) dt;

This is a heavily skewed database; the x values are unique and increment by 1, and every (y,z) tuple is 1:

 select * from xy limit 5;
+---+---+---+
| x | y | z |
+---+---+---+
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 1 |
| 4 | 1 | 1 |
| 5 | 1 | 1 |
+---+---+---+

So lets trick the costless optimizer into picking a bad plan:

explain select * from xy where x < 100 and y = 1 and z = 1;
+-------------------------------------+
| plan                                |
+-------------------------------------+
| Filter                              |
|  ├─ (xy.x < 100)                    |
|  └─ IndexedTableAccess(xy)          |
|      ├─ index: [xy.y,xy.z]          |
|      ├─ filters: [{[1, 1], [1, 1]}] |
|      └─ columns: [x y z]            |
+-------------------------------------+

Dolt chooses this index by default because indexes are usually shaped in anticipation of workload patterns where they help return small result sets. But though this example may seem silly, most databases end up hitting this problem. A small number of ecommerce clients are power users that drive a disproportionate fraction of sales; a small number of social media celebrities have the majority of followers and drive the most traffic; certain maritime ports are chokepoints for the distribution of goods to large geographical areas.

Basically, databases are subject to Zipf's Law. As businesses grow, databases and workload patterns expand. Scaling businesses reach sizes where data distribution skew impacts partitioning, horizontal read and write scaling, and optimal index design. A smart database adapts queries to match workload patterns as the contents of tables change.

We will switch to Dolt v.1.26.0, cost our indexes, and use the information schema to inspect the results (using --output-format=vertical):

analyze table xy;
*************************** 1. row ***************************
   Table: xy
      Op: analyze
Msg_type: status
Msg_text: OK

select * from information_schema.column_statistics;

*************************** 1. row ***************************
SCHEMA_NAME: tmp
 TABLE_NAME: xy
COLUMN_NAME: x
  HISTOGRAM: {"statistic": {"avg_size": 0, "buckets": [{"bound_count": 1, "distinct_count": 209, "mcv_counts": [1,1,1], "mcvs": [[209],[2],[3]], "null_count": 0, "row_count": 209, "upper_bound": [209]},{"bound_count": 1, "distinct_count": 205, "mcv_counts": [1,1,1], "mcvs": [[414],[211],[212]], "null_count": 0, "row_count": 205, "upper_bound": [414]},{"bound_count": 1, "distinct_count": 55, "mcv_counts": [1,1,1], "mcvs": [[469],[416],[417]], "null_count": 0, "row_count": 55, "upper_bound": [469]},{"bound_count": 1, "distinct_count": 117, "mcv_counts": [1,1,1], "mcvs": [[586],[471],[472]], "null_count": 0, "row_count": 117, "upper_bound": [586]},{"bound_count": 1, "distinct_count": 132, "mcv_counts": [1,1,1], "mcvs": [[718],[588],[589]], "null_count": 0, "row_count": 132, "upper_bound": [718]},{"bound_count": 1, "distinct_count": 112, "mcv_counts": [1,1,1], "mcvs": [[830],[720],[721]], "null_count": 0, "row_count": 112, "upper_bound": [830]},{"bound_count": 1, "distinct_count": 170, "mcv_counts": [1,1,1], "mcvs": [[1000],[832],[833]], "null_count": 0, "row_count": 170, "upper_bound": [1000]}], "columns": ["x"], "created_at": "2023-11-14T09:29:14.138995-08:00", "distinct_count": 1000, "null_count": 1000, "qualifier": "tmp4.xy.PRIMARY", "row_count": 1000, "types:": ["int"]}}

*************************** 2. row ***************************
SCHEMA_NAME: tmp
 TABLE_NAME: xy
COLUMN_NAME: y,z
  HISTOGRAM: {"statistic": {"avg_size": 0, "buckets": [{"bound_count": 240, "distinct_count": 1, "mcv_counts": [240], "mcvs": [[1,1]], "null_count": 0, "row_count": 240, "upper_bound": [1,1]},{"bound_count": 197, "distinct_count": 1, "mcv_counts": [197], "mcvs": [[1,1]], "null_count": 0, "row_count": 197, "upper_bound": [1,1]},{"bound_count": 220, "distinct_count": 1, "mcv_counts": [220], "mcvs": [[1,1]], "null_count": 0, "row_count": 220, "upper_bound": [1,1]},{"bound_count": 169, "distinct_count": 1, "mcv_counts": [169], "mcvs": [[1,1]], "null_count": 0, "row_count": 169, "upper_bound": [1,1]},{"bound_count": 120, "distinct_count": 1, "mcv_counts": [120], "mcvs": [[1,1]], "null_count": 0, "row_count": 120, "upper_bound": [1,1]},{"bound_count": 54, "distinct_count": 1, "mcv_counts": [54], "mcvs": [[1,1]], "null_count": 0, "row_count": 54, "upper_bound": [1,1]}], "columns": ["y", "z"], "created_at": "2023-11-14T09:29:14.13942-08:00", "distinct_count": 6, "null_count": 1000, "qualifier": "tmp4.xy.yz", "row_count": 1000, "types:": ["int", "int"]}}

We will dig more into what the column_statistics output means later. For now let's run our query again:

explain select * from xy where x < 100 and y = 1 and z = 1;
+----------------------------------+
| plan                             |
+----------------------------------+
| Filter                           |
|  ├─ ((xy.y = 1) AND (xy.z = 1))  |
|  └─ IndexedTableAccess(xy)       |
|      ├─ index: [xy.x]            |
|      ├─ filters: [{(NULL, 100)}] |
|      └─ columns: [x y z]         |
+----------------------------------+

The first plan using the (y,z) index reads 1000 rows from disk (~70ms); the second plan uses the (x) index to read 100 rows from disk (~60ms). They both return the same results, because the filters excluded from the index scan are still executed in memory as a filter operator. But the second plan returns faster because it does less work.

Histograms

We take a deeper look at the data structure we use to cost plans now. Here is an abbreviated statistic from the previous example formatted with jq:

{
  "statistic": {
    "buckets": [
      {
        "bound_count": 1,
        "distinct_count": 209,
        "null_count": 0,
        "row_count": 209,
        "upper_bound": [209]
      },
      {
        "bound_count": 1,
        "distinct_count": 205,
        "null_count": 0,
        "row_count": 205,
        "upper_bound": [414]
      },
...
      {
        "bound_count": 1,
        "distinct_count": 170,
        "null_count": 0,
        "row_count": 170,
        "upper_bound": [1000]
      }
    ],
    "columns": ["x"],
    "avg_size": 0,
    "distinct_count": 1000,
    "row_count": 1000,
    "qualifier": "tmp.xy.PRIMARY",
    "created_at": "2023-11-14T09:37:08.193049-08:00",
    "types:": ["int"]
  }
}

The key data structure is the histogram, a series of "buckets" that summarize the contents of a section of the index. The row_count is the number of bucket rows within the range of (1) the upper_bound of the previous bucket, and (2) the upper_bound of the current bucket. The second bucket has 205 rows, the previous upper bound is 209, and the bucket upper bound of 414, so there are 205 rows in the range (209, 414]. A visualization of the histogram is shown below:

The ideal histogram uses a small amount of space to accurately estimate a wide variety of range scans.

For example, lets say we have a range query:

select count(*) from xy
where
  x < 100 OR
  x between 200 and 300 OR
  x > 800;

x values are unique and range from [0,1000] So we can do a little math to estimate that the result of the query will be 100 + (300-200) + (1000-800) = 100 + 100 + 200 = 400.

We would use a histogram to estimate this value by truncating buckets that fall outside of the filter ranges. The output of that process looks something like the chart below:

The sum of the remaining buckets is 696. This is not a perfect estimate of the query result, and histograms are not foolproof. But this is a lot better than zero information, and we at least have a trade-off where now we can add more buckets to get better estimates at the expense of maintaining a more expensive set of statistics.

More Complicated Example

Now we will do a more complicated example whose ideal index scan is less obvious.

This warehouse table has a unique identifier, geographical properties, the date it started operating, and a maximum capacity. The table has a few indexes that might reflect common access patterns.

CREATE warehouse (
  Id int primary key,
  City varchar(60),
  State varchar(60),
  Zip smallint,
  Region int,
  Capacity int,
  operatingStart datetime,
  Key (region, state, city, zip),
  Key (region, city, capacity),
  Key (operatingStart, capacity)
)

We consider a query that touches all three secondary indexes:

select * from warehouse
where
  region = 'North America' AND
  city = 'Lost Angeles' AND
  state = 'California' AND
  operatingStart >= '2022-01-01' AND
  capacity > 10;

In other words, a subset of filters are a partial match for each set of index column expression keys:

('North America', 'CA', 'Los Angeles', ?) => (region, state, city, zip)
('North America', 'CA', 10-∞) => (region, state, capacity)
('2022-01-01'-∞, ?) => (operatingStart, capacity)

(We number the options as a reference aid in the comparisons below)

-- option 1
(Project
  (*)
  (Filter
    (List
      (OperatingStart >= '2022-01-01')
      (Capacity > 10))
    (IndexScan
      (Index warehouse.regionStateCityZip)
      (Range ('North America') ('CA') ('Los Angeles') (-∞, ∞)))))

-- option 2
(Project
  (*)
  (Filter
    (List
      (City = 'Los Angeles')
      (OperatingStart >= '2022-01-01')
      (Capacity > 10))
    (IndexScan
      (Index warehouse.regionStateCapacity)
      (Range ('North America') ('CA') (10, ∞)))))

-- option 3
(Project
  (*)
  (Filter
    (List
      (Region = 'North America')
      (City = 'Los Angeles')
      (State = 'California')
      (Capacity > 10))
    (IndexScan
      (Index warehouse.operatingStartCapacity)
      (Range ('2022-01-01', ∞) (-∞, ∞)))))

Each index scan option has 1) a set of filters absorbed into the table scan, and 2) a remaining set of filters executed in memory. Depending on the contents of the database, all or none of these options might be fast.

If we skip a few steps, we can find the results of truncating the histograms for each index like we did above: (1) 1574, (2) 10002, (3) 345.

The range filter on operatingStart (option 3) is the most selective, returns the fewest rows, and will be the most performant:

(Project
  (*)
  (Filter
    (List
      (Region = 'North America')
      (City = 'Los Angeles')
      (State = 'California')
      (Capacity > 10))
    (IndexScan
      (Index warehouse.operatingStartCapacity)
      (Range (-∞, '2020-01-01') (-∞, ∞)))))

It is important to keep in mind that changes to the database might change the optimal index scan. For example, if several thousand new warehouses became operational next year, suddenly operatingStart < 2022 will not be as competitive.

Future

We look forward to expanding costing to improve other optimizer decisions like join planning, sorting externally vs with an index, and choosing whether to aggregate before or after joins. We will also soon publish blogs talking more about the storage-layer lifecycle of Dolt statistics.

Until then, if you have any questions about Dolt, databases, or Golang performance reach out to us on Twitter, Discord, and GitHub!

Blog

Using Analyze to Cost Plans

Histograms

More Complicated Example

Future

Get started with Dolt