So you want Slowly Changing Dimension?
Slowly changing dimension is a framework used to discuss database versioning. It can be used to simulate database version control.
In the past, database version control needed to be simulated using slowly changing dimension because the technology didn't exist to actually version control a database. We here at DoltHub solved that problem with Dolt, the world's first and only version controlled SQL database. Dolt is a database with slowly changing dimension type 3 built in.
In this article, we will explain what slowly changing dimension is including examples. Then, we'll talk about what it's used for. Finally, we'll talk a little bit about Dolt, the SQL database with slowly changing dimension built in.
What is Slowly Changing Dimension?
Slowly changing dimension was originally designed as a way to add data versioning to an Extract, Transform, Load (ETL) process. If you were extracting data from your Online Transaction Protocol (OLTP) database and loading it into your Offline Analytics Processing (OLAP) database (ie. data warehouse), how should the history of data changes be handled for particular columns? Slowly changing dimension was developed as a framework to answer this question.
Imagine a business database that contains customer information: their names and addresses. Most of the time this data won't change, but it sometimes will. Maybe a customer moves, or gets married. These columns can be versioned using slowly changing dimension.
You could try a few approaches for dealing with this, each with their own tradeoffs. For our customer table, suppose you had an address change. You could:
- overwrite the old address, and lose it forever
- create a new address column in the table, but make the table wider
- add a new row for the new address, but make the database larger with more complicated read queries
- add a history table, which tracks all address changes, but also makes the database larger
These approaches are called Type 1 through Type 4 slowly changing dimension, respectively.
Ralph Kimball and Margy Ross are credited with developing the Slowly Changing dimension framework in 2005. They published a book called The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling in 2013.
Over time, slowly changing dimension techniques were used for simulating database version control in Online Transaction Protocol (OLTP) databases. My teams have implemented slowly changing dimension techniques to simulate version control in an OLTP database. This database version control is exposed by the application to allow users to see history, make multiple changes at once (ie. simulated branches), or soft delete.
So, I prefer a more liberal definition of slowly changing dimension: slowly changing dimension defines a framework for history and version handling in databases.
Why use Slowly Changing Dimension?
Now that you understand what slowly changing dimension is, the question is why would you ever need slowly changing dimension? I'll focus on four key reasons to use slowly changing dimension:
- Historical Reporting
- Expose Versioning to Your Application
- Disaster Recovery
- Reproducibility
Historical Reporting
Most slowly changing dimension has an ETL focus. You have a fast changing OLTP database and you need history in your OLAP data warehouse to build reports. You use the Slowly Changing Dimension system to define the relationship between your unversioned production database and your versioned data warehouse. You codify this relationship in ETL jobs.
Expose Versioning to your Application
Maybe your application customers are asking for history reporting in the application. So, you add slowly changing dimension to your production database and run OLAP-style historical reporting queries on your production OLTP database. These queries are generally more expensive than OLTP queries so a common pattern is to run them on a replica.
Another slowly changing dimension use is to add other versioning features beyond logs to your application like diffs, branches, and even merges. Branches can be simulated by having multiple active versions with branch names. We'll cover this a little more in the Poor Man's Data Version Control section.
Disaster Recovery
Like soft deletes or change data capture, slowly changing dimension can be used to recover from a certain class of bad query. Without any history, you are forced to go to backups and logs to save yourself from an erroneous delete. Slowly changing dimension provides some history in the database to use instead.
Reproducibility
Slowly changing dimension can be used for model reproducibility. Increasingly, data in databases in being used in machine learning applications. You may want to be able to build a model with different parameters using the same data, even if it is changing out from under you. Slowly changing dimension can be used to make a reproducible query to your database. The reproducible query results can be used to train a model.
What data benefits most from Slowly Changing Dimension?
Slowly changing dimension has traditionally been used to version data like names of geographical locations, customer account data, or product information. This data changes relatively slowly and a history is useful. Any time you see an audit or change log in an application, you are likely looking at slowly changing dimension behind the scenes. Typically, databases were limited in the scale of data they could store and query. As databases have been able to handle more scale, more data has become targeted for versioning with slowly changing dimension.
Recently, machine learning training sets benefit from versioning for model reproducibility. This development has brought data versioning and thus, slowly changing dimension back into focus.
Slowly Changing Dimension Types
The internet disagrees about whether there are 5, 7, or 8 slowly changing dimension types. I'll run through all of them with examples for completeness. In slowly changing dimension types, the index starts at zero, so beware of off by one errors. Moreover, Type 1 is no slowly changing dimension so things get even more wacky.
I think the best way to show off slowly changing dimension is through a simple example. Note, there are a bunch of nuances around joining with other tables as we get into Type 5+ slowly changing dimension types that a simple example like this ignores. To really understand Types 5+, you will likely need a more complicated example. However, this example should be enough to get most people started.
For this example, I'm going to have an employee table with employee_id
, last_name
, first_name
, and title
columns. We're going to have an employee last name, "Sehn", first name, "Tim", start as a "Software Engineer Intern", become a "Software Engineer", and finally become a "Software Engineer Manager". This is a simplified version of a table one would expect to find in most Enterprise Resource Planning (ERP) software.
mysql> create table employees (
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`title` varchar(100),
primary key(`employee_id`));
Query OK, 0 rows affected (0.02 sec)
In your source OLTP database, you would have a table with a row that looks like this at first:
mysql> insert into employees values (1, 'Sehn', 'Tim', 'Software Engineer Intern');
Query OK, 1 row affected (0.02 sec)
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+--------------------------+
| employee_id | last_name | first_name | title |
+-------------+-----------+------------+--------------------------+
| 1 | Sehn | Tim | Software Engineer Intern |
+-------------+-----------+------------+--------------------------+
1 row in set (0.01 sec)
Then it is replaced this:
mysql> update employees set title='Software Engineer' where employee_id=1;
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+-------------------+
| employee_id | last_name | first_name | title |
+-------------+-----------+------------+-------------------+
| 1 | Sehn | Tim | Software Engineer |
+-------------+-----------+------------+-------------------+
1 row in set (0.00 sec)
And finally it ends like this:
mysql> update employees set title='Software Engineer Manager' where employee_id=1;
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+---------------------------+
| employee_id | last_name | first_name | title |
+-------------+-----------+------------+---------------------------+
| 1 | Sehn | Tim | Software Engineer Manager |
+-------------+-----------+------------+---------------------------+
1 row in set (0.00 sec)
Pretty straightforward. Now let's see what these changes look like under different slowly changing dimension types. Believe it or not, what you just witnessed is slowly changing dimension type 1.
Type 0: Retain Original
Retain Original or Type 0 slowly changing dimension is reserved for columns that never change. In our example this could be the employee_id
. Even last_name
and first_name
could change. The idea here is to mark columns as immutable. This doesn't make much sense for our example but if our table was slowly changing dimension type 0, the row above would always look like:
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+--------------------------+
| employee_id | last_name | first_name | title |
+-------------+-----------+------------+--------------------------+
| 1 | Sehn | Tim | Software Engineer Intern |
+-------------+-----------+------------+--------------------------+
No promotions at this company I guess.
Type 1: Overwrite
Overwrite or Type 1 slowly changing dimension is default database behavior. Just write over the original and forget it. I like to say type 1 slowly changing dimension is no slowly changing dimension. This kind of makes sense when defining an OLTP to OLAP transform but it is kind of confusing if you are looking for a database versioning framework.
The data in our schema would end up looking exactly like it does in our original example.
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+---------------------------+
| employee_id | last_name | first_name | title |
+-------------+-----------+------------+---------------------------+
| 1 | Sehn | Tim | Software Engineer Manager |
+-------------+-----------+------------+---------------------------+
Type 2: Add New Row
This is where we start to get interesting. For the Add New Row or Type 2 slowly changing dimension we add a row every time we make a change. We issue an insert instead of or in addition to an update. In order to do this we need to add one or more columns indicating which row is current or active. We need to add these columns to the primary key so we can have multiple copies of the primary key from the original table.
The upside of this approach is you get unlimited history. The downside is your update process and retrieval queries are more complicated.
There are three strategies to do this: a version
column, a start_date
and end date
column, and an effective_date
and current
boolean column.
For the version
approach, we add an auto_increment
column and insert rows instead of update. To get the current version we add a where clause max(version)
in a subquery. You must make version part of the primary key for this approach to work.
mysql> create table employees (
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`title` varchar(100),
`version` int auto_increment,
primary key(`employee_id`, `version`));
Query OK, 0 rows affected (0.02 sec)
mysql> insert into employees (employee_id, last_name, first_name, title)
values (1, 'Sehn', 'Tim', 'Software Engineer Intern');
Query OK, 1 row affected (0.02 sec)
mysql> insert into employees (employee_id, last_name, first_name, title)
values (1, 'Sehn', 'Tim', 'Software Engineer');
Query OK, 1 row affected (0.01 sec)
mysql> insert into employees (employee_id, last_name, first_name, title)
values (1, 'Sehn', 'Tim', 'Software Engineer Manager');
Query OK, 1 row affected (0.02 sec)
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+---------------------------+---------+
| employee_id | last_name | first_name | title | version |
+-------------+-----------+------------+---------------------------+---------+
| 1 | Sehn | Tim | Software Engineer Intern | 1 |
| 1 | Sehn | Tim | Software Engineer | 2 |
| 1 | Sehn | Tim | Software Engineer Manager | 3 |
+-------------+-----------+------------+---------------------------+---------+
3 rows in set (0.00 sec)
mysql> select * from employees where employee_id=1
and version = (select max(version) from employees where employee_id=1);
+-------------+-----------+------------+---------------------------+---------+
| employee_id | last_name | first_name | title | version |
+-------------+-----------+------------+---------------------------+---------+
| 1 | Sehn | Tim | Software Engineer Manager | 3 |
+-------------+-----------+------------+---------------------------+---------+
For the start_date
and end_date
approach, you add a start_date
and end_date
column. You must add start_date
to the primary key. When you want to change a value, you update the current end_date
, and insert a new row with NOW()
as the start_date
and a NULL
as the end_date
. The issue here is that changing a value is now an update and an insert. The upside is that finding the current version is a simple end_date is NULL
where clause.
mysql> create table employees (
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`title` varchar(100),
`start_date` datetime,
`end_date` datetime,
primary key(`employee_id`, `start_date`));
Query OK, 0 rows affected (0.02 sec)
mysql> insert into employees (employee_id, last_name, first_name, title, start_date)
values (1, 'Sehn', 'Tim', 'Software Engineer Intern', now());
Query OK, 1 row affected (0.02 sec)
mysql> update employees set end_date=now() where end_date is NULL;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> insert into employees (employee_id, last_name, first_name, title, start_date)
values (1, 'Sehn', 'Tim', 'Software Engineer', now());
Query OK, 1 row affected (0.01 sec)
mysql> update employees set end_date=now() where end_date is NULL;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> insert into employees (employee_id, last_name, first_name, title, start_date)
values (1, 'Sehn', 'Tim', 'Software Engineer Manager', now());
Query OK, 1 row affected (0.01 sec)
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+---------------------------+----------------------------+----------------------------+
| employee_id | last_name | first_name | title | start_date | end_date |
+-------------+-----------+------------+---------------------------+----------------------------+----------------------------+
| 1 | Sehn | Tim | Software Engineer Intern | 2023-06-20 12:40:50.208589 | 2023-06-20 12:40:55.754209 |
| 1 | Sehn | Tim | Software Engineer | 2023-06-20 12:41:02.424386 | 2023-06-20 12:41:08.820748 |
| 1 | Sehn | Tim | Software Engineer Manager | 2023-06-20 12:41:16.554803 | NULL |
+-------------+-----------+------------+---------------------------+----------------------------+----------------------------+
mysql> select * from employees where end_date is NULL and employee_id=1;
+-------------+-----------+------------+---------------------------+----------------------------+----------+
| employee_id | last_name | first_name | title | start_date | end_date |
+-------------+-----------+------------+---------------------------+----------------------------+----------+
| 1 | Sehn | Tim | Software Engineer Manager | 2023-06-20 12:41:16.554803 | NULL |
+-------------+-----------+------------+---------------------------+----------------------------+----------+
1 row in set (0.00 sec)
Finally, for this approach we have effective_date
and current
boolean. This is pretty similar to start_date
and end_date
but without any pesky NULL
s. Plus, you'll notice there is a gap between start and end dates in the example above. You can get around this by wrapping the update and insert in a transaction. The effective_date
and current
approach does not require a transactional update and insert.
mysql> create table employees (
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`title` varchar(100),
`effective_date` datetime,
`current` tinyint,
primary key(`employee_id`, `effective_date`));
Query OK, 0 rows affected (0.01 sec)
mysql> insert into employees (employee_id, last_name, first_name, title, effective_date, current)
values (1, 'Sehn', 'Tim', 'Software Engineer Intern', now(), 1);
Query OK, 1 row affected (0.01 sec)
mysql> update employees set current=0 where current=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> insert into employees (employee_id, last_name, first_name, title, effective_date, current)
values (1, 'Sehn', 'Tim', 'Software Engineer', now(), 1);
Query OK, 1 row affected (0.02 sec)
mysql> update employees set current=0 where current=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> insert into employees (employee_id, last_name, first_name, title, effective_date, current)
values (1, 'Sehn', 'Tim', 'Software Engineer Manager', now(), 1);
Query OK, 1 row affected (0.01 sec)
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+---------------------------+----------------------------+---------+
| employee_id | last_name | first_name | title | effective_date | current |
+-------------+-----------+------------+---------------------------+----------------------------+---------+
| 1 | Sehn | Tim | Software Engineer Intern | 2023-06-20 12:54:58.201774 | 0 |
| 1 | Sehn | Tim | Software Engineer | 2023-06-20 12:55:35.789219 | 0 |
| 1 | Sehn | Tim | Software Engineer Manager | 2023-06-20 12:55:48.984256 | 1 |
+-------------+-----------+------------+---------------------------+----------------------------+---------+
3 rows in set (0.01 sec)
mysql> select * from employees where employee_id=1 and current=1;
+-------------+-----------+------------+---------------------------+----------------------------+---------+
| employee_id | last_name | first_name | title | effective_date | current |
+-------------+-----------+------------+---------------------------+----------------------------+---------+
| 1 | Sehn | Tim | Software Engineer Manager | 2023-06-20 12:55:48.984256 | 1 |
+-------------+-----------+------------+---------------------------+----------------------------+---------+
1 row in set (0.00 sec)
Type 3: Add New Attribute
The Add New Attribute or Type 3 slowly changing dimension approach is where you add a new column (or set of columns) every time you want to add a revision. This really is only practical for a few revisions or your schema will get really wide. Plus, adding a revision to the largest number of revisions is a schema change which is expensive. But if you want to only store the last few revisions of a row, not the full history, it may be a good approach.
For our example, we will imagine that we only want to store your current job title and the previous job title you had. If you have more than two job titles, any title not in your previous two is lost.
mysql> create table employees (
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`current_title` varchar(100),
`effective_date` datetime,
`previous_title` varchar(100),
primary key(employee_id));
Query OK, 0 rows affected (0.02 sec)
mysql> insert into employees (employee_id, last_name, first_name, current_title, effective_date)
values (1, 'Sehn', 'Tim', 'Software Engineer Intern', now());
Query OK, 1 row affected (0.01 sec)
mysql> update employees
set previous_title=current_title, effective_date=now(),current_title='Software Engineer'
where employee_id=1;
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> update employees
set previous_title=current_title, effective_date=now(),current_title='Software Engineer Manager'
where employee_id=1;
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+---------------------------+----------------------------+-------------------+
| employee_id | last_name | first_name | current_title | effective_date | previous_title |
+-------------+-----------+------------+---------------------------+----------------------------+-------------------+
| 1 | Sehn | Tim | Software Engineer Manager | 2023-06-20 13:20:29.459638 | Software Engineer |
+-------------+-----------+------------+---------------------------+----------------------------+-------------------+
Type 4: Add History Table
The Add History Table method or Type 4 slowly changing dimension may be familiar to those who have implemented Soft Deletes or Change Data Capture. This method adds a history table that is inserted into every time a change is made to the original table. This is usually accomplished via a trigger.
For our example we keep our employees
table and add an employees_history
table with a similar schema but the addition of a create_date
as part of the primary key. Every time we change employees
, we insert a row into employees_history
with the previous value of the employees
table. In the example below, I do the insert manually for clarity but you could set up a trigger to do the same thing.
mysql> create table employees (
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`title` varchar(100),
primary key(`employee_id`));
Query OK, 0 rows affected (0.02 sec)
mysql> create table employees_history(
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`title` varchar(100),
`create_date` datetime,
primary key(`employee_id`, `create_date`));
Query OK, 0 rows affected (0.02 sec)
mysql> insert into employees values (1, 'Sehn', 'Tim', 'Software Engineer Intern');
Query OK, 1 row affected (0.01 sec)
mysql> insert into employees_history(employee_id, last_name, first_name, title, create_date)
select employee_id, last_name, first_name, title, now() from employees;
Query OK, 1 row affected (0.02 sec)
mysql> update employees set title='Software Engineer' where employee_id=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> insert into employees_history(employee_id, last_name, first_name, title, create_date)
select employee_id, last_name, first_name, title, now() from employees;
Query OK, 1 row affected (0.02 sec)
mysql> update employees set title='Software Engineer Manager' where employee_id=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+---------------------------+
| employee_id | last_name | first_name | title |
+-------------+-----------+------------+---------------------------+
| 1 | Sehn | Tim | Software Engineer Manager |
+-------------+-----------+------------+---------------------------+
1 row in set (0.00 sec)
mysql> select * from employees_history where employee_id=1;
+-------------+-----------+------------+--------------------------+----------------------------+
| employee_id | last_name | first_name | title | create_date |
+-------------+-----------+------------+--------------------------+----------------------------+
| 1 | Sehn | Tim | Software Engineer Intern | 2023-06-20 13:47:09.114368 |
| 1 | Sehn | Tim | Software Engineer | 2023-06-20 13:48:34.548684 |
+-------------+-----------+------------+--------------------------+----------------------------+
2 rows in set (0.00 sec)
Type 5: Mini Dimension
Slowly changing dimension Types 5+ get a little complicated. Also note, every type above 5 has at one point or another been a called a variant type 6. From now on, the types are built on the idea that you may want to combine approaches for different columns in the table. So, for our example, let's imagine that first_name
and last_name
could also change and we want to maintain history but much less frequently than title
. How do we handle that? Slowly changing dimension Types 5+ are all hybrids of the first five approaches: type 0 to type 4.
With Mini Dimension or Type 5 slowly changing dimension, we break out the part of the table we want to version into a separate table and add a foreign key to the new table in the original table. We don't version the original table making it type 1. For the new table, we add a history table making it type 4. Type 4 + Type 1 = Type 5.
This approach means that we only store history for the values we want, saving space. It is especially useful for tables with many columns and deep history. The downside is you now have a lot of tables and certain select queries now require joins.
Again in this example, I walk through the queries manually but you could make triggers to have the history table populated automatically.
mysql> create table employees_title(
`title_id` int,
`employee_id` int,
`title` varchar(100),
primary key(`title_id`, `employee_id`));
Query OK, 0 rows affected (0.02 sec)
mysql> create table employees_title_history(
`title_id` int,
`employee_id` int,
`title` varchar(100),
`create_date` datetime,
primary key(`title_id`, `employee_id`, `create_date`));
Query OK, 0 rows affected (0.01 sec)
mysql> create table employees(
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`title_id` int,
primary key(`employee_id`),
foreign key(`title_id`) references employees_title(`title_id`));
Query OK, 0 rows affected (0.01 sec)
mysql> insert into employees_title values (1,1,'Software Engineer Intern');
Query OK, 1 row affected (0.02 sec)
mysql> insert into employees values (1, 'Sehn', 'Tim', 1);
Query OK, 1 row affected (0.01 sec)
mysql> insert into employees_title_history select title_id, employee_id, title, now() from employees_title;
Query OK, 1 row affected (0.01 sec)
mysql> update employees_title set title = 'Software Engineer' where title_id=1 and employee_id=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> insert into employees_title_history select title_id, employee_id, title, now() from employees_title;
Query OK, 1 row affected (0.01 sec)
mysql> update employees_title set title = 'Software Engineer Manager' where title_id=1 and employee_id=1;
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+----------+
| employee_id | last_name | first_name | title_id |
+-------------+-----------+------------+----------+
| 1 | Sehn | Tim | 1 |
+-------------+-----------+------------+----------+
1 row in set (0.00 sec)
mysql> select * from employees_title where employee_id=1;
+----------+-------------+---------------------------+
| title_id | employee_id | title |
+----------+-------------+---------------------------+
| 1 | 1 | Software Engineer Manager |
+----------+-------------+---------------------------+
1 row in set (0.00 sec)
mysql> select * from employees_title_history where employee_id=1;
+----------+-------------+--------------------------+----------------------------+
| title_id | employee_id | title | create_date |
+----------+-------------+--------------------------+----------------------------+
| 1 | 1 | Software Engineer Intern | 2023-06-20 15:09:27.252767 |
| 1 | 1 | Software Engineer | 2023-06-20 15:11:03.283149 |
+----------+-------------+--------------------------+----------------------------+
2 rows in set (0.00 sec)
mysql> select employees.employee_id, employees.last_name, employees.first_name, employees_title.title
from employees join employees_title
on employees.title_id=employees_title.title_id
and employees.employee_id=employees_title.employee_id
where employees.employee_id=1;
+-------------+-----------+------------+---------------------------+
| employee_id | last_name | first_name | title |
+-------------+-----------+------------+---------------------------+
| 1 | Sehn | Tim | Software Engineer Manager |
+-------------+-----------+------------+---------------------------+
1 row in set (0.01 sec)
As you can see, you save some columns in the history table but now you have to join with another table to show query information like current title.
Type 6: Combined Approach
Combined Approach or Type 6 slowly changing dimension combines Overwrite, Add New Dimension, and Add new Row types into Voltron-like table. Type 1 + Type 2 + Type 3 = Type 6.
This combined approach give you access to many types of current view and history queries on a single table. The downside is that you need to update an entire column when you make a write as well as add new rows.
So for our example, we change the title
column to current_title
and historical_title
and add a start_date
and end_date
column to correspond to the historical title. You can optionally add a current
boolean column as well but I left that off for simplicity. This was already getting kind of much.
mysql> create table employees(
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`current_title` varchar(100),
`historical_title` varchar(100),
`start_date` datetime,
`end_date` datetime,
primary key(`employee_id`, `start_date`));
Query OK, 0 rows affected (0.01 sec)
mysql> insert into employees values
(1, 'Sehn', 'Tim', 'Software Engineer Intern', 'Software Engineer Intern', now(), NULL);
Query OK, 1 row affected (0.02 sec)
mysql> update employees set end_date=now() where end_date is NULL;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> insert into employees values
(1, 'Sehn', 'Tim', 'Software Engineer', 'Software Engineer', now(), NULL);
Query OK, 1 row affected (0.02 sec)
mysql> update employees set current_title='Software Engineer';
Query OK, 1 row affected (0.01 sec)
Rows matched: 2 Changed: 1 Warnings: 0
mysql> update employees set end_date=now() where end_date is NULL;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> insert into employees values
(1, 'Sehn', 'Tim', 'Software Engineer Manager', 'Software Engineer Manager', now(), NULL);
Query OK, 1 row affected (0.01 sec)
mysql> update employees set current_title='Software Engineer Manager';
Query OK, 3 rows affected (0.01 sec)
Rows matched: 3 Changed: 3 Warnings: 0
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+---------------------------+---------------------------+----------------------------+----------------------------+
| employee_id | last_name | first_name | current_title | historical_title | start_date | end_date |
+-------------+-----------+------------+---------------------------+---------------------------+----------------------------+----------------------------+
| 1 | Sehn | Tim | Software Engineer Manager | Software Engineer Intern | 2023-06-21 10:46:30.4786 | 2023-06-21 10:46:38.130405 |
| 1 | Sehn | Tim | Software Engineer Manager | Software Engineer | 2023-06-21 10:46:52.129884 | 2023-06-21 10:47:06.613117 |
| 1 | Sehn | Tim | Software Engineer Manager | Software Engineer Manager | 2023-06-21 10:47:20.212249 | NULL |
+-------------+-----------+------------+---------------------------+---------------------------+----------------------------+----------------------------+
3 rows in set (0.01 sec)
mysql> select * from employees where end_date is NULL and employee_id=1;
+-------------+-----------+------------+---------------------------+---------------------------+----------------------------+----------+
| employee_id | last_name | first_name | current_title | historical_title | start_date | end_date |
+-------------+-----------+------------+---------------------------+---------------------------+----------------------------+----------+
| 1 | Sehn | Tim | Software Engineer Manager | Software Engineer Manager | 2023-06-21 10:47:20.212249 | NULL |
+-------------+-----------+------------+---------------------------+---------------------------+----------------------------+----------+
1 row in set (0.01 sec)
Type 7: Hybrid
Honestly, I read through Type 7 on both The Kimball Groups website and wikipedia and I couldn't figure out exactly what it was. The two sources seem to disagree somewhat so I went from the Kimball documentation.
From the Type 7 Kimball documentation it sounds like a combination of Overwrite (Type 1) and Add New Row (Type 2) techniques. I think this is stretching the limits of my example because it needs multiple keys referenced by other tables to make sense. Thus, understanding Type 7 slowly changing dimension is "an exercise left for the reader".
Poor Man's Data Version Control
Slowly changing dimension techniques can be modified to simulate data version control functionality in your application. Beyond historical reporting, which in version control lingo is called "log", you can use slowly changing dimension to do branching, merging, and diffs.
Note, this requires your application follow rules on writes so it is not true version control enforced by the database. If your application or an operator makes a mistake, data can be lost. True version control would prevent this type of loss.
Logs
Add New Row (Type 2) and History table (Type 4) slowly dimension types give you an infinite log of history much like a log in a version control system. Note, one of the problems with these slowly changing dimension types is storage. Each row may contain extra information that needs to be stored and parsed on access. Practically, because storage is not shared across versions, log-like version control can only deployed for a limited history size.
Diffs
Diffs can be achieved using history queries against Add New Row (Type 2) and History table (Type 4) slowly dimension types much like audit queries for logs. Usually these diffs are achieved by selecting values at different timestamps.
Branches/Merges
You can add a "parent branch" and "branch" column to Overwrite (Type 1), New Row (Type 2) and History table (Type 4) slowly dimension types to simulate branches. Making a branch allows histories to diverge. Your application needs to be aware of what branch it is writing to and reading from. Merges can be handled by the application as well if you make "parent branch", "parent left branch", and add "parent right branch" as well. You can design custom merge logic and update these columns accordingly. This all gets rather complicated but is doable.
Dolt
- Tagline
- It's Git for Data
- Initial Release
- August 2019
- GitHub
- https://github.com/dolthub/dolt
What if your database had slowly changing dimension built in? The database was designed from storage up to do real data version control. We built a database like that. Dolt is a database where every column has slowly changing dimension Type 3.
Let's see how it works for our example. We treat our table as if it has no slowly changing dimension (ie. Type 1). Instead we make a Dolt commit between every write, ensuring the history is preserved.
mysql> create table employees (
`employee_id` int,
`last_name` varchar(100),
`first_name` varchar(100),
`title` varchar(100),
primary key(`employee_id`));
Query OK, 0 rows affected (0.02 sec)
mysql> call dolt_commit('-Am', 'Created Table');
+----------------------------------+
| hash |
+----------------------------------+
| kbas1fkidnht61ekeivf28a3r4circ6g |
+----------------------------------+
1 row in set (0.02 sec)
mysql> insert into employees values (1,'Sehn','Tim','Software Engineer Intern');
Query OK, 1 row affected (0.02 sec)
mysql> call dolt_commit('-Am', 'Added Tim Sehn');
+----------------------------------+
| hash |
+----------------------------------+
| hp705lj2ehefnuk6352fs05nm3m0cc6t |
+----------------------------------+
1 row in set (0.01 sec)
mysql> update employees set title='Software Engineer' where employee_id=1;
Query OK, 1 row affected (0.02 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> call dolt_commit('-Am', 'Rehired Tim Sehn full time');
+----------------------------------+
| hash |
+----------------------------------+
| k27ecdhqh3igk46a23mcqq0mg90pjre5 |
+----------------------------------+
1 row in set (0.02 sec)
mysql> update employees set title='Software Engineer Manager' where employee_id=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> call dolt_commit('-Am', 'Promoted Tim Sehn');
+----------------------------------+
| hash |
+----------------------------------+
| 0vlrmae4kc2d10guqp5vdijlnefs0sjb |
+----------------------------------+
1 row in set (0.02 sec)
mysql> select * from employees where employee_id=1;
+-------------+-----------+------------+---------------------------+
| employee_id | last_name | first_name | title |
+-------------+-----------+------------+---------------------------+
| 1 | Sehn | Tim | Software Engineer Manager |
+-------------+-----------+------------+---------------------------+
1 row in set (0.01 sec)
mysql> select * from dolt_history_employees where employee_id=1;
+-------------+-----------+------------+---------------------------+----------------------------------+-----------+-------------------------+
| employee_id | last_name | first_name | title | commit_hash | committer | commit_date |
+-------------+-----------+------------+---------------------------+----------------------------------+-----------+-------------------------+
| 1 | Sehn | Tim | Software Engineer Manager | 0vlrmae4kc2d10guqp5vdijlnefs0sjb | root | 2023-06-21 18:32:24.81 |
| 1 | Sehn | Tim | Software Engineer | k27ecdhqh3igk46a23mcqq0mg90pjre5 | root | 2023-06-21 18:31:58.486 |
| 1 | Sehn | Tim | Software Engineer Intern | hp705lj2ehefnuk6352fs05nm3m0cc6t | root | 2023-06-21 18:31:04.552 |
+-------------+-----------+------------+---------------------------+----------------------------------+-----------+-------------------------+
If you are looking at adding slowly changing dimension to your database, we think you should consider using Dolt instead. Look at how much simpler the example is. A copy of this database has been pushed to DoltHub if you would like to explore more.
Dolt also natively supports branches, merges, diffs, clones and remotes. All these version control concepts are modeled after Git so if you know Git, you know Dolt. Note, this is true version control. It is enforced at the database layer. Dolt shares storage between versions for data that doesn't change. Thus, Dolt can scale to much deeper histories than slowly changing dimension can allow.
Dolt is the database with slowly changing dimension built in. If you're interested in discussing slowly changing dimension or Dolt come hang out on our Discord.