Pandas DataFrames Works With Dolt

REFERENCE
4 min read

Today's blog we will go over a simple way to utilize Pandas Dataframes using Dolt's version control features. Pandas Dataframes simplify data manipulation and analysis. Dolt simplifies data sharing and collaboration. Combining the two, a user can write and read data and diff any data changes made by Pandas Dataframes functions. It might be easy to see the difference in the DataFrame after cleaning up, but what if your data is so big that you can only hope that the changes are correct?

Dolt + Pandas

Let's get started

Since the Pandas library is not a SQL tool for Python, we need SQLAlchemy or a similar SQL toolkit to connect to a Dolt database. Tim previously wrote an awesome blog on how to use SQLAlchemy with Dolt. It provides a clear guide on how to get started including installing Dolt and necessary python libraries. It also has many super useful functions defined to work with Dolt's version control features. Any functions used in this blog that are not defined will be referenced from the demo code from Tim's blog.

For the demo data, we will use one of popular databases on DoltHub, stocks database. It has a table called dividend with schema of:

% dolt sql -q "DESCRIBE dividend"
+------------+----------------+------+-----+---------+-------+
| Field      | Type           | Null | Key | Default | Extra |
+------------+----------------+------+-----+---------+-------+
| act_symbol | varchar(16383) | NO   | PRI | NULL    |       |
| ex_date    | date           | NO   | PRI | NULL    |       |
| amount     | decimal(10,5)  | YES  |     | NULL    |       |
+------------+----------------+------+-----+---------+-------+

The complete code for this demo can be found in this GitHub repository.

Use dolt diff feature to see what changes Pandas DataFrame made

Pandas DataFrame is good for data cleaning. It can fill or drop rows with None/NULL values. Let's add a couple of dummy entries with NULL values and commit using SQLAlchemy. This will allow us to see the difference in the working set, which holds all the changes since we committed.

'''
Inserts values to `dividend` table using SQLAlchemy. The `amount` values are NULL.
'''
def add_null_data(engine):
    dividend_table = Table('dividend', MetaData(), autoload_with=engine)
    stmt = insert(dividend_table).values([
        {'act_symbol': 'A', 'ex_date': '2006-11-02', 'amount': None},
        {'act_symbol': 'A', 'ex_date': '2012-03-31', 'amount': None}, 
        ])
    with engine.connect() as conn:
        conn.execute(stmt)
        conn.commit()

It is a good practice to develop data in a separate local branch than the default branch, master in this case. Let's create a new branch and checkout that branch. Now we have clean working set to make our changes safely. We retrieve dividend table as a Dataframe using the Pandas read_sql_table() function. We find and fill in the NULL values. In this example, we use the DataFrame ffill() function which replaces the NULL values with the last valid value observed.

# these functions are explained in Tim's blog.
dolt_create_branch(engine, 'dev-branch')
engine = dolt_checkout(database,'dev-branch')
print_active_branch(engine)

# read the `dividend` table.
dividend = pd.read_sql_table(table_name=table_name, con=engine.connect())
print('The first 5 entries: \n', dividend.head())
print('All entries with NULL amount: \n', dividend[dividend['amount'].isna()])

# fill the NULL values the last valid observed
updated_dividend = dividend.ffill()

Let's "update" the SQL table

Pandas library does not support SQL actions other than importing data from or writing data to a SQL database. The DataFrame to_sql() is used to write data to SQL database. The if_exists option makes the function fail if a table with the given name exists. append to the existing table. The replace option drops and creates a new table with the same name. The new table will not have any primary key defined and column types can be different from the original table. This causes Dolt be unable to perform diff on the table as the schema will differ between the original table and the new table. Therefore, we use append option combined with TRUNCATE statement. This deletes the data in the tables and re-inserts all the rows, including the modified ones, from the Panda dataframe. Don't worry, Dolt is versioned so we can always go back to an old version of our table if we need to.

'''
Truncates the given table and inserts the given dataframe
'''
def update_db_table_with_df(engine, table_name, df):
    with engine.connect() as con:
        con.execute(text('TRUNCATE '+table_name+';'))
        con.commit()
    df.to_sql(name=table_name, con=engine, if_exists='append', index=False)

Pandas sql table reading function only allow reading 'sql base tables and views'

We have used the Pandas read_sql_table() function previously to retrieve the dividend table. It only retrieves sql base tables. Since the dolt diff tables are system tables, we use read_sql_query() to get the diff data as a DataFrame.

diff = get_diff(engine, table_name, dividend.columns)
print('The diff result: \n', diff)
    
'''
Gets diff on given table in format of `from` to `to`.
'''
def get_diff(engine, table_name, columns):
    to_cols=''
    from_cols=''
    for col in columns:
        to_cols += ', to_'+col
        from_cols += ', from_'+col

    return pd.read_sql_query("select diff_type" + from_cols + to_cols + " from dolt_diff_" + table_name +" where to_commit='WORKING';", con=engine.connect())

Here is the result we get from the demo. We can see the steps we did to develop data safely and verify the changes. The next step is to commit and merge the changes to master branch, which is included in SQLAlchemy works with Dolt blog demo.

Using branch: master
Active branch: master
Created commit: 0obegghjslprnk2k9fb76hfjn257utjf
The first 5 entries: 
   act_symbol    ex_date   amount
0          A 2006-11-01  1.47139
1          A 2006-11-02      NaN
2          A 2012-03-30  0.07153
3          A 2012-03-31      NaN
4          A 2012-06-29  0.07153
All entries with NULL amount: 
   act_symbol    ex_date  amount
1          A 2006-11-02     NaN
3          A 2012-03-31     NaN
Created branch: dev-branch
Using branch: dev-branch
Active branch: dev-branch
The diff result:
  diff_type from_act_symbol from_ex_date from_amount to_act_symbol  to_ex_date  to_amount
0  modified               A   2006-11-02        None             A  2006-11-02    1.47139
1  modified               A   2012-03-31        None             A  2012-03-31    0.07153

Summary

This is it. We went over a super simple example of how you can utilize useful functions from Pandas DataFrames with Dolt's version control features. There are many other creative ways to combine these two cool tools. We would love to hear about your use case of combining Dolt and DataFrames. Need help with these tools? We can be found on our Discord channel, or file an issue on our Dolt GitHub repository.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.