Pandas DataFrames Works With Dolt
Today's blog we will go over a simple way to utilize Pandas Dataframes using Dolt's version control features. Pandas Dataframes simplify data manipulation and analysis. Dolt simplifies data sharing and collaboration. Combining the two, a user can write and read data and diff any data changes made by Pandas Dataframes functions. It might be easy to see the difference in the DataFrame after cleaning up, but what if your data is so big that you can only hope that the changes are correct?
Let's get started
Since the Pandas library is not a SQL tool for Python, we need SQLAlchemy or a similar SQL toolkit to connect to a Dolt database.
Tim previously wrote an awesome blog on how to use SQLAlchemy with Dolt.
It provides a clear guide on how to get started including installing Dolt
and necessary python libraries. It also
has many super useful functions defined to work with Dolt's version control features. Any functions used in this blog
that are not defined will be referenced from the demo code from Tim's blog.
For the demo data, we will use one of popular databases on DoltHub,
stocks database. It has a table called dividend
with schema of:
% dolt sql -q "DESCRIBE dividend"
+------------+----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+----------------+------+-----+---------+-------+
| act_symbol | varchar(16383) | NO | PRI | NULL | |
| ex_date | date | NO | PRI | NULL | |
| amount | decimal(10,5) | YES | | NULL | |
+------------+----------------+------+-----+---------+-------+
The complete code for this demo can be found in this GitHub repository.
Use dolt diff
feature to see what changes Pandas DataFrame made
Pandas DataFrame is good for data cleaning. It can fill or drop rows with None
/NULL
values.
Let's add a couple of dummy entries with NULL values and commit using SQLAlchemy
. This will allow us to see the difference
in the working set, which holds all the changes since we committed.
'''
Inserts values to `dividend` table using SQLAlchemy. The `amount` values are NULL.
'''
def add_null_data(engine):
dividend_table = Table('dividend', MetaData(), autoload_with=engine)
stmt = insert(dividend_table).values([
{'act_symbol': 'A', 'ex_date': '2006-11-02', 'amount': None},
{'act_symbol': 'A', 'ex_date': '2012-03-31', 'amount': None},
])
with engine.connect() as conn:
conn.execute(stmt)
conn.commit()
It is a good practice to develop data in a separate local branch than the default branch, master
in this case. Let's create
a new branch and checkout that branch. Now we have clean working set to make our changes safely. We
retrieve dividend
table as a Dataframe using the Pandas read_sql_table()
function. We find and fill in the NULL
values.
In this example, we use the DataFrame ffill()
function which replaces the NULL
values with the last valid value observed.
# these functions are explained in Tim's blog.
dolt_create_branch(engine, 'dev-branch')
engine = dolt_checkout(database,'dev-branch')
print_active_branch(engine)
# read the `dividend` table.
dividend = pd.read_sql_table(table_name=table_name, con=engine.connect())
print('The first 5 entries: \n', dividend.head())
print('All entries with NULL amount: \n', dividend[dividend['amount'].isna()])
# fill the NULL values the last valid observed
updated_dividend = dividend.ffill()
Let's "update" the SQL table
Pandas library does not support SQL actions other than importing data from or writing data to a SQL database.
The DataFrame to_sql()
is used to write
data to SQL database. The if_exists
option makes the function fail if a table with the given name exists.
append to the existing table. The replace
option drops and creates a new table with the same name. The new table will not
have any primary key defined and column types can be different from the original table. This causes Dolt be unable to
perform diff on the table as the schema will differ between the original table and the new table. Therefore, we use
append
option combined with TRUNCATE
statement. This deletes the data in the tables and re-inserts all the rows,
including the modified ones, from the Panda dataframe. Don't worry, Dolt is versioned so we can always go back to an
old version of our table if we need to.
'''
Truncates the given table and inserts the given dataframe
'''
def update_db_table_with_df(engine, table_name, df):
with engine.connect() as con:
con.execute(text('TRUNCATE '+table_name+';'))
con.commit()
df.to_sql(name=table_name, con=engine, if_exists='append', index=False)
Pandas sql table reading function only allow reading 'sql base tables and views'
We have used the Pandas read_sql_table()
function previously to retrieve the dividend
table. It only retrieves
sql base tables. Since the dolt diff tables are system tables, we use read_sql_query()
to get the diff data as
a DataFrame.
diff = get_diff(engine, table_name, dividend.columns)
print('The diff result: \n', diff)
'''
Gets diff on given table in format of `from` to `to`.
'''
def get_diff(engine, table_name, columns):
to_cols=''
from_cols=''
for col in columns:
to_cols += ', to_'+col
from_cols += ', from_'+col
return pd.read_sql_query("select diff_type" + from_cols + to_cols + " from dolt_diff_" + table_name +" where to_commit='WORKING';", con=engine.connect())
Here is the result we get from the demo. We can see the steps we did to develop data safely and verify the
changes. The next step is to commit and merge the changes to master
branch, which is included in SQLAlchemy works with Dolt blog demo.
Using branch: master
Active branch: master
Created commit: 0obegghjslprnk2k9fb76hfjn257utjf
The first 5 entries:
act_symbol ex_date amount
0 A 2006-11-01 1.47139
1 A 2006-11-02 NaN
2 A 2012-03-30 0.07153
3 A 2012-03-31 NaN
4 A 2012-06-29 0.07153
All entries with NULL amount:
act_symbol ex_date amount
1 A 2006-11-02 NaN
3 A 2012-03-31 NaN
Created branch: dev-branch
Using branch: dev-branch
Active branch: dev-branch
The diff result:
diff_type from_act_symbol from_ex_date from_amount to_act_symbol to_ex_date to_amount
0 modified A 2006-11-02 None A 2006-11-02 1.47139
1 modified A 2012-03-31 None A 2012-03-31 0.07153
Summary
This is it. We went over a super simple example of how you can utilize useful functions from Pandas DataFrames with Dolt's version control features. There are many other creative ways to combine these two cool tools. We would love to hear about your use case of combining Dolt and DataFrames. Need help with these tools? We can be found on our Discord channel, or file an issue on our Dolt GitHub repository.