Announcing Write Queries for DoltHub's SQL API
DoltHub is a place on the internet to share, discover, and collaborate on Dolt databases. It's been our goal to add more features to make it easy to add and edit data on DoltHub. As a part of this roadmap, we're excited to announce our previously read-only SQL API is now writable!
How it works
DoltHub's SQL API exposes data via an HTTP endpoint at www.dolthub.com/api/v1alpha1
. Our
existing read-only url has the form:
https://www.dolthub.com/api/v1alpha1/<owner_name>/<repo_name>/<branch_name>?q=<sql_query>
Since adding, updating, and deleting data can take a bit of time to execute against larger databases, we made the writable API asynchronous. Therefore we needed to add two new HTTP endpoints for executing these kinds of queries using the SQL API.
The first endpoint starts the write operation with the database information and query. The url has the form:
https://www.dolthub.com/api/v1alpha1/<owner_name>/<repo_name>/write/<from_branch>/<to_branch>?q=<sql_query>
There are a few important things to note here:
- A nonexistent
<to_branch>
creates a new branch from the<from_branch>
- An empty
<sql_query>
merges<from_branch>
into<to_branch>
This allows the user to branch, change, review, and merge data using this endpoint, similar to what the workflow would be using Dolt or DoltHub.
The first endpoint returns an operation name in the response, which can be used to poll the second endpoint until the operation has completed. This operation url has the form:
https://www.dolthub.com/api/v1alpha1/<owner_name>/<repo_name>/write?operationName=<operation_name>
Once done === true
the response will include a to_commit_id
and from_commit_id
,
which can be used to query the
dolt_commit_diff_$tablename
table for reviewing changes.
An example
Our SHAQ basketball bounty finished a
few days ago. Tim wrote a blog about some work needed to
clean it up, which mostly includes deduping players added twice with different player_id
s.
This could be a great use case for the writable SQL API.
Authentication
Making the SQL API writable required putting in place some kind of authentication system. Anyone can access our read-only endpoint for public databases, but making changes to a database requires write permissions. We created API tokens, which can be used in the header of an API request to authenticate. You can learn more about this process in our docs here.
Now that we have our API token, I can fork dolthub/SHAQ
to my own namespace
(tbantle/SHAQ
) and we can move forward updating players in our SHAQ
fork.
What we want to change
We can use Dolt, a query on DoltHub, or the read-only SQL API endpoint to see which players have more than one entry.
SHAQ $ dolt sql -q "select concat(lower(first_name), ' ', lower(last_name)) as name, count(*) from players group by name having count(name) > 1 order by count(*) desc limit 10"
+-----------------+----------+
| name | count(*) |
+-----------------+----------+
| brandon brown | 37 |
| brandon johnson | 28 |
| jordan jones | 27 |
| marko popovic | 27 |
| jordan johnson | 25 |
| chris williams | 23 |
| chris johnson | 23 |
| jordan davis | 23 |
| justin robinson | 22 |
| jamar smith | 22 |
+-----------------+----------+
I scanned the list and found two
entries
for a college player named Dimitri Saliba in the players
table, one with player_id
of
714287 and the other 15404617. In the player_season_stat_totals
table there are four
entries
for these IDs, two for his years at Western Wyoming (NJCAA DI) and two for his years at
Colorado Christian Cougars (NCAA DII).
After a quick Google to make sure the same person played at both these schools, we want to
remove one of the players
entries for Dimitri Saliba and update the player_id
in the
player_season_stat_totals
table. We're going to keep player_id
714287 since it has
more information filled out.
Using the writable SQL API
The process of using the SQL API to make changes requires a few steps.
1. Run query on new branch
First, we want to hit the write endpoint (using the Python requests
library) with a from
branch (main
), to branch (update-dimitri
, which does not exist yet), and an UPDATE
query. I'm going to put this logic in a function because we'll use it again later.
import requests
owner, repo, from_branch, to_branch = 'tbantle', 'SHAQ', 'main', 'update-dimitri'
query = '''UPDATE player_season_stat_totals SET player_id=714287 WHERE player_id=15404617'''
headers = {
'authorization': 'token [token you copied from Settings > Tokens (starts with dhat.v1.)]'
}
def run_write_query(from_branch_name, to_branch_name, q):
res = requests.post(
f'https://www.dolthub.com/api/v1alpha1/{owner}/{repo}/write/{from_branch_name}/{to_branch_name}',
params={'q': query},
headers=headers,
)
return res.json()
update_res = run_write_query(from_branch, to_branch, query)
print(update_res)
The JSON response includes an operation name, which can be used to poll the operation endpoint.
{
"query_execution_status": "Success",
"query_execution_message": "",
"repository_owner": "tbantle",
"repository_name": "SHAQ",
"to_branch_name": "update-dimitri",
"from_branch_name": "main",
"query": "UPDATE player_season_stat_totals SET player_id=714287 WHERE player_id=15404617",
"operation_name": "operations/72abb56b-d478-43ae-9a2d-c9602184c7ab"
}
2. Poll operation endpoint
Now, we need a function that makes a get request to the operation endpoint, and calls this
function every few seconds until the done
field in the response is true.
import time
def get_operation(op_name):
poll_res = requests.get(
f'https://www.dolthub.com/api/v1alpha1/{owner}/{repo}/write',
params={'operationName': op_name},
headers=headers,
)
return poll_res.json()
def poll_operation(op_name):
done = False
while done == False:
poll_res = get_operation(op_name)
done = poll_res['done']
if done:
return poll_res
else:
time.sleep(3)
op_res = poll_operation(update_res['operation_name'])
print(op_res)
We'll get some query metadata when the operation is done, which includes to and from
commit ids that can be used to query the dolt_commit_diff_$tablename
table.
{
"_id": "operations/72abb56b-d478-43ae-9a2d-c9602184c7ab",
"done": true,
"res_details": {
"query_execution_status": "Success",
"query_execution_message": "Query OK, 2 rows affected. Rows matched: 2 Changed: 2 Warnings: 0.",
"owner_name": "tbantle",
"repository_name": "SHAQ",
"from_commit_id": "518ue176ec13qf563e87uoerkqqepood",
"to_commit_id": "5sp7i0bov6itma2u04dpk15ui24lrigp"
}
}
3. View diff
A lot of Dolt's functionality is exposed via SQL, so we can query commit diffs using the
API. Each row in the
dolt_commit_diff_$tablename
table represents a row that has changed between two commits. We can use this table to view
the resulting changes from our UPDATE
query. To make the results easier to read, we're
only going to select a few columns.
def get_dolt_commit_diff(table_name, columns, from_commit, to_commit, branch):
col_str = ', '.join((columns)
query = f'''SELECT {col_str}, diff_type FROM `dolt_commit_diff_{table_name}` WHERE from_commit="{from_commit}" AND to_commit="{to_commit}"'''
diff_res = requests.get(
f'https://www.dolthub.com/api/v1alpha1/{owner}/{repo}/{branch}',
params={'q': query},
headers=headers,
)
return diff_res.json()
details = poll_res['res_details']
cols = ['from_player_id', 'to_player_id']
get_dolt_commit_diff('player_season_stat_totals', cols, details['from_commit_id'], details['to_commit_id'], to_branch)
As with any read query, the response includes some metadata, the table schema, and the result rows.
{
"query_execution_status": "Success",
"query_execution_message": "",
"repository_owner": "tbantle",
"repository_name": "SHAQ",
"commit_ref": "update-dimitri",
"sql_query": "SELECT from_player_id, to_player_id, diff_type FROM `dolt_commit_diff_player_season_stat_totals` WHERE from_commit='518ue176ec13qf563e87uoerkqqepood' AND to_commit='5sp7i0bov6itma2u04dpk15ui24lrigp'",
"schema": [
{
"columnName": "from_player_id",
"columnType": "INT",
"isPrimaryKey": false
},
{
"columnName": "to_player_id",
"columnType": "INT",
"isPrimaryKey": false
},
{ "columnName": "diff_type", "columnType": "TEXT", "isPrimaryKey": false }
],
"rows": [
{
"from_player_id": "null",
"to_player_id": "714287",
"diff_type": "added"
},
{
"from_player_id": "null",
"to_player_id": "714287",
"diff_type": "added"
},
{
"from_player_id": "15404617",
"to_player_id": "null",
"diff_type": "removed"
},
{
"from_player_id": "15404617",
"to_player_id": "null",
"diff_type": "removed"
}
]
}
4. Repeat
Now that we've updated the player_id
in the player_season_stat_totals
table, we can
delete the extra Dimitri Saliba entry in the players
table. We can repeat steps 1-3
above using the functions we wrote, but changing some of the arguments.
# Run query
delete_query = 'DELETE FROM players WHERE player_id=15404617'
delete_res = run_write_query(to_branch, to_branch, delete_query)
# Poll operation
op_res = poll_operation(delete_res['operation_name'])
# View diff
details = poll_res['res_details']
get_dolt_commit_diff('players', details['from_commit_id'], details['to_commit_id'], to_branch)
5. Merge branches
Once we're satisfied with our changes, we can merge our branches using our
run_write_query
function without a query.
merge_res = run_write_query(to_branch, from_branch, '')
poll_res = poll_operation(merge_res['operation_name'])
print(poll_res)
We get a successful response with an empty to_commit_id
.
{
"_id": "operations/0690bd2c-024b-44fb-bd9e-7163a484d17a",
"done": true,
"res_details": {
"query_execution_status": "Success",
"query_execution_message": "",
"owner_name": "tbantle",
"repository_name": "SHAQ",
"from_commit_id": "518ue176ec13qf563e87uoerkqqepood",
"to_commit_id": ""
}
}
If we look at our from branch (main
in this case), we should see the commits generated
from our queries in the log.
Conclusion
Now that the SQL API is writable, you really can do anything with your data. We're still continuing to improve the SQL API, and it is still in alpha. Its shape may change over time, but the basic principle of data access through SQL will remain. If you have any questions or want to talk to us about your use case, please feel free to reach out on Discord.