In our blogs we have shown over and over again how easy it is to clone data from DoltHub and immediately
start querying it with SQL. We are constantly working on improving our data catalog. As we do, there emerge more occasions
where you can derive interesting information by joining two or more of these datasets. In this blog I'll be walking you through
how I took data from the dolthub/us-congress repository and
the dolthub/irs-soi repository to find
the congressional districts representing the largest portion of US tax dollars.
The Setup
To begin we'll create a directory to store our repositories, and then clone data into them.
mkdir datasets
cd datasets
dolt clone dolthub/irs-soi
dolt clone dolthub/us-congress
At this point we are immediately ready to begin running SQL against our data. We will start a Dolt SQL shell session
with the --multi-db-dir parameter, which tells Dolt to look at every subdirectory of the provided directory as a separate
database (version 0.16.4+ is required to try this yourself). We run the command from our newly created datasets directory like so:
dolt sql --multi-db-dir ./
# Welcome to the DoltSQL shell.# Statements must be terminated with ';'.# "exit" or "quit" (or Ctrl-D) to exit.> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema || irs_soi || us_congress |
+--------------------+
As you can see, the shell shows our two cloned repositories, but it's worth noting that the names of the repositories
have been converted to valid SQL identifiers (in this case dashes replace underscores). Additionally, the
shell shows the information_schema database. This is a standard MySQL system database which provides information about all
the databases, but it is not important to our work here.
One thing to note about this data is that congressional districts are not cut along zipcode boundaries, and as a result a
zipcode that is not wholly contained within a congressional district will appear in multiple rows, once for each
district that some of its residents are a part of.
Joining Data from Multiple Databases
Now that we have cloned our data and opened a shell which provides access to both databases, we will create a table and
insert data joined from both databases. Our table will be every state, district, and zip joined with the adjusted gross
incomes and tax return count for each zip.
>USE us_congress;Database changed
us_congress>CREATETABLE irs_district_info (-> state VARCHAR(2)NOTNULL,-> district VARCHAR(2)NOTNULL,-> zip VARCHAR(5)NOTNULL,-> return_count BIGINTNOTNULL,-> agi BIGINTNOTNULL,->PRIMARYKEY(zip,state,district)->);
us_congress>INSERTINTO irs_district_info
->SELECT districts.state, districts.district, districts.zip, allnoagi.return_count, allnoagi.adjusted_gross_income
->FROM districts
->INNERJOIN irs_soi.allnoagi AS allnoagi ON districts.zip = allnoagi.zip;
Query OK,34943rows affected
Now that we have a table that contains districts and data from the IRS we will create a view that makes it easy to see
aggregations at the district level. As mentioned earlier, due to congressional districts not being cut along zipcode
boundaries this is only an approximation. If zipcode Z has some residents that are in district 1 and other residents
that are in district 2 then that zipcode's tax dollars are counted twice.
Now we are all set now to look at the congressional districts representing the most and fewest tax dollars. We'll
start by looking at the districts representing the most tax dollars.
us_congress>SELECT*->FROM house
->INNERJOIN district_agi ON house.district = district_agi.district and house.state = district_agi.state
->ORDERBY total_agi DESC->LIMIT10;+-------+----------+------------+----------------+------------+---------------------+------------------------+------------+-----+----------+-------+-----------+--------------+| state | district | last_name | first_name | party | year_assumed_office | residence | birth_year | sex | district | state | total_agi | return_count |+-------+----------+------------+----------------+------------+---------------------+------------------------+------------+-----+----------+-------+-----------+--------------+| NY |10| Nadler | Jerry | Democratic |1992| Chelsea, Manhattan |1947| M |10| NY |180290441|920080|| NY |12| Maloney | Carolyn | Democratic |1993| East Harlem, Manhattan |1946| F |12| NY |159256865|666310|| CA |18| Eshoo | Anna | Democratic |1993| Atherton, California |1942| F |18| CA |139138057|671370|| CA |33| Lieu | Ted | Democratic |2015| Torrance, California |1969| M |33| CA |125883759|715860|| TX |2| Crenshaw | Dan | Republican |2019| Spring, Texas |1984| M |2| TX |94084261|856060|| CA |17| Khanna | Ro | Democratic |2017| Fremont, California |1976| M |17| CA |84393712|594210|| TX |21| Roy | Chip | Republican |2019| Austin, Texas |1972| M |21| TX |82842753|865730|| IL |7| Davis | Danny K.| Democratic |1997| Austin, Chicago |1941| M |7| IL |81989015|980360|| TX |7| Fletcher | Lizzie Pannill | Democratic |2019| Houston |1975| F |7| TX |81768777|659310|| NJ |7| Malinowski | Tom | Democratic |2019| Rocky Hill, New Jersey |1965| M |7| NJ |81167355|616610|+-------+----------+------------+----------------+------------+---------------------+------------------------+------------+-----+----------+-------+-----------+--------------+
The adjusted gross income data from the IRS is in thousands of dollars, so you can see that our approximation of adjusted
gross income for each congressional district has residents of New York's 10th district accounting for $180 Billion in taxable
income in 2017.
Now the districts representing the fewest tax dollars.
Just like dolt sql, dolt sql-server supports the --multi-db-dir parameter. It will allow you to query
multiple Dolt repositories each through a SQL database. Additionally, dolt sql-server now supports a YAML configuration
file using the --config <file> parameter which has greater support for configuring your databases. When using a YAML
configuration file you may provide the paths to the Dolt repositories on your local machine, and the database name that
they will be accessible through. This is a lot less restrictive than requiring all datasets be subdirectories of the
directory that the command runs in. The YAML configuration file also provides configuration for things that are not
available via command line configuration. See the dolt sql-server documentation
for details.
Conclusion
Historically, acquiring the data you need can be a painful process. It may involve scraping websites, downloading CSVs, or
data files in proprietary formats. Once you have the raw data you may need to write scripts to make it usable so that you
can import it to your analysis tools. You may often need to repeat this process multiple times to be able to build the
dataset you really need by combining data from different datasets. Dolt has been designed to solve these problems. Dolt
lets you clone datasets and immediately start combining them using SQL.
The vision for DoltHub is to be a place where our users can come together and collaborate on data in a way that hasn't been
possible before. As we execute against that vision we hope to become the place to go for well-maintained data. We feel
that being able to easily acquire data from different sources and combine it to build new and exciting datasets is an
extremely compelling reason why you should give Dolt a try, and why you should put your data on DoltHub.
Multiple database support is new to Dolt. We think it provides a hint at where Dolt could go in the future as a
distributed, online data catalog. Imagine being able to search DoltHub for all the datasets with
similar columns to join on, and using this new feature to analyze the combined data.