Using Dolt Workbench to Work with the Medicaid Database

7 min read

Last week, I published an article called “Can Claude Find Fraud?”. The article covered the results of a five-hour session with Dolt Workbench in agent mode analyzing a newly-released Medicaid payment dataset.

In that article, I promised to show you how to analyze the data yourself. I’m at home sick on my Windows desktop so let me start from scratch here and show you how.

Install Dolt Workbench#

As I said, I’m at home on Windows so I’ll head to the Microsoft store to install Dolt Workbench. If you are not a Windows user, Dolt Workbench is also available to download from our GitHub or from the Mac App Store. I clicked the “Download and Install” link in the Windows App Store and opened Dolt Workbench.

Clone timsehn/medicaid_provider_spending#

Once Dolt Workbench is open, click the “Clone a Remote Dolt database from DoltHub” option. Enter the Connection Name (e.g. medicaid), Port (e.g. 3658), Owner Name (e.g. timsehn), and Remote Database Name (e.g. medicaid_provider_spending).

Workbench Clone

Once, you’ve entered the fields click “Start Clone”. The Medicaid spending database is about 30GB so it may take a while to clone. On my home internet, it took about 10 minutes. Once it finished, I was greeted with the database.

Workbench Database

Start Agent Mode#

There is a robot icon in the top right corner of Dolt Workbench. Click that to open agent mode. The workbench uses Claude as its agent. You will be asked to enter an Anthropic API key. Grab one from the Anthropic website and enter it.

Now, ask the agent something about the database. I went with “Tell me about this database”.

Workbench Agent Mode

It’s that easy. You have the power of a SQL workbench and a modern AI agent all in the same handy user interface. This makes data analysis a breeze. Dolt Workbench is Cursor for SQL.

Find some Fraud#

Now that we have agent mode working, we can instruct the agent to look for suspicious payment activity in our Medicaid payments database. This is the prompt I used.

This database contains all medicaid payments from January 2018 through December 2024. 
More details can be found here:

https://opendata.hhs.gov/datasets/medicaid-provider-spending/

There have been claims of rampant fraud in medicaid payments. I would like you to 
analyze the data and try to identify irregular payment patterns in the data. 

The approach you use is up to you but I would like all identified fraud to be 
persisted as a view. Thus, we can persist and share the queries used on DoltHub. 

This database is large: 227M records. Plan your queries accordingly. I would avoid 
queries with greater than two joins or subqueries. For more complicated queries
that require deeper analysis, use a Python script.

The agent came up with a number of views, but I only saved one called fraud_risk_score. Here is its definition.

CREATE VIEW `fraud_risk_score` AS -- Composite fraud risk score ranking providers by how many independent fraud signals
-- they trigger. Each signal contributes 1 point. Providers triggering multiple signals
-- are the highest-confidence fraud candidates.
--
-- Signals:
--   1. high_pay_per_claim: avg payment/claim > $15,000 (10x 99th percentile)
--   2. high_claims_per_bene: claims/beneficiary > 139 (5x 99th percentile)
--   3. high_paid_per_bene: spending/beneficiary > $27,156 (5x 99th percentile)
--   4. low_procedure_diversity: >$10M total with <=3 HCPCS codes
--   5. ghost_billing: >500 distinct servicing providers with >$1M total
--
-- Thresholds derived from statistical analysis of all 227M records (Jan 2018 - Dec 2024).
-- Per-code pricing outliers excluded here as they require per-code median lookups.
SELECT
    BILLING_PROVIDER_NPI_NUM,
    SUM(TOTAL_PAID) as total_paid,
    SUM(TOTAL_CLAIMS) as total_claims,
    SUM(TOTAL_UNIQUE_BENEFICIARIES) as total_beneficiaries,
    COUNT(DISTINCT HCPCS_CODE) as num_procedures,
    COUNT(DISTINCT CLAIM_FROM_MONTH) as num_months,
    COUNT(DISTINCT SERVICING_PROVIDER_NPI_NUM) as num_servicing_providers,
    SUM(TOTAL_PAID) / NULLIF(SUM(TOTAL_CLAIMS), 0) as avg_paid_per_claim,
    SUM(TOTAL_CLAIMS) / NULLIF(SUM(TOTAL_UNIQUE_BENEFICIARIES), 0) as claims_per_beneficiary,
    SUM(TOTAL_PAID) / NULLIF(SUM(TOTAL_UNIQUE_BENEFICIARIES), 0) as paid_per_beneficiary,
    (CASE WHEN SUM(TOTAL_CLAIMS) > 100
               AND SUM(TOTAL_PAID) > 1000000
               AND SUM(TOTAL_PAID) / NULLIF(SUM(TOTAL_CLAIMS), 0) > 15000
          THEN 1 ELSE 0 END)
    +
    (CASE WHEN SUM(TOTAL_UNIQUE_BENEFICIARIES) > 10
               AND SUM(TOTAL_PAID) > 500000
               AND SUM(TOTAL_CLAIMS) / NULLIF(SUM(TOTAL_UNIQUE_BENEFICIARIES), 0) > 139
          THEN 1 ELSE 0 END)
    +
    (CASE WHEN SUM(TOTAL_UNIQUE_BENEFICIARIES) > 10
               AND SUM(TOTAL_PAID) > 1000000
               AND SUM(TOTAL_PAID) / NULLIF(SUM(TOTAL_UNIQUE_BENEFICIARIES), 0) > 27156
          THEN 1 ELSE 0 END)
    +
    (CASE WHEN SUM(TOTAL_PAID) > 10000000
               AND COUNT(DISTINCT HCPCS_CODE) <= 3
          THEN 1 ELSE 0 END)
    +
    (CASE WHEN COUNT(DISTINCT SERVICING_PROVIDER_NPI_NUM) > 500
               AND SUM(TOTAL_PAID) > 1000000
          THEN 1 ELSE 0 END)
    as fraud_signal_count
FROM medicaid_provider_spending
GROUP BY BILLING_PROVIDER_NPI_NUM
HAVING
    (CASE WHEN SUM(TOTAL_CLAIMS) > 100
               AND SUM(TOTAL_PAID) > 1000000
               AND SUM(TOTAL_PAID) / NULLIF(SUM(TOTAL_CLAIMS), 0) > 15000
          THEN 1 ELSE 0 END)
    +
    (CASE WHEN SUM(TOTAL_UNIQUE_BENEFICIARIES) > 10
               AND SUM(TOTAL_PAID) > 500000
               AND SUM(TOTAL_CLAIMS) / NULLIF(SUM(TOTAL_UNIQUE_BENEFICIARIES), 0) > 139
          THEN 1 ELSE 0 END)

It takes about a minute to run. But once it does, you can see the following results.

Workbench View Results

Then, to do some analysis of the provider, just take the NPI field and ask Claude about it. Something like so:

For NPI 1013030808 tell me a little about the provider. 
Is there any reports of fraud or wrongdoing in the past. 
Why did it show up in this view?

After a couple minutes, the following report was generated.

NPI 1013030808 — Provider Analysis#

Provider Identification#

FieldDetails
NPI1013030808
Provider NameCounty of Merced
Also Known AsMarie Green Psychiatric Healthcare Facility
TypeOrganization (NPI-2)
SpecialtyPsychiatric Hospital (Taxonomy 283Q00000X)
Address300 E 15th St, Suite B, Merced, CA 95341
Enumeration DateApril 9, 2007

The Marie Green Psychiatric Center is a county-run psychiatric health facility operated by Merced County Behavioral Health and Recovery Services. It is a small facility with 16 licensed beds that provides inpatient services for individuals needing brief psychiatric hospitalization due to a crisis or emergency. It primarily serves Medi-Cal patients and uninsured individuals.

What the Data Shows#

This provider bills exclusively under HCPCS code H2013 — “Psychiatric health facility service, per diem” — which is a standard billing code for daily inpatient psychiatric care. Here are the key numbers:

MetricValue
Total Paid (2018-2024)$47.6 million
Total Claims2,578
Total Beneficiaries2,248
Average Paid Per Claim~$18,464
Claims Per Beneficiary~1.15
Number of HCPCS Codes1
Months Active73 out of 84
Servicing Providers1

Why It Appears in the Fraud Risk View#

The provider triggered 2 out of 5 fraud signals, which is the minimum threshold to appear in the fraud_risk_score view. Specifically:

Signal 1: High Pay Per Claim (TRIGGERED)#

The average payment per claim is ~$18,464, which exceeds the $15,000 threshold. This is because each “claim” represents a per diem billing period, and psychiatric facility stays can accumulate significant daily charges. For a 16-bed facility billing per diem over many years, this rate is not inherently unusual — it reflects the cost of inpatient psychiatric care in California.

Signal 2: Low Procedure Diversity (TRIGGERED)#

The provider bills over $10 million total but uses only 1 HCPCS code. This makes sense for a specialized psychiatric health facility — H2013 is the standard per diem code for this type of facility. A psychiatric hospital would not typically bill a wide range of procedure codes.

Signals Not Triggered#

  • High Claims Per Beneficiary: The claims-per-beneficiary ratio is very low (1.15), well below the 139 threshold.
  • High Paid Per Beneficiary: Paid per beneficiary (~$21,175) does not exceed the $27,156 threshold.
  • Ghost Billing: There is only 1 servicing provider, far below the 500 threshold.

Monthly Spending Trend (2018-2024)#

MonthClaimsBeneficiariesTotal Paid
2018-012929$431,578
2018-064136$580,549
2018-123030$383,381
2019-061414$336,777
2019-123129$554,088
2020-062018$727,827
2021-062019$536,261
2021-123937$807,958
2022-063636$615,253
2022-122828$682,231
2023-064846$1,223,172
2023-122726$847,309
2024-074343$760,950

Note: Table above shows selected months. Full data spans 73 months from January 2018 through July 2024.

Fraud or Wrongdoing History#

Based on research of publicly available records, no fraud investigations, enforcement actions, or OIG exclusions were found associated with either the County of Merced or the Marie Green Psychiatric Center. The facility appears in publicly available directories as a legitimate county-operated behavioral health facility.

Assessment#

This is a false positive in the fraud risk view. The two signals that were triggered are entirely explained by the nature of the facility:

  • A psychiatric hospital naturally has high per diem costs, which inflates the “pay per claim” metric.
  • A specialized facility legitimately bills under a single HCPCS code (H2013), triggering the “low procedure diversity” flag.

The provider’s other metrics are actually quite normal — low claims per beneficiary, a single servicing provider, and consistent activity over 73 months. This pattern is consistent with a small, county-run psychiatric hospital rather than a fraudulent operation.

Sources#

Have Fun!#

Dolt Workbench in agent mode is an incredible way to explore this Medicaid dataset. I’ve already spent about ten hours working through the data with Claude and will have more analyses beyond the one I published last week coming in the next couple weeks. If you want a spoiler, ask Dolt Workbench agent about ambulances in Chicago.

Need help getting started with Dolt Workbench and this dataset? We’re always available on Discord and would love to get you going.