dqbench

DQBench

The standard benchmark for data quality tools — detection, transformation, entity resolution, and pipeline orchestration.

PyPI Python 3.11+ Tests Categories OCR Company Benchmark ER Benchmark License: MIT

The ImageNet of data quality — standardized benchmarks for validation, transformation, entity resolution, and pipeline tools.

Why DQBench?

Every data quality tool claims to be the best. But there’s no standard way to compare them. DQBench fixes that with:

Install

The repo now also includes an OCR Company benchmark for post-OCR company-name confidence and correction quality.

pip install dqbench

Quick Start

# Run detection benchmark with GoldenCheck
pip install goldencheck
dqbench run goldencheck

# Run ER benchmark with GoldenMatch
pip install goldenmatch
dqbench run goldenmatch

# Run pipeline benchmark with GoldenPipe
pip install goldenpipe
dqbench run goldenpipe

# Run with a custom adapter
dqbench run --adapter my_adapter.py

Benchmark Categories

Category What it measures Example tools
Detect Find data quality issues in a dataset GoldenCheck, Great Expectations, Pandera, Soda Core
Transform Clean, normalize, and repair data GoldenFlow, dbt, pandas
ER Entity resolution — deduplicate and link records GoldenMatch, Splink, Dedupe
Pipeline End-to-end pipeline orchestration and quality gates GoldenPipe, Airflow, Prefect
OCR Company OCR company-name confidence, review, and correction quality OCR confidence/correction tools

Head-to-Head Results — Detect (DQBench v1.0)

Tool Mode T1 F1 T2 F1 T3 F1 Score
GoldenCheck zero-config 84.9% 80.0% 57.6% 72.00
Pandera best-effort 36.4% 38.1% 25.0% 32.51
Soda Core best-effort 38.1% 23.5% 13.3% 22.36
Great Expectations best-effort 36.4% 23.5% 12.5% 21.68
Great Expectations auto-profiled 22.2% 42.1% 0.0% 21.29
Soda Core auto-profiled 0.0% 11.1% 6.2% 6.94
All tools zero-config 0.0% 0.0% 0.0% 0.00

GoldenCheck’s zero-config discovery outperforms every competitor’s hand-written rules.

Head-to-Head Results — ER (DQBench v1.1)

Tool Mode T1 F1 T2 F1 T3 F1 Score
GoldenMatch with LLM 92.6% 97.8% 94.1% 95.30
GoldenMatch without LLM 77.21

GoldenMatch with LLM achieves a 95.30 DQBench ER Score across all three scored tiers.

Cost estimate: ~$0.15-0.30 per full run (3 tiers) with LLM scoring. Without LLM: free, ~23s total. With LLM: ~$0.25, ~670s total. LLM scoring is optional and activates automatically when OPENAI_API_KEY or ANTHROPIC_API_KEY is set.

External validation: GoldenMatch also scores 75.0% F1 on BPID (Amazon’s adversarial PII deduplication benchmark, EMNLP 2024), matching Ditto (75.2%) with zero training data. See the benchmark writeup.

T4 — Mistyped (diagnostic): since DQBench v1.2 the ER harness also ships a fourth tier where four column names deliberately disagree with their content (first_name=hex tokens, last_name=numeric IDs, address=free-form notes, industry=person names). The duplicate signal lives in email/phone, so a deduper that gates per-column refinements on profiled col_type should land near its T1 F1; one that trusts the column name will fire name/address scorers on noise and pay a precision tax. T4 has weight 0 in the composite ER score — it’s reported but doesn’t move the headline.

Run the comparisons yourself:

# Detect benchmark
pip install dqbench goldencheck great_expectations pandera soda-core
dqbench run all

# ER benchmark
pip install dqbench goldenmatch
dqbench run goldenmatch

Leaderboard

Every dqbench run records its result under ~/.dqbench/results/ (latest run per tool per category wins). View your local board at any time:

dqbench run all              # populate Detect results
dqbench run goldenmatch      # add an ER result
dqbench leaderboard          # ranked tables for every category
dqbench leaderboard -c er    # just one category
dqbench leaderboard --json   # machine-readable

Use dqbench run <adapter> --no-save to benchmark without recording, and dqbench leaderboard --clear to reset the local board.

Published leaderboard

The repository ships a public, version-controlled board in LEADERBOARD.md, generated from leaderboard/results/.

Results must reproduce. Every entry is backed by a manifest under leaderboard/submissions/ that declares how to run the benchmark (tool, adapter, pinned packages). CI re-runs each changed manifest on a clean runner and rejects any entry whose numbers don’t reproduce — so scores can’t be hand-edited onto the board.

To submit your tool, add a manifest and reproduce it:

# leaderboard/submissions/detect-mytool.json describes how to run your tool
dqbench reproduce leaderboard/submissions/detect-mytool.json --write
dqbench verify    leaderboard/submissions/detect-mytool.json
# commit the manifest + leaderboard/results/*.json + LEADERBOARD.md, then open a PR

CI gates the PR with dqbench publish --check (every entry has a manifest, board in sync) and dqbench verify (the numbers reproduce). View the published board with dqbench leaderboard --source repo. Full guide: docs/leaderboard.md.

Tiers

Detect

Tier Rows Columns Domain Difficulty
1 — Basics 5,000 20 Customer DB Obvious errors, baseline
2 — Realistic 50,000 30 E-commerce Subtle issues + false positive traps
3 — Adversarial 100,000 50 Healthcare Encoding traps, semantic errors, cross-column logic

Each tier has columns WITH planted issues and columns WITHOUT (false positive traps). Tools that flag clean columns lose precision points.

ER (Entity Resolution)

Tier Rows Dupes Profile In composite score
1 — Easy 1,000 100 Case-change, single-char typo, name swap ✓ (20%)
2 — Fuzzy 5,000 750 Nicknames, missing fields, format change, transposed fields ✓ (40%)
3 — Adversarial 10,000 2,000 Phonetic variants, address abbrev, split records, unicode, merged records, multi-field corruption ✓ (40%)
4 — Mistyped (diagnostic) 800 80 Person-shaped rows where four column names deliberately disagree with their content weight 0

T4 is a diagnostic tier — same row shape as T1 but with first_name=hex, last_name=numeric ID, address=free-form notes, industry=person names. It exists to expose dedupers that fire per-column refinements (name scorers, address normalisation) on noise when the column name doesn’t match the data. Reported alongside T1-T3 but excluded from the composite ER score so it doesn’t move headline numbers for tools that don’t opt in.

Scoring

Metric Description
Recall % of planted-issue columns detected
Precision % of flagged columns that actually have issues
F1 Harmonic mean of recall and precision
FPR Clean columns incorrectly flagged (WARNING/ERROR only)
DQBench Score Tier1_F1 x 20% + Tier2_F1 x 40% + Tier3_F1 x 40%

Write Your Own Adapter

Implement one class to benchmark any tool:

from dqbench.adapters.base import DQBenchAdapter
from dqbench.models import DQBenchFinding
from pathlib import Path

class MyToolAdapter(DQBenchAdapter):
    @property
    def name(self) -> str:
        return "MyTool"

    @property
    def version(self) -> str:
        return "1.0.0"

    def validate(self, csv_path: Path) -> list[DQBenchFinding]:
        # Run your tool on the CSV
        # Return a list of DQBenchFinding objects
        return [
            DQBenchFinding(
                column="email",
                severity="error",      # "error", "warning", or "info"
                check="format",         # what kind of issue
                message="Invalid email format",
                confidence=0.9,         # optional, 0.0-1.0
            )
        ]

Then run:

dqbench run --adapter my_adapter.py

Writing a Custom ER Adapter

To benchmark an entity resolution tool, implement the EntityResolutionAdapter interface:

from dqbench.adapters.er_base import EntityResolutionAdapter
from dqbench.models import ERPrediction
from pathlib import Path
import polars as pl

class MyERAdapter(EntityResolutionAdapter):
    @property
    def name(self) -> str:
        return "MyERTool"

    @property
    def version(self) -> str:
        return "1.0.0"

    def resolve(self, df: pl.DataFrame) -> list[ERPrediction]:
        # Given a DataFrame with potential duplicates,
        # return predicted duplicate pairs
        return [
            ERPrediction(
                record_id_a="row_001",
                record_id_b="row_042",
                confidence=0.95,        # 0.0-1.0
                match=True,             # True = predicted duplicate
            )
        ]

Then run:

dqbench run --adapter my_er_adapter.py

CLI Reference

Command Description
dqbench run <adapter> Run benchmark
dqbench run --adapter <path> Run with custom adapter file
dqbench run <adapter> --tier 2 Run specific tier only
dqbench run <adapter> --json JSON output
dqbench run goldenmatch Run ER benchmark with GoldenMatch
dqbench run goldenpipe Run Pipeline benchmark with GoldenPipe (tuned Flow+Match)
dqbench run goldensuite-tuned Full Golden suite pipeline, tuned (Check+Flow+Match)
dqbench run goldensuite-zero Full Golden suite pipeline, zero-config engine
dqbench run placeholder --adapter <path> Run a custom OCR Company adapter
dqbench run goldenmatch --tier 4 Run only the ER T4 (Mistyped) diagnostic tier
dqbench run <adapter> --no-save Run without recording the result on the leaderboard
dqbench leaderboard Show your local ranked leaderboard across all categories
dqbench leaderboard --category er Show the leaderboard for one category
dqbench leaderboard --json Leaderboard as JSON
dqbench leaderboard --source repo Show the published (committed) board
dqbench leaderboard --clear Delete all locally recorded results
dqbench reproduce <manifest> --write Run a submission manifest and record it on the board
dqbench verify <manifest> Reproduce a manifest and confirm its committed entry matches
dqbench publish Regenerate LEADERBOARD.md from the store
dqbench publish --check Validate store + manifests and verify LEADERBOARD.md (CI)
dqbench generate Generate/cache detection datasets
dqbench generate --er Generate ER benchmark datasets (T1-T4)
dqbench generate --pipeline Generate Pipeline benchmark datasets
dqbench generate --ocr-company Generate OCR Company benchmark datasets
dqbench generate --all Generate datasets for all categories
dqbench generate --force Regenerate datasets

Supported Categories

Category Tiers Description
Detect 3 Data quality issue detection
Transform 3 Data cleaning and normalization
ER 3 scored + 1 diagnostic (T4 Mistyped) Entity resolution and deduplication
Pipeline 3 End-to-end pipeline orchestration
OCR Company 3 OCR company-name confidence and correction

Full suite: 251 tests passing across all five categories.

OCR Company Benchmark

This benchmark measures company-name OCR scoring quality rather than generic data validation.

It ships with deterministic synthetic company OCR tiers and scores:

Generate OCR Company datasets:

dqbench generate --ocr-company

Run with a custom OCR Company adapter:

dqbench run placeholder --adapter examples/ocr_company_adapter.py

5 categories, 16 tiers (15 scored + T4 Mistyped diagnostic), 178 tests passing.

Built-in Adapters

Adapter Tool Category Modes Install
goldencheck GoldenCheck Detect zero-config pip install goldencheck
gx-zero, gx-auto, gx-best Great Expectations Detect zero / auto / best-effort pip install great_expectations
pandera-zero, pandera-auto, pandera-best Pandera Detect zero / auto / best-effort pip install pandera
soda-zero, soda-auto, soda-best Soda Core Detect zero / auto / best-effort pip install soda-core
goldenmatch GoldenMatch ER with-LLM / without-LLM pip install goldenmatch
goldenpipe GoldenPipe Pipeline default pip install goldenpipe

Want to add your tool? See CONTRIBUTING.md.

Reproducibility

License

MIT


From the maker of GoldenCheck, GoldenMatch, GoldenFlow, and GoldenPipe.