The standard benchmark for data quality tools — detection, transformation, entity resolution, and pipeline orchestration.
The ImageNet of data quality — standardized benchmarks for validation, transformation, entity resolution, and pipeline tools.
Every data quality tool claims to be the best. But there’s no standard way to compare them. DQBench fixes that with:
The repo now also includes an OCR Company benchmark for post-OCR company-name confidence and correction quality.
pip install dqbench
# Run detection benchmark with GoldenCheck
pip install goldencheck
dqbench run goldencheck
# Run ER benchmark with GoldenMatch
pip install goldenmatch
dqbench run goldenmatch
# Run pipeline benchmark with GoldenPipe
pip install goldenpipe
dqbench run goldenpipe
# Run with a custom adapter
dqbench run --adapter my_adapter.py
| Category | What it measures | Example tools |
|---|---|---|
| Detect | Find data quality issues in a dataset | GoldenCheck, Great Expectations, Pandera, Soda Core |
| Transform | Clean, normalize, and repair data | GoldenFlow, dbt, pandas |
| ER | Entity resolution — deduplicate and link records | GoldenMatch, Splink, Dedupe |
| Pipeline | End-to-end pipeline orchestration and quality gates | GoldenPipe, Airflow, Prefect |
| OCR Company | OCR company-name confidence, review, and correction quality | OCR confidence/correction tools |
| Tool | Mode | T1 F1 | T2 F1 | T3 F1 | Score |
|---|---|---|---|---|---|
| GoldenCheck | zero-config | 84.9% | 80.0% | 57.6% | 72.00 |
| Pandera | best-effort | 36.4% | 38.1% | 25.0% | 32.51 |
| Soda Core | best-effort | 38.1% | 23.5% | 13.3% | 22.36 |
| Great Expectations | best-effort | 36.4% | 23.5% | 12.5% | 21.68 |
| Great Expectations | auto-profiled | 22.2% | 42.1% | 0.0% | 21.29 |
| Soda Core | auto-profiled | 0.0% | 11.1% | 6.2% | 6.94 |
| All tools | zero-config | 0.0% | 0.0% | 0.0% | 0.00 |
GoldenCheck’s zero-config discovery outperforms every competitor’s hand-written rules.
| Tool | Mode | T1 F1 | T2 F1 | T3 F1 | Score |
|---|---|---|---|---|---|
| GoldenMatch | with LLM | 92.6% | 97.8% | 94.1% | 95.30 |
| GoldenMatch | without LLM | — | — | — | 77.21 |
GoldenMatch with LLM achieves a 95.30 DQBench ER Score across all three scored tiers.
Cost estimate: ~$0.15-0.30 per full run (3 tiers) with LLM scoring. Without LLM: free, ~23s total. With LLM: ~$0.25, ~670s total. LLM scoring is optional and activates automatically when
OPENAI_API_KEYorANTHROPIC_API_KEYis set.External validation: GoldenMatch also scores 75.0% F1 on BPID (Amazon’s adversarial PII deduplication benchmark, EMNLP 2024), matching Ditto (75.2%) with zero training data. See the benchmark writeup.
T4 — Mistyped (diagnostic): since DQBench v1.2 the ER harness also ships a fourth tier where four column names deliberately disagree with their content (
first_name=hex tokens,last_name=numeric IDs,address=free-form notes,industry=person names). The duplicate signal lives inphone, so a deduper that gates per-column refinements on profiledcol_typeshould land near its T1 F1; one that trusts the column name will fire name/address scorers on noise and pay a precision tax. T4 has weight 0 in the composite ER score — it’s reported but doesn’t move the headline.
Run the comparisons yourself:
# Detect benchmark
pip install dqbench goldencheck great_expectations pandera soda-core
dqbench run all
# ER benchmark
pip install dqbench goldenmatch
dqbench run goldenmatch
Every dqbench run records its result under ~/.dqbench/results/ (latest run per
tool per category wins). View your local board at any time:
dqbench run all # populate Detect results
dqbench run goldenmatch # add an ER result
dqbench leaderboard # ranked tables for every category
dqbench leaderboard -c er # just one category
dqbench leaderboard --json # machine-readable
Use dqbench run <adapter> --no-save to benchmark without recording, and
dqbench leaderboard --clear to reset the local board.
The repository ships a public, version-controlled board in
LEADERBOARD.md, generated from leaderboard/results/.
Results must reproduce. Every entry is backed by a manifest under
leaderboard/submissions/ that declares how to run the benchmark (tool, adapter,
pinned packages). CI re-runs each changed manifest on a clean runner and rejects any
entry whose numbers don’t reproduce — so scores can’t be hand-edited onto the board.
To submit your tool, add a manifest and reproduce it:
# leaderboard/submissions/detect-mytool.json describes how to run your tool
dqbench reproduce leaderboard/submissions/detect-mytool.json --write
dqbench verify leaderboard/submissions/detect-mytool.json
# commit the manifest + leaderboard/results/*.json + LEADERBOARD.md, then open a PR
CI gates the PR with dqbench publish --check (every entry has a manifest, board in
sync) and dqbench verify (the numbers reproduce). View the published board with
dqbench leaderboard --source repo. Full guide: docs/leaderboard.md.
| Tier | Rows | Columns | Domain | Difficulty |
|---|---|---|---|---|
| 1 — Basics | 5,000 | 20 | Customer DB | Obvious errors, baseline |
| 2 — Realistic | 50,000 | 30 | E-commerce | Subtle issues + false positive traps |
| 3 — Adversarial | 100,000 | 50 | Healthcare | Encoding traps, semantic errors, cross-column logic |
Each tier has columns WITH planted issues and columns WITHOUT (false positive traps). Tools that flag clean columns lose precision points.
| Tier | Rows | Dupes | Profile | In composite score |
|---|---|---|---|---|
| 1 — Easy | 1,000 | 100 | Case-change, single-char typo, name swap | ✓ (20%) |
| 2 — Fuzzy | 5,000 | 750 | Nicknames, missing fields, format change, transposed fields | ✓ (40%) |
| 3 — Adversarial | 10,000 | 2,000 | Phonetic variants, address abbrev, split records, unicode, merged records, multi-field corruption | ✓ (40%) |
| 4 — Mistyped (diagnostic) | 800 | 80 | Person-shaped rows where four column names deliberately disagree with their content | weight 0 |
T4 is a diagnostic tier — same row shape as T1 but with first_name=hex, last_name=numeric ID, address=free-form notes, industry=person names. It exists to expose dedupers that fire per-column refinements (name scorers, address normalisation) on noise when the column name doesn’t match the data. Reported alongside T1-T3 but excluded from the composite ER score so it doesn’t move headline numbers for tools that don’t opt in.
| Metric | Description |
|---|---|
| Recall | % of planted-issue columns detected |
| Precision | % of flagged columns that actually have issues |
| F1 | Harmonic mean of recall and precision |
| FPR | Clean columns incorrectly flagged (WARNING/ERROR only) |
| DQBench Score | Tier1_F1 x 20% + Tier2_F1 x 40% + Tier3_F1 x 40% |
Implement one class to benchmark any tool:
from dqbench.adapters.base import DQBenchAdapter
from dqbench.models import DQBenchFinding
from pathlib import Path
class MyToolAdapter(DQBenchAdapter):
@property
def name(self) -> str:
return "MyTool"
@property
def version(self) -> str:
return "1.0.0"
def validate(self, csv_path: Path) -> list[DQBenchFinding]:
# Run your tool on the CSV
# Return a list of DQBenchFinding objects
return [
DQBenchFinding(
column="email",
severity="error", # "error", "warning", or "info"
check="format", # what kind of issue
message="Invalid email format",
confidence=0.9, # optional, 0.0-1.0
)
]
Then run:
dqbench run --adapter my_adapter.py
To benchmark an entity resolution tool, implement the EntityResolutionAdapter interface:
from dqbench.adapters.er_base import EntityResolutionAdapter
from dqbench.models import ERPrediction
from pathlib import Path
import polars as pl
class MyERAdapter(EntityResolutionAdapter):
@property
def name(self) -> str:
return "MyERTool"
@property
def version(self) -> str:
return "1.0.0"
def resolve(self, df: pl.DataFrame) -> list[ERPrediction]:
# Given a DataFrame with potential duplicates,
# return predicted duplicate pairs
return [
ERPrediction(
record_id_a="row_001",
record_id_b="row_042",
confidence=0.95, # 0.0-1.0
match=True, # True = predicted duplicate
)
]
Then run:
dqbench run --adapter my_er_adapter.py
| Command | Description |
|---|---|
dqbench run <adapter> |
Run benchmark |
dqbench run --adapter <path> |
Run with custom adapter file |
dqbench run <adapter> --tier 2 |
Run specific tier only |
dqbench run <adapter> --json |
JSON output |
dqbench run goldenmatch |
Run ER benchmark with GoldenMatch |
dqbench run goldenpipe |
Run Pipeline benchmark with GoldenPipe (tuned Flow+Match) |
dqbench run goldensuite-tuned |
Full Golden suite pipeline, tuned (Check+Flow+Match) |
dqbench run goldensuite-zero |
Full Golden suite pipeline, zero-config engine |
dqbench run placeholder --adapter <path> |
Run a custom OCR Company adapter |
dqbench run goldenmatch --tier 4 |
Run only the ER T4 (Mistyped) diagnostic tier |
dqbench run <adapter> --no-save |
Run without recording the result on the leaderboard |
dqbench leaderboard |
Show your local ranked leaderboard across all categories |
dqbench leaderboard --category er |
Show the leaderboard for one category |
dqbench leaderboard --json |
Leaderboard as JSON |
dqbench leaderboard --source repo |
Show the published (committed) board |
dqbench leaderboard --clear |
Delete all locally recorded results |
dqbench reproduce <manifest> --write |
Run a submission manifest and record it on the board |
dqbench verify <manifest> |
Reproduce a manifest and confirm its committed entry matches |
dqbench publish |
Regenerate LEADERBOARD.md from the store |
dqbench publish --check |
Validate store + manifests and verify LEADERBOARD.md (CI) |
dqbench generate |
Generate/cache detection datasets |
dqbench generate --er |
Generate ER benchmark datasets (T1-T4) |
dqbench generate --pipeline |
Generate Pipeline benchmark datasets |
dqbench generate --ocr-company |
Generate OCR Company benchmark datasets |
dqbench generate --all |
Generate datasets for all categories |
dqbench generate --force |
Regenerate datasets |
| Category | Tiers | Description |
|---|---|---|
| Detect | 3 | Data quality issue detection |
| Transform | 3 | Data cleaning and normalization |
| ER | 3 scored + 1 diagnostic (T4 Mistyped) | Entity resolution and deduplication |
| Pipeline | 3 | End-to-end pipeline orchestration |
| OCR Company | 3 | OCR company-name confidence and correction |
Full suite: 251 tests passing across all five categories.
This benchmark measures company-name OCR scoring quality rather than generic data validation.
It ships with deterministic synthetic company OCR tiers and scores:
Generate OCR Company datasets:
dqbench generate --ocr-company
Run with a custom OCR Company adapter:
dqbench run placeholder --adapter examples/ocr_company_adapter.py
5 categories, 16 tiers (15 scored + T4 Mistyped diagnostic), 178 tests passing.
| Adapter | Tool | Category | Modes | Install |
|---|---|---|---|---|
goldencheck |
GoldenCheck | Detect | zero-config | pip install goldencheck |
gx-zero, gx-auto, gx-best |
Great Expectations | Detect | zero / auto / best-effort | pip install great_expectations |
pandera-zero, pandera-auto, pandera-best |
Pandera | Detect | zero / auto / best-effort | pip install pandera |
soda-zero, soda-auto, soda-best |
Soda Core | Detect | zero / auto / best-effort | pip install soda-core |
goldenmatch |
GoldenMatch | ER | with-LLM / without-LLM | pip install goldenmatch |
goldenpipe |
GoldenPipe | Pipeline | default | pip install goldenpipe |
Want to add your tool? See CONTRIBUTING.md.
random.Random(42) instance (stdlib only, no numpy)~/.dqbench/datasets/; regenerate with dqbench generate --forceMIT
From the maker of GoldenCheck, GoldenMatch, GoldenFlow, and GoldenPipe.