Deep Profiling Baseline

Learn dataset statistical properties once. Detect drift forever.

GoldenCheck’s baseline system goes beyond the per-scan drift profiler. It runs 6 deep analysis techniques over a known-good dataset, saves a rich YAML profile, and then detects 13 types of drift on every future scan — automatically.


Quick Start

pip install goldencheck[baseline]

# Step 1: Create baseline from clean/known-good data
goldencheck baseline data.csv

# Step 2: Scan new data — drift is detected automatically
goldencheck scan new_data.csv

GoldenCheck saves the baseline to goldencheck_baseline.yaml and picks it up automatically on subsequent scans. No flags required once the file exists.


How It Works

The baseline workflow follows a learn-once, monitor-forever pattern:

LEARN (once)
  goldencheck baseline data.csv
      │
      ├─ StatisticalProfiler   — mean, std, quantiles, IQR for numeric columns
      ├─ ConstraintMiner        — infer NOT NULL, unique, range, enum, regex constraints
      ├─ SemanticTypeInferrer   — detect email, phone, date, ID, currency patterns
      ├─ CorrelationAnalyzer    — record column pair correlations (Pearson / Spearman)
      ├─ PatternGrammarInducer  — extract structural patterns (e.g. XXX-DDDD)
      └─ ConfidencePriorBuilder — assign per-check confidence weights from data evidence
              │
              ▼
      goldencheck_baseline.yaml   ← human-readable, version-controllable

MONITOR (every scan)
  goldencheck scan new_data.csv
      │
      ├─ [all standard profilers run as normal]
      │
      └─ drift/detector.py  (13 check types vs baseline)
              │
              ▼
      Drift findings surface alongside standard findings

The baseline file is plain YAML. You can edit it, commit it, diff it in PRs, and share it across environments.


The 6 Techniques

1. Statistical Profiler

Module: goldencheck/baseline/statistical.py

Records the full distributional shape of every numeric column: mean, standard deviation, min, max, and the 5th, 25th, 50th, 75th, and 95th percentiles.

What it discovers:

  • Expected numeric range and central tendency
  • Interquartile range (used for outlier bounds)
  • Skew indicators from percentile spread

Example YAML output:

columns:
  age:
    statistical:
      mean: 34.21
      std: 12.87
      min: 18
      max: 92
      p05: 20.0
      p25: 24.0
      p50: 32.0
      p75: 44.0
      p95: 58.0

2. Constraint Miner

Module: goldencheck/baseline/constraints.py

Infers hard constraints from the data by observing what is universally true across all rows:

  • NOT NULL — column had zero nulls
  • UNIQUE — column had 100% unique values
  • RANGE — observed min/max bounds for numeric columns
  • ENUM — column had low cardinality (≤20 values); records the allowed set
  • REGEX — consistent structural pattern detected (e.g., ^\d{5}$ for US zip codes)

Example YAML output:

columns:
  status:
    constraints:
      not_null: true
      enum:
        - active
        - inactive
        - pending
  zip_code:
    constraints:
      regex: '^\d{5}(-\d{4})?$'
      not_null: true

3. Semantic Type Inferrer

Module: goldencheck/baseline/semantic.py

Applies GoldenCheck’s semantic type classifier to the baseline data and locks in the detected type. On future scans, if the semantic type changes or the column format drifts from the baseline type, it is flagged.

Requires goldencheck[semantic] for embedding-based detection.

What it discovers:

  • Column semantic type (email, phone, url, name, id, currency, date, category)
  • Format match rate (how consistently the column matched the type)
  • Whether the type was detected by name hint, format, or embedding similarity

Example YAML output:

columns:
  customer_email:
    semantic:
      type: email
      match_rate: 0.994
      detection_method: format

4. Correlation Analyzer

Module: goldencheck/baseline/correlation.py

Records pairwise correlations between numeric columns using Pearson and Spearman coefficients. Captures the expected correlation structure of the dataset.

What it discovers:

  • Strong positive or negative correlations between column pairs
  • Which correlations are stable enough to enforce as drift checks
Only pairs with correlation ≥ 0.7 are recorded, to keep the baseline focused on meaningful relationships.

Example YAML output:

correlations:
  - columns: [age, years_experience]
    pearson: 0.84
    spearman: 0.81
  - columns: [order_total, item_count]
    pearson: 0.91
    spearman: 0.89

5. Pattern Grammar Inducer

Module: goldencheck/baseline/patterns.py

Generalises the structural patterns of string columns into a grammar: digits become D, letters become L, punctuation is preserved. Records the dominant pattern and its coverage.

What it discovers:

  • The structural “shape” expected for each string column
  • How consistently values conform to that shape
  • Whether multiple patterns coexist (e.g., short and long formats)

Example YAML output:

columns:
  product_code:
    patterns:
      dominant: 'LLL-DDDD'
      coverage: 0.983
      alternatives:
        - pattern: 'LLLLL-DDDD'
          coverage: 0.017

6. Confidence Prior Builder

Module: goldencheck/baseline/priors.py

Assigns a per-check confidence weight to every column based on the evidence seen in the baseline data. These priors are used to calibrate drift findings: a check backed by 50,000 rows of consistent data gets higher confidence than one backed by 200.

What it discovers:

  • Evidence strength for each inferred constraint
  • Row count at baseline time (used to weight future comparisons)
  • Which checks should fire at ERROR vs WARNING given the confidence level

Example YAML output:

columns:
  transaction_id:
    priors:
      uniqueness:
        confidence: high
        evidence_rows: 125000
      not_null:
        confidence: high
        evidence_rows: 125000

Drift Detection

When goldencheck_baseline.yaml exists, the drift detector (goldencheck/drift/detector.py) runs 13 check types on every scan. Drift findings are returned alongside standard profiler findings.

Check Severity What it detects
null_rate_increase WARNING Null rate grew above baseline null rate
null_rate_introduced ERROR Column was NOT NULL in baseline but now has nulls
enum_violation ERROR Value not in the baseline enum set
range_min_violation WARNING Minimum value dropped below baseline min
range_max_violation WARNING Maximum value exceeded baseline max
mean_shift WARNING Mean shifted more than 2 standard deviations from baseline
std_increase WARNING Standard deviation more than doubled vs baseline
semantic_type_change ERROR Column semantic type changed from baseline
format_rate_drop WARNING Format match rate dropped >10 percentage points
pattern_drift WARNING Dominant structural pattern changed
new_pattern_appeared INFO New pattern not seen in baseline
correlation_broken WARNING Column pair correlation dropped below 0.5 (was ≥0.7)
cardinality_explosion WARNING Cardinality increased >5x vs baseline (enum drift)

Severity logic: Checks derived from strict constraints (NOT NULL, enum, semantic type) are always ERROR. Distribution-based checks (mean shift, std, format rate) fire as WARNING by default, upgraded to ERROR if the deviation is severe (>3 standard deviations or >3x baseline std).


CLI Reference

baseline command

Create or update a baseline from a data file:

goldencheck baseline <file> [flags]

Examples:

# Create baseline from production data export
goldencheck baseline data.csv

# Save to a custom path
goldencheck baseline data.csv --output baselines/production.yaml

# Update an existing baseline (merges new statistics)
goldencheck baseline data.csv --update

# Skip specific techniques
goldencheck baseline data.csv --skip correlation,patterns

Flags:

Flag Type Default Description
--output, -o path goldencheck_baseline.yaml Output path for the baseline file
--update bool false Merge new statistics into an existing baseline instead of overwriting
--skip <techniques> string Comma-separated list of techniques to skip: statistical, constraints, semantic, correlation, patterns, priors

Requires goldencheck[baseline] installed:

pip install goldencheck[baseline]

scan with baseline flags

# Use a specific baseline file (instead of goldencheck_baseline.yaml)
goldencheck scan new_data.csv --baseline baselines/production.yaml

# Disable automatic baseline loading
goldencheck scan new_data.csv --no-baseline

Flags added by v1.1.0:

Flag Type Default Description
--baseline <path> path goldencheck_baseline.yaml Path to baseline file. Auto-loaded if file exists
--no-baseline bool false Disable baseline loading even if goldencheck_baseline.yaml exists

Python API

create_baseline()

from goldencheck.baseline import create_baseline

baseline = create_baseline("data.csv")
# Saves to goldencheck_baseline.yaml by default

baseline = create_baseline(
    "data.csv",
    output="baselines/production.yaml",
    skip=["correlation"],   # skip expensive technique
)

load_baseline()

from goldencheck.baseline import load_baseline

baseline = load_baseline("goldencheck_baseline.yaml")
print(baseline.columns["age"].statistical.mean)
print(baseline.columns["status"].constraints.enum)

scan_file() with baseline

from goldencheck.engine.scanner import scan_file
from goldencheck.baseline import load_baseline

baseline = load_baseline("goldencheck_baseline.yaml")
findings, profile = scan_file("new_data.csv", baseline=baseline)

# Drift findings have check names like "null_rate_introduced", "mean_shift", etc.
drift_findings = [f for f in findings if f.source == "baseline"]
for f in drift_findings:
    print(f"{f.severity.name}: [{f.column}] {f.check}{f.message}")

Baseline YAML Format

A complete goldencheck_baseline.yaml has four top-level sections:

goldencheck_baseline: 1           # format version
created_at: "2025-11-12T09:14:32"
source_file: data.csv
row_count: 125000
column_count: 14

columns:
  customer_id:
    statistical: null             # not numeric — omitted
    constraints:
      not_null: true
      unique: true
    semantic:
      type: id
      match_rate: 1.0
      detection_method: heuristic
    patterns:
      dominant: 'LLLL-DDDDDDDD'
      coverage: 1.0
      alternatives: []
    priors:
      uniqueness:
        confidence: high
        evidence_rows: 125000
      not_null:
        confidence: high
        evidence_rows: 125000

  age:
    statistical:
      mean: 34.21
      std: 12.87
      min: 18
      max: 92
      p05: 20.0
      p25: 24.0
      p50: 32.0
      p75: 44.0
      p95: 58.0
    constraints:
      not_null: true
      range:
        min: 18
        max: 92
    semantic:
      type: null
    patterns: null
    priors:
      not_null:
        confidence: high
        evidence_rows: 125000
      range:
        confidence: medium
        evidence_rows: 125000

correlations:
  - columns: [age, years_experience]
    pearson: 0.84
    spearman: 0.81
  - columns: [order_total, item_count]
    pearson: 0.91
    spearman: 0.89

Configuration as Code

The baseline YAML is designed to be human-editable and version-controllable:

  • Edit constraints manually — tighten an enum, add a regex, lower a range bound
  • Commit to version control — diff baselines across releases to see what changed
  • Share across environments — one baseline file works in dev, staging, and CI
  • Review in PRs — baseline changes are visible in GitHub/GitLab diffs

Example: tighten an enum in the baseline

# Before (mined from data)
columns:
  status:
    constraints:
      enum:
        - active
        - inactive
        - pending
        - legacy          # you want to retire this value

# After (manually edited)
columns:
  status:
    constraints:
      enum:
        - active
        - inactive
        - pending
        # "legacy" removed — will now trigger ERROR on any new rows with this value

Committing the baseline:

git add goldencheck_baseline.yaml
git commit -m "chore: tighten status enum, retire legacy value"

Optional Extras

Extra Command Adds
[baseline] pip install goldencheck[baseline] scipy, numpy — required for statistical profiling and correlation analysis
[semantic] pip install goldencheck[semantic] sentence-transformers — enables embedding-based semantic type detection in the baseline

Install both:

pip install goldencheck[baseline,semantic]

Next Steps