How large does a golden dataset need to be before it's useful?

For early-stage validation, 100–200 documents per document type is workable. For a system in production handling significant volume, target 500–1,000 per type, with sub-variant coverage. The more consequential the extraction (regulated documents, financial data), the larger the required set. Accuracy confidence intervals narrow significantly above 500 examples per type.

How often should we run the full golden dataset evaluation?

Before every significant change to the model, schema, or processing logic. For stable systems with no active development, monthly is reasonable. For actively developed systems, run it as part of every release cycle.

What's the difference between precision and recall for field extraction?

Precision measures accuracy when the model extracts a value: of all extraction attempts for a field, what percentage were correct? Recall measures completeness: of all documents that contained the field, what percentage did the model successfully extract? Both matter. Low recall means missing data; low precision means wrong data. The relative importance depends on the use case — a field that feeds a critical downstream process usually requires high precision even at the cost of recall.

How do we handle ground truth disagreements on edge cases?

Define a ground truth adjudication process before you need it. Assign clear ownership (typically a domain expert who understands the business context, not just the technical team). When reviewers disagree on the correct extraction for an edge case, the adjudicator decides. Document the decision and the reasoning. Consistency in ground truth labeling is more important than any individual labeling decision being "right."

Should we use a subset of production traffic as the test set?

No — this creates test set contamination if the model has been trained or fine-tuned on production data. Maintain a held-out golden dataset that never overlaps with training data. You can sample from production traffic to expand the golden dataset, but label it independently and verify it hasn't been seen during training.

Document Intelligence Testing & QA: Production Engineering Guide | Ashtayah Labs

Why standard software testing doesn't transfer directly

Software QA tests deterministic logic: given input X, output Y is correct or it isn't. Document intelligence is probabilistic. The model produces a confidence score alongside every field extraction. The "correct" answer depends on document quality, layout variability, and how the model was trained.

This creates testing requirements that aren't covered by standard unit test frameworks:

- **Field-level accuracy** must be tracked separately for each extraction target (invoice total, vendor name, date, line items) because failure modes are field-specific - **Confidence calibration** must be validated — a model that says it's 95% confident should be right 95% of the time, not 72% of the time - **Regression is non-obvious** — a change that improves extraction on invoices can quietly degrade accuracy on purchase orders - **Ground truth degrades** — the golden dataset that was accurate in Q1 may contain edge cases that were resolved differently in Q2 based on operational decisions

These differences require purpose-built QA infrastructure, not adapted software testing tooling.

Layer 1: The golden dataset — structure and governance

The golden dataset is the foundation of every other testing layer. It is a labeled corpus of real documents with verified ground truth extraction outputs, maintained as a production engineering artifact.

Most teams build a golden dataset once, use it for pre-launch validation, and never update it. The dataset accumulates coverage debt as new document variants enter production.

A production golden dataset requires stratified coverage across the full distribution of real inputs — not just clean, well-formatted examples. The golden set must include low-quality scans, handwritten fields, multi-page documents, documents with missing fields, and the edge cases that historically triggered exceptions or human review.

Each document in the set is versioned. When ground truth is corrected (because an extraction decision changed operationally), the version history is preserved. This allows retroactive analysis: did the model regress, or did the ground truth definition change?

On a scheduled cadence (monthly is typical), a stratified sample from production traffic is added to the golden set. Documents are labeled by the team responsible for ground truth, not auto-labeled by the model being tested.

For a production system processing 1,000+ documents per day, the golden set should cover a minimum of 500–1,000 documents per document type, with at least 50 examples per meaningful document sub-variant.

Layer 2: Field-level accuracy tracking

System-level accuracy metrics hide the failures that matter. An overall extraction accuracy of 92% is meaningless if the field that feeds your downstream payment process has 78% accuracy.

Track accuracy at three levels:

**Field-level precision and recall:** For each extraction target, measure precision (when the model extracts a value, how often is it correct) and recall (how often the model extracts the field at all). These diverge in meaningful ways — a high-recall, low-precision model extracts everything but gets many values wrong; a high-precision, low-recall model is accurate when it extracts but misses fields frequently.

**Confidence-stratified accuracy:** Segment accuracy by confidence band (0–70%, 70–85%, 85–95%, 95%+). The shape of this curve reveals calibration problems. If accuracy in the 85–95% band is only 75%, your routing thresholds are set wrong.

**Document-type breakdown:** Track field accuracy separately per document type and sub-type. A single accuracy number across all document types masks the variation.

Store these metrics in a time-series format so trends are visible. Accuracy drift of 2–3% per month looks small in isolation; it compounds into a system problem over a quarter.

Layer 3: Regression test suite

Every change to the system — model update, schema change, prompt modification, pre-processing logic update, exception routing rule change — must run against the regression suite before deployment.

The regression suite is not the golden dataset. The golden dataset is comprehensive and expensive to run. The regression suite is a curated, fast-running subset designed to catch the most common failure modes and the specific regressions that previous changes introduced.

**Smoke tests:** 50–100 documents covering the primary document types and happy-path extraction scenarios. These run in under 5 minutes and catch obviously broken changes.

**Regression-specific tests:** Every time a bug is found in production, add a representative example to the regression suite. The suite accumulates institutional memory about what has broken before.

**Boundary condition tests:** Documents at known edge cases — minimum confidence scores, maximum page counts, tables that span pages, fields in unusual positions, documents with watermarks.

**Pass/fail gates:** Define explicit thresholds — overall accuracy on the regression set doesn't drop more than 1%, no individual field drops more than 2%, no document type drops more than 3%, no regression-specific test fails. Gates are explicit and enforced, not subjective.

Layer 4: Confidence calibration testing

Confidence calibration is tested separately from accuracy because they measure different things. An accurate model can be miscalibrated. A miscalibrated model cannot be trusted for routing decisions.

For each confidence band (in 5% increments from 0–100%), measure the actual accuracy rate on the golden dataset. Plot the calibration curve. A well-calibrated model produces a line close to the diagonal — 80% confidence means ~80% accuracy.

Common miscalibration patterns:

**Overconfidence at the high end:** The model reports 95% confidence on fields where actual accuracy is 85%. This causes the routing system to pass documents to automated processing that should go to human review.

**Underconfidence in the middle:** The model reports 65–75% confidence on fields it's actually getting right 90% of the time. This sends documents to human review unnecessarily, increasing operational cost.

**Distribution shift:** Calibration was validated at launch. The model's confidence distribution has shifted as the input distribution changed, but calibration hasn't been retested.

Run calibration tests quarterly, and after any change that affects the extraction model, prompt, or pre-processing layer.

Layer 5: Pre-deployment validation gates

Combine the above layers into a pre-deployment validation pipeline that runs automatically before any change reaches production.

The pipeline:

1. **Smoke test run** (5 minutes): Catches obviously broken changes. Gate: all smoke tests pass. 2. **Regression suite run** (15–30 minutes): Runs the full regression set. Gate: no field regression exceeds 2%, no regression-specific test fails. 3. **Golden dataset sample run** (1–2 hours): Runs a stratified 20% sample of the golden dataset. Gate: field-level accuracy within 1% of baseline, calibration curve within acceptable tolerance. 4. **Diff report generation**: Produces a comparison of accuracy metrics before and after the change, by field, by document type, and by confidence band. 5. **Deployment confirmation**: Human sign-off on the diff report before the change goes to production.

The full pipeline adds 2–3 hours to the release cycle. For systems processing regulated documents (KYC, clinical records, financial filings), it's not optional.

Layer 6: Production monitoring as continuous QA

Pre-deployment testing validates a change before it ships. Production monitoring catches what testing missed.

**What to instrument:**

*Field-level extraction rates:* Track how often each field is extracted (vs. missing/null) in production. A sudden drop in extraction rate for a key field indicates a new document variant the system isn't handling, or an upstream formatting change.

*Confidence score distributions:* Track the distribution of confidence scores for each field daily. A shift in the distribution signals input distribution drift before accuracy drops are measurable.

*Exception routing rates:* Track what percentage of documents are being routed to human review, by document type and by reason. Rising exception rates indicate either model degradation or input distribution shift.

*Downstream error rates:* Connect document intelligence metrics to downstream process metrics. Downstream payment processing errors, data entry corrections, and rejected submissions are leading indicators of extraction accuracy problems.

Set alert thresholds and SLAs for each metric. When a metric breaches its threshold, it triggers a review — not necessarily a rollback, but an investigation with a defined protocol.

The common failure to avoid

The most common failure mode in document intelligence QA is treating it as a one-time exercise. Pre-launch validation is necessary but insufficient. Production systems face input distributions, schema changes, and operational decisions that weren't anticipated at build time.

The teams that maintain reliable document intelligence systems treat QA infrastructure as a first-class component: governed golden datasets, automated regression pipelines, calibration testing, pre-deployment gates, and production monitoring with defined SLAs.

The teams that don't end up re-validating their entire system manually when something breaks — because they have no other option.

Start a system review at ashtayahlabs.com

Ashtayah Labs

AI Systems Team

Document Intelligence Testing & QA: How to Build a System You Can Actually Trust in Production