What's the most important metric to track first?

STP rate. It directly translates to cost and throughput, it's straightforward to measure, and it drives every other instrumentation decision. Once you know your STP rate by document type, you know where to focus.

How do I measure field-level accuracy without labelling every document?

Use your human review queue as your ground truth sample. Every document that goes through review produces a correction signal. This is a biased sample — it overrepresents low-confidence extractions — but it's sufficient for tracking field-level accuracy trends over time.

How often should confidence calibration be checked?

Quarterly at minimum, and immediately after any model update or significant change in document source mix. Confidence scores are stable until they aren't — drift is gradual and easy to miss without a scheduled calibration check.

What STP rate should we be targeting?

Depends on document type. For standard structured documents (invoices, purchase orders, standard forms), 85%+ is achievable with a well-tuned system. For highly variable documents (handwritten forms, non-standard contracts, multi-language documents), 65–75% is more realistic. Don't adopt vendor benchmarks — measure your own document population and set targets based on your actual input distribution.

When should we retrain the extraction model?

When field-level accuracy on a specific field type drops below acceptable threshold, when exception rates on a specific carrier or vendor increase materially, or after 90 days of production operation (proactive retraining on accumulated ground truth). The correction feedback loop tells you exactly which examples to add.

Document Intelligence Metrics: Accuracy, SLA & ROI in Production | Ashtayah Labs

The most common mistake in document intelligence isn't a model choice or an architecture decision. It's treating accuracy as something you measure once — during a proof of concept — and then trust indefinitely.

A production document intelligence system is not a trained model sitting on a server. It's a living system processing variable inputs, interacting with downstream services, and accumulating edge cases over time. The metrics that tell you whether it's working are not the same as the metrics that told you it was ready to ship.

This guide is for engineering teams who have a document intelligence system in production — or are close to it — and need to instrument it properly. Not the model accuracy metrics your IDP vendor shows in their dashboard, but the operational metrics that tell you whether the system is doing its job.

Why Benchmark Accuracy Is the Wrong Starting Point

Published accuracy benchmarks measure performance on curated datasets under controlled conditions. The gap between benchmark performance and production performance for document AI is consistently 15–25 percentage points. That gap exists because production documents are messier: poor scan quality, non-standard layouts, handwritten annotations, multilingual content, formats that weren't in the training set.

A system that achieves 97% accuracy on a benchmark can produce 15% exception rates in production and still not surface in any model evaluation metric. The benchmark was right. The production system is failing. The two facts are not contradictory.

The first instrumentation decision is this: stop reporting benchmark accuracy to stakeholders. Report production operational metrics instead.

Layer 1: Field-Level Extraction Metrics

Aggregate accuracy — "the system is 94% accurate" — is not an operational metric. It hides the distribution of where errors occur, and in document processing, that distribution matters enormously.

Field-level accuracy tracking disaggregates performance by field type. A system might be 99.8% accurate on invoice dates but 82% accurate on line-item unit prices. Both contribute to the aggregate. Only one is causing your exception queue backlog.

Track per-field: extraction rate (what percentage of submitted documents contain this field in the output), field-level accuracy (for fields where ground truth is available from human corrections, how often the extracted value is correct), confidence distribution (the distribution of model confidence scores for this field across real documents), and exception rate (what percentage of documents send this field to human review).

The confidence distribution tells you where your auto-approval threshold is actually sitting relative to your real document population. If 40% of your production invoice dates have confidence scores between 70–80%, you need to decide whether that's acceptable to auto-approve or whether it should route to review.

Every extracted field should emit a structured event to your logging system containing: document ID, field name, extracted value, confidence score, and whether it was auto-approved or sent to review. Run daily aggregations over these events. This is your field-level accuracy pipeline.

Layer 2: Confidence Calibration

A confidence score is only useful if it's calibrated. Calibration means that when the system reports 90% confidence on a field extraction, it is correct 90% of the time at that confidence level.

Uncalibrated confidence scores are cosmetic. If you set an auto-approval threshold at 90% and your confidence scores are uncalibrated — if 90% confidence actually means 78% correct — you're silently approving 22% of fields that are wrong. That number compounds across every document in your pipeline.

Measure calibration by binning your production extractions by confidence score (0–0.6, 0.6–0.7, 0.7–0.8, 0.8–0.9, 0.9–1.0) and measuring actual accuracy within each bin using human-corrected ground truth. A well-calibrated system produces a near-diagonal line.

Recalibrate quarterly, or whenever you retrain the extraction model. Document layout drift affects confidence calibration before it affects raw accuracy — the model becomes systematically overconfident on document variants it's unfamiliar with.

The silent failure metric: track what percentage of auto-approved fields are subsequently found to be wrong. Best-in-class systems target below 0.5%. If your silent failure rate is above 2%, your confidence calibration is off, your approval threshold is too permissive, or both.

Layer 3: Straight-Through Processing (STP) Rate

Straight-through processing rate measures what percentage of documents complete processing without any human intervention. It is the single most important operational metric for a document intelligence system because it directly translates to cost and throughput.

An STP rate of 60% means 40% of every document volume hits your human review queue. An STP rate of 87% means only 13% does. Top performers achieve 85%+ STP for invoice processing against an industry average of 40–60%.

Track STP rate by document type (BOLs and standard invoices should have different STP targets than handwritten forms), by carrier/vendor/source (one vendor's invoices may consistently underperform, signalling a training data gap), and over time (a rolling 30-day STP rate by document type reveals drift before it becomes a crisis).

STP rate is also your primary capacity planning metric. If your review team can handle 200 manual reviews per day and your STP rate drops from 85% to 75%, you've just increased manual volume by 67%.

When STP rate drops: investigate by document type first. A drop concentrated in one document type usually signals a source format change. A broad drop across document types usually signals model degradation or an infrastructure problem.

Layer 4: SLA Compliance Tracking

Document processing has SLAs — even when they're not explicitly named. Finance teams have payment cycle requirements. Operations teams have carrier invoice windows. Compliance teams have regulatory deadlines.

SLA compliance tracking measures end-to-end cycle time: from document ingestion to final disposition. Instrument ingestion-to-extraction latency (measure P50, P95, and P99 — tail latency matters at volume), queue wait time, end-to-end cycle time by document type, and SLA breach rate.

Set alerts at 70% of SLA duration, not at breach. If an invoice has a 48-hour processing window and it's been in the review queue for 33 hours without a human touching it, you want to know at 34 hours — not at 49.

Separate model latency SLAs from queue SLAs. A spike in model latency is a technical problem. A spike in queue wait time is a staffing or routing problem. Conflating them makes both harder to diagnose.

Layer 5: Exception Analysis and Root Cause Tracking

Not all exceptions are equal, and treating them as a single queue hides the information you need to improve the system.

Classify every exception at routing time: extraction failure (document couldn't be parsed — image quality, unsupported format, encryption), low confidence (extraction succeeded but below threshold), validation failure (extracted value fails a business rule), and cross-document mismatch (extracted values conflict with a related document).

The ratio between exception classes tells you where to invest next. High extraction failure rate → image quality pre-screening or format coverage gap. High low-confidence rate on specific fields → training data gap. High validation failure rate → business rule coverage gap. High cross-document mismatch → reconciliation logic gap.

When a human reviewer corrects an extraction, log every correction: document ID, field name, extracted value, corrected value, and confidence score at the time of the incorrect extraction. Aggregate these monthly. This is your training improvement queue — the highest-priority examples for your next fine-tuning run. Without this loop, your model never learns from its production failures.

Layer 6: Operational ROI Measurement

The ROI formula for document intelligence is straightforward: cost avoided equals documents processed times manual handling time per document times fully-loaded cost per hour, minus system cost plus human review time at current STP rate times fully-loaded cost per hour.

The inputs come from your existing instrumentation: total documents processed, STP rate, average manual review time per document (usually 2–8 minutes depending on complexity), and error correction costs from your silent failure rate and downstream cost of processing errors.

Report ROI monthly, not annually. Annual calculations hide month-to-month variance that reveals whether the system is improving or degrading. A month where STP rate dropped 8 points should show a corresponding ROI impact — that connection makes the case for retraining investment.

Instead of "our model is 94% accurate," you can report: "We processed 42,000 documents this month. 87% straight-through, 13% human review. Average cycle time 4.2 hours against a 48-hour SLA. Three carriers need model retraining — we're adding them to next sprint." That is the conversation engineering teams should be able to have.

The Instrumentation Stack

You don't need a specialized platform. The components are: event logging (every field extraction emits a structured log event — document ID, field, value, confidence, disposition — via your existing log aggregation), a ground truth pipeline (human review corrections logged with the same document and field IDs, joined daily against extraction events for field-level accuracy), metrics aggregation (daily aggregations give you STP rate, confidence distributions, exception class ratios, and latency percentiles), and SLA tracking (ingestion timestamps and final disposition timestamps per document give you cycle time distribution).

Build this as a daily pipeline, not real-time. Real-time monitoring adds engineering complexity; daily operational metrics are sufficient for most document intelligence workloads unless you're processing time-critical documents with sub-hour SLAs.

Ashtayah Labs

AI Systems Team

Document Intelligence Metrics: How to Instrument Your System for Accuracy, SLA Compliance, and Operational ROI