The prototype is the easy part
Most teams can get a document extraction demo working in a day. Feed a few PDFs into an LLM, get structured JSON back, show the stakeholder, declare success. The problem starts when you move to production — where you encounter documents printed and scanned at 70 DPI, forms that were designed in 1997 and never updated, handwritten amendments over typed text, and a long tail of edge cases that never appeared in your test set.
Document intelligence at production scale is not a prompting problem. It is a systems problem.
Accuracy targets that actually mean something
Field-level accuracy is not a single number. A "95% accuracy" claim on an invoice extraction model can mean very different things depending on which fields are counted, what counts as a match, and whether rare but critical fields (like tax IDs or payment terms) are included in the average.
For production systems, we specify accuracy per field class: high-stakes fields (amounts, account numbers, dates) require different targets than descriptive fields (vendor name, line item descriptions). We track precision and recall separately — because a system that extracts the right value when it tries but skips uncertain cases behaves very differently from one that always attempts extraction but is frequently wrong.
The exception queue is not a failure mode
A production document intelligence system should have an exception queue. Low-confidence extractions, field validation failures, document type mismatches — these should be routed to a human review workflow, not silently passed downstream.
Teams that treat the exception queue as an embarrassment (a sign that the model is not good enough) build systems that fail silently. Teams that treat it as a first-class feature build systems that can be trusted. In most production systems we build, the exception rate after initial stabilisation is 5–15% of volume — and that is by design, not by failure.
What monitoring actually looks like
Logging that extraction happened is not monitoring. Production document intelligence systems need: field-level confidence tracking over time to detect drift, document type distribution monitoring to catch upstream changes, exception rate trending with alerting thresholds, end-to-end latency at the 95th and 99th percentile, and downstream data quality checks (does the extracted data pass the validation rules your ERP would apply?).
Drift is the underappreciated risk. A model trained on this year's invoices may degrade silently when suppliers change their invoice templates next quarter. Without monitoring, you find out when someone notices a problem in the downstream system — usually after weeks of bad data.
What this means for your next project
If you are evaluating a document intelligence project, ask three questions before committing: What is the accuracy target per field, and how will it be measured? What happens to low-confidence extractions? What monitoring will tell us if accuracy degrades after launch?
If those questions have good answers, you are on track for a system that will hold up. If they do not, you are building a prototype that will be called production.
Ashtayah Labs
AI Systems Team