Why does document intelligence cost optimization differ from generic LLM cost optimization?

Document intelligence has heterogeneous input complexity — invoices vs. loan agreements differ by 10–40x in processing cost — a hard accuracy floor tied to STP rate targets, and downstream failure costs when extraction quality degrades. Generic LLM optimization techniques apply but must be implemented relative to these constraints.

What is a realistic per-document extraction cost in a well-optimized production system?

For standard structured documents in a tiered system with schema caching and batch processing: $0.003–0.010 per document. For complex or non-standard documents: $0.05–0.40 per document. The average across a mixed-volume system should be below $0.015 per document if tiering is well-designed.

At what volume does self-hosting extraction models become cost-competitive with API inference?

Self-hosting smaller fine-tuned models for high-volume standard document types typically becomes cost-competitive above 50,000–100,000 documents per day. Below that threshold, API inference with caching and tiering is almost always more economical when engineering cost is included.

How do we know if our STP rate is high enough to reduce extraction model quality for cost savings?

Model the trade-off explicitly: calculate per-document cost at the cheaper extraction quality, estimate the STP rate reduction, multiply the additional exception volume by your human review labor cost per document, and compare total cost. If the exception cost increase exceeds the extraction savings, the model downgrade is not a saving.

What instrumentation should we emit per document for cost governance?

Minimum: document ID, document type, tier assignment, total token count (input + output), number of model API calls, validation status, exception routing flag, total processing cost. This enables segmented cost analysis by document type and tier — the prerequisite for any meaningful optimization.

Document Intelligence Cost Optimization: Production Engineering Guide | Ashtayah Labs

Document intelligence cost problems almost always surface the same way. A pilot runs at $200/month. The team gets confident, expands to 10,000 documents per day, and three months later the extraction infrastructure is costing $180,000 a month. The finance team asks questions. The engineering team scrambles to cut corners on model quality. Accuracy drops. The operations team starts complaining again.

The extraction is expensive. But not in the ways teams typically assume.

Generic LLM cost optimization advice — prompt caching, semantic caching, model routing — describes techniques that work on conversational systems. A document intelligence pipeline is a different engineering problem. It has multi-page inputs with heterogeneous formats, per-document cost profiles that vary by 10–40x depending on document complexity, STP rate targets that constrain how aggressively you can cut, and downstream systems that produce measurable errors when extraction quality degrades.

Cost optimization in this context is not about applying generic tips. It is about making five specific architectural decisions correctly from the start. Here is how we approach each one across the production document intelligence systems we build at Ashtayah Labs.

Why Generic LLM Cost Advice Doesn't Transfer Directly

The standard LLM cost optimization playbook assumes homogeneous traffic: similar prompt lengths, similar task complexity, similar output sizes. Document intelligence pipelines violate all three.

A one-page invoice and a 40-page loan agreement are both "documents." But their extraction cost profiles differ by orders of magnitude: different token counts, different numbers of fields to extract, different validation complexity, different exception handling requirements. A system that applies the same model, the same prompt template, and the same processing strategy to both is either overpaying on invoices or underserving loan agreements.

The second problem: document intelligence has a hard accuracy floor. Unlike a conversational AI system where a slightly wrong response is a minor UX issue, a document intelligence system feeding extracted fields into downstream ERP or compliance systems has a failure cost. If the system extracts the wrong invoice total and it propagates into accounts payable, the error cost is often 10–50x the cost of the extraction itself.

Cost optimization decisions must always be made relative to the STP rate target. Cutting model quality to save 40% on extraction spend only saves money if the resulting accuracy is sufficient to hold the STP rate above the operational threshold.

Decision 1: Classify Before You Extract

The highest-leverage cost decision in a document intelligence pipeline is also the simplest: don't send every document to the same extraction path.

A lightweight classification step — running in under 200 milliseconds, costing a fraction of a cent per document — routes each incoming document to the appropriate extraction tier before any expensive model call is made. The classification identifies document type, complexity level, and confidence of classification. Documents with clear type identification and low structural complexity route to a fast, cheap extraction path. Documents with ambiguous type, high structural variation, or critical field requirements route to a heavier path.

In a well-designed tiered system: Tier 1 handles standard structured documents — single-page, predictable layout, well-defined schema, such as invoices from known vendors, standard forms, templated contracts. Extraction cost: $0.003–0.008 per document. Tier 2 handles variable structured documents — multi-page, moderate layout variation, more complex schema, such as bank statements, multi-party agreements, regulatory filings. Extraction cost: $0.02–0.06 per document. Tier 3 handles complex or exception documents — non-standard formats, handwritten fields, poor scan quality, or high-value documents requiring extraction reliability above 99%. Extraction cost: $0.10–0.40 per document.

This tiering typically reduces average per-document cost by 55–70% relative to applying a single extraction strategy across all volume. The classifier pays for itself within the first day of production traffic.

The failure mode to avoid: building the tiering logic on document type alone without accounting for quality. A Tier 1 invoice from an unfamiliar vendor with a non-standard layout should route to Tier 2 based on classification confidence, not be forced through the Tier 1 path because its document type matched.

Decision 2: Instrument Per-Document Cost Before You Optimize Anything

You cannot optimize what you cannot measure. The first production instrumentation requirement for any document intelligence system is per-document cost attribution — not aggregate API spend, but a cost record attached to each document processed.

A per-document cost record contains: the document ID, document type, tier assignment, model(s) used, input token count, output token count, number of extraction API calls made, validation pass/fail status, exception routing flag, and total cost for that document's processing lifecycle.

This granularity reveals patterns that aggregate spend monitoring misses entirely: a single document type accounting for 3% of volume but 22% of spend because its schema requires multiple extraction passes; a validation failure pattern that causes re-extraction, doubling per-document cost for a specific vendor's invoice format; exception routing firing at 8x the expected rate on a particular document queue, inflating costs and backing up the human review queue simultaneously.

Without per-document instrumentation, cost optimization is guesswork. With it, the highest-ROI interventions become obvious. The systems we build emit this per-document cost record to a structured store from day one — it is cheaper to instrument it correctly at build time than to retrofit it later.

Decision 3: Cache Extraction Schemas, Not Just Model Outputs

Prompt caching in a document intelligence context is more nuanced than its implementation in conversational AI. The high-value caching targets are not the user query — they are the extraction schema definitions and validation rules, which are static and often constitute the majority of tokens in a complex extraction prompt.

For a production system processing 10,000 invoices per day, a schema prompt that defines 40 extraction fields, their data types, their validation rules, and their priority weighting may be 2,000–4,000 tokens. If that schema prompt is reprocessed from scratch on every document, the token cost per document has a fixed overhead entirely unrelated to document complexity.

Provider-native prompt caching reduces input token costs on the schema portion by up to 90% for cached content. For a high-volume extraction pipeline, this one structural change reduces overall per-document cost by 30–50% depending on schema length.

The implementation requirement: keep the static schema content — field definitions, validation rules, output format instructions — at the beginning of the prompt, and append the variable document content at the end. Cache invalidation occurs automatically when the schema prefix changes, which in a stable production system happens rarely.

A second caching layer worth implementing: extracted schema results for documents processed previously. If a pipeline occasionally reprocesses the same document — re-upload, amendment check, audit retrieval — returning a cached extraction result rather than rerunning the model is zero-marginal-cost. A hash of the document content keyed to its extraction result and confidence score handles this at the application layer.

Decision 4: Design Batching Around SLA, Not Throughput

Async batch APIs offer 50% cost reduction on inference versus synchronous real-time calls. In a high-volume document intelligence pipeline, batching is one of the most impactful cost levers available.

The engineering decision that teams frequently get wrong: designing batching around throughput targets — maximize batch size — rather than SLA targets — stay within latency requirements. A batch size of 1,000 documents maximizes cost efficiency. But if your SLA is 15-minute processing turnaround and document ingestion is bursty, large batches create backpressure that blows SLA windows during peak periods.

The correct design: define your maximum acceptable processing latency per tier, and set batch size and batch window dynamically relative to current queue depth and the processing capacity needed to meet that SLA. During low-traffic periods, larger batches reduce cost. During burst periods, smaller batches or synchronous fallback maintain SLA.

A production-grade batching system has three additional properties. It tracks batch success rates separately from individual document extraction success — a failed batch requiring re-submission at synchronous pricing eliminates the cost savings from batching. It handles partial batch failures cleanly, reprocessing only failed documents rather than the entire batch. And it never batches Tier 3 documents with time-sensitive escalation requirements — those always process synchronously regardless of throughput optimization.

Decision 5: Define the STP Rate / Cost Trade-Off Explicitly

Straight-through processing rate and per-document cost are in tension. More expensive extraction paths produce higher accuracy and lower exception queue volume. Cheaper paths produce lower accuracy and higher human review burden.

This trade-off exists on a curve that is specific to your document types, your field criticality weighting, and the operational cost of human review in your context. The mistake is making extraction cost reduction decisions without quantifying where you sit on this curve.

A concrete example from a production system: a logistics client processing 8,000 bills of lading per day, running at 94% STP rate — 6% required human review. Per-document extraction cost: $0.08. Total daily extraction spend: $640. Human review team handling 480 documents per day at 3 minutes per document: approximately $720 per day in review labor. Shifting to a cheaper extraction path reduced per-document cost to $0.025, saving $440 per day in extraction spend. STP rate dropped to 89% — 880 documents per day through human review, costing approximately $1,320. Net daily cost increased by $240.

The optimization that actually worked: a classifier that identified the 40% of BOLs coming from the top 5 carrier formats — consistent layout, high extraction reliability — and routed those to the cheaper path. The remaining 60% stayed on the original path. STP rate held at 93.5%, per-document extraction cost averaged $0.054, and net daily cost dropped by $310 relative to the original configuration.

This analysis requires knowing: extraction cost by tier, STP rate by tier, human review labor cost per document, and volume distribution across document types. Without those numbers, cost optimization decisions are directionally arbitrary.

Cost Governance as Ongoing Operations

Cost optimization in a document intelligence system is not a one-time project. Three structural changes cause drift over time.

Volume growth changes batch sizing math, caching economics, and may cross thresholds where self-hosting specific model components becomes cheaper than API-based inference. Document mix shift — if the proportion of Tier 3 documents increases because a new client onboards with non-standard formats — increases average per-document cost without any change to the pipeline configuration. Model pricing changes require quarterly review of model selection relative to current pricing and quality benchmarks.

The governance infrastructure for managing this over time: a per-document cost dashboard segmented by tier and document type, a STP rate dashboard segmented by the same dimensions, a monthly cost-per-STP-point metric (extraction spend divided by percentage points of STP above the floor), and a quarterly model review cycle.

If your document intelligence pipeline is already in production and you are looking at the bill, the highest-ROI interventions in order: deploy per-document cost instrumentation if you don't have it; implement schema prompt caching; audit your tier distribution; and model the STP rate / cost trade-off explicitly before making any changes to extraction model quality.

If you are building from scratch, build the cost instrumentation layer and the classification tier before writing a single extraction prompt. The architecture decisions made in the first two weeks determine whether cost management is built-in or bolted-on.

Start a system review at ashtayahlabs.com to assess where your document intelligence cost profile sits and what the highest-leverage changes are for your specific document mix.

Ashtayah Labs

AI Systems Team

Document Intelligence Cost Optimization: How to Build an Extraction Pipeline That Doesn't Cost You the Savings You're Chasing