How many document types is too many for a single pipeline?

There is no fixed limit, but complexity increases non-linearly above 10–12 types. Beyond that point, classifier training, schema management, and per-type monitoring create significant operational overhead. At 15+ types, teams typically benefit from modularizing the pipeline — organizing types into functional groups with separate classification models and shared infrastructure.

Should the classifier run before or after OCR/text extraction?

It depends on your classifier architecture. A vision-based classifier that operates on the page image does not need prior text extraction — and often benefits from seeing the layout directly. A text-based classifier requires text extraction first. In practice, most production systems run OCR at ingestion (because extraction will need the text anyway) and feed both the image and the extracted text to the classifier for best accuracy.

How do we handle documents that contain multiple types — e.g., a packet that includes both an invoice and a delivery receipt?

Document splitting is a distinct upstream step from classification. Before the classifier runs, detect whether a document is a multi-type packet and split it into individual documents by type. Building splitting logic into the classifier conflates two separate problems and degrades accuracy on both.

What is a realistic classifier accuracy target?

For a bounded, well-defined type set (5–10 types) with a trained classification model and representative training data: 95–98% on your known type distribution. For an LLM-based classifier handling a broader or variable type set: 85–92% in practice, depending on how visually distinct your document types are. Accuracy below 90% on your primary types creates downstream extraction problems that are difficult to compensate for at the extraction layer.

How should we manage schema updates without disrupting production?

Use schema versioning and blue-green deployment for schema changes. Test the new schema version against a held-out evaluation set before rolling it to production. Run the old and new schema versions in parallel for 24–48 hours on incoming documents, comparing extraction outputs, before cutting over. Store the schema version in the extraction output record so historical documents can be reprocessed with the correct schema version if needed.

Multi-Document-Type Document Intelligence: Classification-First Pipeline Architecture

Why single-type architecture breaks on heterogeneous feeds

Most document intelligence content — guides, tutorials, vendor comparisons — assumes you are processing one document type. Invoices. Bank statements. KYC packets. A single, well-defined extraction schema.

Production systems do not work that way.

Real enterprise document feeds are heterogeneous. A logistics operations team ingests bills of lading, proof-of-delivery receipts, freight invoices, customs declarations, and carrier-specific variant forms — sometimes all in the same batch, sometimes unlabeled. A fintech compliance team processes KYC documents that arrive as passports, national IDs, utility bills, and bank statements in any combination.

Building a system that handles one document type well is a different engineering problem than building a system that handles ten document types reliably. The gap is not in the extraction model. It is in the pipeline architecture: how you classify before you extract, how you route exceptions by type, how you manage schema divergence across types, and how you instrument a system that branches.

The most common mistake when moving from single-type to multi-type document processing is treating classification as an afterthought — a preprocessing step bolted on before extraction. In production, classification is the pipeline's central load-bearing decision.

When classification fails or is skipped, extraction runs against the wrong schema. An invoice extraction model applied to a proof-of-delivery receipt produces confident-but-wrong output. The confidence scores look acceptable — the model found something — but the structured data is meaningless. This class of failure is difficult to detect because it does not produce errors. It produces plausible-looking wrong data.

The classification-first pipeline: architecture overview

A production-grade multi-document-type pipeline has four distinct layers.

Layer 1 — Ingestion and Normalization. Documents arrive from email attachments, upload APIs, SFTP drops, or webhook integrations in various formats: native PDF, scanned PDF, TIFF, JPEG. Before classification, normalize to a consistent representation — typically a rendered page image plus extracted text where available. Track document provenance (source, timestamp, sender, channel) at ingestion.

Layer 2 — Document Classification. Determine document type before any extraction runs. This is the pipeline's decision gate. The classifier outputs a document type label and a confidence score. High-confidence classifications proceed to the matching extraction path. Low-confidence classifications route to a classification review queue — a human decision, not an extraction attempt.

Layer 3 — Type-Specific Extraction. Each document type has its own extraction configuration: a schema defining the fields to extract, the extraction model or prompt, and the confidence thresholds that trigger review for that type. Extraction is isolated per type. A schema change for invoices does not touch the bill-of-lading extraction path.

Layer 4 — Validation and Exception Routing. Extracted fields are validated against the type-specific schema and business rules. Exceptions are routed to a human review queue with the document type label, the extracted fields, and the specific validation failure clearly surfaced. Reviewers see what type the system thinks the document is and what it extracted — not just a blank form to fill in.

Building the document classifier

The classifier is the highest-leverage component in a multi-type pipeline. Investing in a robust classifier reduces exception rates across every downstream extraction path.

For systems with a well-defined, bounded set of document types — say, 5–15 types that are consistent across your document feed — a fine-tuned classification model trained on labeled examples outperforms prompt-based classification on LLMs in production. It is faster, cheaper, and more consistent on a known type distribution.

For systems that need to handle new document types dynamically, or where the type set is large and variable, a vision-capable LLM with a well-structured classification prompt is more practical. The trade-off: higher per-document cost, non-deterministic confidence estimation.

The most production-resilient approach for mid-market enterprise systems: a tiered classifier. A fast, cheap model handles the high-confidence majority (typically 80–90% of documents in a stable feed). Documents below confidence threshold fall through to an LLM classification pass. Documents that remain ambiguous after the second pass route to human classification. This approach keeps throughput high while reserving expensive inference for the cases that need it.

What the classifier must not do: produce a classification with a fabricated high confidence score. Many LLM-based classifiers will output confidence scores that are not calibrated — they reflect the model's linguistic certainty, not a meaningful probability. If you are using LLM-based classification, build a calibration layer or ignore the raw confidence score and use a separate calibration mechanism based on production outcomes.

Type-specific extraction schema management

In a single-type system, you manage one schema. In a multi-type system, you manage a schema per document type, potentially with sub-type variants. Schema management becomes a production engineering concern.

Schema versioning. Document formats change over time. A supplier updates their invoice template. A government agency changes the layout of a permit document. When format changes cause extraction accuracy to drop, you need to know which schema version was in effect when a document was processed. This matters for audit trails, debugging, and SLA reporting. Version your schemas. Store the schema version alongside the extraction output. When you update a schema, create a new version rather than mutating the existing one.

Sub-type divergence. KYC documents are a useful example. A passport, a national ID, and a utility bill are all KYC-supporting documents, but they have completely different field structures and extraction challenges. If you model them as a single "KYC document" type with one schema, you are forcing one extraction path to handle three distinct structures — and your confidence scoring becomes meaningless. Model sub-types explicitly when the fields extracted differ significantly between sub-types, the downstream processing differs by sub-type, or the accuracy profile differs by sub-type.

Schema drift detection. As document formats in your feed evolve, extraction accuracy for specific types may degrade without an obvious trigger. Build a weekly field-level accuracy report that tracks accuracy per field, per document type, per schema version. A drop in a specific field on a specific document type is usually a format change, not a model failure — but you need the per-type visibility to know that.

Exception routing in a multi-type system

Exception routing in a multi-type system requires more structure than in a single-type system. The review interface must communicate document type clearly, and exception queues should be organized by type — not by arrival order.

Classification exceptions are structurally different from extraction exceptions. A classification exception means the system does not know what type the document is. An extraction exception means the system knows the type but could not extract a specific field with sufficient confidence. These two exception types require different review interfaces and different resolution paths.

Classification exceptions should surface all the evidence the classifier had: the document image, any text extracted, the top-N candidate types with their confidence scores. The reviewer is making a routing decision, not a data entry decision.

Extraction exceptions should surface the document image, the classified type, the extracted fields that passed validation, and specifically which fields failed and why. The reviewer is filling in or correcting specific fields against a known schema.

Mixing these two exception types in a single review queue is a common design mistake. Reviewers spend mental effort on the wrong type of decision, throughput drops, and error rates in review increase.

Monitor exception rates per type. When a type's exception rate exceeds its defined threshold, it needs attention — either the classifier is mis-routing documents to that extraction path, the extraction model is degrading on that type's distribution, or the document format has changed.

Observability in a branching pipeline

A branching pipeline has more failure surfaces than a linear one. Observability requirements increase proportionally.

Per-type metrics, not aggregate metrics. Track latency, throughput, exception rate, and STP rate separately for each document type. An aggregate 85% STP rate across a ten-type system might mean nine types at 95% and one type at 5% — a critical production problem masked by aggregation.

Classification confidence distribution. Track the distribution of classifier confidence scores over time. A stable, well-functioning classifier maintains a consistent confidence distribution. A shift in the distribution — more documents clustering at medium confidence, fewer at high confidence — often signals an upstream change in document feed composition before it surfaces as an accuracy drop.

Lineage tracking. For each document processed, record: the ingestion source, the classification decision and confidence, the schema version used, the extraction output and confidence scores, and the final disposition (auto-processed or reviewed, and by whom). This record is the audit trail and the debugging foundation. Without per-document lineage, root-cause analysis on a multi-type system is guesswork.

What production looks like at scale

A logistics client running Ashtayah Labs' multi-type document intelligence system processes eight document types across their inbound freight operations. The classification layer handles 94% of documents with high confidence. Of the remaining 6%, approximately half are resolved by the LLM fallback classifier, and the other half go to human classification review. Once classified, the per-type exception rate ranges from 4% (for standardized carrier invoices from their top-5 suppliers) to 22% (for customs declarations from non-standard country-of-origin markets).

The operational insight from that spread: the system's value isn't in eliminating human review. It's in concentrating human review exactly where it adds value — on the structurally ambiguous, format-variable documents that humans need to see — while fully automating the high-confidence, high-volume majority. That precision is what you build toward with a classification-first architecture.

Start a system review at ashtayahlabs.com

Ashtayah Labs

AI Systems Team

Multi-Document-Type Document Intelligence: How to Build a Classification-First Pipeline That Works in Production