How accurate does extraction need to be for straight-through processing to be viable?

For mandatory fields, you typically need >97% field-level accuracy before STP rates are operationally meaningful. Below that threshold, human review volume grows faster than it falls. Measure accuracy per field, per document type — not as a single global metric.

How do we handle documents in multiple languages?

Multilingual extraction requires language-aware OCR and models trained on the relevant script. For India-focused BFSI, at minimum: English, Hindi (Devanagari), and the primary regional language(s) for your target customer base. Build language detection into the classification layer.

What's a realistic STP rate for a production KYC system?

60–80% STP on first submission is a reasonable target for a well-designed system with a known document set. Higher rates are achievable with stricter quality thresholds at ingestion — but that trades STP rate for resubmission friction. The right balance depends on your customer experience goals.

How do we keep extraction models current as document formats change?

Monitor extraction confidence and field success rate per document type in production. Set alert thresholds for degradation. Maintain a labelling pipeline so new document samples can be used to retrain or fine-tune models on a regular cadence.

KYC Document Intelligence Architecture for Production | Ashtayah Labs

What KYC document intelligence actually involves

KYC is a broad term. For the purposes of this article, we're talking about the document-processing layer: the system that takes unstructured documents submitted during customer onboarding and converts them into structured, validated, decision-ready data.

The typical document set in a BFSI onboarding flow includes identity documents (passport, Aadhaar, PAN card, driving licence, national ID), address proofs (utility bills, bank statements, tenancy agreements), income documents (salary slips, ITR, Form 16, bank statements), entity documents for KYB (certificate of incorporation, MOA/AOA, board resolutions), and relationship documents (power of attorney, beneficiary declarations).

Each document type has different extraction targets, different validation rules, and different failure modes. A system that treats them uniformly will break under real-world load.

Layer 1: Ingestion and pre-processing

Before extraction, documents need to be classified and prepared. This sounds trivial. It isn't.

Classification determines what type of document you're looking at. A system receiving a scanned PDF needs to distinguish a PAN card from a driving licence from a utility bill — before it knows which extraction model or rule set to apply. Classification errors cascade: a PAN card processed through a utility bill extractor will produce garbage.

Pre-processing handles quality degradation: deskewing rotated images, denoising low-resolution scans, adjusting contrast for faded or overexposed documents, detecting and rejecting unreadable inputs before they reach the extraction layer.

One design decision that matters here: where do you draw the reject threshold? Too tight and you reject legitimate documents from mobile-camera onboarding. Too loose and noisy inputs degrade extraction accuracy downstream. Define this explicitly, with quality scores and human review routing at the boundary.

Layer 2: Extraction

Extraction converts document content into structured fields. The architecture choice here has downstream consequences.

Template-based extraction works well for structured, high-volume, consistent document types — Aadhaar cards, PAN cards, standard ITR formats. It's fast, deterministic, and auditable. It breaks when the document deviates from the template.

ML-based extraction handles variation better. Models trained on diverse document samples generalise across scanners, mobile cameras, and regional print variations. The trade-off: they require training data, and low-confidence predictions need a handling strategy.

Hybrid pipelines — template extraction with ML fallback, or ML extraction with rule-based validation — tend to perform best in production. The specific combination depends on your document mix and volume.

What to extract varies by document type. For a PAN card: name, father's name, date of birth, PAN number, issue date. For a bank statement: account holder name, account number, IFSC, transaction history (structured), branch address. Define extraction targets explicitly per document type, including which fields are mandatory vs. optional for downstream validation.

Layer 3: Validation

Extraction gives you data. Validation gives you confidence that the data is correct and consistent.

Field-level validation checks format and value constraints: PAN number format, Aadhaar Verhoeff check digit, valid calendar dates, amounts within realistic bounds.

Cross-document validation checks consistency across the document set: name on identity document matches name on address proof, account number on bank statement matches application form, date of incorporation on certificate matches other entity documents.

Name matching deserves special attention. Exact string matching fails on legitimate documents due to transliteration differences, abbreviations, and OCR errors. Fuzzy matching with configurable thresholds is necessary, but every fuzzy match decision is a compliance event — it needs to be logged with the match score and the decision reason.

Layer 4: Exception routing and human review

No extraction pipeline achieves 100% confidence on all documents all the time. Design your exception routing explicitly.

When a document is unreadable or quality falls below threshold: reject and request resubmission. When extraction confidence is low on a mandatory field: route to human review queue. When cross-document validation fails: route to compliance review, flag for manual check. When tamper indicators are detected: flag for compliance, escalate. When all fields are extracted with high confidence and validation passes: straight-through to downstream system.

The human review interface matters more than most teams anticipate. Reviewers need the original document image, extracted values with confidence scores, the specific validation failure that triggered the review, and a clear action set. Build this first. Don't bolt it on later.

Failure modes to design for upfront

Confidence score gaming: some implementations use a single global confidence threshold, creating a perverse incentive where slightly above-threshold extractions pass without review even when the extracted value is wrong. Use field-level confidence thresholds, and route on the lowest-confidence mandatory field, not the average.

Silent extraction errors: an extraction model that always returns something — even when it's wrong — is dangerous in KYC. Design for explicit "no extraction" states where the model couldn't extract a value with sufficient confidence. Downstream validation should treat null differently from a low-confidence value.

Audit trail gaps: regulators want to know who made this decision, when, on what basis. Every automated decision needs to be logged with a timestamp, the model version, and the input document hash. Every human review action needs to be logged with the reviewer ID and decision reason. Build the audit trail schema before you build the extraction pipeline.

Schema drift: document formats change. Governments update Aadhaar card layouts. Banks change their statement formats. Instrument your extraction pipeline to track field-level confidence and extraction success rate over time, per document type. A drop in extraction quality for a specific document type is an early warning of schema drift.

What to build vs. what to buy

Platform-first approaches (ABBYY, Hyperscience, Kofax) give you a configurable extraction layer and a review UI. They're the right choice when your document types are standard, your volume is high, and you need to go live quickly.

Custom builds are warranted when your validation logic is complex (cross-document, domain-specific rules), your regulatory requirements demand custom audit formats, you need tight integration with proprietary downstream systems, or you need the extraction logic itself to be auditable and explainable.

In practice, the right answer for most BFSI implementations is a hybrid: a commercial extraction engine for the base OCR and classification layer, with a custom validation and routing system built on top. This keeps the commodity parts bought and the high-value, regulatory-specific logic owned.

How Ashtayah Labs approaches this

We've built KYC document intelligence systems for clients in BFSI, GovTech, and healthcare — across high-volume onboarding flows and audit-sensitive compliance contexts.

Our approach: system design first. We review your document set, your validation requirements, your downstream systems, and your regulatory context before recommending an architecture. Then we build to production standard: observable, tested, with audit trails that hold up to scrutiny.

Start a system review at ashtayahlabs.com

Ashtayah Labs

AI Systems Team

How to Architect a KYC Document Intelligence System That Actually Works in Production