How a production IDP pipeline turns 500-page medical-legal bundles into structured data with Textract OCR and a 3-level LLM classification hierarchy.

TL;DR: Intelligent Document Processing (IDP) is the discipline of turning unstructured document bundles into structured, queryable data. This two-part series dissects a production IDP pipeline that ingests 500-page medical and legal bundles for personal-injury cases. Part 1 covers the perception half: upload and storage, AWS Textract OCR, and a three-level LLM classification hierarchy that tags every page using overlapping batches and priority-based merging. Part 2 covers the action half — routing, data extraction, and timeline generation.

A paralegal opens a new personal-injury case and uploads what the hospital sent over: a single 480-page PDF. Inside that one file are emergency-room notes, three months of physical-therapy progress notes, an itemized bill with adjustment columns, an explanation-of-benefits statement from an insurer, a lien letter, two fax cover sheets, and an insurance card someone scanned sideways. None of it is labeled. The page order is whatever the scanner produced.
The job of an IDP pipeline is to read that bundle the way an experienced clerk would: figure out what each page is, throw away the noise, pull the facts that matter (dates of service, diagnoses, charges, providers), and assemble them into something a human can act on. The difference is that the clerk handles one bundle an afternoon, and the pipeline handles thousands a day.
I want to be precise about the word "processing" here, because it hides a lot. When people say "we use AI to process documents," they usually mean one model call against one page. A production pipeline is a different animal. The system I'm describing runs documents through six distinct stages, and the interesting engineering is almost never in the model. It is in the orchestration around the model: where state lives, how you chunk a document that does not fit in a context window, how you reconcile contradictory classifications, and what you do when Textract returns garbage on page 3 of 480.
The mental model I keep coming back to is perception, then action. The first three stages perceive the document: get the pixels into text, then decide what every page is. The last three act on that perception, routing the document, extracting structured facts, and building a timeline. This article is Part 1: perception. Part 2 is action.

At the highest level, a document moves through these stages:
One detail trips up almost everyone the first time they read the architecture: classification is not a separate stage that fires when OCR finishes. Classification lives inside the annotation pipeline as its first step. The reason is mundane but important — classification needs the Textract text output to exist, and OCR is asynchronous and can take minutes. So the system decouples them. OCR writes its JSON to storage and stops. The document sits in an AnnotationPending state. Later, a queue processor (or a manual request, or a batch timeline regeneration) triggers the annotation pipeline, which reads the stored OCR output and runs classification as step one.
That decoupling is the first real architecture decision worth internalizing. If OCR directly triggered classification, a burst of uploads would create a thundering herd of LLM calls the moment Textract finished, and you would have no natural place to apply backpressure. By landing everything in a pending state and pulling work through a queue, the system controls its own throughput.
The backing stores split along clean lines:
The job_meta field deserves a flag now because it shows up constantly in Part 2. It is a JSONB blob attached to every document job, and it accumulates state as the document moves through the pipeline: classification status, the page-level outline, the derived document types, routing flags. Treat it as the document's working memory.
OCR is the unglamorous foundation. If the text extraction is wrong, every downstream model inherits the error, and no amount of prompting recovers a date that Textract never read. So the pipeline takes it seriously and runs AWS Textract through Lambda.
There are two ways a document reaches Textract, and they exist for different operational reasons:
Path 1 — direct S3 trigger. An ObjectCreated event on the upload bucket fires a Lambda. The Lambda calls startDocumentTextDetection(), records a TextractJob in DynamoDB with status Processing, and registers an SNS notification channel. This is the standard path for ordinary uploads.
Path 2 — Step Functions. When OCR is one step inside a larger orchestrated workflow, a state machine invokes a Lambda that carries a task token. When Textract finishes, the handler calls sendTaskSuccess() with the S3 location of the results, which signals the state machine to advance. The task token is the whole point — it lets a long, async OCR step participate in a synchronous-looking workflow without polling.
Either way, when Textract completes it publishes to SNS, a handler Lambda retrieves the OCR result, extracts text blocks and tables, marks the TextractJob as Completed, and writes the output to S3.
Here is the part I found non-obvious: Textract output is stored in two formats, and they are not redundant.
The final JSON is a flat list of page text. That is all classification needs — the LLM reads text, decides a type, and never cares where on the page a word sat. Keeping a lightweight format means the classifier loads less data and runs faster.
The raw JSON preserves everything: block types, bounding boxes, table structure, confidence scores. Bill extraction needs this. To parse an itemized medical bill correctly you have to know which numbers sit in the same row and which column they fall under — geometry is the data. Throwing away bounding boxes would make the bill parser guess at table structure from a flattened text stream, which is exactly the kind of brittle heuristic you want to avoid.
So the rule is: store the cheap format for the cheap consumers, store the expensive format for the one consumer that needs it. Two representations of the same OCR pass, each shaped for its reader.

Classification happens at three levels of granularity, and the relationship between them is the thing to get right.
Every page is assigned exactly one of six types. This is the foundational classification; everything else derives from it.
An LLM (GPT-4o-mini) reads each page's text and assigns a type, plus a quality score (low/medium/high/unknown), a provider name, and a handwriting flag. The output per page is a PageMeta object:
Once a page has a type, a second pass assigns a sub-type specific to that type. Financial pages get billing-specific sub-types; medical pages get clinical-relevance sub-types; and so on. The financial enum is the richest because billing is where the money — literally — gets decided:
Medical sub-typing is different in one important way — it uses a fine-tuned model, not a prompt:
The remaining two parent types each carry their own prompt-classified sub-types. Incident reports separate official police reports from facility incident reports:
Legal pages key off discovery-specific keywords:
The choice of a fine-tuned model for medical sub-typing is a cost/accuracy call. Medical relevance is the highest-stakes sub-classification, and getting it wrong means either burning tokens annotating worthless letterhead, or worse, ignoring a page that documents a critical procedure. Financial, incident, and legal sub-types ride on prompt-based GPT-4o-mini because the categories are more textually obvious — the literal strings "Deposition," "EBT," "Rule 26," or police letterhead are signals a good prompt can match, while "is this page clinically critical?" is a judgment call.
That sub_type field is typed as a union, PageSubTypeOrError, which combines the four real sub-type enums (FinancialSubType, MedicalSubType, IncidentReportSubType, LegalSubType) with two supporting cases. Pages whose type has no sub-types — PageType.Other and PageType.OtherFinancial — get the sentinel PageTypeAssorted.NoSubClassification. And when classification itself fails, the field carries a PageTypeError: ClassificationError for an outright failure, or InvalidType when the model returns a value outside the enum (the same InvalidType the per-batch retry falls back to). Every page ends up with a well-typed sub_type, even the degenerate and error cases.
Here is the inversion that surprised me. You might expect the system to ask an LLM "what type of document is this?" It does not. Document types are computed from the page-level outline:
A document can carry multiple types simultaneously (a bundle that is both hospital records and billing). The derivation rules:
Why fuzzy thresholds for medical but simple presence for the rest? Because medical pages are noisy. A 400-page billing bundle might have one page of clinical notes stapled in by accident. Simple presence would mislabel the whole thing as hospital records and route it into expensive medical annotation. So medical-record detection counts sub-types and checks proportions:
The counts are cumulative on purpose: importantCount includes critical pages, and ignoreCount includes important-plus-critical (minus a 5-page slack), so each bucket is a superset of the one above it. The asymmetry in the thresholds is intentional too: even ≥1% critical pages is enough to flag the document, because a critical medical page is rare and valuable, while it takes >20% of ignore-level pages to register. The thresholds encode a judgment about which mistakes are expensive.
You cannot paste 500 pages into a single LLM call: it overflows the model's token limit, and even within the window, a page rarely classifies correctly without the surrounding pages for context. The pipeline solves this with a layered chunking strategy of overlapping batches and priority-based merging.

Pages are split into batches of 15 with a 2-page overlap, giving an effective stride of 13:
The overlap exists because a page in isolation is often ambiguous. A medical record spanning pages 14–16 should not be cut at a batch boundary with no shared context, and provider names that appear only in a section header need to carry forward. Overlap buys context across the seam.
Inside each batch, pages get explicit markers so the model never confuses a batch index with a page number printed in the document itself:
The 3001 offset removes an ambiguity. If a document says "see page 5," and you had numbered your batch pages 1–15, the model might cross the wires. Starting at 3001, a number that essentially never appears in a medical or legal document, sidesteps the problem. It is the kind of detail you only add after a model confidently mislabels a page because it read an internal cross-reference.
All batches run concurrently with Promise.all. If the model returns the wrong number of classifications for a batch, the system retries those pages individually and, failing that, marks them InvalidType — so the invariant one classification per page always holds.
Overlap means some pages get classified twice. When Batch 1 says page 14 is "Medical" and Batch 2 says "Other," you need a deterministic tie-breaker. The system resolves conflicts by priority, where a lower number means higher importance:
The merge always prefers specificity. "Medical" beats "Other"; "Bills" beats "OtherFinancial"; "Critical" beats "Important." The reasoning is that a confident specific classification carries more signal than a vague one, and in this domain the cost of under-classifying (treating a medical page as Other and skipping it) is higher than over-classifying.
Sub-classification only runs on pages of the matching parent type, and it groups them into contiguous runs so unrelated sections never get analyzed together:
If a document has bills on pages 1–20 and again on 81–100 with medical records in between, you do not want to classify those two billing sections as one blob — they are different providers, different dates, different structure. Contiguous runs keep each section's context intact.
Two pieces of context the model would otherwise miss lift accuracy. First, filename context: a file called Memorial_Hospital_Bill_2023.pdf is a strong hint, so it gets prepended to the page text during sub-classification:
Second, provider backfilling — medical records put the provider in a section header on the first page only, so continuation pages inherit the last known provider:
The pipeline parallelizes aggressively but with a ceiling. Quality assessment and page-type classification run concurrently; all batches run concurrently; all contiguous runs run concurrently; provider extraction runs in its own 12-page batches. A PromiseQueue(5) caps concurrent document-outline generations so a flood of documents cannot exhaust memory or saturate database connections.
One historical note worth keeping, because it is a good lesson in not over-optimizing: the codebase contains a sampling layer that would process only 10–15% of pages (15–60 pages) for non-priority cases to save cost. It is now dead code. Business requirements shifted to full annotation for every case type, so 100% of pages are processed. The sampling logic remains, unused, in case selective processing is ever needed again. If you read the code cold, you would think sampling is active — it is not.
The end product of all this is doc_outline_v2: a per-page array of PageMeta objects, stored both in S3 (full detail) and in the document's job_meta blob (summary plus the outline). A representative slice:
Alongside it, generated_types carries the derived document-level types, and classify_status flips to 'classified'. That outline is the contract between perception and action. Everything in Part 2 (the routing decision, which extractor runs, what ends up on the timeline) reads from this structure. Get the outline right and the rest of the pipeline has a fighting chance; get it wrong and no downstream cleverness saves you.
OCR converts pixels to text — it tells you what words are on a page. Intelligent Document Processing is the full pipeline that sits on top: it classifies what each page is, decides which documents matter, extracts structured fields, and assembles the results into something queryable. OCR is one stage (the second) inside IDP. A system that stops at OCR hands you a text dump; an IDP system hands you structured data with types, providers, dates, and relationships.
Real-world bundles are mixed. A single uploaded PDF routinely contains medical records, bills, legal filings, and administrative junk interleaved in arbitrary order. Document-level classification forces one label onto a heterogeneous file and loses the structure. Page-level classification captures the reality, where page 1 is a clinical note, page 85 is a bill, and page 150 is letterhead, and then derives document-level types from the page distribution. The page is the honest unit of classification.
Cost versus accuracy. Medical relevance ("is this page clinically critical, important, or ignorable?") is subtle, high-stakes, and hard to express reliably in a prompt, and it runs on a huge share of pages, so accuracy compounds. A fine-tuned model earns its training cost there. Financial, incident, and legal sub-types key off textually obvious signals (the literal words "Deposition," "Explanation of Benefits," "Police Report"), where a cheap general model with a good prompt is plenty.
Overlap deliberately classifies boundary pages more than once, then reconciles. After all batches return, a merge step walks every page and, where two batches disagree, keeps the higher-priority (more specific) type using a fixed priority map — Medical beats Other, Bills beats OtherFinancial. The invariant maintained throughout is exactly one final classification per page, so the duplication helps accuracy at the seams without inflating the page count.
No, and this is a common misreading of the architecture. Textract writes its JSON output to storage and marks its job complete, but it does not kick off the annotation pipeline. The document waits in an AnnotationPending state until a queue processor, a manual request, or a batch timeline regeneration pulls it forward. Decoupling OCR from classification gives the system a natural backpressure point and prevents a burst of uploads from stampeding the LLM tier.
This is Part 1 of a two-part series on building a production Intelligent Document Processing pipeline. Part 2 covers routing, data extraction, and timeline generation →
Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.
The action half of a production IDP pipeline: skip-routing, structured bill and medical extraction, day-by-day timeline assembly, plus queues and retries.
AI Engineering, Document AI, LLM ApplicationsOne misplaced timestamp invalidated our entire KV cache and 10x'd our bill. Here are 6 context engineering patterns from Manus and production agent teams that prevent exactly this -- with code examples.
AI Engineering, Agent FrameworksExplore how Claude Code, Cursor, Aider, and Cline work under the hood. Agent loops, tool dispatch, and edit strategies explained.
AI Engineering, Agent Frameworks