IDP Part 2: Routing, Extraction & Timeline Generation

The action half of a production IDP pipeline: skip-routing, structured bill and medical extraction, day-by-day timeline assembly, plus queues and retries.

IDP Part 2: Routing, Extraction & Timeline Generation

IDP Part 2: Routing, Extraction & Timeline Generation

TL;DR: Part 1 of this series covered how an Intelligent Document Processing pipeline perceives a document — OCR plus a three-level classification hierarchy. Part 2 covers what it does next: a routing step that skips low-value documents, structured extraction that turns medical and billing pages into typed annotations, and a timeline stage that merges every case document into a chronological day-by-day view. It closes with the infrastructure — capacity queues, event-driven orchestration, and retry logic — that lets the whole thing run unattended at scale.

Key Takeaways

  • Routing is a cheap gate that runs after classification: a document is skipped unless it carries a high-value type (medical record, medical bill, police report) and is not flagged low-priority. Skipping early saves the expensive extraction step.
  • There are three distinct skip mechanisms — issue-based (corrupted/password/empty), automatic (classification-driven), and manual (bulk or by-type) — and skips are reversible, with previous statuses restored on revert.
  • Extraction is format-aware. The pipeline detects a provider format (15 variants) and routes to the matching event generator, because a hospital's Epic export and a chiropractor's bill need different parsing.
  • Bill extraction relies on Textract's raw geometry, not flat text. Bounding boxes and table structure are what let the parser align CPT codes, dates of service, and charges into correct line items.
  • Annotations and timeline events are stored as structured rows (PostgreSQL) with JSON payloads, full soft-delete audit fields, and token-accounting columns that track LLM spend per event.
  • The system survives at scale through a capacity queue that meters work by page count (not job count), event-driven orchestration over EventBridge, and a uniform retry-once-then-fail pattern with per-page timeout budgets.

The data model: document job, annotation, and timeline event

Where We Left Off

In Part 1 a document was uploaded, OCR'd by Textract, and classified into a per-page outline (doc_outline_v2) with derived document-level types (generated_types). That outline is now the input to everything that follows. The annotation pipeline — the orchestrator that owns these steps — runs them in order:

The mindset shift from Part 1 is from perception to action. Up to now the pipeline has been deciding what the document is. From here it acts on that decision, and the first action is deciding whether to act at all.

How Does the Pipeline Decide What to Skip?

The routing gate: keep high-value documents, skip the rest

Not every page is worth the cost of extraction. An insurance card, a fax cover sheet, a return-to-work form — these are stored for completeness but add nothing to a case timeline. Running them through LLM extraction would burn tokens for no value. So immediately after classification, a routing step decides keep-or-skip.

The logic keys off the derived document types from Part 1:

A document is skipped when it has no high-value type or it carries a DEEPDIVE sort flag; it is processed when it has at least one high-value type and is not flagged DEEPDIVE. The DocumentSortType enum captures the priority intent:

There is a piece of vestigial machinery here worth naming, because it confused me until I dug in. The routing code checks realtime_generated_types alongside generated_types. Those realtime_ fields were meant to be populated by an image-based fast classifier that would run during upload, before OCR*, to make instant routing decisions. That classifier exists in the codebase but is not active in production — it was exploratory work that never got fully integrated. So in practice realtime_generated_types is never set, and routing decisions rest entirely on the text-based generated_types from Part 1. The check survives for backward compatibility.

Worth a sentence on what that classifier would have done, since it is the original home of the DocumentSortType values you just saw. During upload it would render roughly the first five PDF pages to images and send them to a vision model (GPT-4V) for a fast, pre-OCR guess. The model classifies into five buckets — Medical Record, Medical Bill, Insurance Policy, Police Report, or Other — and the routing intent falls straight out of that: the first four map to ESSENTIAL, while anything landing in Other (authorization forms, insurance cards, correspondence) maps to DEEPDIVE. The result would have been written to realtime_sort_type and realtime_generated_types, letting the pipeline skip low-value documents before paying for OCR. It is lower-accuracy than the text-based classifier in Part 1 (five page images versus the full Textract output), which is part of why the text path won out and this one never shipped.

Three kinds of skip

"Skip" turns out to mean three different things:

The manual path has two flavors. Bulk skip takes an explicit list of document job IDs and can revert — when is_revert is true it restores each document to its previous status (captured before the skip), not blindly to AnnotationPending:

Auto-skip by type takes a types_to_keep list for a whole case and skips everything not matching — the same predicate as automatic routing, applied in bulk on demand:

Both manual paths stream progress to the UI over Server-Sent Events, process in rate-limited batches, and protect documents that already have annotations. The status machine stays small and reversible:

Reversibility is the design value here. Skipping is an aggressive, cost-saving move; making every skip undoable, and restoring the exact prior status, means an over-eager auto-skip never destroys information a human later needs.

How Is Structured Data Extracted From a Kept Document?

Once a document survives routing, the pipeline detects its provider format and routes to the matching extractor. This is the step people underestimate: a hospital running Epic exports its records in a completely different shape than a chiropractor's billing software or a police department's incident form. One generic parser would do all of them badly.

Format detection draws on the page-type outline, extracted provider names, content patterns, and filename heuristics. The detected format then selects the event generator. There are two generations of generator: a legacy generateEvents() for the older OtherEpic and Lucy formats, and a modern generateEventsV4() for everything newer (General, Duke, Police, Bill, and so on). New documents take the V4 path.

Bill extraction leans on geometry

Bill extraction is its own sub-pipeline, and it is where the raw Textract JSON from Part 1 finally pays off. The flow:

  1. Detect bill pages from the outline (`PageType.Financial` / `Bills`).
  2. Process OCR blocks — the bill OCR processor reads raw Textract blocks, reconstructs tables, and identifies line items using bounding boxes.
  3. Extract structured data — an LLM pulls patient info, provider details, service dates, charges and payments, insurance fields.
  4. Classify the bill sub-type (Bills, BillsWithAPColumns, Lien, EOB, Pharmacy — see Part 1).
  5. Write annotations to the database.

This is why the pipeline stored two OCR formats. You cannot align a CPT code with its charge and date of service from a flattened text stream — the columns collapse into a soup of numbers. The bounding boxes preserve which cells share a row and which column a value sits under, and the parser rebuilds the table from that geometry. Flat text is fine for "what type of page is this"; structured billing needs the spatial layout.

Bill extraction also runs fire-and-forget, in parallel with the main pipeline. When bill pages are detected, a background job starts and does not block timeline generation. The two products — a structured bill and a medical timeline — are independent, so there is no reason to serialize them.

What an annotation looks like

Extracted data lands as an annotation row in PostgreSQL. The schema carries a typed envelope plus a flexible JSON payload, and (importantly for a system humans audit) full soft-delete and authorship fields:

The generated and is_modified flags do real work here: any downstream view can tell AI output apart from human corrections without a second lookup. In a domain where the output may end up in litigation, that provenance is not optional.

How Are Per-Document Events Assembled Into a Case Timeline?

Scattered events sorted into a day-by-day timeline

Extraction produces events scattered across many documents — an admission noted in one record, a bill from another provider, a police report from a third. The final stage, the Day-by-Day (DBD) timeline, merges them into one chronological narrative for the case.

Each event is a structured row, and the EventDataV4 type shows how much accounting the system keeps per event:

Those four token-counting fields at the bottom are a detail I appreciate. The system records prompt and completion tokens for both populating an event and sub-classifying it, per event. That is how you actually understand unit economics in an LLM pipeline: not "what did the month cost" but "what does one timeline event cost to produce," the number that tells you whether the product is viable at scale. Events land in dbd_events and roll up into daily summaries in dbd_summaries.

Regenerating a whole case

When a case needs a full rebuild — new documents arrived, or extraction logic improved — regenerateSnapshotTimeline() reprocesses every document in the case: fetch all jobs, skip if recently generated, queue them all, run each through classify → skip-check → format-detect → generate-events, then repair and deduplicate the merged timeline. Bulk reprocessing is metered by two queues (capacity by page count, and a database operation queue) so a 50-document case rebuild does not knock over the system.

What Holds the Whole Thing Together at Scale?

The stages above describe the happy path. Running thousands of documents a day unattended is mostly about the parts that are not the happy path.

A capacity queue that counts pages, not jobs

The naive way to limit concurrency is to cap the number of in-flight jobs. That fails badly here, because one job might be a 2-page insurance card and the next a 600-page hospital record. So the capacity queue meters by page count:

Metering by pages is the right unit because pages, not jobs, are what consume memory, CPU, and LLM tokens. Five 600-page documents are a far heavier load than fifty 2-page ones, and a job-count limiter cannot tell the difference. This prevents memory exhaustion, CPU overload, and database connection saturation under bursty load.

Work also gets enqueued to SQS for asynchronous processing, with explicit job types (auto_annotation, manual_annotation, timeline_generation, timeline_update) so the queue processor can prioritize and route appropriately.

Event-driven orchestration

A document-orchestration daemon listens to document lifecycle events over AWS EventBridge and triggers the right workflow for each. Events follow a document.{action}.{scope} convention:

This event spine is what makes the pipeline composable rather than a rigid script. A deposition transcript completing, for instance, kicks off an entirely different downstream workflow (feeding a RAG service) than a medical record completing. New document journeys are added by subscribing to events, not by editing one monolithic function.

Retries, timeouts, and a polling state machine

The error handling is deliberately boring, which is the right instinct. Classification status moves through a small machine, and classifying is a state another worker can poll behind with exponential backoff rather than duplicating the work:

Timeouts scale with document size, because a 600-page document legitimately takes longer than a 5-page one — a fixed timeout would either kill big documents prematurely or let stuck small ones hang:

And the retry policy across classification and event generation is uniformly "try once more, then fail loudly":

A single retry catches the transient failures (a flaky API call, a momentary rate limit) without papering over real bugs through infinite retry loops. If the second attempt fails, the error surfaces. That restraint — retry once, then let it break visibly — is the difference between a pipeline you can debug and one that silently swallows failures until a case is mysteriously incomplete.

What Do the Two Parts Add Up To?

Step back and the shape is clear. The pipeline perceives (OCR + hierarchical classification, Part 1), then acts (route, extract, assemble, Part 2). The intelligence is distributed across the whole system, not concentrated in one model: a cheap general LLM does bulk page typing, a fine-tuned model handles the high-stakes medical call, geometry-aware parsing handles bills, and a deterministic merge and timeline assemble the results. Around all of it sits the part that actually makes it production-grade — decoupled stages, page-count capacity metering, event-driven orchestration, reversible skips, per-event cost accounting, and retry logic that fails loudly.

That is the real lesson of building IDP. The model is a component. The system is the product.

FAQ

Why route and skip documents instead of just extracting everything?

Cost and signal. Extraction is the expensive step — it runs LLM calls per page and produces structured annotations. Documents like insurance cards, fax covers, and return-to-work forms add nothing to a case timeline, so spending extraction budget on them is pure waste. Routing is a cheap classification-driven gate that drops low-value documents before the expensive step. Crucially, every skip is reversible and restores the document's exact prior status, so the cost savings never come at the price of losing information a human might later need.

Why detect a provider format before extracting?

Because document structure varies enormously by source. A hospital's Epic export, a chiropractor's billing statement, and a police incident report encode the same conceptual data in completely different layouts. A single generic extractor would parse all of them mediocrely. Detecting the format first, from page types, provider names, content patterns, and filename hints, lets the pipeline dispatch to a generator tuned for that shape, which is the difference between a usable extraction and a garbled one.

Why store both flattened and raw OCR output?

The two consumers have different needs. Classification reads flat per-page text (it only cares about words, not their position), so it uses the lightweight finaljsons format. Bill extraction needs spatial structure: to align a CPT code with its charge and date of service, you must know which values share a table row and which column they fall under. That requires the bounding boxes and table blocks in the raw Textract output. Keeping both means each consumer reads the representation shaped for it.

How does the system meter load when document sizes vary so much?

It meters by page count, not job count. A capacity queue tracks total pages in flight against a maximum, and only admits a new document when there is room for its pages. This matters because resource consumption (memory, CPU, LLM tokens) scales with pages, not documents. Five 600-page records are a far heavier load than fifty 2-page cards, and a job-count limiter would treat them as equal. Timeouts scale the same way: a 30-minute floor plus 20 seconds per page.

How are AI-generated values distinguished from human corrections?

Every annotation carries generated and is_modified flags in its metadata, and both annotations and timeline events keep full authorship and soft-delete fields (created_by, updated_by, deleted_by, and the matching timestamps). Together they preserve provenance — the system always knows what the model produced versus what a reviewer changed. In a domain where the output can end up in litigation, that audit trail is a hard requirement, not a nicety.


This is Part 2 of a two-part series on building a production Intelligent Document Processing pipeline. ← Part 1 covers OCR and the classification hierarchy

Subscribe to the newsletter

By subscribing, you agree to our Terms of Service and Privacy Policy.

About the Author

Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.

Cite this Article

Aaron. "IDP Part 2: Routing, Extraction & Timeline Generation." fp8.co, June 1, 2026. https://fp8.co/articles/intelligent-document-processing-extraction-timeline

Related Articles

Intelligent Document Processing: OCR & AI Classification

How a production IDP pipeline turns 500-page medical-legal bundles into structured data with Textract OCR and a 3-level LLM classification hierarchy.

AI Engineering, Document AI, LLM Applications

Context Engineering for AI Agents: 6 Techniques That Cut Our Costs 10x

One misplaced timestamp invalidated our entire KV cache and 10x'd our bill. Here are 6 context engineering patterns from Manus and production agent teams that prevent exactly this -- with code examples.

AI Engineering, Agent Frameworks

AI Agent Memory: Why Binding Matters More Than Recall

Discover why AI agent memory fails at binding, not recall. 500+ experiments reveal architecture patterns that fix context-action gaps.

AI Engineering, Agent Frameworks