Processing PDF invoices by hand doesn't scale. I was working on a reconciliation project where the finance team had thousands of PDF invoices that needed to be matched against records in their accounts payable system. The data lived in the PDFs — invoice numbers, dates, totals, vendor names — but getting it out meant someone had to open each one, find the fields, and type them into a spreadsheet. It was slow, error-prone, and took way too much time every month.
I wanted to automate the extraction end to end. Drop a PDF in, get structured data out. No manual intervention, no per-format configuration. The system needed to handle invoices, statements, contracts — whatever showed up in the inbox.
The twist was that these invoices came from hundreds of different vendors, and every single one formats their invoices differently. Different layouts, different labels, different date formats, different ways of presenting totals. Writing extraction rules per vendor wasn't going to work — there were too many, and new ones showed up all the time.
What I ended up building was a two-stage AI pipeline using AWS Textract for OCR and Amazon Bedrock for the AI normalization layer. Textract reads the page. Bedrock figures out what it all means and boils every format down to one consistent JSON structure — regardless of how the vendor laid out the invoice.

The Real Problem: Every Vendor's Invoice Looks Different
This was the fundamental challenge. We weren't dealing with one invoice format — we were dealing with hundreds. Every vendor sends their invoices laid out differently. Some have the invoice number at the top right. Others bury it in a table. Some label it "Invoice #", others call it "Reference No." or "Document ID." Dates show up as "01/15/2025," "January 15, 2025," "2025-01-15," or "15 Jan 25." Totals might be in a summary box, a footer line, or the last row of a table.
If you try to solve this with traditional OCR alone, you end up writing extraction rules for each vendor's format. That might work for your top 10 vendors, but when you have hundreds — and new ones showing up regularly — it's a losing game. You'd spend more time maintaining the rules than you'd save on manual entry.
That's where the AI comes in. The AI is the normalizer. It doesn't care that Vendor A puts the invoice number in the header and Vendor B puts it in line 3 of a table. You give it the raw extracted text and tell it: "Find me the invoice number, the date, the total, and the vendor." It figures out which field is which regardless of layout, labeling, or format — and returns it in a single consistent JSON structure every time.
Why Two AI Services Instead of One?
So if the AI handles the normalization, why not just send the PDF directly to the model and skip OCR entirely?
I actually tried this first. Bedrock supports direct PDF input — you base64-encode the file and send it straight to the model. But there are practical limits. Bedrock caps direct PDF input at 5MB, and a lot of real-world invoices — especially scanned multi-page documents — blow right past that.
More importantly, even when the files were small enough, the results were inconsistent. The model would miss table data, misread amounts, or confuse fields that were visually close together on the page. Out of roughly 15,000 documents, the direct-to-Bedrock approach failed on close to 100% of them when used as the primary path. It was trying to solve two problems at once, and neither one got the attention it needed.
The two problems are:
- The spatial problem — Where is the text on the page? What's a form label vs. a value? What belongs to which table column? This is a vision problem.
- The semantic problem — Out of all these fields, which one is the invoice number? Is this date the invoice date or the due date? Is this amount the subtotal or the total? This is a comprehension problem.
Textract is purpose-built for the spatial problem. It doesn't just do OCR — it identifies key-value pairs (like "Invoice Date: 12/15/2025"), detects table structures with rows and columns, and maps out form fields. It understands the layout of the page in a way that a general-purpose language model can't reliably match.
Bedrock handles the semantic problem — the normalization. I take all that structured data from Textract and hand it to the AI model. Now the model doesn't have to figure out where things are on the page. It just has to look at a list of key-value pairs and lines of text and decide: "This is the invoice number. This is the date. This is the total." And it does that consistently, regardless of whether the invoice came from a massive distributor or a one-person shop with a Word document template.
The combination means Textract does what it's best at (reading the page), and the AI does what it's best at (understanding the content and normalizing it into a single format). Neither one alone would have worked nearly as well.
The AWS Architecture
The whole pipeline is serverless. Drop a PDF into an S3 bucket, and everything else happens automatically — no servers to manage, no workers to monitor.

Here's the flow:
- PDF lands in S3 — uploaded to the
inbox/prefix of an S3 bucket - S3 Event Notification fires — triggers automatically when a new object appears in
inbox/ - SQS Queue receives the event — buffers the work and handles retries if the Lambda fails
- Lambda picks up the message — runs Textract, sends results to Bedrock, writes output to S3 and the database
- PDF moves through prefixes —
inbox/→processing/→processed/(orerror/if something fails)
The S3 prefix movement is a nice pattern. You always know the state of a document by where it sits in the bucket. If something fails mid-processing, you can see exactly which files are stuck in processing/ and investigate.
The Lambda: Textract + Bedrock in One Function
The Lambda function does all the heavy lifting. When an SQS message arrives, it parses the S3 event, grabs the PDF, and runs it through both AI stages.
Step 1: Run Textract with FORMS + TABLES
const result = await textract.send(
new StartDocumentAnalysisCommand({
DocumentLocation: {
S3Object: { Bucket: bucket, Name: key }
},
FeatureTypes: ["FORMS", "TABLES"]
})
);The FORMS and TABLES feature types are key. Basic OCR just gives you text. With these enabled, Textract identifies key-value pairs (like "Invoice Date: 12/15/2025") and detects table structures with rows and columns. This structured data is what makes the Bedrock step so much more accurate.

Step 2: Send to Bedrock for AI Extraction
Here's where the normalization happens. I send the structured Textract data to Amazon Bedrock with a prompt that tells the model exactly what I need back.
The prompt is specific about the output format:
You extract key information from invoices or invoice-like documents.
Focus on identifying the invoice number - it is the most important field.
Return ONLY valid JSON matching exactly:
{
"document_type": "invoice|statement|contract|other",
"specific_number": {
"label": "Invoice Number|Claim Number|PO Number|...",
"value": "string or null",
"confidence": 0.0
},
"key_fields": {
"date": null, "due_date": null,
"total": null, "tax": null,
"vendor": null, "customer": null
},
"summary": "",
"notes": []
}
Rules:
- Confidence is 0.0 to 1.0. If not found, value=null and confidence=0.
- ALL dates MUST use format MM/DD/YYYY.
- ALL dollar amounts MUST use format $X,XXX.XX.Getting the prompt right is critical. This is the part that took the most iteration. The AI will do exactly what you tell it to — and if you're not specific enough, you'll get inconsistent results across thousands of documents.
A few things I learned through trial and error:
- Define the exact JSON schema. I tell the model exactly what shape to return. No extra keys, no variations. Without this, some documents come back with extra fields, others are missing fields, and your parsing code has to handle every variation. Lock down the schema and the output is predictable.
- Enforce formats in the prompt. Dates as MM/DD/YYYY, amounts as $X,XXX.XX. This is huge. Without explicit format rules, the AI will return dates in whatever format it finds on the document — "January 15, 2025" from one vendor, "2025-01-15" from another, "01/15/25" from a third. The whole point of this pipeline is normalization, so the prompt has to enforce it. If you don't do this here, you end up writing format conversion code downstream for every variation.
- Confidence scores. The model rates its own confidence on the invoice number. A 0.95 means it's pretty sure. A 0.4 means the document might not even have an invoice number. This is valuable downstream for deciding what needs human review vs. what can flow through automatically.
- Give the model a place to flag problems. The
notesarray lets the AI say "multiple invoice numbers found" or "document appears to be a statement not an invoice." Without this, the model silently picks one and you never know it was uncertain. Much better to have it tell you.
Step 3: Write the Results
The Lambda writes a .ai.json file to the processed/ prefix in S3 and upserts a database record using a SQL MERGE on source_file — so reprocessing the same PDF updates instead of duplicating. After writing, the PDF moves from processing/ to processed/, or to error/ if something fails.

The Textract Fallback — When OCR Fails
This one surprised me. Some PDFs just don't work well with Textract. Scanned documents with poor quality, unusual layouts, or PDFs that are actually just embedded images. Textract would time out or return very few blocks.
Rather than failing the entire document, I built a fallback: send the PDF directly to Bedrock. Bedrock can accept PDFs directly (up to ~5MB). The extraction quality isn't as good as the two-stage approach — the model has to handle both the spatial and semantic problems — but it's significantly better than returning nothing.
The .ai.json output records whether Textract was used or if it fell back to direct PDF ("status": "FALLBACK_PDF"), so you can audit which documents might need a closer look.
This is a non-ideal workaround. I'm pointing this out for transparency. The direct PDF path loses the structured key-value pairs and table data that Textract provides. But in practice, getting 80% of the fields from a difficult document is better than getting 0%.
When Everything Fails — The Error Pipeline Matters
Here's something that doesn't get talked about enough in AI pipeline posts: you need a plan for errors, not just results.
Out of ~15,000 documents, the vast majority processed fine through the two-stage pipeline. But a meaningful number didn't — Textract timed out, Bedrock returned malformed JSON, the PDF was corrupt, the file was too large, or the document just wasn't an invoice at all (contracts, cover letters, blank pages).
When both Textract and the direct PDF fallback fail, the document lands in the error/ prefix in S3 with a detailed .error.json file next to it. That file captures what went wrong — which stage failed, the error message, timestamps. This gives you a clear picture of what needs attention.
But here's the reality: some documents still end up requiring manual entry. No pipeline handles 100% of real-world data. The goal isn't zero errors — it's making the error path visible and manageable. You want to know exactly which documents failed, why they failed, and have a clean way to either fix and reprocess them or route them to a person.
A few things that helped:
- Error sidecar files per stage. If Textract fails, a
.textract-error.jsongets written. If Bedrock fails, a.bedrock-error.jsongets written. You can tell at a glance which stage broke. - The
error/prefix preserves the original folder structure. If a file was ininbox/vendor-123/invoice.pdf, it lands inerror/vendor-123/invoice.pdf. Easy to find, easy to reprocess — just move it back toinbox/. - Error counting scripts. I built simple scripts to scan the error files and produce counts by error type. This tells you if you have a systemic issue (Textract throttling, Bedrock model changes) vs. one-off bad files.
The pipeline that processes 14,800 out of 15,000 documents automatically is valuable. But the system that tells you exactly which 200 need human attention — and why — is what makes it production-ready.
What I'd Do Differently
A few things I've been thinking about for the next iteration:
- Store dates and amounts as proper types. Right now dates are
nvarcharand amounts arevarcharwith dollar signs. This means every comparison requiresTRY_CASTandREPLACEgymnastics. It works, but it's fragile. I'd normalize these at write time in the Lambda instead. - Add a review queue for low-confidence extractions. The confidence scores are there but nothing surfaces the low-confidence results for human review yet. A second SQS queue that catches anything under 0.7 confidence and routes it to a review UI would be a natural next step.
- Dead letter queue for persistent failures. Right now failed documents land in the
error/prefix in S3, which works but requires manual inspection. A DLQ with CloudWatch alarms would make this more operationally solid.
Conclusion
The core problem was never "how do I OCR a PDF." It was "how do I take invoices from hundreds of vendors — all with different layouts, labels, and formats — and normalize them into one consistent structure." That's the problem the two-stage pipeline solves.
Textract handles the reading. The AI handles the normalization. You don't write rules per vendor. You don't maintain a mapping table of "Vendor A calls it Reference No., Vendor B calls it Invoice #." The model figures that out, and every document comes out the other side in the same JSON format ready for matching.
If you're evaluating AWS services for document processing, I'd recommend starting with Textract + Bedrock together rather than either one alone. Textract alone gives you raw data in whatever format the vendor decided to use. An LLM alone struggles with the spatial layout of real documents. The combination is where it clicks.
I'm still iterating on the matching and reconciliation side of this — automatically pairing the extracted data with accounts payable import records using vendor number, date, and amount as a composite key. If there's interest, I'll cover that in a follow-up post.
Have you dealt with multi-format invoice extraction? I'd be curious whether you went the rules-based route or let AI handle the normalization. Drop a comment below — I'd love to hear what worked and what didn't.

