LLM-Powered Document Validation and Auditing via AWS Bedrock

AWS

Bedrock

document validation

fraud detection

prompt engineering

lambda

serverless

Author

Kara C. Hoover

Published

February 2026

Executive Summary

Problem: Traditional identity document validation pipelines rely on OCR for text extraction followed by rule-based logic checks. This approach extracts text without understanding semantics, requires explicit programming for every validation scenario, and fails on novel or non-standard document layouts – missing exactly the kinds of inconsistencies that signal fraud.

Approach: Built a proof-of-concept serverless auditing pipeline using Claude 3 Haiku via AWS Bedrock to evaluate whether LLMs can detect fraudulent entries in document metadata before they are committed to a permanent record. The system simulates a biometric document ingestion workflow: documents are scanned upstream (via OCR) and their metadata – scores, flags, and field values – is stored as JSON records in a PostgreSQL database. A Lambda function retrieves unaudited records and passes them to Claude, acting as a “Senior Forensic Document Auditor,” which returns a structured PASS/FAIL verdict with reasoning. Ten synthetic audit log entries (8 valid, 2 flagged as malicious) were processed using three prompting strategies (minimal zero-shot, schema-driven, chain-of-thought) and evaluated on structured output reliability, response time, and anomaly detection rate.

Insights: Schema-driven prompting is an excellent design choice for production: it achieved 100% structured output reliability and 100% anomaly detection at 1.6s average response time. LLMs catch semantic violations that OCR cannot detect (e.g., expiration dates that precede issue dates, future-dated documents, or formatting inconsistencies), but they are costly and not suitable for all cases. A hybrid architecture is recommended: fast OCR handles bulk volume and LLM handles flagged cases (10–20% of documents) where semantic validation justifies the 8x cost premium.

Significance: As document fraud becomes more sophisticated, rule-based validation systems are increasingly insufficient. This project demonstrates a scalable, cost-aware integration pattern for deploying LLM semantic reasoning within enterprise document workflows – applicable to identity verification, compliance screening, and any domain requiring explainable anomaly detection at scale. While this project focused on identity documents, the hybrid approach is transferable to other document types requiring validation (e.g., legal papers, vaccination and medical records).

Key Findings

Prompt design is the most important variable in production reliability
Schema-driven prompting achieved 100% JSON parse success and 100% anomaly detection
LLMs detected semantic anomalies that are structurally invisible to OCR
Cost analysis at scale supports selective deployment: processing 10% of documents through LLM review reduces monthly cost by 70% relative to all-LLM processing while retaining 95%+ of semantic anomaly detection

Applied Findings

A validated hybrid architecture combining LLM semantic analysis with traditional identity verification pipelines outperforms either approach alone.
Produces cost and performance benchmarks directly applicable to enterprise fraud screening, compliance workflows, and identity risk management systems.

Research Question

Can large language models augment traditional rule-based document validation by detecting semantic anomalies that static logic checks miss, and if so, which prompting strategy optimizes the reliability-cost-speed trade-off for serverless production deployment?

Research Answers

Prompt Design Determines Reliability

The three prompting strategies produced dramatically different results despite using the same model and the same records. Minimal zero-shot prompting – brief instructions, no schema – achieved only 67% JSON parse success and detected no anomalies in the test cases. Schema-driven prompting, which provided an explicit JSON schema and specified validation requirements, achieved 100% parse success and 100% anomaly detection at a mean response time of 1.6 seconds. Chain-of-thought prompting, which required the model to reason step by step before rendering a verdict, also achieved 100% on both metrics but at 2.3 seconds – a 44% latency increase with no accuracy gain.

Strategy	JSON Success	Avg Time	Anomaly Detection
Minimal	67%	1.4s	0%
Schema-Driven	100%	1.6s	100%
Chain-of-Thought	100%	2.3s	100%

Interpretation: Schema-driven prompting is the production design choice. The investment is in prompt engineering, not model selection – vague instructions reliably produce inconsistent outputs regardless of model capability.

Semantic Anomaly Detection: What Rule-Based Systems Cannot Do

The clearest demonstration of LLM value came from the date anomaly test case. The metadata record contained an issue date in the future (06/20/2025) and an expiration date that preceded the issue date (06/20/2024). A rule-based system would require an explicitly programmed check to catch this; the LLM identifies it from semantic understanding of what document dates mean – no rule required.

The schema-driven response returned structured JSON identifying both violations: issue date in the future, expiration date before issuance, and a composite “date logic violation” flag – plus a confidence level of “low” and a rejection recommendation. This is the system prompt doing the work: Claude was instructed to act as a Senior Forensic Document Auditor checking for logical inconsistencies, and the schema-driven format ensured the output was reliably parseable.

The formatting inconsistency test case showed a complementary strength: the chain-of-thought strategy identified that a record with a lowercase name, three different date format conventions, and hyphenated ID numbers was anomalous in context – without requiring any pre-programmed rules about format standards. It normalized the data, explained its reasoning in natural language, and recommended human review rather than automatic rejection, which is the appropriate response for formatting irregularities that may have reasonable explanations.

Interpretation: LLMs add value by performing the semantic reasoning layer that rule-based systems architecturally cannot provide. The value is concentrated in records flagged as suspicious – a small subset of the ingestion stream.

Hybrid Architecture Optimizes Cost and Coverage

At $0.008 per document, LLM processing costs 8x more than traditional rule-based processing at $0.001 per document. At high volume, all-LLM processing is not viable – $8,000 per million documents versus $1,000 for rule-based-only. The cost case for a hybrid architecture is straightforward.

Figure 1. Prompt strategy performance comparison: JSON parse success, average response time, and anomaly detection rate across minimal, schema-driven, and chain-of-thought approaches.

In the recommended architecture, upstream OCR scans the full document stream and a rule-based filter passes clean records directly to the database (estimated 80–90% of volume), routing only flagged records to Lambda for LLM analysis. The LLM performs semantic validation and generates natural language explanations for human reviewers. At 100K documents per month, this architecture costs approximately $170/month – compared to $800 for all-LLM and $100 for rule-based-only – while retaining semantic detection capability for the 10% of documents where it matters.

Interpretation: The hybrid pipeline uses rule-based processing for faster and cheaper bulk throughput and LLMs for semantic reasoning on ambiguous cases. The cost premium is justified when the decision stakes are highest (fraud versus reasonable errors).

Next Steps

The current proof-of-concept used synthetic metadata records with controlled anomalies. Production validation would require testing against a real document corpus with known ground-truth labels to establish precision and recall for anomaly detection across document types and jurisdictions. Key open questions include: how performance degrades on genuinely novel record patterns the model has not encountered; how to handle hallucination risk in production (the model may generate plausible but incorrect reasoning); and whether fine-tuning on domain-specific audit data improves reliability beyond prompt engineering alone.

A significant infrastructure challenge identified during development is Lambda-to-RDS networking. Connecting a Lambda function outside a VPC to a publicly accessible RDS instance produces InterfaceError timeouts due to AWS internal routing overhead. The correct production architecture places Lambda inside the same private VPC as RDS, with a NAT Gateway or VPC Endpoints to maintain Bedrock API access while keeping database traffic off the public internet.

The hybrid architecture also assumes a stable confidence signal as the routing criterion. In practice, calibrating the threshold – deciding what fraction of records gets LLM review – requires empirical data on the false negative rate of rule-based-only processing in the specific operational context.

Study Design

Data Source: Synthetic audit log records generated for this proof-of-concept using the faker library. Ten JSON records inserted into a PostgreSQL (RDS) audit_logs table representing biometric document scans: 8 with valid scores (0.95) and 2 with low scores (0.30) simulating flagged/malicious entries. No real identity documents or images were used; the LLM operates on structured metadata, not raw images.

Data Handling: Unaudited records (where ai_status IS NULL) were retrieved from RDS and passed as JSON payloads to the Bedrock API. Each record included user_id, score, and doc_type. Claude was instructed via system prompt to act as a Senior Forensic Document Auditor, check for logical inconsistencies, and return {"status": "PASS"/"FAIL", "reasoning": "..."}. Results were written back to the ai_status and ai_reasoning columns. Audit outcomes were visualized as a bar chart of PASS/FAIL counts.

Analytical Approach:

Initialized PostgreSQL schema (schema.sql) with audit_logs table storing user_id, document_type, ai_status, ai_reasoning, and raw_data (JSONB)
Generated 10 synthetic records using faker and inserted via pg8000.native; 2 records seeded with low verification scores (0.30) to simulate fraud cases
Built Lambda function (Python 3.12) to retrieve unaudited records from RDS and invoke Claude 3 Haiku via Bedrock with a forensic auditor system prompt
Tested three prompting strategies varying in structure: minimal zero-shot, schema-driven zero-shot, zero-shot chain-of-thought
Evaluated each strategy on JSON parse success, response time, and PASS/FAIL accuracy against known ground truth
Modeled cost at scale under three architectures: rule-based only, LLM-only, hybrid
Derived hybrid pipeline architecture recommendation balancing cost, latency, and semantic coverage; documented VPC/networking lessons learned for Lambda-to-RDS connectivity

Project Resources

Repository: github.com/kchoover14/document-validation-bedrock

Data: Synthetic audit log records generated via the faker library – no real identity documents used. No external data source.

Code:

prompt engineering via lambda.ipynb – main pipeline: RDS connection, mock data generation, Bedrock invocation, audit results, visualization
mockDatabase.ipynb – standalone mock data generator showing synthetic record construction
schema.sql – PostgreSQL table definition for audit_logs
system_prompt.txt – forensic auditor system prompt used with Claude 3 Haiku

Project Artifacts:

Figures (n=1)

Environment:

requirements.txt – install pinned Python package versions with pip install -r requirements.txt

License:

Tools & Technologies

Languages: Python 3.12

Tools: AWS Bedrock | AWS Lambda | AWS RDS (PostgreSQL 17)

Packages: pg8000 | boto3 | faker | pandas | matplotlib

Expertise

Transferable Expertise: This project demonstrates the ability to design and evaluate AI integration architectures under real-world constraints – not just whether a technology works, but when, for whom, and at what cost. The hybrid pipeline recommendation reflects a pattern-recognition skill applicable to any enterprise context where AI capability must be balanced against operational cost, latency, and explainability requirements.