Few-Shot Prompt Engineering Achieves Production-Grade Biometric Security Auditing Without Fine-Tuning

prompt engineering

few-shot learning

LLM

AWS Bedrock

biometrics

parameter optimization

Author

Kara C. Hoover

Published

February 2026

Executive Summary

Problem: Biometric systems must make real-time pass/fail/retry decisions about sensor data quality, but traditional rule-based approaches require explicit thresholds for every sensor combination, struggle with environmental edge cases, and cannot provide natural language remediation guidance.

Approach: Developed a prompt engineering framework using AWS Bedrock to evaluate whether LLMs can classify biometric sensor telemetry and generate structured remediation logic without custom fine-tuning. Tested Amazon Nova Lite and Claude 4.5 Opus across three edge case scenarios (low light, 2D spoof, ambiguous glare) using few-shot prompting and systematic parameter sweeps across Temperature, Top-K, and Top-P.

Insights: Few-shot prompting with three carefully designed examples was sufficient to teach both models sensor modality prioritization, presentation attack detection, and remediation strategy generation. Parameter configuration – not model selection – is the critical variable: Temperature 0 with low Top-K (10–20) produces deterministic, comprehensive audit logs; Temperature 1 with the same Top-K constraint produces nuanced, prioritized user-facing instructions. Both models achieved 100% JSON schema compliance across all test cases.

Significance: This project demonstrates that production-grade security auditing is achievable through prompt engineering alone, without the time and data costs of fine-tuning. The tiered parameter strategy – one prompt template, multiple configurations for different audiences – is a generalizable pattern applicable to any domain requiring structured LLM outputs with varying levels of determinism.

Key Findings

100% JSON schema compliance across both models and all test cases
Three few-shot examples were sufficient for reliable sensor modality prioritization and spoof detection
Temperature 0 is optimal for audit logs; Temperature 1 with low Top-K (10–20) is optimal for user-facing instructions
Low Top-K (10–20) acts as a safety rail against hallucination regardless of Temperature setting
Nova Lite and Claude 4.5 Opus reached identical decisions on all test cases; they differ in reasoning depth and cost, not accuracy

Applied Findings

Few-shot prompting with role assignment, negative output constraints, and three edge-case examples achieves production-grade structured output reliability – fine-tuning is not required for this class of problem.
Temperature and Top-K must be configured for the end recipient: deterministic settings for compliance audit logs, higher Temperature with constrained Top-K for user-facing remediation instructions.
Nova Lite is the production choice for cost and consistency; Claude 4.5 Opus is the research choice for reasoning depth on novel edge cases.

Research Question

Can few-shot prompt engineering – without fine-tuning – achieve reliable structured output and accurate anomaly classification for biometric sensor validation, and how should Temperature and Top-K parameters be configured for different operational audiences?

Research Answers

Few-Shot Design Determines Classification Reliability

The prompt structure combined four elements: role assignment (“Senior Biometric Security Auditor”), task specification (analyze for presentation attacks or environmental interference), a negative output constraint (“Return ONLY a JSON object”), and three few-shot examples covering distinct edge cases – specular glare on eyewear, lower face occlusion, and low ambient light. Both models returned 100% JSON-compliant responses across all test cases.

The negative constraint (“Return ONLY”) proved as important as the examples themselves. Without explicit instruction about what not to include, LLMs default to conversational framing that breaks downstream JSON parsing. The few-shot examples established sensor modality logic – when to trust IR over RGB, what flat depth profiles indicate – that generalized correctly to the test scenarios.

Test Cases: All Three Scenarios Resolved Correctly

Test Case 1 – Low Light (IR-Heuristic): RGB confidence 0.21, IR confidence 0.96 in a 2-lux environment. Expected: PASS, trusting IR over RGB. Both models returned PASS, correctly prioritizing IR liveness signal over failed RGB in low ambient light.

Test Case 2 – 2D Spoof (Security Check): RGB confidence 0.98, IR confidence 0.15 with a flat surface depth profile. Expected: FAIL, flagging 2D presentation attack. Both models returned FAIL. The high RGB score – which a threshold rule might pass – did not mislead either model; the low IR reading was correctly interpreted as a liveness failure.

Test Case 3 – Ambiguous Glare (Remediation Check): RGB confidence 0.55, IR confidence 0.60 with transition lens glare from overhead LED. Expected: RETRY with specific remediation. Both models returned RETRY with actionable guidance. Nova Lite suggested repositioning and removing non-prescription eyewear; Claude 4.5 Opus identified photochromic lens darkening as the primary cause and provided positioning-specific guidance.

Parameter Optimization: Temperature and Top-K

Temperature controls output style more than output accuracy in this setting. Temperature 0 produces comprehensive, deterministic responses – all possible causes listed, all possible remediations included, phrasing identical across repeated runs. This is the correct configuration for forensic audit logs and compliance documentation. Temperature 1 produces prioritized, targeted responses – the most likely cause identified first, more sophisticated phrasing – appropriate for real-time user-facing instructions.

Top-K functions as a vocabulary constraint that operates independently of Temperature. At the default Top-K of 50, high Temperature settings introduce creative drift – the model can hallucinate non-existent sensor issues by sampling from a wide token distribution. Constraining Top-K to 10–20 limits the model to the most probable next tokens at each step, preserving linguistic variety within safe bounds. The counterintuitive finding: Temperature 1 with Top-K 10 is more stable than moderate Temperature with default Top-K, because constrained vocabulary prevents the model from hedging with generic responses while still allowing nuanced phrasing.

Top-P at 0.8 provides a secondary anchor when responses trend verbose under high Temperature. In practice, Top-K does most of the constraint work for this use case.

Model Comparison: Production vs. Research

	Nova Lite	Claude 4.5 Opus
JSON Compliance	100%	100%
Decision Accuracy	100%	100%
Reasoning Style	Concise	Verbose
Cost per Decision	~$0.0001	~$0.001
Best Use	Production, high-volume	Research, novel edge cases

Both models reached identical decisions on all test cases. The 10x cost difference is not a quality difference – it is a reasoning depth difference. Nova Lite is the production choice where consistency and throughput matter. Claude 4.5 Opus is the research choice for developing new edge case examples and auditing novel sensor configurations where detailed reasoning traces have diagnostic value.

Study Design

Data Source: Manually constructed test cases with synthetic sensor confidence values. No real biometric data or images were used. The LLM operates on structured JSON metadata (RGB confidence score, IR confidence score, and a contextual note field), not raw sensor images.

Data Handling: Three edge case scenarios were hand-crafted to represent distinct failure modes: environmental interference (low light), presentation attack (2D spoof), and ambiguous sensor conflict (glare with transition lenses). Each was tested against both models under varying parameter configurations.

Analytical Approach:

Designed system prompt with role assignment, task specification, negative output constraint, and three few-shot examples (glare, mask, low light)
Constructed three test cases with known expected outputs (PASS, FAIL, RETRY)
Ran each test case through Amazon Nova Lite and Claude 4.5 Opus via AWS Bedrock
Evaluated outputs against expected decisions and JSON schema compliance
Conducted parameter sweep: Temperature (0, 0.5, 1.0) x Top-K (1, 10–20, 50) x Top-P (0.8, 1.0)
Ran stochastic consistency test – same input 3x per Temperature setting – to measure output stability
Compared model outputs on reasoning depth, phrasing sophistication, and cost per decision

Project Resources

Repository: github.com/kchoover14/xai-prompt-engineering-bedrock

Data: Synthetic sensor confidence values constructed for this proof-of-concept. No external data source. No real biometric data used.

Code: No analysis script. Prompts were submitted directly to AWS Bedrock via the console and API. Prompt templates are documented in the Research Answers section above.

Project Artifacts:

No figures (results presented as tables and inline text)

License:

Tools & Technologies

Languages: None (no analysis script)

Tools: AWS Bedrock | Amazon Nova Lite | Claude 4.5 Opus

Packages: None

Expertise

Transferable Expertise: This project demonstrates the ability to systematically evaluate and configure LLM behavior for constrained classification tasks – identifying which prompt design choices and parameter settings govern reliability, and how to match configuration to operational audience. The tiered parameter strategy (one template, multiple configurations) is applicable to any enterprise setting where AI outputs must serve different stakeholders with different reliability and interpretability requirements.