ZenStatement Labs

The State of AI in
Finance Operations

We evaluated four AI systems across 1.37 million synthetic transactions — modelled on real enterprise reconciliation scenarios — to answer the question benchmarks don’t ask: can AI reliably match financial records at the accuracy month-end close demands?

2026 · Version 1.0 11 sections · Full methodology Confidential
1.3Mn+
Synthetic transactions modelled on real enterprise scenarios
11
Reconciliation scenarios, 4 verticals
4
AI systems compared head-to-head
99.8%
Creo precision, #1 across all models
Name

What we found when AI meets complex finance scenarios

Four patterns emerged consistently across 1.37 million synthetic transactions, with direct implications for any enterprise adopting AI in finance workflows.

FINDING 01
Precision matters more than recall in finance
An incorrect match becomes an accounting entry. Claude Sonnet 4.6 achieved 97.3% recall, but only 76.7% precision. More than 1 in 4 of its flagged matches were wrong.
FINDING 02
General-purpose models match aggressively
LLMs are optimised to be helpful, to find answers and make connections. Applied to reconciliation, this means matching when uncertain. Creo abstains when uncertain. The difference in precision is architectural.
FINDING 03
Workflow complexity breaks generic AI fast
On the Health D2C payment gateway scenario (450 records), Claude and Gemini returned 0% precision, 0% recall, 0% F1. Complete failure on a medium dataset is a structural boundary, not a scale issue.
FINDING 04
Domain-specific architecture changes outcomes
Creo’s advantage comes from architecture, not compute — conservative matching strategy, domain-specific variance logic, and a 54.5% pass rate vs 18.2% for Claude and Gemini across all 11 scenarios.

Benchmark results

Overall performance across 1.37 million transactions

ModelPrecisionRecallF1 ScorePass Rate
Creo ★ 99.8% 89.2% 94.2% 6/11  54.5%
GPT-5.5 (High) 99.6% 85.2% 91.8% 4/11  36.4%
Gemini 3.1 Pro 88.0% 96.8% 92.2% 2/11  18.2%
Claude Sonnet 4.6 76.7% 97.3% 85.8% 2/11  18.2%

★ Best in class  ·  Green ≥90%  ·  Amber ≥70%  ·  Red <70%

“Claims without methodology are marketing. Results with methodology are evidence. This is the evidence.”

“General intelligence is not the same as operational financial reliability. The field needs evaluation frameworks that reflect this distinction.”

Built on synthetic benchmark data, transparent by design

Synthetic benchmark data
All 1.37 million transactions came from synthetic datasets designed to mimic real enterprise reconciliation scenarios. No client-identifiable data was used.
Zero-shot evaluation
No fine-tuning, no few-shot examples. Each model tested out-of-the-box via its native agentic interface — Claude Code, Codex, Gemini CLI — at temperature=0.
Standard IR metrics
Precision, Recall, and F1 Score — corpus-level micro-averaging gives appropriate weight to larger scenarios across 1.37 million total transactions.
Human-annotated ground truth
Ground truth built by ZenStatement’s reconciliation team — domain experts who manually verified correct matches using business rules and variance tolerances.
4 industry verticals
E-commerce, payment gateways, OMS logistics, and ERP/POS — 11 scenarios reflecting the actual complexity of high-volume enterprise finance operations.
Reproducibility available
Full benchmark dataset and evaluation code available to enterprise customers and research partners under NDA. Contact marketing@zenstatement.com.