ZenStatement Labs

The State of AI in
Finance Operations

We evaluated four AI systems across 1.37 million synthetic transactions — modelled on real enterprise reconciliation scenarios — to answer the question benchmarks don’t ask: can AI reliably match financial records at the accuracy month-end close demands?

2026 · Version 1.0 11 sections · Full methodology Confidential

1.3Mn+

Synthetic transactions modelled on real enterprise scenarios

Reconciliation scenarios, 4 verticals

AI systems compared head-to-head

99.8%

Creo precision, #1 across all models

Key findings

What we found when AI meets complex finance scenarios

Four patterns emerged consistently across 1.37 million synthetic transactions, with direct implications for any enterprise adopting AI in finance workflows.

FINDING 01

Precision matters more than recall in finance

An incorrect match becomes an accounting entry. Claude Sonnet 4.6 achieved 97.3% recall, but only 76.7% precision. More than 1 in 4 of its flagged matches were wrong.

FINDING 02

General-purpose models match aggressively

LLMs are optimised to be helpful, to find answers and make connections. Applied to reconciliation, this means matching when uncertain. Creo abstains when uncertain. The difference in precision is architectural.

FINDING 03

Workflow complexity breaks generic AI fast

On the Health D2C payment gateway scenario (450 records), Claude and Gemini returned 0% precision, 0% recall, 0% F1. Complete failure on a medium dataset is a structural boundary, not a scale issue.

FINDING 04

Domain-specific architecture changes outcomes

Creo’s advantage comes from architecture, not compute — conservative matching strategy, domain-specific variance logic, and a 54.5% pass rate vs 18.2% for Claude and Gemini across all 11 scenarios.

Benchmark results

Overall performance across 1.37 million transactions

Model	Precision	Recall	F1 Score	Pass Rate
Creo ★	99.8%	89.2%	94.2%	6/11 54.5%
GPT-5.5 (High)	99.6%	85.2%	91.8%	4/11 36.4%
Gemini 3.1 Pro	88.0%	96.8%	92.2%	2/11 18.2%
Claude Sonnet 4.6	76.7%	97.3%	85.8%	2/11 18.2%

★ Best in class · Green ≥90% · Amber ≥70% · Red <70%

From the report

“Claims without methodology are marketing. Results with methodology are evidence. This is the evidence.”

“General intelligence is not the same as operational financial reliability. The field needs evaluation frameworks that reflect this distinction.”

Methodology

Built on synthetic benchmark data, transparent by design

Synthetic benchmark data

All 1.37 million transactions came from synthetic datasets designed to mimic real enterprise reconciliation scenarios. No client-identifiable data was used.

Zero-shot evaluation

No fine-tuning, no few-shot examples. Each model tested out-of-the-box via its native agentic interface — Claude Code, Codex, Gemini CLI — at temperature=0.

Standard IR metrics

Precision, Recall, and F1 Score — corpus-level micro-averaging gives appropriate weight to larger scenarios across 1.37 million total transactions.

Human-annotated ground truth

Ground truth built by ZenStatement’s reconciliation team — domain experts who manually verified correct matches using business rules and variance tolerances.

4 industry verticals

E-commerce, payment gateways, OMS logistics, and ERP/POS — 11 scenarios reflecting the actual complexity of high-volume enterprise finance operations.

Reproducibility available

Full benchmark dataset and evaluation code available to enterprise customers and research partners under NDA. Contact marketing@zenstatement.com.

The State of AI inFinance Operations

What we found when AI meets complex finance scenarios

Overall performance across 1.37 million transactions

Built on synthetic benchmark data, transparent by design

Simplify Your Finance With ZenStatement Today

The State of AI in
Finance Operations

Simplify Your Finance With
ZenStatement Today