We evaluated four AI systems across 13,73,480 real enterprise transactions to answer the question benchmarks don’t ask: can AI reliably match financial records at the accuracy month-end close demands?
Download the full report
Get the complete benchmark, scenario deep dives, and methodology.
No spam. Your details are used only to share relevant ZenStatement research.
Check your inbox. We’ve sent the full report PDF to your email address.
Key findings
Four patterns emerged consistently across 13,73,480 transactions, with direct implications for any enterprise adopting AI in finance workflows.
Benchmark results
| Model | Precision | Recall | F1 Score | Pass Rate |
|---|---|---|---|---|
| Creo ★ | 99.8% | 89.2% | 94.2% | 6/11, 54.5% |
| GPT-5.5 (High) | 99.6% | 85.2% | 91.8% | 4/11, 36.4% |
| Gemini 3.1 Pro | 88.0% | 96.8% | 92.2% | 2/11, 18.2% |
| Claude Sonnet 4.6 | 76.7% | 97.3% | 85.8% | 2/11, 18.2% |
★ Best in class. Color: green ≥90% · amber ≥70% · red <70%
From the report
“Claims without methodology are marketing. Results with methodology are evidence. This is the evidence.”
“General intelligence is not the same as operational financial reliability. The field needs evaluation frameworks that reflect this distinction.”
Methodology