We evaluated four AI systems across 1.37 million synthetic transactions — modelled on real enterprise reconciliation scenarios — to answer the question benchmarks don’t ask: can AI reliably match financial records at the accuracy month-end close demands?
Key findings
Four patterns emerged consistently across 1.37 million synthetic transactions, with direct implications for any enterprise adopting AI in finance workflows.
Benchmark results
| Model | Precision | Recall | F1 Score | Pass Rate |
|---|---|---|---|---|
| Creo ★ | 99.8% | 89.2% | 94.2% | 6/11 54.5% |
| GPT-5.5 (High) | 99.6% | 85.2% | 91.8% | 4/11 36.4% |
| Gemini 3.1 Pro | 88.0% | 96.8% | 92.2% | 2/11 18.2% |
| Claude Sonnet 4.6 | 76.7% | 97.3% | 85.8% | 2/11 18.2% |
★ Best in class · Green ≥90% · Amber ≥70% · Red <70%
From the report
“Claims without methodology are marketing. Results with methodology are evidence. This is the evidence.”
“General intelligence is not the same as operational financial reliability. The field needs evaluation frameworks that reflect this distinction.”
Methodology