The Question Every Lender Asks
When we talk to private lenders about AI-powered document processing, the question always comes up: "Which AI model are you using?" The answer matters—different models have different strengths, and choosing the wrong one can mean the difference between 95% accuracy and 99.5% accuracy.
So we did what any engineering team would do: we ran a comprehensive benchmark. We fed 10,000 real loan documents through GPT-4, Claude 3.5 Sonnet, and Gemini 1.5 Pro, measuring accuracy, speed, and cost across 8 different document types.
The Testing Methodology
- Dataset: 10,000 documents from 50+ private lenders
- Document types: Bank statements, tax returns, appraisals, title reports, credit reports, rent rolls, operating statements, and loan applications
- Evaluation: Human experts manually verified all AI extractions
- Metrics: Field-level accuracy, processing time, and cost per document
Overall Results: The Winner Depends on the Document
Here's the surprising finding: no single model dominated across all document types. Each AI has specific strengths that make it better suited for certain tasks.
Document-by-Document Breakdown
Bank Statements: Claude Wins
Bank statements are highly structured—consistent formatting, tables, clear headers. Claude excelled here with 99.2% accuracy, correctly extracting account numbers, balances, transaction dates, and amounts even from scanned PDFs with poor quality.
| Model | Accuracy | Avg Time | Cost/Doc |
|---|---|---|---|
| Claude 3.5 | 99.2% | 2.3s | $0.04 |
| GPT-4 | 98.1% | 3.1s | $0.06 |
| Gemini 1.5 | 97.8% | 1.8s | $0.03 |
Tax Returns: GPT-4 Edges Ahead
Tax returns require understanding complex financial relationships—how Schedule C income relates to total AGI, how depreciation affects net income. GPT-4's reasoning capabilities gave it an edge with 98.3% accuracy, particularly on complicated business returns.
| Model | Accuracy | Avg Time | Cost/Doc |
|---|---|---|---|
| GPT-4 | 98.3% | 4.2s | $0.08 |
| Claude 3.5 | 97.9% | 3.8s | $0.06 |
| Gemini 1.5 | 96.4% | 3.5s | $0.05 |
Appraisals: Claude's Precision Shines
Appraisal reports mix tables, narrative text, and embedded images. Claude handled this complexity best with 98.7% accuracy, correctly extracting comparable sales, adjustments, and final value conclusions even when formatting varied wildly.
| Model | Accuracy | Avg Time | Cost/Doc |
|---|---|---|---|
| Claude 3.5 | 98.7% | 3.4s | $0.05 |
| GPT-4 | 97.5% | 4.1s | $0.07 |
| Gemini 1.5 | 96.8% | 2.9s | $0.04 |
Credit Reports: Gemini Delivers Speed
Credit reports are long but straightforward—lots of data but consistent structure. Gemini processed these fastest with acceptable 97.2% accuracy, making it ideal for high-volume scenarios where speed matters more than perfection.
| Model | Accuracy | Avg Time | Cost/Doc |
|---|---|---|---|
| Claude 3.5 | 98.1% | 2.8s | $0.04 |
| GPT-4 | 97.6% | 3.2s | $0.06 |
| Gemini 1.5 | 97.2% | 1.6s | $0.02 |
Key Insights from 10,000 Documents
1. No Model is Perfect—Yet
Even the best models hover around 97-99% accuracy. For loan underwriting, that 1-3% error rate means you still need human review. The value of AI isn't eliminating humans—it's reducing their workload from 100% to 10%.
2. Structure Matters More Than Content
Models performed better on highly structured documents (bank statements, credit reports) than on semi-structured ones (appraisals, operating statements). The lesson: if you can standardize document formats from borrowers, AI accuracy improves dramatically.
3. Cost Differences Add Up at Scale
Processing 1,000 documents per month:
- GPT-4: ~$70/month (most expensive but best for complex analysis)
- Claude: ~$50/month (best balance of accuracy and cost)
- Gemini: ~$30/month (cheapest, fastest, slightly lower accuracy)
For a lender processing 5,000+ documents monthly, model choice can mean $2,000+/year in cost differences.
4. Ensemble Approach Works Best
Here's what we actually do at Mentyx: we don't pick just one model. We use Claude for bank statements and appraisals, GPT-4 for tax returns and complex financial analysis, and Gemini for high-volume credit report processing. This ensemble approach gives us the best of all worlds.
Real-World Performance: Beyond Benchmarks
Raw accuracy numbers only tell part of the story. In production with real lenders, we learned some unexpected lessons:
Handling Edge Cases
GPT-4 handled unusual scenarios better—like when a borrower submits a handwritten rent roll or a property has non-standard income sources. Its reasoning ability meant it could adapt to edge cases that rigid extraction would miss.
Error Patterns Matter
Claude made fewer errors overall, but when it did err, the mistakes were often subtle—like confusing similar account numbers. Gemini's errors were more obvious (missing entire sections), making them easier to catch in human review.
Confidence Scoring
All three models provide confidence scores, but Claude's calibration was best—when it said 95% confident, it was right 95% of the time. This matters for automated workflows where you want to flag low-confidence extractions for human review.
What This Means for Your Lending Operation
Start with Claude for Most Documents
If you're building your first AI-powered workflow, Claude 3.5 Sonnet offers the best balance of accuracy, speed, and cost for typical loan documents. It's our default recommendation.
Use GPT-4 for Complex Analysis
Reserve GPT-4 for documents requiring deep understanding—complicated business tax returns, unusual deal structures, or when you need the AI to explain its reasoning to underwriters.
Deploy Gemini for High-Volume Processing
If you're processing thousands of credit reports or straightforward documents daily, Gemini's speed and cost advantages outweigh its slight accuracy disadvantage.
Always Include Human Review
No model is perfect. Design workflows where AI handles 90% of the work and humans review 100% of the output. This catches errors while still achieving massive efficiency gains.
The Future: Models Keep Improving
We ran this benchmark in Q1 2025. By the time you read this, there may be newer versions with better performance. OpenAI, Anthropic, and Google are all racing to improve document understanding.
The good news: the gap between models is narrowing. All three are now good enough for production lending use with proper oversight. The question isn't whether to use AI for document processing—it's how to deploy it intelligently across your specific document types and volumes.
Our Recommendation
For most private lenders processing 100-5,000 loans per month: Start with Claude 3.5 Sonnet as your primary model. It delivers the best all-around performance for typical loan documents. Add GPT-4 for tax returns and complex financial analysis. Consider Gemini for very high-volume, structured document processing.
Testing Methodology Details
For those interested in how we conducted this benchmark:
- Sample size: 10,000 documents (1,250 per document type)
- Time period: Documents from 2022-2024
- Lender diversity: 50+ lenders across 15 states
- Verification: Two independent human reviewers per document
- Accuracy calculation: Field-level precision (each extracted field marked correct/incorrect)
- Cost calculation: Based on January 2025 API pricing
- Hardware: All tests run on same infrastructure to ensure fair speed comparison