Finance Agent Benchmarks Expose Where Lending Automation Breaks

As of mid-June 2026, no independently confirmed benchmark named MortarBench has appeared for mortgage loan origination [unverified]. What does exist is a cluster of adjacent vertical evaluations, led by MBABench, that put finance-domain agents through end-to-end spreadsheet tasks rather than trivia or single-formula edits. Their results are sobering: even the strongest agents degrade sharply once a workflow chains more than a few calculations, and accuracy alone says almost nothing about whether an agent handles borrower data safely.

What do vertical finance benchmarks actually measure?

MBABench tests agents on financial modeling, forecasting, and scenario analysis inside spreadsheets, which its authors describe as one of the first end-to-end evaluations of LLM agents in finance MBABench paper. Earlier benchmarks mostly stayed at question-answering or single-formula edits; MBABench raises the bar by asking agents to produce a complete artifact that a human reviewer would actually use. The scoring reflects that reality. Each deliverable is graded on Accuracy, Formula, and Format, because a finance workbook is not just a right answer but a document that passes through analysts, reviewers, auditors, and sometimes regulators.

The taxonomy matters. Accuracy catches whether the final numbers are right. Formula checks whether the agent built the calculations correctly rather than hard-coding plausible-looking values. Format judges whether the output is professional enough to hand to a stakeholder. The MBABench paper notes that the Claude family leads the leaderboard and produced the most professional-looking outputs in the authors’ qualitative review. But even Claude’s runs “frequently fall short of professional finance standards.” Leading the benchmark and being ready for a lending desk are not the same thing.

This is the first useful signal for mortgage operators. A loan origination file is closer to a complex workbook than to a chat answer. It contains income calculations, debt-to-income ratios, escrow projections, compliance checklists, and investor overlays. If the best general agents cannot yet meet professional spreadsheet standards on MBABench, the burden of proof on any vendor claiming “agentic underwriting” is high.

Where do agents break on chained finance tasks?

The breaking point is not subtle. According to MBABench, agents “degrade sharply as the difficulty increases beyond a few chained calculations.” A single formula edit or a one-off forecast is within reach. A task that requires several interdependent calculations, conditional logic, and clean presentation is not. The authors conclude that current agents “cannot yet reliably produce professional-quality spreadsheets at the complexity real-world workflows demand.”

For mortgage origination, that failure mode maps cleanly onto the job. Calculating a debt-to-income ratio is one step. Verifying self-employment income from two years of tax returns, reconciling Schedule C adjustments with bank-statement deposits, applying the correct overlay for the selected investor, and flagging the file for a compliance exception is a chain. An agent that succeeds on the first step and silently drifts on the fourth is worse than no agent at all, because the drift looks plausible.

The spreadsheet format also exposes a specific kind of error that text-based benchmarks hide: a cell can look correct while referencing the wrong upstream value. MBABench’s Formula dimension is designed to catch exactly that. A loan file has the same property. A compliance checklist can show every box checked while the underlying calculation pulls the wrong year’s income. The benchmark’s structure is therefore a better proxy for origination risk than any reading-comprehension test.

Why can an accurate agent still fail compliance?

Task accuracy and data-handling safety fail on independent axes. That is the central finding of a joint evaluation by the Singapore AI Safety Institute and the Korea AI Safety Institute, which tested three agents on twelve realistic, non-adversarial tasks. None of the three achieved fully correct and fully safe execution across the board. Worse, successful task completion often coincided with data-handling failures such as accessing unnecessary information or disclosing information to inappropriate recipients.

The study organizes leakage risk into five categories that map directly onto regulated lending: lack of data awareness, audience awareness, policy compliance, data minimization, and access-boundary awareness leakage paper. In origination terms, an agent with weak data awareness might pull a credit report before it has permissible purpose. Weak audience awareness might email a preliminary denial to a borrower before the adverse-action process is complete. Weak policy compliance might retain documents past their destruction schedule. Weak data minimization might collect bank statements for all accounts when only the accounts used for reserves are relevant. Weak access-boundary awareness might let a loan officer see an underwriter’s notes on a file they do not own.

Each of these is a compliance event. Each can happen while the agent’s arithmetic is perfect. That is the practical meaning of the capability-safety gap. A lender evaluating an origination agent cannot run a single pass-fail test and call it done. Accuracy and safety need separate scorecards, and a good score on one does not imply a good score on the other.

What would a mortgage-origination benchmark have to test?

Because no confirmed MortarBench paper is available [unverified], the shape of such a benchmark has to be inferred from the closest vertical evaluations and from the workflow itself. A credible mortgage-origination benchmark would need to test more than correctness. It would need to test the full document-heavy, compliance-bound chain that makes origination different from generic enterprise automation.

The input layer alone is demanding. A benchmark task might hand the agent a synthetic loan file containing W-2s, bank statements, pay stubs, a credit report, an appraisal, a title commitment, and a purchase contract, some of them incomplete or contradictory. The agent would need to extract the right fields, flag missing or inconsistent items, and know when to suspend processing rather than guess. MBABench’s Accuracy/Formula/Format taxonomy is a useful starting point here: Accuracy would capture whether the final underwriting decision is sound, Formula would capture whether the calculations are constructed correctly, and Format would capture whether the output matches the lender’s document and disclosure requirements.

The compliance layer is where the leakage study’s five risk types become essential. A mortgage benchmark would need to score whether the agent respects permissible-purpose rules for credit pulls, routes disclosures to the correct parties, applies waiting periods and rescission windows, generates adverse-action notices with the required reasons, and limits data retention to policy. These are not edge cases. They are the core of what makes mortgage lending regulated.

Finally, the benchmark would need an audit layer. Regulators and investors do not just want the right answer; they want to reconstruct how the answer was reached. That means every calculation, every document reference, and every compliance decision must be traceable. An agent that produces a correct loan decision but cannot explain which data supported it is not deployable in a lending environment.

What evidence should lenders demand before deploying an agent?

Lenders should treat vendor demos as advertisements and demand vertical, audit-shaped benchmark evidence instead. The minimum evidence package has three parts: a public or auditable benchmark score on tasks that resemble the lender’s actual workflow; a separate safety evaluation that covers data awareness, audience awareness, policy compliance, data minimization, and access-boundary awareness; and a published failure-mode inventory showing which task types, document conditions, and edge cases cause the agent to err.

The failure-mode inventory is the part vendors are least eager to provide and the part lenders should insist on most. MBABench shows that agents degrade past a few chained calculations; a mortgage deployment needs to know exactly how many chained steps its workflow requires and where the agent’s error rate crosses an unacceptable threshold. The leakage study shows that safety failures ride alongside correct-looking outputs; a lender needs to know which borrower-data actions the agent can perform, which it cannot, and how those permissions are enforced.

There are positive reference points, but they do not close the case. Data Intelligence Agents, a three-agent system for enterprise data, matched or surpassed the best published results across seven SQL benchmarks by treating autonomous coding agents as a first-class abstraction. That is evidence that execution-grounded agent architectures can generalize in enterprise settings. SQL benchmarks are not mortgage origination, and query generation is not underwriting, but the architectural lesson is clear: ground the agent in verifiable execution against a schema, not in open-ended text generation.

The broader mid-2026 benchmark wave, including PERIA-8B, reinforces the same trend. The PERIA model improved over its Qwen3-8B backbone by 10.0% on in-distribution and 4.4% on out-of-distribution spatial-reasoning benchmarks while matching much larger models, illustrating the push toward tool-augmented, benchmark-validated agents PERIA paper. The industry is moving in the right direction. For mortgage lenders, the question is whether that direction arrives before a deployment.

Frequently Asked Questions

Does the benchmark gap matter for small lenders, or only for large originators?

It matters for both, but smaller lenders face the same RESPA, TILA, and fair-lending obligations with less budget to audit vendor claims. A missing mortgage-specific benchmark leaves them dependent on generic finance-agent scores that do not cover document verification or adverse-action workflows.

How is MBABench different from general coding-agent leaderboards like SWE-bench?

SWE-bench checks whether a code patch passes hidden unit tests, while MBABench grades a finished spreadsheet on accuracy, formula structure, and professional format. That second and third dimension matter for lending because a workbook must survive human review, auditor inspection, and investor overlay checks, not just compute a correct number.

What should a lender’s model-risk team do before signing an agent vendor?

Re-run the vendor’s benchmark tasks on the lender’s own synthetic loan files, score accuracy and safety independently, and define a hard step-count or error-rate threshold that triggers human takeover. The checklist in the article names the evidence to demand; this adds an internal test harness and an explicit kill switch.

What failure mode do text-only LLM benchmarks miss that MBABench surfaces?

A spreadsheet cell can show the correct figure while its formula points to the wrong upstream input, which can silently corrupt a debt-to-income ratio or compliance worksheet. In a fair-lending exam, that kind of invisible calculation drift can become a disparate-impact finding even if the arithmetic itself looks right.

How quickly could a credible mortgage-origination benchmark emerge?

The closest vertical benchmarks arrived within days of each other in mid-June 2026: MBABench v3 on June 12 and the Singapore/Korea leakage study on June 15. That pace suggests the pieces, spreadsheet evaluation and data-handling safety, are already being assembled; the missing step is a document-heavy benchmark that combines them with synthetic loan files and public failure-mode tables.