groundy
agents & frameworks

Bloomberg's Pomona Makes Small Automated Code Changes, Not Big Agent PRs

Bloomberg's Pomona agent limits diffs to 10 lines and merged 88% of PRs in production, proving small, bounded edits earn reviewer trust faster than large autonomous refactors.

6 min · · · 3 sources ↓

Bloomberg’s Pomona doesn’t try to write features. It doesn’t refactor architectures or generate multi-file pull requests. Instead, it runs a continuous loop of small, targeted code-quality fixes. In a one-month production deployment reported in a June 2026 arXiv paper, Pomona generated 17 pull requests across a single engineering team. Fifteen were merged. The median time from PR open to close was under two hours.

That merge rate and turnaround are the numbers that matter, because they quantify a design choice most coding-agent projects avoid: deliberately capping agent ambition at roughly 10 lines of diff per PR.

Small Diffs by Design, Not by Accident

Pomona’s architecture is built around two agent skills, Scanning and Repair, operating against a shared task backlog. The Scanning skill identifies improvement tasks across several categories:

  1. Lint-rule enforcement. Finds linting violations across the codebase.
  2. Technical-debt classification. Detects technical-debt markers and flags dead code.
  3. Test and complexity hygiene. Surfaces test-coverage gaps and complexity issues.

Each discovered task is scored and prioritised in a structured backlog. The Repair skill then picks the highest-priority item from the backlog and generates a PR. The target is consistently around 10 lines of diff, small enough that a reviewer can verify correctness in under two minutes.

The design principles are explicit. Pomona must be low friction (fully autonomous, no workflow changes for the engineers on the team), low stakes (human-in-the-loop via mandatory PR review, so the agent cannot land code without approval), and focused on code quality only (changes must benefit long-term maintainability, not add features).

Production Numbers: One Team, One Month

The deployment data comes from a single Bloomberg engineering team over one month. The headline figures:

MetricValue
PRs generated17
PRs merged15 (88%)
Median time-to-closeUnder 2 hours
Engineers wanting to adopt8 of 10 surveyed

The engineer survey is worth reading closely. Eight of ten surveyed engineers expressed a desire to adopt Pomona, praising the small diff sizes and the focus on code quality over feature work. This is not a population dazzled by AI demos; it is a group of senior engineers who noticed that the PRs were easy to review and actually improved the codebase, and who responded by wanting more of them.

Bloomberg has prior form here. B-Assist, an earlier automated program repair tool deployed at the company, achieved up to a 74.56% patch acceptance rate by inserting suggestions into active pull requests via GitHub’s Suggested Changes interface rather than creating new PRs. Pomona builds on the same insight: meet reviewers where they already are, and keep the cognitive cost of engaging with automation low.

Why This Runs Against the Current

Most coding-agent work in 2025 and 2026 has chased the SWE-bench leaderboard: give the agent a GitHub issue, let it navigate the repository, and measure whether it produces a correct patch. The implicit goal is larger, more autonomous changes. Pomona inverts that. It constrains the agent to small, bounded edits and measures throughput in terms of PRs that humans actually merge without modification.

The tension is real. The Pomona paper cites He et al. (2026), who found that agent adoption in open-source repositories increases development velocity only temporarily and correlates with longer-term increases in code complexity and quality issues. That finding has not been independently verified, and the causal mechanism is still being debated. But it frames the tradeoff clearly: an agent that lands a large, multi-file refactor may look impressive on a demo, but if reviewers start rubber-stamping because the diffs are too large to audit carefully, the net effect on codebase health is negative.

Pomona’s 10-line diff cap is a bet that the opposite dynamic holds. If every automated PR is trivially reviewable, engineers maintain full visibility and the trust cost of automation approaches zero.

What This Means for Teams Evaluating Coding Agents

The Pomona results suggest a few concrete questions for any team considering an agentic code-quality tool:

Bounded scope beats ambitious scope for trust. A tool that consistently produces small, correct, boring diffs builds reviewer confidence over time. A tool that alternates between brilliant large refactors and subtly broken ones destroys that confidence immediately. The asymmetric cost of failure favours Pomona’s approach.

Throughput is a function of review cost, not generation cost. Generating PRs is cheap. Getting them reviewed and merged is the bottleneck. Pomona’s sub-2-hour median time-to-close reflects a design that optimises for the reviewer’s time, not the agent’s. Teams measuring coding-agent ROI should weight merge rates and time-to-merge alongside generation volume.

The backlog-as-product pattern is underused. Pomona’s Scan-Prioritise-Repair loop, with a persistent task backlog, is a pattern that generalises beyond Bloomberg. Any team with a linter, a dead-code detector, and a coverage tool could replicate the scanning layer. The repair layer requires LLM generation, but the prioritisation and scoping logic is straightforward engineering.

The limitation is equally clear. Pomona does not address feature work, complex refactoring, architectural migration, or any change that spans more than a handful of lines. It is a specialised tool for a specific class of technical-debt reduction, and the authors are transparent about that scope. The question for teams is not whether Pomona-style agents replace larger coding agents. It is whether starting with small, trusted, high-throughput changes is a better on-ramp to agent-assisted development than jumping straight to autonomous multi-file edits. The Bloomberg deployment data, limited as it is, suggests the answer is yes.

Frequently Asked Questions

What specific linting and static-analysis tools does Pomona’s Python implementation target?

The evaluated deployment uses ruff for rule enforcement and mypy strict for type checking, alongside detectors for TODO, FIXME, HACK, and XXX markers, dead code, test-coverage gaps, and functions exceeding 50 lines. These feed into a P1-through-P4 prioritisation matrix scored on benefit multiplied by ease of review. Teams replicating the pattern can swap in their own linter and coverage tooling without changing the backlog scoring logic.

How does Pomona’s PR-based workflow differ from Bloomberg’s earlier B-Assist tool?

B-Assist injected repair suggestions directly into engineers’ active pull requests using GitHub’s Suggested Changes interface, achieving a 74.56% acceptance rate. Pomona creates its own pull requests instead, so the changes do not depend on someone already having an open PR in the affected area. The tradeoff is that B-Assist piggybacks on existing reviewer attention, while Pomona must earn its own review time, but can act on code that no one is currently working on.

Can the scanning and prioritisation layers run without an LLM?

Yes. The three scanning sub-agents run standard static-analysis tooling and score results on a benefit times ease-of-review matrix that assigns P1 through P4 priority levels. Only the Repair skill, which writes the actual code fix, requires LLM generation. A team could deploy the scanning and prioritisation pipeline as a dashboard or CI check to surface technical debt, then decide later whether to automate the fixes.

What would cause Pomona’s 88% merge rate to drop in a wider deployment?

Several factors the one-team, one-month trial did not stress-test: repositories with tightly coupled modules where a 10-line change in one file breaks assumptions in another, multi-language codebases where the same skill definitions must target different linters and test frameworks, and reviewer fatigue if the agent generates PRs faster than the team can process them. The paper also does not report what the two rejected PRs got wrong, which would be the most informative data point for predicting failure modes.

sources · 3 cited

  1. Pomona: Continuous Code Quality Improvement via Small, Automated Changes at Bloomberg primary accessed 2026-06-09
  2. Pomona: Continuous Code Quality Improvement (full HTML) primary accessed 2026-06-09
  3. User-Centric Deployment of Automated Program Repair at Bloomberg (B-Assist) analysis accessed 2026-06-09