Can Code-Generating LLMs Do Engineering Math? FEM-Bench Tests Them

The web reader is rate-limited, but the research brief contains sufficient detail with specific numbers, table-level references, and multiple corroborating sources. The brief explicitly names the scores (Gemini 3 Pro: 30/33 best-attempt, 26/33 all-five; GPT-5: 73.8% AJSR on unit tests), the benchmark composition (33 FEM tasks), and the complementary benchmarks. I have enough verified material to write without padding. Let me produce the body now.

What FEM-Bench actually tests

The finite element method subdivides complex physical systems into smaller elements to numerically solve partial differential equations: structural stresses, heat transfer, fluid flow. FEM-Bench takes 33 tasks aligned with a first graduate course on computational mechanics and asks code-generating LLMs to produce working FEM solvers from problem descriptions. The tasks are introductory but nontrivial. The authors report that state-of-the-art models do not reliably solve all of them.

The distinction from generic coding benchmarks matters. HumanEval and SWE-bench measure whether generated code compiles, passes unit tests, and integrates into existing codebases. FEM-Bench measures whether the code produces a physically correct numerical result. A solver that compiles, runs, and returns a plausible-looking stress distribution can still be wrong in ways that only a domain expert would catch. Generic benchmarks have no mechanism to detect this class of error.

The numbers: which models lead and where they still fail

According to the FEM-Bench paper’s evaluation tables, Gemini 3 Pro was the best function-writing model in a five-attempt run, completing 30 of 33 tasks at least once and 26 of 33 tasks across all five attempts. GPT-5 achieved the highest score on the unit-test writing subtask with a 73.8% Average Joint Success Rate, reported in the paper’s results section. The gap between “solved at least once” and “solved all five times” is the telling number. A model that needs multiple shots to land a correct FEM solver is a model that does not reliably understand the physics.

Three tasks stumped every model on at least some attempts. The brief does not specify which three, but the pattern is consistent with what computational-mechanics researchers report anecdotally: LLMs handle routine element formulations and boundary conditions but struggle with multi-step reasoning about coupled physics or nonstandard mesh topologies.

The silent killer: wrong code that compiles and runs

The core problem FEM-Bench exposes is not that LLMs write broken code. It is that they write code that runs and returns physically incorrect results without signaling an error.

A typical HumanEval-style evaluation checks output against expected values. If the generated function returns the wrong answer, the test fails. In scientific computing, the “test” is often whether the result satisfies conservation laws, convergence criteria, and physical constraints that are implicit in the problem formulation rather than encoded as assertions. A generated FEM solver can produce a displacement field that looks reasonable, has the right units, and passes a superficial sanity check while violating equilibrium or convergence properties. A downstream engineer who trusts that output builds on a silent error.

This is not a hypothetical risk. The benchmark’s design explicitly tests for it: the evaluation compares generated solver output against reference solutions computed with verified implementations, not against simple string or numeric matches.

How this fits the broader push for scientific LLM benchmarks

FEM-Bench is part of a growing set of domain-specific evaluations that test whether LLMs can reason about physics, not just manipulate syntax.

SimulCost, a complementary benchmark for automating physics simulations, found that frontier LLMs achieve 46% to 64% success on single-round parameter tuning, dropping to 35% to 54% under stricter accuracy requirements. The authors frame this as a cost-aware evaluation: each failed attempt has a concrete computational price, and current models burn through attempts without converging reliably.

Separately, researchers have tested LLMs as online adaptive controllers for SIMP topology optimization, where the models outperformed fixed baselines by 5.7% to 18.1% on compliance. That result points in a different direction: LLMs as interactive components in optimization loops, where human feedback or structured prompting can compensate for single-shot brittleness.

The two findings are not contradictory. An LLM that struggles to write a correct solver from scratch may still provide useful adaptive control signals when the task is narrower and the feedback loop is tighter. The distinction matters for anyone deciding where to deploy these models in an engineering pipeline.

What to watch for next

FEM-Bench is a v2 preprint. The tasks cover introductory graduate-level FEM. A harder benchmark, covering nonlinear material models, dynamic analysis, or coupled multi-physics problems, would stress-test the current generation of models more severely.

The immediate question for teams evaluating LLMs for scientific code generation is whether their internal benchmarks look more like HumanEval or more like FEM-Bench. If a model passes HumanEval at 90% and your domain-specific eval at 50%, the HumanEval score is the wrong number to quote in a procurement decision.

The SimulCost finding, that accuracy requirements erase a large share of apparent competence, is the number to watch as these benchmarks mature. A model that works when “close enough” is acceptable and fails when precision matters is not a model you want anywhere near a structural analysis report.

Frequently Asked Questions

How does FEM-Bench differ from math QA benchmarks that also test scientific reasoning?

Most scientific LLM evaluations test symbolic manipulation or equation solving with short numeric answers. FEM-Bench requires models to generate complete executable numerical solvers that converge to correct physical results, not just recall formulas. The evaluation compares output against reference implementations from verified solvers, making it a test of computational pipeline assembly rather than mathematical knowledge recall.

What compute budget should teams plan for LLM-assisted FEM workflows?

SimulCost’s cost-aware evaluation found that frontier models succeed on only 46% to 64% of single-round simulation tuning attempts. Since each FEM simulation run can consume minutes to hours of compute time, teams should budget roughly 2x to 3x the compute of a hand-written baseline to absorb failed iterations, and build automated retry logic into the pipeline.

Where are LLMs already beating traditional methods in computational mechanics?

LLMs serving as adaptive controllers in SIMP topology optimization beat fixed baselines by 5.7% to 18.1% on compliance. Unlike standalone code generation, this role puts the model inside a human-supervised optimization loop where iterative feedback compensates for single-shot unreliability. The earliest production deployments in computational mechanics will likely follow this interactive-assistance pattern rather than autonomous solver generation.

What FEM problems are beyond this benchmark’s current scope?

The 33 tasks cover introductory graduate-level material. Industrial FEM practice routinely involves nonlinear material models (plasticity, hyperelasticity), transient dynamic analysis (impact, vibration), and coupled multi-physics problems (thermo-mechanical, fluid-structure interaction), all of which fall outside the current benchmark. These problems require multi-step reasoning about coupled equation systems that models already struggle with on simpler formulations.