Can LLMs Write Better Research Paper Titles Than Authors?

A new arXiv preprint from Rehman et al. claims language models can generate “generally appropriate and reliable” titles for research papers, with fine-tuned PEGASUS-large outscoring both LLaMA-3-8B and GPT-3.5-turbo on standard text-similarity metrics. The catch is in the metrics. The study evaluates title quality using ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore: automated measures of n-gram overlap and embedding proximity. None of these capture whether a title accurately represents a paper’s contribution, whether a reader would choose to click it, or whether it leads to a citation. A title that scores well on BERTScore against the original may still misrepresent the work.

What the paper did

Rehman et al. (arXiv:2606.05085) frame title generation as an abstractive summarization task: feed a paper’s abstract into a model, get a title out. They benchmark three families of models. Fine-tuned PEGASUS-large, a model architecturally designed for summarization via gap-sentence generation. Fine-tuned LLaMA-3-8B, an open-weight general-purpose LLM. And zero-shot GPT-3.5-turbo, representing the proprietary API route with no task-specific training.

The evaluation runs across three datasets: CSPubSum, LREC-COLING-2024, and a new SpringerSSAT corpus the authors curated from four Springer social-science journals. The authors also test ChatGPT’s ability to produce “creative” titles, a separate experimental track that the paper discusses but does not integrate into the main quantitative comparison.

Metrics are the standard summarization suite: ROUGE for surface overlap, METEOR for alignment considering synonyms and stems, MoverScore and BERTScore for semantic similarity via contextual embeddings, and SciBERTScore as a domain-specific variant. PEGASUS-large wins across most of these.

The metric trap

The problem is not that these metrics are wrong for their stated purpose. ROUGE and BERTScore are reasonable tools for measuring how closely a generated text resembles a reference text. The problem is that “resembles the author’s original title” and “is a good title” are not the same question.

A summarization model optimized to reproduce the surface features of existing titles will produce outputs that score well on n-gram and embedding similarity. That is tautological. The model learned the statistical distribution of titles in the training data; the metrics measure how close the output is to that distribution; the model wins. This tells you something about the model’s capacity for pattern matching, not about whether the generated title serves the paper’s readers.

The paper does not measure downstream impact. No A/B test of click-through rates. No correlation with citation counts. No human evaluators judging whether the generated title is more or less informative than the original. The headline conclusion, that AI-generated titles are “generally appropriate and reliable,” rests entirely on the assumption that high metric scores imply appropriateness. That assumption is the claim that needed testing.

Fluency is not fidelity

A companion paper published the same week sharpens this critique. arXiv:2606.04152 introduces PEEL, a scaffolding that pairs deterministic reading tools with LLM interpretation via Claude. The authors find systematic distortions in AI-generated text that are “invisible without non-AI measurement.” Their three design implications are worth quoting directly: deterministic instruments must accompany AI tools; fluency is not fidelity; epistemic authority must be designed in.

Applied to the title-generation question, this is the core issue. A fluent title that reads well and scores well on embedding similarity can still be wrong in ways that matter: overstating a result, narrowing a contribution, or defaulting to the most statistically probable framing rather than the most accurate one. The PEEL finding suggests these distortions are systematic, not random. They will tend in a consistent direction toward conventionality and away from precision.

Separate work on LLM calibration reinforces the point. Calibration with Semantic Reward (CSR) reduces Expected Calibration Error by up to 40% over verbalized-confidence baselines across three model families, per Niu et al. The relevance: if LLMs can be confidently wrong about their own outputs in well-studied calibration benchmarks, the confidence with which a title-generation model produces a “good” title is not evidence that the title is good. The metric is the only signal, and the metric is circular.

The visibility arms race

Set aside the methodological critique for a moment and consider the incentive structure. If AI-generated titles do outperform human-written ones on whatever downstream metric actually matters, whether clicks, citations, or recommendation-feed placement, researchers face a coordination problem.

The rational move for any individual researcher is to run their abstract through a title model before submission. Once a critical mass does this, AI-optimized titles become the baseline. The marginal advantage disappears, but the cost of not using the tool becomes a real disadvantage: your human-written title now looks unconventional relative to the new norm, which is itself shaped by what the model learned from the old norm. This is the same dynamic that turned SEO optimization from a competitive advantage into a cost of doing business.

The difference is that SEO optimizes for a search engine’s ranking function, which is at least an external system with its own incentives. Title-generation models optimize for text-similarity metrics that the research community chose for convenience, not because they reflect what makes a title useful to a reader. The optimization target is the tool, not the audience.

arXiv in transition

This plays out against an institutional backdrop that is actively shifting. In November 2025, arXiv announced it would no longer accept computer science review articles and position papers that had not been vetted by a journal or conference, specifically citing a rise in AI-generated submissions. In March 2026, arXiv announced it will separate from Cornell University and become an independent nonprofit on July 1, 2026, motivated by a desire to diversify funding.

The platform where AI-optimized titles would circulate is itself reconfiguring its governance and content policies around the problem of AI-generated text. Title generation sits in an awkward position: it is not the kind of wholesale fabrication that motivated arXiv’s 2025 policy change, but it is a form of AI optimization applied to the single most important surface a paper has. The title determines whether a paper appears in search results, whether it gets opened, whether it gets cited. Optimizing that surface with a tool whose quality signal is circular raises questions that the current metrics cannot answer.

The honest conclusion is that we do not yet know whether LLMs write better research titles than authors. We know they write more metric-plausible titles. Whether those titles are better at their actual job, connecting a reader to work worth reading, requires a different kind of evaluation entirely: human judgment, downstream impact data, and a willingness to treat fluency as a surface feature rather than a quality signal. Until that evaluation exists, the claim rests on the metric, and the metric rests on the model.

Frequently Asked Questions

Does this benchmark generalize beyond English-language computer science papers?

The three datasets span both CS (CSPubSum, LREC-COLING-2024) and social science: the authors curated SpringerSSAT from four Springer social-science journals specifically to widen domain coverage. However, all three corpora are English-only, so cross-lingual transfer remains untested. A title model trained on English abstracts may produce grammatically plausible but culturally misaligned titles for papers written in other languages.

What did the separate ChatGPT creative-title track actually find?

The paper ran ChatGPT’s creative titles as a distinct experimental arm that was not folded into the main quantitative comparison with PEGASUS and LLaMA-3. This means the headline result, fine-tuned PEGASUS-large wins, only covers the summarization-style evaluation path. The creative track is discussed qualitatively but excluded from the ROUGE and BERTScore leaderboard, so there is no direct metric comparison between a model instructed to be inventive and one fine-tuned to reproduce existing title patterns.

What would a rigorous evaluation of generated titles require beyond automated metrics?

Three additions the paper does not attempt: blinded human evaluation of whether a generated title accurately represents the paper’s contribution, A/B click-through testing on a preprint server, and longitudinal citation correlation over 12 to 24 months. A fourth signal comes from the Calibration with Semantic Reward work (arXiv:2605.15588), which reduced Expected Calibration Error by up to 40% over verbalized-confidence baselines. That result suggests a practical test: ask the title model to rate its own output’s accuracy on a calibrated scale, then check whether high-confidence titles actually correspond to more faithful summaries. If the model is poorly calibrated, confident titles are no more trustworthy than uncertain ones.

How does arXiv’s pending independence affect the AI-title question?

arXiv separates from Cornell University on July 1, 2026, driven by a need to diversify funding. An independent arXiv reliant on a broader donor base may face pressure from academic publishers who supply both content and legitimacy. The SpringerSSAT dataset in the Rehman et al. study was curated from Springer journals, creating an awkward overlap: the same publishers whose data trained the title model may push arXiv to restrict AI-augmented metadata on papers they host. The November 2025 policy ban targeted wholesale fabrication in CS reviews and position papers, but title optimization was not addressed. An independent board could tighten that gap.

What specific failure mode does PEEL expose for AI-generated titles in peer review?

PEEL pairs deterministic reading tools with LLM interpretation via Claude and finds that systematic distortions in AI-generated text are invisible to unaided human readers. Applied to titles, this means a reviewer triaging submissions by skimming titles can accept a misrepresentation that is statistically fluent without detecting the reframing. The distortion is not random noise; PEEL’s evidence points to a consistent drift toward conventionality. In a peer-review pipeline where titles gate whether a paper is read at all, this class of error compounds: conventionally framed titles may receive more attention regardless of whether the framing is accurate.