Crutch or Ceiling: What a New Study of LLMs and EFL Writing Reveals About the AI Assistance Trap

When an AI model improves a student’s writing scores while simultaneously degrading their actual writing ability, it is functioning as a crutch. A paper submitted April 16, 2026 by Susanto and colleagues tests this directly across EFL learner cohorts, comparing pre- and post-ChatGPT compositions using both automated metrics and human expert scoring. (Susanto et al., “The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings,” arXiv

.15460, submitted April 16, 2026) The results suggest the crutch-or-ceiling question turns less on which model is used than on whether the learner has the proficiency — and the workflow — to resist cognitive offloading.

The Study: What Susanto et al. Measured (and Why It’s Different)

Most AI-and-writing research arrives at a verdict: AI helps, or AI harms. “The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings,” submitted to arXiv on April 16, 2026, is structured differently. (Susanto et al., “The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings,” arXiv

.15460, submitted April 16, 2026) Susanto, Woo, Yeung, Lo-Philip, and Chi Ho Yeung compare EFL student compositions written before and after ChatGPT’s release, spanning multiple LLM generations, evaluated with a dual-instrument approach: automated metrics — Pearson correlation, MTLD (Measure of Textual Lexical Diversity), and standard readability tests — alongside human expert qualitative scoring.

The methodological pairing is the study’s core contribution. MTLD and readability tools capture surface-level linguistic output. Human expert scoring captures whether the writing actually thinks: whether arguments cohere, whether ideas develop across paragraphs. Running both simultaneously creates the conditions to detect a specific failure mode — that AI might inflate one while depressing the other.

One limitation worth noting upfront: the abstract does not report sample size or name the specific LLM generations evaluated, which constrains how far individual effect sizes can be generalized.

The Crutch Effect: When AI Scores Improve but Skills Stagnate

The central finding for lower-proficiency EFL learners is not that AI makes their writing worse. It is that AI makes their metrics better while making their writing worse. (Susanto et al., “The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings,” arXiv

.15460, submitted April 16, 2026)

Advanced LLMs boost automated assessment scores and lexical diversity (MTLD) for this group. More sophisticated vocabulary appears; readability scores improve. The problem: increased LLM assistance correlated negatively with human expert ratings. Experts reading the same texts found weaker coherence, thinner argumentation, lower analytical depth. The AI generates surface fluency that automated tools reward without the learner developing the underlying capacity those metrics are supposed to proxy.

A 2023–2025 synthesis published in Frontiers in Education corroborates the mechanism across a broader evidence base: unguided AI use consistently degrades argument structure and analytical depth in academic writing, even when surface metrics improve. (“The impact of generative AI on academic reading and writing: a synthesis of recent evidence (2023–2025),” Frontiers in Education) The synthesis identifies human oversight as the decisive factor separating beneficial from harmful AI integration.

The Ceiling Effect: Where Stronger Models Actually Lift Advanced Learners

The EFL study is not a blanket indictment of AI writing assistance. The crutch/ceiling split is proficiency-dependent: for advanced learners, stronger models appear to raise the ceiling rather than compensate for a floor. (Susanto et al., “The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings,” arXiv

.15460, submitted April 16, 2026)

This distinction carries significant practical weight. Advanced learners — presumably more capable of critically evaluating generated output, integrating suggestions rather than substituting them for their own thinking — can use stronger models productively. Their ceiling rises.

The implication for practitioners is uncomfortable: the learners who most need assistance are the ones for whom AI assistance is most likely to function as a crutch. Deploying the most capable available model for the least capable learners may be exactly backwards from a skill-development standpoint.

The Mechanism — Ideational Scaffolding vs Textual Production

The authors frame the fix in Vygotskian terms. (Susanto et al., “The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings,” arXiv

.15460, submitted April 16, 2026) Lev Vygotsky’s Zone of Proximal Development (ZPD) describes the gap between what a learner can do independently and what they can do with guidance. The paper distinguishes two AI assistance modes that interact with the ZPD differently:

Ideational scaffolding: AI assists with idea generation and structuring thinking. The learner produces the prose.
Textual production: AI generates the written output. The learner reviews, edits, or accepts it.

Ideational scaffolding keeps the learner inside their ZPD — pushed toward the upper limit of independent capability. Textual production bypasses the ZPD: the AI performs the cognitive work that learning requires. The model’s capability is identical across both modes. The learner’s developmental trajectory is not.

This Isn’t Just an Education Problem

The EFL findings land in the middle of a convergent body of research across knowledge-work domains showing the same dynamic.

Coding. A randomized controlled trial published in February 2026 (n=52 junior developers, GPT-4o assistant) found that AI-assisted coders scored 17% lower than manual coders on a comprehension quiz, with Cohen’s d=0.738 and p=0.010 — described by the researchers as roughly two letter grades. (“How AI Impacts Skill Formation,” arXiv

.20245) The mechanism is instructive: AI users encountered a median of 1 error versus 3 for non-AI users. Fewer errors sounds like a win until you recognize that debugging experiences are a primary driver of how developers learn to reason about code.

The same study found that interaction mode was the decisive variable: developers who used AI only for conceptual inquiries scored 65% or above on comprehension tests; those who fully delegated code generation scored below 40%. (“How AI Impacts Skill Formation,” arXiv

.20245) The same model, used differently, produced radically different skill trajectories. This finding is narrow — 52 participants, one library — but the effect size is large enough to take seriously.

Organizational skill gaps. The “Augmentation Trap” paper (arXiv

.03501, April 2026) models five AI deployment regimes and identifies a trap region: rational short-run AI adoption that produces steady-state long-run skill loss. (“The Augmentation Trap: AI Productivity and the Cost of Cognitive Offloading,” arXiv

.03501, April 2026) The model finds that junior workers face permanent deskilling when AI handles tasks largely independent of their developing expertise, while senior workers maintain capabilities. The result is a durable organizational skill gap — not from AI failure, but from AI success at the task level.

Medicine. A theoretical analysis (PMC, 2024) raises a compounding concern worth noting as a hypothesis, not an established empirical finding: AI-induced skill decay may be largely unconscious. (“Does using AI assistance accelerate skill decay and hinder skill development without performers’ awareness?” PMC

(2024)) Because task-level performance stays high while the AI compensates, professionals may not notice their underlying capability is eroding. A radiologist might believe they remain sharp at anomaly detection while their independent discrimination skill has materially declined. The mechanism this paper describes is consistent with the empirical patterns in the EFL and coding research, but the claim has not been tested directly.

What Practitioners Should Change Right Now

The convergence of these findings across domains points to a concrete intervention framework.

Audit interaction mode, not AI use. Whether AI is net positive or net negative for skill development turns on whether users are in ideational or generative mode. Organizations and educators who can only measure “AI used / not used” are missing the relevant variable entirely.

Separate AI-assisted and independent assessments. The EFL study’s methodological insight — running automated and expert scoring simultaneously — is transferable to any domain. Any evaluation regime that relies solely on output quality will miss skill stagnation masked by AI compensation. Blind spots emerge precisely where the AI is most helpful.

Weight human feedback above automated metrics. The Frontiers in Education synthesis is explicit: positive outcomes in AI-assisted writing require hybrid feedback (AI plus human), transparency requirements, and explicit focus on process over output. (“The impact of generative AI on academic reading and writing: a synthesis of recent evidence (2023–2025),” Frontiers in Education) Automated scores improving while expert ratings decline is a warning signal, not a success metric.

Match AI capability to the learner’s developmental ceiling, not to task complexity. The proficiency-dependence finding in the EFL paper suggests that high-capability models given to low-proficiency users provide the most tempting offramp from cognitive effort — and therefore the highest deskilling risk.

FAQ

The EFL paper finds that AI helps advanced learners — does that mean AI assistance is fine for skilled practitioners?

Not straightforwardly. The EFL finding suggests advanced learners can use AI to raise their ceiling, but the Augmentation Trap model (“The Augmentation Trap: AI Productivity and the Cost of Cognitive Offloading,” arXiv

.03501, April 2026) and the coding RCT (“How AI Impacts Skill Formation,” arXiv

.20245) both show that even experienced practitioners can deskill when interaction mode shifts from conceptual inquiry to full task delegation. Existing skill level sets the starting risk profile; it does not eliminate the risk.

The coding RCT sample is small. How much weight should the 17% figure carry?

The n=52 and single-library scope are genuine limitations to generalization. The study reports a Cohen’s d of 0.738, which is large by conventional standards, and p=0.010. (“How AI Impacts Skill Formation,” arXiv

.20245) The figure is a strong signal — sufficient to motivate experimental programs or audits of interaction mode — but not sufficient to declare a universal quantitative effect across all AI-assisted development contexts.

Crutch or Ceiling: What a New Study of LLMs and EFL Writing Reveals About the AI Assistance Trap

The Study: What Susanto et al. Measured (and Why It’s Different)

The Crutch Effect: When AI Scores Improve but Skills Stagnate

The Ceiling Effect: Where Stronger Models Actually Lift Advanced Learners

The Mechanism — Ideational Scaffolding vs Textual Production

This Isn’t Just an Education Problem

What Practitioners Should Change Right Now

FAQ

Sources

Enjoyed this article?

The Study: What Susanto et al. Measured (and Why It’s Different)

The Crutch Effect: When AI Scores Improve but Skills Stagnate

The Ceiling Effect: Where Stronger Models Actually Lift Advanced Learners

The Mechanism — Ideational Scaffolding vs Textual Production

This Isn’t Just an Education Problem

What Practitioners Should Change Right Now

FAQ

Sources

Related Articles

EU's 2027 Replaceable Battery Mandate: What It Means for Phone Buyers and Repairers Right Now

Google Ignores California's Global Privacy Control 86% of the Time: webXray's 7,000-Site Audit

Mercor Breach: 4TB of AI Trainer Voice Samples Stolen from 40,000 Contractors

Enjoyed this article?