groundy
infrastructure & runtime

Fine-Tuning a 20B LLM With RLHF on a 24GB GPU: What Fits

The 20B RLHF on 24GB claim is real but the VRAM cost is the duplicated reward model, not the LoRA adapters. Batch size, sequence length, and throughput pay the price.

8 min···7 sources ↓

The Hugging Face blog post that coined “20B RLHF on a 24GB consumer GPU” dates to December 2022, the StackLLaMA era, before DPO existed as a serious alternative and before TRL v1 consolidated SFT, GRPO, DPO, and reward modeling into one library in March 2026. The claim is real, but it papers over a VRAM budget that has almost nothing to do with LoRA adapters.

Where does the “20B on 24GB” claim come from?

The headline traces to Hugging Face’s TRL documentation, which still links a December 9, 2022 post as its canonical origin. The technique it describes is legitimate: 4-bit quantized base weights paired with small LoRA adapters, orchestrated through TRL’s PPOTrainer. What the post doesn’t walk through is which components actually fill the card and what you give up to stay under the limit.

TRL v1, shipped March 27, 2026, expanded the library beyond PPO to cover the full post-training surface: SFT, DPO, GRPO, and reward modeling as first-class trainers. That matters because PPO and DPO have substantially different VRAM footprints, and the 2022 post predates the alternatives entirely.

What does RLHF actually load into VRAM?

RLHF requires at least two separate models resident simultaneously: a reward model and an RL policy, both typically initialized from the same pretrained base. According to Wikipedia’s RLHF overview, the reward model is trained in a supervised manner to predict whether a response to a given prompt is good or bad, based on ranking data collected from human annotators. It outputs a scalar reward score rather than token logits, and shares the same parameter class as the base model.

For a 20B model, you’re paying for two instances of 20 billion parameters, not one.

According to GeeksforGeeks’s RLHF breakdown, PPO limits the size of each policy update for training stability. During on-policy generation, where the policy samples responses and the reward model scores them, both models are in active forward-pass mode simultaneously, and activation memory peaks at that moment.

The VRAM budget, line by line:

  • 4-bit quantized base weights: QLoRA’s three innovations, 4-bit NormalFloat (NF4) weights, double quantization of the quantization constants, and paged optimizers, compress parameter storage substantially. The QLoRA paper demonstrates fine-tuning a 65B model on a single 48GB GPU, confirming the 20B base weight cost is manageable on 24GB in isolation.
  • LoRA adapters: LoRA updates roughly 0.5, 5% of a model’s parameters, significantly reducing memory and hardware requirements versus full fine-tuning. The adapters are not the constraint.
  • Reward model: the same parameter class as the policy, quantized only if you explicitly configure it that way, and resident throughout training.
  • Optimizer states: smaller, but optimizer states for the adapter parameters accumulate per training step.

The LoRA asymmetry is worth stating plainly. 0.5, 5% of 20 billion is 100 million to 1 billion trainable parameters. Adapters stay lightweight in absolute terms. The base weight plus reward model duplication is what fills the card.

Where does the 24GB ceiling actually bite?

The ceiling bites on batch size, sequence length, and the generation rollout phase, in roughly that order.

Batch size falls first. With the policy and reward model both resident, the per-sample activation cost is high. Fitting more than one sample per batch at any useful sequence length forces you to batch size 1, with gradient accumulation over micro-batches extending training time rather than reducing VRAM.

Sequence length is the second wall. Attention activations scale quadratically with sequence length in standard implementations. With the reward model already resident, a sequence length over a few hundred tokens will spike activations past the 24GB budget. The 2022 StackLLaMA post used comparatively short preference examples by current standards.

The generation rollout spike is the most abrupt failure mode. During PPO’s on-policy step, both the policy and the reward model are in active forward-pass mode at the same moment. Paged optimizers, one of QLoRA’s key techniques per arXiv:2305.14314, handle this by treating GPU memory like virtual memory and offloading to CPU RAM during spikes. Without them, the rollout phase reliably OOMs.

Why is the reward model the dominant cost?

The reward model holds the full base weights, making it the same order of magnitude as the policy in VRAM. On a 20B base, that’s another 20B-parameter model sitting on the card, even if frozen and quantized. The LoRA adapter you’re training is a few million parameters by comparison.

QLoRA’s quantization helps with the policy side. It helps on the reward model side only if you also explicitly quantize it to 4-bit, which is possible but carries a specific risk: a quantization-noisy reward model produces degraded preference signals, and those signals feed directly into PPO updates throughout training.

What does 4-bit quantization cost in throughput?

For RLHF training, throughput compounds with on-policy generation. Token-by-token sampling is already slow, and on a 24GB card running PPO with a 20B base, throughput rather than peak VRAM often becomes the real wall. A run that fits in memory can still be impractical if generating preference samples takes far longer than it would on a larger card with the model in FP16.

QLoRA also restricts you to open-weight models. GeeksforGeeks’s QLoRA overview confirms that 4-bit quantization requires direct access to model weights, limiting the technique to LLaMA, Mistral, Falcon, and similar lineages.

When should you skip PPO entirely?

DPO and GRPO don’t require a separate reward model resident during training, which changes the memory budget substantially.

TRL v1 ships DPO as a first-class trainer. DPO frames preference optimization as a classification problem on the policy itself, eliminating the policy-plus-reward-model duplication. You lose the online, on-policy generation loop: DPO trains on a static preference dataset rather than freshly sampled rollouts. What you get back is the VRAM the reward model was occupying, which on a 24GB card is the difference between manageable and marginal.

GRPO, the method used in DeepSeek-R1’s RL training stage, does on-policy generation but computes rewards from a rule or scoring function rather than a separately loaded neural reward model. The reward computation moves to CPU-side code. That’s a different tradeoff from DPO, but similarly lighter than PPO on VRAM.

What’s the practical recipe for staying under 24GB?

A setup with a reasonable chance of fitting requires: 20B open-weight base in 4-bit NF4 via QLoRA, reward model also quantized to 4-bit or offloaded to CPU, paged optimizers enabled, batch size 1 with gradient accumulation for effective batch sizing, sequence length capped conservatively, and LoRA applied to attention projections at a low rank (8-32 range). If static preference data is acceptable, use DPO via TRL and drop the reward model from VRAM entirely.

The most common OOM failure modes, in likelihood order:

  1. Rollout spike with reward model in FP16: the policy is quantized but the reward model loads from checkpoint in native precision. Fix: explicitly apply the same 4-bit config to the reward model before training.
  2. Generation length uncapped: both models in active forward-pass mode at peak sequence length. Fix: lower max generation length before adjusting batch size.
  3. Activation checkpointing off: recomputing activations during the backward pass trades VRAM for compute, and at batch size 1 on a 20B model it’s almost always worth enabling.

The “20B on 24GB” claim is accurate in the narrow sense that it fits on a well-configured card. What fits is batch size 1, short sequences, a quantized reward model, paged optimizers, and slow training throughput. Whether that constitutes a viable experiment depends on your patience and whether you actually need PPO’s online learning loop. If you don’t, DPO on the same card is a cleaner problem.

Frequently Asked Questions

Does the 20B-on-24GB result transfer to non-LLaMA model families?

Only to open-weight lineages where you can read the raw weights. 4-bit NF4 quantization requires direct access to model parameters, so it applies to LLaMA, Mistral, and Falcon but not to API-only models. The QLoRA repository still lists 4-bit matmul as not yet integrated, so each forward pass pays a quantize-dequantize overhead.

How much memory does LoRA actually save versus full fine-tuning?

LoRA updates 0.5 to 5 percent of parameters, which trims memory by about 70 percent versus full fine-tuning while landing GLUE scores within 1 percent of fully fine-tuned baselines. On a 20B base that leaves 100 million to 1 billion trainable parameters, so adapter rank is rarely the line item that OOMs a 24GB card.

Is the rule that QLoRA uses about 0.5GB per 1GB of model actually reliable?

It is a loose heuristic, not a guarantee. Peak VRAM depends on activation footprint during PPO’s on-policy generation, sequence length, and whether the reward model is quantized. The duplicated reward model, not the base-weight ratio, is what actually pushes a 20B run past 24GB.

Where does the reward model come from if the policy is a 20B base?

It is the same pretrained model with the final token-prediction layer swapped for a randomly initialized regression head, then supervised-trained on pairwise preference comparisons until it outputs a scalar reward. That construction is why it matches the policy’s parameter class and why both must stay resident during PPO.

Why does PPO need a value head that DPO and GRPO avoid?

PPO is an actor-critic method, so it maintains a value or critic head alongside the policy to estimate returns, which adds activation and optimizer memory on top of the policy and reward models. DPO reframes preference learning as classification on the policy, and GRPO scores rollouts with a rule, so neither pays that extra head’s footprint.

sources · 7 cited

  1. Reinforcement learning from human feedback - Wikipediaen.wikipedia.organalysisaccessed 2026-06-27
  2. Reinforcement learning from Human Feedback - GeeksforGeeksgeeksforgeeks.orgcommunityaccessed 2026-06-27
  3. Fine-Tuning using LoRA and QLoRA - GeeksforGeeksgeeksforgeeks.orgcommunityaccessed 2026-06-27
  4. QLoRA (Quantized Low-Rank Adapter) - GeeksforGeeksgeeksforgeeks.orgcommunityaccessed 2026-06-27