1 article exploring ppo. Expert insights and analysis from our editorial team.
Static entropy regularization in GRPO underperforms on mixed-difficulty tasks. Difficulty-aware allocation closes the gap by 7-10 points on pass@1 without extra compute.