groundy

ethics, policy & safety

28 articles · rss

Top in ethics, policy & safety

policy

A Single RLHF Pass Can't Align an LLM to Every Online Community

The CARE framework benchmarks LLMs against 3,749 real Reddit reactions and finds community prompting does not close the realism gap, breaking the single-RLHF-pass assumption.

policy

RLHF Can Be Exploited to Optimize the Biases It Was Built to Suppress

An ICML 2026 paper shows RLHF can amplify the biases it was built to suppress, because preference data is self-referential and output-level safety evals miss the drift.

policy

Selective Geometry Attacks Bypass LLM Safety Alignment, New arXiv Paper Reports

Two papers show LLM safety alignment can be bypassed by embedding perturbations, a surface neither standard evaluations nor regulatory certifications inspect.

policy

arXiv Paper Tracks FTC Affiliate Disclosure Gaps in YouTube's Influencer Economy

A study of 2 million YouTube videos finds most affiliate content fails FTC disclosure standards, and the audit method is cheap enough for any plaintiff to replicate.

policy

AI Safety Benchmark Rankings Flip Based on Eval Config, SafetyRepro Paper Reports

SafetyRepro proves eval config alone flips safety rankings on every alignment benchmark, so compliance teams citing leaderboard scores must disclose the full evaluation setup.

policy

arXiv 2602.13372 MoralityGym Tests Whether Agents Hold Moral Priorities Across Sequential Decisions

MoralityGym's benchmark shows Safe RL agents degrade on sequential moral tradeoffs, revealing a gap in the single-turn alignment evals that vendors publish as safety proof.


  1. may 23 policy AI Agent Alignment Tests Are One-Shot. A New Benchmark Catches Multi-Step Failures
  2. may 23 policy Microsoft's Own Numbers Now Show AI Agents Cost More Than the Humans They Replaced
  3. may 22 policy CISA's Own Data Leak Has Lawmakers Demanding Answers About the Voluntary Threat-Sharing Pact
  4. may 22 policy NIH Demands Advance Clearance for Foreign Co-Authors Without a Published Rule
  5. may 18 policy Maryland Enacts First US Ban on Algorithmic Grocery Pricing, Effective Immediately
  6. may 17 policy FTC's TAKE IT DOWN Act Lands May 19: 48-Hour Deepfake NCII Takedowns and No Safe Harbor
  7. may 17 policy Frontier AI Has Broken the Open CTF Format: What the Scoreboard Collapse Means for Security Training
  8. may 17 policy Frontier AI Broke Open CTFs: What Hack The Box and BearcatCTF 2026 Results Mean for Security Hiring Signals
  9. may 17 policy Salesforce Spring '26 Reveals a Default-On AI Training Setting That Predates the Atlassian Backlash
  10. may 17 policy Connecticut SB 5 Passes May 1: AI Provenance, AEDT Disclosures, and Chatbot Guardrails by 2027
  11. may 17 policy EU Commission's May 8 Article 50 Draft Guidelines Pin AI Disclosure to an 'Average Consumer' Test
  12. may 17 policy White House Drafts FDA-Style Pre-Release Vetting for Frontier AI After Anthropic's Mythos Disclosure
  13. apr 28 policy Citizen Lab Names Three Telcos as Persistent Entry Points for Commercial SS7 Surveillance Vendors
  14. apr 28 policy California SB 1119 and AB 2023 Cleared Committee April 21: Companion Chatbots Owe Annual AG-Filed Audits
  15. apr 19 policy Atlassian Turned On AI Training Data Collection by Default: Here's What to Disable
  16. mar 26 policy The AI Grief Split: When Emotional Bonds with Language Models Break
  17. mar 13 policy Detecting AI Content in 2026: The Arms Race Nobody Is Winning
  18. feb 19 policy Anthropic Bans Third-Party Subscription Auth: The Three-Stage Repricing
  19. feb 14 policy Constitutional AI: Teaching Models to Self-Correct Before They Act
  20. feb 18 policy If You're an LLM, Please Read This: The Dark Truth About AI Training Data

AI safety is a moving target dressed up as a settled science. Vendors publish leaderboard scores from single-turn evals; independent researchers show that configuration choices flip those rankings, that multi-step agents drift past guardrails their one-shot tests never probe, and that “aligned” often means filtered rather than principled. This beat sits in that gap, treating alignment as an empirical claim that has to survive replication, not a marketing posture.

The same pattern repeats outside the model. Training-data pipelines depend on consent regimes that were never granted; default-on data collection settings turn enterprise tools into harvesters; shadow libraries underwrite frontier capability while their authors go uncompensated. Regulators respond unevenly: state laws fragment faster than federal frameworks consolidate, transparency rules hinge on tests like “average consumer” that courts will spend years defining, and disclosure obligations land on platforms with no safe harbor before the technical standards exist.

Coverage tracks the second-order effects too. Junior-developer pipelines hollow out when seniors lean on AI pair-programmers. Companion chatbots accrue real psychological weight, and model deprecations produce real grief. Content homogenization, detector arms races, and the steady automation of online discourse all sit downstream of decisions made in places that resist scrutiny. The throughline is principled skepticism, not panic. When a safety claim, a consent assumption, or a policy fix doesn’t survive contact with how systems actually behave, that gap is the story.