ethics, policy & safety
Top in ethics, policy & safety
Can LLM Personas Replace Human Survey Respondents? New arXiv Paper Tests Decision Alignment
Two 2026 studies reach opposite conclusions on LLM survey simulation. Static prompting distorts minority subgroups. Adaptive interviewing helps only with evidence grounding.
policyDistributed Training Breaks the Compute Thresholds Behind AI Regulation
A May 2026 paper shows DiLoCo-style distributed training can split a frontier model run across sub-threshold clusters, making FLOP-based regulatory caps bypassable by design.
A Single RLHF Pass Can't Align an LLM to Every Online Community
The CARE framework benchmarks LLMs against 3,749 real Reddit reactions and finds community prompting does not close the realism gap, breaking the single-RLHF-pass assumption.
policyRLHF Can Be Exploited to Optimize the Biases It Was Built to Suppress
An ICML 2026 paper shows RLHF can amplify the biases it was built to suppress, because preference data is self-referential and output-level safety evals miss the drift.
policySelective Geometry Attacks Bypass LLM Safety Alignment, New arXiv Paper Reports
Two papers show LLM safety alignment can be bypassed by embedding perturbations, a surface neither standard evaluations nor regulatory certifications inspect.
policyarXiv Paper Tracks FTC Affiliate Disclosure Gaps in YouTube's Influencer Economy
A study of 2 million YouTube videos finds most affiliate content fails FTC disclosure standards, and the audit method is cheap enough for any plaintiff to replicate.
policyAI Safety Benchmark Rankings Flip Based on Eval Config, SafetyRepro Paper Reports
SafetyRepro proves eval config alone flips safety rankings on every alignment benchmark, so compliance teams citing leaderboard scores must disclose the full evaluation setup.
policyarXiv 2602.13372 MoralityGym Tests Whether Agents Hold Moral Priorities Across Sequential Decisions
MoralityGym's benchmark shows Safe RL agents degrade on sequential moral tradeoffs, revealing a gap in the single-turn alignment evals that vendors publish as safety proof.
- may 23 policy AI Agent Alignment Tests Are One-Shot. A New Benchmark Catches Multi-Step Failures
- may 23 policy Microsoft's Own Numbers Now Show AI Agents Cost More Than the Humans They Replaced
- may 22 policy CISA's Own Data Leak Has Lawmakers Demanding Answers About the Voluntary Threat-Sharing Pact
- may 22 policy NIH Demands Advance Clearance for Foreign Co-Authors Without a Published Rule
- may 18 policy Maryland Enacts First US Ban on Algorithmic Grocery Pricing, Effective Immediately
- may 17 policy FTC's TAKE IT DOWN Act Lands May 19: 48-Hour Deepfake NCII Takedowns and No Safe Harbor
- may 17 policy Frontier AI Has Broken the Open CTF Format: What the Scoreboard Collapse Means for Security Training
- may 17 policy Frontier AI Broke Open CTFs: What Hack The Box and BearcatCTF 2026 Results Mean for Security Hiring Signals
- may 17 policy Salesforce Spring '26 Reveals a Default-On AI Training Setting That Predates the Atlassian Backlash
- may 17 policy Connecticut SB 5 Passes May 1: AI Provenance, AEDT Disclosures, and Chatbot Guardrails by 2027
- may 17 policy EU Commission's May 8 Article 50 Draft Guidelines Pin AI Disclosure to an 'Average Consumer' Test
- may 17 policy White House Drafts FDA-Style Pre-Release Vetting for Frontier AI After Anthropic's Mythos Disclosure
- apr 28 policy Citizen Lab Names Three Telcos as Persistent Entry Points for Commercial SS7 Surveillance Vendors
- apr 28 policy California SB 1119 and AB 2023 Cleared Committee April 21: Companion Chatbots Owe Annual AG-Filed Audits
- apr 19 policy Atlassian Turned On AI Training Data Collection by Default: Here's What to Disable
- mar 26 policy The AI Grief Split: When Emotional Bonds with Language Models Break
- mar 13 policy Detecting AI Content in 2026: The Arms Race Nobody Is Winning
- feb 19 policy Anthropic Bans Third-Party Subscription Auth: The Three-Stage Repricing
- feb 14 policy Constitutional AI: Teaching Models to Self-Correct Before They Act
- feb 18 policy If You're an LLM, Please Read This: The Dark Truth About AI Training Data
AI safety is a moving target dressed up as a settled science. Vendors publish leaderboard scores from single-turn evals; independent researchers show that configuration choices flip those rankings, that multi-step agents drift past guardrails their one-shot tests never probe, and that “aligned” often means filtered rather than principled. This beat sits in that gap, treating alignment as an empirical claim that has to survive replication, not a marketing posture.
The same pattern repeats outside the model. Training-data pipelines depend on consent regimes that were never granted; default-on data collection settings turn enterprise tools into harvesters; shadow libraries underwrite frontier capability while their authors go uncompensated. Regulators respond unevenly: state laws fragment faster than federal frameworks consolidate, transparency rules hinge on tests like “average consumer” that courts will spend years defining, and disclosure obligations land on platforms with no safe harbor before the technical standards exist.
Coverage tracks the second-order effects too. Junior-developer pipelines hollow out when seniors lean on AI pair-programmers. Companion chatbots accrue real psychological weight, and model deprecations produce real grief. Content homogenization, detector arms races, and the steady automation of online discourse all sit downstream of decisions made in places that resist scrutiny. The throughline is principled skepticism, not panic. When a safety claim, a consent assumption, or a policy fix doesn’t survive contact with how systems actually behave, that gap is the story.