groundy

ethics, policy & safety

58 articles · rss

Top in ethics, policy & safety


  1. jun 08 policy Can a Robot's Own Attention Flag Its Unsafe Actions Before They Run?
  2. jun 08 policy Can One Safety Adapter Realign Every Fine-Tuned LLM?
  3. jun 07 policy Can AI Be Aligned Without Modeling Human Cognitive Diversity?
  4. jun 07 policy Is the Pentagon's Software Pathway Ready to Buy AI Systems?
  5. jun 06 policy Data Safety Policies for AI Agents: Controlling What an Agent Can Leak
  6. jun 06 policy GDPR Rectification Rights Have No Clear Owner in ML Supply Chains
  7. jun 05 policy When LLM Safety Lives at Inference, Not Training: A Certification Gap
  8. jun 04 policy When Should an LLM Forget You? A Benchmark for Deciding What Memory to Drop
  9. jun 04 policy When RL Training Rewards Capability-Seeking: A New Alignment Risk
  10. jun 04 policy Refusal Steering Targets Individual Experts in MoE LLMs
  11. jun 03 policy Stacked Org Policies in LLM Chatbots Break Where Rules Collide
  12. jun 03 policy Why Fine-Tuning Strips Safety Alignment From Open-Weight LLMs
  13. jun 03 policy Game Theory vs RLHF: Modeling LLM Safety Alignment as a Non-Cooperative Game
  14. jun 02 policy Explainability Mandates Leak Graph Models to Their Attackers
  15. jun 02 policy Evolutionary Search Finds LLM Jailbreak Classes That Static Red-Teaming Misses
  16. jun 02 policy Why AI Red-Teaming Rediscovers the Same Jailbreaks and Misses the Rest
  17. jun 01 policy LLMs Treat the Assistant Persona as Privileged. That's a Safety Gap
  18. jun 01 policy Newer LLMs Aren't Always Safer: Adversarial Attacks Transfer Across Model Generations
  19. may 31 policy Can Synthetic Preference Data Keep RLHF Private Without Wrecking Alignment?
  20. may 31 policy FTC's May 11 Take It Down Act Letters Set May 19 Deadline: 48-Hour Removal, $53,088 Per Violation
  21. may 30 policy Can a Mental Health Support Chatbot Be Safe If It Learns From Forums?
  22. may 30 policy Dataset Watermarks Fail to Trace Fine-Tuned AI Image Models, New Benchmark Finds
  23. may 28 policy Can LLM Personas Replace Human Survey Respondents? New arXiv Paper Tests Decision Alignment
  24. may 28 policy Distributed Training Breaks the Compute Thresholds Behind AI Regulation
  25. may 28 policy A Single RLHF Pass Can't Align an LLM to Every Online Community
  26. may 28 policy RLHF Can Be Exploited to Optimize the Biases It Was Built to Suppress
  27. may 27 policy Selective Geometry Attacks Bypass LLM Safety Alignment, New arXiv Paper Reports
  28. may 25 policy arXiv Paper Tracks FTC Affiliate Disclosure Gaps in YouTube's Influencer Economy
  29. may 25 policy AI Safety Benchmark Rankings Flip Based on Eval Config, SafetyRepro Paper Reports
  30. may 24 policy arXiv 2602.13372 MoralityGym Tests Whether Agents Hold Moral Priorities Across Sequential Decisions
  31. may 23 policy AI Agent Alignment Tests Are One-Shot. A New Benchmark Catches Multi-Step Failures
  32. may 23 policy Microsoft's Own Numbers Now Show AI Agents Cost More Than the Humans They Replaced
  33. may 22 policy CISA's Own Data Leak Has Lawmakers Demanding Answers About the Voluntary Threat-Sharing Pact
  34. may 22 policy NIH Demands Advance Clearance for Foreign Co-Authors Without a Published Rule
  35. may 18 policy Maryland Enacts First US Ban on Algorithmic Grocery Pricing, Effective Immediately
  36. may 17 policy FTC's TAKE IT DOWN Act Lands May 19: 48-Hour Deepfake NCII Takedowns and No Safe Harbor
  37. may 17 policy Frontier AI Has Broken the Open CTF Format: What the Scoreboard Collapse Means for Security Training
  38. may 17 policy Frontier AI Broke Open CTFs: What Hack The Box and BearcatCTF 2026 Results Mean for Security Hiring Signals
  39. may 17 policy Salesforce Spring '26 Reveals a Default-On AI Training Setting That Predates the Atlassian Backlash
  40. may 17 policy Connecticut SB 5 Passes May 1: AI Provenance, AEDT Disclosures, and Chatbot Guardrails by 2027
  41. may 17 policy EU Commission's May 8 Article 50 Draft Guidelines Pin AI Disclosure to an 'Average Consumer' Test
  42. may 17 policy White House Drafts FDA-Style Pre-Release Vetting for Frontier AI After Anthropic's Mythos Disclosure
  43. apr 28 policy Citizen Lab Names Three Telcos as Persistent Entry Points for Commercial SS7 Surveillance Vendors
  44. apr 28 policy California SB 1119 and AB 2023 Cleared Committee April 21: Companion Chatbots Owe Annual AG-Filed Audits
  45. apr 19 policy Atlassian Turned On AI Training Data Collection by Default: Here's What to Disable
  46. mar 26 policy The AI Grief Split: When Emotional Bonds with Language Models Break
  47. mar 13 policy Detecting AI Content in 2026: The Arms Race Nobody Is Winning
  48. feb 19 policy Anthropic Bans Third-Party Subscription Auth: The Three-Stage Repricing
  49. feb 18 policy If You're an LLM, Please Read This: The Dark Truth About AI Training Data
  50. feb 14 policy Constitutional AI: Teaching Models to Self-Correct Before They Act

AI safety is a moving target dressed up as a settled science. Vendors publish leaderboard scores from single-turn evals; independent researchers show that configuration choices flip those rankings, that multi-step agents drift past guardrails their one-shot tests never probe, and that “aligned” often means filtered rather than principled. This beat sits in that gap, treating alignment as an empirical claim that has to survive replication, not a marketing posture.

The same pattern repeats outside the model. Training-data pipelines depend on consent regimes that were never granted; default-on data collection settings turn enterprise tools into harvesters; shadow libraries underwrite frontier capability while their authors go uncompensated. Regulators respond unevenly: state laws fragment faster than federal frameworks consolidate, transparency rules hinge on tests like “average consumer” that courts will spend years defining, and disclosure obligations land on platforms with no safe harbor before the technical standards exist.

Coverage tracks the second-order effects too. Junior-developer pipelines hollow out when seniors lean on AI pair-programmers. Companion chatbots accrue real psychological weight, and model deprecations produce real grief. Content homogenization, detector arms races, and the steady automation of online discourse all sit downstream of decisions made in places that resist scrutiny. The throughline is principled skepticism, not panic. When a safety claim, a consent assumption, or a policy fix doesn’t survive contact with how systems actually behave, that gap is the story.