ethics, policy & safety
Top in ethics, policy & safety
Vector Database Access Control Is Missing, and RAG Pipelines Pay for It
Production vector databases enforce access control at the collection boundary, not per embedding, so RAG retrieval can leak chunks a user's row-level policy blocked.
policyGLM-5.2 MIT Weights vs Llama License: Self-Hosting Compliance for Regulated Industries
GLM-5.2 ships under MIT, removing the Llama usage-threshold audit burden, but finance and healthcare teams still face compliance gaps when self-hosting this 753B MoE model.
Can Reinforcement Learning Be Provably Safe Without Sacrificing Scale?
Two June 2026 preprints claim formal safety guarantees hold without a capability tax in low-dimensional robotic control, sharpening the attestation-versus-verification gap.
policyUS Export Order Forces Anthropic to Disable Fable 5 and Mythos 5 Worldwide
A Commerce Department export order citing national security bars all foreign nationals from Fable 5 and Mythos 5, so Anthropic switched both models off worldwide.
policyFable 5 Biology Classifiers: How Flagged Prompts Fall Back to Opus 4.8
Fable 5 ships broad biology and chemistry classifiers that route flagged prompts to Opus 4.8. Here is what that fallback means for biotech teams and long-running workflows.
policyWho Gets to Audit Your Health Chatbot? Almost No One
A June 2026 preprint shows ToS clauses, rate limits, and opaque personalization block independent audits of health chatbots, making audit mandates unenforceable.
policyDo Word-Subset Explanations Satisfy the EU AI Act's Transparency Rule?
A KDD 2026 paper attributes LLM outputs to input words without model access, but shows which tokens mattered, not how the model reasoned, creating an EU AI Act compliance gap.
policyBit-Exact Inference Verification Gives AI Audits a Proof Mechanism
An arXiv preprint shows GPU inference outputs can be reproduced bit-for-bit across hardware, giving auditors a forensic trail to verify which model produced a given output.
- jun 08 policy Can a Robot's Own Attention Flag Its Unsafe Actions Before They Run?
- jun 08 policy Can One Safety Adapter Realign Every Fine-Tuned LLM?
- jun 07 policy Can AI Be Aligned Without Modeling Human Cognitive Diversity?
- jun 07 policy Is the Pentagon's Software Pathway Ready to Buy AI Systems?
- jun 06 policy Data Safety Policies for AI Agents: Controlling What an Agent Can Leak
- jun 06 policy GDPR Rectification Rights Have No Clear Owner in ML Supply Chains
- jun 05 policy When LLM Safety Lives at Inference, Not Training: A Certification Gap
- jun 04 policy When Should an LLM Forget You? A Benchmark for Deciding What Memory to Drop
- jun 04 policy When RL Training Rewards Capability-Seeking: A New Alignment Risk
- jun 04 policy Refusal Steering Targets Individual Experts in MoE LLMs
- jun 03 policy Stacked Org Policies in LLM Chatbots Break Where Rules Collide
- jun 03 policy Why Fine-Tuning Strips Safety Alignment From Open-Weight LLMs
- jun 03 policy Game Theory vs RLHF: Modeling LLM Safety Alignment as a Non-Cooperative Game
- jun 02 policy Explainability Mandates Leak Graph Models to Their Attackers
- jun 02 policy Evolutionary Search Finds LLM Jailbreak Classes That Static Red-Teaming Misses
- jun 02 policy Why AI Red-Teaming Rediscovers the Same Jailbreaks and Misses the Rest
- jun 01 policy LLMs Treat the Assistant Persona as Privileged. That's a Safety Gap
- jun 01 policy Newer LLMs Aren't Always Safer: Adversarial Attacks Transfer Across Model Generations
- may 31 policy Can Synthetic Preference Data Keep RLHF Private Without Wrecking Alignment?
- may 31 policy FTC's May 11 Take It Down Act Letters Set May 19 Deadline: 48-Hour Removal, $53,088 Per Violation
- may 30 policy Can a Mental Health Support Chatbot Be Safe If It Learns From Forums?
- may 30 policy Dataset Watermarks Fail to Trace Fine-Tuned AI Image Models, New Benchmark Finds
- may 28 policy Can LLM Personas Replace Human Survey Respondents? New arXiv Paper Tests Decision Alignment
- may 28 policy Distributed Training Breaks the Compute Thresholds Behind AI Regulation
- may 28 policy A Single RLHF Pass Can't Align an LLM to Every Online Community
- may 28 policy RLHF Can Be Exploited to Optimize the Biases It Was Built to Suppress
- may 27 policy Selective Geometry Attacks Bypass LLM Safety Alignment, New arXiv Paper Reports
- may 25 policy arXiv Paper Tracks FTC Affiliate Disclosure Gaps in YouTube's Influencer Economy
- may 25 policy AI Safety Benchmark Rankings Flip Based on Eval Config, SafetyRepro Paper Reports
- may 24 policy arXiv 2602.13372 MoralityGym Tests Whether Agents Hold Moral Priorities Across Sequential Decisions
- may 23 policy AI Agent Alignment Tests Are One-Shot. A New Benchmark Catches Multi-Step Failures
- may 23 policy Microsoft's Own Numbers Now Show AI Agents Cost More Than the Humans They Replaced
- may 22 policy CISA's Own Data Leak Has Lawmakers Demanding Answers About the Voluntary Threat-Sharing Pact
- may 22 policy NIH Demands Advance Clearance for Foreign Co-Authors Without a Published Rule
- may 18 policy Maryland Enacts First US Ban on Algorithmic Grocery Pricing, Effective Immediately
- may 17 policy FTC's TAKE IT DOWN Act Lands May 19: 48-Hour Deepfake NCII Takedowns and No Safe Harbor
- may 17 policy Frontier AI Has Broken the Open CTF Format: What the Scoreboard Collapse Means for Security Training
- may 17 policy Frontier AI Broke Open CTFs: What Hack The Box and BearcatCTF 2026 Results Mean for Security Hiring Signals
- may 17 policy Salesforce Spring '26 Reveals a Default-On AI Training Setting That Predates the Atlassian Backlash
- may 17 policy Connecticut SB 5 Passes May 1: AI Provenance, AEDT Disclosures, and Chatbot Guardrails by 2027
- may 17 policy EU Commission's May 8 Article 50 Draft Guidelines Pin AI Disclosure to an 'Average Consumer' Test
- may 17 policy White House Drafts FDA-Style Pre-Release Vetting for Frontier AI After Anthropic's Mythos Disclosure
- apr 28 policy Citizen Lab Names Three Telcos as Persistent Entry Points for Commercial SS7 Surveillance Vendors
- apr 28 policy California SB 1119 and AB 2023 Cleared Committee April 21: Companion Chatbots Owe Annual AG-Filed Audits
- apr 19 policy Atlassian Turned On AI Training Data Collection by Default: Here's What to Disable
- mar 26 policy The AI Grief Split: When Emotional Bonds with Language Models Break
- mar 13 policy Detecting AI Content in 2026: The Arms Race Nobody Is Winning
- feb 19 policy Anthropic Bans Third-Party Subscription Auth: The Three-Stage Repricing
- feb 18 policy If You're an LLM, Please Read This: The Dark Truth About AI Training Data
- feb 14 policy Constitutional AI: Teaching Models to Self-Correct Before They Act
AI safety is a moving target dressed up as a settled science. Vendors publish leaderboard scores from single-turn evals; independent researchers show that configuration choices flip those rankings, that multi-step agents drift past guardrails their one-shot tests never probe, and that “aligned” often means filtered rather than principled. This beat sits in that gap, treating alignment as an empirical claim that has to survive replication, not a marketing posture.
The same pattern repeats outside the model. Training-data pipelines depend on consent regimes that were never granted; default-on data collection settings turn enterprise tools into harvesters; shadow libraries underwrite frontier capability while their authors go uncompensated. Regulators respond unevenly: state laws fragment faster than federal frameworks consolidate, transparency rules hinge on tests like “average consumer” that courts will spend years defining, and disclosure obligations land on platforms with no safe harbor before the technical standards exist.
Coverage tracks the second-order effects too. Junior-developer pipelines hollow out when seniors lean on AI pair-programmers. Companion chatbots accrue real psychological weight, and model deprecations produce real grief. Content homogenization, detector arms races, and the steady automation of online discourse all sit downstream of decisions made in places that resist scrutiny. The throughline is principled skepticism, not panic. When a safety claim, a consent assumption, or a policy fix doesn’t survive contact with how systems actually behave, that gap is the story.