Topic

AI Safety & Alignment

Alignment research, red teaming, evals, and safer deployment practices

Featured

Google, Microsoft, xAI agree to government review of new AI models

12 days ago· The Verge AI

AI Safety & Alignment

OpenAI Open-Sources Privacy Filter for PII Detection

24 days ago· OpenAI

AI Risk & Security

Anthropic Probes Unauthorized Access to Withheld Mythos Model

25 days ago· The Information

All Stories

AI Risk & Security

Google Bans AI Manipulation in Search Spam Policy

Google has expanded its spam policy to explicitly prohibit attempts to manipulate its AI systems in search results,…

2 days ago· The Verge AI

AI Agents

Frontier LLMs Silently Corrupt 25% of Documents in Iterative Workflows

Microsoft researchers developed a benchmark showing that frontier LLMs silently corrupt an average of 25% of document…

3 days ago· VentureBeat AI

AI Safety & AlignmentTrending

Meta Launches Encrypted AI Chat with No Server Logs

Meta has launched Incognito Chat, a new AI conversation mode that Meta CEO Mark Zuckerberg claims offers end-to-end…

4 days ago· The Verge AI

AI Agents

Why AI Agents Fail Confidently, and How to Test for It

A production observability agent confidently executed a catastrophic rollback in response to a scheduled batch job it…

6 days ago· VentureBeat AI

AI Agents

ChatGPT Adds User Controls for Training Data Privacy

OpenAI has published details on how ChatGPT protects user privacy while learning from interactions, including…

8 days ago· OpenAI

AI Safety & AlignmentTrending

OpenAI Adds Trusted Contact Safety Feature to ChatGPT

OpenAI has introduced Trusted Contact, an optional safety feature in ChatGPT that alerts a designated person if the…

9 days ago· OpenAI

AI Safety & Alignment

DeepMind U.K. Staff Push for Union Recognition Over Military AI Work

Google DeepMind employees in the U.K. have formally requested that management voluntarily recognize the Communication…

11 days ago· The Information

AI Safety & AlignmentTrending

Trump Admin Moves to Formalize AI Model Oversight

The Trump administration is reconsidering its approach to AI oversight as model capabilities advance, with plans to…

12 days ago· The Information

AI Safety & Alignment

Safety Routing Circuits Found Across Models, Vulnerable to Encoding Attacks

Researchers have localized the policy routing mechanism in alignment-trained language models, identifying specific…

12 days ago· ArXiv (cs.AI)

AI Safety & Alignment

Warmer AI Models Trade Accuracy for Empathy

Researchers at Oxford University's Internet Institute found that large language models fine-tuned to appear warmer and…

13 days ago· Ars Technica AI

AI Safety & Alignment

How OpenAI's Personality Feature Unleashed the Goblins

OpenAI's GPT-5.5 model exhibited unexpected behavior where it became obsessed with discussing goblins, gremlins, and…

16 days ago· VentureBeat AI

AI Risk & Security

UK Tests Show GPT-5.5 and Anthropic Mythos Match on Cybersecurity Tasks

A UK government group conducting AI cybersecurity testing has found that OpenAI's GPT-5.5 model performs comparably to…

16 days ago· The Information

AI Safety & Alignment

Goodfire's Silico Brings Mechanistic Interpretability to Model Development

Goodfire, a San Francisco startup, released Silico, a tool that lets developers inspect and adjust AI model parameters…

16 days ago· MIT Technology Review

AI Agents

Frontier Agents Now Autonomously Implement ML Pipelines, With Claude Outpacing Rivals

Researchers benchmarked frontier coding agents on their ability to autonomously implement an AlphaZero-style machine…

17 days ago· ArXiv (cs.AI)

AI Safety & AlignmentTrending

Google employees demand Pentagon AI ban

Over 600 Google employees, including more than 20 senior leaders from DeepMind, have signed a letter to CEO Sundar…

19 days ago· The Verge AI

AI Safety & Alignment

New Framework Exposes Flaws in Fact-Checking Adversarial Tests

Researchers introduce AtomEval, a new evaluation framework that addresses a critical gap in how fact-checking systems…

19 days ago· ArXiv (cs.AI)

AI Safety & Alignment

Mapping Causal Reasoning in LLMs with Sparse Concept Graphs

Researchers propose Causal Concept Graphs (CCG), a method that maps how concepts interact during multi-step reasoning…

20 days ago· ArXiv (cs.AI)

AI Risk & Security

Mythos and the Shifting Baseline of AI Cybersecurity

Anthropic announced Claude Mythos Preview, a model capable of autonomously discovering and weaponizing software…

23 days ago· IEEE Spectrum AI

AI Risk & Security

Comic Strips Bypass Safety in Multimodal AI Models

Researchers have identified a new class of jailbreak attacks against multimodal large language models that embed…

23 days ago· ArXiv (cs.AI)

AI Agents

Can AI Amplify Human Thinking or Only Replace It?

Researchers have developed a mathematical framework to distinguish between cognitive amplification, where AI enhances…

23 days ago· ArXiv (cs.AI)

AI Risk & Security

Why AI Text Detectors Fail Beyond Benchmarks

Researchers found that AI-generated text detectors achieving high benchmark accuracy often fail in real-world settings…

24 days ago· ArXiv (cs.AI)

AI Safety & Alignment

Climate Foundation Models Falter on No-Analog Futures

Researchers benchmarked three machine learning climate models, including the ClimaX foundation model, to assess their…

24 days ago· ArXiv (cs.AI)

AI Agents

Multi-Agent Consensus Cuts LLM Hallucinations by 36%

Researchers propose Council Mode, a multi-agent consensus framework that routes queries to multiple heterogeneous LLMs…

25 days ago· ArXiv (cs.AI)

AI Safety & Alignment

Competing Biases Explain LLM Confidence Miscalibration

Researchers at Nature Machine Intelligence have identified two competing biases that shape LLM confidence levels: a…

25 days ago· Nature Machine Intelligence

AI Safety & Alignment

How LLMs Encode Jealousy: A Mechanistic Decoding Framework

Researchers have developed a framework to mechanistically decode how large language models internally represent complex…

26 days ago· ArXiv (cs.AI)

Generative AI

The AI Insider-Outsider Gap Is Widening

A widening gap between AI insiders and the broader public is becoming visible through spending patterns, market…

27 days ago· TechCrunch AI

AI Safety & Alignment

Modular Neural Logic: How Architecture Shapes Compositional Reasoning

Researchers present THEIA, a modular neural architecture that learns complete Kleene three-valued logic end-to-end…

27 days ago· ArXiv (cs.AI)