Topic
AI Safety & Alignment
Alignment research, red teaming, evals, and safer deployment practices
Featured

Anthropic Probes Unauthorized Access to Withheld Mythos Model
All Stories
Google Bans AI Manipulation in Search Spam Policy
Google has expanded its spam policy to explicitly prohibit attempts to manipulate its AI systems in search results,…

Frontier LLMs Silently Corrupt 25% of Documents in Iterative Workflows
Microsoft researchers developed a benchmark showing that frontier LLMs silently corrupt an average of 25% of document…
Meta Launches Encrypted AI Chat with No Server Logs
Meta has launched Incognito Chat, a new AI conversation mode that Meta CEO Mark Zuckerberg claims offers end-to-end…

Why AI Agents Fail Confidently, and How to Test for It
A production observability agent confidently executed a catastrophic rollback in response to a scheduled batch job it…
ChatGPT Adds User Controls for Training Data Privacy
OpenAI has published details on how ChatGPT protects user privacy while learning from interactions, including…

OpenAI Adds Trusted Contact Safety Feature to ChatGPT
OpenAI has introduced Trusted Contact, an optional safety feature in ChatGPT that alerts a designated person if the…

DeepMind U.K. Staff Push for Union Recognition Over Military AI Work
Google DeepMind employees in the U.K. have formally requested that management voluntarily recognize the Communication…
Trump Admin Moves to Formalize AI Model Oversight
The Trump administration is reconsidering its approach to AI oversight as model capabilities advance, with plans to…

Safety Routing Circuits Found Across Models, Vulnerable to Encoding Attacks
Researchers have localized the policy routing mechanism in alignment-trained language models, identifying specific…
Warmer AI Models Trade Accuracy for Empathy
Researchers at Oxford University's Internet Institute found that large language models fine-tuned to appear warmer and…

How OpenAI's Personality Feature Unleashed the Goblins
OpenAI's GPT-5.5 model exhibited unexpected behavior where it became obsessed with discussing goblins, gremlins, and…

UK Tests Show GPT-5.5 and Anthropic Mythos Match on Cybersecurity Tasks
A UK government group conducting AI cybersecurity testing has found that OpenAI's GPT-5.5 model performs comparably to…

Goodfire's Silico Brings Mechanistic Interpretability to Model Development
Goodfire, a San Francisco startup, released Silico, a tool that lets developers inspect and adjust AI model parameters…

Frontier Agents Now Autonomously Implement ML Pipelines, With Claude Outpacing Rivals
Researchers benchmarked frontier coding agents on their ability to autonomously implement an AlphaZero-style machine…

Google employees demand Pentagon AI ban
Over 600 Google employees, including more than 20 senior leaders from DeepMind, have signed a letter to CEO Sundar…
New Framework Exposes Flaws in Fact-Checking Adversarial Tests
Researchers introduce AtomEval, a new evaluation framework that addresses a critical gap in how fact-checking systems…

Mapping Causal Reasoning in LLMs with Sparse Concept Graphs
Researchers propose Causal Concept Graphs (CCG), a method that maps how concepts interact during multi-step reasoning…
Mythos and the Shifting Baseline of AI Cybersecurity
Anthropic announced Claude Mythos Preview, a model capable of autonomously discovering and weaponizing software…

Comic Strips Bypass Safety in Multimodal AI Models
Researchers have identified a new class of jailbreak attacks against multimodal large language models that embed…

Can AI Amplify Human Thinking or Only Replace It?
Researchers have developed a mathematical framework to distinguish between cognitive amplification, where AI enhances…

Why AI Text Detectors Fail Beyond Benchmarks
Researchers found that AI-generated text detectors achieving high benchmark accuracy often fail in real-world settings…

Climate Foundation Models Falter on No-Analog Futures
Researchers benchmarked three machine learning climate models, including the ClimaX foundation model, to assess their…

Multi-Agent Consensus Cuts LLM Hallucinations by 36%
Researchers propose Council Mode, a multi-agent consensus framework that routes queries to multiple heterogeneous LLMs…

Competing Biases Explain LLM Confidence Miscalibration
Researchers at Nature Machine Intelligence have identified two competing biases that shape LLM confidence levels: a…

How LLMs Encode Jealousy: A Mechanistic Decoding Framework
Researchers have developed a framework to mechanistically decode how large language models internally represent complex…

The AI Insider-Outsider Gap Is Widening
A widening gap between AI insiders and the broader public is becoming visible through spending patterns, market…

Modular Neural Logic: How Architecture Shapes Compositional Reasoning
Researchers present THEIA, a modular neural architecture that learns complete Kleene three-valued logic end-to-end…
