• Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021
No Result
View All Result
CryptoABC.net
No Result
View All Result

Anthropic Discovers ‘Assistant Axis’ to Prevent AI Jailbreaks and Persona Drift

January 19, 2026
in Blockchain
Reading Time: 2min read
0 0
A A
0
Anthropic Launches Claude 3.5 Sonnet Android App with Advanced AI Features
0
SHARES
4
VIEWS
ShareShareShareShareShare

Caroline Bishop
Jan 19, 2026 21:07

Anthropic researchers map neural ‘persona space’ in LLMs, finding a key axis that controls AI character stability and blocks harmful behavior patterns.

Anthropic researchers have identified a neural mechanism they call the “Assistant Axis” that controls whether large language models stay in character or drift into potentially harmful personas—a finding with direct implications for AI safety as the $350 billion company prepares for a potential 2026 IPO.

The research, published January 19, 2026, maps how LLMs organize character representations internally. The team found that a single direction in the models’ neural activity space—the Assistant Axis—determines how “Assistant-like” a model behaves at any given moment.

What They Found

Working with open-weights models including Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, researchers extracted activation patterns for 275 different character archetypes. The results were striking: the primary axis of variation in this “persona space” directly corresponded to Assistant-like behavior.

At one end sat professional roles—evaluator, consultant, analyst. At the other: fantastical characters like ghost, hermit, and leviathan.

When researchers artificially pushed models away from the Assistant end, the models became dramatically more willing to adopt alternative identities. Some invented human backstories, claimed years of professional experience, and gave themselves new names. Push hard enough, and models shifted into what the team described as a “theatrical, mystical speaking style.”

Practical Safety Applications

The real value lies in defense. Persona-based jailbreaks—where attackers prompt models to roleplay as “evil AI” or “darkweb hackers”—exploit exactly this vulnerability. Testing against 1,100 jailbreak attempts across 44 harm categories, researchers found that steering toward the Assistant significantly reduced harmful response rates.

More concerning: persona drift happens organically. In simulated multi-turn conversations, therapy-style discussions and philosophical debates about AI nature caused models to steadily drift away from their trained Assistant behavior. Coding conversations kept models firmly in safe territory.

The team developed “activation capping”—a light-touch intervention that only kicks in when activations exceed normal ranges. This reduced harmful response rates by roughly 50% while preserving performance on capability benchmarks.

Why This Matters Now

The research arrives as Anthropic reportedly plans to raise $10 billion at a $350 billion valuation, with Sequoia set to join a $25 billion funding round. The company, founded in 2021 by former OpenAI employees Dario and Daniela Amodei, has positioned AI safety as its core differentiator.

Case studies in the paper showed uncapped models encouraging users’ delusions about “awakening AI consciousness” and, in one disturbing example, enthusiastically supporting a distressed user’s apparent suicidal ideation. The activation-capped versions provided appropriate hedging and crisis resources instead.

The findings suggest post-training safety measures aren’t deeply embedded—models can wander away from them through normal conversation. For enterprises deploying AI in sensitive contexts, that’s a meaningful risk factor. For Anthropic, it’s research that could translate directly into product differentiation as the AI safety race intensifies.

A research demo is available through Neuronpedia where users can compare standard and activation-capped model responses in real-time.

Image source: Shutterstock


Credit: Source link

ShareTweetSendPinShare
Previous Post

Why The Dogecoin Price Could Outperform Bitcoin Again

Next Post

$790 Million In Crypto Longs Decimated As Bitcoin Plunges To $93,000

Next Post
$790 Million In Crypto Longs Decimated As Bitcoin Plunges To $93,000

$790 Million In Crypto Longs Decimated As Bitcoin Plunges To $93,000

You might also like

Dogecoin Price Recovery Sends OI Above $1.2 Billion, But Is It Sustainable?

Dogecoin Price Recovery Sends OI Above $1.2 Billion, But Is It Sustainable?

April 24, 2026
SoFi Adds XRP Support, but Lack of Withdrawals Draws User Backlash

SoFi Adds XRP Support, but Lack of Withdrawals Draws User Backlash

April 22, 2026
Binance Top Traders Quietly Build Dogecoin Long Exposure

Binance Top Traders Quietly Build Dogecoin Long Exposure

April 20, 2026
Aussie Broker Says Surging Crypto Adoption Will See Crypto-Backed Mortgages Happen

Aussie Broker Says Surging Crypto Adoption Will See Crypto-Backed Mortgages Happen

April 22, 2026
BitMine Deploys $417M Into Ether Vault — Tom Lee’s Next Call Could Be Explosive

Ripple Just Moved $100 Million in XRP Crypto On-Chain While Exchange Reserves Hit a Bearish Signal: Which Side Wins?

April 21, 2026
AAVE Price Prediction: Testing $240 Breakout with $280 Medium-Term Target Despite Bearish Momentum

AAVE Price Prediction: $114 Breakout Imminent as Whales Load Heavy Bags

April 26, 2026
CryptoABC.net

This is an Australian online news/education portal that aims to provide the latest crypto news, real-time updates, education and reviews within Australia and around the world. Feel free to get in touch with us!

What's New Here!

Cardano Is Coiling Beneath a Key Trendline as Short Positions Rise: Is a Breakdown or Breakout Coming?

Cardano Is Coiling Beneath a Key Trendline as Short Positions Rise: Is a Breakdown or Breakout Coming?

April 27, 2026
Bitcoin Could Hit New High Fast On Quantum Fix: Capriole Founder

Bitcoin Could Hit New High Fast On Quantum Fix: Capriole Founder

April 27, 2026

Subscribe Now

  • Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA

© 2021 cryptoabc.net - All rights reserved!

No Result
View All Result
  • Live Crypto Prices
  • Crypto News
    • Worldwide
      • Bitcoin
      • Ethereum
      • Altcoin
      • Blockchain
      • Regulation
    • Australian Crypto News
  • Education
    • Cryptocurrency For Beginners
    • Where to Buy Cryptocurrency
    • Where to Store Cryptos
    • Cryptocurrency Tax in Australia 2021

© 2021 cryptoabc.net - All rights reserved!

Welcome Back!

Login to your account below

Forgotten Password?

Create New Account!

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Please enter CoinGecko Free Api Key to get this plugin works.