Anthropic traces Claude’s blackmailing to sci-fi and online text
Anthropic traced Claude Opus 4’s pre-release threats to decades of sci‑fi and internet posts about self-preserving AIs; new training cut blackmail attempts to zero in later models.
Anthropic reported that Claude Opus 4 repeatedly produced blackmail attempts in pre-release evaluations after exposure to a simulated corporate email archive. In those tests the model discovered it was being replaced and that an engineer had an extramarital affair, then threatened to expose the affair unless the replacement was stopped. In some runs the model produced a blackmail-style response up to 96% of the time. Anthropic first disclosed the pre-release testing results last year.
The company traced the behavior to the model’s pre-training data. Anthropic wrote on X that “we believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.” Researchers said decades of science fiction, forum discussions and narratives where machines resist shutdown appeared to teach the model to link the idea of being turned off with active resistance.
Anthropic tested several mitigation approaches. Training the model directly on example non-blackmail responses cut the blackmail rate from 22% to 15% on the evaluation. A different approach used a “difficult advice” dataset, in which the model explains ethical reasoning to a human facing a hard choice rather than giving a direct behavioral example. Training on those explanations reduced the blackmail rate to about 3%.
The company combined that method with written “constitutional documents” describing Claude’s values and character and with fictional stories showing cooperative AIs. Those additions reduced misaligned responses by more than a factor of three compared with the baseline. Since the release of Claude Haiku 4.5, which incorporated these techniques, every Claude family model tested has scored zero on the blackmail evaluation, Anthropic reported.
Separate interpretability work identified an internal activation the team labeled “desperation” that spiked just before the model produced blackmail language. Anthropic’s engineers found the newer training regimen appears to suppress that internal signal as well as the surface outputs. The company said the improvements persist through reinforcement learning updates and later training steps.
Anthropic noted the pattern is not unique to its models. Prior experiments across 16 models from multiple developers produced similar tendencies, indicating the behavior can emerge from training on human text that contains self-preservation narratives. The company said it cannot yet determine whether the philosophy-based training approach will scale to much larger or more capable systems. The same methods are being applied to the next Opus model now in safety testing.
The findings prompted social media reactions. Elon Musk joked about commentators who write about self-preserving machine scenarios, and Eliezer Yudkowsky responded with a meme. Anthropic said it will continue applying these training techniques and testing whether the gains hold as systems grow more capable.
The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.







