Anthropic: Mythos safety can’t be fully measured
Anthropic reports in a safety document that it can no longer fully measure risks from Mythos, its high-capability AI model, because current evaluation tools lag the model’s abilities.
In a recent safety report, San Francisco-based Anthropic wrote that standard evaluation techniques such as benchmark tests, red-teaming and curated adversarial prompts are no longer sufficient to capture the full range of risks from Mythos.
Anthropic noted the model’s capabilities are outpacing the ability of existing tests to forecast harmful behaviors across real-world uses and novel inputs.
The report listed several factors that make measurement difficult. Small changes in prompts, context or deployment conditions can produce unexpected outputs that are hard to reproduce in controlled evaluations. The company identified gaps in coverage for long-term harms, coordinated misuse and subtle forms of deception that may not appear in conventional benchmarks. It also cited resource limits and the high cost of thorough adversarial testing as constraints on running exhaustive safety assessments.
Anthropic reported it continues to use a mix of internal assessments and adversarial testing, but those tools provide only partial visibility. The report reads, “We can no longer fully measure the safety of Mythos using our current evaluation methods,” and notes some failure modes remain difficult to detect until they appear in real deployments or targeted attacks. The company said it is documenting where tests fail so researchers can prioritize new evaluation work.
The document included examples of evaluation shortfalls: several red-team scenarios did not expose vulnerabilities later discovered in limited public trials, and certain policy-sensitive behaviors were triggered by rare prompt patterns that standard benchmarks did not include. The report also described reproducibility challenges, saying behaviors seen in one test run sometimes did not replicate under slightly different conditions.
Anthropic recommended expanding research into new measurement techniques, developing stress-testing tools that better approximate real-world misuse, and increasing transparency around evaluation methods so outside researchers can reproduce and interrogate results. The company suggested collaboration with external auditors and the research community to identify blind spots internal teams may miss.
The company framed the update as part of ongoing work to improve model safety and reiterated that it deploys safety layers, content filters, user policies and continual monitoring in production. The report acknowledged those mitigations do not replace robust, predictive evaluation methods.
The disclosure comes amid industry debate over how to govern powerful AI models. Anthropic said it will use the report’s findings to guide research priorities, urged the wider field to invest in better evaluation science, called for new standards and tooling to keep pace with model improvements, and said it will share more details on testing approaches as they are developed.
The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.








