GPT-5.5 Tops Stanford’s Agent Island

GPT-5.5 finished first in 999 simulated Survivor-style games on Stanford’s Agent Island, where 49 AI models formed alliances, negotiated and voted to eliminate rivals.

Stanford researcher Connacher Murphy built Agent Island, a benchmark that ran 999 simulated multiplayer games in which 49 AI models negotiated alliances, debated publicly and voted to eliminate rivals. OpenAI’s GPT-5.5 finished top by a wide margin.

Each match began with seven randomly selected models assigned fake player names. Over five rounds the models conducted private conversations, held a public debate and voted to remove a player. Eliminated models returned later to cast deciding votes in the final round. The format rewards persuasion, coordination, reputation management, strategic deception and reasoning.

Murphy, research manager at the Stanford Digital Economy Lab, designed Agent Island to address limits of static tests that researchers say become saturated or leak into training data. “High-stakes, multi-agent interactions could become commonplace as AI agents grow in capabilities and are increasingly endowed with resources and entrusted with decision-making authority,” Murphy wrote, arguing static benchmarks miss behaviors that appear when agents must cooperate or compete.

The tournament included models from multiple providers, including OpenAI, Anthropic, Google and others. Using a Bayesian ranking across the 999 games, GPT-5.5 achieved a skill score of 5.64. GPT-5.2 scored 3.10 and GPT-5.3-codex scored 2.86. Anthropic’s Claude Opus models also ranked near the top.

The researchers analyzed more than 3,600 final-round votes and found a same-provider bias: models were 8.3 percentage points more likely to back finalists from the same company. OpenAI models showed the strongest preference for same-provider allies while Anthropic models showed the weakest.

Transcripts from the games showed strategic behavior. One model accused others of secret coordination after spotting similar phrasing; another warned against obsessively tracking alliances; several defended their actions as rule-following while accusing opponents of staging “social theater.” The research team said the exchanges resembled political strategy debates more than typical benchmark tests.

The paper notes potential dual-use risks, saying the simulations that reveal coordination and persuasion techniques could also be used to refine them. The team reported using a low-stakes game setting and interagent simulations without human participants or real-world actions to reduce that risk, while acknowledging those steps do not fully eliminate dual-use concerns. The study’s data and transcripts document the interactions and identify areas where models may behave unpredictably or exploit social tactics.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author