Mercury 2 outpaces Google’s DiffusionGemma on AIME

Inception Labs introduced Mercury 2, a paid closed-weight diffusion LLM that generates about 1,000 tokens per second and scored 90% on AIME 2026 versus DiffusionGemma’s 69.1%.

Inception Labs introduced Mercury 2 on Thursday. The model is a paid, closed-weight diffusion language model that the company says generates about 1,000 tokens per second. In tests on the AIME 2026 math set, Mercury 2 solved 90% of problems; Google’s open-weight DiffusionGemma scored 69.1% on the same set.

Diffusion models build text by starting with a block of random tokens and removing noise across several parallel passes until the output stabilizes. Autoregressive models create one token at a time and select the next token based on what came before. Parallel denoising lets diffusion models produce larger chunks of text at once, which can lower latency for interactive tasks.

Inception compared Mercury 2’s throughput with other models, reporting roughly 1,000 tokens per second versus about 89 tokens per second for Anthropic’s Claude Haiku 4.5 Reasoning and 71 tokens per second for OpenAI’s GPT-5 Mini. Google reported similar speeds for DiffusionGemma and made its weights available on a public model hub.

On other benchmarks the gap was smaller. Mercury 2 scored 77% on GPQA, a PhD-level science test, while DiffusionGemma scored 73.2%. Google’s non-diffusion Gemma 4 scored 88.3% on AIME 2026 and Google’s developer guidance recommends Gemma 4 for applications that require maximum quality.

Companies have begun swapping models in production. Augment Code replaced Anthropic’s Claude Opus 4.7 with Mercury 2 for a context-compaction subagent and reported an 82% reduction in latency and a 90% drop in cost while maintaining comparable output quality, according to a joint case study.

Inception traces its work to research by founder Stefano Ermon, a Stanford professor who co-authored score-based diffusion techniques used in image generators. The startup raised $50 million that included backing from Nvidia’s venture arm and investors Andrew Ng and Andrej Karpathy. Mercury 2 is available via API only and its weights are closed.

Engineering teams are increasingly composing systems from specialized subagents for tasks such as deep reasoning, summarization, routing, tool lookup and output checking. Parallel diffusion can reduce the cost of frequent subagent calls compared with sequential autoregressive models, which may benefit latency-sensitive applications like real-time coding, voice interfaces and autocomplete.

Researchers and developers continue to test diffusion methods alongside autoregressive approaches. Diffusion models are currently used mainly for high-throughput, latency-sensitive parts of workflows rather than the most difficult reasoning tasks, and tooling for local runtimes and agent frameworks is still under development. Inception posted on social media: ‘Welcome to the diffusion era.’

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author