DiffusionGemma hits 1,000 tokens/sec but needs custom drafter

Google released DiffusionGemma, an open-weight text-diffusion model that generates 256-token blocks and reaches about 1,000 tokens/sec on an NVIDIA H100 but requires a custom drafter.

Google released DiffusionGemma, an open-weight text-diffusion model that generates full 256-token blocks in parallel. The company reported throughput of roughly 1,000 tokens per second on an NVIDIA H100 and “700+ tokens per second on NVIDIA GeForce RTX 5090.” The model is available under the Apache 2.0 license and Google published weights on Hugging Face.

DiffusionGemma uses a diffusion-style generation method rather than producing one token at a time. It starts with a canvas of noisy placeholder tokens and iteratively refines entire 256-token chunks until the block becomes coherent. Generating blocks in parallel gives the model bidirectional context during generation, a capability not present in autoregressive models.

Running DiffusionGemma efficiently requires a small, fast drafter module that proposes token blocks in parallel while the main model verifies them in a single forward pass, a technique related to speculative decoding. That drafter is not yet included in common public runtimes. The module is absent from current versions of mlx-lm and LM Studio, and is not present in public releases of Apple’s MLX tooling for Apple Silicon.

Attempts to run the model through NVIDIA NIM encountered configuration limits. The model shipped in NIM with a default context window set to 8,192 tokens, which blocked initialization of agent frameworks that require larger windows. An error captured during testing read: “agent init failed: Model google/diffusiongemma-26b-a4b-it has a context window of 8,192 tokens, which is below the minimum 64,000 required by Hermes Agent.” The model’s architectural context limit is substantially higher, on the order of 256,000 tokens, but default runtime settings and missing toolchain adjustments prevented agentic workflows.

Google framed DiffusionGemma as a speed-focused development rather than a quality upgrade compared with Gemma 4. A fine-tuned variant used for demonstration solved about 80% of Sudoku puzzles, while the base model solved almost none.

Early community responses include an initial draft pull request for llama.cpp and reference to prior frameworks such as DFlash, which showed how a lightweight diffusion drafter can yield multiple-fold speedups in some tasks. For now, practical use is limited to developers and researchers with high-end discrete GPUs who can modify runtimes and experiment with custom inference stacks.

DiffusionGemma follows earlier research and commercial work on text diffusion models. The release provides open weights and code, and toolchains and community implementations will need to add the specific drafter and adjust runtime defaults before the model can run efficiently on most consumer setups.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author