HomeCryptocurrency NewsGemma 4 runs up to 3× faster locally with MTP drafters

Gemma 4 runs up to 3× faster locally with MTP drafters

Author GNcrypto

Posted: 7 May 2026, 16:38 CET 2 min read

Google released Multi-Token Prediction drafters for Gemma 4, using speculative decoding to speed local inference up to 3× while preserving output quality.

Google released Multi-Token Prediction (MTP) drafters for the Gemma 4 family. The software uses speculative decoding so the model can generate multiple tokens at once and verify them in a single forward pass.

Speculative decoding pairs the main Gemma 4 model with a smaller, faster drafter. The drafter predicts several tokens in one step, and the target model then checks those tokens together rather than producing each token sequentially. The drafter shares the target model’s key-value (KV) cache to avoid recomputing context. For very small edge models, the team added a clustering method to cut generation time on phones and single-board computers.

Google’s benchmarks report a roughly 2× increase in tokens per second for a Gemma 4 26B model running on an NVIDIA RTX Pro 6000 GPU. On Apple Silicon, batching four to eight requests produced about a 2.2× speedup. The company says results vary by hardware, model size and request patterns and places an upper bound near 3× in optimal conditions.

The drafters are released under the Apache 2.0 license and are available on Hugging Face, Kaggle and Ollama. They work with serving stacks including vLLM, MLX, SGLang and Hugging Face Transformers.

Google described the verification step this way: “if the target model agrees with the draft, it accepts the entire sequence in a single forward pass-and even generates an additional token of its own in the process.”

The drafters are implemented as a serving optimization rather than a change to Gemma 4’s architecture. The company positions the release as a method to reduce latency for applications that require low response times, such as near-real-time chat, voice interfaces and agent workflows.

The release follows broader industry efforts to improve inference efficiency so larger, higher-quality models can run more smoothly on local hardware without changing model weights or relying on heavy compression.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.