Xiaomi hits 1,000+ tokens/sec on 1T model with 8 GPUs

MiMo‑V2.5‑Pro‑UltraSpeed reached over 1,000 tokens per second on a 1‑trillion‑parameter model using a standard 8‑GPU server, combining FP4 expert quantization, DFlash decoding and TileRT.

Xiaomi and inference partner TileRT reported that MiMo‑V2.5‑Pro‑UltraSpeed produced sustained throughput above 1,000 tokens per second on a 1‑trillion‑parameter model running on a single standard 8‑GPU commodity node. In demonstrations, peak throughput reached about 1,200 tokens per second.

The configuration combines two model-level techniques and a custom inference engine. Xiaomi applied 4‑bit floating point (FP4) quantization to the model’s expert weight blocks to reduce memory footprint and bandwidth demands while leaving the remaining parameters at higher precision. The setup uses DFlash speculative decoding to propose and verify blocks of tokens in one forward pass instead of generating tokens sequentially. TileRT coordinates execution on the GPUs and minimizes per‑operator overhead so compute stays resident on the hardware.

Xiaomi reported that quantizing only the expert layers produced near‑zero measurable quality loss in internal tests. On coding tasks, the company said the model accepted on average 6.3 of every 8 proposed tokens during each verification round, allowing multiple tokens to be confirmed in a single step rather than one token at a time.

The implementation runs on commodity GPUs rather than custom accelerators. By comparison, a wafer‑scale chip design previously reached about 969 tokens per second on a 405‑billion‑parameter model, and other custom accelerators have reported ranges from roughly 300 to 750 tokens per second depending on the model and benchmark. Xiaomi emphasized that its result used a 1‑trillion‑parameter model on rentable GPU hardware.

Xiaomi opened an application‑based API trial from June 9 through June 23 that gives priority to enterprise and professional developers. The trial uses an FP4‑DFlash checkpoint that Xiaomi has posted to Hugging Face for community testing. Xiaomi said the UltraSpeed serving mode costs three times the standard MiMo‑V2.5‑Pro rate while delivering roughly ten times the generation speed.

Xiaomi previously published cost figures for the MiMo‑V2.5‑Pro model at about $0.43 per million input tokens and $0.87 per million output tokens. The company noted that UltraSpeed accelerates the full MiMo‑V2.5‑Pro model rather than a reduced or distilled version.

TileRT’s design focuses on keeping GPU execution continuous and removing launch overhead for individual operators. Xiaomi described the combined methods as model and system design working together to reduce memory transfers and the number of full model forward passes needed to produce multiple tokens.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author