Xiaomi MiMo V2.5 Pro: 1M-Token Multimodal Model
Xiaomi launched MiMo V2.5 and V2.5‑Pro, multimodal AI models that handle text, images, audio and video with a 1 million‑token context; Pro targets coding and agent workflows.
Xiaomi launched MiMo V2.5 and MiMo V2.5‑Pro, a two-model family that combines text, image, audio and video understanding in a single architecture and supports a 1,000,000‑token context window. The Pro edition is positioned for intensive coding and agentic tasks.
The company integrated native image, video and audio processing into the MiMo architecture alongside text and code. Xiaomi plans to publish the models as open source in the near future and made them available through its MiMo API; listing in AI Studio showed limited immediate access for some users at launch.
Xiaomi reported benchmark results that place MiMo V2.5‑Pro near leading systems on coding and agent tests. On SWE‑bench Pro, which measures bug fixes in startup codebases, V2.5‑Pro achieved a 57.2% pass rate; the average on that test is about 25%. The model scored close to top systems on τ3‑bench and ClawEval. On Humanity’s Last Exam, a graduate‑level problem suite, V2.5‑Pro scored 48.0% compared with GPT‑5.4’s 58.7%.
The company highlighted token efficiency: V2.5‑Pro uses 42% fewer tokens than Kimi K2.6 for similar benchmark outcomes, and V2.5 uses roughly half the tokens of Muse Spark for comparable results. Xiaomi described V2.5‑Pro as “a major leap from MiMo‑V2‑Pro in general agentic capabilities, complex software engineering, and long‑horizon tasks.”
Xiaomi gave throughput and pricing details. V2.5‑Pro runs at about 60–80 tokens per second and is priced at $1.00 per million input tokens and $3.00 per million output tokens. The consumer‑oriented MiMo V2.5 runs about 100–150 tokens per second and costs $0.40 per million input tokens and $2.00 per million output tokens. Both models support the full 1,000,000‑token context, which the company said equates to roughly 750,000 words in a single conversation.
The firm adjusted its credit plan: V2.5 uses a 1x credit rate and V2.5‑Pro a 2x rate. Xiaomi removed an extra multiplier that had applied to use of the full 1M context and provided a full credit reset for existing users as a launch bonus.
The multimodal capability expands use cases where a single model can accept images, video and audio in the same session with text and code. Examples include uploading photos for recipe suggestions, supplying a video tutorial for step‑by‑step summaries, and providing recorded meetings for automated action‑item extraction.
Xiaomi said the next generation is already in training with a focus on deeper reasoning, tighter tool integration and stronger real‑world grounding. The V2.5 family follows a rapid release cadence that began in December 2025 with MiMo‑V2‑Flash and continued in March with V2‑Pro, V2‑Omni and TTS models.
The company disclosed a planned AI investment of at least $8.7 billion over the next three years. Xiaomi cited internal traffic data indicating MiMo models accounted for about 21% of OpenRouter traffic in early April, with a 42% increase in the prior seven days, and said an earlier arrangement with the agentic tool Hermes offered temporary free access to MiMo V2‑Pro, helping drive adoption.
The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.








