StepAudio 2.5 Realtime Tops Voice Benchmarks, Detects Sighs

Shanghai-based StepFun released StepAudio 2.5 Realtime, an end-to-end Chinese-English voice model that topped five internal benchmarks and detects paralinguistic cues such as sighs.

Shanghai-based AI lab StepFun has released StepAudio 2.5 Realtime, an end-to-end real-time speech model that accepts audio input and produces audio output without converting speech to text. The model supports English and Chinese and is available through StepFun’s platform.

StepFun published benchmark results from April 2026 that place StepAudio ahead of competitors including GPT Realtime 1.5 and Gemini Live across five voice AI tests the company ran. On a paralinguistic comprehension test scored 0–100, StepFun reported StepAudio at 82.18, GPT Realtime 1.5 at 80.46, Gemini Live at 58.05 and DouBao Realtime at 16.09. In a human-evaluation test where mobile app users spoke with the model and responses were rated by human evaluators on a 0–100 scale, StepFun reported scores of 80.41 for StepAudio, 68.01 for GPT Realtime 1.5 and 67.16 for Gemini Live. An objective general dialogue-quality test returned 86.36 for StepAudio and 81.60 for GPT Realtime 1.5.

StepFun reported the model was trained on a million-scale persona dataset that began with more than 10,000 human-authored persona seeds expanded algorithmically into a large feature matrix. The company reported applying roleplay-specific reinforcement learning from human feedback to reduce out-of-character behavior in long or adversarial conversations.

According to StepFun, StepAudio can detect nonverbal acoustic cues such as sighs, speaking rate, emotional tone and age indicators and uses those signals before generating a reply.

StepFun launched the model this week from its Shanghai offices. The company released a flagship persona called Xiao Yue; StepFun materials describe Xiao Yue as a ‘soul-level companion’ meant to feel like texting a friend. Developers can create and configure custom personas through an API, and documentation is available at platform.stepfun.com.

StepFun was founded in April 2023 by Jiang Daxin, a former Microsoft executive who worked on Bing, Cortana and Azure cognitive services. The startup has raised about $1.7 billion and earlier this year published results for a 196-billion-parameter text model called Step 3.5 Flash.

StepFun’s benchmark results are internal and published by the company. StepAudio 2.5 Realtime is offered as a live service for developers and end users.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author