U.S. Agency: DeepSeek V4 Pro about eight months behind
CAISI reported on May 1 that DeepSeek V4 Pro trails U.S. frontier models by about eight months under an IRT-based evaluation; researchers and public leaderboards dispute the result.
The National Institute of Standards and Technology unit, the Center for AI Standards and Innovation (CAISI), released an evaluation on May 1 finding that China’s DeepSeek V4 Pro lags roughly eight months behind U.S. frontier AI models. CAISI called DeepSeek the most capable Chinese model it has tested to date.
CAISI applied item response theory (IRT), a scoring method adapted from standardized testing, across nine benchmarks in five domains: cybersecurity, software engineering, natural sciences, abstract reasoning and math. The IRT approach estimates a model’s latent capability by weighting which problems it solves and which it misses, producing point scores that are comparable to a reference model rather than raw percentage-correct totals.
In CAISI’s IRT-derived ranking, OpenAI’s GPT-5.5 scored about 1,260 points, Anthropic’s Claude Opus 4.6 about 999 points, and DeepSeek V4 Pro around 800 (±28). CAISI placed GPT-5.4 mini at 749 points. Two of the nine benchmarks were non-public; CAISI’s reported largest gaps appear on those private tests. For example, on a closed cybersecurity test called CTF-Archive-Diamond, GPT-5.5 scored 71% while CAISI recorded DeepSeek at roughly 32%.
On public tests the differences were smaller. CAISI reported DeepSeek at 90% on a PhD-level science reasoning benchmark versus 91% for Opus 4.6. On several math-olympiad tasks DeepSeek posted scores of 97%, 96% and 96%. In software engineering tests that measure real GitHub bug fixes, CAISI recorded DeepSeek at 74% compared with GPT-5.5’s 81%. DeepSeek’s own technical report asserts V4 Pro matches Opus 4.6 and GPT-5.4 on many public benchmarks.
For a cost comparison, CAISI excluded U.S. models it judged either too expensive or too weak, leaving only GPT-5.4 mini as the remaining U.S. comparator. Under that filter, DeepSeek was cheaper on five of seven measured benchmarks.
The evaluation drew immediate challenge from the DeepSeek developer using the pseudonym Ex0bit, who wrote, “There’s no ‘gap’, and no one’s 8 months behind. We’ve been trolled on every closed U.S drop and flexed on with open weights.” Independent trackers show narrower differences on public leaderboards: a 2026 AI index from an academic institution found the U.S.-China performance gap on public leaderboards had narrowed to about 2.7%, and another intelligence index placed OpenAI and DeepSeek closer than CAISI’s IRT scores indicate.
Critics of the CAISI report pointed to the use of private datasets and the choice to filter comparators in the cost analysis as factors that affect the result. CAISI said it will publish a full IRT methodology write-up in the near term. DeepSeek debuted in January 2025, and the new evaluation adds to ongoing comparisons between Chinese and U.S. frontier models.
The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.







