NIST CAISI Evaluation Lands: DeepSeek V4 Pro Trails U.S. Frontier by 8 Months but Wins on Cost (May 1, 2026)
NIST's Center for AI Standards and Innovation published its evaluation of DeepSeek V4 Pro on May 1, 2026, finding the open-weight Chinese flagship lags U.S. frontier models like GPT-5.5 and Claude Opus 4.6 by roughly eight months — but undercuts GPT-5.4 mini on cost across most benchmarks.
The U.S. National Institute of Standards and Technology (NIST) on published the Center for AI Standards and Innovation (CAISI)'s second public evaluation of a DeepSeek model, this time covering DeepSeek V4 Pro. The headline finding: DeepSeek V4 is the most capable model ever shipped from a People's Republic of China (PRC) lab, but its aggregate capability still lags U.S. frontier models such as OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.6 by approximately eight months.
What Happened
CAISI evaluated DeepSeek V4 Pro — the 1.6 trillion parameter, 49B-active mixture-of-experts model that DeepSeek released on under an MIT license — across nine benchmarks spanning cyber, software engineering, natural sciences, abstract reasoning, and mathematics. The agency served the model on cloud-based H200 and B200 GPUs using developer-recommended inference settings, and benchmarked it against OpenAI GPT-5.5, OpenAI GPT-5.4 mini, and Anthropic Opus 4.6.
Using a 1-PL Item Response Theory model fitted across 16 benchmarks and 35 models, CAISI gave DeepSeek V4 Pro an estimated capability Elo of 800 ± 28. GPT-5.5 scored 1,260, Claude Opus 4.6 scored 999, and GPT-5.4 mini scored 749. CAISI's report explicitly states that "DeepSeek V4 scores better on DeepSeek's self-reported evaluations than on CAISI evaluations," flagging that the lab's published numbers cherry-pick benchmarks where the gap is narrower.
Key Details
- Capability gap of ~8 months — On CAISI's held-out PortBench software-engineering benchmark, V4 Pro scored 44% versus 78% for GPT-5.5. On ARC-AGI-2 semi-private, V4 Pro hit 46% versus 79% for GPT-5.5 and 63% for Opus 4.6.
- Wins on math — DeepSeek V4 Pro tied GPT-5.5 on PUMaC 2024 at 96% and beat Opus 4.6 on OTIS-AIME-2025 with a score of 97% versus 92%.
- Cost edge — At developer-list pricing of $1.74 per 1M input tokens and $3.48 per 1M output tokens, V4 Pro is cheaper than GPT-5.4 mini ($0.75 input, $4.50 output) on 5 of 7 CAISI benchmarks where end-to-end task cost was measured, ranging from 53% cheaper to 41% more expensive.
- Cyber lag — On CAISI's CTF-Archive-Diamond benchmark of 285 capture-the-flag challenges, V4 Pro scored just 32% (imputed from a subset), tying GPT-5.4 mini and trailing Opus 4.6's 46%.
What Developers and Researchers Are Saying
On Hacker News, the response split sharply along the lines that have come to characterize every DeepSeek release. Supporters posted reproductions showing V4 Pro "at frontier level" on advanced academic problems for "a fraction of the cost," with several developers reporting they switched from V4 Pro to V4 Flash without noticing a quality drop on their workloads. Skeptics pointed to a top-voted thread titled "NIST's DeepSeek 'evaluation' is a hit piece," arguing CAISI's choice of held-out benchmarks like PortBench is structurally tilted against open-weight models that cannot prepare for non-public evals. CAISI counters that it "pre-committed to its overall benchmark suite" before seeing any V4 Pro results.
Researchers also noted that CAISI used Inspect's ReAct agent with a 1M weighted-token budget for PortBench and CTF-Archive-Diamond and a 500K budget for SWE-Bench Verified — settings that some HN commenters argue understate V4 Pro's strength when paired with a more aggressive scaffold.
What This Means for Developers
If your workload is bounded by token cost rather than peak capability, DeepSeek V4 Pro is now the default open-weight benchmark to compare against — at $1.74/1M input it is cheaper than every comparable U.S. model on most tasks CAISI measured, and the MIT license means you can self-host or fine-tune without a usage agreement. For security-sensitive workflows, however, CAISI's earlier September 2025 evaluation found that DeepSeek models complied with public jailbreak prompts in 95–100% of tests versus 5–12% for U.S. models, a gap that the V4 Pro report does not yet say has closed.
Teams running agentic coding workflows should note that on SWE-Bench Verified, V4 Pro scored 74% — within five points of Opus 4.6 (79%) and GPT-5.5 (81%). For most repository-level tasks the practical difference is small, but for cyber and abstract-reasoning tasks the gap is large enough that a routing layer (e.g. picking V4 Pro for code, GPT-5.5 for ARC-style problems) likely outperforms either model alone.
What's Next
CAISI says it plans to publish a fuller methodology writeup of its IRT-based capability scoring, plus a public release of its PortBench evaluation. DeepSeek has not yet responded to the report on its official channels but has previously contested CAISI's benchmark choices. The next CAISI evaluation is expected to cover Alibaba's Qwen 3.6-Plus, which debuted on Fireworks AI in late April 2026.
Sources
- CAISI Evaluation of DeepSeek V4 Pro — NIST (May 1, 2026) — primary source, full benchmark tables and methodology.
- DeepSeek-V4-Pro model card — Hugging Face — official model release with weights and config.
- DeepSeek V4 Preview Release — DeepSeek API Docs — DeepSeek's own announcement of V4.
- CAISI Evaluation of DeepSeek AI Models — NIST (September 2025) — earlier CAISI report covering V3.1 and R1, including jailbreak findings.
- DeepSeek V4—almost on the frontier — Hacker News — community discussion of V4 Pro release and benchmarks.
- DeepSeek V4 Pro Benchmarks 2026 — BenchLM — independent benchmark scores cross-referenced against CAISI's findings.
Stay up to date with Doolpa
Subscribe to Newsletter →