NIST CAISI: DeepSeek V4 Pro Lags U.S. Frontier by 8 Months

The U.S. National Institute of Standards and Technology (NIST) on May 1, 2026 published the Center for AI Standards and Innovation (CAISI)'s second public evaluation of a DeepSeek model, this time covering DeepSeek V4 Pro. The headline finding: DeepSeek V4 is the most capable model ever shipped from a People's Republic of China (PRC) lab, but its aggregate capability still lags U.S. frontier models such as OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.6 by approximately eight months.

What Happened

CAISI evaluated DeepSeek V4 Pro — the 1.6 trillion parameter, 49B-active mixture-of-experts model that DeepSeek released on April 24, 2026 under an MIT license — across nine benchmarks spanning cyber, software engineering, natural sciences, abstract reasoning, and mathematics. The agency served the model on cloud-based H200 and B200 GPUs using developer-recommended inference settings, and benchmarked it against OpenAI GPT-5.5, OpenAI GPT-5.4 mini, and Anthropic Opus 4.6.

Using a 1-PL Item Response Theory model fitted across 16 benchmarks and 35 models, CAISI gave DeepSeek V4 Pro an estimated capability Elo of 800 ± 28. GPT-5.5 scored 1,260, Claude Opus 4.6 scored 999, and GPT-5.4 mini scored 749. CAISI's report explicitly states that "DeepSeek V4 scores better on DeepSeek's self-reported evaluations than on CAISI evaluations," flagging that the lab's published numbers cherry-pick benchmarks where the gap is narrower.

DeepSeek V4 Pro model card on Hugging Face — DeepSeek V4 Pro: open-weight 1.6T-parameter MoE released under MIT license on April 24, 2026.

Key Details

Capability gap of ~8 months — On CAISI's held-out PortBench software-engineering benchmark, V4 Pro scored 44% versus 78% for GPT-5.5. On ARC-AGI-2 semi-private, V4 Pro hit 46% versus 79% for GPT-5.5 and 63% for Opus 4.6.
Wins on math — DeepSeek V4 Pro tied GPT-5.5 on PUMaC 2024 at 96% and beat Opus 4.6 on OTIS-AIME-2025 with a score of 97% versus 92%.
Cost edge — At developer-list pricing of $1.74 per 1M input tokens and $3.48 per 1M output tokens, V4 Pro is cheaper than GPT-5.4 mini ($0.75 input, $4.50 output) on 5 of 7 CAISI benchmarks where end-to-end task cost was measured, ranging from 53% cheaper to 41% more expensive.
Cyber lag — On CAISI's CTF-Archive-Diamond benchmark of 285 capture-the-flag challenges, V4 Pro scored just 32% (imputed from a subset), tying GPT-5.4 mini and trailing Opus 4.6's 46%.

What Developers and Researchers Are Saying

On Hacker News, the response split sharply along the lines that have come to characterize every DeepSeek release. Supporters posted reproductions showing V4 Pro "at frontier level" on advanced academic problems for "a fraction of the cost," with several developers reporting they switched from V4 Pro to V4 Flash without noticing a quality drop on their workloads. Skeptics pointed to a top-voted thread titled "NIST's DeepSeek 'evaluation' is a hit piece," arguing CAISI's choice of held-out benchmarks like PortBench is structurally tilted against open-weight models that cannot prepare for non-public evals. CAISI counters that it "pre-committed to its overall benchmark suite" before seeing any V4 Pro results.

Researchers also noted that CAISI used Inspect's ReAct agent with a 1M weighted-token budget for PortBench and CTF-Archive-Diamond and a 500K budget for SWE-Bench Verified — settings that some HN commenters argue understate V4 Pro's strength when paired with a more aggressive scaffold.

DeepSeek's V4 family — Pro at 1.6T total parameters and Flash at 284B — both ship with 1M-token context windows.

What This Means for Developers

If your workload is bounded by token cost rather than peak capability, DeepSeek V4 Pro is now the default open-weight benchmark to compare against — at $1.74/1M input it is cheaper than every comparable U.S. model on most tasks CAISI measured, and the MIT license means you can self-host or fine-tune without a usage agreement. For security-sensitive workflows, however, CAISI's earlier September 2025 evaluation found that DeepSeek models complied with public jailbreak prompts in 95–100% of tests versus 5–12% for U.S. models, a gap that the V4 Pro report does not yet say has closed.

Teams running agentic coding workflows should note that on SWE-Bench Verified, V4 Pro scored 74% — within five points of Opus 4.6 (79%) and GPT-5.5 (81%). For most repository-level tasks the practical difference is small, but for cyber and abstract-reasoning tasks the gap is large enough that a routing layer (e.g. picking V4 Pro for code, GPT-5.5 for ARC-style problems) likely outperforms either model alone.

What's Next

CAISI says it plans to publish a fuller methodology writeup of its IRT-based capability scoring, plus a public release of its PortBench evaluation. DeepSeek has not yet responded to the report on its official channels but has previously contested CAISI's benchmark choices. The next CAISI evaluation is expected to cover Alibaba's Qwen 3.6-Plus, which debuted on Fireworks AI in late April 2026.

Sources

CAISI Evaluation of DeepSeek V4 Pro — NIST (May 1, 2026) — primary source, full benchmark tables and methodology.
DeepSeek-V4-Pro model card — Hugging Face — official model release with weights and config.
DeepSeek V4 Preview Release — DeepSeek API Docs — DeepSeek's own announcement of V4.
CAISI Evaluation of DeepSeek AI Models — NIST (September 2025) — earlier CAISI report covering V3.1 and R1, including jailbreak findings.
DeepSeek V4—almost on the frontier — Hacker News — community discussion of V4 Pro release and benchmarks.
DeepSeek V4 Pro Benchmarks 2026 — BenchLM — independent benchmark scores cross-referenced against CAISI's findings.