NVIDIA Nemotron 3 Nano Omni: Open 30B MoE, 9× Throughput

NVIDIA on April 28, 2026 released Nemotron 3 Nano Omni, a 30-billion-parameter open-weight multimodal model that unifies text, image, audio and video understanding inside a single hybrid mixture-of-experts checkpoint and claims up to 9× higher throughput than other open omni models at the same level of interactivity. Weights and a permissive commercial license are available immediately on Hugging Face, on OpenRouter as a free endpoint, and on build.nvidia.com as an NVIDIA NIM microservice.

What Happened

Nemotron 3 Nano Omni is the omni-modal flagship of NVIDIA’s newly expanded Nemotron 3 Nano family. The architecture is a 30B-A3B hybrid MoE that interleaves 23 Mamba-2 layers, 23 MoE layers and 6 grouped-query attention layers across 52 total layers — the Mamba-2 layers handle long sequences in linear time, while the six attention layers provide the global context the model needs to keep accuracy competitive. NVIDIA published a full technical report alongside the model card.

NVIDIA paired the launch with a fully open evaluation recipe and an OpenRouter free tier so developers can benchmark the model end-to-end without paying for inference. Industry partners Aible, Applied Scientific Intelligence, Eka Care, Foxconn, H Company, Palantir and Pyler are already shipping with the model, and Dell Technologies, Docusign, Infosys, K-Dense, Lila, Oracle and Zefr are evaluating it for production rollouts.

NVIDIA Nemotron 3 Nano Omni — open multimodal model unifying vision, audio, language — NVIDIA Nemotron 3 Nano Omni: 30B-A3B hybrid MoE with unified text, image, audio and video understanding.

Key Details

Throughput: Up to 9× higher than other open omni models at matched interactivity, and 3.3× higher than Qwen3-30B-A3B on a single H200 GPU at an 8K input / 16K output configuration. Single-stream multimodal reasoning is 2.9× faster than alternatives, per NVIDIA’s technical report.
Coding and reasoning: 68.3% on LiveCodeBench v6, ahead of Qwen3 (66.0%) and GPT-OSS (61.0%).
Long context: 92.9% on RULER-100 at 256K tokens and 86.3% at the full 1M-token context window — both ahead of Qwen3-30B-A3B (89.4% and 77.5%).
Knowledge: 78.3% on MMLU-Pro, slightly behind Qwen3-30B’s 80.9% — the explicit accuracy-vs-throughput trade-off NVIDIA chose.
Multimodal leaderboards: Tops MMlongbench-Doc, OCRBenchV2, WorldSense, DailyOmni and VoiceBench. On MediaPerf video tagging it processes 9.91 hours of video per hour of compute, versus roughly 3.8 h/h for Qwen3-VL.
Local deployment: Runs on a single H200 in BF16 and on consumer GPUs via Unsloth’s quantized builds, making it the first omni-class model to combine MoE efficiency, open weights and single-GPU local inference.

What Developers Are Saying

Reaction on Hacker News has been mostly positive. The Nemotron 3 family thread reached 257 points, and a high-volume practitioner who runs “billions of tokens every month” wrote that the small NVIDIA Nemotron models have shown some of the highest task compliance, understanding and tool-call success rates they have measured against the current 125B-and-under cohort. Several commenters singled out the OpenRouter free tier as “quite generous” for paying OpenRouter customers, and Unsloth shipped a Day-0 quantization guide that put the model on a single 24 GB consumer GPU within hours of release.

Critics flag the same trade-off NVIDIA published openly: Qwen3-30B retains a small lead on text-only knowledge benchmarks like MMLU-Pro, so teams optimizing strictly for written-knowledge accuracy should benchmark on their own workloads. The MoE routing is also still GPU-friendly first — CPU-only inference is not the target deployment.

What This Means for Developers

Nemotron 3 Nano Omni closes a gap that has bothered the open-weight community for most of 2025: there was no open model that combined unified omni-modal perception, MoE efficiency at the 30B class, a permissive commercial license and viable single-GPU local inference. NVIDIA now ships all four in one checkpoint. For agent builders, the practical impact is two-fold — the long-context numbers (86.3% at 1M tokens) make it credible for document- and meeting-heavy workflows, and the 9× throughput gap matters for production cost. Teams already on Qwen3-Omni or other open omni backbones should plan a benchmarking sprint this week against the OpenRouter free tier before committing to next-quarter capacity.

What’s Next

NVIDIA has published the model on Hugging Face, on OpenRouter as a free endpoint, and as an NVIDIA NIM microservice on build.nvidia.com. Quantized GGUF builds are available via Unsloth, and the full evaluation recipe was open-sourced alongside the launch so third-party benchmarks can be reproduced. NVIDIA has signaled additional Nemotron 3 family releases will follow in the coming weeks; watch the Nemotron research lab page for updates.

NVIDIA Launches Nemotron 3 Nano Omni — Open 30B MoE Multimodal Model With 9× Throughput (April 28, 2026)

What Happened

Key Details

What Developers Are Saying

What This Means for Developers

What’s Next

Sources