NVIDIA Launches Nemotron 3 Nano Omni — Open 30B MoE Multimodal Model With 9× Throughput (April 28, 2026)
NVIDIA on April 28, 2026 released Nemotron 3 Nano Omni, a 30B-parameter open-weight hybrid MoE model that unifies text, image, audio and video understanding in a single checkpoint and claims up to 9× higher throughput than other open omni models.
NVIDIA on released Nemotron 3 Nano Omni, a 30-billion-parameter open-weight multimodal model that unifies text, image, audio and video understanding inside a single hybrid mixture-of-experts checkpoint and claims up to 9× higher throughput than other open omni models at the same level of interactivity. Weights and a permissive commercial license are available immediately on Hugging Face, on OpenRouter as a free endpoint, and on build.nvidia.com as an NVIDIA NIM microservice.
What Happened
Nemotron 3 Nano Omni is the omni-modal flagship of NVIDIA’s newly expanded Nemotron 3 Nano family. The architecture is a 30B-A3B hybrid MoE that interleaves 23 Mamba-2 layers, 23 MoE layers and 6 grouped-query attention layers across 52 total layers — the Mamba-2 layers handle long sequences in linear time, while the six attention layers provide the global context the model needs to keep accuracy competitive. NVIDIA published a full technical report alongside the model card.
NVIDIA paired the launch with a fully open evaluation recipe and an OpenRouter free tier so developers can benchmark the model end-to-end without paying for inference. Industry partners Aible, Applied Scientific Intelligence, Eka Care, Foxconn, H Company, Palantir and Pyler are already shipping with the model, and Dell Technologies, Docusign, Infosys, K-Dense, Lila, Oracle and Zefr are evaluating it for production rollouts.
Key Details
- Throughput: Up to 9× higher than other open omni models at matched interactivity, and 3.3× higher than Qwen3-30B-A3B on a single H200 GPU at an 8K input / 16K output configuration. Single-stream multimodal reasoning is 2.9× faster than alternatives, per NVIDIA’s technical report.
- Coding and reasoning: 68.3% on LiveCodeBench v6, ahead of Qwen3 (66.0%) and GPT-OSS (61.0%).
- Long context: 92.9% on RULER-100 at 256K tokens and 86.3% at the full 1M-token context window — both ahead of Qwen3-30B-A3B (89.4% and 77.5%).
- Knowledge: 78.3% on MMLU-Pro, slightly behind Qwen3-30B’s 80.9% — the explicit accuracy-vs-throughput trade-off NVIDIA chose.
- Multimodal leaderboards: Tops MMlongbench-Doc, OCRBenchV2, WorldSense, DailyOmni and VoiceBench. On MediaPerf video tagging it processes 9.91 hours of video per hour of compute, versus roughly 3.8 h/h for Qwen3-VL.
- Local deployment: Runs on a single H200 in BF16 and on consumer GPUs via Unsloth’s quantized builds, making it the first omni-class model to combine MoE efficiency, open weights and single-GPU local inference.
What Developers Are Saying
Reaction on Hacker News has been mostly positive. The Nemotron 3 family thread reached 257 points, and a high-volume practitioner who runs “billions of tokens every month” wrote that the small NVIDIA Nemotron models have shown some of the highest task compliance, understanding and tool-call success rates they have measured against the current 125B-and-under cohort. Several commenters singled out the OpenRouter free tier as “quite generous” for paying OpenRouter customers, and Unsloth shipped a Day-0 quantization guide that put the model on a single 24 GB consumer GPU within hours of release.
Critics flag the same trade-off NVIDIA published openly: Qwen3-30B retains a small lead on text-only knowledge benchmarks like MMLU-Pro, so teams optimizing strictly for written-knowledge accuracy should benchmark on their own workloads. The MoE routing is also still GPU-friendly first — CPU-only inference is not the target deployment.
What This Means for Developers
Nemotron 3 Nano Omni closes a gap that has bothered the open-weight community for most of 2025: there was no open model that combined unified omni-modal perception, MoE efficiency at the 30B class, a permissive commercial license and viable single-GPU local inference. NVIDIA now ships all four in one checkpoint. For agent builders, the practical impact is two-fold — the long-context numbers (86.3% at 1M tokens) make it credible for document- and meeting-heavy workflows, and the 9× throughput gap matters for production cost. Teams already on Qwen3-Omni or other open omni backbones should plan a benchmarking sprint this week against the OpenRouter free tier before committing to next-quarter capacity.
What’s Next
NVIDIA has published the model on Hugging Face, on OpenRouter as a free endpoint, and as an NVIDIA NIM microservice on build.nvidia.com. Quantized GGUF builds are available via Unsloth, and the full evaluation recipe was open-sourced alongside the launch so third-party benchmarks can be reproduced. NVIDIA has signaled additional Nemotron 3 family releases will follow in the coming weeks; watch the Nemotron research lab page for updates.
Sources
- NVIDIA Blog — Nemotron 3 Nano Omni launch announcement — primary source from NVIDIA
- NVIDIA Technical Blog — architecture and benchmarks
- Hugging Face blog — long-context multimodal intelligence
- Nemotron 3 Nano Omni technical report (PDF)
- Hacker News — Nemotron 3 family discussion thread
- Artificial Analysis — head-to-head benchmark with Qwen3
- Unsloth — local quantization and run guide
Stay up to date with Doolpa
Subscribe to Newsletter →