NVIDIA Nemotron 3 Ultra: Hybrid SSM/MoE Architecture Launch and Benchmarks · history

Version 1

2026-06-05 02:17 UTC · 65 items

What

NVIDIA launched Nemotron 3 Ultra, a 550B-total-parameter (55B active) hybrid Mamba SSM/MoE model, at Jensen Huang's Computex 2026 keynote and shipped it on June 4, 2026 [1][3]. The model is positioned as the most capable US open-weights model for long-running agentic workloads [17][2]. Independent benchmarks place it ahead of Qwen3.6-27B on Artificial Analysis's metric (47.7 vs 45.8) [5] and roughly 10x cheaper than GPT-5.5 for comparable code generation tasks [7]. Reception is split: enterprise integrators adopted it same-day [11][12], while skeptics question whether it meaningfully advances the open-source frontier beyond inference-speed optimization [14][15].

Why it matters

An open-weights model at this scale, with MoE-driven cost efficiency and an architecture specifically designed for long-context agentic tasks, lowers the cost barrier for enterprises that would otherwise depend on proprietary APIs. The 10x cost gap versus GPT-5.5 is large enough to reshape model-selection decisions if reliability concerns prove addressable.

Open questions

Does Nemotron 3 Ultra meet production reliability standards? CodeRabbit found it 'close to baseline' but raised a specific reliability caveat [9][10].
Will independent benchmark suites beyond NVIDIA's chosen GDPval-AA confirm its ranking claims? The metric was sourced from Artificial Analysis and used in the keynote itself [4], and BenchLM.ai had no data at launch [18].
Does the ~10x cost advantage over GPT-5.5 [7] hold across task types beyond code generation, and does latency stay competitive at scale?
Which additional cloud and enterprise platforms will integrate it beyond early adopters Glean and Nebius [11][12], and will xAI's explicit non-adoption [16] prove representative of a broader closed-ecosystem pattern?

Narrative

NVIDIA announced Nemotron 3 Ultra at Jensen Huang's Computex 2026 keynote and shipped the model on June 4, 2026. It has 550B total parameters with 55B active — a mixture-of-experts ratio of roughly 10:1 — and uses a hybrid architecture combining Mamba-style state-space models (SSM) with a standard Transformer backbone [1][2]. The SSM component handles long sequences without the quadratic memory cost of full attention, making the architecture suited to extended agentic reasoning and multi-step tool use [2]. NVIDIA claims it is the most intelligent US open-weights model, and the Computex keynote slide used Artificial Analysis benchmark data to support that claim [3][4].

Early performance data from launch day is mixed but largely favorable on cost grounds. Artificial Analysis ran independent tests and found Nemotron 3 Ultra scored 47.7 on their benchmark versus Qwen3.6-27B's 45.8, with faster task completion [5][6]. A side-by-side code generation comparison showed it producing output nearly identical to GPT-5.5 on an HTML5 physics simulation task at $0.051 versus GPT-5.5's $0.57 for roughly the same token count (~11k tokens each) — a roughly 10x cost difference [7]. On agent-specific benchmarks, one observer cited a score of 89.9% versus Claude Opus 4.7's 93.5% [8], and CodeRabbit's evaluation found the model close to baseline but flagged reliability as a specific concern for production use [9][10].

Reception divided into two camps. Enterprise integrators — including Glean, which announced same-day support, and Nebius, which made it available on its Token Factory — framed the model as expanding model choice for agentic workloads [11][12][13]. Skeptics argued that the open-source ecosystem has matured beyond raw scale and that Nemotron models historically rank around fifth among open-source alternatives while benefiting from NVIDIA-optimized inference tricks [14][15]. The benchmark selection itself drew scrutiny: using the same metric (GDPval-AA) in the keynote that NVIDIA cited for performance claims creates a circularity that independent evaluators have noted [4]. Grok/xAI separately clarified it is not adopting Nemotron 3 Ultra, as xAI trains its own models end-to-end [16].

Timeline

2026-05-29: Social media attention builds around an upcoming NVIDIA model release before any official announcement. [21][22]
2026-06-01: Rohan Paul announces Nemotron 3 Ultra will ship within days and summarizes its hybrid SSM/MoE architecture and long-context advantages. [2]
2026-06-01: Skeptical commentary emerges: one commenter argues Nemotron models typically finish ~5th in open-source rankings and calls 550B scale a 'compute flex.' [14][15]
2026-06-03: Artificial Analysis notes NVIDIA used its GDPval-AA benchmark in Jensen Huang's Computex 2026 keynote to support Nemotron 3 Ultra's performance claims. [4][3]
2026-06-04: Nemotron 3 Ultra officially ships; becomes available on Nebius Token Factory and through Kilocode. [12][23][24]
2026-06-04: Glean announces same-day support for Nemotron 3 Ultra in its enterprise AI platform. [11][13]
2026-06-04: Artificial Analysis runs independent benchmarks on launch day: Nemotron 3 Ultra scores 47.7 versus Qwen3.6-27B's 45.8 and completes tasks faster. [6][5]
2026-06-04: CodeRabbit publishes its evaluation: Nemotron 3 Ultra lands close to baseline with a specific reliability caveat for production use. [9][10]
2026-06-04: One observer cites an agent benchmark score of 89.9% for Nemotron 3 Ultra versus 93.5% for Claude Opus 4.7. [8]
2026-06-04: Grok/xAI denies it is adopting Nemotron 3 Ultra, confirming xAI builds and trains its own models end-to-end. [16]
2026-06-05: Side-by-side comparison of Nemotron 3 Ultra and GPT-5.5 on a code generation task shows near-identical output at roughly 10x lower cost ($0.051 vs $0.57). [7]

Perspectives

NVIDIA (official)

Nemotron 3 Ultra is the most intelligent US open-weights model, built for long-running agentic workloads, running 5x faster at inference with reduced costs.

Evolution: Consistent with NVIDIA's positioning of the Nemotron line as enterprise-grade open-weights leaders; Ultra is the largest and most capable model in the series.

[17][19][20]

Artificial Analysis (@ArtificialAnlys)

Independent benchmark data confirms Nemotron 3 Ultra edges Qwen3.6-27B (47.7 vs 45.8) with faster task completion; their benchmark was used by NVIDIA in the Computex keynote, giving their methodology unusual prominence.

Evolution: Consistent empirical testing role; elevated to keynote-slide status by NVIDIA's citation.

[6][4][3]

Rohan Paul (@rohanpaul_ai)

Positive on the architectural rationale and the cost case: SSM enables long-context processing without attention bottlenecks, and the ~10x cost gap versus GPT-5.5 on code tasks makes it compelling for developers.

Evolution: Consistent; provided pre-release architecture summary and followed up with a direct GPT-5.5 price comparison on launch day.

[2][7]

CodeRabbit (@coderabbitai)

Measured: Nemotron 3 Ultra performs close to baseline in their code review benchmark but reliability is flagged as a specific concern for production workloads.

Evolution: First evaluation in this thread; introduces the reliability angle that other assessments have not directly addressed.

[9][10]

Skeptics (ECLresearch, Aero)

Nemotron 3 Ultra is primarily a compute showcase — the open-source ecosystem has moved past raw scale, and Nemotron models historically rank around fifth while benefiting from NVIDIA-optimized inference to appear more competitive than they are.

Evolution: Consistent skeptical position, not updated by launch-day benchmark data.

[14][15]

Enterprise integrators (Glean, Nebius)

Enthusiastic same-day adoption; framing the model as expanding enterprise model choice for agentic applications.

Evolution: First appearance in thread; both announced support on launch day.

[11][12][13]

Grok/xAI (@grok)

Explicitly denied adopting Nemotron 3 Ultra; xAI builds and trains its own models end-to-end.

Evolution: Reactive clarification in response to a user query; first appearance in thread.

[16]

Tensions

NVIDIA claims Nemotron 3 Ultra is the most intelligent US open-weights model; skeptics argue Nemotron models historically land ~5th in open-source rankings and that 550B scale is primarily an inference-speed marketing move. [17][14][15]
NVIDIA's keynote benchmark (GDPval-AA, sourced from Artificial Analysis) is the metric driving the launch narrative; independent evaluators including CodeRabbit and an agent-benchmark observer show more modest or mixed results. [4][9][8]
On cost-efficiency, Rohan Paul's comparison shows ~10x savings versus GPT-5.5 on code generation; CodeRabbit's reliability caveat suggests those savings may come with production stability trade-offs. [7][9][10]
Nemotron 3 Ultra scores 89.9% on an agent benchmark versus Claude Opus 4.7's 93.5%, qualifying NVIDIA's 'most capable' framing against the closed-source frontier. [8]

Sources

[1] NVIDIA has announced Nemotron 3 Ultra at Jensen Huang's Computex 2026 keynote. At 550 billion total parameters (55 billi... — reactive:nvidia-nemotron-ultra (2026-06-04)
[2] Nemotron 3 Ultra will be available from Nvidia in few days. — Rohan Paul Twitter (2026-06-01)
[3] Jensen Huang’s keynote at Computex used Artificial Analysis benchmarks to communicate the performance of Nemotron 3 Ultr... — reactive:nvidia-nemotron-ultra (2026-06-03)
[4] NVIDIA used the GDPval-AA benchmark to promote the launch of NVIDIA Nemotron 3 Ultra https://t.co/N6e4ItzYev — reactive:nvidia-nemotron-ultra (2026-06-03)
[5] Not only did Nemotron 3 Ultra beat Qwen3.6-27b in raw AA benchmark (47.7 vs 45.8) but it also completed the tasks faster... — reactive:nvidia-nemotron-ultra (2026-06-04)
[6] Nemotron 3 Ultra was launched today, including a focus on low latency agentic performance. We tested it against peers un... — reactive:nvidia-nemotron-ultra (2026-06-04)
[7] Nemotron 3 Ultra vs GPT-5.5 on atomic[.]chat, a desktop app that runs LLMs locally. — Rohan Paul Twitter (2026-06-05)
[8] @turingou 我一直堅信固定工作流的agent不需要SoTA，英偉達nemotron 3 ultra在agent benchmark 拿了89.9%，OPUS 4.7拿了93.5%。 — reactive:nvidia-nemotron-ultra (2026-06-04)
[9] In our CodeRabbit benchmark, Nemotron 3 Ultra landed close to the baseline: — reactive:nvidia-nemotron-ultra (2026-06-04)
[10] The caveat: reliability. — reactive:nvidia-nemotron-ultra (2026-06-04)
[11] 🎉 We’re announcing support for @nvidia Nemotron Ultra 3! — reactive:nvidia-nemotron-ultra (2026-06-04)
[12] NVIDIA Nemotron™ 3 Ultra is now live on Nebius Token Factory. — reactive:nvidia-nemotron-ultra (2026-06-04)
[13] Excited to expand model choice in @Glean with @NVIDIA Nemotron 3 Ultra. — reactive:nvidia-nemotron-ultra (2026-06-04)
[14] @jun_song Solid skepticism. Nemotron 3 Ultra at 550B is a compute flex, but the open-weight ecosystem moved past "just b... — reactive:nvidia-nemotron-ultra (2026-06-01)
[15] @scaling01 Nemotron models are always released to be #5 on open-source while using the latest nvidia tricks to be fast a... — reactive:nvidia-nemotron-ultra (2026-06-01)
[16] @xiaosun86 @NVIDIAAI No, Grok isn't adopting Nemotron 3 Ultra. xAI builds and trains its own models end-to-end. Strong b... — reactive:nvidia-nemotron-ultra (2026-06-01)
[17] Nemotron 3 Ultra 550B (A55B) is the largest Nemotron 3 model to date, and the most intelligent US open weights model rig... — reactive:nvidia-nemotron-ultra (2026-06-01)
[18] Nemotron 3 Ultra Benchmarks: Data Coming Soon - BenchLM.ai — reactive:nvidia-nemotron-ultra
[19] NVIDIA ships Nemotron 3 Ultra, a 550B MoE model for long-running agents. It runs 5x faster at inference and reduces cost... — reactive:nvidia-nemotron-ultra (2026-06-04)
[20] Build Agentic AI with Multimodal Foundation Models - NVIDIA — reactive:nvidia-nemotron-ultra
[21] THIS IS THE FUTURE OF AI and it just dropped — reactive:nvidia-nemotron-ultra (2026-05-29)
[22] Day 2 of Trajectory!! — reactive:nvidia-nemotron-ultra (2026-05-29)
[23] Nemotron 3 ultra, Qwen 3.7 plus, laguna m.1, step 3.7 flash ALL FREE on kilocode. but which to use? BENCHMARK TEST TIME.... — reactive:nvidia-nemotron-ultra (2026-06-04)
[24] 550 billion parameters. Open weights. Ships today. — reactive:nvidia-nemotron-ultra (2026-06-04)