Hybrid MoE: Nemotron 3 family of models utilize a hybrid Mamba-Transformer MoE architecture to provide best-in-class throughput while having better or on-par accuracy than standard Transformers.
LatentMoE: Super and Ultra utilize Latent MoE, a novel hardware-aware expert design for improved accuracy.
Multi-Token Prediction: Super and Ultra incorporate MTP layers for improved long-form text generation efficiency and better model quality.
NVFP4: Super and Ultra are trained with NVFP4.
Long Context: Nemotron 3 models support context length up to 1M tokens.
Multi-environment Reinforcement Learning Post-training: Nemotron 3 models are trained using a diverse set of RL environments helping models achieve superior accuracy across a broad range of tasks.
Granular Reasoning Budget Control at Inference Time: Nemotron 3 models are trained to work with inference-time budget control.
Nemotron 3 Nano
Nemotron 3 Nano is a 3.2B active (3.6B with embeddings), 31.6B total parameter model. It achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass.
Key highlights:
More accurate than GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507 on popular benchmarks spanning different categories.
On the 8K input / 16K output setting with a single H200, Nemotron 3 Nano provides inference throughput that is 3.3x higher than Qwen3-30B-A3B and 2.2x higher than GPT-OSS-20B.
Supports context length up to 1M tokens while outperforming both GPT-OSS-20B and Qwen3-30B-A3B-Instruct-2507 on RULER across different context lengths.
We are releasing the model weights, training recipe, and all the data for which we hold redistribution rights.
Open Source
Along with the Nemotron 3 white paper and the Nano 3 technical report, we are releasing the following:
Checkpoints:
Nemotron 3 Nano 30B-A3B FP8: the final post-trained and FP8 quantized Nano model
Nemotron 3 Nano 30B-A3B BF16: the post-trained Nano model
Nemotron 3 Nano 30B-A3B Base BF16: the pre-trained base Nano model
Qwen-3-Nemotron-235B-A22B-GenRM: the GenRM used for RLHF
Data:
Nemotron-CC-v2.1: 2.5 trillion new English tokens from Common Crawl, including curated data from 3 recent snapshots, synthetic rephrasing, and translation to English from other languages.
Nemotron-CC-Code-v1: A pretraining dataset consisting of 428 billion high-quality code tokens obtained from processing Common Crawl Code pages using the Lynx + LLM pipeline from Nemotron-CC-Math-v1. Preserves equations and code, standardizes math equations to LaTeX, and removes noise.
Nemotron-Pretraining-Code-v2: Refresh of curated GitHub code references with multi-stage filtering, deduplication, and quality filters. Large-scale synthetic code data.
Nemotron-Pretraining-Specialized-v1: Collection of synthetic datasets for specialized areas like STEM reasoning and scientific coding.
Nemotron-SFT-Data: Collection of new Nemotron 3 Nano SFT datasets.
Nemotron-RL-Data: Collection of new Nemotron 3 Nano RL datasets.