|
S
|
Removing the Guesswork from Disaggregated Serving | NVIDIA Technical Blog |
nvidia_dev_blog |
09.03.2026 16:00 |
1
|
| Embedding sim. | 1 |
| Entity overlap | 1 |
| Title sim. | 1 |
| Time proximity | 1 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving can be an overwhelming engineering problem. The ideal configuration for any given workload (such as hardware, parallelism, and prefill/decode split) resides in a massive, multi-dimensional search space that is impossible to explore manually or through exhaustive testing. AIConfigurator, an open source tool that simplifies the NVIDIA Dynamo AI serving stack, is intended to cut through this complexity and get you to an optimal deployment in minutes.
The core benefit of AIConfigurator is that you don’t need to run every possible configuration on real hardware to predict which one will perform best. Instead, it decomposes LLM inference into its constituent operations and measures each one in isolation on the target GPU. AIConfigurator can then reassemble those measurements to estimate the end-to-end performance of any configuration, all without occupying a single GPU-hour at search time.
This blog will provide a quick overview of how AIConfigurator works; how to use it with Dynamo; and how ecosystem contributors such as Alibaba and Mooncake are helping extend the features of this open source project to all frameworks.
Using AIConfigurator to configure disaggregated serving
With AIConfigurator, the latency estimate for each operation—including General Matrix Multiplications (GEMM), attention, communication, and mixture-of-experts (MoE) dispatch—is backed by real kernel measurements collected on the target hardware. The collector toolchain benchmarks every primitive across supported quantization modes, batch sizes, sequence lengths, and GPU counts, and logs results to a silicon-calibrated performance database. When collected data isn’t available for a new model or GPU, AIConfigurator falls back to speed-of-light roofline estimates with empirical correction factors, giving usable recommendations even before the model has been empirically profiled.
On top of this estimation layer, AIConfigurator models continuous batching for aggregated serving, rate-matches prefill and decode worker pools for disaggregated serving, and handles MoE-specific concerns like expert parallelism and token routing skew. Rather than returning a single answer, it computes the Pareto frontier across all evaluated configurations, showing the throughput-vs-latency tradeoff for both aggregated and disaggregated modes side-by-side. The full search, often spanning tens of thousands of candidate configurations, completes in seconds, instead of spending days searching on GPUs.
To see how this tool can help you as a developer, consider a concrete example: deploying Qwen3-32B with NVFP4 quantization across 64 NVIDIA B200 GPUs, with target SLAs of 1000ms time-to-first-token (TTFT) and 15ms time-per-output-token (TPOT). Using a single command, you can search through thousands of candidate configurations:
pip install aiconfigurator # or install from source for latest
aiconfigurator cli default \
--model-path nvidia/Qwen3-32B-NVFP4 \
--total-gpus 64 \
--system b200_sxm \
--isl 15000 --osl 500 \
--ttft 1000 --tpot 15 \
--save-dir ./results
Within seconds, AIConfigurator returns a recommendation. In this example, disaggregated serving achieves 550 tokens/s/GPU, a 38% improvement over the best aggregated configuration. The output includes a Pareto frontier visualizing the full tradeoff space, ranked configurations ( best_config_topn.csv
), engine configurations for each worker type, and ready-to-use deployment artifacts for both serving modes.
Figure 1. Example TPS/GPU vs pareto drawn by AIConfigurator
For disaggregated serving in Dynamo , deploying the recommended configuration requires a single command:
kubectl apply -f results/disagg/top1/k8s_deploy.yaml
This workflow generalizes across models and hardware. The same interface applies whether deploying Qwen3-32B on eight NVIDIA H200 GPUs or DeepSeek-V3 across a multi-node B200 cluster; AIConfigurator adapts its search space and recommendations to the specified model, hardware, and SLA constraints.
Extending support to multiple frameworks
AIConfigurator originally supported only NVIDIA TensorRT LLM, but as frameworks like SGLang gained traction—particularly for MoE models like DeepSeek—single-backend support was no longer sufficient. We designed a framework-agnostic abstraction layer with a unified parameter mapping that normalizes each backend’s config schemas and terminology behind a single interface. That investment paid off when community partners such as Mooncake and Alibaba brought SGLang support to life, contributing collectors, validation, and integration work covered in the following sections.
From a user’s perspective, comparing backends is a one-flag change:
# TensorRT LLM
aiconfigurator cli default \
--model-path nvidia/Qwen3-32B-NVFP4 \
--total-gpus 64 --system b200_sxm \
--backend trtllm
# SGLang
aiconfigurator cli default \
--model-path nvidia/Qwen3-32B-NVFP4 \
--total-gpus 64 --system b200_sxm \
--backend sglang
# vLLM
aiconfigurator cli default \
--model-path nvidia/Qwen3-32B-NVFP4 \
--total-gpus 64 --system b200_sxm \
--backend vllm
To make it even simpler, --backend auto
compares three frameworks in one command:
aiconfigurator cli default \
--model-path nvidia/Qwen3-32B-NVFP4 \
--total-gpus 64 --system b200_sxm \
--backend auto
The search process is identical across backends; only the generated deployment artifacts differ, with each backend receiving native config files, CLI arguments, and K8s manifests in its expected format. AIConfigurator currently ships with silicon-validated performance data for TensorRT LLM and SGLang across NVIDIA H100, H200, and B200 systems, with vLLM support on select platforms as well.
WideEP inference for SGLang
SGLang is especially popular for running “Wide Expert Parallelism” (WideEP), which dramatically increases decode throughput for MoE models like DeepSeek V3/R1 by distributing experts across a large number of GPUs. To accurately model SGLang’s WideEP pathway, AIConfigurator simulates key elements like DeepEP all-to-all communication, MTP, MLA attention, Attention DP, workload-aware MoE, and expert parallel load balancing (EPLB). Modeling MoE and EPLB poses the greatest challenge.
WideEP’s MoE routing inherently suffers from load imbalance, with some experts receiving more tokens than others. AIConfigurator models this power-law workload distribution using an alpha parameter. This alpha acts as a lookup key in the performance database, linking distribution patterns to collected latency profiles, similar to the standard MoE path. An alpha of 1.01 empirically fits DeepSeek V3.1 well for both prefill and decode across datasets.
In WideEP deployments, AIConfigurator models EPLB by adjusting two factors instead of directly simulating the algorithm. First, the workload distribution alpha is lowered from 1.01 to 0.6 to reflect the load smoothing from expert replication. Second, the effective token count is multiplied by 0.8, modeling the empirical reduction in maximum per-GPU token load. These changes select the correct latency curve and adjust the operating point accordingly.
Figure 2. Power-law simulation
Preliminary results are promising: The best configuration identified by AIConfigurator aligns with the manually tuned production configuration. Further collaboration is planned to bring this to production readiness.
How the SGLang community is contributing
Mooncake: Initial SGLang support in AIConfigurator
AIConfigurator initially supported only TensorRT LLM, reserving interfaces for SGLang and vLLM without full implementation. Contributors from Mooncake (an open source collaboration between Moonshot AI, Tsinghua University, and others) then developed the first version of the SGLang backend.
They first completed the collector layer, modeling and encapsulating core operations (GEMM, attention, batch-GEMM). This allowed quick support for models like Llama, Qwen, and DeepSeek. This work, combined with the subsequent SGLang WideEP effort, formed the first SGLang backend for AIConfigurator.
Alibaba: Integrating AIConfigurator in the AI Serving Stack for automated deployments
The AI Serving Stack , built on the Alibaba Container Service for Kubernetes (ACK), is an end-to-end solution for efficient and scalable cloud-native LLM inference. It manages the entire lifecycle, offering deployment, smart routing, auto-scaling, and deep observability.
Figure 3. An Alibaba graphic showing how it uses AIConfigurator in its container service
Within this stack, the RoleBasedGroup ( RBG ), an SGLang community-incubated AI orchestration engine to which Alibaba Cloud heavily contributes, simplifies LLM inference service deployment on Kubernetes. RBG uses “Role” as its core orchestration unit, dividing prefill-decode-disaggregation based services into router, prefill, and decode roles to coordinate their placement, scaling, and updates. That ensures a balance of performance and stability with role-based extensibility.
The full Dynamo service stack can be deployed with the AI Serving Stack on ACK, leveraging the AIConfigurator prediction results as input and AIConfigurator’s generator module. The ACK team can generate the deployable configuration for RBG, reference here. By integrating this process, Alibaba achieved 1.86x the throughput on the Qwen3-235B-FP8 model compared to the baseline, while maintaining TTFT <5000ms and ITL<40ms.
RBG will continue to track AIConfigurator’s progress and provide Day 0 support for rapid deployment of new models in ACK.
Alibaba: Building HiSim based on AIConfigurator
AIConfigurator optimizes static workloads, but it cannot easily model dynamic, bursty production traffic, complex scheduling, and KV cache dynamics. To overcome this, the Alibaba TAIR KV Cache Team created Tair-KVCache-HiSim , a lightweight, high-fidelity, and event-driven system simulator.
HiSim tackles dynamic traffic and queuing (predicting TTFT, TPOT, and throughput under variable rates and complex scheduling like SGLang) and advanced KV cache optimization (quantifying tradeoffs for multi-level storage and various eviction/prefetch policies) via system-level simulation.
HiSim comprises a workload generator, global router simulator, and the inference engine simulator (IES). The IES uses a unified global clock to coordinate the scheduler simulator (managing LLM requests: preemption, batching), the KVCache Manager Simulator (HiCacheController, modeling the three-level KV cache and eviction), and the BatchRunnerEstimator (AIConfiguratorTimePredictor, calculating batch latency based on AIConfigurator).
This structure adapts rapidly to diverse inference engines (vLLM, SGLang, TensorRT LLM), accurately mimicking real-world configurations, runtime parameters, and execution semantics (parallelism, batching, device optimizations) without engine modification, ensuring high fidelity.
HiSim guides SGLang R&D by allowing configuration tuning to quantify scheduling tradeoffs (TTFT/throughput, queueing/memory, cache hit/TTFT, overlap efficiency) without code changes. It provides “oracle” evaluation for new hardware by estimating performance ceilings and identifying bottlenecks using theoretical specs. HiSim also aids HiCache architecture exploration and cost/performance optimization through three-level KV cache design (e.g., L2 size, prefetch/eviction policy, L3 bandwidth needs, write-through vs write-back) to find the best cost–performance point.
Leveraging AIConfigurator, HiSim extends static analysis to active, cost-aware deployment recommendations for dynamic traffic. The end-to-end simulation is within 5% error of real-world performance. Future work will enhance this collaboration to build a high-fidelity, production-ready system simulator.
What’s next for AIConfigurator
The roadmap ahead extends AIConfigurator from a standalone command line tool into a core component of the Dynamo platform:
Faster model support. “Hybrid” mode already provides Day 1 recommendations via speed-of-light estimates; we are also automating the silicon data-collection pipeline to accelerate fully validated support.
Powering Dynamo deployments. AIConfigurator is becoming the configuration engine behind Dynamo’s Kubernetes flow via the DynamoGraphDeploymentRequest (DGDR) CRD, producing optimized deployments from a single YAML file.
Dynamic workload modeling. Moving beyond static input sequence length/output sequence length/concurrency targets toward models that capture production workload distributions directly.
NVIDIA plans to keep working with third parties on bringing AIConfigurator to more systems and tools. AIConfigurator is actively welcoming contributions, including performance data for new hardware, additional backend support, new features, and extensions like HiSim.
See the AIConfigurator repository to get started, and check out the Dynamo project for the fastest way to set up disaggregated serving.
For a full technical treatment, including formal definitions and validation results, read our paper: AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving .
Discuss (0)
Like
Tags
Agentic AI / Generative AI | Data Center / Cloud | Developer Tools & Techniques | Cloud Services | A100 | Dynamo | GB200 | H100 | H200 | Intermediate Technical | Tutorial | GB300 | LLM Techniques | LLMs | Machine Learning & Artificial Intelligence | Mixture of Experts (MoE) | TensorRT-LLM
About the Authors
About Tianhao Xu
Tianhao Xu is a solution architect manager at NVIDIA, focusing on customer engagements with consumer internet companies. He specializes in GPU-accelerated workloads such as LLM inference and training, NVIDIA CUDA optimizations, NVIDIA libraries, and end-to-end solutions. He holds a master’s degree in computational fluid dynamics.
View all posts by Tianhao Xu
About Ben Hamm
Ben Hamm is a technical product manager at NVIDIA, with a focus on LLM inference performance and optimization. He joined NVIDIA in 2024 as part of the acquisition of OctoAI, where he served as the PM for Octo’s LLM hosting service. Prior to that, he was a PM at Amazon where he worked on Alexa’s ML stack for wakeword detection. Ben is also a computer vision hobbyist, and invented an AI-powered cat door.
View all posts by Ben Hamm
About Aichen Feng
Aichen Feng is a solutions architect at NVIDIA. Aichen focuses on AI inference frameworks and deep learning model optimization, and is particularly interested in large language models and multimodal models.
View all posts by Aichen Feng
About Jason Zhou
Jason Zhou is a software engineer at NVIDIA, focused on LLM inference performance and optimization. He joined NVIDIA at the end of 2025. Previously, he worked at ByteDance on large-scale training frameworks and, prior to that, at Alibaba Group and Microsoft on distributed cloud storage systems. Outside of work, Jason enjoys watching movies and traveling around the world.
View all posts by Jason Zhou
About Kimi Zhao
Kimi Zhao is a solution architect at NVIDIA, focusing on large model inference acceleration, analysis, and reinforcement learning. He holds a B.S. in physics and M.S. in signal and information processing.
View all posts by Kimi Zhao
Comments
Related posts
Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo
Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo
Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer
Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer
LLM Inference Benchmarking: Fundamental Concepts
LLM Inference Benchmarking: Fundamental Concepts
Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and NVIDIA TensorRT-LLM
Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and NVIDIA TensorRT-LLM
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
Related posts
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes
Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes
L
T
F
R
E
|
|
|
Build Accelerated, Differentiable Computational Physics Code for AI with NVIDIA Warp | NVIDIA Technical Blog |
nvidia_dev_blog |
12.03.2026 17:30 |
0.765
|
| Embedding sim. | 0.8583 |
| Entity overlap | 0.0526 |
| Title sim. | 0.3241 |
| Time proximity | 0.9911 |
| NLP тип | other |
| NLP организация | NVIDIA |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
Computer-aided engineering (CAE) is shifting from human-driven workflows toward AI-driven ones, including physics foundation models that generalize across geometries and operating conditions. Unlike LLMs, these models depend on large volumes of high-fidelity, physics-compliant data.
Recent scaling-law work on computational fluid dynamics (CFD) surrogates indicates that simulation-generated training data is often the limiting cost in practice. This pushes requirements onto the simulator, which must be GPU-native, fast, and able to plug directly into ML workflows.
NVIDIA Warp is a framework for accelerated simulation, data generation, and spatial computing that bridges CUDA and Python. Warp enables developers to write high-performance kernels as regular Python functions that are JIT-compiled into efficient code for execution on the GPU. Unlike the tensor-based frameworks, in which developers express computation as operations on entire N-dimensional arrays, developers author flexible kernels in the Warp framework that execute simultaneously across all elements of a computational grid.
Simulation kernels are often expressed on computational grids and rely on data-dependent control flow like conditionals, early-outs, and selective updates that vary per element. In tensor frameworks, these patterns require composing Boolean masks that quickly become unwieldy and can waste computation on irrelevant elements. In a Warp kernel, each thread can branch, skip, or exit independently, expressing this logic naturally without masking workarounds.
Furthermore, as this post will show, solvers written in Warp can be easily made differentiable through the Warp native support for automatic differentiation. They are straightforward to integrate with optimization or training workflows while remaining interoperable with frameworks like PyTorch, JAX, and NumPy for use cases spanning simulation, robotics, perception, and geometry processing.
This post walks you through how to build a 2D Navier–Stokes solver entirely in Warp. It explains how the Warp programming model maps onto a PDE solver. Then, it differentiates through the simulation to solve an optimal perturbation problem end-to-end. It closes with industrial case studies showcasing what Warp can enable in production workflows. For more information, see the 2D Navier–Stokes solver example and 2D Navier-Stokes optimal perturbation example on the NVIDIA/warp GitHub repo.
How to write a 2D Navier – Stokes solver using Warp
To keep the focus on Warp rather than on numerical methods, a textbook example of 2D decaying turbulence is used here, described by the vorticity-streamfunction formulation of the incompressible Navier-Stokes equations. The vorticity \(\omega\) evolves according to the transport equation:
\(\frac{\partial \omega}{\partial t} + \frac{\partial \psi}{\partial y}\frac{\partial \omega}{\partial x} – \frac{\partial \psi}{\partial x}\frac{\partial \omega}{\partial y} = \frac{1}{\text{Re}}\nabla^2 \omega \tag{1}\)
and the streamfunction \(\psi\) is recovered from vorticity through the Poisson equation:
\(\nabla^2 \psi = -\omega \tag{2}\)
With periodic boundary conditions, the equation above reduces to an algebraic equation in Fourier space bypassing the need for iterative solvers:
\(\hat{\psi}_{m,n} = \frac{\hat{\omega}_{m,n}}{k_x^2 + k_y^2} \tag{3}\)
where \((k_x, k_y)\) is the wavenumber pair in the Fourier space. The solver makes use of the Fast Fourier Transform (FFT) algorithm to efficiently transform \(\omega\) and \(\psi\) to Fourier space and vice versa.
Each timestep has two subcomponents (Figure 1). First, the vorticity transport equation is discretized on an \(N \times N\) grid over an \(L \times L\) square domain. The solution is marched forward in time by \(\Delta t\) using a third-order strong stability-preserving Runge-Kutta (RK3) scheme to obtain \(\omega(t+\Delta t)\). Second, the Poisson equation is solved in the Fourier space to obtain the updated \(\psi(t+\Delta t)\).
Figure 1. Schematic of a single timestep loop for the solver
Thus, the forward solver has two building blocks that will be described in the subsequent sections:
Warp kernel for the discretization and time marching
FFT-based Poisson solver
Building block 1: Finite-difference discretization and time marching
The advection and diffusion terms in the vorticity transport equation are approximated with second-order central finite differences shown in Figure 2. Higher-order discretization could also be used, but the central second-order scheme is chosen for simplicity.
Figure 2. Finite difference stencils for \(\omega\) and \(\psi\)
The following rk3_update()
kernel computes the diffusion and the advection terms and performs a single RK3 substep update. The step()
function calls this kernel three times per timestep, once for each RK3 stage, with different coefficients ( coeff0
, coeff1
, coeff2
) for each stage.
@wp.kernel
def rk3_update(
n: int, h: float, re: float, dt: float,
coeff0: float, coeff1: float, coeff2: float,
omega_0: wp.array2d(dtype=float),
omega_1: wp.array2d(dtype=float),
psi: wp.array2d(dtype=float),
omega_out: wp.array2d(dtype=float)
):
"""Perform a single substep of SSP-RK3."""
i, j = wp.tid()
left = cyclic_index(i - 1, n)
right = cyclic_index(i + 1, n)
top = cyclic_index(j + 1, n)
down = cyclic_index(j - 1, n)
inv_h2 = 1.0 / (h * h)
laplacian = (
omega_1[right, j] + omega_1[left, j] + omega_1[i, top] + omega_1[i, down] - 4.0 * omega_1[i,j]
) * inv_h2
inv_2h = 1.0 / (2.0 * h)
j1 = ((omega_1[right, j] - omega_1[left, j]) * inv_2h) * ((psi[i, top] - psi[i, down]) * inv_2h)
j2 = ((omega_1[i, top] - omega_1[i, down]) * inv_2h) * ((psi[right, j] - psi[left, j]) * inv_2h)
rhs = (1.0 / re) * laplacian + j2 - j1
omega_out[i, j] = coeff0 * omega_0[i, j] + coeff1 * omega_1[i, j] + coeff2 * dt * rhs
The rk3_update()
kernel follows the single-instruction, multiple-threads (SIMT) paradigm where each thread maps to one grid point on the computational domain, and all \(N \times N\) points are updated simultaneously with a single wp.launch()
call.
wp.launch(rk3_update,
dim=(self.n, self.n), # one thread per grid point
inputs=[self.n, self.h, self.re, self.dt,
stage_coeff[0], stage_coeff[1], stage_coeff[2],
self.omega_0,
self.omega_1,
self.psi,
],
outputs=[self.omega_tmp]
)
Figure 3. SIMT update of \(\omega\) on the 2D grid. Thread (i, j) updates cell (i, j) to the next timestep using values from neighboring cells in the stencil at the current timestep
Building block 2: FFT Poisson solver
Warp tile-based primitives enable solving the Poisson equation in Fourier space. The key operations are wp.tile_fft()
and wp.tile_ifft()
, which perform the forward and inverse FFT, respectively, on a single row loaded into a tile. A full 2D FFT on an \(N \times N\) array is then decomposed into three steps: row-wise FFT -> transpose -> row-wise FFT. The schematic in Figure 4 explains how fft_tiled()
and ifft_tiled()
compute the forward and inverse FFT under the hood.
@wp.kernel
def fft_tiled(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):
"""Row-wise FFT using tile primitives."""
i, _, _ = wp.tid()
a = wp.tile_load(x, shape=(1, N_GRID), offset=(i, 0))
wp.tile_fft(a)
wp.tile_store(y, a, offset=(i, 0))
@wp.kernel
def ifft_tiled(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):
"""Row-wise inverse FFT using tile primitives."""
i, _, _ = wp.tid()
a = wp.tile_load(x, shape=(1, N_GRID), offset=(i, 0))
wp.tile_ifft(a)
wp.tile_store(y, a, offset=(i, 0))
Figure 4. Row-wise tile_fft
on an NxN grid. Each block loads one row into a register tile, computes the FFT cooperatively, and stores the result back to global memory
A 2D FFT also requires a transpose between the row-wise passes. This can use either the SIMT or tile paradigm (through wp.tile_transpose
). For simplicity, the SIMT version is shown below:
@wp.kernel
def transpose(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):
i, j = wp.tid()
y[i, j] = x[j, i]
Composing these three kernels, fft_tiled
-> transpose
-> fft_tiled
, together gives a full 2D forward FFT. The inverse follows the same pattern with ifft_tiled
.
Putting the building blocks together
The step()
function in the example relies on a few other helper kernels that are not discussed in detail here. For the definitions of those kernels, see the 2D Navier–Stokes solver example on the NVIDIA/warp GitHub repo. With all the building blocks in place, a single step()
call advances the simulation by one timestep. The self._solve_poisson()
method in the example code abstracts away the \(\omega(t+\Delta t) \xrightarrow{\text{FFT}} \hat{\omega} \xrightarrow{\text{Eq.\,3}} \hat{\psi} \xrightarrow{\text{IFFT}} \psi(t+\Delta t)\) pipeline for modularity.
def step(self) -> None:
"""Advance simulation by one timestep using SSP-RK3."""
for stage_coeff in self.rk3_coeffs:
wp.launch(
rk3_update,
dim=(self.n, self.n),
inputs=[
self.n, self.h, self.re, self.dt,
stage_coeff[0], stage_coeff[1], stage_coeff[2],
self.omega_0,
self.omega_1,
self.psi,
],
outputs=[self.omega_tmp],
)
# Swap buffers for next RK3 substep
self.omega_1, self.omega_tmp = self.omega_tmp, self.omega_1
# Update streamfunction for next timestep
self._solve_poisson()
# Copy updated vorticity to self.omega_0 for the next timestep
wp.copy(self.omega_0, self.omega_1)
Running the solver produces the decaying turbulence field shown in Figure 5. On the GPU, the step()
function is captured into a CUDA Graph through wp.ScopedCapture
and replayed with wp.capture_launch()
for all subsequent frames, eliminating per-launch overhead.
Figure 5. Two-dimensional decaying turbulence at Re = 1,000
Differentiating through the solver
Now that the working solver has been built, the next question is how to make it differentiable.
Automatic differentiation (AD) computes exact derivatives of a program by applying the chain rule to each elementary operation in the computational graph. Unlike finite differences, AD avoids step-size tuning and yields gradients accurate to machine precision. The key advantage of AD for PDE solvers is scaling: with a complex simulation on a large grid, each forward solve is already expensive, so methods like finite differences require \(O(n)\) full solves to get gradients with regard to \(n\) inputs.
Reverse-mode AD computes all \(\partial \mathcal{L}/\partial x_i\) in roughly one forward pass plus one backward pass, making gradient-based optimization practical at production resolution. This is the same idea as backpropagation in neural nets, and it is why both deep learning and large-scale physics optimization can handle millions of degrees of freedom.
The Warp automatic differentiation system generates two versions of a program at compile time for a differentiable simulation:
Forward version : The code that takes physical inputs (initial conditions, discretized governing laws, and so on) and computes the simulation output (fields, derived quantities) as well as intermediate arrays needed for the adjoint version.
Adjoint version : An automatically generated counterpart to the forward simulation that can take sensitivities of a chosen quantity of interest with respect to the simulation outputs and propagate them all the way back to the inputs. This backward propagation reuses intermediate arrays from the forward execution to apply the chain rule of differentiation across the entire solver, yielding the simulation adjoint without constructing large symbolic expressions.
Developers write the forward physics and Warp handles the gradient computation. Any wp.array
that should be differentiable is allocated with requires_grad=True
, which tells Warp to allocate a companion array for adjoint storage. The resulting adjoints can be used standalone (as in this example) or interoperated with PyTorch or JAX for end-to-end optimization, including training ML models. Currently, Warp supports reverse-mode AD only.
To illustrate, the optimal perturbation problem outlined in Prediction and Control of Two-Dimensional Decaying Turbulence Using Generative Adversarial Networks is tackled here. In turbulent flows, small perturbations to the initial conditions can amplify over time and significantly alter the trajectory of the flow. Identifying which perturbations grow the fastest is a stepping stone toward flow control and toward understanding which structures in the flow are dynamically significant. Concretely, the initial vorticity perturbation \(\Delta\omega\) is sought, which maximizes the divergence between perturbed and unperturbed trajectories at a lead time \(\tau\).
Let \(F^{\tau}\) denote the forward solver applied for \(\tau\) time units. The unperturbed trajectory is \(Y^{*} = F^{\tau}(\omega_0)\) and the perturbed trajectory is \(\tilde{Y} = F^{\tau}(\omega_0 + \Delta\omega)\). The mean squared error (MSE)
\(\mathrm{MSE} = -\frac{1}{N^2}\left\| Y^* – \tilde{Y} \right\|_2^2 \tag{4}\)
is minimized, where the negative sign turns maximization of trajectory divergence into a minimization problem. To constrain the optimization, \(\mathrm{rms}(\Delta\omega) \leq 0.2 \times \mathrm{rms}(\omega_0)\), that is, the perturbation RMS must not exceed 20% of the RMS of the initial vorticity field \(\omega_0\).
For more details, see the 2D Navier-Stokes optimal perturbation example on the NVIDIA/warp GitHub repo. The following sections focus on the three key changes in the forward solver that would make it differentiable.
No in-place modifications
wp.Tape()
records kernel launches in the forward pass and replays them in reverse to compute gradients. That only works if the intermediate values needed by the backward pass are still available, so arrays cannot be freely overwritten in place. This is the key difference from the nondifferentiable solver. In the forward-only version, two arrays could be switched, omega_0
and omega_1
, at the end of each timestep:
wp.copy(omega_0, omega_1)
For the differentiable solver, the RHS computation and the RK3 update need to be split into separate kernels that write to separate arrays. Thus a single RK3 update becomes the following. Note that omega_1
values cannot be copied to omega_0
at the end of each timestep as before.
omega_out[i, j] = coeff0 * omega_0[i, j] + coeff1 * omega_in[i, j] + coeff2 * dt * rhs[i, j]
In Warp, all the intermediate arrays need to be explicitly defined by the user. This requires pre-allocating separate arrays for every RK substep at every timestep, which is the generally dominant GPU memory cost of any differentiable solver.
self.omega_timestep = [wp.zeros((n, n), dtype=wp.float32, requires_grad=True) for _ in range(T + 1)]
# Intermediate arrays for each RK3 substep for each timestep
self.omega_stage = []
self.psi_stage = []
self.rhs_stage = []
self.fft_arrays = []
for _ in range(T):
s_omega, s_psi, s_rhs, s_fft = [], [], [], []
for _ in range(3):
s_omega.append(wp.zeros((n, n), dtype=wp.float32, requires_grad=True))
s_psi.append(wp.zeros((n, n), dtype=wp.float32, requires_grad=True))
s_rhs.append(wp.zeros((n, n), dtype=wp.float32, requires_grad=True))
s_fft.append({"omega_complex": wp.zeros((n, n), dtype=wp.vec2f, requires_grad=True),
# ... plus 4 FFT scratch arrays, each (n, n) vec2f
})
self.omega_stage.append(s_omega)
self.psi_stage.append(s_psi)
self.rhs_stage.append(s_rhs)
self.fft_arrays.append(s_fft)
Storing Warp arrays for every intermediate state scales linearly with the number of timesteps, which becomes prohibitive in long runs. One common approach is gradient checkpointing , saving only selected states, then recomputing the missing segments using the forward solver during the backward pass. This method trades extra forward compute for a much smaller memory footprint. For an example showing how to implement gradient checkpointing in Warp, see the fluid checkpoint example on the NVIDIA/warp GitHub repo.
Recording gradients with wp.Tape()
With the pre-allocated arrays in place, recording and differentiating the forward pass is straightforward:
with wp.Tape() as tape:
forward() # wp.launch calls that take omega from t0 to t0 + lead t and calculate MSE
tape.backward(loss) # Automatic differentiation to get derivatives of loss w.r.t Warp arrays
The wp.Tape()
context records every wp.launch()
call into a computational graph. tape.backward(loss)
traverses that graph in reverse, computing the derivatives of loss
with respect to the Warp arrays. Here the focus is the gradients of loss
with respect to \(\Delta{\omega}\), which can be obtained through delta_omega.grad
.
Optimization loop
The following code block shows one optimization step. The forward()
function is run on the perturbed initial vorticity to produce the final field and loss (MSE versus the unperturbed run). The tape records the kernel launches during this pass. tape.backward(loss)
then backpropagates through the recorded graph to compute gradients with regard to the perturbation, and optimizer.step()
updates the perturbation to reduce the loss. Finally, tape.zero()
clears accumulated gradients before the next iteration.
with wp.Tape() as tape:
forward() # Loss is computed inside forward() function
tape.backward(loss)
optimizer.step([delta_omega.grad.flatten()])
tape.zero()
After 1,000 iterations, the optimizer discovers a structured perturbation \(\Delta\omega\) that amplifies trajectory divergence, driving the MSE from near-zero to ~250. The perturbation field obtained from the solver-in-the-loop optimization qualitatively resembles the one reported in Prediction and Control of Two-Dimensional Decaying Turbulence Using Generative Adversarial Networks .
Figure 6. Optimization progressing over 1,000 iterations with discovered perturbation (top right)
To learn more, the NVIDIA/warp GitHub repo includes additional differentiable-solver examples beyond CFD. See also a growing list of research publications that leverage Warp .
Warp in practice: Case studies of AI-driven industrial workflows
In real AI workflows, simulation and geometry sit inside larger systems (surrogate models, RL, design optimization, and so on). PyTorch and JAX handle training and tensor ops, but the simulation layer adds staged timestepping, stencil updates, and big spatial queries. Warp targets that kernel-heavy layer: you control execution, fuse kernels to cut memory traffic and launches, and use CUDA Graphs to reduce repeated dispatch. It also interoperates zero-copy with PyTorch and JAX tensors.
Autodesk XLB
Autodesk Research built XLB , a differentiable Lattice Boltzmann solver in Python with both Warp and JAX backends, enabling a direct comparison on the same formulation and hardware. On a ~134-million-cell lid-driven cavity benchmark, Warp ran about 8x faster than JAX on a single 40 GB NVIDIA A100 Tensor Core GPU , roughly matching the throughput that JAX needed eight A100 Tensor Core GPUs to reach. At larger sizes, Warp used ~2.5x–3x less memory and completed the largest case, on which JAX ran out of memory on the same GPU.
Figure 7. Throughput and memory usage comparison between Warp and JAX
To learn more, see Autodesk Research Brings Warp Speed to Computational Fluid Dynamics on NVIDIA GH200 .
Google DeepMind MuJoCo
Google DeepMind has recently released MuJoCo Warp (MJWarp), a Warp-based backend for large-scale multibody dynamics. The Warp backend reaches up to 252x (locomotion) and 475x (manipulation) speedups over JAX on comparable hardware. MJWarp gets there by exploiting sparse matrix operations and speculative execution to more precisely dispatch compute, while remaining plug-compatible with JAX training.
Figure 8. MJWarp physics step throughput versus MuJoCo MJX on LEAP hand manipulation and Apptronik locomotion benchmarks
To learn more, see the MuJoCo Warp release announcement .
C-Infinity AutoAssembler
The C-Infinity AutoAssembler ASI Engine shows the value of Warp in AI-driven industrial workflows beyond physics simulation. It converts full-fidelity CAD assemblies into motion constraints for AI planning by computing contact, interference, and clearance directly from raw geometry. Current CAD systems do not support these critical queries, which are required to construct manufacturing process plans, evaluate design changes, and generate execution instructions.
The AutoAssembler ASI engine enables building a manufacturing compiler, transforming engineering CAD data directly to assembly instructions for either human or robot consumption. The technology is implemented using Warp kernels optimized for large scale processing to build spatial intelligence.
On an NVIDIA L4 Tensor Core GPU, the Warp GPU backend achieved a speedup of up to 669x over optimized CPU baselines (based on state of the art libraries including FCL plus Embree). The technology is already in use within enterprise manufacturing workflows at top OEMs.
Figure 9. Liaison graph construction: CPU (FCL/Embree) versus AutoAssembler ASI Engine (GPU) across five CAD assemblies of increasing complexity
To learn more, see AutoAssembler ASI: Accelerated Spatial Intelligence, C-Infinity .
Get started with Warp for computational physics applications
Warp enables you to write physics and geometry as GPU kernels in Python, without forcing everything into tensor-based frameworks. In CFD, timestepping and differentiable solves map cleanly to kernels, keeping the structure of the physics intact.
This model already shows up in industrial AI workflows, including the Autodesk differentiable CFD solver, the Google DeepMind multibody dynamics work, and the C-Infinity spatial reasoning engine. With zero-copy interop to PyTorch and JAX, Warp plugs into ML pipelines while preserving the control flow these workloads need, with measured gains in performance, memory, and scalability.
To get started with Warp for computational physics applications, check out these resources:
Introduction to NVIDIA Warp notebook
2D Navier–Stokes solver example
2D Navier-Stokes optimal perturbation example
NVIDIA Warp documentation
To learn more, join the NVIDIA GTC 2026 session, How to Use NVIDIA Warp to Build GPU-Accelerated Computational Physics Simulations [DLIT81837] . Watch the GTC keynote with NVIDIA founder and CEO Jensen Huang and explore more physical AI , robotics , and vision AI GTC sessions.
Acknowledgments
Thanks to Felix Meyer for contributing to this post and project.
Discuss (0)
Like
Tags
Developer Tools & Techniques | Robotics | Simulation / Modeling / Design | HPC / Scientific Computing | General | Advanced Technical | Tutorial | CAE | Physics | Python | research | Warp
About the Authors
About Sheel Nidhan
Sheel Nidhan is a senior technical marketing engineer at NVIDIA, working at the intersection of CUDA-X libraries and computer-aided engineering (CAE) applications. Prior to joining NVIDIA, Sheel was a senior R&D engineer at Ansys, focusing on application of deep learning to Ansys’s simulation technology. He obtained his PhD in mechanical engineering from the University of California, San Diego, where he worked on high-fidelity simulations and data-driven analysis of turbulent wakes.
View all posts by Sheel Nidhan
About Eric Shi
Eric Shi is a senior engineering manager at NVIDIA, where he leads development on the Warp library. As a core contributor defining the project’s engineering standards and infrastructure, he also serves on the Newton team to establish the tooling and workflows that drive developer productivity. Previously, he was a computational physicist at Lawrence Livermore National Laboratory and holds a Ph.D. in plasma physics from Princeton University.
View all posts by Eric Shi
About Neil Ashton
Neil Ashton is a distinguished engineer and product architect at NVIDIA, specializing in computer-aided engineering (CAE). A Fellow of the Institution of Mechanical Engineers, he previously served as the Worldwide Tech Lead for CAE at AWS and a Senior Researcher at the University of Oxford.
View all posts by Neil Ashton
About Zach Corse
Zach Corse is a senior software engineer at NVIDIA, where he works on the NVIDIA Warp library. His published research spans a diverse range of topics, including electron scattering in quantum waveguides, fluid simulation on the surface of a sphere, and the optical properties of ice under varying conditions. Zach holds a BS in physics from Duke, an MA in physics from UT Austin, an MA in art from UCSC, and an MSE in computer graphics from UPenn. He is currently focused on advancing Warp’s tile API.
View all posts by Zach Corse
About Mohammad Mohajerani
Mohammad Mohajerani is a senior product manager at NVIDIA, where his work enables high-performance simulation and real-time physics across physical AI, CAE, and AI-driven digital twin applications through Warp and Newton. Prior to NVIDIA, Mohammad held product and engineering leadership roles in the startup world at Sanctuary AI and Haply Robotics and spent several years advancing physics engines and skills training simulators at CM Labs Simulations. He holds a master's degree in mechanical engineering from Concordia University, with a focus on aerial robotics, system identification, and control optimization.
View all posts by Mohammad Mohajerani
Comments
Related posts
Autodesk Research Brings Warp Speed to Computational Fluid Dynamics on NVIDIA GH200
Autodesk Research Brings Warp Speed to Computational Fluid Dynamics on NVIDIA GH200
Just Released: NVIDIA Warp is Now Open-Source Under Apache 2.0
Just Released: NVIDIA Warp is Now Open-Source Under Apache 2.0
Introducing Tile-Based Programming in Warp 1.5.0
Introducing Tile-Based Programming in Warp 1.5.0
AI-Powered Simulation Tools for Surrogate Modeling Engineering Workflows with Siml.ai and NVIDIA PhysicsNeMo
AI-Powered Simulation Tools for Surrogate Modeling Engineering Workflows with Siml.ai and NVIDIA PhysicsNeMo
NVIDIA PhysicsNeMo: An AI-Accelerated Multiphysics Simulation Toolkit
NVIDIA PhysicsNeMo: An AI-Accelerated Multiphysics Simulation Toolkit
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
New NVIDIA Nemotron 3 Super Delivers 5x Higher Throughput for Agentic AI |
nvidia_blog |
11.03.2026 16:00 |
0.751
|
| Embedding sim. | 0.8391 |
| Entity overlap | 0.1613 |
| Title sim. | 0.2803 |
| Time proximity | 1 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
New NVIDIA Nemotron 3 Super Delivers 5x Higher Throughput for Agentic AI
A new, open, 120-billion-parameter hybrid mixture-of-experts model optimized for NVIDIA Blackwell addresses the costs of long thinking and context explosion that slow autonomous agent workflows.
March 11, 2026 by Kari Briski
0 Comments
Share
Share This Article
X
Facebook
LinkedIn
Copy link
Link copied!
Launched today, NVIDIA Nemotron 3 Super is a 120‑billion‑parameter open model with 12 billion active parameters designed to run complex agentic AI systems at scale.
Available now, the model combines advanced reasoning capabilities to efficiently complete tasks with high accuracy for autonomous agents.
AI-Native Companies: Perplexity offers its users access to Nemotron 3 Super for search and as one of 20 orchestrated models in Computer. Companies offering software development agents like CodeRabbit , Factory and Greptile are integrating the model into their AI agents along with proprietary models to achieve higher accuracy at lower cost. And life sciences and frontier AI organizations like Edison Scientific and Lila Sciences will power their agents for deep literature search, data science and molecular understanding.
Enterprise Software Platforms: Industry leaders such as Amdocs , Palantir , Cadence , Dassault Systèmes and Siemens are deploying and customizing the model to automate workflows in telecom, cybersecurity, semiconductor design and manufacturing.
As companies move beyond chatbots and into multi‑agent applications, they encounter two constraints.
The first is context explosion. Multi‑agent workflows generate up to 15x more tokens than standard chat because each interaction requires resending full histories, including tool outputs and intermediate reasoning.
Over long tasks, this volume of context increases costs and can lead to goal drift, where agents lose alignment with the original objective.
The second is the thinking tax. Complex agents must reason at every step, but using large models for every subtask makes multi-agent applications too expensive and sluggish for practical applications.
Nemotron 3 Super has a 1‑million‑token context window, allowing agents to retain full workflow state in memory and preventing goal drift.
Nemotron 3 Super has set new standards, claiming the top spot on Artificial Analysis for efficiency and openn ess with leading accuracy among models of the same size.
The model also powers the NVIDIA AI-Q research agent to the No. 1 position on DeepResearch Bench and DeepResearch Bench II leaderboards, benchmarks that measure an AI system’s ability to conduct thorough, multistep research across large document sets while maintaining reasoning coherence.
Hybrid Architecture
Nemotron 3 Super uses a hybrid mixture‑of‑experts (MoE) architecture that combines three major innovations to deliver up to 5x higher throughput and up to 2x higher accuracy than the previous Nemotron Super model.
Hybrid Architecture: Mamba layers deliver 4x higher memory and compute efficiency, while transformer layers drive advanced reasoning.
MoE: Only 12 billion of its 120 billion parameters are active at inference.
Latent MoE: A new technique that improves accuracy by activating four expert specialists for the cost of one to generate the next token at inference.
Multi-Token Prediction: Predicts multiple future words simultaneously, resulting in 3x faster inference .
On the NVIDIA Blackwell platform, the model runs in NVFP4 precision. That cuts memory requirements and pushes inference up to 4x faster than FP8 on NVIDIA Hopper, with no loss in accuracy.
Open Weights, Data and Recipes
NVIDIA is releasing Nemotron 3 Super with open weights under a permissive license. Developers can deploy and customize it on workstations, in data centers or in the cloud.
The model was trained on synthetic data generated using frontier reasoning models. NVIDIA is publishing the complete methodology, including over 10 trillion tokens of pre- and post-training datasets, 15 training environments for reinforcement learning and evaluation recipes. Researchers can further use the NVIDIA NeMo platform to fine-tune the model or build their own.
Use in Agentic Systems
Nemotron 3 Super is designed to handle complex subtasks inside a multi-agent system.
A software development agent can load an entire codebase into context at once, enabling end-to-end code generation and debugging without document segmentation.
In financial analysis it can load thousands of pages of reports into memory, eliminating the need to re-reason across long conversations, which improves efficiency.
Nemotron 3 Super has high-accuracy tool calling that ensures autonomous agents reliably navigate massive function libraries to prevent execution errors in high-stakes environments, like autonomous security orchestration in cybersecurity .
Availability
NVIDIA Nemotron 3 Super, part of the Nemotron 3 family , can be accessed at build.nvidia.com , Perplexity , OpenRouter and Hugging Face . Dell Technologies is bringing the model to the Dell Enterprise Hub on Hugging Face, optimized for on-premise deployment on the Dell AI Factory, advancing multi-agent AI workflows. HPE is also bringing NVIDIA Nemotron to its agents hub to help ensure scalable enterprise adoption of agentic AI.
Enterprises and developers can deploy the model through several partners:
Cloud Service Providers : Google Cloud’s Vertex AI and Oracle Cloud Infrastructure , and coming soon to Amazon Web Services through Amazon Bedrock as well as Microsoft Azure.
NVIDIA Cloud Partners : Coreweave , Crusoe , Nebius and Together AI .
Inference Service Providers : Baseten , Cloudflare , DeepInfra , Fireworks AI , Inference.net , Lightning AI , Modal and FriendliAI .
Data Platforms and Services : Distyl, Dataiku , DataRobot , Deloitte , EY and Tata Consultancy Services.
The model is packaged as an NVIDIA NIM microservice, allowing deployment from on-premises systems to the cloud.
Stay up to date on agentic AI, NVIDIA Nemotron and more by subscribing to NVIDIA AI news , joining the community , and following NVIDIA AI on LinkedIn , Instagram , X and Facebook .
Explore self-paced video tutorials and livestreams .
Explore the Best of GTC 2026 Sessions
Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere.
Watch On Demand
Recent News
AI Infrastructure
Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid
March 31, 2026
|
|
|
Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models | NVIDIA Technical Blog |
nvidia_dev_blog |
13.03.2026 16:00 |
0.735
|
| Embedding sim. | 0.8356 |
| Entity overlap | 0 |
| Title sim. | 0.3287 |
| Time proximity | 0.866 |
| NLP тип | product_launch |
| NLP организация | nvidia |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
The next generation of AI-driven robots like humanoids and autonomous vehicles depends on high-fidelity, physics-aware training data. Without diverse and representative datasets, these systems don’t get proper training and face testing risks due to poor generalization, limited exposure to real-world variations, and unpredictable behavior in edge cases. Collecting massive real-world datasets for training is expensive, time-intensive, and often constrained by possibilities.
Explore the NVIDIA Cosmos Cookbook for step-by-step workflows, technical recipes, and concrete examples for building, adapting, and deploying Cosmos WFMs.
NVIDIA Cosmos addresses this challenge by accelerating world foundation model (WFM) development. At the core of its platform, Cosmos WFMs speed up synthetic data generation and act as a foundation for post-training, to develop downstream domain or task-specific physical AI models to solve these challenges. This post explores the latest Cosmos WFMs, their key capabilities that advance physical AI , and how to use them.
Cosmos world foundation model updates:
NVIDIA Cosmos world foundation models have continued to evolve rapidly, with significant advancements that further accelerate synthetic data generation and physical AI development. One year after their introduction, key updates include:
Cosmos Transfer 2.5 —Faster and more scalable data augmentation from simulation and 3D spatial inputs, enabling greater diversity across environments, lighting conditions, and scene variations.
Cosmos Predict 2.5 —Enhanced long-tail scenario generation for sequences up to 30 seconds, delivering up to 10x higher accuracy when post-trained on proprietary or domain-specific data. Supports multiview outputs, custom camera layouts, and alternate policy outputs such as action simulation.
Cosmos Reason 2 —Advanced physical AI reasoning with improved spatiotemporal understanding and timestamp precision. Adds object detection with 2D/3D point localization and bounding box coordinates, along with reasoning explanations and labels. Expanded long-context support up to 256K input tokens.
Cosmos Transfer for photorealistic videos grounded in physics
Cosmos Transfer generates high-fidelity world scenes from structural inputs, ensuring precise spatial alignment and scene composition.
Employing the ControlNet architecture, Cosmos Transfer preserves pretrained knowledge, enabling structured, consistent outputs. It utilizes spatiotemporal control maps to dynamically align synthetic and real-world representations, enabling fine-grained control over scene composition, object placement, and motion dynamics.
Inputs :
Structured visual or geometric data: segmentation maps, depth maps, edge maps, human motion keypoints, LiDAR scans, trajectories, HD maps, and 3D bounding boxes.
Ground truth annotations: high-fidelity references for precise alignment.
Output : Photorealistic video sequences with controlled layout, object placement, and motion.
Figure 1. On the left, a virtual simulation or ‘ground truth’ created in NVIDIA Omniverse. On the right, photoreal transformation using Cosmos Transfer
Key capabilities:
Generate scalable, photorealistic synthetic data that aligns with real-world physics.
Control object interactions and scene composition through structured multimodal inputs.
Using Cosmos Transfer for controllable synthetic data
With generative AI APIs and SDKs, NVIDIA Omniverse accelerates physical AI simulation. Developers use NVIDIA Omniverse, built on OpenUSD , to create 3D scenes that accurately simulate real-world environments for training and testing robots and autonomous vehicles. These simulations serve as ground truth video inputs for Cosmos Transfer, combined with annotations and text instructions. Cosmos Transfer enhances photorealism while varying environment, lighting, and visual conditions to generate scalable, diverse world states.
This workflow accelerates the creation of high-quality training datasets, ensuring AI agents generalize effectively from simulation to real-world deployment.
Figure 2 . Generative API and SDKs in NVIDIA Omniverse power ground truth simulation for Cosmos Transfe r
Figure 3. A photoreal video produced by Cosmos Transfer
Cosmos Transfer enhances robotics development by enabling realistic lighting, colors, and textures in the Isaac GR00T Blueprint for synthetic manipulation motion generation and Omniverse Blueprint for Autonomous Vehicle Simulation for varying environmental and weather conditions for training. This photorealistic data is crucial for post-training policy models, ensuring smooth simulation-to-reality transfer and supporting model training for perception AI and specialized robot models like GR00T N1 .
How to run the new Cosmos Transfer 2.5:
To run inference on new Cosmos Transfer 2.5, follow the inference guide .
To post-train on proprietary or domain data, follow the post-training guide .
Explore NVIDIA Cosmos Cookbook for step-by-step workflows and technical recipes from Cosmos users.
Cosmos Predict for generating future world states
Cosmos Predict WFM is designed to model future world states as video from multimodal inputs, including text, video, and start-end frame sequences. It is built using transformer-based architectures that enhance temporal consistency and frame interpolation.
Key capabilities:
Generates realistic world states directly from text prompts.
Predict next states based on video sequences by predicting missing frames or extending motion.
Multiframe generation between a starting and ending image, creating a complete, smooth sequence.
Cosmos Predict WFM provides a strong foundation for training downstream world models in robotics and autonomous vehicles. You can post-train these models to generate actions instead of video for policy modeling or adapt it for visual-language understanding to create custom perception AI models.
How to run the new Cosmos Predict 2.5:
To run inference on new Cosmos Predict 2.5, follow the inference guide .
To post-train on proprietary or domain data, follow the post-training guide .
Explore the NVIDIA Cosmos Cookbook for step-by-step workflows and technical recipes from Cosmos users.
Cosmos Reason to perceive, reason, and respond intelligently
Cosmos Reason is a fully customizable multimodal AI reasoning model that is purpose-built to understand motion, object interactions, and space-time relationships. Using chain-of-thought (CoT) reasoning, the model interprets visual input, predicts outcomes based on the given prompt, and rewards the optimal decision. Unlike text-based LLMs, it grounds reasoning in real-world physics, generating clear, context-aware responses in natural language.
Input : Video observations and a text-based query or instruction.
Output: Text response generated through long-horizon CoT reasoning.
Key capabilities:
Knows how objects move, interact, and change over time.
Predicts and rewards the next best action based on input observation.
Continuously refines decision-making.
Purpose-built for post-training to build perception AI and embodied AI models.
Training pipeline
Cosmos Reason is trained in three stages, enhancing its ability to reason, predict, and respond to decisions in real-world scenarios.
Pretraining : Uses a Vision Transformer (ViT) to process video frames into structured embeddings, aligning them with text for a shared understanding of objects, actions, and spatial relationships.
Supervised fine-tuning (SFT): Specializes the model in physical reasoning across two key levels. General fine-tuning enhances language grounding and multimodal perception using diverse video-text datasets, while more training on physical AI data sharpens the model’s ability to reason about real-world interactions. It learns object behaviors like how objects can be used in the real world, action sequences, determining how multi-step tasks unfold, and spatial feasibility to distinguish realistic from impossible placements.
Figure 4. Reinforcement learning feedback loop continuously improves through positive and negative feedback and model adjustments
Reinforcement learning (RL) : The model evaluates different reasoning paths and updates itself only when a better decision emerges through trial and reward feedback. Instead of relying on human-labeled data, it uses rule-based rewards:
Entity recognition: Rewarding accurate identification of objects and their properties.
Spatial constraints: Penalizing physically impossible placements while reinforcing realistic object positioning.
Temporal reasoning: Encouraging correct sequence prediction based on cause-effect relationships.
How to run the new Cosmos Reason 2:
To run inference on new Cosmos Reason 2, follow the inference guide .
To post-train on proprietary or domain data, follow the post-training guide .
Explore the NVIDIA Cosmos Cookbook for step-by-step workflows and technical recipes from Cosmos users.
Get started
Visit our Cosmos Cookbook for step-by-step workflows, technical recipes, and concrete examples for building, adapting, and deploying Cosmos WFMs.
Explore new open Cosmos models and datasets on Hugging Face and GitHub or try models on build.nvidia.com .
Be part of the community and join our Cosmos Discord channel .
Already using Cosmos? Learn more about how to contribute .
Watch the GTC keynote from NVIDIA founder and CEO Jensen Huang and explore Cosmos sessions .
Updated on March 13, 2026, with advancements to NVIDIA Cosmos world foundation models .
Discuss (0)
Like
Tags
Agentic AI / Generative AI | Content Creation / Rendering | Robotics | Simulation / Modeling / Design | General | Cosmos | Intermediate Technical | Deep dive | GTC March 2025 | featured | GTC 2026
About the Authors
About Pranjali Joshi
Pranjali Joshi is the product marketing manager for NVIDIA Omniverse, focusing on core technologies and visual generative AI. She holds an M.Sc. degree in data science and marketing strategy from the University of Maryland and a B.Sc. in electronics engineering. Previously, she worked at Accenture and Hitachi Vantara in software development and technology marketing roles.
View all posts by Pranjali Joshi
About Asawaree Bhide
Asawaree Bhide is a technical marketing engineer at NVIDIA, working on robotics and deep learning applications on the Jetson platform. She did her master’s in computer science at Georgia Tech and is interested in solving complex perception tasks in autonomous navigation for embodied agents.
View all posts by Asawaree Bhide
Comments
Related posts
How to Scale Data Generation for Physical AI with the NVIDIA Cosmos Cookbook
How to Scale Data Generation for Physical AI with the NVIDIA Cosmos Cookbook
R²D²: Boost Robot Training with World Foundation Models and Workflows from NVIDIA Research
R²D²: Boost Robot Training with World Foundation Models and Workflows from NVIDIA Research
R²D²: Training Generalist Robots with NVIDIA Research Workflows and World Foundation Models
R²D²: Training Generalist Robots with NVIDIA Research Workflows and World Foundation Models
Develop Custom Physical AI Foundation Models with NVIDIA Cosmos Predict-2
Develop Custom Physical AI Foundation Models with NVIDIA Cosmos Predict-2
Advancing Physical AI with NVIDIA Cosmos World Foundation Model Platform
Advancing Physical AI with NVIDIA Cosmos World Foundation Model Platform
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library | NVIDIA Technical Blog |
nvidia_dev_blog |
09.03.2026 17:00 |
0.717
|
| Embedding sim. | 0.8001 |
| Entity overlap | 0.1111 |
| Title sim. | 0.2623 |
| Time proximity | 0.994 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
Deploying large language models (LLMs) requires large-scale distributed inference , which spreads model computation and request handling across many GPUs and nodes to scale to more users while reducing latency. Distributed inference frameworks use techniques such as disaggregated serving , KV cache loading, and wide expert parallelism.
In disaggregated serving environments, prefill and decode phases are run on separate GPUs, requiring efficient KV cache transfers between them. Low-latency and high-throughput communication to move these KV caches are critical to gain benefits from disaggregated serving.
In KV cache loading, storage is used to help with growing KV caches in multiturn and agentic AI workloads such as coding assistants and reasoning. For the case of long context KV, the previous results can be loaded from local SSDs and remote storage, instead of recomputing them as prefill. This is one example that explains why storage is becoming a core part of inference workloads.
In wide expert parallelism, experts are split across many GPUs, where the intermediate results (activations) have to be dispatched to and combined from these experts. Due to the requirement for ultra-low-latency communication for intermediate activations between stages, these transfers are typically initiated by the GPU through optimized kernels, referred to as device side APIs for networking, or device API in short.
Another unique feature of inference workloads is their need for dynamicity and resiliency. Services can run 24 hours a day, seven days a week. While based on user demand, the number of GPUs used can change. There can also be more fine-grained dynamicity. The ratio of GPUs doing prefill and decode might change or, in the case of elastic expert parallelism, the number of replicated experts or even total experts can change.
In the event of failures, the system needs to be resilient, running at lower throughput for a brief period of time until the recovery mechanism handles the failure. This requirement extends the system’s dynamicity needs by detecting failures and managing the transitional state until recovery completes.
Finally, while there is a need for heterogeneous hardware support in terms of memory and storage, there can be heterogeneity in compute hardware as well. Handling each of these unique hardware components can become cumbersome. This requires a library that can unify different communication and storage technologies, which ensures that frameworks can efficiently move data across various memory and storage hierarchies: GPU memory, CPU memory, and many tiers of local and distributed storage from NVMe to cloud object stores.
NVIDIA Inference Transfer Library (NIXL) is an open source, vendor-agnostic, data movement library designed to support these dynamic, complex AI inference frameworks by offering a unified and powerful abstraction to move data across many memory and storage technologies.
This post explains NIXL core concepts, including agents, memory registration, metadata exchange, descriptors, transfer creation and management, and backend plugins. It also explains the usage flow of this library, highlights available performance tools, and provides a few examples to help you get started.
What is NIXL?
NIXL is an open source library for accelerating point-to-point data transfers in AI inference frameworks. NIXL provides a single, easy-to-use API that can be used to address a variety of data transfer challenges within these frameworks while maintaining maximum performance.
This API supports multiple technologies such as RDMA, GPU-initiated networking, GPU-Direct storage, block and file storage, and advanced cloud storage options including S3 over RDMA and Azure Blob Storage. It is vendor-agnostic and can run across diverse environments. For example, it supports Amazon Web Services (AWS) with EFA networking and Trainium or Inferentia accelerators, as well as Azure with RDMA networking. The team is working with Google Cloud to add both RDMA and GPUDirect-TCPXO networking. NIXL is already a key component of many AI inference frameworks such as NVIDIA Dynamo , NVIDIA TensorRT LLM , vLLM, SGLang, Anyscale Ray, LMCache, and more.
Figure 1. NIXL addresses three core challenges in distributed AI inference: heterogeneous resources, dynamic workload, and massive scale
Core use cases of NIXL include:
Disaggregation : Moves KV blocks between prefill and decode workers with high throughput and low latency
Long context KV cache storage : Stores KV cache data in some long term storage medium to avoid recomputation later
Weight transfer : Ships model weights to GPU nodes for fast startup or resharding. The weights might come from GPU memory, host memory, or storage
Reinforcement learning : Streams updated weights from learners to actors with minimal transfer overhead
Elastic expert parallelism : Dispatches and combine stages in expert parallelism can be done through NIXL, with support for dynamic reconfigurations
The unified NIXL API is for different types of memory and storage, while its pluggable backend design allows this API to target many different high‑performance technologies (RDMA, GPU-initiated networking, GPU-Direct storage, NVMe, Object Stores, and so on). NIXL is designed to have a fully non-blocking API and incur minimal overhead. This enables efficient overlapping of communication and computation with high-performance zero-copy transfers.
The NIXL dynamic metadata exchange enables it to dynamically scale up and down a network of NIXL agents. This feature makes it practical for dynamic, long‑running services where compute nodes are added based on user load, removed due to failures, or recycled for different purposes all the time.
These features enable NIXL to abstract away various memory and storage types for the user of the NIXL library, while supporting a wide range of high-performance transfer backends. Additionally, dynamicity and resiliency is baked in throughout the NIXL design, targeted for inference applications 24 hours a day, seven days a week.
NIXL design
NIXL functions as a standalone library, providing the necessary abstraction for various network and storage backends. These abstractions include a conductor process that determines when transfers are required, and a NIXL transfer agent that will handle transfers. All of this is done in an object-oriented manner. The transfer terminology is based on writing to or reading from a remote agent (or within the local agent). These write and read operations are also referred to as put and get.
This terminology enables a unified API that supports both efficient one-sided network communications and storage transfers. The user describes any memory or storage through a list of descriptors, which has an encompassing type to indicate if the data is stored in host memory, GPU memory, or some type of storage. Each descriptor within a descriptor list points to a location in memory or storage. For example, some base address and a size on a host memory, GPU, or SSD, or similarly a location within a file or storage object. Note that each set of transfer descriptors must be from the same memory type but can transverse memory types across the transfer. For example, sending from GPU memory to host memory.
The conductor gives the NIXL agent access to the desired allocated memories through a registration call. When using one-sided read or write operations, keys or identifiers are generated, so only other processes that have the proper key can access that memory. NIXL encapsulates such information for these registrations, as well as the required connection info, into a metadata object. Inside the NIXL agent, Memory Section and Metadata Handler components are in charge of managing the necessary local and remote information respectively.
The conductor process is also in charge of dynamically exchanging the relevant metadata objects to decide which agents can talk to each other at each point in time. The conductor process can directly obtain the metadata object from one agent, and load it into another agent. For the case of device API usage by GPU kernels, there is one more preparation step necessary to send the relevant local and remote metadata to the GPU.
This metadata exchange is only necessary for remote agent transfers and not for local memory or storage transfers. For remote storage, NIXL talks to the local client of the distributed storage system, becoming a loopback transfer within the agent. NIXL also provides optional facilitating methods to exchange such metadata through direct sockets connection or a central metadata service such as etcd.
Now the conductor process can ask the NIXL agent to prepare a transfer request. NIXL first checks whether the required information is available for this transfer. If it is, the conductor process can ask the NIXL agent to start the transfer. It can also monitor the transfer status until it is complete, in a nonblocking manner. Device API mode operates in a similar manner, from the GPU kernel.
The NIXL agent will internally find the optimal backend for carrying out this transfer request, and deliver the prepared request to that backend (unless the user specifies the desired backend). This enables NIXL to achieve high performance and remain hardware agnostic. Figure 2 shows the current list of supported backends, which is expanding with the rapid adoption of NIXL.
Figure 2. NIXL architecture consists of a core transfer agent with a Memory Section and Metadata Handler, and supports multiple transfer backend plugins through an API
Example NIXL use case
The following NIXL use case explores how applications or conductor processes can use the NIXL API to perform an asynchronous point-to-point data transfer using a high-performance networking library.
For the case of transferring between two agents, one agent plays the role of the initiator, which creates and starts the read or write operation. The other agent plays the role of the target, whose memory is being accessed.
These roles are defined per transfer during the application run based on who invokes the operation. The initiator agent checks the status of transfer locally, and typically sends a notification to the target agent to indicate when the transfer is complete.
Setting up the agents
Setting up the initiator and target agents involves the following steps:
Step 1: Agent creation
At startup, each application spawns a runtime agent configured with relevant initialization parameters. The agent initializes the transfer backends specified or uses UCX as default if none are provided. UCX is a community-driven networking library and is widely tested internally. The user also gives a name to the agent, which can be any string, such as an UUID.
Step 2: Memory registration
Users allocate memory on their chosen devices—GPU, CPU, storage—and register these regions with the agent through NIXL descriptors. NIXL will internally pass that information to each relevant backend that supports that memory type.
Optimization tip: Most backend registrations must go through a kernel call, which can be time consuming. It is advised to minimize the number of registrations by registering larger blocks of memory, as transfers can be created anywhere within the registered memory.
Step 3: Metadata exchange
Target agent metadata is shared with initiator agents for planned transfers. During runtime, new metadata can be loaded, or metadata of another agent can be removed. This is a key feature that enables dynamicity for the NIXL library.
Optimization tip: When new registrations or deregistrations occur, the updated metadata needs to be exchanged. If one side has dynamic registrations and deregistrations, while the other side has fixed buffers to receive the data, it is advised to make the former side the initiator agent. This removes the need for extra metadata exchanges.
Preparing and performing the data transfer
After the metadata has been shared between the two peer agents, the initiator performs the following steps:
Step 1: Create the transfer request
The transfer request indicates the operation type, READ or WRITE, as well as the initiator and the target descriptors to be used. A notification can be optionally specified. NIXL will verify these descriptors, decide on the transfer backend, and deliver the descriptors to that backend if preparations are required.
Step 2: Start (or post) the transfer request
NIXL issues this request to the appropriate backend, making this low overhead. The backend performs the data transfer between the source and destination addresses. The backend library performing the transfer uses the system libraries and drivers underneath to perform the transfer efficiently.
Step 3: Check transfer status
To enable overlap of compute and communication, the post call is nonblocking, which requires the user to check the status of a transfer separately. Note that the transfer might complete, or might result in an error (network failure, for example). Such failure does not impact the other agents in the system, nor the transfers within the same agent that don’t face that network failure.
On the target side, the user can look for notifications that indicate a transfer was complete. The name of the agent shows up in the notification, with the notification message, while the target agent does not need to know the initiator’s name beforehand.
Tear down
When a NIXL agent is deleted, NIXL will automatically deregister the local registered memories. If an active transfer is being directed towards this NIXL agent, it will simply result in an error status. If local transfers are not finished, NIXL will try to release them during agent destruction. However, it is advised to preemptively release those transfer requests.
NIXL performance benchmarking tools
Performance benchmarking tools are valuable for inference systems. They can be used to verify that a system is operating as intended, or find the best backend for a specific enterprise system. They can also help verify performance improvements for a specific backend.
NIXL provides a two‑layer setup, through a low-level benchmark called NIXLBench and an LLM-aware profiler called KVBench.
NIXLBench is intentionally model‑agnostic and maintains a simple system view. It executes real data transfers, sweeps block and batch sizes, and reports bandwidth metrics with latency percentiles. NIXLBench relies on etcd to exchange transfer metadata for network backends, but not for storage backends as there is no need for metadata exchange.
KVBench offers significant advantages for LLM engineers by accelerating benchmarking and iteration through the automatic calculation of exact KV cache I/O size and batch size for supported models, and generates a ready-to-run NIXLBench command. KVBench can also instantiate profiling of KVCache transfers using its CTPerfTest module.
Get started with NVIDIA Inference Transfer Library
NIXL software is fully open source and available on the ai-dynamo/nixl GitHub repo. It is written in C++ for high performance, efficiency, and composability. Several bindings are available, including C, Python, and Rust.
Currently, NIXL is only supported in Linux environments such as Ubuntu and RHEL and is available prebuilt as a Python wheel distributable. We encourage you to try NIXL in your own AI inference frameworks and workloads.
To learn more, you can explore additional examples in the NIXL example guide . As a starting point, basic_two_peers is a simple two-peer Python example showing registration, metadata exchange, a single READ operation, notification, verification, and teardown. In addition, expanded_two_peers builds on top of the previous example, by adding parallel READs and WRITEs with various preparation methods, reposting the same transfer request, and usage of patterns in notifications.
We welcome questions, contributions, pull requests, and feedback from the community on GitHub. Stay tuned for the upcoming NIXL v1.0.0 release. To learn more about NIXL, check out these additional resources:
NIXL core concepts and architecture overview
NIXL Python API use and examples
NIXL telemetry system
NIXL Benchmark
KVBench workflows and tutorials
Acknowledgments
The NVIDIA Inference Transfer Library product team acknowledges the valuable contributions of all open source developers, contributors, testers, and community members who have participated in its evolution.
Discuss (0)
Like
Tags
Developer Tools & Techniques | MLOps | Networking / Communications | HPC / Scientific Computing | Dynamo | NVLink | TensorRT-LLM | Intermediate Technical | Deep dive | Tutorial | AI Agent | AI Inference | Inference Performance | Python
About the Authors
About Seonghee Lee
Seonghee Lee is an engineer on the AI platform software team at NVIDIA, focusing on AI Inference-related products. Seonghee holds a master’s in computer science from Stanford University and a bachelor’s in science from Cornell University, specializing in AI. Before joining NVIDIA, she worked at Microsoft Research on developing real-time AI agent interactions.
View all posts by Seonghee Lee
About Moein Khazraee
Moein Khazraee serves as a senior networking and systems architect at NVIDIA, driven by a fascination with the intricacies of end-to-end systems. He completed his PhD at UC San Diego, followed by a postdoctoral position at MIT. His research centered on the software and hardware co-design of frameworks for networked systems. At NVIDIA, Moein’s work spans AI workload modeling, co-leading the architecture of the NVIDIA NIXL library, and network architecture for NVQLink.
View all posts by Moein Khazraee
About Timothy Stamler
Timothy Stamler is a senior systems architect for the Network Architecture team at NVIDIA and works on a variety of systems and performance optimization problems. Tim did his PhD at UT Austin, focusing on optimizing operating system IO stacks and making storage and network stacks work more cooperatively. Since coming to NVIDIA, he’s worked on extending the RDMA verbs API and making it easier for AI inference platforms to utilize high speed network technology through libraries like NIXL.
View all posts by Timothy Stamler
About Adit Ranadive
Adit Ranadive serves as senior software architect in the NVIDIA Networking Software Advanced Development Group, focusing on system technologies that enhance AI inference workloads. His work includes enabling high-performance computing software to utilize DPUs for IO stream processing and advancing distributed computing architectures. He holds advanced degrees in Computer Science from Georgia Tech and previously led initiatives in virtualization of high performance networks at VMware.
View all posts by Adit Ranadive
About Chris Hoge
Chris Hoge is the manager for the AI Platform Software Technical Marketing Engineering team at NVIDIA. Chris has worked in open source software for over 10 years, with a focus on AI, high-performance computing, and infrastructure. He holds a master’s degree in Applied Mathematics from the University of Colorado, and a bachelor’s degree in Systems Science and Mathematics from Washington University in St. Louis.
View all posts by Chris Hoge
Comments
Related posts
Removing the Guesswork from Disaggregated Serving
Removing the Guesswork from Disaggregated Serving
Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo
Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo
Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer
Reducing Cold Start Latency for LLM Inference with NVIDIA Run:ai Model Streamer
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
NVIDIA Dynamo Accelerates llm-d Community Initiatives for Advancing Large-Scale Distributed Inference
NVIDIA Dynamo Accelerates llm-d Community Initiatives for Advancing Large-Scale Distributed Inference
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
L
T
F
R
E
|
|
|
How Autonomous AI Agents Become Secure by Design With NVIDIA OpenShell |
nvidia_blog |
23.03.2026 15:00 |
0.716
|
| Embedding sim. | 0.8873 |
| Entity overlap | 0.0476 |
| Title sim. | 0.3981 |
| Time proximity | 0.0069 |
| NLP тип | product_launch |
| NLP организация | nvidia |
| NLP тема | autonomous agents |
| NLP страна | |
Открыть оригинал
How Autonomous AI Agents Become Secure by Design With NVIDIA OpenShell
NVIDIA OpenShell provides tools for controlling autonomous agents in a trusted infrastructure policy layer — adding security in the environment, rather than the model or application layer.
March 23, 2026 by Ali Golshan
0 Comments
Share
Share This Article
X
Facebook
LinkedIn
Copy link
Link copied!
Autonomous agents mark a new inflection point in AI. Systems are no longer limited to generating responses or reasoning through tasks. They can take action: Agents can read files, use tools, write and run code, and execute workflows across enterprise systems, all while expanding their own capabilities.
Application-layer risk grows exponentially when agents continuously improve and evolve. The NVIDIA OpenShell runtime is being built to address this.
Part of NVIDIA Agent Toolkit , OpenShell is an open source, secure-by-design runtime for running autonomous agents such as claws. It works by ensuring each agent runs inside its own sandbox, separating application-layer operations from infrastructure-layer policy enforcement.
This means security policies are out of reach of the agent — they’re applied at the system level. Instead of relying on behavioral prompts, OpenShell enforces constraints on the environment the agent runs in — meaning the agent cannot override policies, or leak credentials or private data, even if compromised.
With OpenShell, enterprises can separate agent behavior, policy definition and policy enforcement. Organizations gain a single, unified policy layer to define and monitor how autonomous systems operate. Coding agents, research assistants and agentic workflows all run under the same runtime policies regardless of host operating system, simplifying compliance and operational oversight.
This is the “browser tab” model applied to agents: Sessions are isolated, resources are controlled and permissions are verified by the runtime before any action takes place.
Securing autonomous systems requires an integrated ecosystem. OpenShell is designed to add privacy and security controls for AI agents. NVIDIA is collaborating with security partners, including Cisco , CrowdStrike , Google Cloud , Microsoft Security and TrendAI , to align runtime policy management and enforcement for agents across the enterprise stack.
OpenShell Provides an Enterprise-Grade Sandbox for Building Personal AI Assistants
NVIDIA NemoClaw is an open source reference stack that simplifies installing OpenClaw always-on assistants with the OpenShell runtime and NVIDIA Nemotron models in a single command.
NemoClaw provides enthusiasts with an open reference for building self-evolving personal AI agents, or claws. Since security needs vary, NemoClaw provides a reference example for policy-based privacy and security guardrails to give users more control over their agents’ behavior and data-handling. Users can customize it for their specific use cases — much like adjusting security preferences for applications on a phone.
NemoClaw includes an example configuration of OpenShell that defines how the agent should interact with systems. NemoClaw uses open source models like NVIDIA Nemotron alongside OpenShell.
This enables self-evolving claws to run more securely in clouds, on premises or on personal computers, including NVIDIA GeForce RTX PCs and laptops or NVIDIA RTX PRO-powered workstations , as well as NVIDIA DGX Station and NVIDIA DGX Spark AI supercomputers .
Both OpenShell and NemoClaw are in early preview. NVIDIA is building in the open with the community and its partners to enable enterprises to scale self-evolving, long-running autonomous agents safely, confidently and in compliance with global security standards.
Get started with NVIDIA OpenShell and launch a ready‑to‑use environment on NVIDIA Brev , or explore the open source project on GitHub .
Explore the Best of GTC 2026 Sessions
Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere.
Watch On Demand
Recent News
AI Infrastructure
Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid
March 31, 2026
|
|
|
Reliable AI Coding for Unreal Engine: Improving Accuracy and Reducing Token Costs | NVIDIA Technical Blog |
nvidia_dev_blog |
10.03.2026 15:30 |
0.71
|
| Embedding sim. | 0.8016 |
| Entity overlap | 0.1111 |
| Title sim. | 0.1931 |
| Time proximity | 1 |
| NLP тип | other |
| NLP организация | NVIDIA |
| NLP тема | software development |
| NLP страна | |
Открыть оригинал
Agentic code assistants are moving into daily game development as studios build larger worlds, ship more DLCs, and support distributed teams. These assistants can accelerate development by helping with tasks like generating gameplay scaffolding, refactoring repetitive systems, and answering engine-specific questions faster.
This post outlines how developers can build reliable AI coding workflows for Unreal Engine (UE) 5, from individual setups to team and enterprise-scale systems. Reliability is critical because real-world Unreal codebases are defined by engine conventions, large C++ projects, custom tools, branch differences, and studio-specific coding patterns that generic AI often fails to understand.
The core challenge is the context gap . Failures rarely come from weak code generation, but from missing constraints such as code patterns, branch differences, or internal conventions. Improving context retrieval reduces guesswork and makes AI output reliable enough for production use.
NVIDIA works with game studios to improve AI reliability in large UE environments by combining syntax-aware code indexing, hybrid search techniques, and GPU-accelerated vector search infrastructure. The objective is to improve reliability and reduce review overhead in production Unreal pipelines.
Solving this gap scales with team complexity. Developers need fast engine-aware answers. Teams require codebase-aware assistance for multi-file workflows. Enterprises depend on retrieval-native systems that maintain accuracy across large, governed codebases.
Reducing documentation friction for UE developers
For developers, the context gap shows up as documentation friction. Unreal development often requires fast answers about engine patterns and conventions. The cost is the time spent searching and translating documents into usable code.
Unreal Assistant–style workflows combine documentation retrieval with engine-compatible code generation, helping developers move quickly from question to a correct starting point. The goal is reducing boilerplate and accelerating common Unreal tasks.
The following is an example of engine-aware starter code generated for an Unreal gameplay component.
// Example: UE5 C++ starter component generated from an engine-specific prompt
#pragma once
#include "CoreMinimal.h"
#include "Components/ActorComponent.h"
#include "HeatMeterComponent.generated.h"
UCLASS(ClassGroup=(Custom), meta=(BlueprintSpawnableComponent))
class UHeatMeterComponent : public UActorComponent
{
GENERATED_BODY()
public:
UHeatMeterComponent();
UPROPERTY(EditAnywhere, BlueprintReadWrite, Category="Heat")
float Heat = 0.0f;
UFUNCTION(BlueprintCallable, Category="Heat")
void AddHeat(float Amount);
};
This tier stays reliable when the problem is narrow and grounded in engine docs or common UE patterns. Once the task becomes repo-dependent, cross-module, or branch-specific, the limiting factor becomes codebase context, not code generation. That is where teams benefit from a workflow designed to keep context strong across multiple files.
Supporting multi-file workflows in UE teams
Teams at small and mid-sized studios typically hit a different version of the context gap. The assistant can generate plausible code, but it cannot reliably operate across multiple files and conventions without creating review debt. The problem becomes multi-file reasoning, predictability, and change control across a real codebase.
This is where a hybrid Unreal workflow becomes valuable. Use an AI-first editor for planning, multi-file edits, and codebase-aware changes, while keeping Visual Studio in the loop for reliable Windows debugging. The goal is to strengthen the parts of the workflow that consume time and attention, while keeping debugging and iteration stable.
Get started in 10 – 15 minutes
The following is the fastest path to edit, build, and iterate.
Install Cursor and then Visual Studio 2022 with the Desktop development with C++
workload (for the MSVC
toolchain and debugging).
Tell Unreal to generate a VS Code-style workspace. In Unreal Editor Preferences , set Source Code Editor to Visual Studio Code . Cursor may not appear in the list. Select VS Code
to enable the VS Code-style workspace generation that Cursor opens.
Generate project files using one of these options.
Unreal Editor (if available): Tools > Refresh Visual Studio Code Project
Right-click your .uproject
> Generate Project Files
Open the generated .code-workspace
in Cursor. Open the .code-workspace
file (recommended). It typically includes build tasks.
Get basic C++ code intelligence. In Cursor, install C/C++ (Microsoft)
. If you want deeper navigation on macro-heavy UE code, install clangd (LLVM)
(optional, strongly recommended).
Build once from Cursor. Use Terminal > Run Build Task
, and run your editor target build (for example, YourProjectEditor Win64 Development build
).
Note: Cursor is best used for code generation, refactoring, and multi-file editing, while Visual Studio remains the recommended environment for game and engine-level debugging. The full guide goes in deeper on compile_commands.json
, tasks, and troubleshooting.
The underlying point for studios is that team-scale code assistance must behave like a predictable teammate. It needs to plan before it edits, keep changes scoped, respect conventions, and support review. When those behaviors are in place, AI becomes a repeatable way to accelerate real development work across a shared codebase.
Maintaining accuracy across enterprise-scale C++ codebases
For major publishers, the challenge is keeping models grounded inside massive UE environments filled with proprietary systems, branch divergence, and strict governance. When assistants retrieve incomplete or incorrect context, plausible code quickly turns into costly integration failures, slowing iteration and increasing review burden for senior engineers.
The solution is to treat retrieval as core production infrastructure, making context accurate, structured, and fast enough for developer workflows.
Key building blocks for reliable enterprise AI coding
At enterprise scale, reliable AI coding depends on a few core building blocks that keep context accurate, fast, and usable across large codebases.
AST-based Syntax-aware chunking
Code is structure, not text. Chunking at AST boundaries preserves full functions, signatures, and control flow, creating coherent units that are safer to retrieve, reason over, and edit.
Hybrid search with NVIDIA NeMo Retriever NIM
Enterprise code search blends semantic understanding with exact matching. Hybrid retrieval combines dense embeddings with lexical signals like identifiers and error strings, then reranks results to balance recall, precision, and scalability across large repositories.
GPU-accelerated vector search with NVIDIA cuVS
Higher-dimensional embeddings improve semantic fidelity but introduce latency challenges. GPU-accelerated vector search maintains real-time responsiveness using techniques like quantization, dimensionality reduction, and tiered indexing, keeping retrieval fast at enterprise scale.
From reliable retrieval to production-ready AI agents
Once retrieval is stabilized, AI agents become more reliable because they operate on grounded context instead of improvisation.
Model Context Protocol (MCP) enables this at an organizational scale by standardizing how agents access tools and internal systems. Rather than hardwiring integrations, MCP exposes governed resources such as code search, build logs, documentation, and ticketing systems as structured, secure tools that agents can call consistently.
With reliable retrieval and governed tool access in place, fine-tuning becomes a multiplier rather than a prerequisite . Studios can adapt models to internal APIs, coding standards, and recurring failure modes, improving correctness where it matters most.
The sequence is critical:
Ground context through strong retrieval.
Orchestrate safely through standardized tools.
Customize models for domain-specific accuracy.
Learn more
At GDC 2026
See how NVIDIA RTX neural rendering and AI are shaping the next era of game development. Hear John Spitzer, vice president of Developer and Performance Technology at NVIDIA, present the latest advances in path tracing and generative AI workflows, and join Bryan Catanzaro, vice president of Applied Deep Learning Research, for an interactive AI AMA session. You can also experience the technologies featured in this post at the NVIDIA booth 1426.
At NVIDIA GTC 2026
Attend Crack the Code: Enable AI Assistants for Massive C++ Codebases for a deeper enterprise perspective. Visit us at the NVIDIA booth to experience the technologies featured in this article firsthand.
Discuss (0)
Like
Tags
Agentic AI / Generative AI | Content Creation / Rendering | Developer Tools & Techniques | Gaming | General | Beginner Technical | C++ | Unreal Engine
About the Authors
About Paul Logan
Paul Logan is a generative AI strategic marketing specialist at NVIDIA and an MBA candidate at Berkeley Haas. He has a background in developer relations, product marketing and management, and AI technologies, with specialization in the implementation of generative AI technologies. Paul has extensive experience working in high-growth tech environments and has previously held roles at Slack, Postman, and DataHub.
View all posts by Paul Logan
Comments
Related posts
How to Minimize Game Runtime Inference Costs with Coding Agents
How to Minimize Game Runtime Inference Costs with Coding Agents
Build a Log Analysis Multi-Agent Self-Corrective RAG System with NVIDIA Nemotron
Build a Log Analysis Multi-Agent Self-Corrective RAG System with NVIDIA Nemotron
Spotlight: Qodo Innovates Efficient Code Search with NVIDIA DGX
Spotlight: Qodo Innovates Efficient Code Search with NVIDIA DGX
Improve AI Code Generation Using NVIDIA NeMo Agent Toolkit
Improve AI Code Generation Using NVIDIA NeMo Agent Toolkit
Evolving AI-Powered Game Development with Retrieval-Augmented Generation
Evolving AI-Powered Game Development with Retrieval-Augmented Generation
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core | NVIDIA Technical Blog |
nvidia_dev_blog |
09.03.2026 19:30 |
0.703
|
| Embedding sim. | 0.7971 |
| Entity overlap | 0.075 |
| Title sim. | 0.1898 |
| Time proximity | 0.9851 |
| NLP тип | other |
| NLP организация | NVIDIA |
| NLP тема | large language models |
| NLP страна | United Arab Emirates |
Открыть оригинал
In the rapidly evolving landscape of large language model (LLM) development, NVIDIA Megatron Core has emerged as the foundational framework for training massive transformer models at scale. The open source library offers industry-leading parallelism and GPU-optimized performance. Now developed GitHub-first in the NVIDIA/Megatron-LM repo, Megatron Core is increasingly shaped by contributions from foundation model builders, making it a more flexible, future-proofed engine for open AI models.
This post provides a technical overview of how the Technology Innovation Institute (TII), creators of the Falcon model family, have contributed to and integrated with Megatron Core and Megatron Bridge frameworks. The first section examines the implementation of the Falcon-H1 parallel hybrid architecture within Megatron Bridge, highlighting the challenges of coordinating heterogeneous Transformer and Mamba layers alongside non-learnable µP multipliers. The second section explores the integration of BitNet into Megatron Core, detailing the replacement of standard linear layers with ternary-parameter counterparts and the implications for training efficiency and scalability.
These contributions demonstrate how Megatron Core users can extend the framework to support their own custom model architectures and complex training features and leverage the work of others in the community.
Falcon-H1 hybrid architecture integration in Megatron Bridge
The implementation of the Falcon-H1 parallel hybrid architecture within Megatron Bridge highlights the challenges of coordinating heterogeneous Transformer and Mamba layers alongside non-learnable µP multipliers. Details of this integration are provided in the following sections.
Hybrid parallel design
At the core of the TII contributions to Megatron is the Falcon-H1 parallel hybrid architecture. The design diverges from the sequential layering found in other recent hybrid models. As shown in Figure 1, within each block, the attention mechanism and the SSM operate in parallel, and their outputs are concatenated before being passed through the block’s output projection. The number of SSM and attention heads is configurable and can be adjusted as needed.
Figure 1. Falcon-H1 hybrid architecture helps to process input simultaneously within each core processing block to accelerate performance
Instead of stacking distinct layers, Falcon-H1 adopts a parallel design in which transformer-based attention and Mamba-2 state-space model (SSM) components process the input simultaneously within each core processing block.
The outputs from the attention and Mamba branches are concatenated prior to projection, allowing the model to fuse the superior long-context memory and efficiency of SSMs with the long-range dependency modeling of attention.
The ratio of parallel hybrid layers, pure Mamba layers, attention-only layers, and multilayer perceptron (MLP)-only layers within the model can be configured independently, enabling flexible architecture exploration.
Two-repo integration
The Falcon-H1 support spans two repositories with distinct responsibilities. In Megatron Core (Megatron-LM), TII contributed:
The foundational ParallelHybridLayer
, a layer that runs Mamba and attention in parallel and sums their outputs
The updated layer allocation logic that introduces the PARALLEL
symbol alongside existing Mamba, attention, and MLP layer types.
This also includes checkpoint conversion tools for loading and saving parallel hybrid models. In Megatron Bridge, TII built the complete Falcon-H1 model on top of these primitives :
The FalconH1Layer
extends the parallel design to include an MLP component (forming the full Mamba plus attention plus MLP block)
The FalconH1Bridge
provides bidirectional Hugging Face to Megatron weight conversion with specialized mappings for Mamba and attention parameters
The FalconH1ModelProvider
(with size-specific variants for 0.5B, 1.5B-Deep, 7B, and 34B) encapsulates all model configurations, including forward µP non-learnable multipliers
Integrating this hybrid design into the Megatron ecosystem required TII to address significant engineering challenges through several key architectural innovations, as detailed below.
Layer spec unification
Megatron Core uses ModuleSpec
to define layer configurations. For Falcon-H1, this required extending MambaStackSubmodules
to hold separate specs for mamba_layer
, attention_layer
, mlp_layer
, and the new parallel_hybrid_layer
. The MambaStack
module iterates through a layer type list and builds the appropriate module for each position.
In Megatron Bridge, a corresponding FalconH1StackSubmodules
adds a falconh1_layer
spec that bundles all three components. This enables developers to mix and match Mamba and Transformer components within a single model definition.
Weight mapping for checkpoint conversion
In Megatron Bridge, converting Hugging Face checkpoints to Megatron format requires specialized parameter mappings. The MambaInProjMapping
class handles the complex splitting of Mamba in_proj
weights into z, x, B, C, and dt components. These components must be correctly distributed across tensor parallel ranks while preserving numerical correctness.
The FalconH1Bridge
manages tensor parallel resharding for both Mamba and Attention layers in a single pass, alongside QKVMapping
for fusing separate Q, K, and V projections and GatedMLPMapping
for combining gate and up projections. In Megatron Core, the checkpoint conversion tools ( loader_parallelhybrid
and saver_parallelhybrid_hf
) handle the translation between the Megatron distributed format and Hugging Face FalconH1ForCausalLM
.
Tensor parallelism for SSM layers
Mamba layers have unique tensor parallel requirements. The A_log
, D
, and dt_bias
tensors split along dimension 0, while x_proj
splits along dimension 1. For Mamba-2, the in_proj
and conv1d
layers require special handling to correctly partition the z, x, B, C, and dt components across ranks.
Beyond classical μP
To optimize Falcon-H1 series, TII employed customized maximal update parametrization (μP). While classical μP is rooted in neural network theory to enable effortless hyperparameter transfer from a base model size to larger models, Falcon-H1 extends this by tuning μP multipliers themselves. This enables each component to train at the correct intensity.
Training spikes that are common in SSM-based models are addressed by applying dampening multipliers within the SSM block, leading to smoother training and cleaner experimental signals.
The µP multipliers in Falcon-H1 are stored as non-learnable tensors. They scale activations during the forward pass without accumulating gradients. This approach keeps memory overhead minimal while enabling fine-grained control over learning dynamics across 12 distinct scaling factors covering embeddings, attention, SSM, and MLP components.
For Megatron Bridge, this required adding multiplier extraction during Hugging Face checkpoint loading. The bridge reads multiplier values from the HF config and applies them at the correct forward pass locations. Both attention and Mamba components receive their respective scaling factors.
BitNet integration for Falcon Edge in Megatron Core
Falcon Edge is a series of ternary (1.58-bit) TII language models based on the BitNet architecture. To train Falcon Edge at scale, TII contributed BitNet pretraining support for GPT-like architectures to Megatron Core. This integration is a key step toward enabling scalable pretraining workflows with 1-bit LLMs, while preserving Megatron parallelism and performance characteristics.
TII introduced two new parallel linear layers: BitNetColumnParallelLinear
and BitNetRowParallelLinear
. These layers mirror existing Megatron tensor-parallel linear layers, but incorporate BitNet quantization logic. By embedding BitNet directly at the layer-spec level, the integration remains compatible with Megatron tensor parallelism, pipeline parallelism, and distributed training infrastructure.
Under the hood, the implementation leverages onebitllms
Triton kernels for efficient activation and weight quantization.
During the forward pass, BitNet replaces full-precision matrix multiplications with quantized equivalents:
Weights are quantized to ternary values {−1, 0, +1} using absolute mean scaling. The weight tensor is scaled by the reciprocal of its absolute mean, then rounded and clamped to {−1, 0, +1}.
Activations are quantized to 8-bit precision using per-token absmax
scaling. For each token, the maximum absolute value across the hidden dimension is computed, used to scale the activations into the [−128, 127] range, and the result is rounded to the nearest integer.
The core linear operations are performed using these quantized weights and activations, leveraging the custom Triton kernels provided by onebitllms
for optimization.
By utilizing ternary weights (1.58-bit), the model significantly reduces its memory footprint and enables faster inference speeds compared to full-precision counterparts.
During the backward pass:
Gradients bypass the nondifferentiable quantization functions, enabling backpropagation to proceed as if the quantization step were an identity function.
Weight gradients are computed on the full-precision weights. Quantization is applied only during the forward pass, ensuring optimizer updates remain high fidelity.
Activation gradients follow standard backpropagation through quantization-aware layers.
Implementation
The BitNet integration in Megatron Core introduces minimal changes while maintaining full compatibility with existing parallelism strategies, and Megatron Core scalability. Standard Linear layers are replaced with BitNetLinear
variants, enabling ternary weight quantization while maintaining Megatron Core layer interfaces.
Activation and weight quantization kernels are integrated directly into the Megatron computation pipeline. Tensor parallelism is extended to support sharded quantized weights, with scaling factors handled per shard to preserve numerical correctness. Megatron fused kernels and communication patterns are retained, ensuring that ternary quantization delivers memory and bandwidth savings without sacrificing throughput.
Core components
Custom linear layers: Two new classes extend Megatron tensor-parallel layers:
BitNetColumnParallelLinear
extends ColumnParallelLinear
BitNetRowParallelLinear
extends RowParallelLinear
Quantization integration : Both layers override _forward_impl
to apply ternary weight quantization and 8-bit activation quantization using onebitllms
Triton kernels ( weight_quant_triton
and activation_quant_triton
)
Straight-through estimator (STE): Gradients bypass quantization using the pattern x_quantized = x + (quant(x) – x).detach().
This allows backpropagation through nondifferentiable quantization while maintaining full-precision weight updates.
Integration points
Layer specification system : BitNet layers are registered in get_gpt_layer_local_spec
and get_mlp_module_spec
, enabling activation through the --use-bitnet
flag
Tensor parallelism : Quantization is applied independently on each tensor-parallel shard after weights are partitioned, preserving numerical correctness across distributed computations
Training requirements : BitNet requires --transformer-impl
local and the onebitllms
package. The implementation reuses existing Megatron communication patterns and fused kernels without modification
The integration delivers significant weight memory savings and bandwidth improvements while maintaining compatibility with Megatron pipeline parallelism, gradient accumulation, and optimizer infrastructure.
Get started building foundation models with Megatron
TII Falcon-H1 hybrid architecture and BitNet ternary training support show how foundation model builders can extend Megatron Core and Megatron Bridge for their own architectures and training needs. These contributions are currently available.
To get started in Megatron-LM, check out BitNet pretraining and ParallelHybrid layer support . To get started in Megatron-Bridge, check out Falcon-H1 checkpoint conversion and µP multiplier handling .
Discuss (0)
Like
Tags
Agentic AI / Generative AI | Developer Tools & Techniques | General | Intermediate Technical | Deep dive | AI Foundation Models | Megatron | Open Source
About the Authors
About Mireille Fares
Mireille Fares is a generative AI solution architect at NVIDIA. She leads the technical generative AI engagement with the Technology Innovation Institute (TII) and supports customers across industries on advanced model-development workflows, leveraging the NVIDIA accelerated computing stack for end-to-end model training, optimization, and deployment. She is specialized in multimodal generative AI, large-scale model training, and inference optimization for LLMs and VLMs. She holds a PhD in Multimodal Generative AI from Université Pierre et Marie Curie (Paris 6).
View all posts by Mireille Fares
About Dhia Eddine Rhaiem
Dhia Eddine Rhaiem is a senior AI research engineer at the Technology Innovation Institute (TII), where he contributes to the development of the Falcon family of large language models. He has participated in the release of Falcon-H1, Falcon-Edge, Falcon 3, FalconTiny and Falcon-Mamba, working across infrastructure, data pipelines, and large-scale training. His work spans both pretraining and post-training, supporting efficient system scaling and robust model deployment. He holds an Engineering degree and a master’s degree in Applied Mathematics and Quantitative Finance from École Centrale in France.
View all posts by Dhia Eddine Rhaiem
About Yu Yao
Yu Yao is a senior deep learning algorithm engineer at NVIDIA, where he contributes to the NVIDIA NeMo framework for large-scale generative AI. His work focuses on LLM training, multimodal models, model compression, and GPU-accelerated optimization. Yu holds a PhD in Physics and a master’s degree in Computer Science from the University of Southern California, combining research depth with hands-on AI systems engineering experience.
View all posts by Yu Yao
About Jingwei Zuo
Jingwei Zuo is a principal researcher at the Technology Innovation Institute (TII), UAE, where he leads the Falcon Foundational Models team. He and his colleagues drive the development of the Falcon family of LLMs, including Falcon-H1, Falcon-Edge, Falcon 3, and Falcon-Mamba. Their work spans the full LLM lifecycle—from novel architecture design and data curation to large-scale training, deployment, and system-level scaling. The team’s mission is to advance the efficiency, scalability, and usability of Falcon LLMs, bridging fundamental research with real-world applications. Jingwei received his PhD (2022) from University of Paris-Saclay, awarded the Plateau de Saclay Doctoral Prize, an MSc (2018) from University of Paris-Saclay, an Engineer degree (2017) from Sorbonne Université, and a BSc from Huazhong University of Science & Technology.
View all posts by Jingwei Zuo
About Santosh Bhavani
Santosh Bhavani is a product manager at NVIDIA working on deep learning frameworks, Megatron Core, and Transformer Engine. Santosh holds a bachelor’s degree in Computer Science from Carnegie Mellon.
View all posts by Santosh Bhavani
Comments
Related posts
NVIDIA NeMo Accelerates LLM Innovation with Hybrid State Space Model Support
NVIDIA NeMo Accelerates LLM Innovation with Hybrid State Space Model Support
Train Generative AI Models More Efficiently with New NVIDIA Megatron-Core Functionalities
Train Generative AI Models More Efficiently with New NVIDIA Megatron-Core Functionalities
New NVIDIA NeMo Framework Features and NVIDIA H200 Supercharge LLM Training Performance and Versatility
New NVIDIA NeMo Framework Features and NVIDIA H200 Supercharge LLM Training Performance and Versatility
Efficiently Scale LLM Training Across a Large GPU Cluster with Alpa and Ray
Efficiently Scale LLM Training Across a Large GPU Cluster with Alpa and Ray
Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Framework
Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Framework
Related posts
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes
Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes
L
T
F
R
E
|
|
|
Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics | NVIDIA Technical Blog |
nvidia_dev_blog |
12.03.2026 16:00 |
0.697
|
| Embedding sim. | 0.8363 |
| Entity overlap | 0.0789 |
| Title sim. | 0.1579 |
| Time proximity | 0.5923 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | edge computing |
| NLP страна | |
Открыть оригинал
Physical AI is rapidly evolving, from next-generation software-defined autonomous vehicles (AVs) to humanoid robots. The challenge is no longer how to run a large language model (LLM), but how to enable high-fidelity reasoning, real-time multimodal interaction, and trajectory planning within strict power and latency envelopes.
NVIDIA TensorRT Edge-LLM , a high-performance C++ inference runtime for LLMs and vision language models (VLMs) on embedded platforms, is designed to overcome these challenges.
As explained in this post, the latest TensorRT Edge-LLM release delivers a significant expansion in fundamental capabilities for NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor platforms. It introduces advanced edge architectures, including mixture of experts (MoE) , the NVIDIA Cosmos Reason 2 open planning model for physical AI, and Qwen3-TTS and Qwen-ASR models for embedded speech processing. Building on these foundational pillars, the release also offers optimized support for the NVIDIA Nemotron family of open models. This provides developers with the essential runtime to build the next generation of autonomous machines.
Efficient reasoning at scale
Running massive models on embedded hardware requires a rethink of compute efficiency. The latest release of TensorRT Edge-LLM fully enables MoE support at the edge, specifically optimizing models like Qwen3 MoE. By activating only a subset of expert parameters per token, MoE architectures enable edge devices to access the reasoning capabilities of a massive model while maintaining the inference latency and active compute footprint of a much smaller one.
This architectural shift is critical for deploying high-fidelity reasoning on edge platforms like NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. As a developer, you can drastically scale up the intelligence of your autonomous systems without exceeding the strict power and latency limits required for real-time, mission-critical operations.
Unlock hybrid reasoning at the edge
TensorRT Edge-LLM is a specialized runtime to fully support NVIDIA Nemotron 2 Nano . This enables a new class of System 2 reasoning directly on embedded chipsets, including NVIDIA DRIVE Thor and Jetson Thor.
For developers building advanced in-cabin AI assistants or robotic dialogue agents, deploying highly capable language models at the edge presents a significant memory and latency challenge. Nemotron 2 Nano addresses this challenge fundamentally by utilizing a novel Hybrid Mamba-2-Transformer architecture. This significantly reduces the memory footprint from KV cache storage with Mamba State Space architectures while maintaining high-fidelity precision from attention layers.
TensorRT Edge-LLM bridges the deployment gap by providing optimized kernels that accelerate these specific hybrid layers. This enables developers to use the model’s massive context window for complex edge retrieval-augmented generation (RAG) pipelines or agentic workflows while maintaining a strict, production-viable device memory footprint.
By enabling dynamic “thinking” at the edge with TensorRT Edge-LLM, developers can leverage a model’s ability to shift seamlessly between deep reasoning and immediate conversational action. This is a critical capability for advanced in-cabin assistants and robotic agents that must reason through complex user queries one moment and provide conversational responses the next.
Deep reasoning mode ( /think
) : TensorRT Edge-LLM efficiently handles the expanded token generation required for chain of thought (CoT) processing. By using the /think
system prompt, the runtime enables the model to think through complex logic, achieving a remarkable 97.8% on MATH500—before outputting a decision.
Conversational reflex mode ( /no_think
) : For latency-critical voice interactions where the user expects an immediate reply, developers can issue a /no_think
command. TensorRT Edge-LLM optimizes this path to bypass reasoning traces, delivering immediate, intelligent responsiveness required for seamless conversational AI and agile on-device agents.
By supporting this hybrid architecture, TensorRT Edge-LLM enables compact, production-ready VLMs and LLMs to serve as both reasoned assistants and low-latency conversational agents, significantly reducing the memory constraints of physical AI.
Real-time multimodal interaction at the edge
TensorRT Edge-LLM now offers support for Qwen3-TTS and Qwen3-ASR , a native multimodal model with Thinker-Talker architecture capable of voice interaction. Unlike traditional pipelines that cascade ASR, LLM, and TTS models, adding latency at every hop, Qwen3-TTS/ASR handles end-to-end speech processing.
By optimizing both the Thinker and Talker components, TensorRT Edge-LLM enables low-latency, natural voice synthesis directly on the chip:
Thinker : TensorRT Edge-LLM accelerates the reasoning core, allowing the model to process complex driver queries and environment context to generate intelligent, reasoned responses.
Talker : TensorRT Edge-LLM complements the reasoning engine by delivering low latency, natural voice synthesis (TTS) directly on the chip.
In the case of AVs, this allows for seamless, interruptible conversations between the driver and the vehicle.
Equipping humanoid robotics with physical common sense
For humanoid robots and advanced vision agents, understanding the real world requires more than just identifying objects; it requires an intuitive grasp of physics and time. To meet this need, TensorRT Edge-LLM now supports Cosmos Reason 2 , an open, customizable reasoning VLM purpose-built for physical AI and robotics.
Cosmos Reason 2 empowers embodied agents to reason like humans by using prior knowledge, physical common sense, and chain-of-thought capabilities to understand world dynamics without human annotations. With TensorRT Edge-LLM optimized, low-latency runtime, robots at the edge can efficiently leverage Cosmos Reason 2 as a primary planning model to reason through their next steps.
Key capabilities of Cosmos Reason 2 accelerated by TensorRT Edge-LLM include:
Advanced spatio-temporal reasoning : Enhanced physical AI reasoning with improved timestamp precision and a deep understanding of space, time, and fundamental physics.
3D localization and explanation : The ability to not only detect objects but also provide 2D and 3D point localization, bounding-box coordinates, and contextual reasoning explanations for its labels.
Massive context processing : Support for an improved long-context window of up to 256K input tokens, allowing edge agents to ingest extensive environmental and historical data.
By supporting Cosmos Reason 2, TensorRT Edge-LLM ensures that next-generation robots can continuously evaluate complex, long-tail physical scenarios and safely plan their actions in real time.
Advancing autonomous driving with end-to-end trajectory planning
Among the most significant shifts in autonomous production is the move from traditional modular stacks to end-to-end VLA models. NVIDIA Alpamayo is a family of open AI models, simulation frameworks, and physical AI datasets designed to accelerate the development of safe, transparent, and reasoning-based AVs.
Stay tuned for the forthcoming Alpamayo 1 workflow, a distillation recipe that brings System 2 rational thinking to the edge. Alpamayo 1 represents a leap forward from standard VLMs. It is not just describing a scene; it is planning a precise trajectory through it. The architecture utilizes a Cosmos Reason Backbone (distilled) to generate a chain of causation (reasoning trace) before outputting actions.
Key features of the Alpamayo integration in TensorRT Edge-LLM include:
Flow matching trajectory decoding : Moving beyond simple regression, flow matching is used to generate diverse, high-fidelity future trajectories.
History and context : The model tokenizes two-second historical trajectories and multicamera inputs, processing them through a Qwen3-VL backbone to output explainable driving decisions. For example, “Nudge to the left to increase clearance.”
Performance : On DRIVE Thor, Alpamayo 1 achieves production-viable latencies, using FP8 acceleration for the Vision Transformer (ViT) components.
Figure 1. The most significant shift in autonomous vehicle production is the transition from traditional modular stacks to end-to-end VLA models
Get started with TensorRT Edge-LLM for physical AI
TensorRT Edge-LLM serves as the go-to-open-source, pure C++ inference runtime designed specifically for the mission-critical needs of automotive and robotics. It eliminates Python dependencies for deployment, ensuring predictable memory footprints.
From deploying the efficient expert routing of Qwen3 MoE today, to preparing for the future distilled reasoning of Alpamayo 1, NVIDIA provides the essential runtime to build the next generation of autonomous machines.
To get started, explore the new features, including the Alpamayo and MoE examples, in the updated TensorRT Edge-LLM GitHub repo or through the latest NVIDIA DriveOS releases.
Discuss (0)
Like
Tags
Developer Tools & Techniques | Edge Computing | Robotics | Automotive / Transportation | Cosmos | DRIVE | Jetson | Nemotron | TensorRT | TensorRT-LLM | Intermediate Technical | Deep dive | AI Inference | autonomous vehicles | GTC 2026 | IoT | LLMs | Mixture of Experts (MoE) | Physical AI | Retrieval Augmented Generation (RAG) | Thor | VLMs
About the Authors
About Lin Chai
Lin Chai is a senior product manager at NVIDIA, leading TensorRT and TensorRT Edge-LLM, NVIDIA’s AI inference platforms for deep learning across datacenter and embedded platforms. Drawing on her background in autonomous driving and automotive OEMs, she is inspired to build production-grade inference systems that deliver best-in-class performance for deep learning workloads across data center, edge, and physical AI applications—enabling systems that perceive, reason, and act in the real world.
View all posts by Lin Chai
About Luxiao Zheng
Luxiao Zheng is a senior systems software engineer at NVIDIA. He works on the TensorRT general performance team with a specialization in Large Language Model inference workflow. He works on end-to-end LLM software development, performance measurements, analysis and improvements for x86_64 and aarch64 platforms. Luxiao holds a M.S. in Computer Science, a B.S. in Computer Science and a B.S. in Chemical Engineering from Washington University in St. Louis.
View all posts by Luxiao Zheng
About Fan Shi
Fan Shi is a senior system software engineer on the NVIDIA TensorRT team, specializing in the efficient deployment of advanced AI models on edge platforms. His work focuses on optimizing performance and usability in deep learning inference. Fan holds an M.S. in computational data science from Carnegie Mellon University and a B.S. in statistics and computer science from the University of Illinois.
View all posts by Fan Shi
About Maximilien Breughe
Maximilien Breughe is an engineering leader and software engineer at NVIDIA, where he works on AI inference systems and edge AI technologies. He has a background in deep learning libraries and performance engineering, and holds a PhD in Computer Architecture focused on performance simulation techniques. Maximilien is especially interested in building practical, high-performance AI systems that bridge research and real-world deployment.
View all posts by Maximilien Breughe
About Michael Ferry
Michael Ferry is a software engineering manager on the NVIDIA TensorRT team, where he leads the TensorRT Edge-LLM, Automotive Safety, and New Platforms teams. His work centers on optimized, reliable AI inference for safety-critical robotics and automotive edge systems. Before joining NVIDIA in 2018, Michael created and led several floating-point-focused verification tools at Intel. He holds a PhD in Mathematics, specializing in numerical optimization, from the University of California, San Diego.
View all posts by Michael Ferry
Comments
Related posts
Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM
Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM
NVIDIA Accelerates OpenAI gpt-oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72
NVIDIA Accelerates OpenAI gpt-oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
Setting New Records in MLPerf Inference v3.0 with Full-Stack Optimizations for AI
Setting New Records in MLPerf Inference v3.0 with Full-Stack Optimizations for AI
Getting the Best Performance on MLPerf Inference 2.0
Getting the Best Performance on MLPerf Inference 2.0
Related posts
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell
Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell
NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer
NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer
L
T
F
R
E
|
|
|
NVIDIA RTX Innovations Are Powering the Next Era of Game Development | NVIDIA Technical Blog |
nvidia_dev_blog |
10.03.2026 15:30 |
0.696
|
| Embedding sim. | 0.7933 |
| Entity overlap | 0.0377 |
| Title sim. | 0.2422 |
| Time proximity | 0.881 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | generative ai |
| NLP страна | United States |
Открыть оригинал
NVIDIA RTX ray tracing and AI-powered neural rendering technologies are redefining how games are made, enabling a new standard for visuals and performance. At GDC 2026 , NVIDIA unveiled the latest path tracing innovations elevating visual fidelity, on-device AI models enabling players to interact with their favorite experiences in new ways, and enterprise solutions accelerating game development from the ground up.
This post provides a detailed overview of these latest innovations, including:
Introducing a new system for dense, path-traced foliage in NVIDIA RTX Mega Geometry
Adding path-traced indirect lighting with ReSTIR PT in the NVIDIA RTX Dynamic Illumination SDK and RTX Hair (beta) for strand-based acceleration in the NVIDIA branch of UE5
Expanding language recognition support in NVIDIA ACE; production-quality on-device text-to-speech (TTS); a small language model (SML) with advanced agent capabilities for AI-powered game characters
Enabling foveated game streaming for Apple Vision Pro with NVIDIA CloudXR 6.0
Accelerating in-game AI and streamlined shader compilation with new DirectX APIs
Centralizing infrastructure and virtualized game studio workflows with NVIDIA RTX PRO 6000 Blackwell Server Edition
Enabling scaling AI coding assistants with new NVIDIA Enterprise AI playbook
Scaling game playtesting and player engagement globally with GeForce NOW Playtest
New system for path-traced foliage in NVIDIA RTX Mega Geometry
NVIDIA RTX Mega Geometry is a breakthrough technology that delivers unprecedented geometric detail to path‑traced worlds. Introduced last year, it compresses geometry into clusters and intelligently reuses them per frame as a world is traversed. Mega Geometry enables developers to build ray tracing structures 100x faster than previous methods, enabling full‑fidelity path tracing with advanced detail and real‑time tessellation.
Remedy Entertainment applied RTX Mega Geometry to existing assets in Alan Wake 2 , which saw a 5-20% boost in FPS and 300 MB VRAM reduction. Their upcoming title, CONTROL Resonant will also feature RTX Mega Geometry.
Figure 1. NVIDIA RTX Mega Geometry was used to boost path tracing performance in Alan Wake 2
At GDC 2026, NVIDIA is offering a sneak peek into the new RTX Mega Geometry foliage system. Large natural environments such as forests remain a major challenge for real‑time ray tracing. These scenes pack countless complex, animated objects that heavily tax how quickly the GPU can build acceleration structures.
To tackle this challenge, NVIDIA is developing a new foliage system that uses partitioned top-level acceleration structures. This system instances and updates massive portions of a scene in every single frame. This advancement makes it possible, for the first time, to path trace dense environments featuring millions of detailed, uniquely animated foliage elements.
NVIDIA has partnered with CD PROJEKT RED to bring this new foliage technology to future titles. “Using the in-development RTX Mega Geometry foliage technology, we can bring fully path traced forests to the world of The Witcher ,” said Cezary Bella, Rendering Engineer at CD PROJEKT RED. “We can’t wait for players to experience this level of detail in The Witcher 4 .”
Video 1. New NVIDIA RTX Mega Geometry foliage system, enabling path tracing of dense environments with millions of detailed plants and trees, coming to The Witcher 4
Optimizing path-traced mirror reflections and strand-based hair in NVIDIA RTX Kit
This latest 2026.2 version of NVIDIA RTX Kit suite of neural rendering technologies expands the RTX Dynamic Illumination SDK with ReSTIR PT. This algorithm enables complex path reuse at any bounce, even on challenging surfaces. It provides a high-fidelity path tracing solution specifically optimized for glossy surfaces and mirror reflections.
Figure 2. The ReSTIR PT algorithm enables full resolution mirror reflections
Several additional RTX Kit SDKs have also been updated, including:
RTX Global Illumination: SHaRC improvements, integration of DLSS-RR, and various bug fixes
RTX Character Rendering: Expanded geometry library with refactored tessellation API
RTX Path Tracing: Various performance optimizations lowering the memory footprint and increasing frame rate
RTX Neural Shaders: Bug fixes and improvements
Download the latest NVIDIA RTX Kit .
For Unreal Engine 5 developers, the NVIDIA RTX Branch of Unreal Engine (NvRTX) 5.7 provides the latest compatibility and performance updates. Launching on March 24, the update introduces Linear Swept Sphere (LSS) in beta, which leverages GeForce RTX 50 Series GPUs to provide superior speed and memory efficiency for path-traced, strand-based hair .
Video 2. Learn how to install and compile the NVIDIA RTX Branch of Unreal Engine 5.7
Finally, DLSS 4.5 Super Resolution featuring a second-generation transformer model will be available in NvRTX , also on March 24.
First on-device production-quality TTS model in NVIDIA ACE
AI is evolving traditional non-playable character (NPC) roles into dynamic collaborators. Game developer Creative Assembly is enhancing Total War: Pharaoh with a natural language advisor to guide new players. Krafton is leveraging AI for interactive teammates in PUBG: Battlegrounds and personalized stories in inZOI .
AI tech developer Meaning Machine is presenting research at GDC that was conducted with the University of Bristol. This research uses the demo Blood Will Out , which features NVIDIA TTS and facial animation models with Meaning Machine Authored AI systems, to explore how narrative AI shapes player behavior. To learn more, check out the GDC session, What Good Are AI NPCs? Lessons from a Large-Scale Player Study .
NVIDIA ACE has updated its suite of open source, on-device models with new speech and intelligence models, including its first production quality text-to-speech model. Optimized for latency and efficient VRAM usage, these models use cutting-edge distillation and pruning techniques to run seamlessly on RTX PCs.
NVIDIA Riva v1.1 : Fast, accurate automatic speech recognition (ASR) now has a reduced memory footprint and support for English, Chinese, Korean, French, German, Italian, and Japanese.
NVIDIA Nemotron 3 Nano 4B (coming next week): New SLM with strong instruction following, game agent capabilities, and VRAM scaling with support for hybrid reasoning and thinking budget.
Resemble.ai Chatterbox v1.0.0 : A 350M TTS model with paralinguistic tags and zero-shot voice cloning. These features provide expressive voices with emotional and nonverbal control.
Figure 3. Comparison of SLM quality to VRAM scaling at 16K input sequence length
NVIDIA collaborated with Resemble.AI, a leading provider of open source voice AI, to bring the latest Resemble.AI TTS model on-device. By optimizing for on-device inference alongside game graphics, developers can now deploy more voice agents without the overhead of cloud inference.
“High-quality emotional voices within a small memory footprint will scale the number of interactive characters in games,” said Zohaib Ahmed, CEO of Resemble.AI. “We collaborated with NVIDIA to ensure voice quality and performance were production quality out of the box.”
Get started with the latest NVIDIA ACE models and sign up for early access to new ACE technologies.
Foveated game streaming for Apple Vision Pro with NVIDIA CloudXR 6.0
Spatial computing adds a new dimension to gaming by giving players a whole new way to experience their favorite games. Later this spring, with NVIDIA CloudXR 6.0 for visionOS, Apple Vision Pro users can stream immersive PC and cloud experiences from NVIDIA RTX systems. This integration creates a direct bridge between visionOS and RTX-powered game rendering. iRacing, a highly realistic motorsport racing simulator, and X-Plane 12, a professional-grade flight simulator, will support this feature at launch.
Figure 4. X-Plane 12 on Apple Vision Pro streamed from GeForce RTX 5090 to a RealSimGear cockpit simulator
NVIDIA CloudXR is a streaming platform built to deliver high-fidelity, low-latency XR from a local RTX PC or cloud GPU. CloudXR 6.0 adds visionOS integration with privacy-preserving foveated streaming, which helps developers deliver better visual quality and performance without rebuilding content for standalone hardware. For developers, this means faster iteration, easier deployment of demanding spatial workloads, and a practical path to bring PC-class experiences to Apple Vision Pro.
Learn more about NVIDIA CloudXR for visionOS.
Accelerating in-game AI with new DirectX APIs
Integrating AI into games presents a significant challenge, as AI workloads must execute in milliseconds without compromising the GPU rendering high FPS gameplay. Current integration methods are often fragmented across IHV solutions, forcing operations to execute layer-by-layer and incurring high performance overhead.
Furthermore, because AI tasks frequently run out of sync with primary rendering and simulation pipelines, the resulting GPU context-switching further diminishes efficiency and leads to suboptimal performance.
In collaboration with Microsoft, NVIDIA is standardizing hardware-accelerated AI through DirectX. This collaboration will boost GPU efficiency, eliminate context-switching for smoother gameplay, and provide a unified workflow for the broader development community.
Figure 5. New DirectX APIs accelerate AI workloads throughout the gaming pipeline. Image credit: Microsoft
Building on last summer’s preview of Cooperative Vector support, which enabled developers to accelerate AI workloads such as neural texture compression through Tensor Cores, Microsoft plans to release DirectX Linear Algebra and DirectX Compute Graph Compiler to accelerate the entire gaming pipeline. To learn more, read the Microsoft DirectX Developer Blog .
Shader compilation for the gaming ecosystem
Microsoft and NVIDIA are working to address two major pain points in PC gaming—long compilation times and in-game stuttering—by evolving how shaders are handled. While these issues typically stem from compiling shaders at runtime, the Microsoft Advanced Shader Delivery (ASD) eliminates the bottleneck by distributing precompiled shaders directly during the game download.
Figure 6. Microsoft Advanced Shader Delivery (ASD) brings uncompiled shaders to compiled shaders
NVIDIA is collaborating closely with Microsoft to bring ASD to GeForce RTX users later this year. Developers can check out the latest Advanced Shader Delivery blog post to learn how to precompile and deliver shaders through the Xbox store.
Centralizing studio workflows with NVIDIA RTX PRO Server and NVIDIA Virtual GPU
NVIDIA is showcasing a new approach to centralizing studio workflows across content creation, engineering, AI, and QA with NVIDIA RTX PRO Server . Built on NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs and NVIDIA Virtual GPU (vGPU) software, the platform enables studios to move key development workflows onto shared, centralized infrastructure without giving up the responsiveness and visual fidelity teams expect from workstation-class systems.
Instead of scaling one desk-bound GPU at a time, studios can provision high-performance virtual environments for geographically distributed teams, improving infrastructure consistency and making it easier to support parallel development across locations.
Figure 7. NVIDIA RTX PRO Server can help standardize environments and reduce operational friction across complex pipelines
This virtualized model is designed to help studios consolidate graphics and AI workloads on the same foundation while improving utilization, security, and scalability. With up to 96 GB of VRAM per GPU and support for multitenant environments, NVIDIA RTX PRO Server can support artists, developers, AI researchers, and QA on shared infrastructure, helping standardize environments and reduce operational friction across complex pipelines.
The result is a more flexible studio architecture that can scale with modern production demands while keeping teams closely aligned around the same centralized resources. To learn more, see NVIDIA Virtualizes Game Development With RTX PRO Server .
Accelerating AI-assisted coding for game development
NVIDIA also released an AI Code Assistant Playbook to help studios scale AI-assisted coding from individual Unreal Engine workflows to enterprise-wide deployments. By leveraging AST-aware chunking, MCP integrations, and fine-tuned models for large codebases, the guide aims to boost developer velocity while ensuring compatibility with complex studio pipelines. To learn more, see Reliable AI Coding for Unreal Engine: Improving Accuracy and Reducing Token Costs .
Scaling game playtesting and player engagement globally with GeForce NOW Playtest
GeForce NOW offers a suite of cloud tools that support stages of the game lifecycle—from playtesting to player engagement. For game publishers and studios, GeForce NOW Playtest provides cloud-based virtualization to streamline large-scale testing of games before they are released. It replaces weeks of local environment setup with on-demand hardware in the cloud, enabling secure testing across more than 100 countries.
With features including instant streaming, real-time observability, and video recording for offline viewing, GeForce NOW helps teams working across different locations bridge communication gaps and identify issues at scale.
Figure 8. GeForce NOW enables game developers to securely test games across devices in more than 100 countries
To get started, check out the Install-to-Play Test Application to quickly test your Steam game in the cloud. You can also watch the video walkthrough How to Publish Your Game on NVIDIA GeForce NOW . Explore these capabilities and learn more at the GDC session, Scaling Playtesting Reach with GeForce NOW .
GeForce NOW also powers Discord Instant Play Quests , which enable users to play games from the cloud directly in Discord—no additional downloads or installs required. They combine instant play, rewards, and Discord’s social networks and communities into game trial and engagement. These Instant Play Quests will be available for select games on GeForce NOW.
Get started with new NVIDIA RTX technologies
Join NVIDIA at GDC 2026 this week to explore how NVIDIA RTX neural rendering and AI are defining the next era of gaming. Glimpse into the future of game development with John Spitzer, VP, Developer, and Performance Technology at NVIDIA, as he unveils innovations in path tracing and generative AI workflows. Catch up with Bryan Catanzaro, VP, Applied Deep Learning Research at NVIDIA, for an interactive Ask Me Anything on the latest AI trends. The two full days of sessions offer a front-row seat to the technologies unlocking new player experiences.
Check out the full list of game developer resources and stay up to date with the latest NVIDIA game development news.
Join the NVIDIA Developer Program (select gaming as your industry)
us on social: X , LinkedIn , Facebook , and YouTube
Join our Discord community
Discuss (1)
Like
Tags
Agentic AI / Generative AI | Content Creation / Rendering | Gaming | ACE | Blackwell | CloudXR | DLSS | Nsight Tools - Graphics | RTX Kit | General Interest | News | DirectX | GDC | GeForce | Neural Graphics | NvRTX | Ray Tracing / Path Tracing | Text Processing | Unreal Engine | vGPU
About the Authors
About Ike Nnoli
Ike Nnoli is a senior product marketing manager at NVIDIA. Ike is responsible for driving the adoption of real-time ray-tracing graphics and AI software development kits across the developer network. Previously, Ike held product marketing positions at PlayStation and design engineering roles at the Boeing Company. He holds an MBA from UCLA and a bachelor's degree in mechanical engineering from Northwestern University.
View all posts by Ike Nnoli
Comments
Related posts
Get Started with Neural Rendering Using NVIDIA RTX Kit
Get Started with Neural Rendering Using NVIDIA RTX Kit
NVIDIA RTX Neural Rendering Introduces Next Era of AI-Powered Graphics Innovation
NVIDIA RTX Neural Rendering Introduces Next Era of AI-Powered Graphics Innovation
Top Game Development Sessions at NVIDIA GTC 2023
Top Game Development Sessions at NVIDIA GTC 2023
'GDC Showcase' Highlights Top NVIDIA Technologies
'GDC Showcase' Highlights Top NVIDIA Technologies
Developers Show Off Amazing Real-Time Ray-Traced Projects in New DXR Spotlight Contest
Developers Show Off Amazing Real-Time Ray-Traced Projects in New DXR Spotlight Contest
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell | NVIDIA Technical Blog |
nvidia_dev_blog |
16.03.2026 16:10 |
0.667
|
| Embedding sim. | 0.7944 |
| Entity overlap | 0.0476 |
| Title sim. | 0.2908 |
| Time proximity | 0.4276 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | ai agents |
| NLP страна | |
Открыть оригинал
AI has evolved from assistants following your directions to agents that act independently. Called claws, these agents can take a goal, figure out how to achieve it, and execute indefinitely—while leaving you out of the loop. The more capable claws become, the harder they are to trust. And their self-evolving autonomy changes everything about the environment in which they operate.
The infrastructure to run claws more safely didn’t exist, until now.
NVIDIA at GTC announced NemoClaw , an open source stack that simplifies running OpenClaw always-on assistants—with a single command. It incorporates policy-based privacy and security guardrails, giving you control over your agents’ behavior and data handling. This enables self-evolving claws to run more safely in the cloud, on prem, on NVIDIA RTX PCs, and on NVIDIA DGX Spark.
NVIDIA NemoClaw uses open source models—like NVIDIA Nemotron —alongside the NVIDIA OpenShell runtime , which is part of the NVIDIA Agent Toolkit. By combining powerful open source models with built-in safety measures, NemoClaw simplifies and secures AI agent deployment.
The NVIDIA Agent Toolkit, meanwhile, provides the full deployment stack —models, tools, evaluation, and runtimes—for building, testing, and optimizing long-running agents that can plan tasks; work across applications and enterprise data; and operate as dependable, production-ready services.
Based on Apache 2.0, OpenShell sits between your agent and your infrastructure. It governs how the agent executes, what the agent can see and do, and where inference goes. OpenShell enables claws to run in isolated sandboxes, giving you fine-grained control over your privacy and security while letting you benefit from the agents’ productivity.
Run one command: openshell sandbox create --remote spark --from openclaw
, and make zero code changes. Then any claw or coding agent like OpenClaw, Anthropic’s Claude Code, or OpenAI’s Codex can run unmodified inside OpenShell.
This blog will discuss the evolution of AI agents and detail how OpenShell works.
How claws introduce risk
Claws remember context across sessions, spawn subagents to act independently, write their own code to learn new skills mid-task, use tools, and keep executing long after you close your laptop. For the first time, an individual developer can spin up an agent that does the work of a team, running continuously and handling complexity that would have required coordination, pipelines, and weeks of time.
Long-running agents like OpenClaw have shown productivity gains but also pose security risks. Today’s agent runtimes resemble the early days of the web. They’re powerful but missing core security primitives: sandboxing, permissions, and isolation.
For long-running, self-evolving agents to actually work, you need three things simultaneously: safety, capability, and autonomy. You can only reliably get two at a time with existing approaches. If safe and autonomous but without access to the tools and data it needs, the agent can’t finish the job. If capable and safe but gated on constant approvals, then you’re babysitting it. If capable and autonomous with full access, you’ve got a long-running process policing itself—guardrails living inside the same process they’re supposed to be guarding.
That last one is the critical failure mode. A stateless chatbot has no meaningful attack surface. An agent with persistent shell access, live credentials, the ability to rewrite its own tooling, and six hours of accumulated context running against your internal APIs is a fundamentally different threat model. Every prompt injection is a potential credential leak. Every third-party skill a claw installs is an unreviewed binary with filesystem access. Every subagent it spawns can inherit permissions it was never meant to have.
The agents are ready. The environment you need to actually trust them has been missing.
How NVIDIA built OpenShell
The core architectural decision behind OpenShell is out-of-process policy enforcement. Instead of relying on behavioral prompts, it enforces constraints on the environment the agent runs in—meaning the agent cannot override them, even if compromised. This is the browser tab model applied to agents: Sessions are isolated, and permissions are verified by the runtime before any action executes.
Tools like Claude Code and Cursor ship with valuable internal guardrails and system prompts, but those protections live inside the agent. OpenShell wraps those harnesses, moving the ultimate control point entirely outside the agent’s reach.
The runtime will rely on many pieces, but here are some NVIDIA is delivering today:
The sandbox is designed specifically for long-running, self-evolving agents. It is not generic container isolation. It handles skill development and verification, programmable system and network isolation, and isolated execution environments that agents can break without touching the host. Policy updates happen live at sandbox scope as developer approvals are granted, with a full audit trail of every allow and deny decision.
The policy engine enforces constraints on the agent’s environment across the filesystem, network, and process layers. Self-evolving agents require granular oversight to trust them when they’re installing packages, learning skills at runtime, and spawning scoped subagents. By evaluating every action at the binary, destination, method, and path level, the engine ensures an agent can install a verified skill but cannot execute an unreviewed binary. The agent gets the autonomy it needs to evolve within the boundaries you define. If an agent hits a constraint, it can reason about the roadblock and propose a policy update, leaving you with the final approval.
The privacy router keeps sensitive context on-device with local open models and routes to frontier models like Claude and GPT only when policy allows. The router makes decisions based on your cost and privacy policy, not the agent’s. OpenShell is model-agnostic by design and provides the environment where all agents and their harnesses can be governed.
Figure 1. OpenShell’s architecture for safer autonomous agents, illustrating the core components: the sandbox, the policy engine, and the privacy router.
How OpenShell enables the next generation of claws
OpenShell is designed to scale from a single developer on an NVIDIA DGX Spark or NVIDIA stack to enterprise-wide deployments, using the same primitives at every level. That includes deny-by-default, live policy updates, and a full audit trail whether you’re one developer or running an enterprise GPU cluster.
The adoption and use of Claws is only accelerating, and the infrastructure decisions made in the next six to 12 months will shape what enterprise agent deployment looks like for a long time.
Agents built with OpenShell can continuously build new skills over time using popular coding agents like Claude Code, Codex, Cursor, and OpenCode—and you can add tools, models, and behaviors through the sandbox interface while keeping every new capability subject to the same policy and privacy controls.
Get started with OpenShell today by visiting the NVIDIA GitHub repo and running it on your NVIDIA DGX Spark , NVIDIA DGX Station , or a dedicated PC with an NVIDIA RTX GPU .
Discuss (1)
Like
Tags
Agentic AI / Generative AI | Developer Tools & Techniques | Trustworthy AI / Cybersecurity | General | NeMo | Nemotron | Beginner Technical | News | Agent toolkit | AI Agent | claws | DGX Spark | featured | GTC 2026 | LLMs | NemoClaw | Open Source | OpenShell | RTX GPU
About the Authors
About Ali Golshan
Ali Golshan is senior director of AI software at NVIDIA, leading product efforts on OpenShell and product development at the intersection of AI, privacy, and data infrastructure. Before joining NVIDIA in 2025, Ali cofounded several startups. His most recent was Gretel, an agentic platform for creating safe and high-quality synthetic data for enterprises. Ali began his career conducting security and vulnerability research for the U.S. intelligence community, focusing on resilient infrastructure and nation-state cyber defense challenges.
View all posts by Ali Golshan
About Alex Watson
Alex Watson is senior director of product at NVIDIA AI, helping lead product efforts for OpenShell and synthetic data. He joined the company in 2025 with the acquisition of Gretel. While at the startup, Alex led a team of over 50 scientists and pioneers who were pioneering synthetic data-generation techniques for AI model training and differential privacy. He previously founded harvest.ai to develop AI-driven data protection at petabyte scale. After Amazon Web Services acquired the company, Alex grew the newly branded Amazon Macie into one of AWS’s top 25 revenue-generating services. Alex began his career at the National Security Agency, holds a bachelor’s in computer science from Indiana University, Bloomington.
View all posts by Alex Watson
About John Myers
John Myers is a senior director of software engineering at NVIDIA who is leading OpenShell engineering efforts. He joined in March 2025 with the acquisition of Gretel, where he served as co-founder and CTO. He earlier co-founded Efflux Systems, a cybersecurity and machine learning company acquired by NETSCOUT. John began his career in the U.S. Air Force as a cyberspace operations officer supporting intelligence community missions, including work through the National Security Agency’s computer network operations development program.
View all posts by John Myers
Comments
Related posts
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark
How to Scale Your LangGraph Agents in Production From A Single User to 1,000 Coworkers
How to Scale Your LangGraph Agents in Production From A Single User to 1,000 Coworkers
Securely Deploy AI Models with NVIDIA NIM
Securely Deploy AI Models with NVIDIA NIM
Agentic Autonomy Levels and Security
Agentic Autonomy Levels and Security
Enhanced Security and Streamlined Deployment of AI Agents with NVIDIA AI Enterprise
Enhanced Security and Streamlined Deployment of AI Agents with NVIDIA AI Enterprise
Related posts
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer
NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer
Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics
Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics
L
T
F
R
E
|
|
|
Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes | NVIDIA Technical Blog |
nvidia_dev_blog |
12.03.2026 16:30 |
0.649
|
| Embedding sim. | 0.7647 |
| Entity overlap | 0.0976 |
| Title sim. | 0.2059 |
| Time proximity | 0.5685 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
Every AI cluster running on Kubernetes requires a full software stack that works together, from low-level driver and kernel settings to high-level operator and workload configurations. You get one cluster working, and spend days getting the next one to match. Upgrade a component, and something else breaks. Move to a new cloud and start over. AI Cluster Runtime is a new open-source project designed to remove cluster configuration from the critical path. It publishes optimized, validated, and reproducible Kubernetes configurations as recipes you can deploy onto your clusters.
How AI Cluster Runtime works
To support GPU clusters across cloud and on-premises AI factories, NVIDIA validates specific combinations of drivers, runtimes, operators, kernel modules, and system settings for AI workloads. AI Cluster Runtime publishes those results as recipes. These version-locked YAML files capture which components were tested, the versions, and the configuration values, for a given environment. Recipes also carry constraints (minimum Kubernetes version, required OS, kernel version) and a computed deployment order based on component dependencies. Every recipe is validated against real clusters and reproducible across environments.
You can browse recipes directly in the repository, query them through a REST API, or use the aicr
CLI to generate one for your target environment and render it into Helm charts and manifests ready for deployment.
Capture your cluster state
If you have a running cluster, you can snapshot its state before generating a recipe. This captures OS release, kernel version, GPU hardware and driver, Kubernetes version, and installed operators.
aicr snapshot \
--node-selector nodeGroup=gpu-worker \
--output cm://gpu-operator/aicr-snapshot
This deploys a short-lived Job onto a target node, collects system measurements, and writes the results to a ConfigMap or local file. The snapshot becomes the baseline that the validation checks against.
Generate a recipe
The recipe command takes a description of your target environment and matches it against a library of validated overlays to produce a single recipe with exact component versions and settings.
aicr recipe \
--service eks \
--accelerator h100 \
--intent training \
--os ubuntu \
--platform kubeflow \
--output recipe.yaml
Recipes are composed from layers rather than maintained as monolithic configurations. These include:
Base layers, which define universal components and default versions.
Environment layers, which add Kubernetes-specific components—for example, the EBS CSI driver and EFA plugin on Amazon EKS.
Intent layers, which configure training-optimized component settings and NVIDIA Collective Communications Library (NCCL) tuning parameters.
Hardware layers, which pin driver versions and enable features such as CDI and GDRCopy for specific accelerators.
Each layer is added in order, with more specific values taking precedence over general ones.
A fully specialized recipe (such as NVIDIA Blackwell + EKS + Ubuntu + training + Kubeflow) carries up to 268 configuration values across 16 components. A generic EKS query returns 200. The delta between training and inference intent can swap 5 components and change 41 configuration values, producing completely different deployment stacks from the same base. This kind of variance is exactly why people end up hand-tuning clusters.
Validate
Validation runs in phases. Prior to deploying anything, a readiness check compares recipe constraints against your snapshot: Kubernetes version, OS, kernel, and GPU hardware.
aicr validate \
--recipe recipe.yaml \
--phase readiness
After deployment, subsequent phases validate component health and conformance. The conformance phase checks against standards like the CNCF’s Certified Kubernetes AI Conformance Program , verifying requirements for dynamic resource allocation (DRA), gang scheduling , and job-level networking .
Create a bundle
The bundler turns a recipe into deployable artifacts.
aicr bundle \
--recipe recipe.yaml \
--system-node-selector nodeGroup=system-pool \
--accelerated-node-selector nodeGroup=gpu-worker \
--accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
--output ./bundles
The output is a directory with one folder per component, each containing values.yaml
, integrity checksums, a README, and optional custom manifests.
Components are ordered by their dependency graph (for example, cert-manager before NVIDIA GPU Operator, and the NVIDIA GPU Operator before Kubeflow Trainer). Deploy using the included deploy.sh
script, generate ArgoCD Application manifests with --deployer argocd
, or publish bundles as OCI images for air-gapped environments.
Stay current with AI Cluster Runtime recipes
Recipes update as the NVIDIA internal validation pipelines run. New component releases, driver updates, and kernel parameter changes all flow into published recipes as they are tested. When a particular NCCL setting improves Blackwell throughput, that lands in the next recipe version.
Because every recipe is versioned, you can diff your current deployment against the latest validated configuration and see exactly what changed before upgrading.
Contributing recipes
Designed for collaboration from the start, the project enables CSPs, OEMs, platform teams, and individual operators to help validate diverse hardware, OS, and Kubernetes distribution combinations.
Contribute a recipe. Copy an existing overlay, update the criteria and configuration for your environment, run make test
, and open a PR. The recipe development guide walks through the process.
Extend privately. The --data
flag overlays external recipe directories at runtime, so you can maintain organization-specific configurations alongside public ones without forking.
File issues. Share which environments matter to you. That directly shapes what gets validated next.
Get started with AI Cluster Runtime
AI Cluster Runtime is available on GitHub as an alpha release. It includes the aicr
CLI, an API server, a cluster agent, and validated recipes covering training and inference workloads on Kubernetes (e.g., Amazon EKS) with NVIDIA H100 and NVIDIA Blackwell accelerators running Ubuntu 24.04.
Training recipes target Kubeflow Trainer and inference recipes target NVIDIA Dynamo . Every release includes SLSA Level 3 provenance, signed SBOMs, and image attestations.
Projects are in development to expand AI Cluster Runtime across additional platforms, accelerators, and workload types. Tune in to the Operating Cloud AI Factories at Scale session at NVIDIA GTC 2026 to learn more about AI Cluster Runtime and other products that can scale AI operations.
Discuss (0)
Like
Tags
Data Center / Cloud | Developer Tools & Techniques | MLOps | Cloud Services | Dynamo | Intermediate Technical | Tutorial | Kubernetes | Open Source
About the Authors
About Mark Chmarny
Mark Chmarny, a principal cloud architect in the NVIDIA DGX Cloud organization, specializes in large-scale distributed systems, container orchestration, and GPU-accelerated compute. He focuses on Kubernetes-based AI/ML platforms, high-performance GPU clusters, and multi-cloud infrastructure for training and inference.
View all posts by Mark Chmarny
About Nathan Taber
Nathan Taber is a product manager who has helped define how modern cloud and AI infrastructure are built. At AWS, he was a founding member of the Amazon EKS team and helped define Kubernetes at AWS through work on EKS, Karpenter, and the broader OSS ecosystem. At NVIDIA he helps define GPU-accelerated Kubernetes and health-automation patterns for large-scale AI infrastructure, influencing how cloud providers and their customers run production GPU workloads reliably at scale.
View all posts by Nathan Taber
Comments
Related posts
On-Demand Session: Accelerating Kubernetes with NVIDIA Operators
On-Demand Session: Accelerating Kubernetes with NVIDIA Operators
Adding MIG, Preinstalled Drivers, and More to NVIDIA GPU Operator
Adding MIG, Preinstalled Drivers, and More to NVIDIA GPU Operator
NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes
NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes
Kubernetes For AI Hyperparameter Search Experiments
Kubernetes For AI Hyperparameter Search Experiments
Maximizing NVIDIA DGX with Kubernetes
Maximizing NVIDIA DGX with Kubernetes
Related posts
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core
Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core
L
T
F
R
E
|
|
|
Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning | NVIDIA Technical Blog |
nvidia_dev_blog |
11.03.2026 16:00 |
0.647
|
| Embedding sim. | 0.7319 |
| Entity overlap | 0.0755 |
| Title sim. | 0.2993 |
| Time proximity | 0.7351 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
Agentic AI systems need models with the specialized depth to solve dense technical problems autonomously. They must excel at reasoning , coding, and long-context analysis, while remaining efficient enough to run continuously at scale.
Multi-agent systems generate up to 15x the tokens of standard chats, re-sending history, tool outputs, and reasoning steps at every turn. Over long tasks, this “context explosion” causes goal drift, where agents gradually lose alignment with the original objective. And using massive reasoning models for every sub-task—the “thinking tax”—makes multi-agent applications too expensive and sluggish for practical use.
Today, we are releasing Nemotron 3 Super to address these limitations. The new Super model is a 120B total, 12B active-parameter model that delivers maximum compute efficiency and accuracy for complex multi-agent applications such as software development and cybersecurity triaging. This model follows the introduction of Nemotron 3 Nano in December.
Super addresses the “thinking tax” with its hybrid mixture-of-experts ( MoE ) architecture. It delivers over 5x throughput than the previous Nemotron Super. This model tackles the “context explosion” with a native 1M-token context window that gives agents long-term memory for aligned, high-accuracy reasoning. The model is fully open with open weights, datasets, and recipes so developers can easily customize, optimize, and deploy it on their own infrastructure .
What makes Nemotron 3 Super different
Nemotron 3 Super isn’t just a bigger Nano. It introduces architectural innovations that allow the model to mitigate some of the typical efficiency-accuracy tradeoffs for high-capacity reasoning models:
Latent MoE that calls 4x as many expert specialists for the same inference cost, by compressing tokens before they reach the experts.
Multi-token prediction (MTP) that predicts multiple future tokens in one forward pass, dramatically reducing generation time for long sequences and enabling built-in speculative decoding.
Hybrid Mamba-Transformer backbone integrating Mamba layers for sequence efficiency with Transformer layers for precision reasoning, delivering higher throughput with 4x improved memory and compute efficiency.
Native NVFP4 pretraining optimized for NVIDIA Blackwell, significantly cutting memory requirements and speeding up inference by 4x on NVIDIA B200 compared to FP8 on NVIDIA H100, while maintaining accuracy.
Multi-environment reinforcement-learning (RL) post-trained with RL across 21 environment configurations using NVIDIA NeMo Gym and NVIDIA NeMo RL , trained with more than 1.2 million environment rollouts.
These advantages come together to create a model that is well suited for long-running autonomous agents. On PinchBench —a new benchmark for determining how well LLM models perform as the brain of an OpenClaw agent—Nemotron 3 Super scores 85.6% across the full test suite, making it the best open model in its class.
See it in action
If you want to go hands on with Nemotron 3 Super, follow the tutorial video below. This will walk you through how to use the model from build.nvidia.com to OpenCode.
Video 1. A tutorial walkthrough of Nemotron 3 Super
Diving deep into the architecture
Hybrid Mamba-Transformer MoE backbone
Super builds on the same hybrid philosophy as Nano but at a fundamentally different scale. The backbone interleaves three layer types:
Mamba-2 layers handle the majority of sequence processing. State space models (SSMs) provide linear-time complexity with respect to sequence length, which is what makes the 1M-token context window practical rather than theoretical. When an agent needs to reason over an entire codebase, a long conversation history, or a stack of retrieved documents, Mamba layers keep the memory footprint manageable.
Transformer attention layers are interleaved at key depths. Pure SSMs can struggle with precise associative recall—the kind of task where you need to find one specific fact buried in a long context. The attention layers preserve this capability, ensuring that Super maintains high-fidelity retrieval even when the “needle” sits in the middle of a haystack of conflicting information.
MoE layers scale effective parameter count without the cost of dense computation. Only a subset of experts activates per token, keeping latency low and throughput high—critical when many agents are running concurrently in a shared deployment.
Figure 1. A layer pattern diagram showing repeating blocks of Mamba-2/MoE pairs interleaved with attention layers
Latent MoE
Standard MoE architectures route tokens directly from the model’s full hidden dimension to the experts. As models grow, this routing layer becomes a bottleneck—it increases compute costs and limits how many experts you can practically deploy.
Super introduces latent MoE: Before routing decisions are made, token embeddings are projected into a compressed, low-rank latent space. Expert computation happens in this smaller dimension, and results are projected back to the full model dimension afterward.
Why this matters in practice:
More experts, same cost. By compressing tokens before they reach the experts, latent MoE enables the model to consult 4x as many experts for the exact same computational cost as running one.
Finer-grained specialization. With more experts available, the model can afford highly specialized routing—for example, activating distinct experts for Python syntax versus SQL logic—that are only activated when strictly necessary. This granularity is especially valuable in agentic settings where a single conversation may span tool calls, code generation, data analysis, and conversational reasoning within a few turns.
Figure 2. Side-by-side comparison of standard MoE vs. latent MoE architectures
Multi-token prediction (MTP)
Standard language models are trained to predict one token at a time—a fundamentally myopic objective. Super is trained with MTP, where specialized prediction heads forecast several future tokens simultaneously from each position.
This has two concrete benefits:
Stronger reasoning during training. Predicting multiple future tokens forces the model to internalize longer-range structure and logical dependencies. Rather than learning to guess plausible next words, the model must learn to anticipate coherent sequences. This produces measurable gains on chain-of-thought tasks where each step must follow logically from the last.
Built-in speculative decoding at inference. By predicting multiple future tokens simultaneously in one forward pass, MTP dramatically reduces the time required to generate long sequences. The MTP heads provide draft predictions that can be verified in parallel, enabling up to 3x wall-clock speedups for structured generation tasks like code and tool calls—without requiring a separate draft model.
Both benefits stem from the same design decision. Unlike architectures that train independent prediction heads per offset, Super uses a shared-weight design across all MTP heads. This keeps the parameter overhead minimal while improving training stability—the heads learn to agree on coherent continuations rather than diverging into offset-specific shortcuts. The same weight sharing also makes the speculative drafts more consistent at longer draft lengths, which is where independently trained heads typically degrade.
Native NVFP4 pretraining
Most quantized models start as full-precision and get compressed after training, which inevitably introduces accuracy loss. Super takes a different approach: The majority of floating-point multiply-accumulate operations during pretraining run in NVFP4, the NVIDIA 4-bit floating-point format. Optimized for Blackwell, this significantly cuts memory requirements and speeds up inference compared to FP8, while maintaining accuracy.
Training natively in reduced precision means the model learns to be accurate within the constraints of 4-bit arithmetic from the very first gradient update. The result is a model that is mathematically stable and accurate despite running on a significantly reduced memory footprint.
How we trained Nemotron 3 Super
Nemotron 3 Super is trained in three sequential phases, each building on the last. Pretraining establishes broad world knowledge and language understanding at scale. Supervised fine-tuning shapes the model’s behavior across the task types it will encounter in deployment. Reinforcement learning then refines that behavior against verifiable outcomes across diverse agentic environments.
Pretraining
Super is pretrained on 25 trillion tokens using NVFP4, the NVIDIA 4-bit floating-point format optimized for NVIDIA Blackwell. Rather than quantizing a full-precision model after the fact, Super trains natively in reduced precision from the first gradient update—meaning the model learns to be accurate within the constraints of 4-bit arithmetic throughout pretraining, not just at inference. The pretraining corpus spans 10 trillion unique curated tokens, with the model seeing 25 trillion total tokens across the run, including additional compute focused on reasoning and coding.
Supervised fine-tuning
Before reinforcement learning, Super undergoes supervised fine-tuning on about 7 million SFT samples. They’re drawn from a broader post-training corpus of 40 million samples, which cover reasoning, instruction following, coding, safety, and multi-step agent tasks. This stage establishes the behavioral foundation that RL then refines. The model learns the format and structure of correct responses across task types, giving the subsequent RL phase a stable starting point rather than optimizing from a raw pretrained checkpoint.
Multi-environment reinforcement learning
To align Super with real agentic behavior, the model is post-trained using reinforcement learning across diverse environments in NeMo Gym, the NVIDIA open source library for building and scaling RL training environments. These environments evaluate the model’s ability to perform sequences of actions—generating correct tool calls, writing functional code, producing multi-part plans that satisfy verifiable criteria—not just providing satisfying single-turn responses. These trajectories form the core training data to run reinforcement learning at scale with the NeMo RL open library.
This trajectory-based reinforcement produces a model that behaves reliably under multi-step workflows, reduces reasoning drift, and handles the kinds of structured operations common in agentic pipelines.
Benchmarking Nemotron 3 Super
Nemotron 3 Super achieves leading accuracy across a number of important agentic benchmarks while maintaining incredible throughput.
Figure 3. A chart comparing Nemotron 3 Super accuracy on key benchmarks against similarly sized open models.
The “Super + Nano” deployment pattern
Nemotron 3 Nano is an excellent choice for achieving high accuracy in executing targeted, individual steps within an agentic workflow. However, when multi-agent applications escalate to complex, multi-step activities, they require a high-capacity model for superior planning and reasoning. Think of a computer use agent that needs to make decisions between different modalities of tools in order to, say, create a presentation with 10 high-quality slides.
Nemotron 3 Super is ideal in this use. For instance, in software development, simple merge requests can be addressed by Nemotron 3 Nano while complex coding tasks that require deeper understanding of the code base can be handled by Nemotron 3 Super. And expert-level coding tasks can be addressed by proprietary models.
Building with Super’s open resources
Nemotron 3 Super is fully open—weights, datasets, and recipes—so developers can easily customize, optimize, and deploy the model on their own infrastructure for maximum privacy and security.
Model weights
Full parameter checkpoints for Nemotron 3 Super are available on Hugging Face and through NVIDIA NIM . The NVIDIA Nemotron Open Model License gives enterprises the flexibility to maintain data control and deploy anywhere.
End-to-end training and evaluation recipe s
We are releasing the complete training and evaluation recipe for Nemotron 3 Super, covering the full pipeline from pretraining through alignment. This enables developers to reproduce Super’s training, adapt the recipe for domain-specific variants, or use it as a starting point for their own hybrid architecture research.
Deployment cookbooks
We’ve built ready-to-use cookbooks for major inference engines, each with configuration templates, performance tuning guidance, and reference scripts:
vLLM Cookbook : High-throughput continuous batching and streaming for Super.
SGLang Cookbook : Fast, lightweight inference optimized for multi-agent tool-calling workloads.
NVIDIA TensorRT LLM Cookbook : Fully optimized TensorRT LLM engines with latent MoE kernels for production-grade, low-latency deployment.
Fine-tuning cookbooks
Explore our Nemotron 3 Super customization cookbooks to efficiently fine-tune for your domain (LoRA/SFT) or advance its agentic reasoning capabilities (GRPO/DAPO):
LoRA SFT on Nemotron 3 Super using NVIDIA NeMo Megatron-Bridge
LoRA SFT on Nemotron 3 Super using NVIDIA NeMo Automodel
GRPO/DAPO on Nemotron 3 Super using NeMo RL
Open datasets
Nemotron 3 Super is built on a fully open, end-to-end data pipeline that spans pretraining, post-training, and interactive reinforcement learning—giving developers reproducible building blocks for agentic AI.
Pretraining corpora : 10 trillion curated tokens, trained over 25 trillion total seen tokens, plus an additional 10 billion tokens focused on reasoning and 15 million coding problems. All aggressively deduplicated and quality-filtered to maximize signal-to-noise.
Post-training datasets : 40 million new supervised and alignment samples, covering reasoning, instruction following, coding, safety, and multi-step agent tasks across supervised fine-tuning, preference data, and RL trajectories (about 7 million used directly for SFT)
RL tasks and environments : Interactive RL across 21 environment configurations and 37 datasets (~10 of which are being released) including software engineer-style agent training and tool-augmented search/planning tasks—moving beyond static text into dynamic, verifiable execution workflows and generating ~1.2 million environment rollouts during training.
Open training and evaluation infrastructure
NVIDIA publishes development techniques and tools, giving researchers and enterprises the flexibility to customize Nemotron 3 Super or build their own reasoning models. All recipes integrate with the Nemotron GitHub repository, NeMo Gym , NeMo RL , NVIDIA NeMo Data Designer , NVIDIA NeMo Curator , and NVIDIA NeMo Evaluator —providing a complete, reproducible pipeline from data to deployment.
All Nemotron models are released with an open evaluation approach, including a published evaluation recipe that enables anyone to rerun and inspect the full evaluation pipeline from Nemotron 3 Super.
Get started
Nemotron 3 Super is live now. Available across leading inference platforms and packaged as NVIDIA NIM, Super can run anywhere from the workstation to the cloud. Try it on Perplexity with a Pro subscription or through API, OpenRouter , or build.nvidia.com .
Download the weights from Hugging Face , launch an optimized instance through NVIDIA NIM, fine-tune with Unsloth , or start with the cookbooks to get running in minutes.
Super is also available through Baseten , Cloudflare , Coreweave , DeepInfra , Fireworks AI, FriendliAI, Google Cloud , Inference.net , Lightning AI , Modal, Nebius , and Together AI .
Check out our GitHub repository which has getting started instructions for platforms like OpenCode, OpenHands, and OpenClaw.
For the full technical details, read the Nemotron 3 Super technical report .
Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn , X , Discord , and YouTube . Visit the Nemotron developer page for resources to get started. Explore open Nemotron models and datasets on Hugging Face and Blueprints on build.nvidia.com . And engage with Nemotron livestreams , tutorials , and the developer community on the NVIDIA forum and Discord .
Discuss (0)
Like
Tags
Agentic AI / Generative AI | General | NeMo | Nemotron | NIM | TensorRT-LLM | Intermediate Technical | News | featured | LLM Benchmarking | LLM Techniques | LLMs | Machine Learning & Artificial Intelligence | Mixture of Experts (MoE) | NVFP4 | Reinforcement Learning
About the Authors
About Chris Alexiuk
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
View all posts by Chris Alexiuk
About Chintan Patel
Chintan Patel is a senior product manager at NVIDIA focused on bringing GPU-accelerated solutions to the HPC community. He leads the management and offering of the HPC application containers on the NVIDIA GPU Cloud registry. Prior to NVIDIA, he held product management, marketing and engineering positions at Micrel, Inc. He holds an MBA from Santa Clara University and a bachelor's degree in electrical engineering and computer science from UC Berkeley.
View all posts by Chintan Patel
Comments
Related posts
Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate
Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate
Develop Specialized AI Agents with New NVIDIA Nemotron Vision, RAG, and Guardrail Models
Develop Specialized AI Agents with New NVIDIA Nemotron Vision, RAG, and Guardrail Models
Build More Accurate and Efficient AI Agents with the New NVIDIA Llama Nemotron Super v1.5
Build More Accurate and Efficient AI Agents with the New NVIDIA Llama Nemotron Super v1.5
Advancing Agentic AI with NVIDIA Nemotron Open Reasoning Models
Advancing Agentic AI with NVIDIA Nemotron Open Reasoning Models
Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency
Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency
Related posts
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
L
T
F
R
E
|
|
|
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics | NVIDIA Technical Blog |
nvidia_dev_blog |
16.03.2026 16:00 |
0.643
|
| Embedding sim. | 0.7626 |
| Entity overlap | 0.0667 |
| Title sim. | 0.2803 |
| Time proximity | 0.4285 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | robotics |
| NLP страна | |
Открыть оригинал
Physics forms the foundation of robotic simulation, enabling realistic modeling of motion and interaction. For tasks like locomotion and manipulation, simulators must handle complex dynamics such as contact forces and deformable objects. While most engines trade off speed for realism, Newton—a GPU-accelerated, open source simulator—is designed to do both.
Newton 1.0 GA, announced at NVIDIA GTC 2026, is delivering an accelerated, production-ready foundation for dexterous manipulation and locomotion tasks. As an extensible physics engine built on NVIDIA Warp and OpenUSD , robots can learn how to handle complex tasks with greater precision, speed, and extensibility while using frameworks such as NVIDIA Isaac Lab and NVIDIA Isaac Sim.
Newton is a modular framework that brings together multiple solvers and simulation components behind a unified architecture. Rather than being tied to a single scene format, it supports a broad runtime data model that spans common robotics descriptions such as MJCF, URDF, and OpenUSD, making it easier to connect existing robot assets and workflows. Teams can mix and match collision detection, contact models, sensors, control, and solver backends—rigid-body and deformable solvers as well as custom solvers—while keeping a consistent simulation stack for robot learning and development.
Figure 1. The Newton architecture shows a modular physics simulation framework that unifies multiple solvers and components across robotics and physics workloads, with integration points for Isaac Lab, Isaac Sim, MuJoCo, and OpenUSD
Highlights of this release include:
Stable API : Newton API provides a stable, unified interface for a wide range of capabilities in modeling, solving, controlling, and sensing in the simulation.
Versatile rigid-body solvers : Newton ships with several rigid-body solvers. MuJoCo and Kamino have the most advanced and complementary capabilities:
Kamino , developed by Disney Research, handles complex mechanisms such as robotic hands and legged systems with closed-loop linkages and passive actuation. It enables a new class of simulation capabilities, giving mechanical designers the freedom to design systems without worrying about simulatability, while paving the way for scalable reinforcement learning.
MuJoCo 3.5 (MJWarp) builds on the stability and accuracy the robotics community already trusts in MuJoCo, developed by Google DeepMind, now extended with GPU-scale throughput for thousands of parallel training environments. New optimizations enable MuJoCo Warp to speed up MJX by 252x for locomotion, and 475x for manipulation tasks on the NVIDIA RTX PRO 6000 Blackwell Series .
Rich deformable simulation: Powered by the Vertex Block Descent (VBD) solver, Newton handles linear deformables (cables), thin deformables (cloth), and volumetric deformables (rubber parts), covering common materials found in real industrial settings. Also, the Implicit Material Point Method (iMPM) handles particle simulation (granular material) applicable to rough terrain locomotion scenarios. The VBD and MPM solvers can be coupled with MuJoCo Warp explicitly to support deformable manipulation and locomotion scenarios with robotic systems.
Collision library : A flexible and fast collision detection pipeline enables selection of the right broadphase and narrowphase detection approaches based on scene complexity. The pipeline is reusable and can accelerate custom solver development. The library includes advanced contact generation and modeling:
Signed distance field (SDF)-based collision captures complex geometries directly from CAD-exported meshes, eliminating the need for mesh approximation methods. This is critical for tight-tolerance tasks such as connector insertion or in-hand manipulation.
Hydroelastic contacts , inspired by this Drake contact model , use a continuous pressure distribution across finite-area contact patches rather than a set of contact points. This provides higher-fidelity and more robust object interaction required for tactile sensing and manipulation policies, ultimately achieving better sim-to-real transfer.
Video 1. A Newton GPU-accelerated hydroelastic contact can generate and scale high-quality, realistic dexterous tactile data
OpenUSD and Isaac integration: With OpenUSD as a common data layer, Newton integrates natively with NVIDIA Isaac Sim 6.0 and Isaac Lab 3.0 early access releases, enabling faster workflows from robot description to trained policy and evaluation pipelines across reinforcement and imitation learning workflows.
Tiled camera sensor: A Warp-based tiled camera sensor supports high-throughput simplified rendering with channels for RGB, depth, albedo, surface normals, and instance segmentation. Designed to scale vision-based RL policies, it enables end-to-end perceptive training pipelines to run on the NVIDIA DGX platform . The rendering backend is ray-tracing-based and supports multiple scene representations, including triangle meshes and Gaussian splats.
Video 2. Tiled camera sensor generating batched visual observations across many parallel environments for perceptive RL training
Newton is a Linux Foundation project founded by NVIDIA, Google DeepMind, and Disney Research. Lightwheel, a leader in simulation infrastructure for physical AI, has joined forces with Newton to advance solver calibration, define the SimReady standard, and develop the next generation of physically grounded SimReady assets. Toyota Research Institute (TRI), developer of the Drake physics engine , is also partnering with Newton to advance solver development and contact modeling capabilities.
The next section walks through how these capabilities come together in locomotion and manipulation workflows.
Simulating complex mechanisms with Kamino
The Kamino solver handles complex and intricate closed-chain mechanisms, such as robots with kinematics that include parallel linkage mechanisms. This enables the simulation of mechanisms like multi-link walking robots, where kinematics can include several closed loops in each leg. An example is Dr. Legs , a closed-chain robotic leg mechanism available in the Newton asset repository , which demonstrates how Kamino handles articulated structures with multiple closed loops.
Video 3. Dr. Legs, a closed-chain robotic leg mechanism, simulated with the Kamino solver
Newton workflows follow a consistent pattern: build or import a model, initialize the state, apply controls (such as joint targets or forces), and step a solver such as Kamino to advance the physics, with results visualized through the viewer.
import newton
# Import the articulation model from USD
builder = newton.ModelBuilder()
# Register attributes to be parsed specific to Kamino
newton.solvers.SolverKamino.register_custom_attributes(builder)
# Import USD asset
builder.add_usd("robot.usd")
# Finalize the model (upload to GPU)
model = builder.finalize()
# Create Kamino solver
solver = newton.solvers.SolverKamino(model)
# Create state and control objects
state_0 = model.state()
state_1 = model.state()
control = model.control()
contacts = model.contacts()
# Simulation loop
for i in range(num_steps):
state_0.clear_forces()
model.collide(state_0, contacts)
# Forward dynamics
solver.step(state_0, state_1,
control, contacts, sim_dt)
# Swap states
state_0, state_1 = state_1, state_0
Manipulation workflow with Newton and Isaac Lab
Figure 2. Franka robots performing a cube-picking task in Isaac Lab
NVIDIA Isaac Lab is an open source framework for robot learning. Researchers and developers can define simulation environments, set up reinforcement learning (RL) and imitation learning pipelines, and train policies at GPU scale. Users define a scene (robots, objects, terrain), an MDP (observations, actions, rewards, terminations), and a simulation backend. Newton integrates with Isaac Lab as a new physics and camera-sensor backend, further expanding Isaac Lab’s capabilities.
In an Isaac Lab RL workflow, everything above the physics layer—the task definition, PPO training loop, observation and reward functions—stays identical. Only the simulation backend changes. This means developers can author an environment once and validate it across different physics engines, building confidence in policy robustness before real-world deployment.
Together, Newton and Isaac Lab can build an RL pipeline for training a Franka robot to pick up a cube object. Once the scene is configured, configure the physics settings. The following example shows how Isaac Lab is using a three-layer physics configuration for the Franka robot cube manipulation environment.
from isaaclab.sim import SimulationCfg
from isaaclab_newton.physics import NewtonCfg, MJWarpSolverCfg
# Configure Newton MJWarp simulation for the Franka Cube Env
FrankaCubeEnvCfg.solver_cfg = MJWarpSolverCfg(
solver="newton",
integrator="implicitfast",
njmax=2000,
nconmax=1000,
impratio=1000.0,
cone="elliptic",
update_data_interval=2,
iterations=20,
ls_iterations=100,
ls_parallel=True,
)
FrankaCubeEnvCfg.newton_cfg = NewtonCfg(
solver_cfg=FrankaCubeEnvCfg.solver_cfg,
num_substeps=2,
debug_mode=False,
)
FrankaCubeEnvCfg.sim = SimulationCfg(
dt=1 / 120,
render_interval=FrankaCubeEnvCfg.decimation,
physics=FrankaCubeEnvCfg.newton_cfg,
)
All other steps, such as applying actions, getting rewards, and resetting the environment, remain the same.
How Newton is being used in industrial applications
The following two examples show how Newton’s capabilities come together in production robotics workflows. One focuses on rigid-body precision assembly and the other on dexterous manipulation of deformable materials.
GPU rack assembly automation
Skild AI is training reinforcement learning policies for GPU rack assembly for their industrial end-users, one of the most demanding contact-rich tasks in electronics manufacturing. Connector insertion, board placement, and fastening require stable collision and contact, reliable force feedback, and full-fidelity geometric representation that most simulators cannot provide at training scale.
Skild is using Isaac Lab with the Newton backend for their electronics assembly automation tasks. In their workflows, the SDF-based collision detection and hydroelastic contact modeling are used to bypass MuJoCo Warp’s native collision and contact pipeline, enabling higher contact fidelity. Shapes are configured with precomputed SDFs built from the original CAD geometry, enabling Newton to operate on non-convex tri-mesh models that accurately represent the geometry of assembly components.
SDF collision is useful for rigid, non-compliant interactions with complex geometry, enabling precise contact queries against connectors, boards, and other tightly-toleranced parts.
Video 4. Robotic GPU rack assembly tasks include connector insertion and component placement used for RL policy training
For richer contact dynamics, hydroelastic modeling introduces compliance that produces distributed pressure contacts rather than point-contact approximations. This creates larger contact areas that capture frictional behavior, including torsional friction effects that can arise during complex object manipulation sequences. Together, the SDF geometry representation and hydroelastic contact model provide the fidelity required to train policies that can reliably transfer to real industrial assembly systems.
The following snippet shows how SDF collisions and hydroelastic contacts are configured:
import newton
from newton.geometry import HydroelasticSDF
# --- 1. Shape configuration: enable hydroelastic contact ---
shape_cfg = newton.ModelBuilder.ShapeConfig(
mu=1.0, # friction coefficient
gap=0.01, # contact detection margin [m]
ke=5.0e4, # elastic contact stiffness (MJWarp fallback)
kd=5.0e2, # contact damping
kh=1.0e11, # hydroelastic stiffness [Pa/m] — controls
# pressure vs. penetration across the contact patch
)
# --- 2. Build SDF on each collision mesh ---
# Precompute a sparse signed-distance field so Newton can find
# sub-voxel contact surfaces via marching cubes at runtime.
for mesh in assembly_meshes:
mesh.build_sdf(
max_resolution=128, # SDF grid resolution
narrow_band_range=(-0.01, 0.01), # ±10 mm band around surface
margin=shape_cfg.gap,
)
# --- 3. Mark shapes as hydroelastic ---
# When both shapes in a colliding pair carry this flag, Newton
# routes them through the SDF-hydroelastic pipeline instead of
# MJWarp's native point-contact solver.
for shape_idx in range(builder.shape_count):
builder.shape_flags[shape_idx] |= newton.ShapeFlags.HYDROELASTIC
# --- 4. Create the collision pipeline with hydroelastic config ---
collision_pipeline = newton.CollisionPipeline(
model,
reduce_contacts=True, # contact-reduction for stable solving
broad_phase="explicit", # precomputed shape pairs (few shapes)
sdf_hydroelastic_config=HydroelasticSDF.Config(
output_contact_surface=False, # skip surface mesh export
),
)
# --- 5. Simulation loop (unchanged from standard Newton) ---
# The solver receives distributed contact patches transparently.
collision_pipeline.collide(state, contacts)
solver.step(state_0, state_1, control, contacts, dt)
Cable manipulation for refrigerator assembly
Samsung will use Newton for physically-grounded synthetic data generation (SDG) used for training their vision-language-action (VLA) models.
Lightwheel is applying Newton to generate SimReady assets for Newton with proper tuning based on real-world measurements and verification. This enables a variety of complex industrial assembly tasks, including cable manipulation tasks in Samsung manufacturing workflows. Cables are among the hardest objects to simulate reliably, they exhibit complex 1D deformable behavior, self-collision, and force-dependent shape changes that canonical solvers cannot capture accurately.
Samsung and Lightwheel’s work illustrates how Newton’s deformable simulation stack, spanning cables through volumetric solids, enables synthetic data generation and policy training on the full range of materials found in real electronics assembly lines.
Video 5. A RB-Y1 robot is performing a cable insertion task for refrigerator assembly, simulated with two-way coupled MuJoCo Warp and a VBD cable solver
Newton’s VBD solver enables simulating linear deformables such as cables. Two-way coupling with rigid-body solvers like MuJoCo Warp enables robot motion to physically interact with cable deformation during simulation. Combined with Newton’s stable collision and high-fidelity contact modeling, this setup enables simulation of tasks such as inserting a refrigerator water-hose connector into its housing. The snippet shows how VBD and MuJoCo Warp are coupled.
import warp as wp
import newton
from newton.solvers import SolverMuJoCo, SolverVBD
# --- Universe A: MuJoCo rigid-body robot ---
robot_model = robot_builder.finalize()
mj_solver = SolverMuJoCo(
robot_model,
solver="newton",
integrator="implicitfast",
cone="elliptic",
iterations=20,
ls_iterations=10,
ls_parallel=True,
impratio=1000.0,
)
robot_state_0 = robot_model.state()
robot_state_1 = robot_model.state()
control = robot_model.control()
mj_collision_pipeline = newton.CollisionPipeline(
robot_model,
reduce_contacts=True,
broad_phase="explicit",
)
mj_contacts = mj_collision_pipeline.contacts()
# --- Universe B: VBD deformable cable ---
cable_builder = newton.ModelBuilder()
cable_builder.add_rod(
positions=cable_points, # polyline vertices [m]
quaternions=cable_quats, # parallel-transport frames
radius=0.003, # cable cross-section radius [m]
stretch_stiffness=1e12, # EA [N]
bend_stiffness=3.0, # EI [N*m^2]
stretch_damping=1e-3,
bend_damping=1.0,
)
# --- Proxy bodies: robot links mirrored into VBD ---
for body_id in proxy_body_ids:
# Effective mass: reflects the inertia of the full articulated
# chain when applicable, optionally scaled for coupling stability.
proxy_id = cable_builder.add_body(
xform=robot_state_0.body_q[body_id],
mass=effective_mass[body_id],
)
for shape in shapes_on_body(robot_model, body_id):
cable_builder.add_shape(body=proxy_id, **shape)
robot_to_vbd[body_id] = proxy_id
cable_model = cable_builder.finalize()
vbd_solver = SolverVBD(
cable_model,
iterations=10,
)
vbd_state_0 = cable_model.state()
vbd_state_1 = cable_model.state()
vbd_control = cable_model.control()
vbd_collision_pipeline = newton.CollisionPipeline(cable_model)
vbd_contacts = vbd_collision_pipeline.contacts()
proxy_forces = wp.zeros(robot_model.body_count, dtype=wp.spatial_vector)
coupling_forces_cache = wp.zeros_like(proxy_forces)
@wp.kernel
def sync_proxy_state(
robot_ids: wp.array(dtype=int),
proxy_ids: wp.array(dtype=int),
src_body_q: wp.array(dtype=wp.transform),
src_body_qd: wp.array(dtype=wp.spatial_vector),
dst_body_q: wp.array(dtype=wp.transform),
dst_body_qd: wp.array(dtype=wp.spatial_vector),
proxy_forces: wp.array(dtype=wp.spatial_vector),
body_inv_mass: wp.array(dtype=float),
body_inv_inertia: wp.array(dtype=wp.mat33),
gravity: wp.vec3,
dt: float,
):
i = wp.tid()
rid = robot_ids[i]
pid = proxy_ids[i]
# Copy pose and velocity from robot to proxy
dst_body_q[pid] = src_body_q[rid]
qd = src_body_qd[rid]
# Undo coupling forces + gravity on proxy velocity
f = proxy_forces[rid]
delta_v = dt * body_inv_mass[pid] * wp.spatial_top(f)
r = wp.transform_get_rotation(dst_body_q[pid])
delta_w = dt * wp.quat_rotate(r, body_inv_inertia[pid] * wp.quat_rotate_inv(r, wp.spatial_bottom(f)))
qd = qd - wp.spatial_vector(delta_v + dt * body_inv_mass[pid] * gravity, delta_w)
dst_body_qd[pid] = qd
# --- Coupled step (staggered, one-step lag) ---
# Step 1 -- Apply lagged VBD-to-MuJoCo wrenches
robot_state_0.clear_forces()
coupling_forces_cache.assign(proxy_forces)
robot_state_0.body_f.assign(robot_state_0.body_f + coupling_forces_cache)
# Step 2 -- Advance MuJoCo (rigid-body robot)
mj_collision_pipeline.collide(robot_state_0, mj_contacts)
mj_solver.step(robot_state_0, robot_state_1, control, mj_contacts, dt)
robot_state_0, robot_state_1 = robot_state_1, robot_state_0
# Step 3 + 4 -- Sync proxy poses/velocities and undo coupling forces (single kernel)
wp.launch(
sync_proxy_state,
dim=len(proxy_body_ids),
inputs=[
robot_ids_wp, proxy_ids_wp,
robot_state_0.body_q, robot_state_0.body_qd,
vbd_state_0.body_q, vbd_state_0.body_qd,
coupling_forces_cache,
cable_model.body_inv_mass, cable_model.body_inv_inertia,
gravity, dt,
],
)
# Step 5 -- Advance VBD (cable deformation + cable-proxy contacts)
vbd_collision_pipeline.collide(vbd_state_0, vbd_contacts)
vbd_solver.step(vbd_state_0, vbd_state_1, vbd_control, vbd_contacts, dt)
# Step 6 -- Harvest contact wrenches from proxy bodies (applied at next step)
proxy_forces = harvest_proxy_wrenches(vbd_solver, vbd_state_1, vbd_contacts, dt)
vbd_state_0, vbd_state_1 = vbd_state_1, vbd_state_0
Samsung and Lightwheel’s work shows how Newton’s deformable simulation stack enables synthetic data generation and policy training on the full range of materials found in real electronics assembly lines.
How to get started with Newton
Newton is free to use, modify, and extend. To get started:
Explore the newton-physics/newton GitHub repo for standalone Newton examples and documentation .
Try the dexterous manipulation and locomotion workflows on isaac-sim/IsaacLab GitHub .
If you’re attending NVIDIA GTC 2026, check out the following sessions:
Disney’s Robotic Characters: From the Screen to Reality via Physical AI
An Introduction to Newton Physics Engine for Robotics
Accelerate Robot Learning with NVIDIA Isaac Lab and Newton
Build Robot-Ready Assets for Physically Accurate Simulations With Lightwheel
How to use NVIDIA Warp to Build GPU-Accelerated Computational Physics Simulations
Stay up to date by subscribing to our newsletter and following NVIDIA Robotics on LinkedIn , Instagram , X , and Facebook . Explore NVIDIA documentation and YouTube channels, and join the NVIDIA Developer Robotics forum . To start your robotics journey, enroll in our free NVIDIA Robotics Fundamentals courses today.
Get started with NVIDIA Isaac libraries and AI models for developing physical AI systems.
Watch the NVIDIA GTC 2026 keynote with NVIDIA founder and CEO Jensen Huang and explore more physical AI , robotics , and vision AI GTC sessions.
Discuss (0)
Like
Tags
Developer Tools & Techniques | Robotics | Simulation / Modeling / Design | General | Isaac Lab | Isaac Sim | Intermediate Technical | Tutorial | featured | GTC 2026
About the Authors
About Philipp Reist
Philipp Reist is a director of simulation technology at NVIDIA. He received his Ph.D. in Robotics and Control from the Institute for Dynamic Systems and Control at ETH Zürich for his research in the design of sensorless juggling machines. He previously led the mechanical design and production of flying machines at Verity Studios for the Cirque du Soleil production Paramour and developed mechatronic modules for enabling animatronics-based artworks with Pors & Rao at Wyss Zurich. His recent work focuses on developing high-performance, accurate physics engines to enable synthetic data generation for training physical AI, with a particular emphasis on robotic manipulation workflows.
View all posts by Philipp Reist
About Miguel Zamora Mora
Miguel Zamora Mora is a software engineer at NVIDIA, where he works on robot manipulation for the Newton Physics project. His prior work spans deformable object manipulation, multi-arm task and motion planning, and policy optimization through differentiable simulation. He holds a Ph.D. in Computer Science from ETH Zürich, where he was part of the Computational Robotics Lab.
View all posts by Miguel Zamora Mora
About JC Chang
JC Chang is a senior research scientist at NVIDIA, where he develops high-fidelity, high-performance physics simulations spanning rigid bodies, deformable bodies, and fluids. He's a key contributor to the core NVIDIA physics simulation engines, including Newton and PhysX. He holds a Ph.D. in Computer Science from the University of Waterloo.
View all posts by JC Chang
About Rishabh Chadha
Rishabh Chadha is a technical marketing engineer at NVIDIA, he focuses on integrating deep learning and robotics frameworks for the NVIDIA Jetson platforms. He has a Masters degree in Robotics from Worcester Polytechnic Institute. His interests primarily include deep learning, medical imaging, and robot perception.
View all posts by Rishabh Chadha
About Mohammad Mohajerani
Mohammad Mohajerani is a senior product manager at NVIDIA, where his work enables high-performance simulation and real-time physics across physical AI, CAE, and AI-driven digital twin applications through Warp and Newton. Prior to NVIDIA, Mohammad held product and engineering leadership roles in the startup world at Sanctuary AI and Haply Robotics and spent several years advancing physics engines and skills training simulators at CM Labs Simulations. He holds a master's degree in mechanical engineering from Concordia University, with a focus on aerial robotics, system identification, and control optimization.
View all posts by Mohammad Mohajerani
Comments
Related posts
R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab
R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab
Train a Quadruped Locomotion Policy and Simulate Cloth Manipulation with NVIDIA Isaac Lab and Newton
Train a Quadruped Locomotion Policy and Simulate Cloth Manipulation with NVIDIA Isaac Lab and Newton
Announcing Newton, an Open-Source Physics Engine for Robotics Simulation
Announcing Newton, an Open-Source Physics Engine for Robotics Simulation
Advancing Robot Learning, Perception, and Manipulation with Latest NVIDIA Isaac Release
Advancing Robot Learning, Perception, and Manipulation with Latest NVIDIA Isaac Release
Accelerating Robotics Simulation with NVIDIA Omniverse Isaac Sim
Accelerating Robotics Simulation with NVIDIA Omniverse Isaac Sim
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models
Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models
L
T
F
R
E
|
|
|
As Open Models Spark AI Boom, NVIDIA Jetson Brings It to Life at the Edge |
nvidia_blog |
10.03.2026 16:43 |
0.634
|
| Embedding sim. | 0.7164 |
| Entity overlap | 0.0286 |
| Title sim. | 0.1353 |
| Time proximity | 0.9927 |
| NLP тип | product_launch |
| NLP организация | nvidia |
| NLP тема | edge computing |
| NLP страна | |
Открыть оригинал
As Open Models Spark AI Boom, NVIDIA Jetson Brings It to Life at the Edge
Open source generative AI models are leaving the data center and showing up in machines that work in the physical world. From Orin to Thor, the NVIDIA Jetson family is becoming a common place to run models like NVIDIA Nemotron, Cosmos and Isaac GR00T and a growing list of community models like Qwen, Gemma, Mistral AI, GPT-OSS, PI and others.
March 10, 2026 by Chen Su
0 Comments
Share
Share This Article
X
Facebook
LinkedIn
Copy link
Link copied!
The Cat 306 CR mini-excavator weighs just under eight tons and fits inside a standard shipping container. It’s the machine a contractor rents when the job site is tight: a utility trench near a foundation, a basement dig in a dense neighborhood.
The cab is roughly the size of a phone booth. The operator sits close to the controls, two joysticks, multiple functions per hand. It takes time to learn. It takes longer to speed up.
At CES earlier this year, that machine answered questions .
In the demo, the Cat AI Assistant ran on NVIDIA Jetson Thor , an edge AI platform built for real‑time inference in industrial and robotic systems, NVIDIA Nemotron speech models are used for fast and accurate natural voice interactions. Qwen3 4B, served locally via vLLM, interprets requests and generates responses with low latency, no cloud link required.
Beyond enterprise innovation, open models unlock new possibilities for developers to build and experiment freely. Running OpenClaw on NVIDIA Jetson enables developers to create private, always-on AI assistants at the edge — with zero application programming interface cost and full data privacy.
All Jetson developer kits support OpenClaw, offering the flexibility to switch across open models from 2 billion parameters to 30 billion. With a frontier-class AI assistant running locally, users can power morning briefings, automate daily tasks, perform code reviews and control smart home systems — all in real time.
From the Cloud to the Edge
For most of their recent history, open models lived where it was easiest to support them.
They ran in data centers, backed by elastic compute and persistent networks. Cloud deployments carry costs in latency and ongoing compute spend that scale with every query.
Physical systems optimize for something else. Low latency because machines interact with people and environments. Limited power because devices have hard limits. And consistent behavior because variability introduces risk.
There’s also a supply question. Memory shortages have driven up costs across the industry. Jetson brings compute and memory together in a system-on-module, accelerating customer hardware design and making sourcing and validation easier than with discrete component approaches.
And as models have grown more efficient, developers have also started asking a different question. Not which model performs best in isolation, but where it makes sense to run.
More often, the answer is on the device, starting from Jetson Orin Nano 8GB for entry-level generative AI models.
Building Autonomous Physical AI Systems at Scale
For physical AI systems, generative AI models are expanding what’s possible.
Caterpillar’s in-cab Cat AI Assistant, which is in development, runs speech and language models locally alongside trusted machine context, supporting operator guidance and safety features.
At CES, Franka Robotics showed what that looks like in robotics. The company’s FR3 Duo dual-arm system ran the NVIDIA GR00T N1.6 model end-to-end onboard, perception to motion, no task scripting. The policy executes locally.
In robotics research, the SONIC project from NVIDIA’s GEAR Lab trains a humanoid controller on over 100 million frames of motion-capture data, then deploys the resulting policy on a physical robot where the kinematic planner runs on Jetson Orin at around 12 milliseconds per pass. The policy loop runs at 50 Hz. Everything executes onboard.
The pattern reaches into the developer community. A team from UIUC’s SIGRobotics club built a dual-arm matcha-making robot on Jetson Thor running the GR00T N1.5 model. It took first place at an NVIDIA embodied AI hackathon.
This research momentum continues at the NYU Center for Robotics and Embodied Intelligence. The group recently ran its YOR robot on Jetson Thor, using NVIDIA Blackwell compute to handle the heavy processing required for AI-driven movement. Early results show YOR performing intricate pick-and-place tasks with better generalization to new objects and robustness to scene variation, accelerating readiness for a wide range of household tasks like cooking and laundry.
Independent researchers are finding the same. Andrés Marafioti, a multimodal research lead at Hugging Face , built an agentic AI system on Jetson AGX Orin that routes tasks across models and schedules its own work. Late one night, the agent sent him a message: Go to sleep. Everything will be ready by morning.
Developer Ajeet Singh Raina from the Collabnix community has shown how to run OpenClaw on NVIDIA Jetson Thor for a personal AI assistant that runs 24/7. This setup allows for private large language model inference for the user’s own data while the system manages emails and calendars through a local gateway.
Jetson Is the New Standard
NVIDIA Jetson has become a common platform for running open models at the edge.
It supports a wide range of open models and AI frameworks, giving developers flexibility for almost any generative AI workload at the edge.
Model benchmarks are available at Jetson AI Lab , along with tutorials from the open model community. Jetson Thor delivers leadership inference performance across all major generative AI models.
Gemma: Built on Google’s Gemini research, Gemma 3 is a versatile workhorse for Jetson. It is multimodal out of the box, which means it can see and talk in over 140 languages. On Jetson Thor, it handles a massive 128K context window. This makes it perfect for robots that need to remember a long list of complex or multistep instructions.
gpt-oss-20B: This model from OpenAI lowers the barrier to deploying advanced AI by delivering near state-of-the-art reasoning performance in a model that can run locally on Jetson Thor and Orin for cost-efficient inference.
Mistral AI: The new Mistral 3 open model family delivers industry-leading accuracy, efficiency and customization capabilities for developers and enterprises. This family includes small, dense models ranging from 3B to 14B, fast and remarkably smart for their size. Jetson developers can use the vLLM container on NVIDIA Jetson Thor to achieve 52 tokens per second for single concurrency, with scaling up to 273 tokens per second with concurrency of eight.
NVIDIA Cosmos : This leading, open, reasoning vision language model enables robots and AI agents to see, understand and act in the physical world like humans. Both the 8B and 2B models run on Jetson to deliver advanced spatial-temporal perception and reasoning capabilities.
NVIDIA Isaac GR00T N1.6 is an open vision language action model (VLA) for generalist robot skills. Developers can use it to build robots that perceive their environment, reason about instructions and act across a wide range of tasks, environments and embodiments. On Jetson Thor, the full GR00T N1.6 pipeline executes onboard, delivering real-time perception, spatial awareness and responsive action.
NVIDIA Nemotron : A family of open models, datasets and technologies that empower users to build efficient, accurate and specialized agentic AI systems. It’s designed for advanced reasoning, coding, visual understanding, agentic tasks, safety, speech and information. The Nemotron 3 Nano 9B model effectively runs on Jetson Orin Nano Super with llama.cpp with 9 tokens per second performance.
PI 0.5: A VLA model from Physical Intelligence that enables robots to understand instructions and autonomously execute complex real-world tasks with strong generalization and real-time adaptability, while NVIDIA Jetson Thor delivers 120 action tokens per second to power responsive, low-latency physical AI deployment.
Qwen 3.5: This family of models from Alibaba, including the latest Qwen 3.5 releases, offers a mix of dense and mixture‑of‑experts models that deliver strong reasoning, coding multimodal understanding and long‑context performance. Jetson Thor delivers optimized performance across Qwen models like the Qwen 3.5-35B-A3B model, which reasons at 35 tokens per second, making real-time interactivity possible.
Any developer can fine-tune these models to create specialized physical AI agents and seamlessly deploy them into physical AI systems. The NVIDIA Jetson platform supports popular AI frameworks from NVIDIA TRT, Llama.cpp, Ollama, vLLM, SGLang and more.
Take On Open Models on Jetson
Developers can dive into Hugging Face tutorials — including Deploying Open Source Vision Language Models on Jetson — and catch the latest livestream . Learn from this tutorial and run OpenClaw on NVIDIA Jetson.
Join GTC 2026 next month to see it all in action. NVIDIA will show how open models are moving from data centers into machines operating in the physical world, including in a panel on the Future of Industrial Autonomy .
Watch the GTC keynote from NVIDIA founder and CEO Jensen Huang and explore physical AI , robotics and vision AI sessions.
Caterpillar Technical Highlights
NVIDIA Jetson Thor: Edge AI platform for real-time inference in industrial and robotics systems
NVIDIA Riva: Speech AI framework using Parakeet ASR and Magpie TTS
Qwen3 4B: Compact LLM for intent parsing and response generation
vLLM: Efficient runtime for serving LLM inference at the edge
CatHelios: Unified data platform providing trusted machine context
NVIDIA Omniverse: Digital twin and simulation frameworks for industrial workflows
Watch NVIDIA CEO Jensen Huang’s Keynote
Hear from NVIDIA CEO Jensen Huang live on stage at SAP Center. Arrive early to catch the GTC Live 2026 pregame show for an insightful discussion on the latest in AI, accelerated computing and transformative tech with industry leaders.
Watch Now
Recent News
Generative AI
The Future of AI Is Open and Proprietary
March 25, 2026
|