|
S
|
Delivering Flexible Performance for Future-Ready Data Centers with NVIDIA MGX | NVIDIA Technical Blog |
nvidia_dev_blog |
15.12.2025 18:25 |
1
|
| Embedding sim. | 1 |
| Entity overlap | 1 |
| Title sim. | 1 |
| Time proximity | 1 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
The AI boom reshaping the computing landscape is poised to scale even faster in 2026. As breakthroughs in model capability and computing power drive rapid growth, enterprise data centers are being pushed beyond the limits of conventional server and rack architectures. This is creating new pressures on power budgets, thermal envelopes, and facility space.
NVIDIA MGX modular reference architecture provides forward-looking designs that enable faster time-to-market (TTM) with standardized building blocks. MGX helps system partners integrate fast-evolving technologies and deliver the flexible, energy-efficient platforms modern AI data centers require.
This post explores the next evolution in the MGX modular reference architecture: a 6U (800 mm) chassis configuration designed specifically for the next generation of accelerated compute and networking platforms. This includes the new liquid-cooled variant of the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU.
Flexible, future-proof design with enhanced serviceability
Forward-looking compatibility and flexibility are core design principles of the MGX 6U platform. It features a single chassis that can span multiple computing generations and workload profiles. It’s designed to support today’s most powerful computing platforms while offering future-proof compatibility, reducing the need for disruptive redesigns over time.
Partners can design these systems with multiple MGX-based host-processor modules (HPMs), including x86 platforms and the next-generation NVIDIA Vera CPU. This enables standardizing on a single server design while supporting multiple CPU architectures and workload requirements.
Lastly, the larger chassis volume creates accessible service pathways for maintenance. Key components like network cards, power supplies, and other field-replaceable units are easy to reach. This simplifies serviceability and reduces operational overhead when managing rack-scale infrastructure.
Sustainable, efficient computing with liquid-cooled NVIDIA RTX PRO Server
The MGX 6U design is the foundation for the next wave of accelerated computing platforms, starting with a new liquid-cooled NVIDIA RTX PRO Server . This new RTX PRO Server configuration will feature eight of the latest liquid-cooled RTX PRO 6000 Blackwell Server Edition GPUs, along with advanced AI networking capability delivered by NVIDIA BlueField-3 DPUs and NVIDIA ConnectX-8 SuperNICs with built-in PCIe Gen 6 switches (Figure 1).
Figure 1. The MGX 6U system topology with eight GPUs, NVIDIA BlueField-3 DPUs, and ConnectX-8 SuperNICs with built-in PCIe Gen 6 switches
With a compact, single-slot liquid-cooled form factor, RTX PRO 6000 Blackwell delivers breakthrough performance for powering AI factories and accelerating demanding enterprise AI workloads with improved thermal efficiency. It’s capable of running the full suite of NVIDIA enterprise software, including NVIDIA AI Enterprise , NVIDIA Omniverse , NVIDIA vGPU , and NVIDIA Run:ai . It provides a universal data center platform for building and deploying the next generation of AI-enabled applications, from agentic AI and physical AI to scientific computing, simulation, graphics, and video.
Additionally, the RTX PRO 6000 Blackwell Server Edition GPU is validated by more than 50 leading enterprise ISVs spanning engineering, scientific computing, and professional visualization applications, as well as the most widely adopted orchestration, management, and AI ops platforms.
Figure 2. Liquid-cooled NVIDIA RTX PRO 6000 Blackwell Server Edition GPU
High-performance AI networking with NVIDIA ConnectX
Network performance is essential to maximize the performance of AI workloads at scale. MGX 6U reference design supports ConnectX-8 AI networking today and will support ConnectX-9 when it becomes available, delivering Ethernet and InfiniBand connectivity options to meet diverse data center and workload requirements.
The liquid-cooled RTX PRO Server, based on the MGX 6U configuration, features a streamlined system architecture that includes the latest-generation ConnectX-8 SuperNICs with integrated PCIe Gen 6 switches.
Built for AI workloads, ConnectX-8 with integrated PCIe Gen 6 switches supports up to 400 Gb/s of network bandwidth per RTX PRO 6000 Blackwell GPU (based on a 2:1 GPU-to-NIC ratio).
In addition to streamlining the design and reducing server complexity versus systems with dedicated PCIe switches, ConnectX-8 effectively doubles per‑GPU network bandwidth. This helps to remove I/O bottlenecks and speeds data movement between GPUs, NICs, and storage, resulting in up to 2x higher NCCL all‑to‑all performance and more scalable multi‑GPU, multi‑node workloads across AI factories.
AI runtime security and infrastructure acceleration with NVIDIA BlueField
As accelerated infrastructure grows in scale and complexity, securing every layer of the system becomes essential. The MGX 6U design features NVIDIA BlueField data processing units (DPUs) to bring zero-trust security and infrastructure acceleration directly into the data center layer. The BlueField processor offloads and accelerates functions such as line-rate encryption, micro-segmentation, and real-time threat detection—enforcing least-privilege access while preserving the host’s computing resources (GPU/CPU) to focus on AI and other modern workloads.
By isolating control and management planes in hardware, BlueField enables organizations to protect AI pipelines from emerging threats while accelerating networking, storage, and virtualization services. Enterprises can further extend these capabilities by deploying validated BlueField-accelerated applications from leading software providers, enhancing both infrastructure efficiency and cybersecurity coverage. This combination helps ensure that RTX PRO Server deployments can scale securely, with consistent performance and policy enforcement across every node in the AI factory.
Building future-ready AI factories
As NVIDIA Blackwell and future GPU generations continue to push beyond traditional computing boundaries, the NVIDIA MGX modular architecture ensures AI factories can evolve with silicon innovations. For ecosystem partners building the next generation of accelerated computing platforms, MGX reduces engineering costs, shortens time to market, and delivers multigenerational compatibility while ensuring optimal performance and efficiency for enterprises deploying AI workloads at scale.
Systems featuring the liquid-cooled NVIDIA RTX PRO 6000 Blackwell Server Edition GPU, along with liquid-cooled RTX PRO Servers based on the MGX 6U configuration, are expected to arrive from global system builders in the first half of 2026.
Discuss (0)
Like
Tags
Data Center / Cloud | General | Blackwell | MGX | Spectrum-X Ethernet | Intermediate Technical | Best practice | AI Agent | AI Factory | featured | Multi-GPU | Physical AI | Spectrum Ethernet | SuperNICs
About the Authors
About Anthony Larijani
Anthony Larijani is senior product marketing manager in NVIDIA’s data center enterprise platforms team, focused on NVIDIA’s portfolio of accelerated computing, networking, and software platforms. He is a marketing and sales professional with over 10 years of experience in data center infrastructure and cloud platform technologies. Larijani holds a bachelor’s degree from West Virginia University and an MBA from Carnegie Mellon University.
View all posts by Anthony Larijani
About Neil Dey
Neil Dey is an AI and HPC product leader and engineer, and an inventor with eight patents in System Design, Manageability, and Thermals. Currently senior product manager at NVIDIA for the MGX product line, he brings 18+ years of enterprise and HPC product development experience. Neil excels in system design, architecting solutions, and product management, and holds a master’s degree in Computer Engineering and Kellogg Executive Education.
View all posts by Neil Dey
About Ivan Goldwasser
Ivan leads product marketing for the Data Center CPU products for NVIDIA. Previously, Ivan worked in various marketing and strategy roles in the technology sector. Ivan has an MBA from Georgetown’s McDonough School of Business and a bachelor’s degree in chemical engineering from Texas A&M University.
View all posts by Ivan Goldwasser
About Itay Ozery
Itay Ozery is director of product marketing for networking at NVIDIA. He drives strategic product marketing initiatives for NVIDIA networking platforms and solutions. Itay has a solid track record in building and launching impactful products and solutions to market, and previously served in various enterprise IT positions.
View all posts by Itay Ozery
About Michael Mangiafico
Michael Mangiafico is the director of Product Management and has spent the past five years focused on managing strategic relationships, product alignment, use case development, and roadmap planning with NVIDIA’s global server partners, helping to deliver cutting-edge accelerated computing technologies to customers worldwide. Michael brings over 25 years of experience in the technology industry. His career spans engineering R&D and product management roles at Compaq, Hewlett-Packard, and Hewlett-Packard Enterprise. Michael holds a bachelor’s degree in Electrical Engineering from Wentworth Institute of Technology and a master’s degree in Electrical Engineering from Worcester Polytechnic Institute.
View all posts by Michael Mangiafico
Comments
Related posts
Building the 800 VDC Ecosystem for Efficient, Scalable AI Factories
Building the 800 VDC Ecosystem for Efficient, Scalable AI Factories
Integrating Semi-Custom Compute into Rack-Scale Architecture with NVIDIA NVLink Fusion
Integrating Semi-Custom Compute into Rack-Scale Architecture with NVIDIA NVLink Fusion
Building the Modular Foundation for AI Factories with NVIDIA MGX
Building the Modular Foundation for AI Factories with NVIDIA MGX
NVIDIA Contributes NVIDIA GB200 NVL72 Designs to Open Compute Project
NVIDIA Contributes NVIDIA GB200 NVL72 Designs to Open Compute Project
Revolutionizing Data Center Efficiency with the NVIDIA Grace Family
Revolutionizing Data Center Efficiency with the NVIDIA Grace Family
Related posts
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
L
T
F
R
E
|
|
|
Real-Time Decoding, Algorithmic GPU Decoders, and AI Inference Enhancements in NVIDIA CUDA-Q QEC | NVIDIA Technical Blog |
nvidia_dev_blog |
17.12.2025 21:32 |
0.732
|
| Embedding sim. | 0.8481 |
| Entity overlap | 0.0732 |
| Title sim. | 0.2095 |
| Time proximity | 0.836 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | quantum computing |
| NLP страна | |
Открыть оригинал
Real-time decoding is crucial to fault-tolerant quantum computers. By enabling decoders to operate with low latency concurrently with a quantum processing unit (QPU), we can apply corrections to the device within the coherence time. This prevents errors from accumulating, which reduces the value of results received. We can do this online, with a real quantum device, or offline, with a simulated quantum processor.
To help solve these problems and enable research into better solutions, NVIDIA CUDA-Q QEC version 0.5.0 includes a range of improvements. These include support for online real-time decoding, new GPU-accelerated algorithmic decoders, infrastructure for high-performance AI decoder inference, sliding window decoder support, and more Pythonic interfaces.
We’ll cover all of these improvements in this post and dive into how you can use them to accelerate your quantum error correction research, or operationalize real-time decoding with your quantum computer.
Real-time decoding is real with CUDA-Q QEC
Users can perform this in a four-stage workflow. In order, these are: DEM generation, decoder configuration, decoder loading and initialization, and real-time decoding.
First, we characterize how the device errors behave during operation. Using a helper function, we can generate the detector error model (DEM) from a quantum code, noise model, and circuit parameters. The function will generate a complete DEM that maps error mechanisms to syndrome patterns.
# Step 1: Generate detector error model
print("Step 1: Generating DEM...")
cudaq.set_target("stim")
noise = cudaq.NoiseModel()
noise.add_all_qubit_channel("x", cudaq.Depolarization2(0.01), 1)
dem = qec.z_dem_from_memory_circuit(code, qec.operation.prep0, 3, noise)
The next step is to choose a decoder and configure it. We’ll discuss new decoders in greater detail in the following sections.
Using the DEM, the user configures the decoder and then saves this configuration to a YAML file. This file ensures that the decoders can correctly interpret the syndrome measurements.
# Create decoder config
config = qec.decoder_config()
config.id = 0
config.type = "nv-qldpc-decoder"
config.block_size = dem.detector_error_matrix.shape[1]
. . .
# check out nvidia.github.io/cudaqx/examples_rst/qec/realtime_decoding.html
. . .
Before circuit execution, the user loads the YAML file. CUDA-Q QEC interprets the information, sets up the appropriate implementation in the decoder, and registers it with the CUDA-Q runtime.
# Save decoder config
with open("config.yaml", 'w') as f:
f.write(config.to_yaml_str(200))
Now, users can begin executing quantum circuits. Inside CUDA-Q kernels, the decoding API interacts with the decoders. As the stabilizers of the logical qubits are measured, syndromes are enqueued to the corresponding decoder, which processes them. When corrections are needed, the decoder suggests operations to apply to the logical qubits.
# Load config and run circuit
qec.configure_decoders_from_file("config.yaml")
run_result = cudaq.run(qec_circuit, shots_count=10)
GPU-accelerated RelayBP
A recently developed decoder algorithm helps solve the pitfalls of belief propagation decoders, a popular class of quantum low-density parity check algorithmic decoders. BP+OSD (Belief Propagation with Ordered Statistics Decoding) relies on a GPU-accelerated BP decoder and then uses an Ordered Statistics Post-Processing Algorithm on CPU. If BP fails, OSD kicks in. This is fine, but makes it hard to optimize and parallelize for the low latency needed to enable real-time error decoding.
RelayBP modifies BP methods with the concept of memory strengths, at each node of a graph, and controls how much each node remembers or forgets past messages. This dampens or breaks the harmful symmetries that usually trap BP, preventing it from converging.
Figure 1. Peak decoding throughput (iterations/sec) for RelayBP FP32 on NVIDIA DGX GB200, measured for XYZ and XZ decoding of 1-Gross and 2-Gross quantum error-correction codes, with syndrome complexity held constant to isolate peak performance. Results collected with optimized CUDA-Q QEC 0.6.0 build.
Users can instantiate a RelayBP decoder easily with a few lines of code, outlined below.
import numpy as np
import cudaq_qec as qec
# Simple 3x7 parity check matrix for demonstration
H_list = [[1, 0, 0, 1, 0, 1, 1], [0, 1, 0, 1, 1, 0, 1],
[0, 0, 1, 0, 1, 1, 1]]
H = np.array(H_list, dtype=np.uint8)
# Configure relay parameters
srelay_config = {
'pre_iter': 5, # Run 5 iterations with gamma0 before relay legs
'num_sets': 3, # Use 3 relay legs
'stopping_criterion': 'FirstConv' # Stop after first convergence
}
# Create a decoder with Relay-BP
decoder_relay = qec.get_decoder("nv-qldpc-decoder",
H,
use_sparsity=True,
bp_method=3,
composition=1,
max_iterations=50,
gamma0=0.3,
gamma_dist=[0.1, 0.5],
srelay_config=srelay_config,
bp_seed=42)
print(" Created decoder with Relay-BP (gamma_dist, FirstConv stopping)")
# Decode a syndrome
syndrome = np.array([1, 0, 1], dtype=np.uint8)
decoded_result = decoder_relay.decode(syndrome)
AI decoder inference
AI decoders are becoming increasingly popular for handling specific error models, offering better accuracy or latency than algorithmic decoders.
Users can develop AI decoders by generating training data, training a model, and exporting the model to ONNX. Once this is complete, use the CUDA-Q QEC NVIDIA TensorRT-based AI decoder inference engine to operate low-latency AI decoders.
CUDA-Q QEC recently introduced infrastructure for integrated AI decoder inference with offline decoding. This means that it’s now easy to run any AI decoder saved to an ONNX file with CUDA-Q QEC and an emulated quantum computer.
import cudaq_qec as qec
import numpy as np
# Note: The AI decoder doesn't use the parity check matrix.
# A placeholder matrix is provided here to satisfy the API.
H = np.array([[1, 0, 0, 1, 0, 1, 1],
[0, 1, 0, 1, 1, 0, 1],
[0, 0, 1, 0, 1, 1, 1]], dtype=np.uint8)
# Create TensorRT decoder from ONNX model
decoder = qec.get_decoder("trt_decoder", H,
onnx_load_path="ai_decoder.onnx")
# Decode a syndrome
syndrome = np.array([1.0, 0.0, 1.0], dtype=np.float32)
result = decoder.decode(syndrome)
print(f"Predicted error: {result}")
We also offer a range of recommendations to reduce the initialization time by creating pre-built TensorRT engines. With ONNX files supporting a range of precisions (int8, fp8, fp16, bf16, and tf32) you can explore a range of model and hardware combinations to optimize AI decoder operationalization.
Sliding window decoding
Sliding window decoders enable a decoder to handle circuit-level noise across multiple syndrome extraction rounds. These decoders process the syndrome before the complete measurement sequence is received, which can help reduce the overall latency. The tradeoff is that this can increase logical error rates.
Exploring how and when to use this tool relies on the noise model, error correcting code parameters, and the latency budget of a given quantum processor. With the introduction of the sliding window decoder in 0.5.0, users can now perform experiments using any other CUDA-Q decoder as the “inner” decoder. Additionally, users can vary the window size with simple parameter changes.
import cudaq
import cudaq_qec as qec
import numpy as np
cudaq.set_target('stim')
num_rounds = 5
code = qec.get_code('surface_code', distance=num_rounds)
noise = cudaq.NoiseModel()
noise.add_all_qubit_channel("x", cudaq.Depolarization2(0.001), 1)
statePrep = qec.operation.prep0
dem = qec.z_dem_from_memory_circuit(code, statePrep, num_rounds, noise)
inner_decoder_params = {'use_osd': True, 'max_iterations': 50, 'use_sparsity': True}
opts = {
'error_rate_vec': np.array(dem.error_rates),
'window_size': 1,
'num_syndromes_per_round': dem.detector_error_matrix.shape[0] // num_rounds,
'inner_decoder_name': 'nv-qldpc-decoder',
'inner_decoder_params': inner_decoder_params,
}
swdec = qec.get_decoder('sliding_window', dem.detector_error_matrix, **opts)
Each syndrome extraction round must produce a constant number of measurements. The decoder will make no assumptions about the temporal correlations or periodicity in the underlying noise, so users have maximal flexibility in investigating noise variations per round.
Getting started with CUDA-Q QEC
CUDA-Q QEC 0.5.0 brings a wide range of tools to quantum error correction researchers and QPU operators, to accelerate research into operationalizing fault-tolerant quantum computers.
To get started using the CUDA-Q QEC, you can pip install cudaq-qec and see the CUDA-Q QEC documentation .
Discuss (7)
Like
Tags
Agentic AI / Generative AI | Data Center / Cloud | Developer Tools & Techniques | HPC / Scientific Computing | CUDA-Q | Intermediate Technical | Tutorial | featured | Quantum Computing
About the Authors
About Tom Lubowe
Tom Lubowe is the product manager for quantum libraries at NVIDIA. Prior to joining, he led product focused on quantum computing, machine learning, and tensor networks for materials design at GenMat. Tom also worked at Xanadu and Rigetti in product management, product operations, and business development roles. Before that, he started a quantum machine learning company, Everettian Technologies, after working on FinTech products at SEI Investments.
View all posts by Tom Lubowe
About Ben Howe
Ben Howe is a senior CUDA-Q software engineer at NVIDIA where he develops the CUDA-Q software framework for hybrid classical-quantum computing systems. Before NVIDIA, Ben was an Engineering Fellow at RTX where he developed real-time signal processing algorithms and HPC applications for a variety of sensor systems. He received bachelor degrees in Electrical Engineering and Computer Science, and a master’s degree in Electrical Engineering from Texas Tech University.
View all posts by Ben Howe
About Melody Ren
Melody is a senior quantum software engineer on the CUDA-QX team at NVIDIA. Her current work focuses on quantum error correction and developing scalable tools for hybrid quantum-classical workflows. Prior to NVIDIA, she was a developer for the Intel Quantum SDK, where she learned that debugging quantum software is only slightly less mysterious than quantum physics itself. Melody received her Master’s degree in applied science from the University of British Columbia in Canada
View all posts by Melody Ren
About Scott Thornton
Scott Thornton is a Quantum Computing Libraries Engineer at NVIDIA, specializing in AI inference for quantum error correction and algorithms for chemistry and materials science. Beyond quantum computing, his research interests span density functional theory (DFT) and time-dependent DFT for molecules and condensed matter.
Scott received his Ph.D. in Condensed Matter Physics from the University of Tennessee.
View all posts by Scott Thornton
About Kevin Mato
Kevin is a quantum computing software engineer on the NVIDIA CUDA-QX team, working across multiple layers of the quantum computing stack, with a focus on quantum error correction and scalable tools for hybrid quantum-classical workflows. He fell in love with quantum computing in 2015 and, before joining NVIDIA, worked at CINECA, exploring performance trade-offs across GPUs, large-scale HPC systems, and quantum annealers.
Kevin holds a PhD in Quantum Computing from the Technical University of Munich.
View all posts by Kevin Mato
Comments
Related posts
NVIDIA NVQLink Architecture Integrates Accelerated Computing with Quantum Processors
NVIDIA NVQLink Architecture Integrates Accelerated Computing with Quantum Processors
Streamlining Quantum Error Correction and Application Development with CUDA-QX 0.4
Streamlining Quantum Error Correction and Application Development with CUDA-QX 0.4
Accelerating Quantum Error Correction Research with NVIDIA Quantum
Accelerating Quantum Error Correction Research with NVIDIA Quantum
NVIDIA and QuEra Decode Quantum Errors with AI
NVIDIA and QuEra Decode Quantum Errors with AI
Introducing NVIDIA CUDA-QX Libraries for Accelerated Quantum Supercomputing
Introducing NVIDIA CUDA-QX Libraries for Accelerated Quantum Supercomputing
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS | NVIDIA Technical Blog |
nvidia_dev_blog |
16.12.2025 17:00 |
0.731
|
| Embedding sim. | 0.8224 |
| Entity overlap | 0.0938 |
| Title sim. | 0.3359 |
| Time proximity | 0.8656 |
| NLP тип | product_launch |
| NLP организация | nvidia |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
NVIDIA CUDA developers have access to a wide range of tools and libraries that simplify development and deployment, enabling users to focus on the “what” and the “how” of their applications.
An example of this is Multi-Process Service (MPS), where users can get better GPU utilization by sharing GPU resources across processes. Importantly, this can be done transparently as applications don’t need to be aware of MPS, and no code modifications are needed.
Introducing MLOPart
NVIDIA Blackwell GPUs deliver high bandwidth that is well-suited to training today’s large language models. However, there are cases where applications don’t benefit from the full bandwidth of Blackwell and are more latency sensitive.
Memory Locality Optimized Partition (MLOPart) devices are NVIDIA CUDA devices derived from a GPU and optimized for lower latency. MLOPart is a CUDA MPS feature that enables multi-GPU aware applications to see MLOPart devices.
In the real world, it’s not always easy to determine whether an application is latency-bound or bandwidth-bound. MLOPart is designed to be enabled and disabled using the MPS controller and doesn’t require an application to be rewritten. Developers can do simple A/B testing to see if an application benefits from MLOPart.
MLOPart device enumeration
The defining aspect of MLOPart is that when it is enabled, MLOPart-capable devices appear as multiple distinct CUDA devices, with their own compute and memory resources. In this sense, it is similar to an NVIDIA Multi-Instance GPU (MIG). We’ll compare MLOPart with MIG later in this post.
MLOPart creates CUDA devices that are based on the underlying architecture of GPUs. Where possible, CUDA devices are split along boundaries that’d negatively affect memory latency, with each side of the boundary having the memory and compute resources representing an MLOPart device. For Blackwell, the split is along the die boundaries.
If a GPU doesn’t have such boundaries, no MLOPart devices are created, and the GPU is presented to CUDA applications normally. NVIDIA DGX B200 and NVIDIA B300 are capable of two MLOPart devices per GPU. This number may change with future architectures, so it’s recommended that developers don’t hardcode assumptions about the number of MLOPart devices that a GPU will support.
MLOPart device capabilities and characteristics
An MLOPart device shares similarities with the underlying device, with a few notable exceptions. While in principle, developers don’t need to rewrite applications to use MLOPart devices, they should keep in mind that they don’t share all of the capabilities and characteristics of the underlying devices.
Capabilities and characteristics shared with the underlying device include:
Compute capability
An MLOPart device has the same compute capability and can execute the same GPU binaries as the underlying device. For example, a device that supports MLOPart with compute capability 10.0 will have MLOPart devices that also have compute capability 10.0.
Peer-to-peer ability
An MLOPart device will be capable of the same peer-to-peer communication as the underlying device. For example, if two physical devices are connected by NVIDIA NVLink, any MLOPart devices derived from these two underlying devices will also be connected by NVLink.
The exception to this rule is between MLOPart devices belonging to the same underlying device. In this case, they’re still capable of peer-to-peer communication, but don’t require peer-to-peer communication methods such as NVLink or PCIe.
When peer devices are MLOPart devices belonging to the same underlying device, they’re expected to have lower latency and higher peer-to-peer bandwidth than peer devices connected through other means.
PCI IDs
MLOPart devices share the same PCI ID (bus.device.domain) as the underlying device.
Capabilities and characteristics differing from the underlying device include the following.
Streaming multiprocessor count
Each MLOPart device will have fewer streaming multiprocessors (SMs) than the underlying device. Furthermore, the total SMs in all MLOPart devices with a common shared underlying device may be fewer than the total SMs in the underlying device.
MLOPart devices belonging to the same underlying device have the same number of SMs between them, and the number of SMs is consistent across identical NVIDIA GPUs.
For example, an NVIDIA HGX B200 system with 8 Blackwell GPUs that normally have 148 SMs will result in 16 MLOPart devices with 70 SMs each when MLOPart is enabled.
Available memory
MLOPart devices have a partition of the total memory of the underlying device, and only allocate from that partition, except in the case of CUDA managed memory allocations. Each MLOPart device will have less memory than the underlying device. Each MLOPart device belonging to the same underlying device has the same total memory.
In the current version of MLOPart, it’s possible for memory allocated on one MLOPart device to affect the available memory reported by cuMemGetInfo and cudaMemGetInfo on another MLOPart device from the same underlying device, even though they have separate partitions. Future drivers will enable more rigid memory partitions between MLOPart devices.
Virtual address space
MLOPart devices on the same underlying device share a virtual address space. This means that it’s possible for a buffer overrun of memory allocated on one MLOPart device to corrupt memory allocated on another MLOPart device within the same process.
Universally unique identifier
Each MLOPart device will have its own universally unique identifier (UUID) that can be queried through CUDA APIs. This can be used to uniquely identify MLOPart devices and to filter available CUDA devices using CUDA_VISIBLE_DEVICES
.
Deploying with MLOPart
As with other CUDA MPS features, users can control behavior through MPS controller commands.
The start_server
command starts an MPS server. In CUDA 13.1, we introduced the -mlopart
option to this command. This enables users to start an MPS server that creates MLOPart-enabled MPS clients. As this is done on a per-server basis, multiple users may have different MLOPart configurations, depending on their needs.
In CUDA 13.0, we introduced the device_query
MPS controller command to provide information about the CUDA devices enumerated by MPS. After a server has been created, device_query
can be used to determine information about the devices that’ll be exposed to clients of that server, such as the device name, device ordinals, and UUIDs.
$ echo device_query | nvidia-cuda-mps-control
Default
Device Ordinal PCI IDs UUID Name Attributes
0 0000:1b.00.00 GPU-ebebf640-14d4-de34-f16e-a5e7da272ac4 NVIDIA B200
1 0000:43.00.00 GPU-6d3a75da-dd2e-173e-e797-c0b8ed47a100 NVIDIA B200
2 0000:52.00.00 GPU-a517c26e-0f2f-945a-1672-ea75149f54d6 NVIDIA B200
3 0000:61.00.00 GPU-999b1bd5-82d8-3db2-e2ec-fdae5d1103b1 NVIDIA B200
4 0000:9d.00.00 GPU-b5830513-614b-38ac-b177-5cc2f850ea3d NVIDIA B200
5 0000:c3.00.00 GPU-05f3779e-bfa6-f9c8-256f-6cee98b8871d NVIDIA B200
6 0000:d1.00.00 GPU-2facdb95-1af2-26e3-2c9d-e02f4651675d NVIDIA B200
7 0000:df.00.00 GPU-7e555b40-ffe0-e066-4db3-4ddd96344f0d NVIDIA B200
Server 14056
Device Ordinal PCI IDs UUID Name Attributes
N/A 0000:1b.00.00 GPU-ebebf640-14d4-de34-f16e-a5e7da272ac4 NVIDIA B200 M
0 0000:1b.00.00 GPU-1bd9c0d8-c86a-5a37-acee-411ebcef5fd0 NVIDIA B200 MLOPart 0 MD
1 0000:1b.00.00 GPU-58e7f54c-f60f-56b7-a4c4-b3fb418fde3e NVIDIA B200 MLOPart 1 MD
N/A 0000:43.00.00 GPU-6d3a75da-dd2e-173e-e797-c0b8ed47a100 NVIDIA B200 M
2 0000:43.00.00 GPU-68fb01e9-499c-56d4-b768-8fca70a5ddff NVIDIA B200 MLOPart 0 MD
3 0000:43.00.00 GPU-6cf0c4ea-3a05-52b1-aec6-63acf60df19b NVIDIA B200 MLOPart 1 MD
N/A 0000:52.00.00 GPU-a517c26e-0f2f-945a-1672-ea75149f54d6 NVIDIA B200 M
4 0000:52.00.00 GPU-dd670b14-ca31-5dfd-a49b-7220701f4fc6 NVIDIA B200 MLOPart 0 MD
5 0000:52.00.00 GPU-d7433996-1714-5baa-9812-22cecdc792d3 NVIDIA B200 MLOPart 1 MD
N/A 0000:61.00.00 GPU-999b1bd5-82d8-3db2-e2ec-fdae5d1103b1 NVIDIA B200 M
6 0000:61.00.00 GPU-cff5ab0b-a509-54c8-a9c0-c5ebe3fbd3a0 NVIDIA B200 MLOPart 0 MD
7 0000:61.00.00 GPU-7933cfe7-5139-50d8-ad90-0f7f1ddba559 NVIDIA B200 MLOPart 1 MD
N/A 0000:9d.00.00 GPU-b5830513-614b-38ac-b177-5cc2f850ea3d NVIDIA B200 M
8 0000:9d.00.00 GPU-f973284b-7385-576b-80d7-3ea083bcea94 NVIDIA B200 MLOPart 0 MD
9 0000:9d.00.00 GPU-668e4145-b221-5495-a3fe-a5cdc0e6f6eb NVIDIA B200 MLOPart 1 MD
N/A 0000:c3.00.00 GPU-05f3779e-bfa6-f9c8-256f-6cee98b8871d NVIDIA B200 M
10 0000:c3.00.00 GPU-53858feb-87eb-5963-8d47-6fbf4b24cd4a NVIDIA B200 MLOPart 0 MD
11 0000:c3.00.00 GPU-700b029a-be98-5d13-9a4e-5e8e21386e34 NVIDIA B200 MLOPart 1 MD
N/A 0000:d1.00.00 GPU-2facdb95-1af2-26e3-2c9d-e02f4651675d NVIDIA B200 M
12 0000:d1.00.00 GPU-563db4f2-f70a-564d-aa4a-dbd52d6dfc0b NVIDIA B200 MLOPart 0 MD
13 0000:d1.00.00 GPU-b643e07a-6eda-5cd8-bdde-1788590d0b4b NVIDIA B200 MLOPart 1 MD
N/A 0000:df.00.00 GPU-7e555b40-ffe0-e066-4db3-4ddd96344f0d NVIDIA B200 M
14 0000:df.00.00 GPU-f8f5b46d-7774-57a1-97d2-88f23c3457f0 NVIDIA B200 MLOPart 0 MD
15 0000:df.00.00 GPU-46d7f9b7-0303-5432-b50a-16381f37e365 NVIDIA B200 MLOPart 1 MD
When MLOPart is enabled, device_query
shows the MLOPart devices below the device from which they are derived. This is the recommended method for determining UUID values used for CUDA_VISIBLE_DEVICES
when launching an application. As CUDA will enumerate more devices than actually exist on the system, there’s ambiguity in the device enumeration.
Note that MLOPart devices only exist in the context of MPS and CUDA. nvidia-smi
doesn’t provide information about MLOPart devices.
Lastly, the ps
MPS controller command has been extended to display whether a process is using an MLOPart device.
$ while1 -a &
[1] 52845
$ echo ps | nvidia-cuda-mps-control
PID ID SERVER DEVICE NAMESPACE COMMAND ATTRIBUTES
52845 1 52837 GPU-b13add01-c28c 4026531836 while1 MD
MLOPart in use
Now let’s look at how MLOPart can affect memory latency and bandwidth.
Latency
As an example, let’s look at how MLOPart affects memory latency using a simple kernel that does some atomic operations in a loop.
First, we define the kernel and a helper:
#include <cuda_runtime.h>
#include <vector>
#include <cstdio>
// Helper macro to check for CUDA errors
#define CUDA_CHECK_FAILURE(x) \
if (cudaSuccess != (cudaError_t)x)\
{\
const char* errName = cudaGetErrorName(x);\
const char* errStr = cudaGetErrorString(x);\
printf("%s:%d - %s: %s\n", __FILE__, __LINE__, errName, errStr);\
exit(EXIT_FAILURE);\
}
// Device memory variable to use to prevent the compiler from optimizing away the memory access
__device__ volatile int dummy;
// Trivial kernel to touch the memory so we can measure latency
__global__ void accessMemoryHighLatency(int *startAddress, size_t memorySizeInBytes) {
for (int i = 0 ; i < memorySizeInBytes / sizeof(int) ; ++i) {
dummy = atomicAdd(&startAddress[i], 1);
}
}
Atomic operations are latency-sensitive, making it easy to measure the difference between using and not using MLOPart. The following is a function that uses CUDA events to measure the runtime of the kernel accessMemoryHighLatency
.
// Function to launch the kernel and measure the runtime using CUDA events
float measureKernelRuntime(int *memoryDevPtr, size_t memorySizeInBytes, int numBlocks, int numThreads) {
cudaEvent_t start = NULL, stop = NULL;
float time = 0;
CUDA_CHECK_FAILURE(cudaEventCreate(&start));
CUDA_CHECK_FAILURE(cudaEventCreate(&stop));
CUDA_CHECK_FAILURE(cudaEventRecord(start, 0));
accessMemoryHighLatency<<<numBlocks, numThreads>>>(memoryDevPtr, memorySizeInBytes);
CUDA_CHECK_FAILURE(cudaPeekAtLastError());
CUDA_CHECK_FAILURE(cudaEventRecord(stop, 0));
CUDA_CHECK_FAILURE(cudaEventSynchronize(stop));
CUDA_CHECK_FAILURE(cudaEventElapsedTime(&time, start, stop));
CUDA_CHECK_FAILURE(cudaEventDestroy(start));
CUDA_CHECK_FAILURE(cudaEventDestroy(stop));
return time;
}
Finally, we can put this all together by creating a simple multi-GPU-aware program.
int main(int argc, char *argv[]) {
size_t memorySizeInBytes = 32 * 1024 * 1024; // 32 MB
int numBlocks = 32;
int numThreads = 1;
int numDevices = 0;
float totalTime = 0;
CUDA_CHECK_FAILURE(cudaGetDeviceCount(&numDevices));
// Measure the runtime for each device
for (int i = 0; i < numDevices; i++) {
// Set the current device
CUDA_CHECK_FAILURE(cudaSetDevice(i));
// Allocate memory on the device
int *memoryDevPtr;
CUDA_CHECK_FAILURE(cudaMalloc(&memoryDevPtr, memorySizeInBytes));
// Measure the runtime
float time = measureKernelRuntime(memoryDevPtr, memorySizeInBytes, numBlocks, numThreads);
totalTime += time;
printf("Device %d - Total time: %f milliseconds\n", i, time);
// Free the memory
CUDA_CHECK_FAILURE(cudaFree(memoryDevPtr));
}
printf("Average time: %f milliseconds\n", totalTime / numDevices);
return EXIT_SUCCESS;
}
We’ll name this file atomic_memory_access.cu
and compile it using nvcc atomic_memory_access.cu -arch=sm_100 -o atomic_memory_access.
To establish a baseline, let’s run the example using MPS, but without MLOPart.
$ nvidia-cuda-mps-control -d
# Optional step of explicitly creating an MPS server. This is also done implicitly when we launch a CUDA application while MPS is active.
$ echo start_server -uid $UID | nvidia-cuda-mps-control
$ ./atomic_memory_access
Device 0 - Total time: 2320.550537 milliseconds
Device 1 - Total time: 2323.710938 milliseconds
Device 2 - Total time: 2334.533447 milliseconds
Device 3 - Total time: 2304.551025 milliseconds
Device 4 - Total time: 2304.328125 milliseconds
Device 5 - Total time: 2316.102295 milliseconds
Device 6 - Total time: 2306.165283 milliseconds
Device 7 - Total time: 2306.362061 milliseconds
Average time: 2314.537842 milliseconds
Here we see an average time of around 2,300 milliseconds for each device. Now let’s enable MLOPart and run it again.
# Quit the MPS controller to cleanup the previous server.
$ echo quit | nvidia-cuda-mps-control
# Now repeat the above steps, with MLOPart enabled.
$ nvidia-cuda-mps-control -d
# Note that we must explicitly start the server with "-mlopart".
$ echo start_server -uid $UID -mlopart | nvidia-cuda-mps-control
$ ./atomic_memory_access
Device 0 - Total time: 1500.194946 milliseconds
Device 1 - Total time: 1475.914062 milliseconds
Device 2 - Total time: 1479.729492 milliseconds
Device 3 - Total time: 1480.196045 milliseconds
Device 4 - Total time: 1478.959106 milliseconds
Device 5 - Total time: 1490.808716 milliseconds
Device 6 - Total time: 1468.943237 milliseconds
Device 7 - Total time: 1479.297241 milliseconds
Device 8 - Total time: 1467.947632 milliseconds
Device 9 - Total time: 1476.900757 milliseconds
Device 10 - Total time: 1477.081421 milliseconds
Device 11 - Total time: 1490.295044 milliseconds
Device 12 - Total time: 1484.558594 milliseconds
Device 13 - Total time: 1481.660156 milliseconds
Device 14 - Total time: 1476.067383 milliseconds
Device 15 - Total time: 1484.143921 milliseconds
Average time: 1480.793457 milliseconds
In this example, we see a significant improvement in execution time per device when using MLOPart. While this was a contrived example, it’s important to compare running with and without MLOPart when deciding how to deploy a specific application.
Bandwidth
Given that MLOPart devices have less memory than a full device, they also have lower DRAM bandwidth than devices not using MLOPart.
MLOPart devices have better peer-to-peer bandwidth between MLOPart devices on the same underlying GPU when compared to devices that must communicate using NVLink or PCIe.
Let’s look at the (partial) results of a bidirectional P2P bandwidth test between MLOPart devices on the same underlying device and not on the same underlying device:
$ ./nvbandwidth -t device_to_device_memcpy_read_ce
...
Running device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4
0 N/A 2352.76 766.82 743.46 767.51
1 2402.78 N/A 765.86 744.04 767.03
2 767.23 744.30 N/A 2349.54 766.00
3 767.37 743.91 2372.91 N/A 767.30
4 766.75 743.52 766.89 743.97 N/A
In the above example, devices 0 and 1 are on the same underlying GPU, and devices 2 and 3 are on the same underlying GPU.
In the case of B200, peers normally use NVLink when initiating an operation such as cuMemcpyAsync. If these B200 peers are MLOPart devices on the same B200 chip, they can instead use the much faster NV-HBI.
Considerations when using MLOPart
As mentioned previously, using MLOPart implies choosing lower latency over higher bandwidth. This isn’t the only tradeoff that must be evaluated when using MLOPart.
Device filtering through CUDA_VISIBLE_DEVICES
The devices available to MPS servers and clients can be filtered and/or remapped using the CUDA_VISIBLE_DEVICES
environment variable. Often, this is done using device ordinals. With MPS, this can cause errors if the same value CUDA_VISIBLE_DEVICES
is used for both the controller and server/clients if remapping isn’t taken into account.
For example, given a system with 8 CUDA devices, the MPS controller can be initialized to filter out the odd-numbered devices ( CUDA_VISIBLE_DEVICES=0,2,4,6
). In this scenario, the MPS server and clients will only see at most 4 CUDA devices, even without using CUDA_VISIBLE_DEVICES
. Using the same value for CUDA_VISIBLE_DEVICES
will fail since we can only see devices 0-3. For this reason, it’s recommended to use UUIDs, which are unambiguous.
When MLOPart is enabled, there’s an additional inconsistency to be aware of. UUIDs of the devices visible to the MPS controller and an MPS server/client with MLOPart enabled are different. When using CUDA_VISIBLE_DEVICES
, it’s recommended to execute the device_query
command after the MPS server with MLOPart has been started to determine the UUIDs that will be available to MPS clients.
Fewer compute resources
When MLOPart is enabled, the MLOPart devices may have some SMs disabled. There’s a tradeoff between performance gains from reduced memory latency and performance losses from fewer compute resources. These should be weighed on a per-application basis.
Managed memory
Managed memory doesn’t benefit from MLOPart. As MLOPart requires creating GPU memory for low-latency allocations, this can’t be done with managed memory. Attempting to use managed memory will work as it normally does, and allocations can still be created using managed memory APIs, but they aren’t expected to see performance benefits.
Access modifiers
The cuMemSetAccess
API enables programmers to specify access properties for CUDA allocations. When using this API with respect to MLOPart devices, the least restrictive property set on all MLOPart devices belonging to the same underlying GPU is applied. For example, setting a buffer as read-only for one MLOPart device and read-write (default) for another MLOPart device results in both MLOPart devices having read-write access, until both are updated to a more restrictive access type.
x86 requirement
MLOPart is currently only supported on x86 platforms. Support for ARM platforms will be available in a future release.
Comparison to MIG
MIG can be used to create multiple CUDA devices from a single GPU, as is done with MLOPart. Certain MIG configurations can also reduce latency at the cost of bandwidth, while requiring no code changes.
Topic
MIG
MLOPart / MPS
Privilege required
Requires superuser privilege to configure
Doesn’t require superuser privilege
Scope
System-wide setting
Per-user / per-server setting
Memory isolation
Enforces strict memory isolation between MIG GPU instances
Memory from one MLOPart device may corrupt another on the same GPU
Performance isolation
Enforces strict performance isolation between MIG compute instances
Performance interference may occur between MLOPart devices
Table 1. Comparing MIG to MLOPart / MPS
To learn more about MLOPart, CUDA MPS, and how to maximize GPU utilization, check out the MPS documentation .
Acknowledgements: Thanks to the following NVIDIA contributors: Alfred Barnat, Ehren Bendler, Alicia Hu, Balint Joo, Ze Long, Yashwant Marathe, Vance Miller, Kyrylo Perelygin, Will Pierce, Yifan Yang
Discuss (0)
Like
Tags
Data Center / Cloud | Developer Tools & Techniques | Cloud Services | Blackwell | CUDA | Intermediate Technical | Tutorial | featured
About the Authors
About Sherwin Nassernia
Sherwin is an NVIDIA CUDA driver engineer that focuses primarily on enabling and developing CUDA features for new GPU architectures such as Blackwell, Rubin, and beyond. Sherwin holds a bachelor’s degree in computer engineering from Toronto Metropolitan University.
View all posts by Sherwin Nassernia
Comments
Related posts
Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization
Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization
CUDA Toolkit Now Available for NVIDIA Blackwell
CUDA Toolkit Now Available for NVIDIA Blackwell
Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip
Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip
Discovering New Features in CUDA 11.4
Discovering New Features in CUDA 11.4
Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features
Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features
Related posts
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications
NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell
Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Using AI Physics for Technology Computer-Aided Design Simulations | NVIDIA Technical Blog |
nvidia_dev_blog |
17.12.2025 16:00 |
0.725
|
| Embedding sim. | 0.8239 |
| Entity overlap | 0.0789 |
| Title sim. | 0.245 |
| Time proximity | 0.9167 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | ai for science |
| NLP страна | South Korea |
Открыть оригинал
Technology Computer-Aided Design (TCAD) simulations, encompassing both process and device simulations, are crucial for modern semiconductor manufacturing. They enable “virtual manufacturing,” allowing engineers to design, build, and test transistors and integrated circuits digitally before committing to the costly physical fabrication process. This approach significantly reduces development time from years to months and saves billions of dollars in experimental manufacturing costs.
These simulations, however, are computationally intensive and can take as long as several weeks to complete, delaying manufacturing deadlines. AI-augmented TCAD is a key solution to address this challenge. That’s where NVIDIA PhysicsNeMo and NVIDIA Apollo come in. The PhysicsNeMo framework lets developers build high-fidelity surrogates using state-of-the-art architectures for engineering and science simulations. Apollo, announced last month at SC25 , makes this easier by providing domain specific, pre-trained models.
Engineers at SK hynix, one of the world’s leading memory chip manufacturers, are leveraging AI physics to develop high-fidelity surrogate models to accelerate device and process simulations in the design and manufacturing of semiconductor chips. Using the NVIDIA PhysicsNeMo framework, engineers have fast-tracked the development of proprietary AI models that can unlock tools for significant innovation in device design and manufacturing.
In this blog, we’ll walk you through the steps to get started with PhysicsNeMo to develop your own custom models and share how the TCAD Intelligence team at SK hynix used PhysicsNeMo to accelerate development of its AI physics models.
Tapping into AI physics for TCAD
TCAD is a specialized field of software simulation used to model and optimize the fabrication and physics of semiconductor devices. It’s typically broken into two main parts—process TCAD and device TCAD. Process TCAD simulations model the physical and chemical steps of chip manufacturing, such as deposition, lithography, etching, and ion implantation. Device TCAD simulations, on the other hand, take the final 3D structure predicted by the process simulation and model its electrical behavior. Engineers utilize a variety of simulation solutions for different use cases, ranging from atomic-scale density functional theory (DFT) simulations to chamber-scale computational fluid dynamics (CFD) simulations.
AI-augmented TCAD presents a fundamental disruptive opportunity for semiconductor manufacturers. As transistors shrink to the nanometer scale, the complexity of their behavior increases, making accurate simulations indispensable for designing next-generation devices but making them also orders of magnitude more expensive.
AI surrogate models—which can be created with NVIDIA PhysicsNeMo—are ultra-fast, deep learning-based replicas of slow, physics-based simulations. This approach dramatically accelerates the design and optimization of semiconductor devices by reducing simulation times from hours to milliseconds, enabling engineers to explore a much wider range of possibilities.
PhysicsNeMo provides Python modules to compose scalable and optimized training and inference pipelines to develop and deploy AI surrogates. The PhysicsNeMo framework offers various AI models tuned for science and engineering and enables the combination of physics knowledge with data.
For AI physics researchers and developers exploring the use of neural operators, GNNs, or transformers—or are interested in physics-informed neural networks or a hybrid approach in between—PhysicsNeMo provides an optimized stack that will enable them to train their models at scale. The engineers use the necessary building blocks from PhysicsNeMo to alleviate the need to develop from scratch. This allows them to reduce the effort required to develop detailed AI methodologies and instead focus on using their domain expertise to develop surrogate models for specific physics problems.
Getting started with PhysicsNeMo
The simplest way to get started with PhysicsNeMo for building an AI surrogate is to use one of the reference application recipes. These examples give you a working template for both the training code and the data. Here is the general step-by-step path you would follow, using the official examples as your guide.
Install PhysicsNeMo : First, you need to set up your environment.
The easiest way is to use the official NVIDIA NGC container , which has all dependencies (PyTorch, CUDA, etc.) pre-installed. Next, clone the PhysicsNeMo GitHub repository to get the relevant reference application recipes.
If you have an existing dev environment setup for PyTorch, you can pip install from source following the steps outlined here .
Let’s assume you are interested in developing a GNN-based surrogate model for TCAD CFD simulations, you would start with the vortex shedding recipe . After replicating the sample, you can start to customize the training pipeline to your own custom data.
You can also evaluate other model architectures like DoMINO or Transolver on your custom data.
The built-in distributed functionality in PhysicsNeMo recipes allows you to scale any of the above architectures to full 3D chip scale simulations.
Let’s take a look at how SK hynix engineers used PhysicsNeMo for one of the many TCAD use cases.
How SK hynix uses AI physics for TCAD
South Korea-based SK hynix is a global leader in producing high-bandwidth memory (HBM), a crucial component in advanced AI accelerators and GPUs. Its products are vital for a wide array of electronics, from data center servers and PCs to smartphones and next-generation AI systems.
The company’s engineers are pioneering the use of AI physics by developing high-fidelity surrogate models to accelerate device and process simulations. Utilizing the NVIDIA PhysicsNeMo framework, they have rapidly advanced their proprietary AI models. An example is the SK hynix TCAD intelligence team’s work on AI surrogate models for etching, an increasingly critical process in semiconductor front-end manufacturing, particularly for advanced memory technologies. By employing predictive modeling to guide the etching process, SK hynix aims to expedite the development of next-generation memory devices.
Figure 1. The stepwise improvement in accuracy of the surrogate model to predict the etch profile with improvements in the methodology used.
Accurate prediction of time-varying structures in the etching process is essential for SK hynix. While neural operators are beneficial, they often require large datasets and struggle with data scarcity. To address this, SK hynix adopted the Graph Network-based Simulator (GNS) architectures grounded in Graph Neural Networks (GNNs), which combine numerical time-stepping methods to effectively model geometry changes over time. GNS captures local interactions, representing critical physical properties with minimal training data. However, the existing GNS models were insufficient for effectively emulating the etching process, necessitating the development of additional AI models to enhance the accuracy and efficiency of the emulations.
Methodologies
Improvement
MeshGraphNet(MGN)
Memory requirement decreased
Chamfer Loss used for velocity calculation
Training loss reduced
Re-meshing each Iteration steps
Inference accuracy improved
Feature selection
Inference accuracy improved
Multi-scale message passing
Training loss reduced
Material feature update each iteration steps
Inference accuracy improved
Table 1. AI methodologies employed on the AI surrogate model for etching process
The TCAD Intelligence team at SK hynix believes that AI-augmented TCAD will become a key enabler of research productivity in the semiconductor industry. By leveraging AI-accelerated TCAD predictions, engineers will be able to realistically evaluate tens of thousands of process cases generated from dozens of recipe combinations. This advancement allows TCAD to evolve beyond qualitative guidance and serve as a quantitative optimization framework for semiconductor R&D.
A wide range of AI models that were developed using the PhysicsNeMo framework and GPU-accelerated libraries play a crucial role in enabling these capabilities efficiently.
How to get started with NVIDIA PhysicsNeMo
If you are a TCAD application developer or an AI physics researcher, PhysicsNeMo is a powerful tool in your arsenal to accelerate your AI model development. Instead of building everything from scratch, you can leverage PhysicsNeMo modules and model architectures to build enterprise scale Physics AI solutions with unprecedented speed and simplicity.
TCAD engineers at SK hynix used this approach to focus their domain expertise and efforts on modeling their problems effectively and building skillful models instead of writing training pipelines using low-level libraries.
You can learn more by using these resources:
NVIDIA PhysicsNeMo product page
The PhysicsNemo GitHub repository
User guide
Using PhysicsNeMo with your PyTorch model
Samples:
Explore Jupyter notebooks on Hugging Face
Full repository of r eference samples
Self-paced course: Accelerating Computer-Aided Engineering (CAE) with NVIDIA AI Physics Technology
Yiyi Wang and Alexey Kamenev contributed to the project featured in this blog.
Discuss (0)
Like
Tags
Data Center / Cloud | Simulation / Modeling / Design | Manufacturing | PhysicsNeMo | Intermediate Technical | Deep dive | featured | Hardware / Semiconductor
About the Authors
About Ram Cherukuri
Ram Cherukuri is a senior product manager for PhysicsNeMo, the Physics-ML platform for AI in science and engineering . He is also the product manager for DLA software, working with embedded AI developers and was part of the CUDA product management team. Prior to NVIDIA, Ram was a product manager at MathWorks for code generation and verification products for embedded software development, working with automotive and aero-def customers. He holds a master’s degree in aerospace engineering from Purdue University and a bachelor’s degree in the same discipline from IIT Bombay.
View all posts by Ram Cherukuri
About Kihang Youn
Kihang Youn is an HPC/AI solutions architect at NVIDIA. He collaborates with customers across science and manufacturing to accelerate scientific applications such as electronic design automation, computer aided engineering, atomistic simulation, and quantum chemistry using the NVIDIA platform. Kihang holds a doctorate in applied mathematics from Hanyang University in South Korea.
View all posts by Kihang Youn
About Gyuseung Han
Gyuseung Han is a technology computer-aided design engineer in the TCAD intelligence team at SK hynix R&D, specializing in AI-driven modeling for TCAD simulations and the analysis of experimental results. He holds a Ph.D. in simulation within the field of materials science from Seoul National University. His expertise includes density functional theory, computational fluid dynamics, and artificial intelligence. His work encompasses a wide range of semiconductor process technologies.
View all posts by Gyuseung Han
About Min Kang
Min Kang is a Technology Computer-Aided Design engineer in the TCAD intelligence team at SK hynix R&D. He previously worked with the DRAM process integration team for an extended period, and is now focusing on AI for science projects with the TCAD intelligence team. His earlier works encompass a deep learning-based automated transmission electron microscopy image measurement system and a deep learning-driven transistor simulation. He received his master’s and bachelor’s in physics from Seoul National University.
View all posts by Min Kang
About Hwiwon Seo
Hwiwon Seo is a Technology Computer-Aided Design engineer in the TCAD intelligence team at SK hynix R&D. He specializes in plasma process simulation and the development of AI-driven TCAD solutions. His research focuses on AI for science applications, utilizing his expertise in plasma process modeling, TCAD simulation, and AI-augmented TCAD modeling. He holds a master of science degree in plasma engineering and a bachelor of science in materials science from Seoul National University.
View all posts by Hwiwon Seo
About Junghan Kim
Junghan Kim leads the TCAD intelligence team at SK hynix R&D. The TCAD intelligence team specializes in AI for science and develops AI solutions for R&D based on Technology Computer-Aided Design. Junghan is an experienced R&D researcher in the semiconductor and display industries. He is skilled in AI for science modeling and multi-scale simulation, including CFD, molecular dynamics, DSMC, and more. He has a doctorate from Technische Universiteit Eindhoven in the Netherlands, where he focused on micro-nano fluidics and rarefied gas simulation
View all posts by Junghan Kim
Comments
Related posts
Spotlight: HP 3D Printing Open Sources AI Surrogates for Additive Manufacturing Using NVIDIA PhysicsNeMo
Spotlight: HP 3D Printing Open Sources AI Surrogates for Additive Manufacturing Using NVIDIA PhysicsNeMo
AI-Powered Simulation Tools for Surrogate Modeling Engineering Workflows with Siml.ai and NVIDIA PhysicsNeMo
AI-Powered Simulation Tools for Surrogate Modeling Engineering Workflows with Siml.ai and NVIDIA PhysicsNeMo
Physics-Informed Machine Learning Platform NVIDIA PhysicsNeMo Is Now Open Source
Physics-Informed Machine Learning Platform NVIDIA PhysicsNeMo Is Now Open Source
NVIDIA PhysicsNeMo: An AI-Accelerated Multiphysics Simulation Toolkit
NVIDIA PhysicsNeMo: An AI-Accelerated Multiphysics Simulation Toolkit
GTC Digital Demo: Accelerating Scientific and Engineering Simulation Workflows with AI
GTC Digital Demo: Accelerating Scientific and Engineering Simulation Workflows with AI
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Simulate an Accurate Radio Environment Using NVIDIA Aerial Omniverse Digital Twin | NVIDIA Technical Blog |
nvidia_dev_blog |
17.12.2025 16:00 |
0.724
|
| Embedding sim. | 0.8101 |
| Entity overlap | 0.0588 |
| Title sim. | 0.2803 |
| Time proximity | 1 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
The development of 5G and 6G requires high-fidelity radio channel modeling, but the ecosystem is highly fragmented. Link-level simulators, network-level simulators, and AI training frameworks operate independently, often in different programming languages.
If you are a researcher or an engineer trying to simulate the behavior of the key components of the physical layer of 5G or 6G systems, this tutorial teaches you how to extend your simulation chain and add high-fidelity channel realizations generated by the Aerial Omniverse Digital Twin (AODT).
Prerequisites:
Hardware: An NVIDIA RTX GPU (Ada generation or newer recommended for optimal performance).
Software: Access to the AODT Release 1.4 container .
Knowledge: Basic familiarity with Python and wireless network concepts, such as radio units (RUs) and user equipment (UE).
AODT universal embedded service architecture
Figure 1 shows how AODT can be embedded into any simulation chain, whether in C++, Python, or MATLAB.
Figure 1. AODT as a universal, embedded service via high-performance gRPC
AODT is organized into two main components:
The AODT service acts as the centralized, high-power computation core. It manages and loads the massive 3D city models (e.g., from an Omniverse Nucleus server) and executes all the complex electromagnetic (EM) physics calculations.
The AODT client and language bindings provide a lightweight developer interface. The client handles all the service calls, and uses GPU IPC to transfer data efficiently, enabling direct GPU-memory access to radio-channel outputs. To support a broad range of development environments, the AODT client provides universal language bindings , enabling direct use from C++, Python ( through pybind11
) and MATLAB (through user-implemented mex
).
Workflow in action: Computing channel impulse responses in 7 easy steps
So how do you actually use it? The entire workflow is designed to be straightforward and follows a precise sequence orchestrated by the client as shown in Figure 3.
Figure 2. Summary of AODT client/service workflow
The process is split into two main phases:
Configuration tells AODT what to simulate.
Execution runs the simulation and gets data.
the full example:
Phase 1: Configuration (building the YAML string)
The AODT service is configured using a single YAML string. While you can write this by hand, we also provide a powerful Python API to build it programmatically, step-by-step.
Step 1. Initialize the simulation configuration
First, import the configuration objects and set up the basic parameters: the scene to load, the simulation mode (e.g., SimMode.EM), the number of slots to run, and a seed for repeatable, deterministic results.
from _config import (SimConfig, SimMode, DBTable, Panel)
# EM is the default mode.
config = SimConfig(scene, SimMode.EM)
# One batch is the default.
config.set_num_batches(1)
config.set_timeline(
slots_per_batch=15000,
realizations_per_slot=1
)
# Seeding is disabled by default.
config.set_seed(seed=1)
config.add_tables_to_db(DBTable.CIRS)
Step 2: Define antenna arrays
Next, define the antenna panels for both your base stations (RUs) and your UEs. You can use standard models, like ThreeGPP38901, or define your own.
# Declare the panel for the RU
ru_panel = Panel.create_panel(
antenna_elements=[AntennaElement.ThreeGPP38901],
frequency_mhz=3600,
vertical_spacing=0.5,
vertical_num=1,
horizontal_spacing=0.5,
horizontal_num=1,
dual_polarized=True,
roll_first=-45,
roll_second=45)
# Set as default for RUs
config.set_default_panel_ru(ru_panel)
# Declare the panel for the UE
ue_panel = Panel.create_panel(
antenna_elements=[AntennaElement.InfinitesimalDipole],
frequency_mhz=3600,
vertical_spacing=0.5,
vertical_num=1,
horizontal_spacing=0.5,
horizontal_num=1,
dual_polarized=True,
roll_first=-45,
roll_second=45)
# Set as default for UEs
config.set_default_panel_ue(ue_panel)
Step 3: Deploy network elements (RUs and manual UEs)
Place your network elements in the scene. We use georeferenced coordinates (latitude/longitude) to place them precisely. For UEs, you can define a series of waypoints to create a pre-determined path.
du = Nodes.create_du(
du_id=1,
frequency_mhz=3600,
scs_khz=30
)
ru = Nodes.create_ru(
ru_id=1,
frequency_mhz=3600,
radiated_power_dbm=43,
du_id=du.id,
)
ru.set_position(
Position.georef(
35.66356389841298,
139.74686323425487))
ru.set_height(2.5)
ru.set_mech_azimuth(0.0)
ru.set_mech_tilt(10.0)
ue = Nodes.ue(
ue_id=1,
radiated_power_dbm=26,
)
ue.add_waypoint(
Position.georef(
35.66376818087683,
139.7459968717682))
ue.add_waypoint(
Position.georef(
35.663622296081414,
139.74622811587614))
ue.add_waypoint(
Position.georef(
35.66362516562424,
139.74653110368598))
config.add_ue(ue)
config.add_du(du)
config.add_ru(ru)
Step 4: Deploy dynamic elements (procedural UEs and scatterers)
This is where the simulation becomes truly dynamic. Instead of placing every UE by hand, you can define a spawn_zone and have AODT procedurally generate UEs that move realistically within that area. You can also enable urban_mobility to add dynamic scatterers (cars) that will physically interact with and alter the radio signals.
# If we want to enable procedural UEs we need a spawn zone.
config.add_spawn_zone(
translate=[150.2060449, 99.5086621, 0],
scale=[1.5, 2.5, 1],
rotate_xyz=[0, 0, 71.0])
# Procedural UEs are zero by default.
config.set_num_procedural_ues(1)
# Indoor proc. UEs are 0% by default.
config.set_perc_indoor_procedural_ues(0.0)
# Urban mobility is disabled by default.
config.enable_urban_mobility(
vehicles=50,
enable_dynamic_scattering=True)
# Save to string
from omegaconf import OmegaConf
config_dict = config.to_dict()yaml_string = OmegaConf.to_yaml(config_dict)
Phase 2: Execution (client-server interaction)
Now that we have our yaml_string configuration, we connect to the AODT service and run the simulation.
Step 5: Connect
Import the dt_client library, create a client pointing to the service address, and call client.start(yaml_string). This single call sends the entire configuration to the service, which then loads the 3D scene, generates all the objects, and prepares the simulation.
import dt_client
import numpy as np
import matplotlib.pyplot as plt
# Server address (currently only localhost is supported)
server_address = "localhost:50051"
# Create client
client = dt_client.DigitalTwinClient(server_address)
try:
client.start(yaml_string)
except RuntimeError as e:
print(f"X Failed to start scenario: {e}")
return 1
Once started, you can query the service to get the parameters of the simulation you just created. This confirms everything is ready and tells you how many slots, RUs, and UEs to expect.
try:
status = client.get_status()
num_batches = status['total_batches']
num_slots = status['slots_per_batch']
num_rus = status['num_rus']
num_ues = status['num_ues']
except RuntimeError as e:
print(f"X Failed to get status: {e}")
return 1
Step 6: Get UE positions
for slot in range(num_slots):
try:
ue_positions = client.get_ue_positions(batch_index=0,
temporal_index=SlotIndex(slot))
except RuntimeError as e:
print(f"X Failed to get UE pos: {e}")
Step 7: Retrieve Channel Impulse Responses
Now we loop through each simulation slot where you can ask for the current position of all UEs. This is crucial for verifying that the mobility models are working as expected and for correlating channel data with location.
Retrieving the core simulation data is the most critical step. The Channel Impulse Response (CIR) describes how the signal propagates from each RU to each UE, including all multipath components (their delays, amplitudes, and phases).
Retrieving this much data for/from/at? every slot can be slow. To make it fast, the API uses a two-step, zero-copy process using IPC.
First, before the loop, you ask the client to allocate GPU memory for the CIR results. The service does this and returns IPC handles, which are pointers to that GPU memory.
ru_indices = [0]
ue_indices_per_ru = [[0, 1]]
is_full_antenna_pair = False
try:
# Step 1: Allocate GPU memory for CIR
cir_alloc_result = client.allocate_cirs_memory(
ru_indices,
ue_indices_per_ru,
is_full_antenna_pair)
values_ipc_handles = cir_alloc_result['values_handles']
delays_ipc_handles = cir_alloc_result['delays_handles']
Now, inside your loop, you call client.get_cirs(…), passing in those memory handles. The AODT service runs the full EM simulation for that slot and writes the results directly into that shared GPU memory. No data is copied over the network, making it incredibly efficient. The client has just been notified that the new data is ready.
# Step 2: Retrieve CIR
cirs = client.get_cirs(
values_ipc_handles,
delays_ipc_handles,
batch_index=0,
temporal_index=SlotIndex(0),
ru_indices=ru_indices,
ue_indices_per_ru=ue_indices_per_ru,
is_full_antenna_pair=is_full_antenna_pair)
values_shapes = cirs['values_shapes']
delays_shapes = cirs['delays_shapes']
Access the data in NumPy
The data (CIR values and delays) is still on the GPU. The client library provides simple utilities to get a GPU pointer without latency penalties. For convenience, however, the data can also be accessed from NumPy. This can be achieved as shown in the following code.
# Step 3: export to numpy
for i in range(len(ru_indices)):
values_gpu_ptr = client.access_values_gpu(
values_ipc_handles[i],
values_shapes[i])
delays_gpu_ptr = client.access_delays_gpu(
delays_ipc_handles[i],
delays_shapes[i])
values = client.gpu_to_numpy(
values_gpu_ptr,
values_shapes[i])
delays = client.gpu_to_numpy(
delays_gpu_ptr,
delays_shapes[i])
And that’s it! In just a few lines of Python, you have configured a complex, dynamic, georeferenced simulation, run it on a powerful remote server, and retrieved the high-fidelity, physics-based CIRs as a NumPy array. The data is now ready to be visualized, analyzed, or fed directly into an AI training pipeline. For instance, we can visualize the frequency responses of the manual UE declared above using the following plot function.
def cfr_from_cir(h, tau, freqs_hz):
phase_arg = -1j * 2.0 * np.pi * np.outer(tau, freqs_hz)
# Safe exponential and matrix multiplication
with np.errstate(all='ignore'):
# Sanitize inputs
h = np.where(np.isfinite(h), h, 0.0)
expm = np.exp(phase_arg)
expm = np.where(np.isfinite(expm), expm, 0.0)
result = h @ expm
result = np.where(np.isfinite(result), result, 0.0)
return result
def plot(values, delays):
# values shape:
# [n_ue, number of UEs
# n_symbol, number of OFDM symbols
# n_ue_h, number of horizontal sites in the UE panel
# n_ue_v, number of vertical sites in the UE panel
# n_ue_p, number of polarizations in the UE panel
# n_ru_h, number of horizontal sites in the RU panel
# n_ru_v, number of vertical sites in the RU panel
# n_ru_p, number of polarizations in the RU panel
# n_tap number of taps
# ]
AX_UE, AX_SYM, AX_UEH, AX_UEV, AX_UEP, AX_RUH, AX_RUV, AX_RUP,AX_TAPS = range(9)
# delays shape:
# [n_ue, number of UEs
# n_symbols, number of OFDM symbols
# n_ue_h, number of horizontal sites in the UE panel
# n_ue_v, number of vertical sites in the UE panel
# n_ru_h, number of horizontal sites in the RU panel
# n_ru_v, number of vertical sites in the RU panel
# n_tap number of taps
# ]
D_AX_UE, D_AX_SYM, D_AX_UEH, D_AX_UEV, D_AX_RUH, D_AX_RUV, D_AX_TAPS = range(7)
nbins = 4096
spacing_khz = 30.0
freqs_hz = (np.arange(nbins) - (nbins // 2)) * \
spacing_khz * 1e3
# Setup Figure (2x2 grid)
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 9), \
sharex=True)
axes = axes.ravel()
cases = [(0,0), (0,1), (1,0), (1,1)]
titles = [
"UE$_1$: -45° co-pol",
"UE$_1$: -45° x-pol",
"UE$_1$: 45° x-pol",
"UE$_1$: 45° co-pol"
]
for ax, (j, k), title in zip(axes, cases, titles):
try:
# Construct index tuple: [i, 0, 0, 0, j, 0, 0,
# k, :]
idx_vals = [0] * values_full.ndim
idx_vals[AX_UE] = i_fixed
idx_vals[AX_UEP] = j # UE polarization
idx_vals[AX_RUP] = k # RU polarization
idx_vals[AX_TAPS] = slice(None) # All taps
h_i = values_full[tuple(idx_vals)]
h_i = np.squeeze(h_i)
# Construct index tuple: [i, 0, 0, 0, 0, 0, :]
idx_del = [0] * delays_full.ndim
idx_del[D_AX_UE] = i_fixed
idx_del[D_AX_TAPS] = slice(None)
tau_i = delays_full[tuple(idx_del)]
tau_i = np.squeeze(tau_i) * DELAY_SCALE
H = cfr_from_cir(h_i, tau_i, freqs_hz)
power_w = np.abs(H) ** 2
power_w = np.maximum(power_w, 1e-12)
power_dbm = 10.0 * np.log10(power_w) + 30.0
ax.plot(freqs_hz/1e6 + 3600, power_dbm, \
linewidth=1.5)
ax.set_title(title)
ax.grid(True, alpha=0.3)
# Formatting
for ax in axes:
ax.set_ylabel("Power (dBm)")
axes[2].set_xlabel("Frequency (MHz)")
axes[3].set_xlabel("Frequency (MHz)")
plt.tight_layout()
plt.show()
Figure 3. Polarimetric frequency responses for the considered example
Empowering the AI-native 6G era
The transition from 5G to 6G must tackle greater complexity in wireless signal processing, characterized by massive data volumes, extreme heterogeneity, and the core mandate for AI-native networks. Traditional, siloed simulation methods are simply insufficient for this challenge.
The NVIDIA Aerial Omniverse Digital Twin is built precisely for this new era. By moving to a gRPC-based service architecture in release 1.4, AODT is democratizing access to physics-based radio simulation and providing the ground truth needed for machine learning and algorithm exploration.
AODT 1.4 is available on NVIDIA NGC . We invite researchers, developers, and operators to integrate this powerful new service and collaborate with us in building the future of 6G.
Discuss (0)
Like
Tags
Developer Tools & Techniques | Simulation / Modeling / Design | Telecommunications | Aerial | Omniverse | Intermediate Technical | 5G / 6G | featured | Industrial Digitalization / Digital Twin
About the Authors
About Tommaso Balercia
Tommaso Balercia was born in Jesi, Italy in 1979. He received his master’s degree in Microelectronics from the Polytechnic University of Marche (Italy) and his PhD from the Technical University of Braunschweig (Germany) in 2007 and 2013, respectively. He's currently the principal architect of the NVIDIA digital twin for the simulation of radio access networks (RANs). His area of interest covers EM simulation, RAN design, and HPC at scale.
View all posts by Tommaso Balercia
About CC Chong
CC Chong is the senior director and head of Aerial product management at NVIDIA. Before joining NVIDIA, she was most recently senior director and GM of wireless and access business unit in the Intel Programmable Solutions Group. Chong received her Ph.D., in electronics and electrical engineering from the University of Edinburgh in Scotland and her bachelor's in electronics and electrical engineering from the University of Manchester. She was a recipient of the Ten Outstanding Young Malaysian Awards under the category “Scientific and Technological Development” in 2006.
View all posts by CC Chong
Comments
Related posts
5 New Digital Twin Products Developers Can Use to Build 6G Networks
5 New Digital Twin Products Developers Can Use to Build 6G Networks
Improve AI-Native 6G Design with the NVIDIA Aerial Omniverse Digital Twin
Improve AI-Native 6G Design with the NVIDIA Aerial Omniverse Digital Twin
NVIDIA Aerial Omniverse Digital Twin Boosts Development of AI-Native Wireless and Deployment Flexibility
NVIDIA Aerial Omniverse Digital Twin Boosts Development of AI-Native Wireless and Deployment Flexibility
Developing Next-Generation Wireless Networks with NVIDIA Aerial Omniverse Digital Twin
Developing Next-Generation Wireless Networks with NVIDIA Aerial Omniverse Digital Twin
Accelerating the Future of Wireless Communication with the NVIDIA 6G Developer Program
Accelerating the Future of Wireless Communication with the NVIDIA 6G Developer Program
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Accelerating AI-Powered Chemistry and Materials Science Simulations with NVIDIA ALCHEMI Toolkit-Ops | NVIDIA Technical Blog |
nvidia_dev_blog |
19.12.2025 17:00 |
0.72
|
| Embedding sim. | 0.833 |
| Entity overlap | 0.0435 |
| Title sim. | 0.3007 |
| Time proximity | 0.7083 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | machine learning |
| NLP страна | |
Открыть оригинал
Machine learning interatomic potentials (MLIPs) are transforming the landscape of computational chemistry and materials science. MLIPs enable atomistic simulations that combine the fidelity of computationally expensive quantum chemistry with the scaling power of AI.
Yet, developers working at this intersection face a persistent challenge: a lack of robust, Pythonic toolbox for GPU-accelerated atomistic simulation. For use cases such as running a large number of simultaneous, GPU-accelerated simulations, robust and well-supported tools are either missing in the current software ecosystem or are fragmented across several open source software tools.
Over the past few years, available software for running atomistic simulations with MLIPs has been CPU-centric. Core operations such as neighbor identification, dispersion corrections, long-range interactions, and their associated gradient calculation have traditionally supported only CPU computation, which often struggles to deliver the speed that contemporary research demands. High-throughput simulations of small- to medium-sized atomic systems quickly become bottlenecked by inefficient GPU usage in hybrid workflows where the model is GPU-accelerated in PyTorch but the simulation tooling is serial and CPU-based.
While developers have attempted to implement these operations directly in PyTorch over the years, the general-purpose design of PyTorch leaves performance on the table for the specialized spatial and force calculation operations required in atomistic simulation. This fundamental mismatch between PyTorch capabilities and the demands of atomistic modeling raises an important question: What’s needed to bridge this gap?
NVIDIA ALCHEMI (AI Lab for Chemistry and Materials Innovation), announced at Supercomputing 2024, provides chemistry and materials science developers and researchers with domain-specialized toolkits and NVIDIA NIM microservices optimized on NVIDIA accelerated computing platforms. It is a collection of high-performance, batched and GPU-accelerated tools specifically for enabling atomistic simulations in chemistry and materials science research at the machine learning framework level.
NVIDIA ALCHEMI delivers capabilities across three integrated layers:
ALCHEMI Toolkit-Ops : A repository of GPU-accelerated, batched common operations for AI-enabled atomistic simulation tasks, such as neighbor list construction, DFT-D3 dispersion corrections, and long-range electrostatics.
ALCHEMI Toolkit : A collection of GPU-accelerated simulation building blocks, including geometry optimizers, integrators, and data structures to enable large-scale, batched simulations leveraging AI.
ALCHEMI NIM microservices : A scalable layer of cloud‑ready, domain‑specific microservices for chemistry and materials science, enabling deployment and orchestration on NVIDIA‑accelerated platforms.
This post introduces NVIDIA ALCHEMI Toolkit-Ops , the accelerated batched common operations layer of ALCHEMI. ALCHEMI Toolkit-Ops uses NVIDIA Warp to accelerate and batch common operations in AI-driven atomistic modeling. These operations are exposed through a modular PyTorch accessible API (with a JAX API targeted for a future release) that enables rapid iteration and integration with existing and future atomistic simulation packages.
Figure 1 shows the accelerated batched common operations for atomistic simulations included in this initial release of ALCHEMI Toolkit-Ops. This beta release includes two versions of neighbor lists (naive and cell), DFT-D3 dispersion correction, and long-range coulombic (Ewald and Particle Mesh Ewald) functions.
Figure 1. NVIDIA ALCHEMI Toolkit-Ops is a repository of modules developed specifically for GPU-accelerated batched operations (one GPU, many systems) support for MLIPs and molecular dynamics engines
Figure 2 demonstrates the performance of accelerated kernels in ALCHEMI Toolkit-Ops versus popular kernel-accelerated models like MACE (cuEquivariance) and TensorNet (Warp) to achieve fully parallelized performance and scalability. The blue MLIP baseline allows comparison with advanced features like neighbor lists, dispersion corrections (DFT-D3) and explicit electrostatics computations (Ewald and Particle-Mesh Ewald (PME)). Test systems consisted of ammonia clusters of increasing size packed into various cells using Packmol . Timing results were averaged over 20 runs on an NVIDIA H100 80 GB GPU. The DFT-D3 calculation does not include 6Å due to the long-range nature of D3.
Figure 2. Benchmarks showing the speed of ALCHEMI Toolkit-Ops neighbors list (both naive O(N²) and cell list O(N) implementations), DFT-D3 correction and two versions of electrostatic interactions. All methods are compared to the computational cost of popular kernel-accelerated MLIPs. Left-side panels outline batch scaling for fixed number of atoms and variable system size x [batch size], while right-side panels demonstrate timings for a single system growing in size.
ALCHEMI Toolkit-Ops ecosystem integration
ALCHEMI Toolkit-Ops is designed to integrate seamlessly with the broader PyTorch-based atomistic simulation ecosystem. We are excited to announce in-progress integrations with leading open source tools in the chemistry and materials science community: TorchSim, MatGL, and AIMNet Central.
TorchSim
TorchSim , a next-generation open source atomistic simulation engine, is adopting ALCHEMI Toolkit-Ops kernels to power its GPU-accelerated workflows.TorchSim is a PyTorch-native simulation engine purpose-built for the MLIP era, enabling batched molecular dynamics and structural relaxation across thousands of systems simultaneously on a single GPU. TorchSim will leverage our optimized neighbor lists to drive high-throughput batched operations without sacrificing flexibility or performance.
MatGL
MatGL (Materials Graph Library) is an open source framework for building graph-based machine learning interatomic potentials and foundation potentials for inorganic, molecular, and hybrid materials systems. By integrating ALCHEMI Toolkit-Ops, MatGL significantly accelerates graph-based treatments of long-range interactions, enabling large-scale atomistic simulations that are both faster and more computationally efficient without compromising accuracy.
AIMNet Central
AIMNet Central is a repository for AIMNet2, a general-purpose MLIP capable of modeling neutral, charged, organic, and elemental-organic systems with high fidelity. AIMNet Central is leveraging ALCHEMI Toolkit-Ops to further enhance the performance of its flexible long-range interaction models. Using NVIDIA-accelerated DFT-D3 and neighbor list kernels, AIMNet2 can deliver even faster atomistic simulations for large and periodic systems without compromising accuracy.
How to get started with ALCHEMI Toolkit-Ops
Getting started with ALCHEMI Toolkit-Ops is simple and designed with ease of use in mind.
System and package requirements
Python 3.11+
Operating System: Linux (primary), Windows (WSL2), macOS
NVIDIA GPU (A100 or newer recommended), CUDA compute capability ≥ 8.0
CUDA Toolkit 12+, NVIDIA driver 570.xx.xx+
Installation
To install ALCHEMI Toolkit-Ops, use the following snippet:
# Install via pip wheel
pip install nvalchemi-toolkit-ops
# Make sure it is importable
python -c "import nvalchemiops; print(nvalchemiops.__version__)"
See the ALCHEMI Toolkit-Ops documentation for other installation instructions. Explore the examples directory in the GitHub repository and run them to test acceleration on your own hardware.
Typical troubleshooting tips:
Verify CUDA installation and device availability: nvidia-smi
, nvcc --version
Ensure compatible Python version: python --version
Upgrade dependencies as needed: pip list | grep torch
and pip list | grep warp
Feature highlights
This section dives into three ALCHEMI Toolkit-Ops initial features: high-performance neighbor lists, DFT-D3 dispersion corrections, and long-range electrostatic interactions.
Neighbor lists
Neighbor list construction is the backbone of atomistic simulations enabling calculation of energies and forces with local or semi-local MLIPs. ALCHEMI Toolkit-Ops delivers state-of-the-art GPU performance in PyTorch, achieving performance scaling to millions of atoms per second for batches of many small to medium atomic systems or single large atomic systems.
Capabilities
Both O(N) (cell list) and O(N²) (naive) algorithms with batched processing
Periodic boundary support for triclinic cells with arbitrary cell dimensions and partial periodicity
Supports end-to-end compute graph compilation
Direct API compatibility with PyTorch
API example
import torch
from nvalchemiops.neighborlist import neighbor_list
# Water molecule
water_positions = torch.tensor([
[0.0, 0.0, 0.0], # O
[0.96, 0.0, 0.0], # H
[-0.24, 0.93, 0.0], # H
], device="cuda", dtype=torch.float32)
# Ammonia molecule (NH3)
ammonia_positions = torch.tensor([
[0.0, 0.0, 0.0], # N
[1.01, 0.0, 0.0], # H
[-0.34, 0.95, 0.0], # H
[-0.34, -0.48, 0.82], # H
], device="cuda", dtype=torch.float32)
# Concatenate positions for batch processing
positions = torch.cat([water_positions, ammonia_positions], dim=0)
# Create batch indices (0 for water, 1 for ammonia)
batch_idx = torch.cat([
torch.zeros(3, dtype=torch.int32, device="cuda"), # Water
torch.ones(4, dtype=torch.int32, device="cuda"), # Ammonia
])
# Define cells for each molecule (large enough to contain them without PBC)
cells = torch.stack([
torch.eye(3, device="cuda") * 10.0, # Water cell
torch.eye(3, device="cuda") * 10.0, # Ammonia cell
])
# non-periodic molecule case
pbc = torch.tensor([
[False, False, False], # Water
[False, False, False], # Ammonia
], device="cuda")
# Cutoff distance in Angstroms
cutoff = 4.0
# Compute neighbor list; here we explicitly request a batched cell list algorithm
neighbor_matrix, num_neighbors, shift_matrix = neighbor_list(
positions, cutoff, cell=cells, pbc=pbc, batch_idx=batch_idx, method="batch_cell_list"
)
print(f"Neighbor matrix: {neighbor_matrix.cpu()}") # [7, num_neighbors.max()]
print(f"Neighbors per atom: {num_neighbors.cpu()}") # [7,]
print(f"Periodic shifts: {shift_matrix.cpu()}")
DFT-D3 dispersion corrections
Realistic molecular modeling must fully account for van der Waals interactions, which standard DFT functionals do not account for systematically. DFT-D3 uses empirical pairwise corrections, leading to substantial improvements in binding energies, lattice structures, conformational analysis, and adsorption studies for common DFT functionals.
Capabilities
Becke-Johnson (BJ) rational damping variant
Supports batched and periodic calculations
Supports smoothing at cutoff distance
Joint energy, forces, and virial calculation
API example
from nvalchemiops.interactions.dispersion import dftd3
batch_ptr = torch.tensor([0, 3, 7], dtype=torch.int32, device="cuda")
atomic_numbers = torch.tensor(
[6, 1, 1, 7, 1, 1, 1], dtype=torch.int32, device="cuda"
)
# For this snippet, assume d3_params is loaded as:
# d3_params = D3Parameters(rcov=..., r4r2=..., c6ab=..., cn_ref=...)
# Users can refer to the documentation to source DFT-D3 parameters
# and understand the expected data structure
d3_params = ...
# call the DFT-D3 functional interface
energy, forces, coordination_numbers = dftd3(
positions=positions,
numbers=atomic_numbers,
a1=0.3981, a2=4.4211, s8=0.7875, # PBE parameters
neighbor_matrix=neighbor_matrix,
neighbor_matrix_shifts=shift_matrix,
batch_idx=batch_idx,
d3_params=d3_params
)
print(f"Energies: {energy.cpu()}") # [2,]
print(f"Forces: {forces.cpu()}") # [7, 3]
Limitations
The current implementation computes two-body terms only (C6 and C8). Three-body Axilrod-Teller-Muto (ATM/C9) contributions are not included. This generally leads to some over-estimation of dispersion energies.
Long-range electrostatic interactions
Accurate modeling of electrostatic interactions is critical for simulations involving ions/charged species and polar systems. Currently, the most common approach for MLIPs is to learn Coulomb interactions within the short-ranged model. Systematic underestimation of long-range Coulombic effects leads to loss of accuracy in binding energies, solvation structures, and interfacial phenomena.
ALCHEMI Toolkit-Ops provides fully GPU-accelerated Ewald summation methods—both standard Ewald and particle mesh Ewald—enabling GPU-accelerated, efficient and accurate treatment of long-range electrostatics in PyTorch.
For large periodic systems, Ewald-based methods separate electrostatic interactions into short-range and long-range components, each computed in the domain best suited for performance. ALCHEMI Toolkit-Ops provides a dual-cutoff strategy that dramatically reduces redundant neighbor queries and memory overhead compared to naive all-pairs approaches, making high-throughput simulations of charged systems practical on modern GPUs. Users can choose between standard Ewald for smaller systems or PME for larger periodic systems, depending on their specific performance and accuracy needs.
Capabilities
Ewald summation method
Particle Mesh Ewald (PME) using B-splines
Supports batched and periodic systems
GPU-optimized computation, leveraging cuFFT for fast reciprocal-space evaluation
PyTorch integration provides native tensor support for end-to-end differentiable workflows
API example
from nvalchemiops.interactions.electrostatics import particle_mesh_ewald
# charges for each atom are randomly generated here
atomic_charges = torch.randn(
positions.size(0), dtype=torch.float32, device="cuda"
)
# compute energy and forces with particle mesh ewald
energy, forces = particle_mesh_ewald(
positions,
atomic_charges,
cells,
alpha=0.3, # adjust Ewald splitting parameter
batch_idx=batch_idx,
neighbor_matrix=neighbor_matrix,
neighbor_matrix_shifts=shift_matrix,
compute_forces=True
)
print(f"Energy: {energy.cpu()}") # [2]
print(f"Forces: {forces.cpu()}") # [7, 3]
Dive deeper into ALCHEMI Toolkit-Ops
ALCHEMI Toolkit-Ops empowers the community with high-performance, accessible atomistic modeling tools on NVIDIA GPUs. To accelerate your chemistry and materials science simulations, visit the NVIDIA/nvalchemi-toolkit-ops GitHub repo and NVIDIA ALCHEMI Toolkit-Ops documentation . You can also explore the examples gallery . This beta release of ALCHEMI Toolkit-Ops focuses on highly efficient neighbor lists, dispersion corrections, and long-range electrostatics. Stay tuned for new features and performance optimizations in future releases.
Acknowledgments
We’d like to thank Professor Shyue Ping Ong; Professor Olexandr Isayev; and the TorchSim committee members Abhijeet Gangan, Orion Archer Cohen, Will Engler, and Ben Blaiszik for working with us to adopt NVIDIA ALCHEMI Toolkit-Ops into their open source projects. We also thank Wen Jie Ong, Piero Altoe, and Kibibi Moseley from NVIDIA for their help preparing this blog post.
Discuss (0)
Like
Tags
Developer Tools & Techniques | Simulation / Modeling / Design | HPC / Scientific Computing | NIM | Intermediate Technical | Tutorial | Computational Chemistry / Materials Science | featured | PyTorch
About the Authors
About Justin S. Smith
Justin S. Smith is the senior developer relations manager for AI in Chemistry and Materials Science at NVIDIA. He is a computational chemist who earned his PhD from the University of Florida in 2018 where he worked on AI for atomistic simulation. He then went on to become a staff scientist at Los Alamos National Laboratory where he focused on ML applications to reactive chemistry and materials science.
View all posts by Justin S. Smith
About Nikita Fedik
Nikita Fedik is a senior technical marketing engineer for AI in Chemistry and Materials Science at NVIDIA, specializing in AI-accelerated computational chemistry and scientific visualization. He holds a PhD in Physical Chemistry from Utah State University and previously served as a Staff Scientist at Los Alamos National Laboratory.
View all posts by Nikita Fedik
About Dallas Foster
Dallas Foster is a senior deep learning software engineer for HPC and AI at NVIDIA. He received his PhD in mathematics at Oregon State University and has worked at Los Alamos National Laboratory, the National Center for Atmospheric Research, and MIT. As a member of the PhysicsNeMo team at NVIDIA, he has a particular focus on the application and deployment of deep learning for weather forecasting and molecular dynamics.
View all posts by Dallas Foster
About Roman Zubatyuk
Roman Zubatyuk is the senior application engineer for AI in Chemistry and Materials Science at NVIDIA. He's a computational chemist who earned his PhD in Computational and Data-Enabled Science and Engineering from Jackson State University in 2019. He completed postdoctoral research at the University of North Carolina and Carnegie Mellon University, focusing on the development and application of machine learning interatomic potentials before joining NVIDIA.
View all posts by Roman Zubatyuk
About Kelvin Lee
Kelvin Lee is a senior deep learning software engineer at NVIDIA. He received his PhD in Physical Chemistry at the University of New South Wales, Australia. Prior to joining NVIDIA, Kelvin held academic research positions at the Center for Astrophysics | Harvard & Smithsonian and the Massachusetts Institute of Technology before working in industry as a research scientist at Intel Labs. His work focuses on accelerating computation and developer productivity for chemistry and materials science, spectroscopy, and astrophysics.
View all posts by Kelvin Lee
Comments
Related posts
Faster Chemistry and Materials Discovery with AI-Powered Simulations Using NVIDIA ALCHEMI
Faster Chemistry and Materials Discovery with AI-Powered Simulations Using NVIDIA ALCHEMI
Enabling Scalable AI-Driven Molecular Dynamics Simulations
Enabling Scalable AI-Driven Molecular Dynamics Simulations
Accelerated Molecular Simulation Using Deep Potential Workflow with NGC
Accelerated Molecular Simulation Using Deep Potential Workflow with NGC
NVIDIA GPU Accelerated VASP 6 uses OpenACC to Deliver 15X More Performance
NVIDIA GPU Accelerated VASP 6 uses OpenACC to Deliver 15X More Performance
Share Your Science: Fighting Ebola with Supercomputer Simulations
Share Your Science: Fighting Ebola with Supercomputer Simulations
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Simulate Robotic Environments Faster with NVIDIA Isaac Sim and World Labs Marble | NVIDIA Technical Blog |
nvidia_dev_blog |
17.12.2025 17:00 |
0.701
|
| Embedding sim. | 0.7905 |
| Entity overlap | 0 |
| Title sim. | 0.2391 |
| Time proximity | 0.994 |
| NLP тип | other |
| NLP организация | nvidia |
| NLP тема | robotics |
| NLP страна | |
Открыть оригинал
Building realistic 3D environments for robotics simulation has traditionally been a labor-intensive process, often requiring weeks of manual modeling and setup. Now, with generative world models, you can go from a text prompt to a photorealistic, simulation-ready world in a fraction of time. By combining NVIDIA Isaac Sim , an open source robotics reference framework, with generative models such as Marble from World Labs , you can create entire 3D scenes for robotics development from a text or image prompt.
World Labs recently published the case study “ Scaling Robotic Simulation with Marble ,” showing how researchers are using Marble’s generative worlds to accelerate robot training, testing, and sim-to-real transfer.
In this tutorial, we’ll walk through an end-to-end workflow:
Scene export: Export an existing scene from Marble gallery as Gaussian splats (PLY) and a collider mesh (GLB)
Scene conversion: Convert the Marble outputs to USD format using NVIDIA Omniverse NuRec
Scene import and construction: Import into NVIDIA Isaac Sim
Simulation in Isaac Sim: Add a robot and run the simulation.
By the end, you’ll have a realistic virtual environment where robots can interact physically, all generated far more quickly than by traditional methods. Let’s dive in.
Step 1: Get a 3D kitchen scene from World Labs Marble
World Labs Marble produces rich visual detail and geometric data like depth and surface normals, along with an exportable collider mesh for physical simulation.
For this tutorial, instead of generating a new kitchen from scratch, we’ll use a pre-made Marble kitchen scene that’s available in Marble’s example gallery. This saves time and ensures we have a realistic environment ready to go. The chosen scene is a detailed kitchen and living room interior, complete with furniture and typical kitchen items.
Steps to export the kitchen world from Marble:
Log in to Marble: Sign in to your Marble account on the web. Once logged in, navigate to the pre-made kitchen scene .
Open the scene: Click on the world to load it in Marble’s 3D viewer. You can explore it with WASD controls and mouse as if you were in a game, to verify it looks good.
Download the world: Find the Download button in the bottom bar of Marble’s interface.
Select “Splats (PLY)” to download a Gaussian splat representation. Marble’s Gaussian splat is provided as a .ply file, which contains millions of semi-transparent particles representing the scene with high fidelity.
Select “Collider Mesh (GLB)” to download the triangle mesh of the scene. This will contain the geometry of the kitchen as a standard glTF model.
Note that exporting PLY and GLB files in World Labs Marble requires a paid plan. If you don’t have one, World Labs provides sample PLY and GLB files from its gallery. For this tutorial, we will use the kitchen scene PLY and GLB files as our example. Save the files as MarbleKitchenwithLight.ply and MarbleKitchenwithLight_collider.glb .
At this point, we have our kitchen environment in two forms—as Gaussian splats and as a triangle mesh. Each serves a different purpose: The PLY captures the full visual detail of the scene, and the GLB provides the mesh geometry needed for physics and collisions in simulation.
Video 1. Exploring Mable sample scene and downloading PLY and GLB
Step 2: Convert downloaded PLY into USDZ
NVIDIA Isaac Sim uses Universal Scene Description (USD) as its scene format. To use our Marble-generated world in Isaac Sim, we need to convert the exported PLY into USD format. We will then import it, taking advantage of NVIDIA Omniverse NuRec capabilities to render the point-based scene efficiently.
At the core of NuRec is the 3DGUT algorithm for Gaussian-based reconstruction and rendering. The NVIDIA 3DGRUT repository contains a script to convert a .ply splat file into a USDZ file, a zip-compressed archive that contains USD-specific data. We will use this to convert our Marble PLY:
1. Set up 3DGRUT: Clone the 3DGRUT repository and install its environment. In this tutorial, we set up 3DGRUT inside a dedicated Conda environment named “3dgrut.”
The environment requires Linux with an NVIDIA GPU, CUDA 11.8+, and GCC 11 or lower. If you already have a Python environment with the needed libraries (PyTorch, etc.), you can alternatively just run the conversion Python script in that environment.
git clone --recursive https://github.com/nv-tlabs/3dgrut.git
cd 3dgrut
chmod +x install_env.sh
./install_env.sh 3dgrut
conda activate 3dgrut
2. Convert PLY to USDZ: Once 3DGRUT is set up, use the provided conversion script to turn the Marble point cloud into USDZ:
$ python -m threedgrut.export.scripts.ply_to_usd \
/path/to/MarbleKitchenwithLight.ply \
--output_file /path/to/MarbleKitchenwithLight.usdz
This command will read the .ply file and produce a .usdz file. USDZ uses a custom USD schema (an extension of UsdVolVolume) to represent the Gaussian splats in a way that Omniverse can render. Essentially, it embeds the point cloud as a volumetric primitive, preserving the visual fidelity of the Marble scene. For more details on NuRec neural volumes and how they are rendered in Omniverse, see the NuRec Rendering documentation.
Now, we have one USDZ file and one GLB file:
MarbleKitchenwithLight.usdz – the visual splat world
MarbleKitchenwithLight_collider.glb – the collider mesh we’ll use for physics.
Step 3: Import USDZ/GLB into Isaac Sim and construct the scene
After generating the USDZ file, the next step is to bring the kitchen scene into Isaac Sim, align the mesh with the Gaussian splats, and add physics and lighting so it is ready for interaction.
Since we are editing the scene contents, we need to extract the USDZ archive. Unzip the file and open the default.usda file generated, then go through the following steps:
Geometrically align the Gaussian volume:
We want to make sure that the origin of the imported scene and its scale matches Isaac Sim. In order to do that:
Add a ground plane to the scene. This will be used as a reference for the ground of the imported Gaussian volume, and serve as a smooth collider.
The imported Gaussian volume is contained in an “xform” primitive, which is used to transform the volume. To align the volume with the floor, select the xform primitive and adjust its “Translate” values so the floor of the kitchen sits exactly on the ground plane. Use the ground plane as a visual reference and move the Gaussian volume until the point cloud’s floor coincides with it.
The generated scene may be smaller or larger than the real-world scale. To roughly match the real-world scale, we can use a default cube as a visual reference, which has 1 meter side length. After inserting a cube object, we can adjust the overall X, Y, and Z scaling accordingly. For our example kitchen scene, a factor of 2 for the scaling gives roughly the right sizing, e.g., for the cabinet and stove.
Finally, fine-tune the rotation of the xform primitive to make sure the Gaussian point cloud aligns with the ground plane as accurately as possible. A simple way to verify this is to use the tiles on the kitchen wall as a reference and rotate the Gaussian such that they are completely parallel to the ground plane created. Once aligned, move the ground plane back down so it sits exactly at the kitchen floor level.
Video 2. Geometrically aligning the gaussian volume
Add physics and lighting to the scene:
Now that we have aligned the imported Gaussians, we want to add physics and lighting so that shadows and object interactions work as expected.
We will use the cube that we previously created to adjust the scene scale again to test shadows and physics.
In the collision mesh of the ground plane, turn on the matte object property. This ensures it works properly as a shadow receiver.
Add a dome light to the scene.
Select the ‘gauss’ Volume prim in the stage window, then in the property window, scroll down to “Raw USD Properties” and click the triangle to reveal additional settings. Then, scroll to the “proxy” field, and click on “Add Target.” Finally, select the GroundPlane CollisionMesh as the target.
Video 3. Adding physics and lighting
Move the cube around to ensure shadows show up as expected.
On setting the cube as a rigid body with colliders and hitting play in the simulation, the cube interacts with the ground plane as expected. However, it “goes through” the Gaussians. Let us now move on to setting up the physics of the Gaussian representation.
The collision information for the Gaussians is in the GLB file. Import this mesh, align it with the Gaussian volume, and enable it as a collider.
Drag and drop the MarbleKitchenwithlight_collider.glb file under the Gaussian volume. Make sure it is under the Gaussian volume, as the hierarchy is important. The collider will show up in the scene.
Zoom out of the scene a little and set the X rotation to -90 to match the coordinate conventions of the Gaussian volume. Now the rendered volume and the collision mesh align completely.
Enable the physics collider preset for the imported collision mesh.
Turn off the visibility for the collider as it is overlaid with the Gaussian volume. This affects only the visuals of the scene; physics will use the colliders we just set up in the scene.
Video 4. Importing collision mesh
The geometry, physics, and lighting for the scene are now in good shape: The Gaussian volume provides the photoreal visuals, while the GLB collider and ground plane handle physics and shadows. The scene is now ready for a robot to be added.
Step 4: Add a robot and run the simulation
With the kitchen scene aligned and physics enabled, the final step is to add a robot and drive it around to validate the setup.
Drag and drop the NVIDIA Nova Carter robot into the scene.
Add a differential controller for the robot with and enable keyboard control. This will create the necessary action graph, which allows us to use our keyboard to move the robot around.
Change to a camera mounted on the robot and hit play. Move the robot around with WASD and verify that it respects the kitchen geometry: It should rest on the floor, collide with counters and furniture, and not fall through the scene.
At this point, the Marble kitchen scene is fully integrated into Isaac Sim as a physics-enabled environment, and you can drive robots interactively through it.
Video 5. Adding a robot and navigating the scene
Summary
In this tutorial, we downloaded an AI-generated 3D environment complete with geometry and then brought it into Isaac Sim as a simulation-ready scene. We set up robots in an AI-generated world. The end-to-end workflow here can now be completed in mere hours. This ability to rapidly generate various high-fidelity worlds unlocks more scalable robot development in simulation. With Marble and Isaac Sim, if you can describe a world, you’ll likely be able to start testing it the same day.
To learn more, try the following:
Create your own custom environment with World Labs Marble – You can start with a text description, a single image, multiple photos from different angles, or even a rough 3D layout.
Create your own custom environment with input image and use it for Isaac Sim with Lyra , an NVIDIA research initiative on generative 3D scene reconstruction via video diffusion model.
Learn more about simulation innovations and meet with NVIDIA experts at SIGGRAPH Asia , taking place Dec. 15 to 18 at the Hong Kong Convention and Exhibition Centre.
Discuss (2)
Like
Tags
Agentic AI / Generative AI | Robotics | Simulation / Modeling / Design | General | Isaac Sim | Omniverse | Intermediate Technical | Tutorial | featured | Robot Navigation | Robot Perception | Robotics Compute | Robotics Simulation
About the Authors
About Wonsik Han
Wonsik Han is a senior product manager in the NVIDIA Autonomous Vehicle Group. He brings more than a decade of experience across strategy, business development, and product management roles at global automakers and an autonomous driving startup. Wonsik holds an MBA from Duke University.
View all posts by Wonsik Han
About Rishabh Chadha
Rishabh Chadha is a technical marketing engineer at NVIDIA, he focuses on integrating deep learning and robotics frameworks for the NVIDIA Jetson platforms. He has a Masters degree in Robotics from Worcester Polytechnic Institute. His interests primarily include deep learning, medical imaging, and robot perception.
View all posts by Rishabh Chadha
About Isaac Deutsch
Isaac Deutsch is a senior research scientist at NVIDIA who brings together computer vision, imaging, and real-time computer graphics. He contributed to Instant-NGP, NuRec, and 3DGRUT. His current work focuses on computational photography for high-fidelity 3D capture. Isaac holds a master’s degree in Robotics from ETH Zurich and joined NVIDIA in 2018.
View all posts by Isaac Deutsch
About Raffaello Bonghi
Raffaello Bonghi is a developer relations manager for AI & Robotics. Since 2015, he has been an NVIDIA Jetson Champ designing multiple ROS/ROS-based robots for outdoor navigation and educational applications. Additionally, he has been involved in developing AI solutions for numerous international clients in the retail and robotics space. Raffaello holds a Ph.D. in control theory and industrial automation, with a deep focus on robotics.
View all posts by Raffaello Bonghi
Comments
Related posts
Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena
Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena
3 Easy Ways to Supercharge Your Robotics Development Using OpenUSD
3 Easy Ways to Supercharge Your Robotics Development Using OpenUSD
How to Instantly Render Real-World Scenes in Interactive Simulation
How to Instantly Render Real-World Scenes in Interactive Simulation
Building Custom Robot Simulations with Wandelbots NOVA and NVIDIA Isaac Sim
Building Custom Robot Simulations with Wandelbots NOVA and NVIDIA Isaac Sim
NVIDIA Isaac Sim on Omniverse Now Available in Open Beta
NVIDIA Isaac Sim on Omniverse Now Available in Open Beta
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Migrate Apache Spark Workloads to GPUs at Scale on Amazon EMR with Project Aether | NVIDIA Technical Blog |
nvidia_dev_blog |
17.12.2025 19:00 |
0.683
|
| Embedding sim. | 0.7974 |
| Entity overlap | 0.1026 |
| Title sim. | 0.1935 |
| Time proximity | 0.7108 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
Data is the fuel of modern business, but relying on older CPU-based Apache Spark pipelines introduces a heavy toll. They’re inherently slow, require large infrastructure, and lead to massive cloud expenditure. As a result, GPU-accelerated Spark is becoming a leading solution, providing lightning-fast performance using parallel processing. This improved efficiency reduces cloud bills and saves valuable development hours.
Building on this foundation, we introduce a smart and efficient way to migrate existing CPU-based Spark workloads running on Amazon Elastic MapReduce (EMR) . Project Aether is an NVIDIA tool engineered to automate this transition. It works by taking existing CPU jobs and optimizing them to run on GPU-accelerated EMR using the RAPIDS Accelerator for performance benefits.
What is Project Aether?
Figure 1. Project Aether overview showing workflow phases and services
Project Aether is a suite of microservices and processes designed to automate migration and optimization for the RAPIDS Accelerator , effectively eliminating manual friction . This aims to reduce migration time from CPU to GPU Spark Jobs through:
A prediction model for potential GPU speedup using recommended bootstrap configurations.
Out-of-the-box testing and tuning of GPU jobs in a sandbox environment.
Smart optimization for cost and runtime.
Full integration with Amazon EMR supported workloads.
Amazon EMR Integration
Now supporting the Amazon EMR platform , Project Aether automates the management of GPU test clusters and the conversion and optimization of Spark steps. Users can use the services provided to migrate existing EMR CPU Spark workloads to GPUs.
Setup and configuration
To get started, you’ll need to meet the following prerequisites.
Amazon EMR on EC2 : AWS account with GPU instance quotas
AWS CLI : Configured with aws configure
Aether NGC : Request access, configure credentials with ngc config set, and follow Aether installation instructions.
Configure Aether for EMR
Once the Aether package is installed, configure the Aether client for the EMR platform using the following commands:
# Initialize and list config
$ aether config init
$ aether config list
# Select EMR platform and region
$ aether config set core.selected_platform emr
$ aether config set platform.emr.region <region>
# Set required EMR s3 paths
$ aether config set platform.emr.spark_event_log_dir <s3_path_for event_logs>
$ aether config set platform.emr.cluster.artifacts_path <s3_path_for uploading_aether_artifacts>
$ aether config set platform.emr.cluster.log_path <s3_path_for_cluster_log_uri>
Example Aether EMR migration workflow
The Aether CLI tool provides several modular commands for running the services. Each command displays a summary table and tracks each run in the job history database. At any point, refer to “4. Migrate: Report and Recommendation” for viewing the tracked jobs. Use the --help
option for more details on each aether
command.
The example EMR workflow requires starting with an existing Spark step with step ID s-XXX
that ran on a CPU EMR cluster with a cluster ID j-XXX
. For more information on submitting steps to EMR clusters, refer to the Amazon EMR documentation .
The migration process is broken down into four core phases: predict, optimize, validate, and migrate.
1. Predict: Qualification
Determine a CPU Spark job’s viability for GPU acceleration and generate initial optimization recommendations.
The qualification tool uses the QualX machine learning system’s XGBoost model to predict potential GPU speedup and compatibility based on workload characteristics derived from the CPU event log.
om the CPU event log.
Input:
CPU event log obtained from EMR step and cluster API, or provided directly.
Output:
Recommended Spark configuration parameters generated by the AutoTuner.
Recommended GPU cluster shape with instance types and counts optimized for cost savings.
Aether Job ID to track this job and any subsequent job runs.
Commands:
# Option 1: Use Platform IDs
$ aether qualify --platform_job_id <cpu_step_id> --cluster_id <cpu_cluster_id>
# Option 2: Provide event log path directly
$ aether qualify --event_log <s3_or_local_event_log_path>
2. Optimize: Automatic testing and tuning
Achieve optimal performance and cost savings by testing the job on a GPU cluster and iteratively tuning the Spark configuration parameters.
Create the GPU test cluster with the Cluster service, then optimize the GPU job with the tune service, which iteratively runs submit and profile:
Submit: The job submission service submits the Spark job to a GPU cluster with the specified configurations.
Profile: The profile service uses the profiling tool to process the GPU event logs to analyze bottlenecks and generate new Spark configuration parameters to increase performance and/or reduce cost.
Input:
Recommended Spark configuration parameters from qualify output for the GPU job.
Recommended GPU cluster shape from qualify output to create the GPU cluster.
Output:
Best GPU configuration is selected from the run with the lowest duration among all tuning iterations.
Commands:
A. Create a test EMR GPU cluster:
# Option 1: Use the recommended cluster shape ID with a default cluster configuration
$ aether cluster create --cluster_shape_id <recommended_cluster_shape_id_from_qualify>
# Option 2: Provide a custom configuration file
$ aether cluster create --cluster_shape_id <recommended_cluster_shape_id_from_qualify> --config_file <custom_cluster_yaml_file>
B. Submit the GPU step to the cluster:
# Submit the job to the cluster using config_id and cluster_id
$ aether submit --config_id
<recommended_spark_config_id_from_qualify> --cluster_id
<gpu_cluster_id_from_create>
C. Profile the GPU run to generate new recommended Spark configs:
# Profile the job using the step_id and cluster_id
$ aether profile --platform_job_id <gpu_step_id_from_submit>
--cluster_id <gpu_cluster_id_from_create>
D. Tune the job iteratively (submit + profile loop ):
# Tune the job for 3 iterations
$ aether tune --aether_job_id <aether_job_id> --cluster_id
<gpu_cluster_id_from_create> --min_tuning_iterations 3
3. Validate: Data integrity check
Confirm the GPU job’s output integrity by ensuring its results are identical to the original CPU job.
The validate service compares key row metrics retrieved from the event logs, specifically focusing on rows read and rows written , between the best GPU run and the original CPU run.
Command:
# Validate the CPU and GPU job metrics
$ aether validate --aether_job_id <aether_job_id>
4. Migrate: Report and recommendation
View detailed reports of the tracked jobs in the job history database, and see per-job migration recommendations with the optimal Spark configuration parameters and GPU cluster configurations .
The report service provides CLI and UI options to display:
Key performance indicators (KPIs): The total speedup and total cost savings across all jobs.
Job list: Per-job speedup, cost savings, and migration recommendations.
Job details: All job run (original CPU run and GPU tuning runs) metrics and details for a job.
Commands:
# List all job reports
$ aether report list
# View all job runs for a specific job
$ aether report job --aether_job_id <aether_job_id>
# Start the Aether UI to view the reports in a browser
$ aether report ui
Figure 2. Example screenshot of Aether report UI job details
Figure 3. Example screenshot of Aether report UI GPU config details
5. Automated run
Combine all of the individual services above into a single automated Aether run command:
# Run full Aether workflow on CPU event log
$ aether run --event_log <s3_or_local_event_log_path>
Conclusion
Project Aether is a powerful tool for accelerating big data processing, reducing the time and cost associated with migrating and running large-scale Apache Spark workloads on GPUs.
To try it out for large-scale migrations of Apache Spark workloads, apply for Project Aether access. To learn more about the RAPIDS plugin, see the documentation for RAPIDS Accelerator for Apache Spark .
Discuss (0)
Like
Tags
Data Center / Cloud | Data Science | Developer Tools & Techniques | General | RAPIDS | Intermediate Technical | Tutorial | AWS | featured
About the Authors
About Navin Kumar
Navin Kumar is a senior distributed systems engineer at NVIDIA, working on the Spark RAPIDS Accelerator team. He's the chief architect of Project Aether, a service for accelerating the migration process of Apache Spark jobs from CPU to GPU. Navin holds a B.S. in Computer Science from Cornell University and an M.S. in Information Networking from Carnegie Mellon University. Previously, he was the architect of a large-scale Apache Spark-based data pipeline at security startup Fletch (acquired by F5).
View all posts by Navin Kumar
About Sean Yang
Sean Yang is a system software engineer at NVIDIA, working on Project Aether with the Spark RAPIDS Accelerator team. Previously, he worked on federated learning systems with the NVIDIA FLARE engineering team. He holds a B.S. in computer science from the University of California, Berkeley, with a focus on machine learning and distributed systems.
View all posts by Sean Yang
About Sayed Bilal Bari
Sayed Bilal Bari is a system software engineer at NVIDIA, working on tooling for Spark RAPIDS Accelerator and Project Aether. He holds an M.S. in Computer Science from Stony Brook University. With a keen interest in big data processing and frameworks, he has previously worked as a senior data engineer for companies like Intuit and Walmart Labs.
View all posts by Sayed Bilal Bari
Comments
Related posts
Predicting Performance on Apache Spark with GPUs
Predicting Performance on Apache Spark with GPUs
New Self-Paced Course: RAPIDS Accelerator for Apache Spark
New Self-Paced Course: RAPIDS Accelerator for Apache Spark
RAPIDS Accelerator for Apache Spark v21.06 Release
RAPIDS Accelerator for Apache Spark v21.06 Release
Improving Apache Spark Performance and Reducing Costs with Amazon EMR and NVIDIA
Improving Apache Spark Performance and Reducing Costs with Amazon EMR and NVIDIA
NVIDIA Accelerates Apache Spark, World’s Leading Data Analytics Platform
NVIDIA Accelerates Apache Spark, World’s Leading Data Analytics Platform
Related posts
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark
L
T
F
R
E
|
|
|
Solving Large-Scale Linear Sparse Problems with NVIDIA cuDSS | NVIDIA Technical Blog |
nvidia_dev_blog |
17.12.2025 18:30 |
0.683
|
| Embedding sim. | 0.7641 |
| Entity overlap | 0.1818 |
| Title sim. | 0.2727 |
| Time proximity | 0.8482 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
Solving large-scale problems in Electronic Design Automation (EDA), Computational Fluid Dynamics (CFD), and advanced optimization workflows has become the norm as chip designs, manufacturing, and multi-physics simulations have grown in complexity. These workloads push traditional solvers and require unprecedented scalability and performance. The NVIDIA CUDA Direct Sparse Solver (cuDSS) is built for users to run sparse solvers at massive scale with minimal code changes, unlocking breakthrough speed and efficiency for next-generation engineering and design.
You can leverage your CPU/GPU using hybrid memory mode to run larger problems that otherwise would not fit in a single GPU memory, or run a workload across multiple GPUs, or even scale to multiple nodes effortlessly. This blog discusses cuDSS user strategies for solving large-scale problems.
Getting started
To get started, this blog assumes you already have a working code that uses cuDSS. You may have also explored the introductory examples on GitHub ( here and here ) that demonstrate running cuDSS on a single GPU and adjusting default solution parameters using Get
and Set
functions. These examples cover creating matrices and main cuDSS objects, and executing the three core phases of cuDSS: analysis, numerical factorization, and solution.
Thanks to the increased memory capacity of recent GPU generations, even a single GPU can handle fairly large sparse problems. However, when tackling truly massive problems—on the order of over 10 million rows and over a billion nonzeros—there are effective strategies to make cuDSS run fast and efficiently. The first approach still uses a single GPU but introduces techniques to address these bigger challenges without major code changes.
Rethink your data types: Why INT64 matters now
When you create a dense or sparse matrix for cuDSS, you are likely to use one of two functions, cudssMatrixCreateDn() or cudssMatrixCreateCsr()
or even both. From the documentation, the functions are described below.
cudssMatrixCreateDn
cudssStatus_t cudssMatrixCreateDn(
cudssMatrix_t *matrix,
int64_t nrows,
int64_t ncols,
int64_t ld,
void *values,
cudaDataType_t valueType,
cudssLayout_t layout
)
The second function, cudssMatrixCreateCsr()
, is shown next.
cudssMatrixCreateCsr
cudssStatus_t cudssMatrixCreateCsr(
cudssMatrix_t *matrix,
int64_t nrows,
int64_t ncols,
int64_t nnz,
void *rowStart,
void *rowEnd,
void *colIndices,
void *values,
cudaDataType_t indexType,
cudaDataType_t valueType,
cudssMatrixType_t mtype,
cudssMatrixViewType_t mview,
cudssIndexBase_t indexBase
)
In cuDSS versions before 0.7.0, indices of the sparse matrices could only use 32-bit integers. Specifically, the underlying data type for rowStart , rowEnd , and colIndices could only be int and parameter indexType could only be CUDA_R_32I . From cuDSS 0.7.0 onward, you can solve bigger problems by using 64-bit integer indexing arrays of type int64 and CUDA_R_64I for the indexType .
Note: There is a limitation of having less than 2^31 rows and columns in the input matrix (but with 64-bit indices, the input matrix can have many more nonzeros).
Hybrid memory mode—blurring the line between CPU and GPU
cuDSS hybrid memory mode is designed to overcome the memory limitations of a single GPU when solving extremely large sparse linear problems by using the GPU and CPU memories.
However, there’s a tradeoff: Data transfer between CPU and GPU takes time and is governed by bus bandwidth. While you get to tackle bigger problems, you should expect some performance hit due to these transfers. That said, thanks to modern NVIDIA driver optimizations and fast CPU/GPU interconnects (such as those in NVIDIA Grace Blackwell nodes), the penalty is manageable—and for certain problem sizes, hybrid memory performance scales impressively.
Hybrid memory mode is not on by default, so the first step to enable it is to call the function cudssConfigSet()
to set CUDSS_CONFIG_HYBRID_MODE ,
which tells cuDSS to use hybrid memory mode. Note this change has to be done prior to calling cudssExecute()
.
The device memory is managed by cuDSS automatically by default. It manages how much device memory is needed—as much as the entire GPU contains. Alternatively, users can specify a smaller memory footprint by setting a user-defined limit in the range from the value of CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN
up to the available device memory after the analysis (symbolic factorization) phase, which can be queried via the NVIDIA CUDA Runtime API cudaMemGetInfo . A few highlights to note:
Even if the hybrid memory is on, cuDSS first attempts to utilize device memory (and avoids using CPU memory if possible) to achieve best performance.
Best performance is achieved with using the maximum GPU memory (which would make fewer memory transfers between the CPU and the GPU)
Hybrid memory limit can be set per device (as shown in the next text block)
The example code walks you through fetching minimum device memory requirements and setting your memory limits accordingly, giving you fine control over memory footprints.
...
/* Enable hybrid mode where factors are stored in host memory
Note: It must be set before the first call to ANALYSIS step.*/
int hybrid_mode = 1;
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig, CUDSS_CONFIG_HYBRID_MODE,\
&hybrid_mode,sizeof(hybrid_mode)), status,\
"cudssConfigSet CUDSS_CONFIG_HYBRID_MODE");
/* Symbolic factorization */
...
/* (optional) User can query the minimal amount of device memory sufficient
for the hybrid memory mode.
Note: By default, cuDSS would attempt to use all available
device memory if needed */
size_t sizeWritten;
int64_t device_memory_min;
CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,\
CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,\
&device_memory_min, sizeof(device_memory_min),\
&sizeWritten), status,
"cudssDataGet for\
CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN");
printf("cuDSS example: minimum amount of device memory\n"
"for the hybrid memory mode is %ld bytes\n",
device_memory_min);
/* (optional) User can specify how much device memory is available for
cuDSS
Note: By default, cuDSS would attempt to use all available\
device memory if needed */
int64_t hybrid_device_memory_limit = 40 * 1024 ; // in bytes = 40 KB
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,\
CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT,\
&hybrid_device_memory_limit,\
sizeof(hybrid_device_memory_limit)),\
status, \
"cudssConfigSet for\
CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT");
printf("cuDSS example: set the upper limit on device memory\n"
"for the hybrid memory mode to %ld bytes\n",
hybrid_device_memory_limit);
/* Factorization */
...
/* Solving */
...
The first cuDSS function, called cudssConfigSet()
, enables hybrid memory mode before calling the first analysis step, symbolic factorization. This is followed by using cudssDataGet()
to find the minimal amount of device memory sufficient for hybrid memory mode. A function call, cudssConfigSet()
specifies the amount of device memory for cuDSS. Note that sometimes the automatic memory management results in out of memory (OOM) errors.
For developers integrating this, the documentation on debugging tips are gold—save yourself some headaches by giving them a read.
Hybrid memory mode performance is dependent on CPU/GPU memory bandwidth to move data from CPU to GPU. To illustrate this, Figure 1 below shows the factorization and solve speed-up for matrix sizes ranging from 1 million to 18 million that are solved using cuDSS’s hybrid memory mode. The baseline is a single NVIDIA B200 GPU node. The observed speed-up compares the same model executed on a Grace Blackwell node to a x86 Blackwell node, reflecting the memory bandwidth ratio between the two nodes.
Figure 1. Speedup of factorization and solution phases for GB200 vs. B200 (cuDSS in Hybrid Memory Mode using minimum required device memory: B200 + Grace (72 cores) – 480GB vs B200 + X86 CPU (112 cores)
With INT64 and hybrid memory mode cuDSS coding strategies, we can accommodate large problem sizes and we are using all the possible memory we can on the node if we need it. But we are still limited to a single GPU. The next strategy allows us to use more GPUs to accommodate larger problems. This also allows us to solve problems of a fixed size faster by using more GPUs.
Multiply your muscle: multi-GPU mode (MG mode)
cuDSS multi-GPU mode (MG mode) allows the developer to use all of the GPUs in a single node without the developer having to specify any distributed communication layer. cuDSS handles all communication needed to use the GPUs internally. It is helpful in three scenarios:
When the problem is too large to fit on a single device (with or without hybrid memory).
When the user wants to avoid the performance penalty of hybrid memory mode.
When the user is focused on strong scaling—solving the problem across more GPUs to reach a solution faster.
The highlight of MG mode is that the developer does not need to specify a communication layer: No MPI, no NCCL, and no other communication layer needs to be used. cuDSS does all of this for you.
Additionally, due to CUDA’s limitations with MPI-aware communication on Windows nodes, MG mode becomes particularly valuable for applications running on Windows.
Figure 2 below illustrates the time (in seconds) required to solve an approximately 30-million-row matrix on an NVIDIA DGX H200 node across one-, two-, and four-GPU configurations with the factorization time on the top chart and the solve time on the bottom chart. The initial computation was performed on a single GPU, followed by runs using two and four GPUs with MG mode. As shown, solving the model with two GPUs significantly reduces computation time compared to a single GPU, albeit at the cost of increased GPU resource usage.
Figure 2. Factorization and solve time on H200 for one-, two-, and four-GPU configurations using Cadence’s MCAE applications. The matrix has approximately 31M rows and columns and with approximately 1B non-zeros.
This example shows you how to utilize MG mode. The relevant parts of the code are summarized below. Note that this includes code for using hybrid memory mode. This is important because if you use hybrid memory, you have to set the device memory limits on all of the devices that will be used.
...
/* Creating the cuDSS library handle */
cudssHandle_t handle;
/* Query the actual number of available devices */
int device_count = 0;
cuda_error = cudaGetDeviceCount(&device_count);
if (cuda_error != cudaSuccess || device_count <= 0) {
printf("ERROR: no GPU devices found\n");
fflush(0);
return -1;
}
/* device_indices can be set to NULL. In that cases cuDSS will take devices
* from 0 to (device_count - 1)
*/
int *device_indices = NULL;
device_indices = (int *)malloc(device_count * sizeof(int));
if (device_indices == NULL) {
printf("ERROR: failed to allocate host memory\n");
fflush(0);
return -1;
}
for (int i = 0; i < device_count; i++)
device_indices[i] = i;
...
/* Initialize cudss handle for multiple devices */
CUDSS_CALL_AND_CHECK(cudssCreateMg(&handle, device_count, device_indices),\
status, "cudssCreate");
...
/* Creating cuDSS solver configuration and data objects */
cudssConfig_t solverConfig;
cudssData_t solverData;
CUDSS_CALL_AND_CHECK(cudssConfigCreate(&solverConfig), status,\
"cudssConfigCreate");
/* Pass same device_count and device_indices to solverConfig */
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig, \
CUDSS_CONFIG_DEVICE_COUNT, &device_count,\
sizeof(device_count)), status, \
"cudssConfigSet for device_count");
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,\
CUDSS_CONFIG_DEVICE_INDICES, device_indices,\
device_count * sizeof(int)), status, \
"cudssConfigSet for device_count");
CUDSS_CALL_AND_CHECK(cudssDataCreate(handle, &solverData), status,\
"cudssDataCreate");
...
/* Symbolic factorization */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_ANALYSIS,\
solverConfig, solverData, A, x, b),\
status, "cudssExecute for analysis");
...
/* Query CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN should be done for each device
* separately by calling cudaSetDevice() prior to cudssDataGet.
* Same for getting CUDSS_DATA_MEMORY_ESTIMATES.
* Same for setting CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT with
* cudssConfigSet()
*/
int default_device = 0;
cudaGetDevice(&default_device);
for (int dev_id = 0; dev_id < device_count; dev_id++) {
cudaSetDevice(device_indices[dev_id]);
int64_t hybrid_device_memory_limit = 0;
size_t sizeWritten;
CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,\
CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,\
&hybrid_device_memory_limit,\
sizeof(hybrid_device_memory_limit),\
&sizeWritten),\
status, "cudssDataGet for the memory estimates");
printf("dev_id = %d CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN %ld bytes\n",
device_indices[dev_id], hybrid_device_memory_limit);
}
/* cuDSS requires all API calls to be made on the default device, so
* resseting device context.
*/
cudaSetDevice(default_device);
/* Factorization */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_FACTORIZATION,\
solverConfig, solverData, A, x, b),\
status, "cudssExecute for factor");
/* Solving */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE, solverConfig,\
solverData, A, x, b), status, \
"cudssExecute for solve");
...
Setting up MG mode is straightforward. It begins by finding the number of devices on the node and using them all, or the specific number of devices you want. Then the device indices are set to the number of devices starting with device 0 (the code will use the first device_count
devices, which you can change to the specific device numbers if you want). You could easily have the number of devices and the device number list input on the command line or via a file to make your code more flexible.
After this, the specific MG coding begins by calling cudssCreateMg()
to initialize the cuDSS handle for multiple devices. But before you call a solution phase, you additionally need to initialize the cuDSS configuration with the device information. Specifically, after creating a cuDSS solver configuration object using cudssConfigCreate()
, you should set the configuration details for MG mode using cudssConfigSet()
for the following:
CUDSS_CONFIG_DEVICE_COUNT
, using the array device_count
.
CUDSS_CONFIG_DEVICE_INDICES
, using the array device_indices
.
Then you use the function cudssDataCreate() to create the solverData for cuDSS and perform the analysis stage next.
In case you are using hybrid memory mode, prior to the factorization, you might want to set device memory limits for each of the devices separately. This is shown in the code above. Once completed, you can factorize the matrix and solve the problem.
A highlight of MG mode is that you don’t need to code for communications between the GPUs. cuDSS does all of that for you. However, there are some current limitations to using MG mode.
Using MG mode jointly with multi-GPU-multi-node (MGMN) mode is not supported (the next section talks about MGMN)
Distributed input is not currently supported.
MG mode is not supported when either CUDSS_ALG_1
or CUDSS_ALG_2
is used for reordering.
MG mode does not support matrix batches.
All phases in MG mode are synchronous.
All data must be on the first device (rank 0) before calling cudssExecute
. Then cuDSS will distribute the data to the other devices as needed.
Going big: Multi-GPU Multi-Node (MGMN) mode for distributed power
Now, what if one node isn’t enough and you want to spread your computations across multiple nodes? That’s where MGMN mode comes in. This requires a communication layer that, once added, will allow you to use any or all of the GPUs in a node as well as multiple nodes—with no limitations. This enables users to solve massive problems, or to use more GPUs to solve a problem faster.
cuDSS uses an abstraction—a small communication “shim” layer that can be tailored to CUDA-aware Open MPI , NVIDIA NCCL , or even a custom one, if you want.
This example for MGMN has the code for both Open MPI and NCCL. If you wish to use your own communication layer, there is an explanation on how to do that.
To illustrate how the communication layer is used, an ifdef
code block from the example is presented in the code block below for both MPI and NCCL code. There are some constants defined during compilation that are important for this example but aren’t shown in the code block. These are USE_MPI
and USE_NCCL
that define what code paths are to be used.
This ifdef
code block is for lines 520-535 in the sample code (these line numbers could change with subsequent versions so check them carefully).
#ifdef USE_MPI
#if USE_OPENMPI
if (strcmp(comm_backend_name,"openmpi") == 0) {
CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,\
mpi_comm, sizeof(MPI_Comm*)), \
status, \
"cudssDataSet for OpenMPI comm");
}
#endif
#if USE_NCCL
if (strcmp(comm_backend_name,"nccl") == 0) {
CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,\
nccl_comm, sizeof(ncclComm_t*)), \
status, \
"cudssDataSet for NCCL comm");
}
#endif
#endif
Note that the code changes are basically minimal for defining MPI or NCCL. The code differences between the two are simple. You can use your own communication layer in a very similar manner.
Once you define the communicator pointer, passed to cuDSS via CUDSS_DATA_COMM
in your code, as shown in the previous code snippet, there is no other need to use any communication layer function unless your code specifically needs it. cuDSS uses the defined communication layer “under the covers” so you don’t need to code for it. Look through the example code for how more than one node is used.
For implementing your own communication layer, a good introductory discussion can be found in the cuDSS documentation under advanced topics .
A high-level overview of the communication layers requirements are below:
The MGMN mode is enabled by abstracting away all communication-specific primitives into a small, separately built shim communication layer.
Users can have their own implementation of the communication layer with the communication backend of their choice (MPI, NCCL, etc.).
The enabled MGMN execution in cuDSS does not require any changes for applications that do not make use of the MGMN mode.
MGMN mode supports 1D row-wise distribution (with overlapping) for the input CSR matrix, dense righthand side, or solution, using the cudssMatrixSetDistributedRow1D()
function (see the next paragraph).
cuDSS MGMN mode optionally accepts pre-distributed input and can optionally create distributed output. You can have both A and B on the rank 0 device. In which case, cuDSS will distribute them, or you can tell cuDSS how the data is distributed across the devices and nodes with the cudssMatrixSetDistributedRow1d()
function.The developer will have to make sure the data is in the proper location on the proper node and device.
A critical step for good performance is to carefully choose your CPU:GPU:NIC bindings. This is not discussed here but is documented elsewhere.
There are some current limitations to MGMN mode,
MGMN mode is not supported when either CUDSS_ALG_1 or CUDSS_ALG_2
is used for reordering.
MGMN mode does not support matrix batches.
All phases in MGMN mode are synchronous.
Takeaways
Sparse linear systems appear in many disciplines. Fueled by the need to solve real-life problems, the overall size of the problems is growing at a very rapid rate. Developers must find ways to solve them both efficiently and quickly. NVIDIA cuDSS provides an easy to use library for solving increasingly massive problems using NVIDIA GPUs.
For more features you can use with cuDSS, it is recommended that you read through the advanced feature section of the documents. They contain more information about the features presented here as well as other capabilities to help you solve your large sparse linear problems. There is also a section that explains how to do logging with cuDSS as you develop your code. This is a great resource since debugging parallel code can be challenging. cuDSS has some great capability for getting log information as your code is executed.
Subscribe to cuDSS on the customer page to remain updated on most recent innovations.
Discuss (0)
Like
Tags
Data Science | Simulation / Modeling / Design | HPC / Scientific Computing | Blackwell | CUDA | GB200 | Grace CPU | Intermediate Technical | Tutorial | featured | Linear Algebra
About the Authors
About Jeff Layton
Jeff Layton is a technical marketing engineer on the CAE/EDA software team at NVIDIA. Prior to NVIDIA, Jeff was a professor in aeronautics and an engineer at Boeing and Lockheed Martin. Then he jumped into the high-performance computing world starting with Linux Networx, Panasas (now VRUDA), Dell, Intel, and Amazon Web Services. He has a Ph.D. from Purdue University in aeronautics and astronautics, and his career has revolved around using HPC to solve problems, especially in the aerospace world focusing on multidisciplinary optimization (MDO). He is also an independent author writing about HPC for over 20 years.
View all posts by Jeff Layton
About Azi Riahi
Azi Riahi is a principal product manager for NVIDIA math libraries. Prior to joining NVIDIA, Azi served as lead product manager for the TPU compiler stack at Google, where she supported the launch of multiple TPU platforms and led key initiatives to improve efficiency and usability within the XLA TPU compiler. Azi holds a Ph.D. in computational mechanics from the University of Toronto and has contributed her expertise as a reviewer and consultant for the U.S. Department of Energy and several national laboratories.
View all posts by Azi Riahi
Comments
Related posts
NVIDIA cuDSS Advances Solver Technologies for Engineering and Scientific Computing
NVIDIA cuDSS Advances Solver Technologies for Engineering and Scientific Computing
Accelerating the HPCG Benchmark with NVIDIA Math Sparse Libraries
Accelerating the HPCG Benchmark with NVIDIA Math Sparse Libraries
Just Released: cuDSS 0.3.0
Just Released: cuDSS 0.3.0
Spotlight: Honeywell Accelerates Industrial Process Simulation with NVIDIA cuDSS
Spotlight: Honeywell Accelerates Industrial Process Simulation with NVIDIA cuDSS
Optimizing the High Performance Conjugate Gradient Benchmark on GPUs
Optimizing the High Performance Conjugate Gradient Benchmark on GPUs
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
Advanced Large-Scale Quantum Simulation Techniques in cuQuantum SDK v25.11 | NVIDIA Technical Blog |
nvidia_dev_blog |
16.12.2025 18:00 |
0.675
|
| Embedding sim. | 0.7532 |
| Entity overlap | 0.1333 |
| Title sim. | 0.1971 |
| Time proximity | 0.994 |
| NLP тип | product_launch |
| NLP организация | nvidia |
| NLP тема | quantum computing |
| NLP страна | |
Открыть оригинал
Simulating large-scale quantum computers has become more difficult as the quality of quantum processing units (QPUs) improves. Validating the results is key to ensure that after the devices scale beyond what is classically simulable, we can still trust the outputs.
Similarly, when generating large-scale datasets for various AI models that aim to aid in the operation of quantum processors, we see the need to offer useful training data at all scales and abstractions accelerated by GPUs. Examples include AI quantum error correction decoders, AI compilers, AI agents for calibration and control, and models to generate new device designs.
cuQuantum SDK is a set of high-performance libraries and tools for accelerating quantum computing simulations at both the circuit and device levels by orders of magnitude. The latest version cuQuantum SDK, v25.11 introduces components that accelerate two new workloads: Pauli propagation and stabilizer simulations. Each of these is critical for simulating large scale quantum computers.
This post dives into how you can start running Pauli propagation simulations and accelerate sampling from your stabilizer simulations to solve these problems with GPU-accelerated supercomputers.
cuQuantum cuPauliProp
Pauli propagation is a relatively new method for efficiently simulating the observables of large-scale quantum circuits, which can include noise models of real quantum processors. By expressing states and observables as weighted sums of Pauli tensor products, circuit simulation can dynamically discard terms which contribute insignificantly to a sought expectation value. This permits estimation of experimental quantities which are otherwise intractable for exact simulation.
Many relevant quantum computing applications are centered around computation of expectation values, for example VQE and quantum simulation of physical dynamics. Various exact and approximate classical simulation techniques enable calculating such observables for large circuits, though they become prohibitively expensive in differing settings. For example, the Matrix Product State technique, a very popular approximate tensor network state method for circuit simulation, is typically ill-suited for large circuits which encode the dynamics of two or three dimensional physical systems.
Pauli propagation is a complementary and useful addition to the approximate circuit simulation toolbox, for both pure and noisy circuits. Beyond being provably efficient for simulating near-Clifford and/or very noisy circuits, Pauli propagation has shown impressive performance when simulating circuits which Trotterize the evolution of certain quantum spin systems. This includes some “utility circuits” named in reference to their use in IBM’s utility experiment involving a 127 qubit device as detailed in Evidence for the Utility of Quantum Computing Before Fault Tolerance . Characterizing which circuits can be efficiently simulated with Pauli propagation is an ongoing research effort, as significant as refinement of the algorithmic details of the method itself.
cuQuantum 25.11 offers primitives to accelerate Pauli propagation or derivative methods on NVIDIA GPUs with the release of this new cuQuantum library, enabling developers and researchers to advance the frontier of classical circuit simulation. Core functions are described in the following sections.
Library initialization
Initialize the library handle and workspace descriptor required for operations:
import cupy as cp
from cuquantum.bindings import cupauliprop
from cuquantum import cudaDataType
# Create library handle and workspace descriptor
handle = cupauliprop.create()
workspace = cupauliprop.create_workspace_descriptor(handle)
# Assign GPU memory to workspace
ws_size = 1024 * 1024 * 64 # Example: 64 MiB
d_ws = cp.cuda.alloc(ws_size)
cupauliprop.workspace_set_memory(
handle, workspace, cupauliprop.Memspace.DEVICE,
cupauliprop.WorkspaceKind.WORKSPACE_SCRATCH, d_ws.ptr, ws_size
)
Define an observable
To start the simulation, allocate device memory for the Pauli expansions (sums of products of Pauli operators expressed as a set of unsigned integers as well as their coefficients) and initialize the input expansion with an observable (for example, \(Z_{62}\)).
# Helper to encode Pauli string into packed integers (2 bits per qubit: X and Z masks)
def encode_pauli(num_qubits, paulis, qubits):
num_ints = cupauliprop.get_num_packed_integers(num_qubits)
# Packed integer format: [X_ints..., Z_ints...]
packed = np.zeros(num_ints * 2, dtype=np.uint64)
x_mask, z_mask = packed[:num_ints], packed[num_ints:]
for p, q in zip(paulis, qubits):
idx, bit = divmod(q, 64)
if p in (cupauliprop.PauliKind.PAULI_X, cupauliprop.PauliKind.PAULI_Y):
x_mask[idx] |= (1 << bit)
if p in (cupauliprop.PauliKind.PAULI_Z, cupauliprop.PauliKind.PAULI_Y):
z_mask[idx] |= (1 << bit)
return packed
# 1. Allocate Device Buffers
# Define capacity (max number of Pauli strings) and allocate buffers
max_terms = 10000
num_packed_ints = cupauliprop.get_num_packed_integers(num_qubits)
d_pauli = cp.zeros((max_terms, 2 * num_packed_ints), dtype=cp.uint64, order="C")
d_coef = cp.zeros(max_terms, dtype=cp.float64, order="C")
# 2. Populate Initial Observable (Z_62)
encoded_pauli = encode_pauli(num_qubits, [cupauliprop.PauliKind.PAULI_Z], [62])
# Assign the first term
d_pauli[0] = cp.array(encoded_pauli)
d_coef[0] = 1.0
# 3. Create Pauli Expansions
# Input expansion: pre-populated with our observable
expansion_in = cupauliprop.create_pauli_expansion(
handle, num_qubits,
d_pauli.data.ptr, d_pauli.nbytes,
d_coef.data.ptr, d_coef.nbytes,
cudaDataType.CUDA_R_64F,
1, 1, 1 # num_terms=1, is_sorted=True, is_unique=True
)
# Output expansion: empty initially (num_terms=0), needs its own buffers
d_pauli_out = cp.zeros_like(d_pauli)
d_coef_out = cp.zeros_like(d_coef)
expansion_out = cupauliprop.create_pauli_expansion(
handle, num_qubits,
d_pauli_out.data.ptr, d_pauli_out.nbytes,
d_coef_out.data.ptr, d_coef_out.nbytes,
cudaDataType.CUDA_R_64F,
0, 0, 0
)
Operator creation
Define quantum gates or operators, such as a Pauli rotation \(e^{-i \frac{\theta}{2} P}\).
# Create a Z-rotation gate on qubit 0
paulis = [cupauliprop.PauliKind.PAULI_Z]
qubits = [0]
gate = cupauliprop.create_pauli_rotation_gate_operator(
handle, theta, 1, qubits, paulis
)
Operator application
Apply an operator (a gate or noise-channel) to the expansion, evolving the system. Note that most applications work in the so-called Heisenberg picture , which means that the gates in the circuit are applied in reverse order to the observable. This also requires passing the adjoint
argument as True
when applying the operator.
# Get a view of the current terms in the input expansion
num_terms = cupauliprop.pauli_expansion_get_num_terms(handle, expansion_out)
view = cupauliprop.pauli_expansion_get_contiguous_range(
handle, expansion_in, 0, num_terms)
# Apply gate: in_expansion -> gate -> out_expansion
cupauliprop.pauli_expansion_view_compute_operator_application(
handle, view, expansion_out, gate,
True, # adjoint?
False, False, # make_sorted?, keep_duplicates?
0, None, # Truncation strategies (optional)
workspace
)
Expectation values
Compute the expectation value (trace with the zero state \(\langle 0 | O | 0 \rangle\)).
import numpy as np
result = np.zeros(1, dtype=np.float64)
# Compute trace
cupauliprop.pauli_expansion_view_compute_trace_with_zero_state(
handle, view, result.ctypes.data, workspace
)
Combining these methods shows that NVIDIA DGX B200 GPUs offer significant speedups over CPU based codes. For small coefficient cutoffs, multiple order of magnitude speedups are observed over single-threaded Qiskit Pauli-Prop on the most recent dual-socket data center CPUs.
Figure 1. cuQuantum GPU simulations for pi/4 rotations of the 127 qubit IBM utility circuit show multiple orders of magnitude speedups for a range of truncation schemes on NVIDIA DGX B200 compared to Qiskit PauliProp on an Intel Xeon Platinum 8570 CPU
cuQuantum cuStabilizer
Stabilizer simulations arise from the Gottesman-Knill theorem, which states that gates within the Clifford group (normalizer of the qubit Pauli group) can be efficiently simulated classically in polynomial time. This Clifford group is made up of CNOT, Hadamard and Phase gates (S). For this reason, stabilizer simulations have been critical for resource estimation and testing quantum error correcting codes at large scales.
There are a few different approaches to building stabilizer simulators, from tableau simulators to frame simulators. cuStabilizer currently addresses improving the throughput for sampling rates in a frame simulator.
Frame simulation only focuses on effects of quantum noise on the quantum state. As the quantum devices are imperfect, it’s possible to model the imperfections in circuit execution by inserting random “noisy” gates in it. If the noise-free result is known, getting the noisy result requires only to track the difference, or how the noisy gates change the circuit output.
It turns out that this effect is much easier to compute compared to full circuit simulation. The number of possible combinations of how noisy gates can be inserted grows very fast with the size of the circuit, which means that in order to reliably model the error correcting algorithm a large number of shots is required.
For users interested in developing quantum error correcting codes, testing new decoders, or generating data for AI decoders, frame simulation is ideal. APIs are available to improve sampling and accelerate any frame simulation on NVIDIA GPUs. The cuQuantum SDK cuStabilizer library exposes C API and Python API . While the C API will provide better performance, the Python API is best for getting started, as it is more flexible and handles memory allocation for the user.
Create a circuit and apply frame simulation
cuStabilizer has two main classes involved in the simulation: Circuit
and FrameSimulator
. The circuit can accept a string that contains circuit instructions, similar to the format used in the Stim CPU simulator. To create a FrameSimulator
you need to specify information about the circuit, to allocate enough resources.
import cuquantum.stabilizer as cust
# Circuit information
num_qubits = 5
num_shots = 10_000
num_measurements = 2
# Create a circuit on GPU
circ = cust.Circuit("""
H 0 1
X_ERROR(0.1) 1 2
DEPOLARIZE2(0.5) 2 3
CX 0 1 2 3
M 0 3
"""
sim = cust.FrameSimulator(
num_qubits,
num_shots,
num_measurements
)
sim.apply(circ)
You can reuse a simulator between different circuits, as long as your simulator has enough qubits available. The following code will apply a circuit to a state modified by the first circuit circ
.
circ2 = cust.Circuit("""
Z_ERROR(0.01) 1 4
""")
sim.apply(circ2)
Read simulation results
The state of simulator consists of three bit-tables:
x_bits
z_bits
measurement_bits
The first two tables store the Pauli Frame (similar to the cuPauliProp Pauli Expansion, but in a different layout and without the weights). The third stores the difference between noise-free measurement and the noisy measurements in each shot.
The most efficient way to store the bits is to encode them in an integer value. This is referred to as “bit-packed” format, where each byte in memory stores eight significant bits. While this format is most efficient, manipulating individual bits requires extra steps in your program. The bit-packed format is not easily integrated with the common notion of “array,” as those are considered to contain values of several bytes, such as int32.
To provide an easy representation in numpy, cuStabilizer supports the bit_packed
argument, which can toggle between different formats. If bit_packed=False
, each bit is encoded in one uint8 value, thus using 8x more memory. When specifying input bit tables, the format is also important for performance, as described in the cuQuantum documentation .
# Get measurement flips
m_table = sim.get_measurement_bits(bit_packed=False)
print(m_table.dtype)
# uint8
print(m_table.shape)
# (2, 10000)
print(m_table)
# [[0 0 0 ... 0 0 0]
# [1 0 0 ... 0 1 1]]
x_table, z_table = sim.get_pauli_xz_bits(bit_packed=True)
print(x_table.dtype)
# uint8
print(x_table.shape)
# (5, 1252)
For easy access to the underlying Pauli frames, cuStabilizer provides a PauliTable
class, which can be indexed by the shot index:
# Get pauli table
pauli_table = sim.get_pauli_table()
num_frames_print = 5
for i in range(num_frames_print):
print(pauli_table[i])
# ...XZ
# ZXX..
# ...Z.
# .....
# ...Z.
When leveraging the sampling API we see that we can drastically improve the throughput when compared to Google Stim, state of the art code on the latest data center CPUs.
Surface code simulation
cuStabilizer can accept Stim circuits as input, and you can use it to simulate surface code circuits:
import stim
p = 0.001
circ_stim = stim.Circuit.generated(
"surface_code:rotated_memory_z",
distance=5,
rounds=5,
after_clifford_depolarization=p,
after_reset_flip_probability=p,
before_measure_flip_probability=p,
before_round_data_depolarization=p,
)
circ = cust.Circuit(circ_stim)
sim = cust.FrameSimulator(
circ_stim.num_qubits,
num_shots,
circ_stim.num_measurements,
num_detectors=circ_stim.num_detectors,
)
sim.apply(circ)
pauli_table = sim.get_pauli_table()
for i in range(num_frames_print):
print(pauli_table[i])
Note that the most efficient simulation is achieved for a large number of samples and number of qubits. Furthermore, the best performance is achieved when the resulting bit tables are kept on GPU, as when using the cupy
package.
Figure 2 demonstrates the best use of cuStabilizer and expected performance on the NVIDIA B200 GPU and Intel Xeon Platinum 8570 CPU. It shows that the optimal performance for a code distance 31 is achieved at about a million shots. Users can get a 1,060x speedup for large code distances.
Figure 2. Runtime performance on surface code of different distances and 1 million shots, comparing stim plus cuStabilizer on an NVIDIA DGX B200 GPU with stim on an Intel Xeon Platinum 8570 CPU
Get started with new cuQuantum libraries
The latest functionalities in cuQuantum continue to push the bounds of what is possible with GPU based quantum computer emulations enabling two new major classes of workloads. These workloads are critical for quantum error correction, verification and validation, and algorithm engineering for intermediate to large scale quantum devices.
Get started with cuQuantum cuPauliProp using pip install cupauliprop-cu13
. To learn more, review the cuPauliProp documentation .
Get started with cuQuantum cuStabilizer using pip install custabilizer-cu13
. To learn more, review the cuStabilizer documentation .
Discuss (0)
Like
Tags
Data Center / Cloud | Developer Tools & Techniques | Simulation / Modeling / Design | HPC / Scientific Computing | cuQuantum | DGX | Intermediate Technical | Advanced Technical | Tutorial | featured | Python | Quantum Computing
About the Authors
About Tom Lubowe
Tom Lubowe is the product manager for quantum libraries at NVIDIA. Prior to joining, he led product focused on quantum computing, machine learning, and tensor networks for materials design at GenMat. Tom also worked at Xanadu and Rigetti in product management, product operations, and business development roles. Before that, he started a quantum machine learning company, Everettian Technologies, after working on FinTech products at SEI Investments.
View all posts by Tom Lubowe
About Benedikt Kloss
Benedikt Kloss is a senior math libraries engineer at NVIDIA. He holds a PhD in Chemical Physics from Columbia University and worked as a postdoctoral researcher at the Center for Computational Physics at the Flatiron Institute. Before joining NVIDIA, his research was focused on quantum dynamics and tensor network state methods.
View all posts by Benedikt Kloss
About Danylo Lykov
Danylo Lykov is a senior math libraries engineer at NVIDIA. He holds a PhD in Computer Science from the University of Chicago. He has worked at quantum research groups at Argonne National Laboratory and JPMorgan Chase, among others. His work focuses on high-performance classical simulation of quantum computers and its use in designing and analyzing near-term quantum algorithms.
View all posts by Danylo Lykov
About Tyson Jones
Tyson Jones is a senior math libraries engineer at NVIDIA. He holds a PhD in Material Science from the University of Oxford and worked as a postdoctoral and visiting researcher between the Swiss Federal Technology Institute of Lausanne, and the University of Osaka. Tyson was also a scientific software developer for Quantum Motion and the UK National Quantum Computing Centre, and specializes in parallel, distributed, classical simulation of quantum computers and near-term quantum algorithms.
View all posts by Tyson Jones
About Daniel Lowell
Daniel Lowell is a senior engineering manager at NVIDIA, overseeing the development of cuQuantum SDK. He has a physics BS from the University of Colorado Boulder and an MS in computer science from Texas State University. Daniel has spent his career working in HPC and GPU-accelerated computing, and before working in quantum computing, managed the development of AMD’s deep learning primitives library, MIOpen.
View all posts by Daniel Lowell
Comments
Comments are closed.
Related posts
NVIDIA cuQuantum Adds Dynamics Gradients, DMRG, and Simulation Speedup
NVIDIA cuQuantum Adds Dynamics Gradients, DMRG, and Simulation Speedup
Achieving Supercomputing-Scale Quantum Circuit Simulation with the NVIDIA cuQuantum Appliance
Achieving Supercomputing-Scale Quantum Circuit Simulation with the NVIDIA cuQuantum Appliance
Accelerating Quantum Circuit Simulation with NVIDIA cuStateVec
Accelerating Quantum Circuit Simulation with NVIDIA cuStateVec
NVIDIA Announces cuQuantum Beta Availability, Record Quantum Benchmark, and Quantum Container
NVIDIA Announces cuQuantum Beta Availability, Record Quantum Benchmark, and Quantum Container
NVIDIA cuQuantum SDK Introduces Quantum Circuit Simulation Acceleration
NVIDIA cuQuantum SDK Introduces Quantum Circuit Simulation Acceleration
Related posts
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell
Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell
NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer
NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer
L
T
F
R
E
|
|
|
Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM | NVIDIA Technical Blog |
nvidia_dev_blog |
16.12.2025 21:00 |
0.659
|
| Embedding sim. | 0.7317 |
| Entity overlap | 0.0588 |
| Title sim. | 0.2463 |
| Time proximity | 0.9762 |
| NLP тип | other |
| NLP организация | NVIDIA |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
For machine learning engineers deploying LLMs at scale, the equation is familiar and unforgiving: as context length increases, attention computation costs explode. Whether you’re dealing with retrieval-augmented generation (RAG) pipelines, agentic AI workflows, or long-form content generation, the \(O(N^2)\) complexity of attention remains a primary bottleneck.
This post explains a technique known as Skip Softmax, a hardware-friendly, drop-in sparse attention method that accelerates inference without any retraining. Read on to learn how Skip Softmax delivers up to 1.4x faster time-to-first-token (TTFT), and up to 1.4x faster time-per-output-token (TPOT), and how to get started with the technique in NVIDIA TensorRT LLM .
How does Skip Softmax work?
At its core, Skip Softmax provides a dynamic way to prune attention blocks. This is possible as it exploits a fundamental property of the Softmax function: \(\exp(\text{small negative number}) \approx 0\).
In standard FlashAttention, the GPU computes attention scores (logits) for blocks of queries (\(Q\)) and keys (\(K\)). It then applies softmax to normalize these scores into probabilities (\(P\)) and multiplies them by values (\(V\)).
However, attention is intrinsically sparse . For many blocks, the attention scores are so low compared to the dominant tokens that their contribution to the final output is statistically negligible. Skip Softmax modifies the FlashAttention loop to detect these blocks early and simply skips them.
The Skip Softmax algorithm
Implemented directly within the FlashAttention kernel, the logic follows this heuristic:
Compute local max: Calculate the maximum logit for the current block (\(Q \cdot K^T\)).
Compare to running max : Check if the difference between the current block’s local max (\(m_{i}^{(j)}\)) and the running global max (\(m_{i}^{j-1}\)) exceeds a calibrated threshold (\(\lambda\)).
Skip: If the condition is met, the kernel skips the softmax and BMM2 calculation for that block and, crucially, skips loading the \(V\) block from High Bandwidth Memory (HBM).
What are the benefits of using Skip Softmax?
Skip Softmax offers drop-in compatibility, hardware efficiency, flexibility, and versatility.
Unlike approaches that need specific architectural modifications (such as Linear Attention), Skip Softmax is compatible with existing pretrained models that use standard attention mechanisms like MHA, GQA, or MLA. It is optimized to leverage the specific tensor core and memory hierarchy of NVIDIA Hopper and NVIDIA Blackwell GPUs. It can also be integrated with other optimization methods. For instance, combining XAttention during prefill with Skip Softmax during decoding has been shown to deliver substantial speed improvements without compromising accuracy.
Skip Softmax is versatile because it addresses bottlenecks in both the prefill and decode phases. Based on performance data on Hopper and Blackwell architectures, Skip Softmax is beneficial during bandwidth-bound decoding and compute-bound prefilling, especially in long-context scenarios.
Bandwidth-bound decoding
During the generation (decode) phase, LLM inference is typically bound by memory bandwidth. The GPU spends more time moving KV cache data than computing.
Benefit: By identifying unimportant blocks early, Skip Softmax avoids loading the associated \(V\) blocks entirely.
Data: On Llama 3.3 70B (NVIDIA GB200 NVL72), Skip Softmax achieves a projected 1.36x end-to-end speedup during decoding.
Compute-bound prefilling
During the prefill phase (processing the input prompt), the system is compute-bound.
Benefit: Skipping the softmax and the second matrix multiplication (BMM2) saves significant FLOPs.
Data: For the same Llama 3.3 70B model (NVIDIA GB200 NVL72), prefill sees an estimated 1.4x end-to-end speedup at 128K context length.
Long-context scenarios
The efficacy of Skip Softmax increases with sequence length. The threshold for skipping is mathematically related to the context length (\(L\)) by the relationship \(\text{Threshold} \propto 1/L\). This means that, as context grows, the opportunity to safely identify and skip sparse blocks increases.
The tradeoff between accuracy and sparsity
The obvious question for any approximation technique is, “How does this approach impact accuracy?”
Extensive testing on the RULER (synthetic long-context) and LongBench (realistic long-context) benchmarks suggests a clear “safe zone” for sparsity.
Safe zone: A 50% sparsity ratio (skipping half the blocks) is observed to be the safe zone. In tests with Llama 3.1 8B and Qwen3-8B, running at ~50% sparsity resulted in near-lossless accuracy across most tasks.
Danger zone: Pushing sparsity beyond 60% often leads to sharp accuracy drops, particularly in complex “needle-in-a-haystack” multikey tasks.
Long generation: For tasks requiring long output generation such as MATH-500, Skip Softmax maintains accuracy parity with dense attention, unlike some static KV cache compression methods.
Model
Dataset
Sparsity
Accuracy delta versus baseline
Llama 3.1 8B
RULER-16K
~50% at prefill stage
-0.19%
Qwen-3-8B
MATH500
~50% at decode stage
0.36%
Table 1. Accuracy delta versus baseline without sparsity
Scenario
Threshold
Speedup (BF16)
Baseline accuracy
Sparse accuracy
Accuracy delta
Context only
0.2
130.63%
37.21%
36.74%
-0.47%
Context plus generation
0.6
138.37%
35.81%
34.42%
-1.39%
Table 2. Speedup with Qwen3-30B-Instruct model at a massive 128K sequence length
Additional optimizations while deploying include the following:
Automated calibration procedures to determine the optimal thresholds for target sparsity levels.
Sparsity-aware training makes models more robust to sparse attention patterns.
Get started with Skip Softmax in NVIDIA TensorRT LLM
Skip Softmax Attention is integrated directly into NVIDIA TensorRT LLM and supported on NVIDIA Hopper and NVIDIA Blackwell data center GPUs. This enables you to further accelerate the attention computation, on the basis of the state-of-the-art LLM inference performance powered by TensorRT LLM.
Skip Softmax Attention can be enabled through the sparse attention configuration of the LLM API:
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import SkipSoftmaxAttentionConfig
sparse_attention_config = SkipSoftmaxAttentionConfig(threshold_scale_factor=1000.0)
# Additionally, the threshold_scale_factor for prefill and decode could be separately configured.
sparse_attention_config = SkipSoftmaxAttentionConfig(threshold_scale_factor={"prefill": 1000.0, "decode": 500.0})
llm = LLM(
model="Qwen/Qwen3-30B-A3B-Instruct-2507",
sparse_attention_config=sparse_attention_config,
# Other LLM arguments...
)
The actual threshold value equals the threshold_scale_factor
divided by the context length.
The configuration could also be specified through the extra LLM API options YAML file. An example to launch an OpenAI-compatible endpoint is shown below:
cat >extra_llm_api_options.yaml <<EOF
sparse_attention_config:
algorithm: skip_softmax
threshold_scale_factor: 1000.0
EOF
# Additionally, the threshold_scale_factor for prefill and decode could be separately configured.
cat >extra_llm_api_options.yaml <<EOF
sparse_attention_config:
algorithm: skip_softmax
threshold_scale_factor:
prefill: 1000.0
decode: 500.0
EOF
trtllm-serve Qwen/Qwen3-30B-A3B-Instruct-2507 --extra_llm_api_options extra_llm_api_options.yaml
Learn more
To learn more, see BLASST: Dynamic Blocked Attention Sparsity via Softmax Thresholding . Skip Softmax Attention is supported in TensorRT LLM. For more details, see Accelerating Long-Context Inference with Skip Softmax Attention . The sparse attention kernels are also available in FlashInfer . The calibration will be supported by NVIDIA Model Optimizer , which enables users to specify the target sparsity and reach the desired threshold scale factors.
Discuss (0)
Like
Tags
Agentic AI / Generative AI | General | Blackwell | GB200 | Hopper | TensorRT-LLM | Intermediate Technical | Tutorial | AI Inference | featured | Inference Performance | LLM Techniques
About the Authors
About Laikh Tewari
Laikh Tewari is part of the AI Platform Software group at NVIDIA where he manages products for optimizing LLM inference performance. Laikh received his B.S. and M.S. in computer science from Stanford University where he specialized in systems and AI.
View all posts by Laikh Tewari
About Kai Xu
Kai Xu is a senior engineer with the Deep Learning Algorithm and Software team at NVIDIA, specializing in optimizing inference efficiency for generative AI. He was an early engineer at OmniML prior to its acquisition by NVIDIA. He received his Ph.D. in Computer Engineering from Arizona State University.
View all posts by Kai Xu
About Bo Li
Bo Li is a senior DevTech Compute engineer at NVIDIA, working on accelerating AI at scale. His current focus is efficient LLM inference, spanning from low-level GPU optimization to system design. He is also experienced with generative AI modeling and computer graphics. He received his master's degree in Computer Science from ETH Zurich, and his bachelor's from Peking University.
View all posts by Bo Li
About Fred Oh
Fred is a senior product marketing manager for CUDA, CUDA on WSL, and CUDA Python. Fred has a B.S. in Computer Science and Math from UC Davis. He began his career as a UNIX software engineer porting kernel services and device drivers to x86 architectures. He loves Star Wars, Star Trek and the NBA Warriors.
View all posts by Fred Oh
Comments
Related posts
Making Softmax More Efficient with NVIDIA Blackwell Ultra
Making Softmax More Efficient with NVIDIA Blackwell Ultra
Run High-Performance LLM Inference Kernels from NVIDIA Using FlashInfer
Run High-Performance LLM Inference Kernels from NVIDIA Using FlashInfer
NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200
NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server
NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200
NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200
Related posts
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale
Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes
Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes
L
T
F
R
E
|
|
|
Optimizing Semiconductor Defect Classification with Generative AI and Vision Foundation Models | NVIDIA Technical Blog |
nvidia_dev_blog |
17.12.2025 02:00 |
0.648
|
| Embedding sim. | 0.7359 |
| Entity overlap | 0.0263 |
| Title sim. | 0.2468 |
| Time proximity | 0.812 |
| NLP тип | product_launch |
| NLP организация | NVIDIA |
| NLP тема | computer vision |
| NLP страна | |
Открыть оригинал
In the heart of every modern electronic device lies a silicon chip, built through a manufacturing process so precise that even a microscopic defect can determine success or failure. As semiconductor devices grow more complex, reliably detecting and classifying defects has become a critical bottleneck.
Historically, chipmakers have relied on convolutional neural networks (CNNs) to automate defect classification (ADC). But as manufacturing scales and diversifies, CNN-based approaches are hitting their limits, requiring large labeled datasets, frequent retraining, and still struggling to generalize across new defect types.
In this post, we show how generative AI -powered ADC can overcome these challenges.
The workflows below leverage NVIDIA Metropolis vision language models (VLMs), vision foundation models (VFMs), and the NVIDIA TAO fine-tuning toolkit to modernize defect classification. We outline the limitations of traditional CNN-based systems, detail how VLMs and VFMs address them, and highlight specific approaches and manufacturing challenges they help solve.
The limits of CNNs in semiconductor defect classification
CNNs have long been the backbone of defect detection in semiconductor fabs, supporting optical and e-beam inspection, lithographic analysis, and more. They excel at extracting visual features from large datasets, but manufacturers face persistent challenges related to data requirements, semantic understanding, and retraining.
High data requirements
Achieving high accuracy often requires thousands of labeled images per defect class. Rare or emerging defects frequently lack sufficient examples for effective training.
Limited semantic understanding
While CNNs capture visual features, they cannot interpret context, perform root-cause analysis, or integrate multimodal data. They also struggle to differentiate visually similar yet operationally distinct defect patterns, such as center vs. local defects.
Frequent retraining
Real-world manufacturing is dynamic. Process variations, new tools, and evolving product lines require models to be retrained frequently to recognize new defect types and imaging conditions.
These limitations force fabs to rely on manual inspection, which is costly, inconsistent, and unable to scale with today’s manufacturing throughput.
Modernizing ADC with VLMs and VFMs
To address these challenges, NVIDIA applies VLMs, VFMs, and self-supervised learning across multiple stages of semiconductor manufacturing. Figure 1 illustrates how these models are deployed across front-end-of-line (FEOL) and back-end packaging processes.
In this post, we demonstrate how VLMs classify wafer map images and how VFMs classify die-level images, including optical, e-beam, and back-end optical microscopy (OM) inspection data. With further training, VLMs also show strong potential for die-level inspection.
Figure 1. Examples of different image types that can potentially be used for an automatic defect classification (ADC) system enhanced with vision language models (VLMs) and vision foundation models (VFMs). These include wafer defect maps and various die-level defects found in optical, e-beam, and optical microscopy (OM) images.
Wafer-level intelligence with VLMs
Wafer maps provide a spatial view of defect distributions across an entire wafer. VLMs combine advanced image understanding with natural language reasoning. After fine-tuning, NVIDIA reasoning VLMs , such as Cosmos Reason , can interpret wafer map images to identify macro defects, generate natural language explanations, perform interactive Q&A, and compare test images against “golden” references for preliminary root-cause analysis.
Figure 2. The left side showcases how Cosmos Reason VLM can automatically classify this as a center ring wafer defect and attribute it to chemical contamination. The right side shows how auto-labeling methods accelerate the training process and help to streamline defect analysis and reduce manual visual inspection efforts .
Using this approach offers several advantages:
Few-shot learning: VLMs can be fine-tuned with only a small number of labeled examples, enabling rapid adaptation to new defect patterns, process changes, or product variations.
Explainability: As shown in Figure 2, Cosmos Reason produces interpretable results that engineers can interact with using natural language. For example, asking “What is the primary defect pattern in this wafer map?” might return “Center ring defect detected, likely due to chemical contamination.” This semantic reasoning ability goes beyond CNNs, helping engineers quickly identify potential root causes, accelerate corrective actions, and reduce the volume of manual reviews.
Automated data labeling: VLMs can generate high-quality labels for downstream ADC tasks, reducing the time and cost of model development. In practice, this approach can cut model build times by up to 2x compared to manual labeling workflows.
Time series and lot level analysis: VLMs have the ability to process both still images and video sequences, enabling them to proactively monitor process anomalies over time and mitigate errors before they lead to critical failures. In one study, VLMs achieved high accuracy across both OK and NG cases, outperforming traditional CNN-based methods.
Figure 3. The end-to-end workflow for fine-tuning the Cosmos Reason 1 model, covering data preparation, supervised fine-tuning on the curated dataset, and subsequent quantization and deployment for inference.
Getting started with Cosmos Reason
Here’s a sample workflow to fine-tune Cosmos Reason 1—from data preparation to supervised fine-tuning and evaluation on a prepared dataset of wafer map defects.
Go to the Cosmos Cookbook Wafer Map Anomaly Classification
Create a sample training dataset: Download the open WM-811k Wafermap dataset produced by Mir Lab which is available for public use. Generate a sample dataset and respective annotations with the provided scripts in the cookbook.
Post-train with supervised fine-tuning (SFT): the installation instructions provided in the cosmos-reason1 GitHub repository and install the cosmos-rl package to enable fine-tuning with the curated training data set.
Deploy
Result: Fine-tuning Cosmos Reason on wafer map defect classification data boosts accuracy from zero-shot levels to over 96% on defect classification tasks.
Die-level precision with VFMs and self-supervised learning
The semiconductor industry continues to push the boundaries of physics as device features shrink to microscopic scales. At this level, manufacturing complexity rises dramatically. Even the slightest anomaly—a stray particle, pattern deviation, or material defect—can render a chip unusable, directly affecting yield and profitability. In this high-stakes environment, the biggest bottleneck is the ability to rapidly and accurately detect and classify defects. CNNs have supported this workflow for years, but they struggle to keep pace with the growing complexity and data demands of modern fabs.
A core challenge in training AI models for manufacturing is the dependence on large, meticulously labeled datasets. Dynamic processes, evolving product lines, and the continual emergence of new defect types make it impractical to maintain a perfectly labeled dataset. Compounding the issue, datasets are often highly imbalanced—normal samples vastly outnumber defective ones.
Using a leading VFM such as NV-DINOv2 provides advantages, including:
Self-supervised learning (SSL): NV-DINOv2 is trained on millions of unlabeled images, enabling it to generalize new defect types and process conditions with minimal retraining when labeled data is scarce.
Robust feature extraction: The model captures both fine-grained visual details and high-level semantic information, improving classification accuracy across diverse manufacturing scenarios.
Operational efficiency: By reducing dependence on labeling and frequent retraining, NV-DINOv2 streamlines the deployment and maintenance of defect-inspection systems in fast-moving fab environments.
However, general foundation models like NV-DINOv2 lack the specific details required for industrial tasks such as e-beam and optical microscopy images. To achieve maximum accuracy, the model must be specialized through domain adaptation.
This is a multi-stage workflow:
General VFM : Begin with the powerful, pre-trained NV-DINOv2 model that has broad visual understanding learned from large, diverse datasets.
Domain adaptation : Fine-tune the model using a large, unlabeled, domain-specific dataset, such as millions of images from semiconductor fabs, to align it with industrial imaging characteristics.
Downstream task fine-tuning : Apply a small set of labeled images to fine-tune the model for a specific classification task, a step known as linear probing.
Figure 4. The three-phase NV-DINOv2 workflow for building domain-adapted vision foundation models. Phase 1 (by NVIDIA) provides the general pre-trained model; Phases 2 and 3 (by users) enable domain adaptation and task-specific fine-tuning with minimal labeled data.
The effectiveness of this process depends heavily on the size and quality of the unlabeled domain dataset. These datasets can range from less than a million images to hundreds of millions, but quantity alone is not enough. A meticulous data-cleaning pipeline is essential to remove redundant, blurry, or irrelevant images before training begins.
This domain-adaptation approach delivers significant performance gains. In one study by a leading semiconductor manufacturer, the NVIDIA TAO Toolkit was used to apply self-supervised learning (SSL) to NV-DINOv2 using unlabeled images collected across multiple layers of the chip-production process. Incorporating SSL consistently improved performance, boosting accuracy by up to 8.9% compared to a model trained without SSL which led to productivity gains of up to 9.9%.
Getting started with NV-DINOv2 and SSL
The following is an end-to-end workflow to fine-tune NV-DINOv2 using SSL, from data preparation and domain adaptation to downstream task fine-tuning and deployment. In this example, we use the NVIDIA TAO Toolkit to perform SSL on unlabeled PCB images for defect classification.
The NV-DINOv2 workflow follows a progressive, three-phase approach that maximizes the value of large unlabeled datasets while reducing the need for manual annotation to only a few hundred labeled samples.
1. Set up your environment: Download the NVIDIA TAO Toolkit 6.0 container from NVIDIA NGC which has all dependencies pre-installed:
# Pull the TAO Toolkit 6.0 container from NGC
docker pull nvcr.io/nvidia/tao/tao-toolkit:6.0.0-pyt
# Run the container with GPU support
docker run --gpus all -it -v /path/to/data:/data \
nvcr.io/nvidia/tao/tao-toolkit:6.0.0-pyt /bin/bash
2. Prepare your dataset: NV-DINOv2 accepts RGB images in standard formats (JPG, PNG, BMP, TIFF, WebP) stored in a single directory. For SSL domain adaptation, you only need unlabeled images; no annotations are required.
In our PCB inspection example, we used:
~400 labeled test samples for evaluation
~One million unlabeled PCB images for domain adaptation
~600 labeled training samples for downstream fine-tuning
Organize your data as followed:
/data/
├── unlabeled_images/ # For SSL domain adaptation
├── train_images/ # For downstream fine-tuning
│ ├── OK/
│ ├── missing/
│ ├── shift/
│ ├── upside_down/
│ ├── poor_soldering/
│ └── foreign_object/
└── test_images/ # For evaluation
Data cleaning best practice : Before training, perform a meticulous data cleaning process to remove redundant, blurry, or irrelevant images. The effectiveness of domain adaptation depends heavily on the quality of your unlabeled dataset.
3. Configure the training specification: Create a YAML specification file that defines your model architecture, dataset paths, and training parameters:
model:
backbone:
teacher_type: "vit_l"
student_type: "vit_l"
patch_size: 14
img_size: 518
drop_path_rate: 0.4
head:
num_layers: 3
hidden_dim: 2048
bottleneck_dim: 384
dataset:
train_dataset:
images_dir: /data/unlabeled_images
test_dataset:
images_dir: /data/test_images
batch_size: 16
workers: 10
transform:
n_global_crops: 2
global_crops_scale: [0.32, 1.0]
global_crops_size: 224
n_local_crops: 8
local_crops_scale: [0.05, 0.32]
local_crops_size: 98
train:
num_gpus: 8
num_epochs: 100
checkpoint_interval: 10
precision: "16-mixed"
optim:
optim: "adamw"
clip_grad_norm: 3.0
4. Run SSL training for domain adaptation : Execute the training using TAO Launcher to adapt the general NV-DINOv2 model to your domain-specific images:
tao model nvdinov2 train \
-e /path/to/experiment_spec.yaml \
results_dir=/output/ssl_training \
train.num_gpus=8 \
train.num_epochs=100
5. Perform downstream task fine-tuning: After SSL domain adaptation, fine-tune the model for your specific classification task using a small labeled dataset. This step, known as linear probing, requires only a few hundred labeled samples:
tao model nvdinov2 train \
-e /path/to/finetune_spec.yaml \
train.pretrained_model_path=/output/ssl_training/model.pth \
dataset.train_dataset.images_dir=/data/train_images \
train.num_epochs=50
6. Run inference: Evaluate your domain-adapted model on test images:
tao model nvdinov2 inference \
-e /path/to/experiment_spec.yaml \
inference.checkpoint=/output/ssl_training/model.pth \
inference.gpu_ids=[0] \
inference.batch_size=32
7. Export to ONNX for deployment : Export your trained model to ONNX format for production deployment:
tao model nvdinov2 export \
-e /path/to/experiment_spec.yaml \
export.checkpoint=/output/ssl_training/model.pth \
export.onnx_file=/output/nvdinov2_domain_adapted.onnx \
export.opset_version=12 \
export.batch_size=-1
The exported ONNX model can be deployed using NVIDIA TensorRT for optimized inference or integrated into an NVIDIA DeepStream pipeline for real-time visual inspection.
Results: Using NVIDIA TAO to fine-tune NV-DINOV2 with SSL can also be used for inspecting PCBs. By using a dataset of approximately one million unlabeled images with SSL for industrial domain adaption and 600 training and 400 testing samples for downstream task fine-tuning, accuracy for defect detection jumped from 93.84% with the general model to 98.51%. By eliminating the need for labeling and frequent retraining, NV-DINOv2 streamlines the deployment of defect inspection solutions in fast-moving fab environments.
Paving the way to a smart fab
These applications of vision models deliver immediate accuracy gains and lay the foundation for agentic AI systems within the fab. By combining accelerated computing with generative AI, NVIDIA and leading foundries are introducing new ADC workflows that have the potential to redefine yield improvement and process control in advanced manufacturing.
By streamlining defect analysis across the semiconductor production flow, generative AI significantly reduces model deployment time. Its few-shot learning capabilities simplify ongoing model maintenance, improve robustness, and make it easy to fine-tune models for different fab environments.
With fabs generating millions of high-resolution images daily from a wide range of inspection tools, automated ADC systems are expected to further improve classification accuracy, reduce human workload, and elevate overall productivity.
Beyond defect inspection, semiconductor manufacturers are beginning to adopt video analytics AI agents built using the NVIDIA Blueprint for Video Search and Summarization (VSS). These agents help monitor plant operations, enhance worker safety, and improve compliance with PPE and safety protocols across manufacturing sites.
Next steps
To learn more, try NV-DINOv2 and state-of-the-art NVIDIA VLMs like Cosmos Reason. For technical questions, please visit the forum .
Stay up to date by subscribing to our newsletter and following NVIDIA AI on LinkedIn , Instagram , X and Facebook . Explore YouTube channel, and join the NVIDIA Developer vision AI forum .
Discuss (0)
Like
Tags
Computer Vision / Video Analytics | Manufacturing | Cosmos | Metropolis | TAO Toolkit | Beginner Technical | Tutorial | featured | Hardware / Semiconductor | VLMs
About the Authors
About Tim Lin
Tim Lin is a senior manager at NVIDIA, focused on applying AI to industrial applications, particularly in the semiconductor sector. He has over 10 years of experience in the semiconductor industry and holds a PhD in computer science and engineering from The University of Texas at Arlington.
View all posts by Tim Lin
About HJ Chen
HJ is a senior manager at NVIDIA, leading a Taiwan-based team contributing to the Metropolis/TAO effort for developing CV foundation models and driving the adoption of VLMs for industrial applications. HJ holds his bachelor’s and master's degrees in engineering from National Taiwan University.
View all posts by HJ Chen
About Po Chun Lai
Po Chun Lai is senior solution architect at NVIDIA responsible for AI technology adoption. He comes with experience working in industries. He holds a master’s degree in Computer Science from National Yang Ming Chiao Tung University.
View all posts by Po Chun Lai
About Yiyi Wang
Yiyi Wang is a senior business development leader at NVIDIA, where she drives global business development strategy for the semiconductor vertical, building high-impact partnerships and transforming semiconductor manufacturing with AI and Accelerated Computing. Yiyi holds a PhD in Physics from Boston University.
View all posts by Yiyi Wang
About Anita Chiu
Anita is an AI software engineer at NVIDIA, where she focuses on transforming NVIDIA’s AI models into practical, scalable solutions for industrial applications. She holds both a bachelor’s and a master’s degree in engineering from National Tsing Hua University.
View all posts by Anita Chiu
Comments
Related posts
Build a Real-Time Visual Inspection Pipeline with NVIDIA TAO 6 and NVIDIA DeepStream 8
Build a Real-Time Visual Inspection Pipeline with NVIDIA TAO 6 and NVIDIA DeepStream 8
AI in Manufacturing and Operations at NVIDIA: Accelerating ML Models with NVIDIA CUDA-X Data Science
AI in Manufacturing and Operations at NVIDIA: Accelerating ML Models with NVIDIA CUDA-X Data Science
Transforming Industrial Defect Detection with NVIDIA TAO and Vision AI Models
Transforming Industrial Defect Detection with NVIDIA TAO and Vision AI Models
How to Train a Defect Detection Model Using Synthetic Data with NVIDIA Omniverse Replicator
How to Train a Defect Detection Model Using Synthetic Data with NVIDIA Omniverse Replicator
Learn How to Use Deep Learning for Industrial Inspection
Learn How to Use Deep Learning for Industrial Inspection
Related posts
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics
L
T
F
R
E
|
|
|
The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator |
huggingface |
17.12.2025 13:22 |
0.629
|
| Embedding sim. | 0.7108 |
| Entity overlap | 0 |
| Title sim. | 0.1776 |
| Time proximity | 0.9323 |
| NLP тип | product_launch |
| NLP организация | nvidia |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator
Enterprise + Article Published
December 17, 2025
Upvote 47
+41
Seph Mard sephmard1
nvidia
Isabel Hulseman ihulseman0220
nvidia
Besmira Nushi bnushi
nvidia
Piotr Januszewski pjanuszewski
nvidia
Grzegorz Chlebus grzegorzchlebus
nvidia
VivienneZhang viviennezhang
nvidia
Wojciech Prazuch wprazuch
nvidia
Pablo Ribalta pribalta
nvidia
Nik Spirin spirinus
nvidia
Ferenc Galko fgalko
nvidia
Building a consistent and transparent evaluation workflow with NeMo Evaluator A single, consistent evaluation system
Methodology independent of inference setup
Built to scale beyond one-off experiments
Auditability with structured artifacts and logs
A shared evaluation standard
Open evaluation for Nemotron 3 Nano Open-source model evaluation tooling
Open configurations
Open logs and artifacts
The reproducibility workflow
Reproducing Nemotron 3 Nano benchmark results 1. Install NeMo Evaluator Launcher
2. Set required environment variables
3. Model endpoint
4. Run the full evaluation suite
5. Running an individual benchmark
6. Monitor execution and inspect results
Interpreting results
Conclusion: A more transparent standard for open models
It has become increasingly challenging to assess whether a model’s
reported improvements reflect genuine advances or variations in
evaluation conditions, dataset composition, or training data that
mirrors benchmark tasks. The NVIDIA Nemotron approach to openness
addresses this by publishing transparent and reproducible evaluation
recipes that make results independently verifiable.
NVIDIA released Nemotron 3 Nano 30B
A3B
with an explicitly open evaluation approach to make that distinction
clear. Alongside the model card, we are publishing the complete
evaluation recipe used to generate the results, built with the
NVIDIA NeMo
Evaluator library, so
anyone can rerun the evaluation pipeline, inspect the artifacts, and
analyze the outcomes independently.
We believe that open innovation is the foundation of AI progress. This
level of transparency matters because most model evaluations omit
critical details. Configs, prompts, harness versions, runtime settings,
and logs are often missing or underspecified, and even small differences
in these parameters can materially change results. Without a complete
recipe, it’s nearly impossible to tell whether a model is genuinely
more intelligent or simply optimized for a benchmark.
This blog shows developers exactly how to reproduce the evaluation
behind Nemotron 3 Nano 30B
A3B
using fully open tools, configurations, and artifacts. You’ll learn how
the evaluation was run, why the methodology matters, and how to execute
the same end-to-end workflow using the NeMo Evaluator library so you can
verify results, compare models consistently, and build transparent
evaluation pipelines of your own.
Building a consistent and transparent evaluation workflow with NeMo Evaluator
A single, consistent evaluation system
Developers and researchers need evaluation workflows they can rely on,
not one-off scripts that behave differently from model to model. NeMo
Evaluator provides a unified way to define benchmarks, prompts,
configuration, and runtime behavior once, then reuse that methodology
across models and releases. This avoids the common scenario where the
evaluation setup quietly changes between runs, making comparisons over
time difficult or misleading.
Methodology independent of inference setup
Model outputs can vary by inference backend and configuration, so
evaluation tools should never be tied to a single inference solution.
Locking an evaluation tool to one inference solution would limit its
usefulness. NeMo Evaluator avoids this by separating the evaluation
pipeline from the inference backend, allowing the same configuration to
run against hosted endpoints, local deployments, or third-party
providers. This separation enables meaningful comparisons even when you
change infrastructure or inference engines.
Built to scale beyond one-off experiments
Many evaluation pipelines work once and then break down as the scope
expands. NeMo Evaluator is designed to scale from quick,
single-benchmark validation to full model card suites and repeated
evaluations across multiple models. The launcher, artifact layout, and
configuration model support ongoing workflows, not just isolated
experiments, so teams can maintain consistent evaluation practices over
time.
Auditability with structured artifacts and logs
Transparent evaluation requires more than final scores. Each evaluation
run produces structured results and logs by default, making it easy to
inspect how scores were computed, understand score calculations, debug
unexpected behavior, and conduct deeper analysis. Each component of the
evaluation is captured and reproducible.
A shared evaluation standard
By releasing Nemotron 3 Nano 30B
A3B
with its full evaluation
recipe ,
NVIDIA is providing a reference methodology that the community can run,
inspect, and build upon. Using the same configuration and tools brings
consistency to how benchmarks are selected, executed, and interpreted,
enabling more reliable comparisons across models, providers, and
releases.
Open evaluation for Nemotron 3 Nano
Open evaluation means publishing not just the final results, but the
full methodology behind them, so benchmarks are run consistently, and
results can be compared meaningfully over time. For Nemotron 3 Nano
30B
A3B ,
this includes open‑source tooling, transparent configurations, and
reproducible artifacts that anyone can run end‑to‑end.
Open-source model evaluation tooling
NeMo
Evaluator is an
open-source library designed for robust, reproducible, and scalable
evaluation of generative models. Instead of introducing yet another
standalone benchmark runner, it acts as a unifying orchestration layer
that brings multiple evaluation harnesses under a single, consistent
interface.
Under this architecture, NeMo Evaluator integrates and coordinates
hundreds of benchmarks from many widely used evaluation harnesses,
including NeMo
Skills
for Nemotron instruction-following, tool use, and agentic evaluations,
as well as the LM Evaluation
Harness
for base model and pre-training benchmarks, and many more ( full
benchmark
catalog ).
Each harness retains its native logic, datasets, and scoring semantics,
while NeMo Evaluator standardizes how they are configured, executed, and
logged.
This provides two practical advantages: teams can run diverse benchmark
categories using a single configuration without rewriting custom
evaluation scripts, and results from different harnesses are stored and
inspected in a consistent, predictable way, even when the underlying
tasks differ. The same orchestration framework used internally by
NVIDIA’s Nemotron research and model‑evaluation teams is now available
to the community, enabling developers to run heterogeneous,
multi‑harness evaluations through a shared, auditable workflow.
Open configurations
We published the exact YAML configuration used for the Nemotron 3
Nano 30B A3B model
card
evaluation with NeMo Evaluator. This includes:
model inference and deployment settings
benchmark and task selection
benchmark-specific parameters such as sampling, repeats, and prompt
templates
runtime controls including parallelism, timeouts, and retries
output paths and artifact layout
Using the same configuration means running the same evaluation
methodology.
Open logs and artifacts
Each evaluation run produces structured, inspectable outputs, including
per‑task results.json
files, execution logs for debugging and
auditability, and artifacts organized by task for easy comparison. This
structure makes it possible to understand not only the final scores, but
also how those scores were produced and to perform deeper analysis of
model behavior.
The reproducibility workflow
Reproducing Nemotron 3 Nano 30B A3B model
card
results follows a simple loop:
Start from the released model checkpoint or hosted endpoint
Use the published NeMo Evaluator
config
Execute the evaluation with a single CLI command
Inspect logs and artifacts, and compare results to the model card
The same workflow applies to any model you evaluate using NeMo
Evaluator. You can point the evaluation at a hosted endpoint or a local
deployment, including common inference providers such as
HuggingFace ,
build.nvidia.com ,
and
OpenRouter .
The key requirement is access to the model, either as weights you can
serve or as an endpoint you can call. For this tutorial, we use the
hosted endpoint on
build.nvidia.com .
Reproducing Nemotron 3 Nano benchmark results
This tutorial reproduces the evaluation results for NVIDIA Nemotron
3 Nano 30B
A3B
using NeMo Evaluator. The step-by-step tutorial, including the
published configs used for the model card
evaluation ,
is available on GitHub. Although we have focused this tutorial on the
Nemotron 3 Nano 30B A3B, we also published recipes for the base
model
evaluation .
This walkthrough runs a comprehensive evaluation suite of the published configs used for the model card
evaluation for NVIDIA Nemotron
3 Nano 30B A3B using the following benchmarks:
Benchmark
Accuracy
Category
Description
BFCL v4
53.8
Function Calling
Berkeley Function Calling Leaderboard v4
LiveCodeBench (v6 2025-08–2025-05)
68.3
Coding
Real-world coding problems evaluation
MMLU-Pro
78.3
Knowledge
Multi-task language understanding (10-choice)
GPQA
73.0
Science
Graduate-level science questions
AIME 2025
89.1
Mathematics
American Invitational Mathematics Exam
SciCode
33.3
Scientific Coding
Scientific programming challenges
IFBench
71.5
Instruction Following
Instruction following benchmark
HLE
10.6
Humanity's Last Exam
Expert-level questions across domains
For Model Card details, see the NVIDIA Nemotron
3 Nano 30B A3B Model Card . For a deep dive into the architecture, datasets, and benchmarks, read the full Nemotron 3 Nano Technical Report .
1. Install NeMo Evaluator Launcher
pip install nemo-evaluator-launcher
2. Set required environment variables
# NVIDIA endpoint access
export NGC_API_KEY="your-ngc-api-key"
# Hugging Face access
export HF_TOKEN="your-huggingface-token"
# Required only for judge-based benchmarks such as HLE
export JUDGE_API_KEY="your-judge-api-key"
Optional but recommended for faster reruns:
export HF_HOME="/path/to/your/huggingface/cache"
3. Model endpoint
The evaluation uses the NVIDIA API endpoint hosted on
build.nvidia.com :
target:
api_endpoint:
model_id: nvidia/nemotron-nano-3-30b-a3b
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
Evaluations can be run against common inference providers such as
HuggingFace ,
build.nvidia.com ,
or
OpenRouter ,
or anywhere that the model has an available endpoint.
If you're hosting the model locally or using a
different endpoint:
nemo-evaluator-launcher run \
--config local_nvidia_nemotron_3_nano_30b_a3b.yaml \
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
4. Run the full evaluation suite
Preview the run without executing using --dry-run
:
nemo-evaluator-launcher run \
--config local_nvidia_nemotron_3_nano_30b_a3b.yaml \
--dry-run
From the examples directory, run the evaluation using the YAML
configuration provided:
nemo-evaluator-launcher run \
--config /path/to/examples/nemotron/local_nvidia_nemotron_3_nano_30b_a3b.yaml
Note that for quick testing, you can limit the number
of samples by setting limit_samples
:
nemo-evaluator-launcher run \
--config local_nvidia_nemotron_3_nano_30b_a3b.yaml \
-o evaluation.nemo_evaluator_config.config.params.limit_samples=10
5. Running an individual benchmark
You can run specific benchmarks using the -t
flag (from the examples/nemotron
directory):
# Run only MMLU-Pro
nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_mmlu_pro
# Run only coding benchmarks
nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_livecodebench
# Run multiple specific benchmarks
nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_gpqa -t ns_aime2025
6. Monitor execution and inspect results
# Check status of a specific job
nemo-evaluator-launcher status
# Stream logs for a specific job
nemo-evaluator-launcher logs <job-id>
Results are written to the defined output directory:
results_nvidia_nemotron_3_nano_30b_a3b/
├── artifacts/
│ └── <task_name>/
│ └── results.json
└── logs/
└── stdout.log
Interpreting results
When reproducing evaluations, you may observe small differences in final
scores across runs. This variance reflects the probabilistic nature of
LLMs rather than an issue with the evaluation pipeline. Modern
evaluation introduces several sources of non‑determinism: decoding
settings, repeated trials, judge‑based scoring, parallel execution, and
differences in serving infrastructure. All of which can lead to slight
fluctuations.
The purpose of open evaluation is not to force bit-wise identical
outputs, but to deliver methodological consistency with clear
provenance of evaluation results. To ensure your evaluation aligns with
the reference standard, verify the following:
Configuration : use the published NeMo Evaluator YAML without
modification, or document any changes explicitly
Benchmark selection : run the intended tasks, task versions, and
prompt templates
Inference target : verify you are evaluating the intended model and
endpoint, including chat template behavior and reasoning settings when
relevant
Execution settings : keep runtime parameters consistent, including
repeats, parallelism, timeouts, and retry behavior
Outputs : confirm artifacts and logs are complete and follow the
expected structure for each task
When these elements are consistent, your results represent a valid
reproduction of the methodology, even if individual runs differ
slightly. NeMo Evaluator simplifies this process, tying benchmark
definitions, prompts, runtime settings, and inference configuration into
a single auditable workflow to minimize inconsistencies.
Conclusion: A more transparent standard for open models
The evaluation recipe released alongside Nemotron 3 Nano represents a
meaningful step toward a more transparent and reliable approach to
open-model evaluation. We are moving away from evaluation as a
collection of bespoke, "black box" scripts, and towards a defined system
where benchmark selection, prompts, and execution semantics are encoded
into a transparent workflow.
For developers and researchers, this transparency changes what it means
to share results. A score is only as trustworthy as the methodology
behind it and making that methodology public is what enables the
community to verify claims, compare models fairly, and continue building
on shared foundations. With open evaluation configurations, open
artifacts, and open tooling, Nemotron 3 Nano demonstrates what that
commitment to openness looks like in practice.
NeMo Evaluator supports this shift by providing a consistent
benchmarking methodology across models, releases, and inference
environments. The objective isn’t identical numbers on every run; it’s
confidence in an evaluation methodology that is explicit, inspectable,
and repeatable. And for organizations that need automated or large‑scale
evaluation pipelines, a separate microservice offering provides an
enterprise‑ready NeMo Evaluator
microservice built on
the same evaluation principles.
Use the published NeMo Evaluator
evaluation configuration for an end-to-end walkthrough of the evaluation recipe.
Join the Community!
NeMo
Evaluator is fully open
source, and community input is essential to shaping the future of open
evaluation. If there’s a benchmark you’d like us to support or an
improvement you want to propose, open an issue, or contribute directly
on GitHub. Your contributions help strengthen the ecosystem and advance
a shared, transparent standard for evaluating generative models.
More from this author
Build a Domain-Specific Embedding Model in Under a Day
55
March 20, 2026
Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation
1
March 20, 2026
|