← Все кластеры
Delivering Flexible Performance for Future-Ready Data Centers with NVIDIA MGX | NVIDIA Technical Blog
closed
Тип событияproduct_launch
Темаai infrastructure
ОрганизацияNVIDIA
СтранаSouth Korea
Статей13
Уник. источников2
Важность / Момент2.02 / 0
Период15.12.2025 18:25 — 19.12.2025 17:00
Создан06.04.2026 06:19:09
Статьи в кластере 13
Заголовок Источник Дата публикации Score
S Delivering Flexible Performance for Future-Ready Data Centers with NVIDIA MGX | NVIDIA Technical Blog nvidia_dev_blog 15.12.2025 18:25 1
Embedding sim.1
Entity overlap1
Title sim.1
Time proximity1
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

The AI boom reshaping the computing landscape is poised to scale even faster in 2026. As breakthroughs in model capability and computing power drive rapid growth, enterprise data centers are being pushed beyond the limits of conventional server and rack architectures. This is creating new pressures on power budgets, thermal envelopes, and facility space. NVIDIA MGX modular reference architecture provides forward-looking designs that enable faster time-to-market (TTM) with standardized building blocks. MGX helps system partners integrate fast-evolving technologies and deliver the flexible, energy-efficient platforms modern AI data centers require. This post explores the next evolution in the MGX modular reference architecture: a 6U (800 mm) chassis configuration designed specifically for the next generation of accelerated compute and networking platforms. This includes the new liquid-cooled variant of the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU. Flexible, future-proof design with enhanced serviceability Forward-looking compatibility and flexibility are core design principles of the MGX 6U platform. It features a single chassis that can span multiple computing generations and workload profiles. It’s designed to support today’s most powerful computing platforms while offering future-proof compatibility, reducing the need for disruptive redesigns over time. Partners can design these systems with multiple MGX-based host-processor modules (HPMs), including x86 platforms and the next-generation NVIDIA Vera CPU. This enables standardizing on a single server design while supporting multiple CPU architectures and workload requirements. Lastly, the larger chassis volume creates accessible service pathways for maintenance. Key components like network cards, power supplies, and other field-replaceable units are easy to reach. This simplifies serviceability and reduces operational overhead when managing rack-scale infrastructure. Sustainable, efficient computing with liquid-cooled NVIDIA RTX PRO Server The MGX 6U design is the foundation for the next wave of accelerated computing platforms, starting with a new liquid-cooled NVIDIA RTX PRO Server . This new RTX PRO Server configuration will feature eight of the latest liquid-cooled RTX PRO 6000 Blackwell Server Edition GPUs, along with advanced AI networking capability delivered by NVIDIA BlueField-3 DPUs and NVIDIA ConnectX-8 SuperNICs with built-in PCIe Gen 6 switches (Figure 1). Figure 1. The MGX 6U system topology with eight GPUs, NVIDIA BlueField-3 DPUs, and ConnectX-8 SuperNICs with built-in PCIe Gen 6 switches With a compact, single-slot liquid-cooled form factor, RTX PRO 6000 Blackwell delivers breakthrough performance for powering AI factories and accelerating demanding enterprise AI workloads with improved thermal efficiency. It’s capable of running the full suite of NVIDIA enterprise software, including NVIDIA AI Enterprise , NVIDIA Omniverse , NVIDIA vGPU , and NVIDIA Run:ai . It provides a universal data center platform for building and deploying the next generation of AI-enabled applications, from agentic AI and physical AI to scientific computing, simulation, graphics, and video. Additionally, the RTX PRO 6000 Blackwell Server Edition GPU is validated by more than 50 leading enterprise ISVs spanning engineering, scientific computing, and professional visualization applications, as well as the most widely adopted orchestration, management, and AI ops platforms. Figure 2. Liquid-cooled NVIDIA RTX PRO 6000 Blackwell Server Edition GPU High-performance AI networking with NVIDIA ConnectX Network performance is essential to maximize the performance of AI workloads at scale. MGX 6U reference design supports ConnectX-8 AI networking today and will support ConnectX-9 when it becomes available, delivering Ethernet and InfiniBand connectivity options to meet diverse data center and workload requirements. The liquid-cooled RTX PRO Server, based on the MGX 6U configuration, features a streamlined system architecture that includes the latest-generation ConnectX-8 SuperNICs with integrated PCIe Gen 6 switches. Built for AI workloads, ConnectX-8 with integrated PCIe Gen 6 switches supports up to 400 Gb/s of network bandwidth per RTX PRO 6000 Blackwell GPU (based on a 2:1 GPU-to-NIC ratio). In addition to streamlining the design and reducing server complexity versus systems with dedicated PCIe switches, ConnectX-8 effectively doubles per‑GPU network bandwidth. This helps to remove I/O bottlenecks and speeds data movement between GPUs, NICs, and storage, resulting in up to 2x higher NCCL all‑to‑all performance and more scalable multi‑GPU, multi‑node workloads across AI factories. AI runtime security and infrastructure acceleration with NVIDIA BlueField As accelerated infrastructure grows in scale and complexity, securing every layer of the system becomes essential. The MGX 6U design features NVIDIA BlueField data processing units (DPUs) to bring zero-trust security and infrastructure acceleration directly into the data center layer. The BlueField processor offloads and accelerates functions such as line-rate encryption, micro-segmentation, and real-time threat detection—enforcing least-privilege access while preserving the host’s computing resources (GPU/CPU) to focus on AI and other modern workloads. By isolating control and management planes in hardware, BlueField enables organizations to protect AI pipelines from emerging threats while accelerating networking, storage, and virtualization services. Enterprises can further extend these capabilities by deploying validated BlueField-accelerated applications from leading software providers, enhancing both infrastructure efficiency and cybersecurity coverage. This combination helps ensure that RTX PRO Server deployments can scale securely, with consistent performance and policy enforcement across every node in the AI factory. Building future-ready AI factories As NVIDIA Blackwell and future GPU generations continue to push beyond traditional computing boundaries, the NVIDIA MGX modular architecture ensures AI factories can evolve with silicon innovations. For ecosystem partners building the next generation of accelerated computing platforms, MGX reduces engineering costs, shortens time to market, and delivers multigenerational compatibility while ensuring optimal performance and efficiency for enterprises deploying AI workloads at scale. Systems featuring the liquid-cooled NVIDIA RTX PRO 6000 Blackwell Server Edition GPU, along with liquid-cooled RTX PRO Servers based on the MGX 6U configuration, are expected to arrive from global system builders in the first half of 2026. Discuss (0) Like Tags Data Center / Cloud | General | Blackwell | MGX | Spectrum-X Ethernet | Intermediate Technical | Best practice | AI Agent | AI Factory | featured | Multi-GPU | Physical AI | Spectrum Ethernet | SuperNICs About the Authors About Anthony Larijani Anthony Larijani is senior product marketing manager in NVIDIA’s data center enterprise platforms team, focused on NVIDIA’s portfolio of accelerated computing, networking, and software platforms. He is a marketing and sales professional with over 10 years of experience in data center infrastructure and cloud platform technologies. Larijani holds a bachelor’s degree from West Virginia University and an MBA from Carnegie Mellon University. View all posts by Anthony Larijani About Neil Dey Neil Dey is an AI and HPC product leader and engineer, and an inventor with eight patents in System Design, Manageability, and Thermals. Currently senior product manager at NVIDIA for the MGX product line, he brings 18+ years of enterprise and HPC product development experience. Neil excels in system design, architecting solutions, and product management, and holds a master’s degree in Computer Engineering and Kellogg Executive Education. View all posts by Neil Dey About Ivan Goldwasser Ivan leads product marketing for the Data Center CPU products for NVIDIA. Previously, Ivan worked in various marketing and strategy roles in the technology sector. Ivan has an MBA from Georgetown’s McDonough School of Business and a bachelor’s degree in chemical engineering from Texas A&M University. View all posts by Ivan Goldwasser About Itay Ozery Itay Ozery is director of product marketing for networking at NVIDIA. He drives strategic product marketing initiatives for NVIDIA networking platforms and solutions. Itay has a solid track record in building and launching impactful products and solutions to market, and previously served in various enterprise IT positions. View all posts by Itay Ozery About Michael Mangiafico Michael Mangiafico is the director of Product Management and has spent the past five years focused on managing strategic relationships, product alignment, use case development, and roadmap planning with NVIDIA’s global server partners, helping to deliver cutting-edge accelerated computing technologies to customers worldwide. Michael brings over 25 years of experience in the technology industry. His career spans engineering R&D and product management roles at Compaq, Hewlett-Packard, and Hewlett-Packard Enterprise. Michael holds a bachelor’s degree in Electrical Engineering from Wentworth Institute of Technology and a master’s degree in Electrical Engineering from Worcester Polytechnic Institute. View all posts by Michael Mangiafico Comments Related posts Building the 800 VDC Ecosystem for Efficient, Scalable AI Factories Building the 800 VDC Ecosystem for Efficient, Scalable AI Factories Integrating Semi-Custom Compute into Rack-Scale Architecture with NVIDIA NVLink Fusion Integrating Semi-Custom Compute into Rack-Scale Architecture with NVIDIA NVLink Fusion Building the Modular Foundation for AI Factories with NVIDIA MGX Building the Modular Foundation for AI Factories with NVIDIA MGX NVIDIA Contributes NVIDIA GB200 NVL72 Designs to Open Compute Project NVIDIA Contributes NVIDIA GB200 NVL72 Designs to Open Compute Project Revolutionizing Data Center Efficiency with the NVIDIA Grace Family Revolutionizing Data Center Efficiency with the NVIDIA Grace Family Related posts How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale L T F R E
Real-Time Decoding, Algorithmic GPU Decoders, and AI Inference Enhancements in NVIDIA CUDA-Q QEC | NVIDIA Technical Blog nvidia_dev_blog 17.12.2025 21:32 0.732
Embedding sim.0.8481
Entity overlap0.0732
Title sim.0.2095
Time proximity0.836
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаquantum computing
NLP страна

Открыть оригинал

Real-time decoding is crucial to fault-tolerant quantum computers. By enabling decoders to operate with low latency concurrently with a quantum processing unit (QPU), we can apply corrections to the device within the coherence time. This prevents errors from accumulating, which reduces the value of results received. We can do this online, with a real quantum device, or offline, with a simulated quantum processor. To help solve these problems and enable research into better solutions, NVIDIA CUDA-Q QEC version 0.5.0 includes a range of improvements. These include support for online real-time decoding, new GPU-accelerated algorithmic decoders, infrastructure for high-performance AI decoder inference, sliding window decoder support, and more Pythonic interfaces. We’ll cover all of these improvements in this post and dive into how you can use them to accelerate your quantum error correction research, or operationalize real-time decoding with your quantum computer. Real-time decoding is real with CUDA-Q QEC Users can perform this in a four-stage workflow. In order, these are: DEM generation, decoder configuration, decoder loading and initialization, and real-time decoding. First, we characterize how the device errors behave during operation. Using a helper function, we can generate the detector error model (DEM) from a quantum code, noise model, and circuit parameters. The function will generate a complete DEM that maps error mechanisms to syndrome patterns. # Step 1: Generate detector error model print("Step 1: Generating DEM...") cudaq.set_target("stim") noise = cudaq.NoiseModel() noise.add_all_qubit_channel("x", cudaq.Depolarization2(0.01), 1) dem = qec.z_dem_from_memory_circuit(code, qec.operation.prep0, 3, noise) The next step is to choose a decoder and configure it. We’ll discuss new decoders in greater detail in the following sections. Using the DEM, the user configures the decoder and then saves this configuration to a YAML file. This file ensures that the decoders can correctly interpret the syndrome measurements. # Create decoder config config = qec.decoder_config() config.id = 0 config.type = "nv-qldpc-decoder" config.block_size = dem.detector_error_matrix.shape[1] . . . # check out nvidia.github.io/cudaqx/examples_rst/qec/realtime_decoding.html . . . Before circuit execution, the user loads the YAML file. CUDA-Q QEC interprets the information, sets up the appropriate implementation in the decoder, and registers it with the CUDA-Q runtime. # Save decoder config with open("config.yaml", 'w') as f: f.write(config.to_yaml_str(200)) Now, users can begin executing quantum circuits. Inside CUDA-Q kernels, the decoding API interacts with the decoders. As the stabilizers of the logical qubits are measured, syndromes are enqueued to the corresponding decoder, which processes them. When corrections are needed, the decoder suggests operations to apply to the logical qubits. # Load config and run circuit qec.configure_decoders_from_file("config.yaml") run_result = cudaq.run(qec_circuit, shots_count=10) GPU-accelerated RelayBP A recently developed decoder algorithm helps solve the pitfalls of belief propagation decoders, a popular class of quantum low-density parity check algorithmic decoders. BP+OSD (Belief Propagation with Ordered Statistics Decoding) relies on a GPU-accelerated BP decoder and then uses an Ordered Statistics Post-Processing Algorithm on CPU. If BP fails, OSD kicks in. This is fine, but makes it hard to optimize and parallelize for the low latency needed to enable real-time error decoding. RelayBP modifies BP methods with the concept of memory strengths, at each node of a graph, and controls how much each node remembers or forgets past messages. This dampens or breaks the harmful symmetries that usually trap BP, preventing it from converging. Figure 1. Peak decoding throughput (iterations/sec) for RelayBP FP32 on NVIDIA DGX GB200, measured for XYZ and XZ decoding of 1-Gross and 2-Gross quantum error-correction codes, with syndrome complexity held constant to isolate peak performance.   Results collected with optimized CUDA-Q QEC 0.6.0 build. Users can instantiate a RelayBP decoder easily with a few lines of code, outlined below. import numpy as np import cudaq_qec as qec # Simple 3x7 parity check matrix for demonstration H_list = [[1, 0, 0, 1, 0, 1, 1], [0, 1, 0, 1, 1, 0, 1], [0, 0, 1, 0, 1, 1, 1]] H = np.array(H_list, dtype=np.uint8) # Configure relay parameters srelay_config = { 'pre_iter': 5, # Run 5 iterations with gamma0 before relay legs 'num_sets': 3, # Use 3 relay legs 'stopping_criterion': 'FirstConv' # Stop after first convergence } # Create a decoder with Relay-BP decoder_relay = qec.get_decoder("nv-qldpc-decoder", H, use_sparsity=True, bp_method=3, composition=1, max_iterations=50, gamma0=0.3, gamma_dist=[0.1, 0.5], srelay_config=srelay_config, bp_seed=42) print(" Created decoder with Relay-BP (gamma_dist, FirstConv stopping)") # Decode a syndrome syndrome = np.array([1, 0, 1], dtype=np.uint8) decoded_result = decoder_relay.decode(syndrome) AI decoder inference AI decoders are becoming increasingly popular for handling specific error models, offering better accuracy or latency than algorithmic decoders. Users can develop AI decoders by generating training data, training a model, and exporting the model to ONNX. Once this is complete, use the CUDA-Q QEC NVIDIA TensorRT-based AI decoder inference engine to operate low-latency AI decoders. CUDA-Q QEC recently introduced infrastructure for integrated AI decoder inference with offline decoding. This means that it’s now easy to run any AI decoder saved to an ONNX file with CUDA-Q QEC and an emulated quantum computer. import cudaq_qec as qec import numpy as np # Note: The AI decoder doesn't use the parity check matrix. # A placeholder matrix is provided here to satisfy the API. H = np.array([[1, 0, 0, 1, 0, 1, 1], [0, 1, 0, 1, 1, 0, 1], [0, 0, 1, 0, 1, 1, 1]], dtype=np.uint8) # Create TensorRT decoder from ONNX model decoder = qec.get_decoder("trt_decoder", H, onnx_load_path="ai_decoder.onnx") # Decode a syndrome syndrome = np.array([1.0, 0.0, 1.0], dtype=np.float32) result = decoder.decode(syndrome) print(f"Predicted error: {result}") We also offer a range of recommendations to reduce the initialization time by creating pre-built TensorRT engines. With ONNX files supporting a range of precisions (int8, fp8, fp16, bf16, and tf32) you can explore a range of model and hardware combinations to optimize AI decoder operationalization. Sliding window decoding Sliding window decoders enable a decoder to handle circuit-level noise across multiple syndrome extraction rounds. These decoders process the syndrome before the complete measurement sequence is received, which can help reduce the overall latency. The tradeoff is that this can increase logical error rates. Exploring how and when to use this tool relies on the noise model, error correcting code parameters, and the latency budget of a given quantum processor. With the introduction of the sliding window decoder in 0.5.0, users can now perform experiments using any other CUDA-Q decoder as the “inner” decoder. Additionally, users can vary the window size with simple parameter changes. import cudaq import cudaq_qec as qec import numpy as np cudaq.set_target('stim') num_rounds = 5 code = qec.get_code('surface_code', distance=num_rounds) noise = cudaq.NoiseModel() noise.add_all_qubit_channel("x", cudaq.Depolarization2(0.001), 1) statePrep = qec.operation.prep0 dem = qec.z_dem_from_memory_circuit(code, statePrep, num_rounds, noise) inner_decoder_params = {'use_osd': True, 'max_iterations': 50, 'use_sparsity': True} opts = { 'error_rate_vec': np.array(dem.error_rates), 'window_size': 1, 'num_syndromes_per_round': dem.detector_error_matrix.shape[0] // num_rounds, 'inner_decoder_name': 'nv-qldpc-decoder', 'inner_decoder_params': inner_decoder_params, } swdec = qec.get_decoder('sliding_window', dem.detector_error_matrix, **opts) Each syndrome extraction round must produce a constant number of measurements. The decoder will make no assumptions about the temporal correlations or periodicity in the underlying noise, so users have maximal flexibility in investigating noise variations per round. Getting started with CUDA-Q QEC CUDA-Q QEC 0.5.0 brings a wide range of tools to quantum error correction researchers and QPU operators, to accelerate research into operationalizing fault-tolerant quantum computers. To get started using the CUDA-Q QEC, you can pip install cudaq-qec and see the CUDA-Q QEC documentation . Discuss (7) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Developer Tools & Techniques | HPC / Scientific Computing | CUDA-Q | Intermediate Technical | Tutorial | featured | Quantum Computing About the Authors About Tom Lubowe Tom Lubowe is the product manager for quantum libraries at NVIDIA. Prior to joining, he led product focused on quantum computing, machine learning, and tensor networks for materials design at GenMat. Tom also worked at Xanadu and Rigetti in product management, product operations, and business development roles. Before that, he started a quantum machine learning company, Everettian Technologies, after working on FinTech products at SEI Investments. View all posts by Tom Lubowe About Ben Howe Ben Howe is a senior CUDA-Q software engineer at NVIDIA where he develops the CUDA-Q software framework for hybrid classical-quantum computing systems. Before NVIDIA, Ben was an Engineering Fellow at RTX where he developed real-time signal processing algorithms and HPC applications for a variety of sensor systems. He received bachelor degrees in Electrical Engineering and Computer Science, and a master’s degree in Electrical Engineering from Texas Tech University. View all posts by Ben Howe About Melody Ren Melody is a senior quantum software engineer on the CUDA-QX team at NVIDIA. Her current work focuses on quantum error correction and developing scalable tools for hybrid quantum-classical workflows. Prior to NVIDIA, she was a developer for the Intel Quantum SDK, where she learned that debugging quantum software is only slightly less mysterious than quantum physics itself. Melody received her Master’s degree in applied science from the University of British Columbia in Canada View all posts by Melody Ren About Scott Thornton Scott Thornton is a Quantum Computing Libraries Engineer at NVIDIA, specializing in AI inference for quantum error correction and algorithms for chemistry and materials science. Beyond quantum computing, his research interests span density functional theory (DFT) and time-dependent DFT for molecules and condensed matter. Scott received his Ph.D. in Condensed Matter Physics from the University of Tennessee. View all posts by Scott Thornton About Kevin Mato Kevin is a quantum computing software engineer on the NVIDIA CUDA-QX team, working across multiple layers of the quantum computing stack, with a focus on quantum error correction and scalable tools for hybrid quantum-classical workflows. He fell in love with quantum computing in 2015 and, before joining NVIDIA, worked at CINECA, exploring performance trade-offs across GPUs, large-scale HPC systems, and quantum annealers. Kevin holds a PhD in Quantum Computing from the Technical University of Munich. View all posts by Kevin Mato Comments Related posts NVIDIA NVQLink Architecture Integrates Accelerated Computing with Quantum Processors NVIDIA NVQLink Architecture Integrates Accelerated Computing with Quantum Processors Streamlining Quantum Error Correction and Application Development with CUDA-QX 0.4 Streamlining Quantum Error Correction and Application Development with CUDA-QX 0.4 Accelerating Quantum Error Correction Research with NVIDIA Quantum Accelerating Quantum Error Correction Research with NVIDIA Quantum NVIDIA and QuEra Decode Quantum Errors with AI NVIDIA and QuEra Decode Quantum Errors with AI Introducing NVIDIA CUDA-QX Libraries for Accelerated Quantum Supercomputing Introducing NVIDIA CUDA-QX Libraries for Accelerated Quantum Supercomputing Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics L T F R E
Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS | NVIDIA Technical Blog nvidia_dev_blog 16.12.2025 17:00 0.731
Embedding sim.0.8224
Entity overlap0.0938
Title sim.0.3359
Time proximity0.8656
NLP типproduct_launch
NLP организацияnvidia
NLP темаai infrastructure
NLP страна

Открыть оригинал

NVIDIA CUDA developers have access to a wide range of tools and libraries that simplify development and deployment, enabling users to focus on the “what” and the “how” of their applications. An example of this is Multi-Process Service (MPS), where users can get better GPU utilization by sharing GPU resources across processes. Importantly, this can be done transparently as applications don’t need to be aware of MPS, and no code modifications are needed. Introducing MLOPart NVIDIA Blackwell GPUs deliver high bandwidth that is well-suited to training today’s large language models. However, there are cases where applications don’t benefit from the full bandwidth of Blackwell and are more latency sensitive. Memory Locality Optimized Partition (MLOPart) devices are NVIDIA CUDA devices derived from a GPU and optimized for lower latency. MLOPart is a CUDA MPS feature that enables multi-GPU aware applications to see MLOPart devices. In the real world, it’s not always easy to determine whether an application is latency-bound or bandwidth-bound. MLOPart is designed to be enabled and disabled using the MPS controller and doesn’t require an application to be rewritten. Developers can do simple A/B testing to see if an application benefits from MLOPart. MLOPart device enumeration The defining aspect of MLOPart is that when it is enabled, MLOPart-capable devices appear as multiple distinct CUDA devices, with their own compute and memory resources. In this sense, it is similar to an NVIDIA Multi-Instance GPU (MIG). We’ll compare MLOPart with MIG later in this post. MLOPart creates CUDA devices that are based on the underlying architecture of GPUs. Where possible, CUDA devices are split along boundaries that’d negatively affect memory latency, with each side of the boundary having the memory and compute resources representing an MLOPart device. For Blackwell, the split is along the die boundaries. If a GPU doesn’t have such boundaries, no MLOPart devices are created, and the GPU is presented to CUDA applications normally. NVIDIA DGX B200 and NVIDIA B300 are capable of two MLOPart devices per GPU. This number may change with future architectures, so it’s recommended that developers don’t hardcode assumptions about the number of MLOPart devices that a GPU will support. MLOPart device capabilities and characteristics An MLOPart device shares similarities with the underlying device, with a few notable exceptions. While in principle, developers don’t need to rewrite applications to use MLOPart devices, they should keep in mind that they don’t share all of the capabilities and characteristics of the underlying devices. Capabilities and characteristics shared with the underlying device include: Compute capability An MLOPart device has the same compute capability and can execute the same GPU binaries as the underlying device. For example, a device that supports MLOPart with compute capability 10.0 will have MLOPart devices that also have compute capability 10.0. Peer-to-peer ability An MLOPart device will be capable of the same peer-to-peer communication as the underlying device. For example, if two physical devices are connected by NVIDIA NVLink, any MLOPart devices derived from these two underlying devices will also be connected by NVLink. The exception to this rule is between MLOPart devices belonging to the same underlying device. In this case, they’re still capable of peer-to-peer communication, but don’t require peer-to-peer communication methods such as NVLink or PCIe. When peer devices are MLOPart devices belonging to the same underlying device, they’re expected to have lower latency and higher peer-to-peer bandwidth than peer devices connected through other means. PCI IDs MLOPart devices share the same PCI ID (bus.device.domain) as the underlying device. Capabilities and characteristics differing from the underlying device include the following. Streaming multiprocessor count Each MLOPart device will have fewer streaming multiprocessors (SMs) than the underlying device. Furthermore, the total SMs in all MLOPart devices with a common shared underlying device may be fewer than the total SMs in the underlying device. MLOPart devices belonging to the same underlying device have the same number of SMs between them, and the number of SMs is consistent across identical NVIDIA GPUs. For example, an NVIDIA HGX B200 system with 8 Blackwell GPUs that normally have 148 SMs will result in 16 MLOPart devices with 70 SMs each when MLOPart is enabled. Available memory MLOPart devices have a partition of the total memory of the underlying device, and only allocate from that partition, except in the case of CUDA managed memory allocations. Each MLOPart device will have less memory than the underlying device. Each MLOPart device belonging to the same underlying device has the same total memory. In the current version of MLOPart, it’s possible for memory allocated on one MLOPart device to affect the available memory reported by cuMemGetInfo and cudaMemGetInfo on another MLOPart device from the same underlying device, even though they have separate partitions. Future drivers will enable more rigid memory partitions between MLOPart devices. Virtual address space MLOPart devices on the same underlying device share a virtual address space. This means that it’s possible for a buffer overrun of memory allocated on one MLOPart device to corrupt memory allocated on another MLOPart device within the same process. Universally unique identifier Each MLOPart device will have its own universally unique identifier (UUID) that can be queried through CUDA APIs. This can be used to uniquely identify MLOPart devices and to filter available CUDA devices using CUDA_VISIBLE_DEVICES . Deploying with MLOPart As with other CUDA MPS features, users can control behavior through MPS controller commands. The start_server command starts an MPS server. In CUDA 13.1, we introduced the -mlopart option to this command. This enables users to start an MPS server that creates MLOPart-enabled MPS clients. As this is done on a per-server basis, multiple users may have different MLOPart configurations, depending on their needs. In CUDA 13.0, we introduced the device_query MPS controller command to provide information about the CUDA devices enumerated by MPS. After a server has been created, device_query can be used to determine information about the devices that’ll be exposed to clients of that server, such as the device name, device ordinals, and UUIDs. $ echo device_query | nvidia-cuda-mps-control Default Device Ordinal PCI IDs UUID Name Attributes 0 0000:1b.00.00 GPU-ebebf640-14d4-de34-f16e-a5e7da272ac4 NVIDIA B200 1 0000:43.00.00 GPU-6d3a75da-dd2e-173e-e797-c0b8ed47a100 NVIDIA B200 2 0000:52.00.00 GPU-a517c26e-0f2f-945a-1672-ea75149f54d6 NVIDIA B200 3 0000:61.00.00 GPU-999b1bd5-82d8-3db2-e2ec-fdae5d1103b1 NVIDIA B200 4 0000:9d.00.00 GPU-b5830513-614b-38ac-b177-5cc2f850ea3d NVIDIA B200 5 0000:c3.00.00 GPU-05f3779e-bfa6-f9c8-256f-6cee98b8871d NVIDIA B200 6 0000:d1.00.00 GPU-2facdb95-1af2-26e3-2c9d-e02f4651675d NVIDIA B200 7 0000:df.00.00 GPU-7e555b40-ffe0-e066-4db3-4ddd96344f0d NVIDIA B200 Server 14056 Device Ordinal PCI IDs UUID Name Attributes N/A 0000:1b.00.00 GPU-ebebf640-14d4-de34-f16e-a5e7da272ac4 NVIDIA B200 M 0 0000:1b.00.00 GPU-1bd9c0d8-c86a-5a37-acee-411ebcef5fd0 NVIDIA B200 MLOPart 0 MD 1 0000:1b.00.00 GPU-58e7f54c-f60f-56b7-a4c4-b3fb418fde3e NVIDIA B200 MLOPart 1 MD N/A 0000:43.00.00 GPU-6d3a75da-dd2e-173e-e797-c0b8ed47a100 NVIDIA B200 M 2 0000:43.00.00 GPU-68fb01e9-499c-56d4-b768-8fca70a5ddff NVIDIA B200 MLOPart 0 MD 3 0000:43.00.00 GPU-6cf0c4ea-3a05-52b1-aec6-63acf60df19b NVIDIA B200 MLOPart 1 MD N/A 0000:52.00.00 GPU-a517c26e-0f2f-945a-1672-ea75149f54d6 NVIDIA B200 M 4 0000:52.00.00 GPU-dd670b14-ca31-5dfd-a49b-7220701f4fc6 NVIDIA B200 MLOPart 0 MD 5 0000:52.00.00 GPU-d7433996-1714-5baa-9812-22cecdc792d3 NVIDIA B200 MLOPart 1 MD N/A 0000:61.00.00 GPU-999b1bd5-82d8-3db2-e2ec-fdae5d1103b1 NVIDIA B200 M 6 0000:61.00.00 GPU-cff5ab0b-a509-54c8-a9c0-c5ebe3fbd3a0 NVIDIA B200 MLOPart 0 MD 7 0000:61.00.00 GPU-7933cfe7-5139-50d8-ad90-0f7f1ddba559 NVIDIA B200 MLOPart 1 MD N/A 0000:9d.00.00 GPU-b5830513-614b-38ac-b177-5cc2f850ea3d NVIDIA B200 M 8 0000:9d.00.00 GPU-f973284b-7385-576b-80d7-3ea083bcea94 NVIDIA B200 MLOPart 0 MD 9 0000:9d.00.00 GPU-668e4145-b221-5495-a3fe-a5cdc0e6f6eb NVIDIA B200 MLOPart 1 MD N/A 0000:c3.00.00 GPU-05f3779e-bfa6-f9c8-256f-6cee98b8871d NVIDIA B200 M 10 0000:c3.00.00 GPU-53858feb-87eb-5963-8d47-6fbf4b24cd4a NVIDIA B200 MLOPart 0 MD 11 0000:c3.00.00 GPU-700b029a-be98-5d13-9a4e-5e8e21386e34 NVIDIA B200 MLOPart 1 MD N/A 0000:d1.00.00 GPU-2facdb95-1af2-26e3-2c9d-e02f4651675d NVIDIA B200 M 12 0000:d1.00.00 GPU-563db4f2-f70a-564d-aa4a-dbd52d6dfc0b NVIDIA B200 MLOPart 0 MD 13 0000:d1.00.00 GPU-b643e07a-6eda-5cd8-bdde-1788590d0b4b NVIDIA B200 MLOPart 1 MD N/A 0000:df.00.00 GPU-7e555b40-ffe0-e066-4db3-4ddd96344f0d NVIDIA B200 M 14 0000:df.00.00 GPU-f8f5b46d-7774-57a1-97d2-88f23c3457f0 NVIDIA B200 MLOPart 0 MD 15 0000:df.00.00 GPU-46d7f9b7-0303-5432-b50a-16381f37e365 NVIDIA B200 MLOPart 1 MD When MLOPart is enabled, device_query shows the MLOPart devices below the device from which they are derived. This is the recommended method for determining UUID values used for CUDA_VISIBLE_DEVICES when launching an application. As CUDA will enumerate more devices than actually exist on the system, there’s ambiguity in the device enumeration. Note that MLOPart devices only exist in the context of MPS and CUDA. nvidia-smi doesn’t provide information about MLOPart devices. Lastly, the ps MPS controller command has been extended to display whether a process is using an MLOPart device. $ while1 -a & [1] 52845 $ echo ps | nvidia-cuda-mps-control PID ID SERVER DEVICE NAMESPACE COMMAND ATTRIBUTES 52845 1 52837 GPU-b13add01-c28c 4026531836 while1 MD MLOPart in use Now let’s look at how MLOPart can affect memory latency and bandwidth. Latency As an example, let’s look at how MLOPart affects memory latency using a simple kernel that does some atomic operations in a loop. First, we define the kernel and a helper: #include <cuda_runtime.h> #include <vector> #include <cstdio> // Helper macro to check for CUDA errors #define CUDA_CHECK_FAILURE(x) \ if (cudaSuccess != (cudaError_t)x)\ {\ const char* errName = cudaGetErrorName(x);\ const char* errStr = cudaGetErrorString(x);\ printf("%s:%d - %s: %s\n", __FILE__, __LINE__, errName, errStr);\ exit(EXIT_FAILURE);\ } // Device memory variable to use to prevent the compiler from optimizing away the memory access __device__ volatile int dummy; // Trivial kernel to touch the memory so we can measure latency __global__ void accessMemoryHighLatency(int *startAddress, size_t memorySizeInBytes) { for (int i = 0 ; i < memorySizeInBytes / sizeof(int) ; ++i) { dummy = atomicAdd(&startAddress[i], 1); } } Atomic operations are latency-sensitive, making it easy to measure the difference between using and not using MLOPart. The following is a function that uses CUDA events to measure the runtime of the kernel accessMemoryHighLatency . // Function to launch the kernel and measure the runtime using CUDA events float measureKernelRuntime(int *memoryDevPtr, size_t memorySizeInBytes, int numBlocks, int numThreads) { cudaEvent_t start = NULL, stop = NULL; float time = 0; CUDA_CHECK_FAILURE(cudaEventCreate(&start)); CUDA_CHECK_FAILURE(cudaEventCreate(&stop)); CUDA_CHECK_FAILURE(cudaEventRecord(start, 0)); accessMemoryHighLatency<<<numBlocks, numThreads>>>(memoryDevPtr, memorySizeInBytes); CUDA_CHECK_FAILURE(cudaPeekAtLastError()); CUDA_CHECK_FAILURE(cudaEventRecord(stop, 0)); CUDA_CHECK_FAILURE(cudaEventSynchronize(stop)); CUDA_CHECK_FAILURE(cudaEventElapsedTime(&time, start, stop)); CUDA_CHECK_FAILURE(cudaEventDestroy(start)); CUDA_CHECK_FAILURE(cudaEventDestroy(stop)); return time; } Finally, we can put this all together by creating a simple multi-GPU-aware program. int main(int argc, char *argv[]) { size_t memorySizeInBytes = 32 * 1024 * 1024; // 32 MB int numBlocks = 32; int numThreads = 1; int numDevices = 0; float totalTime = 0; CUDA_CHECK_FAILURE(cudaGetDeviceCount(&numDevices)); // Measure the runtime for each device for (int i = 0; i < numDevices; i++) { // Set the current device CUDA_CHECK_FAILURE(cudaSetDevice(i)); // Allocate memory on the device int *memoryDevPtr; CUDA_CHECK_FAILURE(cudaMalloc(&memoryDevPtr, memorySizeInBytes)); // Measure the runtime float time = measureKernelRuntime(memoryDevPtr, memorySizeInBytes, numBlocks, numThreads); totalTime += time; printf("Device %d - Total time: %f milliseconds\n", i, time); // Free the memory CUDA_CHECK_FAILURE(cudaFree(memoryDevPtr)); } printf("Average time: %f milliseconds\n", totalTime / numDevices); return EXIT_SUCCESS; } We’ll name this file atomic_memory_access.cu and compile it using nvcc atomic_memory_access.cu -arch=sm_100 -o atomic_memory_access. To establish a baseline, let’s run the example using MPS, but without MLOPart. $ nvidia-cuda-mps-control -d # Optional step of explicitly creating an MPS server. This is also done implicitly when we launch a CUDA application while MPS is active. $ echo start_server -uid $UID | nvidia-cuda-mps-control $ ./atomic_memory_access Device 0 - Total time: 2320.550537 milliseconds Device 1 - Total time: 2323.710938 milliseconds Device 2 - Total time: 2334.533447 milliseconds Device 3 - Total time: 2304.551025 milliseconds Device 4 - Total time: 2304.328125 milliseconds Device 5 - Total time: 2316.102295 milliseconds Device 6 - Total time: 2306.165283 milliseconds Device 7 - Total time: 2306.362061 milliseconds Average time: 2314.537842 milliseconds Here we see an average time of around 2,300 milliseconds for each device. Now let’s enable MLOPart and run it again. # Quit the MPS controller to cleanup the previous server. $ echo quit | nvidia-cuda-mps-control # Now repeat the above steps, with MLOPart enabled. $ nvidia-cuda-mps-control -d # Note that we must explicitly start the server with "-mlopart". $ echo start_server -uid $UID -mlopart | nvidia-cuda-mps-control $ ./atomic_memory_access Device 0 - Total time: 1500.194946 milliseconds Device 1 - Total time: 1475.914062 milliseconds Device 2 - Total time: 1479.729492 milliseconds Device 3 - Total time: 1480.196045 milliseconds Device 4 - Total time: 1478.959106 milliseconds Device 5 - Total time: 1490.808716 milliseconds Device 6 - Total time: 1468.943237 milliseconds Device 7 - Total time: 1479.297241 milliseconds Device 8 - Total time: 1467.947632 milliseconds Device 9 - Total time: 1476.900757 milliseconds Device 10 - Total time: 1477.081421 milliseconds Device 11 - Total time: 1490.295044 milliseconds Device 12 - Total time: 1484.558594 milliseconds Device 13 - Total time: 1481.660156 milliseconds Device 14 - Total time: 1476.067383 milliseconds Device 15 - Total time: 1484.143921 milliseconds Average time: 1480.793457 milliseconds In this example, we see a significant improvement in execution time per device when using MLOPart. While this was a contrived example, it’s important to compare running with and without MLOPart when deciding how to deploy a specific application. Bandwidth Given that MLOPart devices have less memory than a full device, they also have lower DRAM bandwidth than devices not using MLOPart. MLOPart devices have better peer-to-peer bandwidth between MLOPart devices on the same underlying GPU when compared to devices that must communicate using NVLink or PCIe. Let’s look at the (partial) results of a bidirectional P2P bandwidth test between MLOPart devices on the same underlying device and not on the same underlying device: $ ./nvbandwidth -t device_to_device_memcpy_read_ce ... Running device_to_device_memcpy_read_ce. memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s) 0 1 2 3 4 0 N/A 2352.76 766.82 743.46 767.51 1 2402.78 N/A 765.86 744.04 767.03 2 767.23 744.30 N/A 2349.54 766.00 3 767.37 743.91 2372.91 N/A 767.30 4 766.75 743.52 766.89 743.97 N/A In the above example, devices 0 and 1 are on the same underlying GPU, and devices 2 and 3 are on the same underlying GPU. In the case of B200, peers normally use NVLink when initiating an operation such as cuMemcpyAsync. If these B200 peers are MLOPart devices on the same B200 chip, they can instead use the much faster NV-HBI. Considerations when using MLOPart As mentioned previously, using MLOPart implies choosing lower latency over higher bandwidth. This isn’t the only tradeoff that must be evaluated when using MLOPart. Device filtering through CUDA_VISIBLE_DEVICES The devices available to MPS servers and clients can be filtered and/or remapped using the CUDA_VISIBLE_DEVICES environment variable. Often, this is done using device ordinals. With MPS, this can cause errors if the same value CUDA_VISIBLE_DEVICES is used for both the controller and server/clients if remapping isn’t taken into account. For example, given a system with 8 CUDA devices, the MPS controller can be initialized to filter out the odd-numbered devices ( CUDA_VISIBLE_DEVICES=0,2,4,6 ). In this scenario, the MPS server and clients will only see at most 4 CUDA devices, even without using CUDA_VISIBLE_DEVICES . Using the same value for CUDA_VISIBLE_DEVICES will fail since we can only see devices 0-3. For this reason, it’s recommended to use UUIDs, which are unambiguous. When MLOPart is enabled, there’s an additional inconsistency to be aware of. UUIDs of the devices visible to the MPS controller and an MPS server/client with MLOPart enabled are different. When using CUDA_VISIBLE_DEVICES , it’s recommended to execute the device_query command after the MPS server with MLOPart has been started to determine the UUIDs that will be available to MPS clients. Fewer compute resources When MLOPart is enabled, the MLOPart devices may have some SMs disabled. There’s a tradeoff between performance gains from reduced memory latency and performance losses from fewer compute resources. These should be weighed on a per-application basis. Managed memory Managed memory doesn’t benefit from MLOPart. As MLOPart requires creating GPU memory for low-latency allocations, this can’t be done with managed memory. Attempting to use managed memory will work as it normally does, and allocations can still be created using managed memory APIs, but they aren’t expected to see performance benefits. Access modifiers The cuMemSetAccess API enables programmers to specify access properties for CUDA allocations. When using this API with respect to MLOPart devices, the least restrictive property set on all MLOPart devices belonging to the same underlying GPU is applied. For example, setting a buffer as read-only for one MLOPart device and read-write (default) for another MLOPart device results in both MLOPart devices having read-write access, until both are updated to a more restrictive access type. x86 requirement MLOPart is currently only supported on x86 platforms. Support for ARM platforms will be available in a future release. Comparison to MIG MIG can be used to create multiple CUDA devices from a single GPU, as is done with MLOPart. Certain MIG configurations can also reduce latency at the cost of bandwidth, while requiring no code changes. Topic MIG MLOPart / MPS Privilege required Requires superuser privilege to configure Doesn’t require superuser privilege Scope System-wide setting Per-user / per-server setting Memory isolation Enforces strict memory isolation between MIG GPU instances Memory from one MLOPart device may corrupt another on the same GPU Performance isolation Enforces strict performance isolation between MIG compute instances Performance interference may occur between MLOPart devices Table 1. Comparing MIG to MLOPart / MPS To learn more about MLOPart, CUDA MPS, and how to maximize GPU utilization, check out the MPS documentation . Acknowledgements: Thanks to the following NVIDIA contributors: Alfred Barnat, Ehren Bendler, Alicia Hu, Balint Joo, Ze Long, Yashwant Marathe, Vance Miller, Kyrylo Perelygin, Will Pierce, Yifan Yang Discuss (0) Like Tags Data Center / Cloud | Developer Tools & Techniques | Cloud Services | Blackwell | CUDA | Intermediate Technical | Tutorial | featured About the Authors About Sherwin Nassernia Sherwin is an NVIDIA CUDA driver engineer that focuses primarily on enabling and developing CUDA features for new GPU architectures such as Blackwell, Rubin, and beyond. Sherwin holds a bachelor’s degree in computer engineering from Toronto Metropolitan University. View all posts by Sherwin Nassernia Comments Related posts Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization CUDA Toolkit Now Available for NVIDIA Blackwell  CUDA Toolkit Now Available for NVIDIA Blackwell  Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip Discovering New Features in CUDA 11.4 Discovering New Features in CUDA 11.4 Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features Related posts Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics L T F R E
Using AI Physics for Technology Computer-Aided Design Simulations | NVIDIA Technical Blog nvidia_dev_blog 17.12.2025 16:00 0.725
Embedding sim.0.8239
Entity overlap0.0789
Title sim.0.245
Time proximity0.9167
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai for science
NLP странаSouth Korea

Открыть оригинал

Technology Computer-Aided Design (TCAD) simulations, encompassing both process and device simulations, are crucial for modern semiconductor manufacturing. They enable “virtual manufacturing,” allowing engineers to design, build, and test transistors and integrated circuits digitally before committing to the costly physical fabrication process. This approach significantly reduces development time from years to months and saves billions of dollars in experimental manufacturing costs. These simulations, however, are computationally intensive and can take as long as several weeks to complete, delaying manufacturing deadlines. AI-augmented TCAD is a key solution to address this challenge. That’s where NVIDIA PhysicsNeMo and NVIDIA Apollo come in. The PhysicsNeMo framework lets developers build high-fidelity surrogates using state-of-the-art architectures for engineering and science simulations. Apollo, announced last month at SC25 , makes this easier by providing domain specific, pre-trained models. Engineers at SK hynix, one of the world’s leading memory chip manufacturers, are leveraging AI physics to develop high-fidelity surrogate models to accelerate device and process simulations in the design and manufacturing of semiconductor chips. Using the NVIDIA PhysicsNeMo framework, engineers have fast-tracked the development of proprietary AI models that can unlock tools for significant innovation in device design and manufacturing. In this blog, we’ll walk you through the steps to get started with PhysicsNeMo to develop your own custom models and share how the TCAD Intelligence team at SK hynix used PhysicsNeMo to accelerate development of its AI physics models. Tapping into AI physics for TCAD TCAD is a specialized field of software simulation used to model and optimize the fabrication and physics of semiconductor devices. It’s typically broken into two main parts—process TCAD and device TCAD. Process TCAD simulations model the physical and chemical steps of chip manufacturing, such as deposition, lithography, etching, and ion implantation. Device TCAD simulations, on the other hand, take the final 3D structure predicted by the process simulation and model its electrical behavior. Engineers utilize a variety of simulation solutions for different use cases, ranging from atomic-scale density functional theory (DFT) simulations to chamber-scale computational fluid dynamics (CFD) simulations. AI-augmented TCAD presents a fundamental disruptive opportunity for semiconductor manufacturers. As transistors shrink to the nanometer scale, the complexity of their behavior increases, making accurate simulations indispensable for designing next-generation devices but making them also orders of magnitude more expensive. AI surrogate models—which can be created with NVIDIA PhysicsNeMo—are ultra-fast, deep learning-based replicas of slow, physics-based simulations. This approach dramatically accelerates the design and optimization of semiconductor devices by reducing simulation times from hours to milliseconds, enabling engineers to explore a much wider range of possibilities. PhysicsNeMo provides Python modules to compose scalable and optimized training and inference pipelines to develop and deploy AI surrogates. The PhysicsNeMo framework offers various AI models tuned for science and engineering and enables the combination of physics knowledge with data. For AI physics researchers and developers exploring the use of neural operators, GNNs, or transformers—or are interested in physics-informed neural networks or a hybrid approach in between—PhysicsNeMo provides an optimized stack that will enable them to train their models at scale. The engineers use the necessary building blocks from PhysicsNeMo to alleviate the need to develop from scratch. This allows them to reduce the effort required to develop detailed AI methodologies and instead focus on using their domain expertise to develop surrogate models for specific physics problems. Getting started with PhysicsNeMo The simplest way to get started with PhysicsNeMo for building an AI surrogate is to use one of the reference application recipes. These examples give you a working template for both the training code and the data. Here is the general step-by-step path you would follow, using the official examples as your guide. Install PhysicsNeMo : First, you need to set up your environment. The easiest way is to use the official NVIDIA NGC container , which has all dependencies (PyTorch, CUDA, etc.) pre-installed. Next, clone the PhysicsNeMo GitHub repository to get the relevant reference application recipes. If you have an existing dev environment setup for PyTorch, you can pip install from source following the steps outlined here . Let’s assume you are interested in developing a GNN-based surrogate model for TCAD CFD simulations, you would start with the vortex shedding recipe . After replicating the sample, you can start to customize the training pipeline to your own custom data. You can also evaluate other model architectures like DoMINO or Transolver on your custom data. The built-in distributed functionality in PhysicsNeMo recipes allows you to scale any of the above architectures to full 3D chip scale simulations. Let’s take a look at how SK hynix engineers used PhysicsNeMo for one of the many TCAD use cases. How SK hynix uses AI physics for TCAD South Korea-based SK hynix is a global leader in producing high-bandwidth memory (HBM), a crucial component in advanced AI accelerators and GPUs. Its products are vital for a wide array of electronics, from data center servers and PCs to smartphones and next-generation AI systems. The company’s engineers are pioneering the use of AI physics by developing high-fidelity surrogate models to accelerate device and process simulations. Utilizing the NVIDIA PhysicsNeMo framework, they have rapidly advanced their proprietary AI models. An example is the SK hynix TCAD intelligence team’s work on AI surrogate models for etching, an increasingly critical process in semiconductor front-end manufacturing, particularly for advanced memory technologies. By employing predictive modeling to guide the etching process, SK hynix aims to expedite the development of next-generation memory devices. Figure 1. The stepwise improvement in accuracy of the surrogate model to predict the etch profile with improvements in the methodology used. Accurate prediction of time-varying structures in the etching process is essential for SK hynix. While neural operators are beneficial, they often require large datasets and struggle with data scarcity. To address this, SK hynix adopted the Graph Network-based Simulator (GNS) architectures grounded in Graph Neural Networks (GNNs), which combine numerical time-stepping methods to effectively model geometry changes over time. GNS captures local interactions, representing critical physical properties with minimal training data. However, the existing GNS models were insufficient for effectively emulating the etching process, necessitating the development of additional AI models to enhance the accuracy and efficiency of the emulations. Methodologies Improvement MeshGraphNet(MGN) Memory requirement decreased Chamfer Loss used for velocity calculation Training loss reduced Re-meshing each Iteration steps Inference accuracy improved Feature selection Inference accuracy improved Multi-scale message passing Training loss reduced Material feature update each iteration steps Inference accuracy improved Table 1. AI methodologies employed on the AI surrogate model for etching process The TCAD Intelligence team at SK hynix believes that AI-augmented TCAD will become a key enabler of research productivity in the semiconductor industry. By leveraging AI-accelerated TCAD predictions, engineers will be able to realistically evaluate tens of thousands of process cases generated from dozens of recipe combinations. This advancement allows TCAD to evolve beyond qualitative guidance and serve as a quantitative optimization framework for semiconductor R&D. A wide range of AI models that were developed using the PhysicsNeMo framework and GPU-accelerated libraries play a crucial role in enabling these capabilities efficiently. How to get started with NVIDIA PhysicsNeMo If you are a TCAD application developer or an AI physics researcher, PhysicsNeMo is a powerful tool in your arsenal to accelerate your AI model development. Instead of building everything from scratch, you can leverage PhysicsNeMo modules and model architectures to build enterprise scale Physics AI solutions with unprecedented speed and simplicity. TCAD engineers at SK hynix used this approach to focus their domain expertise and efforts on modeling their problems effectively and building skillful models instead of writing training pipelines using low-level libraries. You can learn more by using these resources: NVIDIA PhysicsNeMo product page The PhysicsNemo GitHub repository User guide Using PhysicsNeMo with your PyTorch model Samples: Explore Jupyter notebooks on Hugging Face Full repository of r eference samples Self-paced course: Accelerating Computer-Aided Engineering (CAE) with NVIDIA AI Physics Technology Yiyi Wang and Alexey Kamenev contributed to the project featured in this blog. Discuss (0) Like Tags Data Center / Cloud | Simulation / Modeling / Design | Manufacturing | PhysicsNeMo | Intermediate Technical | Deep dive | featured | Hardware / Semiconductor About the Authors About Ram Cherukuri Ram Cherukuri is a senior product manager for PhysicsNeMo, the Physics-ML platform for AI in science and engineering . He is also the product manager for DLA software, working with embedded AI developers and was part of the CUDA product management team. Prior to NVIDIA, Ram was a product manager at MathWorks for code generation and verification products for embedded software development, working with automotive and aero-def customers. He holds a master’s degree in aerospace engineering from Purdue University and a bachelor’s degree in the same discipline from IIT Bombay. View all posts by Ram Cherukuri About Kihang Youn Kihang Youn is an HPC/AI solutions architect at NVIDIA. He collaborates with customers across science and manufacturing to accelerate scientific applications such as electronic design automation, computer aided engineering, atomistic simulation, and quantum chemistry using the NVIDIA platform. Kihang holds a doctorate in applied mathematics from Hanyang University in South Korea. View all posts by Kihang Youn About Gyuseung Han Gyuseung Han is a technology computer-aided design engineer in the TCAD intelligence team at SK hynix R&D, specializing in AI-driven modeling for TCAD simulations and the analysis of experimental results. He holds a Ph.D. in simulation within the field of materials science from Seoul National University. His expertise includes density functional theory, computational fluid dynamics, and artificial intelligence. His work encompasses a wide range of semiconductor process technologies. View all posts by Gyuseung Han About Min Kang Min Kang is a Technology Computer-Aided Design engineer in the TCAD intelligence team at SK hynix R&D. He previously worked with the DRAM process integration team for an extended period, and is now focusing on AI for science projects with the TCAD intelligence team. His earlier works encompass a deep learning-based automated transmission electron microscopy image measurement system and a deep learning-driven transistor simulation. He received his master’s and bachelor’s in physics from Seoul National University. View all posts by Min Kang About Hwiwon Seo Hwiwon Seo is a Technology Computer-Aided Design engineer in the TCAD intelligence team at SK hynix R&D. He specializes in plasma process simulation and the development of AI-driven TCAD solutions. His research focuses on AI for science applications, utilizing his expertise in plasma process modeling, TCAD simulation, and AI-augmented TCAD modeling. He holds a master of science degree in plasma engineering and a bachelor of science in materials science from Seoul National University. View all posts by Hwiwon Seo About Junghan Kim Junghan Kim leads the TCAD intelligence team at SK hynix R&D. The TCAD intelligence team specializes in AI for science and develops AI solutions for R&D based on Technology Computer-Aided Design. Junghan is an experienced R&D researcher in the semiconductor and display industries. He is skilled in AI for science modeling and multi-scale simulation, including CFD, molecular dynamics, DSMC, and more. He has a doctorate from Technische Universiteit Eindhoven in the Netherlands, where he focused on micro-nano fluidics and rarefied gas simulation View all posts by Junghan Kim Comments Related posts Spotlight: HP 3D Printing Open Sources AI Surrogates for Additive Manufacturing Using NVIDIA PhysicsNeMo Spotlight: HP 3D Printing Open Sources AI Surrogates for Additive Manufacturing Using NVIDIA PhysicsNeMo AI-Powered Simulation Tools for Surrogate Modeling Engineering Workflows with Siml.ai and NVIDIA PhysicsNeMo AI-Powered Simulation Tools for Surrogate Modeling Engineering Workflows with Siml.ai and NVIDIA PhysicsNeMo Physics-Informed Machine Learning Platform NVIDIA PhysicsNeMo Is Now Open Source Physics-Informed Machine Learning Platform NVIDIA PhysicsNeMo Is Now Open Source NVIDIA PhysicsNeMo: An AI-Accelerated Multiphysics Simulation Toolkit NVIDIA PhysicsNeMo: An AI-Accelerated Multiphysics Simulation Toolkit GTC Digital Demo: Accelerating Scientific and Engineering Simulation Workflows with AI GTC Digital Demo: Accelerating Scientific and Engineering Simulation Workflows with AI Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics L T F R E
Simulate an Accurate Radio Environment Using NVIDIA Aerial Omniverse Digital Twin | NVIDIA Technical Blog nvidia_dev_blog 17.12.2025 16:00 0.724
Embedding sim.0.8101
Entity overlap0.0588
Title sim.0.2803
Time proximity1
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

The development of 5G and 6G requires high-fidelity radio channel modeling, but the ecosystem is highly fragmented. Link-level simulators, network-level simulators, and AI training frameworks operate independently, often in different programming languages. If you are a researcher or an engineer trying to simulate the behavior of the key components of the physical layer of 5G or 6G systems, this tutorial teaches you how to extend your simulation chain and add high-fidelity channel realizations generated by the Aerial Omniverse Digital Twin (AODT). Prerequisites: Hardware: An NVIDIA RTX GPU (Ada generation or newer recommended for optimal performance). Software: Access to the AODT Release 1.4 container . Knowledge: Basic familiarity with Python and wireless network concepts, such as radio units (RUs) and user equipment (UE). AODT universal embedded service architecture Figure 1 shows how AODT can be embedded into any simulation chain, whether in C++, Python, or MATLAB. Figure 1. AODT as a universal, embedded service via high-performance gRPC AODT is organized into two main components: The AODT service acts as the centralized, high-power computation core. It manages and loads the massive 3D city models (e.g., from an Omniverse Nucleus server) and executes all the complex electromagnetic (EM) physics calculations. The AODT client and language bindings provide a lightweight developer interface. The client handles all the service calls, and uses GPU IPC to transfer data efficiently, enabling direct GPU-memory access to radio-channel outputs. To support a broad range of development environments, the AODT client provides universal language bindings , enabling direct use from C++, Python ( through pybind11 ) and MATLAB (through user-implemented mex ). Workflow in action: Computing channel impulse responses in 7 easy steps So how do you actually use it? The entire workflow is designed to be straightforward and follows a precise sequence orchestrated by the client as shown in Figure 3. Figure 2. Summary of AODT client/service workflow The process is split into two main phases: Configuration tells AODT what to simulate.  Execution runs the simulation and gets data.  the full example: Phase 1: Configuration (building the YAML string) The AODT service is configured using a single YAML string. While you can write this by hand, we also provide a powerful Python API to build it programmatically, step-by-step. Step 1. Initialize the simulation configuration First, import the configuration objects and set up the basic parameters: the scene to load, the simulation mode (e.g., SimMode.EM), the number of slots to run, and a seed for repeatable, deterministic results. from _config import (SimConfig, SimMode, DBTable, Panel) # EM is the default mode. config = SimConfig(scene, SimMode.EM) # One batch is the default. config.set_num_batches(1) config.set_timeline( slots_per_batch=15000, realizations_per_slot=1 ) # Seeding is disabled by default. config.set_seed(seed=1) config.add_tables_to_db(DBTable.CIRS) Step 2: Define antenna arrays Next, define the antenna panels for both your base stations (RUs) and your UEs. You can use standard models, like ThreeGPP38901, or define your own. # Declare the panel for the RU ru_panel = Panel.create_panel( antenna_elements=[AntennaElement.ThreeGPP38901], frequency_mhz=3600, vertical_spacing=0.5, vertical_num=1, horizontal_spacing=0.5, horizontal_num=1, dual_polarized=True, roll_first=-45, roll_second=45) # Set as default for RUs config.set_default_panel_ru(ru_panel) # Declare the panel for the UE ue_panel = Panel.create_panel( antenna_elements=[AntennaElement.InfinitesimalDipole], frequency_mhz=3600, vertical_spacing=0.5, vertical_num=1, horizontal_spacing=0.5, horizontal_num=1, dual_polarized=True, roll_first=-45, roll_second=45) # Set as default for UEs config.set_default_panel_ue(ue_panel) Step 3: Deploy network elements (RUs and manual UEs) Place your network elements in the scene. We use georeferenced coordinates (latitude/longitude) to place them precisely. For UEs, you can define a series of waypoints to create a pre-determined path. du = Nodes.create_du( du_id=1, frequency_mhz=3600, scs_khz=30 ) ru = Nodes.create_ru( ru_id=1, frequency_mhz=3600, radiated_power_dbm=43, du_id=du.id, ) ru.set_position( Position.georef( 35.66356389841298, 139.74686323425487)) ru.set_height(2.5) ru.set_mech_azimuth(0.0) ru.set_mech_tilt(10.0) ue = Nodes.ue( ue_id=1, radiated_power_dbm=26, ) ue.add_waypoint( Position.georef( 35.66376818087683, 139.7459968717682)) ue.add_waypoint( Position.georef( 35.663622296081414, 139.74622811587614)) ue.add_waypoint( Position.georef( 35.66362516562424, 139.74653110368598)) config.add_ue(ue) config.add_du(du) config.add_ru(ru) Step 4: Deploy dynamic elements (procedural UEs and scatterers) This is where the simulation becomes truly dynamic. Instead of placing every UE by hand, you can define a spawn_zone and have AODT procedurally generate UEs that move realistically within that area. You can also enable urban_mobility to add dynamic scatterers (cars) that will physically interact with and alter the radio signals. # If we want to enable procedural UEs we need a spawn zone. config.add_spawn_zone( translate=[150.2060449, 99.5086621, 0], scale=[1.5, 2.5, 1], rotate_xyz=[0, 0, 71.0]) # Procedural UEs are zero by default. config.set_num_procedural_ues(1) # Indoor proc. UEs are 0% by default. config.set_perc_indoor_procedural_ues(0.0) # Urban mobility is disabled by default. config.enable_urban_mobility( vehicles=50, enable_dynamic_scattering=True) # Save to string from omegaconf import OmegaConf config_dict = config.to_dict()yaml_string = OmegaConf.to_yaml(config_dict) Phase 2: Execution (client-server interaction) Now that we have our yaml_string configuration, we connect to the AODT service and run the simulation. Step 5: Connect Import the dt_client library, create a client pointing to the service address, and call client.start(yaml_string). This single call sends the entire configuration to the service, which then loads the 3D scene, generates all the objects, and prepares the simulation. import dt_client import numpy as np import matplotlib.pyplot as plt # Server address (currently only localhost is supported) server_address = "localhost:50051" # Create client client = dt_client.DigitalTwinClient(server_address) try: client.start(yaml_string) except RuntimeError as e: print(f"X Failed to start scenario: {e}") return 1 Once started, you can query the service to get the parameters of the simulation you just created. This confirms everything is ready and tells you how many slots, RUs, and UEs to expect. try: status = client.get_status() num_batches = status['total_batches'] num_slots = status['slots_per_batch'] num_rus = status['num_rus'] num_ues = status['num_ues'] except RuntimeError as e: print(f"X Failed to get status: {e}") return 1 Step 6: Get UE positions for slot in range(num_slots): try: ue_positions = client.get_ue_positions(batch_index=0, temporal_index=SlotIndex(slot)) except RuntimeError as e: print(f"X Failed to get UE pos: {e}") Step 7: Retrieve Channel Impulse Responses Now we loop through each simulation slot where you can ask for the current position of all UEs. This is crucial for verifying that the mobility models are working as expected and for correlating channel data with location. Retrieving the core simulation data is the most critical step. The Channel Impulse Response (CIR) describes how the signal propagates from each RU to each UE, including all multipath components (their delays, amplitudes, and phases). Retrieving this much data for/from/at? every slot can be slow. To make it fast, the API uses a two-step, zero-copy process using IPC. First, before the loop, you ask the client to allocate GPU memory for the CIR results. The service does this and returns IPC handles, which are pointers to that GPU memory. ru_indices = [0] ue_indices_per_ru = [[0, 1]] is_full_antenna_pair = False try: # Step 1: Allocate GPU memory for CIR cir_alloc_result = client.allocate_cirs_memory( ru_indices, ue_indices_per_ru, is_full_antenna_pair) values_ipc_handles = cir_alloc_result['values_handles'] delays_ipc_handles = cir_alloc_result['delays_handles'] Now, inside your loop, you call client.get_cirs(…), passing in those memory handles. The AODT service runs the full EM simulation for that slot and writes the results directly into that shared GPU memory. No data is copied over the network, making it incredibly efficient. The client has just been notified that the new data is ready. # Step 2: Retrieve CIR cirs = client.get_cirs( values_ipc_handles, delays_ipc_handles, batch_index=0, temporal_index=SlotIndex(0), ru_indices=ru_indices, ue_indices_per_ru=ue_indices_per_ru, is_full_antenna_pair=is_full_antenna_pair) values_shapes = cirs['values_shapes'] delays_shapes = cirs['delays_shapes'] Access the data in NumPy The data (CIR values and delays) is still on the GPU. The client library provides simple utilities to get a GPU pointer without latency penalties. For convenience, however, the data can also be accessed from NumPy. This can be achieved as shown in the following code. # Step 3: export to numpy for i in range(len(ru_indices)): values_gpu_ptr = client.access_values_gpu( values_ipc_handles[i], values_shapes[i]) delays_gpu_ptr = client.access_delays_gpu( delays_ipc_handles[i], delays_shapes[i]) values = client.gpu_to_numpy( values_gpu_ptr, values_shapes[i]) delays = client.gpu_to_numpy( delays_gpu_ptr, delays_shapes[i]) And that’s it! In just a few lines of Python, you have configured a complex, dynamic, georeferenced simulation, run it on a powerful remote server, and retrieved the high-fidelity, physics-based CIRs as a NumPy array. The data is now ready to be visualized, analyzed, or fed directly into an AI training pipeline. For instance, we can visualize the frequency responses of the manual UE declared above using the following plot function. def cfr_from_cir(h, tau, freqs_hz): phase_arg = -1j * 2.0 * np.pi * np.outer(tau, freqs_hz) # Safe exponential and matrix multiplication with np.errstate(all='ignore'): # Sanitize inputs h = np.where(np.isfinite(h), h, 0.0) expm = np.exp(phase_arg) expm = np.where(np.isfinite(expm), expm, 0.0) result = h @ expm result = np.where(np.isfinite(result), result, 0.0) return result def plot(values, delays): # values shape: # [n_ue, number of UEs # n_symbol, number of OFDM symbols # n_ue_h, number of horizontal sites in the UE panel # n_ue_v, number of vertical sites in the UE panel # n_ue_p, number of polarizations in the UE panel # n_ru_h, number of horizontal sites in the RU panel # n_ru_v, number of vertical sites in the RU panel # n_ru_p, number of polarizations in the RU panel # n_tap number of taps # ] AX_UE, AX_SYM, AX_UEH, AX_UEV, AX_UEP, AX_RUH, AX_RUV, AX_RUP,AX_TAPS = range(9) # delays shape: # [n_ue, number of UEs # n_symbols, number of OFDM symbols # n_ue_h, number of horizontal sites in the UE panel # n_ue_v, number of vertical sites in the UE panel # n_ru_h, number of horizontal sites in the RU panel # n_ru_v, number of vertical sites in the RU panel # n_tap number of taps # ] D_AX_UE, D_AX_SYM, D_AX_UEH, D_AX_UEV, D_AX_RUH, D_AX_RUV, D_AX_TAPS = range(7) nbins = 4096 spacing_khz = 30.0 freqs_hz = (np.arange(nbins) - (nbins // 2)) * \ spacing_khz * 1e3 # Setup Figure (2x2 grid) fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 9), \ sharex=True) axes = axes.ravel() cases = [(0,0), (0,1), (1,0), (1,1)] titles = [ "UE$_1$: -45° co-pol", "UE$_1$: -45° x-pol", "UE$_1$: 45° x-pol", "UE$_1$: 45° co-pol" ] for ax, (j, k), title in zip(axes, cases, titles): try: # Construct index tuple: [i, 0, 0, 0, j, 0, 0, # k, :] idx_vals = [0] * values_full.ndim idx_vals[AX_UE] = i_fixed idx_vals[AX_UEP] = j # UE polarization idx_vals[AX_RUP] = k # RU polarization idx_vals[AX_TAPS] = slice(None) # All taps h_i = values_full[tuple(idx_vals)] h_i = np.squeeze(h_i) # Construct index tuple: [i, 0, 0, 0, 0, 0, :] idx_del = [0] * delays_full.ndim idx_del[D_AX_UE] = i_fixed idx_del[D_AX_TAPS] = slice(None) tau_i = delays_full[tuple(idx_del)] tau_i = np.squeeze(tau_i) * DELAY_SCALE H = cfr_from_cir(h_i, tau_i, freqs_hz) power_w = np.abs(H) ** 2 power_w = np.maximum(power_w, 1e-12) power_dbm = 10.0 * np.log10(power_w) + 30.0 ax.plot(freqs_hz/1e6 + 3600, power_dbm, \ linewidth=1.5) ax.set_title(title) ax.grid(True, alpha=0.3) # Formatting for ax in axes: ax.set_ylabel("Power (dBm)") axes[2].set_xlabel("Frequency (MHz)") axes[3].set_xlabel("Frequency (MHz)") plt.tight_layout() plt.show() Figure 3. Polarimetric frequency responses for the considered example Empowering the AI-native 6G era The transition from 5G to 6G must tackle greater complexity in wireless signal processing, characterized by massive data volumes, extreme heterogeneity, and the core mandate for AI-native networks. Traditional, siloed simulation methods are simply insufficient for this challenge. The NVIDIA Aerial Omniverse Digital Twin is built precisely for this new era. By moving to a gRPC-based service architecture in release 1.4, AODT is democratizing access to physics-based radio simulation and providing the ground truth needed for machine learning and algorithm exploration. AODT 1.4 is available on NVIDIA NGC . We invite researchers, developers, and operators to integrate this powerful new service and collaborate with us in building the future of 6G. Discuss (0) Like Tags Developer Tools & Techniques | Simulation / Modeling / Design | Telecommunications | Aerial | Omniverse | Intermediate Technical | 5G / 6G | featured | Industrial Digitalization / Digital Twin About the Authors About Tommaso Balercia Tommaso Balercia was born in Jesi, Italy in 1979. He received his master’s degree in Microelectronics from the Polytechnic University of Marche (Italy) and his PhD from the Technical University of Braunschweig (Germany) in 2007 and 2013, respectively. He's currently the principal architect of the NVIDIA digital twin for the simulation of radio access networks (RANs). His area of interest covers EM simulation, RAN design, and HPC at scale. View all posts by Tommaso Balercia About CC Chong CC Chong is the senior director and head of Aerial product management at NVIDIA. Before joining NVIDIA, she was most recently senior director and GM of wireless and access business unit in the Intel Programmable Solutions Group. Chong received her Ph.D., in electronics and electrical engineering from the University of Edinburgh in Scotland and her bachelor's in electronics and electrical engineering from the University of Manchester. She was a recipient of the Ten Outstanding Young Malaysian Awards under the category “Scientific and Technological Development” in 2006. View all posts by CC Chong Comments Related posts 5 New Digital Twin Products Developers Can Use to Build 6G Networks 5 New Digital Twin Products Developers Can Use to Build 6G Networks Improve AI-Native 6G Design with the NVIDIA Aerial Omniverse Digital Twin Improve AI-Native 6G Design with the NVIDIA Aerial Omniverse Digital Twin NVIDIA Aerial Omniverse Digital Twin Boosts Development of AI-Native Wireless and Deployment Flexibility NVIDIA Aerial Omniverse Digital Twin Boosts Development of AI-Native Wireless and Deployment Flexibility Developing Next-Generation Wireless Networks with NVIDIA Aerial Omniverse Digital Twin Developing Next-Generation Wireless Networks with NVIDIA Aerial Omniverse Digital Twin Accelerating the Future of Wireless Communication with the NVIDIA 6G Developer Program Accelerating the Future of Wireless Communication with the NVIDIA 6G Developer Program Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics L T F R E
Accelerating AI-Powered Chemistry and Materials Science Simulations with NVIDIA ALCHEMI Toolkit-Ops | NVIDIA Technical Blog nvidia_dev_blog 19.12.2025 17:00 0.72
Embedding sim.0.833
Entity overlap0.0435
Title sim.0.3007
Time proximity0.7083
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаmachine learning
NLP страна

Открыть оригинал

Machine learning interatomic potentials (MLIPs) are transforming the landscape of computational chemistry and materials science. MLIPs enable atomistic simulations that combine the fidelity of computationally expensive quantum chemistry with the scaling power of AI. Yet, developers working at this intersection face a persistent challenge: a lack of robust, Pythonic toolbox for GPU-accelerated atomistic simulation. For use cases such as running a large number of simultaneous, GPU-accelerated simulations, robust and well-supported tools are either missing in the current software ecosystem or are fragmented across several open source software tools. Over the past few years, available software for running atomistic simulations with MLIPs has been CPU-centric. Core operations such as neighbor identification, dispersion corrections, long-range interactions, and their associated gradient calculation have traditionally supported only CPU computation, which often struggles to deliver the speed that contemporary research demands. High-throughput simulations of small- to medium-sized atomic systems quickly become bottlenecked by inefficient GPU usage in hybrid workflows where the model is GPU-accelerated in PyTorch but the simulation tooling is serial and CPU-based. While developers have attempted to implement these operations directly in PyTorch over the years, the general-purpose design of PyTorch leaves performance on the table for the specialized spatial and force calculation operations required in atomistic simulation. This fundamental mismatch between PyTorch capabilities and the demands of atomistic modeling raises an important question: What’s needed to bridge this gap? NVIDIA ALCHEMI (AI Lab for Chemistry and Materials Innovation), announced at Supercomputing 2024, provides chemistry and materials science developers and researchers with domain-specialized toolkits and NVIDIA NIM microservices optimized on NVIDIA accelerated computing platforms. It is a collection of high-performance, batched and GPU-accelerated tools specifically for enabling atomistic simulations in chemistry and materials science research at the machine learning framework level. NVIDIA ALCHEMI delivers capabilities across three integrated layers: ALCHEMI Toolkit-Ops : A repository of GPU-accelerated, batched common operations for AI-enabled atomistic simulation tasks, such as neighbor list construction, DFT-D3 dispersion corrections, and long-range electrostatics. ALCHEMI Toolkit : A collection of GPU-accelerated simulation building blocks, including geometry optimizers, integrators, and data structures to enable large-scale, batched simulations leveraging AI. ALCHEMI NIM microservices : A scalable layer of cloud‑ready, domain‑specific microservices for chemistry and materials science, enabling deployment and orchestration on NVIDIA‑accelerated platforms. This post introduces NVIDIA ALCHEMI Toolkit-Ops , the accelerated batched common operations layer of ALCHEMI. ALCHEMI Toolkit-Ops uses NVIDIA Warp to accelerate and batch common operations in AI-driven atomistic modeling. These operations are exposed through a modular PyTorch accessible API (with a JAX API targeted for a future release) that enables rapid iteration and integration with existing and future atomistic simulation packages. Figure 1 shows the accelerated batched common operations for atomistic simulations included in this initial release of ALCHEMI Toolkit-Ops. This beta release includes two versions of neighbor lists (naive and cell), DFT-D3 dispersion correction, and long-range coulombic (Ewald and Particle Mesh Ewald) functions. Figure 1. NVIDIA ALCHEMI Toolkit-Ops is a repository of modules developed specifically for GPU-accelerated batched operations (one GPU, many systems) support for MLIPs and molecular dynamics engines Figure 2 demonstrates the performance of accelerated kernels in ALCHEMI Toolkit-Ops versus popular kernel-accelerated models like MACE (cuEquivariance) and TensorNet (Warp) to achieve fully parallelized performance and scalability. The blue MLIP baseline allows comparison with advanced features like neighbor lists, dispersion corrections (DFT-D3) and explicit electrostatics computations (Ewald and Particle-Mesh Ewald (PME)). Test systems consisted of ammonia clusters of increasing size packed into various cells using Packmol . Timing results were averaged over 20 runs on an NVIDIA H100 80 GB GPU. The DFT-D3 calculation does not include 6Å due to the long-range nature of D3. Figure 2. Benchmarks showing the speed of ALCHEMI Toolkit-Ops neighbors list (both naive O(N²) and cell list O(N) implementations), DFT-D3 correction and two versions of electrostatic interactions. All methods are compared to the computational cost of popular kernel-accelerated MLIPs. Left-side panels outline batch scaling for fixed number of atoms and variable system size x [batch size], while right-side panels demonstrate timings for a single system growing in size. ALCHEMI Toolkit-Ops ecosystem integration ALCHEMI Toolkit-Ops is designed to integrate seamlessly with the broader PyTorch-based atomistic simulation ecosystem. We are excited to announce in-progress integrations with leading open source tools in the chemistry and materials science community: TorchSim, MatGL, and AIMNet Central. TorchSim TorchSim , a next-generation open source atomistic simulation engine, is adopting ALCHEMI Toolkit-Ops kernels to power its GPU-accelerated workflows.TorchSim is a PyTorch-native simulation engine purpose-built for the MLIP era, enabling batched molecular dynamics and structural relaxation across thousands of systems simultaneously on a single GPU. TorchSim will leverage our optimized neighbor lists to drive high-throughput batched operations without sacrificing flexibility or performance. MatGL MatGL (Materials Graph Library) is an open source framework for building graph-based machine learning interatomic potentials and foundation potentials for inorganic, molecular, and hybrid materials systems. By integrating ALCHEMI Toolkit-Ops, MatGL significantly accelerates graph-based treatments of long-range interactions, enabling large-scale atomistic simulations that are both faster and more computationally efficient without compromising accuracy. AIMNet Central AIMNet Central is a repository for AIMNet2, a general-purpose MLIP capable of modeling neutral, charged, organic, and elemental-organic systems with high fidelity. AIMNet Central is leveraging ALCHEMI Toolkit-Ops to further enhance the performance of its flexible long-range interaction models. Using NVIDIA-accelerated DFT-D3 and neighbor list kernels, AIMNet2 can deliver even faster atomistic simulations for large and periodic systems without compromising accuracy. How to get started with ALCHEMI Toolkit-Ops Getting started with ALCHEMI Toolkit-Ops is simple and designed with ease of use in mind. System and package requirements Python 3.11+ Operating System: Linux (primary), Windows (WSL2), macOS NVIDIA GPU (A100 or newer recommended), CUDA compute capability ≥ 8.0 CUDA Toolkit 12+, NVIDIA driver 570.xx.xx+ Installation To install ALCHEMI Toolkit-Ops, use the following snippet: # Install via pip wheel pip install nvalchemi-toolkit-ops # Make sure it is importable python -c "import nvalchemiops; print(nvalchemiops.__version__)" See the ALCHEMI Toolkit-Ops documentation for other installation instructions. Explore the examples directory in the GitHub repository and run them to test acceleration on your own hardware. Typical troubleshooting tips: Verify CUDA installation and device availability: nvidia-smi , nvcc --version Ensure compatible Python version: python --version Upgrade dependencies as needed: pip list | grep torch and pip list | grep warp Feature highlights This section dives into three ALCHEMI Toolkit-Ops initial features: high-performance neighbor lists, DFT-D3 dispersion corrections, and long-range electrostatic interactions. Neighbor lists Neighbor list construction is the backbone of atomistic simulations enabling calculation of energies and forces with local or semi-local MLIPs. ALCHEMI Toolkit-Ops delivers state-of-the-art GPU performance in PyTorch, achieving performance scaling to millions of atoms per second for batches of many small to medium atomic systems or single large atomic systems. Capabilities Both O(N) (cell list) and O(N²) (naive) algorithms with batched processing Periodic boundary support for triclinic cells with arbitrary cell dimensions and partial periodicity Supports end-to-end compute graph compilation Direct API compatibility with PyTorch API example import torch from nvalchemiops.neighborlist import neighbor_list # Water molecule water_positions = torch.tensor([ [0.0, 0.0, 0.0], # O [0.96, 0.0, 0.0], # H [-0.24, 0.93, 0.0], # H ], device="cuda", dtype=torch.float32) # Ammonia molecule (NH3) ammonia_positions = torch.tensor([ [0.0, 0.0, 0.0], # N [1.01, 0.0, 0.0], # H [-0.34, 0.95, 0.0], # H [-0.34, -0.48, 0.82], # H ], device="cuda", dtype=torch.float32) # Concatenate positions for batch processing positions = torch.cat([water_positions, ammonia_positions], dim=0) # Create batch indices (0 for water, 1 for ammonia) batch_idx = torch.cat([ torch.zeros(3, dtype=torch.int32, device="cuda"), # Water torch.ones(4, dtype=torch.int32, device="cuda"), # Ammonia ]) # Define cells for each molecule (large enough to contain them without PBC) cells = torch.stack([ torch.eye(3, device="cuda") * 10.0, # Water cell torch.eye(3, device="cuda") * 10.0, # Ammonia cell ]) # non-periodic molecule case pbc = torch.tensor([ [False, False, False], # Water [False, False, False], # Ammonia ], device="cuda") # Cutoff distance in Angstroms cutoff = 4.0 # Compute neighbor list; here we explicitly request a batched cell list algorithm neighbor_matrix, num_neighbors, shift_matrix = neighbor_list( positions, cutoff, cell=cells, pbc=pbc, batch_idx=batch_idx, method="batch_cell_list" ) print(f"Neighbor matrix: {neighbor_matrix.cpu()}") # [7, num_neighbors.max()] print(f"Neighbors per atom: {num_neighbors.cpu()}") # [7,] print(f"Periodic shifts: {shift_matrix.cpu()}") DFT-D3 dispersion corrections Realistic molecular modeling must fully account for van der Waals interactions, which standard DFT functionals do not account for systematically. DFT-D3 uses empirical pairwise corrections, leading to substantial improvements in binding energies, lattice structures, conformational analysis, and adsorption studies for common DFT functionals. Capabilities Becke-Johnson (BJ) rational damping variant Supports batched and periodic calculations Supports smoothing at cutoff distance Joint energy, forces, and virial calculation API example from nvalchemiops.interactions.dispersion import dftd3 batch_ptr = torch.tensor([0, 3, 7], dtype=torch.int32, device="cuda") atomic_numbers = torch.tensor( [6, 1, 1, 7, 1, 1, 1], dtype=torch.int32, device="cuda" ) # For this snippet, assume d3_params is loaded as: # d3_params = D3Parameters(rcov=..., r4r2=..., c6ab=..., cn_ref=...) # Users can refer to the documentation to source DFT-D3 parameters # and understand the expected data structure d3_params = ... # call the DFT-D3 functional interface energy, forces, coordination_numbers = dftd3( positions=positions, numbers=atomic_numbers, a1=0.3981, a2=4.4211, s8=0.7875, # PBE parameters neighbor_matrix=neighbor_matrix, neighbor_matrix_shifts=shift_matrix, batch_idx=batch_idx, d3_params=d3_params ) print(f"Energies: {energy.cpu()}") # [2,] print(f"Forces: {forces.cpu()}") # [7, 3] Limitations The current implementation computes two-body terms only (C6 and C8). Three-body Axilrod-Teller-Muto (ATM/C9) contributions are not included. This generally leads to some over-estimation of dispersion energies. Long-range electrostatic interactions Accurate modeling of electrostatic interactions is critical for simulations involving ions/charged species and polar systems. Currently, the most common approach for MLIPs is to learn Coulomb interactions within the short-ranged model. Systematic underestimation of long-range Coulombic effects leads to loss of accuracy in binding energies, solvation structures, and interfacial phenomena. ALCHEMI Toolkit-Ops provides fully GPU-accelerated Ewald summation methods—both standard Ewald and particle mesh Ewald—enabling GPU-accelerated, efficient and accurate treatment of long-range electrostatics in PyTorch. For large periodic systems, Ewald-based methods separate electrostatic interactions into short-range and long-range components, each computed in the domain best suited for performance. ALCHEMI Toolkit-Ops provides a dual-cutoff strategy that dramatically reduces redundant neighbor queries and memory overhead compared to naive all-pairs approaches, making high-throughput simulations of charged systems practical on modern GPUs. Users can choose between standard Ewald for smaller systems or PME for larger periodic systems, depending on their specific performance and accuracy needs. Capabilities Ewald summation method Particle Mesh Ewald (PME) using B-splines Supports batched and periodic systems GPU-optimized computation, leveraging cuFFT for fast reciprocal-space evaluation PyTorch integration provides native tensor support for end-to-end differentiable workflows API example from nvalchemiops.interactions.electrostatics import particle_mesh_ewald # charges for each atom are randomly generated here atomic_charges = torch.randn( positions.size(0), dtype=torch.float32, device="cuda" ) # compute energy and forces with particle mesh ewald energy, forces = particle_mesh_ewald( positions, atomic_charges, cells, alpha=0.3, # adjust Ewald splitting parameter batch_idx=batch_idx, neighbor_matrix=neighbor_matrix, neighbor_matrix_shifts=shift_matrix, compute_forces=True ) print(f"Energy: {energy.cpu()}") # [2] print(f"Forces: {forces.cpu()}") # [7, 3] Dive deeper into ALCHEMI Toolkit-Ops ALCHEMI Toolkit-Ops empowers the community with high-performance, accessible atomistic modeling tools on NVIDIA GPUs. To accelerate your chemistry and materials science simulations, visit the NVIDIA/nvalchemi-toolkit-ops GitHub repo and NVIDIA ALCHEMI Toolkit-Ops documentation . You can also explore the examples gallery . This beta release of ALCHEMI Toolkit-Ops focuses on highly efficient neighbor lists, dispersion corrections, and long-range electrostatics. Stay tuned for new features and performance optimizations in future releases. Acknowledgments We’d like to thank Professor Shyue Ping Ong; Professor Olexandr Isayev; and the TorchSim committee members Abhijeet Gangan, Orion Archer Cohen, Will Engler, and Ben Blaiszik for working with us to adopt NVIDIA ALCHEMI Toolkit-Ops into their open source projects. We also thank Wen Jie Ong, Piero Altoe, and Kibibi Moseley from NVIDIA for their help preparing this blog post. Discuss (0) Like Tags Developer Tools & Techniques | Simulation / Modeling / Design | HPC / Scientific Computing | NIM | Intermediate Technical | Tutorial | Computational Chemistry / Materials Science | featured | PyTorch About the Authors About Justin S. Smith Justin S. Smith is the senior developer relations manager for AI in Chemistry and Materials Science at NVIDIA. He is a computational chemist who earned his PhD from the University of Florida in 2018 where he worked on AI for atomistic simulation. He then went on to become a staff scientist at Los Alamos National Laboratory where he focused on ML applications to reactive chemistry and materials science. View all posts by Justin S. Smith About Nikita Fedik Nikita Fedik is a senior technical marketing engineer for AI in Chemistry and Materials Science at NVIDIA, specializing in AI-accelerated computational chemistry and scientific visualization. He holds a PhD in Physical Chemistry from Utah State University and previously served as a Staff Scientist at Los Alamos National Laboratory. View all posts by Nikita Fedik About Dallas Foster Dallas Foster is a senior deep learning software engineer for HPC and AI at NVIDIA. He received his PhD in mathematics at Oregon State University and has worked at Los Alamos National Laboratory, the National Center for Atmospheric Research, and MIT. As a member of the PhysicsNeMo team at NVIDIA, he has a particular focus on the application and deployment of deep learning for weather forecasting and molecular dynamics. View all posts by Dallas Foster About Roman Zubatyuk Roman Zubatyuk is the senior application engineer for AI in Chemistry and Materials Science at NVIDIA. He's a computational chemist who earned his PhD in Computational and Data-Enabled Science and Engineering from Jackson State University in 2019. He completed postdoctoral research at the University of North Carolina and Carnegie Mellon University, focusing on the development and application of machine learning interatomic potentials before joining NVIDIA. View all posts by Roman Zubatyuk About Kelvin Lee Kelvin Lee is a senior deep learning software engineer at NVIDIA. He received his PhD in Physical Chemistry at the University of New South Wales, Australia. Prior to joining NVIDIA, Kelvin held academic research positions at the Center for Astrophysics | Harvard & Smithsonian and the Massachusetts Institute of Technology before working in industry as a research scientist at Intel Labs. His work focuses on accelerating computation and developer productivity for chemistry and materials science, spectroscopy, and astrophysics. View all posts by Kelvin Lee Comments Related posts Faster Chemistry and Materials Discovery with AI-Powered Simulations Using NVIDIA ALCHEMI Faster Chemistry and Materials Discovery with AI-Powered Simulations Using NVIDIA ALCHEMI Enabling Scalable AI-Driven Molecular Dynamics Simulations Enabling Scalable AI-Driven Molecular Dynamics Simulations Accelerated Molecular Simulation Using Deep Potential Workflow with NGC Accelerated Molecular Simulation Using Deep Potential Workflow with NGC NVIDIA GPU Accelerated VASP 6 uses OpenACC to Deliver 15X More Performance NVIDIA GPU Accelerated VASP 6 uses OpenACC to Deliver 15X More Performance Share Your Science: Fighting Ebola with Supercomputer Simulations Share Your Science: Fighting Ebola with Supercomputer Simulations Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics L T F R E
Simulate Robotic Environments Faster with NVIDIA Isaac Sim and World Labs Marble | NVIDIA Technical Blog nvidia_dev_blog 17.12.2025 17:00 0.701
Embedding sim.0.7905
Entity overlap0
Title sim.0.2391
Time proximity0.994
NLP типother
NLP организацияnvidia
NLP темаrobotics
NLP страна

Открыть оригинал

Building realistic 3D environments for robotics simulation has traditionally been a labor-intensive process, often requiring weeks of manual modeling and setup. Now, with generative world models, you can go from a text prompt to a photorealistic, simulation-ready world in a fraction of time. By combining NVIDIA Isaac Sim , an open source robotics reference framework, with generative models such as Marble from World Labs , you can create entire 3D scenes for robotics development from a text or image prompt. World Labs recently published the case study “ Scaling Robotic Simulation with Marble ,” showing how researchers are using Marble’s generative worlds to accelerate robot training, testing, and sim-to-real transfer. In this tutorial, we’ll walk through an end-to-end workflow: Scene export: Export an existing scene from Marble gallery as Gaussian splats (PLY) and a collider mesh (GLB) Scene conversion: Convert the Marble outputs to USD format using NVIDIA Omniverse NuRec Scene import and construction: Import into NVIDIA Isaac Sim Simulation in Isaac Sim: Add a robot and run the simulation. By the end, you’ll have a realistic virtual environment where robots can interact physically, all generated far more quickly than by traditional methods. Let’s dive in. Step 1: Get a 3D kitchen scene from World Labs Marble World Labs Marble produces rich visual detail and geometric data like depth and surface normals, along with an exportable collider mesh for physical simulation. For this tutorial, instead of generating a new kitchen from scratch, we’ll use a pre-made Marble kitchen scene that’s available in Marble’s example gallery. This saves time and ensures we have a realistic environment ready to go. The chosen scene is a detailed kitchen and living room interior, complete with furniture and typical kitchen items. Steps to export the kitchen world from Marble: Log in to Marble: Sign in to your Marble account on the web. Once logged in, navigate to the pre-made kitchen scene . Open the scene: Click on the world to load it in Marble’s 3D viewer. You can explore it with WASD controls and mouse as if you were in a game, to verify it looks good. Download the world: Find the Download button in the bottom bar of Marble’s interface. Select “Splats (PLY)” to download a Gaussian splat representation. Marble’s Gaussian splat is provided as a .ply file, which contains millions of semi-transparent particles representing the scene with high fidelity. Select “Collider Mesh (GLB)” to download the triangle mesh of the scene. This will contain the geometry of the kitchen as a standard glTF model. Note that exporting PLY and GLB files in World Labs Marble requires a paid plan. If you don’t have one, World Labs provides sample PLY and GLB files from its gallery. For this tutorial, we will use the kitchen scene PLY and GLB files as our example. Save the files as MarbleKitchenwithLight.ply and MarbleKitchenwithLight_collider.glb . At this point, we have our kitchen environment in two forms—as Gaussian splats and as a triangle mesh. Each serves a different purpose: The PLY captures the full visual detail of the scene, and the GLB provides the mesh geometry needed for physics and collisions in simulation. Video 1. Exploring Mable sample scene and downloading PLY and GLB Step 2: Convert downloaded PLY into USDZ NVIDIA Isaac Sim uses Universal Scene Description (USD) as its scene format. To use our Marble-generated world in Isaac Sim, we need to convert the exported PLY into USD format. We will then import it, taking advantage of NVIDIA Omniverse NuRec capabilities to render the point-based scene efficiently. At the core of NuRec is the 3DGUT algorithm for Gaussian-based reconstruction and rendering. The NVIDIA 3DGRUT repository contains a script to convert a .ply splat file into a USDZ file, a zip-compressed archive that contains USD-specific data. We will use this to convert our Marble PLY: 1. Set up 3DGRUT: Clone the 3DGRUT repository and install its environment. In this tutorial, we set up 3DGRUT inside a dedicated Conda environment named “3dgrut.” The environment requires Linux with an NVIDIA GPU, CUDA 11.8+, and GCC 11 or lower. If you already have a Python environment with the needed libraries (PyTorch, etc.), you can alternatively just run the conversion Python script in that environment. git clone --recursive https://github.com/nv-tlabs/3dgrut.git cd 3dgrut chmod +x install_env.sh ./install_env.sh 3dgrut conda activate 3dgrut 2. Convert PLY to USDZ: Once 3DGRUT is set up, use the provided conversion script to turn the Marble point cloud into USDZ: $ python -m threedgrut.export.scripts.ply_to_usd \ /path/to/MarbleKitchenwithLight.ply \ --output_file /path/to/MarbleKitchenwithLight.usdz This command will read the .ply file and produce a .usdz file. USDZ uses a custom USD schema (an extension of UsdVolVolume) to represent the Gaussian splats in a way that Omniverse can render. Essentially, it embeds the point cloud as a volumetric primitive, preserving the visual fidelity of the Marble scene. For more details on NuRec neural volumes and how they are rendered in Omniverse, see the NuRec Rendering documentation. Now, we have one USDZ file and one GLB file: MarbleKitchenwithLight.usdz – the visual splat world MarbleKitchenwithLight_collider.glb – the collider mesh we’ll use for physics. Step 3: Import USDZ/GLB into Isaac Sim and construct the scene After generating the USDZ file, the next step is to bring the kitchen scene into Isaac Sim, align the mesh with the Gaussian splats, and add physics and lighting so it is ready for interaction. Since we are editing the scene contents, we need to extract the USDZ archive. Unzip the file and open the default.usda file generated, then go through the following steps: Geometrically align the Gaussian volume: We want to make sure that the origin of the imported scene and its scale matches Isaac Sim. In order to do that: Add a ground plane to the scene. This will be used as a reference for the ground of the imported Gaussian volume, and serve as a smooth collider. The imported Gaussian volume is contained in an “xform” primitive, which is used to transform the volume. To align the volume with the floor, select the xform primitive and adjust its “Translate” values so the floor of the kitchen sits exactly on the ground plane. Use the ground plane as a visual reference and move the Gaussian volume until the point cloud’s floor coincides with it. The generated scene may be smaller or larger than the real-world scale. To roughly match the real-world scale, we can use a default cube as a visual reference, which has 1 meter side length. After inserting a cube object, we can adjust the overall X, Y, and Z scaling accordingly. For our example kitchen scene, a factor of 2 for the scaling gives roughly the right sizing, e.g., for the cabinet and stove. Finally, fine-tune the rotation of the xform primitive to make sure the Gaussian point cloud aligns with the ground plane as accurately as possible. A simple way to verify this is to use the tiles on the kitchen wall as a reference and rotate the Gaussian such that they are completely parallel to the ground plane created. Once aligned, move the ground plane back down so it sits exactly at the kitchen floor level. Video 2. Geometrically aligning the gaussian volume Add physics and lighting to the scene: Now that we have aligned the imported Gaussians, we want to add physics and lighting so that shadows and object interactions work as expected. We will use the cube that we previously created to adjust the scene scale again to test shadows and physics. In the collision mesh of the ground plane, turn on the matte object property. This ensures it works properly as a shadow receiver. Add a dome light to the scene. Select the ‘gauss’ Volume prim in the stage window, then in the property window, scroll down to “Raw USD Properties” and click the triangle to reveal additional settings. Then, scroll to the “proxy” field, and click on “Add Target.” Finally, select the GroundPlane CollisionMesh as the target. Video 3. Adding physics and lighting Move the cube around to ensure shadows show up as expected. On setting the cube as a rigid body with colliders and hitting play in the simulation, the cube interacts with the ground plane as expected. However, it “goes through” the Gaussians. Let us now move on to setting up the physics of the Gaussian representation. The collision information for the Gaussians is in the GLB file. Import this mesh, align it with the Gaussian volume, and enable it as a collider. Drag and drop the MarbleKitchenwithlight_collider.glb file under the Gaussian volume. Make sure it is under the Gaussian volume, as the hierarchy is important. The collider will show up in the scene. Zoom out of the scene a little and set the X rotation to -90 to match the coordinate conventions of the Gaussian volume. Now the rendered volume and the collision mesh align completely. Enable the physics collider preset for the imported collision mesh. Turn off the visibility for the collider as it is overlaid with the Gaussian volume. This affects only the visuals of the scene; physics will use the colliders we just set up in the scene. Video 4. Importing collision mesh The geometry, physics, and lighting for the scene are now in good shape: The Gaussian volume provides the photoreal visuals, while the GLB collider and ground plane handle physics and shadows. The scene is now ready for a robot to be added. Step 4: Add a robot and run the simulation With the kitchen scene aligned and physics enabled, the final step is to add a robot and drive it around to validate the setup. Drag and drop the NVIDIA Nova Carter robot into the scene. Add a differential controller for the robot with and enable keyboard control. This will create the necessary action graph, which allows us to use our keyboard to move the robot around. Change to a camera mounted on the robot and hit play. Move the robot around with WASD and verify that it respects the kitchen geometry: It should rest on the floor, collide with counters and furniture, and not fall through the scene. At this point, the Marble kitchen scene is fully integrated into Isaac Sim as a physics-enabled environment, and you can drive robots interactively through it. Video 5. Adding a robot and navigating the scene Summary In this tutorial, we downloaded an AI-generated 3D environment complete with geometry and then brought it into Isaac Sim as a simulation-ready scene. We set up robots in an AI-generated world. The end-to-end workflow here can now be completed in mere hours. This ability to rapidly generate various high-fidelity worlds unlocks more scalable robot development in simulation. With Marble and Isaac Sim, if you can describe a world, you’ll likely be able to start testing it the same day. To learn more, try the following: Create your own custom environment with World Labs Marble – You can start with a text description, a single image, multiple photos from different angles, or even a rough 3D layout. Create your own custom environment with input image and use it for Isaac Sim with Lyra , an NVIDIA research initiative on generative 3D scene reconstruction via video diffusion model.  Learn more about simulation innovations and meet with NVIDIA experts at SIGGRAPH Asia , taking place Dec. 15 to 18 at the Hong Kong Convention and Exhibition Centre. Discuss (2) Like Tags Agentic AI / Generative AI | Robotics | Simulation / Modeling / Design | General | Isaac Sim | Omniverse | Intermediate Technical | Tutorial | featured | Robot Navigation | Robot Perception | Robotics Compute | Robotics Simulation About the Authors About Wonsik Han Wonsik Han is a senior product manager in the NVIDIA Autonomous Vehicle Group. He brings more than a decade of experience across strategy, business development, and product management roles at global automakers and an autonomous driving startup. Wonsik holds an MBA from Duke University. View all posts by Wonsik Han About Rishabh Chadha Rishabh Chadha is a technical marketing engineer at NVIDIA, he focuses on integrating deep learning and robotics frameworks for the NVIDIA Jetson platforms. He has a Masters degree in Robotics from Worcester Polytechnic Institute. His interests primarily include deep learning, medical imaging, and robot perception. View all posts by Rishabh Chadha About Isaac Deutsch Isaac Deutsch is a senior research scientist at NVIDIA who brings together computer vision, imaging, and real-time computer graphics. He contributed to Instant-NGP, NuRec, and 3DGRUT. His current work focuses on computational photography for high-fidelity 3D capture. Isaac holds a master’s degree in Robotics from ETH Zurich and joined NVIDIA in 2018. View all posts by Isaac Deutsch About Raffaello Bonghi Raffaello Bonghi is a developer relations manager for AI & Robotics. Since 2015, he has been an NVIDIA Jetson Champ designing multiple ROS/ROS-based robots for outdoor navigation and educational applications. Additionally, he has been involved in developing AI solutions for numerous international clients in the retail and robotics space. Raffaello holds a Ph.D. in control theory and industrial automation, with a deep focus on robotics. View all posts by Raffaello Bonghi Comments Related posts Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena 3 Easy Ways to Supercharge Your Robotics Development Using OpenUSD 3 Easy Ways to Supercharge Your Robotics Development Using OpenUSD How to Instantly Render Real-World Scenes in Interactive Simulation How to Instantly Render Real-World Scenes in Interactive Simulation Building Custom Robot Simulations with Wandelbots NOVA and NVIDIA Isaac Sim Building Custom Robot Simulations with Wandelbots NOVA and NVIDIA Isaac Sim NVIDIA Isaac Sim on Omniverse Now Available in Open Beta NVIDIA Isaac Sim on Omniverse Now Available in Open Beta Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics L T F R E
Migrate Apache Spark Workloads to GPUs at Scale on Amazon EMR with Project Aether | NVIDIA Technical Blog nvidia_dev_blog 17.12.2025 19:00 0.683
Embedding sim.0.7974
Entity overlap0.1026
Title sim.0.1935
Time proximity0.7108
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

Data is the fuel of modern business, but relying on older CPU-based Apache Spark pipelines introduces a heavy toll. They’re inherently slow, require large infrastructure, and lead to massive cloud expenditure. As a result, GPU-accelerated Spark is becoming a leading solution, providing lightning-fast performance using parallel processing. This improved efficiency reduces cloud bills and saves valuable development hours. Building on this foundation, we introduce a smart and efficient way to migrate existing CPU-based Spark workloads running on Amazon Elastic MapReduce (EMR) . Project Aether is an NVIDIA tool engineered to automate this transition. It works by taking existing CPU jobs and optimizing them to run on GPU-accelerated EMR using the RAPIDS Accelerator for performance benefits. What is Project Aether? Figure 1. Project Aether overview showing workflow phases and services Project Aether is a suite of microservices and processes designed to automate migration and optimization for the RAPIDS Accelerator , effectively eliminating manual friction . This aims to reduce migration time from CPU to GPU Spark Jobs through: A prediction model for potential GPU speedup using recommended bootstrap configurations. Out-of-the-box testing and tuning of GPU jobs in a sandbox environment. Smart optimization for cost and runtime. Full integration with Amazon EMR supported workloads. Amazon EMR Integration Now supporting the Amazon EMR platform , Project Aether automates the management of GPU test clusters and the conversion and optimization of Spark steps. Users can use the services provided to migrate existing EMR CPU Spark workloads to GPUs. Setup and configuration To get started, you’ll need to meet the following prerequisites. Amazon EMR on EC2 : AWS account with GPU instance quotas AWS CLI : Configured with aws configure Aether NGC : Request access, configure credentials with ngc config set, and follow Aether installation instructions. Configure Aether for EMR Once the Aether package is installed, configure the Aether client for the EMR platform using the following commands: # Initialize and list config $ aether config init $ aether config list # Select EMR platform and region $ aether config set core.selected_platform emr $ aether config set platform.emr.region <region> # Set required EMR s3 paths $ aether config set platform.emr.spark_event_log_dir <s3_path_for event_logs> $ aether config set platform.emr.cluster.artifacts_path <s3_path_for uploading_aether_artifacts> $ aether config set platform.emr.cluster.log_path <s3_path_for_cluster_log_uri> Example Aether EMR migration workflow The Aether CLI tool provides several modular commands for running the services. Each command displays a summary table and tracks each run in the job history database. At any point, refer to “4. Migrate: Report and Recommendation” for viewing the tracked jobs. Use the --help option for more details on each aether command. The example EMR workflow requires starting with an existing Spark step with step ID s-XXX that ran on a CPU EMR cluster with a cluster ID j-XXX . For more information on submitting steps to EMR clusters, refer to the Amazon EMR documentation . The migration process is broken down into four core phases: predict, optimize, validate, and migrate. 1. Predict: Qualification Determine a CPU Spark job’s viability for GPU acceleration and generate initial optimization recommendations. The qualification tool uses the QualX machine learning system’s XGBoost model to predict potential GPU speedup and compatibility based on workload characteristics derived from the CPU event log. om the CPU event log. Input: CPU event log obtained from EMR step and cluster API, or provided directly. Output: Recommended Spark configuration parameters generated by the AutoTuner. Recommended GPU cluster shape with instance types and counts optimized for cost savings. Aether Job ID to track this job and any subsequent job runs. Commands: # Option 1: Use Platform IDs $ aether qualify --platform_job_id <cpu_step_id> --cluster_id <cpu_cluster_id> # Option 2: Provide event log path directly $ aether qualify --event_log <s3_or_local_event_log_path> 2. Optimize: Automatic testing and tuning Achieve optimal performance and cost savings by testing the job on a GPU cluster and iteratively tuning the Spark configuration parameters. Create the GPU test cluster with the Cluster service, then optimize the GPU job with the tune service, which iteratively runs submit and profile: Submit: The job submission service submits the Spark job to a GPU cluster with the specified configurations. Profile: The profile service uses the profiling tool to process the GPU event logs to analyze bottlenecks and generate new Spark configuration parameters to increase performance and/or reduce cost. Input: Recommended Spark configuration parameters from qualify output for the GPU job.  Recommended GPU cluster shape from qualify output to create the GPU cluster. Output: Best GPU configuration is selected from the run with the lowest duration among all tuning iterations. Commands: A. Create a test EMR GPU cluster: # Option 1: Use the recommended cluster shape ID with a default cluster configuration $ aether cluster create --cluster_shape_id <recommended_cluster_shape_id_from_qualify> # Option 2: Provide a custom configuration file $ aether cluster create --cluster_shape_id <recommended_cluster_shape_id_from_qualify> --config_file <custom_cluster_yaml_file> B. Submit the GPU step to the cluster: # Submit the job to the cluster using config_id and cluster_id $ aether submit --config_id <recommended_spark_config_id_from_qualify> --cluster_id <gpu_cluster_id_from_create> C. Profile the GPU run to generate new recommended Spark configs: # Profile the job using the step_id and cluster_id $ aether profile --platform_job_id <gpu_step_id_from_submit> --cluster_id <gpu_cluster_id_from_create> D. Tune the job iteratively (submit + profile loop ): # Tune the job for 3 iterations $ aether tune --aether_job_id <aether_job_id> --cluster_id <gpu_cluster_id_from_create> --min_tuning_iterations 3 3. Validate: Data integrity check Confirm the GPU job’s output integrity by ensuring its results are identical to the original CPU job. The validate service compares key row metrics retrieved from the event logs, specifically focusing on rows read and rows written , between the best GPU run and the original CPU run. Command: # Validate the CPU and GPU job metrics $ aether validate --aether_job_id <aether_job_id> 4. Migrate: Report and recommendation View detailed reports of the tracked jobs in the job history database, and see per-job migration recommendations with the optimal Spark configuration parameters and GPU cluster configurations . The report service provides CLI and UI options to display: Key performance indicators (KPIs): The total speedup and total cost savings across all jobs. Job list: Per-job speedup, cost savings, and migration recommendations. Job details: All job run (original CPU run and GPU tuning runs) metrics and details for a job. Commands: # List all job reports $ aether report list # View all job runs for a specific job $ aether report job --aether_job_id <aether_job_id> # Start the Aether UI to view the reports in a browser $ aether report ui Figure 2. Example screenshot of Aether report UI job details Figure 3. Example screenshot of Aether report UI GPU config details 5. Automated run Combine all of the individual services above into a single automated Aether run command: # Run full Aether workflow on CPU event log $ aether run --event_log <s3_or_local_event_log_path> Conclusion Project Aether is a powerful tool for accelerating big data processing, reducing the time and cost associated with migrating and running large-scale Apache Spark workloads on GPUs. To try it out for large-scale migrations of Apache Spark workloads, apply for Project Aether access. To learn more about the RAPIDS plugin, see the documentation for RAPIDS Accelerator for Apache Spark . Discuss (0) Like Tags Data Center / Cloud | Data Science | Developer Tools & Techniques | General | RAPIDS | Intermediate Technical | Tutorial | AWS | featured About the Authors About Navin Kumar Navin Kumar is a senior distributed systems engineer at NVIDIA, working on the Spark RAPIDS Accelerator team. He's the chief architect of Project Aether, a service for accelerating the migration process of Apache Spark jobs from CPU to GPU. Navin holds a B.S. in Computer Science from Cornell University and an M.S. in Information Networking from Carnegie Mellon University. Previously, he was the architect of a large-scale Apache Spark-based data pipeline at security startup Fletch (acquired by F5). View all posts by Navin Kumar About Sean Yang Sean Yang is a system software engineer at NVIDIA, working on Project Aether with the Spark RAPIDS Accelerator team. Previously, he worked on federated learning systems with the NVIDIA FLARE engineering team. He holds a B.S. in computer science from the University of California, Berkeley, with a focus on machine learning and distributed systems. View all posts by Sean Yang About Sayed Bilal Bari Sayed Bilal Bari is a system software engineer at NVIDIA, working on tooling for Spark RAPIDS Accelerator and Project Aether. He holds an M.S. in Computer Science from Stony Brook University. With a keen interest in big data processing and frameworks, he has previously worked as a senior data engineer for companies like Intuit and Walmart Labs. View all posts by Sayed Bilal Bari Comments Related posts Predicting Performance on Apache Spark with GPUs Predicting Performance on Apache Spark with GPUs New Self-Paced Course: RAPIDS Accelerator for Apache Spark New Self-Paced Course: RAPIDS Accelerator for Apache Spark RAPIDS Accelerator for Apache Spark v21.06 Release RAPIDS Accelerator for Apache Spark v21.06 Release Improving Apache Spark Performance and Reducing Costs with Amazon EMR and NVIDIA Improving Apache Spark Performance and Reducing Costs with Amazon EMR and NVIDIA NVIDIA Accelerates Apache Spark, World’s Leading Data Analytics Platform NVIDIA Accelerates Apache Spark, World’s Leading Data Analytics Platform Related posts Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark L T F R E
Solving Large-Scale Linear Sparse Problems with NVIDIA cuDSS | NVIDIA Technical Blog nvidia_dev_blog 17.12.2025 18:30 0.683
Embedding sim.0.7641
Entity overlap0.1818
Title sim.0.2727
Time proximity0.8482
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

Solving large-scale problems in Electronic Design Automation (EDA), Computational Fluid Dynamics (CFD), and advanced optimization workflows has become the norm as chip designs, manufacturing, and multi-physics simulations have grown in complexity. These workloads push traditional solvers and require unprecedented scalability and performance. The NVIDIA CUDA Direct Sparse Solver (cuDSS) is built for users to run sparse solvers at massive scale with minimal code changes, unlocking breakthrough speed and efficiency for next-generation engineering and design.  You can leverage your CPU/GPU using hybrid memory mode to run larger problems that otherwise would not fit in a single GPU memory, or run a workload across multiple GPUs, or even scale to multiple nodes effortlessly. This blog discusses cuDSS user strategies for solving large-scale problems. Getting started To get started, this blog assumes you already have a working code that uses cuDSS. You may have also explored the introductory examples on GitHub ( here and here ) that demonstrate running cuDSS on a single GPU and adjusting default solution parameters using Get and Set functions. These examples cover creating matrices and main cuDSS objects, and executing the three core phases of cuDSS: analysis, numerical factorization, and solution. Thanks to the increased memory capacity of recent GPU generations, even a single GPU can handle fairly large sparse problems. However, when tackling truly massive problems—on the order of over 10 million rows and over a billion nonzeros—there are effective strategies to make cuDSS run fast and efficiently. The first approach still uses a single GPU but introduces techniques to address these bigger challenges without major code changes. Rethink your data types: Why INT64 matters now When you create a dense or sparse matrix for cuDSS, you are likely to use one of two functions, cudssMatrixCreateDn() or cudssMatrixCreateCsr() or even both. From the documentation, the functions are described below. cudssMatrixCreateDn cudssStatus_t cudssMatrixCreateDn( cudssMatrix_t *matrix, int64_t nrows, int64_t ncols, int64_t ld, void *values, cudaDataType_t valueType, cudssLayout_t layout ) The second function, cudssMatrixCreateCsr() , is shown next. cudssMatrixCreateCsr cudssStatus_t cudssMatrixCreateCsr( cudssMatrix_t *matrix, int64_t nrows, int64_t ncols, int64_t nnz, void *rowStart, void *rowEnd, void *colIndices, void *values, cudaDataType_t indexType, cudaDataType_t valueType, cudssMatrixType_t mtype, cudssMatrixViewType_t mview, cudssIndexBase_t indexBase ) In cuDSS versions before 0.7.0, indices of the sparse matrices could only use 32-bit integers. Specifically, the underlying data type for rowStart , rowEnd , and colIndices could only be int and parameter indexType could only be CUDA_R_32I . From cuDSS 0.7.0 onward, you can solve bigger problems by using 64-bit integer indexing arrays of type int64 and CUDA_R_64I for the indexType . Note: There is a limitation of having less than 2^31 rows and columns in the input matrix (but with 64-bit indices, the input matrix can have many more nonzeros). Hybrid memory mode—blurring the line between CPU and GPU cuDSS hybrid memory mode is designed to overcome the memory limitations of a single GPU when solving extremely large sparse linear problems by using the GPU and CPU memories. However, there’s a tradeoff: Data transfer between CPU and GPU takes time and is governed by bus bandwidth. While you get to tackle bigger problems, you should expect some performance hit due to these transfers. That said, thanks to modern NVIDIA driver optimizations and fast CPU/GPU interconnects (such as those in NVIDIA Grace Blackwell nodes), the penalty is manageable—and for certain problem sizes, hybrid memory performance scales impressively. Hybrid memory mode is not on by default, so the first step to enable it is to call the function cudssConfigSet() to set CUDSS_CONFIG_HYBRID_MODE , which tells cuDSS to use hybrid memory mode. Note this change has to be done prior to calling cudssExecute() . The device memory is managed by cuDSS automatically by default. It manages how much device memory is needed—as much as the entire GPU contains. Alternatively, users can specify a smaller memory footprint by setting a user-defined limit in the range from the value of CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN up to the available device memory after the analysis (symbolic factorization) phase, which can be queried via the NVIDIA CUDA Runtime API cudaMemGetInfo . A few highlights to note: Even if the hybrid memory is on, cuDSS first attempts to utilize device memory (and avoids using CPU memory if possible) to achieve best performance. Best performance is achieved with using the maximum GPU memory (which would make fewer memory transfers between the CPU and the GPU) Hybrid memory limit can be set per device (as shown in the next text block) The example code walks you through fetching minimum device memory requirements and setting your memory limits accordingly, giving you fine control over memory footprints. ... /* Enable hybrid mode where factors are stored in host memory Note: It must be set before the first call to ANALYSIS step.*/ int hybrid_mode = 1; CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig, CUDSS_CONFIG_HYBRID_MODE,\ &hybrid_mode,sizeof(hybrid_mode)), status,\ "cudssConfigSet CUDSS_CONFIG_HYBRID_MODE"); /* Symbolic factorization */ ... /* (optional) User can query the minimal amount of device memory sufficient for the hybrid memory mode. Note: By default, cuDSS would attempt to use all available device memory if needed */ size_t sizeWritten; int64_t device_memory_min; CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,\ CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,\ &device_memory_min, sizeof(device_memory_min),\ &sizeWritten), status, "cudssDataGet for\ CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN"); printf("cuDSS example: minimum amount of device memory\n" "for the hybrid memory mode is %ld bytes\n", device_memory_min); /* (optional) User can specify how much device memory is available for cuDSS Note: By default, cuDSS would attempt to use all available\ device memory if needed */ int64_t hybrid_device_memory_limit = 40 * 1024 ; // in bytes = 40 KB CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,\ CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT,\ &hybrid_device_memory_limit,\ sizeof(hybrid_device_memory_limit)),\ status, \ "cudssConfigSet for\ CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT"); printf("cuDSS example: set the upper limit on device memory\n" "for the hybrid memory mode to %ld bytes\n", hybrid_device_memory_limit); /* Factorization */ ... /* Solving */ ... The first cuDSS function, called cudssConfigSet() , enables hybrid memory mode before calling the first analysis step, symbolic factorization. This is followed by using cudssDataGet() to find the minimal amount of device memory sufficient for hybrid memory mode. A function call, cudssConfigSet() specifies the amount of device memory for cuDSS. Note that sometimes the automatic memory management results in out of memory (OOM) errors. For developers integrating this, the documentation on debugging tips are gold—save yourself some headaches by giving them a read. Hybrid memory mode performance is dependent on CPU/GPU memory bandwidth to move data from CPU to GPU. To illustrate this, Figure 1 below shows the factorization and solve speed-up for matrix sizes ranging from 1 million to 18 million that are solved using cuDSS’s hybrid memory mode. The baseline is a single NVIDIA B200 GPU node. The observed speed-up compares the same model executed on a Grace Blackwell node to a x86 Blackwell node, reflecting the memory bandwidth ratio between the two nodes. Figure 1. Speedup of factorization and solution phases for GB200 vs. B200 (cuDSS in Hybrid Memory Mode using minimum required device memory:​ B200 + Grace (72 cores) – 480GB vs B200 + X86 CPU (112 cores) With INT64 and hybrid memory mode cuDSS coding strategies, we can accommodate large problem sizes and we are using all the possible memory we can on the node if we need it. But we are still limited to a single GPU. The next strategy allows us to use more GPUs to accommodate larger problems. This also allows us to solve problems of a fixed size faster by using more GPUs. Multiply your muscle: multi-GPU mode (MG mode) cuDSS multi-GPU mode (MG mode) allows the developer to use all of the GPUs in a single node without the developer having to specify any distributed communication layer. cuDSS handles all communication needed to use the GPUs internally. It is helpful in three scenarios: When the problem is too large to fit on a single device (with or without hybrid memory). When the user wants to avoid the performance penalty of hybrid memory mode. When the user is focused on strong scaling—solving the problem across more GPUs to reach a solution faster. The highlight of MG mode is that the developer does not need to specify a communication layer: No MPI, no NCCL, and no other communication layer needs to be used. cuDSS does all of this for you. Additionally, due to CUDA’s limitations with MPI-aware communication on Windows nodes, MG mode becomes particularly valuable for applications running on Windows. Figure 2 below illustrates the time (in seconds) required to solve an approximately 30-million-row matrix on an NVIDIA DGX H200 node across one-, two-, and four-GPU configurations with the factorization time on the top chart and the solve time on the bottom chart. The initial computation was performed on a single GPU, followed by runs using two and four GPUs with MG mode. As shown, solving the model with two GPUs significantly reduces computation time compared to a single GPU, albeit at the cost of increased GPU resource usage. Figure 2. Factorization and solve time on H200 for one-, two-, and four-GPU configurations using  Cadence’s MCAE applications. The matrix has approximately 31M rows and columns and with approximately 1B non-zeros. This example shows you how to utilize MG mode. The relevant parts of the code are summarized below. Note that this includes code for using hybrid memory mode. This is important because if you use hybrid memory, you have to set the device memory limits on all of the devices that will be used. ... /* Creating the cuDSS library handle */ cudssHandle_t handle; /* Query the actual number of available devices */ int device_count = 0; cuda_error = cudaGetDeviceCount(&device_count); if (cuda_error != cudaSuccess || device_count <= 0) { printf("ERROR: no GPU devices found\n"); fflush(0); return -1; } /* device_indices can be set to NULL. In that cases cuDSS will take devices * from 0 to (device_count - 1) */ int *device_indices = NULL; device_indices = (int *)malloc(device_count * sizeof(int)); if (device_indices == NULL) { printf("ERROR: failed to allocate host memory\n"); fflush(0); return -1; } for (int i = 0; i < device_count; i++) device_indices[i] = i; ... /* Initialize cudss handle for multiple devices */ CUDSS_CALL_AND_CHECK(cudssCreateMg(&handle, device_count, device_indices),\ status, "cudssCreate"); ... /* Creating cuDSS solver configuration and data objects */ cudssConfig_t solverConfig; cudssData_t solverData; CUDSS_CALL_AND_CHECK(cudssConfigCreate(&solverConfig), status,\ "cudssConfigCreate"); /* Pass same device_count and device_indices to solverConfig */ CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig, \ CUDSS_CONFIG_DEVICE_COUNT, &device_count,\ sizeof(device_count)), status, \ "cudssConfigSet for device_count"); CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,\ CUDSS_CONFIG_DEVICE_INDICES, device_indices,\ device_count * sizeof(int)), status, \ "cudssConfigSet for device_count"); CUDSS_CALL_AND_CHECK(cudssDataCreate(handle, &solverData), status,\ "cudssDataCreate"); ... /* Symbolic factorization */ CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_ANALYSIS,\ solverConfig, solverData, A, x, b),\ status, "cudssExecute for analysis"); ... /* Query CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN should be done for each device * separately by calling cudaSetDevice() prior to cudssDataGet. * Same for getting CUDSS_DATA_MEMORY_ESTIMATES. * Same for setting CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT with * cudssConfigSet() */ int default_device = 0; cudaGetDevice(&default_device); for (int dev_id = 0; dev_id < device_count; dev_id++) { cudaSetDevice(device_indices[dev_id]); int64_t hybrid_device_memory_limit = 0; size_t sizeWritten; CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,\ CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,\ &hybrid_device_memory_limit,\ sizeof(hybrid_device_memory_limit),\ &sizeWritten),\ status, "cudssDataGet for the memory estimates"); printf("dev_id = %d CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN %ld bytes\n", device_indices[dev_id], hybrid_device_memory_limit); } /* cuDSS requires all API calls to be made on the default device, so * resseting device context. */ cudaSetDevice(default_device); /* Factorization */ CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_FACTORIZATION,\ solverConfig, solverData, A, x, b),\ status, "cudssExecute for factor"); /* Solving */ CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE, solverConfig,\ solverData, A, x, b), status, \ "cudssExecute for solve"); ... Setting up MG mode is straightforward. It begins by finding the number of devices on the node and using them all, or the specific number of devices you want. Then the device indices are set to the number of devices starting with device 0 (the code will use the first device_count devices, which you can change to the specific device numbers if you want). You could easily have the number of devices and the device number list input on the command line or via a file to make your code more flexible. After this, the specific MG coding begins by calling cudssCreateMg() to initialize the cuDSS handle for multiple devices. But before you call a solution phase, you additionally need to initialize the cuDSS configuration with the device information. Specifically, after creating a cuDSS solver configuration object using cudssConfigCreate() , you should set the configuration details for MG mode using cudssConfigSet() for the following: CUDSS_CONFIG_DEVICE_COUNT , using the array device_count . CUDSS_CONFIG_DEVICE_INDICES , using the array device_indices . Then you use the function cudssDataCreate() to create the solverData for cuDSS and perform the analysis stage next. In case you are using hybrid memory mode, prior to the factorization, you might want to set device memory limits for each of the devices separately. This is shown in the code above. Once completed, you can factorize the matrix and solve the problem. A highlight of MG mode is that you don’t need to code for communications between the GPUs. cuDSS does all of that for you. However, there are some current limitations to using MG mode. Using MG mode jointly with multi-GPU-multi-node (MGMN) mode is not supported (the next section talks about MGMN) Distributed input is not currently supported. MG mode is not supported when either CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering. MG mode does not support matrix batches. All phases in MG mode are synchronous. All data must be on the first device (rank 0) before calling cudssExecute . Then cuDSS will distribute the data to the other devices as needed. Going big: Multi-GPU Multi-Node (MGMN) mode for distributed power Now, what if one node isn’t enough and you want to spread your computations across multiple nodes? That’s where MGMN mode comes in. This requires a communication layer that, once added, will allow you to use any or all of the GPUs in a node as well as multiple nodes—with no limitations. This enables users to solve massive problems, or to use more GPUs to solve a problem faster. cuDSS uses an abstraction—a small communication “shim” layer that can be tailored to CUDA-aware Open MPI , NVIDIA NCCL , or even a custom one, if you want. This example for MGMN has the code for both Open MPI and NCCL. If you wish to use your own communication layer, there is an explanation on how to do that. To illustrate how the communication layer is used, an ifdef code block from the example is presented in the code block below for both MPI and NCCL code. There are some constants defined during compilation that are important for this example but aren’t shown in the code block. These are USE_MPI and USE_NCCL that define what code paths are to be used. This ifdef code block is for lines 520-535 in the sample code (these line numbers could change with subsequent versions so check them carefully). #ifdef USE_MPI #if USE_OPENMPI if (strcmp(comm_backend_name,"openmpi") == 0) { CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,\ mpi_comm, sizeof(MPI_Comm*)), \ status, \ "cudssDataSet for OpenMPI comm"); } #endif #if USE_NCCL if (strcmp(comm_backend_name,"nccl") == 0) { CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,\ nccl_comm, sizeof(ncclComm_t*)), \ status, \ "cudssDataSet for NCCL comm"); } #endif #endif Note that the code changes are basically minimal for defining MPI or NCCL. The code differences between the two are simple. You can use your own communication layer in a very similar manner. Once you define the communicator pointer, passed to cuDSS via CUDSS_DATA_COMM in your code, as shown in the previous code snippet, there is no other need to use any communication layer function unless your code specifically needs it. cuDSS uses the defined communication layer “under the covers” so you don’t need to code for it. Look through the example code for how more than one node is used. For implementing your own communication layer, a good introductory discussion can be found in the cuDSS documentation under advanced topics . A high-level overview of the communication layers requirements are below: The MGMN mode is enabled by abstracting away all communication-specific primitives into a small, separately built shim communication layer. Users can have their own implementation of the communication layer with the communication backend of their choice (MPI, NCCL, etc.). The enabled MGMN execution in cuDSS does not require any changes for applications that do not make use of the MGMN mode. MGMN mode supports 1D row-wise distribution (with overlapping) for the input CSR matrix, dense righthand side, or solution, using the cudssMatrixSetDistributedRow1D() function (see the next paragraph). cuDSS MGMN mode optionally accepts pre-distributed input and can optionally create distributed output. You can have both A and B on the rank 0 device. In which case, cuDSS will distribute them, or you can tell cuDSS how the data is distributed across the devices and nodes with the cudssMatrixSetDistributedRow1d() function.The developer will have to make sure the data is in the proper location on the proper node and device. A critical step for good performance is to carefully choose your CPU:GPU:NIC bindings. This is not discussed here but is documented elsewhere. There are some current limitations to MGMN mode, MGMN mode is not supported when either CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering. MGMN mode does not support matrix batches. All phases in MGMN mode are synchronous. Takeaways Sparse linear systems appear in many disciplines. Fueled by the need to solve real-life problems, the overall size of the problems is growing at a very rapid rate. Developers must find ways to solve them both efficiently and quickly. NVIDIA cuDSS provides an easy to use library for solving increasingly massive problems using NVIDIA GPUs. For more features you can use with cuDSS, it is recommended that you read through the advanced feature section of the documents. They contain more information about the features presented here as well as other capabilities to help you solve your large sparse linear problems. There is also a section that explains how to do logging with cuDSS as you develop your code. This is a great resource since debugging parallel code can be challenging. cuDSS has some great capability for getting log information as your code is executed. Subscribe to cuDSS on the customer page to remain updated on most recent innovations. Discuss (0) Like Tags Data Science | Simulation / Modeling / Design | HPC / Scientific Computing | Blackwell | CUDA | GB200 | Grace CPU | Intermediate Technical | Tutorial | featured | Linear Algebra About the Authors About Jeff Layton Jeff Layton is a technical marketing engineer on the CAE/EDA software team at NVIDIA. Prior to NVIDIA, Jeff was a professor in aeronautics and an engineer at Boeing and Lockheed Martin. Then he jumped into the high-performance computing world starting with Linux Networx, Panasas (now VRUDA), Dell, Intel, and Amazon Web Services. He has a Ph.D. from Purdue University in aeronautics and astronautics, and his career has revolved around using HPC to solve problems, especially in the aerospace world focusing on multidisciplinary optimization (MDO). He is also an independent author writing about HPC for over 20 years. View all posts by Jeff Layton About Azi Riahi Azi Riahi is a principal product manager for NVIDIA math libraries. Prior to joining NVIDIA, Azi served as lead product manager for the TPU compiler stack at Google, where she supported the launch of multiple TPU platforms and led key initiatives to improve efficiency and usability within the XLA TPU compiler. Azi holds a Ph.D. in computational mechanics from the University of Toronto and has contributed her expertise as a reviewer and consultant for the U.S. Department of Energy and several national laboratories. View all posts by Azi Riahi Comments Related posts NVIDIA cuDSS Advances Solver Technologies for Engineering and Scientific Computing NVIDIA cuDSS Advances Solver Technologies for Engineering and Scientific Computing Accelerating the HPCG Benchmark with NVIDIA Math Sparse Libraries Accelerating the HPCG Benchmark with NVIDIA Math Sparse Libraries Just Released: cuDSS 0.3.0 Just Released: cuDSS 0.3.0 Spotlight: Honeywell Accelerates Industrial Process Simulation with NVIDIA cuDSS Spotlight: Honeywell Accelerates Industrial Process Simulation with NVIDIA cuDSS Optimizing the High Performance Conjugate Gradient Benchmark on GPUs Optimizing the High Performance Conjugate Gradient Benchmark on GPUs Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics L T F R E
Advanced Large-Scale Quantum Simulation Techniques in cuQuantum SDK v25.11 | NVIDIA Technical Blog nvidia_dev_blog 16.12.2025 18:00 0.675
Embedding sim.0.7532
Entity overlap0.1333
Title sim.0.1971
Time proximity0.994
NLP типproduct_launch
NLP организацияnvidia
NLP темаquantum computing
NLP страна

Открыть оригинал

Simulating large-scale quantum computers has become more difficult as the quality of quantum processing units (QPUs) improves. Validating the results is key to ensure that after the devices scale beyond what is classically simulable, we can still trust the outputs. Similarly, when generating large-scale datasets for various AI models that aim to aid in the operation of quantum processors, we see the need to offer useful training data at all scales and abstractions accelerated by GPUs. Examples include AI quantum error correction decoders, AI compilers, AI agents for calibration and control, and models to generate new device designs. cuQuantum SDK is a set of high-performance libraries and tools for accelerating quantum computing simulations at both the circuit and device levels by orders of magnitude. The latest version cuQuantum SDK, v25.11 introduces components that accelerate two new workloads: Pauli propagation and stabilizer simulations. Each of these is critical for simulating large scale quantum computers. This post dives into how you can start running Pauli propagation simulations and accelerate sampling from your stabilizer simulations to solve these problems with GPU-accelerated supercomputers. cuQuantum cuPauliProp Pauli propagation is a relatively new method for efficiently simulating the observables of large-scale quantum circuits, which can include noise models of real quantum processors. By expressing states and observables as weighted sums of Pauli tensor products, circuit simulation can dynamically discard terms which contribute insignificantly to a sought expectation value. This permits estimation of experimental quantities which are otherwise intractable for exact simulation. Many relevant quantum computing applications are centered around computation of expectation values, for example VQE and quantum simulation of physical dynamics. Various exact and approximate classical simulation techniques enable calculating such observables for large circuits, though they become prohibitively expensive in differing settings. For example, the Matrix Product State technique, a very popular approximate tensor network state method for circuit simulation, is typically ill-suited for large circuits which encode the dynamics of two or three dimensional physical systems. Pauli propagation is a complementary and useful addition to the approximate circuit simulation toolbox, for both pure and noisy circuits. Beyond being provably efficient for simulating near-Clifford and/or very noisy circuits, Pauli propagation has shown impressive performance when simulating circuits which Trotterize the evolution of certain quantum spin systems. This includes some “utility circuits” named in reference to their use in IBM’s utility experiment involving a 127 qubit device as detailed in Evidence for the Utility of Quantum Computing Before Fault Tolerance . Characterizing which circuits can be efficiently simulated with Pauli propagation is an ongoing research effort, as significant as refinement of the algorithmic details of the method itself. cuQuantum 25.11 offers primitives to accelerate Pauli propagation or derivative methods on NVIDIA GPUs with the release of this new cuQuantum library, enabling developers and researchers to advance the frontier of classical circuit simulation. Core functions are described in the following sections. Library initialization Initialize the library handle and workspace descriptor required for operations: import cupy as cp from cuquantum.bindings import cupauliprop from cuquantum import cudaDataType # Create library handle and workspace descriptor handle = cupauliprop.create() workspace = cupauliprop.create_workspace_descriptor(handle) # Assign GPU memory to workspace ws_size = 1024 * 1024 * 64 # Example: 64 MiB d_ws = cp.cuda.alloc(ws_size) cupauliprop.workspace_set_memory( handle, workspace, cupauliprop.Memspace.DEVICE, cupauliprop.WorkspaceKind.WORKSPACE_SCRATCH, d_ws.ptr, ws_size ) Define an observable To start the simulation, allocate device memory for the Pauli expansions (sums of products of Pauli operators expressed as a set of unsigned integers as well as their coefficients) and initialize the input expansion with an observable (for example, \(Z_{62}\)). # Helper to encode Pauli string into packed integers (2 bits per qubit: X and Z masks) def encode_pauli(num_qubits, paulis, qubits): num_ints = cupauliprop.get_num_packed_integers(num_qubits) # Packed integer format: [X_ints..., Z_ints...] packed = np.zeros(num_ints * 2, dtype=np.uint64) x_mask, z_mask = packed[:num_ints], packed[num_ints:] for p, q in zip(paulis, qubits): idx, bit = divmod(q, 64) if p in (cupauliprop.PauliKind.PAULI_X, cupauliprop.PauliKind.PAULI_Y): x_mask[idx] |= (1 << bit) if p in (cupauliprop.PauliKind.PAULI_Z, cupauliprop.PauliKind.PAULI_Y): z_mask[idx] |= (1 << bit) return packed # 1. Allocate Device Buffers # Define capacity (max number of Pauli strings) and allocate buffers max_terms = 10000 num_packed_ints = cupauliprop.get_num_packed_integers(num_qubits) d_pauli = cp.zeros((max_terms, 2 * num_packed_ints), dtype=cp.uint64, order="C") d_coef = cp.zeros(max_terms, dtype=cp.float64, order="C") # 2. Populate Initial Observable (Z_62) encoded_pauli = encode_pauli(num_qubits, [cupauliprop.PauliKind.PAULI_Z], [62]) # Assign the first term d_pauli[0] = cp.array(encoded_pauli) d_coef[0] = 1.0 # 3. Create Pauli Expansions # Input expansion: pre-populated with our observable expansion_in = cupauliprop.create_pauli_expansion( handle, num_qubits, d_pauli.data.ptr, d_pauli.nbytes, d_coef.data.ptr, d_coef.nbytes, cudaDataType.CUDA_R_64F, 1, 1, 1 # num_terms=1, is_sorted=True, is_unique=True ) # Output expansion: empty initially (num_terms=0), needs its own buffers d_pauli_out = cp.zeros_like(d_pauli) d_coef_out = cp.zeros_like(d_coef) expansion_out = cupauliprop.create_pauli_expansion( handle, num_qubits, d_pauli_out.data.ptr, d_pauli_out.nbytes, d_coef_out.data.ptr, d_coef_out.nbytes, cudaDataType.CUDA_R_64F, 0, 0, 0 ) Operator creation Define quantum gates or operators, such as a Pauli rotation \(e^{-i \frac{\theta}{2} P}\). # Create a Z-rotation gate on qubit 0 paulis = [cupauliprop.PauliKind.PAULI_Z] qubits = [0] gate = cupauliprop.create_pauli_rotation_gate_operator( handle, theta, 1, qubits, paulis ) Operator application Apply an operator (a gate or noise-channel) to the expansion, evolving the system. Note that most applications work in the so-called Heisenberg picture , which means that the gates in the circuit are applied in reverse order to the observable. This also requires passing the adjoint argument as True when applying the operator. # Get a view of the current terms in the input expansion num_terms = cupauliprop.pauli_expansion_get_num_terms(handle, expansion_out) view = cupauliprop.pauli_expansion_get_contiguous_range( handle, expansion_in, 0, num_terms) # Apply gate: in_expansion -> gate -> out_expansion cupauliprop.pauli_expansion_view_compute_operator_application( handle, view, expansion_out, gate, True, # adjoint? False, False, # make_sorted?, keep_duplicates? 0, None, # Truncation strategies (optional) workspace ) Expectation values Compute the expectation value (trace with the zero state \(\langle 0 | O | 0 \rangle\)). import numpy as np result = np.zeros(1, dtype=np.float64) # Compute trace cupauliprop.pauli_expansion_view_compute_trace_with_zero_state( handle, view, result.ctypes.data, workspace ) Combining these methods shows that NVIDIA DGX B200 GPUs offer significant speedups over CPU based codes. For small coefficient cutoffs, multiple order of magnitude speedups are observed over single-threaded Qiskit Pauli-Prop on the most recent dual-socket data center CPUs. Figure 1. cuQuantum GPU simulations for pi/4 rotations of the 127 qubit IBM utility circuit show multiple orders of magnitude speedups for a range of truncation schemes on NVIDIA DGX B200 compared to Qiskit PauliProp on an Intel Xeon Platinum 8570 CPU cuQuantum cuStabilizer Stabilizer simulations arise from the Gottesman-Knill theorem, which states that gates within the Clifford group (normalizer of the qubit Pauli group) can be efficiently simulated classically in polynomial time. This Clifford group is made up of CNOT, Hadamard and Phase gates (S). For this reason, stabilizer simulations have been critical for resource estimation and testing quantum error correcting codes at large scales. There are a few different approaches to building stabilizer simulators, from tableau simulators to frame simulators. cuStabilizer currently addresses improving the throughput for sampling rates in a frame simulator. Frame simulation only focuses on effects of quantum noise on the quantum state. As the quantum devices are imperfect, it’s possible to model the imperfections in circuit execution by inserting random “noisy” gates in it. If the noise-free result is known, getting the noisy result requires only to track the difference, or how the noisy gates change the circuit output. It turns out that this effect is much easier to compute compared to full circuit simulation. The number of possible combinations of how noisy gates can be inserted grows very fast with the size of the circuit, which means that in order to reliably model the error correcting algorithm a large number of shots is required. For users interested in developing quantum error correcting codes, testing new decoders, or generating data for AI decoders, frame simulation is ideal. APIs are available to improve sampling and accelerate any frame simulation on NVIDIA GPUs. The cuQuantum SDK cuStabilizer library exposes C API and Python API . While the C API will provide better performance, the Python API is best for getting started, as it is more flexible and handles memory allocation for the user. Create a circuit and apply frame simulation cuStabilizer has two main classes involved in the simulation: Circuit and FrameSimulator . The circuit can accept a string that contains circuit instructions, similar to the format used in the Stim CPU simulator. To create a FrameSimulator you need to specify information about the circuit, to allocate enough resources. import cuquantum.stabilizer as cust # Circuit information num_qubits = 5 num_shots = 10_000 num_measurements = 2 # Create a circuit on GPU circ = cust.Circuit(""" H 0 1 X_ERROR(0.1) 1 2 DEPOLARIZE2(0.5) 2 3 CX 0 1 2 3 M 0 3 """ sim = cust.FrameSimulator( num_qubits, num_shots, num_measurements ) sim.apply(circ) You can reuse a simulator between different circuits, as long as your simulator has enough qubits available. The following code will apply a circuit to a state modified by the first circuit circ . circ2 = cust.Circuit(""" Z_ERROR(0.01) 1 4 """) sim.apply(circ2) Read simulation results The state of simulator consists of three bit-tables: x_bits z_bits measurement_bits The first two tables store the Pauli Frame (similar to the cuPauliProp Pauli Expansion, but in a different layout and without the weights). The third stores the difference between noise-free measurement and the noisy measurements in each shot. The most efficient way to store the bits is to encode them in an integer value. This is referred to as “bit-packed” format, where each byte in memory stores eight significant bits. While this format is most efficient, manipulating individual bits requires extra steps in your program. The bit-packed format is not easily integrated with the common notion of “array,” as those are considered to contain values of several bytes, such as int32. To provide an easy representation in numpy, cuStabilizer supports the bit_packed argument, which can toggle between different formats. If bit_packed=False , each bit is encoded in one uint8 value, thus using 8x more memory. When specifying input bit tables, the format is also important for performance, as described in the cuQuantum documentation . # Get measurement flips m_table = sim.get_measurement_bits(bit_packed=False) print(m_table.dtype) # uint8 print(m_table.shape) # (2, 10000) print(m_table) # [[0 0 0 ... 0 0 0] # [1 0 0 ... 0 1 1]] x_table, z_table = sim.get_pauli_xz_bits(bit_packed=True) print(x_table.dtype) # uint8 print(x_table.shape) # (5, 1252) For easy access to the underlying Pauli frames, cuStabilizer provides a PauliTable class, which can be indexed by the shot index: # Get pauli table pauli_table = sim.get_pauli_table() num_frames_print = 5 for i in range(num_frames_print): print(pauli_table[i]) # ...XZ # ZXX.. # ...Z. # ..... # ...Z. When leveraging the sampling API we see that we can drastically improve the throughput when compared to Google Stim, state of the art code on the latest data center CPUs. Surface code simulation cuStabilizer can accept Stim circuits as input, and you can use it to simulate surface code circuits: import stim p = 0.001 circ_stim = stim.Circuit.generated( "surface_code:rotated_memory_z", distance=5, rounds=5, after_clifford_depolarization=p, after_reset_flip_probability=p, before_measure_flip_probability=p, before_round_data_depolarization=p, ) circ = cust.Circuit(circ_stim) sim = cust.FrameSimulator( circ_stim.num_qubits, num_shots, circ_stim.num_measurements, num_detectors=circ_stim.num_detectors, ) sim.apply(circ) pauli_table = sim.get_pauli_table() for i in range(num_frames_print): print(pauli_table[i]) Note that the most efficient simulation is achieved for a large number of samples and number of qubits. Furthermore, the best performance is achieved when the resulting bit tables are kept on GPU, as when using the cupy package. Figure 2 demonstrates the best use of cuStabilizer and expected performance on the NVIDIA B200 GPU and Intel Xeon Platinum 8570 CPU. It shows that the optimal performance for a code distance 31 is achieved at about a million shots. Users can get a 1,060x speedup for large code distances. Figure 2. Runtime performance on surface code of different distances and 1 million shots, comparing stim plus cuStabilizer on an NVIDIA DGX B200 GPU with stim on an Intel Xeon Platinum 8570 CPU Get started with new cuQuantum libraries The latest functionalities in cuQuantum continue to push the bounds of what is possible with GPU based quantum computer emulations enabling two new major classes of workloads. These workloads are critical for quantum error correction, verification and validation, and algorithm engineering for intermediate to large scale quantum devices. Get started with cuQuantum cuPauliProp using pip install cupauliprop-cu13 . To learn more, review the cuPauliProp documentation . Get started with cuQuantum cuStabilizer using pip install custabilizer-cu13 . To learn more, review the cuStabilizer documentation . Discuss (0) Like Tags Data Center / Cloud | Developer Tools & Techniques | Simulation / Modeling / Design | HPC / Scientific Computing | cuQuantum | DGX | Intermediate Technical | Advanced Technical | Tutorial | featured | Python | Quantum Computing About the Authors About Tom Lubowe Tom Lubowe is the product manager for quantum libraries at NVIDIA. Prior to joining, he led product focused on quantum computing, machine learning, and tensor networks for materials design at GenMat. Tom also worked at Xanadu and Rigetti in product management, product operations, and business development roles. Before that, he started a quantum machine learning company, Everettian Technologies, after working on FinTech products at SEI Investments. View all posts by Tom Lubowe About Benedikt Kloss Benedikt Kloss is a senior math libraries engineer at NVIDIA. He holds a PhD in Chemical Physics from Columbia University and worked as a postdoctoral researcher at the Center for Computational Physics at the Flatiron Institute. Before joining NVIDIA, his research was focused on quantum dynamics and tensor network state methods. View all posts by Benedikt Kloss About Danylo Lykov Danylo Lykov is a senior math libraries engineer at NVIDIA. He holds a PhD in Computer Science from the University of Chicago. He has worked at quantum research groups at Argonne National Laboratory and JPMorgan Chase, among others. His work focuses on high-performance classical simulation of quantum computers and its use in designing and analyzing near-term quantum algorithms. View all posts by Danylo Lykov About Tyson Jones Tyson Jones is a senior math libraries engineer at NVIDIA. He holds a PhD in Material Science from the University of Oxford and worked as a postdoctoral and visiting researcher between the Swiss Federal Technology Institute of Lausanne, and the University of Osaka. Tyson was also a scientific software developer for Quantum Motion and the UK National Quantum Computing Centre, and specializes in parallel, distributed, classical simulation of quantum computers and near-term quantum algorithms. View all posts by Tyson Jones About Daniel Lowell Daniel Lowell is a senior engineering manager at NVIDIA, overseeing the development of cuQuantum SDK. He has a physics BS from the University of Colorado Boulder and an MS in computer science from Texas State University. Daniel has spent his career working in HPC and GPU-accelerated computing, and before working in quantum computing, managed the development of AMD’s deep learning primitives library, MIOpen. View all posts by Daniel Lowell Comments Comments are closed. Related posts NVIDIA cuQuantum Adds Dynamics Gradients, DMRG, and Simulation Speedup  NVIDIA cuQuantum Adds Dynamics Gradients, DMRG, and Simulation Speedup  Achieving Supercomputing-Scale Quantum Circuit Simulation with the NVIDIA cuQuantum Appliance Achieving Supercomputing-Scale Quantum Circuit Simulation with the NVIDIA cuQuantum Appliance Accelerating Quantum Circuit Simulation with NVIDIA cuStateVec Accelerating Quantum Circuit Simulation with NVIDIA cuStateVec NVIDIA Announces cuQuantum Beta Availability, Record Quantum Benchmark, and Quantum Container NVIDIA Announces cuQuantum Beta Availability, Record Quantum Benchmark, and Quantum Container NVIDIA cuQuantum SDK Introduces Quantum Circuit Simulation Acceleration NVIDIA cuQuantum SDK Introduces Quantum Circuit Simulation Acceleration Related posts Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer L T F R E
Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM | NVIDIA Technical Blog nvidia_dev_blog 16.12.2025 21:00 0.659
Embedding sim.0.7317
Entity overlap0.0588
Title sim.0.2463
Time proximity0.9762
NLP типother
NLP организацияNVIDIA
NLP темаlarge language models
NLP страна

Открыть оригинал

For machine learning engineers deploying LLMs at scale, the equation is familiar and unforgiving: as context length increases, attention computation costs explode. Whether you’re dealing with retrieval-augmented generation (RAG) pipelines, agentic AI workflows, or long-form content generation, the \(O(N^2)\) complexity of attention remains a primary bottleneck. This post explains a technique known as Skip Softmax, a hardware-friendly, drop-in sparse attention method that accelerates inference without any retraining. Read on to learn how Skip Softmax delivers up to 1.4x faster time-to-first-token (TTFT), and up to 1.4x faster time-per-output-token (TPOT), and how to get started with the technique in NVIDIA TensorRT LLM . How does Skip Softmax work? At its core, Skip Softmax provides a dynamic way to prune attention blocks. This is possible as it exploits a fundamental property of the Softmax function: \(\exp(\text{small negative number}) \approx 0\). In standard FlashAttention, the GPU computes attention scores (logits) for blocks of queries (\(Q\)) and keys (\(K\)). It then applies softmax to normalize these scores into probabilities (\(P\)) and multiplies them by values (\(V\)). However, attention is intrinsically sparse . For many blocks, the attention scores are so low compared to the dominant tokens that their contribution to the final output is statistically negligible. Skip Softmax modifies the FlashAttention loop to detect these blocks early and simply skips them. The Skip Softmax algorithm Implemented directly within the FlashAttention kernel, the logic follows this heuristic: Compute local max: Calculate the maximum logit for the current block (\(Q \cdot K^T\)). Compare to running max : Check if the difference between the current block’s local max (\(m_{i}^{(j)}\)) and the running global max (\(m_{i}^{j-1}\)) exceeds a calibrated threshold (\(\lambda\)). Skip: If the condition is met, the kernel skips the softmax and BMM2 calculation for that block and, crucially, skips loading the \(V\) block from High Bandwidth Memory (HBM). What are the benefits of using Skip Softmax? Skip Softmax offers drop-in compatibility, hardware efficiency, flexibility, and versatility. Unlike approaches that need specific architectural modifications (such as Linear Attention), Skip Softmax is compatible with existing pretrained models that use standard attention mechanisms like MHA, GQA, or MLA. It is optimized to leverage the specific tensor core and memory hierarchy of NVIDIA Hopper and NVIDIA Blackwell GPUs. It can also be integrated with other optimization methods. For instance, combining XAttention during prefill with Skip Softmax during decoding has been shown to deliver substantial speed improvements without compromising accuracy. Skip Softmax is versatile because it addresses bottlenecks in both the prefill and decode phases. Based on performance data on Hopper and Blackwell architectures, Skip Softmax is beneficial during bandwidth-bound decoding and compute-bound prefilling, especially in long-context scenarios. Bandwidth-bound decoding During the generation (decode) phase, LLM inference is typically bound by memory bandwidth. The GPU spends more time moving KV cache data than computing. Benefit: By identifying unimportant blocks early, Skip Softmax avoids loading the associated \(V\) blocks entirely. Data: On Llama 3.3 70B (NVIDIA GB200 NVL72), Skip Softmax achieves a projected 1.36x end-to-end speedup during decoding. Compute-bound prefilling During the prefill phase (processing the input prompt), the system is compute-bound. Benefit: Skipping the softmax and the second matrix multiplication (BMM2) saves significant FLOPs. Data: For the same Llama 3.3 70B model (NVIDIA GB200 NVL72), prefill sees an estimated 1.4x end-to-end speedup at 128K context length. Long-context scenarios The efficacy of Skip Softmax increases with sequence length. The threshold for skipping is mathematically related to the context length (\(L\)) by the relationship \(\text{Threshold} \propto 1/L\). This means that, as context grows, the opportunity to safely identify and skip sparse blocks increases. The tradeoff between accuracy and sparsity The obvious question for any approximation technique is, “How does this approach impact accuracy?” Extensive testing on the RULER (synthetic long-context) and LongBench (realistic long-context) benchmarks suggests a clear “safe zone” for sparsity. Safe zone: A 50% sparsity ratio (skipping half the blocks) is observed to be the safe zone. In tests with Llama 3.1 8B and Qwen3-8B, running at ~50% sparsity resulted in near-lossless accuracy across most tasks. Danger zone: Pushing sparsity beyond 60% often leads to sharp accuracy drops, particularly in complex “needle-in-a-haystack” multikey tasks. Long generation: For tasks requiring long output generation such as MATH-500, Skip Softmax maintains accuracy parity with dense attention, unlike some static KV cache compression methods. Model Dataset Sparsity Accuracy delta versus baseline Llama 3.1 8B RULER-16K ~50% at prefill stage -0.19% Qwen-3-8B MATH500 ~50% at decode stage 0.36% Table 1. Accuracy delta versus baseline without sparsity Scenario Threshold Speedup (BF16) Baseline accuracy Sparse accuracy Accuracy delta Context only 0.2 130.63% 37.21% 36.74% -0.47% Context plus generation 0.6 138.37% 35.81% 34.42% -1.39% Table 2. Speedup with Qwen3-30B-Instruct model at a massive 128K sequence length Additional optimizations while deploying include the following: Automated calibration procedures to determine the optimal thresholds for target sparsity levels. Sparsity-aware training makes models more robust to sparse attention patterns. Get started with Skip Softmax in NVIDIA TensorRT LLM Skip Softmax Attention is integrated directly into NVIDIA TensorRT LLM and supported on NVIDIA Hopper and NVIDIA Blackwell data center GPUs. This enables you to further accelerate the attention computation, on the basis of the state-of-the-art LLM inference performance powered by TensorRT LLM. Skip Softmax Attention can be enabled through the sparse attention configuration of the LLM API: from tensorrt_llm import LLM from tensorrt_llm.llmapi import SkipSoftmaxAttentionConfig sparse_attention_config = SkipSoftmaxAttentionConfig(threshold_scale_factor=1000.0) # Additionally, the threshold_scale_factor for prefill and decode could be separately configured. sparse_attention_config = SkipSoftmaxAttentionConfig(threshold_scale_factor={"prefill": 1000.0, "decode": 500.0}) llm = LLM( model="Qwen/Qwen3-30B-A3B-Instruct-2507", sparse_attention_config=sparse_attention_config, # Other LLM arguments... ) The actual threshold value equals the threshold_scale_factor divided by the context length. The configuration could also be specified through the extra LLM API options YAML file. An example to launch an OpenAI-compatible endpoint is shown below: cat >extra_llm_api_options.yaml <<EOF sparse_attention_config: algorithm: skip_softmax threshold_scale_factor: 1000.0 EOF # Additionally, the threshold_scale_factor for prefill and decode could be separately configured. cat >extra_llm_api_options.yaml <<EOF sparse_attention_config: algorithm: skip_softmax threshold_scale_factor: prefill: 1000.0 decode: 500.0 EOF trtllm-serve Qwen/Qwen3-30B-A3B-Instruct-2507 --extra_llm_api_options extra_llm_api_options.yaml Learn more To learn more, see BLASST: Dynamic Blocked Attention Sparsity via Softmax Thresholding . Skip Softmax Attention is supported in TensorRT LLM. For more details, see Accelerating Long-Context Inference with Skip Softmax Attention . The sparse attention kernels are also available in FlashInfer . The calibration will be supported by NVIDIA Model Optimizer , which enables users to specify the target sparsity and reach the desired threshold scale factors. Discuss (0) Like Tags Agentic AI / Generative AI | General | Blackwell | GB200 | Hopper | TensorRT-LLM | Intermediate Technical | Tutorial | AI Inference | featured | Inference Performance | LLM Techniques About the Authors About Laikh Tewari Laikh Tewari is part of the AI Platform Software group at NVIDIA where he manages products for optimizing LLM inference performance. Laikh received his B.S. and M.S. in computer science from Stanford University where he specialized in systems and AI. View all posts by Laikh Tewari About Kai Xu Kai Xu is a senior engineer with the Deep Learning Algorithm and Software team at NVIDIA, specializing in optimizing inference efficiency for generative AI. He was an early engineer at OmniML prior to its acquisition by NVIDIA. He received his Ph.D. in Computer Engineering from Arizona State University. View all posts by Kai Xu About Bo Li Bo Li is a senior DevTech Compute engineer at NVIDIA, working on accelerating AI at scale. His current focus is efficient LLM inference, spanning from low-level GPU optimization to system design. He is also experienced with generative AI modeling and computer graphics. He received his master's degree in Computer Science from ETH Zurich, and his bachelor's from Peking University. View all posts by Bo Li About Fred Oh Fred is a senior product marketing manager for CUDA, CUDA on WSL, and CUDA Python. Fred has a B.S. in Computer Science and Math from UC Davis. He began his career as a UNIX software engineer porting kernel services and device drivers to x86 architectures. He loves Star Wars, Star Trek and the NBA Warriors. View all posts by Fred Oh Comments Related posts Making Softmax More Efficient with NVIDIA Blackwell Ultra Making Softmax More Efficient with NVIDIA Blackwell Ultra Run High-Performance LLM Inference Kernels from NVIDIA Using FlashInfer​​ Run High-Performance LLM Inference Kernels from NVIDIA Using FlashInfer​​ NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200 NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200 Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200 NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200 Related posts Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes L T F R E
Optimizing Semiconductor Defect Classification with Generative AI and Vision Foundation Models | NVIDIA Technical Blog nvidia_dev_blog 17.12.2025 02:00 0.648
Embedding sim.0.7359
Entity overlap0.0263
Title sim.0.2468
Time proximity0.812
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаcomputer vision
NLP страна

Открыть оригинал

In the heart of every modern electronic device lies a silicon chip, built through a manufacturing process so precise that even a microscopic defect can determine success or failure. As semiconductor devices grow more complex, reliably detecting and classifying defects has become a critical bottleneck. Historically, chipmakers have relied on convolutional neural networks (CNNs) to automate defect classification (ADC). But as manufacturing scales and diversifies, CNN-based approaches are hitting their limits, requiring large labeled datasets, frequent retraining, and still struggling to generalize across new defect types. In this post, we show how generative AI -powered ADC can overcome these challenges. The workflows below leverage NVIDIA Metropolis vision language models (VLMs), vision foundation models (VFMs), and the NVIDIA TAO fine-tuning toolkit to modernize defect classification. We outline the limitations of traditional CNN-based systems, detail how VLMs and VFMs address them, and highlight specific approaches and manufacturing challenges they help solve. The limits of CNNs in semiconductor defect classification CNNs have long been the backbone of defect detection in semiconductor fabs, supporting optical and e-beam inspection, lithographic analysis, and more. They excel at extracting visual features from large datasets, but manufacturers face persistent challenges related to data requirements, semantic understanding, and retraining. High data requirements Achieving high accuracy often requires thousands of labeled images per defect class. Rare or emerging defects frequently lack sufficient examples for effective training. Limited semantic understanding While CNNs capture visual features, they cannot interpret context, perform root-cause analysis, or integrate multimodal data. They also struggle to differentiate visually similar yet operationally distinct defect patterns, such as center vs. local defects. Frequent retraining Real-world manufacturing is dynamic. Process variations, new tools, and evolving product lines require models to be retrained frequently to recognize new defect types and imaging conditions. These limitations force fabs to rely on manual inspection, which is costly, inconsistent, and unable to scale with today’s manufacturing throughput. Modernizing ADC with VLMs and VFMs To address these challenges, NVIDIA applies VLMs, VFMs, and self-supervised learning across multiple stages of semiconductor manufacturing. Figure 1 illustrates how these models are deployed across front-end-of-line (FEOL) and back-end packaging processes. In this post, we demonstrate how VLMs classify wafer map images and how VFMs classify die-level images, including optical, e-beam, and back-end optical microscopy (OM) inspection data. With further training, VLMs also show strong potential for die-level inspection. Figure 1. Examples of different image types that can potentially be used for an automatic defect classification (ADC) system enhanced with vision language models (VLMs) and vision foundation models (VFMs). These include wafer defect maps and various die-level defects found in optical, e-beam, and optical microscopy (OM) images. Wafer-level intelligence with VLMs Wafer maps provide a spatial view of defect distributions across an entire wafer. VLMs combine advanced image understanding with natural language reasoning. After fine-tuning, NVIDIA reasoning VLMs , such as Cosmos Reason , can interpret wafer map images to identify macro defects, generate natural language explanations, perform interactive Q&A, and compare test images against “golden” references for preliminary root-cause analysis. Figure 2. The left side showcases how Cosmos Reason VLM can automatically classify this as a center ring wafer defect and attribute it to chemical contamination. The right side shows how auto-labeling methods accelerate the training process and help to streamline defect analysis and reduce manual visual inspection efforts . Using this approach offers several advantages: Few-shot learning: VLMs can be fine-tuned with only a small number of labeled examples, enabling rapid adaptation to new defect patterns, process changes, or product variations. Explainability: As shown in Figure 2, Cosmos Reason produces interpretable results that engineers can interact with using natural language. For example, asking “What is the primary defect pattern in this wafer map?” might return “Center ring defect detected, likely due to chemical contamination.” This semantic reasoning ability goes beyond CNNs, helping engineers quickly identify potential root causes, accelerate corrective actions, and reduce the volume of manual reviews. Automated data labeling: VLMs can generate high-quality labels for downstream ADC tasks, reducing the time and cost of model development. In practice, this approach can cut model build times by up to 2x compared to manual labeling workflows. Time series and lot level analysis: VLMs have the ability to process both still images and video sequences, enabling them to proactively monitor process anomalies over time and mitigate errors before they lead to critical failures. In one study, VLMs achieved high accuracy across both OK and NG cases, outperforming traditional CNN-based methods. Figure 3. The end-to-end workflow for fine-tuning the Cosmos Reason 1 model, covering data preparation, supervised fine-tuning on the curated dataset, and subsequent quantization and deployment for inference. Getting started with Cosmos Reason Here’s a sample workflow to fine-tune Cosmos Reason 1—from data preparation to supervised fine-tuning and evaluation on a prepared dataset of wafer map defects. Go to the Cosmos Cookbook Wafer Map Anomaly Classification Create a sample training dataset: Download the open WM-811k Wafermap dataset produced by Mir Lab which is available for public use. Generate a sample dataset and respective annotations with the provided scripts in the cookbook. Post-train with supervised fine-tuning (SFT): the installation instructions provided in the cosmos-reason1 GitHub repository and install the cosmos-rl package to enable fine-tuning with the curated training data set. Deploy Result: Fine-tuning Cosmos Reason on wafer map defect classification data boosts accuracy from zero-shot levels to over 96% on defect classification tasks. Die-level precision with VFMs and self-supervised learning The semiconductor industry continues to push the boundaries of physics as device features shrink to microscopic scales. At this level, manufacturing complexity rises dramatically. Even the slightest anomaly—a stray particle, pattern deviation, or material defect—can render a chip unusable, directly affecting yield and profitability. In this high-stakes environment, the biggest bottleneck is the ability to rapidly and accurately detect and classify defects. CNNs have supported this workflow for years, but they struggle to keep pace with the growing complexity and data demands of modern fabs. A core challenge in training AI models for manufacturing is the dependence on large, meticulously labeled datasets. Dynamic processes, evolving product lines, and the continual emergence of new defect types make it impractical to maintain a perfectly labeled dataset. Compounding the issue, datasets are often highly imbalanced—normal samples vastly outnumber defective ones. Using a leading VFM such as NV-DINOv2 provides advantages, including: Self-supervised learning (SSL): NV-DINOv2 is trained on millions of unlabeled images, enabling it to generalize new defect types and process conditions with minimal retraining when labeled data is scarce. Robust feature extraction: The model captures both fine-grained visual details and high-level semantic information, improving classification accuracy across diverse manufacturing scenarios. Operational efficiency: By reducing dependence on labeling and frequent retraining, NV-DINOv2 streamlines the deployment and maintenance of defect-inspection systems in fast-moving fab environments. However, general foundation models like NV-DINOv2 lack the specific details required for industrial tasks such as e-beam and optical microscopy images. To achieve maximum accuracy, the model must be specialized through domain adaptation. This is a multi-stage workflow: General VFM : Begin with the powerful, pre-trained NV-DINOv2 model that has broad visual understanding learned from large, diverse datasets. Domain adaptation : Fine-tune the model using a large, unlabeled, domain-specific dataset, such as millions of images from semiconductor fabs, to align it with industrial imaging characteristics. Downstream task fine-tuning : Apply a small set of labeled images to fine-tune the model for a specific classification task, a step known as linear probing. Figure 4. The three-phase NV-DINOv2 workflow for building domain-adapted vision foundation models. Phase 1 (by NVIDIA) provides the general pre-trained model; Phases 2 and 3 (by users) enable domain adaptation and task-specific fine-tuning with minimal labeled data. The effectiveness of this process depends heavily on the size and quality of the unlabeled domain dataset. These datasets can range from less than a million images to hundreds of millions, but quantity alone is not enough. A meticulous data-cleaning pipeline is essential to remove redundant, blurry, or irrelevant images before training begins. This domain-adaptation approach delivers significant performance gains. In one study by a leading semiconductor manufacturer, the NVIDIA TAO Toolkit was used to apply self-supervised learning (SSL) to NV-DINOv2 using unlabeled images collected across multiple layers of the chip-production process. Incorporating SSL consistently improved performance, boosting accuracy by up to 8.9% compared to a model trained without SSL which led to productivity gains of up to 9.9%. Getting started with NV-DINOv2 and SSL The following is an end-to-end workflow to fine-tune NV-DINOv2 using SSL, from data preparation and domain adaptation to downstream task fine-tuning and deployment. In this example, we use the NVIDIA TAO Toolkit to perform SSL on unlabeled PCB images for defect classification. The NV-DINOv2 workflow follows a progressive, three-phase approach that maximizes the value of large unlabeled datasets while reducing the need for manual annotation to only a few hundred labeled samples. 1. Set up your environment: Download the NVIDIA TAO Toolkit 6.0 container from NVIDIA NGC which has all dependencies pre-installed: # Pull the TAO Toolkit 6.0 container from NGC docker pull nvcr.io/nvidia/tao/tao-toolkit:6.0.0-pyt # Run the container with GPU support docker run --gpus all -it -v /path/to/data:/data \ nvcr.io/nvidia/tao/tao-toolkit:6.0.0-pyt /bin/bash 2. Prepare your dataset: NV-DINOv2 accepts RGB images in standard formats (JPG, PNG, BMP, TIFF, WebP) stored in a single directory. For SSL domain adaptation, you only need unlabeled images; no annotations are required. In our PCB inspection example, we used: ~400 labeled test samples for evaluation ~One million unlabeled PCB images for domain adaptation ~600 labeled training samples for downstream fine-tuning Organize your data as followed: /data/ ├── unlabeled_images/ # For SSL domain adaptation ├── train_images/ # For downstream fine-tuning │ ├── OK/ │ ├── missing/ │ ├── shift/ │ ├── upside_down/ │ ├── poor_soldering/ │ └── foreign_object/ └── test_images/ # For evaluation Data cleaning best practice : Before training, perform a meticulous data cleaning process to remove redundant, blurry, or irrelevant images. The effectiveness of domain adaptation depends heavily on the quality of your unlabeled dataset. 3. Configure the training specification: Create a YAML specification file that defines your model architecture, dataset paths, and training parameters: model: backbone: teacher_type: "vit_l" student_type: "vit_l" patch_size: 14 img_size: 518 drop_path_rate: 0.4 head: num_layers: 3 hidden_dim: 2048 bottleneck_dim: 384 dataset: train_dataset: images_dir: /data/unlabeled_images test_dataset: images_dir: /data/test_images batch_size: 16 workers: 10 transform: n_global_crops: 2 global_crops_scale: [0.32, 1.0] global_crops_size: 224 n_local_crops: 8 local_crops_scale: [0.05, 0.32] local_crops_size: 98 train: num_gpus: 8 num_epochs: 100 checkpoint_interval: 10 precision: "16-mixed" optim: optim: "adamw" clip_grad_norm: 3.0 4. Run SSL training for domain adaptation : Execute the training using TAO Launcher to adapt the general NV-DINOv2 model to your domain-specific images: tao model nvdinov2 train \ -e /path/to/experiment_spec.yaml \ results_dir=/output/ssl_training \ train.num_gpus=8 \ train.num_epochs=100 5. Perform downstream task fine-tuning: After SSL domain adaptation, fine-tune the model for your specific classification task using a small labeled dataset. This step, known as linear probing, requires only a few hundred labeled samples: tao model nvdinov2 train \ -e /path/to/finetune_spec.yaml \ train.pretrained_model_path=/output/ssl_training/model.pth \ dataset.train_dataset.images_dir=/data/train_images \ train.num_epochs=50 6. Run inference: Evaluate your domain-adapted model on test images: tao model nvdinov2 inference \ -e /path/to/experiment_spec.yaml \ inference.checkpoint=/output/ssl_training/model.pth \ inference.gpu_ids=[0] \ inference.batch_size=32 7. Export to ONNX for deployment : Export your trained model to ONNX format for production deployment: tao model nvdinov2 export \ -e /path/to/experiment_spec.yaml \ export.checkpoint=/output/ssl_training/model.pth \ export.onnx_file=/output/nvdinov2_domain_adapted.onnx \ export.opset_version=12 \ export.batch_size=-1 The exported ONNX model can be deployed using NVIDIA TensorRT for optimized inference or integrated into an NVIDIA DeepStream pipeline for real-time visual inspection. Results: Using NVIDIA TAO to fine-tune NV-DINOV2 with SSL can also be used for inspecting PCBs. By using a dataset of approximately one million unlabeled images with SSL for industrial domain adaption and 600 training and 400 testing samples for downstream task fine-tuning, accuracy for defect detection jumped from 93.84% with the general model to 98.51%. By eliminating the need for labeling and frequent retraining, NV-DINOv2 streamlines the deployment of defect inspection solutions in fast-moving fab environments. Paving the way to a smart fab These applications of vision models deliver immediate accuracy gains and lay the foundation for agentic AI systems within the fab. By combining accelerated computing with generative AI, NVIDIA and leading foundries are introducing new ADC workflows that have the potential to redefine yield improvement and process control in advanced manufacturing. By streamlining defect analysis across the semiconductor production flow, generative AI significantly reduces model deployment time. Its few-shot learning capabilities simplify ongoing model maintenance, improve robustness, and make it easy to fine-tune models for different fab environments. With fabs generating millions of high-resolution images daily from a wide range of inspection tools, automated ADC systems are expected to further improve classification accuracy, reduce human workload, and elevate overall productivity. Beyond defect inspection, semiconductor manufacturers are beginning to adopt video analytics AI agents built using the NVIDIA Blueprint for Video Search and Summarization (VSS). These agents help monitor plant operations, enhance worker safety, and improve compliance with PPE and safety protocols across manufacturing sites. Next steps To learn more, try NV-DINOv2 and state-of-the-art NVIDIA VLMs like Cosmos Reason. For technical questions, please visit the forum . Stay up to date by subscribing to our newsletter and following NVIDIA AI on LinkedIn , Instagram , X and Facebook . Explore YouTube channel, and join the NVIDIA Developer vision AI forum . Discuss (0) Like Tags Computer Vision / Video Analytics | Manufacturing | Cosmos | Metropolis | TAO Toolkit | Beginner Technical | Tutorial | featured | Hardware / Semiconductor | VLMs About the Authors About Tim Lin Tim Lin is a senior manager at NVIDIA, focused on applying AI to industrial applications, particularly in the semiconductor sector. He has over 10 years of experience in the semiconductor industry and holds a PhD in computer science and engineering from The University of Texas at Arlington. View all posts by Tim Lin About HJ Chen HJ is a senior manager at NVIDIA, leading a Taiwan-based team contributing to the Metropolis/TAO effort for developing CV foundation models and driving the adoption of VLMs for industrial applications. HJ holds his bachelor’s and master's degrees in engineering from National Taiwan University. View all posts by HJ Chen About Po Chun Lai Po Chun Lai is senior solution architect at NVIDIA responsible for AI technology adoption. He comes with experience working in industries. He holds a master’s degree in Computer Science from National Yang Ming Chiao Tung University. View all posts by Po Chun Lai About Yiyi Wang Yiyi Wang is a senior business development leader at NVIDIA, where she drives global business development strategy for the semiconductor vertical, building high-impact partnerships and transforming semiconductor manufacturing with AI and Accelerated Computing. Yiyi holds a PhD in Physics from Boston University. View all posts by Yiyi Wang About Anita Chiu Anita is an AI software engineer at NVIDIA, where she focuses on transforming NVIDIA’s AI models into practical, scalable solutions for industrial applications. She holds both a bachelor’s and a master’s degree in engineering from National Tsing Hua University. View all posts by Anita Chiu Comments Related posts Build a Real-Time Visual Inspection Pipeline with NVIDIA TAO 6 and NVIDIA DeepStream 8 Build a Real-Time Visual Inspection Pipeline with NVIDIA TAO 6 and NVIDIA DeepStream 8 AI in Manufacturing and Operations at NVIDIA: Accelerating ML Models with NVIDIA CUDA-X Data Science AI in Manufacturing and Operations at NVIDIA: Accelerating ML Models with NVIDIA CUDA-X Data Science Transforming Industrial Defect Detection with NVIDIA TAO and Vision AI Models Transforming Industrial Defect Detection with NVIDIA TAO and Vision AI Models How to Train a Defect Detection Model Using Synthetic Data with NVIDIA Omniverse Replicator How to Train a Defect Detection Model Using Synthetic Data with NVIDIA Omniverse Replicator Learn How to Use Deep Learning for Industrial Inspection Learn How to Use Deep Learning for Industrial Inspection Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics L T F R E
The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator huggingface 17.12.2025 13:22 0.629
Embedding sim.0.7108
Entity overlap0
Title sim.0.1776
Time proximity0.9323
NLP типproduct_launch
NLP организацияnvidia
NLP темаlarge language models
NLP страна

Открыть оригинал

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator Enterprise + Article Published December 17, 2025 Upvote 47 +41 Seph Mard sephmard1 nvidia Isabel Hulseman ihulseman0220 nvidia Besmira Nushi bnushi nvidia Piotr Januszewski pjanuszewski nvidia Grzegorz Chlebus grzegorzchlebus nvidia VivienneZhang viviennezhang nvidia Wojciech Prazuch wprazuch nvidia Pablo Ribalta pribalta nvidia Nik Spirin spirinus nvidia Ferenc Galko fgalko nvidia Building a consistent and transparent evaluation workflow with NeMo Evaluator A single, consistent evaluation system Methodology independent of inference setup Built to scale beyond one-off experiments Auditability with structured artifacts and logs A shared evaluation standard Open evaluation for Nemotron 3 Nano Open-source model evaluation tooling Open configurations Open logs and artifacts The reproducibility workflow Reproducing Nemotron 3 Nano benchmark results 1. Install NeMo Evaluator Launcher 2. Set required environment variables 3. Model endpoint 4. Run the full evaluation suite 5. Running an individual benchmark 6. Monitor execution and inspect results Interpreting results Conclusion: A more transparent standard for open models It has become increasingly challenging to assess whether a model’s reported improvements reflect genuine advances or variations in evaluation conditions, dataset composition, or training data that mirrors benchmark tasks. The NVIDIA Nemotron approach to openness addresses this by publishing transparent and reproducible evaluation recipes that make results independently verifiable. NVIDIA released Nemotron 3 Nano 30B A3B with an explicitly open evaluation approach to make that distinction clear. Alongside the model card, we are publishing the complete evaluation recipe used to generate the results, built with the NVIDIA NeMo Evaluator library, so anyone can rerun the evaluation pipeline, inspect the artifacts, and analyze the outcomes independently. We believe that open innovation is the foundation of AI progress. This level of transparency matters because most model evaluations omit critical details. Configs, prompts, harness versions, runtime settings, and logs are often missing or underspecified, and even small differences in these parameters can materially change results. Without a complete recipe, it’s nearly impossible to tell whether a model is genuinely more intelligent or simply optimized for a benchmark. This blog shows developers exactly how to reproduce the evaluation behind Nemotron 3 Nano 30B A3B using fully open tools, configurations, and artifacts. You’ll learn how the evaluation was run, why the methodology matters, and how to execute the same end-to-end workflow using the NeMo Evaluator library so you can verify results, compare models consistently, and build transparent evaluation pipelines of your own. Building a consistent and transparent evaluation workflow with NeMo Evaluator A single, consistent evaluation system Developers and researchers need evaluation workflows they can rely on, not one-off scripts that behave differently from model to model. NeMo Evaluator provides a unified way to define benchmarks, prompts, configuration, and runtime behavior once, then reuse that methodology across models and releases. This avoids the common scenario where the evaluation setup quietly changes between runs, making comparisons over time difficult or misleading. Methodology independent of inference setup Model outputs can vary by inference backend and configuration, so evaluation tools should never be tied to a single inference solution. Locking an evaluation tool to one inference solution would limit its usefulness. NeMo Evaluator avoids this by separating the evaluation pipeline from the inference backend, allowing the same configuration to run against hosted endpoints, local deployments, or third-party providers. This separation enables meaningful comparisons even when you change infrastructure or inference engines. Built to scale beyond one-off experiments Many evaluation pipelines work once and then break down as the scope expands. NeMo Evaluator is designed to scale from quick, single-benchmark validation to full model card suites and repeated evaluations across multiple models. The launcher, artifact layout, and configuration model support ongoing workflows, not just isolated experiments, so teams can maintain consistent evaluation practices over time. Auditability with structured artifacts and logs Transparent evaluation requires more than final scores. Each evaluation run produces structured results and logs by default, making it easy to inspect how scores were computed, understand score calculations, debug unexpected behavior, and conduct deeper analysis. Each component of the evaluation is captured and reproducible. A shared evaluation standard By releasing Nemotron 3 Nano 30B A3B with its full evaluation recipe , NVIDIA is providing a reference methodology that the community can run, inspect, and build upon. Using the same configuration and tools brings consistency to how benchmarks are selected, executed, and interpreted, enabling more reliable comparisons across models, providers, and releases. Open evaluation for Nemotron 3 Nano Open evaluation means publishing not just the final results, but the full methodology behind them, so benchmarks are run consistently, and results can be compared meaningfully over time. For Nemotron 3 Nano 30B A3B , this includes open‑source tooling, transparent configurations, and reproducible artifacts that anyone can run end‑to‑end. Open-source model evaluation tooling NeMo Evaluator is an open-source library designed for robust, reproducible, and scalable evaluation of generative models. Instead of introducing yet another standalone benchmark runner, it acts as a unifying orchestration layer that brings multiple evaluation harnesses under a single, consistent interface. Under this architecture, NeMo Evaluator integrates and coordinates hundreds of benchmarks from many widely used evaluation harnesses, including NeMo Skills for Nemotron instruction-following, tool use, and agentic evaluations, as well as the LM Evaluation Harness for base model and pre-training benchmarks, and many more ( full benchmark catalog ). Each harness retains its native logic, datasets, and scoring semantics, while NeMo Evaluator standardizes how they are configured, executed, and logged. This provides two practical advantages: teams can run diverse benchmark categories using a single configuration without rewriting custom evaluation scripts, and results from different harnesses are stored and inspected in a consistent, predictable way, even when the underlying tasks differ. The same orchestration framework used internally by NVIDIA’s Nemotron research and model‑evaluation teams is now available to the community, enabling developers to run heterogeneous, multi‑harness evaluations through a shared, auditable workflow. Open configurations We published the exact YAML configuration used for the Nemotron 3 Nano 30B A3B model card evaluation with NeMo Evaluator. This includes: model inference and deployment settings benchmark and task selection benchmark-specific parameters such as sampling, repeats, and prompt templates runtime controls including parallelism, timeouts, and retries output paths and artifact layout Using the same configuration means running the same evaluation methodology. Open logs and artifacts Each evaluation run produces structured, inspectable outputs, including per‑task results.json files, execution logs for debugging and auditability, and artifacts organized by task for easy comparison. This structure makes it possible to understand not only the final scores, but also how those scores were produced and to perform deeper analysis of model behavior. The reproducibility workflow Reproducing Nemotron 3 Nano 30B A3B model card results follows a simple loop: Start from the released model checkpoint or hosted endpoint Use the published NeMo Evaluator config Execute the evaluation with a single CLI command Inspect logs and artifacts, and compare results to the model card The same workflow applies to any model you evaluate using NeMo Evaluator. You can point the evaluation at a hosted endpoint or a local deployment, including common inference providers such as HuggingFace , build.nvidia.com , and OpenRouter . The key requirement is access to the model, either as weights you can serve or as an endpoint you can call. For this tutorial, we use the hosted endpoint on build.nvidia.com . Reproducing Nemotron 3 Nano benchmark results This tutorial reproduces the evaluation results for NVIDIA Nemotron 3 Nano 30B A3B using NeMo Evaluator. The step-by-step tutorial, including the published configs used for the model card evaluation , is available on GitHub. Although we have focused this tutorial on the Nemotron 3 Nano 30B A3B, we also published recipes for the base model evaluation . This walkthrough runs a comprehensive evaluation suite of the published configs used for the model card evaluation for NVIDIA Nemotron 3 Nano 30B A3B using the following benchmarks: Benchmark Accuracy Category Description BFCL v4 53.8 Function Calling Berkeley Function Calling Leaderboard v4 LiveCodeBench (v6 2025-08–2025-05) 68.3 Coding Real-world coding problems evaluation MMLU-Pro 78.3 Knowledge Multi-task language understanding (10-choice) GPQA 73.0 Science Graduate-level science questions AIME 2025 89.1 Mathematics American Invitational Mathematics Exam SciCode 33.3 Scientific Coding Scientific programming challenges IFBench 71.5 Instruction Following Instruction following benchmark HLE 10.6 Humanity's Last Exam Expert-level questions across domains For Model Card details, see the NVIDIA Nemotron 3 Nano 30B A3B Model Card . For a deep dive into the architecture, datasets, and benchmarks, read the full Nemotron 3 Nano Technical Report . 1. Install NeMo Evaluator Launcher pip install nemo-evaluator-launcher 2. Set required environment variables # NVIDIA endpoint access export NGC_API_KEY="your-ngc-api-key" # Hugging Face access export HF_TOKEN="your-huggingface-token" # Required only for judge-based benchmarks such as HLE export JUDGE_API_KEY="your-judge-api-key" Optional but recommended for faster reruns: export HF_HOME="/path/to/your/huggingface/cache" 3. Model endpoint The evaluation uses the NVIDIA API endpoint hosted on build.nvidia.com : target: api_endpoint: model_id: nvidia/nemotron-nano-3-30b-a3b url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY Evaluations can be run against common inference providers such as HuggingFace , build.nvidia.com , or OpenRouter , or anywhere that the model has an available endpoint. If you're hosting the model locally or using a different endpoint: nemo-evaluator-launcher run \ --config local_nvidia_nemotron_3_nano_30b_a3b.yaml \ -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions 4. Run the full evaluation suite Preview the run without executing using --dry-run : nemo-evaluator-launcher run \ --config local_nvidia_nemotron_3_nano_30b_a3b.yaml \ --dry-run From the examples directory, run the evaluation using the YAML configuration provided: nemo-evaluator-launcher run \ --config /path/to/examples/nemotron/local_nvidia_nemotron_3_nano_30b_a3b.yaml Note that for quick testing, you can limit the number of samples by setting limit_samples : nemo-evaluator-launcher run \ --config local_nvidia_nemotron_3_nano_30b_a3b.yaml \ -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 5. Running an individual benchmark You can run specific benchmarks using the -t flag (from the examples/nemotron directory): # Run only MMLU-Pro nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_mmlu_pro # Run only coding benchmarks nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_livecodebench # Run multiple specific benchmarks nemo-evaluator-launcher run --config local_nvidia_nemotron_3_nano_30b_a3b.yaml -t ns_gpqa -t ns_aime2025 6. Monitor execution and inspect results # Check status of a specific job nemo-evaluator-launcher status # Stream logs for a specific job nemo-evaluator-launcher logs <job-id> Results are written to the defined output directory: results_nvidia_nemotron_3_nano_30b_a3b/ ├── artifacts/ │ └── <task_name>/ │ └── results.json └── logs/ └── stdout.log Interpreting results When reproducing evaluations, you may observe small differences in final scores across runs. This variance reflects the probabilistic nature of LLMs rather than an issue with the evaluation pipeline. Modern evaluation introduces several sources of non‑determinism: decoding settings, repeated trials, judge‑based scoring, parallel execution, and differences in serving infrastructure. All of which can lead to slight fluctuations. The purpose of open evaluation is not to force bit-wise identical outputs, but to deliver methodological consistency with clear provenance of evaluation results. To ensure your evaluation aligns with the reference standard, verify the following: Configuration : use the published NeMo Evaluator YAML without modification, or document any changes explicitly Benchmark selection : run the intended tasks, task versions, and prompt templates Inference target : verify you are evaluating the intended model and endpoint, including chat template behavior and reasoning settings when relevant Execution settings : keep runtime parameters consistent, including repeats, parallelism, timeouts, and retry behavior Outputs : confirm artifacts and logs are complete and follow the expected structure for each task When these elements are consistent, your results represent a valid reproduction of the methodology, even if individual runs differ slightly. NeMo Evaluator simplifies this process, tying benchmark definitions, prompts, runtime settings, and inference configuration into a single auditable workflow to minimize inconsistencies. Conclusion: A more transparent standard for open models The evaluation recipe released alongside Nemotron 3 Nano represents a meaningful step toward a more transparent and reliable approach to open-model evaluation. We are moving away from evaluation as a collection of bespoke, "black box" scripts, and towards a defined system where benchmark selection, prompts, and execution semantics are encoded into a transparent workflow. For developers and researchers, this transparency changes what it means to share results. A score is only as trustworthy as the methodology behind it and making that methodology public is what enables the community to verify claims, compare models fairly, and continue building on shared foundations. With open evaluation configurations, open artifacts, and open tooling, Nemotron 3 Nano demonstrates what that commitment to openness looks like in practice. NeMo Evaluator supports this shift by providing a consistent benchmarking methodology across models, releases, and inference environments. The objective isn’t identical numbers on every run; it’s confidence in an evaluation methodology that is explicit, inspectable, and repeatable. And for organizations that need automated or large‑scale evaluation pipelines, a separate microservice offering provides an enterprise‑ready NeMo Evaluator microservice built on the same evaluation principles. Use the published NeMo Evaluator evaluation configuration for an end-to-end walkthrough of the evaluation recipe. Join the Community! NeMo Evaluator is fully open source, and community input is essential to shaping the future of open evaluation. If there’s a benchmark you’d like us to support or an improvement you want to propose, open an issue, or contribute directly on GitHub. Your contributions help strengthen the ecosystem and advance a shared, transparent standard for evaluating generative models. More from this author Build a Domain-Specific Embedding Model in Under a Day 55 March 20, 2026 Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation 1 March 20, 2026