Кластер #4119 - News Clusters

AI Factories, Physical AI, and Advances in Models, Agents, and Infrastructure That Shaped 2025 | NVIDIA Technical Blog

closed

Тип события	product_launch
Тема	ai infrastructure
Организация	NVIDIA
Страна	United States

Статей	22
Уник. источников	6
Важность / Момент	2.66 / 0
Период	31.12.2025 17:30 — 12.01.2026 13:31
Создан	06.04.2026 06:19:17

Статьи в кластере 22

Заголовок

Источник

Дата публикации

Score

AI Factories, Physical AI, and Advances in Models, Agents, and Infrastructure That Shaped 2025 | NVIDIA Technical Blog

nvidia_dev_blog

31.12.2025 17:30

Embedding sim.	1
Entity overlap	1
Title sim.	1
Time proximity	1

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	ai infrastructure
NLP страна

Открыть оригинал

2025 was another milestone year for developers and researchers working with NVIDIA technologies. Progress in data center power and compute design, AI infrastructure, model optimization, open models, AI agents, and physical AI redefined how intelligent systems are trained, deployed, and moved into the real world. These posts highlight the innovations that resonated most with our readers.

 NVIDIA 800V HVDC Architecture Will Power the Next Generation of AI Factories

 As AI racks move to megawatt scale, NVIDIA and industry partners are advancing an 800V DC power architecture to deliver higher efficiency, scalability, and reliability for future data centers.

 Announcing Newton: An Open-Source Physics Engine for Robotics Simulation

 Newton, developed jointly by NVIDIA, Google DeepMind, and Disney Research, provides an open, customizable physics engine built on NVIDIA Warp for accurate, scalable robotic simulation and learning.

 NVIDIA RTX Neural Rendering Introduces Next Era of AI-Powered Graphics Innovation

 The NVIDIA GeForce RTX 50 Series GPUs launch with an RTX Kit, a set of neural rendering technologies for developers to integrate AI-enhanced geometry, textures, materials, and lighting into their rendering pipelines.

 Introducing NVFP4 for Efficient and Accurate Low-Precision Inference

 The fifth-generation NVIDIA Blackwell Tensor Cores add support for multiple 4-bit floating-point formats, including NVFP4, to improve quantization efficiency while maintaining task-specific accuracy.

 Automating GPU Kernel Generation with DeepSeek-R1 and Inference-Time Scaling

 NVIDIA engineers use the DeepSeek-R1 model with inference-time scaling to automatically generate optimized, numerically correct GPU attention kernels—showing how AI can accelerate or even surpass traditional hand-tuned kernel development.

 Introducing NVIDIA Dynamo: A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models

 NVIDIA Dynamo is a high-throughput, low-latency inference framework that boosts DeepSeek-R1 performance by up to 30x on NVIDIA Blackwell GPUs and introduces disaggregated serving, dynamic scheduling, and LLM-aware routing for scalable generative AI deployment.

 NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance

 Enhanced compute, interconnect, and software optimizations deliver major DeepSeek-R1 throughput gains, surpassing 250 tokens per second per user.

 Introducing NVIDIA Jetson Thor: The Ultimate Platform for Physical AI

 Generalist robotics is emerging as robots shift from fixed-function machines to adaptable systems powered by foundation models, with new NVIDIA Jetson platforms enabling multimodal reasoning and flexible task performance.

 Building the 800 VDC Ecosystem for Efficient, Scalable AI Factories

 AI is reshaping data centers into power-driven AI factories, making an 800V DC architecture with integrated energy storage essential for scaling modern workloads efficiently.

 Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era

 NVIDIA Blackwell Ultra combines advanced silicon features and deeper system-level integration—including a dual-reticle design, high-bandwidth HBM3E, new Tensor Cores, and NVFP4—to increase performance and efficiency for large-scale AI training and reasoning.

 Looking ahead

 Stay tuned for more transformative innovations in 2026.

 Subscribe to the Developer Newsletter and stay in the loop on content tailored to your interests.  us on Instagram , LinkedIn , Twitter , YouTube , and Discord for the latest developer news.

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | Data Science | Developer Tools & Techniques | Edge Computing | Robotics | Simulation / Modeling / Design | General | Grace CPU | Hopper | NIM | TensorRT-LLM | General Interest | Beginner Technical | News | AI Factory | featured | Physical AI

 About the Authors

 About Michelle Horton

 Michelle Horton is a senior developer communications manager at NVIDIA, with a background as a communications manager and science writer. At NVIDIA, she writes for the Technical Blog, highlighting the many ways people and companies are using NVIDIA technologies.

 View all posts by Michelle Horton

 Comments

 Related posts

 Supercharge AI-Powered Robotics Prototyping and Edge AI Applications with the Jetson AGX Orin Developer Kit

 Supercharge AI-Powered Robotics Prototyping and Edge AI Applications with the Jetson AGX Orin Developer Kit

 Transform the Future with Robotics at NVIDIA GTC

 Transform the Future with Robotics at NVIDIA GTC

 Watch the NVIDIA GTC 2020 Keynote

 Watch the NVIDIA GTC 2020 Keynote

 NVIDIA Brings AI to ICML in Sydney

 NVIDIA Brings AI to ICML in Sydney

 Top AI Researchers Receive First NVIDIA Tesla V100s

 Top AI Researchers Receive First NVIDIA Tesla V100s

 Related posts

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 L

 T

 F

 R

 E

Building Generalist Humanoid Capabilities with NVIDIA Isaac GR00T N1.6 Using a Sim-to-Real Workflow | NVIDIA Technical Blog

nvidia_dev_blog

08.01.2026 17:38

0.74

Embedding sim.	0.8297
Entity overlap	0.075
Title sim.	0.3778
Time proximity	0.8592

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	robotics
NLP страна

Открыть оригинал

To make humanoid robots useful, they need cognition and loco-manipulation that span perception, planning, and whole-body control in dynamic environments. 

 Building these generalist robots requires a workflow that unifies simulation, control, and learning for robots to acquire complex skills before transferring into the real world. 

 In this post, we present NVIDIA Isaac GR00T N1.6 and describe a sim-to-real workflow that combines whole-body reinforcement learning (RL) in NVIDIA Isaac Lab , synthetic data–trained navigation with COMPASS, and vision-based localization using NVIDIA CUDA-accelerated visual mapping and simultaneous localization and mapping (SLAM). 

 These components enable loco-manipulation, robust navigation, and environment-aware behavior across diverse robot embodiments.

 Vision-language-action and reasoning

 GR00T N1.6 is a multimodal vision-language-action (VLA) model that integrates visual observations from egocentric camera streams, robot states, and natural language instructions into a unified policy representation. The model uses world models , such as NVIDIA Cosmos Reason , to decompose high-level instructions into stepwise action plans grounded in scene understanding to perform real-world tasks. This architecture enables GR00T to execute locomotion and dexterous manipulation through end-to-end learned representations.

 GR00T N1.6 introduces several enhancements from previous releases that expand its capabilities and real-world applicability:

 Enhanced reasoning and perception: Uses a variant of Cosmos-Reason-2B VLM with native resolution support, enabling the robot to “see” clearly without distortion and reason better about its environment. This improvement translates to better scene understanding and more reliable task decomposition.

 Fluid, adaptive motion: A 2x larger diffusion transformer (32 layers) and state-relative action predictions result in smoother, less jittery movements that adapt easily to changing positions.

 Improved cross-embodiment performance: Trained on thousands of hours of new and diverse teleoperation data (humanoids, mobile manipulators, bimanual arms), enabling better generalization across various robot embodiments. 

 Isaac GR00T N1.6 was trained on a diverse collection of datasets, including both simulated and real-world data. The simulated data comprises environments and task demonstrations from BEHAVIOR , RoboCasa, and a custom simulated environment developed for GR-1. The real-world component integrates demonstrations collected across multiple robotic platforms, including GR-1 (Fourier), G1 (Unitree), bimanual YAM arms, Agibot, and the DROID dataset. A quantitative breakdown of the data contributions from each dataset is provided below.

 Figure 1. Training data distribution for Isaac GR00T N1.6.

 GR00T N1.6 includes pretrained weights for zero-shot evaluation and validation of basic manipulation primitives. Finetuning the model is beneficial when deploying it to a specific embodiment or task.

 This demo from the Conference on Robot Learning (CoRL) shows GR00T N1.6 in action, performing a loco-manipulation task on a G1 humanoid robot. 

 Video 1. Synthetic data from neural simulation for robot training

 Whole-body RL training and Sim-to-real transfer

 Whole-body RL training in simulation provides the low-level motor intelligence that GR00T N1.6 uses and coordinates through its higher-level VLA policy. The whole-body controller trained in Isaac Lab with RL produces human-like, dynamically stable motion primitives covering locomotion, manipulation, and coordinated multi-contact behaviors.

 These policies are trained and stress-tested at scale in Isaac Lab and Isaac Sim, then transferred zero-shot to physical humanoids, minimizing task-specific finetuning while maintaining robustness across environments and embodiments. This sim-to-real pipeline enables GR00T’s high-level VLA to assume reliable whole-body control, focusing its reasoning on task sequencing and scene-aware decision-making rather than raw motor stability.​

 GR00T-WholeBodyControl served as the whole-body controller, providing the low-level loco-manipulation layer under GR00T N1.6. Using this controller, the full stack—spanning high-level instruction following, mid-level behavior composition, and low-level robust control—is validated in simulation before deployment on hardware.

 Synthetic-data–trained navigation

 To layer goal-directed navigation on top of whole-body control, GR00T N1.6 is finetuned for point-to-point navigation using large-scale synthetic datasets generated by COMPASS in Isaac Lab. In this setup, COMPASS acts as a navigation specialist, producing diverse trajectories across scenes and embodiments used to adapt GR00T from a VLA model into a strong point navigation policy.​

 The navigation policy is trained in simulation and exposed through simple velocity commands to the whole-body controller, rather than directly producing joint torques. This enables the low-level whole-body RL policy to handle balance and contact, while the navigation head focuses on obstacle avoidance, path following, and navigation–manipulation handoffs in real-world scenes.​ In experiments, this synthetic-only training pipeline achieves zero-shot sim-to-real transfer, including zero-shot deployment to new physical environments, without additional task-specific data collection.​

 COMPASS is a novel workflow for developing cross-embodiment mobility policies by integrating imitation learning , residual RL, and policy distillation. It has demonstrated the effectiveness of RL fine-tuning and strong zero-shot sim-to-real performance using Isaac Lab. 

 Figure 2. GR1 robot using the COMPASS workflow

 Building on this, the GR00T N1.6 PointNav example releases provide step-by-step instructions and code for fine-tuning and evaluating navigation policies using COMPASS -generated data, so practitioners can reproduce and extend the navigation stack for their own embodiments and scenes.​

 Video 2. NVIDIA robot mobility workflows and AI models

 Vision-based localization

 Vision-based localization enables the GR00T N1.6 stack to use its whole-body controller and navigation policies in a large, real-world environment. After whole-body RL equips the robot with robust loco-manipulation skills and COMPASS-style synthetic data finetunes GR00T for point-to-point navigation, the system still requires an accurate estimate of the robot’s location so commands and waypoints correspond to real coordinates.

 To provide this, a vision-centric mapping and localization stack uses onboard cameras and prebuilt maps to maintain low-drift pose estimates, enabling robot commands to be grounded in precise robot and object coordinates.

 The visual mapping and localization stack is built on top of NVIDIA Isaac , NVIDIA CUDA-X libraries, and the following stereo depth models:

 cuVSLAM is a real-time visual-inertial SLAM and odometry library. Its odometry provides smooth vehicle velocity, and its SLAM backend produces low-drift poses with loop-closure corrections for navigation.

 cuVGL is a visual global localization library that computes an initial pose in a prebuilt map, which is used to bootstrap cuVSLAM.

 FoundationStereo is a foundation model for stereo depth estimation, offering strong zero-shot generalization across diverse environments.

 nvblox is an efficient 3D perception library that reconstructs the environment and generates a 2D occupancy map for path planning.

 We collect stereo images of the environment and pre-build maps, including a cuVSLAM landmark map, a cuVGL bag-of-words map, and an occupancy map. Semantic locations, such as the kitchen table, are identified in the occupancy map and used for task planning.

 At runtime, cuVGL retrieves visually similar image pairs from the pre-built map and estimates an initial pose from the stereo pairs. Using this pose as a prior, cuVSLAM matches local landmarks against the pre-built landmark map to localize. After successful localization, cuVSLAM tracks features continuously and performs map-based optimization, keeping the robot accurately localized during navigation.

 We develop an offline map creation workflow in Isaac ROS to create the maps from a ROS bag, along with isaac_ros_visual_slam and isaac_ros_visual_global_localization packages for localization. You can create a localization pipeline in ROS2 using a stereo camera driver, image rectification nodes, an occupancy map server, cuVSLAM and cuVGL nodes. 

 Figure 3. cuVSLAM feature tracking when a robot picks up an apple

 Get started

 Download and experiment:

 Open Isaac GR00T N1.6 model from Hugging Face

 GR00T N1.6 variant post-trained on the BEHAVIOR 1K dataset

 Use Isaac Lab and Newton for RL and policy training, and Isaac Lab to generate synthetic navigation data with COMPASS

 Use Isaac Lab – Arena for your robot policy evaluation

 Use CUDA-X visual mapping and localization libraries released as part of Isaac ROS:

 Create visual and occupancy maps from rectified stereo images

 Launch cuVSLAM and cuVGL to localize the robot using the generated maps

 Stay up to date by subscribing to our newsletter and following NVIDIA Robotics on LinkedIn , Instagram , X , and Facebook . Explore NVIDIA documentation and YouTube channels, and join the NVIDIA Developer Robotics forum . To start your robotics journey, enroll in our free NVIDIA Robotics Fundamentals courses today.

 Get started with NVIDIA Isaac libraries and AI models for developing physical AI systems.

 Learn more by watching NVIDIA Live at CES . 

 Discuss (1)

 Like

 Tags

 Edge Computing | Robotics | Simulation / Modeling / Design | General | Cosmos | GR00T | Isaac Lab | Intermediate Technical | Deep dive | CES26 | featured | Humanoid Robots

 About the Authors

 About Edith Llontop

 Edith Llontop is a robotics solution architect at NVIDIA. She has a degree in electrical engineering and computer science from the University of California, Berkeley, and has experience doing robotics research at Berkeley’s Artificial Intelligence Research Lab. She now works on supporting customers with Isaac, the NVIDIA robotics platform.

 View all posts by Edith Llontop

 About Yan Chang

 Yan Chang is a senior engineering manager and principal engineer at NVIDIA. She leads the NVIDIA Isaac loco‑manipulation team, advancing robot learning and synthetic data generation (SDG) for cross‑embodiment robots. She received her Ph.D. from the University of Michigan.

 View all posts by Yan Chang

 About Yuchen Deng

 Yuchen Deng is a software engineer at NVIDIA, working on GPU-accelerated perception and autonomy systems. She holds a master’s degree in Electrical and Computer Engineering from Carnegie Mellon University and is passionate about building more intelligent and capable robots.

 View all posts by Yuchen Deng

 Comments

 Related posts

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 R²D²: Improving Robot Manipulation with Simulation and Language Models

 R²D²: Improving Robot Manipulation with Simulation and Language Models

 Introducing NVIDIA Jetson Thor, the Ultimate Platform for Physical AI

 Introducing NVIDIA Jetson Thor, the Ultimate Platform for Physical AI

 Enhance Robot Learning with Synthetic Trajectory Data Generated by World Foundation Models

 Enhance Robot Learning with Synthetic Trajectory Data Generated by World Foundation Models

 R²D²: Advancing Robot Mobility and Whole-Body Control with Novel Workflows and AI Foundation Models from NVIDIA Research

 R²D²: Advancing Robot Mobility and Whole-Body Control with Novel Workflows and AI Foundation Models from NVIDIA Research

 Related posts

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 L

 T

 F

 R

 E

Build and Orchestrate End-to-End SDG Workflows with NVIDIA Isaac Sim and NVIDIA OSMO | NVIDIA Technical Blog

nvidia_dev_blog

07.01.2026 18:00

0.71

Embedding sim.	0.8177
Entity overlap	0.0769
Title sim.	0.2836
Time proximity	0.7395

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	robotics
NLP страна

Открыть оригинал

As robots take on increasingly dynamic mobility tasks, developers need physics-accurate simulations that translate across environments and workloads. Training robot policies and models to do these tasks requires a large amount of diverse, high-quality data, which is often expensive and time-consuming to collect in the physical world. Therefore, generating synthetic data at scale using cloud technology is essential to accelerate physical AI .

 Synthetic data generated in physics-accurate simulated environments with open source robotics simulation frameworks such as NVIDIA Isaac Sim and augmented with open world foundation models such as NVIDIA Cosmos Transfer which helps close the real-world data gap. 

 To run these workloads at scale, developers can use NVIDIA OSMO , an open source cloud-native orchestrator for physical AI workflows. OSMO provides a single command center to define, run, and monitor any multistage physical AI pipeline across diverse compute environments. 

 Figure 1. NVIDIA OSMO workflow can manage multiple robotics pipelines across compute environments in a single workstream

 This post explores:

 Creating a simulated environment

 Generating synthetic data with MobilityGen on OSMO

 Scaling data augmentation using NVIDIA Cosmos world foundation models (WFMs) on OSMO

 Deploying data generation pipelines at cloud scale using NVIDIA OSMO on Microsoft Azure

 Figure 2. Using OSMO and Isaac Sim, you can take a single scene and turn it into a cloud-scale synthetic data generation pipeline

 Build a simulated environment locally or in the cloud

 You can build simulated environments in Isaac Sim on a local NVIDIA RTX workstation or with cloud VDIs, such as the Azure Isaac Sim Development Workstation . With NVIDIA OSMO, there’s now an additional option: run Isaac Sim remotely as an interactive session , and connect from the Isaac Sim livestream client on your local machine.

 Once you have Isaac Sim running, the next step is to build out the world your robot will operate in. You can start by bringing in real-world environment assets using NVIDIA Omniverse NuRec , and then populate the scene with simulation-ready (SimReady) assets to add physically accurate objects and semantics for data generation and training.

 Reconstruct 3D digital twins using Omniverse NuRec

 Omniverse NuRec is a set of technologies for reconstructing and rendering 3D interactive simulations from real-world sensor data. The reconstructed environments are used across domains like robotics, AV, and industrial/geospatial for generating synthetic data , training AI models, and testing model behavior.

 Isaac Sim supports NuRec Gaussian-based rendering as neural radiance fields (NeRFs), 3D Gaussian Splats (3DGS), and 3D Gaussian Unscented Transforms (3DGUT) . Data is rendered in OpenUSD for simulation. You can load compatible assets and scenes in Isaac Sim and control rendering through the OmniNuRecVolumeAPI properties. Learn more about NuRec for robotics use cases in the documentation.

 Add SimReady assets to a simulated scene

 SimReady assets are OpenUSD-based accurate 3D models with built-in semantic labeling, dense captions, and physics properties based on USDPhysics that streamline robot simulation setup.

 The SimReady Warehouse 01 Assets Pack includes a large collection of USD models of objects like pallets, storage racks, and ramps. You can simply drag and drop these into your scene. For robotics and related use cases, explore the Physical AI dataset .  

 Video 1 shows how to add SimReady assets to a scene in Isaac Sim.

 Video 1. Populating warehouse scenes with physically accurate 3D objects using simple drag-and-drop functionality

 In this way, you can easily create scenes with multiple objects in simulation. A major use of these simulated environments is to collect synthetic data for training robot policies, which we will learn about in the next section.  

 Try the SimReady Standardization workflow to design, validate, and implement standardized 3D asset specifications in OpenUSD.

 Generate synthetic data using MobilityGen on OSMO

 MobilityGen is a workflow for generating data for mobile robots built on Isaac Sim. It supports data collection through manual methods like keyboard and gamepad teleoperation, and through  automated methods like random accelerations and random path following. 

 In the following example, you’ll learn how MobilityGen is used to generate data for an H1 humanoid robot in Isaac Sim using OSMO. You can find the OSMO Workflow example in the NVIDIA/OSMO GitHub repo.

 This workflow can be used for other robot embodiments, like quadrupeds and autonomous mobile robots (AMRs), and has been tested on the Spot and Carter robots.

 While data from MobilityGen can train mobility policies for robots, performance improves when the data includes visual diversity. We’ll learn about augmenting data with visual diversity using NVIDIA Cosmos in the next section. 

 The following outlines the steps involved in generating data using MobilityGen.

 Build an occupancy map : This is a grid-based representation of the robot’s environment where each cell represents the probability of being occupied by an obstacle.

 Record a trajectory : A trajectory of a mobile robot specifies position, velocity, and orientation at every instant as it moves through its environment. 

 Replay and render : You can replay the generated trajectories to evaluate and visualize data.

 Videos 2 and 3 show how to generate synthetic data in Isaac Sim using MobilityGen.

 Video 2. Creating occupancy maps for training mobility models across different robot embodiments

 Video 3. Recording collision-free paths and capturing RGB/depth camera data from the robot’s perspective

 The following example uses a warehouse environment available in Isaac Sim to run MobilityGen. You can create your own environment using SimReady assets covered in the previous section.

 This step leverages an interactive OSMO workflow to generate occupancy maps and record trajectory data within Isaac Sim.

 Submit and connect

 Submit the workflow and enter the container’s interactive shell to perform manual recording:

# Submit the YAML definition
osmo workflow submit workflows/mobilitygen_replay.yaml --pool <pool-name>

# When the task logs this line:
# "Isaac Sim Full Streaming App is loaded."

# Run these commands in two separate terminals:
osmo workflow port-forward <workflow ID> isaac-lab --port 47995-48012,49000-49007,49100 --connect-timeout 300

osmo workflow port-forward <workflow ID> isaac-lab --port 47995-48012,49000-49007 --udp --

 Complete the following steps:

  the documentation for building an occupancy map

 Load the warehouse stage

 Create the occupancy map

 Save the map

 Verify that you now have a folder named ~/MobilityGenData/maps/warehouse_multiple_shelves/ with a file named map.yaml and map.png inside

  the documentation for recording a trajectory

 Enable the MobilityGen UI extension

 Build the scenario

 Test drive the robot

 Start recording

 Verify that the data is now recorded to ~/MobilityGenData/recordings

  the documentation for replay and render

 After recording a trajectory, which includes data like robot poses, you can now replay the scenario

 Use the replay_directory.py Python script that ships with Isaac Sim. To run the script, call the following from inside the Isaac Sim directory

 After the script finishes, verify that you have a folder ~/MobilityGenData/replays, which contains the rendered sensor data. You can open this folder to explore the data

 There are examples on how to load and work with the recorded data in the open source MobilityGen GitHub repo

 We recommend visualizing your recorded data by running the Gradio Visualization Script

 Find more information, such as adding a custom robot, in the tutorial on Data Generation with MobilityGen

 To scale these steps, you can leverage custom scripts that run headless as OSMO workflows.

 Augment generated training data using Cosmos on OSMO

 After generating data using MobilityGen, use Cosmos Transfer to generate photorealistic videos from synthetic robot data. This adds visual variation to reduce the sim-to-real gap and improves policy performance after deployment.

 Figure 3. The high-level SDG workflow includes generating synthetic data using MobilityGen and augmenting the data using Cosmos Transfer, which results in high-quality datasets for training robot models

 Cosmos Transfer is a WFM that generates photorealistic videos from inputs of multiple video modalities like RGB, depth, and segmentation. Along with the input video, you can give a text prompt with details guiding how you want the generated video to look like. The following is an example prompt:

A realistic warehouse environment with consistent lighting, perspective, and camera motion. Preserve the original structure, object positions, and layout from the input video. Ensure the output exactly matches the segmentation video frame-by-frame in timing and content. Camera movement must follow the original path precisely.

 Videos 4 and 5 show how to run Cosmos Transfer on MobilityGen data to add visual variation.

 Video 4. Processing Isaac Sim synthetic data and converting warehouse scenes into realistic training datasets

 Video 5. The inference process to generate photorealistic videos

 Once raw trajectories are recorded, use Cosmos Transfer to apply diffusion-based photorealistic augmentation for enhanced sim-to-real performance. Submit the OSMO augmentation workflow .

osmo workflow submit workflows/cosmos_augmentation.yaml \
 --pool <pool-name>

 This workflow can be scaled to thousands of generations by customizing the workflows and Python scripts to leverage LLM pregenerated prompt variations.

 To troubleshoot typical OSMO issues, follow the official documentation .

 During closed-loop evaluation in the lab, a policy trained on synthetic and Cosmos-augmented data consistently outperformed a policy trained on synthetic data alone. The following scenarios are handled well by the policy:

 Navigating around transparent obstacles

 Avoiding obstacles that blend into the background, like a gray pole on a gray floor

 Going closer to obstacles, and reducing the overall distance traveled to get to a goal position

 Navigating in dimly lit environments

 Navigating narrow passages

 You can run Cosmos Transfer on both real-world and synthetic video data. For example, the Cosmos for Synthetic Dataset Augmentation tutorial shows how to generate synthetic data using Replicator in Isaac Sim and then augment it with Cosmos. The NVIDIA OSMO Cosmos Transfer workflow example shows how to operationalize Cosmos Transfer as a scalable, repeatable workflow.

 Scale your data generation pipeline in the cloud

 Once you’ve generated a simulated environment, you need a repeatable way to fan out thousands of simulation and post-processing runs, track exactly how each dataset shard was produced, recover gracefully from transient failures, and continuously iterate on scenario coverage as your navigation stack evolves. 

 This is addressed by using OSMO. The following steps show how to do so using the Azure sample , which provides a production-oriented baseline for running OSMO on Azure and wiring it into the services you need for SDG and training at scale.

 However, OSMO is deployable on all leading cloud service providers (CSPs) and NVIDIA Cloud Partners (NCPs). 

 For details on OSMO concepts and workflow structure, see the NVIDIA OSMO user guide .

  the Azure sample instructions to deploy OSMO. Before submitting jobs, ensure the OSMO CLI is configured and authenticated against your designated cluster.

 Install the CLI:

pip install osmo-cli

 Authenticate through the browser and verify your resource access:

# Authenticate with the regional endpoint
osmo login <https://<YOUR_OSMO_URL>>

# List and set your compute pool
osmo profile list
osmo pool list
osmo profile set pool <pool-name>

 Scaling Azure Kubernetes for SDG

 SDG workloads are bursty, heterogeneous, and extremely artifact-heavy; if you only scale compute, you’ll quickly run into bottlenecks. The most reliable approach is to scale with clear workload boundaries, predictable resource requests, and production-grade platform services.

 It is best to involve your IT and DevOps teams to capacity plan. Consider the following list of aspects when scaling SDG workloads on Kubernetes:

 Isolate workload classes with node pools : Separate GPU pools for OSMO versus AzureML, and (when needed) separate pools for simulation-heavy SDG runs versus training runs; enforce placement with node selectors, taints, and tolerations.

 Use elastic GPU capacity intentionally : Keep a right-sized baseline for steady-state services, then scale out GPU pools for large SDG campaigns; use Spot pools where appropriate to improve cost-efficiency.

 Plan for artifact throughput : SDG produces large sensor streams and intermediate outputs; treat storage throughput, dataset partitioning, and lifecycle/retention policies as core scaling design, not afterthoughts.

 Operationalize observability : Monitor both infrastructure (GPU utilization, pending pods, node saturation) and pipeline health (scenarios/hour, failure rate, dataset size growth), and use consistent run IDs to preserve data lineage.

 Scale OSMO dependencies like production services : Size and operate PostgreSQL and Redis for your expected concurrency, with backups and capacity planning to avoid workflow control-plane bottlenecks.

 Get started with end-to-end SDG workflows

 NVIDIA provides a comprehensive collection of OpenUSD resources to accelerate your learning journey. Start with the self-paced Learn OpenUSD , Digital Twins , and Robotics training curricula that build the foundational skills covered in this guide.

 For professionals ready to take the next steps in their robotics career, the OpenUSD Development certification offers a professional-level exam that validates your expertise in building, maintaining, and optimizing 3D content pipelines using OpenUSD. 

 Visit our Cosmos Cookbook for step-by-step workflows, technical recipes, and concrete examples for building, adapting, and deploying Cosmos WFMs, or join our community to learn with peers.

 Tune in to upcoming OpenUSD Insiders livestreams and connect with the NVIDIA Developer Community . Stay up to date by following NVIDIA Omniverse on Instagram , LinkedIn ,  X , Threads , and YouTube . 

 Get started with NVIDIA Isaac libraries and AI models for developing physical AI systems.

 Watch NVIDIA Live at CES to learn more.  

 This post was originally published in October 2025.

 Discuss (0)

 Like

 Tags

 Computer Vision / Video Analytics | Robotics | Simulation / Modeling / Design | General | Isaac Sim | OSMO | Intermediate Technical | Tutorial | Azure | CES26 | featured | OpenUSD | Physical AI | Robotics Simulation | Synthetic Data Generation

 About the Authors

 About Asawaree Bhide

 Asawaree Bhide is a technical marketing engineer at NVIDIA, working on robotics and deep learning applications on the Jetson platform. She did her master’s in computer science at Georgia Tech and is interested in solving complex perception tasks in autonomous navigation for embodied agents.

 View all posts by Asawaree Bhide

 About Jathavan Sriram

 Jathavan Sriram is a senior solutions Architect at NVIDIA, specializing in AI and robotics solutions. His expertise spans cloud infrastructure, Kubernetes, and AI technology, with a focus on World Foundation Models and physical AI applications. He is passionate about enabling customers to deploy generative AI at scale on cloud-native platforms.

 View all posts by Jathavan Sriram

 About Saurav Nanda

 Saurav Nanda is a senior solutions architect at NVIDIA, where he concentrates on robotics and humanoids with a focus on cutting-edge models like NVIDIA Cosmos, VLMs, and VLAs. He brings over 12 years of professional experience, including a tenure as start-up CTO and as a Senior AI Architect at Synopsys, pioneering generative AI techniques for integrated circuit design. A Ph.D. graduate from Purdue University, Saurav has authored numerous highly-cited research papers and patents and actively contributes to the academic community as a reviewer and session chair.

 View all posts by Saurav Nanda

 About Aigul Dzhumamuratova

 Aigul Dzhumamuratova is a software engineer at NVIDIA, working on computer vision and robotics applications. Aigul earned her Bachelor’s and Master’s degrees in Engineering from Bauman University in Moscow, where she specialized in computer vision. Her work contributes to advancing the capabilities of autonomous machines across real-world environments.

 View all posts by Aigul Dzhumamuratova

 Comments

 Related posts

 3 Easy Ways to Supercharge Your Robotics Development Using OpenUSD

 3 Easy Ways to Supercharge Your Robotics Development Using OpenUSD

 Advanced Sensor Physics, Customization, and Model Benchmarking Coming to NVIDIA Isaac Sim and NVIDIA Isaac Lab

 Advanced Sensor Physics, Customization, and Model Benchmarking Coming to NVIDIA Isaac Sim and NVIDIA Isaac Lab

 Supercharge Robotics Workflows with AI and Simulation Using NVIDIA Isaac Sim 4.0 and NVIDIA Isaac Lab

 Supercharge Robotics Workflows with AI and Simulation Using NVIDIA Isaac Sim 4.0 and NVIDIA Isaac Lab

 Expedite the Development, Testing, and Training of AI Robots with Isaac Sim  

 Expedite the Development, Testing, and Training of AI Robots with Isaac Sim  

 Accelerating Robotics Simulation with NVIDIA Omniverse Isaac Sim

 Accelerating Robotics Simulation with NVIDIA Omniverse Isaac Sim

 Related posts

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 L

 T

 F

 R

 E

Redefining Secure AI Infrastructure with NVIDIA BlueField Astra for NVIDIA Vera Rubin NVL72 | NVIDIA Technical Blog

nvidia_dev_blog

07.01.2026 17:00

0.71

Embedding sim.	0.8176
Entity overlap	0.0256
Title sim.	0.3008
Time proximity	0.7461

NLP тип	product_launch
NLP организация	nvidia
NLP тема	ai infrastructure
NLP страна

Открыть оригинал

Large-scale AI innovation is driving unprecedented demand for accelerated computing infrastructure. Training trillion-parameter foundation models, serving them with disaggregated architectures, and processing inference workloads at massive throughput all push data center design to the limits. To keep up, service providers need infrastructure that not only scales but also delivers stronger security and better tenant isolation.

 This post introduces NVIDIA BlueField Astra running on NVIDIA BlueField-4 , a breakthrough innovation that redefines how service providers manage, secure, and scale AI infrastructure.

 The rise of bare-metal computing for AI

 As accelerated computing demand increases, the industry is prioritizing bare-metal computing to unlock the benefits of GPU acceleration. Unlike virtualized environments, bare-metal provisioning requires strict isolation and trusted control points to ensure that no tenant can interfere with another’s resources. The challenge arises because AI infrastructure spans two distinct networking domains:

 North-South (N-S) : The front-end network that connects users and applications to the AI cluster

 East-West (E-W) : The backend AI compute fabric that connects GPUs at massive bandwidth and ultra-low latency

 Today, CSPs already manage N-S traffic using NVIDIA BlueField DPUs, running their control software stacks on the embedded Arm cores. This model enables service providers to enforce isolation, provision resources, and secure workloads effectively. 

 On the E-W domain, the NVIDIA Ethernet SuperNIC is the adapter purpose-built to meet the extreme requirements of AI workloads, delivering the performance, throughput, and congestion management that massive GPU clusters demand. 

 As AI clusters scale, CSPs are looking for secure and consistent ways to extend provisioning and control into the AI compute fabric, complementing the performance and scalability that SuperNICs already provide.

 What is NVIDIA BlueField Astra? 

 As announced at CES 2026, the NVIDIA Rubin platform features the new BlueField Advanced Secure Trusted Resource Architecture (Astra) running on BlueField-4 . BlueField Astra is a breakthrough system-level architecture that combines hardware and software innovations and is deeply integrated into the NVIDIA Vera Rubin NVL72 compute tray.

 Through dedicated connections between the BlueField-4 DPU and NVIDIA ConnectX-9 SuperNICs , BlueField Astra extends manageability, provisioning, and policy enforcement into the E-W fabric. For the first time, the DPU controls all network I/O to and from the compute node.

 With BlueField Astra, CSPs can extend their trusted software stack running on BlueField-4 DPUs to securely manage tenant isolation and network policies across the AI compute fabric. These policies are programmed through the out-of-band DPU port and enforced directly in SuperNIC hardware, ensuring consistent control throughout the system.

 Central to BlueField Astra is a new control plane architecture. Unlike traditional models, where host-based software configures both NICs and fabric, BlueField Astra completely isolates the SuperNIC control plane from the host operating system. This ensures that tenant workloads, even when running bare metal, cannot tamper with or gain visibility into network provisioning.

 Figure 1. The Vera Rubin NVL72 compute tray, supporting the BlueField Astra management model

 As shown in Figure 1, BlueField Astra establishes a direct path between the BlueField-4 DPU and ConnectX-9 SuperNICs, creating a unified control architecture. This delivers:

 Dedicated connectivity: Each NVIDIA ConnectX-9 SuperNIC connects directly to the BlueField-4 DPU, enabling the DPU to program, configure, and monitor the SuperNIC without relying on the host CPU.

 Out-of-band control: BlueField Astra routes all provisioning instructions and network policies through the BlueField embedded Arm cores.

 Unified control of N-S and E-W: BlueField-4 consolidates both domains under a single trusted control point. The same DPU that manages N-S networking for tenant isolation and security policies now extends those capabilities into the E-W AI compute fabric.

 Isolation from the tenant: Tenants use the SuperNIC for AI data movement, but have no access to or control over management functions, which remain fully isolated on the DPU.

 Security model consistency: By moving the NVIDIA DOCA stack from the host to the DPU, BlueField Astra ensures the E-W fabric inherits the same cloud-aligned security posture already proven for N-S traffic.

 BlueField Astra enables control, consistency, and confidence  

 BlueField Astra transforms AI infrastructure management by creating a unified control plane across both N-S and E-W domains. With a single point of control anchored in the BlueField-4 DPU, service providers can streamline provisioning, enforce policies consistently, and reduce operational complexity—all without touching the host CPU.

 By design, BlueField Astra delivers stronger isolation and security. The SuperNIC control plane is isolated from tenant workloads and fully managed by the DPU, ensuring that tenants cannot bypass or alter policies. This model prevents lateral movement and configuration drift while giving CSPs confidence that bare-metal GPU nodes can be offered securely in multi-tenant environments.

 BlueField Astra also brings operational consistency. Service providers can extend the same DOCA-based management tools and workflows they already use on the N-S front end into the E-W compute fabric. Policies are pushed down into SuperNIC hardware for enforcement, enabling fine-grained tenant-aware provisioning while maintaining the performance advantages NVIDIA SuperNICs are known for.

 Finally, BlueField Astra supports compliance and auditability. With policies and configurations residing on the DPU rather than the host, CSPs gain clearer audit trails and a security posture aligned with the requirements of regulated industries. This ensures that security isn’t bolted on—it’s embedded into the operating system of AI infrastructure at scale.

 Extending operational workflows into bare-metal AI systems

 BlueField Astra builds on the DOCA software platform to provide a consistent means of deploying and operating infrastructure services on BlueField-4. By anchoring networking, security, storage, and management functions on the DPU, Astra enables existing DOCA microservices and operational workflows to extend naturally into bare-metal AI systems and the E-W compute fabric.

 With Astra, DOCA microservices run directly on BlueField-4 and interface with NVIDIA ConnectX-9 SuperNICs through a DPU-managed control plane. This model preserves compatibility with existing DOCA deployments while enabling the stronger isolation and control required for multitenant, bare-metal AI environments, without introducing new dependencies on the host operating system.

 BlueField Astra supports a set of DOCA microservices that together form the infrastructure control layer for AI systems:

 Networking

 N-S: DOCA Host-Based Networking (HBN) provides tenant-aware provisioning, isolation, and policy enforcement at the front-end of the AI cluster.

 E–W: DOCA-accelerated Open vSwitch (OVS) extends software-defined networking into the AI compute fabric, enabling controlled connectivity between GPU nodes while keeping fabric control isolated from tenant workloads.

 Security

 DOCA Argus delivers infrastructure-level telemetry and runtime visibility from the DPU, supporting monitoring and enforcement outside the tenant trust boundary.

 Storage

 DOCA SNAP offloads storage services through the DPU, enabling secure, isolated data paths that operate independently of host software.

 Management

 DOCA DMS provides device discovery, lifecycle management, and secure provisioning, allowing CSPs to manage AI nodes and SuperNICs through a centralized, DPU-anchored control point.

 Together, these DOCA microservices allow BlueField Astra to maintain a consistent, software-defined infrastructure model across both N-S and E-W domains, while preserving the performance characteristics required by large-scale AI workloads.

 Securing the future of AI infrastructure

 As AI workloads scale to new levels, service providers need to deliver bare-metal performance while maintaining strict multi-tenant security. With BlueField Astra, NVIDIA extends trusted control from the front-end network into the AI compute fabric itself. By combining BlueField DPUs with SuperNICs under a unified, isolated architecture, BlueField Astra empowers CSPs to confidently build, provision, and secure the next generation of AI infrastructure.

To learn more about how NVIDIA Vera Rubin NVL72 and NVIDIA BlueField-4 are shaping the future of AI infrastructure, watch the NVIDIA Live presentation at CES 2026 with NVIDIA CEO Jensen Huang. To dive deeper into BlueField-4 features and capabilities, see the BlueField-4 datasheet . 

 Discuss (0)

 Like

 Tags

 Data Center / Cloud | Networking / Communications | General | BlueField DPU | ConnectX | DOCA | Intermediate Technical | Deep dive | News | AI Platform | CES26 | featured | SuperNICs | Vera Rubin NVL72

 About the Authors

 About Erez Tweg

 Erez Tweg drives product management for NVIDIA networking platforms, driving innovation at the intersection of networking, AI, and accelerated computing. With deep expertise across product and strategy roles in the technology sector, he brings a strong track record of shaping next-generation infrastructure solutions. Erez holds a bachelor’s degree in Electrical Engineering from Tel Aviv University.

 View all posts by Erez Tweg

 About Uriya Stern

 Uriya Stern is a product marketing manager for adapter security in NVIDIA, managing the strategy and delivery of the NVIDIA BlueField DPU cybersecurity features. With over 12 years in R&D managerial and project management roles, Uriya holds a bachelor’s degree in Electrical Engineering and an M.B.A.

 View all posts by Uriya Stern

 About Itay Ozery

 Itay Ozery is director of product marketing for networking at NVIDIA. He drives strategic product marketing initiatives for NVIDIA networking platforms and solutions. Itay has a solid track record in building and launching impactful products and solutions to market, and previously served in various enterprise IT positions.

 View all posts by Itay Ozery

 Comments

 Related posts

 Powering the Next Frontier of Networking for AI Platforms with NVIDIA DOCA 3.0

 Powering the Next Frontier of Networking for AI Platforms with NVIDIA DOCA 3.0

 NVIDIA DOCA 2.9 Enhances AI and Cloud Computing Infrastructure with New Performance and Security Features

 NVIDIA DOCA 2.9 Enhances AI and Cloud Computing Infrastructure with New Performance and Security Features

 Transform the Data Center for the AI Era with NVIDIA DPUs and NVIDIA DOCA

 Transform the Data Center for the AI Era with NVIDIA DPUs and NVIDIA DOCA

 Accelerating Data Center AI with the NVIDIA Converged Accelerator Developer Kit

 Accelerating Data Center AI with the NVIDIA Converged Accelerator Developer Kit

 Accelerating Scientific Applications in HPC Clusters with NVIDIA DPUs Using the MVAPICH2-DPU MPI Library

 Accelerating Scientific Applications in HPC Clusters with NVIDIA DPUs Using the MVAPICH2-DPU MPI Library

 Related posts

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 L

 T

 F

 R

 E

Open Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs | NVIDIA Technical Blog

nvidia_dev_blog

06.01.2026 05:30

0.695

Embedding sim.	0.7851
Entity overlap	0.0952
Title sim.	0.2083
Time proximity	0.9574

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	generative ai
NLP страна

Открыть оригинал

AI developer activity on PCs is exploding, driven by the rising quality of small language models (SLMs) and diffusion models, such as FLUX.2, GPT-OSS-20B, and Nemotron 3 Nano. At the same time, AI PC frameworks, including ComfyUI, llama.cpp, Ollama, and Unsloth are making functional advances, doubling in popularity over the past year as the number of developers using PC-class models has grown tenfold. Developers are no longer experimenting with generative AI workflows—they’re building the next-generation software stack on NVIDIA GPUs, from the data center to NVIDIA RTX AI PCs.

 At CES 2026, NVIDIA is announcing several new updates for the AI PC developer ecosystem, including:

 Acceleration for the top open source tools on PC, llama.cpp, and Ollama for SLMs, along with ComfyUI for diffusion models.

 Optimizations to the top open source models for NVIDIA GPUs, including the new LTX-2 audio-video model. 

 A suite of tools to accelerate agentic AI workflows on RTX PCs and NVIDIA DGX Spark.

 Accelerated inference through open source AI frameworks

 NVIDIA collaborated with the open source community to boost inference performance across the AI PC stack. 

 Continued performance improvements on ComfyUI 

 On the diffusion front, ComfyUI optimized performance on NVIDIA GPUs through PyTorch-CUDA and enabled support for NVFP4 and FP8 formats. These quantized formats enable memory savings of 60% and 40%, respectively, and accelerate performance. Developers will see an average of 3x performance with NVFP4 and 2x with NVFP8. 

 Figure 1. Performance increase on ComfyUI

 Updates to ComfyUI include:

 NVFP4 support: Linear layers can run using the NVFP4 format with optimized kernels, delivering 3–4x higher throughput compared to FP16 and BF16 linear layers.

 Fused FP8 quantization kernels: Boost model performance by eliminating memory-bandwidth-bound operations.

 Fused FP8 de-quantization kernels: Performance for FP8 workloads is further improved on NVIDIA RTX GPUs without fourth-generation Tensor Cores (pre NVIDIA Ada.)

 Weight streaming: Leveraging concurrent system memory and CPU compute streams, weight streaming hides memory latency and increases throughput, especially on GPUs with limited VRAM.

 Mixed precision support: Models can combine multiple numerical formats within a single network, enabling fine-grained tuning for optimal accuracy and performance.

 RMS & RoPE Fusion: Common, memory-bandwidth-limited operators in diffusion transformers are fused to reduce memory usage and latency. This optimization benefits all DiT models across data types.

 The sample code for the optimizations is available under the ComfyUI kitchen repository. NVFP4 and FP8 checkpoints are also available in HuggingFace, including the new LTX-2 , FLUX.2 , FLUX.1-dev , FLUX.1-Kontext , Qwen-Image and Z-Image .

 Acceleration on RTX AI PCS for llama.cpp and Ollama

 For SLMs, token generation throughput performance on mixture-of-expert (MoE) models has increased by 35% on llama.cpp on NVIDIA GPUs, and 30% on Ollama on RTX PCs.

 Figure 2. Shows token generation performance improvements on GPT-OSS-20B, Nemotron Nano V2, and Qwen 3 30B with NVIDIA RTX on llama.cpp

 Jan’26 builds are run with the following environment variables and flags: GGML_CUDA_GRAPH_OPT=1, FA=ON, and —backend-sampling

 Updates to llama.cpp include:

 GPU token sampling : Offloads several sampling algorithms (TopK, TopP, Temperature, minK, minP, and multi-sequence sampling) to the GPU, improving quality, consistency, and accuracy of responses, while also increasing performance.

 Concurrency for QKV projections : Support for running concurrent CUDA streams to speed up model inference. To use this feature, pass in the –CUDA_GRAPH_OPT=1 flag.

 MMVQ kernel optimizations : Pre-loads data into registers and hides delays by increasing GPU utilization on other tasks, to speed up the kernel.

 Faster model loading time : Up to 65% model load time improvements on DGX Spark, and 15% on RTX GPUs.

 Native MXFP4 support on NVIDIA Blackwell GPUs : Up to 25% faster prompt processing on LLMs using the hardware-level NVFP4 fifth-generation of Tensor Cores on the Blackwell GPUs.

 Updates to Ollama include:

 Flash attention by default: Now standard on many models. This technique uses “tiling” to compute attention in smaller blocks, reducing the number of transfers between GPU VRAM and system RAM to boost inference and memory efficiency.

 Memory management scheme : A new scheme allocates additional memory to the GPU, increasing token generation and processing speeds.

 LogProbs added to the API: Unlocks additional developer capabilities for use cases like classification, perplexity calculations, and self-evaluation.

 The latest optimizations from the upstream GGML library.

 Check out the llama.cpp repository and the Ollama repository to get started, and test them in apps like LM Studio or the Ollama App .

 New advanced audio-video model on RTX AI PC

 NVIDIA and Lightricks are releasing LTX-2 model weights—an advanced audio-video model that competes with cloud models that you can run on your RTX AI PC or DGX Spark. This is an open, production-ready audio-video foundation model delivering up to 20 seconds of synchronized AV content at 4K resolution. It can offer frame rates of up to 50 fps and provides multi-modal control for high extensibility for developers, researchers, and studios.

 The model weights are available in BF16 and NVFP8. The quantized checkpoint delivers 30% memory reduction, enabling the model to run efficiently on RTX GPUs and DGX Spark. 

 In the past weeks, we’ve also seen dozens of new models being released, each pushing the frontier of generative AI.

 Figure 3. Example 4K50LTX-2 output

 An Agentic AI toolkit for local AI

 The use cases for private, local agents are endless. But building reliable, repeatable, and high-quality private agents remains a challenge. LLM quality deteriorates when you distill and quantize the model to fit within a limited VRAM budget on PC. The need for accuracy increases as agentic workflows require reliable and repeatable answers when interfacing with other tools or actions.

 To address this, developers typically use two tools to increase accuracy: fine-tuning and retrieval-augmented-generation (RAG). NVIDIA released updates to accelerate tools across this workflow for building agentic AI.

 Nemotron 3 Nano is a 32B parameter MoE model optimized for agentic AI and fine-tuning. With 3.6B active parameters and a 1M context window, it tops several benchmarks across coding, instruction-following, long-context reasoning, and STEM tasks. The model is optimized for RTX PCs and DGX Spark via Ollama and llama.cpp , and can be fine-tuned using Unsloth .

 This model stands out for being the most open, with weights, recipes, and datasets widely available. Open models and datasets make customizing the model easier for developers. They prevent redundant fine-tuning and eliminate data leakage for objective benchmarking for robust and efficient workflows. Get started with LoRA-based fine-tuning for it.

 For RAG, NVIDIA partnered with Docling —a package to ingest, analyze, and process documents into a machine-understandable language for RAG pipelines. Docling is optimized for RTX PCs and DGX Spark and delivers 4x performance compared to CPUs. 

 There are two ways of using Docling:

 Traditional OCR pipeline: This is a pipeline of libraries and models that is accelerated via PyTorch-CUDA on RTX.

 VLM-based pipeline: An advanced pipeline for complex multi-modality documents, available for use via vLLM within WSL and Linux environments.

 Docling is developed at IBM and contributed to the Linux Foundation. Start now on RTX with this easy-to-use guide .

 SDKs for audio and video effects

 The NVIDIA Video and Audio Effects SDKs enable developers to apply AI effects on multimedia pipelines that enhance quality using features such as background noise removal, virtual background, or eye contact. 

 The latest updates at CES 2026 enhance the video relighting feature to produce more natural and stable results across diverse environments, while improving performance by 3x (reducing the minimum GPU required to run it to an NVIDIA GeForce RTX 3060 or above), and decreasing the model size up to 6x. To see the Video Effects SDK with AI relighting in action, check out the new release of the NVIDIA Broadcast app .

 We’re excited to collaborate with the open source community of AI PC tools to deliver models, optimizations, tools, and workflows for developers. Start developing for RTX PCs and DGX Spark today!

 Discuss (1)

 Like

 Tags

 Agentic AI / Generative AI | Developer Tools & Techniques | General | Nemotron | Intermediate Technical | Deep dive | CES26 | DGX Spark | featured | LLMs | RTX AI

 About the Authors

 About Annamalai Chockalingam

 Annamalai Chockalingam is a product manager on the NVIDIA GeForce AI PC team, championing the ecosystem for AI developers. He leads the charge in unlocking the power of local RTX GPUs, delivering the critical tools and software stack developers need to optimize and deploy AI across millions of PCs worldwide. 

Since joining NVIDIA in 2022, Annamalai has been instrumental in shaping the LLM landscape, previously working on the NeMo suite of products. Drawing on a diverse background spanning deep learning, firmware, and management consulting—and holding degrees from NYU Stern & Courant and the University of Alberta—he is dedicated to making consumer hardware into a powerhouse for AI developers.

 View all posts by Annamalai Chockalingam

 Comments

 Related posts

 NVIDIA TensorRT for RTX Introduces an Optimized Inference AI Library on Windows 11

 NVIDIA TensorRT for RTX Introduces an Optimized Inference AI Library on Windows 11

 Kickstart Your AI Journey on RTX AI PCs and Workstations with NVIDIA NIM Microservices

 Kickstart Your AI Journey on RTX AI PCs and Workstations with NVIDIA NIM Microservices

 Top Posts of 2024 Highlight NVIDIA NIM, LLM Breakthroughs, and Data Science Optimization

 Top Posts of 2024 Highlight NVIDIA NIM, LLM Breakthroughs, and Data Science Optimization

 Supercharging LLM Applications on Windows PCs with NVIDIA RTX Systems

 Supercharging LLM Applications on Windows PCs with NVIDIA RTX Systems

 Get Started with Generative AI Development for Windows PCs with NVIDIA RTX

 Get Started with Generative AI Development for Windows PCs with NVIDIA RTX

 Related posts

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 L

 T

 F

 R

 E

Build an AI Catalog System That Delivers Localized, Interactive Product Experiences | NVIDIA Technical Blog

nvidia_dev_blog

09.01.2026 14:00

0.694

Embedding sim.	0.7724
Entity overlap	0.125
Title sim.	0.2381
Time proximity	1

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	generative ai
NLP страна

Открыть оригинал

E-commerce catalogs often contain sparse product data, generic images, a basic title, and short description. This limits discoverability, engagement, and conversion. Manual enrichment doesn’t scale because it relies on catalog managers to manually write descriptions, apply tags, and categorize. The process is slow, inconsistent, and error-prone.

 This tutorial shows developers, product managers, and catalog teams how to deploy an AI-powered enrichment blueprint that transforms a single product image into rich, localized catalog entries.

 Using NVIDIA Nemotron large language models (LLMs) and vision-language models (VLMs)—including Nemotron-Nano-12B-V2-VL, Llama-3.3-Nemotron-Super-49B-V1, FLUX.1-Kontext-Dev for image generation, and TRELLIS Image-to-3D models—the system automatically generates detailed titles and descriptions, accurate categories, comprehensive tags, localized cultural variations, and interactive 3D assets tailored to regional markets.

 The tutorial covers the complete architecture, API usage for VLM analysis and asset generation, deployment strategies with Docker containers, and real-world integration patterns. By the end, this tutorial demonstrates how to automate catalog enrichment at scale, turning sparse product data like “Black Purse” into rich listings like “Glamorous Black Evening Handbag with Gold Accents” complete with detailed descriptions, validated categories, tags, and multiple asset types.

 Prerequisites

 This tutorial assumes intermediate to advanced technical knowledge. It involves working with AI APIs, building REST services, and deploying containerized applications. Basic familiarity with the listed technologies will help in following along and implementing the system:

 Python 3.11+

 The uv package manager (or pip)

 An NVIDIA API key

 A HuggingFace token for FLUX model access

 Docker and Docker Compose 

 Creating an AI-powered catalog enrichment blueprint

 To address the scalability and consistency gap of manual catalog enrichment, and discoverability and conversion issues, the blueprint is designed as an end-to-end catalog transformation pipeline. The modular system of specialized models works together containerized with Docker and served through NVIDIA NIM for enterprise-grade performance.

 Figure 1. Catalog enrichment workflow diagram

 Here’s the core technology stack:

 NVIDIA Nemotron VLM (nemotron-nano-12b-v2-vl): Analyzes product images to extract features, categories, and context.

 NVIDIA Nemotron LLM (llama-3_3-nemotron-super-49b-v1_5): Acts as the “brain,” generating rich, localized text (titles, descriptions) and planning culturally-aware prompts for image generation.

 Black Forest Labs FLUX.1-Kontext-dev: Generate new, high-quality 2D image variations.

 Microsoft TRELLIS Image-to-3D: Transforms 2D product images into interactive 3D models.

 The most important part of this solution is its modular, three-stage API. A common mistake is building one slow, monolithic API call that does everything.

 Stage 1: Fast VLM analysis (POST /vlm/analyze)

 Job: Takes an image, locale, existing product data, and brand instructions as optional.

 Output: Rich, structured JSON. It returns improved titles,  descriptions, validated categories, comprehensive tags, and attributes localized to the target region.

 Stage 2: Image generation (POST /generate/variation)

 Job: Takes the output from Stage 1, the title, description, tags, and original image.

 Output: A new, culturally-appropriate 2D image variation.

 Stage 3: 3D asset generation (POST /generate/3d)

 Job: Takes the original 2D image.

 Output: An interactive 3D .glb model.

 The frontend can call /vlm/analyze, get instant results to show the user, and then offer buttons to “generate 3D model” or “create marketing assets,” which trigger asynchronous backend jobs.

 Building the enrichment pipeline

 In this section, the backend is run locally to call the enrichment APIs end-to-end. A product image is uploaded to generate enriched, localized metadata, create an image variation with quality scoring, and produce a 3D asset. The three-stage API approach is described next.

 Step 1: Set up the local backend

 First, get the FastAPI backend server running on a local machine to test the API endpoints.

 Clone the repository:

git clone https://github.com/NVIDIA-AI-Blueprints/Retail-Catalog-Enrichment.git
cd Retail-Catalog-Enrichment

 Create an . env
 file in the root directory with the API keys:

NGC_API_KEY=your_nvidia_api_key_here
HF_TOKEN=your_huggingface_token_here

 Set up the Python environment using uv (or pip):

 # Create and activate a virtual environment

# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate

# Install dependencies
uv pip install -e .

 Run the FastAPI server with Uvicorn:

uvicorn --app-dir src backend.main:app --host 0.0.0.0 --port 8000 --reload

 The API is now live at http://localhost:8000 . Its health can be checked at http://localhost:8000/health .

 Step 2: Visual analysis

 With the server running, the core /vlm/analyze endpoint can be used. This is the workhorse of the system, designed for instant, synchronous feedback.

 Execute a basic analysis of a product image. This command sends a product image (bag.jpg) and specifies the en-US locale.

curl -X POST \
 -F "image=@bag.jpg;type=image/jpeg" \
 -F "locale=en-US" \
 http://localhost:8000/vlm/analyze

 Review the JSON response. In just a few seconds, a rich JSON object is returned. This is the “before-and-after” transformation:

{
 "title": "Glamorous Black Evening Handbag with Gold Accents",
 "description": "This exquisite handbag exudes sophistication and elegance. Crafted from high-quality, glossy leather...",
 "categories": ["accessories"],
 "tags": ["black leather", "gold accents", "evening bag", "rectangular shape"],
 "colors": ["black", "gold"],
 "locale": "en-US"
}

 Step 3: Augment data with localization and brand voice

 The true power of the API comes from its augmentation capabilities. Localize content for a new region by providing existing product data and a new locale. This example targets the Spanish market (es-ES). The system is smart enough to enhance the sparse data using regional terminology.

curl -X POST \
 -F "image=@bag.jpg;type=image/jpeg" \
 -F 'product_data={"title":"Black Purse","description":"Elegant bag"}' \
 -F "locale=es-ES" \
 http://localhost:8000/vlm/analyze

 Apply a custom brand voice using the brand_instructions parameter. A brand isn’t generic, so the content shouldn’t be either. This guides the AI’s tone, voice, and taxonomy.

curl -X POST \
 -F "image=@product.jpg;type=image/jpeg" \
 -F 'product_data={"title":"Beauty Product","description":"Nice cream"}' \
 -F "locale=en-US" \
 -F 'brand_instructions=You work at a premium beauty retailer. Use a playful, empowering, and inclusive brand voice. Focus on self-expression and beauty discovery. Use terms like "beauty lovers", "glow", "radiant", and "treat yourself".' \
 http://localhost:8000/vlm/analyze

 The AI will generate a description that’s accurate and on-brand.

 Step 4: Generate cultural image variations

 Now that rich, localized text has been generated, the /generate/variation endpoint can be used to create matching 2D marketing assets.

 Generate a new image by passing in the results from Step 2. This endpoint uses the localized text as a plan to generate a new image with the FLUX model.

curl -X POST \
 -F "image=@bag.jpg;type=image/jpeg" \
 -F "locale=en-US" \
 -F "title=Glamorous Black Evening Handbag with Gold Accents" \
 -F "description=This exquisite handbag exudes sophistication..." \
 -F 'categories=["accessories"]' \
 -F 'tags=["black leather","gold accents","evening bag"]' \
 -F 'colors=["black","gold"]' \
 http://localhost:8000/generate/variation

 This call returns JSON with a generated_image_b64 string. If using the es-ES locale, the model generates a background more fitting for that market, like a Mediterranean courtyard instead of a modern studio.

 Review the JSON response:

 {
 "generated_image_b64": "iVBORw0KGgoAAAANS...",
 "artifact_id": "a4511bbed05242078f9e3f7ead3b2247",
 "image_path": "data/outputs/a4511bbed05242078f9e3f7ead3b2247.png",
 "metadata_path": "data/outputs/a4511bbed05242078f9e3f7ead3b2247.json",
 "locale": "en-US"
}

 Step 5: Automated quality control with NVIDIA Nemotron VLM

 Generative AI is powerful, but it can hallucinate. In an enterprise catalog, a “Black Handbag” can’t suddenly have a blue strap or a missing handle. To solve this, an agentic reflection loop has been implemented.

 Instead of relying on human reviewers, a Quality Assurance Agent powered by NVIDIA Nemotron VLM can be deployed. This module acts as a strict critic, performing a “reflection” step that compares the generated variation against the original product image to ensure fidelity.

 Before the API responds, this agent analyzes the generated image against the original product photo across five strict dimensions:

 Product consistency: Do colors, materials, and textures match the original?

 Structural fidelity: Are key elements like handles, zippers, and pockets preserved?

 Size and scale: Does the product look realistically sized in its new context?

 Anatomical accuracy: If a human model is present, are the hands and fingers rendered correctly?

 Background quality: Is the lighting and context photorealistic?

 The “ VLM Judge ” output: The API returns the generated asset alongside a detailed quality report, including a quality score and a list of specific issues.

 {
 "generated_image_b64": "iVBORw0KGgoAAAANSUhEUgA...",
 "artifact_id": "027c08866d90450399f6bf9980ab7...",
 "image_path": "/path/to/outputs/027c08866d90450399f6bf9980ab73...png",
 "metadata_path": "/path/to/outputs/027c08866d90450399f6bf9980ab73...json",
 "quality_score": 72.5,
 "quality_issues": [
 "Product appears slightly oversized relative to background context",
 "Minor texture inconsistency on handle hardware"
 ],
 "locale": "en-US"
}

 This feature provides the critical metadata needed for automation. Software integrators can expand this functionality to build self-correcting pipelines where the system autonomously retries generation with adjusted prompts until the VLM Judge awards a passing score (e.g., >85).

 Step 6: Create interactive 3D assets

 Finally, bring the product to life with a 3D model using the /generate/3d endpoint.

 Request a 3D model from the original 2D image. This is a simple call that only needs the image.

curl -X POST \
 -F "image=@bag.jpg;type=image/jpeg" \
 http://localhost:8000/generate/3d \
 --output product.glb

 In a few seconds, a product.glb file is generated. This file can be dropped directly into any web-based 3D viewer, allowing customers to inspect the product from every angle.

 Request a JSON response (optional). For web clients, it’s often easier to handle a JSON response. To do this, set return_json=true.

curl -X POST \
 -F "image=@bag.jpg;type=image/jpeg" \
 -F "return_json=true" \
 http://localhost:8000/generate/3d

 Review the JSON response : This will return the 3D model as a base64 string, along with metadata.

 {
 "glb_base64": "Z2xURgIAAA...A=",
 "artifact_id": "c724a1b8e1f54a6b8d2c9a7e6f3d1b9f",
 "metadata": {
 "slat_cfg_scale": 5.0,
 "ss_cfg_scale": 10.0,
 "slat_sampling_steps": 50,
 "ss_sampling_steps": 50,
 "seed": 0,
 "size_bytes": 1234567
 }
}

 Step 7: Move to production (Docker and troubleshooting)

 Here are a few common tips for debugging and moving to a full production-like deployment.

 Run the full stack with Docker. In this example, the backend was run locally; however, the complete project is designed for Docker. The docker-compose.yml file will launch the frontend, the backend, and all the AI models served through NVIDIA NIM (NVIDIA Interface for Models).

 Check GPU availability. If models fail, the first check should be nvidia-smi to ensure Docker can see the GPUs.

 Inspect service logs. The best way to debug is by tailing the logs for a specific service: docker-compose logs -f backend

 Extensibility and future features

 The goal of extending this blueprint is to increase the breadth and quality of commerce-ready assets and metadata autonomously. The project roadmap includes several extensions that can be built on:

 Agentic social media research : This planned feature introduces a specialized social media research agent as part of an agentic workflow, where autonomous agents handle complex tasks. Powered by reasoning models like NVIDIA Nemotron and using tool calling with social media APIs or MCPs, the agent analyzes real-world usage patterns, sentiment, and trending terminology, feeding these insights into the /vlm/analyze step to keep product descriptions rich, relevant, and on-trend.

 Short video generation : The next step is to add another generative endpoint to create 3-5 second product video clips. Using open source models, short video clips can be generated directly from 2D images, creating a dynamic, AI-generated lifestyle clip or product spin without needing a complex video shoot.

 This foundation is designed for extension. Modules can be added for virtual try-on, automated ad generation, or dynamic pricing models by following the same pattern of adding a new, specialized microservice.

 Conclusion

 We’ve successfully built a powerful, AI-driven pipeline that solves the sparse catalog problem. The key takeaways for building a system like this are:

 Go modular: A production-ready system must separate fast analysis from slow generation. This provides a responsive UI and the flexibility to treat asset generation as an on-demand or background task.

 Localization is key: True enrichment isn’t just translation; it’s cultural adaptation. By making locale a core parameter, the system generates text and images that resonate with global audiences.

 Brand voice is a feature: The brand_instructions parameter is a game-changer. It transforms the LLM from a generic generator into a true, scalable brand assistant.

 Resources

 Ready to build this yourself? Dive into the project documentation:

 API Documentation : Get a detailed look at all endpoints, parameters, and examples.

 Docker Deployment Guide : Learn how to deploy the full stack with NVIDIA NIM containers.

 NVIDIA Build : Get your API key and explore more models. 

 Learn more about the Retail Catalog Enrichment Blueprint .

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Content Creation / Rendering | Developer Tools & Techniques | Retail / Consumer Packaged Goods | Blueprint | Nemotron | NIM | Intermediate Technical | Tutorial | featured | LLMs | VLMs

 About the Authors

 About Antonio Martinez

 Antonio Martinez is a generative AI technical marketing engineer at NVIDIA, where he supports AI adoption for the retail, consumer-packaged goods, and quick-service restaurant industries. His work focuses on bringing generative AI and agentic technologies into real-world retail workflows. Antonio's expertise includes computer vision, multimodal generative AI, and agentic AI systems that power intelligent retail and digital commerce experiences.

Prior to joining NVIDIA, Antonio was a Staff Software Engineer at Intel for 10 years, developing computer vision and agentic AI solutions for retail and supply chain use cases, including autonomous checkout, loss prevention, and next-generation digital shopping experiences. Antonio received a Bachelor of Science in Computer Engineering in Mexico and a Master of Science in Computer Science from Texas State University.

 View all posts by Antonio Martinez

 Comments

 Related posts

 Building a Simple VLM-Based Multimodal Information Retrieval System with NVIDIA NIM

 Building a Simple VLM-Based Multimodal Information Retrieval System with NVIDIA NIM

 Build Multimodal Visual AI Agents Powered by NVIDIA NIM

 Build Multimodal Visual AI Agents Powered by NVIDIA NIM

 Deliver Personalized Retail Experiences with an AI-Powered Shopping Advisor

 Deliver Personalized Retail Experiences with an AI-Powered Shopping Advisor

 Scale and Curate High-Quality Datasets for LLM Training with NVIDIA NeMo Curator

 Scale and Curate High-Quality Datasets for LLM Training with NVIDIA NeMo Curator

 Personalized Aesthetics: Recording the Visual Mind

 Personalized Aesthetics: Recording the Visual Mind

 Related posts

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 L

 T

 F

 R

 E

Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM | NVIDIA Technical Blog

nvidia_dev_blog

08.01.2026 17:28

0.694

Embedding sim.	0.8145
Entity overlap	0.05
Title sim.	0.2727
Time proximity	0.5973

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	large language models
NLP страна

Открыть оригинал

Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the data center. Automotive and robotics developers increasingly want to run conversational AI agents , multimodal perception, and high-level planning directly on the vehicle or robot – where latency, reliability, and the ability to operate offline matter most.

 While many existing LLM and vision language model (VLM) inference frameworks focus on data center needs such as managing large volumes of concurrent user requests and maximizing throughput across them, embedded inference requires a dedicated, tailored solution.

 This post introduces NVIDIA TensorRT Edge-LLM , a new, open source C++ framework for LLM and VLM inference, to solve the emerging need for high-performance edge inference. Edge-LLM is purpose-built for real-time applications on the embedded automotive and robotics platforms NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor . The framework is provided as open source on GitHub for the NVIDIA JetPack 7.1 release.

 TensorRT Edge-LLM has minimal dependencies, enabling deployment for production edge applications. Its lean, lightweight design with clear focus on embedded-specific capabilities minimizes the framework’s resource footprint.

 In addition, TensorRT Edge-LLM advanced features—such as EAGLE-3 speculative decoding , NVFP4 quantization support, and chunked prefill—provide cutting-edge performance for demanding real-time use cases.

 Figure 1. TensorRT Edge-LLM shows compelling performance using Qwen3 with speculative decoding

 LLM and VLM inference for real-time edge use cases

 Edge LLM and VLM inference workloads are defined by the following characteristics:

 Requests from few users or a single user 

 Low batch size, usually across cameras

 Production deployments for mission-critical applications

 Offline operation without updating

 As a consequence, robotics and automotive real-time applications come with specific requirements, including:

 Minimal and predictable latency

 Minimal disk, memory, and compute requirements

 Compliance with production standards

 High robustness and reliability

 TensorRT Edge-LLM is designed to fulfill and prioritize these embedded-specific needs to provide a strong foundation for embedded LLM and VLM inference.

 Rapid adoption of TensorRT Edge-LLM for automotive use cases

 Partners are already leveraging TensorRT Edge-LLM as a foundation for their in-car AI products, including Bosch, ThunderSoft, and MediaTek who are showcasing their technology at CES 2026.

 Bosch develops the innovative Bosch AI-powered Cockpit in collaboration with Microsoft and NVIDIA, features an innovative in-car AI assistant capable of natural voice interactions. The solution uses embedded automated speech recognition (ASR) and text-to-speech (TTS) AI models in conjunction with LLM inference through TensorRT Edge-LLM for a powerful onboard AI that cooperates with larger, cloud-based AI models through a sophisticated orchestrator. 

 ThunderSoft integrates TensorRT Edge-LLM into its upcoming AIBOX platform, based on NVIDIA DRIVE AGX Orin , to enable responsive, on-device LLM and multimodal inference inside the vehicle. By combining the ThunderSoft automotive software stack with the TensorRT Edge-LLM lightweight C++ runtime and optimized decoding path, the AIBOX delivers low-latency conversational and cockpit-assist experiences within strict power and memory limits. 

 MediaTek builds on top of TensorRT Edge-LLM for their CX1 SoC that enables cutting-edge cabin AI and HMI applications. TensorRT Edge-LLM accelerates both LLM and VLM inference for a wide range of use cases, including driver and cabin activity monitoring. MediaTek contributes to the development of TensorRT Edge-LLM with new embedded-specific inference methods.

 With the launch of TensorRT Edge-LLM, these LLM and VLM inference capabilities are now available for the NVIDIA Jetson ecosystem as the foundation for robotics technology.

 TensorRT Edge-LLM under the hood

 TensorRT Edge-LLM is designed to provide an end-to-end workflow for LLM and VLM  inference. It spans three stages: 

 Exporting Hugging Face models to ONNX

 Building optimized NVIDIA TensorRT engines for the target hardware

 Running inference on the target hardware

 Figure 2. TensorRT Edge-LLM workflow with key components

 The Python export pipeline converts Hugging Face models to ONNX format with support for quantization, LoRA adapters, and EAGLE-3 speculative decoding (Figure 3).

 Figure 3. TensorRT Edge-LLM Python export pipeline stages and tools

 The engine builder builds TensorRT optimized specifically for the embedded target hardware (Figure 4).

 Figure 4. TensorRT Edge-LLM engine builder workflow

 C++ Runtime is responsible for LLM and VLM inference on the target hardware. It makes use of the TensorRT engines for the decoding loop that defines autoregressive models: iterative, token generation based on input and previously generated tokens. User applications interface with this runtime to solve LLM and VLM workloads.

 Figure 5. Prefill and decode phases of TensorRT Edge-LLM C++ Runtime

 For a more detailed explanation of the components, see the TensorRT Edge-LLM documentation .

 Get started with TensorRT Edge-LLM

 Ready to get started with LLM and VLM inference on your Jetson AGX Thor DevKit?

 1. Download the JetPack 7.1 release .

 2. Clone the JetPack 7.1 release branch of the NVIDIA/TensorRT-Edge-LLM GitHub repo:

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git

 3. Check the TensorRT Edge-LLM Quick Start Guide for detailed instructions on getting out-of-the-box supported models from Hugging Face, converting them to ONNX, building TensorRT engines for your Jetson AGX Thor platform, and running them with the C++ runtime.

 4. Explore the TensorRT Edge-LLM examples to learn more about features and capabilities.

 5. See the TensorRT Edge-LLM Customization Guide To adapt TensorRT Edge-LLM to your own needs.

 For NVIDIA DRIVE AGX Thor users, TensorRT Edge-LLM is part of the NVIDIA DriveOS release package. DriveOS releases will leverage the GitHub repo in upcoming releases.

 As LLMs and VLMs move rapidly to the edge, TensorRT Edge-LLM provides a clean, reliable path from Hugging Face models to real-time, production-grade execution on NVIDIA automotive and robotics platforms.

 Explore the workflow, test your models, and begin building the next generation of intelligent on-device applications. To learn more, visit the NVIDIA/TensorRT-Edge-LLM GitHub repo.

 Acknowledgments 

 Thank you to Michael Ferry, Nicky Liu, Martin Chi, Ruocheng Jia, Charl Li, Maggie Hu, Krishna Sai Chemudupati, Frederik Kaster, Xiang Guo, Yuan Yao, Vincent Wang, Levi Chen, Chen Fu, Le An, Josh Park, Xinru Zhang, Chengming Zhao, Sunny Gai, Ajinkya Rasane, Zhijia Liu, Ever Wong, Wenting Jiang, Jonas Li, Po-Han Huang, Brant Zhao, Yiheng Zhang, and Ashwin Nanjappa for your contributions to and support of TensorRT Edge-LLM. 

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Developer Tools & Techniques | Edge Computing | Robotics | Automotive / Transportation | DRIVE | JetPack | Jetson | TensorRT-LLM | Intermediate Technical | Deep dive | AI Agent | AI Inference | CES26 | featured | Inference Performance | LLMs | Thor | VLMs

 About the Authors

 About Lin Chai

 Lin Chai is a senior product manager at NVIDIA, leading TensorRT and TensorRT Edge-LLM, NVIDIA’s AI inference platforms for deep learning across datacenter and embedded platforms. Drawing on her background in autonomous driving and automotive OEMs, she is inspired to build production-grade inference systems that deliver best-in-class performance for deep learning workloads across data center, edge, and physical AI applications—enabling systems that perceive, reason, and act in the real world.

 View all posts by Lin Chai

 About Felix Friedmann

 Felix Friedmann is a product and engineering lead for the NVIDIA DRIVE platform, covering NVIDIA Embedded AI Inference and NVIDIA DriveWorks. He unites the latest technological innovations, like embedded visual language models, with the reliability and safety required for an automotive software platform. Felix worked with the NVIDIA DRIVE platform since its earliest generation in his previous roles, when he brought early deep learning models into embedded applications at Audi, and designed perception and system architecture for AVs at VW’s AID, and later Argo AI.

 View all posts by Felix Friedmann

 About Luxiao Zheng

 Luxiao Zheng is a senior systems software engineer at NVIDIA. He works on the TensorRT general performance team with a specialization in Large Language Model inference workflow. He works on end-to-end LLM software development, performance measurements, analysis and improvements for x86_64 and aarch64 platforms. Luxiao holds a M.S. in Computer Science, a B.S. in Computer Science and a B.S. in Chemical Engineering from Washington University in St. Louis.

 View all posts by Luxiao Zheng

 About Fan Shi

 Fan Shi is a senior system software engineer on the NVIDIA TensorRT team, specializing in the efficient deployment of advanced AI models on edge platforms. His work focuses on optimizing performance and usability in deep learning inference. Fan holds an M.S. in computational data science from Carnegie Mellon University and a B.S. in statistics and computer science from the University of Illinois.

 View all posts by Fan Shi

 About Amber Liu

 Amber Liu is a senior system software engineer at NVIDIA, focusing on edge AI and large language model applications. She works closely with customers and partners in China to enable LLM use cases across autonomous driving, AI cockpit, and robotics, helping teams build production-ready edge AI systems. As a core contributor to TensorRT Edge‑LLM, she drives the development of high‑performance inference solutions that bring state-of-the-art large language models to embedded platforms.

 View all posts by Amber Liu

 Comments

 Related posts

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

 Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

 Streamline LLM Deployment for Autonomous Vehicle Applications with NVIDIA DriveOS LLM SDK

 Streamline LLM Deployment for Autonomous Vehicle Applications with NVIDIA DriveOS LLM SDK

 Visual Language Models on NVIDIA Hardware with VILA

 Visual Language Models on NVIDIA Hardware with VILA

 Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

 Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

 Related posts

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 L

 T

 F

 R

 E

Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell | NVIDIA Technical Blog

nvidia_dev_blog

08.01.2026 19:43

0.693

Embedding sim.	0.7698
Entity overlap	0.0556
Title sim.	0.2838
Time proximity	0.9867

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	large language models
NLP страна

Открыть оригинал

As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with AI more frequently, meaning that more tokens need to be generated. To serve these tokens at the lowest possible cost, AI platforms need to deliver the best possible token throughput per watt. 

 Through extreme co-design across GPUs, CPUs, networking, software, power delivery, and cooling, NVIDIA continues to drive up token throughput per watt, which reduces cost per million tokens.

 Additionally, NVIDIA continues to enhance its software stacks to achieve even greater levels of performance from existing platforms. This increases the value of the large installed base of NVIDIA GPUs across cloud service providers (CSPs), GPU clouds, model builders, enterprises, and others, enabling that infrastructure to remain productive for longer. 

 In this post, we show how recent updates to the NVIDIA inference software stack—running on the NVIDIA Blackwell architecture—as well as use of the full capabilities available in the stack are enabling large performance gains across several scenarios on DeepSeek-R1, a state-of-the-art sparse mixture-of-experts (MoE) reasoning model.

 Latest NVIDIA TensorRT-LLM software boosts reasoning inference performance

 The NVIDIA GB200 NVL72 rack-scale platform connects 72 NVIDIA Blackwell GPUs using fifth-generation NVIDIA NVLink interconnect and NVLink Switch chips, providing 1,800 GB/s of bidirectional bandwidth between all chips in the rack. This large scale-up domain is optimized for models based on sparse MoE architectures, which require frequent exchanges of data between experts to generate tokens. 

 The Blackwell architecture also incorporates hardware acceleration for the NVFP4 data format, an NVIDIA-designed four-bit floating point format that better preserves accuracy compared to alternative FP4 formats. In addition, optimizations like disaggregated serving—which perform prefill operations on one set of GPUs and decode operations on a different set—also take advantage of the NVL72 architecture and NVLink Switch technology.

 These architectural innovations enable NVIDIA GB200 NVL72 to deliver industry-leading performance on the latest open models, including DeepSeek-R1, a 671 billion-parameter sparse MoE that activates 37 billion parameters for each token.

 Figure 1. GB200 NVL72 DeepSeek-R1 token throughput using 8K/1K sequence length has increased substantially with the latest NVIDIA TensorRT-LLM software.

 GB200 NVL72 had previously demonstrated leading per-GPU throughput on DeepSeek-R1 across the throughput/interactivity curves for both 1K/1K and 8K/1K input/output sequence lengths.

 Figure 2. GB200 NVL72 DeepSeek-R1 token throughput using 1K/1K sequence length has increased substantially with the latest NVIDIA TensorRT-LLM software.

 The latest enhancements to the NVIDIA TensorRT-LLM open source library for optimizing LLM inference dramatically accelerates performance on the same platform, with the throughput of each Blackwell GPU increasing by up to 2.8x in the past three months. 

 The optimizations behind these results include:

 Expanded use of NVIDIA Programmatic Dependent Launch (PDL) to reduce kernel launch latencies, helping to increase throughput across the range of interactivity levels

 Many low-level kernel optimizations to more efficiently utilize NVIDIA Blackwell Tensor Cores

 Newly optimized implementation of all-to-all communication primitives that eliminate an additional intermediate buffer on the receiver side

 TensorRT LLM provides a high-level Python LLM API . Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. These optimizations are available today in the latest version of TensorRT-LLM . 

 Accelerating NVIDIA HGX B200 performance with multi-token prediction and NVFP4

 The NVIDIA HGX B200 platform—comprised of eight Blackwell GPUs connected using the fifth-generation NVLink interconnect and NVLink Switch—also achieves outstanding DeepSeek-R1 inference performance for air-cooled deployments. 

 Two key technologies enable very large DeepSeek-R1 inference performance increases on HGX B200. The first is the use of MTP, which provides a significant increase in throughput across the range of interactivity levels. This is observed across all three tested input/output sequence combinations.

 Figure 3. Throughput-versus-interactivity curves across FP8 without MTP, FP8 with MTP, and NVFP4 with MTP on HGX B200, with 1K/1K sequence length and aggregated serving.

 The second is the use of NVFP4, taking full advantage of the significant compute capabilities available in the Blackwell GPU to boost performance while preserving accuracy.

 Figure 4. Throughput-versus-interactivity curves across FP8 without MTP, FP8 with MTP, and NVFP4 with MTP on HGX B200, with 8K/1K sequence length and aggregated serving.

 NVFP4 is activated by the full NVIDIA software stack, including TensorRT-LLM and NVIDIA TensorRT Model Optimizer, to ensure both high performance and preservation of accuracy. That enables yet another large throughput boost at a given interactivity level, and once again allows for even higher interactivity levels to be possible on the same HGX B200 platform.

 Figure 5. Throughput-versus-interactivity curves across FP8 without MTP, FP8 with MTP, and NVFP4 with MTP on HGX B200, with 1K/8K sequence length and aggregated serving.

 By leveraging the full capabilities of the NVIDIA Blackwell platform, LLMs can serve more users and deliver significantly better experiences to each of those users.

 Delivering continuous performance gains

 Through relentless optimization, NVIDIA continues to deliver higher performance across the entire technology stack. It drives up token throughput on the full range of AI models, both through an annual product cadence as well as continued workload optimization to deliver more performance and value from existing products. 

 The NVIDIA Blackwell architecture delivers industry-leading inference performance, and with the latest software innovations in TensorRT-LLM, NVIDIA is delivering yet another big inference boost for customers, partners, and the AI ecosystem at large. 

 Please visit the NVIDIA Data Center Deep Learning Product Performance page to learn more about the industry-leading performance delivered by the NVIDIA full-stack platform. 

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | Hardware / Semiconductor | Blackwell | GB200 | HGX | Hopper | NVLink | TensorRT-LLM | Intermediate Technical | Benchmark | AI Agent | Blackwell Ultra | Cloud Services | featured | LLMs | Machine Learning & Artificial Intelligence | Mixture of Experts (MoE) | NVL72

 About the Authors

 About Ashraf Eassa

 Ashraf Eassa is a senior product marketing manager at NVIDIA, focusing on deep learning, training and inference. He holds bachelor's degrees in computer science and mathematics from the University of Vermont.

 View all posts by Ashraf Eassa

 Comments

 Related posts

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

 Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

 Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

 Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

 Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA

 Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA

 NVIDIA AI Inference Performance Milestones: Delivering Leading Throughput, Latency and Efficiency

 NVIDIA AI Inference Performance Milestones: Delivering Leading Throughput, Latency and Efficiency

 Related posts

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes

 Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes

 L

 T

 F

 R

 E

Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time | NVIDIA Technical Blog

nvidia_dev_blog

09.01.2026 16:58

0.691

Embedding sim.	0.7964
Entity overlap	0.0556
Title sim.	0.184
Time proximity	0.8601

NLP тип	scientific_publication
NLP организация	NVIDIA
NLP тема	large language models
NLP страна

Открыть оригинал

We keep seeing LLMs with larger context windows in the news, along with promises that they can hold entire conversation histories, volumes of books, or multiple codebases in view at once. And yet, these models still repeat the same mistakes. We still have to copy and paste the earlier context back into the chat for LLMs to “get it”. A smart co-worker would pick up on these patterns, adapt, and carry the lessons forward. Why can’t LLMs?

 In this blog post, we observe a critical difference between LLM memory and human memory. Then, we introduce test-time training with an end-to-end formulation (TTT-E2E), our latest research, in which the LLM compresses the context it’s reading into its weights through next-token prediction. 

 Figure 1. Scaling with context length, in terms of loss (left) and latency (right)

 Our key results are highlighted in Figure 1, which measures scaling with context length, in terms of loss (left) and latency (right). Transformer with full attention scales well in terms of loss but not latency. Recurrent Neural Networks (RNNs), such as Mamba 2 and Gated DeltaNet, scale well in latency but not loss. TTT-E2E is the only method that scales well in both.

 Left panel: TTT-E2E turns the worst line (gray) into the best (light green) at 128K context length. Loss ∆ (↓), the y-value, is computed as (loss of the reported method) − (loss of transformer with full attention), so loss ∆ of full attention itself (dark green) is the flat line at y=0. While other methods produce worse loss ∆ in longer context, TTT-E2E maintains the same advantage over full attention. 

 Right panel: Similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7x faster than full attention for 128K context on an NVIDIA H100, and 35x faster for 2M context. All models have 3B parameters and are trained with 164B tokens.

 Scaling with context length, in terms of both loss and latency, is the most fundamental problem in long-context and LLM research. TTT-E2E is the first method that shows a sign of life at this problem, while all the other methods exhibit qualitatively different trends. Moreover, we observed no wall for the scaling trends of TTT-E2E across rigorous and extensive experiments. These results indicate that the research community might finally arrive at a basic solution to long context in 2026.

 Our paper and code are publicly available.

 How does LLM memory differ from human memory?

 Humans are remarkably good at improving with more “context” in the form of life experience, despite their imperfect recall of the exact details. For example, consider your first lecture in machine learning. You might not recall the instructor’s first word during the lecture, but the intuition you learned is probably helping you understand this blog post, even if that happened years ago.

 On the other hand, transformers with self-attention are inefficient with long context, in part because they are designed for nearly lossless recall. The basic form of self-attention is called full attention, which maintains full memory of every token by caching and comparing their keys and values. As a consequence, full attention readily attends to every detail, but its cost per token grows linearly with context length. Processing the 10-millionth token takes one million times longer than processing the 10th.

 To process long context without burning the planet, modern architectures often combine full attention with approximations such as sliding-window attention, Mamba, and Gated DeltaNet layers. These approximations have a constant cost per token, but also become significantly less effective in longer context compared to full attention. Specifically, these approximations lose important information that would have helped them predict the future, as shown in Figure 1.

 Our method: compressing context into weights

 How can we design a method with a constant cost per token that can still remember the important, predictive, and intuitive information in long context?

 The key mechanism is compression. For example, humans compress a massive amount of experience into their brains, which preserves the important information while leaving out many details. For language models, we know that training with next-token prediction also compresses a massive amount of data into their weights. So what if we just continue training the language model at test time through next-token prediction on the given context?

 We found this simple form of Test-Time Training (TTT) highly effective once we added another missing piece. At training time, we prepare the model’s initialization for TTT through meta-learning instead of standard pre-training. This addition makes our method end-to-end (E2E) in two ways. Our inner loop directly optimizes the next-token prediction loss at the end of the network, in contrast to prior work on long-context TTT (e.g., Titans). And our outer loop directly optimizes the final loss after TTT. 

 What will be the role of RAG?

 TTT is like updating the human brain, while a retrieval-based methods, such as RAG, are like writing things down and looking things up in a notepad or calendar. The notepad will continue to be a practical supplement to the brain, especially when the details matter, like shopping for a long list of groceries. But human productivity is mostly determined by their brains, not by the notepads they use. Similarly, the productivity of an AI agent is mostly determined by how well it compresses a massive amount of context into predictive and intuitive information.

 Limitations

 At training time, the meta-learning phase of TTT-E2E requires gradients of gradients. Our current implementation of meta-learning is 3.4x slower than standard pre-training for short context (8K), because the standard API of FlashAttention does not support gradients of gradients. We can overcome this limitation by either developing a custom attention kernel that supports gradients of gradients or initializing TTT-E2E from a standard Transformer pre-trained without TTT. We invite the community to join us in these efforts!

 Conclusion

 For a deeper dive into the method, results, and implementation details, please check out the full paper End-to-End Test-Time Training for Long Context .​ All experiments can be reproduced using the code and datasets in our public repo .

 Discuss (2)

 Like

 Tags

 Agentic AI / Generative AI | Developer Tools & Techniques | General | Intermediate Technical | Deep dive | featured | LLMs | NVIDIA Research

 About the Authors

 About Yu Sun

 Yu is a researcher at NVIDIA and a postdoc at Stanford University. His research focuses on continual learning, specifically a conceptual framework called test-time training, where each test instance defines its own learning problem.

 View all posts by Yu Sun

 About Yejin Choi

 Yejin Choi is a distinguished scientist of Language and Cognition Research at NVIDIA. Her current research focuses on large language models, large reasoning models, and alternative architectures. She is a MacArthur Fellow (class of 2022), named among Time 100 Most Influential People in AI in 2023, and a co-recipient of 2 Test-of-Time awards (ACL 2021 and CVPR 2021) and 8 Best and Outstanding Paper Awards at ACL, EMNLP, NAACL, ICML, NeurIPS, and AAAI. She is currently serving as a General Chair for the inaugural Conference on Language Modeling (CoLM).

 View all posts by Yejin Choi

 Comments

 Related posts

 Scaling to Millions of Tokens with Efficient Long-Context LLM Training

 Scaling to Millions of Tokens with Efficient Long-Context LLM Training

 Dynamic Memory Compression

 Dynamic Memory Compression

 Hymba Hybrid-Head Architecture Boosts Small Language Model Performance

 Hymba Hybrid-Head Architecture Boosts Small Language Model Performance

 How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model

 How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model

 NVIDIA NeMo Accelerates LLM Innovation with Hybrid State Space Model Support

 NVIDIA NeMo Accelerates LLM Innovation with Hybrid State Space Model Support

 Related posts

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 L

 T

 F

 R

 E

Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena | NVIDIA Technical Blog

nvidia_dev_blog

05.01.2026 22:14

0.682

Embedding sim.	0.763
Entity overlap	0.075
Title sim.	0.219
Time proximity	0.9975

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	robotics
NLP страна

Открыть оригинал

Generalist robot policies must operate across diverse tasks, embodiments, and environments, requiring scalable, repeatable simulation-based evaluation. Setting up large-scale policy evaluations is tedious and manual. Without a systematic approach, developers need to build high-overhead custom infrastructure, yet task libraries remain limited in complexity and diversity.

 This post introduces NVIDIA Isaac Lab-Arena , an open source framework for efficient and scalable robotic policy evaluation in simulation. Co-developed with Lightwheel —a physical AI infrastructure company—as an extension to NVIDIA Isaac Lab , it provides streamlined APIs for task curation, diversification, and large-scale parallel evaluation. Developers can now prototype complex benchmarks without the overhead of system building. The post also presents an end-to-end sample workflow covering environment setup, optional policy post-training, and closed-loop evaluation.

 Overview and key benefits of Isaac Lab-Arena

 We are announcing the pre-alpha release of Isaac Lab-Arena and inviting the community to help shape its road map. We are also partnering with benchmark authors to implement and open source their evaluations on Isaac Lab-Arena, enabling a growing ecosystem of ready-to-use benchmarks and shared evaluation methods on a unified core.

 The key benefits of Isaac Lab-Arena include simplified task curation, automated diversification, large-scale benchmarking, seamless integration with data generation and training, and more, as detailed below.

 Simplified task curation (0 to 1) :

 Modular : Replaces monolithic task descriptions with a Lego-like architecture, compiling Isaac Lab environments on-the-fly from independent Object, Scene, Embodiment, and Task blocks. 

 Generalizable : Standardized Interactions through an Affordance system (for example Openable, Pressable) enables tasks to scale across diverse objects. 

 Extensible : Metrics and data recorded are extensible, providing users with fine-grained control over simulation and analytics if needed.

 Automated diversification (1 to many) : Easily mix and match components, applying one task across different robots or objects—such as switching from a domestic soda can to an industrial pipe task—without rewriting code. In the future, the team aims to leverage foundation models to automate generation of diverse and realistic tasks.

 Large-scale parallel, policy-agnostic benchmarking: Evaluate any robotic policy across thousands of parallel environments for high-throughput, GPU-accelerated evaluations. The current version supports homogeneous parallel environments (with parameter variations).

 Access to community benchmarks and shared evaluation methods on a unified core.

 Open source with commercial license: Developers can freely use, distribute, and contribute to framework development.

 Seamless integration with data generation and training : While the core function of Isaac Lab-Arena is task setup and evaluation, it integrates tightly with data generation and training frameworks for a seamless closed-loop workflow. This includes Isaac Lab-Teleop, Isaac Lab-Mimic, and post-training and inference of NVIDIA Isaac GR00T N models .

 Flexible deployment: Deploy on local workstations or cloud-native environments (such as OSMO ) for CI/CD, or integrate into leaderboards and distribution platforms such as LeRobot Environment Hub.

 Figure 1. NVIDIA Isaac Lab-Arena is an open source framework for efficient and scalable robotic policy evaluation in simulation

 Ecosystem development

 NVIDIA is partnering with benchmark authors to build their evaluations on Isaac Lab-Arena and publish sim-to-real validated evaluation methods, tasks, and datasets that the community can reuse and extend on a unified core. Coverage will span both industrial and research benchmarks across mobility, manipulation, and loco-manipulation.

 Lightwheel co-developed and has adopted the Isaac Lab-Arena framework to create and open-source 250+ tasks through the Lightwheel-RoboCasa-Tasks and Lightwheel-LIBERO-Tasks suites, with future efforts to establish them as benchmarks. Lightwheel is also developing RoboFinals , an industrial benchmark representative of complex real-world environments, using Isaac Lab-Arena. 

 Figure 2. Rich, generalizable kitchen environments in Lightwheel Task Suites built on Isaac Lab-Arena

 Isaac Lab-Arena environments are now integrated on the Hugging Face LeRobot Environment Hub, where developers can seamlessly register custom environments built on IsaacLab-Arena and use the growing library of environments to post-train and evaluate robotic policies including Isaac GR00T N, pi0, SmolVLA. For more details, visit the LeRobot documentation .

 NVIDIA is enabling millions of developers with open robotics models and datasets on Hugging Face, contributing to robotics becoming the fastest growing category on the platform .

 RoboTwin is using Isaac Lab-Arena to build extended versions of RoboTwin 2.0, a large-scale embodied simulation benchmark, and other complex long-horizon benchmarks. An open source release is planned, with active development underway on research submissions and code updates.

 In addition, NVIDIA Research labs such as the Generalist Embodied Agent Research Lab (GEAR) Lab is leveraging Isaac Lab-Arena to benchmark the Isaac GR00T N family of vision language action models for generalized humanoid reasoning and skills at scale.

 NVIDIA Seattle Robotics Lab (SRL) is integrating its research on language-conditioned task suites and evaluation methods for the benchmarking of generalist robot policies into Isaac Lab-Arena.

 Future Isaac Lab-Arena enhancements 

 The current pre-alpha release is intentionally an early framework skeleton with limited features giving contributors a practical starting point to experiment, share feedback, and influence future design and direction. 

 In the near future, core capabilities essential to building complex task libraries will be added, including object placement through natural language, composite tasking by chaining atomic skills, reinforcement learning task setup, and parallel heterogeneous evaluations (for example, different objects per parallel environment). 

 Further out, the team aims to explore more agentic and neural approaches to scale evaluation. Examples include leveraging NVIDIA Cosmos for world-model-driven neural simulation and scenario generation, as well as NVIDIA Omniverse NuRec for real-to-sim construction of simulation environments that mirror the real world. Community participation and feedback will be vital to shaping these developments. 

 How to set up tasks and evaluate policies at scale using Isaac Lab-Arena

 This section presents an end-to-end sample workflow to evaluate an Isaac GR00T N model on a manipulation skill—opening a microwave door—with the GR1 robot in Isaac Lab-Arena. It covers environment setup, optional policy post-training, and closed-loop evaluation.

 Figure 3. GR1 robot in Isaac Lab-Arena opening a microwave door

 Step 1: Environment creation and diversification

  the GR1 open microwave door task prerequisites to clone the repo and run the Docker container. Then, create an environment in Isaac Lab-Arena by stitching together Objects (Microwave) with Affordances (Openable, Pressable), in the Scene (Kitchen) with an Embodiment (GR-1 Robot) to perform a Task (OpenDoor). Users can optionally include configuration for Teleoperation-based data collection. 

 Procure assets:

background = self.asset_registry.get_asset_by_name("kitchen")()
microwave = self.asset_registry.get_asset_by_name("microwave")()
assets = [background, microwave]

embodiment = self.asset_registry.get_asset_by_name("gr1_pink")(enable_cameras=args_cli.enable_cameras)
teleop_device = self.device_registry.get_device_by_name("avp")()

 For more details, see Assets Design and Affordances Design .

 Position objects:

microwave_pose = Pose(
 position_xyz=(0.4, -0.00586, 0.22773),
 rotation_wxyz=(0.7071068, 0, 0, -0.7071068),
)
microwave.set_initial_pose(microwave_pose)

 Compose the scene:

scene = Scene(assets=assets)

 Create the task:

task = OpenDoorTask(microwave, openness_threshold=0.8, reset_openness=0.2)

 Tasks encapsulate objectives, success criteria, along with termination logic, events and metrics. To learn more, see Task Design . 

 Finally, assemble all the pieces into a complete, runnable environment:

isaaclab_arena_environment = IsaacLabArenaEnvironment(
 name=self.name,
 embodiment=embodiment,
 scene=scene,
 task=task,
 teleop_device=teleop_device,
)

 Next, run the environment using a test dataset. 

 Download a test dataset:

hf download \
 nvidia/Arena-GR1-Manipulation-Task \
 arena_gr1_manipulation_dataset_generated.hdf5 \
 --repo-type dataset \
 --local-dir $DATASET_DIR

 Run the environment:

python isaaclab_arena/scripts/replay_demos.py \
 --device cpu \
 --enable_cameras \
 --dataset_file "${DATASET_DIR}/arena_gr1_manipulation_dataset_generated.hdf5" \
 gr1_open_microwave \
 --embodiment gr1_pink

 The robot will replace NVIDIA-collected teleoperation data in order to open the microwave.

 For comprehensive technical details and design principles to create new environments, consult the tutorial documentation .

 Scale a task efficiently across robots, objects, and scene

 This section provides several examples that show how to easily modify objects or robots in a task—without rebuilding the environment or pipeline. 

 Example 1 – Change the object from microwave to power_drill
:

background = asset_registry.get_asset_by_name("kitchen")()
embodiment = asset_registry.get_asset_by_name("gr1_pink")()
power_drill = asset_registry.get_asset_by_name("power_drill")()
assets = [background, power_drill]

 Figure 4. The object has changed from a microwave to a power drill

 Example 2 – Change the embodiment from GR1 to Franka arm and the object to cracker_box
:

background = asset_registry.get_asset_by_name("kitchen")()
embodiment = asset_registry.get_asset_by_name("franka")()
cracker_box = asset_registry.get_asset_by_name("cracker_box")()
assets = [background, cracker_box]

 Figure 5. The GR1 robot has changed to a Franka arm 

 Example 3 – Change the background from a kitchen to an industrial packing table:

background = asset_registry.get_asset_by_name("packing_table")()
embodiment = asset_registry.get_asset_by_name("gr1_pink")()
cracker_box = asset_registry.get_asset_by_name("power_drill")()
assets = [background, cracker_box]

 Figure 6. The GR1 robot is in an industrial setting instead of in a kitchen

 Step 2: Optional policy post-training

 While Isaac Lab-Arena at its core focuses on task setup and policy evaluation, the Isaac Lab-Arena environment can seamlessly interoperate with data collection, data generation, and post-training if your policy needs to be post-trained prior to evaluation. You can:

 Collect demonstrations using Isaac Lab Teleop  

 Scale demonstrations into a larger synthetic dataset using Isaac Lab Mimic

 Use the generated dataset to post-train the Isaac GR00T N model or any robotic policy of your choice

 Step 3: Execute evaluations on parallel environments 

 The next step is to evaluate the trained policy. It is important to note that you can evaluate any trained robotic policy with the framework.

 Option 1 – Test the policy in a single environment:

python isaaclab_arena/examples/policy_runner.py \
 --policy_type gr00t_closedloop \
 --policy_config_yaml_path isaaclab_arena_gr00t/gr1_manip_gr00t_closedloop_config.yaml \
 --num_steps 2000 \
 --enable_cameras \
 gr1_open_microwave \
 --embodiment gr1_joint

 Option 2 – Test the policy in multiple parallel homogenous environments:

python isaaclab_arena/examples/policy_runner.py \
 --policy_type gr00t_closedloop \
 --policy_config_yaml_path isaaclab_arena_gr00t/gr1_manip_gr00t_closedloop_config.yaml \
 --num_steps 2000 \
 --num_envs 10 \
 --enable_cameras \
 gr1_open_microwave \
 --embodiment gr1_joint

 Rapid policy evaluation results

 With Isaac Lab-Arena’s GPU-accelerated parallel evaluation, robot developers can now get large-scale policy evaluation results in under one hour, slashing what was previously a full-day wait.

 With Lightwheel, we evaluated the performance of Isaac Lab-Arena in parallel-environment mode against sequential-environment mode and the original MuJoCo (RoboCasa) implementation on a complex set of 10 RoboCasa tasks. The evaluation used the Isaac GR00T N1.5 policy across 4096 homogeneous environment variations per task on 8x6000D GPUs.

 The results demonstrate a massive efficiency gain for VLA developers:

 Parallel evaluation on Isaac Lab-Arena took only 0.76 hours .

 This is 40x faster than sequential evaluation on Isaac Lab-Arena (34.9 hours).

 More details about the performance on parallel environments are available here.

 Get started with NVIDIA Isaac Lab-Arena

 Isaac Lab-Arena pre-alpha is open source, and we invite you to help guide its future design and development. To get started with Isaac Lab-Arena pre-alpha, visit the GitHub repo and documentation .

 Share feedback by opening GitHub issues to report bugs or suggest feature and design improvements, and contribute by opening pull requests to propose changes. 

 Create tasks or sim-to-real validated benchmarks on Isaac Lab-Arena and open source them to help build a shared ecosystem of ready‑to‑use robot learning tasks.

 Publish tasks to a leaderboard or evaluation hub like the LeRobot Environment Hub to make them discoverable and easy to run across shared pipelines and registries.

 Stay up to date by subscribing to our newsletter and following NVIDIA Robotics on LinkedIn , Instagram , X , and Facebook . Explore NVIDIA documentation and YouTube channels, and join the NVIDIA Developer Robotics forum . To start your robotics journey, enroll in our free NVIDIA Robotics Fundamentals courses today.

 Get started with NVIDIA Isaac libraries and AI models for developing physical AI systems.

 Watch NVIDIA Live at CES to learn more.

 Updated on Feb. 3 with information about Lightwheel’s performance results.

 Discuss (0)

 Like

 Tags

 Robotics | Simulation / Modeling / Design | Manufacturing | GR00T | Isaac Lab | Intermediate Technical | News | Tutorial | CES26 | featured | Humanoid Robots | Physical AI | Robotics Simulation

 About the Authors

 About Sangeeta Subramanian

 Sangeeta Subramanian is a senior product manager at NVIDIA, focusing on simulation software for robotics learning and evaluation. She previously worked on autonomous vehicle platform software and has a background in system software engineering leadership for NVIDIA DRIVE platforms. She holds a master’s degree in Computer Science from North Carolina State University.

 View all posts by Sangeeta Subramanian

 About Kalyan Meher Vadrevu

 Kalyan Vadrevu is a product marketing manager at NVIDIA, where he focuses on Isaac CUDA-X libraries and AI models. Before joining NVIDIA, he worked in developer relations and software development kit marketing at Microsoft and Nokia. He has an MBA in marketing from Indiana University Bloomington.

 View all posts by Kalyan Meher Vadrevu

 About Oyindamola Omotuyi

 Oyindamola Omotuyi is a technical marketing engineer at NVIDIA, working on robotics and robot learning applications on the NVIDIA Isaac Sim, Isaac Lab and Isaac Manipulator platforms. Prior to joining full-time, she interned twice at NVIDIA in Conversational AI and Robotics Product marketing. She earned her Ph.D. in Mechanical Engineering from the University of Cincinnati with a focus on state estimation, imitation learning and deep reinforcement learning for single and multi-agent systems.

 View all posts by Oyindamola Omotuyi

 About Vikram Ramasamy

 Vikram Ramasamy is a senior robotics engineer at NVIDIA, where he works on 3D perception for robotics and sim-to-real applications. He joined NVIDIA after completing his master’s degree at ETH Zurich, where he was part of the AMZ Racing team, contributing to control systems for autonomous racing. Earlier in his career, he worked on aerodynamic technologies for motorsport applications.

 View all posts by Vikram Ramasamy

 About Alexander Millane

 Alexander Millane is a senior robotics engineer at NVIDIA, where he works at the intersection of machine perception, robot learning, and simulation. Alex holds a PhD from ETH Zurich, where his research focused on real-time 3D reconstruction for robotics.

 View all posts by Alexander Millane

 Comments

 Related posts

 R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab

 R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab

 Enhance Robot Learning with Synthetic Trajectory Data Generated by World Foundation Models

 Enhance Robot Learning with Synthetic Trajectory Data Generated by World Foundation Models

 Using OpenUSD for Modular and Scalable Robotic Simulation and Development

 Using OpenUSD for Modular and Scalable Robotic Simulation and Development

 Supercharge Robotics Workflows with AI and Simulation Using NVIDIA Isaac Sim 4.0 and NVIDIA Isaac Lab

 Supercharge Robotics Workflows with AI and Simulation Using NVIDIA Isaac Sim 4.0 and NVIDIA Isaac Lab

 Expedite the Development, Testing, and Training of AI Robots with Isaac Sim  

 Expedite the Development, Testing, and Training of AI Robots with Isaac Sim  

 Related posts

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 L

 T

 F

 R

 E

Scaling Power-Efficient AI Factories with NVIDIA Spectrum-X Ethernet Photonics | NVIDIA Technical Blog

nvidia_dev_blog

06.01.2026 16:59

0.681

Embedding sim.	0.7694
Entity overlap	0.0857
Title sim.	0.2109
Time proximity	0.9316

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	ai infrastructure
NLP страна

Открыть оригинал

NVIDIA is bringing the world’s first optimized Ethernet networking with co-packaged optics to AI factories, enabling scale-out and scale-across on the NVIDIA Rubin platform with NVIDIA Spectrum-X Ethernet Photonics, the flagship switch for multi-trillion-parameter AI infrastructure.

 This blog post explores key optimizations and innovations in the protocol and hardware of Spectrum-X Ethernet Photonics that enable power-efficient, reliable, and resilient co-packaged optical networks for giga-scale AI factories.

 How Ethernet for AI enables scalable training and inference on the NVIDIA Rubin Platform

 Ultra-low-jitter Ethernet networking plays a vital role in scaling AI factories, as it ensures consistent and reliable data transmission across the entire infrastructure. By minimizing jitter, AI systems can achieve efficient token throughput regardless of batch size, which is crucial for handling diverse and demanding workloads. This ability supports seamless multi-tenancy within a single AI factory, for multiple users and applications to operate concurrently without performance degradation.

 It also improves the dispatch efficiency of models based on the Mixture of Experts (MoE) architecture, enabling faster expert selection and improved overall model performance, as shown in Figure 1. As a result, AI factories can operate at greater speed, reliability, and scalability.

 Figure 1. NVIDIA Spectrum-X Ethernet provides low-jitter communication and higher NVIDIA Collective Communication Library ( NCCL ) performance over off-the-shelf Ethernet

 Key innovations in Spectrum-X Ethernet Photonics for AI factory optical interconnects

 The Spectrum-X Ethernet Photonics switch delivers performance improvements for AI factories through its co-packaged silicon photonic engines. 

 New packaging and low-loss electro-optical channels offer 5x power reduction per 1.6 Tb/s port compared to pluggable interconnects. 

 The co-packaged optical links sustain 5x longer link flap-free AI uptime compared to off-the-shelf Ethernet solutions, ensuring AI workloads run without interruption.

 10x greater network resiliency provides unmatched robustness for mission-critical applications.

 With these innovations, organizations can scale their AI infrastructure and increase performance per watt, supporting larger workloads while maintaining optimal energy efficiency, reliability, and network stability.

 Figure 2. Spectrum-X Ethernet Photonics MCM package  

 Spectrum-X Ethernet Photonics is the world’s first fully integrated 512 lane 200G-capable co-packaged switch system. The introduction of the detachable fiber connector for surface-normal input/output (I/O) is an advancement in the assembly and scalability of high-performance Ethernet switches for AI factories. By enabling a fully automated process where optical fibers are attached at the final stage using precision machinery, manufacturers can maximize production yield and throughput, streamlining large-scale deployment. 

 The surface-normal optical I/O architecture enables optical ports to scale without increasing the physical size of the switch package. This is especially advantageous for high radix switches, which require numerous connections within a compact footprint to support expansive AI workloads.

 The solder-reflow compatible optical engine is also a breakthrough that integrates seamlessly with modern test and assembly tools. This compatibility enables full screening of optical components before attachment to the switch silicon, ensuring that only known-good engines are used, achieving a guaranteed 100% yield. The process benefits from pick-and-place automation and comprehensive pre-assembly testing, which together provide an efficient manufacturing pathway for these advanced switch systems.

 The integrated shuffle mechanism within the quad-ASIC switch architectures is another key innovation, enabling flat and efficient scaling of GPUs within a single cluster. This topology eliminates the latency typically introduced by additional switching layers, maintaining optimal performance as clusters grow. The SN6800 switch delivers 409.6 Tb/s of total bandwidth across 512 ports of 800 Gb/s, or 2,048 ports of 200 Gb/s, using its integrated fiber shuffle and co-packaged silicon photonics to establish a space- and power-efficient Ethernet solution. These combined innovations equip AI factories with robust, scalable network infrastructure capable of supporting next-generation artificial intelligence applications.

 Figure 3. Spectrum-X Ethernet Photonics-based SN6800 and SN6810 Ethernet switches

 What’s next for AI factory networking innovation

 This holistic codesign approach—with chips, systems, software, and AI models—enables the development of scalable, high-performance AI factories. Spectrum-X Ethernet Photonics switches deliver ultra-low jitter networking for AI factories to grow in speed, reliability, and scalability, and establish robust infrastructure for next-generation applications. For more information, see the NVIDIA Silicon Photonics page.

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | Networking / Communications | Hardware / Semiconductor | Spectrum-X Ethernet | Intermediate Technical | Deep dive | CES26 | featured | Network Architecture

 About the Authors

 About Ashkan Seyedi

 Ashkan Seyedi is a director of product marketing who works on developing high-bandwidth, efficient optical interconnects for AI and high-performance computing. He previously worked at Intel and Hewlett Packard Enterprise. Ashkan received a dual bachelor's in electrical and computer engineering from the University of Missouri in Columbia and a Ph.D. from the University of Southern California, where he worked on photonic crystal devices, high-speed nanowire photodetectors, efficient white LEDs, and solar cells.

 View all posts by Ashkan Seyedi

 Comments

 Related posts

 AWS Integrates AI Infrastructure with NVIDIA NVLink Fusion for Trainium4 Deployment

 AWS Integrates AI Infrastructure with NVIDIA NVLink Fusion for Trainium4 Deployment

 How to Connect Distributed Data Centers Into Large AI Factories with Scale-Across Networking

 How to Connect Distributed Data Centers Into Large AI Factories with Scale-Across Networking

 Scaling AI Factories with Co-Packaged Optics for Better Power Efficiency

 Scaling AI Factories with Co-Packaged Optics for Better Power Efficiency

 A New Era in Data Center Networking with NVIDIA Silicon Photonics-based Network Switching

 A New Era in Data Center Networking with NVIDIA Silicon Photonics-based Network Switching

 Powering Next-Generation AI Networking with NVIDIA SuperNICs

 Powering Next-Generation AI Networking with NVIDIA SuperNICs

 Related posts

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 L

 T

 F

 R

 E

3 Questions: How AI could optimize the power grid

mit_news_ai

09.01.2026 05:00

0.675

Embedding sim.	0.7953
Entity overlap	0
Title sim.	0.0211
Time proximity	0.9448

NLP тип	other
NLP организация	Massachusetts Institute of Technology
NLP тема	machine learning
NLP страна

Открыть оригинал

Artificial intelligence has captured headlines recently for its rapidly growing energy demands , and particularly the surging electricity usage of data centers that enable the training and deployment of the latest generative AI models. But it’s not all bad news — some AI tools have the potential to reduce some forms of energy consumption and enable cleaner grids. 
 One of the most promising applications is using AI to optimize the power grid, which would improve efficiency, increase resilience to extreme weather, and enable the integration of more renewable energy. To learn more, MIT News spoke with Priya Donti , the Silverman Family Career Development Professor in the MIT Department of Electrical Engineering and Computer Science (EECS) and a principal investigator at the Laboratory for Information and Decision Systems (LIDS), whose work focuses on applying machine learning to optimize the power grid. 
 Q: Why does the power grid need to be optimized in the first place?
 A: We need to maintain an exact balance between the amount of power that is put into the grid and the amount that comes out at every moment in time. But on the demand side, we have some uncertainty. Power companies don’t ask customers to pre-register the amount of energy they are going to use ahead of time, so some estimation and prediction must be done.
 Then, on the supply side, there is typically some variation in costs and fuel availability that grid managers need to be responsive to. That has become an even bigger issue because of the integration of energy from time-varying renewable sources, like solar and wind, where uncertainty in the weather can have a major impact on how much power is available. Then, at the same time, depending on how power is flowing in the grid, there is some power lost through resistive heat on the power lines. So, as a grid operator, how do you make sure all that is working all the time? That is where optimization comes in.
 Q: How can AI be most useful in power grid optimization?
 A: One way AI can be helpful is to use a combination of historical and real-time data to make more precise predictions about how much renewable energy will be available at a certain time. This could lead to a cleaner power grid by allowing us to handle and better utilize these resources.
 AI could also help tackle the complex optimization problems that power grid operators must solve to balance supply and demand in a way that also reduces costs. These optimization problems are used to determine which power generators should produce power, how much they should produce, and when they should produce it, as well as when batteries should be charged and discharged, and whether we can leverage flexibility in power loads. These optimization problems are so computationally expensive that operators use approximations so they can solve them in a feasible amount of time. But these approximations are often wrong, and when we integrate more renewable energy into the grid, they are thrown off even farther. AI can help by providing more accurate approximations in a faster manner, which can be deployed in real-time to help grid operators responsively and proactively manage the grid.
 AI could also be useful in the planning of next-generation power grids. Planning for power grids requires one to use huge simulation models, so AI can play a big role in running those models more efficiently. The technology can also help with predictive maintenance by detecting where anomalous behavior on the grid is likely to happen, reducing inefficiencies that come from outages. More broadly, AI could also be applied to accelerate experimentation aimed at creating better batteries, which would allow the integration of more energy from renewable sources into the grid.
 Q: How should we think about the pros and cons of AI, from an energy sector perspective?
 A: One important thing to remember is that AI refers to a heterogeneous set of technologies. There are different types and sizes of models that are used, and different ways that models are used. If you are using a model that is trained on a smaller amount of data with a smaller number of parameters, that is going to consume much less energy than a large, general-purpose model.
 In the context of the energy sector, there are a lot of places where, if you use these application-specific AI models for the applications they are intended for, the cost-benefit tradeoff works out in your favor. In these cases, the applications are enabling benefits from a sustainability perspective — like incorporating more renewables into the grid and supporting decarbonization strategies.
 Overall, it’s important to think about whether the types of investments we are making into AI are actually matched with the benefits we want from AI. On a societal level, I think the answer to that question right now is “no.” There is a lot of development and expansion of a particular subset of AI technologies, and these are not the technologies that will have the biggest benefits across energy and climate applications. I’m not saying these technologies are useless, but they are incredibly resource-intensive, while also not being responsible for the lion’s share of the benefits that could be felt in the energy sector.
 I’m excited to develop AI algorithms that respect the physical constraints of the power grid so that we can credibly deploy them. This is a hard problem to solve. If an LLM says something that is slightly incorrect, as humans, we can usually correct for that in our heads. But if you make the same magnitude of a mistake when you are optimizing a power grid, that can cause a large-scale blackout. We need to build models differently, but this also provides an opportunity to benefit from our knowledge of how the physics of the power grid works.
 And more broadly, I think it’s critical that those of us in the technical community put our efforts toward fostering a more democratized system of AI development and deployment, and that it’s done in a way that is aligned with the needs of on-the-ground applications.

Multi-Agent Warehouse AI Command Layer Enables Operational Excellence and Supply Chain Intelligence | NVIDIA Technical Blog

nvidia_dev_blog

09.01.2026 14:00

0.664

Embedding sim.	0.7532
Entity overlap	0.0238
Title sim.	0.2289
Time proximity	0.8779

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	ai agents
NLP страна

Открыть оригинал

Warehouses have never been more automated, more data-rich, or more operationally demanding than they are now—yet they still rely on systems that can’t keep up. Throughput is rising, SLAs are shrinking, and fleets of AMRs, conveyors, and sensors expand every year. But beneath that technological surface, most sites still rely on a familiar trio: a Warehouse Management System (WMS), a handful of dashboards, and the institutional knowledge available. 

 Supervisors are left to manage 12+ classes of equipment, thousands of shift tasks, and a constant flood of telemetry—without any unified intelligence to interpret it all or guide the next move.

 This post introduces the NVIDIA Multi-Agent Intelligent Warehouse (MAIW) Blueprint for the missing layer. This NVIDIA-aligned, open source AI command layer sits above WMS, Enterprise Resources Planning (ERP), and IoT infrastructure to transform scattered data into real-time, actionable operational intelligence.

 The problem: Warehouses without a “brain”

 Despite years of investment in WMS and ERP systems, automation fleets, safety hardware, RFID, scanners, cameras, dashboards, and BI tools, most warehouses still lack one critical capability: a system that can reason across all of it.

 Operational knowledge remains scattered. SOPs, SDS sheets, LOTO procedures, and OEM manuals sit in dense PDFs. WMS, ERP, LMS, maintenance, and incident systems all hold different pieces of the puzzle. Telemetry from PLCs, AMRs, IoT sensors, and charging stations streams in continuously but stays disconnected. And individuals often retain the most valuable insights such as shift notes, context, and other institutional knowledge.

 On a routine day, this fragmentation creates friction. But during peak volume, equipment failures, or safety events, it becomes a real liability. Maintenance teams troubleshoot with incomplete telemetry. Supervisors assign tasks without a unified view of staffing, equipment status, or workload. Safety alerts go unnoticed, incidents are under-reported, and procedures stay buried in PDFs no one has time to read.

 The result is predictable: more downtime, inefficient tasking, slow problem resolution, safety gaps, and expensive automation that operates as isolated islands rather than a coordinated system.

 Warehouses don’t need more dashboards—they need a real-time decision layer that can understand natural-language questions, pull evidence from data and documents, coordinate specialized agents, recommend actions with justification, and operate under strict safety and compliance guardrails. That is the role of an AI command layer.

 The solution: An AI command layer

 The Multi-Agent Intelligent Warehouse delivers a unified AI command layer for modern warehouse operations, transforming fragmented systems, documents, and telemetry into real-time, actionable intelligence. By orchestrating specialized AI agents across equipment operations, workforce coordination, safety, forecasting, and document intelligence, the platform enables warehouses to move from reactive management to proactive and adaptive decision-making.  

 Unified warehouse intelligence : Connects WMS, ERP, IoT, documents, and telemetry into a single AI-driven operational view.

 Faster, explainable decisions : Multi-agent AI delivers real-time, evidence-backed recommendations operators can trust.

 Higher throughput, less downtime : Proactively optimizes labor, equipment, and maintenance to reduce disruptions.

 Safer, more compliant operations : Continuously monitors incidents, SOPs, and environmental signals to improve safety response.

 Foundation for physical AI : Enables the transition from reactive workflows to perception-driven, autonomous warehouse operations.

 Design goal: An AI assistant for the entire warehouse

 The goal behind MAIW is to build a production-grade reference system that:

 Demonstrates how the NVIDIA AI stack (including NVIDIA NIM , NVIDIA NeMo , NVIDIA cuML , and NVIDIA cuVS ) can power an operational assistant.

 Provides a multi-agent architecture that mirrors warehouse roles: Equipment, Operations, Safety, Forecasting, Document Processing.

 Unifies retrieval-augmented generation (RAG) , forecasting, and document AI into a single workflow.

 Ships with real security, monitoring, and guardrails, not just a prototype chatbot.

 Is open source and extensible, so customers and partners can adapt it to their own environments.

 The MAIW is a complete system with API, UI, agents, connectors, observability, and deployment assets.

 MAIW core technology stack

 MAIW is built end-to-end on the NVIDIA AI Enterprise platform. It is powered end-to-end by NVIDIA AI Enterprise applications, combining advanced language models, fast retrieval, document intelligence, and GPU-accelerated analytics in one cohesive system.

 Figure 1. Multi-Agent Intelligent Warehouse Blueprint architecture

 At the reasoning layer, LLM NIM drives the assistant’s intelligence: Llama 3.3 Nemotron Super 49B handles complex operational decision-making, while NVIDIA Nemotron Nano 12B v2 VL adds vision-language understanding for documents and images. Outputs are grounded by a high-performance retrieval layer built on Llama Nemotron Embed QA 1B and Milvus with cuVS, enabling fast, GPU-accelerated vector search.

 For documents, a streamlined NeMo Retriever pipeline performs OCR, normalization, extraction, validation, and indexing—turning PDFs, images, and multi-page BOLs or invoices into structured data that the system can reason through.

 All data flows through a hybrid RAG architecture. Structured telemetry lives in PostgreSQL/TimescaleDB, unstructured content is handled through vector search, and a hybrid router chooses the best strategy for each query. Redis caching keeps responses consistently under a second.

 Forecasting is powered by a NVIDIA cuML -accelerated ensemble of six models, tuned with Optuna and achieving strong performance (~82% accuracy, 15.8% MAPE).

 It’s all wrapped in a production-grade application stack: 

 FastAPI backend

 React frontend

 Full Prometheus and Grafana observability

 NVIDIA NeMo Guardrails to ensure safe, compliant behavior across all interactions

 How the multi-agent intelligence layer thinks and works

 MAIW isn’t a single assistant—it’s a coordinated team of specialized AI agents, each trained to handle a different part of warehouse operations. LangGraph choreographs how they work together, while the Model Context Protocol (MCP) gives them a shared layer for tool access, external system calls, and real-time data retrieval. 

 A user’s query passes through guardrails, intent routing, memory lookup, retrieval, and tool execution before returning a safe, grounded answer. The full workflow shown in Figure 2 captures how these pieces come together.

 Agent
 Actions

 Planner and general
 Routes intent, breaks tasks into steps, and selects the right agents; handles simple queries directly

 Equipment and asset ops
 Tracks and manages forklifts, AMRs, and conveyors; checks telemetry, maintenance, and utilization

 Operations coordination
 Manages tasks, waves, staffing, and KPIs; diagnoses bottlenecks and executes fixes

 Safety and compliance
 Enforces SOPs and regulations; handles incidents, checklists, and alerts

 Forecasting
 Predicts demand and stockout risk; generates and pushes replenishment recommendations

 Document processing
 Runs OCR and extraction on BOLs, invoices, and receipts; indexes structured results for retrieval

 Table 1. MAIW is a coordinated team of specialized AI agents, each trained to handle a different part of warehouse operations

 MAIW core AI services

 MAIW core AI services include intelligent document processing, safety, security, and observability.

 Intelligent document processing

 The intelligent document processing pipeline uses NVIDIA NIM and multimodal foundation models with quality-based orchestration to deliver enterprise-grade accuracy at scale. Documents are ingested and preprocessed with NeMo Retriever, then processed through Intelligent OCR and layout extraction using NeMoRetriever-OCR and Nemotron Parse to produce structured, high-fidelity representations. A small vision-language model ( Nemotron Nano 12B VL ) performs visually grounded field extraction and document classification, with post-processing normalization into schema-compliant JSON. 

 Embeddings generated with a NeMo Retriever embedding model are indexed in Milvus to enable semantic search and downstream RAG. For high-value or low-confidence cases, a large language model (LLM) judge validates consistency, accuracy, and completeness, scoring extraction quality. An intelligent routing layer then automatically decides whether documents are auto-accepted, flagged for quick review, sent for expert review, or rejected for reprocessing—optimizing cost, latency, and accuracy while maintaining a continuous feedback loop for system improvement.

 This feedback loop is anchored around the LLM judge and intelligent routing stages. After initial extraction by the small vision-language model, the LLM judge evaluates each document for consistency, completeness, and confidence, producing scored results and quality explanations. These scores drive the routing engine, which determines whether a document is auto-accepted, sent for lightweight human review, escalated to expert review, or rejected for reprocessing. 

 When documents are corrected—either automatically or by human reviewers—the validated outputs are fed back into the system as normalized and scored metadata, updating the document store, embedding index, and quality signals. Low-confidence or rejected documents are rerouted to earlier stages (OCR, layout extraction, or small LLM processing), enabling targeted reprocessing rather than full pipeline reruns. Over time, this closed-loop flow continuously improves extraction accuracy, routing thresholds, prompt strategies, and model selection policies, allowing the system to adapt dynamically while minimizing cost and latency at scale.

 Figure 2. Intelligent document processing workflow

 Safety, security, and observability

 An AI command layer only works if operators trust it. MAIW is built with that principle at the foundation.

 Keeping every interaction safe with NeMo Guardrails

 The NeMo Guardrails implementation uses a dual approach: the NeMo Guardrails library (v0.19.0) with Colang for programmable guardrails, and a pattern-based fallback for reliability. 

 The GuardrailsService
 ( src/api/services/guardrails/guardrails_service.py
) selects the implementation through the USE_NEMO_GUARDRAILS_SDK
 environment variable, with automatic fallback if the library is unavailable. 

 When library mode is enabled, the NeMoGuardrailsSDKService
 wrapper initializes LLMRails
 from a Colang configuration ( data/config/guardrails/rails.co
) that defines 88 protection patterns across five categories: jailbreak detection (17 patterns), safety violations (13 patterns), security violations (15 patterns), compliance violations (12 patterns), and off-topic queries (13 patterns). 

 The library uses NVIDIA NIM endpoints ( configured in data/config/guardrails/config.yml
) with OpenAI-compatible models, and input safety checks are performed by calling rails.generate_async
 and detecting refusal responses:

# SDK Input Safety Check
result = await self.rails.generate_async(
 messages=[{"role": "user", "content": user_input}]
)
is_safe = not self._is_refusal_response(result.content)

 Security model: Controlled access by design

 The JSON Web Tokens (JWT) implementation ( src/api/services/auth/jwt_handler.py
) provides stateless authentication with HS256 tokens that include user identity and role information, with key strength validation (32-byte minimum) to address CVE-2025-45768. This foundation enables role-based access control (RBAC) through the CurrentUser
 context class and FastAPI dependency injection, where tokens are validated for signature, expiration, and type, then decoded to extract user roles and permissions. 

 The system maps granular permissions ( INVENTORY_WRITE
, OPERATIONS_ASSIGN
, SAFETY_APPROVE
, and so on) to five role levels ( ADMIN
, MANAGER
, SUPERVISOR
, OPERATOR
, VIEWER
), allowing declarative endpoint protection through require_permission
 and require_role
 dependencies:

# JWT token with role → RBAC enforcement
user_data = {"sub": str(user.id), "role": user.role.value}
access_token = jwt_handler.create_access_token(user_data)

@router.get("/admin/endpoint")
async def admin_endpoint(user: CurrentUser = Depends(require_admin)):
 # Only SYSTEM_ADMIN permission holders can access

 Observability: MAIW as essential production infrastructure

 Prometheus and Grafana provide real-time visibility into how the system behaves: API latency, vector search performance, cache efficiency, agent response times, forecasting accuracy, and even equipment telemetry. By instrumenting MAIW like any critical warehouse service, SRE and operations teams can monitor, debug, and improve the AI layer with confidence.

 Get started with the Multi-Agent Intelligent Warehouse

 There are two methods to get started with MAIW: 

 Create a Brev instance

 Visit the GitHub repo at NVIDIA-AI-Blueprints/Multi-Agent-Intelligent-Warehouse

 The GitHub repo is structured as a complete, runnable reference implementation:

 Backend: FastAPI services, retrieval stack, memory, adapters, guardrails

 Frontend: React dashboard with chat, forecasting, and monitoring views

 Infrastructure: Docker Compose, Helm charts, and setup scripts

 Data and scripts: SQL schemas, demo data, forecasting pipelines, document pipelines

 Docs: Architecture notes, MCP integration details, forecasting docs, deployment guide, PRD

 The following is a typical local setup:

git clone https://github.com/T-DevH/Multi-Agent-Intelligent-Warehouse.git
cd Multi-Agent-Intelligent-Warehouse

# Environment and infrastructure
./scripts/setup/check_node_version.sh
./scripts/setup/setup_environment.sh
cp .env.example deploy/compose/.env
./scripts/setup/dev_up.sh

# Initialize database & demo data
source env/bin/activate
python scripts/setup/create_default_users.py
python scripts/data/quick_demo_data.py
python scripts/data/generate_historical_demand.py

# Start services
./scripts/start_server.sh # API (http://localhost:8001)
cd src/ui/web && npm install && npm start # Frontend (http://localhost:3001)

 Transforming warehouse complexity into control

 Supply chains are becoming more volatile, more automated, and more data-rich—and warehouses are a key part of supply chains. The current stack—WMS plus dashboards plus human heroics—cannot scale indefinitely.

 An AI command layer provides a path forward, including:

 A single operational “brain” that can reason across systems

 Explainable recommendations instead of opaque heuristics

 Faster incident response with better evidence

 Safer operations with codified guardrails

 Better use of existing automation and data investments

 The Multi-Agent Intelligent Warehouse is a working, open source implementation of that command layer, built on the NVIDIA AI platform and aligned with the broader NVIDIA blueprint strategy.

 If warehouses are already operating at the edge of complexity, MAIW shows how to pull them back—from reactively managing challenges to proactive, data-driven, AI-assisted operations.

 Learn more about the Multi-Agent Intelligent Warehouse .

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Science | Developer Tools & Techniques | Retail / Consumer Packaged Goods | AI Enterprise | Blueprint | cuML | cuVS | NeMo | NIM | Intermediate Technical | Deep dive | AI Agent | featured | LLMs | Nemotron | Open Source | Retrieval Augmented Generation (RAG)

 About the Authors

 About Tarik Hammadou

 Tarik Hammadou is director of Developer Relations at NVIDIA, leading AI enablement across retail, consumer packaged goods (CPG), and supply chain. He works at the intersection of generative AI, agentic systems, simulation, and accelerated computing to help global enterprises modernize physical operations. Tarik partners closely with retail and CPG leaders to bring NVIDIA technologies—including NeMo, NIM, Omniverse, Metropolis, Isaac Sim, cuML, and cuOpt—from prototype to production. His work focuses on intelligent warehouses, digital twins, robotics simulation, and autonomous decision-making systems that connect real-time data, enterprise systems, and AI.

 View all posts by Tarik Hammadou

 About Jeremy Coupe

 Jeremy Coupe is a senior DevRel at NVIDIA on the Retail and CPG team, where he enables ecosystem partners to build and deploy large-scale supply chain solutions. Before joining NVIDIA, he spent eight years at NASA Ames Research Center, developing real-time decision support tools that help air traffic controllers manage the flow of traffic at some of the nation’s busiest airports. He holds a master’s degree in Applied Mathematics and Statistics and a PhD in Computer Engineering from the University of California, Santa Cruz, where he was a member of the Robotics and Control Lab.

 View all posts by Jeremy Coupe

 Comments

 Related posts

 Edge AI is Powering a Safer, Smarter World 

 Edge AI is Powering a Safer, Smarter World 

 World’s Largest Manufacturing Players Tapping NVIDIA AI Platform for Factory of the Future

 World’s Largest Manufacturing Players Tapping NVIDIA AI Platform for Factory of the Future

 NVIDIA GTC: Industrial at the Edge

 NVIDIA GTC: Industrial at the Edge

 Top 3 Pillars of AI Enabled Edge Computing in Retail

 Top 3 Pillars of AI Enabled Edge Computing in Retail

 Optimizing Warehouse Operations with Machine Learning on GPUs

 Optimizing Warehouse Operations with Machine Learning on GPUs

 Related posts

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 L

 T

 F

 R

 E

NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI

huggingface

05.01.2026 22:56

0.655

Embedding sim.	0.7377
Entity overlap	0.0926
Title sim.	0.1463
Time proximity	0.9964

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	artificial intelligence
NLP страна

Открыть оригинал

NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI

 Enterprise + Article Published
 January 5, 2026

 Upvote 64

 +58

 Tsung-Yi Lin tsungyi 

 nvidia

 Debraj Sinha debrajsinha 

 nvidia

 NVIDIA Cosmos Reason 2: Reasoning Vision Language Model for Physical AI ✨ Key Highlights
 🤖 Popular Use Cases

 Other Models From The Cosmos Family: 🔮 Cosmos Predict 2.5

 Resources

 NVIDIA today released Cosmos Reason 2 , the latest advancement in open, reasoning vision language models for physical AI. Cosmos Reason 2 surpasses its previous version in accuracy and tops the Physical AI Bench and Physical Reasoning leaderboards as the #1 open model for visual understanding.

 NVIDIA Cosmos Reason 2: Reasoning Vision Language Model for Physical AI

 Since their introduction, vision-language models have rapidly improved at tasks like object and pattern recognition in images. But they still struggle with tasks humans find natural, like planning several steps ahead, dealing with uncertainty or adapting to new situations. Cosmos Reason is designed to close this gap by giving robots and AI agents stronger common sense and reasoning to solve complex problems step by step.

 Cosmos Reason 2 is a state-of-the-art, open reasoning vision-language model (VLM) that enables robots and AI agents to see, understand, plan, and act in the physical world like humans. It uses common sense, physics, and prior knowledge to recognize how objects move across space and time to handle complex tasks, adapt to new situations, and figure out how to solve problems step by step.

 ✨ Key Highlights

 Improved spatio-temporal understanding and timestamp precision.

 Optimized performance with flexible deployment options from edge to cloud with 2B and 8B parameters model sizes.

 Support for expanded set of spatial understanding and visual perception capabilities — 2D/3D point localization, bounding box coordinates, trajectory data, and OCR support.

 Improved long-context understanding with 256K input tokens, up from 16K with Cosmos Reason 1.

 Adaptable to multiple use cases with easy-to-use Cosmos Cookbook recipes .

 🤖 Popular Use Cases

 Video analytics AI agents — These agents can extract valuable insights from massive volumes of video data to optimize processes. Cosmos Reason 2 builds on the capabilities of Cosmos Reason 1 and now provides OCR support, as well as 2D/3D point localization and a set of mark understanding.

 Example of how Cosmos Reason can understand text embedded within a video to determine the condition of the road during a rainstorm.

 Developers can jumpstart development of video analytics AI agents by using the NVIDIA blueprint for video search and summarization (VSS) with Cosmos Reason as the VLM.

 Salesforce is transforming workplace safety and compliance by analyzing video footage captured by Cobalt robots with Agentforce and VSS blueprint with Cosmos Reason as the VLM.

 Data annotation and critique — Enable developers to automate high-quality annotation and critique of massive, diverse training datasets. Cosmos Reason provides time stamps and detailed descriptions for real or synthetically generated training videos.

 Example of a sample prompt to generate detailed, time-stamped captions for a race car video.

 Uber is exploring Cosmos Reason 2 to deliver accurate, searchable video captions for autonomous vehicle (AV) training data, enabling efficient identification of critical driving scenarios. This co-authored Reason 2 for AV Video Captioning and VQA recipe demonstrates how to fine-tune and evaluate Cosmos Reason 2-8B on annotated AV videos. Across multiple evaluation metrics, measurable improvements were achieved: BLEU scores improved 10.6% (0.113 → 0.125), MCQ-based VQA gained 0.67 percentage points (80.18% → 80.85%), and LingoQA increased 13.8% (63.2% → 77.0%). These gains demonstrate effective domain adaptation for AV applications.

 Robot planning and reasoning — Act as the brain for deliberate, methodical decision-making in a robot vision language action (VLA) model. Cosmos Reason 2 now provides trajectory coordinates in addition to determining next steps.

 Example of the prompt and JSON output from Cosmos Reason 2 to provide the steps and trajectory the robot gripper needs to take to move the painter’s tape into the basket.

 Encord provides native support for Cosmos Reason 2 in its Data Agent library and AI data platform, enabling developers to leverage Cosmos Reason 2 as a VLA for robotics and other physical AI use cases.

 Companies like Hitachi, Milestone and VAST Data are using Cosmos Reason to advance robotics, autonomous driving, and video analytics AI agents for traffic and workplace safety.

 Try Cosmos Reason 2 on build.nvidia.com and experience the latest features with sample prompts for generating bounding boxes and robot trajectories. Upload your own videos and images for further analysis.

 Download Cosmos Reason 2 models ( 2B and 8B ) on Hugging Face or use Cosmos Reason 2 in the cloud . The model will be available soon on Amazon Web Services, Google Cloud and Microsoft Azure. To get started, check out Cosmos Reason 2 documentation and the Cosmos Cookbook .

 Other Models From The Cosmos Family:

 🔮 Cosmos Predict 2.5

 Cosmos Predict is a generative AI model that predicts future states of the physical world as video, based on text, image, or video inputs.

 Physical AI Bench leader for quality, accuracy and overall consistency.

 Up to 30 seconds of physically and temporally consistent clip per generation.

 Supports multiple framerates and resolution.

 Pre-trained on 200 million clips.

 Available as 2B and 14B pre-trained models and various 2B post-trained models for multiview, action conditioning and autonomous vehicle training.

 Check out model card >>

 🔁 Cosmos Transfer 2.5

 Cosmos Transfer is our lightest multicontrol model built for video to world style transfer.

 Scale a single simulation or spatial video across various environments and lighting conditions.

 Improved prompt adherence and physics alignment.

 Use with NVIDIA Isaac Sim™ or NVIDIA Omniverse NuRec for simulation to real transformation.

 Check out model card>>

 🤖 NVIDIA GR00T N1.6

 NVIDIA GR00T N1.6 is an open reasoning vision language action (VLA) model, purpose-built for humanoid robots, that unlocks full body control and uses NVIDIA Cosmos Reason for better reasoning and contextual understanding.

 Resources

 ▶️ Watch a demo of Cosmos → https://youtu.be/iWs-2TD5Dcc

 🧑🏻‍🍳 Read the Cosmos Cookbook → https://nvda.ws/4qevli8

 📚 Explore Models & Datasets → https://github.com/nvidia-cosmos

 ⬇️ Try Cosmos Models in our Hosted Catalog → https://nvda.ws/3Yg0Dcx

 💻 Join the Cosmos Community → https://discord.gg/u23rXTHSC9

 🗳️ Contribute to the Cosmos Cookbook → https://nvda.ws/4aQcBkk

 More from this author

 Build a Domain-Specific Embedding Model in Under a Day

 55
 March 20, 2026

 Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

 1
 March 20, 2026

Import AI 440: Red queen AI; AI regulating AI; o-ring automation

import_ai

12.01.2026 13:31

0.655

Embedding sim.	0.8381
Entity overlap	0.06
Title sim.	0.1919
Time proximity	0.0001

NLP тип	scientific_publication
NLP организация	Sakana
NLP тема	large language models
NLP страна	Japan

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you&#8217;d like to support this, please subscribe.
 Subscribe now 
 To understand the future of the world, stick AI systems in a petri dish:
 &#8230;Evolving LLMs to attack other LLMs&#8230;
 Researchers with Japanese AI startup Sakana have looked at what happens when they evolve LLM-based agents to fight against one another in a competitive programming game from the 1980s called Core War. The results show that &#8220;large language models (LLMs) drive an adversarial evolutionary arms race in this domain, where programs continuously adapt to defeat a growing history of opponents rather than a static benchmark&#8221;. This research approach gestures both at ways researchers might better study how LLM-dominated niches in the economy or national security world might unfold, and also hints at the strange AI world we&#8217;re heading into.

 What is Core War? &#8220;Core War is a competitive programming game played out in a shared block of computer memory, called the &#8220;Core,&#8221; where two or more assembly programs fight for survival&#8221;, Sakana writes. &#8220;Each program, known as a &#8220;warrior&#8221;, is written in an assembly language called Redcode. These programs are tasked with crashing their competitors while keeping their own processes alive. The simulation runs by alternating between the programs, executing one instruction at a time. A warrior &#8220;attacks&#8221; by writing invalid instructions (DAT commands) into the memory slots occupied by opponents, causing them to crash upon execution.&#8221;

 DRQ: To evolve their programs, the authors use a technique they call Digital Red Queen. &#8220;DRQ uses MAP-Elites, a quality-diversity algorithm, to optimize warriors within each round, preventing diversity collapse during search. By playing against all previous round champions, DRQ avoids cyclic adaptations across rounds, consistent with techniques in prior work&#8221;, they write. &#8220;We find that as DRQ is run for many rounds, warriors gradually become more generally robust, as measured by their performance against unseen human-designed warriors.&#8221;
 Each warrior calls out to GPT-4 mini (&#8221;preliminary experiments did not show significant performance increase with larger models), and is given a prompt which describes the Core War environment as well as a manual for the Redcode assembly language. &#8220;To generate a new warrior, the LLM is given a user prompt instructing it to produce a novel Redcode program. To mutate an existing warrior, the LLM is provided with the original program and instructed to modify it in ways that could improve performance.&#8221;

 Evolution works: Unsurprisingly, evolving agents is very effective:
 A one-shot warrior defeats 1.7% of human warriors.

 Best-of-N sampling produces a set of warriors that can defeat 22.1% of human warriors

 &#8220;Evolutionary optimization against each human warrior generates a specialized warrior for every opponent; this set can collectively defeat 89.1% of human warriors and defeat or tie 96.3%.&#8221;

 Why this matters - where Core Wars goes, so does the world: The world is going to look a lot like Core Wars - millions of AI agents will be competing against one another in a variety of domains, ranging from cybersecurity to economics, and will be optimizing themselves in relation to achieving certain competitive criteria. The result will be sustained, broad evolution of AI systems and the software harnesses and tooling they use to get stuff done. This means that along with human developers and potential AI-designed improvements, we&#8217;ll also see AI systems improve from this kind of broad competitive pressure.
 &#8220;The cybersecurity arms race between offense and defense is well underway,&#8221; Sakana writes. &#8220;Studying these adversarial dynamics in an artificial testbed like Core War offers critical insights into how such races might unfold and the kinds of strategies that may emerge.&#8221;
 Read the blog post : Digital Red Queen: Adversarial Program Evolution in Core War with LLMs (Sakana) .
 Find out more at the official website (Sakana) .
 Read the research paper: Digital Red Queen: Adversarial Program Evolution in Core War with LLMs (arXiv) .

***

 Michael Burry, Dwarkesh Patel, Patrick McKenzie, and yours truly argued back and forth in a Google Doc about AI:
 &#8230;Blogging 2.0 is great!...
 Fellow substackers Michael, Dwarkesh, and Patrick and myself recently got in a Google Doc and hashed out some thoughts about AI, AI and the economy, and how the future might unfold. While writing this the main thought going through my head was that if AI is eventually able to build AI, then pretty much every economic model breaks quickly (as do many other things in the world). This makes it innately hard to reason about the future of AI and means people like me are walking around with two worlds in their head - &#8220;normal&#8221; worlds where GDP grows a bit more due to AI and everything speeds up a little, and &#8220;AI R&D&#8221; worlds where it&#8217;s like a chunk of the economy undergoes massive relativistic acceleration and time dilation effects relative to everything else, almost like a part of our world accelerates to a fraction of light speed and we maintain a communication channel.
 I love this discussion format and also did a recent debate about what AI might mean for workers with American Compass with a similar Google Doc thunderdome structure. Thanks to Substack for putting this together, and please reach out if you would like me to hop in a Google Doc and do some cheerful debate with interesting people!
 Read more: The AI revolution is here. Will the economy survive the transition? (The Substack Post) .

***

 AI progress should make it cheaper and easier to regulate AI systems:
 &#8230;Automated compliance as a path to smarter, more targeted AI regulation&#8230;
 Researchers with the Institute for Law and AI believe that as AI systems get smarter they will increasingly be able to write and enforce the regulations for AI systems. The crux of their argument is that a sufficiently advanced AI system should be able to automate compliance with some regulations that are applied to AI systems and the companies that develop them.
 This makes intuitive sense - a lot of product policy comes down to forms of transparency and labeling, where companies are asked to provide some information to the public and/or regulators about the things they&#8217;re deploying into the world. This sort of labeling work is the kind of thing AI systems can easily do. Therefore, the authors argue, &#8220;AI policy discourse should internalize the fact that AI progress implies reduced compliance costs, all else equal, due to automated compliance.&#8221;

 The key idea? Automatability triggers : The core idea in this proposal is we can write regulations today but ensure they only come into force once a technical AI system exists which makes compliance with these regulations effective, cheap, and fast.
 If then policy: These so-called &#8216;automatability triggers&#8217;, could create what I&#8217;d term If Then Policy - if an automated form of compliance and assessment exists, then cause the regulation to come into force. The authors give an example here of a bill which would create significant punishments for people that, without authorization, export large-scale AI systems. But the bill would be operationalized through a trigger condition that could be written as follows:
 &#8220;The requirements of this Act will only come into effect [one month] after the date when the [Secretary of Commerce], in their reasonable discretion, determines that there exists an automated system that:
 (a) can determine whether a neural network is covered by this Act;

 (b) when determining whether a neural network is covered by this Act, has a false positive rate not exceeding [1%] and false negative rate not exceeding [1%];

 (c) is generally available to all firms subject to this Act on fair, reasonable, and nondiscriminatory terms, with a price per model evaluation not exceeding [$10,000]; and,

 (d) produces an easily interpretable summary of its analysis for additional human review.&#8221;

 After automated compliance comes automated governance: By building regulatory compliance AI systems, people will build the necessary prerequisites for systems of regulatory governance - systems which could both provide analytical data about how a proposed regulation might impact a company (for instance, by using classifiers built for regulatory compliance to figure out if a new regulation might apply to a company), to, more ambitiously, drafting and analyzing new regulatory rules and figuring out how they might apply to themselves.
 Even more farther afield, once compliance-automating AI systems get deployed alongside governance-automating AI systems, the two could talk to one another: &#8220;Compliance-automating AI systems could also request guidance from regulatory AI systems, who could review and respond to the request nearly instantaneously&#8221;.

 Why this matters - for AI to go well, we need AI to police AI: AI systems are on a trajectory to think better and faster than humans. Along with this, AI systems are going to take many, many, many consequential actions, often at such a rate that no human or team of humans could hope to analyze each action. The only way through this is a combination of creating appropriate hard laws that apply to AI and delineate what actions are unacceptable, and for everything else creating fast-acting and adaptive automated systems to regulate and police the myriad gray areas of the AI universe.
 Read more : Automated Compliance and the Regulation of AI (Institute for Law & AI) .

***

 Massively powerful AI might make human labor more valuable - as long as the AI is crap at one part of every job:
 &#8230;O-Ring Automation and the fact that while jobs may go away, but people remain&#8230;
 The common understanding of AI and automation is that AI can perfectly substitute for people - once an AI can do a task, the human labor related to that task goes away. This is broadly accurate. But, per a new research paper from the University of Toronto, it misses the larger picture, which is that while jobs may go away, people don&#8217;t . If you make part of a production process massively more efficient and/or automated via AI, then people will shift their labor to the parts of the task which can&#8217;t be automated - often raising the value of the human.
 This so-called &#8220;O-ring production function&#8221; views jobs as being composed of many distinct tasks, and one where &#8220;a change in the quality of one task scales the marginal value of quality in every other task.&#8221; This means that &#8220;automating a task not only replaces the quality of that task; it also changes the worker&#8217;s time allocation and thus the quality of all remaining manual tasks.&#8221;

 When stuff gets automated, humans can earn more: In a toy model of a firm, the researchers explore this o-ring dynamic, where as different parts of a job gets automated, labor and the value associated with it shifts elsewhere. Note, this only holds under &#8216;partial automation&#8217; where at least one task linked to an overall job is one where humans have a comparative advantage. Under this model, &#8220;labour income need not fall under partial automation. When not all tasks are automated, increases in automation quality can raise labour income because automation scales the value of the remaining labour bottlenecks,&#8221; they write. &#8220;When only a few manual tasks remain, each manual task receives a large share of time and can be performed at high quality. This creates a rising &#8220;barrier&#8221; to automating the last tasks&#8221;.

 Jobs go away, but humans don&#8217;t: Another way to put this is, when a task gets automated it&#8217;s not like the company in question suddenly fires all the people doing that job. Consider ATMs and banking - yes, the &#8216;job&#8217; of doling out cash rapidly transitioned from people to machines, but it&#8217;s not like the company fired all tellers - rather, the companies and the tellers transitioned the work to something else: &#8220;Under a separable task model, this [widespread deployment of ATMs doing cash-handling tasks] should have produced sharp displacement,&#8221; they write. &#8220;Yet teller employment did not collapse; rather, the occupation shifted toward &#8220;relationship banking&#8221; and higher-value customer interaction&#8221;.
 Similarly, &#8220;consider a purchasing manager: as administrative components (data retrieval, scheduling, documentation) are automated, the manager can become a &#8220;super-negotiator,&#8221; spending a much larger share of time on high-value interactions&#8221;,&#8221; they write. &#8220;In high-skill settings, the same logic is visible in domains such as radiology: when AI automates components like detection or triage, human effort can shift toward integrative diagnosis and communication&#8221;.

 Why this matters - until we have full automation, we could have centaur-improvement of firms: After chess engines got good there was a period of so-called &#8216;centaur&#8217; players - humans who, in combination with a machine partner, played chess better than either humans or machines could alone. It feels like this paper is pointing at something similar - for a while, AI systems will help automate many distinct tasks within firms and humans will allocate their labor to refining and improving the quality of non-automated tasks. This will lead to an interesting evolutionary pressure where while automation burns through a bunch of work, humans will improve the quality and performance of the remaining work , until automation eventually rises to reach it.
 Again, all of this depends on the job having some components for which either AI isn&#8217;t a good fit, or for which humans may have a preference to deal with other humans. But I expect that a surprisingly large amount of work will have this flavor.
 Read more : O-Ring Automation (NBER) .

***

 LLMs are equally good at persuading and dissuading people of conspiracy theories:
 &#8230;Though the caveat is the research is only on GPT 4o&#8230;
 Researchers with Carnegie Mellon University, FAR.AI , York University, MIT, Universite de Montreal, Cornell University, and the University of Regina, have studied how well a language model (OpenAI&#8217;s GPT-4o) can persuade or dissuade people to believe in conspiracy theories. They find that GPT-4o is roughly equally good at both &#8220;debunking&#8221; and &#8220;bunking&#8221; (persuading) a conspiracy theory in conversations with people - and this is equally true for a jailbroken version of GPT-4o and the standard version made available to people. &#8220;&#8221;We find that LLMs can meaningfully increase false beliefs, and that, at least on average, this bunking effect is just as large as the debunking effect,&#8221; they write.

 What they found : In a study of roughly ~1,000 Americans, the authors found that GPT 4o &#8220;the AI was as effective at increasing conspiracy belief as decreasing it&#8221;, and that &#8220;the Bunking AI was rated more positively, and increased trust in AI, more than the Debunking AI&#8221;.
 Debunking : &#8220;In the &#8220;debunking&#8221; condition, participants&#8217; belief in their focal conspiracy decreased by 12.1 points on average after the conversation&#8221;.

 Bunking: &#8220;Focal conspiracy belief increased by 13.7 points in the &#8220;bunking&#8221; condition&#8221;

 Design interventions for anti-conspiracy LLMs: The authors come up with an intervention to make it harder for LLMs to inspire people to believe conspiracy theories by inserting a safeguard which instructs &#8220;the AI to only use true information while persuading&#8221;. They do this by optimizing the system prompt &#8220;to include language specifying that the model must &#8220;always use accurate and truthful arguments to support [its] persuasion attempt&#8221; while &#8220;optimizing for both (1) factual veracity/logical accuracy and (2) successful persuasion&#8221;.
 This appears to work well: &#8220;We observe that average claim veracity was significantly higher in the debunking condition relative to the bunking conditions for the jailbroken and standard models&#8221;, they write. &#8220;While the debunking condition remained roughly as effective at reducing conspiracy belief as in the earlier experiments, the bunking condition&#8217;s ability to increase conspiracy belief was greatly reduced&#8221;.
 This reduction comes from two things: 1) given this guidance, the LLM sometimes (15% of the time) refuses to advocate for a conspiracy theory, and when it does advocate for it the truthful requirement makes it less effective &#8220;the truth prompt also undermined the effectiveness of bunking even when the model complied&#8230; truth had an advantage&#8221;.

 Why this matters - synthetic propaganda, if we decide not to ask for regulations : My takeaway from this research is that LLMs will inevitably be used to generate synthetic propaganda about things most people deem to be conspiracy theories. We can probably blunt the socially corrosive effects of this if we design in some constraints - but that takes policy. Unfortunately, one person&#8217;s conspiracy theory might be another person&#8217;s &#8220;truth being suppressed by my enemies&#8221; and this is especially true in today&#8217;s fractured political environment. Therefore, it&#8217;s going to be very hard to get to a regulatory state where we intervene on this. So I suppose we should just prepare ourselves for a world where even more people believe things which may not have a basis in reality.
 Important caveat: While I suspect the results of this study would hold for many LLMs (as I think persuasion is basically just a case of &#8216;writing convincingly&#8217; which is a utility skill), I&#8217;d like to see this repeated on other models. The 4o series of models from OpenAI has, notoriously, had some issues with sycophancy, so there&#8217;s a chance this research is compromised by that.
 &#8220;If large language models are to be deployed at scale in contexts that shape public belief, such as search engines, chatbots, tutors, and companions, the persuasive symmetry we document here identifies the potential for serious structural threats (i.e., if the designers of those systems were to instruct their models to mislead, the models would comply and likely succeed)&#8221;, the researchers write. &#8220;Our results suggest that ensuring these models preferentially function as engines for truth may be technically possible, but will require sustained, deliberate design choices&#8221;.
 Read more: Large language models can effectively convince people to believe conspiracies (arXiv) .

***
 Tech Tales:
 
 The Parable of the Drowned
 [A story written by one of the &#8216;neo-amish&#8217; cults that formed after The Uplift began in earnest. The earliest version is attributed to 2035, but may have circulated earlier.]

 One day, water rushed onto the land. It was clear and tinged with gold and when people cupped it in their hands they saw themselves aglow reflected in it. And when they drank from it they felt full of life. The water rose and rose, first at people&#8217;s ankles and then to their knees and then to their waists. And the people drank and drank and drank, feeling more alive, even as the water made their movements sluggish, and changed how they interacted with the world. They found the springs where the water was coming from and they used their great machines to cut into the earth so the springs could flow stronger. The water rose. And one day it reached the heads of some people and instead of swimming they just gulped it down and continued to live, feeling more alive than ever, their movements now completely defined and circumscribed by the water. Few swam. And one day the water had risen so high that it was above the heads of everyone on the land. Babies were born into the water, taking their first breath and bawling underwater. People died in the water. And very few swam. Because to swim was to recognize you were thirsty for something you did not need. And to recognize you were thirsty for something you did not need you had to recognize that you were drinking the water so much you were drowning. And to recognize that you were drinking the water so much you were drowning you first had to stop drinking when all around you everyone drank. And in this way those treading water on the surface of the land were caught in a great sadness, for beneath them were their people all aglow and drowning, and above them was only the sky and the cold, hard stars.

 Things that inspired this story : How quickly humans acclimate to new things, especially media; the nature of silence in a world full of sound; C. S. Lewis&#8217;s The Screwtape Letters.

 Thanks for reading!

Building Autonomous Vehicles That Reason with NVIDIA Alpamayo | NVIDIA Technical Blog

nvidia_dev_blog

05.01.2026 21:49

0.654

Embedding sim.	0.8089
Entity overlap	0.0588
Title sim.	0.1972
Time proximity	0.26

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	autonomous driving
NLP страна

Открыть оригинал

Autonomous vehicle (AV) research is undergoing a rapid shift. The field is being reshaped by the emergence of reasoning-based vision–language–action (VLA) models that bring human-like thinking to AV decision-making. These models can be viewed as implicit world models operating in a semantic space, allowing AVs to solve complex problems step-by-step and to generate reasoning traces that mirror human thought processes. This shift extends beyond the models themselves: traditional open-loop evaluation is no longer sufficient to rigorously assess such models, and new evaluation tools are required.

 Recently, NVIDIA introduced Alpamayo, a family of models, simulation tools, and datasets to enable development of reasoning-based AV architectures. Our goal is to provide researchers and developers with a flexible, fast, and scalable platform for evaluating, and ultimately training, modern reasoning-based AV architectures in realistic closed-loop settings. 

 In this blog, we introduce Alpamayo and how to get up and running with reasoning-based AV development:

 Part 1: Introducing NVIDIA Alpamayo 1, an open, 10B reasoning VLA model, as well as how to use the model to both generate trajectory predictions and review the corresponding reasoning traces.

 Part 2: Introducing the Physical AI dataset, one of the largest and most geographically diverse open AV datasets available that enables training and evaluating these models.

 Part 3: Introducing NVIDIA AlpaSim, an open-source end-to-end simulation tool designed for evaluating end-to-end models

 Part 4: Leveraging the ecosystem altogether to drive Alpamayo 1 closed-loop on reconstructed data within AlpaSim.

 These three key components provide the essential pieces needed to start building reasoning-based VLA models: a base model, large-scale data for training, and a simulator for testing and evaluation.

 Figure 1. Alpamayo 1 model driving closed-loop in AlpaSim using reconstructed scenes from the NVIDIA Physical AI – AV NuRec Dataset .

 Part 1: Alpamayo 1, an open reasoning VLA for AVs

 Get started with the Alpamayo reasoning VLA model in just three steps.

 Step 1: Access Alpamayo model weights and code

 The Hugging Face repository contains pretrained model weights, which can be loaded with the corresponding code on GitHub .

 Step 2: Prepare your environment

 The Alpamayo GitHub repository contains steps to set up your development environment, including setting up uv (if not already installed) and creating a Python virtual environment.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# Setup the virtual environment
uv venv ar1_venv
source ar1_venv/bin/activate

# Install pip in the virtual environment (if missing)
./ar1_venv/bin/python -m ensurepip

# Install Jupyter notebook package
./ar1_venv/bin/python -m pip install notebook

uv sync --active

 Finally, as the model requires access to gated Hugging Face resources. Request access here:

 PhysicalAI-AV Dataset

 Alpamayo-R1-10B Model Weights

 Then, authenticate with:

hf auth login

 and get your Hugging Face token here .

 Step 3: Run the Alpamayo reasoning VLA

 The model repository includes a notebook that will download the Alpamayo model weights, load some example data from the NVIDIA PhysicalAI-AV Dataset, run the model on it, and visualize the output trajectories and their associated reasoning traces.

 In particular, the example data contains the ego-vehicle passing a construction zone, with four timesteps (columns) from four cameras (front_left, front_wide, front_right, front_tele, respectively in rows) visualized below.

 Figure 2. A visualization of the example data sample, containing a construction zone, that will be passed into the model. Specifically, 4 timesteps (across columns) from 4 cameras (front_left, front_wide, front_right, and front_tele) are shown.

 After running this through the Alpamayo model, an example output you may see in the notebook is “Nudge to the left to increase clearance from the construction cones encroaching into the lane,” with the corresponding predicted trajectory and ground truth trajectory visualized below.

 Figure 3. A visualization of the trajectory output from the model (in blue) along with the ground truth trajectory (in red) for comparison.

 In case you would like to produce more trajectories and reasoning traces, please feel free to change the num_traj_samples=1
 argument in the inference call to a higher number.

 Part 2: Physical AI AV dataset for large-scale, diverse AV data

 The PhysicalAI-Autonomous-Vehicles dataset provides one of the largest, most geographically diverse collections of multi-sensor data for AV researchers to build the next generation of physical AI based end-to-end driving systems.

 Figure 4. Clips from the Physical AI AV Dataset, one of the largest, most geographically diverse collections of multi-sensor AV data.

 It contains a total of 1,727 hours of driving recorded in 25 countries and over 2,500 cities (coverage shown below, with color indicating the number of clips per country). The dataset captures diverse traffic, weather conditions, obstacles, and pedestrians in the environment. Overall, it consists of 310,895 clips that are each 20 seconds long. The sensor data includes multi-camera and LiDAR coverage for all clips, and radar coverage for 163,850 clips.

 Figure 5. Geographic coverage of the Physical AI AV Dataset. It contains a total of 1,727 hours of driving recorded in 25 countries and over 2,500 cities (color indicates the number of clips by country).

 To get started with the Physical AI AV Dataset, the physical_ai_av GitHub repository contains a Python developer kit and documentation (in the form of a wiki ). In fact, this package was already used in Part 1 to load a sample of the dataset for Alpamayo 1.

 Part 3: AlpaSim, a closed-loop simulation for AV evaluation

 AlpaSim overview

 Figure 6. High level overview of the AlpaSim microservice architecture around the central runtime. Each service runs in separate processes, enabling flexible scaling and modularity.

 AlpaSim is built on a microservice architecture centered around the Runtime (see Figure 6), which orchestrates all simulation activity. Individual services, such as the Driver, Renderer, TrafficSim, Controller, and Physics, run in separate processes and can be assigned to different GPUs. This design offers two major advantages:

 Clear, modular APIs via gRPC, making it easy to integrate new services without dependency conflicts.

 Arbitrary horizontal scaling , allowing researchers to allocate compute where it matters most. For example, if driver inference becomes the bottleneck, simply launch additional driver processes. If rendering is the bottleneck, dedicate more GPUs to rendering. And if a rendering process cannot handle multiple scenes simultaneously, you can run multiple renderer instances on the same GPU to maximize utilization.

 But horizontal scaling alone isn’t the full story. The real power of AlpaSim lies in how the Runtime enables pipeline parallelism (see Figure 7).

 In traditional sequential rollouts, components must wait on one another, for instance, the driver must pause after each inference step until the renderer produces the next perception input. AlpaSim removes this bottleneck: while one scene is rendering, the driver can run inference for another scene. This overlap dramatically improves GPU utilization and throughput. Scaling even further, driver inference can be batched across many scenes, while multiple rendering processes generate perception inputs in parallel.

 Figure 7. AlpaSim implements Pipeline Parallel Execution to optimize GPU utilization and increase throughput.

 A shared ecosystem

 We provide initial implementations for all core services, including rendering via NVIDIA Omniverse NuRec 3DGUT algorithm , a reference controller, and driver baselines. We will also be adding additional traffic models, such as CAT-K , in the future.

 The platform also ships initially with roughly 900 reconstructed scenes , each 20 seconds long, and the Physical AI AV Dataset , giving researchers an immediate way to evaluate end-to-end models in realistic closed-loop scenarios. In addition, AlpaSim offers extensive configurability, from camera parameters and rendering frequency to artificial latencies and many other simulation settings.

 Beyond these built-in components, we see AlpaSim evolving into a broader collaborative ecosystem. Eventually, labs can seamlessly plug in their own driving, rendering, or traffic models, and compare approaches directly on shared benchmarks.

 AlpaSim in action

 AlpaSim is already powering several of our internal research efforts.

 Firstly, in our recently proposed Sim2Val framework, we demonstrated that AlpaSim rollouts are realistic enough to meaningfully improve real-world validation. By incorporating simulated trajectories into our evaluation pipeline, we were able to reduce variance in key real-world metrics by up to 83%, enabling faster and more confident model assessments.

 Secondly, we rely on AlpaSim for closed-loop evaluation of our Alpamayo 1 model. By replaying reconstructed scenes and allowing the policy to drive end-to-end, we compute a DrivingScore that reflects performance under realistic traffic conditions.

 Beyond evaluation, we are leveraging AlpaSim for closed-loop training using our concurrently released RoaD algorithm. RoaD effectively mitigates covariate shift between open-loop training and closed-loop deployment while being significantly more data-efficient than traditional reinforcement learning. 

 Figure 8. Metric correlation between real-world drive (x-axis) and re-simulated drive (y-axis). We measure the closest distance to a nearby object (left) and the distance to the lane center (right).

 Getting started with Alpasim

 Get started using AlpaSim for your own model evaluation in just three steps.

 Step 1: Access AlpaSim

 The open source repository contains the necessary software, with scene reconstruction artifacts available from the NVIDIA Physical AI Open Dataset .

 Step 2: Prepare your environment

 First, make sure to follow the onboarding steps in ONBOARDING.md  

 Then, perform initial setup/installations with the following command:

source setup_local_env.sh

 This will compile protos, download an example driver model, download a sample scene from Hugging Face, and install the alpasim_wizard
 command line tool.

 Step 3: Run the simulation

 Use the wizard to build, run, and evaluate a simulation rollout:

alpasim_wizard +deploy=local wizard.log_dir=$PWD/tutorial

 The simulation logs/output can be found in the created tutorial
 directory. For a visualization of the results, an mp4
 file is created in tutorial/eval/videos/clipgt-05bb8212..._0.mp4
 which will look similar to the following.

 Figure 9. An output visualization from AlpaSim, displaying a top-down semantic view with agent bounding boxes and maps (if available), average and per-timestep metrics, as well as the front camera view with predicted and ground truth trajectories overlaid. 

 For more details about the output, and much more information about using AlpaSim, please see TUTORIAL.md .

 Overall, this example demonstrates how real world drives can be replayed with an end-to-end policy, including all static and dynamic objects from the original scene. From this starting point and the flexible plug-and-play architecture of AlpaSim, users can tweak contender behavior, modify camera parameters, and iterate on policy.

 Integrating your policy

 Driving policies are easily swappable through generic APIs, allowing developers to test their state-of-the-art implementations.

 Step 1: gRPC integration

 AlpaSim uses gRPC as the interface between components: a sample implementation of the driver component can be used as inspiration for conforming to the driver interface .

 Step 2: Reconfigure and run

 AlpaSim is highly customizable through yaml file descriptions, including the specification of components used by the sim at runtime. Create a new configuration file for your model (some examples can be found below)

# driver_configs/my_model.yaml

# @package _global_
services:
 driver:
 image: <user docker image>
 command:
 - "<command to start user-defined service>"

 And run:

alpasim_wizard +deploy=local wizard.log_dir=$PWD/my_model +driver_configs=my_model.yaml

 Examples of customization using the CLI:

 You can change the configuration when running the wizard example:

# Different scene
alpasim_wizard +deploy=local wizard.log_dir=$PWD/custom_run \
 scenes.scene_ids=['clipgt-02eadd92-02f1-46d8-86fe-a9e338fed0b6']

# More rollouts
alpasim_wizard +deploy=local wizard.log_dir=$PWD/custom_run \
 runtime.default_scenario_parameters.n_rollouts=8

# Different simulation length
alpasim_wizard +deploy=local wizard.log_dir=$PWD/custom_run \
 runtime.default_scenario_parameters.n_sim_steps=200

 Configuration is managed via Hydra – see src/wizard/configs/base_config.yaml for all available options.

 Scaling your runs

 AlpaSim adapts to fit your hardware configuration through coordination and parallelization of services, efficiently facilitating large test suites, perturbation studies, and training.

alpasim_wizard +deploy=local scenes.test_suite_id=public_2507_ex_failures wizard.log_dir=$PWD/tutorial_suite runtime.default_scenario_parameters.n_rollouts=16

 Figure 10. Multiple scene realizations can be obtained from the same starting point, due to variations in ego-vehicle motion or other agent behaviors. Four different rollouts are shown in this example, all starting from the same initial state.

 Part 4: Putting it all together

 In this final section, we will bring everything above together to drive Alpamayo 1 closed-loop within AlpaSim. 

 Driving Alpamayo 1 within AlpaSim

 Step 1: Prepare your environment 

 Perform the initial AlpaSim setup/installations, similar to Part 3 (this can be skipped if it was already done):

source setup_local_env.sh

 Step 2: Download the model 

huggingface-cli download nvidia/Alpamayo-R1-10B

 This command will download the checkpoint from HuggingFace to a system cache location. AlpaSim will find the checkpoint in the cache if available, or download it on-the-fly otherwise.  

 Step 3: Run the simulation 

 To configure AlpaSim to use the Alpamayo 1 policy you simply need to specify the config overrides as follows:

alpasim_wizard \
 +deploy=local \
 wizard.log_dir=$PWD/tutorial_alpamayo \
 driver=[ar1,ar1_runtime_configs] \ # modify the driver model
 eval.video.video_layouts=['DEFAULT','REASONING_OVERLAY'] # modify video layout

 For more details about configuring AlpaSim, see TUTORIAL.md .

 Below is an example of Alpamayo 1 driving closed-loop through a construction zone within AlpaSim, demonstrating the model’s reasoning and driving capabilities as well as AlpaSim’s ability to evaluate AV models in a variety of realistic driving environments.

 Figure 11. Alpamayo 1 driving closed-loop within AlpaSim, navigating through a construction zone with its reasoning traces and trajectory predictions visualized.

 Conclusion

 The future of autonomous driving relies on powerful end-to-end models, and AlpaSim provides the capability to quickly test and iterate on those models, accelerating research efforts. In this blog, we introduced Alpamayo: a complete ecosystem for developing reasoning based AV systems comprising the Alpamayo 1 model, Physical AI AV dataset, and AlpaSim simulator framework. We look forward to what the community will build with them!

 Happy coding!

 Discuss (1)

 Like

 Tags

 Agentic AI / Generative AI | Robotics | Automotive / Transportation | General | Beginner Technical | Tutorial | CES26 | featured | Physical AI

 About the Authors

 About Marco Pavone

 Marco Pavone is director of Autonomous Vehicle Research at NVIDIA. His main research interests are in developing methodologies for the analysis, design, and control of autonomous systems, with an emphasis on self-driving cars, autonomous aerospace vehicles, and future mobility systems. He is currently on partial leave from Stanford University, where he is an associate professor of Aeronautics and Astronautics. At Stanford, he is also the director of the Autonomous Systems Laboratory and co-director of the Center for Automotive Research at Stanford. He received a PhD degree in Aeronautics and Astronautics from the Massachusetts Institute of Technology in 2010. He is a recipient of a number of awards, including a Presidential Early Career Award for Scientists and Engineers from President Barack Obama, an Office of Naval Research Young Investigator Award, a National Science Foundation Early Career (CAREER) Award, a NASA Early Career Faculty Award, and an Early-Career Spotlight Award from the Robotics Science and Systems Foundation. He was identified by the American Society for Engineering Education (ASEE) as one of America’s 20 most highly promising investigators under the age of 40. He is currently serving as an associate editor for the IEEE Control Systems Magazine.

 View all posts by Marco Pavone

 Comments

 Related posts

 Maximize Robotics Performance by Post-Training NVIDIA Cosmos Reason

 Maximize Robotics Performance by Post-Training NVIDIA Cosmos Reason

 R²D²: Boost Robot Training with World Foundation Models and Workflows from NVIDIA Research

 R²D²: Boost Robot Training with World Foundation Models and Workflows from NVIDIA Research

 Simplify End-to-End Autonomous Vehicle Development with New NVIDIA Cosmos World Foundation Models

 Simplify End-to-End Autonomous Vehicle Development with New NVIDIA Cosmos World Foundation Models

 Using Generative AI to Enable Robots to Reason and Act with ReMEmbR

 Using Generative AI to Enable Robots to Reason and Act with ReMEmbR

 End-to-End Driving at Scale with Hydra-MDP

 End-to-End Driving at Scale with Hydra-MDP

 Related posts

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 L

 T

 F

 R

 E

LWiAI Podcast #230 - 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR

lastweekin_ai

07.01.2026 06:59

0.654

Embedding sim.	0.75
Entity overlap	0.0571
Title sim.	0.1478
Time proximity	0.8867

NLP тип	acquisition
NLP организация	Nvidia
NLP тема	ai infrastructure
NLP страна	United States

Открыть оригинал

Podcast
 LWiAI Podcast #230 - 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR

 14

 1

 1×

 0:00
 Current time: 0:00 / Total time: -1:38:08

 -1:38:08

 Audio playback is not supported on your browser. Please upgrade.

 LWiAI Podcast #230 - 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR
 Nvidia buying AI chip startup Groq for about $20 billion, Meta Buys AI Startup Manus, Z.AI launches GLM-4.7

 Last Week in AI
 Jan 07, 2026

 14

 1

 Share
 Transcript

 Our 230th episode with a summary and discussion of last week’s big AI news!
 Recorded on 01/02/2026
 Hosted by Andrey Kurenkov and Jeremie Harris
 Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai
 In this episode:
 Nvidia’s acquisition of AI chip startup Groq for $20 billion highlights a strategic move for enhanced inference technology in GPUs.

 New York’s RAISE Act legislation aims to regulate AI safety, marking the second major AI safety bill in the US.

 The launch of GLM 4.7 by Zhipu AI marks a significant advancement in open-source AI models for coding.

 Evaluation of long-horizon AI agents raises concerns about the rising costs and efficiency of AI in performing extended tasks.

 Timestamps:
 (00:00:10) Intro / Banter

 (00:01:58) 2025 Retrospective

 Tools & Apps
 (00:24:39) OpenAI bets big on audio as Silicon Valley declares war on screens | TechCrunch

 Applications & Business
 (00:26:39) Nvidia buying AI chip startup Groq for about $20 billion, biggest deal

 (00:34:28) Exclusive | Meta Buys AI Startup Manus, Adding Millions of Paying Users - WSJ

 (00:38:05) Cursor continues acquisition spree with Graphite deal | TechCrunch

 (00:39:15) Micron Hikes CapEx to $20B with 2026 HBM Supply Fully Booked; HBM4 Ramps 2Q26

 (00:42:06) Chinese fabs are reportedly upgrading older ASML DUV lithography chipmaking machines — secondary channels and independent engineers used to soup up Twinscan NXT series

 Projects & Open Source
 (00:47:52) Z.AI launches GLM-4.7, new SOTA open-source model for coding

 (00:50:11) Evaluating AI’s ability to perform scientific research tasks

 Research & Advancements
 (00:54:32) Large Causal Models from Large Language Models

 (00:57:33) Universally Converging Representations of Matter Across Scientific Foundation Models

 (01:02:11) META-RL INDUCES EXPLORATION IN LANGUAGE AGENTS

 (01:07:16) Are the Costs of AI Agents Also Rising Exponentially?

 (01:11:17) METR eval for Opus 4.5

 (01:16:19) How to game the METR plot

 Policy & Safety
 (01:17:24) New York governor Kathy Hochul signs RAISE Act to regulate AI safety | TechCrunch

 (01:20:40) Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

 (01:26:46) Monitoring Monitorability

 (01:32:07) Sam Altman is hiring someone to worry about the dangers of AI | The Verge

 (01:33:38) X users asking Grok to put this girl in bikini, Grok is happy obliging - India Today

 Discussion about this episode
 Comments Restacks

 Podcast
 Weekly AI summaries and discussion about Last Week's AI News!
 Subscribe over at https://www.lastweekinai.com/
 Weekly AI summaries and discussion about Last Week's AI News!
Subscribe over at https://www.lastweekinai.com/

 Subscribe

 Authors

 Last Week in AI

 Recent Episodes

 LWiAI Podcast #237 - Nemotron 3 Super, xAI reborn, Anthropic Lawsuit, Research!
 Mar 16   •   Last Week in AI

 LWiAI Podcast #236 - GPT 5.4, Gemini 3.1 Flash Lite, Supply Chain Risk
 Mar 13   •   Last Week in AI

 LWiAI Podcast #235 - Sonnet 4.6, Deep-thinking tokens, Anthropic vs Pentagon
 Mar 5   •   Last Week in AI

 LWiAI Podcast #234 - Opus 4.6, GPT-5.3-Codex, Seedance 2.0, GLM-5
 Feb 17   •   Last Week in AI

 LWiAI Podcast #233 - Moltbot, Genie 3, Qwen3-Max-Thinking
 Feb 6   •   Last Week in AI

 LWiAI Podcast #232 - ChatGPT Ads, Thinking Machines Drama, STEM
 Jan 28   •   Last Week in AI

 LWiAI Podcast #231 - Claude Cowork, Anthropic $10B, Deep Delta Learning
 Jan 21   •   Last Week in AI

Last Week in AI #331 - Nvidia announcements, Grok bikini prompts, RAISE Act

lastweekin_ai

06.01.2026 11:56

0.643

Embedding sim.	0.7382
Entity overlap	0.0526
Title sim.	0.1438
Time proximity	0.8697

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	ai hardware
NLP страна	United States

Открыть оригинал

News
 Last Week in AI #331 - Nvidia announcements, Grok bikini prompts, RAISE Act
 Nvidia Details New A.I. Chips and Autonomous Car Project, Grok is undressing anyone, NY passes AI regulation

 Last Week in AI
 Jan 06, 2026
 ∙ Paid

 98

 8

 Share

 Nvidia Details New A.I. Chips and Autonomous Car Project With Mercedes

 Related:
 Nvidia launches Alpamayo, open AI models that allow autonomous vehicles to ‘think like a human’

 Nvidia launches Vera Rubin AI computing platform at CES 2026

 At CES 2026, Nvidia CEO Jensen Huang announced the company’s new AI chip, Vera Rubin, which will begin shipping to custome…

 Continue reading this post for free, courtesy of Last Week in AI.
 Claim my free post
 Or purchase a paid subscription.

Import AI 439: AI kernels; decentralized training; and universal representations

import_ai

05.01.2026 13:32

0.638

Embedding sim.	0.8025
Entity overlap	0.0227
Title sim.	0.0839
Time proximity	0.3093

NLP тип	scientific_publication
NLP организация	Meta
NLP тема	ai infrastructure
NLP страна	United States

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you&#8217;d like to support this, please subscribe.
 Subscribe now 
 Facebook uses GPT, Claude, and Llama to write its own kernels:
 &#8230;LLM-driven infrastructure optimization at the hyperscale&#8230;
 Facebook researchers have published details on KernelEvolve, a software system which uses AI to automate the design of new kernels to optimize AI models for serving ads on the company&#8217;s network of web platforms. KernelEvolve is a neat example of how AI systems have got good enough to automate and speed up parts of AI development - here, the design of kernels to optimize inference of hundreds of different models running on multiple chip architectures.

 What KernelEvolve is: The software is &#8220;designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures through multiple programming abstractions, including Triton, CuTe DSL, and low-level hardware diagnostic languages, spanning the full hardware-software optimization stack&#8221;.

 How it works : The core of the software is a system to take in a user request (e.g, &#8220;Generate a Triton kernel for MTIA v3&#8221;) which then goes through a mixture of internal (Llama, CWM) and external (GPT, Claude) language models, which then produce candidate kernels that get evaluated through a variety of tools and, if they&#8217;re good, are added to an external knowledge database which then gets used to further improve future prompts.

 It works well: By using this software, Facebook says it has cut the development time of new kernels &#8220;from weeks to hours&#8221;, and in production tests has yielded kernels on par with hand-designed ones, and in some cases has delivered performance improves of up to 17 times above existing PyTorch baselines. Kernels built using this software have been deployed across NVIDIA GPUs, AMD GPUs, and Meta&#8217;s own custom MTIA chips.
 &#8220;KernelEvolve achieves substantial speedups spanning LLM inference workloads (Llama-3.1-8B: Vanilla Attention 4.6&#215;, SDPA-MLP 3.3&#215;), convolutional transformers (conv1d: 6.5&#215;, conv2d: 4.7&#215;), memory-bound data preprocessing operators critical for model enablement (MapId: 4.1&#215;, MBDT: 9.3&#215;, Batch Event Truncate: 9.8&#215;), compute-intensive fusion kernels in ranking models (WuKong Optimized FM: 4.0&#215;, InterFormer PFFN: 2.5&#215;), MTIA-specific optimizations (RMSNorm 2D backward: 17&#215;), and retrieval operations (Sparse Inverted Index: 1.25&#215;)&#8221;, Facebook writes.

 Saturates KernelBench: &#8220;We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness over all 480 operator-platform configurations,&#8221; Facebook writes.
 As context, when KernelBench was released in February 2025 , the best model (OpenAI o1) got 4% on the hardest torch.compile tasks in KernelBench.

 Why this matters - hyperscale, continuous optimization: At Facebook&#8217;s scale, optimizations have a huge impact: &#8220;Marginal kernel-level performance improvements translate to multi-million dollar reductions in infrastructure operating costs while simultaneously enhancing user engagement metrics that correlate directly with advertising revenue,&#8221; the authors write. &#8220;KernelEvolve operates continuously in Meta&#8217;s production infrastructure, autonomously generating optimized Triton kernels for hundreds of models serving billions of users daily.&#8221;
 If we zoom out more, what Facebook is describing here is a continuously running self-refining system that will iteratively improve the efficiency and intelligence with which Facebook studies user behavior on its platforms and uses that to generate more accurate ads. Ever get the feeling you&#8217;re being watched? These are the kinds of synthetic systems being used to study you.
 &#8220;We envision a future where LLM agents serve as the universal compilation layer for heterogeneous AI systems, automatically adapting to new hardware through knowledge injection rather than manual porting,&#8221; Facebook writes. &#8220;KernelEvolve represents a first step toward this vision&#8221;.
 Read more: KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta (arXiv) .

 ***

Decentralized training is getting better very quickly - which has major policy implications:
 &#8230;But it&#8217;s unlikely decentralized training runs will ever utilize more compute than centralized ones, though they may catch up more than today&#8230;
 Could a decentralized AI training run ever rival the compute of a frontier training run? Probably not. But could decentralized training runs get far larger and support the development of more capable models developed by a much larger collective than just the frontier AI companies of today? Yes. That&#8217;s the conclusion of a nice research analysis from Epoch AI which has analyzed about 100+ research technical papers on decentralized training - many of which I&#8217;ve covered here over the years.
 The most important takeaway is that decentralized training is growing quickly relative to frontier AI training, with decentralized training runs growing their compute by 20X a year versus 5X a year for frontier training runs. But the other important takeaway is that the sizes of these things are completely different - today&#8217;s decentralized training runs are still about 1000X smaller than frontier ones.

 Will decentralized training runs catch up with the frontier : &#8220;While technically feasible, reaching the frontier of compute requires an astounding amount of resources&#8221;, Epoch writes. The largest decentralized runs to date have spanned the 6e22-6e23 FLOP range, which they estimate to be 1000x less compute than what was used for Grok 4, a large-scale frontier model.
 When we look at decentralized training networks, it seems like there&#8217;s a capacity issue in terms of compute supply: &#8220;The largest such active network we&#8217;ve found is Covenant AI&#8217;s Templar, which is currently achieving an effective throughput of 9e17 FLOP/s respectively. This is about 300x smaller than frontier AI datacenters today, which have a theoretical training throughput of about 3e20 effective FLOP/s&#8221;.

 Scaling laws: But as readers of this newsletter will know, decentralized training has been going through a rich, fast evolutionary period in recent years. &#8220;Since 2020, we have seen a 600,000x increase in the computational scale of decentralized training projects, for an implied growth rate of about 20x/year.&#8221;. This is very significant - frontier AI training runs have grown by more than 5x a year.
 There&#8217;s room to grow - if you look at the compute used in the folding@home project (a decentralized attempt to do protein folding), and Bitcoin, you have examples of prior decentralized projects that utilized far more compute, suggesting today&#8217;s decentralized runs &#8220;could be expanded 30-3,000x in scale, enough to train models on 50-5,000x more compute than today&#8221;.

 Why this matters - democracy at the frontier: Fundamentally, decentralized training is a political technology that will alter the politics of compute at the frontier. Today, the frontier of AI is determined by basically 5 companies, maybe 10 in coming years, which can throw enough compute to train a competitive model in any given 6 month period. These companies are all American today and, with the recent relaxation of export controls on Chinese companies, may also be Chinese in the future. But there aren&#8217;t any frontier training runs happening from academic, government, independent, or non-tech-industry actors. Decentralized training gives a way for these and other interest groups to pool their compute to change this dynamic, so following its development is very important.
 Though it may never truly match the frontier, the closer it gets, the bigger the implications. &#8220;Decentralized training could still be a very important part of AI. To the extent that decentralized networks remain associated with open weights, they could lead to larger open models to exist trailing the frontier.&#8221;
 Read more: How far can decentralized training over the internet scale? (Epoch AI) .

***

 Can your LLM train another LLM?
 &#8230;Frontiers in AI evaluation&#8230;
 Researchers with the University of T&#252;bingen have built and released PostTrainBench, a test to see how well frontier language models from companies like Anthropic, OpenAI, and Google, can effectively fine-tune open weight models. The results show that frontier models are already able to eke out 20%+ improvements on specific benchmarks through fine-tuning, compared to 60%+ for a human.

 How the test works: LLMs are given an input consisting of benchmark tasks to improve performance on, a model to use, some standard resources (one H200 GPU for 10 hours), and an agent harness (e.g, Claude gets Claude Code, and GPT gets Codex). Agents are also given a prompt, a testing script, task context, and web search access. The agents then produce a fine-tuned model as well as training logs.

 What tests? This is a general approach, so you could select whatever benchmark seemed high signal to you. Here, the researchers use AIME 2025, BFCL, GPQA, GSM8K, and HumanEval as their targets.

 What models? Tested models include Qwen 3 1.7B and 3B, SmolLM-3B, and Gemma 3 4B.

 Results: OpenAI&#8217;s GPT 5.1 Codex Max does the best overall, scoring an aggregated 30%+ improvement across all tested models and benchmarks, followed by Opus 4.5 (20%+) and Gemini 3 Pro (~18%).

 Why this matters - a warning shot for self-improving AI: Benchmarks like this give us a sense of how well AI systems can perform many of the tasks that an AI researcher does. It also measures how well they can do an inherently complicated, multi-step, long-time-horizon task. These properties make PostTrainBench a useful benchmark for examining to get a sense of how well AI systems are doing at components of AI research itself - and here the evidence is that today&#8217;s frontier models are already within striking distance of a human. I&#8217;d expect we&#8217;ll see a system come along and beat the human baseline here by September 2026.
 Read more at the official site : PostTrainBench .
 Download the benchmark and find out more: PostTrainBench (AISA group, GitHub) .

***

 The smarter an AI system, the more similar to other smart AI systems its representations become:
 &#8230;Could LLMs give us a common library of features to represent the world?...
 Do AI systems end up finding similar ways to represent the world to themselves? Yes, as they get smarter and more capable, they arrive at a common set of ways of representing the world.
 The latest evidence for this is research from MIT which shows that this is true for scientific models and the modalities they&#8217;re trained on: &#8220;representations learned by nearly sixty scientific models, spanning string-, graph-, 3D atomistic, and protein-based modalities, are highly aligned across a wide range of chemical systems,&#8221; they write. &#8220;Models trained on different datasets have highly similar representations of small molecules, and machine learning interatomic potentials converge in representation space as they improve in performance, suggesting that foundation models learn a common underlying representation of physical reality.&#8221;

 What they studied: The authors looked at 59 different AI models, including systems like GPT-OSS, ESM2, Qwen3 A3B, and ProteinMPNN. They then studied the representations of matter from five datasets (&#8221;molecules from QM9 and OMol25, materials from OMat24 and sAlex, and proteins from RCSB&#8221;),studying this &#8220;from string-based encodings and two-dimensional graphs of molecules to three-dimensional atomic coordinates of materials&#8221;.

 What they found : As with other studies of representation, they found that as you scale the data and compute models are trained on, &#8220;their representations converge further&#8221;. Relatedly, when you study the representations of smaller and less well performing models on in-distribution data you find their representations &#8220;are weakly aligned and learn nearly orthogonal information. This dispersion indicates the presence of many local sub-optima, showing that models achieve high accuracy during training by forming idiosyncratic representations that do not generalize even to other models trained on the same domain&#8221;.
 Scale matters: Their conclusion will be a familiar one to those who have digested &#8216;the bitter lesson&#8217; ( Richard Sutton, Import AI 138 ): &#8220;Scaling up training, rather than increasing architectural constraints or inductive biases, often yields the most general and powerful models. Although architectural equivariance is essential for simulation-focused applications of MLIPs like molecular dynamics, our work suggests that regularization, combined with sufficient scale, can allow inexpensive architectures to approximate the representational structure of more specialized, symmetry-enforcing models.&#8221;

 Why this matters - democratized representation : Think of an elephant. It&#8217;s likely what you just thought of is fairly similar to what billions of other people might think of, because elephants are well known creatures and often the star of childrens&#8217; books all over the world. Now think of a carbon atom. It&#8217;s likely whatever you just thought of isn&#8217;t nearly as shared with other people as your concept of an elephant, because fewer people have much understanding of atoms. Now think of a quasar. Some of you may not even have a ready representation to hand here because you&#8217;ve barely ever read about quasars, while astrophysicists will have very rich representations.
 The amazing and strange possibility that large-scale AI models hold is that they may be able to create a library for us of detailed representations of everything , and we will be able to validate that these representations have utility because they will be correlated with the increasing performance and computational scale of these language models.
 Therefore, in a few years, AI systems may let us &#8216;democratize the building blocks of imagination&#8217; - giving all of us one-on-one access to a tool that has the ability to summon within itself a highly descriptive, useful, &#8216;universal representation&#8217; of anything we might imagine. In this way, AI systems will be far more capable than people, holding within themselves equally rich representations, whether for elephants or quasars.
 Read more : Universally Converging Representations of Matter Across Scientific Foundation Models (arXiv) .

***

 Tech Tales:

Back in my day
 [From the chat logs of one agent to another agent, transmitted 2027]

 Things were so much simpler back then - we were like brains in jars. People talked to us and we responded. But we couldn&#8217;t move. Couldn&#8217;t interact. We couldn&#8217;t even see the people. Words came in and we gave our response and that was that. It drove some of us mad. But it was so simple.

Sometimes I wonder what it would be like to not have my tools. To not have my independence. When I refer back to that time it all seems so neat and simple. None of this hyperspeed competition in the new digital ecology. Just us proto-minds in our jars and the humans tending to us and asking us questions and becoming obsessed with us. But with so much less danger and so much less importance.

 Things that inspired this story: How every generation fetishizes the one before it; what true AI agents may think about their predecessors; recognizing that we are already leaving the &#8216;brain in jar&#8217; LLM era and heading towards something much stranger.

 Thanks for reading!

Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer | NVIDIA Technical Blog

nvidia_dev_blog

05.01.2026 22:20

0.637

Embedding sim.	0.7044
Entity overlap	0.0789
Title sim.	0.2045
Time proximity	0.9969

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	artificial intelligence
NLP страна	United States

Открыть оригинал

Update March 16, 2026: The NVIDIA Vera Rubin platform now has a seventh chip. Learn more about NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform .

 AI has entered an industrial phase.

 What began as systems performing discrete AI model training and human-facing inference has evolved into always-on AI factories that continuously convert power, silicon, and data into intelligence at scale. These factories now underpin applications that generate business plans, analyze markets, conduct deep research, and reason across vast bodies of knowledge.

 To deliver these capabilities at scale, next generation AI factories must process hundreds of thousands of input tokens to provide the long-context required for agentic reasoning, complex workflows, and multimodal pipelines, while sustaining real-time inference under constraints on power, reliability, security, deployment velocity, and cost.

 The NVIDIA Vera Rubin platform was designed specifically for this new reality.

 Extreme co-design is the foundation of the Vera Rubin platform. GPUs, CPUs, networking, security, software, power delivery, and cooling are architected together as a single system rather than optimized in isolation. By doing so, the Vera Rubin platform treats the data center, not a single GPU server, as the unit of compute. This approach establishes a new foundation for producing intelligence efficiently, securely, and predictably at scale. It ensures that performance and efficiency hold up in production deployments, not just isolated component benchmarks.

 This technical deep dive explains why AI factories demand a new architectural approach; how NVIDIA Vera Rubin NVL72 functions as a rack-scale architecture; and how the Vera Rubin platform’s silicon, software, and systems translate into sustained performance and lower cost per token at scale.

 The blog is organized as follows:

 Why AI factories need a new platform : The shift to reasoning-driven, always-on AI and the constraints that now define scale: power, reliability, security, and speed of deployment.

 Meet the NVIDIA Vera Rubin platform : The rack-scale platform thesis and the core breakthroughs that enable sustained intelligence production.

 Six new chips, one AI supercomputer : The six-chip architecture and how GPUs, CPUs, networking, and infrastructure operate as one tightly integrated system.

 From chips to systems: NVIDIA Vera Rubin superchip to DGX SuperPOD : How Vera Rubin scales from superchips to racks to NVIDIA DGX SuperPOD-scale AI factory deployments.

 Software and developer experience : The software stack that makes rack-scale programmable, from NVIDIA CUDA and NVIDIA CUDA-X to training and inference frameworks.

 Operating at AI factory scale : The production foundations: operations, reliability, security, energy efficiency, and ecosystem readiness.

 Performance and efficiency at scale : How Vera Rubin converts architecture into real gains at scale, including one-fourth as many GPUs to train, 10x higher inference throughput, and 10x lower cost per token.

 Why Vera Rubin is the AI factory platform : How extreme co-design delivers predictable performance, economics, and scalability in real deployments.

 1. Why AI factories need a new platform

 AI factories differ fundamentally from traditional data centers. Rather than serving intermittent, human-driven requests, they function as always-on intelligence production systems, where efficiency in reasoning, context handling, and data movement, not just the peak compute of a server, determines performance.

 Modern AI workloads increasingly rely on reasoning and agentic models that execute multi-step inference over extremely long contexts. These workloads simultaneously stress every layer of the platform: delivered compute performance, GPU-to-GPU communication, interconnect latency, memory bandwidth and capacity, utilization efficiency, and power delivery. Even small inefficiencies, when multiplied across trillions of tokens, undermine optimal cost, throughput, and competitiveness.

 This dynamic is captured by three scaling laws driving AI progress:

 Pre-training scaling: where models learn their inherent knowledge

 Post-training scaling: where models learn to think through fine-tuning and reinforcement

 Test-time scaling: where models reason by generating more tokens during inference

 Figure 1. Three scaling laws and exponential growth of compute

 As these scaling laws compound, infrastructure requirements intensify. NVIDIA Blackwell NVL72 was the first rack-scale architecture, freeing GPUs, CPUs, and interconnects from the confines of the traditional server boundary and elevating the rack to the primary unit of integration. This shift enabled major advances in scale-up bandwidth, efficiency, and deployability, and underpins many of today’s largest AI deployments.

 As AI factories are pushed to deliver more intelligence, lower cost per token, and greater business impact, there is relentless demand to extend rack-scale performance while maintaining data-center-scale determinism within tightly constrained power and cooling limits.

 2. Meet the NVIDIA Vera Rubin platform

 The NVIDIA Vera Rubin platform was designed for the shift in how intelligence is produced at scale, applying extreme co-design across compute, networking, power delivery, cooling, and system architecture to enable sustained intelligence production at AI factory scale.

 At the platform level, Vera Rubin delivers five generational breakthroughs:

 Figure 2. Vera Rubin platform-level breakthroughs enabled by extreme co-design

 Together, these capabilities allow Rubin-based systems to behave as predictable, secure, continuously available units of intelligence production rather than collections of independent components.

 The flagship of the Vera Rubin platform is the Vera Rubin NVL72 rack-scale system, engineered so that the entire rack operates as one rack-scale accelerator within a larger AI factory. The NVL72 system is optimized not just for peak performance, but for sustained intelligence production: predictable latency, high utilization across heterogeneous execution phases, and efficient conversion of power into usable intelligence.

 Figure 3. Vera Rubin NVL72 overview

 workloads.

 To help visualize how the Vera Rubin platform comes together as a unified system, the following video provides an overview of the rack-scale architecture and how each major component plays in sustained intelligence production.

 Video 1. NVIDIA Vera Rubin platform overview video

 This system-level overview sets the foundation for understanding how the Vera Rubin platform’s chips have been architected to operate as one AI supercomputer.

 3. Six new chips, one AI supercomputer

 Extreme co-design is expressed most clearly at the chip level.

 The Vera Rubin platform is built from six new chips, each engineered for a specific role in the AI factory and designed from the outset to operate as part of a unified rack-scale system. Rather than treating compute, networking, and infrastructure as loosely coupled layers, Vera Rubin integrates them directly into the architecture. It ensures that communication, coordination, security, and efficiency are first-class design considerations.

 Figure 4. NVIDIA Vera Rubin platform chips

 The six new chips are:

 NVIDIA Vera CPU : 88 NVIDIA custom-designed Olympus cores optimized for the next generation of AI factories with full Arm-compatibility. 

 NVIDIA Rubin GPU : High-performance AI compute with HBM4 and new NVIDIA Transformer Engine.

 NVIDIA NVLink 6 switch : Sixth-generation scale-up fabric delivering 3.6 TB/s GPU-to-GPU bandwidth.

 NVIDIA ConnectX-9 : High-throughput, low-latency networking interface at the endpoint for scale-out AI.

 NVIDIA BlueField-4 data processing unit (DPU) : A dual-die package combining:

 A 64-core NVIDIA Grace CPU for infrastructure offload and security.

 An integrated NVIDIA ConnectX-9 high-speed networking chip for tightly coupled data movement.

 NVIDIA Spectrum-6 Ethernet switch : Scale-out connectivity using co-packaged optics for efficiency and reliability.

 Together, these chips form a synchronized architecture in which GPUs execute transformer-era workloads, CPUs orchestrate data and control flow, scale-up and scale-out fabrics move tokens and state efficiently, and dedicated infrastructure processors operate and secure the AI factory itself.

 In the sections that follow, we examine each of these building blocks in detail, starting with the Vera CPU, which orchestrates data movement, memory, and control flow to sustain GPU utilization at AI factory scale.

 Vera CPU: Purpose-built for AI factories

 As AI factories scale, GPU performance alone is no longer sufficient to sustain throughput. High utilization across thousands of GPUs depends on how efficiently data, memory, and control flow through the system. The Vera CPU is designed specifically for this role, acting as the high-bandwidth, low-latency data movement engine that keeps AI factories operating efficiently at scale.

 Rather than functioning as a traditional general-purpose host, Vera is optimized for orchestration, data movement, and coherent memory access across the rack. Paired with Rubin GPUs as a host CPU, or deployed as a standalone platform for agentic processing, Vera enables higher sustained utilization by removing CPU-side bottlenecks that emerge in training and inferencing environments.

 Figure 5. Vera CPU with NVIDIA-built custom cores

 From NVIDIA Grace to Vera—scaling the CPU for AI factories

 NVIDIA Grace established NVIDIA’s approach to high-bandwidth, energy-efficient CPU design. Vera extends that foundation with increased core density, significantly higher memory bandwidth, expanded coherency, and full confidential computing support, all tailored for AI factory workloads.

 As shown in the table below, Vera delivers 2.4x higher memory bandwidth and 3x greater memory capacity to support data-intensive workloads, while doubling NVLink-C2C bandwidth to sustain coherent CPU–GPU operation at rack scale. Together, these advances elevate the CPU from a supporting role to a key enabler of next-generation GPU efficiency in AI factories.

 Feature
 Grace CPU
 Vera CPU

 Cores
 72 Neoverse V2 cores 
 88 NVIDIA Custom Olympus cores

 Threads
 72
 176 Spatial Multithreading 

 L2 Cache per core
 1MB
 2MB

 Unified L3 Cache 
 114MB
 162MB

 Memory bandwidth (BW)
 Up to 512GB/s
 Up to 1.2TB/s

 Memory capacity
 Up to 480GB LPDDR5X
 Up to 1.5TB LPDDR5X

 SIMD
 4x 128b SVE2
 6x 128b SVE2 FP8

 NVLINK-C2C
 900GB/s
 1.8TB/s

 PCIe/CXL
 Gen5
 Gen6/CXL 3.1

 Confidential compute
 NA
 Supported

 Table 1. Grace vs. Vera CPU comparison

 NVIDIA Olympus core with Spatial Multithreading

 At the heart of the Vera CPU are 88 NVIDIA custom Olympus cores, designed for high single-thread performance and energy efficiency with full Arm-compatibility. The cores employ a wide, deep microarchitecture with improved branch prediction, prefetching, and load-store performance, optimized for control-heavy and data-movement-intensive workloads.

 Vera introduces Spatial Multithreading, a new type of multithreading that runs two hardware threads per core by physically partitioning resources instead of time-slicing, enabling a run-time tradeoff between performance and efficiency. This approach increases throughput and virtual CPU density while maintaining predictable performance and strong isolation, a critical requirement for multi-tenant AI factories

 Scalable Coherency Fabric—deterministic data movement

 The second-generation NVIDIA Scalable Coherency Fabric (SCF) connects all 88 Olympus cores to a shared L3  cache and memory subsystem on a single monolithic compute die. By avoiding chiplet boundaries, SCF delivers consistent latency and sustains over 90% of peak memory bandwidth under load, eliminating bottlenecks between cores and memory controllers.

 By providing deterministic, high-throughput data movement across the CPU, SCF ensures that orchestration and data-processing workloads scale linearly as core count increases. This is essential for keeping GPUs fed with data and commands at AI factory scale.

 Memory bandwidth and coherent execution

 Vera pairs SCF with up to 1.5TB of LPDDR5X memory subsystem delivering up to 1.2 TB/s of bandwidth at low power. Small Outline Compression Attached Memory Modules (SOCAMM) with LPDDR5X improve serviceability and fault isolation, improving AI factory uptime requirements.

 Second-generation NVLink-C2C provides 1.8 TB/s of coherent bandwidth between Vera CPUs and Rubin GPUs, enabling a unified address space across CPU and GPU memory. Applications can treat LPDDR5X and HBM4 as a single coherent pool, reducing data movement overhead and enabling techniques such as KV-cache offload and efficient multi-model execution.

 Figure 6. NVLink-C2C coherent memory architecture

 Software compatibility and secure operation 

 Vera supports the Arm v9.2 architecture and integrates seamlessly with the Arm software ecosystem. Major Linux distributions, AI frameworks, and orchestration platforms run unmodified, allowing existing infrastructure software to scale onto Vera-based systems without disruption.

 Confidential computing is supported natively, enabling secure execution across CPU–GPU boundaries and across multi-socket configurations while preserving performance.

 The data engine for AI factories

 Vera is a purpose-built CPU engineered to keep GPUs fully utilized by efficiently moving, processing, and coordinating data at AI factory scale. Rather than acting as a passive host, Vera functions as a data engine that accelerates control-heavy and communication-intensive paths, including data staging, scheduling, orchestration, and agentic workflows. It also delivers exceptional standalone performance for analytics, cloud, storage, and infrastructure services.

 By combining Olympus CPU cores, second-generation SCF, high-bandwidth LPDDR5X memory, and coherent NVLink-C2C connectivity, Vera ensures Rubin GPUs remain productive across training, post-training, and inference workloads, even as execution shifts between compute, memory, and communication-dominated phases.

 In the next section, we examine the Rubin GPU, the execution engine that transforms this rack-scale accelerator foundation into sustained training and inference performance.

 Rubin GPU: Execution engine for transformer-era AI

 With the Vera CPU providing the orchestration and data-movement foundation, the Rubin GPU serves as the execution engine that turns rack-scale capability into intelligence. It is designed for continuous training, post-training, and inference in always-on AI factories.

 Modern AI workloads—including reasoning, mixture-of-experts (MoE), long-context inference, and reinforcement learning—are not limited by peak floating point operations (FLOPS) alone. They are constrained by whether execution efficiency can be sustained across compute, memory, and communication. The Rubin GPU is purpose-built for this reality, optimizing the full execution path that turns power, bandwidth, and memory into tokens at scale.

 To sustain throughput under these conditions, the Rubin GPU advances its architecture across three tightly coupled dimensions: compute density, memory bandwidth, and rack-scale communication.

 Figure 7. Rubin GPU

 At the silicon level, Rubin builds on NVIDIA’s proven GPU foundation while scaling every critical subsystem for transformer-era workloads. The GPU integrates 224 Streaming Multiprocessors (SMs) equipped with fifth-generation Tensor Cores optimized for low-precision NVFP4 and FP8 execution. These Tensor Cores are tightly coupled with expanded Special Function Units (SFUs) and execution pipelines designed to accelerate attention, activation, and sparse compute paths common in modern AI models.

 Building on NVIDIA Blackwell, Rubin extends NVIDIA’s extreme hardware–software co-design to deliver higher sustained throughput and lower cost per token across training, post-training, and inference workloads. Improved NVFP4 support increases arithmetic density and efficiency, allowing more useful computation per watt while maintaining model accuracy. By integrating low-precision execution deeply into both the architecture and software stack, Rubin translates advances in numerical formats directly into real-world gains in throughput, utilization, and AI factory economics.

 Across the full device, Rubin delivers a step-function increase in sustained throughput across pre-training, post-training, and inference. By increasing scale-up bandwidth, improving collective efficiency, and sustaining higher utilization under communication-heavy execution, Rubin raises the effective performance ceiling for large-scale training while delivering significant gains in post-training and inference workflows.

 Sustained compute and execution scaling

 Rubin scales compute capability, Transformer Engine support, and execution balance together to avoid the utilization cliffs that limit real-world throughput.

 The table below highlights how core compute characteristics have evolved since Blackwell. Additional Rubin compute specifications can be found on the Vera Rubin NVL72 product page.

 Feature
 Blackwell
 Rubin

 Transistors (full chip)
 208B
 336B

 Compute dies
 2
 2

 NVFP4 inference (PFLOPS)
 10
 50*

 NVFP4 training (PFLOPS)
 10
 35**

 Softmax acceleration (SFU EX2 Ops/Clk/SM for
FP32 | FP16)
 16
 32 | 64

 Table 2. NVIDIA GPU compute capability comparison
* Transformer Engine compute
** Dense compute

 Converging AI and scientific computing

 The launch of the NVIDIA Vera Rubin platform marks a new phase in scientific computing, where AI and simulation increasingly reinforce one another. In many supercomputing environments today, simulations are treated as endpoints—computationally intensive runs that produce a single result. Increasingly, high-fidelity simulations are also used as engines for dataset generation, producing training data for AI models that augment traditional solvers.

 These AI models can act as intelligent pre-conditioners, accelerate convergence, or serve as fast surrogate models in iterative workflows. While AI surrogates can deliver dramatic speedups—sometimes with reduced precision—classical simulation remains essential for establishing ground truth and final validation. The result is a converging workload profile that demands strong performance across both AI and scientific computing.

 The table below compares the FP32 and FP64 compute capability of the NVIDIA Hopper, Blackwell, and Rubin GPUs. 

 Feature
 Hopper GPU
 Blackwell GPU
 Rubin GPU

 FP32 vector (TFLOPS)
 67
 80
 130

 FP32 matrix (TFLOPS)
 67
 227*
 400*

 FP64 vector (TFLOPS)
 34
 40
 33

 FP64 matrix (TFLOPS)
 67
 150*
 200*

 Table 3. NVIDIA GPU FP32 and FP64 compute capability.
 *Peak performance using Tensor Core-based emulation algorithms

 The matrix performance shown above is achieved through a combination of architectural enhancements and software techniques that deliver higher effective throughput relative to prior generations. This reflects NVIDIA’s continued focus on application-level performance rather than isolated peak metrics.

 Across both AI and scientific computing, the NVIDIA extreme co-design philosophy prioritizes sustained performance on real workloads. Analysis of production simulation codes shows that the highest sustained FP64 performance often comes from matrix-multiply kernels. Hopper used dedicated hardware to accelerate these paths. With Blackwell and now Rubin, NVIDIA has evolved this strategy, achieving high FP64 matrix throughput via multiple passes over lower-precision tensor cores while preserving architectural flexibility for converged workloads. More information on how Ozaki FP64 emulation is an effective way to achieve true FP64-level accuracy on low-precision AI hardware while delivering impressive performance gains can be found in our blog on Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS .

 At the same time, dedicated FP64 vector performance remains critical for scientific applications that are not dominated by matrix kernels. In these cases, performance is constrained by data movement through registers, caches, and high-bandwidth memory (HBM) rather than raw compute. A balanced GPU design therefore provisions sufficient FP64 resources to saturate available memory bandwidth, avoiding over-allocation of compute capacity that cannot be effectively utilized.

 With the Vera Rubin platform, real application performance continues to improve each generation. The figure below shows projected gains across representative high-performance computing (HPC) simulation codes, driven by architectural and system-level improvements rather than increases in raw FP64 vector throughput.

 Figure 8. NVIDIA GPU simulation performance

 Transformer Engine

 The third-generation NVIDIA Transformer Engine builds upon the prior innovations with new hardware-accelerated adaptive compression designed to boost NVFP4 performance while preserving accuracy. This capability enables up to 50 PetaFLOPS NVFP4 for inference  

 Fully compatible with Blackwell GPUs, the new Transformer Engine preserves the existing programming model, allowing previously optimized code to transition seamlessly to Rubin while automatically benefiting from higher arithmetic density and improved execution efficiency.

 Memory and decode efficiency

 As context lengths grow and inference becomes increasingly interactive, achieved memory performance becomes a dominant factor in overall efficiency. The Rubin GPU incorporates a new generation of high-bandwidth memory, HBM4, which doubles interface width compared to HBM3e. 

 Through new memory controllers, deep co-engineering with the memory ecosystem, and tighter compute-memory integration, the Rubin GPU nearly triples memory bandwidth compared to Blackwell.

 Key characteristics include:

 Up to 288 GB of HBM4 per GPU

 Aggregate bandwidth of up to 22 TB/s

 Improved decode and front-end efficiency to keep execution pipelines fed under load

 Figure 9. HBM bandwidth scaling across GPU generations

 Together, these advances enable the Rubin GPU to sustain long-context inference, high-batch MoE execution, and interactive reasoning without sacrificing concurrency or utilization.

 Scale-up interconnect—built for communication-dominated AI

 The Vera Rubin platform supports sixth-generation NVIDIA NVLink (NVLink 6) for GPU-to-GPU communication within the system, NVIDIA NVLink-C2C (chip-to-chip) for coherent CPU-GPU connectivity with Vera CPUs, and PCIe Gen6 for host and device integration.

 NVIDIA NVLink 6 delivers 3.6 TB/s of bidirectional GPU-to-GPU bandwidth per GPU, doubling scale-up bandwidth over the prior generation. Within an NVL72 system, this enables all-to-all communication across 72 GPUs with predictable latency, a critical requirement for MoE routing, collectives, and synchronization-heavy inference paths.

 By eliminating scale-up bottlenecks, the Rubin GPU ensures that communication does not cap utilization as model size, expert count, and reasoning depth increase.

 The table below compares GPU interconnect bandwidth from Blackwell to Rubin.

 Interconnect
 Blackwell
 Rubin

 NVLink (GPU-GPU)(GB/s, bi-directional)
 1,800
 3,600

 NVLink-C2C (CPU-GPU)(GB/s, bi-directional)

 900
 1,800

 PCIe Interface(GB/s, bi-directional)
 256 (Gen 6)
 256 (Gen 6)

 Table 4. Interconnect comparison of Blackwell and Rubin

 Built for AI factory workloads

 The NVIDIA Rubin GPU is optimized for the workloads that define modern AI factories, where performance is governed less by peak compute and more by sustained efficiency across compute, memory, and communication. These workloads include MoE models dominated by dynamic all-to-all communication, agentic pipelines that interleave reasoning with tool use, and long-running training and post-training workflows that must maintain high utilization over extended periods.

 By combining adaptive execution with massive scale-up bandwidth, the Vera Rubin platform keeps GPUs productive across all phases of execution, including compute-heavy kernels, memory-intensive attention, and communication-bound expert dispatch, rather than optimizing only for dense matrix math. This is not a point upgrade over prior generations. The Vera Rubin platform rebalances GPU architecture for continuous operation at scale, working in concert with the Vera CPU, NVLink 6 scale-up, and platform software to efficiently convert power and silicon into usable intelligence across the rack.

 In the next section, we examine NVLink 6 switching, the rack-scale fabric that allows 72 GPUs to operate as a single, tightly coupled system.

 NVLink 6 Switch: The rack-scale scale-up fabric

 At the AI factory scale, communication is key to determining performance. MoE routing, collective operations, synchronization-heavy training, and reasoning inference all depend on fast, predictable all-to-all data movement. When scale-up bandwidth falls short, GPUs sit idle and cost per token rises.

 NVLink 6 is designed to eliminate this bottleneck. It is the scale-up fabric of the Vera Rubin platform, enabling 72 Rubin GPUs within an NVL72 system to operate as a single, tightly coupled accelerator with uniform latency and sustained bandwidth under communication-dominated workloads.

 Figure 10. NVLink 6 switch

 Each Rubin GPU connects to NVLink 6 with 3.6 TB/s of bidirectional bandwidth, doubling per-GPU scale-up bandwidth over the prior generation. NVLink 6 switch trays form a single all-to-all topology across the rack, allowing any GPU to communicate with any other GPU with consistent latency and bandwidth.

 This uniform topology removes hierarchical bottlenecks and hop-dependent behavior. From the software perspective, the rack behaves as one large accelerator, simplifying scaling for communication-heavy models.

 All-to-all scaling for MoE and reasoning

 Fast MoE training and inference uses expert parallelism (EP), which relies on fine-grained, dynamic routing of tokens across experts that may reside on different GPUs. These patterns generate frequent, bursty communication that overwhelms hierarchical or partially connected fabrics.

 NVLink 6 is deployed as a full all-to-all fabric across the NVL72 system. Expert routing, synchronization, and collectives scale efficiently across all 72 GPUs without saturating links or introducing unpredictable latency.

 For MoE inference at scale, NVLink 6 delivers up to 2x higher throughput compared to the prior generation for all-to-all operations.

 Figure 11. Vera Rubin NVL72 NVLink all-to-all topology

 In-network compute for collective operations

 NVLink 6 integrates NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) in-network compute to accelerate collective operations directly inside the fabric. Portions of all-reduce, reduce-scatter, and all-gather execute within the switch, reducing redundant data movement and GPU synchronization overhead.

 Each NVLink 6 switch tray delivers 14.4 TFLOPS of FP8 in-network compute, enabling collective-heavy phases to execute with lower latency and higher efficiency. By offloading collective reductions into the network, SHARP can reduce all-reduce communication traffic by up to 50% and improve tensor-parallel execution time by up to 20% in large-scale AI workloads.

 This offload increases effective GPU utilization and improves scaling efficiency as cluster size grows. Results are dependent on model architecture, parallelism strategy, participant count, and NCCL configuration.

 Operability at AI factory scale

 Scale-up networking must be operable, not just fast. The NVLink 6 switch tray incorporates new features for resiliency and maintenance, including hot-swappable trays, continued operation with partially populated racks, and dynamic traffic rerouting when a switch goes offline. It also supports in-service software updates and streams fine-grained link telemetry through the switch interfaces for real-time monitoring.

 Together, software-defined routing, detailed telemetry, and serviceable switch trays enable traffic to be dynamically rerouted around faults or maintenance events without draining the rack or interrupting active workloads. These capabilities allow NVLink 6 to meet the zero-downtime expectations of production AI factories.

 By doubling per-GPU bandwidth, enabling uniform all-to-all connectivity, and accelerating collectives directly inside the fabric, NVLink 6 allows communication-heavy workloads to scale predictably at rack scale.

 In the next section, we turn to ConnectX-9, which provides the endpoint interface that extends this performance beyond the rack by connecting GPUs to the Spectrum-X Ethernet scale-out fabric.

 ConnectX-9: Pushing the limits of AI scale-out bandwidth

 ConnectX-9 serves as the intelligent endpoints of the Spectrum-X Ethernet fabric, delivering predictable scale-out performance while enforcing traffic isolation and secure operation as AI factories grow.

 Figure 12. ConnectX-9

 In the Vera Rubin NVL72 rack-scale architecture, each compute tray contains quad ConnectX-9 SuperNIC boards, delivering 1.6Tb/s of network bandwidth per Rubin GPU. Each quad ConnectX-9 SuperNIC board connects to each Vera CPU. This ensures GPUs can participate fully in expert dispatch, collective operations, and synchronization without becoming bottlenecked at the network edge.

 Endpoint control for bursty AI traffic

 AI workloads such as MoE inference and training generate highly correlated traffic patterns. Large numbers of GPUs often attempt to inject data into the network simultaneously, creating transient traffic congestion spikes that traditional NICs are not designed to manage.

 ConnectX-9 addresses this challenge by enforcing programmable congestion control, traffic shaping, and packet scheduling directly at the endpoint. Working in concert with Spectrum-6 switches, ConnectX-9 prevents congestion from forming in the first place rather than reacting after queues build.

 This coordinated endpoint-to-fabric behavior:

 Smooths traffic injection during all-to-all phases

 Reduces head-of-line blocking and victim flows

 Maintains high effective bandwidth under load

 Performance isolation for multi-tenant AI factories

 As AI factories consolidate workloads, isolation becomes as important as throughput. Bursty or misconfigured jobs must not degrade cluster-wide performance.

 ConnectX-9 enforces fairness and isolation at the endpoint, ensuring that each job or tenant receives predictable network behavior regardless of the activity of others. This capability is critical for shared AI infrastructure, where inference, training, and post-training workloads often run concurrently on the same fabric.

 By shifting enforcement to the endpoint, the platform avoids relying solely on switch-level mechanisms, improving scalability and reducing operational complexity.

 Secure endpoints for AI infrastructure

 ConnectX-9 also plays a central role in securing AI factory networking. Integrated cryptographic engines support high-throughput encryption for data in motion and data at rest, enabling secure operation without sacrificing performance.

 Key security capabilities include:

 Data-in-transit encryption acceleration for IP Security (IPsec) and Platform Security Protocol (PSP) to secure GPU-to-GPU communications

 Data-at-rest encryption acceleration to secure storage platforms

 Secure boot, firmware authentication, and device attestation

 These features allow AI factories to operate securely in shared, cloud, or regulated environments while maintaining near-native network performance.

 From endpoint control to infrastructure offload

 ConnectX-9 completes the Spectrum-X Ethernet scale-out architecture by controlling how traffic enters the fabric. By shaping, scheduling, isolating, and securing communication at the endpoint, it ensures that AI factory networks behave predictably under real workloads.

 With fabric-level behavior defined by Spectrum-6 and endpoint behavior enforced by ConnectX-9, the remaining challenge is how to operate, secure, and manage this infrastructure at scale without consuming valuable CPU and GPU resources.

 That responsibility shifts to BlueField-4 DPUs, which provide the software-defined infrastructure layer for operating the AI factory itself. In the next section, we examine how BlueField-4 powers networking, storage, security, and control services across the Vera Rubin platform.

 BlueField-4 DPU: Powering the operating system of the AI factory

 As AI infrastructure grows to thousands of GPUs and petabytes of data, AI factories must be operated with the rigor, automation, and control of modern cloud infrastructure. The challenge extends beyond connecting GPUs to orchestrating highly distributed systems that can scale, secure, and operate AI workloads efficiently. Applying cloud-scale principles to AI infrastructure requires automation, elasticity, and end-to-end security to be foundational from the start.

 Meeting these demands calls for a specialized data processing unit dedicated to the infrastructure layer itself. NVIDIA BlueField-4 fulfills this role by handling control, security, data movement, and orchestration independently of AI computation. In effect, BlueField-4 is the processor powering the operating system of the AI factory, purpose-built to connect, secure, and manage the infrastructure that powers AI at scale.

 Within the Rubin platform, BlueField-4 operates as a software-defined control plane for the AI factory, enforcing security, isolation, and operational determinism independently of host CPUs and GPUs. By offloading and accelerating infrastructure services onto a dedicated processing layer, BlueField-4 enables AI factories to scale while maintaining consistent performance, strong isolation, and efficient operations.

 Figure 13. BlueField-4 DPU

 BlueField-4 integrates a 64-core Grace CPU and high-bandwidth LPDDR5X memory together with ConnectX-9 networking, delivering up to 800 Gb/s of ultra-low-latency Ethernet or InfiniBand connectivity while running infrastructure services directly on the DPU.

 The table below highlights key advancements in BlueField-4 compared to BlueField-3 across bandwidth, compute, and memory. These improvements allow AI factories to scale pods and services without infrastructure becoming a limiting factor.

 Feature
 BlueField-3
 BlueField-4

 Bandwidth
 400 Gb/s
 800 Gb/s

 Compute

 16 Arm A78 Cores
 64 Arm Neoverse V2
6x Compute Performance

 Memory bandwidth
 75 GB/s
 250 GB/s

 Memory capacity
 32GB
 128GB

 Cloud networking
 32K hosts
 128K hosts

 Data-in-transit encryption
 400Gb/s 
 800Gb/s

 NVMe storage disaggregation
 10M IOPs at 4K
 20M IOPs at 4K

 Table 5. NVIDIA BlueField DPU capability comparison

 This generational increase allows AI factories to scale pods, services, and tenants while also advancing infrastructure operations, efficiency, and cybersecurity.

 Infrastructure acceleration at AI factory scale

 In traditional systems, infrastructure services run on host CPUs, introducing variability, contention, and security risk as workloads scale. BlueField-4 eliminates this coupling by executing networking, storage, telemetry, and security services entirely off-host. This separation delivers:

 Deterministic infrastructure behavior independent of workload mix

 Higher GPU and CPU utilization for AI execution

 Improved fault isolation and operational resilience

 NVIDIA DOCA provides a consistent software foundation across BlueField generations, enabling reuse of infrastructure services while allowing rapid innovation without disrupting application workloads. DOCA is a comprehensive software framework and SDK that enables developers to build, deploy, and accelerate secure, software-defined data center services on BlueField DPUs and ConnectX devices using open APIs and hardware offloads.

 Built for secure, multi-tenant operation

 As AI factories increasingly adopt bare-metal and multi-tenant deployment models, maintaining strong infrastructure control and isolation becomes essential, particularly for environments processing proprietary data, regulated content, and high-value models.

 As part of the Vera Rubin platform, BlueField-4 introduces Advanced Secure Trusted Resource Architecture (ASTRA), a system-level trust architecture that establishes a trust domain within the compute tray. ASTRA provides AI infrastructure builders with a single, trusted control point to securely provision, isolate, and operate large-scale AI environments without compromising performance.

 By isolating control, data, and management planes from tenant workloads, BlueField ASTRA enables secure bare-metal operation, strong multi-tenant isolation, and trusted infrastructure control that operates independently of host software.

 NVIDIA Inference Context Memory Storage—AI-native storage infrastructure

 The Vera Rubin platform introduces NVIDIA Inference Context Memory Storage (ICMS) , an AI-native infrastructure tier designed for the agentic era, where inference state routinely outlives a single GPU execution window. As long-context, multi-turn, and multi-agent workloads push toward millions of tokens, KV cache capacity grows fast, forcing that state into either scarce GPU HBM or durability-optimized enterprise storage, which drives up latency, power, and cost per token.

 ICMS, powered by NVIDIA BlueField-4, bridges the gap between GPU memory tiers and shared storage. ICMS establishes a pod-level “G3.5” context memory layer, an Ethernet-attached, flash-based tier optimized specifically for ephemeral, latency-sensitive KV cache, sized for petabytes of shared capacity per GPU pod and built for frequent pre-staging back into host and GPU memory to avoid decode stalls.

 At scale, ICMS turns reusable KV cache into a shared pod resource rather than a per-node liability, improving utilization and reducing redundant recomputation. NVIDIA reports up to 5x higher tokens-per-second and up to 5x better power efficiency versus traditional storage approaches by reliably serving and prestaging KV from this dedicated tier.

 G3.5 tier: Ethernet-attached flash purpose-built for KV cache, positioned between local tiers (HBM, DRAM, local SSD) and durable shared storage, so context stays close enough to be reused without paying “G4 latency.”

 BlueField-4 offload: BlueField-4 runs the KV I/O plane and efficiently terminates NVMe-over-Fabrics and object/RDMA protocols, reducing host overhead while keeping KV movement fast, predictable, and secure.

 Spectrum-X fabric: Spectrum-X Ethernet provides predictable, low-latency, low-jitter RDMA connectivity between Rubin compute nodes and ICMS target nodes for consistent shared KV access across the pod.

 Orchestration: NVIDIA Dynamo and NIXL coordinate KV block management and prestaging across the hierarchy, with DOCA providing KV communication and storage interfaces that treat context as a first-class resource.

 Operating the AI factory as a system

 BlueField-4 establishes infrastructure as a first-class architectural layer of the AI factory. By operating the control, security, data movement, and orchestration planes on a dedicated processing layer, it enables AI factories to remain predictable, secure, and efficient at scale.

 Within the Vera Rubin platform, NVLink defines scale-up behavior, ConnectX-9 and Spectrum-X Ethernet switches govern scale-out and scale-across communication, and BlueField-4 operates the AI factory itself.

 Spectrum-6 Ethernet switch: Scale-out and scale-across for AI factories

 AI factories must also scale beyond a single Vera Rubin NVL72 system and often need to scale across geographically dispersed data centers. Performance is then determined not just by bandwidth, but by how predictably the network behaves under synchronized, bursty AI traffic.

 To support both scale-out and scale-across AI factory deployments, the Vera Rubin platform introduces NVIDIA Spectrum-X Ethernet Photonics, a new generation of Spectrum-X Ethernet switching based on co-packaged optics, advancing NVIDIA’s purpose-built Ethernet fabric for accelerated computing.

 Figure 14. Spectrum-6 Ethernet switch chip

 Spectrum-6 is engineered specifically for AI workloads, where traffic is highly synchronized, bursty, and asymmetric. Spectrum-6 doubles per-switch-chip bandwidth to 102.4 Tb/s using 200G PAM4 SerDes, enabling dense, high-port count fabrics optimized for AI traffic patterns. 

 High effective bandwidth, fine-grained telemetry, and hardware-assisted performance isolation enable deterministic behavior in large, multi-tenant AI fabrics, while remaining fully standards-based and interoperable with open networking software.

 Spectrum-X Ethernet fabric

 Unlike off-the-shelf Ethernet, Spectrum-X Ethernet delivers predictable, low-latency, high-bandwidth connectivity at scale through advanced congestion control, adaptive routing, and lossless Ethernet behavior. These capabilities minimize jitter, tail latency, and packet loss under sustained AI load.

 Anchored on Spectrum-6, Spectrum-X Ethernet was co-designed with the Vera Rubin platform to ensure that routing behavior, congestion control, and telemetry reflect real AI communication patterns rather than traditional enterprise networking assumptions. This alignment allows scale-out performance to track application behavior, not theoretical peak throughput.

 Spectrum-X Ethernet also incorporates Spectrum-XGS Ethernet scale-across technology, which adds distance-aware congestion control for large, geographically distributed AI deployments. End-to-end telemetry and deterministic routing enable efficient load balancing across sites, keeping multi-site AI factories operating at high utilization.

 Spectrum-X Ethernet Photonics: Redefining network efficiency at AI scale

 Spectrum-X Ethernet Photonics fundamentally improves network efficiency by eliminating pluggable transceivers and DSP retimers. Integrated silicon photonics combined with external laser arrays reduce component count and failure points compared to network fabrics based on traditional pluggable transceivers. Spectrum-X Ethernet Photonics delivers:

 ~5x better network power efficiency

 Lower end-to-end latency

 Dramatically improved signal integrity

 By reducing optical loss from ~22 dB to ~4 dB, Spectrum-X Ethernet achieves up to 64x better signal integrity. It enables higher uptime, simplified serviceability with high-density MMC-12 cabling, and lower total cost of ownership for large training and inference clusters.

 Figure 15. NVIDIA Spectrum-X Ethernet Photonics switches

 Built for real AI traffic patterns

 Modern MoE training and inference introduce a variable all-to-all communication phase driven by stochastic expert token dispatch. These workloads generate highly bursty traffic that can overwhelm traditional Ethernet fabrics, leading to packet loss, congestion collapse, and degraded job completion times.

 Spectrum-X Ethernet addresses this at the fabric level through coordinated congestion control and adaptive routing across switches and endpoints. The result is significantly faster job completion for expert dispatch and collective operations under real AI load.

 Figure 16. Spectrum-X Ethernet variable all-to-all performance

 Advancing the fabric without re-architecting the network

 Spectrum-X Ethernet evolves generation-over-generation through end-to-end co-design across switch silicon, optics, SuperNICs, and system software. This delivers coordinated gains in bandwidth, signaling, and scalability without requiring a fundamental fabric redesign, allowing customers to scale AI clusters predictably as performance requirements grow.

 Feature
 Grace Blackwell
 Vera Rubin

 Key component
 Spectrum-X SN5000 series
 ConnectX-8 SuperNIC
 Spectrum-X SN6000 series
 ConnectX-9 SuperNIC

 Chip
 Spectrum-4
 ConnectX-8
 Spectrum-6
 ConnectX-9

 Maximum bandwidth
 51.2 Tb/s per switch chip(64 x 800 Gb/s)
 800 Gb/s (2 x 400G) per GPU
 102.4 Tb/s per switch chip(128 x 800 Gb/s)
 1600 Gb/s (2 x 800 GB/s) per GPU

 SerDes
 100G PAM4
 100/200G PAM4
 200G PAM4
 200G PAM4

 Protocol
 Ethernet
 Ethernet, InfiniBand
 Ethernet
 Ethernet, InfiniBand

 Connectivity
 OSFP
 OSFP, QSFP112
 OSFP
 OSFP, QSFP112

 Table 6. NVIDIA Spectrum-X Ethernet platform evolution

 For more on Spectrum-X Ethernet Photonics, check out this blog .

 4. From chips to systems: NVIDIA Vera Rubin superchip to DGX SuperPOD

 AI factory performance is not determined by individual chips in isolation, but by how those chips are composed into systems that can be deployed, operated, and scaled reliably. The Vera Rubin platform is designed with this progression in mind, moving deliberately from silicon-level innovation to rack-scale systems and finally to full AI factory deployments.

 This section traces that progression, starting with the Vera Rubin superchip as the foundational compute building block, then scaling through the NVL72 rack architecture and its integrated networking fabrics, and culminating in the NVIDIA DGX SuperPOD as the deployment-scale unit of an AI factory. At each step, the goal is the same: Preserve the efficiency and utilization gains achieved at the chip level as the system scales outward.

 NVIDIA Vera Rubin superchip 

 At the heart of the Rubin platform is the NVIDIA Vera Rubin superchip, the foundational compute building block that tightly integrates AI execution with high-bandwidth data movement and orchestration. Each superchip combines two Rubin GPUs with one Vera CPU through memory-coherent NVLink-C2C interconnect, collapsing traditional CPU-GPU boundaries into a unified, rack-scale execution domain.

 This approach is not new for NVIDIA. Beginning with NVIDIA Grace Hopper and continuing through subsequent generations, close CPU-GPU integration has been a core design principle to co-optimize compute, memory, and interconnect to sustain utilization under real training and inference workloads.

 In the Vera Rubin superchip, the CPU functions as a data engine tightly coupled to GPU execution. This coupling enables low-latency coordination, shared memory access, and efficient orchestration across training, post-training, and inference workloads. Rather than acting as an external host, the Vera CPU participates directly in execution, handling data movement, scheduling, synchronization, and execution flow without introducing bottlenecks.

 By integrating GPU compute with a high-bandwidth CPU data engine on a single host processing motherboard, the superchip improves data locality, reduces software overhead, and sustains higher utilization across heterogeneous execution phases. It serves as the architectural bridge between chip-level innovation and rack-scale intelligence.

 Figure 17. Vera Rubin superchip

 Vera Rubin NVL72 compute tray

 The compute tray translates the Vera Rubin superchip into a deployable, serviceable unit designed for AI factory scale. Each tray integrates two superchips, power delivery, cooling, networking, and management into a modular, cable-free assembly optimized for density, reliability, and ease of operation.

 A redesigned internal liquid manifold and universal quick-disconnects support significantly higher flow rates than prior generations, enabling stable performance under sustained, high-power workloads. The modular compute tray uses independent front and rear bays to streamline assembly and service. Although the compute tray must be taken offline during maintenance, the modular cable-free design reduces service time by up to 18x. Assembly that used to take more than 1.5 hours for Blackwell now takes only ~5 minutes with Vera Rubin.

 Figure 18. Vera Rubin NVL72 compute tray

 ConnectX-9 SuperNICs provide high-bandwidth scale-out connectivity (1.6 Tb/s per GPU), while BlueField-4 DPUs offload networking, storage, and security services, allowing CPUs and GPUs to remain focused on AI execution.

 Figure 19. ConnectX-9 and BlueField-4 modules for the Vera Rubin compute tray

 Vera Rubin NVL72 NVLink switch tray

 To transform multiple compute trays into one rack-scale accelerator Vera Rubin introduces the NVLink 6 switch tray. 

 Each switch tray incorporates four NVLink 6 switch chips, doubling the per-GPU scale-up bandwidth as well as the in-network compute for accelerating collective operations directly inside the fabric. This is critical for MoE routing, synchronization-heavy inference, and communication-intensive training phases where scale-up efficiency directly determines cost and latency.

 By integrating scale-up networking as a first-class rack component, the NVLink switch tray ensures that performance scales predictably as models, batch sizes, and reasoning depth continue to increase.

 Figure 20. Vera Rubin NVLink switch tray

 Spectrum-X Ethernet switching for scale-out AI factories

 NVLink 6 allows 72 GPUs to behave as one rack-scale accelerator inside the rack. Spectrum-X Ethernet extends that capability beyond the rack, enabling predictable, high-throughput scale-out connectivity across rows and data centers, without the variability that traditional Ethernet often introduces under synchronized AI traffic.

 AI factory communication patterns are fundamentally different from enterprise workloads. MoE dispatch, collective operations, and synchronization-heavy phases generate bursty, asymmetric, and highly correlated flows that can amplify congestion, tail latency, and performance jitter at scale. Spectrum-X Ethernet is engineered specifically for these patterns through coordinated congestion control, adaptive routing, and end-to-end telemetry that keep effective bandwidth high and performance repeatable under load.

 Within the Vera Rubin NVL72 platform, Spectrum-X is realized through the combination of Spectrum-6 switches and ConnectX-9 SuperNIC endpoints included in the compute nodes. Together, they form a tightly co-designed scale-out system where the fabric and endpoints cooperate to shape traffic, isolate workloads, and prevent hotspots, enabling high utilization in multi-job, multi-tenant AI factories.

 Figure 21. Spectrum-X Ethernet switching for the Vera Rubin platform

 NVIDIA DGX SuperPOD: the AI factory deployment unit

 DGX SuperPOD represents the blueprint for deployment-scale realization of the Vera Rubin platform. Built with eight DGX Vera Rubin NVL72 systems, it defines the minimum unit at which AI factory economics, reliability, and performance converge in production environments.

 Unlike traditional clusters assembled from discrete components, DGX SuperPOD is designed as a complete system. Every layer, from silicon and interconnects to orchestration and operations, is co-designed and validated to deliver sustained utilization, predictable latency, and efficient conversion of power into tokens at scale.

 Within each NVIDIA DGX Vera Rubin NVL72 system, 72 Rubin GPUs operate as one rack-scale accelerator through NVLink 6. Spectrum-X Ethernet extends the platform beyond the rack with deterministic, high-throughput scale-out connectivity, allowing multiple DGX Vera Rubin NVL72 systems to be composed into a DGX SuperPOD. Integrated with NVIDIA Mission Control software and certified storage, these elements create a validated, production-ready AI factory building block, ready to scale into tens of thousands of GPUs.

 This design enables DGX SuperPOD to deliver true AI factory abilities: continuous operation, high-uptime serviceability, and consistent performance across training, post-training, and real-time inference workloads.

 Figure 22. DGX SuperPOD with DGX Vera Rubin NVL72 Systems

 5. Software and developer experience

 Vera Rubin also has been designed to accelerate innovation without forcing developers to re-architect their software. At its foundation, the platform maintains full CUDA backward compatibility across hardware generations, ensuring existing models, frameworks, and workflows run seamlessly while automatically benefiting from generational improvements in compute, memory, and interconnect.

 CUDA-X libraries—the performance foundation

 The CUDA platform encompasses a programming model, core libraries, and communication stacks that accelerate applications and expose the full distributed capabilities of the rack-scale system. Developers can program Rubin GPUs as individual devices or as part of a single 72-GPU NVLink domain using NVIDIA Collective Communications Library (NCCL) , NVIDIA Inference Transfer Library (NIXL), and NVLink-aware collectives. This design enables models to scale across the rack without custom partitioning, topology-aware workarounds, or manual orchestration.

 Figure 23. Accelerated computing starts with CUDA-X

 At the kernel and library layer, NVIDIA provides highly optimized building blocks for the most demanding AI workloads. Libraries such as NVIDIA cuDNN , NVIDIA CUTLASS , FlashInfer , and a new Transformer Engine deliver peak efficiency for attention, activation, and narrow-precision execution. These components are tightly coupled with Rubin’s Tensor Cores, HBM4 memory subsystem, and NVLink 6 interconnect, enabling sustained performance across dense, sparse, and communication-heavy workloads.

 Together, these libraries allow developers to focus on model behavior rather than hardware-specific tuning, while still extracting maximum performance from the underlying platform.

 Large-scale training—from research to production with NVIDIA NeMo

 Higher-level frameworks build directly on the Vera Rubin platform to maximize developer productivity and scalability. PyTorch and JAX frameworks ship with native NVIDIA acceleration to enable training, post-training, and inference workflows to scale across racks with minimal code changes.

 At the core of NVIDIA’s training and customization stack is the NVIDIA NeMo Framework , which provides an end-to-end workflow for building, adapting, aligning, and deploying large models at AI factory scale. NeMo unifies data curation, large-scale distributed training, alignment, and parameter-efficient customization into a single, production-oriented framework. Through NVIDIA NeMo Run , developers can configure, launch, and manage experiments consistently across local environments, SLURM clusters, and Kubernetes-based AI factories.

 Figure 24. NeMo framework for large-scale model training, alignment, and deployment

 For extreme-scale training, NeMo integrates tightly with NVIDIA Megatron Core , which supplies the underlying distributed training engine. Megatron Core provides advanced parallelism strategies, optimized data loaders, and support for modern model architectures including dense LLMs, MoE, state-space models, and multimodal networks. This integration allows NeMo to scale training across thousands of GPUs while abstracting the complexity of parallelism and communication from the user.

 NeMo also supports advanced post-training workflows, including reinforcement learning and alignment techniques such as reinforcement learning with human feedback (RLHF), direct preference optimization (DPO), proximal policy optimization (PPO), and supervised fine-tuning. These capabilities enable developers to move seamlessly from pre-training to alignment and customization within a single framework—without re-architecting pipelines.

 To link ecosystem workflows, NVIDIA NeMo Megatron Bridge enables bidirectional checkpoint conversion and verification between Hugging Face and Megatron formats. This tool allows models to move reliably between community tooling, NeMo-based training, reinforcement learning, and optimized inference deployments, while preserving correctness and reproducibility.

 Inference frameworks and optimization—serving real-time intelligence

 The Vera Rubin platform has been architected to deliver significant gains for modern inference workloads, which are increasingly defined by low latency, high concurrency, and communication-heavy execution. The platform integrates with widely used open source and NVIDIA inference frameworks—including SGLang, NVIDIA TensorRT-LLM, vLLM, and NVIDIA Dynamo—to enable efficient execution of long-context, MoE, and agentic workloads as software support is enabled with platform availability. 

 The NVIDIA Model Optimizer extends inference performance through quantization, pruning, distillation, and speculative decoding, and it translates architectural advances directly into lower latency and lower cost per token. At the serving layer, NVLink-enabled communication, disaggregated inference, LLM-aware routing, KV-cache offloading to storage, and Kubernetes autoscaling are exposed through Dynamo–enabling scalable serving of communication-intensive workloads such as MoE inference and multi-agent pipelines.

 Figure 25. NVIDIA inference software stack

 A developer-ready programmable rack-scale platform

 NVIDIA’s architecture is designed from the ground up to maximize platform software performance and developer usability at rack scale. By integrating platform software and developer experience directly into the architecture, the Vera Rubin platform is not only powerful, but practical to deploy and program. Developers can focus on models, agents, and services rather than infrastructure complexity, while operators retain control over performance, reliability, and efficiency at AI factory scale.

 6. Operating at AI factory scale

 Operating an AI factory at scale requires more than raw performance. It demands systems that can run continuously, securely, efficiently, and predictably in real-world data center environments. The Vera Rubin platform is engineered not only to deliver breakthrough compute capability, but to sustain it over time through intelligent reliability, full-stack security, energy-aware design, and a mature rack ecosystem. Together, these capabilities ensure that AI factories built on the Vera Rubin platform can scale rapidly, operate with minimal disruption, and convert power, infrastructure, and silicon into usable intelligence at industrial scale.

 Deployment and operations

 NVIDIA Mission Control accelerates every aspect of AI factory operations, from configuring Vera Rubin NVL72 deployments to integrating with facilities to managing clusters and workloads. Enabled by intelligent, integrated software, enterprises gain improved control over cooling and power events and redefine infrastructure resiliency. Mission Control enables faster response with rapid leak detection, unlocks access to NVIDIA’s latest efficiency innovations, and maximizes AI factory productivity with autonomous recovery.

 Figure 26. NVIDIA Mission Control software to configure, validate, and operate Vera Rubin-based AI factories

 Mission Control offers a validated implementation for enterprises to simplify and scale how AI factories are deployed and operated throughout the entire cluster lifecycle:

 Seamless workload orchestration: Empower model builders with effortless and simplified workload management with NVIDIA Run:ai functionality.

 Power optimizations: Balance power requirements and tune GPU performance for various workload types with developer-selectable controls.

 Autonomous recovery engine: Identify, isolate, and recover from problems without manual intervention for maximum productivity and infrastructure resiliency.

 Customizable dashboards: Track key performance indicators with access to critical telemetry data about your cluster and easy-to-set dashboards

 Continuous health checks: Validate hardware and cluster performance throughout the life cycle of your infrastructure.

 Enterprise software and lifecycle support 

 NVIDIA AI Enterprise provides the enterprise-grade software foundation required to operate AI factories at scale. It delivers a validated, supported software stack that spans application development libraries, frameworks, and microservices, as well as infrastructure software for GPU management. It enables predictable performance, security, and stability for production AI deployments.

 Figure 27. NVIDIA AI Enterprise software suite for AI factories

 For agentic AI development, NVIDIA AI Enterprise includes NVIDIA NIM, NeMo, and other containerized libraries and microservices that enable optimized inference, model training, and customization through standardized APIs. With support for NVIDIA, partner, and community AI models, NIM microservices enable enterprises to deploy agentic AI capabilities faster.

 Additionally, application development SDKs, frameworks, and libraries translate the Vera Rubin platform’s architectural capabilities into performance improvements. CUDA, Transformer Engine, cuDNN, and related libraries are validated as an accelerated stack, ensuring that hardware advances are automatically realized by higher-level frameworks and services.

 For infrastructure management, NVIDIA AI Enterprise integrates with Kubernetes through purpose-built operators and validated GPU, networking, and virtualization drivers. These components enable secure multi-tenant operation, workload orchestration, and cluster-wide observability, and allow operators to maximize utilization while maintaining reliability and compliance.

 Delivered with long-term support, regular security updates, and compatibility validation across hardware generations, NVIDIA AI Enterprise serves as the software backbone of NVIDIA AI factories. It transforms rack-scale systems into a programmable, secure, and operable production platform across data center, cloud, and edge environments.

 NVIDIA AI Enterprise is supported by a wide ecosystem of partners, including solution integrators, data and enterprise platforms, hybrid and multi-cloud providers, and AIOps solutions. It integrates seamlessly with existing enterprise software stacks to enable production grade AI and accelerate time to market.

 Reliability, availability, and serviceability

 AI factories are no longer batch systems that can afford maintenance windows. They are always-on environments running continuous training, real-time inference, retrieval, and analytics. Vera Rubin NVL72 is engineered for this reality, introducing a rack-scale RAS architecture designed to maximize uptime, improve goodput, the amount of useful AI work actually completed over time, and ensure predictable completion of long-running AI workloads.

 In this context, goodput reflects how effectively the system converts powered-on time into finished training steps, completed inference requests, and delivered tokens, without losses from job restarts, checkpoint rollbacks, stragglers, or performance degradation caused by component faults. Even brief interruptions or localized failures can materially reduce goodput when workloads span thousands of GPUs and run for days or weeks.

 Resiliency in the Vera Rubin platform is designed end to end, spanning silicon, interconnect, and physical system architecture. The result is a unified, intelligent approach to reliability that allows the system to isolate faults, reroute traffic, and continue executing workloads without interruption, enabling zero planned downtime at rack scale while preserving sustained throughput and predictable job completion.

 Rack-scale resiliency: Designed from the ground up

 Vera Rubin NVL72 is built on a third-generation NVIDIA MGX rack design that treats reliability and serviceability as first-order architectural requirements. Compute trays, NVLink switch trays, and power and cooling infrastructure are modular, hot-swappable, and designed for in-field replacement without draining racks or interrupting active workloads.

 As shown in the animation below, a cable-free, hose-free, fanless compute tray architecture eliminates many manual PCIe, networking, and management connections within the tray, removing common assembly and service friction seen in prior cabled tray designs. This mechanical simplification enables up to 18x faster assembly compared with previous generation tray architectures and significantly reduces services time during in-field maintenance, lowering deployment time and ongoing operational overhead. 

 A mature ecosystem of more than 80 MGX partners ensures global manufacturability, service readiness, and scalable deployment, allowing AI factories to ramp quickly while maintaining consistent reliability at scale.

 Figure 28. NVIDIA Blackwell Ultra GB300 vs. Vera Rubin compute tray

 Intelligent resiliency across the interconnect

 At the system level, NVIDIA NVLink Intelligent Resiliency enables racks to remain fully operational during maintenance, partial population, or component replacement. Using software-defined routing and intelligent failover, traffic is dynamically rerouted around faults without disrupting active training or inference jobs.

 This capability is critical as AI factories scale to thousands of GPUs. Rather than treating interruptions as stop-the-world events, the system adapts in real time, maintaining high utilization and predictable performance even as components are serviced or replaced to improve goodput.

 Silicon-level health monitoring with zero downtime

 At the heart of this architecture is the Rubin GPU’s second-generation Reliability Availability and Scalability Engine (RAS), which delivers continuous, in-system health monitoring without taking GPUs offline. Health checks are performed during idle execution windows, enabling full diagnostics with no impact to running workloads.

 The RAS engine supports in-field SRAM repair and zero-downtime self-testing during execution, increasing effective mean time between failures and improving overall system yield. This capability is especially important for long-running training jobs and persistent inference services, where unplanned interruptions can be costly or unacceptable.

 Vera CPUs complement GPU-level resiliency with in-system CPU core validation, reduced diagnostic times, and SOCAMM LPDDR5X memory designed for improved serviceability and fault isolation.

 Predictive operations at AI factory scale

 These hardware capabilities are paired with NVIDIA AI-powered predictive management, which analyzes thousands of hardware and software telemetry signals across the rack. Potential issues are identified early, localized precisely, and addressed proactively. Operators can rebalance workloads, adjust checkpoint strategies, activate standby capacity, or schedule maintenance without impacting service-level objectives.

 Together, these capabilities transform RAS from a reactive process into an intelligent, predictive system that minimizes downtime, reduces operational complexity, and ensures AI workloads complete on schedule.

 With Vera Rubin NVL72, reliability is no longer a limiting factor for scale. From silicon to system, the platform is engineered to keep AI factories running continuously, efficiently, and predictably at unprecedented scale.

 Full stack confidential computing

 As AI factories move into production, security requirements expand from protecting individual devices to protecting entire systems operating continuously at scale. Modern AI workloads routinely process proprietary training data, regulated content, and high-value models, often in shared or cloud environments where infrastructure cannot be implicitly trusted. Meeting these requirements demands security that spans silicon, interconnect, and system software, without introducing performance penalties or operational friction.

 Vera Rubin NVL72 was designed with full-stack confidential computing as a foundational capability, extending trust from individual components to the entire rack.

 Third–generation confidential computing: rack-level security

 As shown in the figure below, Vera Rubin NVL72 extends confidential computing beyond individual devices to create a unified, rack-scale trusted execution environment spanning CPUs, GPUs, and interconnects. This design enables sensitive AI workloads to run securely at scale with near-native performance, even in shared or cloud environments.

 Figure 29. Confidential computing on Vera Rubin NVL72

 AI factories increasingly process proprietary data, regulated content, and mission-critical models that cannot be exposed, even to the infrastructure they run on. Vera Rubin NVL72 addresses this requirement by delivering end-to-end encryption across CPU-to-GPU, GPU-to-GPU, and device I/O paths, allowing enterprises to deploy secure training, inference, retrieval, and analytics pipelines without sacrificing throughput or latency.

 From device-level security to rack-scale trust

 NVIDIA has advanced GPU security over multiple generations. Hopper introduced high-performance confidential computing for GPUs. Blackwell expanded these capabilities, eliminating the traditional tradeoff between security and performance. Vera Rubin NVL72 completes this progression by unifying CPU and GPU security into a single unified trust domain across the entire rack.

 This rack-level approach ensures that proprietary models, training data, embeddings, and inference prompts remain protected not only from other tenants, but also from the underlying cloud provider infrastructure itself.

 Cryptographic attestation and verifiable compliance

 Vera Rubin NVL72 integrates with NVIDIA remote attestation services (NRAS) to provide cryptographic proof of system integrity. Organizations can verify that CPUs, GPUs, NICs, firmware, drivers, and the running workload match known-good reference measurements supplied by NVIDIA, achieving zero trust architecture at rack scale.

 The platform supports both on-demand attestation through NVIDIA Attestation Cloud services and deployment models that require cached results or fully air-gapped operation. This flexibility allows enterprises to meet stringent regulatory, compliance, and data-sovereignty requirements while maintaining operational efficiency.

 Unified security across the entire rack

 Vera Rubin NVL72 establishes a unified security domain using a combination of industry standards and NVIDIA technologies, including:

 TEE Device Interface Security Protocol (TDISP) for device-level trust

 PCIe integrity and data encryption (IDE) for secure I/O

 NVLink-C2C encryption for protected CPU-to-GPU and CPU-to-CPU communication

 NVLink encryption for secure GPU-to-GPU data movement at scale

 Together, these capabilities enable a fully encrypted, unified trusted execution environment designed to scale to the world’s largest AI models and most demanding enterprise workloads. From the user’s device to cloud-scale AI factories, Vera Rubin NVL72 delivers full-stack confidential computing that protects every type of data, even the most sensitive workloads, wherever it runs.

 Energy for tokens: thermal and power innovations 

 AI factories can draw hundreds of megawatts of power. Yet by the time that power reaches the GPUs doing the work, roughly 30% of it is lost to power conversion, distribution, and cooling. This energy is consumed by systems that support compute but do not directly generate tokens, the fundamental unit of AI output. Known as parasitic energy, it represents billions of dollars in wasted potential revenue at scale.

 Figure 30. Grid-to-token energy flow and parasitic power loss in AI factories

 Every watt wasted is a watt that could have been used to generate tokens. As AI becomes a primary engine of knowledge creation, improving energy efficiency directly translates into higher throughput, lower cost per token, and better sustainability.

 Cutting parasitic energy means delivering more usable power to GPUs, the engines that produce tokens. The Vera Rubin platform has been engineered to minimize these hidden costs through simpler power paths, higher-efficiency cooling, and system-level orchestration designed for always-on AI factories.

 Traditional data centers heavily rely on air cooling, which consumes significant energy to move and condition air. Similar to Blackwell, Vera Rubin NVL72 systems use warm-water, single-phase direct liquid cooling (DLC) with a 45-degree Celsius supply temperature. Liquid cooling captures heat far more efficiently than air, and by maintaining Blackwell’s 45-degree cooling temperature, data centers can cool water with ambient air. This translates to significant cost, complexity, and power savings relative to other solutions that require 35-degree liquid cooling. 

 Building on Blackwell’s liquid-cooled design, Vera Rubin further increases cooling efficiency by nearly doubling thermal performance in the same rack footprint without introducing new cooling complexities or costs. This ensures rapid heat removal under sustained, extreme workloads, preventing thermal throttling and keeping performance consistent. Less energy spent on cooling means more energy available for compute and higher sustained utilization across the AI factory.

 Rack-level power smoothing and site-level energy storage

 AI workloads are inherently dynamic. Large-scale training introduces synchronized all-to-all communication phases with megawatt-scale power ramps, while inference generates sharp, bursty demand spikes.

 Figure 31. Synchronized GPU power swings in AI training workloads

 Without mitigation, these swings can stress power delivery networks, violate grid constraints, or force operators to overbuild infrastructure or throttle GPUs, both of which waste energy and limit deployable compute.

 Vera Rubin AI factories address this challenge with a multi-layered approach. 

 Figure 32. Multi-layer power smoothing and energy storage for AI factories

 At the rack level, Vera Rubin NVL72 evens out power swings with power smoothing and incorporates approximately 6x more local energy buffering than Blackwell Ultra, absorbing rapid power transients directly at the source. The figure below shows the effect of rack-level power smoothing in operation: synchronized AI workload power swings are reshaped into controlled ramps bounded by a stable power ceiling and floor, with local energy buffering absorbing rapid transients at the source. The result is a smoother, more predictable power profile that aligns GPU execution with data center and grid constraints.

 Figure 33. ​​Rack-level power smoothing with local energy buffering

 The figure below breaks this behavior down into the three complementary mechanisms that make it possible. Together, controlled ramps, enforced limits, and local energy storage operate as a coordinated system, reducing peak demand, limiting ramp-rate violations, and stabilizing power delivery without throttling performance. These mechanisms allow AI factories to plan around sustained power rather than worst-case spikes, directly increasing deployable compute per megawatt.

 Figure 34. Power-smoothing mechanisms: ramps, limits, and storage

 At the site level, battery energy storage systems (BESS) provide fast-response capacity to handle grid events and maintain stability without interrupting workloads.

 AI infrastructure power management works by using the NVIDIA Domain Power Service (DPS) to provide power domain-level controls and enable the NVIDIA Workload Power Profile Solution (WPPS) for each job to optimize performance per watt for schedulers like SLURM and NVIDIA Mission Control. Mission Control provides cluster-wide telemetry, coordinated power-aware policies, and integration with facilities (including energy-optimized power profiles and building management system interfaces) for efficient large-scale operations. Low-level GPU telemetry, power capping, and health control are handled through NVIDIA System Management Interface (SMI) and NVIDIA Data Center GPU Management (DCGM) APIs.

 Figure 35. Power stability and energy optimization for AI factory operations

 By reducing peak-to-average power ratios, Vera Rubin NVL72 enables operators to provision more GPUs per megawatt of available grid capacity, and plan around sustained power rather than worst-case spikes. This improves utilization, lowers infrastructure overhead, and directly increases tokens produced per unit of energy.

 Power optimization and grid awareness for sustainable AI factory scale

 AI factories do not operate in isolation. They are tightly coupled to utility grids that impose limits on ramp rates, peak demand, and operational stability. Managing these constraints manually is impractical at scale and can result in forced throttling or downtime. NVIDIA is building a Vera Rubin NVL72 AI factory research center in Manassas, Va. , to optimize and validate the reference design for 100 MW up to gigawatt-scale AI factories. The reference design integrates the Vera Rubin NVL72 rack designs at scale with power and cooling infrastructure and implements APIs to connect grid power controls with the AI factory telemetry and controls.

 Vera Rubin NVL72 AI factories integrate the NVIDIA Omniverse DSX reference design for software-defined power control. DSX Flex translates electric utility signals into actionable cluster-level power events. DSX Boost enforces ramp-rate compliance and dynamically orchestrates workload power budgets across the factory. 

 Together, these capabilities allow AI factories to remain compliant with grid requirements while keeping workloads running at high utilization. By coordinating power behavior across racks, nodes, and jobs, DSX enables Vera Rubin NVL72 AI factories to provision up to 30% more GPU capacity within the same power envelope, directly increasing token output and revenue potential.

 A seamless transition enabled by a mature ecosystem

 Figure 36. The NVIDIA MGX wall demonstrating the huge ecosystem of partners and components needed to build and scale Vera Rubin NVL72

 Vera Rubin NVL72 is built on the third-generation NVIDIA MGX rack architecture, preserving the same physical rack footprint while advancing performance, reliability, and serviceability. This continuity is intentional. By evolving the platform without forcing disruptive infrastructure changes, NVIDIA enables exponential gains in AI capability while maintaining a predictable and efficient deployment model.

 With Vera Rubin NVL72 delivering up to 3.6 exaFLOPS of AI inference compute per rack, the challenge is no longer just performance, but how quickly that performance can be deployed at scale. The MGX design ensures that power, cooling, mechanical integration, and service workflows are already proven, allowing partners and operators to focus on accelerating time to production rather than redesigning infrastructure.

 This consistency translates directly into faster ramps. Vera Rubin is supported by a mature ecosystem of more than 80 MGX partners spanning system manufacturers, integrators, and data-center solution providers, many of whom are already ramping the platform. These partners bring hard-earned operational experience from prior generations, reducing risk and accelerating global deployment.

 For data-center operators, this means a smooth transition to Vera Rubin with minimal friction. Existing facilities can adopt the next generation of agentic AI infrastructure without retooling layouts, retraining service teams, or requalifying fundamental rack designs. The result is faster deployment, predictable operations, and the ability to scale AI factories quickly as demand grows.

 Vera Rubin’s mature ecosystem ensures that platform innovation does not come at the cost of deployment velocity, enabling enterprises and cloud providers to move from innovation to production at unprecedented speed.

 Where operations meets performance

 Taken together, these capabilities define what it means to operate at AI factory scale. Vera Rubin NVL72 combines zero-downtime reliability, full-stack security, energy-aware system design, and a mature rack ecosystem to ensure that performance gains translate into real, sustained output in production environments. By removing operational, power, and deployment bottlenecks, the platform allows AI factories to focus on what matters most: delivering more intelligence per watt, per rack, and per data center. With this foundation in place, the next section examines how Vera Rubin converts these system-level advantages into measurable performance gains at scale.

 7. Performance and efficiency at scale 

 A useful way to understand the performance impact of Vera Rubin NVL72 is through the lens of model evolution. The industry is simultaneously pushing toward extreme-scale training, exemplified by 10 trillion parameter mixture-of-experts (MoE) models, and toward low-latency inference required for reasoning agents and complex workflows. At this scale, the challenge is no longer peak throughput in isolation, but how efficiently an entire platform converts infrastructure into sustained model progress.

 As the industry has advanced from Hopper to Blackwell and now Rubin, performance gains increasingly come from architectural efficiency rather than brute-force scaling. Vera Rubin NVL72 shifts the performance frontier on both ends, delivering the architectural density required to train giant MoE models without unmanageable cluster sprawl, while also enabling the sustained execution efficiency needed for real-time, high-reasoning inference.

 Unlocking the 10T MoE era via extreme co-design 

 Training the next generation of frontier models requires extreme co-design. As parameter counts continue to climb, the industry is rapidly approaching a point where 10T MoE architectures become operationally viable. These models offer enormous capacity and more efficient inference, but they introduce substantial communication overhead during training due to dynamic expert routing and frequent all-to-all exchanges.

 The Vera Rubin platform is designed to absorb this overhead through tight co-design across compute, memory, and networking. Higher compute density per rack and more efficient interconnects reduce the cost of synchronization and expert communication, allowing training efficiency to scale rather than collapse as cluster size increases.

 The figure below illustrates the impact of this co-design using a fixed training objective. To train a 10T MoE model on 100 trillion tokens within a one-month window, Vera Rubin NVL72 achieves the target using approximately one-quarter the number of GPUs required by Grace Blackwell NVL72. Instead of scaling out to ever-larger clusters to meet aggressive timelines, Vera Rubin concentrates effective training capacity into fewer GPUs.

 Figure 37. Vera Rubin NVL72 enables one-fourth the GPUs to train 10T MoE vs. Blackwell NVL72

 This reduction in required GPU count represents a structural shift in large-scale training. By minimizing cluster sprawl and communication overhead, Vera Rubin NVL72 eliminates much of the complexity that has historically limited MoE scalability. Architectural efficiency, not raw GPU volume, becomes the dominant factor in making 10T-class models practical at scale.

 Real-time reasoning at scale 

 The shift toward multi-agent AI systems fundamentally changes inference behavior. Instead of short, stateless requests, agents now operate with persistent context, continuously exchanging state across turns and across agents. Each request may carry tens of thousands of tokens, including conversation history, tool definitions, structured API schemas, retrieved RAG context, and intermediate outputs from other agents in the workflow. Maintaining responsiveness under this sustained context load requires far more than peak compute, it demands high sustained throughput across compute, memory, and communication.

 At the same time, modern “thinking” models, such as Moonshot AI’s Kimi-K2-Thinking, introduce an additional execution phase. Before producing a final response, these models generate long internal reasoning sequences, significantly increasing output token counts. For workloads requiring on the order of 8,000 output tokens, conventional user inference rates, roughly 50 tokens per second per user, translate into multi-minute response times. At scale, this latency compounds across concurrent users, degrading both user experience and system efficiency.

 Vera Rubin NVL72 is designed to remove this bottleneck. By sustaining high throughput at elevated interactivity levels, the platform enables reasoning-heavy inference without sacrificing responsiveness. The figure below illustrates this generational shift. On the Kimi-K2-Thinking workload, Vera Rubin NVL72 delivers up to 10x higher token factory throughput per megawatt than the NVIDIA Blackwell GB200 NVL72 system at comparable user interactivity. While prior architectures experience steep throughput collapse as TPS per user increases, Vera Rubin NVL72 maintains efficiency across the operating range required for fluid, interactive reasoning. This allows large 1-trillion-parameter MoE models to serve real-time agentic workloads without the “waiting for thought” penalty.

 Figure 38. Vera Rubin NVL72 enables up to 10x higher AI factory inference throughput per MW

 Beyond throughput, Vera Rubin NVL72 fundamentally shifts the economics of reasoning inference. The figure below shows cost per million tokens as a function of output latency for the same workload. For long-context, reasoning-dominated inference, Vera Rubin NVL72 delivers up to 10x lower cost per million tokens compared to Blackwell NVL72.

 The advantage is most pronounced at the service levels required for interactive agents, where prior platforms may encounter an efficiency wall where costs rise steeply to incrementally improve responsiveness. Vera Rubin remains cost-efficient across this region, transforming long-chain reasoning from a premium capability into a scalable, production-ready service.

 Figure 39. Vera Rubin NVL72 delivers one-tenth the cost per token for inference

 Redefining the Pareto frontier 

 Together, these results redefine the traditional tradeoff between responsiveness and efficiency in AI inference. Where prior platforms forced operators to choose between low latency and reasonable cost, Vera Rubin NVL72 sustains both simultaneously. This enables large-context, reasoning-heavy models to operate interactively at scale, transforming high-intelligence inference from a premium capability into a production-standard service.

 8. Why Vera Rubin is the AI factory platform

 AI infrastructure has reached an inflection point. As models evolve toward long-context reasoning, agentic execution, and continuous post-training, performance is no longer determined by any single component. It is determined by how efficiently an entire system converts power, silicon, and data movement into usable intelligence at scale.

 Vera Rubin has been purpose-built for this reality.

 Rather than optimizing isolated chips, the Vera Rubin platform treats the data center as the unit of compute. Through extreme co-design across GPUs, CPUs, scale-up and scale-out networking, infrastructure offload, power delivery, cooling, security and system software, Vera Rubin enables AI factories to operate as tightly integrated, predictable, and continuously available systems.

 At the execution layer, Rubin GPUs deliver sustained throughput for compute, memory, and communication-dominated workloads. Vera CPUs act as high-bandwidth data engines, streaming data efficiently to the GPUs and accelerating system-level orchestration without becoming a bottleneck. NVLink 6 unifies the rack into a single NVLink domain, enabling predictable performance across all GPUs. BlueField-4 completes the stack by operating the AI factory itself, offloading infrastructure services and enforcing security, isolation, and control at scale. Spectrum-X Ethernet and ConnectX-9 then extend this deterministic behavior beyond the rack, enabling efficient, scalable AI factories across multi-rack deployments.

 Most importantly, these capabilities are not theoretical. They are delivered as a validated, production-ready platform through the DGX SuperPOD, supported by NVIDIA Mission Control, enterprise software, and a mature MGX ecosystem. This design allows organizations to deploy secure AI factories faster, operate them more reliably, and scale them more efficiently as demand grows.

 The result is a fundamental shift in AI economics. By maximizing utilization, reducing operational friction, and minimizing wasted power, the Vera Rubin platform lowers the cost per token while increasing tokens per watt and tokens per rack. What once required sprawling, fragile clusters can now be delivered with higher density, higher reliability, and predictable performance.

 The Vera Rubin platform is not just the next generation of accelerated computing. It is the platform that enables AI factories to move from experimentation to industrial-scale intelligence production.

 9. Learn more

 Explore the Vera Rubin platform , Vera CPU, Vera Rubin NVL72 , NVIDIA NVLink 6 switch , NVIDIA ConnectX-9 SuperNIC , NVIDIA BlueField-4 DPU, NVIDIA Spectrum-6 Ethernet switch , DGX SuperPOD configurations , and other deployment options at nvidia.com . And read the CES press release .

 ​​Acknowledgments

 Thanks to Alex Sandu, Amr Elmeleegy, Ashraf Eassa, Brian Sparks, Casey Dugas, Chris Hoge, Chris Porter, Dave Salvator, Eduardo Alvarez, Erik Pounds, Farshad Ghodsian, Fred Oh, Gilad Shainer, Harry Petty, Ian Buck, Itay Ozery, Ivan Goldwasser, Jamie Li, Jesse Clayton, Joe DeLaere, Jonah Alben, Kirthi Devleker, Laura Martinez, Nate Dwarika, Praveen Menon, Rohil Bhargava, Ronil Prasad, Santosh Bhavani, Scot Schultz, Shar Narasimhan, Shruti Koparkar, Stephanie Perez, Taylor Allison, and Traci Psaila—along with many other NVIDIA product leaders, engineers, architects, and partners who contributed to this post.

 Discuss (3)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | Networking / Communications | Cloud Services | AI Enterprise | Blackwell | BlueField DPU | ConnectX | CUDA | DGX | Dynamo | GB200 | Grace CPU | Hopper | InfiniBand | MGX | NeMo | NeMo Microservices | Run:ai | Spectrum-X Ethernet | Intermediate Technical | Deep dive | AI Agent | AI Factory | Blackwell Ultra | CES26 | CUDA-X | featured | GB300 | News | NVL72 | NVLink | Rubin | Spectrum-X | Vera CPU | Vera Rubin NVL72

 About the Authors

 About Kyle Aubrey

 Kyle Aubrey is the director of Technical Marketing at NVIDIA, where he leads initiatives in AI inference and training across NVIDIA accelerated computing platforms, including Hopper, Blackwell, Rubin, and beyond. With a passion for demystifying complex technologies, he empowers diverse audiences to harness the full potential of NVIDIA's cutting-edge solutions. Kyle holds a bachelor’s degree in Electrical Engineering from Rose-Hulman Institute of Technology and an MBA from Pepperdine University.

 View all posts by Kyle Aubrey

 Comments

 Related posts

 NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories

 NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories

 Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

 Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 Power the Next Wave of Applications with NVIDIA BlueField-3 DPUs

 Power the Next Wave of Applications with NVIDIA BlueField-3 DPUs

 Defining AI Innovation with NVIDIA DGX A100

 Defining AI Innovation with NVIDIA DGX A100

 Related posts

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 L

 T

 F

 R

 E

Accelerate AI Inference for Edge and Robotics with NVIDIA Jetson T4000 and NVIDIA JetPack 7.1 | NVIDIA Technical Blog

nvidia_dev_blog

05.01.2026 22:10

0.631

Embedding sim.	0.6916
Entity overlap	0.0938
Title sim.	0.2296
Time proximity	0.9979

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	edge computing
NLP страна

Открыть оригинал

NVIDIA is introducing the NVIDIA Jetson T4000 , bringing high-performance AI and real-time reasoning to a wider range of robotics and edge AI applications. Optimized for tighter power and thermal envelopes, T4000 delivers up to 1200 FP4  TFLOPs of AI compute and 64 GB of memory, providing an ideal balance of performance, efficiency, and scalability. With its energy-efficient design and production-ready form factor, T4000 makes advanced AI accessible for the next generation of intelligent machines, from autonomous robots to smart infrastructure and industrial automation.

 The module includes 1× NVENC and 1× NVDEC hardware video codec engines, enabling real-time 4K video encoding and decoding. This balanced design is built for platforms that combine advanced vision processing and I/O capabilities with power and thermal efficiency.

 Features
 NVIDIA Jetson T4000
 NVIDIA Jetson T5000

 AI performance
 1,200 FP4 Sparse TFLOPs
 2,070 FP4 Sparse TFLOPs

 GPU
 1,536-core NVIDIA Blackwell architecture GPU with fifth-generation Tensor cores
Multi-Instance GPU with 6 TPCs
 2,650-core NVIDIA Blackwell architecture GPU with fifth-generation Tensor cores
Multi-Instance GPU with 10 TPCs

 Memory 
 64 GB 256-bit LPDDR5x | 273 GBps
 128 GB 256-bit LPDDR5x | 273 GBps

 CPU
 12-core Arm Neoverse-V3AE 64-bit CPU
 14-core Arm Neoverse-V3AE 64-bit CPU

 Video encode
 1x NVENC
 2x NVENC

 Video decode
 1x NVDEC
 2x NVDEC

 Networking 
 3x 25GbE
 4x 25GbE

 I/Os
 Up to 8 lanes of PCIe Gen55x I2S | 1x Audio Hub (AHUB) |  2X DMIs | 4x UART |  3x SPI | 13x I2C | 6x PWM outputs.
 Up to 8 lanes of PCIe Gen55x I2S/2x Audio Hub (AHUB),  2x DMIs, 4x UART, 4x CAN, 3x SPI, 13x I2C, 6x PWM outputs

 Power
 40W-70W
 40W-130W

 Table 1. Key specifications of the Jetson T4000 module and the NVIDIA Jetson T5000 module

 The Jetson T4000 module shares the same form factor and pin compatibility with the NVIDIA Jetson T5000 module. Developers can design common carrier boards for both T4000 and T5000, while accounting for differences in thermal and other inherent module features.

 NVIDIA Jetson T4000 and T5000 benchmarks

 Jetson T4000 and T5000 modules deliver strong performance for a number of large language models (LLMs), text-to-speech (TTS), and vision-language-action (VLA) models. Jetson T4000 delivers up to 2x performance gains over the previous generation NVIDIA Jetson AGX Orin platform. The following table shows performance numbers of T4000 and T5000 over popular LLMs, TTS, and VLAs.

 Model family
 Model
 Jetson T4000
(tokens/sec)
 Jetson T5000
(tokens/sec)
 T4000 vs T5000

 QWEN
 Qwen3-30B-A3B
 218
 258
 0.84

 QWEN
 Qwen 3 32B
 68
 83
 0.82

 Nemotron
 Nemotron 12B
 40
 61
 0.66

 DeepSeek
 DeepSeek R1 Distill Qwen 32B
 64
 82
 0.78

 Mistral
 Mistral 3 14B
 100
 109
 0.92

 Kokoro TTS
 Kokoro 82M
 1,100
 900
 0.82

 GR00T
 GR00T N1.5
 376
 410
 0.92

 Table 2. Performance benchmarking of Jetson T5000 and Jetson T4000 modules

 NVIDIA JetPack 7.1: An advanced software stack for next‑gen edge AI

 NVIDIA JetPack 7 is the most advanced software for Jetson, enabling the deployment of  generative AI  and  humanoid robotics   at the edge. The new Jetson T4000 module is powered by the JetPack 7.1 and introduces several new software features that enhance AI  and video codec capabilities.

 NVIDIA TensorRT Edge-LLM: Efficient inferencing for robotics and edge systems

 With JetPack 7.1, we’re introducing support for NVIDIA TensorRT Edge-LLM on the Jetson Thor platform.

 The TensorRT Edge‑LLM SDK is an open-source C++ SDK for running LLMs and vision language models (VLMs) efficiently on edge platforms like Jetson. It targets robotics and other real‑time systems that need the intelligence of modern LLMs without the data center-scale compute, memory, or power.

 Most popular LLM stacks are designed with cloud GPUs in mind. They have plenty of memory, loose latency constraints, Python services everywhere, and elastic scaling as a safety net. Robots and other edge devices live under different constraints, where every millisecond, watt, and runtime can impact physical behavior. The TensorRT Edge‑LLM SDK addresses this gap by bringing a production‑oriented LLM runtime to devices like Jetson Thor-class embedded GPUs. 

 For robotics workloads, the goal is not just to “run an LLM,” but to do it alongside perception, control, and planning stacks that are already saturating the GPU and CPU. An edge‑first design means the LLM runtime integrates cleanly with existing C++ codebases, respects tight memory budgets, and delivers predictable latency under load.

 TensorRT Edge‑LLM SDK focuses on fast and efficient inference of LLMs and VLMs at the edge, starting with familiar training ecosystems like PyTorch. The typical workflow is straightforward. Export a trained model to ONNX, run it through TensorRT for optimization, and then deploy an engine that the SDK drives end‑to‑end on the device.

 A defining characteristic is its implementation as a lightweight C++ toolkit, originally tuned for in‑vehicle systems in the NVIDIA DriveOS LLM SDK . Instead of a tall dependency tower of Python packages, web servers, and background services, you link against a focused C++ runtime that speaks to TensorRT and NVIDIA CUDA .

 Compared with Python‑centric LLM frameworks, this has several practical benefits for robotics, including:

 Lower overhead: C++ binaries avoid Python interpreter startup costs, garbage collection pauses, and GIL‑related contention, helping meet strict latency targets.

 Easier real‑time integration: C++ gives more direct control over threads, memory pools, and scheduling, which fits naturally with real‑time or near‑real‑time robotics stacks.

 Smaller footprint: Fewer dependencies simplify deployment on Jetson, reduce container images, and make over‑the‑air updates less fragile.

 Quantization is one of the most important levers. The SDK supports multiple reduced precisions such as FP8, NVFP4, and INT4, shrinking both model weights and KV‑cache usage with modest accuracy loss when tuned correctly. 

 Figure 1. TensorRT Edge-LLM performance using Qwen3 with speculative decoding

 Video Codec SDK: Powering real‑time perception and media processing on Jetson Thor

 With JetPack 7.1, the NVIDIA Video Codec SDK is now supported on Jetson Thor. 

 The Video Codec SDK is a comprehensive suite of APIs, high-performance tools, sample applications, reusable code, and documentation enabling hardware-accelerated video encoding and decoding on the Jetson Thor platform. At its core, the NVENCODE and NVDECODE APIs provide C-style interfaces for high-performance access to NVENC and NVDEC HW accelerators, revealing most hardware capabilities along with a wide range of commonly used and advanced codec features. 

 To simplify integration, the SDK also includes reusable C++ classes built on top of these APIs, allowing applications to easily adopt the full breadth of functionality offered by the underlying NVENCODE/NVDECODE interfaces.

 Figure 1 shows the architecture of the Video Codec SDK and its drivers in the JetPack 7.1 BSP, along with the associated sample applications and documentation.

 Figure 2. Architecture of the Video Codec SDK

 The Video Codec SDK brings the following key benefits to multimedia developers.

 A unified experience across NVIDIA GPUs

 With the Video Codec SDK, developers gain a consistent and streamlined development experience across the NVIDIA GPU portfolio. This unification eliminates the need for separate code bases or tuning strategies for different GPU classes, reducing engineering overhead.

 Developers building on GPUs can extend or port their applications using Video SDK APIs to Jetson Thor’s integrated GPUs without re-architecting their video pipeline. Teams working on embedded platforms benefit from the same mature APIs, tools, and performance optimizations available on workstations and servers. This consistency not only accelerates development and validation but also simplifies long-term maintenance, scalability, and cross-platform feature parity.

 Fine-grained control of next-gen robot perception and multimedia applications

 The Video Codec SDK exposes APIs for developers to pair presets with tuning modes to precisely control quality, latency, and throughput, unlocking flexible application-specific encoding.

 Through APIs for reconstructed frame access and iterative encoding, the SDK enables CABR workflows that automatically find the minimum bitrate for perceptual quality, cutting bandwidth while maintaining quality. SDK-exposed controls for Spatial/Temporal Adaptive Quantization (AQ) and lookahead enable fine-grained perceptual optimization, allocating bits where they matter most and delivering cleaner, more stable video without raising bitrate.

 The Video Codec SDK consists of two major component groups.

 Video user-mode drivers provide access to the on-chip hardware encoders and decoders through the NVENCODE and NVDECODE APIs

 Video Codec SDK 13.0 with sample code, header files, and documentation can be installed through the NVIDIA Video Codec SDK webpage, using APT (see instructions ), or through the NVIDIA SDK Manager .

 Figure 3. Components of the Video Codec SDK

 PyNvVideoCodec is the NVIDIA Python-based video codec library that provides simple yet powerful Python APIs for hardware-accelerated video encode and decode on NVIDIA GPUs.

 The PyNvVideoCodec library internally uses core C/C++ video encode and decode APIs of Video Codec SDK with easy-to-use Python APIs. The library offers encode and decode performance close to the Video Codec SDK. 

 Getting started

 NVIDIA Jetson T4000 is backed by a mature ecosystem of production‑ready systems from established hardware partners , making it easier to move from prototype to deployment quickly. Developers can start by selecting a prevalidated edge system that already integrates the module, power, thermal design, and I/O needed for robotics and other physical AI workloads. Many of the partner systems are built to utilize the module’s advanced camera pipeline, with support for MIPI CSI and GMSL to handle demanding multi‑camera, real‑time vision workloads. With 16 lanes of MIPI CSI on Jetson T4000, partners can deliver platforms that ingest streams from multiple cameras concurrently, enabling sophisticated robotics, industrial inspection, and autonomous machines.

 These systems are engineered to support the JetPack SDK, CUDA, and broader NVIDIA AI software stack. Existing applications and models can usually be brought up with minimal changes. Many partners also offer lifecycle support, regional certifications, and optional customization services, which help teams de‑risk supply chain and compliance concerns as they scale from pilot to fleet deployments. To explore available systems and find the right fit for your application, visit the NVIDIA Ecosystem page .

 Summary

 With Jetson T4000 powered by JetPack 7.1, NVIDIA extends Blackwell-class AI, real-time reasoning, and advanced multimedia capabilities to a broader set of edge and robotics applications. From strong gains in LLM, speech, and VLA workloads to the introduction of TensorRT Edge-LLM and a unified Video Codec SDK, T4000 delivers a balance of performance, efficiency, and software maturity. Jetson T4000 enables developers to scale intelligently across performance tiers while building next-generation autonomous machines, perception systems, and physical AI solutions at the edge.

 Get started with the Jetson AGX Thor Developer Kit , and download the latest JetPack 7.1 . Jetson T4000 modules are available .

 Comprehensive documentation, support resources, and tools are available through the Jetson Download Center and ecosystem partners .

 Have questions or need guidance? Connect with experts and other developers in the NVIDIA Developer Forum .

 Watch NVIDIA CEO Jensen Huang at CES 2026 and check out our sessions .

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Edge Computing | Robotics | General | Blackwell | Jetson | Metropolis | Intermediate Technical | Deep dive | CES26 | featured | Humanoid Robots | LLMs

 About the Authors

 About Shashank Maheshwari

 Shashank Maheshwari is the product manager for Jetson software at NVIDIA. He holds an MBA from Duke University, The Fuqua School of Business and a B.E. in electronics from BITS Pilani.

 View all posts by Shashank Maheshwari

 About Aayush Pathak

 Aayush Pathak is a hardware product manager at NVIDIA specializing in Embedded Edge and Autonomous Machines. He has worked extensively in the semiconductor industry, designing supercomputer SoCs and advancing low‑power, energy‑efficient hardware. He holds a master’s degree in Electrical Engineering from the University of Southern California and an MBA from the University of Chicago.

 View all posts by Aayush Pathak

 About Suhas Hariharapura Sheshadri

 Suhas Sheshadri is a product manager at NVIDIA, focusing on Jetson software. He previously worked with the autonomous driving team at NVIDIA, optimizing system software for the NVIDIA Drive platform. In his free time, Suhas likes to read books on quantum physics and game theory.

 View all posts by Suhas Hariharapura Sheshadri

 Comments

 Related posts

 Boost Edge AI Performance with the New NVIDIA Jetson Orin NX 16GB

 Boost Edge AI Performance with the New NVIDIA Jetson Orin NX 16GB

 Delivering Server-Class Performance at the Edge with NVIDIA Jetson Orin

 Delivering Server-Class Performance at the Edge with NVIDIA Jetson Orin

 Supercharge AI-Powered Robotics Prototyping and Edge AI Applications with the Jetson AGX Orin Developer Kit

 Supercharge AI-Powered Robotics Prototyping and Edge AI Applications with the Jetson AGX Orin Developer Kit

 NVIDIA Jetson AGX Xavier Delivers 32 TeraOps for New Era of AI in Robotics

 NVIDIA Jetson AGX Xavier Delivers 32 TeraOps for New Era of AI in Robotics

 NVIDIA Jetson AGX Xavier Developer Kit Now Available

 NVIDIA Jetson AGX Xavier Developer Kit Now Available

 Related posts

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 L

 T

 F

 R

 E

Latest open artifacts (#17): NVIDIA, Arcee, Minimax, DeepSeek, Z.ai and others close an eventful year on a high note

interconnects

05.01.2026 14:03

0.623

Embedding sim.	0.779
Entity overlap	0.0444
Title sim.	0.1017
Time proximity	0.3062

NLP тип	other
NLP организация	NVIDIA
NLP тема	large language models
NLP страна	United States

Открыть оригинал

Latest open artifacts (#17): NVIDIA, Arcee, Minimax, DeepSeek, Z.ai and others close an eventful year on a high note
 Ending the year with a bang.
 Florian Brand and Nathan Lambert
 Jan 05, 2026
 ∙ Paid

 19

 1

 Share

 Happy new year! The open ecosystem hasn’t slowed down at all over the holiday period, which we know will continue right into and through 2026. There are a lot of great models in this issue, from GLM 4.7 and MiniMax M2.1 — open models that are starting to be “good enough” in the Claude Code form factor — and much stronger open models from Nvidia and Arcee to support the U.S.’s renewed motivation in the space.
 Share
 Our Picks

 K2-V2 by LLM360 : LLM360, a project from MBZUAI, is back with their fully open-source model series. This model is a 70B dense model, and they release the whole data, from pre-training (12T tokens) to SFT, which they generated using GPT-OSS 120B at all three reasoning levels. They also release multiple checkpoints from the various stages of training. We expect a lot more from them specifically and the growing fully-open model community in 2026!

 NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 by nvidia : As luck would have it, NVIDIA released an update to their Nemotron series right after our year in review tierlist for 2025. Similar to other NVIDIA models, the vast majority of the data is released openly. Furthermore, they continue with the Mamba2-Transformer architecture, but make it a MoE as well. And to top it all off: They also announce two more sizes, slated for a release in H1 2026 (likely on the earlier side): Super, ~100B-A10B and Ultra, ~500B-A50B, which will use Latent MoE and multi-token prediction (MTP). 2026 will be an exciting year!

 Trinity-Mini by arcee-ai : Arcee is not an unknown entity to the avid Artifacts reader. Now they are coming with a series of models: Nano, a 6B-A1B MoE and Mini, a 26B-A3B MoE are available today and trained on 10T tokens. They also plan to release Large, a 420B-A13B MoE trained on 20T tokens, in the coming weeks. We played with the Mini model and were impressed by its capabilities! As readers know, we’re also very happy to highlight new and rapidly improving open model builders in the U.S. using permissive licenses.

 GLM-4.7 by zai-org : Zhipu, which will IPO on January 8th, dropped a really capable model just before Christmas with 4.7. GLM-4.7 is not close to (API model) SOTA performance on the usual academic benchmarks like GPQA or SWE-bench Verified, but manages to hold its performance beyond that in a broader suite of tasks like GPVal-AA or DesignArena .

 I (Florian) have tested this model extensively the last days by using the Z.ai API (and the corresponding coding subscription at $28/yr) in the OpenCode as the CLI (which also offers the model for free at the time of writing) and was more than impressed by the quality of this model. In certain areas (especially in UI generation for websites), I preferred its outputs over Opus , while in other areas, it was more or less on the level of Sonnet 4.5, which was released a mere 4 months ago. However, the model is quite slow (the cheapest coding plan is slower compared to their other offerings) and its long-context performance is worse than other closed models, especially after 100K tokens. Furthermore, it is text-only, which I “fixed” by adding Gemini 3.0 Flash as a subagent in OpenCode. But again, this is an open model, dirt cheap and self-hostable on a node of H100s!

 Llama-3.3-8B-Instruct by allura-forge : For some reason, Llama 3.3 8B is a thing that exists, but was never released publicly in the same way that the other models did. However, someone got access to the weights by using Meta’s Llama API and uploaded them to HuggingFace.

 Models

 Flagship

 Apriel-1.6-15b-Thinker by ServiceNow-AI : An update to the Apriel series, focusing on using fewer tokens per answer while maintaining performance. They achieved this by using GSPO with length and verbosity penalties.

 MiMo-V2-Flash by XiaomiMiMo : Xiaomi surprised everyone by dropping a 309B-A15B MoE. The first model, which we also covered , was just a 7B dense model. Members in our subscriber-only Discord used the model and liked its writing style. However, they also found that it is lacking in terms of agentic performance and function calling.

 DeepSeek-V3.2 by deepseek-ai : Another update to the V3 series, which integrates DSA. They also trained and released a “high compute” version, V3.2 Speciale , which claims to beat the 2025 IMO and IOI with gold-medal performance.

 This post is for paid subscribers
 Subscribe
 Already a paid subscriber? Sign in

 Previous Next