Кластер #4184 - News Clusters

Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities | NVIDIA Technical Blog

closed

Тип события	other
Тема	ai infrastructure
Организация	OpenAI
Страна	United States

Статей	35
Уник. источников	10
Важность / Момент	3.11 / 0
Период	17.02.2026 18:00 — 05.03.2026 17:00
Создан	06.04.2026 06:19:56

Статьи в кластере 35

Заголовок

Источник

Дата публикации

Score

Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities | NVIDIA Technical Blog

nvidia_dev_blog

17.02.2026 18:00

Embedding sim.	1
Entity overlap	1
Title sim.	1
Time proximity	1

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	retrieval-augmented generation
NLP страна

Открыть оригинал

Enterprise data is inherently complex: real-world documents are multimodal, spanning text, tables, charts and graphs, images, diagrams, scanned pages, forms, and embedded metadata. Financial reports carry critical insights in tables, engineering manuals rely on diagrams, and legal documents often include annotated or scanned content. 

 Retrieval-augmented generation (RAG) was created to ground LLMs in trusted enterprise knowledge—retrieving relevant source data at query time to reduce hallucinations and improve accuracy. But if a RAG system processes only surrounding text, it misses key signals embedded in tables, charts, and diagrams—resulting in incomplete or incorrect answers.

 An intelligent agent is only as good as the data foundation it’s built on. Modern RAG must therefore be inherently multimodal—able to understand both visual and textual context to achieve enterprise-grade accuracy. The NVIDIA Enterprise RAG Blueprint is built for this, providing a modular reference architecture that connects unstructured enterprise data to the intelligent systems built on top of it. 

 The blueprint also serves as a foundational layer for the NVIDIA AI Data Platform , helping to bridge the traditional gap between compute and data. By enabling retrieval and reasoning closer to the data layer, it preserves governance, reduces operational friction, and makes enterprise knowledge immediately usable by intelligent systems. The result is a modern AI data stack—storage that can retrieve, enrich, and reason alongside your models.

 While the Enterprise RAG Blueprint provides many configurable options, this post highlights the following five key configurations that most directly improve accuracy and contextual relevance across enterprise use cases: 

 Baseline multimodal RAG pipeline

 Reasoning

 Query decomposition

 Filtering metadata for faster and precise retrieval

 Visual reasoning for multimodal data

 The post also explains how the blueprint can be embedded into AI data platforms to transform traditional repositories into AI-ready knowledge systems. 

 Accuracy metrics in this blog are measured using the RAGAS framework , using well-known public datasets. Learn more about evaluating your NVIDIA RAG Blueprint system .

 1. Document ingestion and understanding

 Before an agent can deliver insights, it must be perfectly grounded in your data. This foundational configuration focuses on intelligent document ingestion and core RAG functionality. 

 The Enterprise RAG Blueprint uses NVIDIA NeMo Retriever to extract multimodal enterprise content—text, tables, charts and graphs, and infographics—then embeds that content into text for indexing in a vector database. At query time, the blueprint runs semantic retrieval, reranking, and Nemotron LLM to generate a grounded answer.

 To maximize performance, this baseline intentionally avoids image captioning and heavy reasoning, making it the ideal starting point for production deployments. Deploy this baseline on Docker .

 Benefits of document ingestion and understanding 

 This foundational configuration is the blueprint’s highest-efficiency pipeline, optimized for accuracy and throughput while keeping GPU cost and time to first token (TTFT) low. This configuration establishes your baseline performance for retrieval quality and LLM grounding.

 Figure 1. RAG pipeline

 Table 1 summarizes the overall impact across a few datasets.

 Accuracy (v2.3 Default)
 MM = Multimodal, TO = Text-Only

 Dataset
 Type
 Accuracy

 RAG Battle
 MM
 0.809

 KG RAG
 MM
 0.565

 FinanceBench
 MM
 0.633

 BO767
 MM
 0.910

 HotpotQA
 TO
 0.671

 Google Frames
 MM
 0.509

 Table 1. Accuracy impact of baseline configuration (higher is better)

 2. Reasoning

 When you turn on reasoning in the RAG blueprint, you enable the LLM to interpret the retrieved evidence, and synthesize logically grounded answers. This is the easiest change to get an accuracy boost for many applications. Enable reasoning for the NVIDIA Enterprise RAG Blueprint .

 Table 2 summarizes the overall impact across several sample datasets.

 Accuracy (v2.3 Default) plus Reasoning
MM = Multimodal, TO = Text-Only

 Dataset
 Type
 Reasoning on
 Default

 RAG Battle
 MM
 0.85
 0.809

 KG RAG
 MM
 0.58
 0.565

 FinanceBench
 MM
 0.69
 0.633

 BO767
 MM
 0.88
 0.91

 Table 2. Accuracy impact of enabling reasoning versus baseline configuration (higher is better)

 Benefits of reasoning 

 For any use case involving mathematical operations or complex data comparison, a typical simple similarity or hybrid search will not suffice. Reasoning is required to correct errors and ensure precise contextual understanding. Accuracy improvements across datasets averaged ~5%, with several cases demonstrating dramatic reasoning-driven corrections. 

 Examples

 In the FinanceBench dataset, the baseline configuration incorrectly computed the Adobe FY2017 operating cash flow ratio as 2.91. After enabling reasoning, the model produced the correct answer, 0.83. In addition, the Ragbattle dataset demonstrates the accuracy improvement from enabling VLM.

 3. Query decomposition 

 Answering complex user questions often requires pulling facts from multiple places in the data foundation. Query decomposition breaks a single question into smaller subqueries, retrieves evidence for each, and recombines the results into a complete, grounded response. Turn on query decomposition for the NVIDIA Enterprise RAG Blueprint .

 Figure 2. Response accuracy before and after query decomposition

 Benefits of query decomposition

 Query decomposition significantly improves accuracy for multihop and context-rich questions that span multiple paragraphs or documents. It does add extra LLM calls (increasing latency and cost), but the accuracy gains are often worth it for mission-critical enterprise use cases. Query decomposition can also be paired with reasoning for an additional boost when needed.

 Example

 As NVIDIA AI Data platform partners evolve to offer more relevant and accurate retrieval, this feature can either include some level of query processing as part of the data platform or can be left to the agent. Learn more about how query decomposition can be an approach in some use cases . 

 Table 3 shows the overall impact across a few datasets.

 Accuracy (v2.3 Default) plus Query Decomposition
MM = Multimodal, TO = Text-Only

 Dataset
 Type
 Query decomposition
 Default

 RAG Battle
 MM
 0.854
 0.809

 FinanceBench
 MM
 0.631
 0.633

 BO767
 MM
 0.885
 0.91

 HotpotQA
 TO
 0.725
 0.671

 Google Frames
 MM
 0.6
 0.5094

 Table 3. Accuracy impact of query decomposition versus baseline configuration (higher is better)

 4. Filtering metadata for faster and precise retrieval

 Metadata, such as author, date, category, and security tags, has always been integral to enterprise data. In RAG pipelines, metadata filters can be leveraged to narrow the search space and align retrieved content with the right context, significantly improving retrieval precision and speed. 

 The RAG blueprint supports custom metadata ingestion and automatic query generation based on that data. To leverage your custom metadata, see Advanced Metadata Filtering with Natural Language Generation . To learn more about what’s possible with this feature set, check out the example notebook on the NVIDIA-AI-Blueprints/rag GitHub repo. 

 Benefits of metadata filtering

 Metadata filtering narrows the search space for faster retrieval and improves precision by aligning retrieved content with context. This allows developers to leverage metadata without manual filter logic to achieve higher throughput and contextual relevance. When metadata filtering capabilities are embedded directly into AI data platforms, it can make your storage smarter, leading to faster retrieval and lower latency.

 Example

 To provide an example, consider two documents that are ingested with the following metadata:

custom_metadata = [
 {
 "filename": "ai_guide.pdf",
 "metadata": {
 "category": "AI",
 "priority": 8,
 "rating": 4.5,
 "tags": ["machine-learning", "neural-networks"],
 "created_date": "2024-01-15T10:30:00"
 }
 },
 {
 "filename": "engineering_manual.pdf",
 "metadata": {
 "category": "engineering",
 "priority": 5,
 "rating": 3.8,
 "tags": ["hardware", "design"],
 "created_date": "2023-12-20T14:00:00"
 }
 }

 When using metadata with dynamic filter expression, a query such as, “Show me high-rated AI documents with machine learning tags created after January 2024” will translate to one that automatically generates a filtering expression such as:

filter_expression = `content_metadata["category"] == "AI" and content_metadata["rating"] >= 4.0 and
array_contains(content_metadata["tags"], "machine-learning") and content_metadata["created_date"] >= "2024-01-01”`

 With metadata filtering enabled, the system retrieved 10 focused citations from one document, ai_guide.pdf
, achieving 100% precision on the target domain while reducing search space by 50%.

 5. Visual reasoning for multimodal data 

 Enterprise data is visually rich. Where traditional text-only embeddings fall short, vision language models (VLMs) such as NVIDIA Nemotron Nano 2 VL (12B) introduce visual reasoning into the pipeline. Learn more about how to leverage a VLM for generation in the RAG Blueprint. 

 Figure 3. Before and after leveraging a VLM for generation

 Benefits of visual reasoning 

 Visual reasoning is crucial for handling real-world enterprise documents. Integrating a VLM in the generation pathway enables the RAG system to interpret images, charts, and infographics, making it possible to accurately answer queries where the information lies in a structured visual element rather than just the surrounding text. 

 Example 

 A significant accuracy improvement was observed when a VLM was enabled for the Ragbattle dataset in the RAG Blueprint, especially when the answer was in a visual element. Note that enabling VLM inference can increase response latency from additional image processing. Consider this tradeoff between accuracy and speed based on your requirements. Learn more about the accuracy improvements with VLM for the Ragbattle dataset.

 Transforming enterprise storage into an active knowledge system

 The Enterprise RAG Blueprint demonstrates how the progressive adoption of these five capabilities—from reasoning and metadata-driven retrieval to multimodal understanding—directly enhances the accuracy and groundedness of your intelligent agents. Each capability offers a unique balance between latency, token cost, and contextual precision, providing a flexible, tunable framework that can be adopted to various enterprise use cases.

 This accelerates the evolution of the data foundation itself. The NVIDIA AI Data Platform transforms enterprise data into AI-searchable knowledge. As NVIDIA partners evolve their storage offerings, this blueprint serves as a reference for delivering embedded RAG capabilities that leverage metadata to enforce permissions, track changes, and provide highly accurate retrieval directly at the storage layer.

 NVIDIA storage partners are building AI data platforms based on the NVIDIA reference design that are transforming enterprise storage from a passive repository to become an active intelligent system in the AI workflow. The result is a next-generation enterprise data infrastructure: faster, smarter, and purpose-built for the age of generative AI.

 What’s new with the NVIDIA Enterprise RAG Blueprint

 The latest release of the NVIDIA EnterpriseRAG Blueprint deepens its focus on serving agentic workflows. It introduces first-class document-level summarization with both shallow and deep strategies, enabling agents to quickly assess relevance, narrow search space, and balance accuracy with latency. A new data catalog improves discoverability and governance across large corpora, while upgrades to the best-in-class Nemotron RAG models further enhance retrieval quality, reasoning, and generation performance—making RAG a more efficient, agent-ready foundation for enterprise-scale knowledge systems.

 Get started with enterprise-grade RAG

 Ready to integrate these five capabilities into your RAG use cases? Access the modular code, documentation, and evaluation notebooks for free within the NVIDIA Enterprise RAG Blueprint .

 Make your enterprise data AI-ready and transform your production data into an intelligent knowledge system with embedded RAG capabilities with NVIDIA AI Data Platform. Contact an NVIDIA AI storage partner to get started with your own NVIDIA-powered AI data platform. 

 Discuss (1)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | General | Blueprint | Nemotron | Intermediate Technical | Best practice | AI Agent | AI Data Platform | AI-Ready Data | featured | LLMs | Retrieval Augmented Generation (RAG)

 About the Authors

 About Shruthii Sathyanarayanan

 Shruthii Sathyanarayanan is a product marketing manager in the NVIDIA Enterprise Computing group with a focus on enterprise AI and virtualization. Shruthii holds a bachelor’s degree in Computer Engineering and Business from the University of Illinois at Urbana-Champaign and has previously held roles in software development and product management.

 View all posts by Shruthii Sathyanarayanan

 About Sumit Bhattacharya

 Sumit Bhattacharya is a senior engineering manager at NVIDIA, working on AI blueprints and conversational AI. His primary area of focus is building scalable, low-latency solutions for Enterprise RAG, data flywheels, and voice agents. He also has extensive experience of working on NLP, dialog systems, and voice assistants. He holds a master’s degree in Electrical Engineering from the Indian Institute of Technology, Kharagpur, and has over 18 years of industry experience.

 View all posts by Sumit Bhattacharya

 About Punit Kumar

 Punit Kumar is a senior system software engineer at NVIDIA with a focus on the RAG Blueprint, production RAG systems, and features that improve accuracy and performance. Punit holds a master’s degree in Data Science and Engineering from BITS Pilani and a BTech in Computer Science from SKIT Jaipur and has previously held roles in R&D in AI engineering and in data engineering.

 View all posts by Punit Kumar

 About Pranjal Doshi

 Pranjal Doshi is a software engineer at NVIDIA, specializing in retrieval-augmented generation (RAG) and the productionization of large language models. Pranjal holds a master’s degree in Computer Science and Engineering from the Indian Institute of Technology (IIT) Kharagpur and focuses on bridging the gap between AI research and scalable, real-world applications.

 View all posts by Pranjal Doshi

 About Nikhil Kulkarni

 Nikhil Kulkarni is a software engineer at NVIDIA specializing in the productization of the RAG Blueprint, with an emphasis on accuracy improvements, performance optimizations, and deployment. Nikhil holds a bachelor’s degree in Computer Science and focuses on translating AI models into robust, enterprise-grade architectures. He has previously worked on building speech-based AI agents at NVIDIA.

 View all posts by Nikhil Kulkarni

 Comments

 Related posts

 Chat With Your Enterprise Data Through Open-Source AI-Q NVIDIA Blueprint

 Chat With Your Enterprise Data Through Open-Source AI-Q NVIDIA Blueprint

 NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster

 NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster

 Insights, Techniques, and Evaluation for LLM-Driven Knowledge Graphs

 Insights, Techniques, and Evaluation for LLM-Driven Knowledge Graphs

 Translate Your Enterprise Data into Actionable Insights with NVIDIA NeMo Retriever

 Translate Your Enterprise Data into Actionable Insights with NVIDIA NeMo Retriever

 Scaling Enterprise RAG with Accelerated Ethernet Networking and Networked Storage

 Scaling Enterprise RAG with Accelerated Ethernet Networking and Networked Storage

 Related posts

 Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

 Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics

 Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

 Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

 How to Build a Document Processing Pipeline for RAG with Nemotron 

 How to Build a Document Processing Pipeline for RAG with Nemotron 

 L

 T

 F

 R

 E

OpenAI and Amazon announce strategic partnership

openai

27.02.2026 05:30

0.783

Embedding sim.	0.9138
Entity overlap	0.3636
Title sim.	0.375
Time proximity	0.4286

NLP тип	partnership
NLP организация	OpenAI
NLP тема	ai infrastructure
NLP страна

Открыть оригинал

OpenAI and Amazon announce a strategic partnership bringing OpenAI’s Frontier platform to AWS, expanding AI infrastructure, custom models, and enterprise AI agents.

Joint Statement from OpenAI and Microsoft

openai

27.02.2026 05:30

0.756

Embedding sim.	0.8662
Entity overlap	0.0667
Title sim.	0.1918
Time proximity	1

NLP тип	other
NLP организация	Microsoft
NLP тема	artificial intelligence
NLP страна

Открыть оригинал

Microsoft and OpenAI continue to work closely across research, engineering, and product development, building on years of deep collaboration and shared success.

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai | NVIDIA Technical Blog

nvidia_dev_blog

18.02.2026 18:00

0.737

Embedding sim.	0.8376
Entity overlap	0.0833
Title sim.	0.2993
Time proximity	0.8601

NLP тип	benchmarking
NLP организация	NVIDIA
NLP тема	large language models
NLP страна

Открыть оригинал

As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent scheduling and dynamic GPU fractioning. GPU fractioning is wholly delivered by NVIDIA Run:ai in any environment—cloud, NCP, and on-premises.

 This post presents the joint benchmarking effort between NVIDIA and AI cloud provider Nebius to evaluate how NVIDIA Run:ai fractional GPU allocation can improve large language model (LLM) inference performance. Nebius’ AI Cloud provided the infrastructure foundation, dedicated NVIDIA GPUs, NVIDIA Quantum InfiniBand networking, and hyperscaler-grade performance and elasticity needed to deliver these gains at production scale. 

 All benchmarks were executed using NVIDIA NIM microservices . This approach provides standardized, production-grade model deployment with consistent performance, security, and lifecycle management across environments.

 The results show that fractional GPUs dramatically increase effective capacity without compromising latency SLAs:

 77% of full GPU throughput and 86% of full-GPU concurrent user capacity using only 0.5 GPU fraction, with time to first token (TTFT) under one second

 Up to 2x more concurrent inference users on smaller models using 0.25 GPU fractions

 Up to 3x more total system users when running mixed workloads (chat, reasoning, embeddings) on shared GPUs

 Near-linear throughput scaling across 0.5, 0.25, and 0.125 GPU fractions, with modest TTFT impact

 Production-ready autoscaling with no latency cliffs or error spikes during scale-out

 This benchmarking shows that fractional GPU scheduling is no longer an optimization technique. It is a foundational capability for running large-scale, multimodel LLM inference efficiently in production.

 LLM inference enterprise challenges

 Enterprise IT departments operate with a finite, often fixed inventory of GPUs. Deploying LLM for inference requires a dedicated GPU (or multiple GPUs) to be allocated to a single LLM instance, even during sporadic traffic. This is necessary because the model must load all the weights in advance of an inference request, so the latency for generating tokens (responses) is as low as possible. 

 As a result, most LLMs consume all GPUs allocated, so it becomes difficult to run more than one model using the same pool of GPUs available. In this scenario, enterprise IT must manually maintain the GPUs to LLM allocation, figure out when and how to scale LLMs as users requesting inference grow to maintain latency between chat requests and tokens generated, and cannot repurpose idle GPUs during off-peak hours.

 Ideally, enterprises want an elastic environment where GPUs can be used to run multiple LLMs, not just one, without significantly impacting the number of users who can run inference or latency for those users. They can scale GPUs based on workloads, and scale down GPUs during off-peak hours, such that other workloads can consume the same GPUs.

 Scale inference workloads with NVIDIA Run:ai and Nebius AI Cloud 

 The NVIDIA Run:ai platform addresses these pain points through its high-throughput AI workload scheduler, built for large-scale GPU clusters and dynamic fractional GPU allocation, without sacrificing performance. Together, NVIDIA Run:ai orchestration and Nebius AI Cloud infrastructure create a flexible, production-ready framework for maximizing GPU ROI. 

 In benchmarking tests conducted by NVIDIA and Nebius AI Cloud, NVIDIA Run:ai delivered up to 2x greater user capacity on existing hardware during peak periods, demonstrating that enterprises can significantly scale inference workloads without proportional increases in GPU investment.

 Dynamic GPU fractioning

 NVIDIA Run:ai enables GPUs to be fractioned into smaller units (such as 0.5 GPU allocations) that serve multiple workloads simultaneously. Users specify their memory requirements directly and the scheduler allocates resources on-demand without any preconfiguration. This is particularly impactful for inference workloads, where smaller, concurrent requests can share GPU resources without significant performance degradation. 

 Memory isolation is enforced at runtime while compute cycles are distributed fairly among active processes. Users can also define a guaranteed minimum (Request) with a burstable upper bound (Limit), allowing workloads to consume additional GPU capacity when available and release it automatically when demand shifts.

 Intelligent workload scheduling

 NVIDIA Run:ai scheduler acts as the “brain” of the operation, analyzing workload priorities, resource requirements, and system capacity to optimize allocations. It prioritizes latency-sensitive tasks, such as real-time inference, over batch-oriented training jobs during peak periods, ensuring service-level agreements (SLAs) are met. 

 The scheduler also automatically scales LLMs up or down based on consecutive users running inference and token latency depending on the SLA criterias given by the admin. These strategies collectively drive higher utilization rates, lower operational complexity, and reduce total cost of ownership (TCO). 

 Teams at NVIDIA and Nebius ran benchmarking to discover the impact NVIDIA Run:ai has on running inference at scale for various LLMs. Scale tests were performed on the number of concurrent users that can run various chat requests and recording the TTFT , output throughput (tokens/second generated), and GPU utilization. At NVIDIA these tests were run on a cluster built following the PCIe-optimized NVIDIA Enterprise Reference Architectures with NVIDIA H100 NVL GPUs. At Nebius AI Cloud the tests were run on a cluster built following the HGX based Enterprise RA for NVIDIA HGX B200 GPUs.

 Benchmarking setup

 The software stack is based on NVIDIA Enterprise RAs (Figure 1). This includes the NVIDIA AI Enterprise stack to manage GPUs using NVIDIA GPU Operator for lifecycle management, NVIDIA Network Operator for north-south and east-west networking, NVIDIA NIM Operator to download various model weights, and NVIDIA NIM microservices to deploy the different models. This was deployed in a cluster of nodes managed by Kubernetes. To learn more, see NVIDIA NIM LLM with NVIDIA Run:ai and Vanilla Kubernetes for Enterprise RA .

 Infrastructure

 Identical benchmarks were run across two hardware configurations: an on-premises cluster with 64 NVIDIA H100 NVL GPUs built to NVIDIA Enterprise RA specifications, and a Nebius AI Cloud cluster with 32 NVIDIA HGX B200 GPUs. This dual-environment approach validates that the results generalize across both self-managed infrastructure and public cloud deployments.

 Figure 1. NVIDIA Run:ai deployment on NVIDIA Enterprise Reference Architecture

 Model selection

 The four models selected span different sizes, memory footprints, and inference use cases (Table 1). This range enables evaluating fractional allocation across workloads with different memory footprints. 

 Model
 Number of parameters
 Memory requirements
 Use case

 Llama 3.1 8B Instruct
 8B
 ~16 GB
 General-purpose chat

 Phi-4-Mini
 3.8B
 ~8 GB
 Lightweight assistant

 Qwen3-14B
 14B
 ~28 GB
 Reasoning

 Qwen-Embeddings-0.6B
 0.6B
 ~1.5 GB
 Document embedding and reranking

 Table 1. Models selected span diverse sizes, memory requirements, and use cases

 Notably, the largest model (Qwen3-14B) occupies only ~35% of one NVIDIA H100 NVL GPU 80 GB capacity, illustrating why traditional whole-GPU allocation might leave so much capacity stranded. 

 Methodology

 GenAI Perf was used to simulate concurrent users sending chat requests to each NIM endpoint. The tool records per-session latency and throughput, enabling measurement under increasing load.

 Primary metrics include:

 TTFT: Latency from request submission to first response token

 Output throughput: Tokens generated per second per session

 GPU utilization: Percentage of GPU memory consumed under load

 Concurrency scaling: Maximum simultaneous users supported while maintaining TTFT and throughput within acceptable bounds (for example, the point at which adding more users causes latency SLA drops)

 Test conditions

 Each model was benchmarked under the following five configurations:

 Baseline : LLM inference without NVIDIA Run:ai (native Kubernetes scheduling)

 Full GPU(s) with NVIDIA Run:ai : 1.0 GPU allocation per model replica

 Fractional 0.5 GPU(s) : NVIDIA Run:ai with 0.5 GPU allocation per model replica

 Fractional 0.25 GPU(s) : NVIDIA Run:ai with 0.25 GPU allocation per model replica

 Mixed mode : Multiple LLMs co-located on shared GPUs

 For the Qwen-Embeddings model, data ingestion throughput was also tested to evaluate embedding-specific workloads.

 Benchmarking results using NVIDIA Run:ai

 This section presents observations based on the results captured from GenAI Perf. 

 Fractional GPU efficiency at half allocation

 Based on the results captured from GenAI Perf, NVIDIA Run:ai was evaluated across two dimensions: scheduler overhead compared to native Kubernetes, fractional GPU efficiency at various allocation sizes. The following subsections detail the findings for each.

 No scheduler overhead

 NVIDIA Run:ai introduces no measurable performance penalty compared to native Kubernetes scheduling across all test configurations. At 64 GPUs, NVIDIA Run:ai with full GPU allocation delivered 10,200 concurrent users versus 9,934 for the native scheduler, confirming the scheduler itself adds no overhead.

 Fractional GPU efficiency

 Concurrent user scaling: At 64 GPUs, the 0.5 GPU configuration supported 8,768 concurrent users, where the TTFT for each user did not go over one second (1,000 ms)—86% of the full GPU capacity (10,200 CCU). This demonstrates that fractional allocation introduces only a modest performance trade-off, enabling enterprises to run multiple models on shared GPUs or scale deployments more granularly without significant capacity loss (Figure 2).

 Figure 2. Concurrent user scaling for Llama 3.1 8B Instruct powered by the NVIDIA H100 NVL GPU cluster

 Output throughput: Token generation throughput showed similar efficiency. At 64 GPUs, the 0.5 GPU configuration achieved 152,694 tokens/sec—77% of full GPU throughput 198,680 tokens/sec), as shown in Figure 3.

 All three configurations—without NVIDIA Run:ai, NVIDIA Run:ai with full GPU, and NVIDIA Run:ai with fractional GPU—scale linearly from one to 64 GPUs. This linear relationship confirms that the efficiency ratios observed at scale are not artifacts of small deployments.

 Figure 3. Output throughput scaling for Llama 3.1 8B Instruct powered by the NVIDIA H100 NVL GPU cluster

 Smaller models scale further with quarter-GPU fractions

 Smaller models have lighter memory footprints, which means they can take even greater advantage of fractional allocation. Phi-4-Mini was tested with 0.25 GPU fractions to measure how much concurrency and throughput this enables.

 Figure 4. Concurrent user scaling (1-32 GPUs) for Phi-4-Mini with TTFT under 1,000 ms on an NVIDIA HGX B200 cluster running on Nebius AI Cloud

 On smaller models such as Phi-4-Mini, NVIDIA Run:ai with 0.25 GPU fractions supported up to 72% more concurrent users than full-GPU allocation (Figure 4). At 32 GPUs, this configuration achieved ~450K tokens/sec with P95 TTFT under 300 ms (Figure 5). Phi-Mini is an ideal candidate for high-density fractional deployments due to its small parameter count and tensor efficiency.

 Figure 5. Throughput at scale for Phi-4 Mini NIM on NVIDIA HGX B200 cluster running on Nebius AI Cloud

 Multimodel co-location on fractional GPUs in Nebius AI Cloud

 NVIDIA Run:ai supports allocating fractional GPUs dynamically. In previous tests, the same number of users were run on fractional GPUs. One test loaded two models (Llama 3.1 8B and DeepseekR1-Distill-8B) on fractional 0.5 NVIDIA H100 NVL GPUs using NVIDIA Run:ai. A single NVIDIA H100 NVL GPU was running two inference models. 

 Results show double the concurrent users with NVIDIA Run:ai versus deploying a single NIM pod per GPU (Figure 6). The performance impact increased when the scale reached more than 50% of the GPUs in the cluster. At max scale, the TTFT for the combined users dropped by 3x while the throughput dropped only by 0.4x.

 Figure 6. Total number of concurrent users on cluster powered by NVIDIA H100 NVL GPU server running two models on a single GPU

 Traditional Kubernetes schedulers don’t support this fractional allocation. NVIDIA Run:ai enables loading multiple models with dynamic frame buffer memory allocation without manual capacity planning. 

 NVIDIA NIM complements this by packaging each model as a production-ready, optimized inference microservice with consistent startup and health signaling. NVIDIA Run:ai then enforces memory isolation and fair compute distribution at runtime. Combined, this enables safe co-location of heterogeneous workloads without cross-model interference.

 Figure 7. The total system users that ran with multiple models on the NVIDIA HGX B200 cluster in Nebius AI Cloud more than tripled

 Nebius ran a similar test co‑deploying 0.5 GPU Llama 3.1 8B, 0.25 GPU Phi‑4 Mini, and 0.125 GPU Qwen‑Embeddings. The cluster achieved predictable scaling with no cross‑model interference, and combined throughput exceeded 350K TPS at full scale (Figure 8). The total number of concurrent users that can run inference went up by almost 3x (Figure 7). This validates that the NVIDIA Run:ai scheduler can bin‑pack heterogeneous inference workloads without destabilizing latency or utilization.

 Figure 8. Total system throughput while running multiple models on the NVIDIA HGX B200 cluster in Nebius AI Cloud

 Autoscaling NIM LLM with NVIDIA Run:ai

 NVIDIA Run:ai supports auto-scaling inference pods based on concurrent users, throughput, or latency thresholds. Nebius set up Llama 3.1 8B to scale when concurrent users exceeded 50, triggering NVIDIA Run:ai to allocate additional GPUs to the NIM inference service.

 Replicas scaled smoothly from 1 to 16 as demand increased. The autoscaling traces showed clean ramp-up with no TTFT spikes, stable GPU utilization during pod warm-up, and negligible HTTP error rates, demonstrating that fractional GPU inference can scale elastically while maintaining SLAs.

 Figure 9. Autoscaling results for Llama 3.1 8B on NVIDIA HGX B200 in Nebius AI Cloud

 Get started with GPU fractioning in NVIDIA Run:ai 

 NVIDIA Run:ai enables efficient GPU utilization through dynamic allocation, fractioning, and intelligent workload placement. Combined with Nebius AI Cloud’s dedicated GPUs, NVIDIA networking, and hyperscaler-grade elasticity, enterprises can achieve:

 GPU utilization improvements under fractional scheduling, eliminating fragmentation and idle pockets

 Near‑linear throughput scaling across 0.5 and 0.25 GPU slices (and 0.125 for embeddings), with modest TTFT impact

 Clean co-existence of mixed workloads: embeddings plus generative plus summarization on the same nodes

 Production‑ready autoscaling for fractional LLM inference—no SLA cliffs during scale‑out

 More workloads per GPU, higher concurrency, and reduced fleet size

 For an executive summary of this benchmark, see Scaling Efficient Production-Grade Inference with NVIDIA Run:ai on Nebius .  

 Get started with the latest version of NVIDIA Run:ai v2.24. To learn more, check out the NVIDIA GTC 2026 session, Scale Inference Using Open Models: How Nebius Token Factory Delivers Control and Efficiency (Presented by Nebius) [S82234] .

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | Data Science | General | NIM | Run:ai | Intermediate Technical | Benchmark | Deep dive | AI Inference | featured | Inference Performance | LLMs

 About the Authors

 About Boskey Savla

 Boskey Savla is a product manager at NVIDIA focusing on defining benchmarks and architectures for LLMs and agentic flows for enterprise customers. She has 18 years of experience in systems and operations. She started as a Linux sys admin and moved on to solution engineering and system engineering roles focusing on virtual, PaaS, cloud, and Kubernetes-based solutions. She is the author of the book 'Kubernetes on vSphere for Dummies' and has spoken and conducted workshops at various events like VMworld, AWS Re:Invent, and Kubecon.

 View all posts by Boskey Savla

 About Ekin Karabulut

 Ekin Karabulut is a data scientist and developer advocate previously at Run:ai, now at NVIDIA, exploring the efficient usage of large models in different production scenarios. Previously she worked on privacy implications of federated learning, focused on distributed training techniques and got fascinated by inefficiencies in GPU usage in research and industry settings. She established the AI Infrastructure Club and is based in Munich, Germany.

 View all posts by Ekin Karabulut

 About Roman Iurkov

 Roman Iurkov is a cloud solutions architect at Nebius, working closely with customers to design, onboard, and optimize a wide range of AI/ML use cases and data-intensive workloads. His frontier role centers on understanding customer requirements and translating them into scalable, reliable, and cost-efficient solutions, with a strong focus on the strategic partnership with NVIDIA and driving adoption of NVIDIA Run:ai and DGX Lepton. Bringing over a decade of experience in large enterprise environments, Roman helps customers confidently and smoothly transition to modern cloud platforms.

 View all posts by Roman Iurkov

 Comments

 Related posts

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Removing the Guesswork from Disaggregated Serving

 Removing the Guesswork from Disaggregated Serving

 Building Scalable and Fault-Tolerant NCCL Applications

 Building Scalable and Fault-Tolerant NCCL Applications

 Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap

 Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap

 Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch

 Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch

 Related posts

 Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

 Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

 Building a Zero-Trust Architecture for Confidential AI Factories

 Building a Zero-Trust Architecture for Confidential AI Factories

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 L

 T

 F

 R

 E

OpenAI announces Frontier Alliance Partners

openai

23.02.2026 05:30

0.729

Embedding sim.	0.8817
Entity overlap	0.4286
Title sim.	0.1311
Time proximity	0.378

NLP тип	partnership
NLP организация	OpenAI
NLP тема	enterprise ai
NLP страна

Открыть оригинал

OpenAI announces Frontier Alliance Partners to help enterprises move from AI pilots to production with secure, scalable agent deployments.

Introducing OpenAI for India

openai

18.02.2026 21:00

0.721

Embedding sim.	0.8401
Entity overlap	0.0278
Title sim.	0.1235
Time proximity	0.9375

NLP тип	other
NLP организация	OpenAI
NLP тема	ai adoption
NLP страна	India

Открыть оригинал

OpenAI for India expands AI access across the country—building local infrastructure, powering enterprises, and advancing workforce skills.

5 New Digital Twin Products Developers Can Use to Build 6G Networks | NVIDIA Technical Blog

nvidia_dev_blog

01.03.2026 07:00

0.721

Embedding sim.	0.8084
Entity overlap	0.0476
Title sim.	0.2656
Time proximity	1

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	ai infrastructure
NLP страна

Открыть оригинал

To make 6G a reality, the telecom industry must overcome a fundamental challenge: how to design, train, and validate AI-native networks that are too complex to be tested in the physical world.

 The NVIDIA Aerial Omniverse Digital Twin (AODT) solves this by enabling a continuous integration/continuous development (CI/CD)-style workflow where Radio Access Network (RAN) software is trained, simulated, and validated in a physics-accurate environment before field deployment. As discussed in a recent post , this approach bridges the gap between statistical models and real-world network performance.

 But the usability of any technology is as important as the technology itself. That’s why NVIDIA designed AODT not just as a powerful simulation platform, but with a modular and accessible architecture that partners and developers can easily integrate into their own workflows.

 Within two years of its launch, AODT’s modular architecture is growing an ecosystem of commercial partner products, making high-fidelity simulation accessible from desktops to the cloud. This blog post spotlights five NVIDIA partners using the modular AODT platform to build commercial solutions. From RAN digital twins and cloud-scale channel simulations to high-fidelity network planning, these solutions provide a unified foundation to plan, build, and test AI-native 6G networks.     

 The role of AODT in accelerating network innovation 

 Part of the NVIDIA AI Aerial platform, AODT provides the physics-accurate simulation engine required to train and fine-tune AI models across the RAN, with unprecedented scale, fidelity, and accuracy.

 Designed to be modular , AODT enables developers to integrate or customize components based on specific use cases and development needs. Developers can start with built-in NVIDIA models for rapid prototyping or plug in their own, such as proprietary propagation engines, RAN digital twins, and user equipment (UE) digital twins, to create a full-network digital twin environment.

 Figure 1. Modular architecture of AODT

 The following are five NVIDIA partners using the modular AODT platform to build commercial solutions. 

 Nokia RAN Digital Twin

 Nokia’s new RAN Digital Twin —integrated with AODT—combines Nokia’s advanced RAN algorithms with the NVIDIA physics-based simulation engine. The AODT engine uses accelerated ray tracing to model how radio waves interact with real-world materials and environments like glass, concrete, trees, or vehicles. Nokia’s Digital Twin Core analyzes network performance at the product level for base stations and user equipment. This modular approach enables operators to optimize site placement, refine beamforming strategies, and validate algorithms before hardware deployment in the physical world.

 Figure 2. Representation of Nokia RAN Digital Twin integration with AODT

 Keysight Technologies 

 Keysight ’s Channel Studio RaySim solution, powered by AODT, transforms traditional stochastic and semi-deterministic channel modeling into site-specific, fully deterministic channel modeling required for 6G and AI-RAN development. RaySim delivers precise, 6G-ready ray-tracing channel models at speed and scale, enabling researchers to explore new waveforms, test mobility scenarios, and evaluate complex propagation environments in photorealistic digital worlds.  

 Building on RaySim, Keysight’s AI‑RAN Simulation Toolset enables developers to integrate the NVIDIA AI Aerial platform to create hardware testbeds and digital twins that facilitate training and benchmarking of AI-RAN workloads in an integrated, end-to-end workflow.  

 Figure 3. Representation of Keysight’s RaySim integration with AODT

 VIAVI Solutions TeraVM AI RSG

 VIAVI’s TeraVM AI RAN Scenario Generator (AI RSG) , fully integrated with AODT, gives developers the ability to simulate detailed, physics-grounded RAN behavior. Now available on AWS Cloud, AI RSG provides scalable, on-demand access to high-fidelity RAN testing—for teams to parallelize experiments, automate benchmarking, and accelerate AI-RAN validation cycles.

 Calibration is essential for creating an accurate digital twin tailored to customer-specific networks. AODT is calibrated with field measurements from the VIAVI OneAdvisor 800 Wireless, creating highly accurate digital twin representations of customer cell sites and producing the most valuable datasets for machine learning and AI-driven RAN optimization.

 Figure 4. Representation of VIAVI’s AI RSG integration with AODT

 Ansys Perceive EM and Ansys HFSS

 Ansys, part of Synopsys, is integrating Ansys HFSS and Ansys Perceive EM software with AODT, expanding the capabilities of these tools and enabling full network simulation for users. High-frequency electromagnetic simulation software (HFSS) provides physics-accurate antenna and array design. Perceive EM radio frequency channel radar signature simulation software extends electromagnetic fidelity to wireless channel modeling in detailed, dynamic, motion-rich environments. AODT scales those models to full network deployments. This workflow forms a continuous electromagnetic chain, from antenna to network, enabling researchers to train and validate AI-RAN and integrated sensing and communications (ISAC) systems with true physical accuracy.

 Figure 5. Representation of Ansys’s HFSS and Ansys Perceive EM simulation software integration with AODT

 Amazon Web Services (AWS)

 With AWS, AODT moves to the cloud, giving researchers and network operators on‑demand access to large‑scale, physics‑accurate network simulation. Running AODT on AWS enables teams to spin up virtual test environments that replicate city‑scale networks, experiment with new RAN topologies, and analyze performance under dynamic, real‑world conditions—all without maintaining dedicated on‑premises infrastructure.

 AWS has leveraged the NVIDIA three-computer “Train → Simulate → Deploy” system to enable AI-native networks through cloud-scale intelligence. In the Train phase, Amazon Bedrock and Amazon SageMaker train domain-specific LLMs on RAN data, including R1 interface telemetry and configuration procedures, enabling models to understand and reason over RAN control signaling, resource management, and protocol-level behaviors. In the Simulate phase, NVIDIA AODT validates implementations across physics-accurate scenarios in parallel, compressing validation timelines from months to days. In the Deploy phase, Agentic applications enable agentic coverage optimization and intelligent energy savings improvements. Central to this phase is the recursive data foundation — production outputs feed back into the training loop, enabling the model to improve continuously over time.  

 The future of 6G starts with simulation

 With NVIDIA Aerial Omniverse Digital Twin and its expanding ecosystem of partners, the telecom industry has a unified, physics-accurate foundation for creating, validating, and accelerating AI-native wireless systems.

 As the industry advances toward autonomous networks , simulation becomes essential: intelligent network agents—powered by AI—need trusted virtual environments to test and validate their recommendations before acting in live networks. Digital twins bridge that gap— closing the loop between training and deployment, enabling networks to self-learn, self-heal, and self-optimize in real time.

 Explore AODT partner solutions to kickstart your 6G research and development, and join the NVIDIA 6G Developer Program to collaborate with us in building the intelligent networks of the future.  

 Discuss (1)

 Like

 Tags

 Developer Tools & Techniques | Networking / Communications | Simulation / Modeling / Design | Telecommunications | Aerial | Omniverse | Intermediate Technical | Deep dive | 5G / 6G | featured | Industrial Digitalization / Digital Twin

 About the Authors

 About Cindy Goh

 Cindy leads Product Marketing for AI-RAN and 6G at NVIDIA, driving the strategy for the next generation of intelligent wireless networks. With a deep foundation in semiconductor innovation, she previously led Technical Marketing and IP Product Management at Intel and Altera. Cindy holds both an M.S. and B.S. in Electrical Engineering from the University of Southern California.

 View all posts by Cindy Goh

 About CC Chong

 CC Chong is the senior director and head of Aerial product management at NVIDIA. Before joining NVIDIA, she was most recently senior director and GM of wireless and access business unit in the Intel Programmable Solutions Group. Chong received her Ph.D., in electronics and electrical engineering from the University of Edinburgh in Scotland and her bachelor's in electronics and electrical engineering from the University of Manchester. She was a recipient of the Ten Outstanding Young Malaysian Awards under the category “Scientific and Technological Development” in 2006.

 View all posts by CC Chong

 Comments

 Related posts

 Improve AI-Native 6G Design with the NVIDIA Aerial Omniverse Digital Twin

 Improve AI-Native 6G Design with the NVIDIA Aerial Omniverse Digital Twin

 NVIDIA Aerial Omniverse Digital Twin Boosts Development of AI-Native Wireless and Deployment Flexibility

 NVIDIA Aerial Omniverse Digital Twin Boosts Development of AI-Native Wireless and Deployment Flexibility

 Developing Next-Generation Wireless Networks with NVIDIA Aerial Omniverse Digital Twin

 Developing Next-Generation Wireless Networks with NVIDIA Aerial Omniverse Digital Twin

 Boosting AI-Driven Innovation in 6G with the AI-RAN Alliance, 3GPP, and O-RAN

 Boosting AI-Driven Innovation in 6G with the AI-RAN Alliance, 3GPP, and O-RAN

 Accelerating the Future of Wireless Communication with the NVIDIA 6G Developer Program

 Accelerating the Future of Wireless Communication with the NVIDIA 6G Developer Program

 Related posts

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 L

 T

 F

 R

 E

Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization | NVIDIA Technical Blog

nvidia_dev_blog

19.02.2026 17:30

0.703

Embedding sim.	0.7946
Entity overlap	0.1071
Title sim.	0.2741
Time proximity	0.8542

NLP тип	other
NLP организация	nvidia
NLP тема	ai hardware
NLP страна

Открыть оригинал

NVIDIA flagship data center GPUs in the NVIDIA Ampere , NVIDIA Hopper , and NVIDIA Blackwell families all feature non-uniform memory access (NUMA) behaviors, but expose a single memory space. Most programs therefore do not have an issue with memory non-uniformity. However, as bandwidth increases in newer generation GPUs, there are significant performance and power gains to be had when taking into consideration compute and data locality.

 This post first analyzes the memory hierarchy of the NVIDIA GPUs, discussing the power and performance impacts of data transfer over die-to-die link. It then reviews how to use NVIDIA Multi-Instance GPU (MIG) mode to achieve data localization. Finally, it presents results for running MIG mode versus unlocalized for the Wilson-Dslash stencil operator use case.

 Note: The techniques described in this post are exploratory, and the field is evolving quickly. New developments may supersede what is described here. Expect additional publications as updated capabilities and practices become available. 

 The term ‘NUMA node’ is used in the post to describe GPU-internal memory locality exposed through MIG. This does not imply full conventional NUMA capabilities.

 Memory hierarchy in NVIDIA GPUs 

 Consider the abstract view of the memory hierarchy with two NUMA nodes depicted in Figure 1. When a streaming multiprocessor (SM) on node 0 needs to access a memory location in the dynamic random-access memory (DRAM) of node 1, it must transfer data over the L2 fabric. In the case of NVIDIA Blackwell GPUs, each NUMA node is a distinct physical die, which adds latency and increases the power required for data transfer. Despite the added complexity, NUMA-unaware code can still achieve peak DRAM bandwidth.

 Figure 1. Abstract view of the GPU memory hierarchy across two NUMA nodes

 To address these drawbacks, it is beneficial to minimize data transfers between NUMA nodes. When a single memory space is presented to the user, NVIDIA architecture employs coherent caching in L2 to reduce data transfers between NUMA nodes. This mechanism helps prevent repeated accesses to the same memory address from refetching data over the L2 fabric interface. Ideally, once the address is fetched into the local L2 cache, all subsequent accesses to the same address will hit the cache. 

 Before the introduction of coherent caching, the unified L2 cache allowed all SMs to achieve peak bandwidth (as in NVIDIA Volta ), though latency varied depending on the proximity of the SM to different L2 segments. With the NVIDIA Ampere generation, larger chips introduced a hierarchy of NUMA nodes, each with its own L2 cache and a coherent connection to others. 

 While large data center GPUs since NVIDIA Ampere architecture have used this design (unlike smaller gaming GPUs), the L2 fabric connection sustains peak bandwidth as mentioned in NVIDIA Blackwell Ultra architecture.

  Two challenges have emerged as GPUs continue to grow: increased latency and power limitations.

 Increased latency: Accessing distant parts of the L2 cache has led to growing latency, which impacts performance, particularly for synchronization.

 Power limitations: On the largest GPUs, power consumption becomes a limiting factor when tensor cores are active. Reducing power consumption through localized L2 access enables decreasing the L2 fabric clock and raising the compute clock through a Dynamic Voltage and Frequency Scaling (DVFS) mechanism associated with GPU Boost . In this way, tensor core performance can be significantly improved.

 MIG reduces data transfers between NUMA nodes. Introduced with the NVIDIA Ampere architecture, this feature enables partitioning a single GPU into multiple instances. By using MIG, developers can create one GPU instance per NUMA node, thereby eliminating accesses over the L2 fabric interface. 

 This approach does come with its own set of costs, including the overhead of communicating between different GPU instances using PCIe. The following section presents results from running workloads using MIG mode and unlocalized memory to demonstrate the effectiveness of this approach.

 Data localization using MIG 

 MIG enables supported NVIDIA GPUs to be partitioned into multiple isolated instances, each with dedicated high-bandwidth memory, cache, and compute cores. This enables efficient and high-performance GPU utilization across multiple users or workloads. MIG can achieve up to 7x more GPU resources on a single GPU. It allows multiple virtual GPUs (vGPUs) and, consequently, virtual machines (VMs) to run in parallel on a single GPU, while providing the isolation guarantees that vGPUs offer.

 The capabilities provided by MIG can be leveraged to achieve NUMA node localization. By creating one MIG instance per NUMA node, you can ensure isolation between different GPU instances. This approach helps eliminate traffic between NUMA nodes.

 MIG allows the splitting of the actual GPU into GPU instances (GI), in which one or more compute instances (CIs) are defined. A CI contains all (in the case of a single CI per GI) or a portion of the SMs belonging to a GI. To enable localization within a GI, the idea is to create two GPU instances mapped onto each NUMA node. On a Blackwell GPU, you can enable MIG mode and list the available GPU instance profiles, as shown with the code in Figure 2.  

 Because Blackwell has two NUMA nodes (one per chiplet), look for the profile with the most SMs of which there are two instances. As shown in Figure 2, this is the profile with ID 9, of which there can be two instances.

 At this point, it’s necessary to create a CI in each GPU instance. This can be done using the commands shown in Figure 3. The main GPU and the GPU instances now have their own identifier hash codes. Use those for the two NUMA nodes:

MIG 3g.90gb Device 0: (UUID:
MIG-ee2ec0e5-0dda-5591-9ee7-4ae51028b6fa)
 MIG 3g.90gb Device 1: (UUID:
MIG-2bbb368b-7cb0-53da-b1a4-7ace0652a197)

 To use these devices, add them to the CUDA_VISIBLE_DEVICES
 environment variable. For example, to run a two-process MPI job, you could create a wrapper script ( wrapper.sh
):

#!/bin/bash
#
case $SLURM_PROCID in
0)
 CUDA_VISIBLE_DEVICES=”MIG-ee2ec0e5-0dda-5591-9ee7-4ae51028b6fa”
 ;;
1)
 CUDA_VISIBLE_DEVICES=”MIG-2bbb368b-7cb0-53da-b1a4-7ace0652a197”
 ;;
esac
$*

 Then launch the MPI jobs:

$ mpirun -n 2 ./wrapper.sh my_executable

 Finally, when all the work is done, the MIG mode can be turned off. 

 Figure 2. Enabling MIG mode and listing the available GPU instance profiles

 Figure 3. Creating compute instances for MIG

 Figure 4. Commands for turning off MIG instances

 What are the benefits of localization with MIG?

 As an example application to demonstrate the benefits of localization with MIG, examine the Wilson-Dslash stencil operator, a key kernel for lattice quantum chromodynamics (LQCD)  drawn from the QUDA library. This library is used to accelerate several large LQCD codes, such as Chroma and MILC.  

 The Dslash kernel is a finite difference operation on a 4D toroidal lattice, where data at each lattice site is updated depending on the values of its eight orthogonal neighbors. The four dimensions in this case are the usual spatial dimensions (X, Y, Z) and the time dimension (T). The kernel is memory bandwidth-bound.

 If the lattice is decomposed onto two NUMA nodes equally, say along the time axis, then each domain will need to access sites on the T-dimension boundaries of the other domain. As shown in Figure 5, green lattice sites on the boundaries of the subdomains need the red sites to complete their stencils. The lattice is notionally laid out onto the two NUMA Nodes. Green sites need red-sites to complete their stencils. Possible data paths are regular memory access (black arrows) when unlocalized, or MPI message passing through the host in MIG localized mode (black arrows).

 Figure 5. Memory accesses for Dslash kernel with MIG mode

 The most convenient way to access neighbors would be through the Shared L2 cache and the interconnect. However, when operating in MIG mode this path requires communication between the MIG instances through MPI using PCle or NVLink. As a result, this path will be slower compared to accessing the main memory attached to the MIG instance. 

 Workloads that require little to no communication between two MIG instances will tend to benefit more using the MIG mode. Instead, one packs the black sites on the boundaries and sends them through MPI. This step introduces additional latency (buffer packing, sending, and unpacking). While it saves GPU power by not using the shared L2 cache-to-cache interconnect, it does use power for its transfer through the host (for PCIe, for example). 

 The amount of data that needs to be transferred between the two processes is related to the number of face sites to be transmitted in the messages, specifically to the surface three-volume orthogonal to the direction of the split. For this example, the split is always in the T-direction, so that each NUMA node notionally ends up with (N s N t )/2 sites, where N s is the number of sites in our spatial volume and N t is the length of the time dimension. The surface to volume ratio is N s /(N s N t /2) = 2/N t . In the case of the problems, N t =64 is considered and the surface-to-volume ratio stays constant at 1/32 ~ 3.13%.

 Figure 6 shows the unlocalized case. The global memory is made up of two memories connected to the NUMA nodes through memory controllers. The colored highlights on the lattices indicate that data may come from either the local DRAM or from the remote DRAM through the shared L2.

 Figure 6. Memory accesses for the unlocalized case

 This is to be compared with the baseline case, where MIG is not employed. Neither the data nor the processing are localized in this case, and the scenario is better represented with Figure 6. Each NUMA node receives its data both from its local memory controller and also from the other NUMA node. In fact, there is only one global lattice and the separation onto two parts for the two NUMA nodes in the figure is artificial. 

 In this scenario, thread blocks to process a collection of sites are assigned to the various NUMA nodes purely at the whim of the scheduler. Since the data is distributed evenly over the two NUMA memories, much more data is transferred across the shared L2 than in the case of the MIG localization where only the minimally required surface sites were transferred. This can incur a significant power cost. 

 On the other hand, the entire operation may be carried out with a single kernel. Latencies incurred can be avoided by packing buffers for message passing, and accumulating the received faces at the end.

 For the experimental results, look at the speedup in workload execution with various GPU power limits in watts. The speedups are the ratios of the wallclock times taken by the unlocalized and MIG approaches running at identical power limits (for example, both at 700 W). 

 As shown in Figure 7, at  a GPU power limit of 400 W, MIG outperforms the unlocalized data with speedups of up to 2.25x depending on the volume of the workload. The reason behind this is the power consumed by the L2 fabric interface becomes a limiting factor when the GPU is running at a low power limit. With MIG mode, since there is no L2 fabric power being consumed to transfer the data between NUMA nodes, workloads can run much faster.

 However, when the GPU power limit is increased, MIG mode performs slightly worse in the case of the experiments represented by the grey, dark green, and black lines in Figure 7, and part of the green. This is because at higher power limits, the extra latency included by the message passing can outweigh the benefits of the localization. 

 Figure 7. Running MIG-based NUMA localization on different workload sizes 

 As it turns out, the smaller cases (especially those indicated by black and dark green lines in Figure 7) never exhaust available power at higher power limits even in the unlocalized case. As such, they benefit little from the GPU power saving won by localization, and at these smaller volumes the latencies due to kernel launch are much more noticeable. The larger volumes (the green, for example) require more power and hence can gain an advantage over the unlocalized setup even  at higher power limits.

 Get started with MIG-based NUMA node localization

 Local L2 caching in NVIDIA data center GPUs can impact performance in NUMA-unaware workloads. Our experiments using the Wilson-Dslash operator in MIG mode show that when the GPU is running at lower power limits and data transfer over MPI (PCIe/NVLink) is low relative to local memory accesses, MIG-based NUMA node localization can yield speedups of up to 2.25x compared to the unlocalized case at the same power limit. 

 While systems running at a higher 1,000 W power envelope may achieve greater absolute performance than a 400 W configuration, MIG-based localization provides clear advantages under power-constrained conditions. In lower-power scenarios, it enables significantly faster performance, making it an especially effective optimization when operating within strict power limits.

 However, in general, MIG does not offer the flexibility required to consistently achieve effective data localization, especially as interprocess communication overhead becomes more pronounced at higher power limits. MIG is only supported for use cases that are too small to fit on a GPU. For this reason, it is not recommended for the cases presented in this post. To address these limitations, alternative approaches are under investigation.

 To learn more, see Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS . 

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | Simulation / Modeling / Design | General | CUDA | Intermediate Technical | Deep dive | CUDA C++ | Data Analytics / Processing | featured | Memory | Multi-Instance GPU (MIG)

 About the Authors

 About Mukul Joshi

 Mukul Joshi is a memory architect with the NVIDIA GPU design team with over a decade of experience in memory architecture. He has worked on both GPU and CPU architectures across a wide range of products ranging from low-power smartphone SoCs to high-performance server processors. He holds a master’s degree from Georgia Tech in Electrical and Computer Engineering.

 View all posts by Mukul Joshi

 About Balint Joo

 Balint Joo is a developer technology engineer at NVIDIA working with high-performance computing workloads. His main area of expertise is Lattice Quantum Chromodynamics (QCD), which he has pursued for over 20 years prior to joining NVIDIA. He holds a PhD in Theoretical Physics from the University of Edinburgh in Edinburgh, Scotland.

 View all posts by Balint Joo

 About Zachary Susskind

 Zachary Susskind is a research scientist in the Architecture Research Group at NVIDIA. His interests include the exploration of non-uniform memory access architectures and algorithm-hardware co-design for energy-efficient machine learning. He holds a PhD in Electrical and Computer Engineering from The University of Texas at Austin.

 View all posts by Zachary Susskind

 About Allard Hendriksen

 Allard Hendriksen is a developer technology engineer at NVIDIA with an expertise in modern GPU memory subsystems. He has worked on projects accelerating key workloads for customers, is the lead developer of cuda::ptx, and has presented on optimizing bandwidth and minimizing latency at GTC and other venues. He holds a PhD in applied mathematics from Leiden University and the CWI in Amsterdam.

 View all posts by Allard Hendriksen

 About Kate Clark

 Kate Clark works at the interface of applications, algorithms, and parallel computation. Before joining NVIDIA, she was a researcher in radio astronomy signal-processing algorithms at Harvard University and a postdoc at Boston University in Massachusetts, with a focus on multi-grid solver algorithms. She received her PhD from the University of Edinburgh, Scotland, where her doctoral research focused on Monte Carlo algorithms for Lattice QCD.

 View all posts by Kate Clark

 Comments

 Related posts

 Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS 

 Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS 

 Less Coding, More Science: Simplify Ocean Modeling on GPUs With OpenACC and Unified Memory

 Less Coding, More Science: Simplify Ocean Modeling on GPUs With OpenACC and Unified Memory

 Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip

 Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip

 Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async

 Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async

 Unified Memory for CUDA Beginners

 Unified Memory for CUDA Beginners

 Related posts

 NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications

 NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications

 CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features

 CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features

 Simplifying GPU Application Development with Heterogeneous Memory Management

 Simplifying GPU Application Development with Heterogeneous Memory Management

 Boosting Application Performance with GPU Memory Access Tuning

 Boosting Application Performance with GPU Memory Access Tuning

 Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2

 Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2

 L

 T

 F

 R

 E

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical Blog

nvidia_dev_blog

27.02.2026 17:00

0.696

Embedding sim.	0.7943
Entity overlap	0.0952
Title sim.	0.3173
Time proximity	0.7143

NLP тип	other
NLP организация	NVIDIA
NLP тема	large language models
NLP страна

Открыть оригинал

Organizations deploying LLMs are challenged by inference workloads with different resource requirements. A small embedding model might use only a few gigabytes of GPU memory, while a 70B+ parameter LLM could require multiple GPUs. This diversity often leads to low average GPU utilization, high compute costs, and unpredictable latency.

 The problem isn’t just about packing more workloads onto GPUs but about scheduling them intelligently. Without orchestration that understands inference workload patterns, organizations face a choice between overprovisioning (wasting resources) and underprovisioning (degrading performance).

 This blog post covers:

 The inference utilization problem : Why traditional scheduling underutilizes GPU resources.

 How NVIDIA NIM delivers production inference : The role of containerized microservices in standardizing model deployment.

 NVIDIA Run:ai’s intelligent scheduling strategies : Four key capabilities that enhance performance (lower latency, increase TPS/GPU) while increasing GPU utilization and reducing compute costs.

 Benchmarking results : ~2x GPU utilization improvement with minimal throughput loss, up to ~1.4x higher throughput under heavy concurrency with dynamic fractions, and 44-61x faster first-request latency with GPU memory swap.

 How to get started : Practical guidance for implementing these strategies with NIM on NVIDIA Run:ai.

 The inference utilization problem

 GPU utilization determines how many workloads can be run on a given cluster, and at what cost. In practice, most inference deployments leave significant GPU capacity idle as each model is assigned a full GPU “just to be safe” or because naive sharing without memory isolation causes out-of-memory (OOM) conditions and latency spikes under traffic.

 Without intelligent orchestration, teams are forced to choose between overprovisioning (waste) and underprovisioning (performance risk).

 How NVIDIA NIM delivers production inference

 NVIDIA NIM packages optimize inference engines as containerized microservices with:

 Packaged inference engines : Inference runtimes pre-configured for improved throughput/latency

 Industry-standard APIs : OpenAI-compatible endpoints for integration

 Model optimization : Automatic selection of quantization, batching, and acceleration techniques.

 Production-ready containers : Pre-built with dependencies, tested at scale

 Security and compliance: Enterprise-grade security controls and container signing for deployments  

 Enterprise support: NVIDIA support and maintenance for production deployments

 NIM standardizes the deployment layer, but maximizing GPU utilization requires intelligent orchestration. This is where NVIDIA Run:ai ‘s scheduling capabilities become essential.

 How NVIDIA Run:ai unlocks efficient resource management for NVIDIA NIM

 Inference utilization is more than just scheduling—it’s about adapting to how workloads behave. With NVIDIA Run:ai, NIM deployments get inference-first prioritization , GPU fractions with full memory isolation, smarter placement based on workload needs, dynamic memory management, and autoscaling (including replica scaling and scale-to-zero). This enables users to follow traffic and give back GPUs when models are idle.

 Inference priority protects user-facing workloads

 NVIDIA Run:ai automatically assigns inference workloads the highest default priority, ensuring training jobs never preempt them. Why this matters:

 Inference serves users : Latency spikes and downtime impact the user experience and SLA compliance.

 Training can tolerate interruption : Model training can checkpoint and resume; inference requests cannot wait.

 This automatic priority assignment eliminates manual tuning in most environments. For organizations running mixed workloads, this ensures training jobs flex around inference demands rather than competing with them. GPUs can train when inference load is low, automatically yielding resources when user-facing requests arrive.

 GPU fractions with bin packing for multiple small models on a GPU

 Many NIM workloads, like embeddings, rerankers, and small LLMs, rarely need an entire GPU. When used with GPU fractions , NVIDIA Run:ai’s bin packing strategy fills GPUs before allocating new ones, maximizing utilization across the cluster.

 How GPU fractions with bin packing work:

 GPU fractions provide true memory isolation (not soft limits). Each model gets a guaranteed memory allocation.

 Bin packing scores GPUs by current utilization and prioritizes filling partially used GPUs before allocating fresh ones.

 Scheduler prioritizes partially-used GPUs for new workloads

 Benchmarking results:

 The approach was tested by simulating a scenario with three NIM models (a 7B LLM, a 12B VLM, and a 30B MoE) on NVIDIA H100 GPUs :

 Scenario A : Three GPUs with one H100 GPU per NIM (baseline)

 Scenario B : Three NIM on 1.5 H100 GPUs using NVIDIA Run:ai fractions, keeping NIM configurations and client load patterns constant

 Figure 1. Three NIM microservices consolidated from three dedicated H100 GPUs to ~1.5 H100 GPUs using GPU fractions and bin packing, retaining 91–100% of baseline throughput

 Exercising short and long-context prompts, the key findings include:

 Each NIM retained about 91–100% of its single-GPU throughput, with modest increases in time-to-first-token (TTFT) and end-to-end (E2E) latency.

 Mistral-7B matched its dedicated-GPU throughput at 834 token/s with long-context input (100%).

 Nemotron-3-Nano-30B retained 95% (582 vs. 614 token/s).

 Nemotron-Nano-12B-v2-VL retained 91% (658 vs. 723 token/s) at short-context input.

 Three NIM microservices that previously required three dedicated H100s were consolidated onto ≈1.5 H100s, freeing the remaining capacity for other workloads.

 Dynamic GPU fractions maintain performance under heavy concurrent requests

 Static GPU fractions guarantee memory isolation, but they impose a rigid ceiling that creates “standard capacity”. As concurrent requests increase, each NIM’s KV-cache grows dynamically to track active sequences. When that growth hits the fixed fraction boundary, throughput plateaus, and latency degrades. This bottleneck forces a difficult trade-off: over-allocate fractions (wasting GPU capacity) or cap concurrency to stay within the fixed memory budget.

 NVIDIA Run:ai’s dynamic GPU fractions solve this by replacing fixed allocations with a request/limit model, borrowing Kubernetes resource semantics for GPU memory:

 Request: The guaranteed minimum fraction, always reserved for the workload.

 Limit: The burstable upper bound, enabling the NIM to spread into available GPU memory when on-demand KV-cache or compute pressure increases.

 When a NIM operates its request, the unused headroom between the request and limit remains available to co-located workloads. When concurrent traffic spikes occur, the NIM bursts toward its limit, claiming that memory and converting it into active throughput. This state transition between request and limit is handled automatically. Workloads scale up when they need resources and release them when demand subsides, maximizing total GPU utilization without manual intervention.

 Benchmarking results:

 Using the same three NIM models and 1.5 H100 GPU footprint from Experiment 1, static fractions were replaced with dynamic fractions to measure performance under increasing concurrency:

 Mistral-7B NIM (Request: 0.3, Limit: 0.4)

 Nemotron-Nano-12B-v2-VL NIM (Request: 0.4, Limit: 0.5)

 Nemotron-3-Nano-30B NIM (Request: 0.65, Limit: 0.75) 

 Scenarios compared:

 Scenario A (static fractions + bin packing): The fixed-fraction deployment from Experiment 1 (See Figure 1), where each NIM has a hard memory ceiling with full isolation.

 Scenario B (dynamic fractions + bin packing): Same bin-packed layout on ≈1.5 H100 GPUs, but each NIM uses a request/limit pair instead of a fixed allocation.

 Figure 2. Throughput vs. p50 end-to-end latency for Nemotron-3-Nano-30B on H100 GPUs with 2,048 input tokens

 In Figures 2, 3, and 4, as concurrency ramped up, static fractions hit a performance wall, throughput stalled, and latency spiked because models couldn’t access additional memory for growing KV caches. With dynamic fractions, NIM microservices absorbed the pressure by bursting toward their limits during traffic peaks and releasing memory back when the load subsided. 

 Across all three NVIDIA NIM microservices, dynamic fractions delivered up to 1.4x higher throughput and 1.7x lower latency, scaling cleanly with concurrency. For example:

 Nemotron-3-Nano-30B sustained 1,025 token/s at 256 concurrent requests with dynamic fractions compared to a static-fraction ceiling of 721 token/s at just four concurrent requests before instability (1.4x).

 Mistral-7B-Instruct-v0.3 p50 end-to-end latency dropped from 5,235 ms to 3,098 ms at 64 concurrent 2,048-token requests (1.7x). 

 The p50 latency curve remains smooth and monotonic rather than spiking or collapsing, confirming that the request/limit headroom accommodates KV-cache growth patterns, improving GPU utilization.

 Figure 3. Throughput vs. p50 end-to-end latency for Mistral-7B-Instruct-v0.3 on H100 GPUs with 2,048 input tokens

 Key takeaway:

 Static fractions + bin packing: Predictable traffic, low-to-moderate concurrency, models with stable memory footprints

 Dynamic GPU fractions + bin packing: Variable traffic, high concurrency, models with significant KV-cache growth

 Figure 4. Throughput vs. p50 end-to-end latency for Nemotron-Nano-12B-v2-VL on H100 GPUs with 2,048 input tokens

 Dynamic GPU fractions eliminate the performance ceiling of static allocations at high concurrency while maintaining workload density. With static fractions, the KV-cache cannot grow beyond the fixed memory boundary, and the inference engine begins rejecting requests because it lacks the headroom to admit new sequences. Dynamic GPU fractions solve this as NIM can burst into available headroom on demand, and organizations get both the efficiency of bin packing and the resilience to handle traffic spikes without allocating additional GPUs.

 GPU memory swap: Efficiently serving rarely-used models

 Organizations serving LLMs face a fundamental trade-off between latency and cost. Scaling an LLM from zero means full container initialization, loading model weights from disk, and allocating GPU memory; a process that can take tens of seconds to minutes. Because this cold-start latency is unacceptable for user-facing applications, most organizations choose over-provision, keeping multiple replicas always-on with dedicated GPUs even during low-traffic or idle periods. 

 This guarantees low latency but wastes GPU capacity, paying for hardware that sits idle just to avoid the risk of a cold start. Scale-to-zero (the Kubernetes pattern of shutting down idle replicas completely and restarting them on demand) can free the GPUs, but the cold-start penalty makes it impractical for latency-sensitive inference workloads.

 How GPU memory swap works:

 With GPU memory swap , models are kept in CPU memory and dynamically swap model weights between CPU and GPU as requests arrive. Only the active model’s weights reside in GPU memory at any moment. When a request targets an idle model, NVIDIA Run:ai’s GPU memory swap moves the currently loaded model’s weights to CPU RAM and loads the requested model into GPU memory, keeping it warm for a configurable window. The model never leaves memory entirely; it just moves between GPU and CPU, eliminating the need for container restarts, disk I/O, and cold-start initialization.

 GPU memory swap works across single-GPU, multi-GPU, and fractional GPU workloads. Previous benchmarking with single-GPU deployments showed up to 66x improvements in time to first token (TTFT) compared to scale-from-zero. In this benchmark, combining GPU memory swap with NIM deployments on fractional GPUs tested whether the same latency benefits hold when models share hardware through bin packing and under memory constraints.

 Benchmarking results:

 Latency between GPU memory swap and scale-from-zero for the same three NIM deployments was compared:

 Scenario A (scale-from-zero): Each NIM cold‑starts from scratch on a dedicated H100 GPU when traffic arrives (three GPUs in total).

 Scenario B (GPU memory swap): The three NVIDIA NIM microservices share 1.5 H100 GPUs (with the same fractions from previous experiments), with swap‑in/swap‑out between GPU and CPU memory.

 Figure 5. GPU memory swap vs. scale‑from‑zero TTFT on an H100 GPU with 128‑token prompts

 Figure 6. GPU memory swap vs. scale-from-zero TTFT on H100 GPUs with longer 2048-token prompts

 With scale-from-zero, infrequently accessed NIM microservices suffer high first-request latency due to full cold starts. With GPU memory swap, first-request latency stays acceptable, and subsequent requests see warm TTFT. All three NIM microservices run on half of the GPUs, freeing up the remaining capacity for high-traffic or other workloads. 

 At 128-token input, cold-start TTFT ranged from 75.3 s (Mistral-7B) to 92.7 s (Nemotron-3-Nano-30B), while GPU memory swap reduced these to 1.23–1.61 s – a 55–61x improvement. At 2,048-token input, cold-start TTFT of 158.3–180.2 s dropped to 3.52–4.02 s with swap, a consistent ~44x reduction.

 Key takeaway : GPU memory swap delivers 44-61x faster TTFT than scale-from-zero while using fewer resources when combined with GPU fractions, eliminating the cold-start penalty for infrequently accessed models, whether deployed on dedicated or fractional GPUs.

 Get started with NVIDIA Run:ai and NVIDIA NIM

 Check out this guide to get started with deploying NVIDIA NIM as a native inference workload on NVIDIA Run:ai. Watch this webinar to see how teams manage growing AI workloads with intelligent scheduling, fine-grained GPU controls, Kubernetes-native traffic balancing, and autoscaling—while new platform updates improve access control, endpoint management, and visibility. 

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | Developer Tools & Techniques | General | NIM | Run:ai | Advanced Technical | Benchmark | featured | Inference Performance | LLMs

 About the Authors

 About Shwetha Krishnamurthy

 Shwetha Krishnamurthy is a product manager at NVIDIA, where she focuses on building LLM inference products. Before joining NVIDIA, she spent several years as a machine learning engineer and data scientist at Goldman Sachs and Yodlee. Shwetha holds an MBA from the University of Chicago Booth School of Business and a Master’s in Computer Science from the University of Chicago.

 View all posts by Shwetha Krishnamurthy

 About Aditi Bodhankar

 Aditi Bodhankar is a developer advocate engineer at NVIDIA who works on developing various deep learning applications, especially those using the NVIDIA NeMo. She is equipped with experience in conversational AI and NLP since her internship at NVIDIA. Aditi holds a master’s degree from the University of Southern California.

 View all posts by Aditi Bodhankar

 About Ekin Karabulut

 Ekin Karabulut is a data scientist and developer advocate previously at Run:ai, now at NVIDIA, exploring the efficient usage of large models in different production scenarios. Previously she worked on privacy implications of federated learning, focused on distributed training techniques and got fascinated by inefficiencies in GPU usage in research and industry settings. She established the AI Infrastructure Club and is based in Munich, Germany.

 View all posts by Ekin Karabulut

 About Julie Adrounie

 Julie Adrounie is an AI product marketing manager and technical advocate for Run:ai software at NVIDIA, where she helps enterprises scale large language model workloads and streamline AI inference in production environments. Previously, as a solutions architect, she built and implemented end-to-end AI production platforms that helped data science teams accelerate model development and deployment at scale. She holds a B.S. in Industrial Engineering and lives in Orlando, Florida.

 View all posts by Julie Adrounie

 Comments

 Related posts

 Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

 Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

 Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap

 Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap

 Accelerate AI Model Orchestration with NVIDIA Run:ai on AWS

 Accelerate AI Model Orchestration with NVIDIA Run:ai on AWS

 NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations

 NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations

 Practical Tips for Preventing GPU Fragmentation for Volcano Scheduler

 Practical Tips for Preventing GPU Fragmentation for Volcano Scheduler

 Related posts

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes

 Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes

 L

 T

 F

 R

 E

Advancing independent research on AI alignment

openai

19.02.2026 10:00

0.692

Embedding sim.	0.7935
Entity overlap	0.1
Title sim.	0.15
Time proximity	0.9226

NLP тип	funding
NLP организация	OpenAI
NLP тема	ai safety
NLP страна

Открыть оригинал

OpenAI commits $7.5M to The Alignment Project to fund independent AI alignment research, strengthening global efforts to address AGI safety and security risks.

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models | NVIDIA Technical Blog

nvidia_dev_blog

18.02.2026 16:00

0.691

Embedding sim.	0.7949
Entity overlap	0.0857
Title sim.	0.1751
Time proximity	0.869

NLP тип	partnership
NLP организация	Sarvam AI
NLP тема	large language models
NLP страна	India

Открыть оригинал

As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups building sovereign AI models from scratch, these challenges are amplified by the need to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and cost control.

 Sarvam AI , a generative AI startup based in Bengaluru, India, set out to build large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To meet strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with NVIDIA to co-design hardware and software optimizations.

 This collaboration delivered a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, and established a path for deployment on the next-generation NVIDIA Blackwell architecture. The end-to-end performance boost was achieved through kernel and scheduling optimizations on NVIDIA H100 SXM GPUs that contributed a 2x speedup. That was combined with the powerful compute capabilities of Blackwell, along with NVFP4 weight quantization, for an additional 2x speedup, with an even bigger performance gain of 2.8x seen at higher interactivity points.

 NVIDIA engineers helped Sarvam AI build 3B, 30B, and 100B foundational models, and optimize a new family of sovereign foundation models that were trained using NVIDIA Nemotron libraries , including the NVIDIA NeMo Framework and NVIDIA NeMo-RL . These models support 22 Indian languages, English, math, and code. They demonstrate how developer teams can leverage NVIDIA’s full-stack AI platform—from data to deployment—to achieve state-of-the-art performance and localized AI capabilities.

 This post walks through the joint engineering effort and shares benchmarks for the speed-ups achieved on the NVIDIA H100 , the largest-deployed NVIDIA GPU in India. We also provide an early look at how these workloads are being adapted for the NVIDIA Blackwell architecture.

 Making multilingual sovereign AI scalable with MoE

 To deliver sovereign-scale intelligence with high efficiency, the Sarvam AI models employ a sophisticated heterogeneous mixture-of-experts (MoE) architecture tailored for deep reasoning and linguistic density. These models were pretrained from scratch across 3B, 30B, 100B using the NVIDIA NeMo framework and NVIDIA Megatron-LM. Furthermore, Nemo-RL was used for post-training workflows for these models including long-context reasoning.

 Sarvam 30B utilizes a 19-layer depth (1 dense + 18 MoE) with 128 experts and a top-6 routing strategy, relying on grouped query attention (GQA) to balance memory bandwidth with generation quality.

 Sarvam 100B scales this design to 32 layers (1 dense + 31 MoE) and employs top-8 routing over 128 experts with a larger MoE FFN hidden size of 2048. Additionally, the 100B model adopts multi-head latent attention (MLA)—similar to DeepSeek-V3—to aggressively compress the Key-Value (KV) cache, enabling massive context windows without the memory penalties of standard attention.

 Both models feature a shared expert design where a dedicated expert handles common features while routed experts tackle specialized tasks. This combination of high active parameter counts (via top-6/top-8 routing) and complex memory access patterns created a unique serving challenge, necessitating the deep kernel optimizations on NVIDIA Hopper and NVIDIA Blackwell GPUs detailed below.

 The performance challenge: SLAs and baseline configuration on NVIDIA H100

 Optimizing the Sarvam 30B model wasn’t just about raw speed; it was about maximizing density under strict latency constraints. For the applications served by this model—voice-to-voice agents—we established the following service level agreements (SLAs):

 P95 (95th percentile) time to first token (TTFT): < 1000 ms

 P95 (95th percentile) inter-token latency (ITL): < 15 ms

 P95 (95th percentile) in inference performance testing measures latency, indicating that 95% of served requests are completed faster than this threshold, while the slowest 5% take longer. It is a critical tail-latency metric used to evaluate user experience and system stability, ensuring that even under load, most users face no more than a specific delay. The engineering goal was to maximize the inference server’s token throughput (concurrently served requests) without breaching these P95 targets.

 For the initial performance analysis, the Sarvam AI and NVIDIA teams selected the SGLang inference engine for their initial performance analysis. Unlike standard serving frameworks that treat the KV cache as a linear buffer, SGLang implements RadixAttention —a mechanism that manages the KV cache as a radix tree. This was critical for the Sarvam 30B architecture; RadixAttention enables automatic prefix sharing, allowing the shared expert context and system prompts to be computed once and reused across concurrent requests. Furthermore, SGLang’s Cache-Aware Scheduler maximizes the hit rate of these shared prefixes, significantly reducing redundant memory operations during the prefill phase.

 The Sarvam AI and NVIDIA teams modeled a production traffic profile characterized by an average input sequence length (ISL) of 3,584 tokens and an output sequence length (OSL) of 128 tokens. Guided by internal simulation data, we deployed the model on two NVIDIA H100 SXM GPUs with a specific parallelism strategy designed to balance the distinct memory and compute requirements of the MoE layers:

 Expert parallelism (EP=2) for the expert weights. This configuration utilizes Grouped GEMM kernels to maximize compute density and ensures that the massive expert weights reside in HBM, reducing the cost of expert routing. 

 Data parallelism (DP=2) for the attention weights with –enable-dp-attention. This enabled us to parallelize attention computation across parallel batches, significantly boosting the aggregate throughput of the prefill phase.

 While this configuration provided a robust functional baseline, profiling revealed that satisfying the sub-second TTFT at high concurrency required deeper optimization – leading us to the specific kernel and precision strategies detailed below. 

 From profiling to performance: eliminating MoE bottlenecks

 Simulation data indicated that a concurrency range of 32 to 64 requests would offer the best chance of meeting SLA requirements. To identify the precise bottlenecks limiting token throughput in this concurrency range, the NVIDIA and Sarvam AI teams utilized NVIDIA Nsight Systems to capture execution traces of both the prefill and decode phases at a concurrency of 32 requests. We then processed the traces to extract the microsecond-level latency contribution of every kernel within a single transformer layer.

 The profiling revealed that while the heavy General Matrix Multiplication (GEMM) operations (experts and attention) were performing well, significant latency bubbles existed in the non-compute-intensive operations—specifically in the MoE routing logic and positional embedding calculations. These operations were suffering from kernel launch overheads and redundant memory reads.

 Figure 1. Nsys profiler timeline showing SM activity and kernel execution over time of the prefill phase, with red boxes marking the most expensive kernels in the layer—QK normalization, attention, and MoE expert computation.

 Following these observations, we executed a targeted optimization strategy across three axes – kernel optimizations, scheduling efficiency, and disaggregated serving.

 Cutting transformer layer time by 34% with kernel-level optimizations

 The NVIDIA and Sarvam AI teams systematically targeted the most expensive kernels identified in the trace, replacing standard PyTorch implementations with fused, architecture-specific kernels. We implemented the models first using a baseline implementation on SGLang with H100 GPUs and then optimized them to achieve significant speedups, as detailed below in Table 1 and in the following text. 

 Kernel
 Baseline time (microseconds)
 Optimized time (microseconds)
 Optimization applied

 RMSNorm + Prepare QKV
 186
 185
 N/A

 QK Norm + RoPE
 414
 54
 Use optimized fused in-place query-key normalization kernel

 Attention
 322
 296
 Use FA3 for prefill, FlashInfer backend for decode

 Post-attention linear projection
 114
 112
 N/A

 AllReduce
 252
 250
 N/A

 Router logits and TopK
 560
 134
 Use fused TopK impl.; ReplicatedLinear block for router logits

 Routed experts computation
 1103
 1080
 Tune kernel params for and DEP2 configuration (64 experts per GPU)

 Shared expert computation
 216
 215
 Overlap with TopK using NVIDIA CUDA streams

 AllReduce
 265
 249
 N/A

 Total layer time
 3432
 2575
 1.34x faster prefill overall

 Table 1. Kernel-level optimizations pay off: Fusing and tuning the hottest kernels cut layer time drastically and deliver faster prefill. 

 MoE routing (4.1x faster than baseline H100 performance): The most significant bottleneck identified was the MoE routing mechanism. In the baseline, computing router logits and performing TopK selection involved multiple kernel launches and redundant memory round-trips.

 Optimization: We implemented a Fused TopK kernel that fuses the logit computation and selection logic into a single CUDA kernel. Furthermore, we utilized a ReplicatedLinear block for the router logits. Since the router weights are small, replicating them across GPUs eliminates the need for expensive communication during the gating phase, keeping the operation purely compute-bound.

 Fusing positional embeddings (7.6x faster than baseline H100 performance): The baseline implementation of query-key (QK) norm, followed by rotary positional embeddings (RoPE), required reading and writing the massive KV cache twice.

 Optimization: We deployed a custom fused in-place QK norm + RoPE kernel. This kernel performs normalization and rotary embedding calculations in a single pass, keeping the data in the L2 cache and reducing global memory bandwidth consumption.

 Hiding latency with overlap: While the shared expert computation itself saw negligible speedup, we effectively hid its cost. By utilizing separate NVIDIA CUDA streams, we scheduled the shared expert computation to execute asynchronously alongside the router logits and TopK calculation. This parallelism ensures that the GPU’s compute units (streaming multiprocessors, or SMs) remain saturated even while the routing logic is being resolved.

 These targeted kernel optimizations reduced the total time per transformer layer in a prefill iteration from 3.4ms to 2.5ms, a 1.3x speedup over baseline H100 performance. This latency reduction directly translated to higher supportable concurrency, allowing us to serve more users per GPU while maintaining the strict <1000ms time to first token (TTFT)  and < 15ms inter-token latency service level agreement (ITL SLA) as shown in Figure 2 below.

 Figure 2. Performance gains from kernel optimizations across various concurrency points. In focus is the performance gain at the 75 TPS/user point. With kernel optimizations, we see a 1.26x improvement in overall token throughput per GPU.

 How mixed prefill and decode scheduling improve GPU utilization

 While kernel-level optimizations improve individual operation latency, significant efficiency gains can be achieved at the scheduler level by optimizing aggregated serving (prefill and decode run on the same GPU) and disaggregated serving (prefill and decode run on different GPUs).

 The default scheduling strategy for aggregated serving in the SGLang engine is to strictly serialize the prefill and decode phases. In this default mode, the GPU processes a batch of prefills, finishes them, and only then switches to processing decodes. While this simplifies memory management, it often leads to suboptimal GPU utilization. Prefills are typically compute-bound (dense matrix multiplications), while decodes are memory-bound (loading KV cache). Serializing them means the GPU’s Tensor Core units (SMs) are underutilized during decode phases, and memory bandwidth may be underutilized during prefill phases, particularly for the low concurrency operating point imposed by the tight SLA requirements.

 To address this, we enabled a mixed batching strategy. This approach allows the SGLang scheduler to mix prefill tokens and decode tokens within the same batch or compute chunk. By processing a chunk of prefill tokens alongside ongoing decode requests, we achieve a complementary resource profile on the GPU. This optimization introduces a subtle tradeoff. Mixing heavy prefill chunks into the decode stream can arguably increase inter-token latency (ITL) for the active decode requests, as they must wait for the shared compute resources.

 However, for the Sarvam 30B workload, we observed that this impact was marginal and well within our 15ms ITL SLA. In exchange, the end-to-end request latency improved significantly due to the reduction in queue times. By clearing the prefill queue faster (piggybacking on decodes), we reduced the time requests spent waiting to start, ultimately driving up total system throughput by 15%. This scheduling optimization is quite favorable in the high ISL, low OSL scenario of interest here. For more decode-heavy cases, it might be worthwhile to pick smaller mixed chunk sizes or disable it altogether.

 Figure 3. The impact of mixed chunk scheduling, with 15% token throughput gains seen at the 2-second request latency point.

 How disaggregated serving removes the critical path and boosts throughput 1.5x

 Despite kernel and scheduling improvements, our profiling indicated that inter-GPU communication for token distribution (expert parallelism) remained on the critical path. Since the Sarvam 30B model (optimized with FP8 precision) fits comfortably within a single NVIDIA H100 SXM GPU’s memory, we pivoted from model parallelism to disaggregated serving.

 We reconfigured the setup to use a 1P+1D strategy via the SGLang router: dedicating one NVIDIA H100 SXM GPU exclusively to prefill and another to decode. This approach eliminated the overhead of routing tokens between GPUs during the forward pass. The result was immediate: We observed a sharp reduction in TTFT (as prefill workers ran uninterrupted) and a significant increase in per-user decode throughput (1.5x over baseline H100 performance), proving that for this model size, pipeline separation outweighs the benefits of aggregated memory capacity.

 Figure 4. The benefits of disaggregated serving on NVIDIA H100 SXM for Sarvam 30B model

 The end-to-end impact of kernel, scheduling, and disaggregation optimizations

 Figure 5 below summarizes the end-to-end performance speedup we were able to achieve via a combination of optimized kernels and scheduling optimizations. We also observe that disaggregated serving is the most optimal configuration for this model and ISL/OSL workload pattern and specific TTFT and ITL SLAs.

 Figure 5. Progressive improvements seen in Sarvam 30B model inference on NVIDIA H100 SXM through a combination of kernel optimizations, scheduling optimizations, and disaggregated serving.

 Running the Sarvam 30B model on Blackwell NVIDIA GPUs

 The NVIDIA Blackwell architecture is designed to accelerate generative AI. The NVIDIA Blackwell GPU delivers up to 20 PFLOPS of peak FP4 compute and 8 TB/s of memory bandwidth, representing a jump over the NVIDIA H100 GPU’s capabilities. This throughput is driven by the second-generation Transformer Engine, which utilizes the new NVFP4 format to provide over 2x the performance of FP8 while maintaining high model accuracy.

 To take advantage of these capabilities in the Sarvam models, we used the NVIDIA Model Optimizer to quantize the base BF16 model to the NVFP4 format. Unlike in the case of  multiple H100 GPUs, we found that the NVIDIA HGX B200 was able to serve the Sarvam 30B model most efficiently with just one Blackwell GPU. By combining the kernel and scheduling optimizations for the model with NVIDIA Blackwell’s NVFP4 compute throughput, we were able to realize a 4x increase in inference serving throughput at the 75 tokens per second per user operating point.

 As indicated in Figure 6 below, the NVIDIA Blackwell GPU enables high performance at low latency due to its superior compute, as well as exceptional throughput at higher concurrencies  from its memory capacity advantage.

 Figure 6. NVIDIA Blackwell GPU offers a 2.8x higher token throughput vs Nvidia H100 SXM GPU at the 100 TPS/User operating point.

 Learn more

 Together, this work shows what is possible when model design, kernel engineering, scheduling strategy, quantization, and GPU architecture are treated as a single system rather than isolated components. By co-optimizing across the full stack, Sarvam AI and NVIDIA delivered substantial gains in throughput and latency while maintaining strict TTFT and inter-token latency targets required for real-world deployment. 

 The result is not just a faster model, but a more economically viable and sovereign-ready inference stack that scales to national workloads. These learnings provide a blueprint for other teams building large, production-grade AI systems on NVIDIA platforms.

 More information about Sarvam AI’s models can be found here .

 To begin exploring your own sovereign AI model strategy, check out the NVIDIA Nemotron framework and libraries for training, fine-tuning, and deploying models on local infrastructure.

 Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn , X , Discord , and YouTube .

 Visit our Nemotron developer page for all the essentials you need to get started with the most open, smartest-per-compute reasoning model. 

 Explore new open Nemotron models and datasets on Hugging Face and NIM microservices and Blueprints on build.nvidia.com . 

 Tune into upcoming Nemotron livestreams and connect with the NVIDIA Developer community through the Nemotron developer forum and the Nemotron channel on Discord

 Browse video tutorials and livestreams to get the most out of NVIDIA Nemotron

 And read more about NVIDIA Cloud Functions, NVIDIA’s multi-cloud, high-performance AI inference solution, here .

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | Data Science | Cloud Services | Blackwell | DGX Cloud | H100 | NeMo | NeMo Microservices | Nemotron | Intermediate Technical | Tutorial | featured | NVIDIA Inception

 About the Authors

 About Utkarsh Uppal

 Utkarsh Uppal is a senior applied deep learning solutions architect at NVIDIA, where he specializes in building high-performance deep learning pipelines across domains like language and speech. His primary focus is on developing end-to-end conversational AI systems, including training LLMs from scratch, particularly for Indic languages and building domain-specific models with enterprises. He also has deep expertise in designing and optimizing inference architectures for production, with a focus on low-precision formats (FP4, FP8), decoding strategies, and KV-cache optimizations.

 View all posts by Utkarsh Uppal

 About Sriharsha Niverty

 Sriharsha Niverty focuses on AI infrastructure at NVIDIA, optimizing systems-level performance for large-scale LLM inference and training workloads. Previously, he worked on graphics application performance and architecture exploration, with an emphasis on efficient work scheduling inside the GPU.

 View all posts by Sriharsha Niverty

 About Diya Shah

 Diya Shah is a machine learning engineer at Sarvam AI, working on inference and optimization of models to drive maximum efficiency in serving stacks. By targeting accelerations at the system and kernel level, she works to ensure that large-scale models remain performant on diverse hardware environments. Diya has a bachelor of technology in electronics and communications engineering from the LNM Institute of Information Technology (LNMIIT) in India.

 View all posts by Diya Shah

 About Rakesh Madugundu

 Rakesh is an ML performance engineer at Sarvam AI. He focuses on accelerating model inference by optimizing at both the system and kernel levels to reduce production latency. He is passionate about low-level engineering, with a particular interest in writing custom kernels and building foundational architectures from scratch to maximize hardware efficiency.

 View all posts by Rakesh Madugundu

 About Ashwin Srinivasan

 Ashwin is a founding engineer at Sarvam whose role is to get all models from research to production. He takes care of model optimization across target hardware and also maintains accelerator infrastructure at Sarvam. He likes to dive deep into model architecture, kernels, and hardware.

 View all posts by Ashwin Srinivasan

 Comments

 Related posts

 Profiling LLM Training Workflows on NVIDIA Grace Hopper

 Profiling LLM Training Workflows on NVIDIA Grace Hopper

 Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick

 Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick

 NVIDIA Sets New Generative AI Performance and Scale Records in MLPerf Training v4.0

 NVIDIA Sets New Generative AI Performance and Scale Records in MLPerf Training v4.0

 New NVIDIA NeMo Framework Features and NVIDIA H200 Supercharge LLM Training Performance and Versatility

 New NVIDIA NeMo Framework Features and NVIDIA H200 Supercharge LLM Training Performance and Versatility

 Build Custom Enterprise-Grade Generative AI with NVIDIA AI Foundation Models 

 Build Custom Enterprise-Grade Generative AI with NVIDIA AI Foundation Models 

 Related posts

 How to Build AI Systems In House with Outerbounds and DGX Cloud Lepton

 How to Build AI Systems In House with Outerbounds and DGX Cloud Lepton

 Accelerating Video Production and Customization with GliaCloud and NVIDIA Omniverse Libraries

 Accelerating Video Production and Customization with GliaCloud and NVIDIA Omniverse Libraries

 Vortex Enables Advanced Imaging Anywhere with NVIDIA Jetson

 Vortex Enables Advanced Imaging Anywhere with NVIDIA Jetson

 Spotlight: Build Scalable and Observable AI Ready for Production with Iguazio's MLRun and NVIDIA NIM

 Spotlight: Build Scalable and Observable AI Ready for Production with Iguazio's MLRun and NVIDIA NIM

 Spotlight: Personal AI Brings AI Receptionists to Small Business Owners with NVIDIA Riva

 Spotlight: Personal AI Brings AI Receptionists to Small Business Owners with NVIDIA Riva

 L

 T

 F

 R

 E

Controlling Floating-Point Determinism in NVIDIA CCCL | NVIDIA Technical Blog

nvidia_dev_blog

05.03.2026 17:00

0.688

Embedding sim.	0.7518
Entity overlap	0.1935
Title sim.	0.287
Time proximity	1

NLP тип	other
NLP организация	nvidia
NLP тема	ai infrastructure
NLP страна	united states

Открыть оригинал

A computation is considered deterministic if multiple runs with the same input data produce the same bitwise result. While this may seem like a simple property to guarantee, it can be difficult to achieve in practice, especially in parallel programming and floating-point arithmetic. This is because floating-point addition and multiplication aren’t strictly associative—that is, (a + b) + c may not equal a + (b + c)—due to rounding that occurs when intermediate results are stored with finite precision .

 With NVIDIA CUDA Core Compute Libraries (CCCL) 3.1, CUB—a low-level CUDA library for speed-of-light parallel device algorithms —added a new single-phase API that accepts an execution environment, enabling users to customize algorithm behavior. We can use this environment to configure the reduce
 algorithm’s determinism property. This can only be done through the new single-phase API, since the two-phase API doesn’t accept an execution environment.

 The following code shows how to specify the determinism level in CUB (find the complete example online using compiler explorer ).

auto input = thrust::device_vector<float>{0.0f, 1.0f, 2.0f, 3.0f};
 auto output = thrust::device_vector<float>(1);

 auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed); // can be not_guaranteed, run_to_run (default), or gpu_to_gpu

 auto error = cub::DeviceReduce::Sum(input.begin(), output.begin(), input.size(), env);
 if (error != cudaSuccess)
 {
 std::cerr << "cub::DeviceReduce::Sum failed with status: " << error << std::endl;
 }

 assert(output[0] == 6.0f);

 We begin by specifying the input and output vectors. We then use cuda::execution::requir
e() to construct a cuda::std::execution::env
 object, setting the determinism level to not_guaranteed
.

 There are three determinism levels available for reduction, which are:

 not_guaranteed

 run_to_run

 gpu_to_gpu

 Determinism not guaranteed

 In floating-point reductions, the result can depend on the order in which elements are combined. If two runs apply the reduction operator in different orders, the final values may differ slightly. In many applications, these minor differences are acceptable. By relaxing the requirement for strict determinism, the reduction implementation can rearrange the operations in any order, which can improve runtime performance.

 In CUB, not_guaranteed
 relaxes the determinism level. This enables atomic operations—whose unordered execution across threads results in a different order of operations between runs—to compute both the block-level partial aggregates and the final reduction value. The entire reduction can also be performed in a single kernel launch, since the atomic operations combine the block-level partial aggregates into the result.

 The nondeterministic reduce variant is typically faster than the run-to-run deterministic version—particularly for smaller input arrays, where performing the reduction in a single kernel reduces latency from multiple kernel launches, minimizes extra data movement, and avoids additional synchronization. The tradeoff is that repeated runs may yield slightly different results due to the lack of deterministic behavior.

 Run-to-run determinism

 While nondeterministic reductions offer potential performance gains, CUB also provides a mode that guarantees consistent results across runs. By default, cub::DeviceReduce
 is run-to-run deterministic, which corresponds to setting the determinism level to run_to_run
 in the single-phase API. In this mode, multiple invocations with the same input, kernel launch configuration, and GPU will produce identical outputs.

 This determinism is achieved by structuring the reduction as a fixed, hierarchical tree rather than relying on atomics, whose update order can vary across runs. At each stage of the reduction, elements are first combined within individual threads. The intermediate results are then reduced across threads within a warp using shuffle instructions, followed by a block-wide reduction using shared memory. Finally, a second kernel aggregates the per-block results to produce the final output. Because this sequence is predetermined and independent of the relative timing of thread execution, the same inputs, kernel configuration, and GPU yield the same bitwise result.

 GPU-to-GPU determinism

 For applications that require the highest level of reproducibility, CUB also provides GPU-to-GPU determinism, which guarantees identical results across multiple runs with the same input on different GPUs. This mode corresponds to setting the determinism level to gpu_to_gpu
.

 To achieve this level of determinism, CUB uses a Reproducible Floating-point Accumulator (RFA) , a solution based on the NVIDIA GTC 2024 session, Restoring the Scientific Method to HPC: High Performance Reproducible Parallel Reductions . The RFA counters floating-point non-associativity—which arises when adding numbers with different exponents—by grouping all input values into a fixed number of exponent ranges (the default is three bins). This fixed, structured accumulation order ensures the final result is independent of GPU architecture. 

 The accuracy of the final result depends on the number of bins: more bins provide greater accuracy, but also increase the number of intermediate summations, which can reduce performance. The current implementation defaults the number of bins to three, an optimal default providing balanced performance and accuracy. It’s worth noting that this configuration is not just strictly deterministic, but also guarantees numerically correct results, providing tighter error bounds than the standard pairwise summation traditionally used in parallel reductions.

 How results vary based on the determinism levels

 The three determinism levels differ in the amount of variation they produce across multiple runs:

 Not-guaranteed determinism produces slightly different summation values on each invocation.

 Run-to-run determinism ensures the same value for every invocation on a single GPU, but the result may vary if a different GPU is used.

 GPU-to-GPU determinism guarantees that the summation value is identical for every invocation, regardless of which GPU executes the reduction.

 This is shown in Figure 1, with the summation of an array for each determinism level—represented by green, blue, and red circles—plotted against the run number. A flat horizontal line shows that the reduction produces the same result. 

 Figure 1. Summation value compared to run 

 Determinism performance comparison

 The level of determinism selected affects the performance of cub::DeviceReduce
. Not-guaranteed determinism, with its relaxed requirements, provides the highest performance. The default run-to-run determinism delivers good performance but is slightly slower than not-guaranteed determinism. GPU-to-GPU determinism, which enforces the strictest reproducibility across different GPUs, can significantly reduce performance, increasing execution time by 20% to 30% for large problem sizes.

 Figure 2 compares the performance of the different determinism requirements for float32
 and float64
inputs on an NVIDIA H200 GPU (lower is better). They clearly show how the choice of determinism level impacts execution time across different data types.

 Figure 2. Elapsed time compared to the number of elements

 Conclusion

 With the introduction of the single-phase API and explicit determinism levels, CUB provides an enhanced toolbox for controlling both the behavior and performance of reduction algorithms. Users can choose the level of determinism that best suits their needs: from the high-performance and flexible, not-guaranteed mode, to the reliable run-to-run default, and up to the strictest GPU-to-GPU reproducibility.

 Determinism in CUB isn’t limited to reductions. We plan to extend these capabilities to additional algorithms for developers to control reproducibility across a wider range of parallel CUDA primitives. For updates and discussion, see the ongoing GitHub issue on expanded determinism support, to follow our roadmap, and provide feedback on algorithms you’d like to see deterministic versions of.

 Discuss (0)

 Like

 Tags

 Data Science | Developer Tools & Techniques | Simulation / Modeling / Design | General | CUDA | Intermediate Technical | Tutorial | featured

 About the Authors

 About Nader Al Awar

 Nader Al Awar is a senior software engineer at NVIDIA and a member of the CUDA Core Compute Libraries (CCCL) team, where he focuses on the development of CUB and cuda.compute. He earned his doctorate in electrical and computer engineering from the University of Texas at Austin, specializing in high-performance computing for Python. Nader is passionate about bridging the gap between high-level languages and hardware by accelerating Python code using GPUs.

 View all posts by Nader Al Awar

 About Srinivas Yadav Singanaboina

 Srinivas Yadav Singanaboina is a graduate research assistant at the Center for Computation and Technology at Louisiana State University (LSU). Srinivas is a core member of STE||AR GROUP and an active contributor to the HPX open-source project.

 View all posts by Srinivas Yadav Singanaboina

 Comments

 Related posts

 Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python

 Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python

 Implementing High Performance Matrix Multiplication Using CUTLASS v2.8

 Implementing High Performance Matrix Multiplication Using CUTLASS v2.8

 Revealing New Features in the CUDA 11.5 Toolkit

 Revealing New Features in the CUDA 11.5 Toolkit

 Faster Parallel Reductions on Kepler

 Faster Parallel Reductions on Kepler

 How to Overlap Data Transfers in CUDA Fortran

 How to Overlap Data Transfers in CUDA Fortran

 Related posts

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 L

 T

 F

 R

 E

Pacific Northwest National Laboratory and OpenAI partner to accelerate federal permitting

openai

26.02.2026 10:00

0.687

Embedding sim.	0.8063
Entity overlap	0.375
Title sim.	0.1698
Time proximity	0.5446

NLP тип	product_launch
NLP организация	OpenAI
NLP тема	benchmarking
NLP страна

Открыть оригинал

OpenAI and Pacific Northwest National Laboratory introduce DraftNEPABench, a new benchmark evaluating how AI coding agents can accelerate federal permitting—showing potential to reduce NEPA drafting time by up to 15% and modernize infrastructure reviews.

Arvind KC appointed Chief People Officer

openai

24.02.2026 13:40

0.683

Embedding sim.	0.7936
Entity overlap	0.3333
Title sim.	0.0513
Time proximity	0.8085

NLP тип	leadership_change
NLP организация	OpenAI
NLP тема	enterprise ai
NLP страна

Открыть оригинал

OpenAI appoints Arvind KC as Chief People Officer to help scale the company, strengthen its culture, and lead how work evolves in the age of AI.

Making Softmax More Efficient with NVIDIA Blackwell Ultra | NVIDIA Technical Blog

nvidia_dev_blog

25.02.2026 17:00

0.682

Embedding sim.	0.7874
Entity overlap	0.1176
Title sim.	0.237
Time proximity	0.7202

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	ai infrastructure
NLP страна

Открыть оригинал

LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA). As a result, AI ”speed of thought” is increasingly governed not by the massive throughput of matrix multiplications, but by the transcendental math of the softmax function.

 Transcendentals refer to functions that cannot be expressed as the root of a polynomial equation with rational coefficients. Subsequently, they “transcend” basic algebraic operations like addition and multiplication—the exact operations Tensor Cores excel at. In the specific context of softmax, the most computationally expensive of these transcendentals is the natural exponential function that  is executed on Special Function Units (SFUs). In NVIDIA assembly instructions ( SASS ), this function is invoked via the MUFU.EX2 instruction. This architectural split creates a softmax bottleneck within the attention block, when powerful matrix engines are forced to idle while waiting for the SFU datapaths to normalize attention scores.

 NVIDIA Blackwell Ultra alleviates this bottleneck by doubling SFU throughput over the standard NVIDIA Blackwell architecture.

 This blog dives into the mechanics of softmax within the attention loop, explores how Blackwell Ultra’s hardware optimizations eliminate pipeline stalls, and provides a benchmark for you to measure the raw MUFU.EX2
 speedup for yourself.

 How attention works

 A foundational component of modern large language models is the attention mechanism, which allows a model to dynamically transform static token vectors into dynamic, context-aware representations. At its core, it is a process of re-weighting information by allowing tokens to adjust their importance to one another. To facilitate this interaction, every token in a sequence is projected into three functional roles:

 Query: Represents what the current token is seeking to understand its own context. 

 Key: Represents a token’s profile that others use for matching. Tokens previous in the sequence have keys that signal their specific relevance to the query. 

 Value: This holds the actual informational content. Once a match is confirmed between a query and a key, the Value is the specific data that is transferred to the original token.

 Figure 1 below shows attention in action. We have two sentences that utilize the word “dog” in two different definitions. Initially, we can see that the embeddings (the numerical vectors that capture meaning and nuance in a multidimensional space) of both “dog” mentions are identical.

 Figure 1. Context building through attention

 Attention operates with the model calculating a dot product between the “ dog “ query and the keys of every other token in the sequence. 

 if the query for “dog” aligns well with the key for “lazy,” it indicates a high degree of relevance. This interaction is what allows the word “dog” to pull in the specific value of its neighbor. By the end of this cycle, the original vector for “dog” has been physically updated with the content of its neighbors, evolving from a generic dictionary definition into a contextualized embedding that “understands” whether it refers to a lethargic animal or the sweltering peak of a season.

 How softmax relates to attention

 Softmax serves as the critical decision-making phase that converts raw compatibility scores into actionable weights. Once the initial dot products are calculated between queries and keys, the resulting scores are passed through the softmax function to be normalized into probabilities that sum to exactly one. This step is what determines the “attention span” of the model, effectively deciding which tokens to prioritize and which to ignore. Without softmax, the model would have no way to objectively weigh the information it gathers, leading to an unmanageable and noisy blend of data.

 However, the softmax operation is the primary source of the “performance cliff” seen in long-context AI. Because every token in a sequence must be compared against every other token, a sequence of 8,192 tokens creates a massive [8,192 x 8,192] attention matrix. Normalizing this matrix requires billions of transcendental calculations and grows quadratically with the sequence length. This creates a bottleneck, where the sheer volume of transcendental math can stall the entire inference pipeline. 

 Blackwell Ultra puts focus on accelerating these exponential calculations specifically to alleviate this mathematical bottleneck and ensure that the system can handle the massive normalization required for large context windows without sacrificing throughput.

 Alleviating the softmax bottleneck in Blackwell Ultra

 By doubling the throughput of the SFU for exponentials in the Blackwell Ultra architecture, NVIDIA is alleviating this bottleneck and is allowing for a more balanced and efficient processing pipeline. This results in faster overall performance, especially for tasks that are heavy on attention mechanisms.

 Figure 2 below illustrates the sequential dependency inherent in the standard attention mechanism, often referred to as the attention loop, as run on the previous generation NVIDIA Blackwell (GB200). Note that the Streaming Multiprocessor (SM) loads two thread blocks running attention loops concurrently. These separate attention loops are denoted in the two different shades of green.

 This pipeline consists of three distinct phases that must execute in order:

 BMM1 (score calculation): The Tensor Cores perform a matrix multiplication to calculate the raw attention scores, or logits.

 Softmax (normalization): The pipeline shifts to the SFUs to normalize these scores into probabilities using exponential functions.

 BMM2 (context aggregation): The pipeline returns to the Tensor Cores to multiply the probabilities by the value vectors.

 Figure 2. The Blackwell attention loop

 The timeline illustrates the latency constraints inherent in the Blackwell GPU during the execution of the attention kernel. Because the second matrix multiplication (BMM2) acts on the output of the softmax, it cannot begin until the normalization is complete. 

 The lower throughput of the Blackwell GPU’s SFUs forces the Tensor Cores to idle between the score calculation (BMM1) and the context aggregation (BMM2). This dependency prevents the pipeline from fully saturating the compute resources and extends the duration of the softmax operation

 The next timeline, as shown in Figure 3, demonstrates the direct impact of the Blackwell Ultra GPUs in NVIDIA GB300 NVL72 and NVIDIA HGX B300 systems doubled SFU throughput on the same instruction sequence.

 Figure 3. The Blackwell Ultra attention loop

 Visually, the width of the softmax blocks is reduced by almost 50%, reflecting the hardware’s ability to process MUFU instructions at twice the rate.

 This reduction in softmax latency tightens the entire pipeline. The gap between BMM1 and BMM2 is drastically minimized, allowing the Tensor Cores to switch between the query-key multiplication and the probability-value multiplication with minimal stalling. The result is a denser main loop where the high-performance matrix engines spend a larger percentage of the total execution time active, directly translating to higher overall inference throughput.

 Benchmarking MUFU.EX2 performance

 To empirically verify the theoretical throughput of the MUFU pipeline, we can construct a synthetic micro-benchmark. The following kernel code isolates the exponential instructions to measure the raw cycle count without interference from global memory latency or other arithmetic operations.

 This test harness launches a grid of threads where each thread performs a dense loop of MUFU.EX2
 instructions. By timing the execution and comparing it against the clock frequency, you can directly calculate the effective instruction throughput and validate the bandwidth saturation point mentioned earlier.

 Step 1: Clone the following repository to pull the exp2-bg300.cu benchmark.

git clone https://github.com/jamieliNVIDIA/mufu_ex2_bench.git
cd mufu_ex2_bench

 Step 2: Compile with (Using sm100f for GB300 or sm103a for GB200).

nvcc -O3 -gencode=arch=compute_103a,code=sm_103a --extended-lambda -o /tmp/exp2-gb300.out exp2-gb300.cu

 Sample results

 We see that GB300 performs about 2x higher in FLOPs performance over GB200 for all tested data types, in line with the doubled SFU throughput.

 Blackwell (GB200)

exp2 BF16x2 2454 Gop/s (4908 GFLOPS)
exp2 BF16 4938 Gop/s
exp2 FP32 4943 Gop/s

 Blackwell Ultra (GB300)

exp2 BF16x2 4996 Gop/s (9992 GFLOPS)
exp2 BF16 9738 Gop/s
exp2 FP32 Time: 10024 Gop/s

 Attention forward propagation performance in Blackwell vs Blackwell Ultra

 The transition from Blackwell to Blackwell Ultra delivers a targeted increase in compute throughput driven by a 2x increase in SFU performance. This hardware upgrade directly accelerates the forward propagation (FPROP) pipeline for models like DeepSeek-V3.

 FPROP is the process where input data travels “forward” through the neural network—from the input layer, through the hidden layers, to the output layer—to generate a prediction. Every time the model produces a single new word, it must run one complete FPROP pass.

 Figure 4 below shows that by doubling the throughput of the SFUs, the GB300 drastically reduces the execution time of the softmax layers within the attention blocks. This faster normalization means the GPU spends less time processing the attention scores and more time utilizing the high-speed matrix engines for the next layer’s computation, directly increasing the overall speed of the forward pass.

 Figure 4. GB300 vs GB200 FLOPS in forward propagation in a grouped query attention (GQA) model.

 The benchmark results highlight a ~35% increase in FPROP throughput for FP8 operations. ​​This gain is particularly pronounced in FP8 because the matrix math is already extremely fast. In this low-precision regime, the time spent on softmax becomes a larger percentage of the total step.

 Getting started

 The performance dynamics of DeepSeek-V3 on the Blackwell Ultra highlight a critical, but often overlooked bottleneck in inference: the computational cost of non-linear operations.

 By optimizing and compressing the attention mechanism, state-of-the-art models effectively increase the density of softmax operations relative to standard linear computations, exposing the SFUs as a governor of total throughput.

 Blackwell Ultra directly addresses this bottleneck. By doubling the throughput of these specialized units, Blackwell Ultra unblocks the transcendental traffic jam that previously forced the powerful Tensor Cores to idle. The benchmark results confirm the impact, demonstrating a 35% gain in FP8 forward propagation. 

 For modern, highly optimized architectures, the path to faster inference isn’t just about faster Tensor Cores, it’s also about ensuring the non-linear math units are fast enough to keep up.

 Visit NVIDIA’s trtllm-gen repository for more benchmarks and information on utilizing this SFU speedup in workloads. Doubling the throughput of the SFUs for MUFU.EX2
 is just one of many features that enable Blackwell Ultra’s fast attention speed. NVIDIA’s extreme hardware-software codesign accelerates the full attention loop through technologies such as: 

 Offloading critical “find-max” reductions to the Tensor Memory controller via LDTM.STAT .

 Optimizing performance using CUDNN .

 Optimizing KVCache data movements using NVFP4 .

 Stay tuned to the NVIDIA technical blog for future posts.

 Acknowledgements

 Special thanks to the cuDNN engineering team for creating the benchmarks and building the software optimizations making this cutting edge performance possible.

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | Cloud Services | Blackwell | GB200 | Intermediate Technical | Deep dive | AI Inference | Blackwell Ultra | cuDNN | featured | GB300 | LLMs | Tensor Cores

 About the Authors

 About Jamie Li

 Jamie Li is a senior technical marketing engineer at NVIDIA focused on wrangling the latest technologies in AI inference. He brings a deep background in both AI software engineering and customer management, translating innovations into practical customer outcomes. Before NVIDIA, he held roles developing, breaking, and fixing AI solutions in the enterprise tech sector. He also did research in medical imaging and holds a master’s degree in Computer Science with an AI focus.

 View all posts by Jamie Li

 About Alexander Zhurkevich

 Alex graduated from the University of Massachusetts Boston with a Master's degree. His past research and working experience is mainly in HPC and AI/ML and computer vision. At NVIDIA, he is a developer technology AI engineer focusing on accelerating ML/DL workloads on GPUs.

 View all posts by Alexander Zhurkevich

 About Vedaanta Agarwalla

 As a senior deep learning software engineer at NVIDIA, Vedaanta focuses on accelerating GPU workloads with a current emphasis on optimizing attention kernels for both training and inference. His previous experience spans ResNet optimizations, GEMMs, and HPC for derivatives pricing in quantitative trading. Vedaanta holds a master’s degree in computer science from the University of Illinois Urbana-Champaign.

 View all posts by Vedaanta Agarwalla

 About Seonghee Lee

 Seonghee Lee is an engineer on the AI platform software team at NVIDIA, focusing on AI Inference-related products. Seonghee holds a master’s in computer science from Stanford University and a bachelor’s in science from Cornell University, specializing in AI. Before joining NVIDIA, she worked at Microsoft Research on developing real-time AI agent interactions.

 View all posts by Seonghee Lee

 About Roman Anders

 Roman Anders is a software engineer on the cuDNN team at NVIDIA, where he focuses on Flash Attention optimizations for inference and training workloads across current and next-generation GPU architectures. His contributions at NVIDIA span RNN, matrix multiplications, and convolutions. Previously, he served as an engineer on the Intel MKL team, where he developed Sparse BLAS, Direct Sparse Solvers, and FFT. He holds a master's degree in applied mathematics and programming from Novosibirsk State University in Russia.

 View all posts by Roman Anders

 Comments

 Related posts

 Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM

 Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM

 Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

 Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

 Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick

 Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick

 OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability

 OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability

 NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200

 NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200

 Related posts

 AI Aims to Bring Order to the Law

 AI Aims to Bring Order to the Law

 How Modern Supercomputers Powered by NVIDIA Are Pushing the Limits of Speed — and Science

 How Modern Supercomputers Powered by NVIDIA Are Pushing the Limits of Speed — and Science

 AI Helps Locate Dangerous Fishing Nets Lost at Sea

 AI Helps Locate Dangerous Fishing Nets Lost at Sea

 Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training

 Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training

 GPU Memory Essentials for AI Performance

 GPU Memory Essentials for AI Performance

 L

 T

 F

 R

 E

How to Minimize Game Runtime Inference Costs with Coding Agents | NVIDIA Technical Blog

nvidia_dev_blog

03.03.2026 19:49

0.681

Embedding sim.	0.806
Entity overlap	0.0476
Title sim.	0.3455
Time proximity	0.4117

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	ai agents
NLP страна

Открыть оригинал

NVIDIA ACE is a suite of technologies for building AI agents for gaming. ACE provides ready-to-integrate cloud and on-device AI models for every part of in-game characters, from speech to intelligence to animation.

 To run these models alongside the game engine efficiently, the NVIDIA In-Game Inferencing (NVIGI) SDK includes a set of performant libraries that developers can integrate into C++ games and applications. 

 NVIDIA In-Game Inferencing SDK 1.5 introduces a new code agent sample in which an AI agent works with the player to defeat monsters in a 2D dungeon. AI agents driven by local small language models (SLMs) can make excessive calls to the GPU that compete with graphics. This post examines how to minimize the number of inference calls and maximize what each call accomplishes, reducing contention on the GPU between graphics and compute. 

 Code agents: Trapping the ghost

 Andrej Karpathy, a founding member of OpenAI, likens working with large language models (LLMs) to summoning ghosts , an apt metaphor for LLM agents, especially ones that write code. Many custom agents limit themselves to tool calling: a function is defined, the LLM decides when to call it, and a result is returned. There is a more ambitious possibility. Instead of just calling a function, an AI agent can create the function and the code to support it. This makes the machine more powerful with less processing. 

 There is, however, a trade-off. An unconstrained LLM with code execution capabilities is a security issue. It can exhaust memory, hang the game process, or, as one unfortunate user discovered, wipe a hard drive while trying to “clear a cache.”

 It can have benefits, like complex multi-step reasoning, dynamic adaptation, and reduced usage of the SLM. The following dives into how a potential coding agent ghoul turned into a friendly ghost eager to help.

 Why code agents outshine tool-calling

 When discussing AI agents, the most typical use case and approach is tool-calling. The model outputs structured JSON, the game or application parses it, and then executes the corresponding function. While the ability to call functions is powerful, it only comes after the model has had a chance to think on it a bit, and inference is expensive, especially when it fights for resources on the user’s GPU. 

 Once the model sends the JSON, it waits for a response, thinks again, and returns an answer—potentially repeating the cycle. This can consume valuable fractions of a second that could be spent rendering the game.

 Moreover, if complex logic is required around the function call, the system must rely on weaker model capabilities. The model doesn’t inherently handle looping; it simply produces tokens. It can try to track state variables, but there is no rigor. If multiple items need addressing, the model must remember each one without missing, duplicating, or hallucinating entries. And every item processed pays an inference cost.

 Numeric analysis introduces another challenge. With tool-calling, accuracy depends on the model’s mathematical ability or on writing yet another function to ensure correctness.

 Tool-calling can struggle to scale. Every function call requires another inference hit that competes for GPU resources and must be mitigated.

 Code agents work by using something computers are already good at—running code. Programming is one of the emerging superpowers of language models. Instead of generating one function call at a time, a single inference can generate all the function calls at once. There’s no performance hit after the initial generation, just standard code that runs until the task is complete.

 They’re also flexible. While language models can’t easily loop themselves, code agents can easily write code with loops, counters, and filters. The following is a hypothetical example of how tool calling might be used to target an enemy.

 Tool-calling schema:

 [
 {
 "name": "get_enemies_list",
 "parameters": {
 "properties": {
 "position": {"type": "string", "description": "Position to search from"},
 "radius": {"type": "number", "description": "Search radius"}
 }
 }
 },
 {
 "name": "target_enemy",
 "parameters": {
 "properties": {
 "enemy_name": {"type": "string", "description": "Name of the enemy to target"}
 },
 "required": ["enemy_name"]
 }
 }
]

 When the user says “target the nearest enemy”:

 Inference call 1 : SLM decides to call get_enemies_list

 Tool response : Returns ["goblin_01", "skeleton_archer_01", "orc_chief"]
 (just strings, otherwise, full entity schemas blow out the context window)

 Inference call 2 : SLM sees the list, picks one, calls target_enemy("goblin_01")

 Tool response : Success

 Inference call 3 : Feedback to the user about the status of the function call

 Three inference calls for one decision. Consider the same “target enemy” action with a code agent.

 Code agent API definition:

 get_enemies(position, radius)
 --[[
 Find enemies near a position.
 Parameters:
 position (table): Center point as {row, col}
 radius (number): Search radius
 Returns:
 table: Array of enemy entities (with .name, .position, .health, etc.)
 Example:
 local nearby = get_enemies(ally.position, 10)
 ]]

set_target(ally, enemy)
 --[[
 Set an ally's attack target.
 Parameters:
 ally (entity): The ally to command
 enemy (entity): The enemy to target
 Example:
 set_target(warrior, nearby[1])
 ]]

 SLM-generated code for “target the nearest enemy”:

 local enemies = get_enemies(ally.position, 10)
local closest = nil
local min_dist = math.huge

for _, enemy in ipairs(enemies) do
 local dx = enemy.position[1] - ally.position[1]
 local dy = enemy.position[2] - ally.position[2]
 local dist = math.abs(dx) + math.abs(dy)
 if dist < min_dist then
 min_dist = dist
 closest = enemy
 end
end

if closest then
 set_target(ally, closest)
end

 With one inference call, the SLM loops over enemies, accesses their positions, calculates distances, and picks the closest. The code agent gets rich entity objects, not just strings, and composes logic that the tool designer never anticipated.

 Notice the flexibility. That same get_enemies
 function works for enemies near the player, near an ally, or near a point. Once the SLM has the enemy list, it can write any selection logic, such as targeting enemies weak to arrows, targeting the closest one, or targeting the one with the lowest health. With tool-calling, adapting to new requirements means more tools, more inference calls, and more complexity. With code agents, the SLM composes new strategies at runtime from the same simple primitives.

 Code agent sample dungeon

 Keeping with the ghoulish theme, the IGI SDK includes an ASCII dungeon crawler to demonstrate the code agent. The dungeon contains all the pieces of a large game, but in one of gaming’s simplest forms. Players move around, collect items, and fight monsters. But they also have a powerful ally on their adventure, an AI agent. An intelligence that can materialize on demand to help them fight, go on dangerous missions, or provide information about the dangers that await.

 Figure 1. The AI navigates the maze to retrieve the bow

 Once an instruction is given, the code is written, and the program doesn’t touch the SLM again until a new instruction is given. A tool call chain may produce the same results, but at the cost of repeated inference calls eating into the allocated frame time slice.

 The threat model of a code agent

 Using an SLM to generate code that runs on the host introduces obvious security and safety risks, including:

 Dangerous function access. The SLM generates os.execute("rm -rf /")
 or require("socket")
, and suddenly the code agent is deleting files or opening network connections. 

 Unauthorized file access. The SLM locates critical files or API keys to exfiltrate or delete.

 Resource exhaustion. The SLM writes a loop that allocates memory forever.

 Stack overflow. The SLM writes a recursive function without proper termination.

 Infinite loops. The SLM writes while true do end
 and never returns.

 Escaping the sandbox. The SLM might manipulate internal structures to break out of its containment. 

 State corruption. The SLM might corrupt the game or application’s state. 

 Choosing a target language

 When choosing a target language, consider: time to execution, general performance, complexity of integration and debugging, and the quality and safety of code produced.

 While running a game, inference calls must take a fraction of the total frame time. Large hits that stall the rendering pipeline are unacceptable. While it’s possible to generate a few tokens at a time each frame to smooth out inference, compilation does not offer that flexibility. This rules out compiled languages such as C++ or C#. Instead, an interpreted language is required.

 Two languages stand out as examples: Python and Lua.

 Python is the obvious first choice. SLMs generate Python fluently. The ecosystem is massive. But Python wasn’t designed for embedding or sandboxing. The Global Interpreter Lock (GIL) complicates multi-threaded hosts. Isolation requires subprocesses or subinterpreters, both adding complexity. Further, there’s no built-in way to limit memory or execution time. Python can run in a sandbox, but it’s a fight against the language the whole way.

 Lua was designed from the ground up for embedding in hostile environments. The entire runtime is about 200 kB and starts in sub-millisecond time. Plus, every identified threat has documented mitigation, including:

 Dangerous functions : Selective library loading. Don’t load io
 or os
, and they don’t exist.

 Memory exhaustion : Custom allocator hook. Track every allocation, enforce a cap.

 Stack overflow : Debug hooks on function calls. Count depth, error on overflow.

 Infinite loops : Debug hooks on instruction count. Error after N instructions.

 Metatable manipulation : Remove getmetatable/setmetatable
 from globals.

 State corruption : Custom _ _newindex
 metamethods that reject writes to protected fields.

 Thus, Lua met all the requirements for this IGI sample but still required hardening. With Lua, dangerous or unwanted functions is set to nil (lua_pushnil(L); lua_setglobal(L,”funcname”)
 pattern). Memory growth is limited by wrapping the default allocator and tracking allocations. The programmer can set up hooks ( lua_sethook
 ) to make sure programs don’t blow out the call stack or hang indefinitely. Similarly, metatable access can be restricted with custom metamethods locked down to protect the game state. 

 These are just some of the steps taken to lock down this sample. More may be required (depending on each particular game or use case), but these tips should help guide the reader while looking through the code.

 For added security, Lua can be embedded in a web assembly runtime. See the blog posts Sandboxing Agentic AI Workflows with WebAssembly and Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk for more information about ways to secure agentic behavior.

 Security is a core concern, not an afterthought. Language choice is a security decision, not a convenience decision. Start with this premise and understand the different attack vectors to guard against, and the ghost stays a friend in the machine.

 Get started with NVIDIA In-Game Inferencing SDK

 Try the sample with the NVIDIA In-Game Inference SDK . Build it, experiment, and think of ways to employ it in games, apps, and other projects.

 Join us at GDC

 Explore how NVIDIA RTX neural rendering and AI are shaping the next era of gaming. Get a glimpse into the future of game development with John Spitzer, vice president of Developer and Performance Technology at NVIDIA, as he unveils the latest innovations in path tracing and generative AI workflows.

 Join Bryan Catanzaro, vice president of Applied Deep Learning Research at NVIDIA, for an interactive “Ask Me Anything” session covering the latest trends in AI.

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Developer Tools & Techniques | Gaming | ACE | Intermediate Technical | Tutorial | DLSS | featured | Game Performance | RTX AI | SLMs

 About the Authors

 About Brandon Rowlett

 Brandon Rowlett is a dev tech at NVIDIA, where he helps game developers integrate AI into their titles. He focuses on local AI models that run efficiently on consumer GPUs, working primarily in Python and C++. With 25 years of experience in game development and AI, he has contributed to technologies such as DLSS and DLSS 3 frame generation.

 View all posts by Brandon Rowlett

 Comments

 Related posts

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 Open Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs

 Open Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs

 Benchmarking Agentic LLM and VLM Reasoning for Gaming with NVIDIA NIM

 Benchmarking Agentic LLM and VLM Reasoning for Gaming with NVIDIA NIM

 Bring NVIDIA ACE AI Characters to Games with the New In-Game Inferencing SDK

 Bring NVIDIA ACE AI Characters to Games with the New In-Game Inferencing SDK

 Generative AI Sparks Life into Virtual Characters with NVIDIA ACE for Games

 Generative AI Sparks Life into Virtual Characters with NVIDIA ACE for Games

 Related posts

 Train Small Orchestration Agents to Solve Big Problems

 Train Small Orchestration Agents to Solve Big Problems

 How Small Language Models Are Key to Scalable Agentic AI

 How Small Language Models Are Key to Scalable Agentic AI

 GPU Memory Essentials for AI Performance

 GPU Memory Essentials for AI Performance

 Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries

 Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries

 Accelerating LLMs with llama.cpp on NVIDIA RTX Systems

 Accelerating LLMs with llama.cpp on NVIDIA RTX Systems

 L

 T

 F

 R

 E

The U.S. and China Are Pursuing Different AI Futures

ieee_spectrum_ai

19.02.2026 17:03

0.679

Embedding sim.	0.7852
Entity overlap	0.04
Title sim.	0.0952
Time proximity	0.958

NLP тип	other
NLP организация	Institute for AI Policy and Strategy
NLP тема	ai governance
NLP страна	United States

Открыть оригинал

More money has been invested in AI than it took to land on the moon. Spending on the technology this year is projected to reach up to US $700 billion , almost double last year’s spending. Part of the impetus for this frantic outlay is a conviction among investors and policymakers in the United States that it needs to “beat China.” Indeed, headlines have long cast AI development as a zero-sum rivalry between the U.S. and China, framing the technology’s advance as an arms race with a defined finish line. The narrative implies speed, symmetry, and a common objective.
 But a closer look at AI development in the two countries shows they’re not only not racing toward the same finish line: “The U.S. and China are running in very different lanes,” says Selina Xu , who leads China and AI policy research in New York City for Eric Schmidt, the tech investor, philanthropist, and former Google chief . “The U.S. is doubling down on scaling,” in pursuit of artificial general intelligence (AGI) Xu says, “while for China it’s more about boosting economic productivity and real-world impact.” 
 Lumping the U.S. and China onto a single AI scoreboard isn’t just inaccurate, it can impact policy and business decisions in a harmful way. “An arms race can become a self-fulfilling prophecy,” Xu says. “If companies and governments all embrace a ‘race to the bottom’ mentality, they will eschew necessary security and safety guardrails for the sake of being ahead. That increases the odds of AI-related crises.”
 Where’s the Real Finish Line?
 As machine learning advanced in the 2010s, prominent public figures such as Stephen Hawking and Elon Musk warned that it would be impossible to separate AI’s general-purpose potential from its military and economic implications, echoing Cold War–era frameworks for strategic competition. “An arms race is an easy way to think about this situation even if it’s not exactly right,” says Karson Elmgren , a China researcher at the Institute for AI Policy and Strategy , a think tank in Washington, D.C. The labs, investors, and media in what’s known as frontier technology benefit from simple, comparable progress metrics, like larger models, better benchmarks, and more computing power, so they favor and compound the arms-race framing.
 Artificial general intelligence is the implied “finish line” if AI is an arms race. But one of the many problems with an AGI finish line is that by its very nature, a machine superintelligence would be smarter than humans and therefore impossible to control. “If superintelligence were to emerge in a particular country, there’s no guarantee that that country’s interests are going to win,” says Graham Webster , a China researcher at Stanford University . 
 An AGI finish line also assumes the U.S. and China are both optimizing for this goal and putting the majority of their resources toward it. This isn’t the case, as the two countries have starkly different economic landscapes.
 When Is the Payoff?
 After decades of rapid growth , China is now facing a grimmer reality. “China has been suffering through an economic slowdown for a mixture of reasons, from real estate to credit to consumption and youth unemployment,” says Xu, adding that the country’s leaders have been “trying to figure out what is the next economic driver that can get China to sustain its growth.”
 Enter AI. Rather than pouring resources into speculative frontier models, Beijing has a pressing incentive to use the technology as a more immediate productivity engine. “I n China we define AI as an enabler to improve existing industry, like health care, energy, or agriculture,” says AI policy researcher Liang Zheng , of Tsinghua University in Beijing. “The first priority is to use it to benefit ordinary people.” 
 To that end, AI investment in China is focused on embedding the technology into manufacturing, logistics, energy, finance, and public services. “ It’s a long-term structural change, and companies must invest more in machines, software, and digitalization,” Liang says. “Even very small and medium enterprises are exploring use of AI to improve their productivity.” 
 China’s AI Plus initiative encourages using AI to boost efficiency. “Having a frontier technology doesn’t really move China towards an innovation-led developed economy,” says Kristy Loke , a fellow at MATS Research who focuses on China’s AI innovation and governance strategies. Instead, she says, “It’s really important to make sure that [these tools] are able to meet the demands of the Chinese economy, which are to industrialize faster, to do more smart manufacturing, to make sure they’re producing things in competitive processes.”
 Automakers have embraced intelligent robots in “dark factories” with minimal human intervention; as of 2024, China had around five times as many factory robots in use than the United States. “We used to use human eyes for quality control and it was very inefficient,” says Liang. Now, computer-vision systems detect errors and software predicts equipment failures, pausing production and scheduling just-in-time maintenance. Agricultural models advise farmers on crop selection, planting schedules, and pest control.
 In health care, AI tools triage patients, interpret medical images, and assist diagnoses; Tsinghua is even piloting an AI “Agent Hospital” where physicians work alongside virtual clinical assistants. “I n hospitals you used to have to wait a long time, but now you can use your agent to make a precise appointment,” Liang says. Many such applications use simpler “narrow AI” designed for specific tasks. 
 AI is also increasingly embedded across industries in the United States, but the focus tends toward service-oriented and data-driven applications, leveraging large language models (LLMs) to handle unstructured data and automate communication. For example, banks use LLM-based assistants to help users manage accounts, find transactions, and handle routine requests; LLMs help health care professionals extract information from medical notes and clinical documentation.
 “LLMs as a technology naturally fit the U.S. service-sector-based economy more so than the Chinese manufacturing economy,” Elmgren says.
 Competition and Cooperation
 The U.S. and China do compete more or less head-to-head in some AI-related areas, such as the underlying chips. The two have grappled to gain enough control over their supply chains to ensure national security, as recent tariff and export control fights have shown. “I think the main competitive element from a top level [for China] is to wriggle their way out of U.S. coercion over semiconductors. They want to have an independent capability to design, build, and package advanced semiconductors,” Stanford’s Webster says.
 Military applications of AI are also a significant arena of U.S.–China competition, with both governments aiming to speed decision making, improve intelligence, and increase autonomy in weapons systems. The U.S. Department of Defense launched its AI Acceleration Strategy last month, and China has explicitly integrated AI into its military modernization strategy under its policy of military-civil fusion . “From the perspective of specific military systems, there are incremental advantages that one side or the other can gain,” Webster says.
 Despite China’s commitment to military and industrial applications, it has not yet picked an AI national champion. “After Deepseek in early 2025 the government could have easily said, ‘You guys are the winners, I’ll give you all the money, please build AGI,’ but they didn’t. They see being ‘close enough’ to the technological frontier as important, but putting all eggs in the AGI basket as a gamble,” Loke says.
 American companies are also still working with Chinese technology and workers, despite a slow uncoupling of the two economies. Though it may seem counterintuitive, more cooperation—and less emphasis on cutthroat competition—could yield better results for all. “For building more secure, trustworthy AI, you need both U.S. and Chinese labs and policymakers to talk to each other, to reach consensus on what’s off limits, then compete within those boundaries,” Xu says. “The arms race narrative also just misses the actual on-the-ground reality of companies co-opting each other’s approaches, the amount of research that gets exchanged in academic communities, the supply chains and talent that permeates across borders, and just how intertwined the two ecosystems are.”

 A correction to this article was made on 23 February 2026. The Institute for AI Policy and Strategy is in Washington, D.C., not San Francisco.

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy | NVIDIA Technical Blog

nvidia_dev_blog

23.02.2026 18:00

0.676

Embedding sim.	0.8096
Entity overlap	0.0588
Title sim.	0.2632
Time proximity	0.4256

NLP тип	experiment
NLP организация	NVIDIA
NLP тема	computational efficiency
NLP страна

Открыть оригинал

As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as training throughput expectations, memory limits, and rising costs are becoming the primary barriers to scaling transformer models. 

 Using lower-precision training can address these challenges. By reducing the numeric precision used during computation, GPUs can process more operations per cycle, enhancing training efficiency and lowering costs. 

 This post compares the following three low-precision training formats directly against established BF16 precision training across multi-hundred-billion token pretraining runs and downstream benchmarks: 

 8-bit floating point per-tensor current scaling (FP8-CS)

 Mixed precision training with FP8 (MXFP8)

 NVFP4 precision training using NVIDIA NeMo Megatron Bridge , an open source library that is part of NVIDIA NeMo framework  

 We present practical, large-scale results showing how low-precision training delivers up to ~1.6x higher throughput, substantial memory savings, and near-identical model quality using production-ready recipes you can adopt today. 

 ​​What is low-precision training?

 Low-precision training uses numerical formats with fewer bits to represent weights and  activations during model training. This reduces memory bandwidth and computational demand, enabling GPUs to process more operations per cycle and significantly increase training throughput. 

 Low-precision formats

 FP8-CS applies FP8 to linear layers using scaling factors derived from the statistical properties of each tensor at the current training step. MXFP8 extends the FP8 approach with block-level scaling optimized for the NVIDIA Blackwell architecture , with each block covering 32 tensor elements. NVFP4 further improves memory efficiency and throughput by using the 4-bit format for tensor values with a hierarchical two-level scaling strategy.

 Figure 1. Comparison of FP8, MXFP8, and NVFP4 low-precision formats. E stands for the exponent and M for Mantissa in the numerical representation

 Can low-precision training match BF16 accuracy at scale? 

 To validate the practical impact of low-precision training for real-world large-model pretraining, the team evaluated both the training convergence and downstream task performance across two widely used dense transformer architectures: Llama 3 8B and an NVIDIA internal research 8B model (Research-8B with dense grouped query attention (GQA) architecture that is similar to Llama 3 8B). The models were trained on 1 trillion tokens.

 Experimental setup: Isolating the impact of precision

 The following large-scale pretraining experiments were run:

 Four numeric precisions : BF16 (baseline), FP8-CS, MXFP8, and NVFP4

 Two model architectures : Llama 3 8B and Research-8B

 Training software and hardware : NeMo Megatron Bridge on NVIDIA B200 GPUs

 Two datasets : Lingua DCLM Dataset and an internal dataset. Llama 3 8B was trained on both datasets and Research-8B was trained on the internal NVIDIA research dataset

 Convergence behavior: Training stability across precisions

 Figures 2, 3, and 4 show training and validation loss curves for both models and datasets. Low-precision training closely tracks with the BF16 baseline, demonstrating stable and consistent convergence across precisions. In all cases, NVFP4 shows slightly higher loss but downstream accuracies remain unaffected. See Table 1 for more details.

 Figure 2. Training and validation loss for the Llama 3 8B trained on the Lingua DCLM dataset across BF16, FP8-CS, MXFP8, and NVFP4

 Figure 3. Training and validation loss for Llama 3 8B trained on the internal NVIDIA research dataset across BF16, FP8-CS, MXFP8, and NVFP4

 Figure 4. Training and validation loss for Research-8B trained on the internal dataset

 Downstream evaluation: Accuracy is preserved

 To assess whether low-precision training impacts real-world performance, we evaluated all pretrained models on standard downstream benchmarks. All evaluations were run in BF16 precision to isolate the impact of training precision. 

 Table 1 shows the results. Despite minor differences in training and validation loss, all low-precision formats achieve downstream task accuracy comparable to BF16. 

 Model
 Dataset
 Precision
   MMLU  (↑)
 HellaSwag (↑)
 WinoGrande (↑)
 ARC-C (↑)

 Llama 3 8B
 DCLM
 BF16
 45.98
 76.44
 70.17
 51.28

 FP8-CS
 46
 75.25
 70.24
 49.91

 MXFP8
 46.56
 75.46
 71.27
 51.11

 NVFP4
 45.64
 75.59
 69.38
 51.28

 Llama 3 8B
 Internal dataset 
 BF16
 52.73
 75.71
 67.88
 51.37

 FP8-CS
 52.46
 75.65
 70.17
 54.52

 MXFP8
 53.7
 75.54
 69.69
 51.62

 NVFP4
 52.83
 75.04
 71.98
 53.58

 Research-8B
 Internal dataset
 BF16
 53
 76.98
 70.4
 55.89

 FP8-CS
 52.62
 75.81
 70.8
 54.44

 MXFP8
 52.38
 76.55
 69.77
 53.58

 NVFP4
 52.21
 76.19
 70.32
 54.95

 Table 1. Downstream task accuracy (%) for Llama 3 8B and Research-8B across BF16, FP8-CS, MXFP8, and NVFP4 training

 Key insights

 Key insights from these experiments are detailed below.

 Low precision training matches BF16 convergence: FP8, MXFP8, NVFP4 achieve pretraining and validation losses very close to BF16, showing minimal degradation.

 Downstream accuracy is preserved: Across all models and benchmarks, low-precision training delivers downstream task accuracy comparable to BF16, demonstrating that reduced precision maintains model effectiveness.

 MXFP8 performs slightly better than standard FP8: This is likely due to its finer-grained scaling mechanism, which better captures local dynamic range within tensors.

 NVFP4 with proper calibration delivers competitive results despite aggressive compression : The following recipe is the empirical sweet spot: AdamW ϵ=1e-8, LR=6e-4 → 6e-6, GBS=768. 

 Selective BF16 layers are essential for NVFP4: Ablation studies show that fully NVFP4 models diverge. Stable training requires keeping some layers in BF16, particularly near the end of the network, to mitigate NVFP4 quantization error. In these experiments, maintaining the final four transformer layers in BF16 proved sufficient. 

 Advantages of FP8, MXFP8, and NVFP4 training

 Low-precision formats deliver clear gains in both training throughput and memory efficiency, enabling faster end-to-end training and better scalability on NVIDIA Blackwell GPUs.

 Precision
 Micro-batch size
 Throughput (TFLOP/s/GPU)
 Speedup versus BF16

 BF16
 2
 1165
 –

 FP8-CS (F1L1)
 2
 1547
 1.33x

 MXFP8
 2
 1540
 1.32x

 NVFP4 (F0L4)
 4
 1850
 1.59x

 Table 2. Throughput comparison for Llama 3 8B training on NVIDIA GB200 NVL72 shows up to 1.59x speedup with NVFP4 compared to BF16

 GBS=128, Seq. Length=8192. Note that FxLy denotes the first ‘x’ layers and last ‘y’ transformer block layers are kept in BF16 precision

 Faster end-to-end training

 Using 8-bit or 4-bit numeric formats drastically reduces computational overhead by enabling GPUs to process more operations per clock cycle. Gains in throughput can be up to 1.59x over BF16 baseline (Table 2). These gains translate directly into faster time-to-train for large-scale models.

 GPU memory savings and better scalability

 Using lower bit-width formats reduces the memory footprint of weights and activations, allowing larger models or batch sizes on the same hardware. NVFP4 efficiency enables the micro-batch size to double (from 2 to 4) during pretraining, directly improving throughput and scalability. 

 Table 3 provides a detailed breakdown of memory usage across training components. Lower-precision formats significantly reduce parameter and activation storage while preserving FP32 optimizer state, enabling higher throughput and larger batch sizes without compromising training stability.

 Optimizer

 Precision
 Parameter
 Gradients
 Momentum
 Variance
 Master parameter
 Others

 FP16
 FP16
 FP32
 FP32
 FP32
 FP32

 BF16
 BF16
 BF16

 FP8 (tensor scaling)
 FP8x2
 BF16
 Scaling factor per weight tensor

 MXFP8
 FP8x2
 BF16
 (Scaling factor per 32 elements) x 2

 NVFP4
 FP4
 BF16
 16×16 2D block scales replicated for each 1×16 block

 Table 3. Memory footprint across training components for different precision formats

 Low-precision training with NeMo Megatron Bridge

 NeMo Megatron Bridge is an open PyTorch-native library within the NVIDIA NeMo framework. It bi-directionally connects Hugging Face and Megatron Core model checkpoints. It provides optimized training and multi-node parallelisms required to pretrain, SFT, and LoRA-tune generative AI models at maximum throughput. 

 Adopting low-precision training using the NeMo Megatron Bridge library is straightforward. You can use ready-to-use low-precision recipes for various models to experiment with different precision formats by changing a single configuration flag. An example for Llama 3 8B is shown below:

from megatron.bridge.recipes.llama import llama3_8b_low_precision_pretrain_config as low_precision_pretrain_config
from megatron.bridge.training.gpt_step import forward_step

precision = "bf16_with_fp8_current_scaling_mixed" # should be one of ["bf16_with_mxfp8_mixed", "bf16_with_fp8_current_scaling_mixed", "bf16_with_nvfp4_mixed"]
cfg = low_precision_pretrain_config(
 mixed_precision_recipe = precision,
 train_iters = 100,
 lr_warmup_iters = 10,
 lr_decay_iters = 90,
 mock = True, # use mock dataset
)
pretrain(config=cfg, forward_step_func=forward_step)

 You can easily switch between precision formats to evaluate performance, memory savings, and convergence behavior—without modifying model code or optimizer logic.

 Train faster and scale efficiently 

 Low-precision training formats like FP8 with current scaling, MXFP8, and NVFP4 offer exciting new avenues for faster, more efficient deep learning training compared to the widely adopted BF16. Their advantages in speed and memory savings open doors for training larger, more complex models. Empirical evidence from Llama 3 8B and internal research models confirms that training with low precision matches BF16 performance on both pretraining metrics and downstream tasks.

 Get started with low-precision training

 As model sizes continue to scale, low-precision training will be foundational to building the next generation of models. With native NVIDIA Blackwell GPU support and production-ready low-precision recipes in NeMo Megatron Bridge , you can try these techniques today. 

 To get started quickly, try the Megatron Bridge Training Tutorial notebook. It walks through using these low-precision recipes end to end and demonstrates how they can significantly accelerate training workloads.

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Developer Tools & Techniques | MLOps | General | Blackwell | NeMo | Intermediate Technical | Deep dive | featured | NVFP4 | Training AI Models

 About the Authors

 About Aditya Vavre

 Aditya Vavre is a deep learning algorithms engineer at NVIDIA, where he focuses on advancing efficient large-scale language model training and architecture design. His past work includes 4-bit and 8-bit LLM pretraining, quantization-aware training and distillation, and sparse attention mechanisms, enabling more efficient long-context and large-scale transformer models. Prior to NVIDIA, he contributed to research and development in NLP and AI applications during his time as a Research Engineer at Sony, building retrieval-based dialogue systems and text-to-video generation pipelines. Aditya holds a master’s degree in Computer Science from The University of Texas at Austin and a bachelor’s degree from IIT Bombay. His interests lie at the intersection of scalable deep learning systems, model efficiency, and next-generation foundation model architectures.

 View all posts by Aditya Vavre

 About Nima Tajbakhsh

 Nima Tajbakhsh is a deep learning algorithm manager with eight years of industry expertise, specializing in computer vision, LLM, and multimodal GenAI models. Within NVIDIA NeMo, he spearheads the development and optimization of training and inference workflows for diverse GenAI models. By integrating cutting-edge AI technologies, his team drives innovation and advancement in the field, ensuring NeMo remains at the forefront of AI research and application.

 View all posts by Nima Tajbakhsh

 About Wenwen Gao

 Wenwen Gao is a senior product manager for NeMo at NVIDIA, focusing on LLM training framework and microservices. Her past experience include LLM inference (NIM) and recommender systems (Merlin). She holds a B.S. in computer science from the University of Toronto and an M.B.A. from the MIT Sloan School of Management.

 View all posts by Wenwen Gao

 About Selvaraj Anandaraj

 Selvaraj Anandaraj is a Deep Learning Performance Engineer working on accelerating Deep Learning workloads using NVIDIA hardware and software stacks. His recent work is focused on having a highly performant software stack to train and infer large language models at scale. He earned a Master’s degree from the University of Wisconsin-Madison with a specialization in Machine Learning systems.

 View all posts by Selvaraj Anandaraj

 About Amit Bleiweiss

 Amit Bleiweiss is a senior data scientist at NVIDIA, where he focuses on large language models and generative AI. He has 25 years of experience in applied machine learning and deep learning, with over 50 patents and publications in the domain. Amit received his MSc from Hebrew University of Jerusalem, where he specialized in machine learning.

 View all posts by Amit Bleiweiss

 Comments

 Related posts

 Faster Training Throughput in FP8 Precision with NVIDIA NeMo

 Faster Training Throughput in FP8 Precision with NVIDIA NeMo

 Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training

 Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training

 NVIDIA, Arm, and Intel Publish FP8 Specification for Standardization as an Interchange Format for AI

 NVIDIA, Arm, and Intel Publish FP8 Specification for Standardization as an Interchange Format for AI

 Getting Immediate Speedups with NVIDIA A100 TF32

 Getting Immediate Speedups with NVIDIA A100 TF32

 NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch

 NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch

 Related posts

 Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

 Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

 How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation

 How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation

 Accelerating Long-Context Model Training in JAX and XLA

 Accelerating Long-Context Model Training in JAX and XLA

 Top 5 AI Model Optimization Techniques for Faster, Smarter Inference

 Top 5 AI Model Optimization Techniques for Faster, Smarter Inference

 Training XGBoost Models with GPU-Accelerated Polars DataFrames

 Training XGBoost Models with GPU-Accelerated Polars DataFrames

 L

 T

 F

 R

 E

Scaling AI for everyone

openai

27.02.2026 05:30

0.676

Embedding sim.	0.7664
Entity overlap	0.3077
Title sim.	0.0448
Time proximity	1

NLP тип	funding
NLP организация	SoftBank
NLP тема	ai infrastructure
NLP страна

Открыть оригинал

Today we’re announcing $110B in new investment at a $730B pre money valuation. This includes $30B from SoftBank, $30B from NVIDIA, and $50B from Amazon.

CORPGEN advances AI agents for real work

microsoft_research

26.02.2026 17:06

0.673

Embedding sim.	0.8167
Entity overlap	0.0357
Title sim.	0.0263
Time proximity	0.6938

NLP тип	scientific_publication
NLP организация	Microsoft
NLP тема	ai agents
NLP страна

Открыть оригинал

CORPGEN advances AI agents for real work

 Published
 February 26, 2026

 By

 Abubakarr Jaye

 ,

 Applied Scientist 2

 Nigel Boachie Kumankumah

 ,

 Software Engineer

 Chidera Biringa

 ,

 Applied Scientist 2

 Anjel Patel

 ,

 Software Engineer

 Dayquan Julienne

 ,

 Product Manager 2

 Tianwei Chen

 ,

 Senior Software Engineering Manager

 Sulaiman Vesal

 ,

 Senior Applied Science Manager, Microsoft’s AI Development Acceleration Program within the office of the CTO

 Share this page

 Share on Facebook

 Share on X

 Share on LinkedIn

 Share on Reddit

 Subscribe to our RSS feed

 At a glance

 Today’s AI agent benchmarks test one task at a time, while real workplace productivity requires managing dozens of interdependent tasks at once. To reflect this, we created a setting called Multi-Horizon Task Environments (MHTEs).

 Under multi-task loads, leading computer-using agents degrade sharply, with completion rates dropping from 16.7% to 8.7%.

 CORPGEN introduces digital employees , with hierarchical planning, memory isolation, and experiential learning, delivering up to 3.5 times higher completion rates than baselines across three independent agent backends.

 Because CORPGEN is architecture-agnostic and modular, its gains come from system design rather than any single base model, and it benefits directly as underlying models improve.

 By mid-morning, a typical knowledge worker is already juggling a client report, a budget spreadsheet, a slide deck, and an email backlog, all interdependent and all demanding attention at once. For AI agents to be genuinely useful in that environment, they will need to operate the same way, but today’s best models are evaluated one task at a time, not dozens at once.

 In our paper, “ CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments ,” we propose an agent framework that equips AI with the memory, planning, and learning capabilities to close that gap.

 Introducing Multi-Horizon Task Environments

 Replicating the reality of workplace multitasking requires a new kind of evaluation environment. In response, we developed Multi-Horizon Task Environments (MHTEs), settings where an agent must manage multiple complex tasks simultaneously. Each task requires 10 to 30 dependent steps within a single session spanning five hours.

 To determine what a benchmark would need to test, we ran MHTEs at scale on some of today’s leading AI agents, exposing four weaknesses. First, memory fills up. An agent cannot hold details for multiple active tasks at once. Second, information from one task interferes with reasoning about another. Third, tasks don’t depend on each other in simple sequences. They form complex webs where an agent must constantly check whether upstream work is finished before it can move forward on anything downstream. Fourth, every action cycle requires reprioritizing across all active tasks, not simply resuming where the agent left off.

 We also tested three independent agent systems under increasing loads. As the number of concurrent tasks rose from 12 to 46, completion rates fell from 16.7% to 8.7% across all systems.

 CORPGEN’s architecture

 CORPGEN introduces digital employees : LLM-powered AI agents with persistent identities, role-specific expertise, and realistic work schedules. They operate Microsoft Office applications through GUI automation and perform consistently within MHTEs over hours of continuous activity. Figure 1 illustrates how a digital employee moves through a full workday.

 Figure 1. Each day begins with a structured plan and memory loaded from previous sessions. The agent then works through overlapping tasks in repeated cycles, storing key outcomes at day’s end to inform the next session.

 CORPGEN addresses each of the four weaknesses of concurrent task execution—memory overload, cross-task interference, dependency complexity, and reprioritization—in a targeted way. Hierarchical planning breaks objectives into daily goals and then into moment-to-moment decisions, allowing the agent to act from a structured plan instead of reviewing all available tasks before each step.

 Subagents perform complex operations like web research in isolated contexts, preventing cross-task contamination. A tiered memory system enables selective recall of task-related information rather than retaining everything in active context. Adaptive summarization compresses routine observations while preserving critical information, keeping memory growth controlled.

 Because these mechanisms are not tied to a specific base model, we tested CORPGEN across three different agents. In each case, we observed consistent gains. The improvements came from the architecture, not from the strength of any particular model. Figure 2 shows how they fit together within CORPGEN’s architecture.

 Figure 2. Four mechanisms support concurrent task execution in CORPGEN: hierarchical planning, isolated subagents, tiered memory, and adaptive summarization.

 How digital employees collaborate

 When multiple digital employees operate in the same environment, collaboration takes shape through standard communication channels, without predefined coordination rules. One employee sends an email requesting data; another picks it up in the next cycle, uses its memory to process it, and responds. This exchange mirrors real workplace communication.

 There is no shared internal state between agents. Coordination occurs entirely through email and Microsoft Teams, the same channels many workers use. Over time, these independent exchanges form recognizable organizational patterns. Some agents take on leadership roles; others provide support; shared documents become the connective tissue.

 When a communication path breaks, such as an email delivery error, agents reroute messages through alternate channels to keep work moving. The result is a virtual organization that behaves like a real one without being explicitly programmed to do so.

 Evaluating CORPGEN

 We evaluated CORPGEN on a multi-task benchmark that combined up to 46 tasks into a single six-hour session. Three findings stood out.

 Baselines degrade as load increases; CORPGEN does not. All three baseline agent systems showed steady performance declines as task load rose. CORPGEN, by contrast, maintained or improved its completion rates at higher loads. At 46 tasks, CORPGEN completed 15.2% of tasks, compared with 4.3% for the baselines, roughly 3.5 times more.

 Experiential learning drives the largest gains. We introduced CORPGEN’s components sequentially: first the orchestration layer, then cognitive tools, and finally experiential learning. The first two produced moderate improvements. Experiential learning, in which agents store records of completed tasks and reuse them when they encounter structurally similar work, produced the largest increase, raising completion rates from 8.7% to 15.2%.

 Evaluation methodology changes the picture. When we inspected the actual output files produced by agents, the results agreed with human judgements roughly 90% of the time. Evaluation based on screenshots and action logs agreed only about 40% of the time. This gap suggests that common evaluation approaches may underestimate what agents actually accomplish in practice.

 Spotlight: Microsoft research newsletter

 Microsoft Research Newsletter

 Stay connected to the research community at Microsoft.

 Subscribe today

 Opens in a new tab

 Implications and looking forward

 The results suggest that memory and retrieval, not just raw model capability, may be a key bottleneck in getting agents to work in the real world. The largest gains came from experiential learning. Agents that learn from prior successes and apply those patterns to structurally similar tasks build an advantage over systems that respond to each task in isolation.

 CORPGEN also opens a new lens on how AI agents collaborate. Next steps include testing whether agents can maintain memory across multiple workdays and how they coordinate when working in teams. We are also exploring ways to make agents faster and more reliable by combining different methods of interacting with software.

 Acknowledgments

 This work is a result of a collaboration between the Office of the CTO at Microsoft and the Microsoft AI Development Accelerator Program (MAIDAP). We would like to thank the Microsoft Security Research team for providing resources that supported this research. We also thank the members of the Microsoft UFO2 (opens in new tab) team and the Mem0 (opens in new tab) project for their open-source contributions, which enabled key components of the CORPGEN architecture, and the OSWorld team for the benchmark that served as the foundation for our multi-task evaluation.

 Finally, we thank the many contributors to this research: Charlotte Siska, Manuel Raúl Meléndez Luján, Anthony Twum-Barimah, and Mauricio Velazco.

 Opens in a new tab

 Related publications

 CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments  

 Meet the authors

 Abubakarr Jaye

 Applied Scientist 2

 Learn more

 Nigel Boachie Kumankumah

 Software Engineer

 Learn more

 Chidera Biringa

 Applied Scientist 2

 Learn more

 Anjel Patel

 Software Engineer

 Learn more

 Dayquan Julienne

 Product Manager 2

 Learn more

 Tianwei Chen

 Senior Software Engineering Manager

 Learn more

 Sulaiman Vesal

 Senior Applied Science Manager, Microsoft’s AI Development Acceleration Program within the office of the CTO

 Learn more

 Continue reading

 March 12, 2026

 Systematic debugging for AI agents: Introducing the AgentRx framework  

 December 11, 2025

 Agent Lightning: Adding reinforcement learning to AI agents without code rewrites  

 May 19, 2025

 Magentic-UI, an experimental human-centered web agent  

 February 25, 2025

 Magma: A foundation model for multimodal AI agents across digital and physical worlds  

 See all blog posts

 Research Areas

 Artificial intelligence

Our agreement with the Department of War

openai

28.02.2026 12:30

0.671

Embedding sim.	0.7886
Entity overlap	0.0909
Title sim.	0.0845
Time proximity	0.8155

NLP тип	other
NLP организация	OpenAI
NLP тема	ai safety
NLP страна

Открыть оригинал

Details on OpenAI’s contract with the Department of War, outlining safety red lines, legal protections, and how AI systems will be deployed in classified environments.

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute | NVIDIA Technical Blog

nvidia_dev_blog

18.02.2026 17:00

0.671

Embedding sim.	0.7516
Entity overlap	0.0938
Title sim.	0.1899
Time proximity	0.994

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	ai infrastructure
NLP страна

Открыть оригинал

Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into C++ to write custom kernels and to maintain bindings back to Python. For most Python developers and researchers, this is a significant barrier to entry.

 Frameworks like PyTorch address this by implementing kernels in CUDA C++—either handwritten or by leveraging libraries like the NVIDIA CUDA Core Compute Libraries . Handwritten kernels are time-consuming and require deep, low-level architectural expertise. Using CUB, a C++ library within CCCL, is often better, since its primitives are highly optimized per architecture and are rigorously tested. But exposing CUB to Python traditionally means building and maintaining bindings and pre-instantiating C++ templates with fixed types and operators—limiting flexibility on the Python side. 

 The NVIDIA cuda.compute
 library overcomes these limitations by offering a high-level, Pythonic API for device-wide CUB primitives. 

 Using cuda.compute
 helped an NVIDIA CCCL team top the GPU MODE leaderboard, a kernel competition hosted by an online community with more than 20,000 members and a focus on learning and improving GPU programming. GPU MODE hosts the kernel competitions to find the best implementations for a variety of tasks, from simple vector addition to more complex block matrix multiplications. 

 The NVIDIA CCCL team focuses on delivering “speed-of-light” (SOL) implementations of parallel primitives across GPU architectures through high-level abstractions. It achieved the most first-place finishes overall on the tested GPU architectures: NVIDIA B200, NVIDIA H100, NVIDIA A100, and NVIDIA L4. 

 In this blog we’ll share more details about how we were able to place so high on the leaderboard.

 CUDA Python: GPU performance meets productivity

 CUB offers highly optimized CUDA kernels for common parallel operations, including those featured in the GPU MODE competition. These kernels are architecturally tuned and widely considered near speed-of-light implementations.

 The cuda.compute
 library supports custom types and operators defined directly in Python. Under the hood, it just-in-time (JIT) compiles specialized kernels and applies link-time optimization to deliver near-SOL performance on par with CUDA C++. You stay in Python while getting the flexibility of templates and the performance of tuned CUDA kernels.

 With cuda.compute
 you get:

 Fast, composable CUDA workflows in Python: Develop efficient and modular CUDA applications directly within Python.

 Custom data types and operators: Utilize custom data types and operators without the need for C++ bindings.

 Optimized performance: Achieve architecture-aware performance through proven CUB primitives.

 Rapid iteration: Accelerate development with JIT compilation while maintaining CUDA C++ levels of performance. JIT compilation accelerates the development cycle by providing the flexibility and rapid iteration cycles that developers need without compromising performance.

 The leaderboard results

 Using cuda.compute
, we submitted entries across GPU MODE benchmarks for PrefixSum , VectorAdd , Histogram , Sort , and Grayscale (look for username Nader).

 For algorithms like sort, the CUB implementation was two-to-four times faster than the next best submission. This is the CCCL promise in action: SOL‑class algorithms that outperform custom kernels for standard primitives you’d otherwise spend months building.

 Where we didn’t take first place, the gap typically came down to us not having a tuning policy for that specific GPU. In some instances, our implementation was a more general solution, while higher-ranked submissions were specialized to specific problem sizes. 

 In other cases, the first place submission was already using CUB or cuda.compute
 under the hood. This underscores that these libraries already represent the performance ceiling for many standard GPU algorithms, and that their performance characteristics are now well understood and intentionally relied upon by leading submissions.

 This isn’t about winning

 Leaderboard results are a byproduct; the real objective is learning with the community, benchmarking transparently, and demonstrating the power of Python for high-performance GPU work.

 Our goal isn’t to discourage hand-written CUDA kernels. There are plenty of valid cases for custom kernels—novel algorithms, tight fusion, or specialized memory access patterns—but for standard primitives (sort, scan, reduce, histogram, etc.), your first move should be a proven, high-performance implementation. With cuda.compute
, those tuned CUB primitives are now accessible directly from native Python, allowing you to build high-quality, production-grade, GPU-accelerated Python libraries. 

 This is great news for anyone building the next CuPy, RAPIDS component, or a custom Python GPU accelerated library: faster iteration, fewer glue layers, and production-grade performance all while staying in pure Python.

 How cuda.compute
 looks in practice

 One of the first examples any person writes when learning GPU programming is a vector addition. Using cuda.compute we can solve this using pure Python by calling a device-wide primitive.

import cuda.compute
from cuda.compute import OpKind

# Build-time tensors (used to specialize the callable)
build_A = torch.empty(2, 2, dtype=torch.float16, device="cuda")
build_B = torch.empty(2, 2, dtype=torch.float16, device="cuda")
build_out = torch.empty(2, 2, dtype=torch.float16, device="cuda")

# JIT compiling the transform kernel
transform = cuda.compute.make_binary_transform(build_A, build_B, build_out, OpKind.PLUS)

# Defining custom_kernel is required to submit to the GPU MODE competition
def custom_kernel(data):
 # Invoking our transform operation on some input data
 A, B, out = data
 transform(A, B, out, A.numel())
 return out

 You can find more cuda.compute examples on the GPU MODE Leaderboard . The pattern is consistent: simple code with speed-of-light performance, achieved by calling device-wide building blocks that are automatically optimized by CCCL for every GPU generation.

 Other top-performing submissions for the VectorAdd category required dropping into C++ and inline PTX, resulting in code that is highly architecture-dependent.

 Try cuda.compute today

 If you’re building Python GPU software, custom pipelines, library components, or performance-sensitive code, cuda.compute
 gives you the option to use CCCL CUB primitives directly in Python and leverage building blocks designed for architecture-aware speed-of-light performance.

 To try cuda.compute
 today, you can install it via pip or conda:

pip install cuda-cccl[cu13] (or [cu12])

conda install -c conda-forge cccl-python cuda-version=12 (or 13)

 We’re building this with the community—your feedback and benchmarks shape our roadmap so don’t hesitate to reach out to us on Github or in the GPU MODE discord .

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Science | Developer Tools & Techniques | General | CUDA | Intermediate Technical | Best practice | featured

 About the Authors

 About Daniel Rodriguez

 Daniel Rodriguez is a technical product manager on the CUDA Python and DevTools teams at NVIDIA. His efforts are focused on building tooling for data scientists and high-performance computing engineers. Daniel has a background of Electrical Engineering and Data Analytics. Prior to NVIDIA, he worked at Google and multiple enterprise data science companies where he built data related products and contributed to many open source projects.

 View all posts by Daniel Rodriguez

 About Nader Al Awar

 Nader Al Awar is a senior software engineer at NVIDIA and a member of the CUDA Core Compute Libraries (CCCL) team, where he focuses on the development of CUB and cuda.compute. He earned his doctorate in electrical and computer engineering from the University of Texas at Austin, specializing in high-performance computing for Python. Nader is passionate about bridging the gap between high-level languages and hardware by accelerating Python code using GPUs.

 View all posts by Nader Al Awar

 Comments

 Related posts

 CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels

 CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels

 Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python

 Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python

 Developing Accelerated Code with Standard Language Parallelism

 Developing Accelerated Code with Standard Language Parallelism

 Unifying the CUDA Python Ecosystem

 Unifying the CUDA Python Ecosystem

 NVIDIA Announces CUDA-X HPC

 NVIDIA Announces CUDA-X HPC

 Related posts

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 Streamlining CUB with a Single-Call API

 Streamlining CUB with a Single-Call API

 Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer

 Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer

 Better Bug Detection: How Compile-Time Instrumentation for Compute Sanitizer Enhances Memory Safety

 Better Bug Detection: How Compile-Time Instrumentation for Compute Sanitizer Enhances Memory Safety

 How to Get Started with Neural Shading for Your Game or Application

 How to Get Started with Neural Shading for Your Game or Application

 L

 T

 F

 R

 E

Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile | NVIDIA Technical Blog

nvidia_dev_blog

05.03.2026 17:00

0.659

Embedding sim.	0.7566
Entity overlap	0.0303
Title sim.	0.2623
Time proximity	0.7311

NLP тип	other
NLP организация	nvidia
NLP тема	computational efficiency
NLP страна

Открыть оригинал

In this post, we dive into one of the most critical workloads in modern AI: Flash Attention , where you’ll learn:

 How to implement Flash Attention using NVIDIA cuTile . Walk through the complete code for a production-ready implementation.

 The “trap and rescue” optimization journey . This case study shows how naive optimizations (like just increasing tile size) can backfire, and how to fix them.

 Advanced techniques like FMA patterns, fast math, loop splitting, and adaptive tiling for maximum performance.

 Environment requirements:

 CUDA 13.1 or higher

 GPU architecture : Compute capability 8.X, 10.X, 11.X, 12.X (NVIDIA Ampere, NVIDIA Ada, NVIDIA Blackwell)

 Python : 3.10 or higher

 See the quickstart doc for more information on installing cuTile Python.

 What is attention?

 The attention mechanism is the computational heart of transformer models. Given a sequence of tokens, attention enables each token to “look at” every other token and decide how much to weigh their contributions. Mathematically, for input matrices Query (\(Q\)), Key (\(K\)), and Value (\(V\)), the output is:

 \(O = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V\)

 Where:

 \(Q \text{ has shape } (N,d),\ N \text{ query tokens, each with dimension } d.\)

 \(K \text{ has shape } (N,d),\ N \text{ key tokens.}\)

 \(V \text{ has shape } (N,d),\ N \text{ value tokens.}\)

 \(\text{The intermediate } QK^{T} \text{ matrix has shape } (N,N), \text{ is a problem.}\)

 The memory bandwidth problem

 For a sequence length of \(N = 16,384\) (common in modern LLMs), the attention matrix \(QK^{T}\) contains \(N^2 = 268\) million elements. In FP16, that’s 512 MB of intermediate storage per attention head, per batch item.

 Standard attention implementations:

 Compute the full \(N \times N\) attention matrix and write it to global memory (slow)

 Apply softmax row-by-row

 Read the matrix back and multiply by \(V\)

 This approach is memory-bound as the GPU spends most of its time waiting for data to move between HBM and compute units, rather than computing.

 How Flash Attention solves the memory bandwidth problem

 Flash Attention (introduced by Dao et al., 2022) is an IO-aware algorithm that never materializes the full \(N \times N\) matrix. Instead, it:

 Tiles the computation : Processes \(Q, K, V\) in small blocks that fit in fast on-chip SMEM

 Uses online softmax : Computes softmax incrementally without needing the full row

 Fuses operations : Combines the matrix multiply and softmax into a single kernel pass

 The result is a 2-4x speedup and significant memory savings, enabling longer context lengths.

 Figure 1. Tiled Flash Attention computation

 Understanding online softmax

 The key algorithmic insight of Flash Attention is the online softmax trick. The numerically stable safe softmax requires knowing the maximum value across the entire row before computing:

 \(\text{softmax}(x_i) = \frac{e^{x_i – \max(x)}}{\sum_j e^{x_j – \max(x)}}\)

 But if we’re processing tiles, we don’t have access to the full row. Online softmax solves this by maintaining running statistics that can be updated incrementally.

 The online softmax algorithm

 We maintain two running values for each row:

 \(m_i\): The maximum value seen so far (for numerical stability)

 \(l_i\): The sum of exponentials seen so far (the softmax denominator)

 When we process a new tile with values \(x_{new}\):

 Update the maximum : \(m_{new} = \max(m_i, \max(x_{new}))\)

 Compute correction factor : \(\alpha = e^{m_i – m_{new}}\) (rescales previous work)

 Update the sum : \(l_i = l_i \cdot \alpha + \sum e^{x_{new} – m_{new}}\)

 Update the accumulator : \(acc = acc \cdot \alpha + P_{new} \cdot V_{tile}\)

 \(P_{new}\) is the matrix of the attention weights, and \(V_{tile}\) is the value matrix tile, corresponding to the Key tile of the current iteration. At the end, we normalize: \(O = acc / l_i\)

 This enables us to compute an exact softmax without ever storing the full row.

 Causal attention and grouped-query attention

 Before diving into the implementation, let’s understand two important attention variants used in modern LLMs:

 Causal attention 

 In autoregressive language models like GPT, LLaMA, and Claude, each token can only attend to previous tokens in the sequence, not future ones. This prevents “cheating” during training, where the model looks ahead to predict the next word.

 Mathematically, we apply a triangular mask to the attention scores:

 \(\text{mask}_{ij} = \begin{cases} 0 & \text{if } i \geq j \text{ (query position ≥ key position)} \ -\infty & \text{if } i < j \text{ (future tokens)} \end{cases}\)

 The masked attention becomes:

 \(O = \text{softmax}\left(\frac{QK^T}{\sqrt{d}} + \text{mask}\right)V\)

 Adding \(-\infty\) to future positions ensures they become zero after softmax, effectively blocking information flow from future tokens.

 Figure 2. Causal attention mask for four tokens

 With causal masking, roughly half the attention matrix is masked (the upper triangle). We can skip computing these masked tiles entirely, providing a 2x algorithmic speedup. This is crucial for the K-loop splitting optimization.

 Grouped-query attention

 Standard multi-head attention has separate \(K,V\) matrices for each attention head, leading to high memory usage:

 Multi-head attention (MHA) : 32 query heads → 32 K/V heads (1:1 ratio)

 Grouped-query attention (GQA) : 32 query heads → 4 K/V heads (8:1 ratio)

 Multi-query attention (MQA) : 32 query heads → 1 K/V head (32:1 ratio)

 In GQA, multiple query heads share the same K/V heads. For example, with 32 query heads and 4 K/V heads:

 Query heads 0-7 use K/V head 0

 Query heads 8-15 use K/V head 1

 Query heads 16-23 use K/V head 2

 Query heads 24-31 use K/V head 3

 This reduces K/V cache size by 8x during inference, critical for serving long-context models. Modern LLMs like LlamA 2, Llama 3, Mistral, and Qwen use GQA extensively.

 When implementing in Flash Attention, each CUDA block computes attention for one query head, but loads the appropriate shared K/V head:

head_idx = bid_y % num_heads # Which query head (0-31)
kv_head_idx = head_idx // query_group_size # Which K/V head (0-3)

 With a query group size of 8, query heads 0-7 all map to kv_head_idx = 0
, sharing the same K/V tiles in memory.

 Part 1: The flash attention kernel in CUDA Tile

 Let’s implement Flash Attention step-by-step. Our baseline uses small 64×64 tiles and straightforward code—correct but not yet optimized.

 1. Defining the kernel interface

 In cuTile, the @ct.kernel
 decorator marks a Python function as a GPU kernel. We pass compile-time constants using ct.Constant[T]
 type annotations:

import math
import cuda.tile as ct

# Type aliases for compile-time constants
ConstInt = ct.Constant[int]
ConstBool = ct.Constant[bool]

# Conversion factor: we use exp2 instead of exp for efficiency
INV_LOG_2 = 1.0 / math.log(2)

@ct.kernel()
def fmha_kernel(
 Q, K, V, Out, # Input/output tensors
 qk_scale: float, # Scale factor (1/sqrt(d))
 input_pos: int, # Position offset for causal masking
 TILE_D: ConstInt, # Head dimension (for example, 128)
 H: ConstInt, # Number of attention heads
 TILE_M: ConstInt, # Tile size for Q dimension (for example, 64)
 TILE_N: ConstInt, # Tile size for K/V dimension (for example, 64)
 QUERY_GROUP_SIZE: ConstInt,# For Grouped Query Attention (GQA)
 CAUSAL: ConstBool, # Whether to apply causal mask
 EVEN_K: ConstBool, # Whether K length is divisible by TILE_N
):

 2. Block ID mapping

 Each CUDA block computes one tile of the output. Using ct.bid
 , we map the 2D grid to batch/head indices:

# Get block indices
 bid_x = ct.bid(0) # Which tile along the sequence dimension
 bid_y = ct.bid(1) # Which batch-head combination

 # Decode batch and head from flattened index
 batch_idx = bid_y // H
 head_idx = bid_y % H

 # For Grouped Query Attention: multiple Q heads share one K/V head
 off_kv_h = head_idx // QUERY_GROUP_SIZE

 3. Initializing accumulators

 Before the main loop, we initialize the online softmax state and output accumulator:

# Convert scale for base-2 exponential (faster than natural exp)
 qk_scale = qk_scale * INV_LOG_2

 # Create position indices for this tile
 offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32)
 offs_m += input_pos
 offs_m = offs_m[:, None] # Shape: [TILE_M, 1]

 offs_n_tile = ct.arange(TILE_N, dtype=ct.int32)
 offs_n_tile = offs_n_tile[None, :] # Shape: [1, TILE_N]

 # Online softmax state (float32 for numerical stability)
 m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) # Running max
 l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) # Running sum
 acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) # Output accumulator

 We use float32
 for accumulators, even when inputs are float16 to maintain numerical precision during the iterative softmax computation.

 4. Loading the query tile

 The query tile is loaded once and reused across all K/V iterations:

 # Load Q tile: shape [1, 1, TILE_M, TILE_D] -> [TILE_M, TILE_D]
 q = ct.load(
 Q,
 index=(batch_idx, head_idx, bid_x, 0),
 shape=(1, 1, TILE_M, TILE_D)
 ).reshape((TILE_M, TILE_D))

 The ct.load
 function handles boundary conditions automatically when the tile extends past the tensor edge.

 5. The main loop over K/V tiles

 This is the heart of Flash Attention. We iterate over K/V tiles:

 # Calculate loop bounds
 m_end = input_pos + (bid_x + 1) * TILE_M
 k_seqlen = K.shape[2]

 if CAUSAL:
 # For causal attention, stop early (future tokens are masked)
 Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N)
 else:
 Tc = ct.cdiv(k_seqlen, TILE_N)

 for j in range(0, Tc):
 # --- Step A: Load Key tile and compute QK^T ---
 k = ct.load(
 K,
 index=(batch_idx, off_kv_h, 0, j),
 shape=(1, 1, TILE_D, TILE_N),
 order=(0, 1, 3, 2), # Transpose for correct layout
 latency=2 # Hint for memory prefetching
 ).reshape((TILE_D, TILE_N))

 # Matrix multiply: Q @ K^T
 qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32)
 qk = ct.mma(q, k, qk) # Uses Tensor Cores automatically

 The order=(0,1,3,2)
 in the parameter tells cuTile load operation to use K transposed, and latency=2
 hints that we can tolerate some latency (enabling better pipelining). Then we use the ct.mma=(q, k, k,qk)
 to perform the cuTile matrix multiply-accumulate .

 6. Applying the causal mask

 For autoregressive models (GPT, Llama, etc.), each token can only attend to previous tokens:

# --- Step B: Apply causal masking ---
 if CAUSAL or not EVEN_K:
 offs_n = j * TILE_N + offs_n_tile
 mask = ct.full((TILE_M, TILE_N), True, dtype=ct.bool_)

 # Boundary mask (for non-divisible sequence lengths)
 if not EVEN_K:
 mask = mask & (offs_n < k_seqlen)

 # Causal mask: query position >= key position
 if CAUSAL:
 mask = mask & (offs_m >= offs_n)

 # Convert to additive mask: True->0, False->-inf
 mask = ct.where(mask, 0.0, -math.inf)
 qk += mask

 Adding -inf
 to masked positions ensures they become zero after softmax.

 7. Online softmax update

 Now we update our running softmax statistics:

 # --- Step C: Online softmax ---
 # Find max in current tile
 qk_max = ct.max(qk, axis=-1, keepdims=True)
 qk_max_scaled = qk_max * qk_scale

 # Update running maximum
 m_ij = max(m_i, qk_max_scaled)

 # Scale QK scores
 qk = qk * qk_scale
 qk = qk - m_ij

 # Compute attention weights (using exp2 for speed)
 p = ct.exp2(qk)

 # Update running sum
 l_ij = ct.sum(p, axis=-1, keepdims=True)
 alpha = ct.exp2(m_i - m_ij) # Correction factor
 l_i = l_i * alpha
 l_i = l_i + l_ij

 # Rescale previous accumulator
 acc = acc * alpha

 8. Accumulating the output

 Finally, we load the Value tile and accumulate:

# --- Step D: Load V and accumulate ---
 v = ct.load(
 V,
 index=(batch_idx, off_kv_h, j, 0),
 shape=(1, 1, TILE_N, TILE_D),
 latency=4
 ).reshape((TILE_N, TILE_D))

 # Cast attention weights back to input dtype for Tensor Core MMA
 p = p.astype(Q.dtype)

 # Accumulate: acc += P @ V
 acc = ct.mma(p, v, acc)

 # Update max for next iteration
 m_i = m_ij

 9. Final normalization and store

 After processing all tiles, we normalize by the total sum and write the result:

 # --- Final: Normalize and store ---
 acc = ct.truediv(acc, l_i)
 acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype)
 ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc)

 Launching the kernel: Host-side code

 Now let’s look at the host-side code that launches the kernel:

import torch
from math import ceil

def tile_fmha(q, k, v, sm_scale=None, is_causal=True):
 """
 Launch the Flash Attention kernel.

 Args:
 q: Query tensor, shape [batch, heads, seq_len, head_dim]
 k: Key tensor, shape [batch, kv_heads, seq_len, head_dim]
 v: Value tensor, shape [batch, kv_heads, seq_len, head_dim]
 sm_scale: Softmax scale (default: 1/sqrt(head_dim))
 is_causal: Whether to apply causal masking

 Returns:
 Output tensor, same shape as q
 """
 if sm_scale is None:
 sm_scale = 1.0 / math.sqrt(q.size(-1))

 batch_size, num_heads, seq_len, head_dim = q.shape
 _, num_kv_heads, _, _ = k.shape

 # Calculate query group size for GQA
 query_group_size = num_heads // num_kv_heads

 # Ensure contiguous memory layout
 q = q.contiguous()
 k = k.contiguous()
 v = v.contiguous()

 # Allocate output
 o = torch.empty_like(q)

 # Choose tile sizes (we'll optimize this later!)
 TILE_M, TILE_N = 64, 64

 # Calculate grid dimensions
 grid_x = ceil(seq_len / TILE_M) # Number of tiles along sequence
 grid_y = batch_size * num_heads # One block per batch-head pair
 grid = (grid_x, grid_y, 1)

 # Check if K length is evenly divisible
 EVEN_K = (k.shape[2] % TILE_N) == 0

 # Launch kernel
 ct.launch(
 torch.cuda.current_stream(),
 grid,
 fmha_kernel,
 (q, k, v, o, sm_scale, 0, head_dim, num_heads,
 TILE_M, TILE_N, query_group_size, is_causal, EVEN_K)
 )

 return o

 This baseline with 64×64 tiles works correctly. But can we make it faster? Let’s find out.

 Part 2: The “trap and rescue” optimization journey

 We benchmark on the following configuration:

 Hardware : NVIDIA B200

 Batch : 4, Heads : 32, Head dimension : 128

 Attention : Causal, Dtype : FP16

 Sequence lengths : 1024, 2048, 4096, 8192, 16384

 To interpret each step, we use Nsight Compute with a minimal section set:

 LaunchStats

 Occupancy

 SpeedOfLight

 ComputeWorkloadAnalysis

 MemoryWorkloadAnalysis

 Baseline performance

 SeqLen
 Throughput (TFLOPS)

 1,024
 330

 2,048
 441

 4,096
 511

 8,192
 546

 16,384
 566

 Table 1. Baseline performance without any specific optimizations

 This is our starting point with 64×64 tiles and no optimizations.

 NCU insight (SeqLen=1024, B200) :

 Registers/thread: 128

 Theoretical/achieved occupancy: 25% / 19.8%

 Compute (SM) throughput: 37.8%

 Memory throughput: 19.7%

 Grid size: 2,048

 1. The trap of larger tiles

 A common intuition in GPU programming is “bigger tiles = better performance.” Larger tiles:

 Amortize memory access overhead.

 Improve L2 cache utilization.

 Reduce kernel launch overhead per element.

 So, let’s increase our tile size from 64×64 to 256×128 :

TILE_M, TILE_N = 256, 128 # Was 64, 64

 The expected is better memory bandwidth utilization → faster performance. However, the result in TFLOPS are:

 SeqLen
 Baseline (64×64)
 Larger tiles (256×128)
 Performance Degradation

 1,024
 330
 187
 -43%

 2,048
 441
 268
 -39%

 4,096
 511
 347
 -32%

 8,192
 546
 415
 -24%

 16,384
 566
 463
 -18%

 Table 2. Baseline performance compared to performance with larger tile sizes, showing degradation when using larger tile sizes

 Performance degraded by 18-43% across all sequence lengths. This is the trap, where large tiles make performance worse .

 Why does this happen?

 Compute bottleneck : With more elements per tile, inefficient operations (separate mul/add, precise math) become the bottleneck.

 Instruction overhead : More work per tile means more instructions before the next memory operation.

 Lesson : Tile size and compute efficiency are interdependent. Large tiles only help if the computation is efficient enough to keep up.

 NCU insight (SeqLen=1,024, NVIDIA B200) :

 Registers/thread jump to 168 (+31%), reducing theoretical occupancy to 18.75%

 Achieved occupancy drops to 16.5%

 Compute throughput collapses to 17.4% (the trap)

 Memory throughput falls to 7.4%

 Grid size shrinks to 512 (fewer blocks from larger tiles)

 2. The rescue with fast math 

 One of the bottlenecks is special functions: exp2 (exponential) and truediv (division). By default, these are IEEE-754 precise—highly accurate, but slow.

 For deep learning, we can trade a tiny bit of precision for massive speedups:

 Before (precise operations):

p = ct.exp2(qk)
alpha = ct.exp2(m_i - m_ij)
acc = ct.truediv(acc, l_i)

 After (fast math):

p = ct.exp2(qk, flush_to_zero=True)
alpha = ct.exp2(m_i - m_ij, flush_to_zero=True)
acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX)

 What these flags do :

 flush_to_zero=True
: Denormal numbers (extremely small values near zero) become exactly zero. This avoids slow microcode paths on the GPU.

 rounding_mode=RMd.APPROX
: Skips iterative refinement after initial hardware approximation.

 With fast math, we’ve “rescued” the large tiles, and the results in TFLOPS are:

 SeqLen
 Larger tiles (trap)
 Fast math (rescue)
 Improvement

 1,024
 187
 322
 +72%

 2,048
 268
 436
 +63%

 4,096
 347
 524
 +51%

 8,192
 415
 585
 +41%

 16,384
 463
 620
 +34%

 Table 3. Performance improvement when using two fast math optimizations

 We now match or exceed the small-tile baseline, with 10-20% gains for longer sequences.

 NCU insight (SeqLen=1,024, NVIDIA B200) :

 Registers/thread: 168 (unchanged)

 Theoretical/achieved occupancy: 18.75% / 16.6% (unchanged)

 Compute throughput rebounds to 24.0%

 Memory throughput improves to 12.9%

 3. K-loop split

 For causal attention , we apply a triangular mask: each query can only attend to keys at earlier positions. In our baseline, we check if CAUSAL: mask
… on every loop iteration.

 But think about it: for a query tile at position 1000, most key tiles (0-900) need no masking at all . Only tiles near the diagonal need the mask. And tiles beyond the query position are completely masked (we can skip them entirely).

 Figure 3. Tiled causal attention matrix (8 tiles per side) 

 The optimization splits the loop into phases:

# Calculate where masking starts being necessary
mask_start = (input_pos + bid_x * TILE_M) // TILE_N
mask_start = min(mask_start, k_seqlen // TILE_N)

# Calculate where to stop (for causal, we exit early)
if CAUSAL:
 Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N)
else:
 Tc = ct.cdiv(k_seqlen, TILE_N)

for j in range(0, Tc):
 # Load K and compute QK...

 # ONLY apply masking when necessary
 if (CAUSAL or not EVEN_K) and j >= mask_start:
 offs_n = j * TILE_N + offs_n_tile
 mask = ct.full((TILE_M, TILE_N), True, dtype=ct.bool_)
 if not EVEN_K:
 mask = mask & (offs_n < k_seqlen)
 if CAUSAL:
 mask = mask & (offs_m >= offs_n)
 mask = ct.where(mask, 0.0, -math.inf)
 qk += mask

 # Continue with softmax and accumulation...

 Why this matters : For a 16K sequence with 256-token tiles:

 ~50% of tiles are fully unmasked (no branch, no mask computation)

 ~1 tile per row is partially masked (full logic)

 The rest are skipped entirely (early exit)

 Result in TFLOPS :

 SeqLen
 Fast math
 Loop split
 Improvement

 1,024
 322
 373
 +16%

 2,048
 436
 552
 +27%

 4,096
 524
 684
 +31%

 8,192
 585
 770
 +32%

 16,384
 620
 813
 +31%

 Table 4. Performance improvement when using K-loop split optimization

 This is the biggest single optimization —up to 32% speedup across all sequence lengths.

 NCU insight (SeqLen=1,024, B200) :

 Registers/thread: 168 (unchanged)

 Theoretical/achieved occupancy: 18.75% / 16.6% (unchanged)

 Memory throughput improves to 14.5% (less wasted work)

 Compute throughput remains 24.0% (work is more useful, not necessarily faster per cycle)

 4. ProgramId remapping

 One subtle optimization is reversing the block order for causal attention. When we process tiles in reverse (bottom-right to top-left), later-launched blocks have less work due to the causal mask. This improves load balancing and reduces tail effects.

 Before (standard order):

bid_x = ct.bid(0) # Process tiles 0, 1, 2, ...

 After (reversed for causal):

if CAUSAL:
 bid_x = NUM_M_BLOCKS - 1 - ct.bid(0) # Process tiles N, N-1, N-2, ...
else:
 bid_x = ct.bid(0)

 This small change improves wave scheduling, as blocks complete more uniformly across the GPU.

 Result in TFLOPS :

 SeqLen
 Loop split
 Remapping
 Improvement

 1,024
 373
 377
 +1%

 2,048
 552
 560
 +1.5%

 4,096
 684
 696
 +1.8%

 8,192
 770
 781
 +1.5%

 16,384
 813
 835
 +2.6%

 Table 5. Performance improvement after remapping the block order of the tiles 

 A modest but consistent 1-3% gain, especially noticeable at longer sequences where tail effects matter most.

 5. Autotuning

 We’ve optimized large tiles, but there’s a catch: short sequences still prefer small tiles .

 Why? With a 1,024-token sequence and 256-token tiles, we only have 4 tiles. That’s not enough to fully utilize all SMs on a B200. Smaller tiles (64×64) give us 16 tiles, better filling the GPU.

 Rather than manually choosing a threshold, we can let cuTile’s autotuner benchmark multiple configurations and cache the best one for each input shape.

 The autotuner approach :

def _fmha_autotune_configs():
 """Search space for autotuning.

 The autotuner will benchmark these configurations and cache the best one
 per input shape (sequence length, batch size, etc.).
 """
 gpu_capability = torch.cuda.get_device_capability()

 if gpu_capability in [(12, 0), (12, 1)]:
 # RTX 50 series (sm120, sm121)
 yield SimpleNamespace(TILE_M=64, TILE_N=64, num_ctas=1, occupancy=2)
 else:
 # B200/GB200 (sm100) - Try multiple tile sizes
 # Autotuner will discover:
 # - 64x64 is best for short sequences (1024-2048)
 # - 128x128 may be best for medium sequences (4096)
 # - 256x128 is best for long sequences (8192+)
 yield SimpleNamespace(TILE_M=64, TILE_N=64, num_ctas=1, occupancy=2)
 yield SimpleNamespace(TILE_M=128, TILE_N=128, num_ctas=1, occupancy=2)
 yield SimpleNamespace(TILE_M=256, TILE_N=128, num_ctas=1, occupancy=1)

 How to launch with autotuning :

 Instead of calling ct.launch
 directly, use ct_experimental.autotune_launch
:

import cuda.tile_experimental as ct_experimental

def autotune_launch_fmha(
 stream, q, k, v, o, sm_scale, input_pos,
 hidden_size, num_heads, query_group_size, is_causal
):
 batch_size, _, q_len, _ = q.shape

 def _grid_fn(cfg):
 return (math.ceil(q_len / cfg.TILE_M), batch_size * num_heads, 1)

 def _args_fn(cfg):
 num_m_blocks = math.ceil(q_len / cfg.TILE_M)
 even_k = (k.shape[2] % cfg.TILE_N) == 0
 return (
 q, k, v, o, sm_scale, input_pos,
 hidden_size, num_heads, cfg.TILE_M, cfg.TILE_N,
 query_group_size, is_causal, even_k, num_m_blocks,
 )

 ct_experimental.autotune_launch(
 stream,
 grid_fn=_grid_fn,
 kernel=fmha_kernel,
 args_fn=_args_fn,
 hints_fn=lambda cfg: {"num_ctas": cfg.num_ctas, "occupancy": cfg.occupancy},
 search_space=_fmha_autotune_configs,
 )

 Note: The autotuner API may be subject to change.

 The autotuner works intelligently:

 First call with seq_len=1024 : Benchmarks all 3 configs, caches best one

 First call with seq_len=2048 : Benchmarks all 3 configs, caches best one

 Subsequent calls : Uses cached config (zero overhead)

 The cache key includes tensor shapes, so different sequence lengths automatically get different optimal configurations.

 Result in TFLOPS :

 SeqLen
 Baseline
 Remapping
 Autotune
 Speedup vs baseline

 1,024
 330
 377
 548
 1.66x

 2,048
 441
 560
 708
 1.61x

 4,096
 511
 696
 817
 1.60x

 8,192
 546
 781
 887
 1.62x

 16,384
 566
 835
 918
 1.62x

 Table 6. Original baseline compared to step 5 and to step 6 autotuned results

 The autotuner discovers that 64×64 tiles are best for sequences ≤2,048, then transitions to larger tiles for longer sequences. This delivers 45% additional performance at short sequences compared to fixed large tiles, while maintaining peak performance at long sequences.

 What the autotuner chose (on B200):

 SeqLen 1,024: 64×64 tiles (high parallelism)

 SeqLen 2,048: 64×64 or 128×128 tiles (balanced)

 SeqLen 4,096+: 128×128 or 256×128 tiles (memory efficiency)

 We now achieve optimal performance across all sequence lengths without manual tuning.

 Summary: The optimization stack

 Optimization
 Key insight
 Impact

 Baseline (64×64)
 Correct but unoptimized
 Baseline

 Large tiles (256×128)
 TRAP : 18-43% slower!
 -18% to -43%

 + Fast math (FTZ, APPROX)
 RESCUE : Large tiles now pay off
 +34% to +72% from trap

 + K-loop split
 Biggest single optimization
 +16% to +32%

 + ProgramId remapping
 Better load balancing
 +1% to +3%

 + Autotuning
 Optimal tiles per sequence
 +10% to +45%

 Table 7. Step-by-step optimization results with performance impacts for each step

 Final speedup: 1.60x-1.66x across all sequence lengths.

 Getting started

 Writing high-performance kernels is rarely about finding one “magic” setting. As we saw with the “trap and rescue”:

 Optimizations are interdependent : Large tiles were slower until we fixed the math. You can’t evaluate tile size in isolation.

 Math matters : Flags like flush_to_zero
 and APPROX
 are critical for unlocking Tensor Core throughput. Precise math is often overkill for deep learning.

 Algorithmic wins compound : K-loop splitting gave us the biggest single improvement (up to 32%) by avoiding unnecessary work.

 Autotuning beats manual heuristics : cuTile’s autotuner discovers optimal tile sizes per sequence length (64×64 for short sequences, 256×128 for long), delivering 10-45% gains over fixed configurations.

 Cumulative effects are multiplicative : The full optimization stack delivers 1.60x-1.66x speedup across all sequence lengths—far more than any single optimization alone.

 cuTile enables developers to express these optimizations—tiling, fast math controls, loop splitting, autotune—in clean, readable Python code while generating highly optimized PTX for NVIDIA GPUs.

 You can find the completely optimized kernel in the TileGym repository . Happy hacking.

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Science | Developer Tools & Techniques | General | CUDA | Advanced Technical | Tutorial | CUDA Tile | cuTile | featured

 About the Authors

 About Alessandro Morari

 Alessandro Morari is an AI systems leader at NVIDIA in the DevTech AI organization. His current focus is on AI-driven GPU kernels and next-generation programming models for accelerated computing. His experience spans the full AI stack, from GPU kernel optimization to AI product leadership. Before NVIDIA, he led the team at IBM Research that shipped the Watson Code Assistant, one of the earliest large-scale generative AI products. He previously worked on system software for the Summit and Sierra supercomputers and created NYU Courant's first course on high-performance machine learning. Morari has authored over 30 publications, holds 15 patents, and earned a Ph.D. in Computer Architecture.

 View all posts by Alessandro Morari

 About Allen Zhao

 Allen Zhao (Wenyi Zhao) is a senior compute architect engineer specializing in cutting-edge AI compiler technologies, including both graph-level and tile-level compilation. His expertise lies in optimizing the execution efficiency of AI models across diverse hardware architectures, especially for GPGPU. He's passionate about translating theoretical compiler advancements into practical, high-impact solutions for the next generation of artificial intelligence. He holds a Master's degree from Shanghai Jiao Tong University.

 View all posts by Allen Zhao

 About Ivan Yin

 Ivan Yin (Wenzhi Yin) is a senior computer architect engineer specializing in GPU compiler engineering and high-performance deep learning. He graduated from Shanghai Jiao Tong University. He has expertise in compiler development for NVIDIA CUDA Tile Programming, where he maps high-level tensor operations to efficient GPU machine code through automated code generation for modern GPU architectures. Beyond compiler engineering, he has experience in high-performance deep learning kernel development and performance tuning.

 View all posts by Ivan Yin

 About Vishal Mehta

 Vishal works as a senior developer technology engineer at NVIDIA, with focus on performance optimization for GPU applications. He has been working in the field of GPU computing for over 10 years. He is keen on teaching CUDA and GPU computing to users and drives the content for the CUDA programming guide. His day-to-day activities involve collaborations with domain scientists and industry experts to improve their workloads on GPUs.

 View all posts by Vishal Mehta

 Comments

 Related posts

 Making Softmax More Efficient with NVIDIA Blackwell Ultra

 Making Softmax More Efficient with NVIDIA Blackwell Ultra

 Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM

 Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM

 OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability

 OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability

 Next Generation of FlashAttention

 Next Generation of FlashAttention

 Accelerating Transformers with NVIDIA cuDNN 9

 Accelerating Transformers with NVIDIA cuDNN 9

 Related posts

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

 L

 T

 F

 R

 E

New method could increase LLM training efficiency

mit_news_ai

26.02.2026 05:00

0.658

Embedding sim.	0.7518
Entity overlap	0.0732
Title sim.	0.1321
Time proximity	0.9286

NLP тип	scientific_publication
NLP организация	Massachusetts Institute of Technology
NLP тема	large language models
NLP страна	United States

Открыть оригинал

Reasoning large language models (LLMs) are designed to solve complex problems by breaking them down into a series of smaller steps. These powerful models are particularly good at challenging tasks like advanced programming and multistep planning.
 But developing reasoning models demands an enormous amount of computation and energy due to inefficiencies in the training process. While a few of the high-power processors continuously work through complicated queries, others in the group sit idle.
 Researchers from MIT and elsewhere found a way to use this computational downtime to efficiently accelerate reasoning-model training.
 Their new method automatically trains a smaller, faster model to predict the outputs of the larger reasoning LLM, which the larger model verifies. This reduces the amount of work the reasoning model must do, accelerating the training process.
 The key to this system is its ability to train and deploy the smaller model adaptively, so it kicks in only when some processors are idle. By leveraging computational resources that would otherwise have been wasted, it accelerates training without incurring additional overhead.
 When tested on multiple reasoning LLMs, the method doubled the training speed while preserving accuracy. This could reduce the cost and increase the energy efficiency of developing advanced LLMs for applications such as forecasting financial trends or detecting risks in power grids.
 “People want models that can handle more complex tasks. But if that is the goal of model development, then we need to prioritize efficiency. We found a lossless solution to this problem and then developed a full-stack system that can deliver quite dramatic speedups in practice,” says Qinghao Hu, an MIT postdoc and co-lead author of a paper on this technique .
 He is joined on the paper by co-lead author Shang Yang, an electrical engineering and computer science (EECS) graduate student; Junxian Guo, an EECS graduate student; senior author Song Han, an associate professor in EECS, member of the Research Laboratory of Electronics and a distinguished scientist of NVIDIA; as well as others at NVIDIA, ETH Zurich, the MIT-IBM Watson AI Lab, and the University of Massachusetts at Amherst. The research will be presented at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems.
 Training bottleneck 
 Developers want reasoning LLMs to identify and correct mistakes in their critical thinking process. This capability allows them to ace complicated queries that would trip up a standard LLM.
 To teach them this skill, developers train reasoning LLMs using a technique called reinforcement learning (RL). The model generates multiple potential answers to a query, receives a reward for the best candidate, and is updated based on the top answer. These steps repeat thousands of times as the model learns.
 But the researchers found that the process of generating multiple answers, called rollout, can consume as much as 85 percent of the execution time needed for RL training.
 “Updating the model — which is the actual ‘training’ part — consumes very little time by comparison,” Hu says.
 This bottleneck occurs in standard RL algorithms because all processors in the training group must finish their responses before they can move on to the next step. Because some processors might be working on very long responses, others that generated shorter responses wait for them to finish.
 “Our goal was to turn this idle time into speedup without any wasted costs,” Hu adds.
 They sought to use an existing technique, called speculative decoding, to speed things up. Speculative decoding involves training a smaller model called a drafter to rapidly guess the future outputs of the larger model.
 The larger model verifies the drafter’s guesses, and the responses it accepts are used for training.
 Because the larger model can verify all the drafter’s guesses at once, rather than generating each output sequentially, it accelerates the process.
 An adaptive solution 
 But in speculative decoding, the drafter model is typically trained only once and remains static. This makes the technique infeasible for reinforcement learning, since the reasoning model is updated thousands of times during training.
 A static drafter would quickly become stale and useless after a few steps.
 To overcome this problem, the researchers created a flexible system known as “Taming the Long Tail,” or TLT.
 The first part of TLT is an adaptive drafter trainer, which uses free time on idle processors to train the drafter model on the fly, keeping it well-aligned with the target model without using extra computational resources.
 The second component, an adaptive rollout engine, manages speculative decoding to automatically select the optimal strategy for each new batch of inputs. This mechanism changes the speculative decoding configuration based on the training workload features, such as the number of inputs processed by the draft model and the number of inputs accepted by the target model during verification.
 In addition, the researchers designed the draft model to be lightweight so it can be trained quickly. TLT reuses some components of the reasoning model training process to train the drafter, leading to extra gains in acceleration.
 “As soon as some processors finish their short queries and become idle, we immediately switch them to do draft model training using the same data they are using for the rollout process. The key mechanism is our adaptive speculative decoding — these gains wouldn’t be possible without it,” Hu says.
 They tested TLT across multiple reasoning LLMs that were trained using real-world datasets. The system accelerated training between 70 and 210 percent while preserving the accuracy of each model.
 As an added bonus, the small drafter model could readily be utilized for efficient deployment as a free byproduct.
 In the future, the researchers want to integrate TLT into more types of training and inference frameworks and find new reinforcement learning applications that could be accelerated using this approach.
 “As reasoning continues to become the major workload driving the demand for inference, Qinghao’s TLT is great work to cope with the computation bottleneck of training these reasoning models. I think this method will be very helpful in the context of efficient AI computing,” Han says.
 This work is funded by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT Amazon Science Hub, Hyundai Motor Company, and the National Science Foundation.

OpenAI Codex and Figma launch seamless code-to-design experience

openai

26.02.2026 06:00

0.656

Embedding sim.	0.7848
Entity overlap	0.0833
Title sim.	0.1444
Time proximity	0.5685

NLP тип	product_launch
NLP организация	OpenAI
NLP тема	code generation
NLP страна

Открыть оригинал

OpenAI and Figma launch a new Codex integration that connects code and design, enabling teams to move between implementation and the Figma canvas to iterate and ship faster.

Anthropic vs. OpenAI, the Pre IPO Days

ai_supremacy

20.02.2026 10:36

0.653

Embedding sim.	0.7534
Entity overlap	0.0192
Title sim.	0.1299
Time proximity	0.8956

NLP тип	funding
NLP организация	Anthropic
NLP тема	enterprise ai
NLP страна

Открыть оригинал

Anthropic vs. OpenAI, the Pre IPO Days
 The last few months before lucrative AI IPOs are upon on. Let's do some math. Decoding the crazy high valuations.

 Michael Spencer and Raphaëlle d'Ornano
 Feb 20, 2026
 ∙ Paid

 118

 13

 Share

 Good Morning,
 The next AI duopoly of BigAI is almost here. With BigTech benefits. 😄🗺️
 Anthropic closed its $30 Bn. round and OpenAI is almost read to close its own $100 Bn. round. Confirmation that OpenAI will keep paying 20% of its revenue to Microsoft until 2032 complicates the business model of OpenAI. Nvidia is in discussions to invest up to $30 billion in OpenAI as part of a funding round that could value the AI startup at a $730 (even as high as $850) billion pre-money valuation. We have to assume that both SpaceX and OpenAI will IPO at near or above $1 Tr. market cap.
 I asked Raphaëlle d'Ornano of Decoding Discontinuity (and her team) to do some analysis on the OpenAI vs. Anthropic debate.
 Decoding Discontinuity

 Frameworks for analyzing the financial and strategic impact of emerging tech like Generative AI.
 Decoding Discontinuity A newsletter that provides frameworks for analyzing the financial and strategic impact of emerging technologies like GenAI.
 By Raphaëlle d'Ornano

 R.O is a very deep thinker so read her analysis carefully: (related)
 Decoding Anthropic’s $380 Billion Valuation: Orchestration over Raw Intelligence in Enterprise AI

 The $285 Billion ‘SaaSpocalypse’ Is the Wrong Panic

 OpenAI and xAI: When Megawatts Become the New ARPU

 Epoch AI ( Epoch AI & various writers ) are also projecting like Patel and myself have, that Anthropic will overtake OpenAI in ARR not in 2027, but sooner as in this year!
 Anthropic Growing Revenue 3x Faster than OpenAI in 2026

 Epoch AI.
 When you are growing revenue at a 10x pace instead of a 3.4x pace, that tends to happen. Anthropic is on pace to overtake OpenAI in ARR sometime in late 2026. But what does it mean for Anthropic’s IPO vs. OpenAI’s? In Anthropic’s global push 2026 and 2027 are just massive years for its growth.
 Epoch AI notes that The Information shows both companies projecting slower revenue growth in 2026, with OpenAI expecting 2.2× growth, and Anthropic expecting 4× growth or less. No wonder there is a SaaS apocalypse market jitters and narrative. So instead of growing three times as fast, Anthropic may grow twice as fast in 2026. Keep in mind they are also iterating new models faster:
 Anthropic releases Sonnet 4.6 this week .

 Google releases Gemini 3.1 Pro this week.

 In what Industries are Agents being Deployed?

 Anthropic
 Software Engineering

 Back office automation

 Other

 Marketing, content and copywriting

 Sales and CRM

 Finance and Accounting

 Business Intelligence and Data analysis

 Academic Research

 As Nvidia reaps the rewards of the GPU-era, we’ll have to track Anthropic more closely now as they dominate Enterprise AI products. OpenAI doesn’t appear to be executing in a customer focused manner like Anthropic has where as each company hit $1B in annualized revenues, Anthropic has grown substantially faster. The trajectory, branding, product and focus feels entirely different. Anthropic of course can’t grow 10x every year as it gets larger, Epoch AI notes that Since July 2025, Anthropic has grown its revenue at a rate of 7×/year rather than 10×. In 2026, most expect 4.5x.
 Generational IPOs? 🌊

 Loading...

 We are mere months away from the biggest IPOs we’ve ever seen: SpaceX, OpenAI and Anthropic.
 My personal picks for the BigAI winners of the Gen AI era excluding players with vast ecosystems like Google, Meta, xAI:
 Nvidia

 Anthropic

 ByteDance

 Broadcom

 An Unknown AI Chip maker upstart

 Alibaba

 SK Hynix (HBM chips)

 Core Automation (startup)

 An unnamed Chinese AI research lab

 An unnamed Chinese AI chip startup

 Something Big “Might be Happening”

 I don’t know if something big is happening ( Shumer ) as VC takes over media, they certainly want you to think that this is big.

 Decoding Discontinuity have a lot of advanced reports for paying readers . They showed me a recent PDF and I was blown away.

 Decoding Discontinuity
 Anthropic are trying to measure agentic autonomy in practice . They might be the moonshot of AI automation that’s hottest right now. Anthropic is likely to be profitable as soon as 2028. Frankly it’s not clear when OpenAI will reach that mark, could be as late as 2031.
 OpenAI’s has Raised far more but with less results

 OpenAI is Losing Marketshare to Emerging Players

 In 2026, Anthropic, Google, xAI and others will increasingly take marketshare away from OpenAI.

 Nvidia and Amazon are piling into OpenAI, supposedly to save it as a major customer.

 But…if AI was a big thing why are consumers spending more on OnlyFans, than OpenAI and NYT combined?

 a16z
 Anthropic’s Super Bowl Surge in Subscriptions

 The cheeky Anthropic Superbowl Ads were a question of good timing and the Claude Code momentum has built incredible momentum going into the pre IPO intensity for both parts of the BigAI (BigTech driven) Duopolies.

 Axios via Ramp AI Index data.
 This means the faster Anthropic grows, the slower OpenAI will grow. The main early 2026 vibes have been Codex vs. Claude Code.

 January, 2026 might have been Anthropic's breakthrough month , wrote Ara Kharazian, an economist at Ramp, which has been tracking business spending on AI.

 Ramp
 It’s getting intense: 79% of Anthropic's customers are already OpenAI customers. And churn rates are nearly identical at 4%.

 79% of Anthropic's customers are already OpenAI customers. And churn rates are nearly identical at 4%.

 According to Ramp’s data as of February 11th, 16% of businesses pay for both OpenAI and Anthropic. A year ago it was 8%.
 There will be Winners and Losers

 Decoding Discontinuity
 In terms of global competition, if either OpenAI or Anthropic falters there’s Google, Alibaba, ByteDance, xAI, DeepSeek and a host of others pushing including Open-weight Chinese startups you’ve never heard of.

 There will also be more nimble new research labs that will end up creating even better AI products, new architectures and offer new approaches to LLMs.

 B2B Market Looks Mission Critical

 For more sustainable big long-term contracts, Enterprise AI competition looks like the critical piece that will make or break their IPOs. While OpenAI’s B2C marketshare lead once looked impressive, diffusion by Gemini and others will reduce that first-mover advantage.
 “1 in 5 businesses on Ramp now pay for Anthropic. A year ago, it was 1 in 25.” - Ara Kharazian , Ramp Economist

 Decoding Discontinuity
 It’s highly uncertain if OpenAI’s AI device can compete with the likes of Meta, Apple, Google and others in smart glasses and other AI wearables . A huge market by 2028. It’s not clear if you are an OpenAI bear like I am, what exactly they win in. Especially is the case as ByteDance and Meta become direct competitors. Seedance looks more impressive than Sora, and so forth.
 The AI Coding Impact Focus

 Sometimes in 2025, Google, Anthropic and Alibaba Qwen began to outpace OpenAI in cadence of new releases and LLM quality making them more attractive for key builders, developers and entrepreneurs. Even Gemini CLI is gaining on Codex now in 2026. While OpenAI has transformed "Codex" from a simple model into a heavy-duty "Agent Command Center," Google’s Gemini CLI has found its niche as the high-context, low-friction alternative. All of this isn’t so great for Cursor or Microsoft’s own Github Copilot.
 Anthropic's MCP Advantage begins to Compound

 Decoding Discontinuity
 Anthropic and Google are building the agentic protocols that form the scaffolding of the future of Agentic AI. In 2026, the Model Context Protocol (MCP) is no longer just an Anthropic experiment; it has become the "USB-C for AI."

 Read Agentic Protocol Handbook
 Anthropic’s Upcoming Event

 Join Anthropic on Tuesday, February 24 for The Briefing: Enterprise Agents , a livestreamed event where we'll demonstrate how Cowork and Plugins help legal, sales, finance, and data teams build new products and solutions.
 Add to Calendar
 ChatGPT’s Viral Growth Not Enough

 Decoding Discontinuity
 ChatGPT’s growth looked magical, but Gemini and others are now showing similar adoption patterns. The “weekly” users metric don’t stand tall as Generative AI becomes more specialized.

 AI Supremacy isn't zero-sum game, new players and global competition will make things interesting. Finally let’s dive into the analysis of the guest contributor.
 Share
 Feel free to share this if you know anyone interested in the business trajectory of OpenAI or Anthropic, or indeed what “BigAI” will turn into.
 OpenAI at a Crossroads: 2026 the pre IPO Last Weeks

 See more at Decoding Discontinuity .

 Continue reading this post for free, courtesy of Michael Spencer.
 Claim my free post
 Or purchase a paid subscription.

 Previous Next

 A guest post by
 Raphaëlle d'Ornano
 Raphaëlle D'Ornano, is the founder of D'Ornano + Co. and Decoding Discontinuity, a research and investment platform. Her Durable Growth Moat™ framework analyzes how companies sustain competitive advantages through AI transformation.

 Subscribe to Raphaëlle

Perplexity announces "Computer," an AI agent that assigns work to other AI agents

arstechnica_ai

26.02.2026 22:53

0.641

Embedding sim.	0.712
Entity overlap	0.0476
Title sim.	0.2289
Time proximity	0.9656

NLP тип	product_launch
NLP организация	Perplexity
NLP тема	ai agents
NLP страна

Открыть оригинал

Agentception

 Perplexity announces “Computer,” an AI agent that assigns work to other AI agents

 It’s also a buttoned-down, ostensibly safer take on the OpenClaw concept.

 Samuel Axon

 –

 Feb 26, 2026 5:53 pm

 |

 87

 The vague marketing image for Perplexity Computer.

 Credit:

 Perplexity

 The vague marketing image for Perplexity Computer.

 Credit:

 Perplexity

 Text
 settings

 Story text

 Size

 Small
 Standard
 Large

 Width
 *

 Standard
 Wide

 Links

 Standard
 Orange

 * Subscribers only

    Learn more

 Minimize to nav

 Perplexity has introduced “Computer,” a new tool that allows users to assign tasks and see them carried out by a system that coordinates multiple agents running various models.

 The company claims that Computer, currently available to Perplexity Max subscribers, is “a system that creates and executes entire workflows” and “capable of running for hours or even months.”

 The idea is that the user describes a specific outcome—something like “plan and execute a local digital marketing campaign for my restaurant” or “build me an Android app that helps me do a specific kind of research for my job.” Computer then ideates subtasks and assigns them to multiple agents as needed, running the models Perplexity deems best for those tasks.

 The core reasoning engine currently runs Anthropic’s Claude Opus 4.6, while Gemini is used for deep research, Nano Banana for image generation, Veo 3.1 for video production, Grok for lightweight tasks where speed is a consideration, and ChatGPT 5.2 for “long-context recall and wide search.”

 This kind of best-model-for-the-task approach differs from some competing products like Claude Cowork , which only uses Anthropic’s models.

 All this happens in the cloud, with prebuilt integrations. “Every task runs in an isolated compute environment with access to a real filesystem, a real browser, and real tool integrations,” Perplexity says.

 The idea is partly that this workflow was what some power users were already doing, and this aims to make that possible for a wider range of people who don’t want to deal with all that setup. People were already using multiple models and tailoring them to specific tasks based on perceived capabilities, while, for example, using MCP (Model Context Protocol) to give those models access to data and applications on their local machines. Perplexity Computer takes a different approach, but the goal is the same: have AI agents running tailor-picked models to perform tasks involving your own files, services, and applications.

 Then there is OpenClaw, which you could perceive as the immediate predecessor to this concept.

 The story so far

 If you haven’t been following the wild OpenClaw craze, here’s the quick summary: originally titled ClawdBot, then Moltbot, OpenClaw was an agentic AI tool that leveraged large language models to independently operate as a sort of background or ambient process on your local machine, performing a wide range of tasks, from sorting through your email history to building websites to, well, basically whatever you could imagine.

 Given the right permissions and with the proper plugins, it could create, modify, or delete the user’s files and otherwise change things far beyond what most users could achieve with existing models and MCP (Model Context Protocol). Users would use files like USER.MD, MEMORY.MD, SOUL.MD, or HEARTBEAT.MD to give the tool context about its goals and how to work toward them independently, sometimes running for long stretches without direct user input.

 On one hand, that meant it could do impressive things—the first glimpses of the sort of knowledge work that AI boosters have been saying agentic AI would ultimately do. On the other hand, it was prone to serious errors and vulnerable to prompt injection and other security problems, in part due to a Wild West of unverified plugins.

 The same toolkit that was used to create a viral Reddit clone populated by AI agents was also, at least in one case, responsible for deleting a user’s emails against her will.

 Stay in your lane

 Perplexity Computer aims to address those concerns in a few ways. First, its core process occurs in the cloud, not on the user’s local machine. Second, it lives within a walled garden with a curated list of integrations, in contrast to OpenClaw’s unregulated frontier.

 This is, of course, an imperfect analogy, but you could say that if OpenClaw were the open web of AI agent tools, then Computer is Apple’s App Store. While you’re more limited in what you can do, you’re not trusting packages from unverified sources with access to your system.

 There could still be risks, though. For one thing, LLMs make mistakes, and those could be consequential if Computer is working with data you don’t have backed up elsewhere or if you’re not verifying the outputs, for example.

 Perplexity Computer aims to button up, refine, and contain the wild power of the viral OpenClaw agentic AI tool—competing with the likes of Claude Cowork—by optimizing subtasks by selecting models best suited to them.

 It surely won’t be the last existing AI player to try to do this sort of thing. After all, OpenAI hired OpenClaw’s developer, with CEO Sam Altman suggesting that some of what we saw in OpenClaw will be essential to the company’s product vision moving forward.

 Samuel Axon

 Senior Editor

 Samuel Axon

 Senior Editor

 Samuel Axon is the editorial lead for tech and gaming coverage at Ars Technica. He covers AI, software development, gaming, entertainment, and mixed reality. He has been writing about gaming and technology for nearly two decades at Engadget, PC World, Mashable, Vice, Polygon, Wired, and others. He previously ran a marketing and PR agency in the gaming industry, led editorial for the TV network CBS, and worked on social media marketing strategy for Samsung Mobile at the creative agency SPCSHP. He also is an independent software and game developer for iOS, Windows, and other platforms, and he is a graduate of DePaul University, where he studied interactive media and software development.

 87 Comments

An update on our mental health-related work

openai

27.02.2026 00:00

0.637

Embedding sim.	0.721
Entity overlap	0
Title sim.	0.1594
Time proximity	0.959

NLP тип	other
NLP организация	OpenAI
NLP тема	ai safety
NLP страна

Открыть оригинал

OpenAI shares updates on its mental health safety work, including parental controls, trusted contacts, improved distress detection, and recent litigation developments.

Google DeepMind Partnerships in India: scaling AI in science and education — Google DeepMind

deepmind

18.02.2026 10:30

0.631

Embedding sim.	0.7331
Entity overlap	0.0213
Title sim.	0.0662
Time proximity	0.9018

NLP тип	partnership
NLP организация	Google DeepMind
NLP тема	artificial intelligence
NLP страна	India

Открыть оригинал

February 18, 2026 Responsibility & Safety
 Accelerating discovery in India through AI-powered science and education
 Demis Hassabis, Lila Ibrahim and Pushmeet Kohli

 Share

 Copied

 Introducing our National Partnerships for AI and collaboration in India
 We believe AI will be the most transformative technology in human history and that it should be deployed in ways that benefit all of humanity. This requires deep, strategic collaboration between frontier AI labs, governments, academia, and civil society.
 To fully realise AI’s potential, Google DeepMind is working with governments through our National Partnerships for AI initiative to broaden access to our frontier AI capabilities, helping ensure they are deployed to serve citizens and meet national priorities in science, education, resilience, and public services.
 Building on our collaborations with the US and UK governments, we are establishing a new partnership with Indian government bodies and local institutions. In the global AI transformation, India is showing exceptional leadership in applying the technology to tackle its own biggest challenges. But India is going even further, playing a critical international role by convening this week the fourth global AI summit of governments, companies and civil society. International dialogue and collaboration will guide positive impacts and create the global frameworks required to prepare society for a future with AI.
 Partnership in India to broaden AI access
 Our partnerships are designed to accelerate the pace of progress across India. Here are a few ways we are working together to unlock new possibilities in science and education.
 Advancing scientific breakthroughs
 Google DeepMind, Google Research and Google.org are partnering with the Anusandhan National Research Foundation (ANRF) to facilitate the adoption of AI models to advance science. We’re providing access to our frontier AI for Science models, supporting hackathons and community contests, and enabling training and mentorship to students, researchers, and those in the early stages of their careers.
 Researchers and engineers in India will be able to use our AI tools, including:
 AlphaGenome : An AI model to help scientists better understand how mutations in human DNA sequences impact a wide range of gene functions
 AI Co-scientist : A multi-agent AI system that acts as a virtual scientific collaborator
 Earth AI: A collection of models built on Gemini’s advanced reasoning that are helping enterprises, nonprofits, and cities with everything from environmental monitoring to disaster response
 Scientists around the world are already using AlphaFold - our AI system capable of accurately predicting the structure and interactions of proteins, DNA, RNA, ligands and more - to accelerate discoveries. India stands as the fourth largest adopter of AlphaFold globally, with over 180,000 researchers using it today. We hope to see Indian scientists benefit even more from using AlphaGenome and the other AI systems we are now providing.
 We're also working to support AI for science at a global level. This is why, today at the India Summit, we announced the $30 million Google.org Impact Challenge: AI for Science , an open call for researchers, nonprofits, and social enterprises in India, and around the world, using AI to achieve scientific breakthroughs. Selected awardees will also have the opportunity to participate in a Google.org Accelerator, receiving engineering support, expert mentorship, and infrastructure from Google DeepMind and Google Research to turn their concepts into scalable discoveries.
 Empowering India’s Students and Teachers with an AI-powered Future
 Our recent survey with Ipsos has shown that learning is the top motivation for using AI globally. This is especially true in India, which now leads the world in daily Gemini usage by students. We’re seeing AI can drive profound comprehension and critical thinking when it is purpose-built for learning and implemented as a supportive partner to educators.
 At City Montessori School in Lucknow, teachers are integrating Guided Learning into math classes for Grade 8-9 students and seeing a positive response. An early analysis of a randomized control study conducted by Fab AI shows that students are demonstrating a desire for deeper learning, not just quick answers: in almost three out of every four conversations on Gemini, students sought to develop their understanding rather than a quick answer or shortcut.
 That’s why we’re expanding efforts with additional partners to supercharge the potential of learning for more Indian students and teachers:
 Powering innovation hubs with GenAI assistants: Together with Atal Tinkering Labs, which serves more than 10,000 Indian schools and 11 million students, we will help incorporate robotics and coding into local curricula, integrate Gemini thoughtfully into teacher workflows, and build a safely guardrailed AI assistant for students grounded in national curriculum standards that can act as an educational partner. Teachers can access real-time tips to help students fix a robot missing a part with readily available materials or mend a broken circuit design by simply pointing a camera to it or asking Gemini in chat.
 Transforming textbooks into interactive digital journeys: In a first-of-its-kind partnership with PM Publishers Pvt. Ltd., a K-12 textbook publisher in India, Gemini will be used to transform two million static textbooks into AI-powered interactive journeys across more than 250 titles and 2,000 schools. Each book features a QR code that can be scanned by students to access a custom Gem (specialized versions of the Gemini AI model), that acts as an expert assistant on the subject, providing summaries and responses on the contents of the respective book.
 Serving India’s linguistic diversity: There is incredible potential for AI to make a positive impact on education when built in close partnership with experts and grounded in local language and culture. Building on Google.org’s recent $2 million founding contribution to establish the new Indic Language Technologies Research Hub at IIT Bombay, we’ll help incorporate India’s linguistic diversity into AI as it advances globally.
 These efforts build on the global success of existing AI literacy programs like Experience AI , a joint partnership developed by Google DeepMind with Raspberry Pi Foundation, which has already reached up to 300,000 students and 8,000 teachers in India.
 AI solutions for India’s agriculture and energy sectors
 Our new partnerships in science and education build on our ongoing collaboration with local Indian organizations to tackle global challenges in agriculture and energy security. Working with Indian startups, institutions like Council on Energy, Environment and Water (CEEW), and Indian state and central government entities are using the APIs of our freely available Agri AI models to enhance agricultural resilience, crop productivity and farmer incomes. TerraStack is also using Google AI to combine satellite, crop, and weather data, into hyper-local insights that help farmers make better agricultural decisions.
 We also recently announced a growing collaboration with Open Climate Fix to integrate our WeatherNext AI models into India’s electricity grid operations. We’re aiming to significantly improve the accuracy of renewable energy forecasts in India, help grid operators manage volatility, and support the country’s ambitious clean energy targets. When we tested the integration of WeatherNext into OCF’s wind generation forecast, results showed up to 8% accuracy improvement in forecast performance.
 This partnership comes as India rapidly scales its renewable capacity, becoming the third largest generator of solar energy globally in 2023, with an ambitious target of installing 500 GW of renewable capacity by 2030. Working together on energy solutions has never been more important - we remain committed to working with experts in India to progress this effort together to prepare for the future.
 Preparing for the future together
 AI’s global impact is inevitable, but its success is not. To turn potential into prosperity, we are committing to deep, local collaboration with India's government bodies and institutions to ensure AI delivers tangible results across the subcontinent–and the world.

 Related posts

 National Partnerships for AI

 Learn more

 Strengthening our partnership with the UK government to support prosperity and security in the AI era
 December 2025 Responsibility & Safety
 Learn more

 Deepening our partnership with the UK AI Security Institute
 December 2025 Responsibility & Safety
 Learn more

 Google DeepMind supports U.S. Department of Energy on Genesis: a national mission to accelerate innovation and scientific discovery
 December 2025 Science
 Learn more

Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo | NVIDIA Technical Blog

nvidia_dev_blog

01.03.2026 07:00

0.631

Embedding sim.	0.7109
Entity overlap	0.0476
Title sim.	0.2783
Time proximity	0.7738

NLP тип	partnership
NLP организация	NVIDIA
NLP тема	ai agents
NLP страна

Открыть оригинал

Autonomous networks are quickly becoming one of the top priorities in telecommunications. According to the latest NVIDIA State of AI in Telecommunications report , 65% of operators said AI is driving network automation, and 50% named autonomous networks as the top AI use case for ROI. 

 Yet many telcos still report gaps in AI and data science expertise. This makes it difficult to scale safe, closed-loop automation across complex, multidomain networks.  

 Most telecom network operations centers (NOCs) today operate using reactive, alarm-driven workflows. Engineers manually triage thousands of incidents across multiple tools, sift through a high volume of alarm and performance data, and stitch together fragmented dashboards and logs before applying a fix or dispatching a field team. NOCs are a natural starting point for autonomous networks, because they concentrate high-volume, repeatable tasks where AI can directly cut MTTR and OPEX.

Tech Mahindra, a leading global provider of technology consulting and digital solutions to enterprises across industries, and NVIDIA are collaborating to close this AI skills gap. They’re doing so by making autonomous network building blocks—open models, tools, and implementation guides—into assets telecom developers can readily adopt and adapt in their own environments. 

 This post outlines how to fine‑tune reasoning models with NVIDIA NeMo so they behave like NOC engineers, safely driving closed‑loop, self‑healing workflows. It shows how to: 

 Generate synthetic, telecom‑realistic incident data

 Translate expert procedures into structured reasoning traces using the production-grade reference workflows. This teaches the model to coordinate tools, reason over network state, and execute fault‑management tasks end to end

 The result is a repeatable method that telco teams can use to build their own specialized AI agents for network operations. These agents can perform triage, root‑cause analysis, and resolution for high‑volume incident classes, helping operators progress toward TM Forum Level 4 highly autonomous networks and beyond.

 Why do network operations centers need reasoning models?

 Traditional NOC automation is mostly rule‑based and open‑loop: scripts trigger on fixed conditions but struggle with noisy signals, cross‑domain dependencies, and constantly changing network behavior. As a result, many Level 1 and Level 2 tasks—triage, root‑cause analysis, validation after a change—still depend on manual effort, keeping MTTR high and limiting how far operators can move toward truly autonomous operations.

 Figure 1. Shifting from manual NOC alarm handling to a reasoning agent embedded in the NOC workflow

 A telco reasoning model becomes the engine for an AI agent that can take on this work pattern in a controlled, auditable way. Instead of hard‑coded runbooks and point scripts, the agent uses the model to interpret incidents, decide which tools to call, and adapt its actions based on live responses. Key features include:

 AI reasoning plus tool-calling : Replaces manual alarm triage by invoking NOC tools for validation, root‑cause analysis, and remediation across existing systems

 End-to-end automation : Handles alarm validation, RCA, and healing for various incident types such as outages, flaps, congestion, and configuration issues

 Noise reduction : Filters self‑clearing or low‑value alarms using historical patterns so engineers can focus on higher priorities

 Resolution in seconds, not hours : Shrinks resolution time for high‑volume, well‑understood incidents from hours to seconds, significantly reducing MTTR

 The outcome is a closed‑loop, self‑healing network. Specialized NOC agents handle routine triage and resolution, and engineers shift from reactive alarm handling to proactive optimization and complex problem-solving.

 Designing a telco reasoning pipeline

 The technical approach to this solution combines the following components into one reproducible pipeline: 

 Synthetic incident data

 Expert NOC procedures

 Structured reasoning traces

 Supervised fine‑tuning 

 Evaluation 

 Instead of trying to learn from raw logs and alarms directly, the model is trained on curated examples that show how an experienced engineer would analyze an incident, call tools, and decide when a fix is complete.

 Figure 2. Agent training pipeline, from synthetic incident generation to reasoning model, fine-tuning, and evaluation across tool-calling, reasoning, and conclusions

 In this case, Qwen3-32B is the base reasoning modeling that is fine-tuned for telco NOC workflows using the following design principles:

 Focusing on a small number of high‑impact faults, which account for the majority of incidents and require deliberate action. This enables the model to learn deeply on the fault classes that matter most.

 Defining step-by-step operational guidelines for each problem type including RCA and remediation steps and NOC tools that agents must use.

 Generate synthetic reasoning traces that capture multistep tool calls and the rationale behind each decision, using the NeMo Skills reference workflow to automate trace and incident generation.

 NeMo Skills orchestrates this pipeline end to end, using its CLI, vLLM or TensorRT LLM servers, and training utilities to move from raw incidents to a fine-tuned telco reasoning model.​

 Synthetic incidents and NOC tool-calling

 The input to the pipeline is a fully synthetic incident dataset that is modeled on real NOC behavior. Each record includes fields such as region, domain, priority, problem type, possible cause, and time stamps. Engineer notes are also included, describing intermediate steps and close notes summarizing the final resolution and close code. 

 An incident summary captures why the network was degraded or down and is the backbone of what the model is trained to solve. The pipeline concentrates on the most frequent, high-impact faults that account for the bulk of incident volume and require explicit action. The reasoning model learns deeply on the cases that drive MTTR and OPEX.

 To model realistic NOC workflows, a set of custom tools are defined for agents to call in multistep procedures, such as:

 Acknowledging and tracking the initial alert

 Checking site and equipment status

 Performing remote actions (reset, unlock, enable)

 Monitoring for automatic recovery or alarm clearance

 Checking topology, power, and fiber, plus public outage information

 Applying configuration fixes

 Rechecking alarm status when it remains active

 Investigating persistent or recurring alarms

 Documenting actions and status updates

 Coordinating onsite dispatch or hardware replacement

 Confirming final site health and closing the incident

 For each problem type, domain experts translate existing workflows into step‑by‑step guidelines that map onto these tools. Examples include which triage toolkit to consult first; which alarms to query; when to reboot a device; and how to verify a fiber cut, power outage, or network element faults. 

 These guidelines become blueprints for the synthetic reasoning traces the model will learn from. They later define the action space that NOC agents use when executing closed‑loop workflows in production. 

 Turn expert procedures into reasoning traces

 To turn expert NOC procedures into training data for a telco‑specialized reasoning model, follow the three-step NeMo Skills workflow outlined below. It converts runbooks into structured, multiturn reasoning traces ready for autonomous NOC agents.

 Step 1: Generate structured action sequences

 Using a reference workflow from NeMo Skills, a teacher model generates standardized action sequences for each incident based on prompts that include incident fields and guideline templates. The steps map directly to NOC tools.

 Traces are formatted so each step records the action, its parameters, the tool call, and the immediate result, forming a structured view of the NOC workflow.​

 Step 2: Attach per‑step reasoning

 A second pass enriches every action with reasoning text that explains why the step is taken, what signals it uses, and how it influences the next decision. This creates a chain of reasoning that reflects how an experienced NOC engineer reasons over topologies, alarms, and historical behavior. 

 Because raw traces can be verbose or repetitive, a squashing phase merges related steps while preserving key decision points, making sequences more efficient for training.

 Step 3: Formatting for multiturn, tool‑calling models

 Using another workflow from NeMo Skills, the formatted traces are converted into a Qwen-compatible format that encodes both the dialogue-style interaction and tool-calling actions over multiple turns. Multiturn tokenization simulates realistic interactions where the agent alternates between reasoning, calling tools, and interpreting tool responses, which is essential for deploying a ReAct-style NOC agent.​​

 The result is a curriculum-structured dataset where easier cases and shorter traces appear earlier, while more complex multi-step incidents appear later, supporting curriculum learning during model training.​​

 Fine-tuning the telco reasoning model 

 The fine-tuning phase uses a standard train/test split on the compiled reasoning dataset, with NeMo Skills orchestrating data preparation and Qwen3 32B serving as the base reasoning model. NeMo Skills prepare_data
 utilities apply a telco‑specific prompt template ( noc_reasoning_sft
) and the Qwen tokenizer. This makes each trace in the training split into a supervised fine‑tuning (SFT) example that includes:

 Incident context and NOC signals

 Multistep tool calls and intermediate results

 Reasoning traces explaining each decision

 Final resolution and incident summary

 This produces a single JSONL file of SFT-ready examples for the telco reasoning model.​

 To improve learning efficiency, curriculum learning is applied by ordering samples from simple, single‑problem incidents to more complex multistep, multitool cases. This allows the model to master core NOC behaviors before tackling long, multiturn troubleshooting patterns. 

 Multiturn tokenization ensures that each example preserves realistic sequences of queries, tool calls, responses, and follow‑up actions, rather than isolated single‑turn prompts. These capabilities are critical for downstream ReAct‑style agents that must coordinate multiple tools over long contexts.

 Ultimately, Qwen3‑32B is fine‑tuned on this telco reasoning curriculum with long sequence lengths and tensor model parallelism across GPUs. Checkpointing and experiment tracking allow teams to iterate on data quality, curriculum design, and hyperparameters. 

 The result is a telco‑specialized reasoning model that understands incident fields, close codes, and NOC procedures, and can reliably drive multitool, multiturn tool‑calling workflows in production.

 Evaluating incident summary accuracy and safety

 Initial evaluation focuses on incident summary accuracy: how well the model, embedded in a ReAct‑style agent with tools, predicts and executes the correct resolution path for a given incident. 

 Experiments compare the fine‑tuned telco reasoning model against a baseline Qwen3‑32B on held‑out incidents, measuring accuracy, precision, and recall across problem and close‑code categories. Incident summary accuracy can also be analyzed within a single problem type to highlight where reasoning traces and curriculum learning deliver the largest gains, informing future iterations of synthetic data generation and guideline design.

Evaluations across multiple iterations show that the fine-tuned model improves accuracy from roughly 20% to 60%.

 Beyond incident summary metrics, additional evaluation methods can be introduced over time to further harden the system, including:

 LLM‑as‑a‑judge setups to evaluate reasoning traces for correctness, completeness, and safety

 LLM‑as‑a‑judge to assess final conclusions and remediation plans

 Tool‑calling benchmarks such as BFCLv3 to measure how reliably the agent sequences and interprets tool calls

 Rollout and rejection sampling to stress‑test behavior across many simulated incidents

 Controlled errors injected into traces to teach the model to detect and recover from its own mistakes

 Incorporation of retrieval‑augmented generation (RAG) with historical few‑shot examples to improve robustness on long‑tail scenarios

 Get started building telco reasoning models for autonomous networks

 Telco‑specific reasoning models—powered by synthetic data, structured traces, and safe tool‑calling—can move NOCs toward zero‑touch, self‑healing operations. By focusing on high‑impact close codes, encoding expert guidelines as multiturn reasoning traces, and fine‑tuning large models with the NVIDIA NeMo software toolkit, operators can build agents that reliably take on real NOC engineer tasks. 

 The pipeline is reusable and adaptable, so this approach can be tailored to each operator’s tools, data, and policies. This accelerates the industry’s transition from manual alarm handling to intelligent, autonomous network operations.

 To get started fine-tuning a reasoning model to build AI agents for network operations, see Teaching a Model to Reason over Telecom Network Incidents .

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Networking / Communications | Telecommunications | NeMo | TensorRT-LLM | Intermediate Technical | Tutorial | AI Agent | featured | Retrieval Augmented Generation (RAG) | Training AI Models

 About the Authors

 About Aiden Chang

 Aiden Chang is a solution architect at NVIDIA, focusing on enterprise applications of generative AI, robotics, and reasoning systems. He earned his master’s in computer science from the University of Southern California. Outside of work, he enjoys skiing, aviation, and building robots.

 View all posts by Aiden Chang

 About Amparo Canaveras

 Amparo Canaveras is a senior solutions architect at NVIDIA, specializing in generative AI applications within the telecommunications sector. She brings over 20 years of experience from her time in network operations and analytics at Nokia and Verizon. Amparo holds a B.Sc. in electrical engineering from the Polytechnic University of Valencia and an M.Sc. in systems design and management from MIT.

 View all posts by Amparo Canaveras

 About Ari Uskudar

 Ari Uskudar has 20-plus years of experience in AI-driven network automation, RAN intelligence, and large-scale telecom architecture across NVIDIA, VMware, Ericsson, Verizon, Turkcell, Vodafone, and Motorola. Her expertise spans agentic AI systems, autonomous network design, LLM-based telco reasoning, ML-powered observability, and end-to-end optimization. Ari has authored multiple patents in autonomous networks, 6G core architecture, and telco blueprints, etc. Known for bridging deep engineering with strategic product thinking, she designs advanced architectures, leads complex technical collaborations, and develops industry-adopted innovations that shape the future of AI-native telecom systems.

 View all posts by Ari Uskudar

 About Amol Phadke

 Amol Phadke is the chief transformation officer at Tech Mahindra, working closely with the CEO on enterprise-wide strategic initiatives, including the global elevation of the Communications industry vertical. He brings deep technology and business leadership across AI, cloud, software networks, big tech, and telecommunications, specializing in strategy definition, driving execution of large-scale engineering, and leading global multidiscipline teams. With over 25 years of global industry experience, he has previously held senior leadership posts as Group CTIO Telenor Group and GM at Google Cloud, among others. Amol holds a double degree executive MBA from UCLA, California - NUS, Singapore, a master’s degree in Telecommunications Engineering from USC, California, and a bachelor’s degree in Electronics Engineering from the University of Mumbai.

 View all posts by Amol Phadke

 Comments

 Related posts

 Build an AI Agent to Analyze IT Tickets with NVIDIA Nemotron

 Build an AI Agent to Analyze IT Tickets with NVIDIA Nemotron

 Build Enterprise AI Agents with Advanced Open NVIDIA Llama Nemotron Reasoning Models

 Build Enterprise AI Agents with Advanced Open NVIDIA Llama Nemotron Reasoning Models

 Transforming Telco Network Operations Centers with NVIDIA NeMo Retriever and NVIDIA NIM

 Transforming Telco Network Operations Centers with NVIDIA NeMo Retriever and NVIDIA NIM

 Navigating Generative AI for Network Admins

 Navigating Generative AI for Network Admins

 Diagnosing Network Issues Faster with NVIDIA WJH

 Diagnosing Network Issues Faster with NVIDIA WJH

 Related posts

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 L

 T

 F

 R

 E

Google and the Massachusetts AI Hub are launching a new AI training initiative for the Commonwealth.

google

26.02.2026 18:55

0.627

Embedding sim.	0.7131
Entity overlap	0.0286
Title sim.	0.0973
Time proximity	0.9892

NLP тип	partnership
NLP организация	Google
NLP тема	ai adoption
NLP страна	United States

Открыть оригинал

Breadcrumb

 Company News

 Outreach & Initiatives

 Grow with Google

 AI is creating new opportunities across the workforce.
 Today, we announced with Governor Maura Healey that Google is partnering with the Massachusetts AI Hub to provide every Bay Stater with no-cost access to Google’s AI and career training through our Grow with Google program. This includes Google’s new AI Professional Certificate and the Google Career Certificates program . We’re excited to help Massachusetts residents learn how to use AI tools in their everyday work and create opportunities for career advancement. This partnership builds on our ongoing AI and career training commitments in Arkansas, Connecticut, Oklahoma and Virginia.
 By equipping residents with essential AI literacy, Massachusetts is ensuring its workforce is prepared for the jobs of today and tomorrow. Google is proud to call Massachusetts home with an office in Cambridge, and we’re committed to helping the state make AI literacy and professional training accessible to everyone.
 Massachusetts residents can now access Google’s AI Training at no cost .

 Grow with Google founder Lisa Gevelber and Governor Maura Healey meet with Google AI course graduates to discuss their journeys.

 POSTED IN:

Grow with Google

AI

Public Policy

 Related stories

AI Data Centers Turn to High-Temperature Superconductors

ieee_spectrum_ai

21.02.2026 14:00

0.625

Embedding sim.	0.7337
Entity overlap	0.027
Title sim.	0.1209
Time proximity	0.7325

NLP тип	other
NLP организация	Microsoft
NLP тема	ai infrastructure
NLP страна	United States

Открыть оригинал

Data centers for AI are turning the world of power generation on its head. There isn’t enough power capacity on the grid to even come close to how much energy is needed for the number being built. And traditional transmission and distribution networks aren’t efficient enough to take full advantage of all the power available. According to the U.S. Energy Information Administration (EIA), annual transmission and distribution losses average about 5 percent. The rate is much higher in some other parts of the world. Hence, hyperscalers such as Amazon Web Services, Google Cloud and Microsoft Azure are investigating every avenue to gain more power and raise efficiency.
 Microsoft, for example, is extolling the potential virtues of high-temperature superconductors (HTS) as a replacement for copper wiring. According to the company, HTS can improve energy efficiency by reducing transmission losses, increasing the resiliency of electrical grids, and limiting the impact of data centers on communities by reducing the amount of space required to move power.
 “Because superconductors take up less space to move large amounts of power, they could help us build cleaner, more compact systems,” Alastair Speirs, the general manager of global infrastructure at Microsoft wrote in a blog post .
 Superconductors Revolutionize Power Efficiency
 Copper is a good conductor, but current encounters resistance as it moves along the line. This generates heat, lowers efficiency, and restricts how much current can be moved. HTS largely eliminates this resistance factor, as it’s made of superconducting materials that are cooled to cryogenic temperatures. (Despite the name, high-temperature superconductors still rely on frigid temperatures—albeit significantly warmer than those required by traditional superconductors.)
 The resulting cables are smaller and lighter than copper wiring, don’t lower voltage as they transmit current, and don’t produce heat. This fits nicely into the needs of AI data centers that are trying to cram massive electrical loads into a tiny footprint. Fewer substations would also be needed. According to Speirs, next-gen superconducting transmission lines deliver capacity that is an order of magnitude higher than conventional lines at the same voltage level.
 Microsoft is working with partners on the advancement of this technology including being a part of a US $75 million Series B funding round into Veir , a superconducting power technology developer. Veir’s conductors use HTS tape, most commonly based on a class of materials known as rare-earth barium copper oxide (REBCO). REBCO is a ceramic superconducting layer deposited as a thin film on a metal substrate, then engineered into a rugged conductor that can be assembled into power cables.
 “The key distinction from copper or aluminum is that, at operating temperature, the superconducting layer carries current with almost no electrical resistance, enabling very high current density in a much more compact form factor,” says Tim Heidel , Veir’s CEO and cofounder.
 Liquid Nitrogen Cooling in Data Centers
 Ruslan Nagimov, the principal infrastructure engineer for cloud operations and innovation at Microsoft, stands near the world’s first HTS-powered rack prototype. Microsoft 
 HTS cables still operate at cryogenic temperatures, so cooling must be integrated into the power-delivery system design. Veir maintains a low operating temperature using a closed-loop liquid-nitrogen system: The nitrogen circulates through the length of the cable, exits at the far end, is recooled, and then recirculated back to the start.
 “Liquid nitrogen is a plentiful, low cost, safe material used in numerous critical commercial and industrial applications at enormous scale,” says Heidel. “We are leveraging the experience and standards for working with liquid nitrogen proven in other industries to design stable, data center solutions designed for continuous operation, with monitoring and controls that fit critical infrastructure expectations rather than lab conditions.”
 HTS cable cooling can be done either within the data center or externally. Heidel favors the latter as that minimizes footprint and operational complexity indoors. Liquid nitrogen lines are fed into the facility to serve the superconductors. They deliver power to where it’s needed and the cooling system is managed like other facility subsystems.
 Rare earth materials, cooling loops, cryogenic temperatures—all of this adds considerably to costs. Thus, HTS isn’t going to replace copper in the vast majority of applications. Heidel says the economics are most compelling where power delivery is constrained by space, weight, voltage drop, and heat.
 “In those cases, the value shows up at the system level: smaller footprints, reduced resistive losses, and more flexibility in how you route power,” says Heidel. “As the technology scales, costs should improve through higher-volume HTS tape manufacturing and better yields, and also through standardization of the surrounding system hardware, installation practices, and operating playbooks that reduce design complexity and deployment risk.”
 AI data centers are becoming the perfect proving ground for this approach. Hyperscalers are willing to spend to develop higher-efficiency systems. They can balance spending on development against the revenue they might make by delivering AI services broadly.
 “HTS manufacturing has matured—particularly on the tape side—which improves cost and supply availability,” says Husam Alissa , Microsoft’s director of systems technology. “Our focus currently is on validating and derisking this technology with our partners with focus on systems design and integration.”
 This story was updated on 26 February, 2026 to correct details of Microsoft’s investment into Veir.

Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions

microsoft_research

19.02.2026 16:00

0.624

Embedding sim.	0.7235
Entity overlap	0
Title sim.	0.037
Time proximity	0.9642

NLP тип	product_launch
NLP организация
NLP тема	code generation
NLP страна

Открыть оригинал

November 11, 2025

 BlueCodeAgent: A blue teaming agent enabled by automated red teaming for CodeGen AI

Introducing EVMbench

openai

18.02.2026 00:00

0.624

Embedding sim.	0.7249
Entity overlap	0
Title sim.	0.0261
Time proximity	0.9643

NLP тип	product_launch
NLP организация	OpenAI
NLP тема	ai agents
NLP страна

Открыть оригинал

OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

huggingface

18.02.2026 16:15

0.621

Embedding sim.	0.7108
Entity overlap	0.0256
Title sim.	0.0588
Time proximity	0.9984

NLP тип	experiment
NLP организация	IBM Research
NLP тема	ai agents
NLP страна

Открыть оригинал

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

 Enterprise Article Published
 February 18, 2026

 Upvote 18

 +12

 Ayhan Sebin ayhansebin 

 ibm-research

 Rohan Arora rohan-arora 

 ibm-research

 Saurabh Jha saurabhjha1 

 ibm-research

 The "Black Box" Problem of Agent Benchmarks
 The Experiment: Diagnosing ITBench Agents
 Finding 1: Stronger models like Gemini-3-Flash shows surgical (isolated failure modes) per trace whereas open sourced Kimi-K2 and GPT-oss-120b show compounding failure patterns
 Finding 2: "Non-Fatal" vs. "Fatal" Failures The "Non-Fatal" (Benign) Flaws
 The "Fatal" Flaws
 Case Study: Gemini-3-Flash (Decisive but Overconfident)
 Case Study: GPT-OSS-120B

 A different (and more useful) way to read the plots: “fatal” vs “non-fatal” Recoverable / structural (show up even in successful traces)
 Fatal / decisive (strongly associated with failed traces)

 Conclusion

 Ayhan Sebin
 Saurabh Jha
 Rohan Arora
 Daby Sow
 Mert Cemri
 Melissa Pan
 Ion Stoica

 ITBench HF Space
 ITBench HF Dataset
 MAST HF Dataset
 ITBench Github
 MAST Github

 IBM Research and UC Berkeley collaborated to study how agentic LLM systems break in real-world IT automation, for tasks involving incident triage, logs/metrics queries, and Kubernetes actions in long-horizon tool loops.

 Benchmarks typically reduce performance to a single number, telling you whether an agent failed but never why. To solve this black-box problem, we applied MAST (Multi-Agent System Failure Taxonomy), an emerging practice for diagnosing agentic reliability ). By leveraging MAST to analyze ITBench—the industry benchmark for SRE, Security, and FinOps automation—we turned raw execution traces into structured failure signatures, revealing exactly what broke and how to fix it. We annotated 310 ITBench SRE traces across three distinct model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B.

 Key Findings:

 Frontier models like Gemini-3-Flash fail cleanly (2.6 failure modes/trace), typically hitting isolated bottlenecks like verification. Large open models like GPT-OSS-120B suffer from cascading failure modes (5.3 failure modes/trace). -A single reasoning mismatch early in the run poisons the context, leading to compounding hallucinations.

 Across all models, the strongest predictor of failure is FM-3.3 (Incorrect Verification). Agents consistently "declare victory" without checking ground truth.

 Kimi-K2 struggles to recognize when a task is done. It exhibits a massive spike in Premature Termination (+46%) and Unaware of Termination Conditions (+43%), often quitting just before solving the problem or looping indefinitely.

 Takeaways from our analysis when building agents:

 For Frontier Models like Gemini: Externalize Verification. Never let the LLM grade its own homework. Require hard tool evidence before exit.

 Put termination + loop control outside the model: Termination issues are common killers (FM-1.5). Add explicit stop conditions + loop detectors for repeated tool calls/actions or implement Finite State Machines.

 Force clarify-or-read-only when inputs are ambiguous: Clarification failures (FM-2.2) are a major failure driver for smaller models. Make ambiguity a first-class branch in your agent graph.

 If you’re building agents for enterprise IT workflows, this is the kind of evaluation you want: not just “did it pass?”, but “what broke, where, and what intervention is most leverageable?”

 The "Black Box" Problem of Agent Benchmarks

 Benchmarks like ITBench are becoming the standard for measuring agentic performance in high-stakes IT automation tasks. In ITBench, agents act as Site Reliability Engineers (SREs) or Security Analysts tasked with diagnosing Kubernetes outages, patching vulnerabilities, or managing cloud costs in production environments.

 This benchmarks use success rate as a main metric to evaluate agents. However, this metric is insufficient for engineering robust systems. Knowing that an agentic system achieves a 14% success rate on ITBench tells us that it failed, but not why: Did it fail because it forgot the context? Because it hallucinated a command? Or because it simply did not terminate?

 Without a comprehensive approach to diagnose these failures, developers are left guessing, often resorting to blind prompting tweaks that solve one problem only to create another.

 As a new standard to analyze the failure modes of complex agentic systems, we developed MAST (Multi-Agent System Failure Taxonomy) . MAST brings more insights and open up the opaque evaluation of these benchmarks. Derived from a rigorous analysis of over 1,600 traces across seven different frameworks, MAST provides a standardized taxonomy for agent failures.

 MAST converts unstructured execution logs into structured " failure vectors " based on 14 distinct patterns across three key categories:

 FC1: System Design Issues (The "Skeleton")
 Failures here stem from the agent's architecture and role definition.

 Examples: FM-1.3 Step Repetition (looping), FM-1.4 Loss of Conversation History (memory leaks), FM-1.5 Unaware of Termination (failing to stop).

 FC2: Inter-Agent Misalignment (The "Communication")
 Failures arising during runtime from how agents talk to each other or the environment.

 Examples: FM-2.2 Fail to Ask for Clarification (assuming instead of asking), FM-2.3 Task Derailment (going off-topic).

 FC3: Task Verification (The "Quality Control")
 Failures in quality assurance of the agents' output.

 Examples: FM-3.1 Premature Termination (giving up too soon), FM-3.3 Incorrect Verification (hallucinating success).

 The Experiment: Diagnosing ITBench Agents

 We stress-test the idea of using MAST to make agent evaluations actionable and gain insights on the failure modes by applying it to ITBench, a popular evaluation suite for IT automation tasks across SRE , Security/Compliance , and FinOps .

 We annotated 310 ITBench SRE execution traces produced by an SRE agent built with Codex in realistic environments. These traces capture natural language interactions between agents and their tools across three models representing different capability tiers: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. This lets us look past simple success metrics and investigate the distinct failure signatures driving these results. For this we use the recall scores, as the models by design only output a maximum of 3-5 outputs and SREs prefer the recall scores over F-1 score.

 Gemini-3-Flash: 100 traces (75.5% Mean Recall)

 Kimi-K2: 105 traces (28.6% Mean Recall)

 GPT-OSS-120B: 105 traces (12.4% Mean Recall)

 Below, we detail the findings from this diagnostic analysis.

 Finding 1: Stronger models like Gemini-3-Flash shows surgical (isolated failure modes) per trace whereas open sourced Kimi-K2 and GPT-oss-120b show compounding failure patterns

 When we examine the failed traces, a clear hierarchy of complexity becomes apparent across the three models. This is measured by the number of distinct failure modes observed per failed run.

 Gemini-3-Flash:  2.6 failure modes per failed trace

 Kimi-K2:  4.7 failure modes per failed trace

 GPT-OSS-120B:  5.3 failure modes per failed trace

 This disparity in failure mode density reveals a fundamental difference in how these systems break down. Gemini-3-Flash exhibits a surgical failure profile. Even in unsuccessful runs, it maintains high internal coherence and typically fails due to a single isolated failure, such as an incorrect verification step. These failures are precise and far easier to diagnose.

 On the opposite end of the spectrum, GPT-OSS-120B suffers from cascading collapse. In these traces, we observe that errors tend to compound over time. A small reasoning mismatch early in the process often leads to a deviation from the task specification, which in turn triggers a total derailment of the agent. Kimi-K2 represents the middle ground, where failures are more frequent and complex than the frontier model but do not reach the systemic instability seen in the 120B open weights model.

 The significance of this finding is that a higher success rate is often accompanied by isolated failure. Systems that fail with fewer simultaneous problems are far more predictable and simpler to improve through targeted engineering interventions.

 Finding 2: "Non-Fatal" vs. "Fatal" Failures

 Perhaps the most critical insight from MAST is distinguishing between failures that the system can tolerate versus those that are fatal to success of the downstream task. By comparing the distribution of failure modes in Successful Traces vs. Failed Traces , we can classify them into three categories.

 The "Non-Fatal" (Benign) Flaws

 Across all three models, certain failure modes appear frequently even in runs that ultimately succeed. These are often structural frictions rather than terminal bugs.

 FM-1.3 Step Repetition:  This mode is present in over 90 percent of successful Kimi-K2 runs. In the SRE domain, iteration is often a necessity. An agent might query the same metric multiple times to verify if a service is stabilizing or if a fix has taken effect. Gemini-3-Flash actually shows less repetition in its failed traces, suggesting that it sometimes fails because it does not iterate enough.

 FM-1.1 Disobey Task Specification:  Agents frequently deviate from strict tool formatting or sequential instructions yet still manage to identify the correct root cause.

 This separation is where MAST proves its value. It allows us to ignore the bening failures like repetition that often occurs in troubleshooting, and focus instead on fatal failures that killed a run.

 The "Fatal" Flaws

 Certain behaviors strongly separate success from failure. When these modes appear, the probability of a successful outcome drops precipitously. The most prominent example is  FM-3.3 (Incorrect Verification) . This mode shows a 52 percent increase in failed Gemini-3-Flash traces compared to its successful ones. Other prominent failure modes are 1.5 (Unaware of Termination Conditions) and 2.6 (Reasoning Action Mismatch).

 If these happen, the run is likely dead; guiding practitioners to develop robust context management strategies across agents in the system and multiple turns of interactions.

 Case Study: Gemini-3-Flash (Decisive but Overconfident)

 Gemini-3-Flash is highly efficient, but its primary bottleneck is its tendency to assume success without rigorous proof. Its failure signature is dominated by a massive delta in verification errors. It often identifies the correct signals but terminates before cross-referencing them against the ground truth. To fix this, developers should implement an external verification gate. By requiring tool-based evidence like a cleared alert or a healthy metric threshold before allowing the agent to exit, we can mitigate this model’s inherent overconfidence.

 Fix: To improve Gemini-3-Flash on ITBench, prompt engineering won't help much. In particular, the experiments we shown in our NeurIPS 2025 paper shows that with manual interventions like prompt engineering for memory related failures, we can get only up to around 15.6% performance improvements, whereas in a previous blogpost on MAST , we showed that by introducing new agents such as a Summarizer Agent to remind the other agents of what is going on and continuously augment their state (fixing FM-1.4) or by introducing context management mechanisms (such as a stricter State Machine to enforce termination to fix FM-1.5), we can get up to 53% performance improvement as these tackle more fundamental issues with the system.

 Case Study: Kimi-K2 (The Termination Crisis)

 While termination confusion (FM-3.1 and FM-1.5) is the prevalent failure mode for Kimi-K2, its failed trajectories are defined by a pervasive  Action-Reasoning Mismatch (FM-2.6) , which is present in a staggering  92% of its failures .

 The Execution Gap:  While parts of its internal reasoning are often accurate, it suffers from a 92 percent failure prevalence of  FM-2.6 (Action-Reasoning Mismatch) . It frequently identifies the correct next step but then executes a redundant or irrelevant command.

 The Meta-Loop Trap:  Roughly 25 percent of failed traces involve  FM-2.3 (Task Derailment) . When a tool call returns a minor error, the agent often abandons the primary incident to enter a cycle of debugging its own investigation scripts.

 Kimi-K2 is a good example of an overthinking model, its reasoning chains are often too long but can fail at execution.

 Case Study: GPT-OSS-120B

 GPT-OSS-120B exhibits the most unstable failure signature of the cohort. This model exhibits an average of 5.3 distinct failure modes per failed trace, indicating a fundamental inability to maintain internal state.

 Loss of Conversation History (FM-1.4):  This is a unique fatal flaw for the 120B model. It loses conversation history in  24%  of traces, whereas Gemini-3-Flash exhibited zero memory loss and Kimi-K2 only 7%. As SRE traces grow in length, GPT-OSS-120B effectively "forgets" the alerts it was originally triaging, leading to total task derailment.

 Reasoning Disconnect (FM-2.6):  A staggering  94%  of traces show a decoupling of reasoning and action. It is nearly 3x more likely than Gemini (31%) to describe a correct plan but then execute a completely unrelated or redundant tool call.

 A different (and more useful) way to read the plots: “fatal” vs “non-fatal”

 In summary, MAST lets you split failure modes into two buckets:

 Recoverable / structural (show up even in successful traces)

 These are failures which are not fatal and from which the system can recover to successfully complete the task.

 FM-1.3 Step repetition

 FM-3.3 Incorrect verification (important nuance: the system does verify; it just verifies poorly)

 FM-2.6 Reasoning–action mismatch (often present, but not always decisive)

 Fatal / decisive (strongly associated with failed traces)

 These are failures from which the system typically cannot recover.

 FM-1.5 Unaware of termination conditions

 FM-3.1 Premature termination

 FM-1.4 Loss of conversation history

 FM-2.3 Task derailment (rare but extremely diagnostic when it appears)

 FM-2.2 Fail to ask for clarification (especially for Granite/Llama regimes)

 This is the “richer understanding” piece: two models can have the same success rate on a small slice, yet fail for entirely different reasons—requiring different fixes.

 Conclusion

 MAST is a tool that inspects the agentic system traces to identify fine-grain failure types that support system development and debugging. In this blog, we show that by applying MAST to ITBench, we move from generic observations ("Open models struggle") to a concrete engineering roadmap that help improving the performance of agentic systems relying on thse models, e.g.:

 For Gemini-3-Flash:  Verification failure ( FM-3.3 ) is the most common fatal failure for surgical models. Never allow an agent to self-terminate; require hard, tool-mediated evidence (e.g., AlertManager clearance or K8s state changes) before a run is considered successful.

 For Kimi-K2: Use a deterministic state machine to fix the model's frequent struggle with recognizing task completion. This model’s reasoning chains can be too long and struggle to terminate, so it might benefit significantly from a tighter control on when to end.

 For GPT-oss-120b: Systemic collapse occurs when minor reasoning mismatches ( FM-2.6 ) poison the task history. Implement aggressive context hygiene and early error detection to ensure that small misalignment's do not compound into total derailment.

 IT-Bench Paper: https://arxiv.org/pdf/2502.05352

 IT-Bench Code: https://github.com/itbench-hub/ITBench

 MAST Paper: https://arxiv.org/abs/2503.13657

 MAST Code: https://github.com/multi-agent-systems-failure-taxonomy/MAST

 MAST-Data : 🤗 MAST-Data (1600+ Traces)

 Mentioned datasets
 ibm-research/ITBench-Lite
 Updated 22 days ago • 981 • 5

 mcemri/MAST-Data
 Preview • Updated Jul 21, 2025 • 358 • 13

 More from this author

 AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

 31
 January 21, 2026

 CUGA on Hugging Face: Democratizing Configurable AI Agents

 67
 December 15, 2025