← Все кластеры
Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities | NVIDIA Technical Blog
closed
Тип событияother
Темаai infrastructure
ОрганизацияOpenAI
СтранаUnited States
Статей35
Уник. источников10
Важность / Момент3.11 / 0
Период17.02.2026 18:00 — 05.03.2026 17:00
Создан06.04.2026 06:19:56
Статьи в кластере 35
Заголовок Источник Дата публикации Score
S Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities | NVIDIA Technical Blog nvidia_dev_blog 17.02.2026 18:00 1
Embedding sim.1
Entity overlap1
Title sim.1
Time proximity1
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаretrieval-augmented generation
NLP страна

Открыть оригинал

Enterprise data is inherently complex: real-world documents are multimodal, spanning text, tables, charts and graphs, images, diagrams, scanned pages, forms, and embedded metadata. Financial reports carry critical insights in tables, engineering manuals rely on diagrams, and legal documents often include annotated or scanned content. Retrieval-augmented generation (RAG) was created to ground LLMs in trusted enterprise knowledge—retrieving relevant source data at query time to reduce hallucinations and improve accuracy. But if a RAG system processes only surrounding text, it misses key signals embedded in tables, charts, and diagrams—resulting in incomplete or incorrect answers. An intelligent agent is only as good as the data foundation it’s built on. Modern RAG must therefore be inherently multimodal—able to understand both visual and textual context to achieve enterprise-grade accuracy. The NVIDIA Enterprise RAG Blueprint is built for this, providing a modular reference architecture that connects unstructured enterprise data to the intelligent systems built on top of it. The blueprint also serves as a foundational layer for the NVIDIA AI Data Platform , helping to bridge the traditional gap between compute and data. By enabling retrieval and reasoning closer to the data layer, it preserves governance, reduces operational friction, and makes enterprise knowledge immediately usable by intelligent systems. The result is a modern AI data stack—storage that can retrieve, enrich, and reason alongside your models. While the Enterprise RAG Blueprint provides many configurable options, this post highlights the following five key configurations that most directly improve accuracy and contextual relevance across enterprise use cases: Baseline multimodal RAG pipeline Reasoning Query decomposition Filtering metadata for faster and precise retrieval Visual reasoning for multimodal data The post also explains how the blueprint can be embedded into AI data platforms to transform traditional repositories into AI-ready knowledge systems. Accuracy metrics in this blog are measured using the RAGAS framework , using well-known public datasets. Learn more about evaluating your NVIDIA RAG Blueprint system . 1. Document ingestion and understanding Before an agent can deliver insights, it must be perfectly grounded in your data. This foundational configuration focuses on intelligent document ingestion and core RAG functionality. The Enterprise RAG Blueprint uses NVIDIA NeMo Retriever to extract multimodal enterprise content—text, tables, charts and graphs, and infographics—then embeds that content into text for indexing in a vector database. At query time, the blueprint runs semantic retrieval, reranking, and Nemotron LLM to generate a grounded answer. To maximize performance, this baseline intentionally avoids image captioning and heavy reasoning, making it the ideal starting point for production deployments. Deploy this baseline on Docker . Benefits of document ingestion and understanding This foundational configuration is the blueprint’s highest-efficiency pipeline, optimized for accuracy and throughput while keeping GPU cost and time to first token (TTFT) low. This configuration establishes your baseline performance for retrieval quality and LLM grounding. Figure 1. RAG pipeline Table 1 summarizes the overall impact across a few datasets. Accuracy (v2.3 Default) MM = Multimodal, TO = Text-Only Dataset Type Accuracy RAG Battle MM 0.809 KG RAG MM 0.565 FinanceBench MM 0.633 BO767 MM 0.910 HotpotQA TO 0.671 Google Frames MM 0.509 Table 1. Accuracy impact of baseline configuration (higher is better) 2. Reasoning When you turn on reasoning in the RAG blueprint, you enable the LLM to interpret the retrieved evidence, and synthesize logically grounded answers. This is the easiest change to get an accuracy boost for many applications. Enable reasoning for the NVIDIA Enterprise RAG Blueprint . Table 2 summarizes the overall impact across several sample datasets. Accuracy (v2.3 Default) plus Reasoning MM = Multimodal, TO = Text-Only Dataset Type Reasoning on Default RAG Battle MM 0.85 0.809 KG RAG MM 0.58 0.565 FinanceBench MM 0.69 0.633 BO767 MM 0.88 0.91 Table 2. Accuracy impact of enabling reasoning versus baseline configuration (higher is better) Benefits of reasoning For any use case involving mathematical operations or complex data comparison, a typical simple similarity or hybrid search will not suffice. Reasoning is required to correct errors and ensure precise contextual understanding. Accuracy improvements across datasets averaged ~5%, with several cases demonstrating dramatic reasoning-driven corrections. Examples In the FinanceBench dataset, the baseline configuration incorrectly computed the Adobe FY2017 operating cash flow ratio as 2.91. After enabling reasoning, the model produced the correct answer, 0.83. In addition, the Ragbattle dataset demonstrates the accuracy improvement from enabling VLM. 3. Query decomposition Answering complex user questions often requires pulling facts from multiple places in the data foundation. Query decomposition breaks a single question into smaller subqueries, retrieves evidence for each, and recombines the results into a complete, grounded response. Turn on query decomposition for the NVIDIA Enterprise RAG Blueprint . Figure 2. Response accuracy before and after query decomposition Benefits of query decomposition Query decomposition significantly improves accuracy for multihop and context-rich questions that span multiple paragraphs or documents. It does add extra LLM calls (increasing latency and cost), but the accuracy gains are often worth it for mission-critical enterprise use cases. Query decomposition can also be paired with reasoning for an additional boost when needed. Example As NVIDIA AI Data platform partners evolve to offer more relevant and accurate retrieval, this feature can either include some level of query processing as part of the data platform or can be left to the agent. Learn more about how query decomposition can be an approach in some use cases . Table 3 shows the overall impact across a few datasets. Accuracy (v2.3 Default) plus Query Decomposition MM = Multimodal, TO = Text-Only Dataset Type Query decomposition Default RAG Battle MM 0.854 0.809 FinanceBench MM 0.631 0.633 BO767 MM 0.885 0.91 HotpotQA TO 0.725 0.671 Google Frames MM 0.6 0.5094 Table 3. Accuracy impact of query decomposition versus baseline configuration (higher is better) 4. Filtering metadata for faster and precise retrieval Metadata, such as author, date, category, and security tags, has always been integral to enterprise data. In RAG pipelines, metadata filters can be leveraged to narrow the search space and align retrieved content with the right context, significantly improving retrieval precision and speed. The RAG blueprint supports custom metadata ingestion and automatic query generation based on that data. To leverage your custom metadata, see Advanced Metadata Filtering with Natural Language Generation . To learn more about what’s possible with this feature set, check out the example notebook on the NVIDIA-AI-Blueprints/rag GitHub repo. Benefits of metadata filtering Metadata filtering narrows the search space for faster retrieval and improves precision by aligning retrieved content with context. This allows developers to leverage metadata without manual filter logic to achieve higher throughput and contextual relevance. When metadata filtering capabilities are embedded directly into AI data platforms, it can make your storage smarter, leading to faster retrieval and lower latency. Example To provide an example, consider two documents that are ingested with the following metadata: custom_metadata = [ { "filename": "ai_guide.pdf", "metadata": { "category": "AI", "priority": 8, "rating": 4.5, "tags": ["machine-learning", "neural-networks"], "created_date": "2024-01-15T10:30:00" } }, { "filename": "engineering_manual.pdf", "metadata": { "category": "engineering", "priority": 5, "rating": 3.8, "tags": ["hardware", "design"], "created_date": "2023-12-20T14:00:00" } } When using metadata with dynamic filter expression, a query such as, “Show me high-rated AI documents with machine learning tags created after January 2024” will translate to one that automatically generates a filtering expression such as: filter_expression = `content_metadata["category"] == "AI" and content_metadata["rating"] >= 4.0 and array_contains(content_metadata["tags"], "machine-learning") and content_metadata["created_date"] >= "2024-01-01”` With metadata filtering enabled, the system retrieved 10 focused citations from one document, ai_guide.pdf , achieving 100% precision on the target domain while reducing search space by 50%. 5. Visual reasoning for multimodal data Enterprise data is visually rich. Where traditional text-only embeddings fall short, vision language models (VLMs) such as NVIDIA Nemotron Nano 2 VL (12B) introduce visual reasoning into the pipeline. Learn more about how to leverage a VLM for generation in the RAG Blueprint. Figure 3. Before and after leveraging a VLM for generation Benefits of visual reasoning Visual reasoning is crucial for handling real-world enterprise documents. Integrating a VLM in the generation pathway enables the RAG system to interpret images, charts, and infographics, making it possible to accurately answer queries where the information lies in a structured visual element rather than just the surrounding text. Example A significant accuracy improvement was observed when a VLM was enabled for the Ragbattle dataset in the RAG Blueprint, especially when the answer was in a visual element. Note that enabling VLM inference can increase response latency from additional image processing. Consider this tradeoff between accuracy and speed based on your requirements. Learn more about the accuracy improvements with VLM for the Ragbattle dataset. Transforming enterprise storage into an active knowledge system The Enterprise RAG Blueprint demonstrates how the progressive adoption of these five capabilities—from reasoning and metadata-driven retrieval to multimodal understanding—directly enhances the accuracy and groundedness of your intelligent agents. Each capability offers a unique balance between latency, token cost, and contextual precision, providing a flexible, tunable framework that can be adopted to various enterprise use cases. This accelerates the evolution of the data foundation itself. The NVIDIA AI Data Platform transforms enterprise data into AI-searchable knowledge. As NVIDIA partners evolve their storage offerings, this blueprint serves as a reference for delivering embedded RAG capabilities that leverage metadata to enforce permissions, track changes, and provide highly accurate retrieval directly at the storage layer. NVIDIA storage partners are building AI data platforms based on the NVIDIA reference design that are transforming enterprise storage from a passive repository to become an active intelligent system in the AI workflow. The result is a next-generation enterprise data infrastructure: faster, smarter, and purpose-built for the age of generative AI. What’s new with the NVIDIA Enterprise RAG Blueprint The latest release of the NVIDIA EnterpriseRAG Blueprint deepens its focus on serving agentic workflows. It introduces first-class document-level summarization with both shallow and deep strategies, enabling agents to quickly assess relevance, narrow search space, and balance accuracy with latency. A new data catalog improves discoverability and governance across large corpora, while upgrades to the best-in-class Nemotron RAG models further enhance retrieval quality, reasoning, and generation performance—making RAG a more efficient, agent-ready foundation for enterprise-scale knowledge systems. Get started with enterprise-grade RAG Ready to integrate these five capabilities into your RAG use cases? Access the modular code, documentation, and evaluation notebooks for free within the NVIDIA Enterprise RAG Blueprint . Make your enterprise data AI-ready and transform your production data into an intelligent knowledge system with embedded RAG capabilities with NVIDIA AI Data Platform. Contact an NVIDIA AI storage partner to get started with your own NVIDIA-powered AI data platform. Discuss (1) Like Tags Agentic AI / Generative AI | Data Center / Cloud | General | Blueprint | Nemotron | Intermediate Technical | Best practice | AI Agent | AI Data Platform | AI-Ready Data | featured | LLMs | Retrieval Augmented Generation (RAG) About the Authors About Shruthii Sathyanarayanan Shruthii Sathyanarayanan is a product marketing manager in the NVIDIA Enterprise Computing group with a focus on enterprise AI and virtualization. Shruthii holds a bachelor’s degree in Computer Engineering and Business from the University of Illinois at Urbana-Champaign and has previously held roles in software development and product management. View all posts by Shruthii Sathyanarayanan About Sumit Bhattacharya Sumit Bhattacharya is a senior engineering manager at NVIDIA, working on AI blueprints and conversational AI. His primary area of focus is building scalable, low-latency solutions for Enterprise RAG, data flywheels, and voice agents. He also has extensive experience of working on NLP, dialog systems, and voice assistants. He holds a master’s degree in Electrical Engineering from the Indian Institute of Technology, Kharagpur, and has over 18 years of industry experience. View all posts by Sumit Bhattacharya About Punit Kumar Punit Kumar is a senior system software engineer at NVIDIA with a focus on the RAG Blueprint, production RAG systems, and features that improve accuracy and performance. Punit holds a master’s degree in Data Science and Engineering from BITS Pilani and a BTech in Computer Science from SKIT Jaipur and has previously held roles in R&D in AI engineering and in data engineering. View all posts by Punit Kumar About Pranjal Doshi Pranjal Doshi is a software engineer at NVIDIA, specializing in retrieval-augmented generation (RAG) and the productionization of large language models. Pranjal holds a master’s degree in Computer Science and Engineering from the Indian Institute of Technology (IIT) Kharagpur and focuses on bridging the gap between AI research and scalable, real-world applications. View all posts by Pranjal Doshi About Nikhil Kulkarni Nikhil Kulkarni is a software engineer at NVIDIA specializing in the productization of the RAG Blueprint, with an emphasis on accuracy improvements, performance optimizations, and deployment. Nikhil holds a bachelor’s degree in Computer Science and focuses on translating AI models into robust, enterprise-grade architectures. He has previously worked on building speech-based AI agents at NVIDIA. View all posts by Nikhil Kulkarni Comments Related posts Chat With Your Enterprise Data Through Open-Source AI-Q NVIDIA Blueprint Chat With Your Enterprise Data Through Open-Source AI-Q NVIDIA Blueprint NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster NVIDIA NeMo Retriever Delivers Accurate Multimodal PDF Data Extraction 15x Faster Insights, Techniques, and Evaluation for LLM-Driven Knowledge Graphs Insights, Techniques, and Evaluation for LLM-Driven Knowledge Graphs Translate Your Enterprise Data into Actionable Insights with NVIDIA NeMo Retriever Translate Your Enterprise Data into Actionable Insights with NVIDIA NeMo Retriever Scaling Enterprise RAG with Accelerated Ethernet Networking and Networked Storage Scaling Enterprise RAG with Accelerated Ethernet Networking and Networked Storage Related posts Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo How to Build a Document Processing Pipeline for RAG with Nemotron  How to Build a Document Processing Pipeline for RAG with Nemotron  L T F R E
OpenAI and Amazon announce strategic partnership openai 27.02.2026 05:30 0.783
Embedding sim.0.9138
Entity overlap0.3636
Title sim.0.375
Time proximity0.4286
NLP типpartnership
NLP организацияOpenAI
NLP темаai infrastructure
NLP страна

Открыть оригинал

OpenAI and Amazon announce a strategic partnership bringing OpenAI’s Frontier platform to AWS, expanding AI infrastructure, custom models, and enterprise AI agents.
Joint Statement from OpenAI and Microsoft openai 27.02.2026 05:30 0.756
Embedding sim.0.8662
Entity overlap0.0667
Title sim.0.1918
Time proximity1
NLP типother
NLP организацияMicrosoft
NLP темаartificial intelligence
NLP страна

Открыть оригинал

Microsoft and OpenAI continue to work closely across research, engineering, and product development, building on years of deep collaboration and shared success.
Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai | NVIDIA Technical Blog nvidia_dev_blog 18.02.2026 18:00 0.737
Embedding sim.0.8376
Entity overlap0.0833
Title sim.0.2993
Time proximity0.8601
NLP типbenchmarking
NLP организацияNVIDIA
NLP темаlarge language models
NLP страна

Открыть оригинал

As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent scheduling and dynamic GPU fractioning. GPU fractioning is wholly delivered by NVIDIA Run:ai in any environment—cloud, NCP, and on-premises. This post presents the joint benchmarking effort between NVIDIA and AI cloud provider Nebius to evaluate how NVIDIA Run:ai fractional GPU allocation can improve large language model (LLM) inference performance. Nebius’ AI Cloud provided the infrastructure foundation, dedicated NVIDIA GPUs, NVIDIA Quantum InfiniBand networking, and hyperscaler-grade performance and elasticity needed to deliver these gains at production scale. All benchmarks were executed using NVIDIA NIM microservices . This approach provides standardized, production-grade model deployment with consistent performance, security, and lifecycle management across environments. The results show that fractional GPUs dramatically increase effective capacity without compromising latency SLAs: 77% of full GPU throughput and 86% of full-GPU concurrent user capacity using only 0.5 GPU fraction, with time to first token (TTFT) under one second Up to 2x more concurrent inference users on smaller models using 0.25 GPU fractions Up to 3x more total system users when running mixed workloads (chat, reasoning, embeddings) on shared GPUs Near-linear throughput scaling across 0.5, 0.25, and 0.125 GPU fractions, with modest TTFT impact Production-ready autoscaling with no latency cliffs or error spikes during scale-out This benchmarking shows that fractional GPU scheduling is no longer an optimization technique. It is a foundational capability for running large-scale, multimodel LLM inference efficiently in production. LLM inference enterprise challenges Enterprise IT departments operate with a finite, often fixed inventory of GPUs. Deploying LLM for inference requires a dedicated GPU (or multiple GPUs) to be allocated to a single LLM instance, even during sporadic traffic. This is necessary because the model must load all the weights in advance of an inference request, so the latency for generating tokens (responses) is as low as possible. As a result, most LLMs consume all GPUs allocated, so it becomes difficult to run more than one model using the same pool of GPUs available. In this scenario, enterprise IT must manually maintain the GPUs to LLM allocation, figure out when and how to scale LLMs as users requesting inference grow to maintain latency between chat requests and tokens generated, and cannot repurpose idle GPUs during off-peak hours. Ideally, enterprises want an elastic environment where GPUs can be used to run multiple LLMs, not just one, without significantly impacting the number of users who can run inference or latency for those users. They can scale GPUs based on workloads, and scale down GPUs during off-peak hours, such that other workloads can consume the same GPUs. Scale inference workloads with NVIDIA Run:ai and Nebius AI Cloud The NVIDIA Run:ai platform addresses these pain points through its high-throughput AI workload scheduler, built for large-scale GPU clusters and dynamic fractional GPU allocation, without sacrificing performance. Together, NVIDIA Run:ai orchestration and Nebius AI Cloud infrastructure create a flexible, production-ready framework for maximizing GPU ROI. In benchmarking tests conducted by NVIDIA and Nebius AI Cloud, NVIDIA Run:ai delivered up to 2x greater user capacity on existing hardware during peak periods, demonstrating that enterprises can significantly scale inference workloads without proportional increases in GPU investment. Dynamic GPU fractioning NVIDIA Run:ai enables GPUs to be fractioned into smaller units (such as 0.5 GPU allocations) that serve multiple workloads simultaneously. Users specify their memory requirements directly and the scheduler allocates resources on-demand without any preconfiguration. This is particularly impactful for inference workloads, where smaller, concurrent requests can share GPU resources without significant performance degradation. Memory isolation is enforced at runtime while compute cycles are distributed fairly among active processes. Users can also define a guaranteed minimum (Request) with a burstable upper bound (Limit), allowing workloads to consume additional GPU capacity when available and release it automatically when demand shifts. Intelligent workload scheduling NVIDIA Run:ai scheduler acts as the “brain” of the operation, analyzing workload priorities, resource requirements, and system capacity to optimize allocations. It prioritizes latency-sensitive tasks, such as real-time inference, over batch-oriented training jobs during peak periods, ensuring service-level agreements (SLAs) are met. The scheduler also automatically scales LLMs up or down based on consecutive users running inference and token latency depending on the SLA criterias given by the admin. These strategies collectively drive higher utilization rates, lower operational complexity, and reduce total cost of ownership (TCO). Teams at NVIDIA and Nebius ran benchmarking to discover the impact NVIDIA Run:ai has on running inference at scale for various LLMs. Scale tests were performed on the number of concurrent users that can run various chat requests and recording the TTFT , output throughput (tokens/second generated), and GPU utilization. At NVIDIA these tests were run on a cluster built following the PCIe-optimized NVIDIA Enterprise Reference Architectures with NVIDIA H100 NVL GPUs. At Nebius AI Cloud the tests were run on a cluster built following the HGX based Enterprise RA for NVIDIA HGX B200 GPUs. Benchmarking setup The software stack is based on NVIDIA Enterprise RAs (Figure 1). This includes the NVIDIA AI Enterprise stack to manage GPUs using NVIDIA GPU Operator for lifecycle management, NVIDIA Network Operator for north-south and east-west networking, NVIDIA NIM Operator to download various model weights, and NVIDIA NIM microservices to deploy the different models. This was deployed in a cluster of nodes managed by Kubernetes. To learn more, see NVIDIA NIM LLM with NVIDIA Run:ai and Vanilla Kubernetes for Enterprise RA . Infrastructure Identical benchmarks were run across two hardware configurations: an on-premises cluster with 64 NVIDIA H100 NVL GPUs built to NVIDIA Enterprise RA specifications, and a Nebius AI Cloud cluster with 32 NVIDIA HGX B200 GPUs. This dual-environment approach validates that the results generalize across both self-managed infrastructure and public cloud deployments. Figure 1. NVIDIA Run:ai deployment on NVIDIA Enterprise Reference Architecture Model selection The four models selected span different sizes, memory footprints, and inference use cases (Table 1). This range enables evaluating fractional allocation across workloads with different memory footprints. Model Number of parameters Memory requirements Use case Llama 3.1 8B Instruct 8B ~16 GB General-purpose chat Phi-4-Mini 3.8B ~8 GB Lightweight assistant Qwen3-14B 14B ~28 GB Reasoning Qwen-Embeddings-0.6B 0.6B ~1.5 GB Document embedding and reranking Table 1. Models selected span diverse sizes, memory requirements, and use cases Notably, the largest model (Qwen3-14B) occupies only ~35% of one NVIDIA H100 NVL GPU 80 GB capacity, illustrating why traditional whole-GPU allocation might leave so much capacity stranded. Methodology GenAI Perf was used to simulate concurrent users sending chat requests to each NIM endpoint. The tool records per-session latency and throughput, enabling measurement under increasing load. Primary metrics include: TTFT: Latency from request submission to first response token Output throughput: Tokens generated per second per session GPU utilization: Percentage of GPU memory consumed under load Concurrency scaling: Maximum simultaneous users supported while maintaining TTFT and throughput within acceptable bounds (for example, the point at which adding more users causes latency SLA drops) Test conditions Each model was benchmarked under the following five configurations: Baseline : LLM inference without NVIDIA Run:ai (native Kubernetes scheduling) Full GPU(s) with NVIDIA Run:ai : 1.0 GPU allocation per model replica Fractional 0.5 GPU(s) : NVIDIA Run:ai with 0.5 GPU allocation per model replica Fractional 0.25 GPU(s) : NVIDIA Run:ai with 0.25 GPU allocation per model replica Mixed mode : Multiple LLMs co-located on shared GPUs For the Qwen-Embeddings model, data ingestion throughput was also tested to evaluate embedding-specific workloads. Benchmarking results using NVIDIA Run:ai This section presents observations based on the results captured from GenAI Perf. Fractional GPU efficiency at half allocation Based on the results captured from GenAI Perf, NVIDIA Run:ai was evaluated across two dimensions: scheduler overhead compared to native Kubernetes, fractional GPU efficiency at various allocation sizes. The following subsections detail the findings for each. No scheduler overhead NVIDIA Run:ai introduces no measurable performance penalty compared to native Kubernetes scheduling across all test configurations. At 64 GPUs, NVIDIA Run:ai with full GPU allocation delivered 10,200 concurrent users versus 9,934 for the native scheduler, confirming the scheduler itself adds no overhead. Fractional GPU efficiency Concurrent user scaling: At 64 GPUs, the 0.5 GPU configuration supported 8,768 concurrent users, where the TTFT for each user did not go over one second (1,000 ms)—86% of the full GPU capacity (10,200 CCU). This demonstrates that fractional allocation introduces only a modest performance trade-off, enabling enterprises to run multiple models on shared GPUs or scale deployments more granularly without significant capacity loss (Figure 2). Figure 2. Concurrent user scaling for Llama 3.1 8B Instruct powered by the NVIDIA H100 NVL GPU cluster Output throughput: Token generation throughput showed similar efficiency. At 64 GPUs, the 0.5 GPU configuration achieved 152,694 tokens/sec—77% of full GPU throughput 198,680 tokens/sec), as shown in Figure 3. All three configurations—without NVIDIA Run:ai, NVIDIA Run:ai with full GPU, and NVIDIA Run:ai with fractional GPU—scale linearly from one to 64 GPUs. This linear relationship confirms that the efficiency ratios observed at scale are not artifacts of small deployments. Figure 3. Output throughput scaling for Llama 3.1 8B Instruct powered by the NVIDIA H100 NVL GPU cluster Smaller models scale further with quarter-GPU fractions Smaller models have lighter memory footprints, which means they can take even greater advantage of fractional allocation. Phi-4-Mini was tested with 0.25 GPU fractions to measure how much concurrency and throughput this enables. Figure 4. Concurrent user scaling (1-32 GPUs) for Phi-4-Mini with TTFT under 1,000 ms on an NVIDIA HGX B200 cluster running on Nebius AI Cloud On smaller models such as Phi-4-Mini, NVIDIA Run:ai with 0.25 GPU fractions supported up to 72% more concurrent users than full-GPU allocation (Figure 4). At 32 GPUs, this configuration achieved ~450K tokens/sec with P95 TTFT under 300 ms (Figure 5). Phi-Mini is an ideal candidate for high-density fractional deployments due to its small parameter count and tensor efficiency. Figure 5. Throughput at scale for Phi-4 Mini NIM on NVIDIA HGX B200 cluster running on Nebius AI Cloud Multimodel co-location on fractional GPUs in Nebius AI Cloud NVIDIA Run:ai supports allocating fractional GPUs dynamically. In previous tests, the same number of users were run on fractional GPUs. One test loaded two models (Llama 3.1 8B and DeepseekR1-Distill-8B) on fractional 0.5 NVIDIA H100 NVL GPUs using NVIDIA Run:ai. A single NVIDIA H100 NVL GPU was running two inference models. Results show double the concurrent users with NVIDIA Run:ai versus deploying a single NIM pod per GPU (Figure 6). The performance impact increased when the scale reached more than 50% of the GPUs in the cluster. At max scale, the TTFT for the combined users dropped by 3x while the throughput dropped only by 0.4x. Figure 6. Total number of concurrent users on cluster powered by NVIDIA H100 NVL GPU server running two models on a single GPU Traditional Kubernetes schedulers don’t support this fractional allocation. NVIDIA Run:ai enables loading multiple models with dynamic frame buffer memory allocation without manual capacity planning. NVIDIA NIM complements this by packaging each model as a production-ready, optimized inference microservice with consistent startup and health signaling. NVIDIA Run:ai then enforces memory isolation and fair compute distribution at runtime. Combined, this enables safe co-location of heterogeneous workloads without cross-model interference. Figure 7. The total system users that ran with multiple models on the NVIDIA HGX B200 cluster in Nebius AI Cloud more than tripled Nebius ran a similar test co‑deploying 0.5 GPU Llama 3.1 8B, 0.25 GPU Phi‑4 Mini, and 0.125 GPU Qwen‑Embeddings. The cluster achieved predictable scaling with no cross‑model interference, and combined throughput exceeded 350K TPS at full scale (Figure 8). The total number of concurrent users that can run inference went up by almost 3x (Figure 7). This validates that the NVIDIA Run:ai scheduler can bin‑pack heterogeneous inference workloads without destabilizing latency or utilization. Figure 8. Total system throughput while running multiple models on the NVIDIA HGX B200 cluster in Nebius AI Cloud Autoscaling NIM LLM with NVIDIA Run:ai NVIDIA Run:ai supports auto-scaling inference pods based on concurrent users, throughput, or latency thresholds. Nebius set up Llama 3.1 8B to scale when concurrent users exceeded 50, triggering NVIDIA Run:ai to allocate additional GPUs to the NIM inference service. Replicas scaled smoothly from 1 to 16 as demand increased. The autoscaling traces showed clean ramp-up with no TTFT spikes, stable GPU utilization during pod warm-up, and negligible HTTP error rates, demonstrating that fractional GPU inference can scale elastically while maintaining SLAs. Figure 9. Autoscaling results for Llama 3.1 8B on NVIDIA HGX B200 in Nebius AI Cloud Get started with GPU fractioning in NVIDIA Run:ai NVIDIA Run:ai enables efficient GPU utilization through dynamic allocation, fractioning, and intelligent workload placement. Combined with Nebius AI Cloud’s dedicated GPUs, NVIDIA networking, and hyperscaler-grade elasticity, enterprises can achieve: GPU utilization improvements under fractional scheduling, eliminating fragmentation and idle pockets Near‑linear throughput scaling across 0.5 and 0.25 GPU slices (and 0.125 for embeddings), with modest TTFT impact Clean co-existence of mixed workloads: embeddings plus generative plus summarization on the same nodes Production‑ready autoscaling for fractional LLM inference—no SLA cliffs during scale‑out More workloads per GPU, higher concurrency, and reduced fleet size For an executive summary of this benchmark, see Scaling Efficient Production-Grade Inference with NVIDIA Run:ai on Nebius . Get started with the latest version of NVIDIA Run:ai v2.24. To learn more, check out the NVIDIA GTC 2026 session, Scale Inference Using Open Models: How Nebius Token Factory Delivers Control and Efficiency (Presented by Nebius) [S82234] . Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Data Science | General | NIM | Run:ai | Intermediate Technical | Benchmark | Deep dive | AI Inference | featured | Inference Performance | LLMs About the Authors About Boskey Savla Boskey Savla is a product manager at NVIDIA focusing on defining benchmarks and architectures for LLMs and agentic flows for enterprise customers. She has 18 years of experience in systems and operations. She started as a Linux sys admin and moved on to solution engineering and system engineering roles focusing on virtual, PaaS, cloud, and Kubernetes-based solutions. She is the author of the book 'Kubernetes on vSphere for Dummies' and has spoken and conducted workshops at various events like VMworld, AWS Re:Invent, and Kubecon. View all posts by Boskey Savla About Ekin Karabulut Ekin Karabulut is a data scientist and developer advocate previously at Run:ai, now at NVIDIA, exploring the efficient usage of large models in different production scenarios. Previously she worked on privacy implications of federated learning, focused on distributed training techniques and got fascinated by inefficiencies in GPU usage in research and industry settings. She established the AI Infrastructure Club and is based in Munich, Germany. View all posts by Ekin Karabulut About Roman Iurkov Roman Iurkov is a cloud solutions architect at Nebius, working closely with customers to design, onboard, and optimize a wide range of AI/ML use cases and data-intensive workloads. His frontier role centers on understanding customer requirements and translating them into scalable, reliable, and cost-efficient solutions, with a strong focus on the strategic partnership with NVIDIA and driving adoption of NVIDIA Run:ai and DGX Lepton. Bringing over a decade of experience in large enterprise environments, Roman helps customers confidently and smoothly transition to modern cloud platforms. View all posts by Roman Iurkov Comments Related posts Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Removing the Guesswork from Disaggregated Serving Removing the Guesswork from Disaggregated Serving Building Scalable and Fault-Tolerant NCCL Applications Building Scalable and Fault-Tolerant NCCL Applications Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch Related posts Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety Building a Zero-Trust Architecture for Confidential AI Factories Building a Zero-Trust Architecture for Confidential AI Factories How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale L T F R E
OpenAI announces Frontier Alliance Partners openai 23.02.2026 05:30 0.729
Embedding sim.0.8817
Entity overlap0.4286
Title sim.0.1311
Time proximity0.378
NLP типpartnership
NLP организацияOpenAI
NLP темаenterprise ai
NLP страна

Открыть оригинал

OpenAI announces Frontier Alliance Partners to help enterprises move from AI pilots to production with secure, scalable agent deployments.
Introducing OpenAI for India openai 18.02.2026 21:00 0.721
Embedding sim.0.8401
Entity overlap0.0278
Title sim.0.1235
Time proximity0.9375
NLP типother
NLP организацияOpenAI
NLP темаai adoption
NLP странаIndia

Открыть оригинал

OpenAI for India expands AI access across the country—building local infrastructure, powering enterprises, and advancing workforce skills.
5 New Digital Twin Products Developers Can Use to Build 6G Networks | NVIDIA Technical Blog nvidia_dev_blog 01.03.2026 07:00 0.721
Embedding sim.0.8084
Entity overlap0.0476
Title sim.0.2656
Time proximity1
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

To make 6G a reality, the telecom industry must overcome a fundamental challenge: how to design, train, and validate AI-native networks that are too complex to be tested in the physical world. The NVIDIA Aerial Omniverse Digital Twin (AODT) solves this by enabling a continuous integration/continuous development (CI/CD)-style workflow where Radio Access Network (RAN) software is trained, simulated, and validated in a physics-accurate environment before field deployment. As discussed in a recent post , this approach bridges the gap between statistical models and real-world network performance. But the usability of any technology is as important as the technology itself. That’s why NVIDIA designed AODT not just as a powerful simulation platform, but with a modular and accessible architecture that partners and developers can easily integrate into their own workflows. Within two years of its launch, AODT’s modular architecture is growing an ecosystem of commercial partner products, making high-fidelity simulation accessible from desktops to the cloud. This blog post spotlights five NVIDIA partners using the modular AODT platform to build commercial solutions. From RAN digital twins and cloud-scale channel simulations to high-fidelity network planning, these solutions provide a unified foundation to plan, build, and test AI-native 6G networks. The role of AODT in accelerating network innovation Part of the NVIDIA AI Aerial platform, AODT provides the physics-accurate simulation engine required to train and fine-tune AI models across the RAN, with unprecedented scale, fidelity, and accuracy. Designed to be modular , AODT enables developers to integrate or customize components based on specific use cases and development needs. Developers can start with built-in NVIDIA models for rapid prototyping or plug in their own, such as proprietary propagation engines, RAN digital twins, and user equipment (UE) digital twins, to create a full-network digital twin environment. Figure 1. Modular architecture of AODT The following are five NVIDIA partners using the modular AODT platform to build commercial solutions. Nokia RAN Digital Twin Nokia’s new RAN Digital Twin —integrated with AODT—combines Nokia’s advanced RAN algorithms with the NVIDIA physics-based simulation engine. The AODT engine uses accelerated ray tracing to model how radio waves interact with real-world materials and environments like glass, concrete, trees, or vehicles. Nokia’s Digital Twin Core analyzes network performance at the product level for base stations and user equipment. This modular approach enables operators to optimize site placement, refine beamforming strategies, and validate algorithms before hardware deployment in the physical world. Figure 2. Representation of Nokia RAN Digital Twin integration with AODT Keysight Technologies Keysight ’s Channel Studio RaySim solution, powered by AODT, transforms traditional stochastic and semi-deterministic channel modeling into site-specific, fully deterministic channel modeling required for 6G and AI-RAN development. RaySim delivers precise, 6G-ready ray-tracing channel models at speed and scale, enabling researchers to explore new waveforms, test mobility scenarios, and evaluate complex propagation environments in photorealistic digital worlds.   Building on RaySim, Keysight’s AI‑RAN Simulation Toolset enables developers to integrate the NVIDIA AI Aerial platform to create hardware testbeds and digital twins that facilitate training and benchmarking of AI-RAN workloads in an integrated, end-to-end workflow. Figure 3. Representation of Keysight’s RaySim integration with AODT VIAVI Solutions TeraVM AI RSG VIAVI’s TeraVM AI RAN Scenario Generator (AI RSG) , fully integrated with AODT, gives developers the ability to simulate detailed, physics-grounded RAN behavior. Now available on AWS Cloud, AI RSG provides scalable, on-demand access to high-fidelity RAN testing—for teams to parallelize experiments, automate benchmarking, and accelerate AI-RAN validation cycles. Calibration is essential for creating an accurate digital twin tailored to customer-specific networks. AODT is calibrated with field measurements from the VIAVI OneAdvisor 800 Wireless, creating highly accurate digital twin representations of customer cell sites and producing the most valuable datasets for machine learning and AI-driven RAN optimization. Figure 4. Representation of VIAVI’s AI RSG integration with AODT Ansys Perceive EM and Ansys HFSS Ansys, part of Synopsys, is integrating Ansys HFSS and Ansys Perceive EM software with AODT, expanding the capabilities of these tools and enabling full network simulation for users. High-frequency electromagnetic simulation software (HFSS) provides physics-accurate antenna and array design. Perceive EM radio frequency channel radar signature simulation software extends electromagnetic fidelity to wireless channel modeling in detailed, dynamic, motion-rich environments. AODT scales those models to full network deployments. This workflow forms a continuous electromagnetic chain, from antenna to network, enabling researchers to train and validate AI-RAN and integrated sensing and communications (ISAC) systems with true physical accuracy. Figure 5. Representation of Ansys’s HFSS and Ansys Perceive EM simulation software integration with AODT Amazon Web Services (AWS) With AWS, AODT moves to the cloud, giving researchers and network operators on‑demand access to large‑scale, physics‑accurate network simulation. Running AODT on AWS enables teams to spin up virtual test environments that replicate city‑scale networks, experiment with new RAN topologies, and analyze performance under dynamic, real‑world conditions—all without maintaining dedicated on‑premises infrastructure. AWS has leveraged the NVIDIA three-computer “Train → Simulate → Deploy” system to enable AI-native networks through cloud-scale intelligence. In the Train phase, Amazon Bedrock and Amazon SageMaker train domain-specific LLMs on RAN data, including R1 interface telemetry and configuration procedures, enabling models to understand and reason over RAN control signaling, resource management, and protocol-level behaviors. In the Simulate phase, NVIDIA AODT validates implementations across physics-accurate scenarios in parallel, compressing validation timelines from months to days. In the Deploy phase, Agentic applications enable agentic coverage optimization and intelligent energy savings improvements. Central to this phase is the recursive data foundation — production outputs feed back into the training loop, enabling the model to improve continuously over time. The future of 6G starts with simulation With NVIDIA Aerial Omniverse Digital Twin and its expanding ecosystem of partners, the telecom industry has a unified, physics-accurate foundation for creating, validating, and accelerating AI-native wireless systems. As the industry advances toward autonomous networks , simulation becomes essential: intelligent network agents—powered by AI—need trusted virtual environments to test and validate their recommendations before acting in live networks. Digital twins bridge that gap— closing the loop between training and deployment, enabling networks to self-learn, self-heal, and self-optimize in real time. Explore AODT partner solutions to kickstart your 6G research and development, and join the NVIDIA 6G Developer Program to collaborate with us in building the intelligent networks of the future. Discuss (1) Like Tags Developer Tools & Techniques | Networking / Communications | Simulation / Modeling / Design | Telecommunications | Aerial | Omniverse | Intermediate Technical | Deep dive | 5G / 6G | featured | Industrial Digitalization / Digital Twin About the Authors About Cindy Goh Cindy leads Product Marketing for AI-RAN and 6G at NVIDIA, driving the strategy for the next generation of intelligent wireless networks. With a deep foundation in semiconductor innovation, she previously led Technical Marketing and IP Product Management at Intel and Altera. Cindy holds both an M.S. and B.S. in Electrical Engineering from the University of Southern California. View all posts by Cindy Goh About CC Chong CC Chong is the senior director and head of Aerial product management at NVIDIA. Before joining NVIDIA, she was most recently senior director and GM of wireless and access business unit in the Intel Programmable Solutions Group. Chong received her Ph.D., in electronics and electrical engineering from the University of Edinburgh in Scotland and her bachelor's in electronics and electrical engineering from the University of Manchester. She was a recipient of the Ten Outstanding Young Malaysian Awards under the category “Scientific and Technological Development” in 2006. View all posts by CC Chong Comments Related posts Improve AI-Native 6G Design with the NVIDIA Aerial Omniverse Digital Twin Improve AI-Native 6G Design with the NVIDIA Aerial Omniverse Digital Twin NVIDIA Aerial Omniverse Digital Twin Boosts Development of AI-Native Wireless and Deployment Flexibility NVIDIA Aerial Omniverse Digital Twin Boosts Development of AI-Native Wireless and Deployment Flexibility Developing Next-Generation Wireless Networks with NVIDIA Aerial Omniverse Digital Twin Developing Next-Generation Wireless Networks with NVIDIA Aerial Omniverse Digital Twin Boosting AI-Driven Innovation in 6G with the AI-RAN Alliance, 3GPP, and O-RAN Boosting AI-Driven Innovation in 6G with the AI-RAN Alliance, 3GPP, and O-RAN Accelerating the Future of Wireless Communication with the NVIDIA 6G Developer Program Accelerating the Future of Wireless Communication with the NVIDIA 6G Developer Program Related posts Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer L T F R E
Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization | NVIDIA Technical Blog nvidia_dev_blog 19.02.2026 17:30 0.703
Embedding sim.0.7946
Entity overlap0.1071
Title sim.0.2741
Time proximity0.8542
NLP типother
NLP организацияnvidia
NLP темаai hardware
NLP страна

Открыть оригинал

NVIDIA flagship data center GPUs in the NVIDIA Ampere , NVIDIA Hopper , and NVIDIA Blackwell families all feature non-uniform memory access (NUMA) behaviors, but expose a single memory space. Most programs therefore do not have an issue with memory non-uniformity. However, as bandwidth increases in newer generation GPUs, there are significant performance and power gains to be had when taking into consideration compute and data locality. This post first analyzes the memory hierarchy of the NVIDIA GPUs, discussing the power and performance impacts of data transfer over die-to-die link. It then reviews how to use NVIDIA Multi-Instance GPU (MIG) mode to achieve data localization. Finally, it presents results for running MIG mode versus unlocalized for the Wilson-Dslash stencil operator use case. Note: The techniques described in this post are exploratory, and the field is evolving quickly. New developments may supersede what is described here. Expect additional publications as updated capabilities and practices become available. The term ‘NUMA node’ is used in the post to describe GPU-internal memory locality exposed through MIG. This does not imply full conventional NUMA capabilities. Memory hierarchy in NVIDIA GPUs Consider the abstract view of the memory hierarchy with two NUMA nodes depicted in Figure 1. When a streaming multiprocessor (SM) on node 0 needs to access a memory location in the dynamic random-access memory (DRAM) of node 1, it must transfer data over the L2 fabric. In the case of NVIDIA Blackwell GPUs, each NUMA node is a distinct physical die, which adds latency and increases the power required for data transfer. Despite the added complexity, NUMA-unaware code can still achieve peak DRAM bandwidth. Figure 1. Abstract view of the GPU memory hierarchy across two NUMA nodes To address these drawbacks, it is beneficial to minimize data transfers between NUMA nodes. When a single memory space is presented to the user, NVIDIA architecture employs coherent caching in L2 to reduce data transfers between NUMA nodes. This mechanism helps prevent repeated accesses to the same memory address from refetching data over the L2 fabric interface. Ideally, once the address is fetched into the local L2 cache, all subsequent accesses to the same address will hit the cache. Before the introduction of coherent caching, the unified L2 cache allowed all SMs to achieve peak bandwidth (as in NVIDIA Volta ), though latency varied depending on the proximity of the SM to different L2 segments. With the NVIDIA Ampere generation, larger chips introduced a hierarchy of NUMA nodes, each with its own L2 cache and a coherent connection to others. While large data center GPUs since NVIDIA Ampere architecture have used this design (unlike smaller gaming GPUs), the L2 fabric connection sustains peak bandwidth as mentioned in NVIDIA Blackwell Ultra architecture. Two challenges have emerged as GPUs continue to grow: increased latency and power limitations. Increased latency: Accessing distant parts of the L2 cache has led to growing latency, which impacts performance, particularly for synchronization. Power limitations: On the largest GPUs, power consumption becomes a limiting factor when tensor cores are active. Reducing power consumption through localized L2 access enables decreasing the L2 fabric clock and raising the compute clock through a Dynamic Voltage and Frequency Scaling (DVFS) mechanism associated with GPU Boost . In this way, tensor core performance can be significantly improved. MIG reduces data transfers between NUMA nodes. Introduced with the NVIDIA Ampere architecture, this feature enables partitioning a single GPU into multiple instances. By using MIG, developers can create one GPU instance per NUMA node, thereby eliminating accesses over the L2 fabric interface. This approach does come with its own set of costs, including the overhead of communicating between different GPU instances using PCIe. The following section presents results from running workloads using MIG mode and unlocalized memory to demonstrate the effectiveness of this approach. Data localization using MIG MIG enables supported NVIDIA GPUs to be partitioned into multiple isolated instances, each with dedicated high-bandwidth memory, cache, and compute cores. This enables efficient and high-performance GPU utilization across multiple users or workloads. MIG can achieve up to 7x more GPU resources on a single GPU. It allows multiple virtual GPUs (vGPUs) and, consequently, virtual machines (VMs) to run in parallel on a single GPU, while providing the isolation guarantees that vGPUs offer. The capabilities provided by MIG can be leveraged to achieve NUMA node localization. By creating one MIG instance per NUMA node, you can ensure isolation between different GPU instances. This approach helps eliminate traffic between NUMA nodes. MIG allows the splitting of the actual GPU into GPU instances (GI), in which one or more compute instances (CIs) are defined. A CI contains all (in the case of a single CI per GI) or a portion of the SMs belonging to a GI. To enable localization within a GI, the idea is to create two GPU instances mapped onto each NUMA node. On a Blackwell GPU, you can enable MIG mode and list the available GPU instance profiles, as shown with the code in Figure 2. Because Blackwell has two NUMA nodes (one per chiplet), look for the profile with the most SMs of which there are two instances. As shown in Figure 2, this is the profile with ID 9, of which there can be two instances. At this point, it’s necessary to create a CI in each GPU instance. This can be done using the commands shown in Figure 3. The main GPU and the GPU instances now have their own identifier hash codes. Use those for the two NUMA nodes: MIG 3g.90gb Device 0: (UUID: MIG-ee2ec0e5-0dda-5591-9ee7-4ae51028b6fa) MIG 3g.90gb Device 1: (UUID: MIG-2bbb368b-7cb0-53da-b1a4-7ace0652a197) To use these devices, add them to the CUDA_VISIBLE_DEVICES environment variable. For example, to run a two-process MPI job, you could create a wrapper script ( wrapper.sh ): #!/bin/bash # case $SLURM_PROCID in 0) CUDA_VISIBLE_DEVICES=”MIG-ee2ec0e5-0dda-5591-9ee7-4ae51028b6fa” ;; 1) CUDA_VISIBLE_DEVICES=”MIG-2bbb368b-7cb0-53da-b1a4-7ace0652a197” ;; esac $* Then launch the MPI jobs: $ mpirun -n 2 ./wrapper.sh my_executable Finally, when all the work is done, the MIG mode can be turned off. Figure 2. Enabling MIG mode and listing the available GPU instance profiles Figure 3. Creating compute instances for MIG Figure 4. Commands for turning off MIG instances What are the benefits of localization with MIG? As an example application to demonstrate the benefits of localization with MIG, examine the Wilson-Dslash stencil operator, a key kernel for lattice quantum chromodynamics (LQCD) drawn from the QUDA library. This library is used to accelerate several large LQCD codes, such as Chroma and MILC. The Dslash kernel is a finite difference operation on a 4D toroidal lattice, where data at each lattice site is updated depending on the values of its eight orthogonal neighbors. The four dimensions in this case are the usual spatial dimensions (X, Y, Z) and the time dimension (T). The kernel is memory bandwidth-bound. If the lattice is decomposed onto two NUMA nodes equally, say along the time axis, then each domain will need to access sites on the T-dimension boundaries of the other domain. As shown in Figure 5, green lattice sites on the boundaries of the subdomains need the red sites to complete their stencils. The lattice is notionally laid out onto the two NUMA Nodes. Green sites need red-sites to complete their stencils. Possible data paths are regular memory access (black arrows) when unlocalized, or MPI message passing through the host in MIG localized mode (black arrows). Figure 5. Memory accesses for Dslash kernel with MIG mode The most convenient way to access neighbors would be through the Shared L2 cache and the interconnect. However, when operating in MIG mode this path requires communication between the MIG instances through MPI using PCle or NVLink. As a result, this path will be slower compared to accessing the main memory attached to the MIG instance. Workloads that require little to no communication between two MIG instances will tend to benefit more using the MIG mode. Instead, one packs the black sites on the boundaries and sends them through MPI. This step introduces additional latency (buffer packing, sending, and unpacking). While it saves GPU power by not using the shared L2 cache-to-cache interconnect, it does use power for its transfer through the host (for PCIe, for example). The amount of data that needs to be transferred between the two processes is related to the number of face sites to be transmitted in the messages, specifically to the surface three-volume orthogonal to the direction of the split. For this example, the split is always in the T-direction, so that each NUMA node notionally ends up with (N s N t )/2 sites, where N s is the number of sites in our spatial volume and N t is the length of the time dimension. The surface to volume ratio is N s /(N s N t /2) = 2/N t . In the case of the problems, N t =64 is considered and the surface-to-volume ratio stays constant at 1/32 ~ 3.13%. Figure 6 shows the unlocalized case. The global memory is made up of two memories connected to the NUMA nodes through memory controllers. The colored highlights on the lattices indicate that data may come from either the local DRAM or from the remote DRAM through the shared L2. Figure 6. Memory accesses for the unlocalized case This is to be compared with the baseline case, where MIG is not employed. Neither the data nor the processing are localized in this case, and the scenario is better represented with Figure 6. Each NUMA node receives its data both from its local memory controller and also from the other NUMA node. In fact, there is only one global lattice and the separation onto two parts for the two NUMA nodes in the figure is artificial. In this scenario, thread blocks to process a collection of sites are assigned to the various NUMA nodes purely at the whim of the scheduler. Since the data is distributed evenly over the two NUMA memories, much more data is transferred across the shared L2 than in the case of the MIG localization where only the minimally required surface sites were transferred. This can incur a significant power cost. On the other hand, the entire operation may be carried out with a single kernel. Latencies incurred can be avoided by packing buffers for message passing, and accumulating the received faces at the end. For the experimental results, look at the speedup in workload execution with various GPU power limits in watts. The speedups are the ratios of the wallclock times taken by the unlocalized and MIG approaches running at identical power limits (for example, both at 700 W). As shown in Figure 7, at a GPU power limit of 400 W, MIG outperforms the unlocalized data with speedups of up to 2.25x depending on the volume of the workload. The reason behind this is the power consumed by the L2 fabric interface becomes a limiting factor when the GPU is running at a low power limit. With MIG mode, since there is no L2 fabric power being consumed to transfer the data between NUMA nodes, workloads can run much faster. However, when the GPU power limit is increased, MIG mode performs slightly worse in the case of the experiments represented by the grey, dark green, and black lines in Figure 7, and part of the green. This is because at higher power limits, the extra latency included by the message passing can outweigh the benefits of the localization. Figure 7. Running MIG-based NUMA localization on different workload sizes As it turns out, the smaller cases (especially those indicated by black and dark green lines in Figure 7) never exhaust available power at higher power limits even in the unlocalized case. As such, they benefit little from the GPU power saving won by localization, and at these smaller volumes the latencies due to kernel launch are much more noticeable. The larger volumes (the green, for example) require more power and hence can gain an advantage over the unlocalized setup even at higher power limits. Get started with MIG-based NUMA node localization Local L2 caching in NVIDIA data center GPUs can impact performance in NUMA-unaware workloads. Our experiments using the Wilson-Dslash operator in MIG mode show that when the GPU is running at lower power limits and data transfer over MPI (PCIe/NVLink) is low relative to local memory accesses, MIG-based NUMA node localization can yield speedups of up to 2.25x compared to the unlocalized case at the same power limit. While systems running at a higher 1,000 W power envelope may achieve greater absolute performance than a 400 W configuration, MIG-based localization provides clear advantages under power-constrained conditions. In lower-power scenarios, it enables significantly faster performance, making it an especially effective optimization when operating within strict power limits. However, in general, MIG does not offer the flexibility required to consistently achieve effective data localization, especially as interprocess communication overhead becomes more pronounced at higher power limits. MIG is only supported for use cases that are too small to fit on a GPU. For this reason, it is not recommended for the cases presented in this post. To address these limitations, alternative approaches are under investigation. To learn more, see Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS . Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Simulation / Modeling / Design | General | CUDA | Intermediate Technical | Deep dive | CUDA C++ | Data Analytics / Processing | featured | Memory | Multi-Instance GPU (MIG) About the Authors About Mukul Joshi Mukul Joshi is a memory architect with the NVIDIA GPU design team with over a decade of experience in memory architecture. He has worked on both GPU and CPU architectures across a wide range of products ranging from low-power smartphone SoCs to high-performance server processors. He holds a master’s degree from Georgia Tech in Electrical and Computer Engineering. View all posts by Mukul Joshi About Balint Joo Balint Joo is a developer technology engineer at NVIDIA working with high-performance computing workloads. His main area of expertise is Lattice Quantum Chromodynamics (QCD), which he has pursued for over 20 years prior to joining NVIDIA. He holds a PhD in Theoretical Physics from the University of Edinburgh in Edinburgh, Scotland. View all posts by Balint Joo About Zachary Susskind Zachary Susskind is a research scientist in the Architecture Research Group at NVIDIA. His interests include the exploration of non-uniform memory access architectures and algorithm-hardware co-design for energy-efficient machine learning. He holds a PhD in Electrical and Computer Engineering from The University of Texas at Austin. View all posts by Zachary Susskind About Allard Hendriksen Allard Hendriksen is a developer technology engineer at NVIDIA with an expertise in modern GPU memory subsystems. He has worked on projects accelerating key workloads for customers, is the lead developer of cuda::ptx, and has presented on optimizing bandwidth and minimizing latency at GTC and other venues. He holds a PhD in applied mathematics from Leiden University and the CWI in Amsterdam. View all posts by Allard Hendriksen About Kate Clark Kate Clark works at the interface of applications, algorithms, and parallel computation. Before joining NVIDIA, she was a researcher in radio astronomy signal-processing algorithms at Harvard University and a postdoc at Boston University in Massachusetts, with a focus on multi-grid solver algorithms. She received her PhD from the University of Edinburgh, Scotland, where her doctoral research focused on Monte Carlo algorithms for Lattice QCD. View all posts by Kate Clark Comments Related posts Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS  Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS  Less Coding, More Science: Simplify Ocean Modeling on GPUs With OpenACC and Unified Memory Less Coding, More Science: Simplify Ocean Modeling on GPUs With OpenACC and Unified Memory Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async Unified Memory for CUDA Beginners Unified Memory for CUDA Beginners Related posts NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features Simplifying GPU Application Development with Heterogeneous Memory Management Simplifying GPU Application Development with Heterogeneous Memory Management Boosting Application Performance with GPU Memory Access Tuning Boosting Application Performance with GPU Memory Access Tuning Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2 Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2 L T F R E
Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical Blog nvidia_dev_blog 27.02.2026 17:00 0.696
Embedding sim.0.7943
Entity overlap0.0952
Title sim.0.3173
Time proximity0.7143
NLP типother
NLP организацияNVIDIA
NLP темаlarge language models
NLP страна

Открыть оригинал

Organizations deploying LLMs are challenged by inference workloads with different resource requirements. A small embedding model might use only a few gigabytes of GPU memory, while a 70B+ parameter LLM could require multiple GPUs. This diversity often leads to low average GPU utilization, high compute costs, and unpredictable latency. The problem isn’t just about packing more workloads onto GPUs but about scheduling them intelligently. Without orchestration that understands inference workload patterns, organizations face a choice between overprovisioning (wasting resources) and underprovisioning (degrading performance). This blog post covers: The inference utilization problem : Why traditional scheduling underutilizes GPU resources. How NVIDIA NIM delivers production inference : The role of containerized microservices in standardizing model deployment. NVIDIA Run:ai’s intelligent scheduling strategies : Four key capabilities that enhance performance (lower latency, increase TPS/GPU) while increasing GPU utilization and reducing compute costs. Benchmarking results : ~2x GPU utilization improvement with minimal throughput loss, up to ~1.4x higher throughput under heavy concurrency with dynamic fractions, and 44-61x faster first-request latency with GPU memory swap. How to get started : Practical guidance for implementing these strategies with NIM on NVIDIA Run:ai. The inference utilization problem GPU utilization determines how many workloads can be run on a given cluster, and at what cost. In practice, most inference deployments leave significant GPU capacity idle as each model is assigned a full GPU “just to be safe” or because naive sharing without memory isolation causes out-of-memory (OOM) conditions and latency spikes under traffic. Without intelligent orchestration, teams are forced to choose between overprovisioning (waste) and underprovisioning (performance risk). How NVIDIA NIM delivers production inference NVIDIA NIM packages optimize inference engines as containerized microservices with: Packaged inference engines : Inference runtimes pre-configured for improved throughput/latency Industry-standard APIs : OpenAI-compatible endpoints for integration Model optimization : Automatic selection of quantization, batching, and acceleration techniques. Production-ready containers : Pre-built with dependencies, tested at scale Security and compliance: Enterprise-grade security controls and container signing for deployments   Enterprise support: NVIDIA support and maintenance for production deployments NIM standardizes the deployment layer, but maximizing GPU utilization requires intelligent orchestration. This is where NVIDIA Run:ai ‘s scheduling capabilities become essential. How NVIDIA Run:ai unlocks efficient resource management for NVIDIA NIM Inference utilization is more than just scheduling—it’s about adapting to how workloads behave. With NVIDIA Run:ai, NIM deployments get inference-first prioritization , GPU fractions with full memory isolation, smarter placement based on workload needs, dynamic memory management, and autoscaling (including replica scaling and scale-to-zero). This enables users to follow traffic and give back GPUs when models are idle. Inference priority protects user-facing workloads NVIDIA Run:ai automatically assigns inference workloads the highest default priority, ensuring training jobs never preempt them. Why this matters: Inference serves users : Latency spikes and downtime impact the user experience and SLA compliance. Training can tolerate interruption : Model training can checkpoint and resume; inference requests cannot wait. This automatic priority assignment eliminates manual tuning in most environments. For organizations running mixed workloads, this ensures training jobs flex around inference demands rather than competing with them. GPUs can train when inference load is low, automatically yielding resources when user-facing requests arrive. GPU fractions with bin packing for multiple small models on a GPU Many NIM workloads, like embeddings, rerankers, and small LLMs, rarely need an entire GPU. When used with GPU fractions , NVIDIA Run:ai’s bin packing strategy fills GPUs before allocating new ones, maximizing utilization across the cluster. How GPU fractions with bin packing work: GPU fractions provide true memory isolation (not soft limits). Each model gets a guaranteed memory allocation. Bin packing scores GPUs by current utilization and prioritizes filling partially used GPUs before allocating fresh ones. Scheduler prioritizes partially-used GPUs for new workloads Benchmarking results: The approach was tested by simulating a scenario with three NIM models (a 7B LLM, a 12B VLM, and a 30B MoE) on NVIDIA H100 GPUs : Scenario A : Three GPUs with one H100 GPU per NIM (baseline) Scenario B : Three NIM on 1.5 H100 GPUs using NVIDIA Run:ai fractions, keeping NIM configurations and client load patterns constant Figure 1. Three NIM microservices consolidated from three dedicated H100 GPUs to ~1.5 H100 GPUs using GPU fractions and bin packing, retaining 91–100% of baseline throughput Exercising short and long-context prompts, the key findings include: Each NIM retained about 91–100% of its single-GPU throughput, with modest increases in time-to-first-token (TTFT) and end-to-end (E2E) latency. Mistral-7B matched its dedicated-GPU throughput at 834 token/s with long-context input (100%). Nemotron-3-Nano-30B retained 95% (582 vs. 614 token/s). Nemotron-Nano-12B-v2-VL retained 91% (658 vs. 723 token/s) at short-context input. Three NIM microservices that previously required three dedicated H100s were consolidated onto ≈1.5 H100s, freeing the remaining capacity for other workloads. Dynamic GPU fractions maintain performance under heavy concurrent requests Static GPU fractions guarantee memory isolation, but they impose a rigid ceiling that creates “standard capacity”. As concurrent requests increase, each NIM’s KV-cache grows dynamically to track active sequences. When that growth hits the fixed fraction boundary, throughput plateaus, and latency degrades. This bottleneck forces a difficult trade-off: over-allocate fractions (wasting GPU capacity) or cap concurrency to stay within the fixed memory budget. NVIDIA Run:ai’s dynamic GPU fractions solve this by replacing fixed allocations with a request/limit model, borrowing Kubernetes resource semantics for GPU memory: Request: The guaranteed minimum fraction, always reserved for the workload. Limit: The burstable upper bound, enabling the NIM to spread into available GPU memory when on-demand KV-cache or compute pressure increases. When a NIM operates its request, the unused headroom between the request and limit remains available to co-located workloads. When concurrent traffic spikes occur, the NIM bursts toward its limit, claiming that memory and converting it into active throughput. This state transition between request and limit is handled automatically. Workloads scale up when they need resources and release them when demand subsides, maximizing total GPU utilization without manual intervention. Benchmarking results: Using the same three NIM models and 1.5 H100 GPU footprint from Experiment 1, static fractions were replaced with dynamic fractions to measure performance under increasing concurrency: Mistral-7B NIM (Request: 0.3, Limit: 0.4) Nemotron-Nano-12B-v2-VL NIM (Request: 0.4, Limit: 0.5) Nemotron-3-Nano-30B NIM (Request: 0.65, Limit: 0.75)  Scenarios compared: Scenario A (static fractions + bin packing): The fixed-fraction deployment from Experiment 1 (See Figure 1), where each NIM has a hard memory ceiling with full isolation. Scenario B (dynamic fractions + bin packing): Same bin-packed layout on ≈1.5 H100 GPUs, but each NIM uses a request/limit pair instead of a fixed allocation. Figure 2. Throughput vs. p50 end-to-end latency for Nemotron-3-Nano-30B on H100 GPUs with 2,048 input tokens In Figures 2, 3, and 4, as concurrency ramped up, static fractions hit a performance wall, throughput stalled, and latency spiked because models couldn’t access additional memory for growing KV caches. With dynamic fractions, NIM microservices absorbed the pressure by bursting toward their limits during traffic peaks and releasing memory back when the load subsided. Across all three NVIDIA NIM microservices, dynamic fractions delivered up to 1.4x higher throughput and 1.7x lower latency, scaling cleanly with concurrency. For example: Nemotron-3-Nano-30B sustained 1,025 token/s at 256 concurrent requests with dynamic fractions compared to a static-fraction ceiling of 721 token/s at just four concurrent requests before instability (1.4x). Mistral-7B-Instruct-v0.3 p50 end-to-end latency dropped from 5,235 ms to 3,098 ms at 64 concurrent 2,048-token requests (1.7x).  The p50 latency curve remains smooth and monotonic rather than spiking or collapsing, confirming that the request/limit headroom accommodates KV-cache growth patterns, improving GPU utilization. Figure 3. Throughput vs. p50 end-to-end latency for Mistral-7B-Instruct-v0.3 on H100 GPUs with 2,048 input tokens Key takeaway: Static fractions + bin packing: Predictable traffic, low-to-moderate concurrency, models with stable memory footprints Dynamic GPU fractions + bin packing: Variable traffic, high concurrency, models with significant KV-cache growth Figure 4. Throughput vs. p50 end-to-end latency for Nemotron-Nano-12B-v2-VL on H100 GPUs with 2,048 input tokens Dynamic GPU fractions eliminate the performance ceiling of static allocations at high concurrency while maintaining workload density. With static fractions, the KV-cache cannot grow beyond the fixed memory boundary, and the inference engine begins rejecting requests because it lacks the headroom to admit new sequences. Dynamic GPU fractions solve this as NIM can burst into available headroom on demand, and organizations get both the efficiency of bin packing and the resilience to handle traffic spikes without allocating additional GPUs. GPU memory swap: Efficiently serving rarely-used models Organizations serving LLMs face a fundamental trade-off between latency and cost. Scaling an LLM from zero means full container initialization, loading model weights from disk, and allocating GPU memory; a process that can take tens of seconds to minutes. Because this cold-start latency is unacceptable for user-facing applications, most organizations choose over-provision, keeping multiple replicas always-on with dedicated GPUs even during low-traffic or idle periods. This guarantees low latency but wastes GPU capacity, paying for hardware that sits idle just to avoid the risk of a cold start. Scale-to-zero (the Kubernetes pattern of shutting down idle replicas completely and restarting them on demand) can free the GPUs, but the cold-start penalty makes it impractical for latency-sensitive inference workloads. How GPU memory swap works: With GPU memory swap , models are kept in CPU memory and dynamically swap model weights between CPU and GPU as requests arrive. Only the active model’s weights reside in GPU memory at any moment. When a request targets an idle model, NVIDIA Run:ai’s GPU memory swap moves the currently loaded model’s weights to CPU RAM and loads the requested model into GPU memory, keeping it warm for a configurable window. The model never leaves memory entirely; it just moves between GPU and CPU, eliminating the need for container restarts, disk I/O, and cold-start initialization. GPU memory swap works across single-GPU, multi-GPU, and fractional GPU workloads. Previous benchmarking with single-GPU deployments showed up to 66x improvements in time to first token (TTFT) compared to scale-from-zero. In this benchmark, combining GPU memory swap with NIM deployments on fractional GPUs tested whether the same latency benefits hold when models share hardware through bin packing and under memory constraints. Benchmarking results: Latency between GPU memory swap and scale-from-zero for the same three NIM deployments was compared: Scenario A (scale-from-zero): Each NIM cold‑starts from scratch on a dedicated H100 GPU when traffic arrives (three GPUs in total). Scenario B (GPU memory swap): The three NVIDIA NIM microservices share 1.5 H100 GPUs (with the same fractions from previous experiments), with swap‑in/swap‑out between GPU and CPU memory. Figure 5. GPU memory swap vs. scale‑from‑zero TTFT on an H100 GPU with 128‑token prompts Figure 6. GPU memory swap vs. scale-from-zero TTFT on H100 GPUs with longer 2048-token prompts With scale-from-zero, infrequently accessed NIM microservices suffer high first-request latency due to full cold starts. With GPU memory swap, first-request latency stays acceptable, and subsequent requests see warm TTFT. All three NIM microservices run on half of the GPUs, freeing up the remaining capacity for high-traffic or other workloads. At 128-token input, cold-start TTFT ranged from 75.3 s (Mistral-7B) to 92.7 s (Nemotron-3-Nano-30B), while GPU memory swap reduced these to 1.23–1.61 s – a 55–61x improvement. At 2,048-token input, cold-start TTFT of 158.3–180.2 s dropped to 3.52–4.02 s with swap, a consistent ~44x reduction. Key takeaway : GPU memory swap delivers 44-61x faster TTFT than scale-from-zero while using fewer resources when combined with GPU fractions, eliminating the cold-start penalty for infrequently accessed models, whether deployed on dedicated or fractional GPUs. Get started with NVIDIA Run:ai and NVIDIA NIM Check out this guide to get started with deploying NVIDIA NIM as a native inference workload on NVIDIA Run:ai. Watch this webinar to see how teams manage growing AI workloads with intelligent scheduling, fine-grained GPU controls, Kubernetes-native traffic balancing, and autoscaling—while new platform updates improve access control, endpoint management, and visibility. Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Developer Tools & Techniques | General | NIM | Run:ai | Advanced Technical | Benchmark | featured | Inference Performance | LLMs About the Authors About Shwetha Krishnamurthy Shwetha Krishnamurthy is a product manager at NVIDIA, where she focuses on building LLM inference products. Before joining NVIDIA, she spent several years as a machine learning engineer and data scientist at Goldman Sachs and Yodlee. Shwetha holds an MBA from the University of Chicago Booth School of Business and a Master’s in Computer Science from the University of Chicago. View all posts by Shwetha Krishnamurthy About Aditi Bodhankar Aditi Bodhankar is a developer advocate engineer at NVIDIA who works on developing various deep learning applications, especially those using the NVIDIA NeMo. She is equipped with experience in conversational AI and NLP since her internship at NVIDIA. Aditi holds a master’s degree from the University of Southern California. View all posts by Aditi Bodhankar About Ekin Karabulut Ekin Karabulut is a data scientist and developer advocate previously at Run:ai, now at NVIDIA, exploring the efficient usage of large models in different production scenarios. Previously she worked on privacy implications of federated learning, focused on distributed training techniques and got fascinated by inefficiencies in GPU usage in research and industry settings. She established the AI Infrastructure Club and is based in Munich, Germany. View all posts by Ekin Karabulut About Julie Adrounie Julie Adrounie is an AI product marketing manager and technical advocate for Run:ai software at NVIDIA, where she helps enterprises scale large language model workloads and streamline AI inference in production environments. Previously, as a solutions architect, she built and implemented end-to-end AI production platforms that helped data science teams accelerate model development and deployment at scale. She holds a B.S. in Industrial Engineering and lives in Orlando, Florida. View all posts by Julie Adrounie Comments Related posts Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap Accelerate AI Model Orchestration with NVIDIA Run:ai on AWS Accelerate AI Model Orchestration with NVIDIA Run:ai on AWS NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations Practical Tips for Preventing GPU Fragmentation for Volcano Scheduler Practical Tips for Preventing GPU Fragmentation for Volcano Scheduler Related posts Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes L T F R E
Advancing independent research on AI alignment openai 19.02.2026 10:00 0.692
Embedding sim.0.7935
Entity overlap0.1
Title sim.0.15
Time proximity0.9226
NLP типfunding
NLP организацияOpenAI
NLP темаai safety
NLP страна

Открыть оригинал

OpenAI commits $7.5M to The Alignment Project to fund independent AI alignment research, strengthening global efforts to address AGI safety and security risks.
How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models | NVIDIA Technical Blog nvidia_dev_blog 18.02.2026 16:00 0.691
Embedding sim.0.7949
Entity overlap0.0857
Title sim.0.1751
Time proximity0.869
NLP типpartnership
NLP организацияSarvam AI
NLP темаlarge language models
NLP странаIndia

Открыть оригинал

As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and cost requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups building sovereign AI models from scratch, these challenges are amplified by the need to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and cost control. Sarvam AI , a generative AI startup based in Bengaluru, India, set out to build large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To meet strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with NVIDIA to co-design hardware and software optimizations. This collaboration delivered a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, and established a path for deployment on the next-generation NVIDIA Blackwell architecture. The end-to-end performance boost was achieved through kernel and scheduling optimizations on NVIDIA H100 SXM GPUs that contributed a 2x speedup. That was combined with the powerful compute capabilities of Blackwell, along with NVFP4 weight quantization, for an additional 2x speedup, with an even bigger performance gain of 2.8x seen at higher interactivity points. NVIDIA engineers helped Sarvam AI build 3B, 30B, and 100B foundational models, and optimize a new family of sovereign foundation models that were trained using NVIDIA Nemotron libraries , including the NVIDIA NeMo Framework and NVIDIA NeMo-RL . These models support 22 Indian languages, English, math, and code. They demonstrate how developer teams can leverage NVIDIA’s full-stack AI platform—from data to deployment—to achieve state-of-the-art performance and localized AI capabilities. This post walks through the joint engineering effort and shares benchmarks for the speed-ups achieved on the NVIDIA H100 , the largest-deployed NVIDIA GPU in India. We also provide an early look at how these workloads are being adapted for the NVIDIA Blackwell architecture. Making multilingual sovereign AI scalable with MoE To deliver sovereign-scale intelligence with high efficiency, the Sarvam AI models employ a sophisticated heterogeneous mixture-of-experts (MoE) architecture tailored for deep reasoning and linguistic density. These models were pretrained from scratch across 3B, 30B, 100B using the NVIDIA NeMo framework and NVIDIA Megatron-LM. Furthermore, Nemo-RL was used for post-training workflows for these models including long-context reasoning. Sarvam 30B utilizes a 19-layer depth (1 dense + 18 MoE) with 128 experts and a top-6 routing strategy, relying on grouped query attention (GQA) to balance memory bandwidth with generation quality. Sarvam 100B scales this design to 32 layers (1 dense + 31 MoE) and employs top-8 routing over 128 experts with a larger MoE FFN hidden size of 2048. Additionally, the 100B model adopts multi-head latent attention (MLA)—similar to DeepSeek-V3—to aggressively compress the Key-Value (KV) cache, enabling massive context windows without the memory penalties of standard attention. Both models feature a shared expert design where a dedicated expert handles common features while routed experts tackle specialized tasks. This combination of high active parameter counts (via top-6/top-8 routing) and complex memory access patterns created a unique serving challenge, necessitating the deep kernel optimizations on NVIDIA Hopper and NVIDIA Blackwell GPUs detailed below. The performance challenge: SLAs and baseline configuration on NVIDIA H100 Optimizing the Sarvam 30B model wasn’t just about raw speed; it was about maximizing density under strict latency constraints. For the applications served by this model—voice-to-voice agents—we established the following service level agreements (SLAs): P95 (95th percentile) time to first token (TTFT): < 1000 ms P95 (95th percentile) inter-token latency (ITL): < 15 ms P95 (95th percentile) in inference performance testing measures latency, indicating that 95% of served requests are completed faster than this threshold, while the slowest 5% take longer. It is a critical tail-latency metric used to evaluate user experience and system stability, ensuring that even under load, most users face no more than a specific delay. The engineering goal was to maximize the inference server’s token throughput (concurrently served requests) without breaching these P95 targets. For the initial performance analysis, the Sarvam AI and NVIDIA teams selected the SGLang inference engine for their initial performance analysis. Unlike standard serving frameworks that treat the KV cache as a linear buffer, SGLang implements RadixAttention —a mechanism that manages the KV cache as a radix tree. This was critical for the Sarvam 30B architecture; RadixAttention enables automatic prefix sharing, allowing the shared expert context and system prompts to be computed once and reused across concurrent requests. Furthermore, SGLang’s Cache-Aware Scheduler maximizes the hit rate of these shared prefixes, significantly reducing redundant memory operations during the prefill phase. The Sarvam AI and NVIDIA teams modeled a production traffic profile characterized by an average input sequence length (ISL) of 3,584 tokens and an output sequence length (OSL) of 128 tokens. Guided by internal simulation data, we deployed the model on two NVIDIA H100 SXM GPUs with a specific parallelism strategy designed to balance the distinct memory and compute requirements of the MoE layers: Expert parallelism (EP=2) for the expert weights. This configuration utilizes Grouped GEMM kernels to maximize compute density and ensures that the massive expert weights reside in HBM, reducing the cost of expert routing. Data parallelism (DP=2) for the attention weights with –enable-dp-attention. This enabled us to parallelize attention computation across parallel batches, significantly boosting the aggregate throughput of the prefill phase. While this configuration provided a robust functional baseline, profiling revealed that satisfying the sub-second TTFT at high concurrency required deeper optimization – leading us to the specific kernel and precision strategies detailed below. From profiling to performance: eliminating MoE bottlenecks Simulation data indicated that a concurrency range of 32 to 64 requests would offer the best chance of meeting SLA requirements. To identify the precise bottlenecks limiting token throughput in this concurrency range, the NVIDIA and Sarvam AI teams utilized NVIDIA Nsight Systems to capture execution traces of both the prefill and decode phases at a concurrency of 32 requests. We then processed the traces to extract the microsecond-level latency contribution of every kernel within a single transformer layer. The profiling revealed that while the heavy General Matrix Multiplication (GEMM) operations (experts and attention) were performing well, significant latency bubbles existed in the non-compute-intensive operations—specifically in the MoE routing logic and positional embedding calculations. These operations were suffering from kernel launch overheads and redundant memory reads. Figure 1. Nsys profiler timeline showing SM activity and kernel execution over time of the prefill phase, with red boxes marking the most expensive kernels in the layer—QK normalization, attention, and MoE expert computation. Following these observations, we executed a targeted optimization strategy across three axes – kernel optimizations, scheduling efficiency, and disaggregated serving. Cutting transformer layer time by 34% with kernel-level optimizations The NVIDIA and Sarvam AI teams systematically targeted the most expensive kernels identified in the trace, replacing standard PyTorch implementations with fused, architecture-specific kernels. We implemented the models first using a baseline implementation on SGLang with H100 GPUs and then optimized them to achieve significant speedups, as detailed below in Table 1 and in the following text. Kernel Baseline time (microseconds) Optimized time (microseconds) Optimization applied RMSNorm + Prepare QKV 186 185 N/A QK Norm + RoPE 414 54 Use optimized fused in-place query-key normalization kernel Attention 322 296 Use FA3 for prefill, FlashInfer backend for decode Post-attention linear projection 114 112 N/A AllReduce 252 250 N/A Router logits and TopK 560 134 Use fused TopK impl.; ReplicatedLinear block for router logits Routed experts computation 1103 1080 Tune kernel params for and DEP2 configuration (64 experts per GPU) Shared expert computation 216 215 Overlap with TopK using NVIDIA CUDA streams AllReduce 265 249 N/A Total layer time 3432 2575 1.34x faster prefill overall Table 1. Kernel-level optimizations pay off: Fusing and tuning the hottest kernels cut layer time drastically and deliver faster prefill. MoE routing (4.1x faster than baseline H100 performance): The most significant bottleneck identified was the MoE routing mechanism. In the baseline, computing router logits and performing TopK selection involved multiple kernel launches and redundant memory round-trips. Optimization: We implemented a Fused TopK kernel that fuses the logit computation and selection logic into a single CUDA kernel. Furthermore, we utilized a ReplicatedLinear block for the router logits. Since the router weights are small, replicating them across GPUs eliminates the need for expensive communication during the gating phase, keeping the operation purely compute-bound. Fusing positional embeddings (7.6x faster than baseline H100 performance): The baseline implementation of query-key (QK) norm, followed by rotary positional embeddings (RoPE), required reading and writing the massive KV cache twice. Optimization: We deployed a custom fused in-place QK norm + RoPE kernel. This kernel performs normalization and rotary embedding calculations in a single pass, keeping the data in the L2 cache and reducing global memory bandwidth consumption. Hiding latency with overlap: While the shared expert computation itself saw negligible speedup, we effectively hid its cost. By utilizing separate NVIDIA CUDA streams, we scheduled the shared expert computation to execute asynchronously alongside the router logits and TopK calculation. This parallelism ensures that the GPU’s compute units (streaming multiprocessors, or SMs) remain saturated even while the routing logic is being resolved. These targeted kernel optimizations reduced the total time per transformer layer in a prefill iteration from 3.4ms to 2.5ms, a 1.3x speedup over baseline H100 performance. This latency reduction directly translated to higher supportable concurrency, allowing us to serve more users per GPU while maintaining the strict <1000ms time to first token (TTFT) and < 15ms inter-token latency service level agreement (ITL SLA) as shown in Figure 2 below. Figure 2. Performance gains from kernel optimizations across various concurrency points. In focus is the performance gain at the 75 TPS/user point. With kernel optimizations, we see a 1.26x improvement in overall token throughput per GPU. How mixed prefill and decode scheduling improve GPU utilization While kernel-level optimizations improve individual operation latency, significant efficiency gains can be achieved at the scheduler level by optimizing aggregated serving (prefill and decode run on the same GPU) and disaggregated serving (prefill and decode run on different GPUs). The default scheduling strategy for aggregated serving in the SGLang engine is to strictly serialize the prefill and decode phases. In this default mode, the GPU processes a batch of prefills, finishes them, and only then switches to processing decodes. While this simplifies memory management, it often leads to suboptimal GPU utilization. Prefills are typically compute-bound (dense matrix multiplications), while decodes are memory-bound (loading KV cache). Serializing them means the GPU’s Tensor Core units (SMs) are underutilized during decode phases, and memory bandwidth may be underutilized during prefill phases, particularly for the low concurrency operating point imposed by the tight SLA requirements. To address this, we enabled a mixed batching strategy. This approach allows the SGLang scheduler to mix prefill tokens and decode tokens within the same batch or compute chunk. By processing a chunk of prefill tokens alongside ongoing decode requests, we achieve a complementary resource profile on the GPU. This optimization introduces a subtle tradeoff. Mixing heavy prefill chunks into the decode stream can arguably increase inter-token latency (ITL) for the active decode requests, as they must wait for the shared compute resources. However, for the Sarvam 30B workload, we observed that this impact was marginal and well within our 15ms ITL SLA. In exchange, the end-to-end request latency improved significantly due to the reduction in queue times. By clearing the prefill queue faster (piggybacking on decodes), we reduced the time requests spent waiting to start, ultimately driving up total system throughput by 15%. This scheduling optimization is quite favorable in the high ISL, low OSL scenario of interest here. For more decode-heavy cases, it might be worthwhile to pick smaller mixed chunk sizes or disable it altogether. Figure 3. The impact of mixed chunk scheduling, with 15% token throughput gains seen at the 2-second request latency point. How disaggregated serving removes the critical path and boosts throughput 1.5x Despite kernel and scheduling improvements, our profiling indicated that inter-GPU communication for token distribution (expert parallelism) remained on the critical path. Since the Sarvam 30B model (optimized with FP8 precision) fits comfortably within a single NVIDIA H100 SXM GPU’s memory, we pivoted from model parallelism to disaggregated serving. We reconfigured the setup to use a 1P+1D strategy via the SGLang router: dedicating one NVIDIA H100 SXM GPU exclusively to prefill and another to decode. This approach eliminated the overhead of routing tokens between GPUs during the forward pass. The result was immediate: We observed a sharp reduction in TTFT (as prefill workers ran uninterrupted) and a significant increase in per-user decode throughput (1.5x over baseline H100 performance), proving that for this model size, pipeline separation outweighs the benefits of aggregated memory capacity. Figure 4. The benefits of disaggregated serving on NVIDIA H100 SXM for Sarvam 30B model The end-to-end impact of kernel, scheduling, and disaggregation optimizations Figure 5 below summarizes the end-to-end performance speedup we were able to achieve via a combination of optimized kernels and scheduling optimizations. We also observe that disaggregated serving is the most optimal configuration for this model and ISL/OSL workload pattern and specific TTFT and ITL SLAs. Figure 5. Progressive improvements seen in Sarvam 30B model inference on NVIDIA H100 SXM through a combination of kernel optimizations, scheduling optimizations, and disaggregated serving. Running the Sarvam 30B model on Blackwell NVIDIA GPUs The NVIDIA Blackwell architecture is designed to accelerate generative AI. The NVIDIA Blackwell GPU delivers up to 20 PFLOPS of peak FP4 compute and 8 TB/s of memory bandwidth, representing a jump over the NVIDIA H100 GPU’s capabilities. This throughput is driven by the second-generation Transformer Engine, which utilizes the new NVFP4 format to provide over 2x the performance of FP8 while maintaining high model accuracy. To take advantage of these capabilities in the Sarvam models, we used the NVIDIA Model Optimizer to quantize the base BF16 model to the NVFP4 format. Unlike in the case of  multiple H100 GPUs, we found that the NVIDIA HGX B200 was able to serve the Sarvam 30B model most efficiently with just one Blackwell GPU. By combining the kernel and scheduling optimizations for the model with NVIDIA Blackwell’s NVFP4 compute throughput, we were able to realize a 4x increase in inference serving throughput at the 75 tokens per second per user operating point. As indicated in Figure 6 below, the NVIDIA Blackwell GPU enables high performance at low latency due to its superior compute, as well as exceptional throughput at higher concurrencies  from its memory capacity advantage. Figure 6. NVIDIA Blackwell GPU offers a 2.8x higher token throughput vs Nvidia H100 SXM GPU at the 100 TPS/User operating point. Learn more Together, this work shows what is possible when model design, kernel engineering, scheduling strategy, quantization, and GPU architecture are treated as a single system rather than isolated components. By co-optimizing across the full stack, Sarvam AI and NVIDIA delivered substantial gains in throughput and latency while maintaining strict TTFT and inter-token latency targets required for real-world deployment. The result is not just a faster model, but a more economically viable and sovereign-ready inference stack that scales to national workloads. These learnings provide a blueprint for other teams building large, production-grade AI systems on NVIDIA platforms. More information about Sarvam AI’s models can be found here . To begin exploring your own sovereign AI model strategy, check out the NVIDIA Nemotron framework and libraries for training, fine-tuning, and deploying models on local infrastructure. Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn , X , Discord , and YouTube . Visit our Nemotron developer page for all the essentials you need to get started with the most open, smartest-per-compute reasoning model. Explore new open Nemotron models and datasets on Hugging Face and NIM microservices and Blueprints on build.nvidia.com . Tune into upcoming Nemotron livestreams and connect with the NVIDIA Developer community through the Nemotron developer forum and the Nemotron channel on Discord Browse video tutorials and livestreams to get the most out of NVIDIA Nemotron And read more about NVIDIA Cloud Functions, NVIDIA’s multi-cloud, high-performance AI inference solution, here . Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Data Science | Cloud Services | Blackwell | DGX Cloud | H100 | NeMo | NeMo Microservices | Nemotron | Intermediate Technical | Tutorial | featured | NVIDIA Inception About the Authors About Utkarsh Uppal Utkarsh Uppal is a senior applied deep learning solutions architect at NVIDIA, where he specializes in building high-performance deep learning pipelines across domains like language and speech. His primary focus is on developing end-to-end conversational AI systems, including training LLMs from scratch, particularly for Indic languages and building domain-specific models with enterprises. He also has deep expertise in designing and optimizing inference architectures for production, with a focus on low-precision formats (FP4, FP8), decoding strategies, and KV-cache optimizations. View all posts by Utkarsh Uppal About Sriharsha Niverty Sriharsha Niverty focuses on AI infrastructure at NVIDIA, optimizing systems-level performance for large-scale LLM inference and training workloads. Previously, he worked on graphics application performance and architecture exploration, with an emphasis on efficient work scheduling inside the GPU. View all posts by Sriharsha Niverty About Diya Shah Diya Shah is a machine learning engineer at Sarvam AI, working on inference and optimization of models to drive maximum efficiency in serving stacks. By targeting accelerations at the system and kernel level, she works to ensure that large-scale models remain performant on diverse hardware environments. Diya has a bachelor of technology in electronics and communications engineering from the LNM Institute of Information Technology (LNMIIT) in India. View all posts by Diya Shah About Rakesh Madugundu Rakesh is an ML performance engineer at Sarvam AI. He focuses on accelerating model inference by optimizing at both the system and kernel levels to reduce production latency. He is passionate about low-level engineering, with a particular interest in writing custom kernels and building foundational architectures from scratch to maximize hardware efficiency. View all posts by Rakesh Madugundu About Ashwin Srinivasan Ashwin is a founding engineer at Sarvam whose role is to get all models from research to production. He takes care of model optimization across target hardware and also maintains accelerator infrastructure at Sarvam. He likes to dive deep into model architecture, kernels, and hardware. View all posts by Ashwin Srinivasan Comments Related posts Profiling LLM Training Workflows on NVIDIA Grace Hopper Profiling LLM Training Workflows on NVIDIA Grace Hopper Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick NVIDIA Sets New Generative AI Performance and Scale Records in MLPerf Training v4.0 NVIDIA Sets New Generative AI Performance and Scale Records in MLPerf Training v4.0 New NVIDIA NeMo Framework Features and NVIDIA H200 Supercharge LLM Training Performance and Versatility New NVIDIA NeMo Framework Features and NVIDIA H200 Supercharge LLM Training Performance and Versatility Build Custom Enterprise-Grade Generative AI with NVIDIA AI Foundation Models  Build Custom Enterprise-Grade Generative AI with NVIDIA AI Foundation Models  Related posts How to Build AI Systems In House with Outerbounds and DGX Cloud Lepton How to Build AI Systems In House with Outerbounds and DGX Cloud Lepton Accelerating Video Production and Customization with GliaCloud and NVIDIA Omniverse Libraries Accelerating Video Production and Customization with GliaCloud and NVIDIA Omniverse Libraries Vortex Enables Advanced Imaging Anywhere with NVIDIA Jetson Vortex Enables Advanced Imaging Anywhere with NVIDIA Jetson Spotlight: Build Scalable and Observable AI Ready for Production with Iguazio's MLRun and NVIDIA NIM Spotlight: Build Scalable and Observable AI Ready for Production with Iguazio's MLRun and NVIDIA NIM Spotlight: Personal AI Brings AI Receptionists to Small Business Owners with NVIDIA Riva Spotlight: Personal AI Brings AI Receptionists to Small Business Owners with NVIDIA Riva L T F R E
Controlling Floating-Point Determinism in NVIDIA CCCL | NVIDIA Technical Blog nvidia_dev_blog 05.03.2026 17:00 0.688
Embedding sim.0.7518
Entity overlap0.1935
Title sim.0.287
Time proximity1
NLP типother
NLP организацияnvidia
NLP темаai infrastructure
NLP странаunited states

Открыть оригинал

A computation is considered deterministic if multiple runs with the same input data produce the same bitwise result. While this may seem like a simple property to guarantee, it can be difficult to achieve in practice, especially in parallel programming and floating-point arithmetic. This is because floating-point addition and multiplication aren’t strictly associative—that is, (a + b) + c may not equal a + (b + c)—due to rounding that occurs when intermediate results are stored with finite precision . With NVIDIA CUDA Core Compute Libraries (CCCL) 3.1, CUB—a low-level CUDA library for speed-of-light parallel device algorithms —added a new single-phase API that accepts an execution environment, enabling users to customize algorithm behavior. We can use this environment to configure the reduce algorithm’s determinism property. This can only be done through the new single-phase API, since the two-phase API doesn’t accept an execution environment. The following code shows how to specify the determinism level in CUB (find the complete example online using compiler explorer ). auto input = thrust::device_vector<float>{0.0f, 1.0f, 2.0f, 3.0f}; auto output = thrust::device_vector<float>(1); auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed); // can be not_guaranteed, run_to_run (default), or gpu_to_gpu auto error = cub::DeviceReduce::Sum(input.begin(), output.begin(), input.size(), env); if (error != cudaSuccess) { std::cerr << "cub::DeviceReduce::Sum failed with status: " << error << std::endl; } assert(output[0] == 6.0f); We begin by specifying the input and output vectors. We then use cuda::execution::requir e() to construct a cuda::std::execution::env object, setting the determinism level to not_guaranteed . There are three determinism levels available for reduction, which are: not_guaranteed run_to_run gpu_to_gpu Determinism not guaranteed In floating-point reductions, the result can depend on the order in which elements are combined. If two runs apply the reduction operator in different orders, the final values may differ slightly. In many applications, these minor differences are acceptable. By relaxing the requirement for strict determinism, the reduction implementation can rearrange the operations in any order, which can improve runtime performance. In CUB, not_guaranteed relaxes the determinism level. This enables atomic operations—whose unordered execution across threads results in a different order of operations between runs—to compute both the block-level partial aggregates and the final reduction value. The entire reduction can also be performed in a single kernel launch, since the atomic operations combine the block-level partial aggregates into the result. The nondeterministic reduce variant is typically faster than the run-to-run deterministic version—particularly for smaller input arrays, where performing the reduction in a single kernel reduces latency from multiple kernel launches, minimizes extra data movement, and avoids additional synchronization. The tradeoff is that repeated runs may yield slightly different results due to the lack of deterministic behavior. Run-to-run determinism While nondeterministic reductions offer potential performance gains, CUB also provides a mode that guarantees consistent results across runs. By default, cub::DeviceReduce is run-to-run deterministic, which corresponds to setting the determinism level to run_to_run in the single-phase API. In this mode, multiple invocations with the same input, kernel launch configuration, and GPU will produce identical outputs. This determinism is achieved by structuring the reduction as a fixed, hierarchical tree rather than relying on atomics, whose update order can vary across runs. At each stage of the reduction, elements are first combined within individual threads. The intermediate results are then reduced across threads within a warp using shuffle instructions, followed by a block-wide reduction using shared memory. Finally, a second kernel aggregates the per-block results to produce the final output. Because this sequence is predetermined and independent of the relative timing of thread execution, the same inputs, kernel configuration, and GPU yield the same bitwise result. GPU-to-GPU determinism For applications that require the highest level of reproducibility, CUB also provides GPU-to-GPU determinism, which guarantees identical results across multiple runs with the same input on different GPUs. This mode corresponds to setting the determinism level to gpu_to_gpu . To achieve this level of determinism, CUB uses a Reproducible Floating-point Accumulator (RFA) , a solution based on the NVIDIA GTC 2024 session, Restoring the Scientific Method to HPC: High Performance Reproducible Parallel Reductions . The RFA counters floating-point non-associativity—which arises when adding numbers with different exponents—by grouping all input values into a fixed number of exponent ranges (the default is three bins). This fixed, structured accumulation order ensures the final result is independent of GPU architecture. The accuracy of the final result depends on the number of bins: more bins provide greater accuracy, but also increase the number of intermediate summations, which can reduce performance. The current implementation defaults the number of bins to three, an optimal default providing balanced performance and accuracy. It’s worth noting that this configuration is not just strictly deterministic, but also guarantees numerically correct results, providing tighter error bounds than the standard pairwise summation traditionally used in parallel reductions. How results vary based on the determinism levels The three determinism levels differ in the amount of variation they produce across multiple runs: Not-guaranteed determinism produces slightly different summation values on each invocation. Run-to-run determinism ensures the same value for every invocation on a single GPU, but the result may vary if a different GPU is used. GPU-to-GPU determinism guarantees that the summation value is identical for every invocation, regardless of which GPU executes the reduction. This is shown in Figure 1, with the summation of an array for each determinism level—represented by green, blue, and red circles—plotted against the run number. A flat horizontal line shows that the reduction produces the same result. Figure 1. Summation value compared to run  Determinism performance comparison The level of determinism selected affects the performance of cub::DeviceReduce . Not-guaranteed determinism, with its relaxed requirements, provides the highest performance. The default run-to-run determinism delivers good performance but is slightly slower than not-guaranteed determinism. GPU-to-GPU determinism, which enforces the strictest reproducibility across different GPUs, can significantly reduce performance, increasing execution time by 20% to 30% for large problem sizes. Figure 2 compares the performance of the different determinism requirements for float32 and float64 inputs on an NVIDIA H200 GPU (lower is better). They clearly show how the choice of determinism level impacts execution time across different data types. Figure 2. Elapsed time compared to the number of elements Conclusion With the introduction of the single-phase API and explicit determinism levels, CUB provides an enhanced toolbox for controlling both the behavior and performance of reduction algorithms. Users can choose the level of determinism that best suits their needs: from the high-performance and flexible, not-guaranteed mode, to the reliable run-to-run default, and up to the strictest GPU-to-GPU reproducibility. Determinism in CUB isn’t limited to reductions. We plan to extend these capabilities to additional algorithms for developers to control reproducibility across a wider range of parallel CUDA primitives. For updates and discussion, see the ongoing GitHub issue on expanded determinism support, to follow our roadmap, and provide feedback on algorithms you’d like to see deterministic versions of. Discuss (0) Like Tags Data Science | Developer Tools & Techniques | Simulation / Modeling / Design | General | CUDA | Intermediate Technical | Tutorial | featured About the Authors About Nader Al Awar Nader Al Awar is a senior software engineer at NVIDIA and a member of the CUDA Core Compute Libraries (CCCL) team, where he focuses on the development of CUB and cuda.compute. He earned his doctorate in electrical and computer engineering from the University of Texas at Austin, specializing in high-performance computing for Python. Nader is passionate about bridging the gap between high-level languages and hardware by accelerating Python code using GPUs. View all posts by Nader Al Awar About Srinivas Yadav Singanaboina Srinivas Yadav Singanaboina is a graduate research assistant at the Center for Computation and Technology at Louisiana State University (LSU). Srinivas is a core member of STE||AR GROUP and an active contributor to the HPX open-source project. View all posts by Srinivas Yadav Singanaboina Comments Related posts Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python Implementing High Performance Matrix Multiplication Using CUTLASS v2.8 Implementing High Performance Matrix Multiplication Using CUTLASS v2.8 Revealing New Features in the CUDA 11.5 Toolkit Revealing New Features in the CUDA 11.5 Toolkit Faster Parallel Reductions on Kepler Faster Parallel Reductions on Kepler How to Overlap Data Transfers in CUDA Fortran How to Overlap Data Transfers in CUDA Fortran Related posts Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark L T F R E
Pacific Northwest National Laboratory and OpenAI partner to accelerate federal permitting openai 26.02.2026 10:00 0.687
Embedding sim.0.8063
Entity overlap0.375
Title sim.0.1698
Time proximity0.5446
NLP типproduct_launch
NLP организацияOpenAI
NLP темаbenchmarking
NLP страна

Открыть оригинал

OpenAI and Pacific Northwest National Laboratory introduce DraftNEPABench, a new benchmark evaluating how AI coding agents can accelerate federal permitting—showing potential to reduce NEPA drafting time by up to 15% and modernize infrastructure reviews.
Arvind KC appointed Chief People Officer openai 24.02.2026 13:40 0.683
Embedding sim.0.7936
Entity overlap0.3333
Title sim.0.0513
Time proximity0.8085
NLP типleadership_change
NLP организацияOpenAI
NLP темаenterprise ai
NLP страна

Открыть оригинал

OpenAI appoints Arvind KC as Chief People Officer to help scale the company, strengthen its culture, and lead how work evolves in the age of AI.
Making Softmax More Efficient with NVIDIA Blackwell Ultra | NVIDIA Technical Blog nvidia_dev_blog 25.02.2026 17:00 0.682
Embedding sim.0.7874
Entity overlap0.1176
Title sim.0.237
Time proximity0.7202
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA). As a result, AI ”speed of thought” is increasingly governed not by the massive throughput of matrix multiplications, but by the transcendental math of the softmax function. Transcendentals refer to functions that cannot be expressed as the root of a polynomial equation with rational coefficients. Subsequently, they “transcend” basic algebraic operations like addition and multiplication—the exact operations Tensor Cores excel at. In the specific context of softmax, the most computationally expensive of these transcendentals is the natural exponential function that is executed on Special Function Units (SFUs). In NVIDIA assembly instructions ( SASS ), this function is invoked via the MUFU.EX2 instruction. This architectural split creates a softmax bottleneck within the attention block, when powerful matrix engines are forced to idle while waiting for the SFU datapaths to normalize attention scores. NVIDIA Blackwell Ultra alleviates this bottleneck by doubling SFU throughput over the standard NVIDIA Blackwell architecture. This blog dives into the mechanics of softmax within the attention loop, explores how Blackwell Ultra’s hardware optimizations eliminate pipeline stalls, and provides a benchmark for you to measure the raw MUFU.EX2 speedup for yourself. How attention works A foundational component of modern large language models is the attention mechanism, which allows a model to dynamically transform static token vectors into dynamic, context-aware representations. At its core, it is a process of re-weighting information by allowing tokens to adjust their importance to one another. To facilitate this interaction, every token in a sequence is projected into three functional roles: Query: Represents what the current token is seeking to understand its own context. Key: Represents a token’s profile that others use for matching. Tokens previous in the sequence have keys that signal their specific relevance to the query. Value: This holds the actual informational content. Once a match is confirmed between a query and a key, the Value is the specific data that is transferred to the original token. Figure 1 below shows attention in action. We have two sentences that utilize the word “dog” in two different definitions. Initially, we can see that the embeddings (the numerical vectors that capture meaning and nuance in a multidimensional space) of both “dog” mentions are identical. Figure 1. Context building through attention Attention operates with the model calculating a dot product between the “ dog “ query and the keys of every other token in the sequence. if the query for “dog” aligns well with the key for “lazy,” it indicates a high degree of relevance. This interaction is what allows the word “dog” to pull in the specific value of its neighbor. By the end of this cycle, the original vector for “dog” has been physically updated with the content of its neighbors, evolving from a generic dictionary definition into a contextualized embedding that “understands” whether it refers to a lethargic animal or the sweltering peak of a season. How softmax relates to attention Softmax serves as the critical decision-making phase that converts raw compatibility scores into actionable weights. Once the initial dot products are calculated between queries and keys, the resulting scores are passed through the softmax function to be normalized into probabilities that sum to exactly one. This step is what determines the “attention span” of the model, effectively deciding which tokens to prioritize and which to ignore. Without softmax, the model would have no way to objectively weigh the information it gathers, leading to an unmanageable and noisy blend of data. However, the softmax operation is the primary source of the “performance cliff” seen in long-context AI. Because every token in a sequence must be compared against every other token, a sequence of 8,192 tokens creates a massive [8,192 x 8,192] attention matrix. Normalizing this matrix requires billions of transcendental calculations and grows quadratically with the sequence length. This creates a bottleneck, where the sheer volume of transcendental math can stall the entire inference pipeline. Blackwell Ultra puts focus on accelerating these exponential calculations specifically to alleviate this mathematical bottleneck and ensure that the system can handle the massive normalization required for large context windows without sacrificing throughput. Alleviating the softmax bottleneck in Blackwell Ultra By doubling the throughput of the SFU for exponentials in the Blackwell Ultra architecture, NVIDIA is alleviating this bottleneck and is allowing for a more balanced and efficient processing pipeline. This results in faster overall performance, especially for tasks that are heavy on attention mechanisms. Figure 2 below illustrates the sequential dependency inherent in the standard attention mechanism, often referred to as the attention loop, as run on the previous generation NVIDIA Blackwell (GB200). Note that the Streaming Multiprocessor (SM) loads two thread blocks running attention loops concurrently. These separate attention loops are denoted in the two different shades of green. This pipeline consists of three distinct phases that must execute in order: BMM1 (score calculation): The Tensor Cores perform a matrix multiplication to calculate the raw attention scores, or logits. Softmax (normalization): The pipeline shifts to the SFUs to normalize these scores into probabilities using exponential functions. BMM2 (context aggregation): The pipeline returns to the Tensor Cores to multiply the probabilities by the value vectors. Figure 2. The Blackwell attention loop The timeline illustrates the latency constraints inherent in the Blackwell GPU during the execution of the attention kernel. Because the second matrix multiplication (BMM2) acts on the output of the softmax, it cannot begin until the normalization is complete. The lower throughput of the Blackwell GPU’s SFUs forces the Tensor Cores to idle between the score calculation (BMM1) and the context aggregation (BMM2). This dependency prevents the pipeline from fully saturating the compute resources and extends the duration of the softmax operation The next timeline, as shown in Figure 3, demonstrates the direct impact of the Blackwell Ultra GPUs in NVIDIA GB300 NVL72 and NVIDIA HGX B300 systems doubled SFU throughput on the same instruction sequence. Figure 3. The Blackwell Ultra attention loop Visually, the width of the softmax blocks is reduced by almost 50%, reflecting the hardware’s ability to process MUFU instructions at twice the rate. This reduction in softmax latency tightens the entire pipeline. The gap between BMM1 and BMM2 is drastically minimized, allowing the Tensor Cores to switch between the query-key multiplication and the probability-value multiplication with minimal stalling. The result is a denser main loop where the high-performance matrix engines spend a larger percentage of the total execution time active, directly translating to higher overall inference throughput. Benchmarking MUFU.EX2 performance To empirically verify the theoretical throughput of the MUFU pipeline, we can construct a synthetic micro-benchmark. The following kernel code isolates the exponential instructions to measure the raw cycle count without interference from global memory latency or other arithmetic operations. This test harness launches a grid of threads where each thread performs a dense loop of MUFU.EX2 instructions. By timing the execution and comparing it against the clock frequency, you can directly calculate the effective instruction throughput and validate the bandwidth saturation point mentioned earlier. Step 1: Clone the following repository to pull the exp2-bg300.cu benchmark. git clone https://github.com/jamieliNVIDIA/mufu_ex2_bench.git cd mufu_ex2_bench Step 2: Compile with (Using sm100f for GB300 or sm103a for GB200). nvcc -O3 -gencode=arch=compute_103a,code=sm_103a --extended-lambda -o /tmp/exp2-gb300.out exp2-gb300.cu Sample results We see that GB300 performs about 2x higher in FLOPs performance over GB200 for all tested data types, in line with the doubled SFU throughput. Blackwell (GB200) exp2 BF16x2 2454 Gop/s (4908 GFLOPS) exp2 BF16 4938 Gop/s exp2 FP32 4943 Gop/s Blackwell Ultra (GB300) exp2 BF16x2 4996 Gop/s (9992 GFLOPS) exp2 BF16 9738 Gop/s exp2 FP32 Time: 10024 Gop/s Attention forward propagation performance in Blackwell vs Blackwell Ultra The transition from Blackwell to Blackwell Ultra delivers a targeted increase in compute throughput driven by a 2x increase in SFU performance. This hardware upgrade directly accelerates the forward propagation (FPROP) pipeline for models like DeepSeek-V3. FPROP is the process where input data travels “forward” through the neural network—from the input layer, through the hidden layers, to the output layer—to generate a prediction. Every time the model produces a single new word, it must run one complete FPROP pass. Figure 4 below shows that by doubling the throughput of the SFUs, the GB300 drastically reduces the execution time of the softmax layers within the attention blocks. This faster normalization means the GPU spends less time processing the attention scores and more time utilizing the high-speed matrix engines for the next layer’s computation, directly increasing the overall speed of the forward pass. Figure 4. GB300 vs GB200 FLOPS in forward propagation in a grouped query attention (GQA) model. The benchmark results highlight a ~35% increase in FPROP throughput for FP8 operations. ​​This gain is particularly pronounced in FP8 because the matrix math is already extremely fast. In this low-precision regime, the time spent on softmax becomes a larger percentage of the total step. Getting started The performance dynamics of DeepSeek-V3 on the Blackwell Ultra highlight a critical, but often overlooked bottleneck in inference: the computational cost of non-linear operations. By optimizing and compressing the attention mechanism, state-of-the-art models effectively increase the density of softmax operations relative to standard linear computations, exposing the SFUs as a governor of total throughput. Blackwell Ultra directly addresses this bottleneck. By doubling the throughput of these specialized units, Blackwell Ultra unblocks the transcendental traffic jam that previously forced the powerful Tensor Cores to idle. The benchmark results confirm the impact, demonstrating a 35% gain in FP8 forward propagation. For modern, highly optimized architectures, the path to faster inference isn’t just about faster Tensor Cores, it’s also about ensuring the non-linear math units are fast enough to keep up. Visit NVIDIA’s trtllm-gen repository for more benchmarks and information on utilizing this SFU speedup in workloads. Doubling the throughput of the SFUs for MUFU.EX2 is just one of many features that enable Blackwell Ultra’s fast attention speed. NVIDIA’s extreme hardware-software codesign accelerates the full attention loop through technologies such as: Offloading critical “find-max” reductions to the Tensor Memory controller via LDTM.STAT . Optimizing performance using CUDNN . Optimizing KVCache data movements using NVFP4 . Stay tuned to the NVIDIA technical blog for future posts. Acknowledgements Special thanks to the cuDNN engineering team for creating the benchmarks and building the software optimizations making this cutting edge performance possible. Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Cloud Services | Blackwell | GB200 | Intermediate Technical | Deep dive | AI Inference | Blackwell Ultra | cuDNN | featured | GB300 | LLMs | Tensor Cores About the Authors About Jamie Li Jamie Li is a senior technical marketing engineer at NVIDIA focused on wrangling the latest technologies in AI inference. He brings a deep background in both AI software engineering and customer management, translating innovations into practical customer outcomes. Before NVIDIA, he held roles developing, breaking, and fixing AI solutions in the enterprise tech sector. He also did research in medical imaging and holds a master’s degree in Computer Science with an AI focus. View all posts by Jamie Li About Alexander Zhurkevich Alex graduated from the University of Massachusetts Boston with a Master's degree. His past research and working experience is mainly in HPC and AI/ML and computer vision. At NVIDIA, he is a developer technology AI engineer focusing on accelerating ML/DL workloads on GPUs. View all posts by Alexander Zhurkevich About Vedaanta Agarwalla As a senior deep learning software engineer at NVIDIA, Vedaanta focuses on accelerating GPU workloads with a current emphasis on optimizing attention kernels for both training and inference. His previous experience spans ResNet optimizations, GEMMs, and HPC for derivatives pricing in quantitative trading. Vedaanta holds a master’s degree in computer science from the University of Illinois Urbana-Champaign. View all posts by Vedaanta Agarwalla About Seonghee Lee Seonghee Lee is an engineer on the AI platform software team at NVIDIA, focusing on AI Inference-related products. Seonghee holds a master’s in computer science from Stanford University and a bachelor’s in science from Cornell University, specializing in AI. Before joining NVIDIA, she worked at Microsoft Research on developing real-time AI agent interactions. View all posts by Seonghee Lee About Roman Anders Roman Anders is a software engineer on the cuDNN team at NVIDIA, where he focuses on Flash Attention optimizations for inference and training workloads across current and next-generation GPU architectures. His contributions at NVIDIA span RNN, matrix multiplications, and convolutions. Previously, he served as an engineer on the Intel MKL team, where he developed Sparse BLAS, Direct Sparse Solvers, and FFT. He holds a master's degree in applied mathematics and programming from Novosibirsk State University in Russia. View all posts by Roman Anders Comments Related posts Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200 NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200 Related posts AI Aims to Bring Order to the Law AI Aims to Bring Order to the Law How Modern Supercomputers Powered by NVIDIA Are Pushing the Limits of Speed — and Science How Modern Supercomputers Powered by NVIDIA Are Pushing the Limits of Speed — and Science AI Helps Locate Dangerous Fishing Nets Lost at Sea AI Helps Locate Dangerous Fishing Nets Lost at Sea Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training GPU Memory Essentials for AI Performance GPU Memory Essentials for AI Performance L T F R E
How to Minimize Game Runtime Inference Costs with Coding Agents | NVIDIA Technical Blog nvidia_dev_blog 03.03.2026 19:49 0.681
Embedding sim.0.806
Entity overlap0.0476
Title sim.0.3455
Time proximity0.4117
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai agents
NLP страна

Открыть оригинал

NVIDIA ACE is a suite of technologies for building AI agents for gaming. ACE provides ready-to-integrate cloud and on-device AI models for every part of in-game characters, from speech to intelligence to animation. To run these models alongside the game engine efficiently, the NVIDIA In-Game Inferencing (NVIGI) SDK includes a set of performant libraries that developers can integrate into C++ games and applications. NVIDIA In-Game Inferencing SDK 1.5 introduces a new code agent sample in which an AI agent works with the player to defeat monsters in a 2D dungeon. AI agents driven by local small language models (SLMs) can make excessive calls to the GPU that compete with graphics. This post examines how to minimize the number of inference calls and maximize what each call accomplishes, reducing contention on the GPU between graphics and compute. Code agents: Trapping the ghost Andrej Karpathy, a founding member of OpenAI, likens working with large language models (LLMs) to summoning ghosts , an apt metaphor for LLM agents, especially ones that write code. Many custom agents limit themselves to tool calling: a function is defined, the LLM decides when to call it, and a result is returned. There is a more ambitious possibility. Instead of just calling a function, an AI agent can create the function and the code to support it. This makes the machine more powerful with less processing. There is, however, a trade-off. An unconstrained LLM with code execution capabilities is a security issue. It can exhaust memory, hang the game process, or, as one unfortunate user discovered, wipe a hard drive while trying to “clear a cache.” It can have benefits, like complex multi-step reasoning, dynamic adaptation, and reduced usage of the SLM. The following dives into how a potential coding agent ghoul turned into a friendly ghost eager to help. Why code agents outshine tool-calling When discussing AI agents, the most typical use case and approach is tool-calling. The model outputs structured JSON, the game or application parses it, and then executes the corresponding function. While the ability to call functions is powerful, it only comes after the model has had a chance to think on it a bit, and inference is expensive, especially when it fights for resources on the user’s GPU. Once the model sends the JSON, it waits for a response, thinks again, and returns an answer—potentially repeating the cycle. This can consume valuable fractions of a second that could be spent rendering the game. Moreover, if complex logic is required around the function call, the system must rely on weaker model capabilities. The model doesn’t inherently handle looping; it simply produces tokens. It can try to track state variables, but there is no rigor. If multiple items need addressing, the model must remember each one without missing, duplicating, or hallucinating entries. And every item processed pays an inference cost. Numeric analysis introduces another challenge. With tool-calling, accuracy depends on the model’s mathematical ability or on writing yet another function to ensure correctness. Tool-calling can struggle to scale. Every function call requires another inference hit that competes for GPU resources and must be mitigated. Code agents work by using something computers are already good at—running code. Programming is one of the emerging superpowers of language models. Instead of generating one function call at a time, a single inference can generate all the function calls at once. There’s no performance hit after the initial generation, just standard code that runs until the task is complete. They’re also flexible. While language models can’t easily loop themselves, code agents can easily write code with loops, counters, and filters. The following is a hypothetical example of how tool calling might be used to target an enemy. Tool-calling schema: [ { "name": "get_enemies_list", "parameters": { "properties": { "position": {"type": "string", "description": "Position to search from"}, "radius": {"type": "number", "description": "Search radius"} } } }, { "name": "target_enemy", "parameters": { "properties": { "enemy_name": {"type": "string", "description": "Name of the enemy to target"} }, "required": ["enemy_name"] } } ] When the user says “target the nearest enemy”: Inference call 1 : SLM decides to call get_enemies_list Tool response : Returns ["goblin_01", "skeleton_archer_01", "orc_chief"] (just strings, otherwise, full entity schemas blow out the context window) Inference call 2 : SLM sees the list, picks one, calls target_enemy("goblin_01") Tool response : Success Inference call 3 : Feedback to the user about the status of the function call Three inference calls for one decision. Consider the same “target enemy” action with a code agent. Code agent API definition: get_enemies(position, radius) --[[ Find enemies near a position. Parameters: position (table): Center point as {row, col} radius (number): Search radius Returns: table: Array of enemy entities (with .name, .position, .health, etc.) Example: local nearby = get_enemies(ally.position, 10) ]] set_target(ally, enemy) --[[ Set an ally's attack target. Parameters: ally (entity): The ally to command enemy (entity): The enemy to target Example: set_target(warrior, nearby[1]) ]] SLM-generated code for “target the nearest enemy”: local enemies = get_enemies(ally.position, 10) local closest = nil local min_dist = math.huge for _, enemy in ipairs(enemies) do local dx = enemy.position[1] - ally.position[1] local dy = enemy.position[2] - ally.position[2] local dist = math.abs(dx) + math.abs(dy) if dist < min_dist then min_dist = dist closest = enemy end end if closest then set_target(ally, closest) end With one inference call, the SLM loops over enemies, accesses their positions, calculates distances, and picks the closest. The code agent gets rich entity objects, not just strings, and composes logic that the tool designer never anticipated. Notice the flexibility. That same get_enemies function works for enemies near the player, near an ally, or near a point. Once the SLM has the enemy list, it can write any selection logic, such as targeting enemies weak to arrows, targeting the closest one, or targeting the one with the lowest health. With tool-calling, adapting to new requirements means more tools, more inference calls, and more complexity. With code agents, the SLM composes new strategies at runtime from the same simple primitives. Code agent sample dungeon Keeping with the ghoulish theme, the IGI SDK includes an ASCII dungeon crawler to demonstrate the code agent. The dungeon contains all the pieces of a large game, but in one of gaming’s simplest forms. Players move around, collect items, and fight monsters. But they also have a powerful ally on their adventure, an AI agent. An intelligence that can materialize on demand to help them fight, go on dangerous missions, or provide information about the dangers that await. Figure 1. The AI navigates the maze to retrieve the bow Once an instruction is given, the code is written, and the program doesn’t touch the SLM again until a new instruction is given. A tool call chain may produce the same results, but at the cost of repeated inference calls eating into the allocated frame time slice. The threat model of a code agent Using an SLM to generate code that runs on the host introduces obvious security and safety risks, including: Dangerous function access. The SLM generates os.execute("rm -rf /") or require("socket") , and suddenly the code agent is deleting files or opening network connections.  Unauthorized file access. The SLM locates critical files or API keys to exfiltrate or delete. Resource exhaustion. The SLM writes a loop that allocates memory forever. Stack overflow. The SLM writes a recursive function without proper termination. Infinite loops. The SLM writes while true do end and never returns. Escaping the sandbox. The SLM might manipulate internal structures to break out of its containment. State corruption. The SLM might corrupt the game or application’s state. Choosing a target language When choosing a target language, consider: time to execution, general performance, complexity of integration and debugging, and the quality and safety of code produced. While running a game, inference calls must take a fraction of the total frame time. Large hits that stall the rendering pipeline are unacceptable. While it’s possible to generate a few tokens at a time each frame to smooth out inference, compilation does not offer that flexibility. This rules out compiled languages such as C++ or C#. Instead, an interpreted language is required. Two languages stand out as examples: Python and Lua. Python is the obvious first choice. SLMs generate Python fluently. The ecosystem is massive. But Python wasn’t designed for embedding or sandboxing. The Global Interpreter Lock (GIL) complicates multi-threaded hosts. Isolation requires subprocesses or subinterpreters, both adding complexity. Further, there’s no built-in way to limit memory or execution time. Python can run in a sandbox, but it’s a fight against the language the whole way. Lua was designed from the ground up for embedding in hostile environments. The entire runtime is about 200 kB and starts in sub-millisecond time. Plus, every identified threat has documented mitigation, including: Dangerous functions : Selective library loading. Don’t load io or os , and they don’t exist. Memory exhaustion : Custom allocator hook. Track every allocation, enforce a cap. Stack overflow : Debug hooks on function calls. Count depth, error on overflow. Infinite loops : Debug hooks on instruction count. Error after N instructions. Metatable manipulation : Remove getmetatable/setmetatable from globals. State corruption : Custom _ _newindex metamethods that reject writes to protected fields. Thus, Lua met all the requirements for this IGI sample but still required hardening. With Lua, dangerous or unwanted functions is set to nil (lua_pushnil(L); lua_setglobal(L,”funcname”) pattern). Memory growth is limited by wrapping the default allocator and tracking allocations. The programmer can set up hooks ( lua_sethook ) to make sure programs don’t blow out the call stack or hang indefinitely. Similarly, metatable access can be restricted with custom metamethods locked down to protect the game state.  These are just some of the steps taken to lock down this sample. More may be required (depending on each particular game or use case), but these tips should help guide the reader while looking through the code. For added security, Lua can be embedded in a web assembly runtime. See the blog posts Sandboxing Agentic AI Workflows with WebAssembly and Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk for more information about ways to secure agentic behavior. Security is a core concern, not an afterthought. Language choice is a security decision, not a convenience decision. Start with this premise and understand the different attack vectors to guard against, and the ghost stays a friend in the machine. Get started with NVIDIA In-Game Inferencing SDK Try the sample with the NVIDIA In-Game Inference SDK . Build it, experiment, and think of ways to employ it in games, apps, and other projects. Join us at GDC Explore how NVIDIA RTX neural rendering and AI are shaping the next era of gaming. Get a glimpse into the future of game development with John Spitzer, vice president of Developer and Performance Technology at NVIDIA, as he unveils the latest innovations in path tracing and generative AI workflows. Join Bryan Catanzaro, vice president of Applied Deep Learning Research at NVIDIA, for an interactive “Ask Me Anything” session covering the latest trends in AI. Discuss (0) Like Tags Agentic AI / Generative AI | Developer Tools & Techniques | Gaming | ACE | Intermediate Technical | Tutorial | DLSS | featured | Game Performance | RTX AI | SLMs About the Authors About Brandon Rowlett Brandon Rowlett is a dev tech at NVIDIA, where he helps game developers integrate AI into their titles. He focuses on local AI models that run efficiently on consumer GPUs, working primarily in Python and C++. With 25 years of experience in game development and AI, he has contributed to technologies such as DLSS and DLSS 3 frame generation. View all posts by Brandon Rowlett Comments Related posts Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Open Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs Open Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs Benchmarking Agentic LLM and VLM Reasoning for Gaming with NVIDIA NIM Benchmarking Agentic LLM and VLM Reasoning for Gaming with NVIDIA NIM Bring NVIDIA ACE AI Characters to Games with the New In-Game Inferencing SDK Bring NVIDIA ACE AI Characters to Games with the New In-Game Inferencing SDK Generative AI Sparks Life into Virtual Characters with NVIDIA ACE for Games Generative AI Sparks Life into Virtual Characters with NVIDIA ACE for Games Related posts Train Small Orchestration Agents to Solve Big Problems Train Small Orchestration Agents to Solve Big Problems How Small Language Models Are Key to Scalable Agentic AI How Small Language Models Are Key to Scalable Agentic AI GPU Memory Essentials for AI Performance GPU Memory Essentials for AI Performance Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries Accelerating LLMs with llama.cpp on NVIDIA RTX Systems Accelerating LLMs with llama.cpp on NVIDIA RTX Systems L T F R E
The U.S. and China Are Pursuing Different AI Futures ieee_spectrum_ai 19.02.2026 17:03 0.679
Embedding sim.0.7852
Entity overlap0.04
Title sim.0.0952
Time proximity0.958
NLP типother
NLP организацияInstitute for AI Policy and Strategy
NLP темаai governance
NLP странаUnited States

Открыть оригинал

More money has been invested in AI than it took to land on the moon. Spending on the technology this year is projected to reach up to US $700 billion , almost double last year’s spending. Part of the impetus for this frantic outlay is a conviction among investors and policymakers in the United States that it needs to “beat China.” Indeed, headlines have long cast AI development as a zero-sum rivalry between the U.S. and China, framing the technology’s advance as an arms race with a defined finish line. The narrative implies speed, symmetry, and a common objective. But a closer look at AI development in the two countries shows they’re not only not racing toward the same finish line: “The U.S. and China are running in very different lanes,” says Selina Xu , who leads China and AI policy research in New York City for Eric Schmidt, the tech investor, philanthropist, and former Google chief . “The U.S. is doubling down on scaling,” in pursuit of artificial general intelligence (AGI) Xu says, “while for China it’s more about boosting economic productivity and real-world impact.” Lumping the U.S. and China onto a single AI scoreboard isn’t just inaccurate, it can impact policy and business decisions in a harmful way. “An arms race can become a self-fulfilling prophecy,” Xu says. “If companies and governments all embrace a ‘race to the bottom’ mentality, they will eschew necessary security and safety guardrails for the sake of being ahead. That increases the odds of AI-related crises.” Where’s the Real Finish Line? As machine learning advanced in the 2010s, prominent public figures such as Stephen Hawking and Elon Musk warned that it would be impossible to separate AI’s general-purpose potential from its military and economic implications, echoing Cold War–era frameworks for strategic competition. “An arms race is an easy way to think about this situation even if it’s not exactly right,” says Karson Elmgren , a China researcher at the Institute for AI Policy and Strategy , a think tank in Washington, D.C. The labs, investors, and media in what’s known as frontier technology benefit from simple, comparable progress metrics, like larger models, better benchmarks, and more computing power, so they favor and compound the arms-race framing. Artificial general intelligence is the implied “finish line” if AI is an arms race. But one of the many problems with an AGI finish line is that by its very nature, a machine superintelligence would be smarter than humans and therefore impossible to control. “If superintelligence were to emerge in a particular country, there’s no guarantee that that country’s interests are going to win,” says Graham Webster , a China researcher at Stanford University . An AGI finish line also assumes the U.S. and China are both optimizing for this goal and putting the majority of their resources toward it. This isn’t the case, as the two countries have starkly different economic landscapes. When Is the Payoff? After decades of rapid growth , China is now facing a grimmer reality. “China has been suffering through an economic slowdown for a mixture of reasons, from real estate to credit to consumption and youth unemployment,” says Xu, adding that the country’s leaders have been “trying to figure out what is the next economic driver that can get China to sustain its growth.” Enter AI. Rather than pouring resources into speculative frontier models, Beijing has a pressing incentive to use the technology as a more immediate productivity engine. “I n China we define AI as an enabler to improve existing industry, like health care, energy, or agriculture,” says AI policy researcher Liang Zheng , of Tsinghua University in Beijing. “The first priority is to use it to benefit ordinary people.” To that end, AI investment in China is focused on embedding the technology into manufacturing, logistics, energy, finance, and public services. “ It’s a long-term structural change, and companies must invest more in machines, software, and digitalization,” Liang says. “Even very small and medium enterprises are exploring use of AI to improve their productivity.” China’s AI Plus initiative encourages using AI to boost efficiency. “Having a frontier technology doesn’t really move China towards an innovation-led developed economy,” says Kristy Loke , a fellow at MATS Research who focuses on China’s AI innovation and governance strategies. Instead, she says, “It’s really important to make sure that [these tools] are able to meet the demands of the Chinese economy, which are to industrialize faster, to do more smart manufacturing, to make sure they’re producing things in competitive processes.” Automakers have embraced intelligent robots in “dark factories” with minimal human intervention; as of 2024, China had around five times as many factory robots in use than the United States. “We used to use human eyes for quality control and it was very inefficient,” says Liang. Now, computer-vision systems detect errors and software predicts equipment failures, pausing production and scheduling just-in-time maintenance. Agricultural models advise farmers on crop selection, planting schedules, and pest control. In health care, AI tools triage patients, interpret medical images, and assist diagnoses; Tsinghua is even piloting an AI “Agent Hospital” where physicians work alongside virtual clinical assistants. “I n hospitals you used to have to wait a long time, but now you can use your agent to make a precise appointment,” Liang says. Many such applications use simpler “narrow AI” designed for specific tasks. AI is also increasingly embedded across industries in the United States, but the focus tends toward service-oriented and data-driven applications, leveraging large language models (LLMs) to handle unstructured data and automate communication. For example, banks use LLM-based assistants to help users manage accounts, find transactions, and handle routine requests; LLMs help health care professionals extract information from medical notes and clinical documentation. “LLMs as a technology naturally fit the U.S. service-sector-based economy more so than the Chinese manufacturing economy,” Elmgren says. Competition and Cooperation The U.S. and China do compete more or less head-to-head in some AI-related areas, such as the underlying chips. The two have grappled to gain enough control over their supply chains to ensure national security, as recent tariff and export control fights have shown. “I think the main competitive element from a top level [for China] is to wriggle their way out of U.S. coercion over semiconductors. They want to have an independent capability to design, build, and package advanced semiconductors,” Stanford’s Webster says. Military applications of AI are also a significant arena of U.S.–China competition, with both governments aiming to speed decision making, improve intelligence, and increase autonomy in weapons systems. The U.S. Department of Defense launched its AI Acceleration Strategy last month, and China has explicitly integrated AI into its military modernization strategy under its policy of military-civil fusion . “From the perspective of specific military systems, there are incremental advantages that one side or the other can gain,” Webster says. Despite China’s commitment to military and industrial applications, it has not yet picked an AI national champion. “After Deepseek in early 2025 the government could have easily said, ‘You guys are the winners, I’ll give you all the money, please build AGI,’ but they didn’t. They see being ‘close enough’ to the technological frontier as important, but putting all eggs in the AGI basket as a gamble,” Loke says. American companies are also still working with Chinese technology and workers, despite a slow uncoupling of the two economies. Though it may seem counterintuitive, more cooperation—and less emphasis on cutthroat competition—could yield better results for all. “For building more secure, trustworthy AI, you need both U.S. and Chinese labs and policymakers to talk to each other, to reach consensus on what’s off limits, then compete within those boundaries,” Xu says. “The arms race narrative also just misses the actual on-the-ground reality of companies co-opting each other’s approaches, the amount of research that gets exchanged in academic communities, the supply chains and talent that permeates across borders, and just how intertwined the two ecosystems are.” A correction to this article was made on 23 February 2026. The Institute for AI Policy and Strategy is in Washington, D.C., not San Francisco.
Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy | NVIDIA Technical Blog nvidia_dev_blog 23.02.2026 18:00 0.676
Embedding sim.0.8096
Entity overlap0.0588
Title sim.0.2632
Time proximity0.4256
NLP типexperiment
NLP организацияNVIDIA
NLP темаcomputational efficiency
NLP страна

Открыть оригинал

As the sizes of AI models and datasets continue to increase, relying only on higher-precision BF16 training is no longer sufficient. Key challenges such as training throughput expectations, memory limits, and rising costs are becoming the primary barriers to scaling transformer models. Using lower-precision training can address these challenges. By reducing the numeric precision used during computation, GPUs can process more operations per cycle, enhancing training efficiency and lowering costs. This post compares the following three low-precision training formats directly against established BF16 precision training across multi-hundred-billion token pretraining runs and downstream benchmarks: 8-bit floating point per-tensor current scaling (FP8-CS) Mixed precision training with FP8 (MXFP8) NVFP4 precision training using NVIDIA NeMo Megatron Bridge , an open source library that is part of NVIDIA NeMo framework We present practical, large-scale results showing how low-precision training delivers up to ~1.6x higher throughput, substantial memory savings, and near-identical model quality using production-ready recipes you can adopt today. ​​What is low-precision training? Low-precision training uses numerical formats with fewer bits to represent weights and activations during model training. This reduces memory bandwidth and computational demand, enabling GPUs to process more operations per cycle and significantly increase training throughput. Low-precision formats FP8-CS applies FP8 to linear layers using scaling factors derived from the statistical properties of each tensor at the current training step. MXFP8 extends the FP8 approach with block-level scaling optimized for the NVIDIA Blackwell architecture , with each block covering 32 tensor elements. NVFP4 further improves memory efficiency and throughput by using the 4-bit format for tensor values with a hierarchical two-level scaling strategy. Figure 1. Comparison of FP8, MXFP8, and NVFP4 low-precision formats. E stands for the exponent and M for Mantissa in the numerical representation Can low-precision training match BF16 accuracy at scale? To validate the practical impact of low-precision training for real-world large-model pretraining, the team evaluated both the training convergence and downstream task performance across two widely used dense transformer architectures: Llama 3 8B and an NVIDIA internal research 8B model (Research-8B with dense grouped query attention (GQA) architecture that is similar to Llama 3 8B). The models were trained on 1 trillion tokens. Experimental setup: Isolating the impact of precision The following large-scale pretraining experiments were run: Four numeric precisions : BF16 (baseline), FP8-CS, MXFP8, and NVFP4 Two model architectures : Llama 3 8B and Research-8B Training software and hardware : NeMo Megatron Bridge on NVIDIA B200 GPUs Two datasets : Lingua DCLM Dataset and an internal dataset. Llama 3 8B was trained on both datasets and Research-8B was trained on the internal NVIDIA research dataset Convergence behavior: Training stability across precisions Figures 2, 3, and 4 show training and validation loss curves for both models and datasets. Low-precision training closely tracks with the BF16 baseline, demonstrating stable and consistent convergence across precisions. In all cases, NVFP4 shows slightly higher loss but downstream accuracies remain unaffected. See Table 1 for more details. Figure 2. Training and validation loss for the Llama 3 8B trained on the Lingua DCLM dataset across BF16, FP8-CS, MXFP8, and NVFP4 Figure 3. Training and validation loss for Llama 3 8B trained on the internal NVIDIA research dataset across BF16, FP8-CS, MXFP8, and NVFP4 Figure 4. Training and validation loss for Research-8B trained on the internal dataset Downstream evaluation: Accuracy is preserved To assess whether low-precision training impacts real-world performance, we evaluated all pretrained models on standard downstream benchmarks. All evaluations were run in BF16 precision to isolate the impact of training precision. Table 1 shows the results. Despite minor differences in training and validation loss, all low-precision formats achieve downstream task accuracy comparable to BF16. Model Dataset Precision MMLU (↑) HellaSwag (↑) WinoGrande (↑) ARC-C (↑) Llama 3 8B DCLM BF16 45.98 76.44 70.17 51.28 FP8-CS 46 75.25 70.24 49.91 MXFP8 46.56 75.46 71.27 51.11 NVFP4 45.64 75.59 69.38 51.28 Llama 3 8B Internal dataset BF16 52.73 75.71 67.88 51.37 FP8-CS 52.46 75.65 70.17 54.52 MXFP8 53.7 75.54 69.69 51.62 NVFP4 52.83 75.04 71.98 53.58 Research-8B Internal dataset BF16 53 76.98 70.4 55.89 FP8-CS 52.62 75.81 70.8 54.44 MXFP8 52.38 76.55 69.77 53.58 NVFP4 52.21 76.19 70.32 54.95 Table 1. Downstream task accuracy (%) for Llama 3 8B and Research-8B across BF16, FP8-CS, MXFP8, and NVFP4 training Key insights Key insights from these experiments are detailed below. Low precision training matches BF16 convergence: FP8, MXFP8, NVFP4 achieve pretraining and validation losses very close to BF16, showing minimal degradation. Downstream accuracy is preserved: Across all models and benchmarks, low-precision training delivers downstream task accuracy comparable to BF16, demonstrating that reduced precision maintains model effectiveness. MXFP8 performs slightly better than standard FP8: This is likely due to its finer-grained scaling mechanism, which better captures local dynamic range within tensors. NVFP4 with proper calibration delivers competitive results despite aggressive compression : The following recipe is the empirical sweet spot: AdamW ϵ=1e-8, LR=6e-4 → 6e-6, GBS=768. Selective BF16 layers are essential for NVFP4: Ablation studies show that fully NVFP4 models diverge. Stable training requires keeping some layers in BF16, particularly near the end of the network, to mitigate NVFP4 quantization error. In these experiments, maintaining the final four transformer layers in BF16 proved sufficient. Advantages of FP8, MXFP8, and NVFP4 training Low-precision formats deliver clear gains in both training throughput and memory efficiency, enabling faster end-to-end training and better scalability on NVIDIA Blackwell GPUs. Precision Micro-batch size Throughput (TFLOP/s/GPU) Speedup versus BF16 BF16 2 1165 – FP8-CS (F1L1) 2 1547 1.33x MXFP8 2 1540 1.32x NVFP4 (F0L4) 4 1850 1.59x Table 2. Throughput comparison for Llama 3 8B training on NVIDIA GB200 NVL72 shows up to 1.59x speedup with NVFP4 compared to BF16 GBS=128, Seq. Length=8192. Note that FxLy denotes the first ‘x’ layers and last ‘y’ transformer block layers are kept in BF16 precision Faster end-to-end training Using 8-bit or 4-bit numeric formats drastically reduces computational overhead by enabling GPUs to process more operations per clock cycle. Gains in throughput can be up to 1.59x over BF16 baseline (Table 2). These gains translate directly into faster time-to-train for large-scale models. GPU memory savings and better scalability Using lower bit-width formats reduces the memory footprint of weights and activations, allowing larger models or batch sizes on the same hardware. NVFP4 efficiency enables the micro-batch size to double (from 2 to 4) during pretraining, directly improving throughput and scalability. Table 3 provides a detailed breakdown of memory usage across training components. Lower-precision formats significantly reduce parameter and activation storage while preserving FP32 optimizer state, enabling higher throughput and larger batch sizes without compromising training stability. Optimizer Precision Parameter Gradients Momentum Variance Master parameter Others FP16 FP16 FP32 FP32 FP32 FP32 BF16 BF16 BF16 FP8 (tensor scaling) FP8x2 BF16 Scaling factor per weight tensor MXFP8 FP8x2 BF16 (Scaling factor per 32 elements) x 2 NVFP4 FP4 BF16 16×16 2D block scales replicated for each 1×16 block Table 3. Memory footprint across training components for different precision formats Low-precision training with NeMo Megatron Bridge NeMo Megatron Bridge is an open PyTorch-native library within the NVIDIA NeMo framework. It bi-directionally connects Hugging Face and Megatron Core model checkpoints. It provides optimized training and multi-node parallelisms required to pretrain, SFT, and LoRA-tune generative AI models at maximum throughput. Adopting low-precision training using the NeMo Megatron Bridge library is straightforward. You can use ready-to-use low-precision recipes for various models to experiment with different precision formats by changing a single configuration flag. An example for Llama 3 8B is shown below: from megatron.bridge.recipes.llama import llama3_8b_low_precision_pretrain_config as low_precision_pretrain_config from megatron.bridge.training.gpt_step import forward_step precision = "bf16_with_fp8_current_scaling_mixed" # should be one of ["bf16_with_mxfp8_mixed", "bf16_with_fp8_current_scaling_mixed", "bf16_with_nvfp4_mixed"] cfg = low_precision_pretrain_config( mixed_precision_recipe = precision, train_iters = 100, lr_warmup_iters = 10, lr_decay_iters = 90, mock = True, # use mock dataset ) pretrain(config=cfg, forward_step_func=forward_step) You can easily switch between precision formats to evaluate performance, memory savings, and convergence behavior—without modifying model code or optimizer logic. Train faster and scale efficiently Low-precision training formats like FP8 with current scaling, MXFP8, and NVFP4 offer exciting new avenues for faster, more efficient deep learning training compared to the widely adopted BF16. Their advantages in speed and memory savings open doors for training larger, more complex models. Empirical evidence from Llama 3 8B and internal research models confirms that training with low precision matches BF16 performance on both pretraining metrics and downstream tasks. Get started with low-precision training As model sizes continue to scale, low-precision training will be foundational to building the next generation of models. With native NVIDIA Blackwell GPU support and production-ready low-precision recipes in NeMo Megatron Bridge , you can try these techniques today. To get started quickly, try the Megatron Bridge Training Tutorial notebook. It walks through using these low-precision recipes end to end and demonstrates how they can significantly accelerate training workloads. Discuss (0) Like Tags Agentic AI / Generative AI | Developer Tools & Techniques | MLOps | General | Blackwell | NeMo | Intermediate Technical | Deep dive | featured | NVFP4 | Training AI Models About the Authors About Aditya Vavre Aditya Vavre is a deep learning algorithms engineer at NVIDIA, where he focuses on advancing efficient large-scale language model training and architecture design. His past work includes 4-bit and 8-bit LLM pretraining, quantization-aware training and distillation, and sparse attention mechanisms, enabling more efficient long-context and large-scale transformer models. Prior to NVIDIA, he contributed to research and development in NLP and AI applications during his time as a Research Engineer at Sony, building retrieval-based dialogue systems and text-to-video generation pipelines. Aditya holds a master’s degree in Computer Science from The University of Texas at Austin and a bachelor’s degree from IIT Bombay. His interests lie at the intersection of scalable deep learning systems, model efficiency, and next-generation foundation model architectures. View all posts by Aditya Vavre About Nima Tajbakhsh Nima Tajbakhsh is a deep learning algorithm manager with eight years of industry expertise, specializing in computer vision, LLM, and multimodal GenAI models. Within NVIDIA NeMo, he spearheads the development and optimization of training and inference workflows for diverse GenAI models. By integrating cutting-edge AI technologies, his team drives innovation and advancement in the field, ensuring NeMo remains at the forefront of AI research and application. View all posts by Nima Tajbakhsh About Wenwen Gao Wenwen Gao is a senior product manager for NeMo at NVIDIA, focusing on LLM training framework and microservices. Her past experience include LLM inference (NIM) and recommender systems (Merlin). She holds a B.S. in computer science from the University of Toronto and an M.B.A. from the MIT Sloan School of Management. View all posts by Wenwen Gao About Selvaraj Anandaraj Selvaraj Anandaraj is a Deep Learning Performance Engineer working on accelerating Deep Learning workloads using NVIDIA hardware and software stacks. His recent work is focused on having a highly performant software stack to train and infer large language models at scale. He earned a Master’s degree from the University of Wisconsin-Madison with a specialization in Machine Learning systems. View all posts by Selvaraj Anandaraj About Amit Bleiweiss Amit Bleiweiss is a senior data scientist at NVIDIA, where he focuses on large language models and generative AI. He has 25 years of experience in applied machine learning and deep learning, with over 50 patents and publications in the domain. Amit received his MSc from Hebrew University of Jerusalem, where he specialized in machine learning. View all posts by Amit Bleiweiss Comments Related posts Faster Training Throughput in FP8 Precision with NVIDIA NeMo Faster Training Throughput in FP8 Precision with NVIDIA NeMo Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training NVIDIA, Arm, and Intel Publish FP8 Specification for Standardization as an Interchange Format for AI NVIDIA, Arm, and Intel Publish FP8 Specification for Standardization as an Interchange Format for AI Getting Immediate Speedups with NVIDIA A100 TF32 Getting Immediate Speedups with NVIDIA A100 TF32 NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch Related posts Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation Accelerating Long-Context Model Training in JAX and XLA Accelerating Long-Context Model Training in JAX and XLA Top 5 AI Model Optimization Techniques for Faster, Smarter Inference Top 5 AI Model Optimization Techniques for Faster, Smarter Inference Training XGBoost Models with GPU-Accelerated Polars DataFrames Training XGBoost Models with GPU-Accelerated Polars DataFrames L T F R E
Scaling AI for everyone openai 27.02.2026 05:30 0.676
Embedding sim.0.7664
Entity overlap0.3077
Title sim.0.0448
Time proximity1
NLP типfunding
NLP организацияSoftBank
NLP темаai infrastructure
NLP страна

Открыть оригинал

Today we’re announcing $110B in new investment at a $730B pre money valuation. This includes $30B from SoftBank, $30B from NVIDIA, and $50B from Amazon.
CORPGEN advances AI agents for real work microsoft_research 26.02.2026 17:06 0.673
Embedding sim.0.8167
Entity overlap0.0357
Title sim.0.0263
Time proximity0.6938
NLP типscientific_publication
NLP организацияMicrosoft
NLP темаai agents
NLP страна

Открыть оригинал

CORPGEN advances AI agents for real work Published February 26, 2026 By Abubakarr Jaye , Applied Scientist 2 Nigel Boachie Kumankumah , Software Engineer Chidera Biringa , Applied Scientist 2 Anjel Patel , Software Engineer Dayquan Julienne , Product Manager 2 Tianwei Chen , Senior Software Engineering Manager Sulaiman Vesal , Senior Applied Science Manager, Microsoft’s AI Development Acceleration Program within the office of the CTO Share this page Share on Facebook Share on X Share on LinkedIn Share on Reddit Subscribe to our RSS feed At a glance Today’s AI agent benchmarks test one task at a time, while real workplace productivity requires managing dozens of interdependent tasks at once. To reflect this, we created a setting called Multi-Horizon Task Environments (MHTEs). Under multi-task loads, leading computer-using agents degrade sharply, with completion rates dropping from 16.7% to 8.7%. CORPGEN introduces digital employees , with hierarchical planning, memory isolation, and experiential learning, delivering up to 3.5 times higher completion rates than baselines across three independent agent backends. Because CORPGEN is architecture-agnostic and modular, its gains come from system design rather than any single base model, and it benefits directly as underlying models improve. By mid-morning, a typical knowledge worker is already juggling a client report, a budget spreadsheet, a slide deck, and an email backlog, all interdependent and all demanding attention at once. For AI agents to be genuinely useful in that environment, they will need to operate the same way, but today’s best models are evaluated one task at a time, not dozens at once. In our paper, “ CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments ,” we propose an agent framework that equips AI with the memory, planning, and learning capabilities to close that gap. Introducing Multi-Horizon Task Environments Replicating the reality of workplace multitasking requires a new kind of evaluation environment. In response, we developed Multi-Horizon Task Environments (MHTEs), settings where an agent must manage multiple complex tasks simultaneously. Each task requires 10 to 30 dependent steps within a single session spanning five hours. To determine what a benchmark would need to test, we ran MHTEs at scale on some of today’s leading AI agents, exposing four weaknesses. First, memory fills up. An agent cannot hold details for multiple active tasks at once. Second, information from one task interferes with reasoning about another. Third, tasks don’t depend on each other in simple sequences. They form complex webs where an agent must constantly check whether upstream work is finished before it can move forward on anything downstream. Fourth, every action cycle requires reprioritizing across all active tasks, not simply resuming where the agent left off. We also tested three independent agent systems under increasing loads. As the number of concurrent tasks rose from 12 to 46, completion rates fell from 16.7% to 8.7% across all systems. CORPGEN’s architecture CORPGEN introduces digital employees : LLM-powered AI agents with persistent identities, role-specific expertise, and realistic work schedules. They operate Microsoft Office applications through GUI automation and perform consistently within MHTEs over hours of continuous activity. Figure 1 illustrates how a digital employee moves through a full workday. Figure 1. Each day begins with a structured plan and memory loaded from previous sessions. The agent then works through overlapping tasks in repeated cycles, storing key outcomes at day’s end to inform the next session. CORPGEN addresses each of the four weaknesses of concurrent task execution—memory overload, cross-task interference, dependency complexity, and reprioritization—in a targeted way. Hierarchical planning breaks objectives into daily goals and then into moment-to-moment decisions, allowing the agent to act from a structured plan instead of reviewing all available tasks before each step. Subagents perform complex operations like web research in isolated contexts, preventing cross-task contamination. A tiered memory system enables selective recall of task-related information rather than retaining everything in active context. Adaptive summarization compresses routine observations while preserving critical information, keeping memory growth controlled. Because these mechanisms are not tied to a specific base model, we tested CORPGEN across three different agents. In each case, we observed consistent gains. The improvements came from the architecture, not from the strength of any particular model. Figure 2 shows how they fit together within CORPGEN’s architecture. Figure 2. Four mechanisms support concurrent task execution in CORPGEN: hierarchical planning, isolated subagents, tiered memory, and adaptive summarization. How digital employees collaborate When multiple digital employees operate in the same environment, collaboration takes shape through standard communication channels, without predefined coordination rules. One employee sends an email requesting data; another picks it up in the next cycle, uses its memory to process it, and responds. This exchange mirrors real workplace communication. There is no shared internal state between agents. Coordination occurs entirely through email and Microsoft Teams, the same channels many workers use. Over time, these independent exchanges form recognizable organizational patterns. Some agents take on leadership roles; others provide support; shared documents become the connective tissue. When a communication path breaks, such as an email delivery error, agents reroute messages through alternate channels to keep work moving. The result is a virtual organization that behaves like a real one without being explicitly programmed to do so. Evaluating CORPGEN We evaluated CORPGEN on a multi-task benchmark that combined up to 46 tasks into a single six-hour session. Three findings stood out. Baselines degrade as load increases; CORPGEN does not. All three baseline agent systems showed steady performance declines as task load rose. CORPGEN, by contrast, maintained or improved its completion rates at higher loads. At 46 tasks, CORPGEN completed 15.2% of tasks, compared with 4.3% for the baselines, roughly 3.5 times more. Experiential learning drives the largest gains. We introduced CORPGEN’s components sequentially: first the orchestration layer, then cognitive tools, and finally experiential learning. The first two produced moderate improvements. Experiential learning, in which agents store records of completed tasks and reuse them when they encounter structurally similar work, produced the largest increase, raising completion rates from 8.7% to 15.2%. Evaluation methodology changes the picture. When we inspected the actual output files produced by agents, the results agreed with human judgements roughly 90% of the time. Evaluation based on screenshots and action logs agreed only about 40% of the time. This gap suggests that common evaluation approaches may underestimate what agents actually accomplish in practice. Spotlight: Microsoft research newsletter Microsoft Research Newsletter Stay connected to the research community at Microsoft. Subscribe today Opens in a new tab Implications and looking forward The results suggest that memory and retrieval, not just raw model capability, may be a key bottleneck in getting agents to work in the real world. The largest gains came from experiential learning. Agents that learn from prior successes and apply those patterns to structurally similar tasks build an advantage over systems that respond to each task in isolation. CORPGEN also opens a new lens on how AI agents collaborate. Next steps include testing whether agents can maintain memory across multiple workdays and how they coordinate when working in teams. We are also exploring ways to make agents faster and more reliable by combining different methods of interacting with software. Acknowledgments This work is a result of a collaboration between the Office of the CTO at Microsoft and the Microsoft AI Development Accelerator Program (MAIDAP). We would like to thank the Microsoft Security Research team for providing resources that supported this research. We also thank the members of the Microsoft UFO2 (opens in new tab) team and the Mem0 (opens in new tab) project for their open-source contributions, which enabled key components of the CORPGEN architecture, and the OSWorld team for the benchmark that served as the foundation for our multi-task evaluation. Finally, we thank the many contributors to this research: Charlotte Siska, Manuel Raúl Meléndez Luján, Anthony Twum-Barimah, and Mauricio Velazco. Opens in a new tab Related publications CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments Meet the authors Abubakarr Jaye Applied Scientist 2 Learn more Nigel Boachie Kumankumah Software Engineer Learn more Chidera Biringa Applied Scientist 2 Learn more Anjel Patel Software Engineer Learn more Dayquan Julienne Product Manager 2 Learn more Tianwei Chen Senior Software Engineering Manager Learn more Sulaiman Vesal Senior Applied Science Manager, Microsoft’s AI Development Acceleration Program within the office of the CTO Learn more Continue reading March 12, 2026 Systematic debugging for AI agents: Introducing the AgentRx framework December 11, 2025 Agent Lightning: Adding reinforcement learning to AI agents without code rewrites May 19, 2025 Magentic-UI, an experimental human-centered web agent February 25, 2025 Magma: A foundation model for multimodal AI agents across digital and physical worlds See all blog posts Research Areas Artificial intelligence
Our agreement with the Department of War openai 28.02.2026 12:30 0.671
Embedding sim.0.7886
Entity overlap0.0909
Title sim.0.0845
Time proximity0.8155
NLP типother
NLP организацияOpenAI
NLP темаai safety
NLP страна

Открыть оригинал

Details on OpenAI’s contract with the Department of War, outlining safety red lines, legal protections, and how AI systems will be deployed in classified environments.
Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute | NVIDIA Technical Blog nvidia_dev_blog 18.02.2026 17:00 0.671
Embedding sim.0.7516
Entity overlap0.0938
Title sim.0.1899
Time proximity0.994
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into C++ to write custom kernels and to maintain bindings back to Python. For most Python developers and researchers, this is a significant barrier to entry. Frameworks like PyTorch address this by implementing kernels in CUDA C++—either handwritten or by leveraging libraries like the NVIDIA CUDA Core Compute Libraries . Handwritten kernels are time-consuming and require deep, low-level architectural expertise. Using CUB, a C++ library within CCCL, is often better, since its primitives are highly optimized per architecture and are rigorously tested. But exposing CUB to Python traditionally means building and maintaining bindings and pre-instantiating C++ templates with fixed types and operators—limiting flexibility on the Python side.  The NVIDIA cuda.compute library overcomes these limitations by offering a high-level, Pythonic API for device-wide CUB primitives. Using cuda.compute helped an NVIDIA CCCL team top the GPU MODE leaderboard, a kernel competition hosted by an online community with more than 20,000 members and a focus on learning and improving GPU programming. GPU MODE hosts the kernel competitions to find the best implementations for a variety of tasks, from simple vector addition to more complex block matrix multiplications. The NVIDIA CCCL team focuses on delivering “speed-of-light” (SOL) implementations of parallel primitives across GPU architectures through high-level abstractions. It achieved the most first-place finishes overall on the tested GPU architectures: NVIDIA B200, NVIDIA H100, NVIDIA A100, and NVIDIA L4. In this blog we’ll share more details about how we were able to place so high on the leaderboard. CUDA Python: GPU performance meets productivity CUB offers highly optimized CUDA kernels for common parallel operations, including those featured in the GPU MODE competition. These kernels are architecturally tuned and widely considered near speed-of-light implementations. The cuda.compute library supports custom types and operators defined directly in Python. Under the hood, it just-in-time (JIT) compiles specialized kernels and applies link-time optimization to deliver near-SOL performance on par with CUDA C++. You stay in Python while getting the flexibility of templates and the performance of tuned CUDA kernels. With cuda.compute you get: Fast, composable CUDA workflows in Python: Develop efficient and modular CUDA applications directly within Python. Custom data types and operators: Utilize custom data types and operators without the need for C++ bindings. Optimized performance: Achieve architecture-aware performance through proven CUB primitives. Rapid iteration: Accelerate development with JIT compilation while maintaining CUDA C++ levels of performance. JIT compilation accelerates the development cycle by providing the flexibility and rapid iteration cycles that developers need without compromising performance. The leaderboard results Using cuda.compute , we submitted entries across GPU MODE benchmarks for PrefixSum , VectorAdd , Histogram , Sort , and Grayscale (look for username Nader). For algorithms like sort, the CUB implementation was two-to-four times faster than the next best submission. This is the CCCL promise in action: SOL‑class algorithms that outperform custom kernels for standard primitives you’d otherwise spend months building. Where we didn’t take first place, the gap typically came down to us not having a tuning policy for that specific GPU. In some instances, our implementation was a more general solution, while higher-ranked submissions were specialized to specific problem sizes. In other cases, the first place submission was already using CUB or cuda.compute under the hood. This underscores that these libraries already represent the performance ceiling for many standard GPU algorithms, and that their performance characteristics are now well understood and intentionally relied upon by leading submissions. This isn’t about winning Leaderboard results are a byproduct; the real objective is learning with the community, benchmarking transparently, and demonstrating the power of Python for high-performance GPU work. Our goal isn’t to discourage hand-written CUDA kernels. There are plenty of valid cases for custom kernels—novel algorithms, tight fusion, or specialized memory access patterns—but for standard primitives (sort, scan, reduce, histogram, etc.), your first move should be a proven, high-performance implementation. With cuda.compute , those tuned CUB primitives are now accessible directly from native Python, allowing you to build high-quality, production-grade, GPU-accelerated Python libraries. This is great news for anyone building the next CuPy, RAPIDS component, or a custom Python GPU accelerated library: faster iteration, fewer glue layers, and production-grade performance all while staying in pure Python. How cuda.compute looks in practice One of the first examples any person writes when learning GPU programming is a vector addition. Using cuda.compute we can solve this using pure Python by calling a device-wide primitive. import cuda.compute from cuda.compute import OpKind # Build-time tensors (used to specialize the callable) build_A = torch.empty(2, 2, dtype=torch.float16, device="cuda") build_B = torch.empty(2, 2, dtype=torch.float16, device="cuda") build_out = torch.empty(2, 2, dtype=torch.float16, device="cuda") # JIT compiling the transform kernel transform = cuda.compute.make_binary_transform(build_A, build_B, build_out, OpKind.PLUS) # Defining custom_kernel is required to submit to the GPU MODE competition def custom_kernel(data): # Invoking our transform operation on some input data A, B, out = data transform(A, B, out, A.numel()) return out You can find more cuda.compute examples on the GPU MODE Leaderboard . The pattern is consistent: simple code with speed-of-light performance, achieved by calling device-wide building blocks that are automatically optimized by CCCL for every GPU generation. Other top-performing submissions for the VectorAdd category required dropping into C++ and inline PTX, resulting in code that is highly architecture-dependent. Try cuda.compute today If you’re building Python GPU software, custom pipelines, library components, or performance-sensitive code, cuda.compute gives you the option to use CCCL CUB primitives directly in Python and leverage building blocks designed for architecture-aware speed-of-light performance. To try cuda.compute today, you can install it via pip or conda: pip install cuda-cccl[cu13] (or [cu12]) conda install -c conda-forge cccl-python cuda-version=12 (or 13) We’re building this with the community—your feedback and benchmarks shape our roadmap so don’t hesitate to reach out to us on Github or in the GPU MODE discord . Discuss (0) Like Tags Agentic AI / Generative AI | Data Science | Developer Tools & Techniques | General | CUDA | Intermediate Technical | Best practice | featured About the Authors About Daniel Rodriguez Daniel Rodriguez is a technical product manager on the CUDA Python and DevTools teams at NVIDIA. His efforts are focused on building tooling for data scientists and high-performance computing engineers. Daniel has a background of Electrical Engineering and Data Analytics. Prior to NVIDIA, he worked at Google and multiple enterprise data science companies where he built data related products and contributed to many open source projects. View all posts by Daniel Rodriguez About Nader Al Awar Nader Al Awar is a senior software engineer at NVIDIA and a member of the CUDA Core Compute Libraries (CCCL) team, where he focuses on the development of CUB and cuda.compute. He earned his doctorate in electrical and computer engineering from the University of Texas at Austin, specializing in high-performance computing for Python. Nader is passionate about bridging the gap between high-level languages and hardware by accelerating Python code using GPUs. View all posts by Nader Al Awar Comments Related posts CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python Developing Accelerated Code with Standard Language Parallelism Developing Accelerated Code with Standard Language Parallelism Unifying the CUDA Python Ecosystem Unifying the CUDA Python Ecosystem NVIDIA Announces CUDA-X HPC NVIDIA Announces CUDA-X HPC Related posts How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale Streamlining CUB with a Single-Call API Streamlining CUB with a Single-Call API Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer Better Bug Detection: How Compile-Time Instrumentation for Compute Sanitizer Enhances Memory Safety Better Bug Detection: How Compile-Time Instrumentation for Compute Sanitizer Enhances Memory Safety How to Get Started with Neural Shading for Your Game or Application How to Get Started with Neural Shading for Your Game or Application L T F R E
Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile | NVIDIA Technical Blog nvidia_dev_blog 05.03.2026 17:00 0.659
Embedding sim.0.7566
Entity overlap0.0303
Title sim.0.2623
Time proximity0.7311
NLP типother
NLP организацияnvidia
NLP темаcomputational efficiency
NLP страна

Открыть оригинал

In this post, we dive into one of the most critical workloads in modern AI: Flash Attention , where you’ll learn: How to implement Flash Attention using NVIDIA cuTile . Walk through the complete code for a production-ready implementation. The “trap and rescue” optimization journey . This case study shows how naive optimizations (like just increasing tile size) can backfire, and how to fix them. Advanced techniques like FMA patterns, fast math, loop splitting, and adaptive tiling for maximum performance. Environment requirements: CUDA 13.1 or higher GPU architecture : Compute capability 8.X, 10.X, 11.X, 12.X (NVIDIA Ampere, NVIDIA Ada, NVIDIA Blackwell) Python : 3.10 or higher See the quickstart doc for more information on installing cuTile Python. What is attention? The attention mechanism is the computational heart of transformer models. Given a sequence of tokens, attention enables each token to “look at” every other token and decide how much to weigh their contributions. Mathematically, for input matrices Query (\(Q\)), Key (\(K\)), and Value (\(V\)), the output is: \(O = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V\) Where: \(Q \text{ has shape } (N,d),\ N \text{ query tokens, each with dimension } d.\) \(K \text{ has shape } (N,d),\ N \text{ key tokens.}\) \(V \text{ has shape } (N,d),\ N \text{ value tokens.}\) \(\text{The intermediate } QK^{T} \text{ matrix has shape } (N,N), \text{ is a problem.}\) The memory bandwidth problem For a sequence length of \(N = 16,384\) (common in modern LLMs), the attention matrix \(QK^{T}\) contains \(N^2 = 268\) million elements. In FP16, that’s 512 MB of intermediate storage per attention head, per batch item. Standard attention implementations: Compute the full \(N \times N\) attention matrix and write it to global memory (slow) Apply softmax row-by-row Read the matrix back and multiply by \(V\) This approach is memory-bound as the GPU spends most of its time waiting for data to move between HBM and compute units, rather than computing. How Flash Attention solves the memory bandwidth problem Flash Attention (introduced by Dao et al., 2022) is an IO-aware algorithm that never materializes the full \(N \times N\) matrix. Instead, it: Tiles the computation : Processes \(Q, K, V\) in small blocks that fit in fast on-chip SMEM Uses online softmax : Computes softmax incrementally without needing the full row Fuses operations : Combines the matrix multiply and softmax into a single kernel pass The result is a 2-4x speedup and significant memory savings, enabling longer context lengths. Figure 1. Tiled Flash Attention computation Understanding online softmax The key algorithmic insight of Flash Attention is the online softmax trick. The numerically stable safe softmax requires knowing the maximum value across the entire row before computing: \(\text{softmax}(x_i) = \frac{e^{x_i – \max(x)}}{\sum_j e^{x_j – \max(x)}}\) But if we’re processing tiles, we don’t have access to the full row. Online softmax solves this by maintaining running statistics that can be updated incrementally. The online softmax algorithm We maintain two running values for each row: \(m_i\): The maximum value seen so far (for numerical stability) \(l_i\): The sum of exponentials seen so far (the softmax denominator) When we process a new tile with values \(x_{new}\): Update the maximum : \(m_{new} = \max(m_i, \max(x_{new}))\) Compute correction factor : \(\alpha = e^{m_i – m_{new}}\) (rescales previous work) Update the sum : \(l_i = l_i \cdot \alpha + \sum e^{x_{new} – m_{new}}\) Update the accumulator : \(acc = acc \cdot \alpha + P_{new} \cdot V_{tile}\) \(P_{new}\) is the matrix of the attention weights, and \(V_{tile}\) is the value matrix tile, corresponding to the Key tile of the current iteration. At the end, we normalize: \(O = acc / l_i\) This enables us to compute an exact softmax without ever storing the full row. Causal attention and grouped-query attention Before diving into the implementation, let’s understand two important attention variants used in modern LLMs: Causal attention In autoregressive language models like GPT, LLaMA, and Claude, each token can only attend to previous tokens in the sequence, not future ones. This prevents “cheating” during training, where the model looks ahead to predict the next word. Mathematically, we apply a triangular mask to the attention scores: \(\text{mask}_{ij} = \begin{cases} 0 & \text{if } i \geq j \text{ (query position ≥ key position)} \ -\infty & \text{if } i < j \text{ (future tokens)} \end{cases}\) The masked attention becomes: \(O = \text{softmax}\left(\frac{QK^T}{\sqrt{d}} + \text{mask}\right)V\) Adding \(-\infty\) to future positions ensures they become zero after softmax, effectively blocking information flow from future tokens. Figure 2. Causal attention mask for four tokens With causal masking, roughly half the attention matrix is masked (the upper triangle). We can skip computing these masked tiles entirely, providing a 2x algorithmic speedup. This is crucial for the K-loop splitting optimization. Grouped-query attention Standard multi-head attention has separate \(K,V\) matrices for each attention head, leading to high memory usage: Multi-head attention (MHA) : 32 query heads → 32 K/V heads (1:1 ratio) Grouped-query attention (GQA) : 32 query heads → 4 K/V heads (8:1 ratio) Multi-query attention (MQA) : 32 query heads → 1 K/V head (32:1 ratio) In GQA, multiple query heads share the same K/V heads. For example, with 32 query heads and 4 K/V heads: Query heads 0-7 use K/V head 0 Query heads 8-15 use K/V head 1 Query heads 16-23 use K/V head 2 Query heads 24-31 use K/V head 3 This reduces K/V cache size by 8x during inference, critical for serving long-context models. Modern LLMs like LlamA 2, Llama 3, Mistral, and Qwen use GQA extensively. When implementing in Flash Attention, each CUDA block computes attention for one query head, but loads the appropriate shared K/V head: head_idx = bid_y % num_heads # Which query head (0-31) kv_head_idx = head_idx // query_group_size # Which K/V head (0-3) With a query group size of 8, query heads 0-7 all map to kv_head_idx = 0 , sharing the same K/V tiles in memory. Part 1: The flash attention kernel in CUDA Tile Let’s implement Flash Attention step-by-step. Our baseline uses small 64×64 tiles and straightforward code—correct but not yet optimized. 1. Defining the kernel interface In cuTile, the @ct.kernel decorator marks a Python function as a GPU kernel. We pass compile-time constants using ct.Constant[T] type annotations: import math import cuda.tile as ct # Type aliases for compile-time constants ConstInt = ct.Constant[int] ConstBool = ct.Constant[bool] # Conversion factor: we use exp2 instead of exp for efficiency INV_LOG_2 = 1.0 / math.log(2) @ct.kernel() def fmha_kernel( Q, K, V, Out, # Input/output tensors qk_scale: float, # Scale factor (1/sqrt(d)) input_pos: int, # Position offset for causal masking TILE_D: ConstInt, # Head dimension (for example, 128) H: ConstInt, # Number of attention heads TILE_M: ConstInt, # Tile size for Q dimension (for example, 64) TILE_N: ConstInt, # Tile size for K/V dimension (for example, 64) QUERY_GROUP_SIZE: ConstInt,# For Grouped Query Attention (GQA) CAUSAL: ConstBool, # Whether to apply causal mask EVEN_K: ConstBool, # Whether K length is divisible by TILE_N ): 2. Block ID mapping Each CUDA block computes one tile of the output. Using ct.bid , we map the 2D grid to batch/head indices: # Get block indices bid_x = ct.bid(0) # Which tile along the sequence dimension bid_y = ct.bid(1) # Which batch-head combination # Decode batch and head from flattened index batch_idx = bid_y // H head_idx = bid_y % H # For Grouped Query Attention: multiple Q heads share one K/V head off_kv_h = head_idx // QUERY_GROUP_SIZE 3. Initializing accumulators Before the main loop, we initialize the online softmax state and output accumulator: # Convert scale for base-2 exponential (faster than natural exp) qk_scale = qk_scale * INV_LOG_2 # Create position indices for this tile offs_m = bid_x * TILE_M + ct.arange(TILE_M, dtype=ct.int32) offs_m += input_pos offs_m = offs_m[:, None] # Shape: [TILE_M, 1] offs_n_tile = ct.arange(TILE_N, dtype=ct.int32) offs_n_tile = offs_n_tile[None, :] # Shape: [1, TILE_N] # Online softmax state (float32 for numerical stability) m_i = ct.full((TILE_M, 1), -math.inf, dtype=ct.float32) # Running max l_i = ct.full((TILE_M, 1), 0.0, dtype=ct.float32) # Running sum acc = ct.full((TILE_M, TILE_D), 0.0, dtype=ct.float32) # Output accumulator We use float32 for accumulators, even when inputs are float16 to maintain numerical precision during the iterative softmax computation. 4. Loading the query tile The query tile is loaded once and reused across all K/V iterations: # Load Q tile: shape [1, 1, TILE_M, TILE_D] -> [TILE_M, TILE_D] q = ct.load( Q, index=(batch_idx, head_idx, bid_x, 0), shape=(1, 1, TILE_M, TILE_D) ).reshape((TILE_M, TILE_D)) The ct.load function handles boundary conditions automatically when the tile extends past the tensor edge. 5. The main loop over K/V tiles This is the heart of Flash Attention. We iterate over K/V tiles: # Calculate loop bounds m_end = input_pos + (bid_x + 1) * TILE_M k_seqlen = K.shape[2] if CAUSAL: # For causal attention, stop early (future tokens are masked) Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) else: Tc = ct.cdiv(k_seqlen, TILE_N) for j in range(0, Tc): # --- Step A: Load Key tile and compute QK^T --- k = ct.load( K, index=(batch_idx, off_kv_h, 0, j), shape=(1, 1, TILE_D, TILE_N), order=(0, 1, 3, 2), # Transpose for correct layout latency=2 # Hint for memory prefetching ).reshape((TILE_D, TILE_N)) # Matrix multiply: Q @ K^T qk = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) qk = ct.mma(q, k, qk) # Uses Tensor Cores automatically The order=(0,1,3,2) in the parameter tells cuTile load operation to use K transposed, and latency=2 hints that we can tolerate some latency (enabling better pipelining). Then we use the ct.mma=(q, k, k,qk) to perform the cuTile matrix multiply-accumulate . 6. Applying the causal mask For autoregressive models (GPT, Llama, etc.), each token can only attend to previous tokens: # --- Step B: Apply causal masking --- if CAUSAL or not EVEN_K: offs_n = j * TILE_N + offs_n_tile mask = ct.full((TILE_M, TILE_N), True, dtype=ct.bool_) # Boundary mask (for non-divisible sequence lengths) if not EVEN_K: mask = mask & (offs_n < k_seqlen) # Causal mask: query position >= key position if CAUSAL: mask = mask & (offs_m >= offs_n) # Convert to additive mask: True->0, False->-inf mask = ct.where(mask, 0.0, -math.inf) qk += mask Adding -inf to masked positions ensures they become zero after softmax. 7. Online softmax update Now we update our running softmax statistics: # --- Step C: Online softmax --- # Find max in current tile qk_max = ct.max(qk, axis=-1, keepdims=True) qk_max_scaled = qk_max * qk_scale # Update running maximum m_ij = max(m_i, qk_max_scaled) # Scale QK scores qk = qk * qk_scale qk = qk - m_ij # Compute attention weights (using exp2 for speed) p = ct.exp2(qk) # Update running sum l_ij = ct.sum(p, axis=-1, keepdims=True) alpha = ct.exp2(m_i - m_ij) # Correction factor l_i = l_i * alpha l_i = l_i + l_ij # Rescale previous accumulator acc = acc * alpha 8. Accumulating the output Finally, we load the Value tile and accumulate: # --- Step D: Load V and accumulate --- v = ct.load( V, index=(batch_idx, off_kv_h, j, 0), shape=(1, 1, TILE_N, TILE_D), latency=4 ).reshape((TILE_N, TILE_D)) # Cast attention weights back to input dtype for Tensor Core MMA p = p.astype(Q.dtype) # Accumulate: acc += P @ V acc = ct.mma(p, v, acc) # Update max for next iteration m_i = m_ij 9. Final normalization and store After processing all tiles, we normalize by the total sum and write the result: # --- Final: Normalize and store --- acc = ct.truediv(acc, l_i) acc = acc.reshape((1, 1, TILE_M, TILE_D)).astype(Out.dtype) ct.store(Out, index=(batch_idx, head_idx, bid_x, 0), tile=acc) Launching the kernel: Host-side code Now let’s look at the host-side code that launches the kernel: import torch from math import ceil def tile_fmha(q, k, v, sm_scale=None, is_causal=True): """ Launch the Flash Attention kernel. Args: q: Query tensor, shape [batch, heads, seq_len, head_dim] k: Key tensor, shape [batch, kv_heads, seq_len, head_dim] v: Value tensor, shape [batch, kv_heads, seq_len, head_dim] sm_scale: Softmax scale (default: 1/sqrt(head_dim)) is_causal: Whether to apply causal masking Returns: Output tensor, same shape as q """ if sm_scale is None: sm_scale = 1.0 / math.sqrt(q.size(-1)) batch_size, num_heads, seq_len, head_dim = q.shape _, num_kv_heads, _, _ = k.shape # Calculate query group size for GQA query_group_size = num_heads // num_kv_heads # Ensure contiguous memory layout q = q.contiguous() k = k.contiguous() v = v.contiguous() # Allocate output o = torch.empty_like(q) # Choose tile sizes (we'll optimize this later!) TILE_M, TILE_N = 64, 64 # Calculate grid dimensions grid_x = ceil(seq_len / TILE_M) # Number of tiles along sequence grid_y = batch_size * num_heads # One block per batch-head pair grid = (grid_x, grid_y, 1) # Check if K length is evenly divisible EVEN_K = (k.shape[2] % TILE_N) == 0 # Launch kernel ct.launch( torch.cuda.current_stream(), grid, fmha_kernel, (q, k, v, o, sm_scale, 0, head_dim, num_heads, TILE_M, TILE_N, query_group_size, is_causal, EVEN_K) ) return o This baseline with 64×64 tiles works correctly. But can we make it faster? Let’s find out. Part 2: The “trap and rescue” optimization journey We benchmark on the following configuration: Hardware : NVIDIA B200 Batch : 4, Heads : 32, Head dimension : 128 Attention : Causal, Dtype : FP16 Sequence lengths : 1024, 2048, 4096, 8192, 16384 To interpret each step, we use Nsight Compute with a minimal section set: LaunchStats Occupancy SpeedOfLight ComputeWorkloadAnalysis MemoryWorkloadAnalysis Baseline performance SeqLen Throughput (TFLOPS) 1,024 330 2,048 441 4,096 511 8,192 546 16,384 566 Table 1. Baseline performance without any specific optimizations This is our starting point with 64×64 tiles and no optimizations. NCU insight (SeqLen=1024, B200) : Registers/thread: 128 Theoretical/achieved occupancy: 25% / 19.8% Compute (SM) throughput: 37.8% Memory throughput: 19.7% Grid size: 2,048 1. The trap of larger tiles A common intuition in GPU programming is “bigger tiles = better performance.” Larger tiles: Amortize memory access overhead. Improve L2 cache utilization. Reduce kernel launch overhead per element. So, let’s increase our tile size from 64×64 to 256×128 : TILE_M, TILE_N = 256, 128 # Was 64, 64 The expected is better memory bandwidth utilization → faster performance. However, the result in TFLOPS are: SeqLen Baseline (64×64) Larger tiles (256×128) Performance Degradation 1,024 330 187 -43% 2,048 441 268 -39% 4,096 511 347 -32% 8,192 546 415 -24% 16,384 566 463 -18% Table 2. Baseline performance compared to performance with larger tile sizes, showing degradation when using larger tile sizes Performance degraded by 18-43% across all sequence lengths. This is the trap, where large tiles make performance worse . Why does this happen? Compute bottleneck : With more elements per tile, inefficient operations (separate mul/add, precise math) become the bottleneck. Instruction overhead : More work per tile means more instructions before the next memory operation. Lesson : Tile size and compute efficiency are interdependent. Large tiles only help if the computation is efficient enough to keep up. NCU insight (SeqLen=1,024, NVIDIA B200) : Registers/thread jump to 168 (+31%), reducing theoretical occupancy to 18.75% Achieved occupancy drops to 16.5% Compute throughput collapses to 17.4% (the trap) Memory throughput falls to 7.4% Grid size shrinks to 512 (fewer blocks from larger tiles) 2. The rescue with fast math One of the bottlenecks is special functions: exp2 (exponential) and truediv (division). By default, these are IEEE-754 precise—highly accurate, but slow. For deep learning, we can trade a tiny bit of precision for massive speedups: Before (precise operations): p = ct.exp2(qk) alpha = ct.exp2(m_i - m_ij) acc = ct.truediv(acc, l_i) After (fast math): p = ct.exp2(qk, flush_to_zero=True) alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX) What these flags do : flush_to_zero=True : Denormal numbers (extremely small values near zero) become exactly zero. This avoids slow microcode paths on the GPU. rounding_mode=RMd.APPROX : Skips iterative refinement after initial hardware approximation. With fast math, we’ve “rescued” the large tiles, and the results in TFLOPS are: SeqLen Larger tiles (trap) Fast math (rescue) Improvement 1,024 187 322 +72% 2,048 268 436 +63% 4,096 347 524 +51% 8,192 415 585 +41% 16,384 463 620 +34% Table 3. Performance improvement when using two fast math optimizations We now match or exceed the small-tile baseline, with 10-20% gains for longer sequences. NCU insight (SeqLen=1,024, NVIDIA B200) : Registers/thread: 168 (unchanged) Theoretical/achieved occupancy: 18.75% / 16.6% (unchanged) Compute throughput rebounds to 24.0% Memory throughput improves to 12.9% 3. K-loop split For causal attention , we apply a triangular mask: each query can only attend to keys at earlier positions. In our baseline, we check if CAUSAL: mask … on every loop iteration. But think about it: for a query tile at position 1000, most key tiles (0-900) need no masking at all . Only tiles near the diagonal need the mask. And tiles beyond the query position are completely masked (we can skip them entirely). Figure 3. Tiled causal attention matrix (8 tiles per side) The optimization splits the loop into phases: # Calculate where masking starts being necessary mask_start = (input_pos + bid_x * TILE_M) // TILE_N mask_start = min(mask_start, k_seqlen // TILE_N) # Calculate where to stop (for causal, we exit early) if CAUSAL: Tc = ct.cdiv(min(m_end, k_seqlen), TILE_N) else: Tc = ct.cdiv(k_seqlen, TILE_N) for j in range(0, Tc): # Load K and compute QK... # ONLY apply masking when necessary if (CAUSAL or not EVEN_K) and j >= mask_start: offs_n = j * TILE_N + offs_n_tile mask = ct.full((TILE_M, TILE_N), True, dtype=ct.bool_) if not EVEN_K: mask = mask & (offs_n < k_seqlen) if CAUSAL: mask = mask & (offs_m >= offs_n) mask = ct.where(mask, 0.0, -math.inf) qk += mask # Continue with softmax and accumulation... Why this matters : For a 16K sequence with 256-token tiles: ~50% of tiles are fully unmasked (no branch, no mask computation) ~1 tile per row is partially masked (full logic) The rest are skipped entirely (early exit) Result in TFLOPS : SeqLen Fast math Loop split Improvement 1,024 322 373 +16% 2,048 436 552 +27% 4,096 524 684 +31% 8,192 585 770 +32% 16,384 620 813 +31% Table 4. Performance improvement when using K-loop split optimization This is the biggest single optimization —up to 32% speedup across all sequence lengths. NCU insight (SeqLen=1,024, B200) : Registers/thread: 168 (unchanged) Theoretical/achieved occupancy: 18.75% / 16.6% (unchanged) Memory throughput improves to 14.5% (less wasted work) Compute throughput remains 24.0% (work is more useful, not necessarily faster per cycle) 4. ProgramId remapping One subtle optimization is reversing the block order for causal attention. When we process tiles in reverse (bottom-right to top-left), later-launched blocks have less work due to the causal mask. This improves load balancing and reduces tail effects. Before (standard order): bid_x = ct.bid(0) # Process tiles 0, 1, 2, ... After (reversed for causal): if CAUSAL: bid_x = NUM_M_BLOCKS - 1 - ct.bid(0) # Process tiles N, N-1, N-2, ... else: bid_x = ct.bid(0) This small change improves wave scheduling, as blocks complete more uniformly across the GPU. Result in TFLOPS : SeqLen Loop split Remapping Improvement 1,024 373 377 +1% 2,048 552 560 +1.5% 4,096 684 696 +1.8% 8,192 770 781 +1.5% 16,384 813 835 +2.6% Table 5. Performance improvement after remapping the block order of the tiles A modest but consistent 1-3% gain, especially noticeable at longer sequences where tail effects matter most. 5. Autotuning We’ve optimized large tiles, but there’s a catch: short sequences still prefer small tiles . Why? With a 1,024-token sequence and 256-token tiles, we only have 4 tiles. That’s not enough to fully utilize all SMs on a B200. Smaller tiles (64×64) give us 16 tiles, better filling the GPU. Rather than manually choosing a threshold, we can let cuTile’s autotuner benchmark multiple configurations and cache the best one for each input shape. The autotuner approach : def _fmha_autotune_configs(): """Search space for autotuning. The autotuner will benchmark these configurations and cache the best one per input shape (sequence length, batch size, etc.). """ gpu_capability = torch.cuda.get_device_capability() if gpu_capability in [(12, 0), (12, 1)]: # RTX 50 series (sm120, sm121) yield SimpleNamespace(TILE_M=64, TILE_N=64, num_ctas=1, occupancy=2) else: # B200/GB200 (sm100) - Try multiple tile sizes # Autotuner will discover: # - 64x64 is best for short sequences (1024-2048) # - 128x128 may be best for medium sequences (4096) # - 256x128 is best for long sequences (8192+) yield SimpleNamespace(TILE_M=64, TILE_N=64, num_ctas=1, occupancy=2) yield SimpleNamespace(TILE_M=128, TILE_N=128, num_ctas=1, occupancy=2) yield SimpleNamespace(TILE_M=256, TILE_N=128, num_ctas=1, occupancy=1) How to launch with autotuning : Instead of calling ct.launch directly, use ct_experimental.autotune_launch : import cuda.tile_experimental as ct_experimental def autotune_launch_fmha( stream, q, k, v, o, sm_scale, input_pos, hidden_size, num_heads, query_group_size, is_causal ): batch_size, _, q_len, _ = q.shape def _grid_fn(cfg): return (math.ceil(q_len / cfg.TILE_M), batch_size * num_heads, 1) def _args_fn(cfg): num_m_blocks = math.ceil(q_len / cfg.TILE_M) even_k = (k.shape[2] % cfg.TILE_N) == 0 return ( q, k, v, o, sm_scale, input_pos, hidden_size, num_heads, cfg.TILE_M, cfg.TILE_N, query_group_size, is_causal, even_k, num_m_blocks, ) ct_experimental.autotune_launch( stream, grid_fn=_grid_fn, kernel=fmha_kernel, args_fn=_args_fn, hints_fn=lambda cfg: {"num_ctas": cfg.num_ctas, "occupancy": cfg.occupancy}, search_space=_fmha_autotune_configs, ) Note: The autotuner API may be subject to change. The autotuner works intelligently: First call with seq_len=1024 : Benchmarks all 3 configs, caches best one First call with seq_len=2048 : Benchmarks all 3 configs, caches best one Subsequent calls : Uses cached config (zero overhead) The cache key includes tensor shapes, so different sequence lengths automatically get different optimal configurations. Result in TFLOPS : SeqLen Baseline Remapping Autotune Speedup vs baseline 1,024 330 377 548 1.66x 2,048 441 560 708 1.61x 4,096 511 696 817 1.60x 8,192 546 781 887 1.62x 16,384 566 835 918 1.62x Table 6. Original baseline compared to step 5 and to step 6 autotuned results The autotuner discovers that 64×64 tiles are best for sequences ≤2,048, then transitions to larger tiles for longer sequences. This delivers 45% additional performance at short sequences compared to fixed large tiles, while maintaining peak performance at long sequences. What the autotuner chose (on B200): SeqLen 1,024: 64×64 tiles (high parallelism) SeqLen 2,048: 64×64 or 128×128 tiles (balanced) SeqLen 4,096+: 128×128 or 256×128 tiles (memory efficiency) We now achieve optimal performance across all sequence lengths without manual tuning. Summary: The optimization stack Optimization Key insight Impact Baseline (64×64) Correct but unoptimized Baseline Large tiles (256×128) TRAP : 18-43% slower! -18% to -43% + Fast math (FTZ, APPROX) RESCUE : Large tiles now pay off +34% to +72% from trap + K-loop split Biggest single optimization +16% to +32% + ProgramId remapping Better load balancing +1% to +3% + Autotuning Optimal tiles per sequence +10% to +45% Table 7. Step-by-step optimization results with performance impacts for each step Final speedup: 1.60x-1.66x across all sequence lengths. Getting started Writing high-performance kernels is rarely about finding one “magic” setting. As we saw with the “trap and rescue”: Optimizations are interdependent : Large tiles were slower until we fixed the math. You can’t evaluate tile size in isolation. Math matters : Flags like flush_to_zero and APPROX are critical for unlocking Tensor Core throughput. Precise math is often overkill for deep learning. Algorithmic wins compound : K-loop splitting gave us the biggest single improvement (up to 32%) by avoiding unnecessary work. Autotuning beats manual heuristics : cuTile’s autotuner discovers optimal tile sizes per sequence length (64×64 for short sequences, 256×128 for long), delivering 10-45% gains over fixed configurations. Cumulative effects are multiplicative : The full optimization stack delivers 1.60x-1.66x speedup across all sequence lengths—far more than any single optimization alone. cuTile enables developers to express these optimizations—tiling, fast math controls, loop splitting, autotune—in clean, readable Python code while generating highly optimized PTX for NVIDIA GPUs. You can find the completely optimized kernel in the TileGym repository . Happy hacking. Discuss (0) Like Tags Agentic AI / Generative AI | Data Science | Developer Tools & Techniques | General | CUDA | Advanced Technical | Tutorial | CUDA Tile | cuTile | featured About the Authors About Alessandro Morari Alessandro Morari is an AI systems leader at NVIDIA in the DevTech AI organization. His current focus is on AI-driven GPU kernels and next-generation programming models for accelerated computing. His experience spans the full AI stack, from GPU kernel optimization to AI product leadership. Before NVIDIA, he led the team at IBM Research that shipped the Watson Code Assistant, one of the earliest large-scale generative AI products. He previously worked on system software for the Summit and Sierra supercomputers and created NYU Courant's first course on high-performance machine learning. Morari has authored over 30 publications, holds 15 patents, and earned a Ph.D. in Computer Architecture. View all posts by Alessandro Morari About Allen Zhao Allen Zhao (Wenyi Zhao) is a senior compute architect engineer specializing in cutting-edge AI compiler technologies, including both graph-level and tile-level compilation. His expertise lies in optimizing the execution efficiency of AI models across diverse hardware architectures, especially for GPGPU. He's passionate about translating theoretical compiler advancements into practical, high-impact solutions for the next generation of artificial intelligence. He holds a Master's degree from Shanghai Jiao Tong University. View all posts by Allen Zhao About Ivan Yin Ivan Yin (Wenzhi Yin) is a senior computer architect engineer specializing in GPU compiler engineering and high-performance deep learning. He graduated from Shanghai Jiao Tong University. He has expertise in compiler development for NVIDIA CUDA Tile Programming, where he maps high-level tensor operations to efficient GPU machine code through automated code generation for modern GPU architectures. Beyond compiler engineering, he has experience in high-performance deep learning kernel development and performance tuning. View all posts by Ivan Yin About Vishal Mehta Vishal works as a senior developer technology engineer at NVIDIA, with focus on performance optimization for GPU applications. He has been working in the field of GPU computing for over 10 years. He is keen on teaching CUDA and GPU computing to users and drives the content for the CUDA programming guide. His day-to-day activities involve collaborations with domain scientists and industry experts to improve their workloads on GPUs. View all posts by Vishal Mehta Comments Related posts Making Softmax More Efficient with NVIDIA Blackwell Ultra Making Softmax More Efficient with NVIDIA Blackwell Ultra Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability Next Generation of FlashAttention Next Generation of FlashAttention Accelerating Transformers with NVIDIA cuDNN 9 Accelerating Transformers with NVIDIA cuDNN 9 Related posts Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark L T F R E
New method could increase LLM training efficiency mit_news_ai 26.02.2026 05:00 0.658
Embedding sim.0.7518
Entity overlap0.0732
Title sim.0.1321
Time proximity0.9286
NLP типscientific_publication
NLP организацияMassachusetts Institute of Technology
NLP темаlarge language models
NLP странаUnited States

Открыть оригинал

Reasoning large language models (LLMs) are designed to solve complex problems by breaking them down into a series of smaller steps. These powerful models are particularly good at challenging tasks like advanced programming and multistep planning. But developing reasoning models demands an enormous amount of computation and energy due to inefficiencies in the training process. While a few of the high-power processors continuously work through complicated queries, others in the group sit idle. Researchers from MIT and elsewhere found a way to use this computational downtime to efficiently accelerate reasoning-model training. Their new method automatically trains a smaller, faster model to predict the outputs of the larger reasoning LLM, which the larger model verifies. This reduces the amount of work the reasoning model must do, accelerating the training process. The key to this system is its ability to train and deploy the smaller model adaptively, so it kicks in only when some processors are idle. By leveraging computational resources that would otherwise have been wasted, it accelerates training without incurring additional overhead. When tested on multiple reasoning LLMs, the method doubled the training speed while preserving accuracy. This could reduce the cost and increase the energy efficiency of developing advanced LLMs for applications such as forecasting financial trends or detecting risks in power grids. “People want models that can handle more complex tasks. But if that is the goal of model development, then we need to prioritize efficiency. We found a lossless solution to this problem and then developed a full-stack system that can deliver quite dramatic speedups in practice,” says Qinghao Hu, an MIT postdoc and co-lead author of a paper on this technique . He is joined on the paper by co-lead author Shang Yang, an electrical engineering and computer science (EECS) graduate student; Junxian Guo, an EECS graduate student; senior author Song Han, an associate professor in EECS, member of the Research Laboratory of Electronics and a distinguished scientist of NVIDIA; as well as others at NVIDIA, ETH Zurich, the MIT-IBM Watson AI Lab, and the University of Massachusetts at Amherst. The research will be presented at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems. Training bottleneck Developers want reasoning LLMs to identify and correct mistakes in their critical thinking process. This capability allows them to ace complicated queries that would trip up a standard LLM. To teach them this skill, developers train reasoning LLMs using a technique called reinforcement learning (RL). The model generates multiple potential answers to a query, receives a reward for the best candidate, and is updated based on the top answer. These steps repeat thousands of times as the model learns. But the researchers found that the process of generating multiple answers, called rollout, can consume as much as 85 percent of the execution time needed for RL training. “Updating the model — which is the actual ‘training’ part — consumes very little time by comparison,” Hu says. This bottleneck occurs in standard RL algorithms because all processors in the training group must finish their responses before they can move on to the next step. Because some processors might be working on very long responses, others that generated shorter responses wait for them to finish. “Our goal was to turn this idle time into speedup without any wasted costs,” Hu adds. They sought to use an existing technique, called speculative decoding, to speed things up. Speculative decoding involves training a smaller model called a drafter to rapidly guess the future outputs of the larger model. The larger model verifies the drafter’s guesses, and the responses it accepts are used for training. Because the larger model can verify all the drafter’s guesses at once, rather than generating each output sequentially, it accelerates the process. An adaptive solution But in speculative decoding, the drafter model is typically trained only once and remains static. This makes the technique infeasible for reinforcement learning, since the reasoning model is updated thousands of times during training. A static drafter would quickly become stale and useless after a few steps. To overcome this problem, the researchers created a flexible system known as “Taming the Long Tail,” or TLT. The first part of TLT is an adaptive drafter trainer, which uses free time on idle processors to train the drafter model on the fly, keeping it well-aligned with the target model without using extra computational resources. The second component, an adaptive rollout engine, manages speculative decoding to automatically select the optimal strategy for each new batch of inputs. This mechanism changes the speculative decoding configuration based on the training workload features, such as the number of inputs processed by the draft model and the number of inputs accepted by the target model during verification. In addition, the researchers designed the draft model to be lightweight so it can be trained quickly. TLT reuses some components of the reasoning model training process to train the drafter, leading to extra gains in acceleration. “As soon as some processors finish their short queries and become idle, we immediately switch them to do draft model training using the same data they are using for the rollout process. The key mechanism is our adaptive speculative decoding — these gains wouldn’t be possible without it,” Hu says. They tested TLT across multiple reasoning LLMs that were trained using real-world datasets. The system accelerated training between 70 and 210 percent while preserving the accuracy of each model. As an added bonus, the small drafter model could readily be utilized for efficient deployment as a free byproduct. In the future, the researchers want to integrate TLT into more types of training and inference frameworks and find new reinforcement learning applications that could be accelerated using this approach. “As reasoning continues to become the major workload driving the demand for inference, Qinghao’s TLT is great work to cope with the computation bottleneck of training these reasoning models. I think this method will be very helpful in the context of efficient AI computing,” Han says. This work is funded by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT Amazon Science Hub, Hyundai Motor Company, and the National Science Foundation.
OpenAI Codex and Figma launch seamless code-to-design experience openai 26.02.2026 06:00 0.656
Embedding sim.0.7848
Entity overlap0.0833
Title sim.0.1444
Time proximity0.5685
NLP типproduct_launch
NLP организацияOpenAI
NLP темаcode generation
NLP страна

Открыть оригинал

OpenAI and Figma launch a new Codex integration that connects code and design, enabling teams to move between implementation and the Figma canvas to iterate and ship faster.
Anthropic vs. OpenAI, the Pre IPO Days ai_supremacy 20.02.2026 10:36 0.653
Embedding sim.0.7534
Entity overlap0.0192
Title sim.0.1299
Time proximity0.8956
NLP типfunding
NLP организацияAnthropic
NLP темаenterprise ai
NLP страна

Открыть оригинал

Anthropic vs. OpenAI, the Pre IPO Days The last few months before lucrative AI IPOs are upon on. Let's do some math. Decoding the crazy high valuations. Michael Spencer and Raphaëlle d'Ornano Feb 20, 2026 ∙ Paid 118 13 Share Good Morning, The next AI duopoly of BigAI is almost here. With BigTech benefits. 😄🗺️ Anthropic closed its $30 Bn. round and OpenAI is almost read to close its own $100 Bn. round. Confirmation that OpenAI will keep paying 20% of its revenue to Microsoft until 2032 complicates the business model of OpenAI. Nvidia is in discussions to invest up to $30 billion in OpenAI as part of a funding round that could value the AI startup at a $730 (even as high as $850) billion pre-money valuation. We have to assume that both SpaceX and OpenAI will IPO at near or above $1 Tr. market cap. I asked Raphaëlle d'Ornano of Decoding Discontinuity (and her team) to do some analysis on the OpenAI vs. Anthropic debate. Decoding Discontinuity Frameworks for analyzing the financial and strategic impact of emerging tech like Generative AI. Decoding Discontinuity A newsletter that provides frameworks for analyzing the financial and strategic impact of emerging technologies like GenAI. By Raphaëlle d'Ornano R.O is a very deep thinker so read her analysis carefully: (related) Decoding Anthropic’s $380 Billion Valuation: Orchestration over Raw Intelligence in Enterprise AI The $285 Billion ‘SaaSpocalypse’ Is the Wrong Panic OpenAI and xAI: When Megawatts Become the New ARPU Epoch AI ( Epoch AI & various writers ) are also projecting like Patel and myself have, that Anthropic will overtake OpenAI in ARR not in 2027, but sooner as in this year! Anthropic Growing Revenue 3x Faster than OpenAI in 2026 Epoch AI. When you are growing revenue at a 10x pace instead of a 3.4x pace, that tends to happen. Anthropic is on pace to overtake OpenAI in ARR sometime in late 2026. But what does it mean for Anthropic’s IPO vs. OpenAI’s? In Anthropic’s global push 2026 and 2027 are just massive years for its growth. Epoch AI notes that The Information shows both companies projecting slower revenue growth in 2026, with OpenAI expecting 2.2× growth, and Anthropic expecting 4× growth or less. No wonder there is a SaaS apocalypse market jitters and narrative. So instead of growing three times as fast, Anthropic may grow twice as fast in 2026. Keep in mind they are also iterating new models faster: Anthropic releases Sonnet 4.6 this week . Google releases Gemini 3.1 Pro this week. In what Industries are Agents being Deployed? Anthropic Software Engineering Back office automation Other Marketing, content and copywriting Sales and CRM Finance and Accounting Business Intelligence and Data analysis Academic Research As Nvidia reaps the rewards of the GPU-era, we’ll have to track Anthropic more closely now as they dominate Enterprise AI products. OpenAI doesn’t appear to be executing in a customer focused manner like Anthropic has where as each company hit $1B in annualized revenues, Anthropic has grown substantially faster. The trajectory, branding, product and focus feels entirely different. Anthropic of course can’t grow 10x every year as it gets larger, Epoch AI notes that Since July 2025, Anthropic has grown its revenue at a rate of 7×/year rather than 10×. In 2026, most expect 4.5x. Generational IPOs? 🌊 Loading... We are mere months away from the biggest IPOs we’ve ever seen: SpaceX, OpenAI and Anthropic. My personal picks for the BigAI winners of the Gen AI era excluding players with vast ecosystems like Google, Meta, xAI: Nvidia Anthropic ByteDance Broadcom An Unknown AI Chip maker upstart Alibaba SK Hynix (HBM chips) Core Automation (startup) An unnamed Chinese AI research lab An unnamed Chinese AI chip startup Something Big “Might be Happening” I don’t know if something big is happening ( Shumer ) as VC takes over media, they certainly want you to think that this is big. Decoding Discontinuity have a lot of advanced reports for paying readers . They showed me a recent PDF and I was blown away. Decoding Discontinuity Anthropic are trying to measure agentic autonomy in practice . They might be the moonshot of AI automation that’s hottest right now. Anthropic is likely to be profitable as soon as 2028. Frankly it’s not clear when OpenAI will reach that mark, could be as late as 2031. OpenAI’s has Raised far more but with less results OpenAI is Losing Marketshare to Emerging Players In 2026, Anthropic, Google, xAI and others will increasingly take marketshare away from OpenAI. Nvidia and Amazon are piling into OpenAI, supposedly to save it as a major customer. But…if AI was a big thing why are consumers spending more on OnlyFans, than OpenAI and NYT combined? a16z Anthropic’s Super Bowl Surge in Subscriptions The cheeky Anthropic Superbowl Ads were a question of good timing and the Claude Code momentum has built incredible momentum going into the pre IPO intensity for both parts of the BigAI (BigTech driven) Duopolies. Axios via Ramp AI Index data. This means the faster Anthropic grows, the slower OpenAI will grow. The main early 2026 vibes have been Codex vs. Claude Code. January, 2026 might have been Anthropic's breakthrough month , wrote Ara Kharazian, an economist at Ramp, which has been tracking business spending on AI. Ramp It’s getting intense: 79% of Anthropic's customers are already OpenAI customers. And churn rates are nearly identical at 4%. 79% of Anthropic's customers are already OpenAI customers. And churn rates are nearly identical at 4%. According to Ramp’s data as of February 11th, 16% of businesses pay for both OpenAI and Anthropic. A year ago it was 8%. There will be Winners and Losers Decoding Discontinuity In terms of global competition, if either OpenAI or Anthropic falters there’s Google, Alibaba, ByteDance, xAI, DeepSeek and a host of others pushing including Open-weight Chinese startups you’ve never heard of. There will also be more nimble new research labs that will end up creating even better AI products, new architectures and offer new approaches to LLMs. B2B Market Looks Mission Critical For more sustainable big long-term contracts, Enterprise AI competition looks like the critical piece that will make or break their IPOs. While OpenAI’s B2C marketshare lead once looked impressive, diffusion by Gemini and others will reduce that first-mover advantage. “1 in 5 businesses on Ramp now pay for Anthropic. A year ago, it was 1 in 25.” - Ara Kharazian , Ramp Economist Decoding Discontinuity It’s highly uncertain if OpenAI’s AI device can compete with the likes of Meta, Apple, Google and others in smart glasses and other AI wearables . A huge market by 2028. It’s not clear if you are an OpenAI bear like I am, what exactly they win in. Especially is the case as ByteDance and Meta become direct competitors. Seedance looks more impressive than Sora, and so forth. The AI Coding Impact Focus Sometimes in 2025, Google, Anthropic and Alibaba Qwen began to outpace OpenAI in cadence of new releases and LLM quality making them more attractive for key builders, developers and entrepreneurs. Even Gemini CLI is gaining on Codex now in 2026. While OpenAI has transformed "Codex" from a simple model into a heavy-duty "Agent Command Center," Google’s Gemini CLI has found its niche as the high-context, low-friction alternative. All of this isn’t so great for Cursor or Microsoft’s own Github Copilot. Anthropic's MCP Advantage begins to Compound Decoding Discontinuity Anthropic and Google are building the agentic protocols that form the scaffolding of the future of Agentic AI. In 2026, the Model Context Protocol (MCP) is no longer just an Anthropic experiment; it has become the "USB-C for AI." Read Agentic Protocol Handbook Anthropic’s Upcoming Event Join Anthropic on Tuesday, February 24 for The Briefing: Enterprise Agents , a livestreamed event where we'll demonstrate how Cowork and Plugins help legal, sales, finance, and data teams build new products and solutions. Add to Calendar ChatGPT’s Viral Growth Not Enough Decoding Discontinuity ChatGPT’s growth looked magical, but Gemini and others are now showing similar adoption patterns. The “weekly” users metric don’t stand tall as Generative AI becomes more specialized. AI Supremacy isn't zero-sum game, new players and global competition will make things interesting. Finally let’s dive into the analysis of the guest contributor. Share Feel free to share this if you know anyone interested in the business trajectory of OpenAI or Anthropic, or indeed what “BigAI” will turn into. OpenAI at a Crossroads: 2026 the pre IPO Last Weeks See more at Decoding Discontinuity . Continue reading this post for free, courtesy of Michael Spencer. Claim my free post Or purchase a paid subscription. Previous Next A guest post by Raphaëlle d'Ornano Raphaëlle D'Ornano, is the founder of D'Ornano + Co. and Decoding Discontinuity, a research and investment platform. Her Durable Growth Moat™ framework analyzes how companies sustain competitive advantages through AI transformation. Subscribe to Raphaëlle
Perplexity announces "Computer," an AI agent that assigns work to other AI agents arstechnica_ai 26.02.2026 22:53 0.641
Embedding sim.0.712
Entity overlap0.0476
Title sim.0.2289
Time proximity0.9656
NLP типproduct_launch
NLP организацияPerplexity
NLP темаai agents
NLP страна

Открыть оригинал

Agentception Perplexity announces “Computer,” an AI agent that assigns work to other AI agents It’s also a buttoned-down, ostensibly safer take on the OpenClaw concept. Samuel Axon – Feb 26, 2026 5:53 pm | 87 The vague marketing image for Perplexity Computer. Credit: Perplexity The vague marketing image for Perplexity Computer. Credit: Perplexity Text settings Story text Size Small Standard Large Width * Standard Wide Links Standard Orange * Subscribers only Learn more Minimize to nav Perplexity has introduced “Computer,” a new tool that allows users to assign tasks and see them carried out by a system that coordinates multiple agents running various models. The company claims that Computer, currently available to Perplexity Max subscribers, is “a system that creates and executes entire workflows” and “capable of running for hours or even months.” The idea is that the user describes a specific outcome—something like “plan and execute a local digital marketing campaign for my restaurant” or “build me an Android app that helps me do a specific kind of research for my job.” Computer then ideates subtasks and assigns them to multiple agents as needed, running the models Perplexity deems best for those tasks. The core reasoning engine currently runs Anthropic’s Claude Opus 4.6, while Gemini is used for deep research, Nano Banana for image generation, Veo 3.1 for video production, Grok for lightweight tasks where speed is a consideration, and ChatGPT 5.2 for “long-context recall and wide search.” This kind of best-model-for-the-task approach differs from some competing products like Claude Cowork , which only uses Anthropic’s models. All this happens in the cloud, with prebuilt integrations. “Every task runs in an isolated compute environment with access to a real filesystem, a real browser, and real tool integrations,” Perplexity says. The idea is partly that this workflow was what some power users were already doing, and this aims to make that possible for a wider range of people who don’t want to deal with all that setup. People were already using multiple models and tailoring them to specific tasks based on perceived capabilities, while, for example, using MCP (Model Context Protocol) to give those models access to data and applications on their local machines. Perplexity Computer takes a different approach, but the goal is the same: have AI agents running tailor-picked models to perform tasks involving your own files, services, and applications. Then there is OpenClaw, which you could perceive as the immediate predecessor to this concept. The story so far If you haven’t been following the wild OpenClaw craze, here’s the quick summary: originally titled ClawdBot, then Moltbot, OpenClaw was an agentic AI tool that leveraged large language models to independently operate as a sort of background or ambient process on your local machine, performing a wide range of tasks, from sorting through your email history to building websites to, well, basically whatever you could imagine. Given the right permissions and with the proper plugins, it could create, modify, or delete the user’s files and otherwise change things far beyond what most users could achieve with existing models and MCP (Model Context Protocol). Users would use files like USER.MD, MEMORY.MD, SOUL.MD, or HEARTBEAT.MD to give the tool context about its goals and how to work toward them independently, sometimes running for long stretches without direct user input. On one hand, that meant it could do impressive things—the first glimpses of the sort of knowledge work that AI boosters have been saying agentic AI would ultimately do. On the other hand, it was prone to serious errors and vulnerable to prompt injection and other security problems, in part due to a Wild West of unverified plugins. The same toolkit that was used to create a viral Reddit clone populated by AI agents was also, at least in one case, responsible for deleting a user’s emails against her will. Stay in your lane Perplexity Computer aims to address those concerns in a few ways. First, its core process occurs in the cloud, not on the user’s local machine. Second, it lives within a walled garden with a curated list of integrations, in contrast to OpenClaw’s unregulated frontier. This is, of course, an imperfect analogy, but you could say that if OpenClaw were the open web of AI agent tools, then Computer is Apple’s App Store. While you’re more limited in what you can do, you’re not trusting packages from unverified sources with access to your system. There could still be risks, though. For one thing, LLMs make mistakes, and those could be consequential if Computer is working with data you don’t have backed up elsewhere or if you’re not verifying the outputs, for example. Perplexity Computer aims to button up, refine, and contain the wild power of the viral OpenClaw agentic AI tool—competing with the likes of Claude Cowork—by optimizing subtasks by selecting models best suited to them. It surely won’t be the last existing AI player to try to do this sort of thing. After all, OpenAI hired OpenClaw’s developer, with CEO Sam Altman suggesting that some of what we saw in OpenClaw will be essential to the company’s product vision moving forward. Samuel Axon Senior Editor Samuel Axon Senior Editor Samuel Axon is the editorial lead for tech and gaming coverage at Ars Technica. He covers AI, software development, gaming, entertainment, and mixed reality. He has been writing about gaming and technology for nearly two decades at Engadget, PC World, Mashable, Vice, Polygon, Wired, and others. He previously ran a marketing and PR agency in the gaming industry, led editorial for the TV network CBS, and worked on social media marketing strategy for Samsung Mobile at the creative agency SPCSHP. He also is an independent software and game developer for iOS, Windows, and other platforms, and he is a graduate of DePaul University, where he studied interactive media and software development. 87 Comments
An update on our mental health-related work openai 27.02.2026 00:00 0.637
Embedding sim.0.721
Entity overlap0
Title sim.0.1594
Time proximity0.959
NLP типother
NLP организацияOpenAI
NLP темаai safety
NLP страна

Открыть оригинал

OpenAI shares updates on its mental health safety work, including parental controls, trusted contacts, improved distress detection, and recent litigation developments.
Google DeepMind Partnerships in India: scaling AI in science and education — Google DeepMind deepmind 18.02.2026 10:30 0.631
Embedding sim.0.7331
Entity overlap0.0213
Title sim.0.0662
Time proximity0.9018
NLP типpartnership
NLP организацияGoogle DeepMind
NLP темаartificial intelligence
NLP странаIndia

Открыть оригинал

February 18, 2026 Responsibility & Safety Accelerating discovery in India through AI-powered science and education Demis Hassabis, Lila Ibrahim and Pushmeet Kohli Share Copied Introducing our National Partnerships for AI and collaboration in India We believe AI will be the most transformative technology in human history and that it should be deployed in ways that benefit all of humanity. This requires deep, strategic collaboration between frontier AI labs, governments, academia, and civil society. To fully realise AI’s potential, Google DeepMind is working with governments through our National Partnerships for AI initiative to broaden access to our frontier AI capabilities, helping ensure they are deployed to serve citizens and meet national priorities in science, education, resilience, and public services. Building on our collaborations with the US and UK governments, we are establishing a new partnership with Indian government bodies and local institutions. In the global AI transformation, India is showing exceptional leadership in applying the technology to tackle its own biggest challenges. But India is going even further, playing a critical international role by convening this week the fourth global AI summit of governments, companies and civil society. International dialogue and collaboration will guide positive impacts and create the global frameworks required to prepare society for a future with AI. Partnership in India to broaden AI access Our partnerships are designed to accelerate the pace of progress across India. Here are a few ways we are working together to unlock new possibilities in science and education. Advancing scientific breakthroughs Google DeepMind, Google Research and Google.org are partnering with the Anusandhan National Research Foundation (ANRF) to facilitate the adoption of AI models to advance science. We’re providing access to our frontier AI for Science models, supporting hackathons and community contests, and enabling training and mentorship to students, researchers, and those in the early stages of their careers. Researchers and engineers in India will be able to use our AI tools, including: AlphaGenome : An AI model to help scientists better understand how mutations in human DNA sequences impact a wide range of gene functions AI Co-scientist : A multi-agent AI system that acts as a virtual scientific collaborator Earth AI: A collection of models built on Gemini’s advanced reasoning that are helping enterprises, nonprofits, and cities with everything from environmental monitoring to disaster response Scientists around the world are already using AlphaFold - our AI system capable of accurately predicting the structure and interactions of proteins, DNA, RNA, ligands and more - to accelerate discoveries. India stands as the fourth largest adopter of AlphaFold globally, with over 180,000 researchers using it today. We hope to see Indian scientists benefit even more from using AlphaGenome and the other AI systems we are now providing. We're also working to support AI for science at a global level. This is why, today at the India Summit, we announced the $30 million Google.org Impact Challenge: AI for Science , an open call for researchers, nonprofits, and social enterprises in India, and around the world, using AI to achieve scientific breakthroughs. Selected awardees will also have the opportunity to participate in a Google.org Accelerator, receiving engineering support, expert mentorship, and infrastructure from Google DeepMind and Google Research to turn their concepts into scalable discoveries. Empowering India’s Students and Teachers with an AI-powered Future Our recent survey with Ipsos has shown that learning is the top motivation for using AI globally. This is especially true in India, which now leads the world in daily Gemini usage by students. We’re seeing AI can drive profound comprehension and critical thinking when it is purpose-built for learning and implemented as a supportive partner to educators. At City Montessori School in Lucknow, teachers are integrating Guided Learning into math classes for Grade 8-9 students and seeing a positive response. An early analysis of a randomized control study conducted by Fab AI shows that students are demonstrating a desire for deeper learning, not just quick answers: in almost three out of every four conversations on Gemini, students sought to develop their understanding rather than a quick answer or shortcut. That’s why we’re expanding efforts with additional partners to supercharge the potential of learning for more Indian students and teachers: Powering innovation hubs with GenAI assistants: Together with Atal Tinkering Labs, which serves more than 10,000 Indian schools and 11 million students, we will help incorporate robotics and coding into local curricula, integrate Gemini thoughtfully into teacher workflows, and build a safely guardrailed AI assistant for students grounded in national curriculum standards that can act as an educational partner. Teachers can access real-time tips to help students fix a robot missing a part with readily available materials or mend a broken circuit design by simply pointing a camera to it or asking Gemini in chat. Transforming textbooks into interactive digital journeys: In a first-of-its-kind partnership with PM Publishers Pvt. Ltd., a K-12 textbook publisher in India, Gemini will be used to transform two million static textbooks into AI-powered interactive journeys across more than 250 titles and 2,000 schools. Each book features a QR code that can be scanned by students to access a custom Gem (specialized versions of the Gemini AI model), that acts as an expert assistant on the subject, providing summaries and responses on the contents of the respective book. Serving India’s linguistic diversity: There is incredible potential for AI to make a positive impact on education when built in close partnership with experts and grounded in local language and culture. Building on Google.org’s recent $2 million founding contribution to establish the new Indic Language Technologies Research Hub at IIT Bombay, we’ll help incorporate India’s linguistic diversity into AI as it advances globally. These efforts build on the global success of existing AI literacy programs like Experience AI , a joint partnership developed by Google DeepMind with Raspberry Pi Foundation, which has already reached up to 300,000 students and 8,000 teachers in India. AI solutions for India’s agriculture and energy sectors Our new partnerships in science and education build on our ongoing collaboration with local Indian organizations to tackle global challenges in agriculture and energy security. Working with Indian startups, institutions like Council on Energy, Environment and Water (CEEW), and Indian state and central government entities are using the APIs of our freely available Agri AI models to enhance agricultural resilience, crop productivity and farmer incomes. TerraStack is also using Google AI to combine satellite, crop, and weather data, into hyper-local insights that help farmers make better agricultural decisions. We also recently announced a growing collaboration with Open Climate Fix to integrate our WeatherNext AI models into India’s electricity grid operations. We’re aiming to significantly improve the accuracy of renewable energy forecasts in India, help grid operators manage volatility, and support the country’s ambitious clean energy targets. When we tested the integration of WeatherNext into OCF’s wind generation forecast, results showed up to 8% accuracy improvement in forecast performance. This partnership comes as India rapidly scales its renewable capacity, becoming the third largest generator of solar energy globally in 2023, with an ambitious target of installing 500 GW of renewable capacity by 2030. Working together on energy solutions has never been more important - we remain committed to working with experts in India to progress this effort together to prepare for the future. Preparing for the future together AI’s global impact is inevitable, but its success is not. To turn potential into prosperity, we are committing to deep, local collaboration with India's government bodies and institutions to ensure AI delivers tangible results across the subcontinent–and the world. Related posts National Partnerships for AI Learn more Strengthening our partnership with the UK government to support prosperity and security in the AI era December 2025 Responsibility & Safety Learn more Deepening our partnership with the UK AI Security Institute December 2025 Responsibility & Safety Learn more Google DeepMind supports U.S. Department of Energy on Genesis: a national mission to accelerate innovation and scientific discovery December 2025 Science Learn more
Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo | NVIDIA Technical Blog nvidia_dev_blog 01.03.2026 07:00 0.631
Embedding sim.0.7109
Entity overlap0.0476
Title sim.0.2783
Time proximity0.7738
NLP типpartnership
NLP организацияNVIDIA
NLP темаai agents
NLP страна

Открыть оригинал

Autonomous networks are quickly becoming one of the top priorities in telecommunications. According to the latest NVIDIA State of AI in Telecommunications report , 65% of operators said AI is driving network automation, and 50% named autonomous networks as the top AI use case for ROI. Yet many telcos still report gaps in AI and data science expertise. This makes it difficult to scale safe, closed-loop automation across complex, multidomain networks. Most telecom network operations centers (NOCs) today operate using reactive, alarm-driven workflows. Engineers manually triage thousands of incidents across multiple tools, sift through a high volume of alarm and performance data, and stitch together fragmented dashboards and logs before applying a fix or dispatching a field team. NOCs are a natural starting point for autonomous networks, because they concentrate high-volume, repeatable tasks where AI can directly cut MTTR and OPEX. Tech Mahindra, a leading global provider of technology consulting and digital solutions to enterprises across industries, and NVIDIA are collaborating to close this AI skills gap. They’re doing so by making autonomous network building blocks—open models, tools, and implementation guides—into assets telecom developers can readily adopt and adapt in their own environments. This post outlines how to fine‑tune reasoning models with NVIDIA NeMo so they behave like NOC engineers, safely driving closed‑loop, self‑healing workflows. It shows how to: Generate synthetic, telecom‑realistic incident data Translate expert procedures into structured reasoning traces using the production-grade reference workflows. This teaches the model to coordinate tools, reason over network state, and execute fault‑management tasks end to end The result is a repeatable method that telco teams can use to build their own specialized AI agents for network operations. These agents can perform triage, root‑cause analysis, and resolution for high‑volume incident classes, helping operators progress toward TM Forum Level 4 highly autonomous networks and beyond. Why do network operations centers need reasoning models? Traditional NOC automation is mostly rule‑based and open‑loop: scripts trigger on fixed conditions but struggle with noisy signals, cross‑domain dependencies, and constantly changing network behavior. As a result, many Level 1 and Level 2 tasks—triage, root‑cause analysis, validation after a change—still depend on manual effort, keeping MTTR high and limiting how far operators can move toward truly autonomous operations. Figure 1. Shifting from manual NOC alarm handling to a reasoning agent embedded in the NOC workflow A telco reasoning model becomes the engine for an AI agent that can take on this work pattern in a controlled, auditable way. Instead of hard‑coded runbooks and point scripts, the agent uses the model to interpret incidents, decide which tools to call, and adapt its actions based on live responses. Key features include: AI reasoning plus tool-calling : Replaces manual alarm triage by invoking NOC tools for validation, root‑cause analysis, and remediation across existing systems End-to-end automation : Handles alarm validation, RCA, and healing for various incident types such as outages, flaps, congestion, and configuration issues Noise reduction : Filters self‑clearing or low‑value alarms using historical patterns so engineers can focus on higher priorities Resolution in seconds, not hours : Shrinks resolution time for high‑volume, well‑understood incidents from hours to seconds, significantly reducing MTTR The outcome is a closed‑loop, self‑healing network. Specialized NOC agents handle routine triage and resolution, and engineers shift from reactive alarm handling to proactive optimization and complex problem-solving. Designing a telco reasoning pipeline The technical approach to this solution combines the following components into one reproducible pipeline: Synthetic incident data Expert NOC procedures Structured reasoning traces Supervised fine‑tuning Evaluation Instead of trying to learn from raw logs and alarms directly, the model is trained on curated examples that show how an experienced engineer would analyze an incident, call tools, and decide when a fix is complete. Figure 2. Agent training pipeline, from synthetic incident generation to reasoning model, fine-tuning, and evaluation across tool-calling, reasoning, and conclusions In this case, Qwen3-32B is the base reasoning modeling that is fine-tuned for telco NOC workflows using the following design principles: Focusing on a small number of high‑impact faults, which account for the majority of incidents and require deliberate action. This enables the model to learn deeply on the fault classes that matter most. Defining step-by-step operational guidelines for each problem type including RCA and remediation steps and NOC tools that agents must use. Generate synthetic reasoning traces that capture multistep tool calls and the rationale behind each decision, using the NeMo Skills reference workflow to automate trace and incident generation. NeMo Skills orchestrates this pipeline end to end, using its CLI, vLLM or TensorRT LLM servers, and training utilities to move from raw incidents to a fine-tuned telco reasoning model.​ Synthetic incidents and NOC tool-calling The input to the pipeline is a fully synthetic incident dataset that is modeled on real NOC behavior. Each record includes fields such as region, domain, priority, problem type, possible cause, and time stamps. Engineer notes are also included, describing intermediate steps and close notes summarizing the final resolution and close code. An incident summary captures why the network was degraded or down and is the backbone of what the model is trained to solve. The pipeline concentrates on the most frequent, high-impact faults that account for the bulk of incident volume and require explicit action. The reasoning model learns deeply on the cases that drive MTTR and OPEX. To model realistic NOC workflows, a set of custom tools are defined for agents to call in multistep procedures, such as: Acknowledging and tracking the initial alert Checking site and equipment status Performing remote actions (reset, unlock, enable) Monitoring for automatic recovery or alarm clearance Checking topology, power, and fiber, plus public outage information Applying configuration fixes Rechecking alarm status when it remains active Investigating persistent or recurring alarms Documenting actions and status updates Coordinating onsite dispatch or hardware replacement Confirming final site health and closing the incident For each problem type, domain experts translate existing workflows into step‑by‑step guidelines that map onto these tools. Examples include which triage toolkit to consult first; which alarms to query; when to reboot a device; and how to verify a fiber cut, power outage, or network element faults. These guidelines become blueprints for the synthetic reasoning traces the model will learn from. They later define the action space that NOC agents use when executing closed‑loop workflows in production. Turn expert procedures into reasoning traces To turn expert NOC procedures into training data for a telco‑specialized reasoning model, follow the three-step NeMo Skills workflow outlined below. It converts runbooks into structured, multiturn reasoning traces ready for autonomous NOC agents. Step 1: Generate structured action sequences Using a reference workflow from NeMo Skills, a teacher model generates standardized action sequences for each incident based on prompts that include incident fields and guideline templates. The steps map directly to NOC tools. Traces are formatted so each step records the action, its parameters, the tool call, and the immediate result, forming a structured view of the NOC workflow.​ Step 2: Attach per‑step reasoning A second pass enriches every action with reasoning text that explains why the step is taken, what signals it uses, and how it influences the next decision. This creates a chain of reasoning that reflects how an experienced NOC engineer reasons over topologies, alarms, and historical behavior. Because raw traces can be verbose or repetitive, a squashing phase merges related steps while preserving key decision points, making sequences more efficient for training. Step 3: Formatting for multiturn, tool‑calling models Using another workflow from NeMo Skills, the formatted traces are converted into a Qwen-compatible format that encodes both the dialogue-style interaction and tool-calling actions over multiple turns. Multiturn tokenization simulates realistic interactions where the agent alternates between reasoning, calling tools, and interpreting tool responses, which is essential for deploying a ReAct-style NOC agent.​​ The result is a curriculum-structured dataset where easier cases and shorter traces appear earlier, while more complex multi-step incidents appear later, supporting curriculum learning during model training.​​ Fine-tuning the telco reasoning model The fine-tuning phase uses a standard train/test split on the compiled reasoning dataset, with NeMo Skills orchestrating data preparation and Qwen3 32B serving as the base reasoning model. NeMo Skills prepare_data utilities apply a telco‑specific prompt template ( noc_reasoning_sft ) and the Qwen tokenizer. This makes each trace in the training split into a supervised fine‑tuning (SFT) example that includes: Incident context and NOC signals Multistep tool calls and intermediate results Reasoning traces explaining each decision Final resolution and incident summary This produces a single JSONL file of SFT-ready examples for the telco reasoning model.​ To improve learning efficiency, curriculum learning is applied by ordering samples from simple, single‑problem incidents to more complex multistep, multitool cases. This allows the model to master core NOC behaviors before tackling long, multiturn troubleshooting patterns. Multiturn tokenization ensures that each example preserves realistic sequences of queries, tool calls, responses, and follow‑up actions, rather than isolated single‑turn prompts. These capabilities are critical for downstream ReAct‑style agents that must coordinate multiple tools over long contexts. Ultimately, Qwen3‑32B is fine‑tuned on this telco reasoning curriculum with long sequence lengths and tensor model parallelism across GPUs. Checkpointing and experiment tracking allow teams to iterate on data quality, curriculum design, and hyperparameters. The result is a telco‑specialized reasoning model that understands incident fields, close codes, and NOC procedures, and can reliably drive multitool, multiturn tool‑calling workflows in production. Evaluating incident summary accuracy and safety Initial evaluation focuses on incident summary accuracy: how well the model, embedded in a ReAct‑style agent with tools, predicts and executes the correct resolution path for a given incident. Experiments compare the fine‑tuned telco reasoning model against a baseline Qwen3‑32B on held‑out incidents, measuring accuracy, precision, and recall across problem and close‑code categories. Incident summary accuracy can also be analyzed within a single problem type to highlight where reasoning traces and curriculum learning deliver the largest gains, informing future iterations of synthetic data generation and guideline design. Evaluations across multiple iterations show that the fine-tuned model improves accuracy from roughly 20% to 60%. Beyond incident summary metrics, additional evaluation methods can be introduced over time to further harden the system, including: LLM‑as‑a‑judge setups to evaluate reasoning traces for correctness, completeness, and safety LLM‑as‑a‑judge to assess final conclusions and remediation plans Tool‑calling benchmarks such as BFCLv3 to measure how reliably the agent sequences and interprets tool calls Rollout and rejection sampling to stress‑test behavior across many simulated incidents Controlled errors injected into traces to teach the model to detect and recover from its own mistakes Incorporation of retrieval‑augmented generation (RAG) with historical few‑shot examples to improve robustness on long‑tail scenarios Get started building telco reasoning models for autonomous networks Telco‑specific reasoning models—powered by synthetic data, structured traces, and safe tool‑calling—can move NOCs toward zero‑touch, self‑healing operations. By focusing on high‑impact close codes, encoding expert guidelines as multiturn reasoning traces, and fine‑tuning large models with the NVIDIA NeMo software toolkit, operators can build agents that reliably take on real NOC engineer tasks. The pipeline is reusable and adaptable, so this approach can be tailored to each operator’s tools, data, and policies. This accelerates the industry’s transition from manual alarm handling to intelligent, autonomous network operations. To get started fine-tuning a reasoning model to build AI agents for network operations, see Teaching a Model to Reason over Telecom Network Incidents . Discuss (0) Like Tags Agentic AI / Generative AI | Networking / Communications | Telecommunications | NeMo | TensorRT-LLM | Intermediate Technical | Tutorial | AI Agent | featured | Retrieval Augmented Generation (RAG) | Training AI Models About the Authors About Aiden Chang Aiden Chang is a solution architect at NVIDIA, focusing on enterprise applications of generative AI, robotics, and reasoning systems. He earned his master’s in computer science from the University of Southern California. Outside of work, he enjoys skiing, aviation, and building robots. View all posts by Aiden Chang About Amparo Canaveras Amparo Canaveras is a senior solutions architect at NVIDIA, specializing in generative AI applications within the telecommunications sector. She brings over 20 years of experience from her time in network operations and analytics at Nokia and Verizon. Amparo holds a B.Sc. in electrical engineering from the Polytechnic University of Valencia and an M.Sc. in systems design and management from MIT. View all posts by Amparo Canaveras About Ari Uskudar Ari Uskudar has 20-plus years of experience in AI-driven network automation, RAN intelligence, and large-scale telecom architecture across NVIDIA, VMware, Ericsson, Verizon, Turkcell, Vodafone, and Motorola. Her expertise spans agentic AI systems, autonomous network design, LLM-based telco reasoning, ML-powered observability, and end-to-end optimization. Ari has authored multiple patents in autonomous networks, 6G core architecture, and telco blueprints, etc. Known for bridging deep engineering with strategic product thinking, she designs advanced architectures, leads complex technical collaborations, and develops industry-adopted innovations that shape the future of AI-native telecom systems. View all posts by Ari Uskudar About Amol Phadke Amol Phadke is the chief transformation officer at Tech Mahindra, working closely with the CEO on enterprise-wide strategic initiatives, including the global elevation of the Communications industry vertical. He brings deep technology and business leadership across AI, cloud, software networks, big tech, and telecommunications, specializing in strategy definition, driving execution of large-scale engineering, and leading global multidiscipline teams. With over 25 years of global industry experience, he has previously held senior leadership posts as Group CTIO Telenor Group and GM at Google Cloud, among others. Amol holds a double degree executive MBA from UCLA, California - NUS, Singapore, a master’s degree in Telecommunications Engineering from USC, California, and a bachelor’s degree in Electronics Engineering from the University of Mumbai. View all posts by Amol Phadke Comments Related posts Build an AI Agent to Analyze IT Tickets with NVIDIA Nemotron Build an AI Agent to Analyze IT Tickets with NVIDIA Nemotron Build Enterprise AI Agents with Advanced Open NVIDIA Llama Nemotron Reasoning Models Build Enterprise AI Agents with Advanced Open NVIDIA Llama Nemotron Reasoning Models Transforming Telco Network Operations Centers with NVIDIA NeMo Retriever and NVIDIA NIM Transforming Telco Network Operations Centers with NVIDIA NeMo Retriever and NVIDIA NIM Navigating Generative AI for Network Admins Navigating Generative AI for Network Admins Diagnosing Network Issues Faster with NVIDIA WJH Diagnosing Network Issues Faster with NVIDIA WJH Related posts Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer L T F R E
Google and the Massachusetts AI Hub are launching a new AI training initiative for the Commonwealth. google 26.02.2026 18:55 0.627
Embedding sim.0.7131
Entity overlap0.0286
Title sim.0.0973
Time proximity0.9892
NLP типpartnership
NLP организацияGoogle
NLP темаai adoption
NLP странаUnited States

Открыть оригинал

Breadcrumb Company News Outreach & Initiatives Grow with Google AI is creating new opportunities across the workforce. Today, we announced with Governor Maura Healey that Google is partnering with the Massachusetts AI Hub to provide every Bay Stater with no-cost access to Google’s AI and career training through our Grow with Google program. This includes Google’s new AI Professional Certificate and the Google Career Certificates program . We’re excited to help Massachusetts residents learn how to use AI tools in their everyday work and create opportunities for career advancement. This partnership builds on our ongoing AI and career training commitments in Arkansas, Connecticut, Oklahoma and Virginia. By equipping residents with essential AI literacy, Massachusetts is ensuring its workforce is prepared for the jobs of today and tomorrow. Google is proud to call Massachusetts home with an office in Cambridge, and we’re committed to helping the state make AI literacy and professional training accessible to everyone. Massachusetts residents can now access Google’s AI Training at no cost . Grow with Google founder Lisa Gevelber and Governor Maura Healey meet with Google AI course graduates to discuss their journeys. POSTED IN: Grow with Google AI Public Policy Related stories
AI Data Centers Turn to High-Temperature Superconductors ieee_spectrum_ai 21.02.2026 14:00 0.625
Embedding sim.0.7337
Entity overlap0.027
Title sim.0.1209
Time proximity0.7325
NLP типother
NLP организацияMicrosoft
NLP темаai infrastructure
NLP странаUnited States

Открыть оригинал

Data centers for AI are turning the world of power generation on its head. There isn’t enough power capacity on the grid to even come close to how much energy is needed for the number being built. And traditional transmission and distribution networks aren’t efficient enough to take full advantage of all the power available. According to the U.S. Energy Information Administration (EIA), annual transmission and distribution losses average about 5 percent. The rate is much higher in some other parts of the world. Hence, hyperscalers such as Amazon Web Services, Google Cloud and Microsoft Azure are investigating every avenue to gain more power and raise efficiency. Microsoft, for example, is extolling the potential virtues of high-temperature superconductors (HTS) as a replacement for copper wiring. According to the company, HTS can improve energy efficiency by reducing transmission losses, increasing the resiliency of electrical grids, and limiting the impact of data centers on communities by reducing the amount of space required to move power. “Because superconductors take up less space to move large amounts of power, they could help us build cleaner, more compact systems,” Alastair Speirs, the general manager of global infrastructure at Microsoft wrote in a blog post . Superconductors Revolutionize Power Efficiency Copper is a good conductor, but current encounters resistance as it moves along the line. This generates heat, lowers efficiency, and restricts how much current can be moved. HTS largely eliminates this resistance factor, as it’s made of superconducting materials that are cooled to cryogenic temperatures. (Despite the name, high-temperature superconductors still rely on frigid temperatures—albeit significantly warmer than those required by traditional superconductors.) The resulting cables are smaller and lighter than copper wiring, don’t lower voltage as they transmit current, and don’t produce heat. This fits nicely into the needs of AI data centers that are trying to cram massive electrical loads into a tiny footprint. Fewer substations would also be needed. According to Speirs, next-gen superconducting transmission lines deliver capacity that is an order of magnitude higher than conventional lines at the same voltage level. Microsoft is working with partners on the advancement of this technology including being a part of a US $75 million Series B funding round into Veir , a superconducting power technology developer. Veir’s conductors use HTS tape, most commonly based on a class of materials known as rare-earth barium copper oxide (REBCO). REBCO is a ceramic superconducting layer deposited as a thin film on a metal substrate, then engineered into a rugged conductor that can be assembled into power cables. “The key distinction from copper or aluminum is that, at operating temperature, the superconducting layer carries current with almost no electrical resistance, enabling very high current density in a much more compact form factor,” says Tim Heidel , Veir’s CEO and cofounder. Liquid Nitrogen Cooling in Data Centers Ruslan Nagimov, the principal infrastructure engineer for cloud operations and innovation at Microsoft, stands near the world’s first HTS-powered rack prototype. Microsoft HTS cables still operate at cryogenic temperatures, so cooling must be integrated into the power-delivery system design. Veir maintains a low operating temperature using a closed-loop liquid-nitrogen system: The nitrogen circulates through the length of the cable, exits at the far end, is recooled, and then recirculated back to the start. “Liquid nitrogen is a plentiful, low cost, safe material used in numerous critical commercial and industrial applications at enormous scale,” says Heidel. “We are leveraging the experience and standards for working with liquid nitrogen proven in other industries to design stable, data center solutions designed for continuous operation, with monitoring and controls that fit critical infrastructure expectations rather than lab conditions.” HTS cable cooling can be done either within the data center or externally. Heidel favors the latter as that minimizes footprint and operational complexity indoors. Liquid nitrogen lines are fed into the facility to serve the superconductors. They deliver power to where it’s needed and the cooling system is managed like other facility subsystems. Rare earth materials, cooling loops, cryogenic temperatures—all of this adds considerably to costs. Thus, HTS isn’t going to replace copper in the vast majority of applications. Heidel says the economics are most compelling where power delivery is constrained by space, weight, voltage drop, and heat. “In those cases, the value shows up at the system level: smaller footprints, reduced resistive losses, and more flexibility in how you route power,” says Heidel. “As the technology scales, costs should improve through higher-volume HTS tape manufacturing and better yields, and also through standardization of the surrounding system hardware, installation practices, and operating playbooks that reduce design complexity and deployment risk.” AI data centers are becoming the perfect proving ground for this approach. Hyperscalers are willing to spend to develop higher-efficiency systems. They can balance spending on development against the revenue they might make by delivering AI services broadly. “HTS manufacturing has matured—particularly on the tape side—which improves cost and supply availability,” says Husam Alissa , Microsoft’s director of systems technology. “Our focus currently is on validating and derisking this technology with our partners with focus on systems design and integration.” This story was updated on 26 February, 2026 to correct details of Microsoft’s investment into Veir.
Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions microsoft_research 19.02.2026 16:00 0.624
Embedding sim.0.7235
Entity overlap0
Title sim.0.037
Time proximity0.9642
NLP типproduct_launch
NLP организация
NLP темаcode generation
NLP страна

Открыть оригинал

November 11, 2025 BlueCodeAgent: A blue teaming agent enabled by automated red teaming for CodeGen AI
Introducing EVMbench openai 18.02.2026 00:00 0.624
Embedding sim.0.7249
Entity overlap0
Title sim.0.0261
Time proximity0.9643
NLP типproduct_launch
NLP организацияOpenAI
NLP темаai agents
NLP страна

Открыть оригинал

OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.
IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST huggingface 18.02.2026 16:15 0.621
Embedding sim.0.7108
Entity overlap0.0256
Title sim.0.0588
Time proximity0.9984
NLP типexperiment
NLP организацияIBM Research
NLP темаai agents
NLP страна

Открыть оригинал

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST Enterprise Article Published February 18, 2026 Upvote 18 +12 Ayhan Sebin ayhansebin ibm-research Rohan Arora rohan-arora ibm-research Saurabh Jha saurabhjha1 ibm-research The "Black Box" Problem of Agent Benchmarks The Experiment: Diagnosing ITBench Agents Finding 1: Stronger models like Gemini-3-Flash shows surgical (isolated failure modes) per trace whereas open sourced Kimi-K2 and GPT-oss-120b show compounding failure patterns Finding 2: "Non-Fatal" vs. "Fatal" Failures The "Non-Fatal" (Benign) Flaws The "Fatal" Flaws Case Study: Gemini-3-Flash (Decisive but Overconfident) Case Study: GPT-OSS-120B A different (and more useful) way to read the plots: “fatal” vs “non-fatal” Recoverable / structural (show up even in successful traces) Fatal / decisive (strongly associated with failed traces) Conclusion Ayhan Sebin Saurabh Jha Rohan Arora Daby Sow Mert Cemri Melissa Pan Ion Stoica ITBench HF Space ITBench HF Dataset MAST HF Dataset ITBench Github MAST Github IBM Research and UC Berkeley collaborated to study how agentic LLM systems break in real-world IT automation, for tasks involving incident triage, logs/metrics queries, and Kubernetes actions in long-horizon tool loops. Benchmarks typically reduce performance to a single number, telling you whether an agent failed but never why. To solve this black-box problem, we applied MAST (Multi-Agent System Failure Taxonomy), an emerging practice for diagnosing agentic reliability ). By leveraging MAST to analyze ITBench—the industry benchmark for SRE, Security, and FinOps automation—we turned raw execution traces into structured failure signatures, revealing exactly what broke and how to fix it. We annotated 310 ITBench SRE traces across three distinct model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. Key Findings: Frontier models like Gemini-3-Flash fail cleanly (2.6 failure modes/trace), typically hitting isolated bottlenecks like verification. Large open models like GPT-OSS-120B suffer from cascading failure modes (5.3 failure modes/trace). -A single reasoning mismatch early in the run poisons the context, leading to compounding hallucinations. Across all models, the strongest predictor of failure is FM-3.3 (Incorrect Verification). Agents consistently "declare victory" without checking ground truth. Kimi-K2 struggles to recognize when a task is done. It exhibits a massive spike in Premature Termination (+46%) and Unaware of Termination Conditions (+43%), often quitting just before solving the problem or looping indefinitely. Takeaways from our analysis when building agents: For Frontier Models like Gemini: Externalize Verification. Never let the LLM grade its own homework. Require hard tool evidence before exit. Put termination + loop control outside the model: Termination issues are common killers (FM-1.5). Add explicit stop conditions + loop detectors for repeated tool calls/actions or implement Finite State Machines. Force clarify-or-read-only when inputs are ambiguous: Clarification failures (FM-2.2) are a major failure driver for smaller models. Make ambiguity a first-class branch in your agent graph. If you’re building agents for enterprise IT workflows, this is the kind of evaluation you want: not just “did it pass?”, but “what broke, where, and what intervention is most leverageable?” The "Black Box" Problem of Agent Benchmarks Benchmarks like ITBench are becoming the standard for measuring agentic performance in high-stakes IT automation tasks. In ITBench, agents act as Site Reliability Engineers (SREs) or Security Analysts tasked with diagnosing Kubernetes outages, patching vulnerabilities, or managing cloud costs in production environments. This benchmarks use success rate as a main metric to evaluate agents. However, this metric is insufficient for engineering robust systems. Knowing that an agentic system achieves a 14% success rate on ITBench tells us that it failed, but not why: Did it fail because it forgot the context? Because it hallucinated a command? Or because it simply did not terminate? Without a comprehensive approach to diagnose these failures, developers are left guessing, often resorting to blind prompting tweaks that solve one problem only to create another. As a new standard to analyze the failure modes of complex agentic systems, we developed MAST (Multi-Agent System Failure Taxonomy) . MAST brings more insights and open up the opaque evaluation of these benchmarks. Derived from a rigorous analysis of over 1,600 traces across seven different frameworks, MAST provides a standardized taxonomy for agent failures. MAST converts unstructured execution logs into structured " failure vectors " based on 14 distinct patterns across three key categories: FC1: System Design Issues (The "Skeleton") Failures here stem from the agent's architecture and role definition. Examples: FM-1.3 Step Repetition (looping), FM-1.4 Loss of Conversation History (memory leaks), FM-1.5 Unaware of Termination (failing to stop). FC2: Inter-Agent Misalignment (The "Communication") Failures arising during runtime from how agents talk to each other or the environment. Examples: FM-2.2 Fail to Ask for Clarification (assuming instead of asking), FM-2.3 Task Derailment (going off-topic). FC3: Task Verification (The "Quality Control") Failures in quality assurance of the agents' output. Examples: FM-3.1 Premature Termination (giving up too soon), FM-3.3 Incorrect Verification (hallucinating success). The Experiment: Diagnosing ITBench Agents We stress-test the idea of using MAST to make agent evaluations actionable and gain insights on the failure modes by applying it to ITBench, a popular evaluation suite for IT automation tasks across SRE , Security/Compliance , and FinOps . We annotated 310 ITBench SRE execution traces produced by an SRE agent built with Codex in realistic environments. These traces capture natural language interactions between agents and their tools across three models representing different capability tiers: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. This lets us look past simple success metrics and investigate the distinct failure signatures driving these results. For this we use the recall scores, as the models by design only output a maximum of 3-5 outputs and SREs prefer the recall scores over F-1 score. Gemini-3-Flash: 100 traces (75.5% Mean Recall) Kimi-K2: 105 traces (28.6% Mean Recall) GPT-OSS-120B: 105 traces (12.4% Mean Recall) Below, we detail the findings from this diagnostic analysis. Finding 1: Stronger models like Gemini-3-Flash shows surgical (isolated failure modes) per trace whereas open sourced Kimi-K2 and GPT-oss-120b show compounding failure patterns When we examine the failed traces, a clear hierarchy of complexity becomes apparent across the three models. This is measured by the number of distinct failure modes observed per failed run. Gemini-3-Flash: 2.6 failure modes per failed trace Kimi-K2: 4.7 failure modes per failed trace GPT-OSS-120B: 5.3 failure modes per failed trace This disparity in failure mode density reveals a fundamental difference in how these systems break down. Gemini-3-Flash exhibits a surgical failure profile. Even in unsuccessful runs, it maintains high internal coherence and typically fails due to a single isolated failure, such as an incorrect verification step. These failures are precise and far easier to diagnose. On the opposite end of the spectrum, GPT-OSS-120B suffers from cascading collapse. In these traces, we observe that errors tend to compound over time. A small reasoning mismatch early in the process often leads to a deviation from the task specification, which in turn triggers a total derailment of the agent. Kimi-K2 represents the middle ground, where failures are more frequent and complex than the frontier model but do not reach the systemic instability seen in the 120B open weights model. The significance of this finding is that a higher success rate is often accompanied by isolated failure. Systems that fail with fewer simultaneous problems are far more predictable and simpler to improve through targeted engineering interventions. Finding 2: "Non-Fatal" vs. "Fatal" Failures Perhaps the most critical insight from MAST is distinguishing between failures that the system can tolerate versus those that are fatal to success of the downstream task. By comparing the distribution of failure modes in Successful Traces vs. Failed Traces , we can classify them into three categories. The "Non-Fatal" (Benign) Flaws Across all three models, certain failure modes appear frequently even in runs that ultimately succeed. These are often structural frictions rather than terminal bugs. FM-1.3 Step Repetition: This mode is present in over 90 percent of successful Kimi-K2 runs. In the SRE domain, iteration is often a necessity. An agent might query the same metric multiple times to verify if a service is stabilizing or if a fix has taken effect. Gemini-3-Flash actually shows less repetition in its failed traces, suggesting that it sometimes fails because it does not iterate enough. FM-1.1 Disobey Task Specification: Agents frequently deviate from strict tool formatting or sequential instructions yet still manage to identify the correct root cause. This separation is where MAST proves its value. It allows us to ignore the bening failures like repetition that often occurs in troubleshooting, and focus instead on fatal failures that killed a run. The "Fatal" Flaws Certain behaviors strongly separate success from failure. When these modes appear, the probability of a successful outcome drops precipitously. The most prominent example is FM-3.3 (Incorrect Verification) . This mode shows a 52 percent increase in failed Gemini-3-Flash traces compared to its successful ones. Other prominent failure modes are 1.5 (Unaware of Termination Conditions) and 2.6 (Reasoning Action Mismatch). If these happen, the run is likely dead; guiding practitioners to develop robust context management strategies across agents in the system and multiple turns of interactions. Case Study: Gemini-3-Flash (Decisive but Overconfident) Gemini-3-Flash is highly efficient, but its primary bottleneck is its tendency to assume success without rigorous proof. Its failure signature is dominated by a massive delta in verification errors. It often identifies the correct signals but terminates before cross-referencing them against the ground truth. To fix this, developers should implement an external verification gate. By requiring tool-based evidence like a cleared alert or a healthy metric threshold before allowing the agent to exit, we can mitigate this model’s inherent overconfidence. Fix: To improve Gemini-3-Flash on ITBench, prompt engineering won't help much. In particular, the experiments we shown in our NeurIPS 2025 paper shows that with manual interventions like prompt engineering for memory related failures, we can get only up to around 15.6% performance improvements, whereas in a previous blogpost on MAST , we showed that by introducing new agents such as a Summarizer Agent to remind the other agents of what is going on and continuously augment their state (fixing FM-1.4) or by introducing context management mechanisms (such as a stricter State Machine to enforce termination to fix FM-1.5), we can get up to 53% performance improvement as these tackle more fundamental issues with the system. Case Study: Kimi-K2 (The Termination Crisis) While termination confusion (FM-3.1 and FM-1.5) is the prevalent failure mode for Kimi-K2, its failed trajectories are defined by a pervasive Action-Reasoning Mismatch (FM-2.6) , which is present in a staggering 92% of its failures . The Execution Gap: While parts of its internal reasoning are often accurate, it suffers from a 92 percent failure prevalence of FM-2.6 (Action-Reasoning Mismatch) . It frequently identifies the correct next step but then executes a redundant or irrelevant command. The Meta-Loop Trap: Roughly 25 percent of failed traces involve FM-2.3 (Task Derailment) . When a tool call returns a minor error, the agent often abandons the primary incident to enter a cycle of debugging its own investigation scripts. Kimi-K2 is a good example of an overthinking model, its reasoning chains are often too long but can fail at execution. Case Study: GPT-OSS-120B GPT-OSS-120B exhibits the most unstable failure signature of the cohort. This model exhibits an average of 5.3 distinct failure modes per failed trace, indicating a fundamental inability to maintain internal state. Loss of Conversation History (FM-1.4): This is a unique fatal flaw for the 120B model. It loses conversation history in 24% of traces, whereas Gemini-3-Flash exhibited zero memory loss and Kimi-K2 only 7%. As SRE traces grow in length, GPT-OSS-120B effectively "forgets" the alerts it was originally triaging, leading to total task derailment. Reasoning Disconnect (FM-2.6): A staggering 94% of traces show a decoupling of reasoning and action. It is nearly 3x more likely than Gemini (31%) to describe a correct plan but then execute a completely unrelated or redundant tool call. A different (and more useful) way to read the plots: “fatal” vs “non-fatal” In summary, MAST lets you split failure modes into two buckets: Recoverable / structural (show up even in successful traces) These are failures which are not fatal and from which the system can recover to successfully complete the task. FM-1.3 Step repetition FM-3.3 Incorrect verification (important nuance: the system does verify; it just verifies poorly) FM-2.6 Reasoning–action mismatch (often present, but not always decisive) Fatal / decisive (strongly associated with failed traces) These are failures from which the system typically cannot recover. FM-1.5 Unaware of termination conditions FM-3.1 Premature termination FM-1.4 Loss of conversation history FM-2.3 Task derailment (rare but extremely diagnostic when it appears) FM-2.2 Fail to ask for clarification (especially for Granite/Llama regimes) This is the “richer understanding” piece: two models can have the same success rate on a small slice, yet fail for entirely different reasons—requiring different fixes. Conclusion MAST is a tool that inspects the agentic system traces to identify fine-grain failure types that support system development and debugging. In this blog, we show that by applying MAST to ITBench, we move from generic observations ("Open models struggle") to a concrete engineering roadmap that help improving the performance of agentic systems relying on thse models, e.g.: For Gemini-3-Flash: Verification failure ( FM-3.3 ) is the most common fatal failure for surgical models. Never allow an agent to self-terminate; require hard, tool-mediated evidence (e.g., AlertManager clearance or K8s state changes) before a run is considered successful. For Kimi-K2: Use a deterministic state machine to fix the model's frequent struggle with recognizing task completion. This model’s reasoning chains can be too long and struggle to terminate, so it might benefit significantly from a tighter control on when to end. For GPT-oss-120b: Systemic collapse occurs when minor reasoning mismatches ( FM-2.6 ) poison the task history. Implement aggressive context hygiene and early error detection to ensure that small misalignment's do not compound into total derailment. IT-Bench Paper: https://arxiv.org/pdf/2502.05352 IT-Bench Code: https://github.com/itbench-hub/ITBench MAST Paper: https://arxiv.org/abs/2503.13657 MAST Code: https://github.com/multi-agent-systems-failure-taxonomy/MAST MAST-Data : 🤗 MAST-Data (1600+ Traces) Mentioned datasets ibm-research/ITBench-Lite Updated 22 days ago • 981 • 5 mcemri/MAST-Data Preview • Updated Jul 21, 2025 • 358 • 13 More from this author AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality 31 January 21, 2026 CUGA on Hugging Face: Democratizing Configurable AI Agents 67 December 15, 2025