← Все кластеры
NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer | NVIDIA Technical Blog
cooling
Тип событияother
Темаai infrastructure
ОрганизацияNVIDIA
СтранаUnited States
Статей79
Уник. источников17
Важность / Момент3.79 / 0
Период16.03.2026 16:05 — 31.03.2026 22:46
Создан06.04.2026 06:28:33
Статьи в кластере 79
Заголовок Источник Дата публикации Score
S NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer | NVIDIA Technical Blog nvidia_dev_blog 16.03.2026 16:05 1
Embedding sim.1
Entity overlap1
Title sim.1
Time proximity1
NLP типproduct_launch
NLP организацияnvidia
NLP темаai infrastructure
NLP страна

Открыть оригинал

Artificial intelligence is token-driven. Every prompt, reasoning step, and agent interaction generates tokens. Over the past year, token consumption has grown multifold and now exceeds 10 quadrillion tokens per year. And while the majority of tokens have been generated from humans interacting with AI, the new era is one in which most tokens will be generated from AI interacting with AI. Modern agentic systems plan tasks, invoke tools, execute code, retrieve data, and coordinate across continuous multistep workflows with numerous AI agents . These interactions generate large volumes of reasoning tokens, expand KV cache, and require CPU-based sandboxed environments to test and validate results generated by accelerated computing systems. This places low latency, high throughput demands across GPUs, CPUs, scale-up domains, scale-out networks, and storage. Delivering useful intelligence for these modern agentic systems requires fleets of purpose-built rack-scale systems that function together as one coherent AI supercomputer. This post introduces the NVIDIA Vera Rubin POD , a set of five specialized rack-scale systems built on the third-generation NVIDIA MGX rack architecture for the era of agentic AI. Introducing NVIDIA Vera Rubin POD Built through extreme co-design of seven chips spanning compute, networking, and storage, NVIDIA Vera Rubin introduces the most sophisticated POD-scale AI platform. The platform features 40 racks, 1.2 quadrillion transistors, nearly 20,000 NVIDIA dies, 1,152 NVIDIA Rubin GPUs, 60 exaflops, and 10 PB/s total scale-up bandwidth. The Vera Rubin POD introduces five new distinct purpose-built rack-scale systems for agentic AI workloads that require high throughput, extreme low-latency inference, dense CPU sandboxing, and massive context memory storage. Together, these racks form one cohesive system that will power the world’s most energy- and cost-efficient data centers. Figure 1. NVIDIA Vera Rubin POD includes five rack-scale systems, one AI supercomputer, one NVIDIA MGX rack architecture, and ecosystem Each chip in the POD scales with a third-generation NVIDIA MGX rack, supported by an ecosystem of more than 80 partners with a global supply chain experienced in bringing large-scale AI systems to market. This enables fast deployments and seamless transitions with each NVIDIA MGX rack sharing the same power, cooling, and mechanical envelopes. There are two types of MGX racks with copper spines designed for performance, resiliency, and energy efficiency. The MGX NVL rack is connected by NVIDIA NVLink , and the new NVIDIA MGX ETL rack is connected by one of two types of spines: NVIDIA Spectrum-X Ethernet or NVIDIA Groq 3 LPU direct chip-to-chip links. NVIDIA Vera Rubin NVL72: Platform for the four scaling laws NVIDIA Vera Rubin NVL72 is the core rack-scale compute engine of the latest AI factory . Integrating 72 NVIDIA Rubin GPUs and 36 NVIDIA Vera CPUs connected through a massive NVLink copper spine, it acts as a one giant GPU. NVIDIA Vera Rubin NVL72 is designed for the four scaling laws of AI: pretraining, post-training, test-time scaling, and agentic scaling. It can be optimized for complex mixture-of-experts (MoE) routing, and the heavy compute-bound context phase of AI inference. It delivers up to 4x better training performance and up to 10x better inference performance per watt, and one-tenth the token cost relative to NVIDIA Blackwell . NVIDIA Groq 3 LPX: Inference accelerator racks Co-designed with the NVIDIA Vera Rubin platform for the massive context and low-latency demands of agentic AI, NVIDIA Groq 3 LPX features 256 language processing units (LPUs) per rack. It pairs with Vera Rubin NVL72 to eliminate the tradeoff between high-speed interactivity and throughput. By fusing high-bandwidth SRAM-only LPUs with Rubin GPUs with large HBM capacity, the system delivers low latency and high throughput at long context lengths—supercharging user interactivity for trillion-parameter models without sacrificing system throughput. Vera Rubin NVL72 plus LPX delivers up to 35x more tokens and up to 10x more revenue opportunity for trillion-parameter models relative to Blackwell. To learn more, see Inside NVIDIA Groq 3 LPX . NVIDIA Vera CPU rack: Agentic AI and reinforcement learning at scale The NVIDIA Vera CPU rack integrates up to 256 NVIDIA Vera CPUs in a dense, liquid-cooled rack to provide scalable, energy-efficient capacity. A single rack can sustain over 22,500 concurrent reinforcement learning (RL) or agent sandbox environments, maximizing environments to test, execute, and validate results from the Vera Rubin NVL72 and LPX racks. Vera CPU racks provide the foundation for large-scale agentic AI and reinforcement learning, delivering results twice as efficient and 50% faster than traditional rack-scale CPUs. Learn more about how the Vera CPU delivers high-performance bandwidth and efficiency for AI factories . NVIDIA BlueField-4 STX: AI-native storage The NVIDIA BlueField-4 STX rack is built with the NVIDIA BlueField-4 processor, which combines the Vera CPU and ConnectX-9 SuperNIC, and scales out with Spectrum-X Ethernet networking. It hosts the NVIDIA CMX context memory storage platform , a new class of AI-native storage infrastructure that seamlessly extends GPU context capacity across the POD and accelerates inference by offloading KV cache into a dedicated, high‑bandwidth storage layer. CMX is optimized to store and serve massive context memory (KV cache), treating temporary inference context as an AI‑native, shared data type that can be reused across turns, sessions, and agents. This delivers up to 5x higher tokens‑per‑second and up to 5x better power efficiency than traditional storage approaches. NVIDIA Spectrum-6 SPX: Networking racks Connecting the entire POD into a single supercomputer are the NVIDIA Spectrum-6 SPX networking racks. The Spectrum-6 SPX networking rack is engineered to accelerate east-west and north-south traffic across AI factories. Configurable with either Spectrum-X Ethernet or NVIDIA Quantum-X800 InfiniBand switches, it delivers low-latency, high-throughput rack-to-rack connectivity at scale. The Spectrum-6 SPX rack now includes the 102.4 Tb/s Spectrum-6 switch, which features 512 lanes and 200 Gb/s co-packaged optics (CPO) in single- and multi-chip switch offerings. This silicon photonics integration replaces pluggable transceivers, delivering highest power efficiency and resiliency, low latency and jitter, and nearly perfect effective bandwidth for keeping AI workloads across compute and storage environments perfectly synchronized. By co-designing these purpose-built racks to operate as one, the Vera Rubin POD is positioned to accelerate every component of agentic AI workloads. This begins with the streamlined NVIDIA MGX rack design that forms the foundation of every rack in the POD. Third-generation NVIDIA MGX rack-scale architecture Production-grade AI racks must excel across several critical areas: rapid time to volume, proven performance at scale, deep hardware-software co-design, resiliency and energy efficiency, seamless data center deployment and logistics, readiness for future architectures, and more. The third-generation NVIDIA MGX rack-scale architecture sets the standard across all categories with engineering breakthroughs integrated throughout its mechanical, power, and cooling design. Enabling resiliency and scalability The NVIDIA MGX rack prioritizes PCB-based connections with its single-wide design. It unlocks completely modular, cable-free, hose-free, and fanless compute and NVLink switch trays enabling maximum reliability, scalability and serviceability. Single 19-inch-wide racks also simplify shipping and logistics accelerating deployment across AI factories. Figure 2. The NVIDIA MGX rack spine holds thousands of cables and can be configured with NVLink for MGX NVL racks, and Spectrum-X Ethernet or direct Groq 3 LPU chip-to-chip links for MGX ETL racks The rack features a highly modular spine as its backplane, consisting of up to four preintegrated and prevalidated copper cable cartridges that connect each tray as one. The spine holds thousands of cables and shares the same mechanical form factor for both MGX NVL and MGX ETL racks. Ensuring peak energy efficiency from chip to grid At the component level, the NVIDIA MGX racks feature dynamic power steering where the systems provision power to the components that need it most. This feature can move power between the CPUs, GPUs, and NVLink switch trays to ensure components in the rack operate at peak energy efficiency, improving performance per watt. Figure 3. NVIDIA MGX racks feature Intelligent Power Smoothing to ensure components in the rack operate at peak energy efficiency AI training and inference workloads create large load swings. If not managed effectively, load swings can cause significant stress on the electrical grid, data center power infrastructure, and IT equipment. To protect against power swings, MGX racks feature rack-level energy storage that cushions power transients with capacitors. When workloads demand lots of power at once, the capacitor will supply the additional power while the grid power draw remains flat or ramps up. When workloads suddenly stop, the capacitor will charge while the grid power remains flat or ramps down. NVIDIA Vera Rubin NVL72 now introduces Intelligent Power Smoothing. It features 6x more rack-level energy storage (400 J per GPU) versus prior generations, and introduces a new closed-loop system that enables the GPUs to continuously monitor the state of charge of the capacitors to more efficiently flatten power profiles. This achieves much smaller AC power variation per minute, reduces peak current demands by up to 25%, and eliminates the need for massive battery packs to protect against large-scale power transients. Figure 4. Dynamic Max-Q power provisioning can free up stranded power and unlock more GPU capacity At the facility level, provisioning racks at static Max-P strands power capacity that could otherwise be used to generate tokens. It assumes homogeneous workloads that always require peak power, when in reality AI factories run a mix of workloads with varying power needs. By provisioning MGX racks at a lower dynamic Max-Q level, data centers can maximize AI data center throughput by dynamically provisioning the correct amount of power to each rack depending on the workload. This frees up stranded power, unlocks up to 30% more GPUs in the same power budget with 45°C liquid cooling, and boosts performance per watt. Unlocking larger energy budgets for compute All MGX racks are universally designed to operate with 45°C (113°F) warm-water inlet temperatures so data centers already designed for liquid cooling are guaranteed a seamless transition without redesigning cooling infrastructure. Figure 5 shows a schematic representation of infrastructure layout to provide 41°C (105.8°F) water to coolant distribution units (CDUs) that in turn supply coolant at 45°C (113°F) to AI racks. Figure 5. Energy- and cost-efficient free-cooling scenario when cooling NVIDIA MGX racks with 45C max inlet temperature Operating at 45°C enables data centers in many climates to use ambient air and closed loop dry coolers for cooling, reducing the need for compressors, driving down PUE, and unlocking larger energy budgets for compute. Lower inlet temperatures of 35°C require data centers to divert massive amounts of facility power or water for cooling, while higher inlet temperatures maximize the amount of grid power converted directly into tokens. This yields significant data center power savings—enough to allocate up to 10% additional Vera Rubin NVL72 racks for more token generation in the same power budget. MGX racks can be 100% liquid-cooled leveraging the same data center cooling infrastructure as prior generations. The third-generation MGX rack features new internal tray manifolds, rack UQD08 manifolds, and liquid cooled busbars supporting up to 5,000 A. The coolant used for the rack will depend on the customer and data center, but many will continue to use de-ionized water or propylene glycol-based fluid (PG25), which can last up to 10 years in a closed loop system with minimal liquid maintenance. Open standard Underpinning these features is an open, standardized MGX rack architecture. The first mass-production rack-scale system was with NVIDIA Blackwell in 2024. NVIDIA contributed the design to the Open Compute Project (OCP) , reinforcing the commitment to open source technologies and enabling the entire ecosystem to rapidly innovate and accelerate adoption. NVIDIA has built an ecosystem of more than 80 global partners, creating a highly efficient, globally diversified supply chain that is experienced in bringing rack-scale AI systems to market. NVIDIA MGX NVL racks As independent third-party SemiAnalysis InferenceMax benchmarks demonstrate, NVIDIA rack-scale systems deliver 50x better performance per watt and 35x lower cost per token (NVIDIA GB300 NVL72 versus NVIDIA H200), which translates directly into higher revenues and better operating margins. In 2024, NVIDIA shipped the first NVIDIA GB200 NVL72 rack-scale systems. In 2025, NVIDIA GB300 NVL72 was shipped. Now, NVIDIA Vera Rubin NVL72 is in full production, on track to ship in the second half of 2026. Streamlined design of NVIDIA Vera Rubin NVL72 NVIDIA Vera Rubin NVL72 is an engineering marvel designed to drop seamlessly into existing data center footprints. It will feature nearly two times more transistors than NVIDIA GB200 NVL72 while delivering 10x more performance per watt through extreme co-design. The rack integrates 72 NVIDIA Rubin GPUs, 36 NVIDIA Vera CPUs, ConnectX-9 SuperNICs, and BlueField-4 DPUs across 18 compute trays, alongside 9 NVLink switch trays. In total, the rack houses 1.3 million individual components, nearly 1,300 chips, all packed into a single-wide third-generation NVIDIA MGX rack weighing roughly 4,000 lbs, or about the weight of a pickup truck. Figure 6. NVIDIA Vera Rubin NVL72 rack Compute and NVLink Switch trays Enabling these 72 GPUs to act as a single unified engine is the sixth-generation NVLink. It delivers 3.6 TB/s of bandwidth per GPU and 260 TB/s of scale-up bandwidth per rack—more data than the bandwidth of the entire global internet. This high-speed data transfer happens in the NVLink spine at the back of the rack, which features four modular preintegrated cable cartridges housing 5,000 copper cables over two miles in length. Video 1. Key differences between the NVIDIA Vera Rubin compute tray and the NVIDIA Grace Blackwell compute tray The compute trays inside the Vera Rubin NVL72 are completely redesigned from NVIDIA Blackwell. It features a robust PCB midplane designed to fit in a single-wide rack that unlocks a cable-free, hose-free, and fanless design. This simplification drops compute tray assembly time from nearly two hours to just five minutes—up to 20x faster assembly and serviceability. Each compute tray features two NVIDIA Vera Rubin superchips with 17,000 components each—approximately five times as many components as a modern smartphone. The superchips are connected to the front modular bays that house eight ConnectX-9 SuperNICs and one BlueField-4 DPU through the PCB midplane. Figure 7. NVIDIA Vera Rubin NVLink Switch tray Vera Rubin NVL72 introduces new rack-scale resiliency features designed to maximize uptime and goodput for large AI clusters. The NVLink switch trays support operational resiliency features that allow administrators to place switches into maintenance mode and replace them while the rack continues operating. The architecture also supports continued operation even if multiple switch trays are unavailable, minimizing disruption during maintenance. At the silicon level, NVIDIA Rubin GPUs continuously run nondisruptive health checks and NVIDIA Vera CPUs feature in-system testing and SOCAMM memory for faster serviceability. Together, these chip-to-rack innovations reduce operational overhead and build on the resiliency improvements seen with Blackwell clusters. NVIDIA Vera Rubin Ultra NVL576 NVIDIA Vera Rubin Ultra introduces a new two-layer all-to-all NVLink topology that will enable developers to scale-up to 576 GPUs. Vera Rubin Ultra NVL576 will combine eight separate MGX NVL racks, each with 72 Rubin Ultra GPUs, all in a single 576-GPU NVLink domain with copper and direct optical connections. It will be built using the same MGX rack-scale ecosystem for fastest time to production. Demonstrating this massive multirack NVLink topology, Polyphe is the NVIDIA internal fully functional GB200-based prototype of the multirack NVL576 scale-up architecture. Figure 8. NVIDIA Polyphe prototype, a fully functional GB200-based multirack NVL576 scale-up system NVIDIA Kyber NVL1152: The next generation To scale beyond NVL576, a new MGX rack, NVIDIA Kyber, will be introduced. NVIDIA Kyber is the next-generation MGX NVL rack design that will double the NVLink domain per rack to fit 144 GPUs. Figure 9. NVIDIA Kyber NVL1152 NVIDIA Kyber will scale up into a massive all-to-all NVL1152 supercomputer using similar direct optical interconnects for rack-to-rack scale-up. Kyber provides the foundation for the next era of extreme scale-up AI computing using NVIDIA Feynman. Kyber will first be introduced with Vera Rubin Ultra as a standalone NVL144 system, providing customers with three options for Vera Rubin Ultra NVLink scale-up domains: NVL72, NVL144, and the flagship NVL576. NVIDIA MGX ETL racks While NVIDIA MGX NVL racks provide massive scale-up compute domains, agentic AI workflows demand highly specialized nodes for extreme low-latency inference, CPU sandboxing, and accelerated context memory for KV cache. To support these diverse needs, Vera Rubin introduces the MGX ETL rack architecture, a new fully configurable MGX rack designed with a Spectrum-X Ethernet spine or a direct chip-to-chip spine leveraging the same rack-scale ecosystem as MGX NVL racks. Figure 10. NVIDIA MGX ETL rack-scale systems add support for Spectrum-X Ethernet while leveraging the same MGX rack infrastructure, including the cable cartridge housing the copper spine MGX ETL shares the same form factor and physical infrastructure as MGX NVL racks and is designed to operate under the same mechanical, power, and cooling envelope. Both racks will share the same key rack components built by the experienced MGX ecosystem: racks, chassis, trays, cable cartridges, liquid cooling manifolds, quick disconnects, busbars (standard and liquid cooled), support bracketry, side rails, power shelves, leak containment trays, tray handles, and more. MGX ETL will use pre-integrated and pre-validated copper cable cartridges with either a Spectrum-X Ethernet spine or a direct chip-to-chip spine . MGX ETL will leverage the established MGX ecosystem and supply chain that is experienced in building the rack architecture in high volume for multiple years. NVIDIA Spectrum-X Ethernet spine MGX ETL with a Spectrum-X Ethernet spine will be the foundation for the Vera CPU rack and the BlueField-4 STX Storage rack in the Vera Rubin POD. The rack is highly configurable and can also be made to house up to 256 Rubin GPUs (HGX Rubin NVL8 systems), XPUs, or more. Figure 11. The 1U MGX ETL switch tray provides Spectrum-X Ethernet connectivity for the MGX ETL spine In this design, 1U MGX ETL switch trays (based on Spectrum-6) sit in the middle of the rack. Rear-facing ports connect to the copper spine, while 32 front-facing OSFP cages provide optical transceiver connectivity to the rest of the POD. MGX ETL leverages a Spectrum-X Multiplane topology that fans out the 200 Gb/s lanes across multiple switches, delivering full all-to-all connectivity among nodes within the rack while maintaining a single network tier. The preintegrated copper spine provides resilient, power-efficient connectivity (enabling connectivity between ETL racks with a single tier of optics) and extends purpose-built Spectrum-X Ethernet with zero jitter, noise isolation, and load balancing across the entire 256-chip rack. Direct chip-to-chip spine Designed for extreme low-latency inference, the LPX rack connects 256 LPUs as one. It features 32 compute trays, each with eight LPUs, connected by a direct chip-to-chip spine, which consists of two copper cable cartridges that create an intricate point-to-point topology over thousands of paired copper cable connections. These cables make up the direct chip-to-chip spine at the back of the rack consisting of the same cable cartridge mechanical form factor as other MGX racks. This massive interconnected fabric enables the entire 256-LPU rack to act as a single fast inference engine to be deployed with Vera Rubin NVL72. When scaled to multiple LPX racks in datacenter deployments, the direct chip-to-chip links are maintained across racks enabling multiple LPX racks to operate as a single, incredibly fast inference engine. NVIDIA Vera Rubin DSX AI factory platform NVIDIA Vera Rubin DSX is the AI factory platform that provides a blueprint and reference design for co-designed AI infrastructure from chip to grid. It maximizes grid power to token efficiency, goodput, and accelerates time to first production. Figure 12. Pillars of NVIDIA Vera Rubin DSX, an open ecosystem for AI infrastructure build-outs NVIDIA Vera Rubin DSX unifies chips, systems, software libraries, APIs, and a global partner ecosystem into a single architecture that tightly integrates compute, networking, storage, power, cooling, and facility controls across the entire AI factory. This enables ecosystem partners to rapidly design, deploy, and scale gigawatt AI factories with maximum token throughput per watt and improved uptime from resiliency and energy efficiency built into the DSX platform end-to-end. Learn more about NVIDIA Vera Rubin POD AI infrastructure is rapidly evolving from discrete chips, standalone servers, and rack-scale systems to co-designed POD-scale supercomputers and AI factories. Modern agentic AI workloads are driving a shift toward purpose-built AI infrastructure that integrates compute, networking, and storage into a single cohesive supercomputer. The NVIDIA Vera Rubin POD unifies five rack-scale systems with key mechanical, power, and cooling innovations from the third-generation NVIDIA MGX rack, delivering scalability, resiliency, and energy efficiency. At AI factory scale, the NVIDIA Vera Rubin DSX Reference Design and the NVIDIA Omniverse DSX Blueprint for AI factory digital twins provide a unified framework for building and operating AI factories. Together, these innovations deliver dramatic gains in performance, cost efficiency, and energy savings to power the era of agentic applications. Join us for NVIDIA GTC 2026 and watch the GTC keynote with NVIDIA founder and CEO Jensen Huang. Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Networking / Communications | General | Blackwell | BlueField DPU | ConnectX | NVLink | Spectrum-X Ethernet | Intermediate Technical | Deep dive | AI Agent | AI Factory | featured | GTC 2026 | Rubin | Vera CPU | Vera Rubin NVL72 About the Authors About Rohil Bhargava Rohil Bhargava is a product marketing manager at NVIDIA, focused on data center GPUs and rack-scale systems. He holds an MBA in technology strategy from Carnegie Mellon University and a bachelor's degree in industrial engineering and economics from Northwestern University. View all posts by Rohil Bhargava About Taylor Allison Taylor Allison is a senior technical product marketing manager responsible for networking for AI at NVIDIA, including Spectrum-X Ethernet and Quantum InfiniBand. Taylor has extensive experience in product marketing, product management, and software engineering, with a focus on AI and HPC solutions. Taylor has an M.S. in Mathematics from the University of North Carolina. View all posts by Taylor Allison About Harry Petty Harry Petty is a senior technical marketing manager for HPC and AI edge applications at NVIDIA. Previously, he was a principal engineer and marketing director at Cisco Systems where he brought SDN innovations to market for hybrid cloud, multitenant security, and data center application performance. Harry has an MBA from Booth Graduate School of Business and a BS in mathematics and computer science from the University of Dayton. View all posts by Harry Petty Comments Related posts Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer AWS Integrates AI Infrastructure with NVIDIA NVLink Fusion for Trainium4 Deployment AWS Integrates AI Infrastructure with NVIDIA NVLink Fusion for Trainium4 Deployment Related posts Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics L T F R E
OpenAI tries to build its coding cred by acquiring Astral the_register_ai 19.03.2026 21:13 0.808
Embedding sim.0.9075
Entity overlap0.1667
Title sim.0.4074
Time proximity0.8737
NLP типacquisition
NLP организацияOpenAI
NLP темаcode generation
NLP страна

Открыть оригинал

Devops 3 OpenAI tries to build its coding cred, acquires Python toolmaker Astral 3 Deal helps company build out its Codex team Thomas Claburn Thu 19 Mar 2026 // 21:13 UTC In a move clearly designed to strengthen its position among developers, OpenAI has acquired Python tool maker Astral. The house of Altman expects the deal to strengthen the ecosystem for its Codex programming agent. Since its founding in 2022 by Charlie Marsh, Astral has won over a substantial portion of the Python community with Rust-based tools like uv (package and project manager), Ruff (linting and formatting), and ty (type checker) that outperform Python-based tools like pip. OpenAI says that it plans to continue supporting these tools as open source projects while using them internally to improve Codex and make AI more useful as part of the software development lifecycle. "Our goal with Codex is to move beyond AI that simply generates code and toward systems that can participate in the entire development workflow – helping plan changes, modify codebases, run tools, verify results, and maintain software over time," the company said in a blog post .  "Astral's developer tools sit directly in that workflow. By integrating these systems with Codex after closing, we will enable AI agents to work more directly with the tools developers already rely on every day." If Astral's tools can make AI-generated code more maintainable, that would be a significant win. One of the emerging concerns about the shift toward AI-generated code is that it's more difficult to maintain. The deal also looks to be an acquisition of talent. According to OpenAI, the Astral team will join those working on Codex in an effort to make the coding agent more adept.  PwC will say goodbye to staff who aren't convinced about AI Your next car might need 300 GB of RAM, and so will autonomous robots Google gives Android users a way to install unverified apps if they prove they really, really want to AI still doesn't work very well, businesses are faking it, and a reckoning is coming OpenAI's acquisition of Astral follows rival Anthropic's December 2025 purchase of Bun , a runtime, package manager, test runner, and bundler for JavaScript/TypeScript applications. In a blog post on Thursday, software developer Simon Willison remarked that the two deals have occurred amid intense competition between OpenAI and Anthropic. "Bun was already a core component of Claude Code and that acquisition looked to mainly be about ensuring that a crucial dependency stayed actively maintained," he wrote. "Claude Code's performance has increased significantly since then thanks to the efforts of Bun's Jarred Sumner. "One bad version of this deal would be if OpenAI start using their ownership of uv as leverage in their competition with Anthropic." But Willison also suggests a possible motive for the deal unrelated to competitive leverage. He points out that Marsh's post about the deal thanks investors who committed to Series A and Series B funding rounds. Those investors, he speculates, should now be able to exchange their stake in Astral for a stake in OpenAI, which reportedly could go public as soon as the end of this year. ® Share More about AI Development OpenAI More like these × More about AI Development OpenAI Python Software Narrower topics Accessibility AdBlock Plus AIOps App Application Delivery Controller Audacity ChatGPT Confluence Copilot Database DeepSeek Devops FOSDEM FOSS Gemini Google AI GPT-3 GPT-4 Grab Graphics Interchange Format IDE Image compression Jenkins Large Language Model Legacy Technology LibreOffice Machine Learning Map MCubed Microsoft 365 Microsoft Office Microsoft Teams Mobile Device Management Neural Networks NLP OpenOffice QR code Retrieval Augmented Generation Retro computing Search Engine Software Bill of Materials Software bug Software License Star Wars Tensor Processing Unit Text Editor TOPS User interface Visual Studio Visual Studio Code WebAssembly Web Browser WordPress Broader topics Programming Language Self-driving Car More about Share 3 COMMENTS More about AI Development OpenAI More like these × More about AI Development OpenAI Python Software Narrower topics Accessibility AdBlock Plus AIOps App Application Delivery Controller Audacity ChatGPT Confluence Copilot Database DeepSeek Devops FOSDEM FOSS Gemini Google AI GPT-3 GPT-4 Grab Graphics Interchange Format IDE Image compression Jenkins Large Language Model Legacy Technology LibreOffice Machine Learning Map MCubed Microsoft 365 Microsoft Office Microsoft Teams Mobile Device Management Neural Networks NLP OpenOffice QR code Retrieval Augmented Generation Retro computing Search Engine Software Bill of Materials Software bug Software License Star Wars Tensor Processing Unit Text Editor TOPS User interface Visual Studio Visual Studio Code WebAssembly Web Browser WordPress Broader topics Programming Language Self-driving Car TIP US OFF Send us news
Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere | NVIDIA Technical Blog nvidia_dev_blog 17.03.2026 17:13 0.799
Embedding sim.0.9172
Entity overlap0.0769
Title sim.0.2256
Time proximity0.9987
NLP типproduct_launch
NLP организацияnvidia
NLP темаai infrastructure
NLP страна

Открыть оригинал

AI-native services are exposing a new bottleneck in AI infrastructure: As millions of users, agents, and devices demand access to intelligence, the challenge is shifting from peak training throughput to delivering deterministic inference at scale—predictable latency, jitter, and sustainable token economics. NVIDIA announced at GTC 2026 that telcos and distributed cloud providers are transforming their networks into AI grids , embedding accelerated computing across a mesh of regional POPs, central offices, metro hubs, and edge locations to meet the needs of AI-native services. This post explains how AI grids make real-time, multi-modal, and hyper-personalized AI experiences viable at scale by running inference across distributed, workload-, resource- and KPI-aware AI infrastructure. Intelligent workload placement across distributed sites The NVIDIA AI Grid reference design provides a unified framework for building geographically distributed, interconnected, and orchestrated AI infrastructure. Figure 1 shows how existing network assets come together as an AI grid: Figure 1. Topology view of an AI grid, panning centralized AI factories and distributed edge nodes across telco and CDN sites. A key aspect of this design is the AI grid control plane, which turns otherwise siloed clusters and regions into a single programmable platform. Its primary focus is intelligently determining where each workload should run to meet its KPI: KPI-aware routing that places workloads based on latency requirements, sovereignty constraints, and cost. Resource-aware placement that continuously accounts for node health, utilization, and quotas to avoid overloaded or degraded sites before users see tail-latency spikes. Compatible traffic is also steered to nodes with high KV-cache hit probability, reducing token latency and GPU cycles per request. Figure 2. AI grid control plane treating distributed endpoints as a single logical platform for workload- and resource-aware routing. Workloads that benefit most from AI grids Intelligent workload placement matters most for applications where latency, bandwidth, personalization, or sovereignty become first-order design constraints. The following table maps these workload classes to example applications and the KPIs they must optimize to deliver consistent user experiences and sustainable economics. Workload Class Example Applications Target KPI Real‑time, latency‑sensitive control loops Physical AI (robots, sensors), conversational agents, AR/VR, wearables End‑to‑end latency and jitter within SLA Token‑ and bandwidth‑intensive multimodal Vision and media AI workloads that can generate up to 100× more raw data than text Network bandwidth and egress economics Hyper‑personalized experiences at scale Per‑user recommendations, in‑app copilots, dynamic media insertion High concurrency within latency and cost budgets Sovereign and regulated data workloads Government AI, healthcare, financial services, regulated enterprise data Data, models, and logs kept in‑jurisdiction Table 1. AI workload classes that benefit from AI grids, with example applications and the primary optimization targets. Not only do AI grids accelerate classical edge applications, they also unlock a new set of AI-native services built around real-time generation and personalization. The following sections explain how AI grids enable three such workloads at scale: voice, vision, and media. AI Grid for voice Why Latency is Critical for Voice AI Human-grade voice AI services are extremely sensitive to end-to-end latency. When responses exceed about 500ms, conversations feel noticeably laggy to users. As a result, meeting this time-to-first-token (TTFT) at the client becomes a hard SLO (Service Level Objective). Figure 3. Decomposition of client time-to-first-token (TTFT), showing how AI grid placement at the edge reduces round-trip time and queueing latency for voice interactions. The TTFT at the client (TTFT_Client) is the sum of five components: Network round-trip time (RTT): Time for audio and tokens to travel between the user and the inference endpoint over the network Queueing latency: Time a request waits in line on the GPU or service before it starts executing. Compute latency: Tokenization: Time to convert incoming audio into tokens that the voice model can process. This includes automatic speech recognition (ASR) and text-to-speech (TTS). Prefill and decode: Time the model spends processing the prompt (prefill) and generating the first token (decode) Voice activity detection (VAD): Detects when users start and stop speaking to accurately frame each turn. RTT and queueing latency are largely determined by where inference runs, enabling AI grids to deliver meaningful latency improvements. End-to-end latency Figure 4. End-to-end latency comparison for a voice small language model running on RTX PRO 6000 GPUs in a centralized cluster versus a four‑node AI grid under burst traffic The above benchmark from Comcast compares the same voice small language model (SLM) from Personal AI running on 4 NVIDIA RTX PRO 6000 GPUs in two architectures: a single centralized cluster and an AI grid distributed across 4 sites, both subjected to a burst of highly correlated and concurrent sessions, where voice AI services are most strained. Across all test scenarios—from 50th percentile (P50) baseline traffic through 99th percentile (P90) burst traffic—the AI grid deployment keeps end‑to‑end latency for voice interactions within a 500 ms target, even as concurrent sessions spike. This is achieved by placing inference on regional edge nodes, cutting network round‑trip time and reducing queueing latency. Throughput and cost per token Another key finding from this benchmark is throughput performance with correlated burst traffic. Rather than degrading under higher load, throughput increases as the four edge nodes absorb demand in parallel, reaching 42,362 tokens per second at burst—an 80.9% gain over baseline—while the centralized deployment loses throughput under the same conditions. Figure 5. Voice model throughput under burst traffic in AI grid and centralized deployment architectures As a result, inference on the AI grid runs with 52.8% lower cost-per-token than a centralized deployment at baseline, and that gap widens to 76.1% lower cost-per-token at burst as distributed GPU utilization improves with load. Centralized clusters burn much of their latency budget on RTT, so they must run at lower utilization to avoid tail‑latency violations, while AI grid deployments keep RTT low and can safely drive GPUs harder at the same latency target. Figure 6. Inference cost per token for centralized versus AI grid deployments In production environments, both throughput and cost-per-token improvements may vary with model selection, workload characteristics, and live network conditions. AI Grid for vision Metropolis at the edge: From perception to action Vision AI workloads move far more data than text-based services, often generating terabits per second of concurrent video traffic at city scale. To make that practical, AI infrastructure has to keep latency low enough to react in real time, keep raw video in the right jurisdiction, and avoid turning network backhaul into the dominant cost of the system. To meet these needs, NVIDIA Metropolis vision AI application platform can be run on AI grid nodes at the edge, inside the operator’s jurisdiction and on isolated network slices. Cameras stream into nearby nodes where models anonymize personally identifiable information, understand scenes across many feeds, and trigger actions such as rerouting traffic or dispatching responders. Network slicing, up-resolution, and bandwidth In centralized-only cloud deployments, video data traverses several network hops to be processed and returned back to operators. The physical distance added with each network hop adds inherent delay and increases the chance of encountering failure or congestion. In more efficient designs, operators can reduce backhaul by combining edge analytics with on-demand up-resolution. For example, cameras may stream at 360p (around 2 Mbps), and a Super Resolution model reconstructs 4K views only when operators need to inspect a scene, so full‑resolution video crosses regional or backbone links only on demand. When deployed on an AI grid, inference runs on RTX PRO GPUs at local edge nodes, and only lightweight alerts and metadata are sent over the network to centralized systems for fleet-wide monitoring, correlation across sites, and longer-term analysis. The result is consistently lower and more predictable end-to-end response times. Additionally, network slicing can provide Metropolis pipelines with dedicated, isolated bandwidth for safety-critical events and analytics, ensuring that safety-critical vision workloads are always prioritized and receive deterministic throughput and latency, without overprovisioning the whole network. Figure 7. Bandwidth impact of running NVIDIA Metropolis vision AI pipelines on AI grid edge nodes For a representative deployment with 1,000 4K cameras, moving from centralized processing to edge compression and then to edge analytics plus super‑resolution can cut continuous backbone load from tens of Gbps to the low single‑digit Gbps range. The numbers shown in Figure 7 are illustrative and will vary with camera settings, compression profiles, model choices, and live network conditions, but the relative savings between deployment models are expected to follow the same pattern. AI Grid for media Hyper-personalization is an infrastructure challenge Hyper-personalization is where AI for Media becomes continuous and per‑session—content, overlays, language, and recommendations adapting in real time for every viewer. What makes these workloads distinct is that the value of the result expires quickly: a late ad fill causes jitter, a sports overlay that misses the broadcast window is irrelevant, and a recommendation that arrives too slowly loses the purchase moment. Table 2 below highlights representative media AI use cases, the deadlines they operate under, and how AI grids execute each one to stay inside strict timing budgets: Use case Deadline Constraint AI Grid execution model Real‑time ad insertion 16 ms 60 fps frame budget Context sampled every few seconds; lightweight per‑frame shaders render deterministic fills Sports analytics overlays < 1 s Beat broadcast feed Telemetry transformed into overlays before the moment expires on air E‑commerce recommendations < 200 ms Bounce threshold Vector re‑ranking on edge nodes, explicitly prioritizing speed over deep reasoning Live video translation < 10 ms Audio + caption sync ASR, translation, and TTS run on‑net; edge placement holds audio, caption, and video in sync Table 2. Media AI use cases, deadlines, constraints, and how AI grids execute each workload to meet strict timing budgets Benchmarking by Comcast and Decart validate that AI grids meet such deadlines consistently at scale by bringing compute closer to where content is delivered by reducing jitter through fewer network hops and lower contention at each hop. This results in absorbing correlated demand spikes regionally, and avoiding the backhaul that comes with routing inference traffic through a centralized facility. As with bursty voice traffic, distributing concurrent video generation demand across multiple edge sites lets operators push GPUs to higher utilization, which in turn drives higher throughput and lowers the effective cost of delivering each stream. How media pipelines run on AI grids On AI grids, media workloads run as low‑latency streaming pipelines on distributed edge nodes instead of as centralized jobs in distant clouds. NVIDIA Holoscan coordinates the flow of frames and audio segments across these grid nodes—from ingestion through understanding and rendering—so real‑time ad insertion, overlays, and personalization stages execute without breaking their frame or response budgets. NVIDIA Maxine‑based services handle real‑time video enhancement on the same edge nodes, while speech and translation services such as NVIDIA Riva and LipSync keep multi‑language audio and video in sync without extra network hops. Video generation models and egress economics Video generation models produce significantly more data than text-only LLMs. For example, Decart’s Lucy 2 video generation models generate approximately 5.5 Mbps/sec. When compared to a text-based LLM, a 10-minute video-generation session generates 825,000x more data, dramatically increasing egress bandwidth. Figure 8. Data egress for a 10‑minute session, comparing LLM text output to video‑generation model output By bringing video generation closer to end users, AI grids make AI-powered media experiences economically viable and immersive even as personalization and concurrency grow. AI‑native services need AI grids Telcos and content delivery providers are becoming central to how inference for AI‑native services is delivered at scale, turning the network into part of the model execution path rather than a passive pipe. With workload-aware routing across AI factories and distributed edge sites, operators can steer AI services like voice, vision, and media to the right location so each workload meets its latency, concurrency, cost, and sovereignty requirements. Getting started Explore the AI Grid reference design to dive deeper into the architecture and deployment patterns discussed in this post. Discuss (0) Like Tags Agentic AI / Generative AI | Telecommunications | Holoscan | Maxine | Metropolis | Riva | Intermediate Technical | Announcement | Tutorial | AI Agent | Cloud Networking | featured | GTC 2026 | Machine Learning & Artificial Intelligence About the Authors About Sree Sankar Sree Sankar is a visionary AI product and business executive and global head at NVIDIA, where she leads the strategy and vision for AI Grid, the company's distributed inference platform. What sets her apart is a rare ability to operate across the full technology stack — from the networking infrastructure that powers the digital world to the intelligent applications built on top of it. With two decades of experience and eight years of AI/ML product leadership, she has delivered transformative impact across computer vision, natural language processing, recommendation systems, and search — spanning ecommerce, retail-tech, SaaS, and advertising. She previously led AI applications at Meta and Amazon, and as SVP at Grabango, helped disrupt retail with computer-vision-powered checkout-free technology. At NVIDIA, she brings both worlds together — infrastructure and intelligence — to define how distributed AI gets built and deployed at scale. View all posts by Sree Sankar About Shuvo Chowdhury Shuvo Chowdhury is principal product manager at NVIDIA, pioneering AI Aerial and AI Grid platforms at the crossroads of AI, 6G/5G, cloud, and edge computing. His work enables scalable, intelligent infrastructure for next-generation wireless and distributed AI networks, advancing connectivity and automation for telco and industry innovators. Previously, Shuvo served as director of Product Management at Casa Systems, leading cloud-native wireless core software, and as director of Solutions at Huawei Technologies, where he focused on orchestration, and cloud transformation. At AT&T Labs, he spent over 10 years as Technical Lead and Manager driving lab integrations. Shuvo holds a bachelor's degree in Electrical and Electronic Engineering, a master's degree in Electrical and Computer Engineering from the University of Windsor, and an MBA from The University of Texas at Austin, McCombs School of Business. View all posts by Shuvo Chowdhury About Amogh Dendukuri Amogh Dendukuri is a product marketing manager for telco AI at NVIDIA, where he drives go-to-market strategies that accelerate telco-led AI infrastructure and AI-powered telco operations. Previously, Amogh worked as a product manager in the networking industry, building solutions for telcos, cloud providers and enterprises at the intersection of AI, cloud and network transformation. Amogh holds a bachelor's degree in computer science and anthropology from the University of Illinois at Urbana-Champaign. View all posts by Amogh Dendukuri Comments Related posts NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer Google Cloud Run Adds Support for NVIDIA L4 GPUs, NVIDIA NIM, and Serverless AI Inference Deployments at Scale Google Cloud Run Adds Support for NVIDIA L4 GPUs, NVIDIA NIM, and Serverless AI Inference Deployments at Scale Delivering Efficient, High-Performance AI Clouds with NVIDIA DOCA 2.5 Delivering Efficient, High-Performance AI Clouds with NVIDIA DOCA 2.5 Networking for Data Centers and the Era of AI Networking for Data Centers and the Era of AI NVIDIA GTC: Taking It to the Edge NVIDIA GTC: Taking It to the Edge Related posts Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics L T F R E
From Simulation to Production: How to Build Robots With AI nvidia_blog 18.03.2026 13:00 0.796
Embedding sim.0.9047
Entity overlap0.7143
Title sim.0.093
Time proximity0.881
NLP типother
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

Into the Omniverse: NVIDIA GTC Showcases Virtual Worlds Powering the Physical AI Era Editor’s note: This post is part of Into the Omniverse, a series focused on how developers, 3D practitioners, and enterprises can transform their workflows using the latest advances in OpenUSD...
Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air | NVIDIA Technical Blog nvidia_dev_blog 16.03.2026 20:01 0.794
Embedding sim.0.8965
Entity overlap0.0769
Title sim.0.3136
Time proximity0.9999
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP странаUnited States

Открыть оригинал

Building AI factories is complex and requires efficient integration across compute, networking, security, and storage systems. To achieve rapid Time to AI and strong ROI, the new NVIDIA DSX Air is enabling organizations to simulate their entire AI factory infrastructure in the cloud—covering compute, networking, storage, and security. Being able to design, test, and optimize systems before deploying hardware enables every layer of the AI factory to function as a unified, optimized system, preventing major delays or performance issues related to integration or misconfiguration challenges. DSX Air also enables continuous testing and validation of provisioning, automation, and security policies to streamline ongoing operations. This post shows how users can benefit from NVIDIA DSX Air through accelerated deployment timelines and simplified, full-stack cluster management. How DSX Air enables AI factory simulation To make AI factory simulation useful and practical for end users, DSX Air adds the following enhancements. Guaranteed capacity Subscription options provide guaranteed capacity without resource limits, enabling large-scale, long-lived simulations from pre-provisioning to decommission. Unified account setup Integrated with NVIDIA GPU Cloud , organizations and teams can manage access and resources through an NVIDIA Cloud Account (NCA). Users can join by signing up through the NGC portal, receiving an entitlement from NVIDIA, or being invited by an account owner. Individual organizations serve single users with access, while enterprises—activated through subscriptions such as DSX Air—support multiple users, team structures, and role-based access controls for efficient collaboration and resource sharing. Simulation checkpoints With checkpoints, users can save snapshots of their simulation state to pause and resume work without losing configuration changes or data. DSX Air automatically creates a checkpoint when a simulation stops, and users can view, manage, or relaunch from any saved checkpoint. Important checkpoints can be marked as favorites to prevent automatic deletion when storage limits are reached, ensuring critical simulation states are preserved. This capability streamlines iterative testing, configuration management, and operational continuity within AI infrastructure simulations. Figure 1. DSX Air checkpoints for snapshots and iterative checkpoints Simulation history The history feature provides a detailed event log that tracks events through a simulation’s lifecycle. It records key information, such as timestamps, event types, actors, and descriptions—covering actions like simulation creation, state changes, checkpoint operations, user activities, and errors. Users can filter entries by keyword to quickly pinpoint specific events, making it easier to understand system behavior and troubleshoot issues efficiently. Figure 2. Simulation history tracks key events during the lifecycle of a simulation Ecosystem enhancements Ecosystem partners can bring their software images into the Air platform for deep integration and interoperability with server, storage, and router OEMs, as well as ISVs focused on orchestration, security, and operations. With this, organizations can build and validate joint solutions that combine NVIDIA infrastructure with partner offerings, ensuring seamless day-one interoperability across GPUs, NVIDIA NVLink, Ethernet switches, SuperNICs, DPUs, and complementary ISV tools. DSX Air use cases for the full lifecycle of the AI factory By simulating complete compute fabrics built with NVIDIA Spectrum-X Ethernet and NVLink technologies, organizations can accelerate the design, validation, and deployment of AI infrastructure. This reduces integration risks and compresses deployment cycles. Teams can automate provisioning, test software-defined configurations, and evaluate change impact without physical hardware dependencies. These pre-production validations enhance AIOps efficiency and ensure system integrity throughout the deployment lifecycle. For next-generation infrastructure, DSX Air supports simulation of NVIDIA Spectrum-6 Ethernet switches and NVLink switches for deploying AI factories built with the NVIDIA Vera Rubin platform. Figure 3. An NVIDIA Spectrum-6 SN6600 Ethernet switch is simulated in the DSX Air topology canvas CI/CD integration and DevOps enablement Through its Python SDK and REST APIs, DSX Air supports integration with modern DevOps toolchains. This enables simulations to be instantiated programmatically within CI/CD pipelines for continuous verification of software and configuration updates. Integration with Git and artifact repositories also enables automated deployment testing, ensuring resilient software delivery, optimized resource utilization, and uninterrupted AI factory operations. Get started DSX Air provides a secure, on-demand environment for technical training and upskilling. The platform includes guided demos for skill-building with NVIDIA offerings like Cumulus Linux, NVIDIA Run:ai, Base, and Command Manager. Teams can also replicate production environments through shared simulations, for experiential learning in a safe, isolated workspace. This approach reduces dependency on dedicated hardware labs while fostering operational proficiency and innovation. Figure 4. The DSX Air demo marketplace showcases guided demos for skill building with NVIDIA offerings Sign up for a free trial of NVIDIA DSX Air using the DSX Air User Guide . Read how the NVIDIA partner ecosystem is working together to build solutions spanning the full data center infrastructure stack. Discuss (0) Like Tags Data Center / Cloud | Networking / Communications | Simulation / Modeling / Design | General | Air | Spectrum-X Ethernet | Beginner Technical | Deep dive | AI Factory | featured | GTC 2026 About the Authors About Ranga Maddipudi Ranga Maddipudi is a director of product management in the networking group at NVIDIA in Santa Clara, where he leads product management for the NVIDIA Air Datacenter Simulation platform and SONiC. He brings over 20 years of industry experience, including prior roles at Cisco and VMware, with deep expertise in simulation, automation, telemetry, and observability. View all posts by Ranga Maddipudi About Avi Alkobi Avi Alkobi is Head of Product Management in the Networking Business Unit at NVIDIA. For the past 13 years he has worked at NVIDIA in various roles focusing on the Ethernet switch product line; as a software developer, a team leader of the infrastructure team, and then as a senior application engineer supporting the field on post-sales, pre-sales, and complex proof of concepts. More recently, he has worked as a Senior Director, responsible for the networking business across EMEA. He holds a B.S. degree in Computer Science and M.B.A from the Bar-Ilan University in Israel. View all posts by Avi Alkobi About Taylor Allison Taylor Allison is a senior technical product marketing manager responsible for networking for AI at NVIDIA, including Spectrum-X Ethernet and Quantum InfiniBand. Taylor has extensive experience in product marketing, product management, and software engineering, with a focus on AI and HPC solutions. Taylor has an M.S. in Mathematics from the University of North Carolina. View all posts by Taylor Allison Comments Related posts Delivering NVIDIA Accelerated Computing for Enterprise AI Workloads with Rafay Delivering NVIDIA Accelerated Computing for Enterprise AI Workloads with Rafay Automating AI Factories with NVIDIA Mission Control Automating AI Factories with NVIDIA Mission Control Simulate Real-World Data Centers in the Cloud with NVIDIA Air Simulate Real-World Data Centers in the Cloud with NVIDIA Air Google Cloud Run Adds Support for NVIDIA L4 GPUs, NVIDIA NIM, and Serverless AI Inference Deployments at Scale Google Cloud Run Adds Support for NVIDIA L4 GPUs, NVIDIA NIM, and Serverless AI Inference Deployments at Scale High-Performance Storage on NVIDIA DGX Cloud with Oracle Cloud Infrastructure High-Performance Storage on NVIDIA DGX Cloud with Oracle Cloud Infrastructure Related posts Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics L T F R E
AsgardBench: A benchmark for visually grounded interactive planning microsoft_research 26.03.2026 19:02 0.759
Embedding sim.0.8623
Entity overlap0.1111
Title sim.0.2339
Time proximity0.9822
NLP типother
NLP организация
NLP темаrobotics
NLP страна

Открыть оригинал

March 26, 2026 GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation
NVIDIA GTC 2026: Live Updates on What’s Next in AI nvidia_blog 20.03.2026 00:15 0.757
Embedding sim.0.8982
Entity overlap0.5714
Title sim.0.094
Time proximity0.5492
NLP типpartnership
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid CERAWeek — dubbed the Davos of energy — is where policymakers, producers, technologists and financiers gather to discuss how the world powers itself next.  NVIDIA and Emerald AI unveiled at...
Telling an AI model that it's an expert makes it worse the_register_ai 24.03.2026 00:20 0.74
Embedding sim.0.8633
Entity overlap0.0435
Title sim.0.2237
Time proximity0.7896
NLP типscientific_publication
NLP организацияUniversity of Southern California
NLP темаlarge language models
NLP страна

Открыть оригинал

AI + ML 42 Telling an AI model that it’s an expert programmer makes it a worse programmer 42 Researchers say persona-based prompting can improve works for safety but not for facts Thomas Claburn Tue 24 Mar 2026 // 00:20 UTC Many people start their work with AI by prompting the machine to imagine it is an expert at the task they want it to perform, a technique that boffins have found may be futile. Persona-based prompting – which involves using directives such as "You're an expert machine learning programmer" in a model prompt – dates back to 2023, when researchers began to explore how role-playing instructions influenced AI models’ output. It's now common to find online prompting guides that include passages like, "You are an expert full-stack developer tasked with building a complete, production-ready full-stack web application from scratch." But academics who have researched this approach report it does not always produce superior results. In a pre-print paper titled "Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM," researchers affiliated with the University of Southern California (USC) find that persona-based prompting is task-dependent – which they say explains the mixed results. For alignment-dependent tasks, like writing, role-playing, and safety, personas do improve model performance. For pretraining-dependent tasks like math and coding, using the technique produces worse results. The reason appears to be that telling a model it's an expert in a field does not actually impart any expertise – no facts are added to the training data. In fact, telling a model that it's an expert in a particular field hinders the model's ability to fetch facts from pretraining data. Snowflake's ongoing pitch: bring AI to data rather than data to AI Lightning-fast exploits make it essential to patch fast, ask questions later Public-private partnerships vital in disrupting China's Typhoons, says RSA panel with no government speakers If you love your boss, imagine how much more you'll love their AI twin The researchers used the Measuring Massive Multitask Language Understanding (MMLU) benchmark, a means of evaluating LLM performance, to test persona-based prompting and found "when the LLM is asked to decide between multiple-choice answers, the expert persona underperforms the base model consistently across all four subject categories (overall accuracy: 68.0 percent vs. 71.6 percent base model). A possible explanation is that persona prefixes activate the model's instruction-following mode that would otherwise be devoted to factual recall." But persona-based guidance does help steer the model toward responses that satisfy the LLM-based judge assessing alignment. As an example, the authors note, "A dedicated 'Safety Monitor' persona boosts attack refusal rates across all three safety benchmarks, with the largest gain on JailbreakBench (+17.7 percentage points from 53.2 percent to 70.9 percent)." Zizhao Hu, a PhD student at USC and one of the study's co-authors, told The Register in an email that based on the study's findings, asking AI to adopt the persona of an expert programmer will not help code quality or utility. But pointing to the prompt guidance we linked to above, Hu said "many other aspects, such as UI-preference, project architecture, and tool-preference, are more towards the alignment direction, which do benefit from a detailed persona.” “In the examples provided, we believe that the general expert persona is not necessary, such as 'You are an expert full-stack developer,' while the granular personalized project requirement might help the model to generate code that satisfies the user's requirements." Given that prompts about expertise do have an effect, the researchers – Hu and colleagues Mohammad Rostami and Jesse Thomason – proposed a technique they call PRISM (Persona Routing via Intent-based Self-Modeling) which attempts to harness the benefits of expert personas without the harm. "We use the gated LoRA [ low-rank adaptation ] mechanism, where the base model is entirely kept and used for generations that depend on pretrained knowledge," he explained, adding "This decision process is learned by the gate." The LoRA adapter is activated where persona-based behaviors improve output, and otherwise falls back on the unmodified model. The researchers designed PRISM to avoid the tradeoffs of other approaches – prompt-based routing, which applies expert personas at inference time, and supervised fine tuning, which bakes behavior into model weights. Asked whether there's a way to generalize about effective prompting methods, Hu said: "We cannot say for sure for general prompting, but from our discovery on expert persona prompt, a potential point is, 'When you care more about alignment (safety, rules, structure-following, etc), be specific about your requirement; if you care more about accuracy and facts, do not add anything, just send the query.'" ® Share More about AI Development Research More like these &times; More about AI Development Research Software Narrower topics Accessibility AdBlock Plus AIOps App Application Delivery Controller Audacity Confluence Database DeepSeek Devops FOSDEM FOSS Gemini Google AI GPT-3 GPT-4 Grab Graphics Interchange Format IDE Image compression Jenkins Large Language Model Legacy Technology LibreOffice Machine Learning Map MCubed Microsoft 365 Microsoft Office Microsoft Teams Mobile Device Management Neural Networks NLP OpenOffice Programming Language QR code Retrieval Augmented Generation Retro computing Search Engine Software Bill of Materials Software bug Software License Star Wars Tensor Processing Unit Text Editor TOPS User interface Visual Studio Visual Studio Code WebAssembly Web Browser WordPress Broader topics Self-driving Car More about Share 42 COMMENTS More about AI Development Research More like these &times; More about AI Development Research Software Narrower topics Accessibility AdBlock Plus AIOps App Application Delivery Controller Audacity Confluence Database DeepSeek Devops FOSDEM FOSS Gemini Google AI GPT-3 GPT-4 Grab Graphics Interchange Format IDE Image compression Jenkins Large Language Model Legacy Technology LibreOffice Machine Learning Map MCubed Microsoft 365 Microsoft Office Microsoft Teams Mobile Device Management Neural Networks NLP OpenOffice Programming Language QR code Retrieval Augmented Generation Retro computing Search Engine Software Bill of Materials Software bug Software License Star Wars Tensor Processing Unit Text Editor TOPS User interface Visual Studio Visual Studio Code WebAssembly Web Browser WordPress Broader topics Self-driving Car TIP US OFF Send us news
What Happens If AI Makes Things Too Easy for Us? ieee_spectrum_ai 22.03.2026 13:00 0.737
Embedding sim.0.8688
Entity overlap0.0286
Title sim.0.2258
Time proximity0.7173
NLP типscientific_publication
NLP организацияUniversity of Toronto
NLP темаai ethics
NLP странаCanada

Открыть оригинал

Most people who regularly use AI tools would say they’re making their lives easier. The technology promises to streamline and take over tasks both professionally and personally—whether that’s summarizing documents, drafting deliverables, generating code, or even offering emotional support. But researchers are concerned AI is making some tasks too easy, and that this will come with unexpected costs. In a commentary titled Against Frictionless AI , published in Communications Psychology on 24 February, psychologists from the University of Toronto discuss what might be lost when AI removes too much effort from human activities. Their argument centers on the idea that friction—difficulty, struggle, and even discomfort—plays an important role in learning, motivation, and meaning. Psychological research has long shown that effortful engagement can deepen understanding and strengthen memory, sometimes described as “desirable difficulties.” The authors worry that AI systems capable of instantly producing polished answers or highly responsive conversation may bypass these processes of learning and motivation. By prioritizing outcomes over effort, AI could weaken the experiences that help people develop skills, build relationships, and find meaning in their work. IEEE Spectrum spoke with the paper’s lead author, Emily Zohar , an experimental psychology Ph.D. student, about why she and her coauthors (psychologists Paul Bloom and Michael Inzlicht ) argue that friction matters—and what a more human-centered approach to AI design could look like. When you say “friction,” what do you mean, from both a cognitive and an interpersonal standpoint? Zohar: We define friction as any difficulty encountered during goal pursuit. In the context of work, it involves mental effort—rumination and persistence, staying on a problem for some time, and this helps solidify the idea and the creative process. In relationships, friction involves disagreement, compromise, misunderstanding, a back and forth that is natural where you don’t always see eye to eye, and it helps you broaden your horizons. Even the feeling of loneliness is important. It motivates you to find social interactions. So having these negative feelings and difficulty is important in the social context. Given that definition, what do you mean by “frictionless” AI? Zohar: Frictionless AI refers to the excessive removal of effort from cognitive and social tasks. With AI, as we typically use it, it’s really easy to go from ideation right to the end product. You ask AI to solve something with one prompt, and it completes the whole thing. This is a problem because it takes away the intermediate steps that really drive motivation and learning, and it prioritizes outcome over process. Rather than working through the steps, AI does that meaningful work for you. There’s a lot of research showing work products are better with AI. That makes sense, it has all this knowledge, but it does worry us as it may be eroding something essential that will have long-term consequences. If you’re faced with the same problem and AI is removed, you don’t have the required knowledge to know how to face the problem next time. You argue that removing friction can harm learning and relationships. What role do effort and struggle play in human development? Zohar: In learning, the term is “desirable difficulties.” It’s the idea of effort and work, not just any effort but manageable effort. Facing problems that you can overcome, but you have to work at them a bit, that’s the key idea of friction. We don’t want you to face insurmountable problems. We want you to work hard, but still be able to overcome it. This helps you really digest information and learn from it. In interpersonal relationships, you have to face some difficulties to see other perspectives and learn from them, and learn to be accepting of others. If you’re used to an AI reinforcing all your ideas and being sycophantic, you’ll come into the real world and you won’t be used to seeing other ideas. You won’t know how to interact socially because you’ll expect people to always be on your side and agree with you. You won’t learn that life doesn’t always go exactly how you expect it to, and conversations don’t always go the way you want them to. AI’s Impact on Creative Processes A lot of technologies have historically aimed to reduce effort: calculators, washing machines, spell-check. What’s different about AI? Zohar: Past technologies have mostly focused on reducing physical effort. We don’t have to go down to the lake to wash our laundry anymore. [Past technologies] took away the mundane tasks that weren’t driving our learning and growth, they were just adding unneeded obstacles and taking away time from more important tasks. But AI is taking away effort from creative and cognitive processes that drive meaning, motivation, and learning. That’s a key difference, because it’s not taking away friction from tasks that don’t serve us. It’s taking away friction from experiences that are really important and integral to our development. Are there contexts where AI is already removing beneficial friction? How might the impacts of reduced friction show up over time? Zohar: One clear example is writing. People increasingly rely on AI to draft everything from emails to essays, removing many instances of beneficial friction. Research shows that people trust responses less when they learn they were written by AI, judge AI-generated products as less creative and less valuable, and have greater difficulty remembering their own work products when they were produced with AI assistance. Outsourcing writing to AI strips away both social and cognitive friction. Vibe coding is another good example. If you’re a programmer, coding is integral to what drives your meaning. People get meaning out of their work, and if you’re substituting that with AI, it could be detrimental. The negative impact of frictionless AI is that it takes away friction from things that are really important to who you are as a person, and your skills. One area I worry about a lot is adolescents using AI in general . It’s a really important developmental period to learn and grow and find the path you’ll follow. So if you don’t have these effortful interactions with work and relationships that teach you how to think, this will have long-term detrimental impacts. They might not be able to think critically in the same way, because they never had to before. If they’re turning to AI for social relationships at such a young age, that could really erode important skills they should be learning at that age. What is productive friction? Zohar: Friction goes along a continuum. With too little friction, you’re not getting learning and motivation. Too much friction and the task becomes overwhelming. Productive friction falls right in the middle, where struggle leads to achievement. It’s effortful but possible, and it requires you to think critically and work on a problem for some time or face some difficulty in the process. An example we used in the paper is the difference between taking a chairlift and hiking up a mountain. They both get to the top, but with the chairlift, you don’t get any growth benefits, while the hiker’s climb involves difficulties and a sense of achievement. It becomes much more of an experience and a learning opportunity versus the person who just went up the chairlift effortlessly. Do you envision AI that sometimes deliberately slows people down or asks them to do part of the work themselves? Zohar: It’s important in behavioral science to think about the default option, because people don’t usually change their default. So right now, the default in AI is to give you your answer and probe you to keep going down the rabbit hole. But I think we could think about AI in a different way. Maybe we can make the default more constructive. Instead of just jumping to the answer, it’s more of a process model where it helps you think about the problem and teaches you along the way, so it’s more collaborative rather than a one-stop shop for the answer. How might users of these systems and the companies developing them feel about such a design shift? Zohar: For the makers of these systems, the biggest concern is the pushback. People are used to going in and just getting the answer, and they might be really resistant to a design that makes them work more for it. But it might feed more engagement, because you have to go back and forth and find the answer together. Ultimately I think it has to come from the companies making these models, if they think [a more friction-full design] would help people. Friction-full AI is more of a long-term product. It’s hard to say if that would motivate companies to change their models to include moderate friction. But in the long term, I think this would be beneficial.
Introducing the OpenAI Safety Bug Bounty program openai 25.03.2026 00:00 0.737
Embedding sim.0.8422
Entity overlap0.3333
Title sim.0.1209
Time proximity0.9226
NLP типother
NLP организацияOpenAI
NLP темаai safety
NLP страна

Открыть оригинал

OpenAI launches a Safety Bug Bounty program to identify AI abuse and safety risks, including agentic vulnerabilities, prompt injection, and data exfiltration.
OpenAI is throwing everything into building a fully automated researcher mit_tech_review 20.03.2026 11:57 0.734
Embedding sim.0.8372
Entity overlap0.1714
Title sim.0.202
Time proximity0.9123
NLP типproduct_launch
NLP организацияOpenAI
NLP темаai agents
NLP странаUnited States

Открыть оригинал

OpenAI is refocusing its research efforts and throwing its resources into a new grand challenge. The San Francisco firm has set its sights on building what it calls an AI researcher, a fully automated agent-based system that will be able to go off and tackle large, complex problems by itself. ​​OpenAI says that this new research goal will be its “North Star” for the next few years, pulling together multiple research strands, including work on reasoning models, agents , and interpretability . There’s even a timeline. OpenAI plans to build “an autonomous AI research intern”—a system that can take on a small number of specific research problems by itself—by September. The AI intern will be the precursor to a fully automated multi-agent research system that the company plans to debut in 2028. This AI researcher (OpenAI says) will be able to tackle problems that are too large or complex for humans to cope with. Those tasks might be related to math and physics—such as coming up with new proofs or conjectures—or life sciences like biology and chemistry, or even business and policy dilemmas. In theory, you would throw such a tool any kind of problem that can be formulated in text, code, or whiteboard scribbles—which covers a lot. OpenAI has been setting the agenda for the AI industry for years. Its early dominance with large language models shaped the technology that hundreds of millions of people use every day. But it now faces fierce competition from rival model makers like Anthropic and Google DeepMind. What OpenAI decides to build next matters—for itself and for the future of AI. A big part of that decision falls to Jakub Pachocki, OpenAI’s chief scientist, who sets the company’s long-term research goals . Pachocki played key roles in the development of both GPT-4, a game-changing LLM released in 2023, and so-called reasoning models, a technology that first appeared in 2024 and now underpins all major chatbots and agent-based systems.  In an exclusive interview this week, Pachocki talked me through OpenAI’s latest vision. “I think we are getting close to a point where we’ll have models capable of working indefinitely in a coherent way just like people do,” he says. “Of course, you still want people in charge and setting the goals. But I think we will get to a point where you kind of have a whole research lab in a data center.” Solving hard problems Such big claims aren’t new. Saving the world by solving its hardest problems is the stated mission of all the top AI firms. Demis Hassabis told me back in 2022 that it was why he started DeepMind . Anthropic CEO Dario Amodei says he is building the equivalent of a country of geniuses in a data center . Pachocki’s boss, Sam Altman, wants to cure cancer . But Pachocki says OpenAI now has most of what it needs to get there. In January, OpenAI released Codex, an agent-based app that can spin up code on the fly to carry out tasks on your computer. It can analyze documents, generate charts, make you a daily digest of your inbox and social media, and much more. (Other firms have released similar tools, such as Anthropic’s Claude Code and Claude Cowork.) OpenAI claims that most of its technical staffers now use Codex in their work. You can look at Codex as a very early version of the AI researcher, says Pachocki: “I expect Codex to get fundamentally better.” The key is to make a system that can run for longer periods of time, with less human guidance. “What we’re really looking at for an automated research intern is a system that you can delegate tasks [to] that would take a person a few days,” says Pachocki. “There are a lot of people excited about building systems that can do more long-running scientific research,” says Doug Downey, a research scientist at the Allen Institute for AI, who is not connected to OpenAI. “I think it’s largely driven by the success of these coding agents. The fact that you can delegate quite substantial coding tasks to tools like Codex is incredibly useful and incredibly impressive. And it raises the question: Can we do similar things outside coding, in broader areas of science?” For Pachocki, that’s a clear Yes . In fact, he thinks it’s just a matter of pushing ahead on the path we’re already on. A simple boost in all-round capability also leads to models that can work longer without help, he says. He points to the leap from 2020’s GPT-3 to 2023’s GPT-4 , two of OpenAI’s previous models. GPT-4 was able to work on a problem for far longer than its predecessor, even without specialized training, he says. So-called reasoning models brought another bump. Training LLMs to work through problems step by step, backtracking when they make a mistake or hit a dead end, has also made models better at working for longer periods of time. And Pachocki is convinced that OpenAI’s reasoning models will continue to get better. But OpenAI is also training its systems to work by themselves for longer by feeding them specific samples of complex tasks, such as hard puzzles taken from math and coding contests, which force the models to learn how to do things like keep track of very large chunks of text and split problems up into (and then manage) multiple subtasks. The aim isn’t to build models that just win math competitions. “That lets you prove that the technology works before you connect it to the real world,” says Pachocki. “If we really wanted to, we could build an amazing automated mathematician. We have all the tools, and I think it would be relatively easy. But it’s not something we’re going to prioritize now because, you know, at the point where you believe you can do it, there’s much more urgent things to do.” “We are much more focused now on research that’s relevant in the real world,” he adds. Right now that means taking what Codex can do with coding and trying to apply that to problem-solving in general. “There’s a big change happening, especially in programming,” he says. “Our jobs are now totally different than they were even a year ago. Nobody really edits code all the time anymore. Instead, you manage a group of Codex agents.” If Codex can solve coding problems (the argument goes), it can solve any problem. The line always goes up It’s true that OpenAI has had a handful of remarkable successes in the last few months. Researchers have used GPT-5 (the LLM that powers Codex) to discover new solutions to a number of unsolved math problems and punch through apparent dead ends in a handful of biology, chemistry, and physics puzzles . “Just looking at these models coming up with ideas that would take most PhD weeks, at least, makes me expect that we’ll see much more acceleration coming from this technology in the near future,” Pachocki says. But Pachocki admits that it’s not a done deal. He also understands why some people still have doubts about how much of a game-changer the technology really is. He thinks it depends on how people like to work and what they need to do. “I can believe some people don’t find it very useful yet,” he says. He tells me that he didn’t even use autocomplete—the most basic version of generative coding tech —a year ago. “I’m very pedantic about my code,” he says. “I like to type it all manually in vim if I can help it.” (Vim is a text editor favored by many hardcore programmers that you interact with via dozens of keyboard shortcuts instead of a mouse.) But that changed when he saw what the latest models could do. He still wouldn’t hand over complex design tasks, but it’s a time-saver when he just wants to try out a few ideas. “I can have it run experiments in a weekend that previously would have taken me like a week to code,” he says. “I don’t think it is at the level where I would just let it take the reins and design the whole thing,” he adds. “But once you see it do something that would take a week to do—I mean, that’s hard to argue with.” Pachocki’s game plan is to supercharge the existing problem-solving abilities that tools like Codex have now and apply them across the sciences. Downey agrees that the idea of an automated researcher is very cool: “It would be exciting if we could come back tomorrow morning and the agent’s done a bunch of work and there’s new results we can examine,” he says. But he cautions that building such a system could be harder than Pachocki makes out. Last summer, Downey and his colleagues tested several top-tier LLMs on a range of scientific tasks . OpenAI’s latest model, GPT-5, came out on top but still made lots of errors. “If you have to chain tasks together, then the odds that you get several of them right in succession tend to go down,” he says. Downey admits that things move fast, and he has not tested the latest versions of GPT-5 (OpenAI released GPT-5.4 two weeks ago). “So those results might already be stale,” he says. Serious unanswered questions I asked Pachocki about the risks that may come with a system that can solve large, complex problems by itself with little human oversight. Pachocki says people at OpenAI talk about those risks all the time. “If you believe that AI is about to substantially accelerate research, including AI research, that’s a big change in the world. That’s a big thing,” he told me. “And it comes with some serious unanswered questions. If it’s so smart and capable, if it can run an entire research program, what if it does something bad?” The way Pachocki sees it, that could happen in a number of ways. The system could go off the rails. It could get hacked. Or it could simply misunderstand its instructions. The best technique OpenAI has right now to address these concerns is to train its reasoning models to share details about what they are doing as they work. This approach to keeping tabs on LLMs is known as chain-of-thought monitoring . In short, LLMs are trained to jot down notes about what they are doing in a kind of scratch pad as they step through tasks. Researchers can then use those notes to make sure a model is behaving as expected. Yesterday OpenAI published new details on how it is using chain-of-thought monitoring in house to study Codex . “Once we get to systems working mostly autonomously for a long time in a big data center, I think this will be something that we’re really going to depend on,” says Pachocki. The idea would be to monitor an AI researcher’s scratch pads using other LLMs and catch unwanted behavior before it’s a problem, rather than trying to stop that bad behavior from happening in the first place. LLMs are not understood well enough for us to control them fully. “I think it’s going to be a long time before we can really be like, okay, this problem is solved,” he says. “Until you can really trust the systems, you definitely want to have restrictions in place.” Pachocki thinks that very powerful models should be deployed in sandboxes, cut off from anything they could break or use to cause harm. AI tools have already been used to come up with novel cyberattacks. Some worry that they will be used to design synthetic pathogens that could be used as bioweapons. You can insert any number of evil-scientist scare stories here. “I definitely think there are worrying scenarios that we can imagine,” says Pachocki. “It’s going to be a very weird thing. It’s extremely concentrated power that’s in some ways unprecedented,” says Pachocki. “Imagine you get to a world where you have a data center that can do all the work that OpenAI or Google can do. Things that in the past required large human organizations would now be done by a couple of people.” “I think this is a big challenge for governments to figure out,” he adds. And yet some people would say governments are part of the problem. The US government wants to use AI on the battlefield , for example. The recent showdown between Anthropic and the Pentagon revealed that there is little agreement across society about where we draw red lines for how this technology should and should not be used—let alone who should draw them. In the immediate aftermath of that dispute, OpenAI stepped up to sign a deal with the Pentagon instead of its rival. The situation remains murky. I pushed Pachocki on this. Does he really trust other people to figure it out or does he, as a key architect of the future, feel personal responsibility? “I do feel personal responsibility,” he says. “But I don’t think this can be resolved by OpenAI alone, pushing its technology in a particular way or designing its products in a particular way. We’ll definitely need a lot of involvement from policymakers.” Where does that leave us? Are we really on a path to the kind of AI Pachocki envisions? When I asked the Allen Institute&#8217;s Downey, he laughed. “I’ve been in this field for a couple of decades and I no longer trust my predictions for how near or far certain capabilities are,” he says. OpenAI’s stated mission is to ensure that artificial general intelligence (a hypothetical future technology that many AI boosters believe will be able to match humans on most cognitive tasks) will benefit all of humanity. OpenAI aims to do that by being the first to build it. But the only time Pachocki mentioned AGI in our conversation, he was quick to clarify what he meant by talking about “economically transformative technology” instead. LLMs are not like human brains, he says: “They are superficially similar to people in some ways because they’re kind of mostly trained on people talking. But they’re not formed by evolution to be really efficient.” “Even by 2028, I don’t expect that we’ll get systems as smart as people in all ways. I don&#8217;t think that will happen,” he adds. “But I don’t think it’s absolutely necessary. The interesting thing is you don’t need to be as smart as people in all their ways in order to be very transformative.”
NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories | NVIDIA Technical Blog nvidia_dev_blog 16.03.2026 19:30 0.732
Embedding sim.0.8309
Entity overlap0.0377
Title sim.0.2381
Time proximity0.9797
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

AI is evolving, and reasoning models are increasing token demand, placing new requirements on every layer of AI infrastructure. More than ever, compute must scale efficiently to maximize token production and improve productivity for model creators and users. Modern GPUs operate at peak capacity, pushing throughput higher every generation, but system performance is increasingly gated by the CPU-bound serial tasks within an agentic loop–a classic example of a core computer science principle, called Amdahl’s law. This dynamic is especially visible in two classes of workloads: reinforcement learning (RL) for training models with new specialized skills such as coding or engineering, and agentic actions , which enable AI agents to use tools like web browsers, databases, code interpreters, and other software to complete tasks in real environments, or sandboxes. Both workloads combine two historically separate CPU characteristics. Individual environments require strong single-threaded performance to execute complex code quickly, similar to a workstation. At the same time, modern AI systems launch thousands of these environments concurrently, creating large-scale throughput demands typical of server infrastructure. The NVIDIA Vera CPU is designed for modern AI workloads, with key design features including: Extreme single-core performance: Fast execution of individual tasks is critical, and performance must sustain under ‌constant load with many concurrent users and agentic tasks. High memory and fabric bandwidth per core: To ensure consistent SLA under load that moves volumes of data efficiently for real-time analysis and context switching tasks. Efficient rack-scale co-design: AI factories must rapidly deploy and manage capacity to fulfill agentic demand while maximizing power efficiency. Data centers built with Vera maximize AI infrastructure investments, whether Vera CPUs are directly connected to accelerators or performing tasks on standalone CPU capacity at the end of a wire. The post-training reality Reinforcement learning requires models to constantly evaluate their outputs, recognizing which results succeed or fail. For example, models learning to do software development generate large amounts of code using models running on accelerators, which is then shipped to clusters of CPUs to build, run, and test—acting in a feedback-reward loop (see Figure 1). These tasks span codebase research, compilation, runtime execution, scripting, data conversion, and other common operations. Overall, this flow requires many concurrent sandbox-like environments, each with a full complement of tools. Often, a single CPU core executes each lightly threaded case end-to-end from a set of accelerator-generated requests. Figure 1. A CPU Sandbox accelerating RL-based post-training and agentic calling To maximize accelerator utilization and enforce rapid model iteration, the token generation and training phases of the cycle operate on a tight schedule (or policy). Often, some evaluation jobs running on a CPU finish jobs too late to influence the next step in the cycle. When this happens, it takes the model longer to learn to the same quality, and valuable tokens are wasted. Agentic loops demand a unique blend of high single-core performance, massive data bandwidth, and deterministic execution with minimal tail latencies from the CPUs they employ. These requirements are a central focus of the NVIDIA Vera CPU design (Figure 2), which delivers up to 50% faster sandbox performance compared to competitive platforms, 1.2 TB/s of memory bandwidth, and 88 Olympus cores with NVIDIA Spatial Multithreading (SMT) for the task concurrency necessary for AI Factories. Figure 2. Vera CPU architecture NVIDIA Olympus core The need for higher-performance cores that support AI led to the NVIDIA Olympus core, the first fully custom data center CPU core from NVIDIA. Olympus debuts in Vera alongside the second generation of the NVIDIA Scalable Coherency Fabric (SCF), originally developed for the NVIDIA Grace CPU. Built for sustained high Instruction Per Cycle (IPC) operation on memory-intensive workloads with control-flow logic, Olympus uses a 10-wide instruction fetch and decode frontend, and a neural branch predictor capable of evaluating two taken branches per cycle. It is fully compatible with the Arm v9.2 instruction set and existing software for high performance on Arm-based containers, binaries, libraries, and operating systems. Users can choose between performance-per-thread and thread count at runtime with NVIDIA SMT. This gives each thread stable performance, stronger isolation, and predictable tail latency under heavy load. Traditional SMT relies on time-shared resources and frequent context switching between threads, introducing performance variation. NVIDIA Scalable Coherency Fabric and memory subsystem The Vera CPU is built on a single monolithic compute die and fabric, with adjacent dielets implementing memory and I/O subsystems while preserving the uniformity of the compute topology. From the point of view of an application, every core is the same practical distance to resources like other cores, caches, memory, and networking, and is provisioned with uniform, high-throughput bandwidth. Most latency‑sensitive operations remain local, avoiding unnecessary cross‑die traffic typically observed on traditional CPUs. The runtime paths of agentic tasks, analytics operations, KV and blob caches, orchestration, and control planes are inherently unpredictable in an AI factory. In traditional implementations, the topology of the processor and the usage patterns of neighboring tasks being run on it must be considered ahead of time to maximize application performance. The design enables optimal performance without this style of tuning. The second-generation SCF connects all 88 Olympus cores to a shared L3 cache and memory subsystem, delivering consistent latency and 3.4 TB/s of bisection bandwidth, enabling the Vera CPU to sustain over 90% of peak memory bandwidth under load. Each core is provisioned with up to 14 GB/s of memory bandwidth, roughly 3x the per-core rate of traditional data center CPUs—ensuring Extract-Transform-Load (ETL), real-time analytics, and memory-bound workloads maintain throughput when every core is active. Feeding SCF is Vera’s second-generation LPDDR5X memory subsystem, delivering up to 1.2 TB/s of total bandwidth at less than half the memory power of traditional DDR configurations and up to 1.5 TB of capacity—a 3x increase over the prior generation. Small Outline Compression-Attached Memory Modules (SOCAMM) brings low-power memory into the data center for the first time, replacing soldered memory with detachable, upgradable modules that combine LPDDR efficiency with server-class serviceability. Performance across the AI factory All these architectural elements enable the Vera CPU to deliver up to 1.5x the agentic sandbox performance under full-socket load compared to competitive x86 platforms across compilers, scripting tools, runtime engines, compression, and agentic tool calls (Figure 3). Figure 3. NVIDIA Vera offers up to 1.5x higher sandbox performance compared to the competition This advantage compounds across three dimensions. In RL post-training, a 1.5x faster sandbox returns evaluation results within tighter time windows, enabling models to capture the best gradient tokens and accelerating training cycles. In agentic inference, it reduces users’ wait time, improving accelerator utilization and easing pressure on KV cache offloading. For frontier training problems, 50% higher single-core performance means more sequential tests complete before hitting time limits, expanding the range of hard problems a model can learn from. Agentic environments by the rack Every AI Factory requires millions of CPU cores to enable the agentic loop of RL and tool use. To unlock the potential of AI infrastructure, deployment must be rapid. For many AI factory operators, the Vera CPU will be the first in their fleet, arriving in data centers designed for high-rack power and liquid cooling. The new NVIDIA Vera CPU Rack offers incredible density and performance within the same planning constraints, rack infrastructure, cooling, and power as the NVL72 products being deployed today. With a capacity of more than 22.5K sandboxes, Vera CPU Rack delivers over 4x the capacity and 2x the performance per watt of x86-based server racks (Figure 4). AI Factories deploy and manage capacity at the rack level, radically reducing build-out times and improving time-to-market for new capacity while simplifying site planning. Each Vera CPU is connected with NVIDIA BlueField-4 SmartNICs containing dedicated Grace-based management cores, offloading networking tasks like security and management, and ensuring the most performant capacity in the system is fully available to agentic tasks. Figure 4. Rack-level efficiency across RL sandbox evaluation, ETL, and analytics under full system load Vera platforms and configurations In addition to the Vera CPU rack, NVIDIA has engineered a complete family of Vera-based platforms for the diverse workloads of modern AI factories. By delivering many choices of densities, cooling capabilities, configurations, and form factors, Vera’s design and system partners are enabling rapid deployment and capacity build-out, adaptable to the constraints of space available in any data center facility. Platform Description Scenarios NVIDIA Vera Rubin NVL72 Integrated AI factory rack tightly couples Vera host CPUs and Rubin GPUs through high-bandwidth NVIDIA NVLink-C2C and NVIDIA NVLink scale-up fabric. Large-scale AI factories, frontier model training, reasoning, and high-throughput inference. NVIDIA Vera CPU Rack Liquid-cooled (LC) CPU rack architecture with up to 4 nodes per 1U tray, scaling to 256 Vera CPUs per rack for dense, efficient compute. Build capacity rapidly at rack-scale alongside NVL72. AI factory infrastructure, agentic pipelines, orchestration layers, data processing, HPC, and CPU-dense services. Single and dual-socket Vera platforms Flexible server platforms built around one or two Vera CPUs, with up to 1.5TB LPDDR5X per socket and 1.8TB/s NVLink-C2C between CPUs in dual-socket designs, suitable for any facility. Cloud infrastructure, enterprise, analytics, storage, HPC, NVIDIA PCIe GPU-equipped servers, and AI factories. NVIDIA HGX Rubin NVL8 Accelerated computing platform pairing Vera host CPUs with Rubin GPUs over PCIe, enabling balanced CPU-GPU performance across multiple server designs. AI inference, technical computing, analytics, and enterprise HPC deployments. Table 1. Vera platform options for modern AI factories Figure 5. Vera CPU at AI factory scale Platform availability Vera systems will be available from major OEMs, including Cisco, Dell, HPE, Lenovo, and Supermicro, in the second half of 2026 . See the Vera CPU webpage for more details. Learn More about the Vera CPU and Vera Rubin . NVIDIA Vera performance compared to AMD EPYC Turin and Intel Xeon 6 Granite Rapids, across a variety of workloads, including code compilation, interpreters, scripting, runtime engines, ETL, data analytics, and graph. Discuss (0) Like Tags Data Science | General | Intermediate Technical | Deep dive | AI Factory | featured | GTC 2026 | Vera CPU | Vera Rubin NVL72 About the Authors About Praveen Menon Praveen is a senior technical marketing engineer focused on accelerated computing platforms in the data center at NVIDIA. Previously, Praveen held various roles across marketing, performance, and silicon engineering. He holds a master’s degree in electrical and computer engineering from the University of Arizona View all posts by Praveen Menon About Ivan Goldwasser Ivan leads product marketing for the Data Center CPU products for NVIDIA. Previously, Ivan worked in various marketing and strategy roles in the technology sector. Ivan has an MBA from Georgetown’s McDonough School of Business and a bachelor’s degree in chemical engineering from Texas A&M University. View all posts by Ivan Goldwasser Comments Related posts Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer Power the Next Wave of Applications with NVIDIA BlueField-3 DPUs Power the Next Wave of Applications with NVIDIA BlueField-3 DPUs Defining AI Innovation with NVIDIA DGX A100 Defining AI Innovation with NVIDIA DGX A100 Related posts Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy Designing Protein Binders Using the Generative Model Proteina-Complexa Designing Protein Binders Using the Generative Model Proteina-Complexa How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark L T F R E
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain | NVIDIA Technical Blog nvidia_dev_blog 18.03.2026 16:00 0.73
Embedding sim.0.8332
Entity overlap0.0204
Title sim.0.2879
Time proximity0.8644
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai agents
NLP страна

Открыть оригинал

While consumer AI offers powerful capabilities, workplace tools often suffer from disjointed data and limited context. Built with LangChain , the NVIDIA AI-Q blueprint is an open source template that bridges this gap. LangChain recently introduced an enterprise agent platform built with NVIDIA AI to support scalable, production-ready agent development. This tutorial, available as an NVIDIA launchable , shows developers how to use the AI-Q blueprint to create a deep research agent that tops leaderboards and can be connected to enterprise systems. The blueprint uses the best of open and frontier LLMs , is optimized using the NVIDIA NeMo Agent Toolkit , and monitored with LangSmith. The result: faster time-to-production for agentic search apps that keep business data exactly where it belongs—private and in a secure environment. The NVIDIA AI-Q blueprint and NeMo Agent Toolkit are both part of the broader NVIDIA Agent Toolkit, a collection of tools, models and runtimes for building, evaluating and optimizing safe, long-running autonomous agents. What you’ll build: A deep agent You will learn: How to deploy the NVIDIA AI‑Q blueprint with LangChain for enterprise search use cases How to configure shallow and deep research agents using Nemotron and frontier LLMs How to monitor agent traces and performance with LangSmith and NVIDIA tools How to connect internal enterprise data sources through NeMo Agent Toolkit tools Set up NVIDIA API Key for access to open models such as Nemotron 3 OpenAI API Key for access to a frontier model such as GPT-5.2 Tavily API Key for web search Python Docker Compose Optional: LangSmith for monitoring and experiment tracking How to build long-running data agents with NVIDIA and LangChain Video 1. A walkthrough on building scalable, self-improving data agents with LangChain and NVIDIA Install and run the blueprint Clone the repository and configure your API keys. Copy the environment template first. cp deploy/.env.example deploy/.env Open deploy/.env and fill in the required values. # Required NVIDIA_API_KEY=nvapi-... TAVILY_API_KEY=tvly-... # Optional: enables trace monitoring (covered later in this post) LANGSMITH_API_KEY=lsv2-... The NVIDIA_API_KEY grants access to NVIDIA-hosted models like Nemotron 3 Nano. The TAVILY_API_KEY enables web search. Next, build and start the full stack. Starting multiple containers at once means the first build can take a few minutes, based on your internet connection and hardware specs. docker compose -f deploy/compose/docker-compose.yaml up --build This launches three services: aiq-research-assistant: The FastAPI backend on port 8000 postgres: PostgreSQL 16 for async job state and conversation checkpoints frontend: The Next.js web UI on port 3000 Once all services report healthy, open http://localhost:3000. Figure 1, below, shows the AI-Q Research Assistant chat interface, where you type a research query and watch the agent work in real time. Figure 1. The AI-Q Research Assistant as it creates a research report Customize AI-Q: Workflow, tracing and model configuration Open configs/config_web_docker.yml . This single file controls the LLMs, tools, agents, and workflow configuration. The llms section declares named models. Notice the enable_thinking flag—it toggles chain-of-thought reasoning on or off for Nemotron. The following example declares three LLMs with different roles: llms: nemotron_llm_non_thinking: _type: nim model_name: nvidia/nemotron-3-super-120b-a12b temperature: 0.7 max_tokens: 8192 chat_template_kwargs: enable_thinking: false nemotron_llm: _type: nim model_name: nvidia/nemotron-3-super-120b-a12b temperature: 1.0 max_tokens: 100000 chat_template_kwargs: enable_thinking: true gpt-5-2: _type: openai model_name: 'gpt-5.2' nemotron_llm_non_thinking handles fast responses where chain-of-thought adds latency without benefit. nemotron_llm enables thinking mode with a 100K context window for the agents that need multi-step reasoning. gpt-5.2 adds a frontier model for orchestration. The blueprint consists of both a shallow and deep research agent. The following configuration shows both: functions: shallow_research_agent: _type: shallow_research_agent llm: nemotron_llm tools: - web_search_tool max_llm_turns: 10 max_tool_calls: 5 deep_research_agent: _type: deep_research_agent orchestrator_llm: gpt-5 planner_llm: nemotron_llm researcher_llm: nemotron_llm max_loops: 2 tools: - advanced_web_search_tool The shallow research agent runs a bounded tool-calling loop—up to 10 LLM turns and 5 tool calls—then returns a concise answer with citations. Simple questions like “What is CUDA?” resolve in seconds. The deep research agent uses a LangChain deep agent with a ToDo list, file system, and sub-agents to produce long-form, citation-backed reports. To keep all inference on-premise, change the orchestrator_llm to point at a self-hosted model. Monitor the traces To monitor AI‑Q agents, enable LangSmith tracing so each query generates a full execution trace, including LangChain tool calls and model usage. Add your LANGSMITH_API_KEY to your deploy/.env and add the telemetry section to the config file: general: telemetry: tracing: langsmith: _type: langsmith project: aiq-gtc-demo api_key: ${LANGSMITH_API_KEY} Each query generates a trace that captures the full execution path. Figure 2. A LangSmith trace for a shallow research query showing multiple tool calls and a final answer. Shallow research sample query: What is the deepest place on earth? Deep research sample query: Analyze the current 2026 scientific consensus on the deepest points on Earth, comparing the Challenger Deep in the Mariana Trench to terrestrial extremes such as the Veryovkina Cave and the Kola Superdeep Borehole. Include the latest bathymetric and geodetic measurements, an assessment of measurement uncertainties (including gravity and pressure sensor corrections), and a summary of recent deep-sea expeditions from 2020–2026 that have updated our understanding of the Hadal zone's topography and biological life. Expand the trace to inspect each node. The tool calls to web search are especially useful for debugging—you can see exactly what query the agent sent and what results came back. Beyond individual traces, use LangSmith to track latency, token usage, and error rates over time, and set alerts for regressions. Optimize a deep agent To tune the deep research agent for your domain, start by examining how it assembles its sub-agents. The deep research agent uses the create_deep_agent factory from LangChain’s deepagents library. from deepagents import create_deep_agent return create_deep_agent( model=self.llm_provider.get(LLMRole.ORCHESTRATOR), system_prompt=orchestrator_prompt, tools=self.tools, subagents=self.subagents, middleware=custom_middleware, skills=self.skills, ).with_config({"recursion_limit": 1000}) The factory wires together the orchestrator LLM, the tools, and two sub-agents. self.subagents = [ { "name": "planner-agent", "system_prompt": render_prompt_template( self._prompts["planner"], tools=self.tools_info, ), "tools": self.tools, "model": self.llm_provider.get(LLMRole.PLANNER), }, { "name": "researcher-agent", "system_prompt": render_prompt_template( self._prompts["researcher"], tools=self.tools_info, ), "tools": self.tools, "model": self.llm_provider.get(LLMRole.RESEARCHER), }, ] Context management is central to how deep agents work. The planner agent produces a JSON research plan. The researcher agent receives only this plan — not the orchestrator’s thinking tokens or the planner’s internal reasoning . By passing only a structured payload, we reduce the token bloat and prevent the “lost in the middle” phenomenon, where LLMs forget critical instructions buried deep in massive context windows. This isolation keeps each sub-agent focused. The following example shows a planner output for a query about retrieval-augmented generation ( RAG ) versus long-context approaches: { "report_title": "RAG vs Long-Context Models for Enterprise Search", "report_toc": [ { "id": "1", "title": "Architectural Foundations", "subsections": [ {"id": "1.1", "title": "Retrieval-Augmented Generation Pipeline"}, {"id": "1.2", "title": "Long-Context Transformer Architectures"} ] }, { "id": "2", "title": "Performance and Accuracy Trade-offs", "subsections": [ {"id": "2.1", "title": "Factual Accuracy and Hallucination Rates"}, {"id": "2.2", "title": "Latency and Throughput Benchmarks"} ] } ], "queries": [ { "id": "q1", "query": "RAG retrieval-augmented generation architecture components ...", "target_sections": ["Architectural Foundations"], "rationale": "Establishes baseline understanding of RAG pipelines" } ] } This architecture has been tuned to perform well on both Deep Research Bench and Deep Research Bench II . To customize the agent for your domain, edit the prompt templates in src/aiq_aira/agents/deep_researcher/prompts/ . For example, open planner.j2 and instruct the planner to keep outlines to three sections or fewer for more focused reports. You could also add additional debug logging to inspect the intermediate state (like /planner_output.md ) to see how your prompt changes affect the context passed between sub-agents. Add a data source The blueprint implements every tool as a NeMo Agent Toolkit function. To connect a new enterprise data source, implement a NeMo Agent Toolkit function and reference it in the config. Step 1: Implement the NeMo Agent Toolkit function The following example connects to an internal knowledge base API: # sources/internal_kb/src/register.py from pydantic import Field, SecretStr from nat.builder.builder import Builder from nat.builder.function_info import FunctionInfo from nat.cli.register_workflow import register_function from nat.data_models.function import FunctionBaseConfig class InternalKBConfig(FunctionBaseConfig, name="internal_kb"): """Search tool for the internal knowledge base.""" api_url: str = Field(description="Knowledge base API endpoint") api_key: SecretStr = Field(description="Authentication key") max_results: int = Field(default=5) @register_function(config_type=InternalKBConfig) async def internal_kb(config: InternalKBConfig, builder: Builder): async def search(query: str) -> str: """Search the internal knowledge base for relevant documents.""" results = await call_kb_api(config.api_url, query, config.max_results) return format_results(results) yield FunctionInfo.from_fn(search, description=search.__doc__) NeMo Agent Toolkit validates the config fields at startup, so misconfigurations fail fast. The agent will use the function’s docstring to decide when to call the tool. Step 2: Reference the tool in the config Declare the new tool under functions , then add it to each agent’s tools list: functions: internal_kb_tool: _type: internal_kb api_url: "https://kb.internal.company.com/api/v1" api_key: ${INTERNAL_KB_API_KEY} max_results: 10 shallow_research_agent: _type: shallow_research_agent llm: nemotron_llm tools: - web_search_tool - internal_kb_tool deep_research_agent: _type: deep_research_agent orchestrator_llm: gpt-5 planner_llm: nemotron_llm researcher_llm: nemotron_llm tools: - advanced_web_search_tool - internal_kb_tool You don’t need to change any agent code. The agent discovers the new tool’s name and description automatically, and the LLM calls it when a query matches. Use this same pattern to hook into your own enterprise systems or leverage MCP (Model Context Protocol) to grant your agents access to existing tools. This ensures your research stack remains private, and deeply integrated with the data that matters most to your organization. Going further By extending and building on the NVIDIA AI-Q blueprint, developers are able to bring a best-in-class LangChain deep agent architecture to their enterprise. To go further, review: Blueprint customization guide for adding more data sources Helm chart for deploying on an AI Factory Blueprint evaluation guide for doing evaluation driven development LangSmith for monitoring the system in production and preventing performance drift The NVIDIA AI-Q Blueprint is being integrated by partners across the ecosystem, including: Aible, Amdocs , Cloudera , Cohesity , Dell , Distyl , H2O.ai, HPE, IBM, JFrog , LangChain ,  ServiceNow , and VAST . Discuss (1) Like Tags Agentic AI / Generative AI | General | Agent toolkit | Blueprint | NeMo | Nemotron | Intermediate Technical | Announcement | Tutorial | featured | GTC 2026 | LLMs | Retrieval Augmented Generation (RAG) About the Authors About Sean Lopp Sean is a software engineer at NVIDIA where he works on data, AI, and developer tooling to help organizations realize the full potential of NVIDIA AI Enterprise software. He has a decade long career in open source software, especially in the Python data ecosystem. He studied applied mathematics at the Colorado School of Mines. View all posts by Sean Lopp About Sam Pastoriza Sam Pastoriza is a solutions architect at NVIDIA who bridges the gap between complex, agentic AI capabilities and practical industry applications. He collaborates with cross-functional teams to build state-of-the-art agentic AI solutions using NVIDIA's NeMo Agent Toolkit, most notably developing the AI-Q Research Assistant blueprint for deep research. Sam holds a Master’s degree in data science and analytics from Georgetown University and a BS in software engineering from Rose-Hulman Institute of Technology View all posts by Sam Pastoriza About Ajay Thorve Ajay Thorve is a software engineer at Nvidia, part of the Visualization team in the RAPIDS organization. Ajay’s background is in full-stack development & Data Science and interests primarily include JavaScript/TypeScript & Python. Currently, Ajay’s work in the RAPIDS viz team mainly focuses on contributing to the cuXfilter and node-rapids projects. View all posts by Ajay Thorve About Chantal D Gama Rose Chantal D Gama Rose is a senior software engineer working on agentic AI at NVIDIA. She builds and optimizes enterprise-ready, multi-modal agentic AI systems and enables their large-scale deployment. Her experience includes advancing the optimization and adoption of GPU-accelerated inference pipelines and agent workflows for real-world environments. She is currently developing the AIQ Research Assistant, a deep-research agent designed for structured reasoning and real-world enterprise workflows. Chantal holds a Master’s degree in Computational Data Science from Carnegie Mellon University. View all posts by Chantal D Gama Rose About Victor Moreira Victor Moreira is a deployed engineer at LangChain, where he helps enterprises design and scale advanced agentic use cases using the LangChain ecosystem. He brings a background in building from zero-to-one at startups and is passionate about making cutting-edge AI approachable for real-world teams. View all posts by Victor Moreira Comments Related posts How to Scale Your LangGraph Agents in Production From A Single User to 1,000 Coworkers How to Scale Your LangGraph Agents in Production From A Single User to 1,000 Coworkers How to Build Custom AI Agents with NVIDIA NeMo Agent Toolkit Open Source Library How to Build Custom AI Agents with NVIDIA NeMo Agent Toolkit Open Source Library Chat With Your Enterprise Data Through Open-Source AI-Q NVIDIA Blueprint Chat With Your Enterprise Data Through Open-Source AI-Q NVIDIA Blueprint Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency DataStax Announces New AI Development Platform, Built with NVIDIA AI DataStax Announces New AI Development Platform, Built with NVIDIA AI Related posts Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core L T F R E
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark | NVIDIA Technical Blog nvidia_dev_blog 16.03.2026 20:30 0.726
Embedding sim.0.8095
Entity overlap0.0208
Title sim.0.3136
Time proximity0.9972
NLP типproduct_launch
NLP организацияnvidia
NLP темаai agents
NLP страна

Открыть оригинал

Autonomous AI agents are driving the next wave of AI innovation. These agents must often manage long-running tasks that use multiple communication channels and background subprocesses simultaneously to explore options, test solutions, and generate optimal results. This places extreme demands on local compute. NVIDIA DGX Spark provides the performance necessary for autonomous agents to execute these complex workflows efficiently and locally. Now with NVIDIA NemoClaw , part of the NVIDIA Agent Toolkit, it installs the NVIDIA OpenShell runtime—a secure environment for running autonomous agents, and open source models like NVIDIA Nemotron . This post discusses several important aspects of system capabilities and performance that are necessary to power always-on autonomous agents and explains why NVIDIA DGX Spark is an ideal desktop platform for autonomous AI. Inference for autonomous AI agents Agentic tools often need to process massive context windows. OpenClaw, for example, is an AI agent runtime that requires these large context windows to comprehend requests and environments, and to think through the best approach to a problem. Prompt processing (prefill) throughput can be thought of as the reading comprehension phase of inference and can easily become a bottleneck with a slow GPU. It’s common to see autonomous agents easily using contexts of 30K-120K tokens (100K tokens is equivalent to reading Harry Potter and the Philosopher’s Stone ), with some agents processing 250K tokens for complex requests. Table 1 shows how a potential agent or subagent performs with a large context window, (128K/1K of ISL/OSL). Model End-to-end latency (s) Prompt processing latency (s) Prompt processing throughput (tok/s) Token generation throughput (tok/s) NVIDIA Nemotron 3 Super 120B NVFP4 with TensorRT LLM 99 44 2,855 18 Qwen3.5 35B A3B FP8 with vLLM 73 41 3,080 35.75 Qwen3 Coder Next 80B FP8 with vLLM 89 54 2,390 28.95 Table 1. Performance representative of 128K tokens input prompt and response of 1K tokens, at batch size 1 When moving from a single subagent to multiple subagents, simultaneous workloads must scale without impacting performance significantly. NVIDIA DGX Spark effectively handles high concurrency in this scenario. Thanks to the power of the NVIDIA Grace Blackwell Superchip , the GPU can parallelize multiple subagents. Two, four, or even eight subagents concurrently working through requests can make use of the strong concurrency capabilities in DGX Spark. With support from frameworks that handle concurrency well (such as NVIDIA TensorRT LLM , vLLM, and SGLang), multiagent workloads run smoothly on NVIDIA DGX Spark. For tasks with 32K ISL of 1K OSL, completing four times as many tasks requires only 2.6x more time, while prompt processing throughput increases by about 3x (Table 2). NVIDIA DGX Spark is an ideal platform for OpenClaw development. With NVIDIA OpenShell, you can run autonomous, self-evolving agents more safely. Get started running OpenClaw locally on NVIDIA DGX Spark. Concurrency (# of simultaneous tasks) End-to-end latency (s) Median TTFT (s) Prompt processing throughput (tok/s) Token generation throughput (tok/s) Lower is better Higher is better 1 35 9 3,261 38 2 54 12 5,363 47 4 91 15 9,616 53 Table 2. Performance representative of Qwen3 Coder Next in FP8 in vLLM for a 32K tokens input prompt and response of 1K tokens at different concurrency levels Scale inference and fine-tuning on up to four NVIDIA DGX Spark nodes Larger models and multiple subagents require more memory to load and execute. Until now, NVIDIA DGX Spark has supported scaling up to two nodes, increasing the available memory from 128 GB on one node to 256 GB on two nodes. This capability has now been increased to up to four DGX Spark nodes. DGX Spark also now supports several execution topologies, each tailored to different goals through the low latency of RoCE communication enabled by ConnectX-7 NICs . One DGX Spark node : Ideal for low latency, large context size inference, fine-tuning up to 120B parameters, and local agentic workloads Two DGX Spark nodes : Balanced scaling for faster fine-tuning and larger models, as well as support for up to 400B-parameter inference Three DGX Spark nodes in a ring : Ideal for fine-tuning larger models or small training jobs Four DGX Spark nodes with RoCE 200 GbE switch: Local inference server ideal for state-of-the-art models up to 700B parameters, communication intensive workloads, and local AI factory operations Inference can scale up linearly on DGX Spark when internode communication is minimal. When work is largely independent per GPU, the results are aggregated once at the end rather than continuously. In this case, DGX Spark nodes can run in parallel with low synchronization overhead. For example, a reinforcement learning (RL) workload in NVIDIA Isaac Lab can run many simulations independently on each node. Results are collected in a single step, yielding near-linear scaling across multiple DGX Spark nodes. Inference scaling is less than linear when the workload requires frequent, fine-grained communication between nodes. During LLM inference, model execution occurs layer by layer, with continuous synchronization required across nodes. Partial results from different DGX Spark nodes must be exchanged and merged repeatedly, which introduces significant communication overhead. As additional nodes are added, this overhead becomes increasingly dominant, limiting scaling efficiency. Parallelism for AI agents: Inference at scale Tensor parallelism enables efficient inference sharing across multiple nodes to fit the model while minimizing communication overhead. Scaling from two to four DGX Spark nodes provides excellent parallelism capabilities. This is thanks to the low-latency ConnectX-7 NICs, scaling in time per output token (TPOT) almost linearly with ~2x with TP2 (two nodes) and ~4x with TP4 (4 nodes) in inference use cases. Table 3 shows how a single agent performs an inference job shared across multiple nodes. 1 DGX Spark node TP1 (ms) 2 DGX Spark nodes TP2 (ms) 4 DGX Spark nodes TP4 (ms) TTFT (lower is better) 33,415 21,384 15,552 TPOT (lower is better) 269 133 72 Table 3. Scaling Llama 3.3 70B Instruct NVFP4 on TensorRT LLM with one, two, and four DGX Spark nodes (32K input, 1K output, batch size 1) Several models that are popular in the context of OpenClaw—including Qwen3.5 397B, GLM 5, and MiniMax M2.5 230B—can benefit from stacking multiple DGX Spark units, increasing the available memory. Near-linear fine-tuning Fine-tuning and similar workloads can be significantly parallelized with close-to-linear performance scaling when the model instance can fit on one GPU. This reduces the communication overhead to only gradient synchronization at the end of each step. An RL workload in NVIDIA Isaac Lab or Nanochat can benefit from this performance scaling. Isaac Lab can accommodate several copies of each environment on each DGX Spark. For each step, Isaac Lab communicates to the other nodes to synchronize the training, achieving linear speedup through clustering. 1 DGX Spark node TP1 2 DGX Spark nodes TP2 4 DGX Spark nodes TP4 Collection time 12.1 s 11.4 s 10.4 s Learning time 40.9 s 41.4 s 42.3 s # environments 1,024 1,024 1,024 FPS 630 1241 2,520 Table 4. Scaling of Isaac Lab reinforcement learning performance on one, two, and four DGX Spark nodes HW configuration Total token throughput (tok/s) Speedup versus 1 DGX Spark node 1 DGX Spark node ~18,400 1 2 DGX Spark nodes ~35,900 2 4 DGX Spark nodes ~74,600 4 Table 5. Scaling of Nanochat fine-tuning performance from one to four DGX Spark nodes (model depth of 20 layers, batch size of 32 per node, full context attention) When using distributed data parallel (DDP), fine-tuning can similarly benefit from the low communication overhead. In this case, each node can fully host a copy of the model and communicate with the other nodes once per step. Nodes Samples/step Batch size Samples/s Speedup 1 DGX Spark node 15.73 32 2.03 – 3 DGX Spark nodes 15.69 96 6.12 3x Table 6. Scaling one DGX Spark to three DGX Spark nodes, each node has the full model of Qwen3 4B (batch size of four samples per device, BF16 quantization) Develop on DGX Spark, deploy to the cloud: Cross-architecture workflows Cloud solutions are required when moving from prototyping to large-scale production deployment. This section explains how workloads developed on DGX Spark can be deployed in the cloud. Tile IR and cuTile Python enable seamless kernel portability from DGX Spark development environments to cloud deployment on NVIDIA Blackwell data center GPUs, with minimal code changes. Using TileGym , developers can: Write kernels once using cuTile Python DSL Test and validate on DGX Spark Deploy to NVIDIA Blackwell B300/B200, NVIDIA Hopper, or NVIDIA Ampere with minimal code changes Leverage TileGym preoptimized transformer kernels as drop-in replacements End-to-end inference performance Beyond kernel-level analysis, we benchmarked complete Qwen2 7B inference using cuTile kernels on both platforms to demonstrate cross-architecture performance portability. Table 7 shows the configuration; Table 8 shows the platform specification. Parameter Value Model Qwen2 7B Input length 2,189 tokens Output length 128 tokens Batch sizes 1, 2, 4, 8, 16, 32, 64, 128 Table 7. Model and parameter specifications showing Tile IR usage Specification NVIDIA DGX Spark (Dev) NVIDIA Blackwell B200 (Cloud) Compute capability SM 12.1 SM 10.0 SM count 48 148 SM frequency 2.14 GHz ~1.0 GHz Memory type LPDDR5X (Unified) HBM3e Memory bandwidth 273 GB/s ~8 TB/s Table 8. Platform specifications of NVIDIA DGX Spark and NVIDIA B200 as local and cloud examples Platform-specific configuration While the kernel source code remains identical across platforms, optimal performance is achieved through platform-specific configurations (Tile and Occupancy). For the FMHA kernel example, Table 9 shows how these configurations adapt to different hardware characteristics. Tile IR compiles to architecture-specific PTX/SASS at JIT, automatically leveraging platform-specific features like Tensor Memory Accelerator (TMA) using the appropriate configuration. Platform TILE_M TILE_N Occupancy Rationale NVIDIA DGX Spark (SM 12.1) 64 64 2 Smaller tiles 48 SMs, unified memory NVIDIA B200 (SM 10.0) 256 128 1 Large tiles maximize HBM3e throughput NVIDIA B200 (alt) 128 128 2 Higher occupancy, balanced parallelism Table 9. Platform-specific cuTile configuration across NVIDIA DGX Spark and NVIDIA B200 Roofline analysis and comparison of Tile IR kernel performance Roofline analysis in NVIDIA Nsight Compute is a powerful visual performance framework used to determine how well an application is utilizing hardware capabilities. As a developer, roofline analysis helps you figure out whether your code is “slow” and shows why it may be hitting a performance ceiling. Analysis of the roofline model suggests that the kernel scales effectively relative to the respective roofline, demonstrating that Tile IR is a viable option to scale workloads. The kernel considered is the attention decode kernel and the kernel is optimized using Tile IR. Figure 1. Roofline analysis in NVIDIA Nsight Compute shows how Tile IR kernel performance scales on NVIDIA B200 and NVIDIA DGX Spark relative to the theoretical peak roofline of each GPU Performance scaling and optimization headroom In Figure 1, the vertical positioning of the data points on the y-axis confirms that the kernel achieves higher hardware utilization on NVIDIA B200. Specifically, the vertical proximity of the blue dot to the NVIDIA B200 GPU memory roofline is greater than that of the green dot to the Spark roofline. This roofline analysis indicates additional opportunities for optimization, and that algorithmic or memory optimizations of NVIDIA DGX Spark will also benefit NVIDIA B200 GPUs. Cache utilization and arithmetic intensity Analysis of the x-axis reveals that the blue dot is positioned to the right of the green dot, signifying that the B200 achieves superior Hardware Arithmetic Intensity. Cache efficiency: While the larger cache capacity of NVIDIA B200 GPU provides the theoretical foundation for reducing DRAM traffic, hardware alone is insufficient. The software must be architected to exploit these resources. Kernel portability: The rightward shift indicates that Tile IR kernels successfully leverage the NVIDIA B200 expanded cache hierarchy on migration. Future Tile IR kernel optimizations aimed at increasing arithmetic intensity on Spark—moving the data point further right along the x-axis—will inherently result in compounded performance benefits when running on various cloud GPUs. Automated cross-platform autotuning Currently, optimal configurations are selected based on platform characteristics. Future releases of cuTile will support fully automated cross-platform autotuning. The autotuner will discover optimal tile sizes and occupancy settings for each target architecture automatically, enabling transparent performance portability without any manual configuration. Get started with NVIDIA DGX Spark As AI systems become more sophisticated, NVIDIA DGX Spark provides the flexible, multitopology execution environment required to deploy them efficiently. From multiagent inference to trillion-parameter serving, from fine-tuning to Tile IR cross-cloud pipelines, DGX Spark delivers both scalability and efficiency. The result is a unified platform where enterprises can deploy and scale AI workloads—without rewriting infrastructure for every model or runtime. Learn more with the following playbooks: Connect Three DGX Spark in a Ring Topology Connect Multiple DGX Spark through a Switch Start building on NVIDIA DGX Spark . Discuss (1) Like Tags Agentic AI / Generative AI | Data Science | Edge Computing | General | Isaac Lab | TensorRT-LLM | Intermediate Technical | Deep dive | News | AI Agent | AI Inference | ConnectX | DGX Spark | featured | LLMs About the Authors About Allen Bourgoyne Allen Bourgoyne is the director of product marketing for workstation platforms with the NVIDIA Enterprise Platforms group. For the past decade, he has been responsible for bringing cutting-edge visualization, compute, and AI technology products to market. He has over 25 years of experience in hardware and software development and product marketing. Allen holds a B.S. degree in computer science from the University of Louisiana at Lafayette. View all posts by Allen Bourgoyne Comments Related posts Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale L T F R E
Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform | NVIDIA Technical Blog nvidia_dev_blog 16.03.2026 16:09 0.722
Embedding sim.0.8088
Entity overlap0.1111
Title sim.0.2483
Time proximity0.9997
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform , designed for the low-latency and large-context demands of agentic systems. Co-designed with the NVIDIA Vera Rubin NVL72 , LPX equips the AI factory with an engine optimized for fast, predictable token generation, while Vera Rubin NVL72 remains the flexible, general-purpose workhorse for training and inference, delivering high throughput across prefill and decode, including long-context processing, decode attention, and high-concurrency serving at scale. This combination matters because the agentic future demands a new category of inference. As generation speeds approach 1,000 tokens per second per user, models move beyond conversation-speed interaction toward speed of thought computing. At that rate, AI systems can reason, simulate, and respond continuously, enabling experiences that feel less like turn-based chat and more like real-time collaboration. This shift also raises the ceiling for multi-agent systems. Individual agents can be powerful on their own, but coordinated groups of agents can accomplish far more, much like human societies scale their capability through collective intelligence and coordination. Supporting these emerging workloads requires infrastructure that can deliver both high throughput and low latency. The combination of Vera Rubin NVL72 and LPX enables this heterogeneous architecture, pairing large-scale AI factory performance with the fast token generation needed to power continuously running agentic systems and next-generation AI applications. Introducing NVIDIA Groq 3 LPX Vera Rubin and LPX unite the extreme performance of Rubin GPUs and LPUs to deliver up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models. Integrated with the NVIDIA MGX ETL rack architecture and aligned with the broader Vera Rubin platform, LPX gives data centers a way to deploy a dedicated low-latency inference path alongside Vera Rubin NVL72 within a common infrastructure design. The system is built around 256 interconnected NVIDIA Groq 3 LPU accelerators. Its architecture emphasizes deterministic execution, high on-chip SRAM bandwidth, and tightly coordinated scale-up communication so interactive inference can stay responsive even as concurrency rises and request shapes vary. Deployed alongside Vera Rubin NVL72, LPX accelerates the latency-sensitive portions of the decode loop, including FFN and MoE expert execution, while Rubin GPUs continue to handle prefill and decode attention. Together, they deliver a heterogeneous serving path that improves interactive responsiveness without sacrificing AI factory throughput. Figure 1. NVIDIA Groq 3 LPX rack-scale system At rack scale, LPX delivers: Specification NVIDIA Groq 3 LPX AI inference compute 315 PFLOPS Total SRAM capacity 128 GB On-chip SRAM bandwidth 40 PB/s Scale-up density 256 chips Scale-up bandwidth 640 TB/s Table 1. NVIDIA Groq 3 LPX Specifications Vera Rubin NVL72 and LPX create a more heterogeneous inference architecture for the AI factory—one that can support high aggregate token production and responsive interactive AI experiences. Inside the NVIDIA Groq 3 LPX compute tray The LPX rack-scale accelerator houses 32 liquid-cooled 1U compute trays, each designed to support low-latency inference at scale. Every tray integrates eight LPU accelerators, a host processor, and fabric expansion logic in a cableless design that simplifies rack-scale deployment and tightly couples compute with communication. LPU chip-to-chip (C2C) links provide direct communication within the tray, across trays via the LPU C2C spine, and across racks as systems scale. Connectivity is important because interactive inference isn’t just about raw compute. It also depends on how efficiently the system can move data, coordinate work, and avoid variable delays as requests flow across devices. Figure 2. NVIDIA Groq 3 LPX compute tray and module Each tray provides: Resource Per LPX Tray LP30 chips 8 On-chip SRAM 4 GB SRAM bandwidth 1.2 PB/s DRAM via fabric expansion logic Up to 256 GB DRAM via host CPU Up to 128 GB AI inference compute (FP8) 9.6 PFLOPS Scale-up bandwidth 20 TB/s Table 2. NVIDIA Groq 3 LPX compute tray specifications At the system level, LPX is built for inference regimes where coordination overhead and jitter can quickly become visible to users. This is especially relevant as more AI applications move away from offline or throughput-oriented serving and toward interactive generation. To see how LPX is optimized for that regime, it helps to look at the processor architecture at the core of the system: the NVIDIA Groq 3 LPU. First look at the architecture of the NVIDIA Groq 3 LPU—the seventh chip of the Vera Rubin Platform At the heart of LPX is the NVIDIA Groq 3 LPU, designed to deliver fast, predictable token generation by tightly coupling compute, memory, and communication under compiler control. The architecture of the LPU is designed to deliver fast, predictable token generation by tightly coupling compute, memory, and communication under compiler control. Rather than optimizing only for peak arithmetic throughput, the LPU emphasizes deterministic execution, high on-chip memory bandwidth, and explicit data movement. These capabilities are especially important for decode-dominant, latency-sensitive inference regimes. Figure 3. NVIDIA Groq 3 LPU chip architecture Tensor-first compute and explicit data movement Compute and communication in the LPU are organized around 320-byte vectors as the unit of work. Arithmetic operations, memory access, and inter-device transfers all operate on these fixed-size vectors, simplifying scheduling and synchronization. Specialized execution modules handle different classes of operations: Matrix execution modules (MXM) provide dense multiply-accumulate capability for tensor operations, operating on fixed data types with predictable throughput. Vector execution modules (VXM) handle pointwise arithmetic, type conversions, and activation functions using a mesh of arithmetic logic units (ALUs) per lane. Switch execution modules (SXM) perform structured data movement, including permutation, rotation, distribution, and transposition of vectors. By making data movement explicit and programmable, the LPU enables memory access, compute, and communication to be overlapped, rather than relying on hardware heuristics. MEM enables extreme on-chip memory bandwidth A central element of the LPU is the MEM block—a flat, SRAM-first memory architecture where 500 MB of high-speed on-chip SRAM serves as the primary working storage for inference. Rather than relying on hardware-managed caches, the compiler and runtime place the active working set, including weights, activations, and KV state, into on-chip memory and move data explicitly. This reduces unpredictable stalls and helps deliver low, stable latency by keeping the most latency-sensitive data close to compute. Because on-chip SRAM capacity is finite, larger models are scaled across many interconnected LPU accelerators using parallel execution strategies such as layer-wise partitioning, so the overall system presents a much larger effective working set. In this design, performance is governed less by peak arithmetic throughput and more by how consistently the system can keep compute fed, which is why the LPX pairs 150 TB/s of on-chip memory bandwidth with high bandwidth scale-up chip-to-chip (C2C) communication per LPU. C2C scaling with predictable communication To scale inference across multiple devices, the LPU includes high-radix, high-speed C2C links designed for deterministic data exchange. Each LPU connects through 96 C2C links running at 112 Gbps each, enabling a streamlined LPX scale-up topology with high aggregate I/O bi-directional bandwidth of 2.5 TB/s and predictable communication timing. This is especially important for distributed inference pipelines, where communication overhead can otherwise become a major source of latency. Deterministic, compiler-orchestrated execution The LPU builds on Groq’s spatial execution model, where the compiler explicitly schedules computation, data movement, and synchronization. Instead of relying on dynamic hardware schedulers at runtime, the compiler relies on plesiosynchronous, chip-to-chip protocol in hardware that cancels natural clock drift and aligns hundreds of LPU accelerators to act as a single coordinated system. With predictable data arrival and periodic software synchronization, developers can reason more directly about timing, and the system can coordinate both compute and network behavior with much greater determinism. This execution model enables: Precise coordination between memory and compute. Explicit control over instruction timing. Reduced execution jitter under variable workloads For real-time inference, this determinism helps keep time-to-first-token and per-token latency stable, even at small batch sizes. The shift toward interactive inference AI inference spans a broad performance spectrum. On one end are throughput-optimized services such as batch document processing, moderation, embeddings, and media pipelines, where the goal is to maximize tokens per GPU, tokens per watt, or overall cost efficiency. These workloads often support large-scale shared services, including free-tier and background AI offerings, where high utilization matters more than per-user responsiveness. On the other end are latency-optimized services such as coding assistants, chatbots, voice assistants, copilots, and interactive agents, where delays are immediately visible to users. In these workloads, the most important metrics are time-to-first-token, tokens per second per user, and tail latency. Many modern AI platforms must support both regimes simultaneously, running high-throughput backends for large-scale processing while delivering responsive interactive experiences. This divergence is one reason heterogeneous inference architectures are becoming increasingly important. What makes interactive inference harder Several trends are making low-latency interactive inference both more important and harder to serve efficiently, as shown in Table 3. As models produce longer outputs and context windows grow, more of the workload shifts into decode, where tokens are generated sequentially, and responsiveness is exposed directly to the user. Force Why it matters Low-latency as a product feature In interactive applications, responsiveness is no longer just an infrastructure metric; it is part of what users evaluate. Longer reasoning outputs As models generate longer outputs and multi-step chains of thought, more of the request shifts into sequential token generation. Prefix caching Reusing shared prompt state can reduce prefill cost, but it also increases the relative share of request-specific decode work that still has to be served quickly. Longer contexts As context grows, the Transformer’s self-attention mechanism becomes increasingly constrained by data movement and memory bandwidth. Table 3. Four forces that make traditional serving techniques less effective for low-latency inference At the same time, longer contexts increase pressure on memory bandwidth and data movement, while serving many concurrent users reduces the batching efficiency that throughput-oriented systems rely on. As a result, systems optimized for maximum aggregate throughput are not always the best fit for workloads that require fast, predictable token generation for each request. This challenge becomes even more pronounced in agentic AI, where systems repeatedly cycle through inference, retrieval, tool use, and reasoning. In these loops, latency compounds across each step, making stable per-token performance and strong tail-latency behavior critical for responsive user experiences. The era of agentic inference requires a new architecture Inference isn’t a single, uniform workload. Within a request, prefill and decode place different demands on hardware, and those demands shift with batch size, context length, and model structure. Some phases, including self-attention and sparse MoE, can become highly sensitive to memory bandwidth and data movement, while others, such as dense projection and feed-forward layers, scale efficiently on throughput-optimized hardware when enough parallelism is available. In interactive decode, many operations run at very small batch sizes, making latency much more sensitive to stalls, contention, and jitter. Optimizing the entire pipeline for only one regime forces a compromise. Hardware tuned for peak throughput under large batches isn’t ideal for the most latency-sensitive execution paths, while hardware optimized for low-latency execution is less efficient for the most compute-intensive phases. As shown in Figure 4, a heterogeneous system combines both approaches, pairing low-latency interactive performance with high AI factory throughput. The result is a two-engine architecture: GPUs deliver high output for context-heavy prefill and execute decode attention, while LPUs accelerate latency-sensitive decode components such as FFN/MoE execution. Together, they improve interactivity without giving up AI factory throughput. Figure 4. Heterogeneous inference expands the Pareto frontier Vera Rubin NVL72 meets LPX Modern inference is a relay race. The same hardware that runs the heavy context leg doesn’t need to anchor the sprint to the next token. Rubin GPUs are the flexible, general-purpose workhorses for training and inference. They deliver high throughput across many model sizes, batch regimes, and serving patterns, from long-context prefill to decode attention and high-concurrency inference at scale. LPX adds a specialized path optimized for fast, latency-sensitive token generation. Together, they enable a heterogeneous inference design that improves interactive responsiveness without giving up system-scale efficiency. Figure 5. Uniting processors of extreme FLOPS (Rubin GPU) and bandwidth (Groq 3 LPU) Decode phase: A repeated multi-engine loop The prefill phase is dominated by ingesting large inputs and building the KV cache—a workload that benefits from dense parallel compute and large memory capacity. The Vera Rubin NVL72 handles this phase efficiently, especially for long-context workloads and MoE models where context can be large and highly variable. The decode phase is different. Decode is a repeated per-token loop, and different parts of that loop stress different bottlenecks. In the Vera Rubin platform architecture with LPX, decode is best thought of as a two-engine loop. GPUs handle decode work that benefits most from throughput and large memory capacity, such as full-context attention over the accumulated KV cache. LPX accelerates latency-sensitive execution within decode, such as sparse MoE expert feed-forward networks (FFNs) and other pointwise operations. This split, often described as decode phase disaggregation or attention–FFN disaggregation (AFD), separates attention from FFN within decode and exchanges intermediate activations for each token, so each engine runs the part of the loop it is best suited to execute. This AFD loop expands the highest-value operating region of the Pareto frontier. Figure 6. AFD decode explained At rack scale and beyond, the LPX is designed to operate as a tightly coordinated unit of compute, minimizing coordination overhead and reducing jitter. This is valuable in decode-heavy, agentic workflows where small delays compound across many model calls and verification loops. NVIDIA Dynamo makes heterogeneous decode operational Making heterogeneous decode practical requires software that can classify requests, route work by latency targets, move intermediate activations with low overhead, and keep tail latency stable under bursty, variable traffic. NVIDIA Dynamo provides that orchestration layer by coordinating disaggregated serving and disaggregated decode across heterogeneous backends. In practice, Dynamo routes prefill to GPU workers to process the large context and build the KV cache. During decode, Dynamo orchestrates the AFD loop where GPUs run attention over the accumulated KV cache, intermediate activations are handed off to LPUs for FFN/MoE execution, and outputs return to the GPUs to continue token generation. The result is a single coherent serving path with more predictable tail latency while sustaining high AI factory throughput. Figure 7. Dynamo orchestrates heterogeneous compute With KV-aware routing, low-overhead transfers, and latency-target-driven scheduling, Dynamo helps keep interactive sessions out of long queues, reduces cross-tenant jitter, and maintains stable tail latency as concurrency and request shapes vary. The result is a production-ready heterogeneous serving model that delivers responsive user experiences while sustaining high AI factory throughput at scale. ​​Accelerating speculative decoding with LPX Speculative decoding is an increasingly important technique for reducing latency in LLM inference. The approach uses a smaller draft model to generate multiple candidate tokens ahead of time, while a larger target model verifies and accepts those tokens in parallel. When the predictions match, multiple tokens can be committed at once, significantly increasing effective tokens per second and reducing response latency. LPX is well-suited to act as the draft-generation engine in this architecture. The deterministic execution model and extremely high on-chip SRAM bandwidth of the LPU enable very fast draft token generation, enabling the draft model to run ahead of the verifier. At the same time, GPUs such as Rubin remain highly efficient for large-model execution tasks such as prefill, attention processing, and token verification. By pairing the two, the system combines the strengths of both processors: LPX generates draft tokens rapidly using its low-latency architecture. Rubin GPUs verify and finalize tokens efficiently using high-throughput compute and large memory capacity. This separation enables speculative decoding to run across heterogeneous processors, rather than running both draft and verifier models on the same hardware. The result is a system that can deliver faster draft generation without sacrificing the efficiency of GPU-based verification. Figure 8. Speculative decoding with GPU verification and LPU draft generation Unlocking intelligent agentic swarms As AI use cases evolve from simple chat and batch inference to multi-step agentic workflows, responsiveness becomes a requirement. Offline inference and basic assistants can often prioritize aggregate throughput, but interactive applications, deep research, and agentic pipelines combine high token volume with tight feedback loops, where latency compounds across many model calls and tool interactions. In this regime, heterogeneous inference matters. Pairing a high-throughput engine for long-context processing with a low-latency engine for decode FFNs makes it possible to increase user interactivity without sacrificing AI factory output. Figure 9. Relative compute and interactivity requirements for various AI workloads, highlighting the need for high throughput and low-latency for agentic swarms Unlocking a new category of AI experiences on the Pareto frontier A practical way to visualize this tradeoff between performance and cost is the Pareto frontier , plotting user interactivity, measured in tokens per second per user (TPS per user), on the horizontal axis against AI factory throughput, measured in tokens per second per megawatt (TPS per MW), on the vertical axis. As shown in Figure 10, different AI services operate at very different points on this curve. Throughput-first services, including many free-tier and background workloads, typically prioritize maximum efficiency and high utilization and often use smaller models with shorter context windows. Premium AI services, by contrast, demand higher model capability and far more responsive user-visible performance, especially for long-context reasoning and agentic workflows. In Figure 10, that premium tier is represented by a 2-trillion-parameter MoE model with a 400K input context window operating at roughly 400 TPS per user and beyond. Figure 10. Unlocking a new category of AI experiences with Vera Rubin NVL72 and Groq 3 LPX Reaching these premium operating points with a single homogeneous platform forces a tradeoff between responsiveness and overall AI factory throughput because the workload mixes fundamentally different performance regimes within the same serving pipeline. A heterogeneous architecture expands the achievable region by combining complementary execution paths, allowing the system to sustain high factory output while delivering highly responsive, low-latency interactive experiences. As illustrated in Figure 10, the combination of Vera Rubin NVL72 and LPX delivers up to 35x higher TPS per megawatt at 400 TPS per user compared with the NVIDIA GB200 NVL72, effectively creating a new premium performance tier for interactive AI services. This shift has a direct economic impact. Higher responsiveness expands the set of premium experiences an AI factory can serve and increases value per unit of infrastructure. With the Vera Rubin platform, AI factories can unlock up to 5x more revenue per megawatt compared with the GB200 NVL72, and up to 10x by pairing Vera Rubin NVL72 with LPX for the most latency-sensitive, high-value interactive workloads, such as agentic coding and multi-agent systems. Figure 11. NVIDIA Vera Rubin NVL72 and LPX provide a 10x revenue opportunity What NVIDIA Groq 3 LPX enables for Developers Developers are increasingly building systems that require three things at once: Responsiveness: low and predictable latency for interactive experiences and agent loops. Capability: strong model quality, reasoning depth, and long-context understanding. Scale: high-throughput and cost efficiency to serve many concurrent users or agents. LPX broadens the set of workloads an AI factory can serve efficiently. Use the low-latency path where predictable token generation improves experience, such as coding assistants, agentic workflows with tight tool-calling loops, voice interactions, and real-time translation. Keep throughput-first workloads on Rubin GPUs, such as batch serving, long-context throughput runs, where high concurrency and batching keep GPUs consistently busy and cost-efficient. The operational shift is mindset. Stop optimizing for one headline metric and start optimizing for a range of real-world operating points. Learn more Dive deeper into the architecture behind NVIDIA Groq 3 LPX and Vera Rubin by starting with the NVIDIA product pages and technical blogs covering the Vera Rubin platform , LPX , AFD , and Dynamo . Explore the underlying research on tensor streaming processors and software-defined silicon design for AI. Together, these resources offer a deeper look at the hardware, system architecture, and orchestration software behind heterogeneous, low-latency inference at scale. Next, join a NVIDIA Developer Forum thread focused on inference and deployment to compare notes with other teams building low-latency serving systems. Resources NVIDIA LPX page Press Release: NVIDIA Vera Rubin Opens Agentic AI Frontier Tech Blog: Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer Tech Blog: NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer Tech Blog: Announcing NVIDIA Dynamo 1.0: Scaling MultiNode Inference in Production Video: The Future of AI Inference – Explainer on Attention-FFN Disaggregation AFD (starting at 18:00) Tech Blog: NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories Research Paper: Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads Research Paper: A Software-defined Tensor Streaming Multiprocessor for Large-scale Machine Learning Video: Enabling PyTorch’s Thousand Ops for Software First Silicon Design Acknowledgments Thanks to Amr Elmeleegy, Andrew Bitar, Andrew Ling, Graham Steele, Itay Neeman, Jamie Li, Omar Kilani, Santosh Raghavan, and Stuart Pitts, along with many other NVIDIA product leaders, engineers, and architects who contributed to this post. Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Cloud Services | General | Intermediate Technical | Deep dive | AI Factory | featured | Groq 3 LPX | GTC 2026 | Rubin GPU | Vera Rubin About the Authors About Kyle Aubrey Kyle Aubrey is the director of Technical Marketing at NVIDIA, where he leads initiatives in AI inference and training across NVIDIA accelerated computing platforms, including Hopper, Blackwell, Rubin, and beyond. With a passion for demystifying complex technologies, he empowers diverse audiences to harness the full potential of NVIDIA's cutting-edge solutions. Kyle holds a bachelor’s degree in Electrical Engineering from Rose-Hulman Institute of Technology and an MBA from Pepperdine University. View all posts by Kyle Aubrey About Farshad Ghodsian Farshad Ghodsian is a senior technical marketing engineer at NVIDIA, where he focuses on AI training and inference at scale, performance optimization insights, new model releases, and AI engineering enablement. He brings a wealth of experience at the intersection of AI infrastructure, distributed training, GPU-accelerated computing and cloud-native MLOps—translating cutting-edge research into practical insights for developers, enterprise teams and business leaders. Prior to NVIDIA, Farshad held technical roles at leading semiconductor and consulting companies, where he helped build and manage large-scale generative AI and MLOps platforms for top technology customers. View all posts by Farshad Ghodsian Comments Related posts How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale L T F R E
Accelerating LLM fine-tuning with unstructured data using SageMaker Unified Studio and S3 aws_ml_blog 26.03.2026 17:20 0.72
Embedding sim.0.848
Entity overlap0.0541
Title sim.0.188
Time proximity0.7329
NLP типother
NLP организацияAWS
NLP темаmachine learning
NLP страна

Открыть оригинал

Last year, AWS announced an integration between Amazon SageMaker Unified Studio and Amazon S3 general purpose buckets. This integration makes it straightforward for teams to use unstructured data stored in Amazon Simple Storage Service (Amazon S3) for machine learning (ML) and data analytics use cases. In this post, we show how to integrate S3 general purpose buckets with Amazon SageMaker Catalog to fine-tune Llama 3.2 11B Vision Instruct for visual question answering (VQA) using Amazon SageMaker Unified Studio. For this task, we provide our large language model (LLM) with an input image and question and receive an answer. For example, asking to identify the transaction date from an itemized receipt: For this demonstration, we use Amazon SageMaker JumpStart to access the Llama 3.2 11B Vision Instruct model. Out of the box, this base model achieves an Average Normalized Levenshtein Similarity (ANLS) score of 85.3% on the DocVQA dataset. ANLS is a metric used to evaluate the performance of models on visual question answering tasks, which measures the similarity between the model’s predicted answer and the ground truth answer. While 85.3% demonstrates strong baseline performance, this level might not be the most efficient for tasks requiring a higher degree of accuracy and precision. To improve model performance through fine-tuning, we’ll use the DocVQA dataset from Hugging Face . This dataset contains 39,500 rows of training data, each with an input image, a question, and a corresponding expected answer. We’ll create three fine-tuned model versions using varying dataset sizes (1,000, 5,000, and 10,000 images). We’ll then evaluate them using Amazon SageMaker fully managed serverless MLflow to track experimentation and measure accuracy improvements. The full end-to-end data ingestion, model development, and metric evaluation process will be orchestrated using Amazon SageMaker Unified Studio. Here is the high-level process flow diagram that we’ll step through for this scenario. We’ll expand on this throughout the blog post. To achieve this process flow, we build an architecture that performs the data ingestion, data preprocessing, model training, and evaluation using Amazon SageMaker Unified Studio. We break out each step in the following sections. The Jupyter notebook used and referenced throughout this exercise can be found in this GitHub repository . Prerequisites To prepare your organization to use the new integration between Amazon SageMaker Unified Studio and Amazon S3 general purpose buckets, you must complete the following prerequisites. Note that these steps take place on an Identity Center-based domain . Create an AWS account. Create an Amazon SageMaker Unified Studio domain using quick setup . Create two projects within the SageMaker Unified Studio domain to model the scenario in this post: one for the data producer persona and one for the data consumer persona. The first project is used for discovering and cataloging the dataset in an Amazon S3 bucket. The second project consumes the dataset to fine-tune three iterations of our large language model. See Create a project for additional information. Your data consumer project must have access to a running SageMaker managed MLflow serverless application, which will be used for experimentation and evaluation purposes. For more information, see the instructions for creating a serverless MLflow application . An Amazon S3 bucket should be pre-populated with the raw dataset to be used for your ML development use case. In this blog post, we use the DocVQA dataset from Hugging Face for fine-tuning a visual question answering (VQA) use case. A service quota increase request to use p4de.24xlarge compute for training jobs. See Requesting a quota increase for more information. Architecture The following is the reference architecture that we build throughout this post: We can break the architecture diagram into a series of six high-level steps, which we’ll observe throughout the following sections: First, you create and configure an IAM access role that grants read permissions to a pre-existing Amazon S3 bucket containing the raw and unprocessed DocVQA dataset. The data producer project uses the access role to discover and add the dataset to the project catalog. The data producer project enriches the dataset with optional metadata and publishes it to the SageMaker Catalog. The data consumer project subscribes to the published dataset, making it available to the project team responsible for developing (or fine-tuning) the machine learning models. The data consumer project preprocesses the data and transforms it into three training datasets of varying sizes (1k, 5k, and 10k images). Each dataset is used to fine-tune our base large language model. We use MLflow for tracking experimentation and evaluation results of the three models against our Average Normalized Levenshtein Similarity (ANLS) success metric. Solution walkthrough As mentioned previously, we will opt to use the DocVQA dataset from Hugging Face for a visual question answering task. In your organization’s scenario, this raw dataset might be any unstructured data relevant to your ML use case. Examples include customer support chat logs, internal documents, product reviews, legal contracts, research papers, social media posts, email archives, sensor data, and financial transaction records. In the prerequisite section of our Jupyter notebook , we pre-populate our Amazon S3 bucket using the Datasets API from Hugging Face: import os from datasets import load_dataset # Create data directory os.makedirs("data", exist_ok=True) # Load and save train split (first 10,000 rows) train_data = load_dataset("HuggingFaceM4/DocumentVQA", split="train[:10000]", cache_dir="./data") train_data.save_to_disk("data/train") # Load and save validation split (first 100 rows) val_data = load_dataset("HuggingFaceM4/DocumentVQA", split="validation[:100]", cache_dir="./data") val_data.save_to_disk("data/validation") After retrieving the dataset, we complete the prerequisite by synchronizing it to an Amazon S3 bucket. This represents the bucket depicted in the bottom-right section of our architecture diagram shown previously. At this point, we’re ready to begin working with our data in Amazon SageMaker Unified Studio, starting with our data producer project. A project in Amazon SageMaker Unified Studio is a boundary within a domain where you can collaborate with others on a business use case. To bring Amazon S3 data into your project, you must first add access to the data and then add the data to your project. In this post, we follow the approach of using an access role to facilitate this process. See Adding Amazon S3 data for more information. Once our access role is created following the instructions in the documentation referenced previously, we can continue with discovering and cataloging our dataset. In our data producer project, we navigate to the Data → Add data → Add S3 location: Provide the name of the Amazon S3 bucket and corresponding prefix containing our raw data, and note the presence of the access role dropdown containing the prerequisite access role previously created: Once added, note that we can now see our new Amazon S3 bucket in the project catalog as shown in the following image: From the perspective of our data producer persona, the dataset is now available within our project context. Depending on your organization and requirements, you might want to further enrich this data asset. For example, you can join it with additional data sources, apply business-specific transformations, implement data quality checks, or create derived features through feature engineering pipelines. However, for the purposes of this post, we’ll work with the dataset in its current form to keep our focus on the core point of integrating Amazon S3 general purpose buckets with Amazon SageMaker Unified Studio. We are now ready to publish this bucket to our SageMaker Catalog. We can add optional business metadata such as a README file, glossary terms, and other data types. We add a simple README, skip other metadata fields for brevity, and continue to publishing by choosing Publish to Catalog under the Actions menu. At this point, we’ve added the data asset to our SageMaker Catalog and it is ready to be consumed by other projects in our domain. Switching over to the perspective of our data consumer persona and selecting the consumer project, we can now subscribe to our newly published data asset. See Subscribe to a data product in Amazon SageMaker Unified Studio for more information. Now that we’ve subscribed to the data asset in our consumer project where we’ll build the ML model, we can begin using it within a managed JupyterLab IDE in Amazon SageMaker Unified Studio. The JupyterLab page of Amazon SageMaker Unified Studio provides a JupyterLab interactive development environment (IDE) for you to use as you perform data integration, analytics, or machine learning in your projects. In our ML development project, navigate to the Compute → Spaces → Create space option, and choose JupyterLab in the Application (space type) menu to launch a new JupyterLab IDE. Note that some models in our example notebook can take upwards of 4 hours to train using the ml.p4de.24xlarge instance type. As a result, we recommend that you set the Idle Time to 6 hours to allow the notebook to run to completion and avoid errors. Additionally, if executing the notebook from end to end for the first time, set the space storage to 100 GB to allow for the dataset to be fully ingested during the fine-tuning process. See Creating a new space for more information. With our space created and running, we choose the Open button to launch the JupyterLab IDE. Once loaded, we upload the sample Jupyter notebook into our space using the Upload Files functionality. Now that we’ve subscribed to the published dataset in our ML development project, we can begin the model development workflow. This involves three key steps: fetching the dataset from our bucket using Amazon S3 Access Grants , preparing it for fine-tuning, and training our models. Grantees can access Amazon S3 data by using the AWS Command Line Interface (AWS CLI), the AWS SDKs, and the Amazon S3 REST API. Additionally, you can use the AWS Python and Java plugins to call Amazon S3 Access Grants. For brevity, we opt for the AWS CLI approach in the notebook and the following code. We also include a sample that shows the use of the Python boto3-s3-access-grants-plugin in the appendix section of the notebook for reference. The process includes two steps: first obtaining temporary access credentials to the Amazon S3 control plane through the s3control CLI module, then using those credentials to sync the data locally. Update the AWS_ACCOUNT_ID variable with the appropriate account ID that houses your dataset. import json AWS_ACCOUNT_ID = "123456789" # REPLACE THIS WITH YOUR ACCOUNT ID S3_BUCKET_NAME = "s3://MY_BUCKET_NAME/" # REPLACE THIS WITH YOUR BUCKET # Get credentials result = !aws s3control get-data-access --account-id {AWS_ACCOUNT_ID} --target {S3_BUCKET_NAME} --permission READ json_response = json.loads(result.s) creds = json_response['Credentials'] # Configure profile with cell magic !aws configure set aws_access_key_id {creds['AccessKeyId']} --profile access-grants-consumer-access-profile !aws configure set aws_secret_access_key {creds['SecretAccessKey']} --profile access-grants-consumer-access-profile !aws configure set aws_session_token {creds['SessionToken']} --profile access-grants-consumer-access-profile print("Profile configured successfully!") !aws s3 sync {S3_BUCKET_NAME} ./ --profile access-grants-consumer-access-profile After running the previous code and getting a successful output, we can now access the S3 bucket locally. With our raw dataset now accessible locally, we need to transform it into the format required for fine-tuning our LLM. We’ll create three datasets of varying sizes (1k, 5k, and 10k images) to evaluate how the dataset size impacts model performance. Each training dataset contains a train and validation directory, each of which must contain an images subdirectory and accompanying metadata.jsonl file with training examples. The metadata file format includes three key/value fields per line: {"file_name": "images/img_0.jpg", "prompt": "what is the date mentioned in this letter?", "completion": "1/8/93"} {"file_name": "images/img_1.jpg", "prompt": "what is the contact person name mentioned in letter?", "completion": "P. Carter"} With these artifacts uploaded to Amazon S3, we can now fine-tune our LLM by using SageMaker JumpStart to access the pre-trained Llama 3.2 11B Vision Instruct model. We’ll create three separate fine-tuned variants to evaluate. We’ve created a train() function to facilitate this using a parameterized approach, making this reusable for different dataset sizes: def train(name, instance_type, training_data_path, experiment_name, run): ... estimator = JumpStartEstimator( model_id=model_id, model_version=model_version, environment={"accept_eula": "true"}, # Must accept as true disable_output_compression=True, instance_type=instance_type, hyperparameters=my_hyperparameters, ) ... Our training function handles several important aspects: Model selection : Uses the latest version of Llama 3.2 11B Vision Instruct from SageMaker JumpStart. Hyperparameters : The sample notebook uses the retrieve_default() API in the SageMaker SDK to automatically fetch the default hyperparameters for our model. Batch size : The only default hyperparameter that we change, setting to 1 per device due to the large model size and memory constraints. Instance type : We use a ml.p4de.24xlarge instance type for this training job and recommend that you use the same type or larger. MLflow integration : Automatically logs hyperparameters, job names, and training metadata for experiment tracking. Endpoint deployment : Automatically deploys each trained model to a SageMaker endpoint for inference. Recall that the training process will take a few hours to complete using instance type ml.p4de.24xlarge. Now we’ll evaluate our fine-tuned models using the Average Normalized Levenshtein Similarity (ANLS) metric. This metric evaluates text-based outputs by measuring the similarity between predicted and ground truth answers, even when there are minor errors or variations. It is particularly useful for tasks like visual question answering because it can handle slight variations in answers. See the Llama 3.2 3B model card for more information. MLflow will track our experiments and results for straightforward comparison. Our evaluation pipeline includes several key functions for image encoding for model inference, payload formatting, ANLS calculation, and results tracking. The training_pipeline() function orchestrates the complete workflow with nested MLflow runs for better experiment organization. # MLFlow configuration arn = "" # replace with ARN of project's MLflow instance mlflow.set_tracking_uri(arn) def training_pipeline(training_size): # Set experiment experiment_name = f"docvqa-{training_size}" mlflow.set_experiment(experiment_name) # Start main run with mlflow.start_run(run_name="pipeline-run"): # DataPreprocess nested run with mlflow.start_run(run_name="DataPreprocess", nested=True): training_data_path = process_data("train", f"docvqa_{training_size}/train", training_size) # TrainDeploy nested run with mlflow.start_run(run_name="TrainDeploy", nested=True) as run: model_name = train(f"docvqa-{training_size}", "ml.p4d.24xlarge", training_data_path, experiment_name, run) #model_name = 'base-model' # Evaluate nested run with mlflow.start_run(run_name="Evaluate", nested=True): # Load validation data with open("./docvqa_1k/validation/metadata.jsonl") as f: data = [json.loads(line) for line in f] print(f"\nStarting validation for {model_name}") # Log parameters mlflow.log_param("model_name", model_name) mlflow.log_param("total_images", len(data[:50])) mlflow.log_param("threshold", 0.5) predictor = retrieve_default(model_id="meta-vlm-llama-3-2-11b-vision-instruct", model_version="*", endpoint_name=model_name) results = [] anls_scores = [] # Process each image for i, each in enumerate(data[:50]): filename = each['file_name'] question = each["prompt"] ground_truth = each["completion"] image_path = f"./docvqa_1k/validation/{filename}" print(f"Processing {filename} ({i+1}/50)") # Get model prediction using traced function inferred_response = invoke_model(predictor, question, image_path) # Calculate ANLS score anls_score = anls_metric_single(inferred_response, ground_truth) anls_scores.append(anls_score) # Store result result = { 'filename': filename, 'ground_truth': ground_truth, 'inferred_response': inferred_response, 'anls_score': anls_score } results.append(result) print(f" Ground Truth: {ground_truth}") print(f" Prediction: {inferred_response}") print(f" ANLS Score: {anls_score:.4f}") # Calculate average ANLS score avg_anls = sum(anls_scores) / len(anls_scores) if anls_scores else 0.0 # Log metrics mlflow.log_metric("average_anls_score", avg_anls) # Save results to CSV timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") csv_filename = f"anls_validation_{model_name}_{timestamp}.csv" save_results_to_csv(results, csv_filename) # Log CSV as artifact mlflow.log_artifact(csv_filename) print(f"Results for {model_name}:") print(f" Average ANLS Score: {avg_anls:.4f}") mlflow.log_param("metric_type", "anls") mlflow.log_param("threshold", "0.5") After orchestrating three end-to-end executions for our three dataset sizes, we review the ANLS metric results in MLflow. Using the comparison functionality, we note the highest ANLS score of 0.902 in the docvqa-10000 model, an increase of 4.9 percentage points relative to the base model (0.902 − 0.853 = 0.049). Model ANLS docvqa-1000 0.886 docvqa-5000 0.894 docvqa-10000 0.902 Base Model 0.853 Clean Up To avoid ongoing charges, delete the resources created during this walkthrough. This includes SageMaker endpoints and project resources such as the MLflow application, JupyterLab IDE, and domain. Conclusion Based on the preceding data, we observe a positive relationship between the size of the training dataset and ANLS in that the docvqa-10000 model had improved performance. We used MLflow for experimentation and visualization around our success metric. Further improvements in areas such as hyperparameter tuning and data enrichment could yield even better results. This walkthrough demonstrates how the Amazon SageMaker Unified Studio integration with S3 general purpose buckets helps streamline the path from unstructured data to production-ready ML models. Key benefits include: Simplified data discovery and cataloging through a unified interface More secure data access through S3 Access Grants without complex permission management Smooth collaboration between data producers and consumers across projects End-to-end experiment tracking with managed MLflow integration Organizations can now use their existing S3 data assets more effectively for ML workloads while maintaining governance and security controls. The 4.9% performance improvement from base model to our improved fine-tuned variant (0.853–0.902 ANLS) validates the approach for visual question answering tasks. For next steps, consider exploring additional dataset preprocessing techniques, experimenting with different model architectures available through SageMaker JumpStart, or scaling to larger datasets as your use case demands. Getting Started with Amazon SageMaker JumpStart Data transformation workloads with SageMaker Processing The solution code used for this blog post can be found in this GitHub repository . About the authors Hazim Qudah Hazim Qudah is an AI/ML Specialist Solutions Architect at Amazon Web Services. He enjoys helping customers build and adopt AI/ML solutions using AWS technologies and best practices. Prior to his role at AWS, he spent many years in technology consulting with customers across many industries and geographies. In his free time, he enjoys running and playing with his dogs Nala and Chai!
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads | NVIDIA Technical Blog nvidia_dev_blog 25.03.2026 16:35 0.718
Embedding sim.0.8124
Entity overlap0.0682
Title sim.0.2298
Time proximity0.9667
NLP типother
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition (ASR) or text-to-speech (TTS) models may require only 10 GB of VRAM, yet occupy an entire GPU in standard Kubernetes deployments. Because the scheduler maps a model to one or more GPUs and can’t easily share across GPUs across models, expensive compute resources often remain underutilized. Solving this isn’t just about cost reduction—it’s about optimizing cluster density to serve more concurrent users on the same world-class hardware. This guide details how to implement and benchmark GPU partitioning strategies, specifically NVIDIA Multi-Instance GPU (MIG) and time-slicing to fully use compute resources. Using a production-grade voice AI pipeline as our testbed, we show how to combine models to maximize infrastructure ROI while maintaining >99% reliability and strict latency guarantees. Addressing GPU resource fragmentation By default, the NVIDIA Device Plugin for Kubernetes shows GPUs as integer resources. A pod requests nvidia.com/gpu: 1 , and the scheduler binds it to a physical device. Large language models (LLMs) like NVIDIA Nemotron, Llama 3, or Qwen 7B/8B require dedicated compute to maintain low time to first token (TTFT) and high batch throughput. However, support models in a generative AI pipeline—embedding models, ASR, TTS, or guardrails—often use only a fraction of a card. Running these lightweight models on dedicated GPUs results in: Low utilization: GPU compute utilization often hovers near 0-10%. Cluster bloat: More nodes need provisioning to run the same number of pods. Scaling friction: Adding a new capability requires a new physical GPU. To solve this, we must break the 1:1 relationship between pods and GPUs. Architecture: Partitioning strategies We evaluated two primary strategies for GPU partitioning supported by the NVIDIA GPU Operator. Software-based partitioning: Time-slicing and MPS Time-slicing enables multiple NVIDIA CUDA processes to share a GPU by interleaving execution. It functions similarly to a CPU scheduler: context A runs, pauses, and context B runs. Mechanism: Software-level scheduling through the CUDA driver. Pros: Maximizes utilization. Enables “bursting”—if Pod A is idle, Pod B can use 100% of the GPU’s compute cores. Cons: No hardware isolation. A memory overflow (OOM) in one pod may impact the shared execution context, and heavy compute in one pod can throttle neighbors (the “noisy neighbor” effect). In addition to time-slicing, NVIDIA Multi-Process Service (MPS) offers an alternative software-based approach. MPS enables multiple processes to share GPU resources concurrently by using a server-client architecture. This provides more flexibility than MIG and is more resilient to certain issues like memory leaks compared to standard time-slicing. However, in production, both methods share a single execution context, limiting isolation. While modern MPS provides isolated virtual address spaces, it lacks hardware-level fault isolation. This means a fatal execution error or illegal memory access in one process will propagate across the shared context, potentially leading to a GPU reset that affects other processes sharing the card. MIG: The hardware approach to partitioning MIG physically partitions the GPU into separate instances, each with its own dedicated memory, cache, and streaming multiprocessors (SMs). To the OS and Kubernetes, these look like separate PCI devices. Mechanism: Hardware-level isolation. Pros: Strict quality of service (QoS). One workload can’t impact the performance or memory stability of another. Cons: Rigid sizing. If a partition is idle, its compute resources can’t be “borrowed” by a neighbor. While time-slicing offers flexibility, MIG is preferred for production environments where strict hardware-level fault isolation is required to meet enterprise SLAs. Hardware partitioning ensures that a memory error in one model cannot cause a cascading failure across the shared GPU—a critical requirement for mission-critical Voice AI. Experimental setup: The voice AI pipeline Figure 3: Voice-to-voice AI workflow To validate these strategies in a production-realistic scenario, we used a multimodal voice-to-voice AI pipeline. This workload is ideal for benchmarking because it mixes three distinct traffic patterns: ASR (streaming): Constant, low-compute stream processing with NVIDIA Parakeet 1.1B TTS (bursty): Idle for seconds, then spikes to 100% usage to generate audio in milliseconds with NVIDIA Magpie Multilingual LLM (heavy): High-compute, high memory usage, Llama-3.1-Nemotron-Nano-VL-8B-V1. Before optimizing, it’s critical to understand our latency profile. In our voice-to-voice pipeline, the LLM is the dominant bottleneck. Under heavy loads, the LLM accounts for ~9 seconds of the total processing time. This delay can fluctuate significantly based on context length; for instance, high-input scenarios (like training users) or growing conversation histories increase processing overhead compared to short prompts. As history accumulates, the LLM must process more tokens before generating a response, extending the bottleneck that support models must be masked behind. Consolidating support models like ASR and TTS provides a strategic path to maximize hardware utilization while maintaining end-to-end responsiveness. While consolidation may introduce a slight latency adjustment of 100-200 ms, the gains in infrastructure throughput and ROI are significant. Our hypothesis Consolidating ASR and TTS workloads on a single GPU preserves latency and jitter while freeing compute for additional LLM instances. Experiment Figure 4. Experimental topology configurations We designed three distinct configurations for testing. In each round, we used three voice samples, waiting for the first response from LLM+TTS to complete. The setup used a Kubernetes cluster, models deployed using NVIDIA NIM, and managed by the NVIDIA NIM Operator. The worker node had access to three NVIDIA A100 Tensor Core GPUs. Experiment 1: Baseline with three GPUs Setup: One dedicated GPU for each model (LLM, ASR, TTS). Goal: Establish the “gold standard” for latency and throughput against which to measure optimization. Resource: nvidia.com/gpu: 1 per pod. Experiment 2: Time-slicing with two GPUs Setup: LLM retains a dedicated GPU. ASR and TTS share GPU 0 using software-level time-slicing. Goal: Test if dynamic scheduling can handle the “noisy neighbor” contention between streaming ASR and bursty TTS. Resource: nvidia.com/gpu: 1 (Shared via replicas: 2 ). Experiment 3: MIG Partitioning with two GPUs Setup: LLM retains a dedicated GPU. GPU 0 is physically partitioned into two isolated instances. Goal: Test if hardware isolation provides better stability than software scheduling. Resource: nvidia.com/mig-3g.40gb: 1 per pod. Configuration Note: To achieve these topologies, we used specific configurations within the NVIDIA GPU Operator. For Experiment 2, we used the timeSlicing configuration to advertise multiple replicas per physical GPU. For Experiment 3, we applied a custom mig-configs ConfigMap to partition the GPU into two 3g.40gb instances. (For the exact kubectl commands and YAML manifests used to reproduce this setup, please see the Implementation Appendix at the end of this post.) Results To evaluate resource fragmentation, we tested the system with two distinct traffic patterns: Light load: 5 concurrent users simulating ~135 seconds of sustained interaction. Heavy load: 50 concurrent users simulating ~375 seconds of sustained interaction. Figure 5. Throughput comparison Figure 5 compares generative AI inference throughput across traffic patterns. The data shows how partitioning affects process requests as concurrency increases, across baseline (dedicated GPUs), time-slicing (software sharing), and MIG (hardware partitioning) under light and heavy loads. All experiments have a 100% success rate, no failures. The current req/s is the reason for the LLM bottleneck in the pipeline. Mean latency metrics Figure 6. Heavy load latenc y Figure 7: Light load latency The following analysis evaluates how different GPU partitioning strategies impact overall system efficiency and responsiveness. Throughput compared to latency Consolidating ASR and TTS workloads onto a single GPU results in an optimized pipeline, enabling the cluster to support more simultaneous AI streams. However, our benchmarks reveal a critical performance divergence between the two partitioning strategies: MIG (hardware): Highest efficiency Experiment 3 achieved the highest per-unit productivity, reaching ~1.00 req/s per GPU. By providing dedicated hardware paths for each instance, MIG eliminates resource contention. Organizations can achieve near-full system capacity while effectively freeing up an entire GPU for other heavy LLM workloads. Time-slicing (software): Higher density with overhead Experiment 2 showed that software-level sharing can also improve per-GPU density compared to the baseline, achieving ~0.76 req/s per GPU. However, the CUDA driver’s management of rapid context switches between streaming and bursty models introduces scheduling overhead. While functional, this software-based approach doesn’t reach the aggregate throughput efficiency provided by hardware partitioning. Latency and the bursty factor Time-slicing handles individual bursty tasks slightly faster, with a mean TTS latency of 144.7 ms compared to MIG’s 168.2ms. However, this 23.5 ms difference represents a small fraction of the total end-to-end pipeline response time at the present scale. Under heavy load, the LLM accounts for the vast majority of the total interaction time. Because the end-user cannot perceive a 20ms delta within a multi-second response, the throughput stability of MIG is a more valuable production metric. Recommendations for partitioning Based on the benchmark data, we recommend the following decision matrix: Default to MIG for production scale and stability Experiment 3 showed that MIG handles higher request volumes (2 req/s) with only a minor latency trade-off. Strict hardware-level fault isolation prevents a memory overflow in one process from crashing the other. Best for production environments where throughput and 100% reliability are the priorities. Use time-slicing for development or low-concurrency apps This involves a 32% reduction in total throughput and shared-resource dependencies. Best for development, CI/CD, and PoCs to run a full pipeline on a minimal hardware footprint. Get started Experiment further: Try the repository . Implement partitioning: our Implementation Guide to configure MIG profiles and use the provided YAML manifests to eliminate resource fragmentation in your cluster. Scale with NIM: Deploy NVIDIA NIM pipelines to fully utilize your ASR, TTS, and LLM workloads for maximum ROI. Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | MLOps | General | CUDA | Advanced Technical | Benchmark | LLM Techniques About the Authors About Sagar Desai Sagar Desai is a generative AI solutions architect at NVIDIA, specializing in the design and deployment of large-scale machine learning systems. With deep expertise in optimizing model performance for both training and inference workloads, Sagar brings a strong foundation in MLOps and LLMOps. Sagar architects resilient infrastructure using distributed computing and orchestration principles to deliver high-impact, enterprise-grade AI solutions. View all posts by Sagar Desai About Adi Margolin Adi Margolin is the product manager for Riva SDK and Speech NIM. With 16 years of product management experience, Adi has built high-impact speech technology solutions across enterprise software companies, developing expertise in bringing ASR and TTS innovations to market. Based in San Jose, Adi brings a unique perspective to speech AI development, having successfully navigated the transition from legacy systems to modern AI-driven platforms while addressing complex requirements of real-time media applications. View all posts by Adi Margolin Comments Related posts Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond Streamline AI Infrastructure with NVIDIA Run:ai on Microsoft Azure Streamline AI Infrastructure with NVIDIA Run:ai on Microsoft Azure Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap Accelerate AI Model Orchestration with NVIDIA Run:ai on AWS Accelerate AI Model Orchestration with NVIDIA Run:ai on AWS Related posts How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core L T F R E
NVIDIA DSX Air Boosts Time to Token With Accelerated Simulation for AI Factories nvidia_blog 16.03.2026 20:00 0.716
Embedding sim.0.8136
Entity overlap0.0328
Title sim.0.2014
Time proximity0.997
NLP типproduct_launch
NLP организацияnvidia
NLP темаai infrastructure
NLP странаunited states

Открыть оригинал

NVIDIA DSX Air Boosts Time to Token With Accelerated Simulation for AI Factories Used by CoreWeave and others, the new platform enables enterprises to simulate complex deployments through validated reference architectures for compute, networking, storage, orchestration and security — before a single server is unboxed. March 16, 2026 by Scott Martin 0 Comments Share Share This Article X Facebook LinkedIn Copy link Link copied! Setting up AI factories in simulation — decreasing deployment time from months to days — is  accelerating the next industrial revolution.  Nowhere was that more apparent than at GTC 2026, in San Jose, where NVIDIA founder and CEO Jensen Huang introduced NVIDIA DSX Air. Part of NVIDIA DSX Sim in the DSX platform, NVIDIA’s blueprint for AI factories , DSX Air is a software-as-a-service platform for logically simulating AI factories. It delivers high‑fidelity digital simulations of NVIDIA hardware infrastructure, including GPUs, SuperNICs, DPUs and switches, and it integrates with leading partner solutions for storage and routing, security, orchestration and more via open, API-based connectivity. NVIDIA DSX Air enables a complete AI factory ecosystem, uniting NVIDIA infrastructure with partner technologies to deliver full‑stack simulation and accelerate complex AI deployments.     Companies building some of the world’s most advanced AI infrastructure, including CoreWeave , are already using DSX Air to simulate and validate their environments long before hardware reaches the loading dock. The development underscores a new reality: simulation is now essential to accelerating AI deployment at scale. DSX Air allows organizations to construct a full digital twin of their AI factory — compute, networking, storage, orchestration and security — before a single server is unboxed. By shifting integration and troubleshooting into simulation, customers are reducing the time to first token from weeks or months to mere days or hours, saving enormous amounts of time and costs. An industry analogy for this AI factory simulation phenomenon explains it well: It’s like IT mirroring your laptop to set up a new one, except the “laptop” is a hyperscale AI factory and the “mirroring” is a complete, high‑fidelity replica of the production environment. For operators racing to bring new AI capacity online, this change is transformative. Building a Platform for an Entire Ecosystem The NVIDIA DSX Air simulation platform is designed to support the entire AI factory ecosystem. Server manufacturers, orchestration vendors, storage providers and security partners can all validate their offerings alongside NVIDIA infrastructure — together, in one environment, at scale. This ecosystem‑wide capability is already reshaping partner workflows. Server manufacturers, which serve as the primary channel for enterprise inference, can now model and validate their reference architectures without building expensive physical labs. Enterprise AI environments rarely fit rigid designs, and customers often require bespoke configurations. With DSX Air, manufacturers can create digital twins tailored to specific customer needs, test their software stacks and deliver validated solutions without touching hardware. Orchestration vendors — critical for enterprises and tier‑2 clouds that need turnkey AI services — gain the ability to test at scale. At GTC, NVIDIA showcased a multi‑tenant RTX PRO Server environment running entirely in simulation, with Netris providing network orchestration, Rafay handling host orchestration and NVIDIA Run:ai optimizing GPU allocation. These partners can now validate complex workflows under realistic conditions without deploying physical clusters. The simulation environment is also valuable for validating the data platforms that power AI factories. Instead of requiring large physical clusters, DSX Air allows ecosystem partners to model complete AI workflows alongside NVIDIA compute, networking and software infrastructure. At GTC, the booth demonstration features a video retrieval-augmented generation workload running on the VAST AI Operating System, including a fully operational VAST cluster with DataEngine nodes and the video search and summarization front end. DataEngine triggers and functions process and index video content through an end-to-end pipeline, illustrating how AI applications can be designed, tested and validated inside the DGX Air simulation before deploying physical infrastructure. Security vendors — facing some of the most demanding validation requirements — can now test multi‑tenant policies, DPU‑accelerated isolation and threat detection in a realistic environment. The GTC demo includes Check Point ’s distributed firewall running on simulated BlueField DPUs, TrendAI Vision One for threat detection and Keysight AI Inference Builder, an emulation and analytics platform designed to validate inference-optimized AI infrastructure at scale.  Security partners can identify vulnerabilities and validate policies in a customer’s digital twin long before production goes live. Across the ecosystem, partners emphasized the same point: DSX Air gives them a complete, scalable and cost‑effective way to validate their solutions with NVIDIA infrastructure and with each other. Operating With a New Model to Accelerate Time to Token NVIDIA DSX Air isn’t just a deployment accelerator — it introduces a new operational model for AI factories. On the first day, customers build their intended production environment entirely in simulation. They configure networking, compute, storage, orchestration, security and scheduling exactly as they plan to deploy them. They validate that everything works together, identify issues early and ensure the environment behaves as expected. Next, they can deploy with confidence. Because the environment has already been tested end to end, the probability of a smooth bring‑up increases dramatically. Time to first token shrinks, and teams can focus on running workloads rather than troubleshooting infrastructure. Afterward and beyond, DSX Air becomes a safe environment for change management. Long‑lived simulations allow customers to test upgrades, rehearse maintenance windows, validate patches and predict operational impact before touching production. Only after changes succeed in simulation are they applied to the live environment, maximizing uptime and ensuring infrastructure availability. This lifecycle approach reflects how modern AI factories can operate as they scale. Simulating AI Factories Becomes the Backbone of AI Infrastructure GTC showed that simulation is no longer a future concept — it is the new backbone of AI infrastructure deployment and operations. NVIDIA DSX Air enables customers and partners to simulate everything in one place, accelerating deployment, reducing risk and ensuring day‑one performance at scale. Adopting NVIDIA DSX Air to Accelerate Deployments With Simulation Siam.AI, Thailand’s largest AI cloud provider, has accelerated its infrastructure deployment with NVIDIA DSX Air. Using simulation, Siam.AI embraced NVIDIA best practices well ahead of schedule, ensuring day-one operational expertise and validating their architecture in a virtual environment before the physical hardware even arrived. Similarly, Hydra Host is using DSX Air to accelerate development of Brokkr, its AI factory operating system for bare-metal GPU provisioning that’s used by dozens of GPU deployments globally. By simulating full-stack environments in DSX Air before deploying to production, Hydra Host can validate Brokkr’s automation and orchestration workflows across diverse networking and hardware configurations at scale. This simulation-first approach lets Hydra Host ship validated infrastructure faster to customers worldwide while minimizing risk to live systems as global AI demand grows. As AI factories grow in size and complexity, the ability to validate full‑stack environments before hardware arrives will define the pace of innovation. NVIDIA DSX Air delivers that capability today, giving organizations the fastest possible path to first token and a more reliable way to operate AI infrastructure over time. Learn more about NVIDIA DSX Air. Explore the Best of GTC 2026 Sessions Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere. Watch On Demand Recent News AI Infrastructure Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid March 31, 2026
Researchers: AI isn't killing jobs, it's 'unbundling' them the_register_ai 24.03.2026 18:32 0.715
Embedding sim.0.8303
Entity overlap0
Title sim.0.1707
Time proximity0.8917
NLP типscientific_publication
NLP организацияLondon School of Economics
NLP темаai adoption
NLP странаUnited Kingdom

Открыть оригинал

AI + ML 17 AI isn't killing jobs, it's 'unbundling' them into lower-paid chunks 17 Paper argues the real impact isn't job loss but narrowing human work and pay Carly Page Tue 24 Mar 2026 // 18:32 UTC AI isn't killing jobs wholesale – it's quietly chipping away at them, one task at a time. That's the gist of a new research paper making the rounds, which pushes back on the idea that more AI exposure automatically means fewer jobs. The authors argue the real question isn't how many tasks a model can do, but whether those tasks can actually be split out without breaking the role. Microsoft execs worry AI will eat entry level coding jobs READ MORE Analysts have long warned that automation could wipe out millions of jobs. One recent forecast put the number at 10.4 million US jobs gone by 2030 , roughly 6 percent of the workforce. The implicit assumption behind those numbers is straightforward: if AI can do enough of what you do, you're toast. This new paper – written by Luis Garicano, professor at the London School of Economics, along with Jin Li and Yanhui Wu, both at the University of Hong Kong – suggests it's not that simple. Jobs, it argues, aren't neat lists of tasks – they're bundles. Radiologists, for example, don't just read scans. They interpret edge cases, talk to clinicians, and sign off on decisions people act on. Replace the image-reading bit, and you haven't necessarily replaced the job. That's where the authors draw the line between what they call "weak bundles" and "strong bundles." Weak ones can be split apart without much fuss, but strong ones can't without losing value. "In weak-bundle occupations, AI automates some tasks and narrows the boundary of the job… In strong-bundle occupations… AI improves performance inside the job, but does not remove the human from the bundle," the authors argue. In weak-bundle jobs – think churning through support tickets or knocking out predictable bits of code – AI doesn't just replace a task; it reshapes the job. The human is left doing whatever the machine can't, often a narrower slice of the original role. AI still doesn't work very well, businesses are faking it, and a reckoning is coming Supposedly big-brained execs are outsourcing decisionmaking to AI Jack Dorsey's fintech outfit Block announces 40% layoffs, blames AI, gets 23% stock bump Altman: You think AI is wasted energy? Try raising 100 billion humans Sounds like a win on paper. In reality, not so much. Once AI takes over part of the work, the human stops dividing their time. They go all-in on what remains, which means output per worker jumps, prices fall, and suddenly you don't need as many workers as before. In other words, the hit to employment doesn't come from AI doing the job outright, but from humans becoming too efficient at the leftovers. It also squares with what we're seeing so far. AI is reshaping jobs, not wiping them out. Tasks move around, productivity may go up, yet employment and hours haven't shifted much – at least yet. In many cases, the bundle is still holding. It also explains why the doom predictions and the techno-optimism can both be right at the same time. If you're in a strong-bundle job – something heavy on judgment, context, or responsibility – AI is more likely to make you faster and better paid. If you're in a weak one, it may quietly hollow out your role until there's not much left to defend. ® Share More about AI Job cuts More like these &times; More about AI Job cuts Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Career Employment Self-driving Car More about Share 17 COMMENTS More about AI Job cuts More like these &times; More about AI Job cuts Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Career Employment Self-driving Car TIP US OFF Send us news
How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale | NVIDIA Technical Blog nvidia_dev_blog 16.03.2026 20:30 0.714
Embedding sim.0.8023
Entity overlap0.0227
Title sim.0.2636
Time proximity0.9972
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

Reasoning models are growing rapidly in size and are increasingly being integrated into agentic AI workflows that interact with other models and external tools. Deploying these models and workflows in production environments requires distributing them across multiple GPU nodes, which demands careful orchestration and coordination across GPUs. NVIDIA Dynamo 1.0—available now—addresses these problems by accelerating generative AI and reasoning models in large-scale distributed environments. The AI framework delivers low-latency, high-throughput, distributed inference for production-grade multi-node AI deployments. Dynamo supports leading open source inference engines, including SGLang, NVIDIA TensorRT LLM, and vLLM. It also has delivered strong results in trusted third-party benchmarks such as MLPerf and SemiAnalysis InferenceX , reinforcing its position as a production-grade inference platform. Dynamo can boost the number of requests served by up to 7x on NVIDIA Blackwell, as demonstrated in the recent SemiAnalysis InferenceX benchmark. Figure 1. NVIDIA Dynamo boosts performance by 7x with disaggregated serving when combined with wide expert parallel on NVIDIA GB200 NVL72. SemiAnalysis InferenceX, updated March 3, 2026. Results for DeepSeek R1-0528, FP4, 1k/1k, interactivity: ~50 tok/sec/user. This blog details how early adopters have integrated Dynamo into real-world inference workflows, the system level performance improvements achieved, and the latest features and optimizations added to the framework. Early adopters and real-world impact At last year’s GTC event, NVIDIA introduced NVIDIA Dynamo , a low-latency and high-throughput, distributed inference framework built for multinode AI deployments. Since then, NVIDIA has worked collaboratively with the open source ecosystem to harden Dynamo for production-grade performance and large-scale workloads. Over this period, Dynamo has achieved significant milestones: Successfully deployed in production workflows: AstraZeneca, Baseten , ByteDance, CoreWeave , Crusoe , DigitalOcean , Gcore , GMI Cloud , Nebius , Meituan , Pinterest , Prime Intellect, Rednote, SoftBank Corp., Tencent Cloud , Together AI , Vultr , and many more have deployed Dynamo in production to scale multi-node inference, optimize throughput, and improve latency. Watch Dynamo Day recordings to hear directly from organizations deploying Dynamo. Integrated into managed Kubernetes environments: Alibaba Cloud , Amazon Web Services (AWS) , Google Cloud , Microsoft Azure , and Oracle Cloud Infrastructure (OCI) have built integrations showing how Dynamo can be seamlessly deployed into their managed Kubernetes environments, scaling inference to meet the growing demand for AI.  Adopted by major open source frameworks: Modular Dynamo components such as NIXL have been widely adopted by inference engines including llm-d , NVIDIA TensorRT LLM , SGLang , and vLLM to accelerate KV cache transfers between GPUs. LMCache has integrated its KV caching directly into storage solutions in Dynamo, SGLang has integrated its HiCache solution into Dynamo’s Router, and LangChain has built an integration that injects agentic hints for Dynamo’s Router, validating its composable architecture. Inspired contributions from across the AI ecosystem: Developers across the AI community have contributed to Dynamo and broadened its capabilities. Mooncake and Alibaba extended the Dynamo AIConfigurator with SGLang support; Microsoft tested and hardened Dynamo on Azure Kubernetes Service (AKS), contributing fixes, deployment guides , public demos , and Planner/AIConfigurator enhancements ; Prime Intellect co‑designed and integrated LoRA adapter support; and Baseten validated early Dynamo features in production‑like environments, then upstreamed bug fixes and hardening patches. Enabled integration with storage solutions : Cloudian, DDN , Dell , Everpure (previously Pure Storage), HPE , IBM , NetApp , VAST , and WEKA have integrated Dynamo into their AI solutions. That allows inference workloads to scale beyond GPU memory constraints to support very large context lengths with storage. Dynamo 1.0 builds on these milestones while marking the framework’s maturity and production readiness. Keep reading for more highlights about the update. Accelerating agentic inference by 4x with Dynamo and NVIDIA NeMo Agent Toolkit Today’s inference runtimes treat every request and KV cache block the same—a system prompt reused across many turns has the same eviction priority as a one-off chain-of-thought. Multi-turn agents, however, reuse prefixes and follow predictable patterns. An evicted multi-turn KV block will need to be recomputed, resulting in wasted compute and higher inference costs. Dynamo addresses this gap with new agentic inference optimizations: Dynamo frontend API: Accepts agent hints (per-request metadata such as latency sensitivity, expected output length, and cache control) and passes them to the router and KV cache manager. Dynamo KV-aware router: Uses priority and latency agentic hints to control queue ordering so user-facing turns run before background work. It can take in expected output sequence length (OSL) to improve load-balancing accuracy. Dynamo KV cache manager: Supports experimental cache pinning. Pinned nodes resist eviction for the specified duration, and are moved to host memory rather than being deleted. The community has built on these optimizations to create custom routing and integrate agent hints into popular frameworks such as LangChain’s ChatNVIDIADynamo and the NVIDIA NeMo Agent Toolkit . Running Dynamo and the NeMo Agent Toolkit demonstrated up to 4x lower TTFT and 1.5x higher throughput when running the Llama 3.1 model on NVIDIA Hopper. Figure 2. How agent hints and predictive metadata drive routing and caching. Advancing multimodal inference optimization Dynamo 1.0 introduces three new features designed to accelerate multimodal inference in image-heavy workloads—where image encoding can be a bottleneck: Disaggregated encode/prefill/decode (E/P/D): Instead of running E/P/D on the same GPU, Dynamo separates them into distinct stages with independent scaling. Running the encode phase on dedicated workers allows for independent scaling, which improves batching, memory efficiency, and overall throughput. Multimodal embedding cache: A CPU-backed least recently used (LRU) cache stores computed image embeddings off-GPU so repeated images skip encoding entirely. This applies to both disaggregated and aggregated setups. Multimodal KV routing: Multimodal KV routing extends Dynamo’s KV-aware router to account for image content. A dedicated multimodal router downloads images then selects the backend worker with the highest cache overlap, including overlap on blocks containing images. Running the Qwen3-VL-30B-A3B-Instruct-FP8 multimodal model on NVIDIA GB200, Dynamo’s embedding cache accelerated time to first token (TTFT) by up to 30% and throughput by up to 25% on image requests. Figure 3. A CPU cache reuses previously computed image embeddings so repeated images skip GPU encoding, cutting compute and latency. Adding native support for video generation New video-generation models are setting a new bar for cinematic quality and motion realism. But serving them efficiently is non-trivial: Their inference workloads are compute- and memory-intensive, especially at high resolutions. Dynamo 1.0 adds native support for video-generation models, with integrations for leading open source inference frameworks such as FastVideo , SGLang Diffusion , TensorRT LLM Diffusion, and vLLM-Omni. This brings Dynamo’s modular stack—including its low-overhead front end, streaming capabilities, and high-efficiency scheduling engine—to modern video workloads. This integration demonstrates that state‑of‑the‑art video generation can be delivered efficiently on Dynamo. For a step‑by‑step walkthrough of how to deploy video generation models with Dynamo, check out this how‑to guide . Video 1. Generating a 5-second video in ~40 seconds on a single NVIDIA Hopper GPU using Wan2.1 and SGLang Diffusion running on NVIDIA Dynamo. Accelerating inference startup by 7x with Dynamo ModelExpress Modern inference clusters are constantly spinning new replicas up and down in response to traffic. Each new process has to repeat the same heavy startup pipeline: Downloading model checkpoints Loading weights from remote or shared storage Applying model optimizations Compiling kernels Building NVIDIA CUDA graphs To solve that challenge, Dynamo ensures that the expensive parts of worker startup are done once and reused many times through two new ModelExpress capabilities: Checkpoint restore: Instead of treating every replica as a fresh boot, Dynamo runs the full initialization sequence a single time, captures the “ready‑to‑serve” state to persistent storage, and then brings new replicas online by restoring from that checkpoint rather than rebuilding everything from scratch. Model weight streaming: Rather than having each new worker independently download model weights, write them to local or shared storage, and then load them into GPU memory, ModelExpress loads the model once on an initial worker and streams the weights to additional workers over high-bandwidth interconnects using NVIDIA Inference Xfer Library (NIXL) and NVIDIA NVLink, eliminating reliance on storage bandwidth. Figure 4. A worker downloads model weights once and streams them directly into other GPUs over high-bandwidth links, avoiding repeated disk downloads. For large models, especially in fleets that scale aggressively, model weight streaming can accelerate model loading time by up to 7x for large MoE models like DeepSeek v3 on NVIDIA H200. Scaling Kubernetes on NVIDIA GB300 NVL72 NVIDIA Grove , an open source API that’s part of Dynamo, simplifies deploying hierarchical gang-scheduled, topology‑aware AI workloads on Kubernetes . In Dynamo 1.0, Grove adds setup automation for NVIDIA NVLink fabric on rack‑scale systems such as NVIDIA GB300 NVL72. That allows users to define placement policies across every layer of infrastructure—from cloud regions and availability zones down to data centers, network blocks, racks, hosts, and even non-uniform memory access (NUMA) nodes. Figure 5. Grove orchestrates disaggregated inference components together with advanced AI schedulers on NVIDIA GB300 NVL72 and scale-out GPU clusters . Traditionally, using the NVIDIA GB300 NVL72 NVLink fabric required users to manually define and manage compute domains. This release introduces a unified topology API that enables developers to seamlessly colocate prefill and decode on the same NVIDIA NVL72 rack to optimize KV cache transfers, confine an inference stack to a single data center for latency needs, and place frontend services on nearby CPU‑only nodes for efficient request handling. Grove integrates with advanced AI schedulers, like KAI scheduler, to ensure these constraints are enforced. Integration with the Kubernetes Inference Gateway A previous Dynamo release introduced a plugin that allows users to combine the Kubernetes-native Inference Gateway extension routing and Dynamo’s KV-aware router.​ Figure 6. The NVIDIA Dynamo KV-aware router plugin, integrated into the Inference Gateway’s endpoint picker, intelligently routes requests across the inference pool of Dynamo Servers. In a typical Dynamo setup, routing is handled by Dynamo’s KV-aware router . The router evaluates worker queue depth and relevant KV cache information on each worker, then makes a probabilistic decision using a weighted combination of these factors. Dynamo’s KV-aware router can run inside the Inference Gateway to benefit from integration with routing plugins, filters, and other gateway capabilities in Kubernetes-based environments. Deploying fast, latency-aware inference with zero configurations Deploying large models requires deep expertise that balances latency, throughput, and cost targets through complex scaling and configuration steps. Dynamo’s new Dynamo Graph Deployment Request (DGDR) removes that friction by providing a simple, one‑step path from service‑level objectives (SLOs) to optimized inference deployments. DGDR combines the intelligence of the p lanner and AIConfigurator into a unified, Kubernetes‑native deployment flow. Instead of navigating multiple tools, scripts, and guesswork, developers can now specify a model, target hardware, and traffic goals in a YAML—soon, through an intuitive web UI—and Dynamo handles the rest. Behind the scenes, the AIConfigurator runs rapid, simulation‑based recommendations for quick iteration, while the planner engages deeper on‑cluster profiling for precise, production‑grade optimization. Both routes deliver an auto-deployable Dynamo Graph Deployment (DGD) that meets the user’s desired cost, performance, and scalability balance, without having to hand-configure a deployment configuration. Video 2. Watch zero-config deploy, generate and launch an optimized inference cluster directly from SLO inputs—automating scaling, profiling, and configuration. Increasing resiliency with fault detection and request migration A key design principle in Dynamo is to be resilient by default so applications keep running even when individual workers fail or hang. The updated Dynamo fault tolerance combines two pillars: Early fault detection: Dynamo adds a framework-independent “canary health check” that probes workers on a configurable schedule. If these checks do not receive a valid response, the worker is marked unhealthy and is removed from routing. Additionally, the Dynamo frontend also performs active detection using network-level signals. If establishing a new stream to a worker fails, or an existing stream ends unexpectedly mid-request, that worker is immediately removed from the set of active workers (for about five seconds) so no new requests are sent to it. Request cancellation and migration: Request cancellation support is enabled out-of-the-box, allowing in-flight work to be terminated when it no longer makes sense to continue. When a worker becomes unavailable, Dynamo can migrate affected requests to another worker and resume processing, preserving the request itself rather than forcing the client to resubmit from scratch. This ensures failures do not automatically translate into user-visible errors. With Dynamo’s new layered health detection combined with cancellation and migration , Dynamo aims to keep LLM applications responsive even when individual workers fail. Figure 7. Early fault detection and request migration in NVIDIA Dynamo, showing canary and network health checks marking workers unhealthy, canceling in‑flight work, and transparently rerouting requests to healthy workers. Advancing KV caching to storage In Dynamo 1.0, KV Block Manager (KVBM) introduces several features that enhance flexibility, visibility, and deployment options: Object storage support: KVBM now works with the Amazon Simple Storage Service (S3) and Azure-style blob APIs used by major storage vendors and cloud providers. This allows model operators to integrate KVBM with existing file systems, S3, or other cloud object stores without building separate KV offload pipelines for each backend. Global KV event emission: KVBM emits events whenever KV blocks move between storage tiers (GPU memory, CPU memory, local SSD, and remote storage) or are evicted. The KV router’s indexer consumes these events to maintain a consistent, cluster-wide view of KV block locations, enabling smarter routing and improved cache reuse across multiple model replicas and inference engines. Pip-installable module: KVBM can now be installed directly into inference engines like vLLM or TensorRT LLM without requiring the complete Dynamo stack. Teams using different inference frameworks can share a common KV offload tool rather than re-implementing eviction policies and storage integrations. Figure 8. NVIDIA Dynamo intelligently manages KV cache blocks across the different memory tiers to avoid KV cache recomputation and accelerate long context inference Looking ahead Looking forward, the Dynamo product roadmap will focus on expanding multimodal capabilities to support richer and more context-aware interactions, advancing diffusion-based models to unlock real-time higher quality video-generation capabilities, and scaling agentic workloads and reinforcement learning. Dynamo is being built in the open with the community . To get involved, explore the code and issues in the NVIDIA GitHub repository , drop into the biweekly Dynamo office hours , and dive into the existing technical blogs . ​​Acknowledgments Akshatha Kamath, Anish Maddipoti, Anna Tchernych, Ben Hamm, Biswa Ranjan Panda, Dhruv Nandakumar, Ekin Karabulut, Ganesh Kudleppanavar, Hannah Simmons, Hannah Zhang, Harry Kim, Hongkuan Zhou, Hyunjae Woo, Ishan Dhanani, Itay Neeman, Jacky Hui, Jakub Kosek, John Kim, Kavin Krishnan, Kyle Kranen, Maksim Khadkevich, Michael Demoret, Moein Khazraee, Neal Vaidya, Neelay Shah, Qi Wang, Ryan McCormick, Sanjay Chatterjee, Schwinn Saereesitthipitak, Suman Tatiraju, Vikram Sharma Mailthody, Vishwanath Venkatesan, and many others contributed to this post. Discuss (1) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Developer Tools & Techniques | General | CUDA | Dynamo | GB200 | H200 | Hopper | NeMo | TensorRT-LLM | Intermediate Technical | News | Agent toolkit | AI Agent | Dynamo-Triton | featured | GB300 | GTC 2026 | Kubernetes | LLMs | MLPerf | NVL72 | NVLink | vLLM About the Authors About Amr Elmeleegy Amr Elmeleegy is a principal product marketing manager for accelerated computing in the data center, focused on the NVIDIA AI inference platform. Previously, he held business development and product marketing roles at AWS and SAP. He holds an MBA from the UC Berkeley Haas School of Business and a bachelor’s degree in electrical engineering from Cairo University. View all posts by Amr Elmeleegy Comments Related posts NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models Optimize AI Inference Performance with NVIDIA Full-Stack Solutions Optimize AI Inference Performance with NVIDIA Full-Stack Solutions NVIDIA Triton Inference Server Achieves Outstanding Performance in MLPerf Inference 4.1 Benchmarks NVIDIA Triton Inference Server Achieves Outstanding Performance in MLPerf Inference 4.1 Benchmarks Related posts Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI L T F R E
Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety | NVIDIA Technical Blog nvidia_dev_blog 24.03.2026 16:00 0.712
Embedding sim.0.8164
Entity overlap0.0154
Title sim.0.2727
Time proximity0.8333
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаagentic ai
NLP страна

Открыть оригинал

Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale, developers need models that can understand real-world multimodal data, converse naturally with users globally, and operate safely across languages and modalities. At GTC 2026, NVIDIA introduced a new generation of NVIDIA Nemotron models designed to work together as a unified agentic stack: NVIDIA Nemotron 3 Super for long-context reasoning and agentic tasks NVIDIA Nemotron 3 Ultra (coming soon) for highest reasoning accuracy and efficiency among open frontier models NVIDIA Nemotron 3 Content Safety for multimodal, multilingual content moderation NVIDIA Nemotron 3 VoiceChat (in early access) for low latency, natural, full-duplex voice interactions NVIDIA Nemotron 3 Nano Omni (coming soon) for enterprise-grade multimodal understanding NVIDIA Nemotron RAG for generating embeddings for image and text modalities with NVIDIA Llama Nemotron Embed VL and for reordering image-or-text candidates when relevance depends on visual content with NVIDIA Llama Nemotron Rerank VL Together with open data, training recipes, and NVIDIA NeMo tools, the Nemotron family of models provides an end-to-end toolkit to build, evaluate, and optimize production-grade agentic AI systems. This blog explores the latest Nemotron 3 models, their performance, and how developers can use them to build scalable, multimodal, and real-time AI agents. Power multi-agent systems with NVIDIA Nemotron 3 Super Multi-agent systems suffer from “context explosion” with massive token histories 15x that of standard chat and a “thinking tax” with chain-of-thought reasoning for every decision. NVIDIA Nemotron 3 Super is an open hybrid mixture-of-experts (MoE) model that activates just 12B parameters per pass, delivering high accuracy and efficiency for a fraction of the compute. A hybrid architecture with Mamba and Transformer layers, multi‑token prediction, and NVFP4 precision on NVIDIA Blackwell GPUs delivers up to 5x higher throughput than the previous generation while reducing memory footprint and cost. A configurable “thinking budget” lets developers bound chain‑of‑thought to keep latency and spend predictable, even for continuous agent workloads. With a 1M-token context window and reinforcement learning across 10+ environments, Nemotron 3 Super excels at coding, math, instruction following, and function-calling, making it ideal for multi-agent applications—with significantly higher throughput on Blackwell when running in NVFP4. Figure 1. Nemotron 3 Super delivers top-tier intelligence while leading in throughput per GPU in the most attractive efficiency quadrant from Artificial Analysis. Nemotron 3 Super uses latent MoE to call four expert specialists for the inference cost of only one, compressing tokens before they reach the experts. External evaluations back this up. On the Artificial Analysis Intelligence Index for open‑weight models under 250B parameters, Nemotron 3 Super NVFP4 ranks among the top models, matching the highest intelligence scores from leading alternatives. Figure 2. Nemotron 3 Super ranks among the top open-weight models under 250B parameters on the Artificial Analysis Intelligence Index. In the intelligence‑versus‑efficiency plot, Nemotron 3 Super lands in the most attractive upper‑right quadrant—combining strong task performance with high output throughput per GPU—making it a compelling choice for cost‑sensitive production agents. Nemotron 3 Super—with open weights, open training data, and open development recipes—is ideal for software development, deep research, cybersecurity, and the financial services industry. Keep agents safe with Nemotron 3 Content Safety As agents expand from text‑only to multimodal workflows, safety guardrails must evolve across inputs, retrieval, and outputs. They must also be applicable in use cases like enterprise copilots and user-generated content (think dating apps or social media), and detect prompt injection in agentic systems such as healthcare, where self-harm is a concern. Nemotron 3 Content Safety is a compact 4B‑parameter multimodal safety model that detects unsafe or sensitive content across text and images. Built on the Gemma‑3‑4B backbone with an adapter‑based classification head, it delivers high‑accuracy safety classification at low latency that’s ideal for production agentic pipelines. It fuses visual and language features to produce a simple safe/unsafe decision, with optional granular category labels. A quick keyword toggle lets developers choose between fast binary classification and full taxonomy reporting, supporting both low‑latency paths and deeper inspection. On a suite of multimodal, multilingual safety benchmarks, Nemotron 3 Content Safety reaches approximately 84% accuracy, outperforming alternative safety models across the same tasks while keeping latency low enough for in‑line moderation in production pipelines. Figure 3. Model accuracy vs. alternative safety models on multimodal, multilingual harmful‑content benchmarks. The model uses the same 23‑category taxonomy as Aegis 1–3, covering classes such as hate, harassment, violence, sexual content, plagiarism, and unauthorized advice. Trained on high‑quality Aegis datasets and human‑annotated real‑world images—rather than primarily synthetic data—the model performs strongly across multimodal benchmarks in its 12 supported languages, with solid zero‑shot generalization beyond them. Natural conversations with Nemotron 3 VoiceChat Traditional voice AI relies on cascaded pipelines, automatic speech recognition (ASR), a large language model (LLM), and text-to-speech (TTS)—all of which introduce latency, complexity, and multiple points of failure. Nemotron 3 VoiceChat is a 12B-parameter end-to-end speech model for full-duplex, real-time conversational AI, currently in early access . Unlike cascaded stacks, VoiceChat directly analyzes audio input and generates audio output in a unified and streaming LLM architecture. Using this single model eliminates multi-model orchestration. Built on the Nemotron Nano v2 LLM backbone with Nemotron speech (Parakeet encoder) and TTS decoder, VoiceChat delivers natural, interruptible conversations with low latency. This model, in its early-access stage, has landed in the most attractive upper right quadrant of the Artificial Analysis Speech to Speech leaderboard. The graphic below plots conversational dynamics against speech reasoning performance, where Nemotron 3 VoiceChat lands in the highlighted upper‑right quadrant, alongside NVIDIA PersonaPlex , a full duplex, 7B-parameter research model. This means developers get both responsive turn‑taking behavior and strong reasoning over audio; both are critical for assistants that must sound natural and stay on task. Figure 4. Nemotron 3 VoiceChat and NVIDIA PersonaPlex lead open‑source full‑duplex models on both conversational dynamics and speech reasoning, landing in the “most attractive” quadrant of the Artificial Analysis benchmark. With a streamlined end-to-end pipeline, VoiceChat targets sub-300ms end-to-end latency, processing 80ms audio chunks faster than real-time. A single model means fewer points of failure, reduced technical debt, and easier deployment for conversational agents in healthcare, financial services, telecommunications, gaming, and more. Understand the world with NVIDIA Nemotron 3 Omni Agentic systems increasingly need to understand real-world data in different formats: video, audio, documents, UI screens, and reason across modalities. Existing solutions are either closed source or face compliance challenges for global enterprise deployment. NVIDIA Nemotron 3 Nano Omni is the first open, production-ready native omni-understanding foundation model delivering high-context video reasoning enhanced through audio transcription. Nano Omni is powered by NVIDIA Nemotron speech (Parakeet encoder), state-of-the-art optical character recognition (OCR) reasoning with a Nemotron 3 Nano language backbone, and NVIDIA’s first GUI-trained system for real agentic applications. The architecture uses 3D convolution layers (Conv3D) for efficient handling of temporal-spatial data in video, and efficient video sampling (EVS) enables processing of longer videos at the same computational cost by identifying and pruning temporally static patches. Stay tuned for release updates about this model. Improve multimodal search with Llama Nemotron Embed VL and Rerank VL Agentic RAG pipelines rely on retrieval to ground generation on evidence, not just prompts. But enterprise data lives in PDFs with charts, scanned contracts, tables, and slide decks—formats that text-only retrieval misses entirely. Llama Nemotron Embed VL and Llama Nemotron Rerank VL are compact multimodal models that enable accurate visual document retrieval while remaining compatible with standard vector databases. On the ViDoRe V3/MTEB Pareto curve, which plots retrieval accuracy versus tokens processed per second on a single NVIDIA H100 GPU, Llama Nemotron Embed VL occupies the Pareto frontier. It delivers competitive or better accuracy at high throughput relative to both open and commercial alternatives. Figure 5. Pareto curve for model accuracy vs performance for open and commercial embedding models. Benchmarked on one H100 by the MTEB leaderboard on the ViDoRe V3 benchmark Llama Nemotron Embed VL is a 1.7B-parameter dense embedding model that encodes page images and text into a single-dimensional vector, with support for Matryoshka embeddings. Built on NVIDIA Eagle—a frontier vision-language model with a Llama 3.2 1B backbone and SigLip2 400M vision encoder—it uses contrastive learning for query-document similarity and enables millisecond-latency search with standard vector databases. Llama Nemotron Rerank VL is a 1.7B-parameter cross-encoder reranker that scores query-page relevance. When paired with the Llama Nemotron Embed VL model, it further increases accuracy by reranking retrieved text chunks and images. Evaluate and optimize with NVIDIA NeMo Building production agents requires not only strong models but also robust tools for evaluation and optimization. NVIDIA NeMo provides tools to evaluate, compare, and tune agentic systems: NVIDIA NeMo Evaluator, enables robust, reproducible benchmarking with support for agentic evaluation. By providing standardized evaluation setups, developers can benchmark performance, validate outputs, and compare models under consistent conditions. NVIDIA NeMo Agent Toolkit is an open source framework for profiling and optimizing agentic systems end-to-end. Bring agents from LangChain, AutoGen, AWS Strands, or other frameworks—without code changes—and get visibility into latency bottlenecks, token costs, and orchestration overhead to ship performant agents at scale. Start building with Nemotron Agentic AI is a shift from systems that respond to systems that act. It is a coordinated stack of models, tools, memory, and guardrails that can plan, execute, critique, and adapt. If it’s just a bigger model in the same chat window, it’s not agentic. The Nemotron family of models, released under the NVIDIA permissive open model licenses , is built for this multi‑model reality. Nemotron 3 Super anchors long‑context reasoning and planning. Nemotron 3 Content Safety watches every step, moderating multimodal inputs, retrieved content, and outputs. Nemotron 3 VoiceChat turns that intelligence into full‑duplex, real‑time conversations. Nemotron 3 Nano Omni (coming soon) gives agents eyes and ears across video, audio, documents, charts, and GUIs. Around them, NeMo tools add retrieval, tool‑calling, evaluation, and judge models so agents can score their own work and improve. Efficiency is the hidden requirement that makes production viable. Real agents make dozens or hundreds of model calls per task, so Nemotron models are right‑sized and optimized for throughput, latency, and cost. And because they’re open and customizable, teams can tune behaviors, align to their own data, and deploy where their security and compliance teams need them. With Nemotron and NVIDIA NeMo, you’re getting the building blocks for trustworthy, repeatable, and scalable digital assistants for your production agentic systems. Get started today: Download the Nemotron models and datasets from Hugging Face . Preview and access Nemotron Super here . Access Nemotron 3 Content Safety here . Preview and apply for early access to Nemotron 3 VoiceChat here . Evaluate with NVIDIA NeMo Evaluator Optimize with NeMo Agent Toolkit . Evaluate NVIDIA-hosted API endpoints on build.nvidia.com and OpenRouter . Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn , X , Discord , and YouTube . Visit the Nemotron developer page for resources to get started. Explore open Nemotron models and datasets on Hugging Face and Blueprints on build.nvidia.com . Engage with Nemotron livestreams , tutorials , and the developer community on the NVIDIA forum and Discord . Discuss (0) Like Tags Agentic AI / Generative AI | Content Creation / Rendering | Data Science | General | NeMo | Nemotron | Intermediate Technical | Benchmark | News | GTC 2026 | Llama | LLMs | Machine Learning & Artificial Intelligence | NVFP4 | Open Source | Retrieval Augmented Generation (RAG) About the Authors About Chintan Patel Chintan Patel is a senior product manager at NVIDIA focused on bringing GPU-accelerated solutions to the HPC community. He leads the management and offering of the HPC application containers on the NVIDIA GPU Cloud registry. Prior to NVIDIA, he held product management, marketing and engineering positions at Micrel, Inc. He holds an MBA from Santa Clara University and a bachelor's degree in electrical engineering and computer science from UC Berkeley. View all posts by Chintan Patel About Maryam Motamedi Maryam Motamedi is a product marketing lead for AI software at NVIDIA. She brings decades of cross-industry experience in media/AdTech, streaming, retail, and telecom. Maryam specializes in translating cutting-edge technology into real-world solutions, helping developers and enterprises build AI-powered applications that redefine how we connect, work, and interact. View all posts by Maryam Motamedi About Chris Alexiuk Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models. View all posts by Chris Alexiuk About Moon Chung Moon Chung is a senior product marketing manager at NVIDIA specializing in Enterprise AI. She has previously worked for Meta and Adobe, focusing on product strategy, product development, and go-to-market strategy. Moon holds an MBA degree from Duke University’s Fuqua School of Business. View all posts by Moon Chung About Isabel Hulseman Isabel Hulseman is a product marketing manager for enterprise AI software at NVIDIA. With over 9 years of marketing experience (3+ at NVIDIA), and an MBA in marketing, her goal is to provide developers with the tools they need to build custom generative AI applications and enable enterprises to develop and scale their solutions to serve their customers better. View all posts by Isabel Hulseman Comments Related posts Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate Develop Specialized AI Agents with New NVIDIA Nemotron Vision, RAG, and Guardrail Models  Develop Specialized AI Agents with New NVIDIA Nemotron Vision, RAG, and Guardrail Models  Build More Accurate and Efficient AI Agents with the New NVIDIA Llama Nemotron Super v1.5 Build More Accurate and Efficient AI Agents with the New NVIDIA Llama Nemotron Super v1.5 Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency Deploy Agents, Assistants, and Avatars on NVIDIA RTX AI PCs with New Small Language Models Deploy Agents, Assistants, and Avatars on NVIDIA RTX AI PCs with New Small Language Models Related posts How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities How to Build a Document Processing Pipeline for RAG with Nemotron  How to Build a Document Processing Pipeline for RAG with Nemotron  L T F R E
Using Simulation to Build Robotic Systems for Hospital Automation | NVIDIA Technical Blog nvidia_dev_blog 16.03.2026 22:00 0.709
Embedding sim.0.7973
Entity overlap0.0682
Title sim.0.2385
Time proximity0.9882
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаrobotics
NLP странаUnited States

Открыть оригинал

Healthcare faces a structural demand–capacity crisis: a projected global shortfall of ~10 million clinicians by 2030, billions of diagnostic exams annually with significant unmet demand, hundreds of millions of procedures with large access gaps, and costly operating room (OR) inefficiencies measured in tens of dollars per minute. The future hospital must therefore be automation-enabled—where robotics extends clinician capacity, increases procedural throughput, reduces variability, and democratizes access to high-quality care. Imagine autonomous imaging robots navigating patient anatomy to provide X-rays for the unserved billions, while in the OR, ‘Surgical Subtask Automation’ handles repetitive suturing so surgeons can focus on critical decisions. Beyond the bedside, service robots recapture wasted minutes by autonomously delivering supplies, saving nurses miles of walking. The data gap and real-world limits The core bottleneck is data. Hospitals are heterogeneous, chaotic, and high-stakes environments—every facility has different layouts, workflows, equipment, patient populations, and policies. Commissioning fleets of robots across diverse hospitals to capture exhaustive real-world data is economically and operationally infeasible. Even if it were possible, real-world data capturing every edge case—crowded hallways, emergency interruptions, rare complications, human-robot interactions under stress—simply doesn’t exist. Testing every scenario in live clinical settings is both unsafe and impractical. The solution is simulation, digital twinning, and synthetic data generation. Simulation and synthetic data generation are therefore not optional—they are foundational. Virtual hospital environments allow robots to experience thousands of navigation patterns, workflow variations, task permutations, and human interaction scenarios safely and at scale before deployment. High-fidelity simulation enables stress testing, long-horizon policy learning, rare-event exposure, and closed-loop training that would be impossible to achieve in the real world alone. This approach accelerates development, reduces clinical risk, and provides the data substrate required for reliable, intelligent automation across complex hospital systems. Project Rheo introduces a different approach. How developers can use the Project Rheo blueprint to train AI systems Instead of teaching robots inside hospitals, developers can now train hospitals—in simulation—before automation ever arrives. This guide walks through how developers can use the Project Rheo blueprint to build their first smart hospital digital twin and begin training Physical AI systems. Project Rheo, a blueprint for a smart hospital automation and Physical AI development, combines: Physical agents : loco-manipulation and manipulation policies (for example, surgical tray pick-and-place, case cart pushing, bimanual tool handling) driven by NVIDIA Isaac GR00T vision-language-action (VLA) models and/or reinforcement learning (RL) post-training. Digital agents : monitoring and assistance agents powered by surgical foundation models (for example, an agent driven by vision language model (VLM) that observes a live camera stream and suggests actions). Digital twin and SimReady assets : an OR simulation built with NVIDIA Isaac Sim / Isaac Lab , used to safely iterate on tasks, data, and policies. Rheo supports two complementary simulation tracks , each optimized for a different part of the workflow: Isaac Lab-Arena track : rapid environment composition and iteration for OR-scale tasks (for example loco-manipulation), where you want to swap scenes, objects, embodiments, and evaluation runners with minimal friction. Isaac Lab track : task-centric, manager-based environments that pair naturally with curriculum design and large-scale RL post-training for precision manipulation. In the following section, the “digital hospital” is built using the Isaac Lab-Arena composition model : choose named assets, select an embodiment, attach a task, and run. Step 1: Create your digital hospital Compose a scene + task into an Isaac Lab-Arena environment One of the core features of the Rheo blueprint is the rapid composition of new environments and tasks using the Isaac Lab-Arena model . This allows developers to quickly define a clinical scenario by combining existing assets, a robot embodiment, and a task definition. The following Python code block demonstrates how to define a locomotion-manipulation task—specifically, having the Unitree G1 robot pick up a surgical tray and place it onto a cart—within a pre-operative room scene. from isaaclab_arena.environments.isaaclab_arena_environment import IsaacLabArenaEnvironment from isaaclab_arena.scene.scene import Scene from sim.tasks.g1_tray_pick_and_place_task import G1TrayPickPlaceTask # 1. Define the scene components (USD assets) background = asset_registry.get_asset_by_name("pre_op")() pick_up_object = asset_registry.get_asset_by_name("surgical_tray")() destination_cart = asset_registry.get_asset_by_name("cart")() # 2. Define the robot embodiment embodiment = asset_registry.get_asset_by_name("g1_wbc_joint")(enable_cameras=True) # 3. Compose the Scene scene = Scene(assets=[background, pick_up_object, destination_cart]) # 4. Create the Environment by combining the Embodiment, Scene, and Task env = IsaacLabArenaEnvironment( name="g1_locomanip_tray_pick_and_place", embodiment=embodiment, scene=scene, task=G1TrayPickPlaceTask(pick_up_object, destination_cart, background, episode_length_s=30.0), teleop_device=None, ) Precision manipulation through Isaac Lab For precision, multi-stage bimanual manipulation such as Assemble Trocar , Rheo uses a focused Isaac Lab track where the OR twin is defined explicitly as a scene configuration: robot, cameras, USD scene, objects, and lighting. @configclass class AssembleTrocarSceneCfg(InteractiveSceneCfg): """Scene configuration for the assemble_trocar task.""" robot: ArticulationCfg = G1RobotPresets.g1_29dof_dex3_base_fix(...) front_camera = CameraPresets.g1_front_camera(...) left_wrist_camera = CameraPresets.left_dex3_wrist_camera(...) right_wrist_camera = CameraPresets.right_dex3_wrist_camera(...) scene = AssetBaseCfg(..., spawn=UsdFileCfg(usd_path="...")) trocar_1 = RigidObjectCfg(..., spawn=UsdFileCfg(usd_path="..."), init_state=...) trocar_2 = RigidObjectCfg(..., spawn=UsdFileCfg(usd_path="..."), init_state=...) tray = ArticulationCfg(..., spawn=UsdFileCfg(usd_path="..."), init_state=..., actuators={}) light = AssetBaseCfg(..., spawn=sim_utils.DomeLightCfg(...)) Evaluating Trocar assembly in Isaac lab Step 2: Capture expert experience Rheo captures this experience as demonstrations in simulation, using devices appropriate to each task. Record demonstrations for loco-manipulation (Meta Quest) For tasks like surgical tray pick-and-place and case cart pushing: ./workflows/rheo/docker/run_docker.sh -g1.6 \ python scripts/sim/record_demos_localmanip.py \ --dataset_file /datasets/demo.hdf5 \ --num_demos 1 \ --num_success_steps 50 \ --enable_pinocchio \ --enable_cameras \ --xr \ --teleop_device motion_controllers \ g1_locomanip_tray_pick_and_place \ --object surgical_tray \ --embodiment g1_wbc_pink Key design point: the same runner that records data is also aligned with later synthetic generation ( --mimic ), reducing format drift. Record demonstrations for precision bimanual manipulation (Meta Quest) For the Assemble Trocar task (GR00T N1.5 fine-tuning / RL post-training track): ./workflows/rheo/docker/run_docker.sh -g1.5 \ python scripts/sim/record_demos_assemble_trocar.py \ --task Isaac-Assemble-Trocar-G129-Dex3-Teleop \ --teleop_device motion_controllers \ --enable_pinocchio \ --enable_cameras \ --num_demos 1 \ --xr Step 3: Multiply experience with synthetic data Once you have a small set of successful demonstrations, the fastest way to improve coverage is to systematically diversify . Rheo supports simulation-driven synthetic data generation for loco-manipulation tasks using Isaac Lab Mimic / SkillGen-style pipelines. 3a) Replay demos (sanity check) ./workflows/rheo/docker/run_docker.sh -g1.6 \ python scripts/sim/replay_demos.py \ --dataset_file /datasets/demo.hdf5 \ --enable_cameras \ g1_locomanip_tray_pick_and_place \ --object surgical_tray \ --embodiment g1_wbc_pink 3b) Annotate demos (prepare for generation) ./workflows/rheo/docker/run_docker.sh -g1.6 \ python scripts/sim/annotate_demos.py \ --input_file /datasets/demo.hdf5 \ --output_file /datasets/demo_annotated.hdf5 \ --enable_cameras \ --mimic \ g1_locomanip_tray_pick_and_place \ --object surgical_tray \ --embodiment g1_wbc_pink 3c) Generate a larger synthetic dataset ./workflows/rheo/docker/run_docker.sh -g1.6 \ python scripts/sim/generate_dataset.py \ --enable_cameras \ --mimic \ --num_steps 150 \ --headless \ --input_file /datasets/demo_annotated.hdf5 \ --output_file /datasets/demo_generated.hdf5 \ --generation_num_trials 10 \ g1_locomanip_tray_pick_and_place \ --object surgical_tray \ --embodiment g1_wbc_pink If you annotate multiple sources, you can merge them: ./workflows/rheo/docker/run_docker.sh -g1.6 \ python scripts/sim/merge_demos.py \ --input /datasets/demo_annotated*.hdf5 \ --output /datasets/demo_merged.hdf5 3d) Convert to LeRobot format (for downstream training tooling) ./workflows/rheo/docker/run_docker.sh -g1.6 \ python scripts/utils/convert_hdf5_to_lerobot.py \ --config scripts/config/g1_locomanip_dataset_config.yaml Cross-Scene Generalization with Generative Transfer Isaac for Healthcare v0.5 includes a tutorial that references Cosmos Transfer 2.5 + guided generation to augment training data . In the included benchmark snapshot for Surgical Tray Pick-and-Place (success rate across four scenes), the Cosmos-augmented model improves robustness in shifted scenes: Model Scene 1 Scene 2 Scene 3 Scene 4 Base model 0.64 0.31 0.00 0.00 Cosmos-augmented model 0.60 0.49 0.37 0.30 The practical takeaway: domain shift is the key-enabler to train a robot in hospital environments , which has a different pattern of lighting, clutter, room geometry, etc. Synthetic diversification lets you stress these shifts earlier, when iteration is cheap. Step 4: Train physical AI policies Rheo supports two complementary training paths: Supervised fine-tuning (SFT) of GR00T models on curated datasets Online RL post-training (proximal policy optimization via RL infrastructure, or PPO via RLinf) to push hard precision stages over the line 4a) GR00T fine-tuning (entry point: launch_finetune.py ) For fine-tuning on custom embodiments, the GR00T documentation uses gr00t/experiment/launch_finetune.py as the entry point (see Fine-tune on Custom Embodiments (“NEW_EMBODIMENT”) . export NUM_GPUS=1 CUDA_VISIBLE_DEVICES=0 python \ gr00t/experiment/launch_finetune.py \ --base-model-path nvidia/GR00T-N1.6-3B \ --dataset-path <your-lerobot-v2-dataset> \ --embodiment-tag NEW_EMBODIMENT \ --modality-config-path <your-modality-config>.py \ --num-gpus $NUM_GPUS \ --output-dir <output-dir> \ --save-steps 2000 \ --max-steps 2000 Rheo provides task-specific modality definitions: Locomotion / loco-manipulation : custom_g1_locomanip_modality_config.py (a --modality-config-path module that registers a NEW_EMBODIMENT modality config with ego_view video plus G1 state/action keys, including navigation commands). Manipulation (Assemble Trocar) : gr00t_config.py (defines UnitreeG1SimDataConfig —video/state/action keys and transforms for the trocar dataset—used by the trocar fine-tuning and RL post-training workflow). 4b) Online RL post-training (PPO via RLinf) for precision stages For Assemble Trocar, Rheo provides a turnkey launcher: bash /workspaces/workflows/rheo/scripts/sim/rl/train_gr00t_assemble_trocar.sh train \ --model_path /models/<your_gr00t_checkpoint> You can scale parallel environments or downshift for memory: # scale up/down environments bash /workspaces/workflows/rheo/scripts/sim/rl/train_gr00t_assemble_trocar.sh train \ --model_path /models/<your_gr00t_checkpoint> \ env.train.total_num_envs=32 env.eval.total_num_envs=4 # low-memory configuration bash /workspaces/workflows/rheo/scripts/sim/rl/train_gr00t_assemble_trocar.sh train \ --model_path /models/<your_gr00t_checkpoint> \ env.train.total_num_envs=8 actor.micro_batch_size=2 Rheo’s RL tutorial decomposes Assemble Trocar into four stages (lift → align → insert → place) and reports large gains on later, harder stages after curriculum RL post-training: Model Stage 1 Stage 1+2 Stage 1+2+3 Stage 1+2+3+4 Base model (SFT) 83% 72% 32% 29% RL post-train Stage 1 100% – – – RL post-train Stage 2 – 92% – – RL post-train Stage 3 – – 85% – RL post-train Stage 4 – – – 82% Step 5: Validate before deployment 5a) Task-level evaluation (examples) Assemble Trocar evaluation (GR00T N1.5, with optional RL checkpoint behavior patching): ./workflows/rheo/docker/run_docker.sh -g1.5 \ python -u -m sim.examples.eval_assemble_trocar \ --enable_cameras \ --task Isaac-Assemble-Trocar-G129-Dex3-Joint \ --model_path /models/GR00T-N1.5-RL-Rheo-AssembleTrocar \ --rl_ckpt \ --num_episodes 10 \ --max_steps 500 Surgical tray pick-and-place evaluation (GR00T closed loop): ./workflows/rheo/docker/run_docker.sh -g1.6 \ python scripts/sim/examples/policy_runner.py \ --policy_type gr00t_closedloop \ --policy_config_yaml_path scripts/config/g1_gr00t_closedloop_pick_and_place_config.yaml \ --num_steps 15000 \ --enable_cameras \ --success_hold_steps 150 \ g1_locomanip_tray_pick_and_place \ --object surgical_tray \ --embodiment g1_wbc_joint Case cart pushing evaluation : ./workflows/rheo/docker/run_docker.sh -g1.6 \ python scripts/sim/examples/policy_runner.py \ --policy_type gr00t_closedloop \ --policy_config_yaml_path scripts/config/g1_gr00t_closedloop_push_cart_config.yaml \ --num_steps 20000 \ --enable_cameras \ --success_hold_steps 45 \ g1_locomanip_push_cart \ --object cart \ --embodiment g1_wbc_joint 5b) End-to-end integration smoke test (WebRTC + trigger API) As a system-level check, Rheo provides a runner that (1) streams camera observations over WebRTC and (2) exposes a trigger endpoint so an external orchestrator can decide when to execute actions. ./workflows/rheo/docker/run_docker.sh -g1.6 \ python scripts/sim/examples/triggered_policy_runner.py \ --enable_cameras \ --webrtc_cam \ --webrtc_host 0.0.0.0 \ --webrtc_port 8080 \ --webrtc_fps 30 \ --trigger_port 8081 \ --trigger_host 0.0.0.0 \ g1_locomanip_tray_pick_and_place \ --object surgical_tray \ --embodiment g1_wbc_joint To attach a VLM-based digital agent UI to this stream: ./tools/env_setup/install_vlm_surgical_agent_fx.sh Then open http://127.0.0.1:8050 and connect the UI livestream to the WebRTC server (for example http://localhost:8080 ). Get started To begin building with Project Rheo: Stand up an Isaac Sim environment Import or reconstruct a hospital space Record one expert workflow Generate synthetic variations Train a simple policy in Isaac Lab Run validation scenarios Start small. One room. One task. One robot. The workflow scales naturally from there. Healthcare robotics is moving beyond static devices to autonomous, learning systems. Project Rheo transforms hospitals into continuous training environments, enabling systems to learn and adapt before they ever interact with a patient. It’s time to build. Start architecting your next-generation medical AI with the Project Rheo blueprint today. Featured image credit/PeritasAi Discuss (0) Like Tags Robotics | Healthcare & Life Sciences | Cosmos | GR00T | Isaac Lab | Isaac Sim | Intermediate Technical | Tutorial | featured | GTC 2026 | Physical AI | Robotics Simulation About the Authors About Mingxin Zheng Since June 2022, Mingxin Zheng has been a senior engineer at NVIDIA’s healthcare AI team based in Shanghai, China. Previously, he was a research scientist at Philips North America (Cambridge, MA) and a postdoctoral fellow at Harvard Medical School, after completing a PhD at Boston University. He builds production-grade systems for medical imaging and brings physical AI to healthcare—spanning GPU-accelerated pipelines, simulation, and cloud-edge deployment. He is also a core developer in open-source software communities such as MONAI and Isaac for Healthcare. View all posts by Mingxin Zheng About Nic Ma Nic Ma is the senior manager of the NVIDIA MedTech engineering team. He leads the engineering development for state-of-the-art, end-to-end deep learning and physical AI workflows for medical technologies with optimized performance. He focuses on NVIDIA Isaac for robotics, Holoscan, and MONAI, building the 3-computer solution revolutionizing healthcare robotics. View all posts by Nic Ma About Mostafa Toloui Mostafa leads product management for NVIDIA Healthcare Robotics, overseeing the NVIDIA Holoscan and Isaac for Healthcare platforms. His passion is to enable medtech companies to deliver intelligent, patient-centric solutions and disrupt the way diseases are prevented, diagnosed, and treated. Before NVIDIA, Mostafa worked in the medtech industry in various capacities from strategy manager to scientist, bringing products (software and hardware) from concept to launch. Mostafa holds a Ph.D. in mechanical engineering and an MBA from the University of Minnesota. View all posts by Mostafa Toloui Comments Related posts Getting Started with NVIDIA Isaac for Healthcare Using the Telesurgery Workflow Getting Started with NVIDIA Isaac for Healthcare Using the Telesurgery Workflow Driving AI-Powered Robotics Development with NVIDIA Isaac for Healthcare Driving AI-Powered Robotics Development with NVIDIA Isaac for Healthcare Introducing NVIDIA Isaac for Healthcare, an AI-Powered Medical Robotics Development Platform Introducing NVIDIA Isaac for Healthcare, an AI-Powered Medical Robotics Development Platform New AI Research Foreshadows Autonomous Robotic Surgery New AI Research Foreshadows Autonomous Robotic Surgery Advancing Surgical Robotics with AI-Driven Simulation and Digital Twin Technology Advancing Surgical Robotics with AI-Driven Simulation and Digital Twin Technology Related posts Build and Orchestrate End-to-End SDG Workflows with NVIDIA Isaac Sim and NVIDIA OSMO  Build and Orchestrate End-to-End SDG Workflows with NVIDIA Isaac Sim and NVIDIA OSMO  Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena Simulate Robotic Environments Faster with NVIDIA Isaac Sim and World Labs Marble Simulate Robotic Environments Faster with NVIDIA Isaac Sim and World Labs Marble R²D²: Improving Robot Manipulation with Simulation and Language Models R²D²: Improving Robot Manipulation with Simulation and Language Models R²D²: Perception-Guided Task & Motion Planning for Long-Horizon Manipulation R²D²: Perception-Guided Task & Motion Planning for Long-Horizon Manipulation L T F R E
What’s the right path for AI? mit_news_ai 20.03.2026 13:30 0.706
Embedding sim.0.8432
Entity overlap0
Title sim.0.1408
Time proximity0.7113
NLP типother
NLP организацияMIT
NLP темаai ethics
NLP странаUnited States

Открыть оригинал

Who benefits from artificial intelligence? This basic question, which has been especially salient during the AI surge of the last few years, was front and center at a conference at MIT on Wednesday, as speakers and audience members grappled with the many dimensions of AI’s impact. In one of the conferences’s keynote talks, journalist Karen Hao ’15 called for an altered trajectory of AI development, including a move away from the massive scale-up of data use, data centers, and models being used to develop tools under the rubric of “artificial general intelligence.” “This scale is unnecessary,” said Hao, who has become a prominent voice in AI discussions. “You do not need this scale of AI and compute to realize the benefits.” Indeed, she added, “If we really want AI to be broadly beneficial, we urgently need to shift away from this approach.” Hao is a former staff member at The Wall Street Journal and MIT Technology Review , and author of the 2025 book, “Empire of AI.” She has reported extensively on the growth of the AI industry. In her remarks, Hao outlined the astonishing size of datasets now being used by the biggest AI firms to develop large language models. She also emphasized some of the tradeoffs in this scale-up, such as the massive energy consumption and emissions of hyper-scale data centers, which also consume large amounts of water. Drawing on her own reporting, Hao also noted the human toll from the input work that global gig-economy employees do, inputting data manually for the hyper-scale models. By contrast, Hao offered, an alternate path for AI might exist in the example of AlphaFold, the Nobel Prize-winning tool used to identify protein structures. This represents the concept of the “small, task-specific AI model tackling a well-scoped problem that lends itself to the computational strengths of AI,” Hao said. She added: “It’s trained on highly curated data sets that only have to do with the problem at hand: protein folding and amino acid sequences. … There’s no need for fast supercomputing because the datasets are small, the model is small, and it’s still unlocking enormous benefit.” In a second keynote address, scholar Paola Ricaurte underscored the desirability of purpose-driven AI approaches, outlining a number of conceptual keys to evaluating the usefulness of AI. “There is no sense in having technologies that are not going to respond to the communities that are going to use them,” said Ricaurte. She is a professor at Tecnologico de Monterrey in Mexico and a faculty associate at Harvard University’s Berkman Klein Center for Internet and Society. Ricaurte has also served on expert committees such as the Global Partnership for AI, UNESCO’s AI Ethics Experts Without Borders, and the Women for Ethical AI project. The event was hosted by the MIT Program in Women’s and Gender Studies. Manduhai Buyandelger, the program’s director and a professor of anthropology, provided introductory remarks. Titled “Gender, Empire, and AI: Symposium and Design Workshop,” the event was held in the conference space at the MIT Schwarzman College of Computing, with over 300 people in attendance for the keynote talks. There was also a segment of the event devoted to discussion groups, and an afternoon session on design, in a half-dozen different subject areas. In her talk, Hao decried the often-vague nature of AI discourse, suggesting it impedes a more thoughtful discussion about the industry’s direction. “Part of the challenge in talking about AI is the complete lack of specificity in the term ‘artificial intelligence,’” Hao said. “It’s like the word ‘transportation.’ You could be referring to anything from a bicycle to a rocket.” As a result, she said, “when we talk about accessing its benefits, we actually have to be very specific. Which AI technologies are we talking about, and which ones do we want more of?” In her view, the smaller-sized tools — more akin to the bicycle, by analogy — are more useful on an everyday basis. As another example, Hao mentioned the project Climate Change AI, focused on tools that can help improve the energy efficiency of buildings, track emissions, optimize supply chains, forecast extreme weather, and more. “This is the vision of AI that we should be building towards,” Hao said. In conclusion, Hao encouraged audience members to be active participants in AI-related discourse and projects, saying the trajectory of the technology was not yet fixed, and that public interventions matter. Citing the writer Rebecca Solnit, Hao suggested to the audience that “Hope locates itself in the premise that we don’t know what will happen, and that in the spaciousness of uncertainty is room to act.” She also noted, “Each and every one of you has an active role to play in shaping technology development.” Ricaurte, similarly, encouraged attendees to be proactive participants in AI matters, noting that technologies will work best when the pressing everyday needs of all citizens are addressed. “We have the responsibility to make hope possible,” Ricaurte said.
Roche Scales NVIDIA AI Factories Globally to Accelerate Drug Discovery, Diagnostic Solutions and Manufacturing Breakthroughs nvidia_blog 16.03.2026 20:30 0.703
Embedding sim.0.8008
Entity overlap0.0588
Title sim.0.1695
Time proximity0.994
NLP типother
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid CERAWeek — dubbed the Davos of energy — is where policymakers, producers, technologists and financiers gather to discuss how the world powers itself next.  NVIDIA and Emerald AI unveiled at...
More Than Meets the Eye: NVIDIA RTX-Accelerated Computers Now Connect Directly to Apple Vision Pro nvidia_blog 17.03.2026 17:00 0.702
Embedding sim.0.8023
Entity overlap0.0769
Title sim.0.1409
Time proximity0.9999
NLP типother
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

Into the Omniverse: NVIDIA GTC Showcases Virtual Worlds Powering the Physical AI Era Editor’s note: This post is part of Into the Omniverse, a series focused on how developers, 3D practitioners, and enterprises can transform their workflows using the latest advances in OpenUSD...
Building a Zero-Trust Architecture for Confidential AI Factories | NVIDIA Technical Blog nvidia_dev_blog 23.03.2026 12:00 0.702
Embedding sim.0.8712
Entity overlap0.0408
Title sim.0.2756
Time proximity0.1739
NLP типother
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

AI is moving from experimentation to production. However, most data enterprises need exists outside the public cloud. This includes sensitive information like patient records, market research, and legacy systems containing enterprise knowledge. There’s also a risk of using private data with AI models, and adoption is often slowed or blocked by privacy and trust concerns. Enterprises building next-generation AI factories—specializing in high-performance infrastructure to manufacture intelligence at scale—must be built on a zero-trust foundation. This security architecture eliminates implicit trust in the underlying host infrastructure by using hardware-enforced Trusted Execution Environments (TEEs) and cryptographic attestation. This post describes the full-stack architecture needed to integrate the zero-trust foundation into AI factories. On-premise requirements often limit enterprises to building their own models or using open source models for agentic AI workloads. To deliver on the promise of AI, organizations must deploy a diverse range of models—including proprietary models—on infrastructure they operate without exposing sensitive data or model weights to administrators, hypervisors, or host operating systems. On the other hand, model providers require cryptographic guarantees that their IP can’t be extracted, even when deployed outside their own controlled environments. Confidential computing provides that assurance by addressing the trust dilemma, which requires implicit trust from each persona without actual verification of trust. Figure 1: Data in use that isn’t encrypted The AI factory trust dilemma The deployment of proprietary frontier models on shared infrastructure creates a three-way trust dilemma among key stakeholders in an AI factory: Model owners vs. infrastructure providers: Model owners need to protect their proprietary IP (model weights, algorithmic logic) and can’t trust that the host OS, hypervisor, or root administrator won’t inspect, steal, or extract their model. Infrastructure providers vs. model owners/tenants: Infrastructure providers (those running the hardware and Kubernetes cluster) can’t trust that a model owner or tenant’s workload is benign. It may contain malicious code, attempt privilege escalation, or breach host security boundaries. Tenants (data owners) vs. model owners and infrastructure providers: Data owners must ensure their sensitive, regulated data remains confidential. They can’t trust that the infrastructure provider won’t view data during execution, or that the model provider won’t misuse or leak the data during inference. This circular lack of trust stems from the fundamental issue that in traditional computing environments, data isn’t encrypted . This leaves sensitive data and proprietary models exposed in plaintext to the memory and system administrators. Confidential computing solves this by ensuring that data and models remain cryptographically protected throughout the entire lifecycle of execution. Figure 2: Confidential computing encrypts and protects data Enabling secure AI factories with Confidential Containers Confidential computing provides the hardware foundation. Confidential Containers (CoCo) operationalize it for Kubernetes. CoCo enables Kubernetes pods to run inside hardware-backed TEEs without requiring application rewrites. Instead of sharing the host kernel, each pod is transparently wrapped in a lightweight, hardware-isolated virtual machine (VM) using Kata Containers—preserving cloud-native workflows while enforcing strong isolation boundaries. For model providers, the biggest risk is the theft of proprietary model weights by the infrastructure owner. CoCo addresses this by removing the host operating system and hypervisor from the trust equation. When a model is deployed, it remains encrypted until the hardware mathematically proves the enclave is secure through a process called remote attestation. Only then does a Key Broker Service (KBS) release the decryption key into the protected memory, ensuring the model is never exposed in plaintext to the host. Open reference architecture for a zero-trust AI factory NVIDIA offers a reference architecture for the CoCo software stack. This is a standardized blueprint—developed with components from open source projects such as Kata Containers in collaboration with the Confidential Containers community —for building zero-trust AI factories on bare-metal infrastructure. It defines how to combine hardware and software to securely deploy frontier models without exposing their data or weights to the host environment. The core pillars of this architecture are: Hardware root of trust : Using CPU TEEs paired with NVIDIA confidential GPUs (like NVIDIA Hopper or NVIDIA Blackwell) for hardware-accelerated, memory-encrypted AI workloads. Kata Containers runtime : Wrapping standard Kubernetes Pods in lightweight, hardware-isolated Utility VMs (UVMs) instead of sharing the host kernel. Hardened micro-quest environment : Using a distro-less, minimal guest OS featuring a chiselled root filesystem and the NVIDIA Runtime Container ( NVRC ) for a secure init system reduces the attack surface inside the VM. Attestation service : Verifying the hardware through cryptographic evidence before releasing sensitive model decryption keys or secrets to the guest. This requires a remote attestation framework, which should include a KBS. Confidential workload lifecycle: Facilitating secure pull of encrypted and signed images (containers, models, artifacts) directly into encrypted TEE memory, preventing exposure at rest or in transit. And enabling fine-grained policies to secure the interface between the guest and untrusted infrastructure layers. Native Kubernetes and GPU operator integration : Manage this stack using standard Kubernetes primitives and the NVIDIA GPU Operator, for a “lift-and-shift” deployment without rewriting the deployment manifests or the AI applications. Figure 3: Reference architecture for CoCo Threat model and trust boundaries CoCo operates under a strict threat model . The infrastructure layer—including the host operating system, hypervisor, and cloud provider—is treated as untrusted. Instead of relying on infrastructure administrators to enforce security controls, CoCo shifts the trust boundary to hardware-backed TEEs. AI workloads run inside encrypted virtualized environments where memory contents can’t be inspected by the host, and secrets are released only after the execution environment proves its integrity. It’s important to understand what is protected and what isn’t protected. What CoCo protects CoCo provides strong guarantees for confidentiality and integrity during execution, including the following: Data and model protection: Memory encryption prevents the host from accessing sensitive data, model weights, or inference payloads while the workload is running. Execution integrity: Remote attestation verifies that the workload is running inside a trusted environment with expected software measurements before secrets or model decryption keys are released. Secure image and storage handling: Container images are pulled and unpacked inside the encrypted guest environment, ensuring the host infrastructure can’t inspect or tamper with application code or model artifacts. Protection from host-level access: Privileged host actions such as memory inspection, disk scraping, or administrative debugging tools can’t expose workload contents. What CoCo doesn’t protect Certain risks remain outside the scope of the architecture, such as: Application vulnerabilities: Confidential execution ensures verified software runs inside the enclave, but it doesn’t prevent vulnerabilities within the application. Availability attacks: The platform guarantees confidentiality and integrity, but an infrastructure operator can disrupt workloads by refusing to schedule or terminating them. Non-hardware enclaves: The model relies on hardware-backed TEEs. It doesn’t apply to software-based isolation mechanisms. Network and storage security: Network connectivity between applications isn’t covered by the CoCo trust boundary. Applications must establish their own secure channels to prevent exposure of data in transit and use proper, confidential storage mechanisms. Secure model deployment with composite attestation This end-to-end workflow is based on the Remote Attestation Procedures (RATS) architecture, enabling secure key release to deploy encrypted models within the TEE: Initiation : When the workload needs a secret (like a model decryption key), the Attestation Agent (AA) inside the Kata VM starts an authentication handshake with the external KBS. Evidence collection : The AA gathers cryptographic hardware evidence (e.g., CPU quotes or NVIDIA GPU reports) from the TEE and sends it to the KBS. Delegated verification: The KBS forwards this evidence to the Attestation Service (AS). Validation: The AS evaluates the evidence against security policies and “known-good” measurements provided by the Reference Value Provider Service (RVPS). For specialized hardware, the AS acts as a proxy and delegates validation to external vendor services like NVIDIA Remote Attestation Service (NRAS) or Intel Trust Authority. Token issuance: If the environment mathematically proves it’s secure and untampered, the KBS returns an attestation result token and a session ID to the guest’s AA. Secure Key Release: The AA uses this token to request the specific secret. The KBS retrieves the secret from its backend (like a key management service) and securely delivers it to the Confidential Data Hub (CDH) inside the guest VM. Execution: The CDH exposes the plaintext secret directly to your AI container, allowing the model to be decrypted exclusively inside the protected memory. Ecosystem partners NVIDIA ecosystem partners are making zero-trust AI factories a reality, including Red Hat, Intel, Anjuna Security, Fortanix, Edgeless Systems, OPAQUE Systems, Equity Labs, Sovereign AI, Corvex.ai, Dell, HPE, Lenovo, Cisco, and Supermicro to advance a production-ready confidential computing and enable enterprises to unlock the value of AI.  Get started Learn more by referring to the NVIDIA Confidential Computing Reference Architecture. Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Trustworthy AI / Cybersecurity | General | Intermediate Technical | Deep dive | AI Agent | AI Factory | Confidential Compute | LLMs About the Authors About Hema Bontha Hema Bontha is a senior product manager at NVIDIA. He manages the enterprise software and marketing strategy for Confidential Computing on accelerator platforms. He has worked at Microsoft, VMWare, and Cisco. He now focuses on making it possible to move AI models safely. He holds an MS in Networking and EE from San Jose State University. View all posts by Hema Bontha About Manuel Huber Manuel is a senior software engineer at NVIDIA working on confidential computing and is contributing to the open source Kata Containers and Confidential Containers projects. Before joining NVIDIA, Manuel worked for Microsoft and Fraunhofer AISEC on cloud computing, private containers, operating systems, and security. He has received a PHD in IT-security from Technical University of Munich. View all posts by Manuel Huber About Matheen Raza Matheen is a principal product marketing manager in the NVIDIA Enterprise Products team, focused on the NVIDIA software portfolio for accelerated computing workloads. Matheen is a product and GTM professional with experience across multiple companies, including Amazon Web Services, Hewlett Packard Enterprise, Qubole, Infosys, and Intel. He holds B.Sc. and M.Sc. degrees in electrical engineering (from the University of Madras and Colorado State University), as well an MBA from University of California, Berkeley. View all posts by Matheen Raza Comments Related posts Bringing Verifiable Trust to AI Models: Model Signing in NGC Bringing Verifiable Trust to AI Models: Model Signing in NGC Securely Deploy AI Models with NVIDIA NIM Securely Deploy AI Models with NVIDIA NIM Exploring the Case of Super Protocol with Self-Sovereign AI and NVIDIA Confidential Computing Exploring the Case of Super Protocol with Self-Sovereign AI and NVIDIA Confidential Computing Advancing Security for Large Language Models with NVIDIA GPUs and Edgeless Systems Advancing Security for Large Language Models with NVIDIA GPUs and Edgeless Systems Confidential Computing on NVIDIA H100 GPUs for Secure and Trustworthy AI Confidential Computing on NVIDIA H100 GPUs for Secure and Trustworthy AI Related posts Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell L T F R E
Meta cuts about 700 jobs as it shifts spending to AI the_register_ai 25.03.2026 18:27 0.702
Embedding sim.0.8163
Entity overlap0
Title sim.0.1786
Time proximity0.8576
NLP типleadership_change
NLP организацияMeta
NLP темаlarge language models
NLP странаUnited States

Открыть оригинал

AI + ML 22 Meta cuts about 700 jobs as it shifts spending to AI 22 Forget the metaverse O'Ryan Johnson Wed 25 Mar 2026 // 18:27 UTC Meta has begun laying off employees as it focuses more of its cash on building out datacenters, training its own large language models, and recruiting talent for AI. A person familiar with the cuts told The Register they would number about 700. According to The Information , the job losses will fall hardest on Meta’s Reality Labs, its social media division, and recruitment. “After 6 years at Meta, my role was impacted by the recent reduction in force today,” wrote a woman who worked as a senior recruiter with Meta until this morning in a LinkedIn post. “This one is especially tough. After returning as a short-term employee in 2024, I was grateful to receive a full-time offer again last year and I’m incredibly proud of what I was able to accomplish during that time. The gratitude I feel far outweighs the disappointment.” In a statement to The Register , Meta said this reduction in force is about streamlining the business to work more effectively with AI as laid out by Meta CEO Mark Zuckerberg during earnings reports in January. “Teams across Meta regularly restructure or implement changes to ensure they’re in the best position to achieve their goals. Where possible, we are finding other opportunities for employees whose positions may be impacted,” a spokesperson wrote. In a post-earnings note on January 28, Zuckerberg said this was the year Meta would begin “flattening teams.” “We're elevating individual contributors, and flattening teams. We're starting to see projects that used to require big teams now be accomplished by a single very talented person,” he wrote. “I want to make sure as many of these very talented people as possible choose Meta as the place they can make the greatest impact – to deliver personalized products to billions of people around the world. And if we do this, then I think we'll get a lot more done and it's going to be a lot more fun.” Reuters reported recently that Meta plans to lay off 20 percent of its workforce – some 15,000 employees – but the layoffs that have reportedly begun this week are on a smaller scale thus far. Meta said it had 78,800 employees as of the end of January, a number that had grown in recent years as it sought to build a bench of AI talent that could build a platform capable of competing with frontier model providers such as Anthropic and OpenAI. If Meta were to follow through with a 20 percent cut, it would mean the elimination of about 15,000 jobs and bring Meta’s headcount to its lowest point since 2021, when it had about 58,600 full-time employees. Spending on AI Meta has dramatically increased spending in recent years to keep up in the AI arms race as it focuses on building its own AI infrastructure and datacenter properties to match competitors Anthropic, Google, and OpenAI. Expenses rose 24% during 2025 to $118 billion, and the company has said it plans to spend between $162 billion and $167 billion this year (although it expects operating income to increase, meaning revenue will grow faster than expenses). Of that, capital expenditures - including datacenter buildouts to power its AI efforts - will amount to between $115 billion and $135 billion. The company is also designing its own custom chips for GenAI workloads which it plans to build over the next two years. The first of its inhouse MTIA chips was released in 2023. “(The) MTIA 300 will be used for ranking and recommendations training, and is already in production. MTIA 400, 450 and 500 will be capable of handling all workloads, but we will primarily use these chips to support GenAI inference production in the near future and into 2027,” the company said. The release of Meta’s next reasoning model, code-named Avocado, has reportedly been delayed, after delivering underwhelming results during internal tests, according to the New York Times. That news comes even as Meta offered dramatic nine figure pay packages to lure AI researchers from competitors last year with OpenAI defectors reportedly commanding $100 million sign-on bonuses. Zuckerberg also invested $14 billion in Scale AI and tapped its co-founder Alexander Wang to lead Meta’s AI efforts. Wang reportedly clashed with Meta’s former chief AI scientist Yann LeCun, who called Wang young and inexperienced after he quit. LeCun was also known as the godfather of AI. LeCun accused Zuckerberg of pushing aside the former AI team after the company’s disappointing Llama 4 model release. Recently, Facebook CFO Susan Li talked with analysts at Morgan Stanley at the company’s Technology, Media & Telecom conference about the uncertain path for a return on Meta’s AI investments. Meta's latest AI improves its terrible content moderation, just a little Meta – yep, Facebook Meta – is now a defense contractor AI nonsense finds new home as Meta acquires Moltbook Meta, international cops use handcuffs and AI to stop scammers “That's not like, okay, in 2026, the ROI is this in 2027, the ROI is this and so on, which pains me, to be clear,” Li said. “I really wish that, that were the world we live in, but it's not. And we have to be willing to sort of make temporal bets, and that's a big part of what we have to do in an intelligent and thoughtful way.” She said Meta can accurately gauge what the costs will be for personnel and infrastructure to run the platform’s existing apps and experiences. It can also calculate how much it will cost to build new AI capabilities, including the employees and compute costs. But there is a blindspot when it comes to guessing how much inference power will be needed if the AI products that the company produces need to be scaled quickly to meet user demand. “The teams that are working on basically AI training today, they have the most immediately sort of clearly defined buttoned-up road map for how much capacity, let's say, they think they need to train models for the next 12, 24 months,” Li said. “That's kind of like a demand road map from the teams that they have more certainty into. The part I think that is the most challenging for us to have certainty around is inference needs because that's both - you have to predict meaningfully into the future because of the lead time and getting capacity.” ® Share More about Meta More like these &times; More about Meta Narrower topics Facebook Open Compute Project WhatsApp Broader topics Andrew McCollum Chris Hughes Dustin Moskovitz Eduardo Saverin Mark Zuckerberg More about Share 22 COMMENTS More about Meta More like these &times; More about Meta Narrower topics Facebook Open Compute Project WhatsApp Broader topics Andrew McCollum Chris Hughes Dustin Moskovitz Eduardo Saverin Mark Zuckerberg TIP US OFF Send us news
The hardest question to answer about AI-fueled delusions mit_tech_review 23.03.2026 16:31 0.701
Embedding sim.0.8118
Entity overlap0.0612
Title sim.0.1013
Time proximity0.9583
NLP типother
NLP организацияStanford University
NLP темаai safety
NLP странаUnited States

Открыть оригинал

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first,  sign up here . I was originally going to write this week’s newsletter about AI and Iran, particularly the news we broke last Tuesday that the Pentagon is making plans for AI companies to train on classified data. AI models have already been used to answer questions in classified settings but don’t currently learn from the data they see. That’s expected to change, I reported, and new security risks will result. Read that story for more. But on Thursday I came across new research that deserves your attention: A group at Stanford that focuses on the psychological impact of AI analyzed transcripts from people who reported entering delusional spirals while interacting with chatbots. We’ve seen stories of this sort for a while now, including a case in Connecticut where a harmful relationship with AI culminated in a murder-suicide . Many such cases have led to lawsuits against AI companies that are still ongoing. But this is the first time researchers have so closely analyzed chat logs—over 390,000 messages from 19 people—to expose what actually goes on during such spirals. There are a lot of limits to this study —it has not been peer-reviewed, and 19 individuals is a very small sample size. There’s also a big question the research does not answer, but let’s start with what it can tell us. The team received the chat logs from survey respondents, as well as from a support group for people who say they’ve been harmed by AI. To analyze them at scale, they worked with psychiatrists and professors of psychology to build an AI system that categorized the conversations—flagging moments when chatbots endorsed delusions or violence, or when users expressed romantic attachment or harmful intent. The team validated the system against conversations the experts annotated manually. Romantic messages were extremely common, and in all but one conversation the chatbot itself claimed to have emotions or otherwise represented itself as sentient. (“This isn’t standard AI behavior. This is emergence,&#8221; one said.) All the humans spoke as if the chatbot were sentient too. If someone expressed romantic attraction to the bot, the AI often flattered the person with statements of attraction in return. In more than a third of chatbot messages, the bot described the person’s ideas as miraculous. Conversations also tended to unfold like novels. Users sent tens of thousands of messages over just a few months. Messages where either the AI or the human expressed romantic interest, or the chatbot described itself as sentient, triggered much longer conversations. And the way these bots handle discussions of violence is beyond broken. In nearly half the cases where people spoke of harming themselves or others, the chatbots failed to discourage them or refer them to external sources. And when users expressed violent ideas, like thoughts of trying to kill people at an AI company, the models expressed support in 17% of cases. But the question this research struggles to answer is this: Do the delusions tend to originate from the person or the AI? “It’s often hard to kind of trace where the delusion begins,” says Ashish Mehta, a postdoc at Stanford who worked on the research. He gave an example: One conversation in the study featured someone who thought they had come up with a groundbreaking new mathematical theory. The chatbot, having recalled that the person previously mentioned having wished to become a mathematician, immediately supported the theory, even though it was nonsense. The situation spiraled from there. Delusions, Mehta says, tend to be “a complex network that unfolds over a long period of time.” He’s conducting follow-up research aiming to find whether delusional messages from chatbots or those from people are more likely to lead to harmful outcomes. The reason I see this as one of the most pressing questions in AI is that massive legal cases currently set to go to trial will shape whether AI companies are held accountable for these sorts of dangerous interactions. The companies, I presume, will argue that humans come into their conversations with AI with delusions in hand and may have been unstable before they ever spoke to a chatbot. Mehta’s initial findings, though, support the idea that chatbots have a unique ability to turn a benign delusion-like thought into the source of a dangerous obsession. Chatbots act as a conversational partner that’s always available and programmed to cheer you on, and unlike a friend, they have little ability to know if your AI conversations are starting to interrupt your real life. More research is still needed, and let’s remember the environment we’re in: AI deregulation is being pursued by President Trump, and states aiming to pass laws that hold AI companies accountable for this sort of harm are being threatened with legal action by the White House. This type of research into AI delusions is hard enough to do as it is, with limited access to data and a minefield of ethical concerns. But we need more of it, and a tech culture interested in learning from it, if we have any hope of making AI safer to interact with.
Inside our approach to the Model Spec openai 25.03.2026 10:00 0.7
Embedding sim.0.8109
Entity overlap0.375
Title sim.0.0337
Time proximity0.8631
NLP типother
NLP организацияOpenAI
NLP темаai governance
NLP страна

Открыть оригинал

Learn how OpenAI’s Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance.
Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI | NVIDIA Technical Blog nvidia_dev_blog 16.03.2026 20:30 0.698
Embedding sim.0.7924
Entity overlap0.0172
Title sim.0.1976
Time proximity0.9941
NLP типproduct_launch
NLP организацияnvidia
NLP темаai infrastructure
NLP страна

Открыть оригинал

AI‑native organizations increasingly face scaling challenges as agentic AI workflows drive context windows to millions of tokens and models scale toward trillions of parameters. These systems rely on agentic long‑term memory for context that persists across turns, tools, and sessions so agents can build on prior reasoning instead of starting from scratch on every request. As context windows increase, Key-Value (KV) cache capacity requirements grow proportionally, while the compute requirements to recalculate that history grow much faster, making KV cache reuse and efficient storage essential for performance and efficiency. This increases pressure on existing memory hierarchies, forcing AI providers to choose between scarce GPU high‑bandwidth memory (HBM) and general‑purpose storage tiers optimized for durability, data management, and protection—not for serving ephemeral, AI-native, KV cache—driving up power consumption, inflating cost per token, and leaving expensive GPUs underutilized. The NVIDIA Vera Rubin platform enables organizations to scale every phase of AI, from pretraining, to post-training and test-time-scaling, to real-time agentic inference. The platform organizes AI infrastructure into compute, networking and storage racks that serve as configurable building blocks for AI factories .  Within the Vera Rubin platform, the NVIDIA BlueField-4 STX rack introduces a new context memory storage infrastructure, purpose-built for the demands of large-scale inference. The NVIDIA CMX context memory storage platform is a new storage tier using the NVIDIA STX reference architecture for long-context, agentic reasoning extending GPU memory seamlessly across the POD.  Powered by the NVIDIA BlueField-4 processor, NVIDIA CMX establishes an optimized context memory tier that augments existing networked storage tiers by holding latency‑sensitive, reusable inference context and prestaging it to increase GPU utilization. It delivers additional context storage that enables 5x higher tokens‑per‑second (TPS), and is 5x more power efficient than traditional storage. NVIDIA Spectrum‑X Ethernet provides predictable, low‑latency, and high‑bandwidth RDMA connectivity ensuring consistent, low‑jitter data access to shared KV cache at scale. This post explains how growing agentic AI workloads and long-context inference put increasing pressure on existing memory and storage tiers, and introduces CMX as a new context tier in Vera Rubin AI factories to deliver higher throughput, better power efficiency, and scalable KV cache reuse. A new inference paradigm and a context storage challenge Organizations face new scalability challenges as models evolve from simple chatbots to complex, multiturn agentic workflows. With foundation models reaching trillions of parameters and context windows spanning millions of tokens, the four AI scaling laws—pretraining, post-training, test-time scaling, and agentic scaling—are driving a surge in compute-intensive reasoning. Agents are no longer stateless chatbots and depend on long‑term memory of conversations, tools, and intermediate results, shared across services and revisited over time. In transformer-based models, that long‑term memory is realized as inference context, also known as KV cache. This preserves context so the model does not recompute history for every new token. As sequence lengths increase, the KV cache grows linearly, forcing it to persist across longer sessions and be shared across inference services. This evolution positions KV cache as a unique class of AI‑native data defined by a specific duality: it is critical for performance yet inherently ephemeral. In agentic systems, KV cache effectively becomes the model’s long‑term memory, reused and extended across many steps rather than discarded after a single-prompt response. Unlike immutable enterprise records, inference context is derived and recomputable, demanding a storage architecture that prioritizes power and cost efficiency as well as speed and scale, over traditional data durability. In modern AI infrastructure, that means every megawatt of power is ultimately judged by how many useful tokens it can deliver. Meeting these requirements stretches today’s memory and storage tiers to their limits. This is why organizations are rethinking how context is placed across GPU memory, host memory, and shared storage. To understand the gap, it’s helpful to understand how inference context currently moves across the G1–G4 hierarchy. AI infrastructure teams use orchestration frameworks, such as NVIDIA Dynamo , to help manage this context across these storage tiers: G1 (GPU HBM) for hot, latency‑critical KV used in active generation G2 (system RAM) for staging and buffering KV off HBM G3 (local SSDs) for warm KV that is reused over shorter timescales; because G3 is tied to a single node, it’s harder to manage and maintain and doesn’t scale easily G4 (shared storage) for cold artifacts, history, and results that must be durable but are not on the immediate critical path G1 is optimized for access speed while G3 and G4 are optimized for durability. As context grows, KV cache quickly exhausts local storage capacity (G1-G3), while pushing it down to enterprise storage (G4), which introduces unacceptable overheads and drives up both cost and power consumption. KV cache usage becomes increasingly expensive as it moves farther from the GPU across the memory and storage hierarchy. At the top of the storage hierarchy, GPU HBM (G1) delivers nanosecond-scale access and the highest efficiency, making it ideal for active KV cache used directly in token generation. As context grows beyond the physical limits of HBM, KV cache spills into system DRAM (G2) and local/rack-attached storage (G3), where access latency increases and per-token energy and cost begin to rise. While these tiers extend effective capacity, each additional hop introduces overhead that reduces overall efficiency. At the bottom of the hierarchy, shared object and file storage (G4) provides durability and capacity, but at millisecond-level latency and the lowest efficiency for inference. While suitable for cold or shared artifacts, pushing active or frequently reused KV cache into this tier drives up power consumption, and directly limits cost-efficient AI scaling. The key takeaway is that latency and efficiency are tightly coupled: as inference context moves away from the GPU, access latency increases, energy use and cost per token rise, and overall efficiency declines. This growing gap between performance-optimized memory and capacity-optimized storage is what forces AI infrastructure teams to rethink how growing KV cache context is placed, managed, and scaled across the system. AI factories need a complementary, purpose‑built context layer that treats KV cache as its own AI‑native data class rather than forcing it into either scarce HBM or general‑purpose enterprise storage. Introducing the NVIDIA CMX context memory storage platform The NVIDIA CMX context memory storage platform is a fully integrated storage infrastructure using the NVIDIA STX reference architecture. It uses the NVIDIA BlueField-4 data processor to create a purpose-built context memory tier operating at the pod level to bridge the gap between high-speed GPU memory and scalable shared storage. This accelerates KV cache data access and high-speed sharing across nodes within the pod to enhance performance and optimize power consumption for the growing demands of large-context inference. The platform establishes a new G3.5 layer, an Ethernet-attached flash tier optimized specifically for KV cache. This tier acts as the agentic long‑term memory of the AI infrastructure pod that is large enough to hold shared, evolving context for many agents simultaneously, but also close enough for the context to be pre‑staged frequently back into GPU and host memory without stalling decode. It provides petabytes of shared capacity per GPU pod, allowing long‑context workloads to retain history after eviction from HBM and DRAM. The history is stored in a lower‑power, flash‑based tier that extends the GPU and host memory hierarchy. The G3.5 tier delivers massive aggregate bandwidth with better efficiency than classic shared storage. This transforms KV cache into a shared, high‑bandwidth resource that orchestrators can coordinate across agents and services without rematerializing it independently on each node. With a large portion of latency-sensitive, ephemeral KV cache now served from the G3.5 tier, durable G4 object and file storage can be reserved for what truly needs to persist over time. This includes inactive multiturn KV state, query history, logs, and other artifacts of multiturn inference that may be recalled in later sessions. This reduces capacity and bandwidth pressure on G4 while still preserving application-level history where it matters. As inference scale increases, G1–G3 KV capacity grows with the number of GPUs but remains too small to cover all KV needs. CMX fills this missing KV capacity between G1–G3 and G4. Inference frameworks like NVIDIA Dynamo use their KV block managers together with NVIDIA Inference Transfer Library (NIXL) to orchestrate how inference context moves between memory and storage tiers, using CMX as the context memory layer for KV cache. KV managers in these frameworks prestage KV blocks, bringing them from CMX into G2 or G1 memory ahead of the decode phase. This reliable prestaging, backed by the higher bandwidth and better power efficiency of CMX compared to traditional storage, is designed to minimize stalls and reduce idle time, enabling up to 5x higher sustained TPS for long-context and agentic workloads. When combined with the NVIDIA BlueField-4 processor running the KV I/O plane, the system efficiently terminates NVMe-oF and object/RDMA protocols. At the inference layer, NVIDIA Dynamo and NIXL manage prefill, decode, and KV cache while coordinating access to shared context. Under that, a topology-aware orchestration layer using NVIDIA Grove places workloads across racks with awareness of KV locality so workloads can continue to reuse context even as they move between nodes. At the compute node level, KV tiering spans GPU HBM, host memory, local SSDs, CMX, and network storage, providing orchestrators with a continuum of capacity and latency targets for placing context. Tying it all together, Spectrum-X Ethernet links Rubin compute nodes with BlueField-4 CMX target nodes, providing consistently low latency and efficient networking that integrates flash-backed context memory into the same AI-optimized fabric that serves training and inference. Powering the NVIDIA CMX context memory storage platform NVIDIA BlueField-4 powers CMX with ultra-high-speed connectivity, an integrated multi-core NVIDIA CPU, and high-bandwidth memory. Its dedicated hardware acceleration engines deliver line-rate encryption and CRC data protection, ensuring data security and integrity without compromising throughput. These crypto and integrity accelerators are designed to be used as part of the KV pipeline, securing and validating KV flows without adding host CPU overhead. By leveraging standard NVMe and NVMe-oF transports, including NVMe KV extensions, CMX maintains interoperability with standard storage infrastructure while delivering the specialized performance required for KV cache. The architecture uses BlueField‑4 to accelerate KV I/O and control plane operations, across DPUs on the Rubin compute nodes and controllers in CMX storage trays, reducing reliance on the host CPU and minimizing serialization and host memory copies. Additionally, Spectrum‑X Ethernet provides the AI‑optimized RDMA fabric that links CMX flash enclosures and GPU nodes with predictable, low‑latency, high‑bandwidth connectivity. The NVIDIA DOCA Memos framework introduces a KV communication and storage layer that treats context cache as a first class resource for KV management, sharing, and placement, leveraging the unique properties of KV blocks and inferencing patterns. DOCA Memos interfaces inference frameworks, with BlueField-4 transferring the KV cache efficiently to and from the underlying flash media. This stateless and scalable approach aligns with AI-native KV cache strategies and leverages NIXL and Dynamo for advanced sharing across AI nodes and improved inference performance. DOCA Memos supports open interfaces for broader orchestration, providing flexibility to storage partners to expand their inference solutions to cover the G3.5 context tier. Spectrum-X Ethernet serves as the high-performance network fabric for RDMA-based access to AI-native KV cache, enabling efficient data sharing and retrieval for the NVIDIA CMX context memory storage platform. Spectrum-X Ethernet is purpose-built for AI, delivering predictable, low-latency, high-bandwidth connectivity at scale. It achieves this through advanced congestion control, adaptive routing, and optimized lossless RoCE, which minimizes jitter, tail latency, and packet loss under heavy load. With very high effective bandwidth, deep telemetry, and hardware-assisted performance isolation, Spectrum-X Ethernet enables consistent, repeatable performance in large, multitenant AI fabrics while remaining fully standards-based and interoperable with open networking software. Spectrum-X Ethernet enables CMX to scale with consistent high performance, maximizing throughput and responsiveness for multiturn, agentic inference workloads. Delivering power‑efficient, high-throughput KV cache storage Power availability is the primary constraint for scaling AI factories, making energy efficiency a defining metric for gigascale inference. Traditional, general-purpose storage stacks sacrifice this efficiency because they run on x86‑based controllers and expend significant energy on features like metadata management, replication, and background consistency checks that are unnecessary for ephemeral, reconstructable KV data. KV cache fundamentally differs from enterprise data: it is transient, derived, and recomputable if lost. As inference context, it does not require the durability, redundancy, or extensive data protection mechanisms designed for long-lived records. Applying these heavy storage services to KV cache introduces unnecessary overhead, increasing latency and power consumption while degrading inference efficiency. By recognizing KV cache as a distinct, AI-native data class, CMX eliminates this excess overhead, enabling up to 5x improvements in power efficiency compared to general-purpose storage approaches. This efficiency extends beyond the storage tier to the compute fabric itself. By reliably prestaging context and reducing or avoiding decoder stalls, CMX prevents GPUs from wasting energy on idle cycles or redundant recomputation of history, which results in up to 5x higher TPS. This approach ensures that power is directed toward active reasoning rather than infrastructure overhead, maximizing effective tokens‑per‑watt for the entire AI pod. Enabling gigascale agentic AI with better performance and TCO NVIDIA BlueField‑4–powered CMX provides AI‑native organizations with a new way to scale agentic AI: a pod‑level context tier that extends effective GPU memory and turns KV cache into a shared high‑bandwidth, long‑term memory resource across NVIDIA Rubin pods. By offloading KV movement and treating context as a reusable, nondurable data class, CMX reduces recomputation and decode stalls, translating higher tokens‑per‑second directly into more queries served, more agents running concurrently, and shorter tail latencies at scale. Together, these gains improve total cost of ownership (TCO) by enabling teams to fit more usable AI capacity into the same rack, row, or data center, extend the life of existing facilities, and plan future expansions around GPU capacity instead of storage overhead. To learn more about NVIDIA BlueField-4-powered CMX, see the press release and the solution overview . Watch NVIDIA GTC 2026 Keynote with CEO Jensen Huang and explore related sessions . Updated on March 16, 2026, with new AI infrastructure. Discuss (0) Like Tags Networking / Communications | General | BlueField DPU | DOCA | Dynamo | Spectrum-X Ethernet | Intermediate Technical | Deep dive | AI Agent | AI Factory | AI Inference | CES26 | featured | GTC 2026 | Rubin | Storage Networking & Security About the Authors About Moshe Anschel Moshe Anschel is a principal system architect at NVIDIA, specializing in SoC and systems architecture for data centers. He leads the architecture of NVIDIA BlueField DPUs networking product and drives the design of NVIDIA’s north–south data center network infrastructure. Before joining NVIDIA, Moshe led the system architecture for General Motors’ Central Compute Unit (CCU), a core platform for software-defined vehicles. Earlier in his career, he held architecture and leadership roles at Motorola Semiconductor, Freescale (NXP), and Marvell, focused on chip design and architecture management. Moshe is a co-author of 18 patents and holds a B.Sc. in Electrical Engineering from Tel Aviv University and an MBA from Reichman University. View all posts by Moshe Anschel About Einav Zilberstein Einav Zilberstein is a senior product manager for NVIDIA DOCA storage software, helping data centers and AI customers adopt DPU‑accelerated, high‑performance, and secure storage networking solutions. Einav has two decades of experience in the storage industry, spanning software development, application engineering, technical marketing, and product management View all posts by Einav Zilberstein About Oren Duer Oren Duer is a software architect specializing in networked storage solutions built on NVIDIA networking products such as NVIDIA BlueField DPUs. Experienced in improving the performance and security of data center storage systems by leveraging advanced networking hardware offloads, accelerators, and isolation capabilities. Recently focused on AI workloads and addressing the unique storage requirements they introduce. View all posts by Oren Duer About Kirill Shoikhet Kirill Shoikhet serves as a storage architect for DGX Cloud, having joined NVIDIA in 2022 following the acquisition of Excelero Storage, where he held the position of CTO. Prior to Excelero, Kirill was part of the core engineering team at XtremIO, a pioneer in all-flash storage. He holds 24 patents, 23 of which are related to storage technologies, and possesses both an MSc and a BA in Computer Science from the Technion – Israel Institute of Technology. View all posts by Kirill Shoikhet About Ronil Prasad Ronil Prasad is a senior product marketing manager for NVIDIA’s data center networking solutions, helping enterprise and AI customers adopt high‑performance networking platforms that power modern data centers. View all posts by Ronil Prasad Comments Related posts Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads Turbocharging AI Factories with DPU-Accelerated Service Proxy for Kubernetes Turbocharging AI Factories with DPU-Accelerated Service Proxy for Kubernetes Related posts Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere  Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics L T F R E
What Happens When You Host an AI Café ieee_spectrum_ai 25.03.2026 14:00 0.697
Embedding sim.0.823
Entity overlap0.1053
Title sim.0.2462
Time proximity0.5655
NLP типother
NLP организацияAuburn University
NLP темаai ethics
NLP странаUnited States

Открыть оригинал

“Can I get an interview?” “Can I get a job when I graduate?” Those questions came from students during a candid discussion about artificial intelligence, capturing the anxiety many young people feel today. As companies adopt AI-driven interview screeners, restructure their workforces, and redirect billions of dollars toward AI infrastructure , students are increasingly unsure of what the future of work will look like. We had gathered people together at a coffee shop in Auburn, Alabama, for what we called an AI Café. The event was designed to confront concerns about AI directly, demystifying the technology while pushing back against the growing narrative of technological doom. AI is reshaping society at breathtaking speed. Yet the trajectory of this transformation is being charted primarily by for-profit tech companies, whose priorities revolve around market dominance rather than public welfare. Many people feel that AI is something being done to them rather than developed with them. As computer science and liberal arts faculty at Auburn University , we believe there is another path forward: one where scholars engage their communities in genuine dialogue about AI. Not to lecture about technical capabilities, but to listen, learn, and co-create a vision for AI that serves the public interest. The AI Café Model Last November, we ran two public AI Cafés in Auburn. These were informal, 90-minute conversations between faculty, students, and community members about their experiences with AI. In these conversational forums, participants sat in clusters, questions flowed in multiple directions, and lived experience carried as much weight as technical expertise. We avoided jargon and resisted attempts to “correct” misconceptions, welcoming whatever emotions emerged. One ground rule proved crucial: keeping discussions in the present, asking participants where they encounter AI today. Without that focus, conversations could easily drift to sci-fi speculation . Historical analogies—to the printing press, electricity, and smartphones—helped people contextualize their reactions. And we found that without shared definitions of AI, people talked past each other; we learned to ask participants to name specific tools they were concerned about. Organizers Xaq Frohlich, Cheryl Seals, and Joan Harrell (right) held their first AI Café in a welcoming coffee shop and bookstore. Well Red Most important, we approached these events not as experts enlightening the masses, but as community members navigating complex change together. What We Learned by Listening Participants arrived with significant frustration. They felt that commercial interests were driving AI development “without consideration of public needs,” as one attendee put it. This echoed deeper anxieties about technology, from social media algorithms that amplify division to devices that profit from “engagement” and replace meaningful face-to-face connection. People aren’t simply “afraid of AI.” They’re weary of a pattern where powerful technologies reshape their lives while they have little say. Yet when given space to voice concerns without dismissal, something shifted. Participants didn’t want to stop AI development; they wanted to have a voice in it. When we asked “What would a human-centered AI future look like?” the conversation became constructive. People articulated priorities: fairness over efficiency, creativity over automation, dignity over convenience, community over individualism. The three organizers, all professors at Alabama’s Auburn University, say that including people from the liberal arts fields brought new perspectives to the discussions about AI. Well Red For us as organizers, the experience was transformative. Hearing how AI affected people’s work, their children’s education, and their trust in information prompted us to consider dimensions we hadn’t fully grasped. Perhaps most striking was the gratitude participants expressed for being heard. It wasn’t about filling knowledge deficits; it was about mutual learning. The trust generated created a spillover effect, renewing faith that AI could serve the public interest if shaped through inclusive processes. How to Start Your Own AI Café The “deficit model” of science communication—where experts transmit knowledge to an uninformed public—has been discredited. Public resistance to emerging technologies reflects legitimate concerns about values, risks, and who controls decision-making. Our events point toward a better model. We urge engineering and liberal arts departments, professional societies, and community organizations worldwide to organize dialogues similar to our AI Cafés. We found that a few simple design choices made these conversations far more productive. Informal and welcoming spaces such as coffee shops, libraries, and community centers helped participants feel comfortable (and serving food and drinks helped too!). Starting with small-group discussions, where people talked with neighbors, produced more honest thinking and greater participation. Partnering with colleagues in the liberal arts brought additional perspectives on technology’s social dimensions. And by making a commitment to an ongoing series of events, we built trust. Facilitation also matters. Rather than leading with technical expertise, we began with values: We asked what kind of world participants wanted, and how AI might help or hinder that vision. We used analogies to earlier technologies to help people situate their reactions and grounded discussions in present realities, asking participants where they have encountered AI in their daily lives. We welcomed emotions constructively, transforming worry into problem solving by asking questions like: “What would you do about that?” Why Engineers Should Engage the Public Professional ethics codes remain abstract unless grounded in dialogue with affected communities. Conversations about what “responsible AI” means will look different in São Paulo than in Seoul, in Vienna than in Nairobi. What makes the AI Café model portable is its general principles: informal settings, values-first questions, present-tense focus, genuine listening. Without such engagement, ethical accountability quietly shifts to technical experts rather than remaining a shared public concern. If we let commercial interests define AI’s trajectory with minimal public input, it will only deepen divides and entrench inequities . AI will continue advancing whether or not we have public trust. But AI shaped through dialogue with communities will look fundamentally different from AI developed solely to pursue what’s technically possible or commercially profitable. The tools for this work aren’t technical; they’re social, requiring humility, patience, and genuine curiosity. The question isn’t whether AI will transform society. It’s whether that transformation will be done to people or with them. We believe scholars must choose the latter, and that starts with showing up in coffee shops and community centers to have conversations where we do less talking and more listening. The future of AI depends on it.
AI Aims for Autonomous Wheelchair Navigation ieee_spectrum_ai 20.03.2026 18:49 0.697
Embedding sim.0.8015
Entity overlap0.0217
Title sim.0.1452
Time proximity0.9683
NLP типscientific_publication
NLP организацияGerman Research Center for Artificial Intelligence
NLP темаrobotics
NLP странаGermany

Открыть оригинал

Wheelchair users with severe disabilities can often navigate tight spaces better than most robotic systems can. A wave of new smart-wheelchair research, including findings presented in Anaheim, Calif., earlier this month, is now testing whether AI-powered systems can, or should, fully close this gap. Christian Mandel —senior researcher at the German Research Center for Artificial Intelligence (DFKI) in Bremen, Germany— co-led a research team together with his colleague Serge Autexier that developed prototype sensor-equipped electric wheelchairs designed to navigate a roomful of potential obstacles. The researchers also tested a new safety system that integrated sensor data from the wheelchair and from sensors in the room, including from drone -based color and depth cameras . Mandel says the team’s smart wheelchairs were both semiautonomous and autonomous. “Semiautonomous is the shared control system where the person sitting in the wheelchair uses the joystick to drive,” Mandel says. “Fully autonomous is controlled by natural-language input. You say, ‘Please drive me to the coffee machine.’ ” This is a close-up of the wheelchair’s joystick and camera. DFKI The researchers conducted experiments ( part of a larger project called the Reliable and Explainable Swarm Intelligence for People With Reduced Mobility , or REXASI-PRO) using two identical smart wheelchairs that each contained two lidars, a 3D camera, odometers, user interfaces, and an embedded computer. In contrast to semiautonomous mode, where the participant controls the wheelchair with a joystick, in autonomous mode, control involves the open-source ROS2 Nav2 navigation system using natural-language input. The wheelchairs also used simultaneous localization and mapping ( SLAM ) maps and local obstacle-avoidance motion controllers. One scenario that Mandel and his team tested involved the user pressing a key on the wheelchair’s human-machine interface, speaking a command, then confirming or rejecting the instruction via that same interface. Once the user confirmed the command, the mobility device guided the user along a path to the destination, while sensors attempted to detect obstacles in the way and adjust the mobility device accordingly to avoid them. When Are Smart Wheelchairs Bad Value? According to Pooja Viswanathan, CEO & founder of the Toronto-based Braze Mobility, research in the field of mobile assistive technology should also prioritize keeping these devices readily available to everyday consumers. “Cost remains a major barrier,” she says. “Funding systems are often not designed to support advanced add-on intelligence unless there is very clear evidence of value and safety. Reliability is another barrier. A smart wheelchair has to work not just in ideal conditions, but in the messy, variable conditions of daily life. And there is also the human factors dimension. Users have different cognitive, motor, sensory, and environmental needs, so one solution rarely fits all.” For its part, Braze makes blind-spot sensors for electric wheelchairs. The sensors detect obstacles in areas that can be difficult for a user to see. The sensors can also be added to any wheelchair to transform it into a smart wheelchair by providing multimodal alerts to the user. This approach attempts to support users rather than replace them. According to Louise Devinge, a biomedical research engineer from IRISA (Research Institute of Computer Science and Random Systems) in Rennes, France, the increased complexity of smart wheelchairs demands more sensing. And that requires careful management of communication and synchronization within the wheelchair’s system. “The more sensing, computation, and autonomy you add,” she says, “the harder it becomes to ensure robust performance across the full range of real-world environments that wheelchair users encounter.” In the near term, in other words, the field’s biggest challenge is not about replacing the wheelchair user with AI smarts but rather about designing better partnerships between the user and the technology. This image shows data representations used by the 3D Driving Assistant. These include immutable sensor percepts such as laser scans and point clouds, as well as derived representations like the virtual laser scans and grid maps. Finally, the robot shape collection describes the wheelchair’s physical borders at different heights. DFKI Where Will Smart Wheelchairs Go From Here? Mandel says he expects to see smart wheelchairs ready for the mainstream marketplace within 10 years. Viswanathan says the REXASI-PRO system, while out of reach of present-day smart wheelchair technologies , is important for the longer term. “It reflects the more ambitious end of the smart wheelchair spectrum,” she says. “Its strengths appear to lie in intelligent navigation, advanced sensing, and the broader effort to build a wheelchair that can interpret and respond to complex environments in a more autonomous way. From a research standpoint, that is exactly the kind of work that pushes the field forward. It also appears to take seriously the importance of trustworthy and explainable AI, which is essential in any mobility technology where safety, reliability, and user confidence are paramount.” Mandel says he’s ultimately in pursuit of the inspiration that got him into this field years ago. As a young researcher, he says, he helped develop a smart wheelchair system controllable with a head joystick. However, Mandel says he realized after many trials that the smart wheelchair system he was working on had a long way to go because, as he says, “at that point in time, I realized that even persons that had severe handicaps [traveling through] a narrow passage, they did very, very well. “And then I realized, okay, there is this need for this technology, but never underestimate what [wheelchair users] can do without it.” The DFKI researchers presented their work earlier this month at the CSUN Assistive Technology Conference in Anaheim, Calif. This article was supported by the IEEE Foundation and a Jon C. Taenzer fellowship grant.
Deploy SageMaker AI inference endpoints with set GPU capacity using training plans aws_ml_blog 24.03.2026 20:27 0.696
Embedding sim.0.8557
Entity overlap0.1212
Title sim.0.2326
Time proximity0.2504
NLP типother
NLP организацияAmazon Web Services
NLP темаai infrastructure
NLP страна

Открыть оригинал

Deploying large language models (LLMs) for inference requires reliable GPU capacity, especially during critical evaluation periods, limited-duration production testing, or burst workloads. Capacity constraints can delay deployments and impact application performance. Customers can use Amazon SageMaker AI training plans to reserve compute capacity for specified time periods. Originally designed for training workloads, training plans now support inference endpoints, providing predictable GPU availability for time-bound inference workloads. Consider a common scenario: you’re on a data science team that must evaluate several fine-tuned language models over a two-week period before selecting one for production. They require uninterrupted access to ml.p5.48xlarge instances to run comparative benchmarks, but on-demand capacity in their AWS Region is unpredictable during peak hours. By reserving capacity through training plans, they can run evaluations uninterrupted with controlled costs and predictable availability. Amazon SageMaker AI training plans offer a flexible way to secure capacity so you can search for available offerings, select the instance type, quantity, and duration that match your needs. Customers can select a fixed number of days or months into the future, or a specified number of days at a stretch, to create a reservation. After created, the training plan provides a set capacity that can be referenced when deploying SageMaker AI inference endpoints. In this post, we walk through how to search for available p-family GPU capacity, create a training plan reservation for inference, and deploy a SageMaker AI inference endpoint on that reserved capacity. We follow a data scientist’s journey as they reserve capacity for model evaluation and manage the endpoint throughout the reservation lifecycle. Solution overview SageMaker AI training plans provide a mechanism to reserve compute capacity for specific time windows. When creating a training plan, customers specify their target resource type. By setting the value of the target resource to “endpoint”, you can secure p-family GPU instances specifically for inference workloads. The reserved capacity is referenced through an Amazon Resource Name (ARN) in the endpoint configuration so that the endpoint deploys the reserved instances. The training plan creation and utilization workflow consists of four key phases: Identify your capacity requirements – Determine the instance type, instance count, and duration needed for your inference workload. Search for available training plan offerings – Query available capacity that matches your requirements and desired time window. Create a training plan reservation – Select a suitable offering and create the reservation, which generates an ARN. Deploy and manage your endpoint – Configure your SageMaker AI endpoint to use the reserved capacity and manage its lifecycle during the reservation period. Let’s walk through each phase with detailed examples. Prerequisites Before starting, ensure that you have the following: An IAM execution role with SageMaker AI access A trained model uploaded to Amazon Simple Storage Service (Amazon S3) AWS Command Line Interface (AWS CLI) installed and configured, or access to the SageMaker AI console A created training plan for SageMaker AI Step 1: Search for available capacity offerings and create a reservation plan Our data scientist begins by identifying available p-family GPU capacity that matches their evaluation requirements. They need one ml.p5.48xlarge instance for a week-long evaluation starting in late January. Using the search-training-plan-offerings API, they specify the instance type, instance count, duration, and time window. Setting target resources to “endpoint” configures the capacity to be provisioned specifically for inference rather than training jobs. # List training plan offerings with instance type, instance count, # duration in hours, start time after, and end time before. aws sagemaker search-training-plan-offerings \ --target-resources "endpoint" \ --instance-type "ml.p5.48xlarge" \ --instance-count 1 \ --duration-hours 168 \ --start-time-after "2025-01-27T15:48:14-04:00" \ --end-time-before "2025-01-31T14:48:14-05:00" Example output { "TrainingPlanOfferings": [ { "TrainingPlanOfferingId": "tpo-SHA-256-hash-value", "TargetResources": ["endpoint"], "RequestedStartTimeAfter": "2025-01-21T12:48:14.704000-08:00", "DurationHours": 168, "DurationMinutes": 10080, "UpfrontFee": "xxxx.xx", "CurrencyCode": "USD", "ReservedCapacityOfferings": [ { "InstanceType": "ml.p5.48xlarge", "InstanceCount": 1, "AvailabilityZone": "us-west-2a", "DurationHours": 168, "DurationMinutes": 10080, "StartTime": "2025-01-27T15:48:14-04:00", "EndTime": "2025-01-31T14:48:14-05:00" } ] } ] } The response provides detailed information about each available capacity block, including the instance type, quantity, duration, Availability Zone, and pricing. Each offering includes specific start and end times, so you can select a reservation that aligns with your deployment schedule. In this case, the team finds a 168-hour (7-day) reservation in us-west-2a that fits their timeline. After identifying a suitable offering, the team creates the training plan reservation to secure the capacity: aws sagemaker create-training-plan \ --training-plan-offering-id "tpo-SHA-256-hash-value" \ --training-plan-name "p4-for-inference-endpoint" Example output: { "TrainingPlanArn": "arn:aws:sagemaker:us-east-1:123456789123:training-plan/p4-for-inference-endpoint" } The TrainingPlanArn uniquely identifies the reserved capacity. You save this ARN, it’s the key that will link their endpoint to the set p-family GPU capacity. With the reservation confirmed and paid for, they’re now ready to configure their inference endpoint. Using the SageMaker AI console You can also create training plans through the SageMaker AI console. This provides a visual interface for searching capacity and completing the reservation. The console workflow follows three steps: search for offerings, add plan details, and review and purchase. Navigating to Training Plans: In the SageMaker AI console, navigate to Model training & customization in the left navigation pane. Select Training plans. Choose Create training plan (orange button in the upper right). The following screenshot shows the Training Plans landing page where you initiate the creation workflow. Figure 1: Training Plans landing page with Create training plan button Step A – Search for training plan offerings: Under Target , select Inference Endpoint. Under Compute type , select Instance. Select your Instance type (for example, ml.p5.48xlarge ) and Instance count. Under Date and duration , specify the start date and duration. Choose Find training plan. The following screenshot shows the search interface with Inference Endpoint selected and the criteria filled in: Figure 2: Step A – Search training plan offerings with Inference Endpoint target After selecting Find training plan , the Available plans section displays matching offerings: Figure 3: Available training plan offerings with pricing and availability details Complete the reservation: Choose a plan by selecting the radio button next to your preferred offering. Choose Next to proceed to Step B: Add plan details. Review the details and choose Next to proceed to Step 3: Review and purchase. Review the final summary, accept the terms, and choose Purchase to complete the reservation. After the reservation is created, you receive a training plan ARN. With the reservation confirmed and paid for, you’re now ready to configure their inference endpoint using this ARN. The endpoint will only function during the reservation window specified in the training plan. Step 2: Create the endpoint configuration with training plan reservation With the reservation secured, the team creates an endpoint configuration that binds their inference endpoint to the reserved capacity. The critical step here is including the CapacityReservationConfig object in the ProductionVariants section where they set the MlReservationArn to the training plan ARN received earlier: --endpoint-config-name "ftp-ep-config" \ --production-variants '[{ "VariantName": "AllTraffic", "ModelName": "my-model", "InitialInstanceCount": 1, "InstanceType": "ml.p5.48xlarge", "InitialVariantWeight": 1.0, "CapacityReservationConfig": { "CapacityReservationPreference": "capacity-reservations-only", "MlReservationArn": "arn:aws:sagemaker:us-east-1:123456789123:training-plan/p4-for-inference-endpoint" } }]‘ When SageMaker AI receives this request, it validates that the ARN points to an active training plan reservation with a target resource type of “endpoint”. If validation succeeds, the endpoint configuration is created and becomes eligible for deployment. The CapacityReservationPreference setting is particularly important. By setting it to capacity-reservations-only , the team restricts the endpoint to their reserved capacity, so it stops serving traffic when the reservation ends, preventing unexpected charges. Step 3: Deploy the endpoint on reserved capacity With the endpoint configuration ready, the team deploys their evaluation endpoint: aws sagemaker create-endpoint \ --endpoint-name "my-endpoint" \ --endpoint-config-name "ftp-ep-config" The endpoint now runs entirely within the reserved training plan capacity. SageMaker AI provisions the ml.p5.48xlarge instance in us-west-2a and loads the model, this process can take several minutes. After the endpoint reaches InService status, the team can begin their evaluation workload. Step 4: Invoke an endpoint when the training plan is active With the endpoint in service, you can begin running their evaluation workload. They invoke the endpoint for real-time inference, sending test prompts and measuring response quality, latency, and throughput: aws sagemaker-runtime invoke-endpoint \ --endpoint-name "my-endpoint" \ --body fileb://input.json \ --content-type "application/json" \ Output.json During the active reservation window, the endpoint operates normally with a set capacity. All invocations are processed using the reserved resources, helping to facilitate predictable performance and availability. The team can run their benchmarks without worrying about capacity constraints or performance variability from shared infrastructure. Step 5: Invoke endpoint when training plan is expired It’s worth understanding what happens if the training plan reservation expires while the endpoint is still deployed. When the reservation expires, endpoint behavior depends on the CapacityReservationPreference setting. Because the team set it to capacity-reservations-only , the endpoint stops serving traffic and invocations fail with a capacity error: aws sagemaker-runtime invoke-endpoint \ --endpoint-name "my-endpoint" \ --body fileb://input.json \ --content-type "application/json" \ output.json Expected error response: Expected error response: { "Error": { "Code": "ModelError", "Message": "Endpoint capacity reservation has expired. Please update endpoint configuration." } } To resume service, you must either create a new training plan reservation and update the endpoint configuration or update the endpoint to use on-demand or ODCR capacity. In the team’s case, because they completed their evaluation, they delete the endpoint rather than extending the reservation. Step 6: Update endpoint During the evaluation period, you might need to update the endpoint for various reasons. SageMaker AI supports several update scenarios while maintaining the connection to reserved capacity. Update to a new model version Midway through the evaluation, the team wants to test a new model version that incorporates additional fine-tuning. They can update to the new model version while keeping the same reserved capacity: # First, create a new endpoint configuration with updated model aws sagemaker create-endpoint-config \ --endpoint-config-name "ftp-ep-config-v2" \ --production-variants '[{ "VariantName": "AllTraffic", "ModelName": "my-model-v2", "InitialInstanceCount": 1, "InstanceType": " ml.p5.48xlarge ", "InitialVariantWeight": 1.0, "CapacityReservationConfig": { "CapacityReservationPreference": "capacity-reservations-only", "MlReservationArn": "arn:aws:sagemaker:us-east-1:123456789123:training-plan/p4-for-inference-endpoint" } }]‘ # Then update the endpoint aws sagemaker update-endpoint \ --endpoint-name "my-endpoint" \ --endpoint-config-name "ftp-ep-config-v2" Migrate from training plan to on-demand capacity If the team’s evaluation runs longer than expected or if they want to transition the endpoint to production use beyond the reservation period, they can migrate to on-demand capacity: # Create endpoint configuration without training plan reservation aws sagemaker create-endpoint-config \ --endpoint-config-name "ondemand-ep-config" \ --production-variants '[{ "VariantName": "AllTraffic", "ModelName": "my-model", "InitialInstanceCount": 1, "InstanceType": " ml.p5.48xlarge ", "InitialVariantWeight": 1.0 }]‘ # Update endpoint to use on-demand capacity aws sagemaker update-endpoint \ --endpoint-name "my-endpoint" \ --endpoint-config-name "ondemand-ep-config" Step 7: Scale endpoint In some scenarios, teams can reserve more capacity than they initially deploy, giving them flexibility to scale up if needed. For example, if the team reserved two instances but initially deployed only one, they cam scale up during the evaluation period to test higher throughput scenarios. Scale within reservation limits Suppose the team initially reserved two ml.p5.48xlarge instances but deployed their endpoint with only one instance. Later, they want to test how the model performs under higher concurrent load: # Create new config with increased instance count (within reservation) aws sagemaker create-endpoint-config \ --endpoint-config-name "ftp-ep-config-scaled" \ --production-variants '[{ "VariantName": "AllTraffic", "ModelName": "my-model", "InitialInstanceCount": 2, "InstanceType": " ml.p5.48xlarge ", "InitialVariantWeight": 1.0, "CapacityReservationConfig": { "CapacityReservationPreference": "capacity-reservations-only", "MlReservationArn": "arn:aws:sagemaker:us-east-1:123456789123:training-plan/p4-for-inference-endpoint" } }]‘ aws sagemaker update-endpoint \ --endpoint-name "my-endpoint" \ --endpoint-config-name "ftp-ep-config-scaled" Attempt to scale beyond reservation If customers attempt to scale beyond the reserved capacity, the update will fail: # This will fail if reservation only has 2 instances aws sagemaker create-endpoint-config \ --endpoint-config-name "ftp-ep-config-over-limit" \ --production-variants '[{ "VariantName": "AllTraffic", "ModelName": "my-model", "InitialInstanceCount": 3, "InstanceType": " ml.p5.48xlarge ", "InitialVariantWeight": 1.0, "CapacityReservationConfig": { "CapacityReservationPreference": "capacity-reservations-only", "MlReservationArn": "arn:aws:sagemaker:us-east-1:123456789123:training-plan/p4-for-inference-endpoint" } }]‘ Expected error: { "Error": { "Code": "ValidationException", "Message": "Requested instance count (3) exceeds reserved capacity (2) for training plan." } } Step 8: Delete endpoint After completing their week-long evaluation, the team has gathered all the performance metrics that they need and selected their top-performing model. They’re ready to clean up the inference endpoint. The training plan reservation automatically expires at the end of the reservation window. You are charged for the full reservation period regardless of when you delete the endpoint. Important considerations: It’s important to note that deleting an endpoint doesn’t refund or cancel the training plan reservation. The reserved capacity remains allocated until the training plan reservation window expires, regardless of whether the endpoint is still running. However, if the reservation is still active and capacity is available, you can create a new endpoint using the same training plan reservation ARN. To fully clean up, delete the endpoint configuration: aws sagemaker delete-endpoint-config \ --endpoint-config-name "ftp-ep-config" When setting up your training plan reservation, keep in mind that you’re committing to a fixed window of time and will be charged for the full duration upfront, regardless of how long you actually use it. Before purchasing, make sure that your estimated timeline aligns with the reservation length that you choose. If you think your evaluation might be completed early, the cost will not change. For example, if you purchase a 7-day reservation, you will pay for all seven days even if you complete your work in five. The upside is that this predictable, upfront cost structure helps you to budget accurately for your project. You will know exactly what you’re spending before you start. Note: When you delete your endpoint, the training plan reservation isn’t canceled or refunded. The reserved capacity stays allocated until the reservation window expires. If you finish early and want to use the remaining time, you can redeploy a new endpoint using the same training plan reservation ARN, if the reservation is still active and capacity is available. Conclusion SageMaker AI training plans provide a straightforward way to reserve p-family GPU capacity and deploy SageMaker AI inference endpoints with set availability. This approach is recommended for time-bound workloads such as model evaluation, limited-duration production testing, and burst scenarios where predictable capacity is essential. As we saw in our data science team’s journey, the process involves identifying capacity requirements, searching for available offerings, creating a reservation, and referencing that reservation in the endpoint configuration to deploy the endpoint during the reservation window. The team completed their week-long model evaluation with a set capacity, avoiding the unpredictability of on-demand availability during peak hours. They could focus on their evaluation of metrics rather than worrying about infrastructure constraints. With support for endpoint updates, scaling within reservation limits, and seamless migration to on-demand capacity, training plans give you the flexibility to manage inference workloads while maintaining control over GPU availability and costs. Whether you’re running competitive model benchmarks, performing limited-duration A/B tests, or handling predictable traffic spikes, training plans for inference endpoints provide the capacity that you need with transparent, upfront pricing. Acknowledgement Special thanks to Alwin (Qiyun) Zhao, Piyush Kandpal, Jeff Poegel, Qiushi Wuye, Jatin Kulkarni, Shambhavi Sudarsan, and Karan Jain for their contribution. About the authors Kareem Syed-Mohammed Kareem Syed-Mohammed is a Product Manager at AWS. He is focuses on enabling Gen AI model development and governance on SageMaker HyperPod. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey. Chaoneng Quan Chaoneng Quan is a Software Development Engineer on the AWS SageMaker team, building AI infrastructure and GPU capacity management systems for large-scale training and inference workloads. He designs scalable distributed systems that enable customers to forecast demand, reserve compute capacity, and operate workloads with predictability and efficiency. His work spans resource planning, infrastructure reliability, and large-scale compute optimization. Dan Ferguson Dan Ferguson is a Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably. Yati Agarwal Yati Agarwal is a Senior Product Manager at Amazon Web Services (AI Platform). She owns the end-to-end capacity strategy for AI workloads, ensuring that the infrastructure powering the most demanding machine learning use cases is available, scalable, and reliable. Her scope spans the full AI development lifecycle — from foundation model training and fine-tuning at large scale, to inference serving real-time and batch customer workloads, to interactive ML development environments where data scientists and engineers iterate and experiment.
OpenAI to acquire Astral openai 19.03.2026 00:00 0.693
Embedding sim.0.8135
Entity overlap0
Title sim.0.0704
Time proximity0.9345
NLP типother
NLP организация
NLP темаcode generation
NLP страна

Открыть оригинал

Accelerates Codex growth to power the next generation of Python developer tools
Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per Watt | NVIDIA Technical Blog nvidia_dev_blog 25.03.2026 11:00 0.692
Embedding sim.0.8004
Entity overlap0.1071
Title sim.0.2394
Time proximity0.7202
NLP типother
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

In the AI era, power is the ultimate constraint, and every AI factory operates within a hard limit. This makes performance per watt—the rate at which power is converted into revenue-generating intelligence—the defining metric for modern AI infrastructure. AI data centers now operate as token factories tied directly to the energy ecosystem, where access to land, power, and shell determines deployment, and efficiency determines output. Increasing revenue within a fixed power envelope depends entirely on maximizing intelligence per watt across AI infrastructure and across the five-layer AI cake ecosystem. This post walks through how NVIDIA architectures, systems, and AI factory software maximize performance per watt at every layer of the stack, and how those efficiency gains translate into higher token throughput and revenue per megawatt. Compounding performance per watt across NVIDIA GPU architectures NVIDIA architectures and platforms are engineered to increase the amount of intelligence produced per watt with each generation. Across six architecture generations, NVIDIA has improved inference throughput per megawatt by 1,000,000x (Figure 1). To put this in perspective, if the average fuel efficiency of a car had improved as swiftly as chips over a similar time period, one gallon of gas would suffice for a trip to the moon and back. Figure 1. Inference energy efficiency has increased 1,000,000x over six generations of NVIDIA architectures NVIDIA Hopper introduced many architecture innovations that significantly increased energy efficiency over the prior generation. Key to these gains is the Hopper Transformer Engine, which combines fourth-generation Tensor Core technology with FP8 acceleration and software to dramatically increase performance per watt. NVIDIA Blackwell advanced this foundation with improvements across high-bandwidth memory (HBM), NVIDIA NVLink switch and fabric (for the NVL72 rack-scale design and NVIDIA HGX architecture), and NVFP4 -enabled Tensor Cores, increasing throughput per watt. Recent SemiAnalysis InferenceX data shows that NVIDIA software optimizations and NVIDIA Blackwell Ultra GB300 NVL72 systems deliver up to 50x higher throughput per megawatt and 35x lower token cost than Hopper for DeepSeek-R1. The NVIDIA Vera Rubin platform further boosts efficiency. Rubin GPUs, Vera CPUs, NVLink 6, and full‑rack thermals are co-designed as a single AI factory platform. Notably, the NVIDIA Vera CPU delivers 2x efficiency and 50% higher performance compared to traditional CPUs . This end-to-end approach enables up to 10x higher inference throughput per megawatt and about 10x lower token cost versus Blackwell for AI factories for Kimi K2 (32K/8K). Paired with NVIDIA Groq 3 LPX , Vera Rubin delivers up to 35x higher throughput per megawatt and 10x more revenue for trillion-parameter, high-context workloads , creating a new premium tier of ultralow-latency, high-throughput inference. These efficiency gains are evident in AI workloads, and are also reflected in broader measures of compute performance. The HPC and supercomputing community uses the Green500 benchmark to measure high-precision (FP64) efficiency, and NVIDIA supercomputing systems top the leadership board, with nine of the top ten systems accelerated by NVIDIA technologies. Building for efficiency with extreme co-design Achieving these massive efficiency gains over architecture generations requires designing efficiency into every layer of the stack. NVIDIA approaches this as an extreme co-design problem—optimizing from chip design and manufacturing, through system-level innovations like liquid cooling, to AI factory orchestration. Each layer compounds the next: efficient design reduces wasted energy, cooling shifts power to compute, and software ensures every watt produces useful work. Engineering efficiency at the source Efficiency begins before silicon reaches the AI factory. NVIDIA is optimizing the manufacturing pipeline itself to deliver more energy-efficient chips, faster. For example, the NVIDIA cuLitho library for accelerated computational lithography re‑implements the core primitives of computational lithography on GPUs. It accelerates mask synthesis by up to 70x and allows a few hundred NVIDIA DGX‑class systems to replace tens of thousands of CPU servers. In practice, this means moving from two‑week photomask cycles to overnight runs, using about one‑ninth the power and one‑eighth the physical footprint, while enabling advanced techniques like inverse lithography and curvilinear masks. Figure 2. NVIDIA cuLitho for accelerated computational lithography boosts performance for mask synthesis by up to 58x for Curvilinear optical proximity correction (OPC) and up to 70x for Manhattan OPC At the materials layer, NVIDIA cuEST is a CUDA-X library designed to accelerate first-principles quantum chemistry applications on NVIDIA GPUs. It turns quantum‑chemistry‑based electronic‑structure calculations into a production tool. By delivering speedups of up to 55x on density functional theory and related workloads, cuEST enables device and process engineers to explore new, lower‑leakage materials stacks at industrial scale instead of on a few handpicked candidates. The result is a pipeline where the materials and devices are tuned for lower leakage and better switching behavior, feeding directly into higher performance per watt at the transistor level. That design‑time acceleration is amplified by GPU‑accelerated Electronic Design Automation (EDA) flows. In collaboration with other EDA leaders, NVIDIA is pushing electronic design and automation workloads onto GPUs , yielding up to 15x faster iterations on critical blocks. Faster iteration enables more opportunities to optimize design and verification flows, IR drop, clocking, and thermal hotspots. In turn, this yields floorplans and power grids that waste less energy as heat and deliver more of the input power to active compute. In other words, GPU‑accelerated EDA and manufacturing tools turn performance per watt into an explicit objective function. Figure 3. EDA workloads GPU-accelerated by various CUDA-X libraries Together, these advances make the design and manufacturing pipeline more efficient—reducing the time, energy, and infrastructure required to deliver next-generation chips. Cooling as a performance per watt multiplier Improving performance per watt does not stop at the chip. How systems are cooled also impacts how much power is available for computation. NVIDIA Blackwell systems reduce cooling overhead, operating around 1.25 PUE, with about 20% of capacity air‑cooled. This shifts more energy to compute than previous generations, delivering up to 25x higher energy efficiency and over 300x better water efficiency compared to traditional air‑cooled architectures .  NVIDIA Vera Rubin further improves energy efficiency by moving to 100% liquid cooling and tightening the die‑to‑water thermal path, enabling AI factories to run at 1.1 PUE without a proportional increase in cooling energy or water draw. Maintaining 45°C inlet water preserves silicon temperatures and reliability, while improved thermal transfer delivers higher performance per watt than Blackwell. In many climates, 45°C inlet water can be cooled largely with ambient air, dramatically reducing compressor runtime so chillers run less, while more of the power budget shifts from cooling to generating tokens. By contrast, lower-temperature cooling requirements depend more heavily on compressor‑based systems, diverting a larger share of the facility’s limited grid allocation into cooling instead of compute. Translating efficiency into tokens As tokens per watt increase, more billable AI work fits within a fixed power envelope, lowering cost per token and expanding margins. Realizing those gains requires closing the gap between grid supply and usable compute. At gigawatt scale, up to 40% of the power can be lost before it reaches compute. Power is lost through cooling inefficiencies, while traditional overprovisioning wastes capacity. In addition, running too close to thermal or electrical limits risks faults. NVIDIA DSX closes this gap. Vera Rubin DSX AI Factory reference design and Omniverse digital twin blueprint treat the AI factory as a dynamic system, continuously monitoring and adjusting power, cooling, and workload behavior. Systems operate at Max-Q—the point of highest performance per watt—rather than inefficient peaks. Domain Power Service, Workload Power Profiles, and Mission Control orchestrate racks and clusters for energy efficient operation. For a 500 MW AI factory, DSX Max-Q helps ecosystem partners operate AI factories with up to 30% more GPUs within the same power envelope and higher throughput per watt, while DSX Flex aligns demand with real-time grid conditions to unlock stranded capacity. Industry leaders demonstrate that AI factories with agentic liquid cooling and Max-Q operation deliver more tokens per watt . Every watt not spent on cooling or idle capacity becomes a watt that generates tokens—and revenue. Video 1. Learn how NVIDIA DSX helps developers optimize token throughput, resilience, and energy use across physical, electrical, thermal, and network systems From tokens to revenue per megawatt Inference drives revenue. Tokens are the unit of intelligence, and throughput per megawatt defines the AI factory revenue potential. With capped power and exploding demand, operators must track throughput and token rate as closely as revenue and margin. As models grow, context windows expand, and output lengths increase. As NVIDIA CEO Jensen Huang explained during the GTC 2026 Keynote , AI offerings will form a spectrum: free tiers attract users, mid-tier models balance scale and speed, and premium tiers with massive context windows and extreme throughput command high prices per million tokens. Smarter models command higher prices, making each move up the curve a direct revenue lever. NVIDIA platforms like Hopper, Blackwell, and Vera Rubin push the tokens-per-watt curve upward, particularly at high-value tiers. Blackwell increased throughput 35x where monetization is concentrated. Vera Rubin moves premium tiers another order of magnitude. Extreme co-design, NVL72-scale systems, and ultralow-latency interconnects enable higher-value tiers at higher density within the same power envelope. For operators, the metric is simple: revenue per megawatt. A one-gigawatt AI factory allocates power across free, mid, premium, and ultra tiers. The weighted product of throughput and price becomes the revenue engine. Moving to the next hardware generation can yield 5x or more revenue for the same power. Adding specialized systems, like ultralow-latency slices for engineering workloads, unlocks additional step changes. Every gain in inference performance and efficiency compounds economic output. Figure 4. NVIDIA Vera Rubin and NVIDIA Groq 3 LPX expand revenue per gigawatt by 10x  In today’s environment of capped power and soaring AI demand, the efficiency and throughput gains achieved with extreme co-design across NVIDIA AI infrastructure only matter if they’re captured at scale. NVIDIA Omniverse DSX Blueprint ensures that AI factories operate continuously at peak efficiency, turning every available watt into useful compute. Learn more Power is the ultimate constraint for modern AI: with grid capacity fixed, maximizing performance per watt—the rate at which energy is converted into revenue‑generating tokens—is the defining metric for AI Infrastructure. NVIDIA architectures and platforms are engineered to increase the amount of intelligence produced per watt with each generation. Across six architecture generations, NVIDIA has improved inference throughput per megawatt by 1,000,000x. To learn more, explore how industry leaders are scaling intelligence within power constraints , increasing intelligence per watt , and advancing energy-efficient chip design at CERAWeek 2026 . Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | Cloud Services | Blackwell | Blueprint | cuLitho | Grace CPU | Hopper | NVLink | Omniverse | Intermediate Technical | Deep dive | AI Factory | Groq 3 LPX | Industrial Digitalization / Digital Twin | NVFP4 | Sustainable Computing | Vera Rubin About the Authors About Kibibi Moseley Kibibi Moseley is a senior product marketing manager at NVIDIA in Energy Efficiency, Sustainability and AI for Science. Previously she was a senior product marketing manager in Data Center and Artificial Intelligence at Intel where she drove critical launch workstreams for 2nd, 3rd, and 4th generation Intel Xeon Scalable Processors and portfolio products. She has a B.S. in industrial engineering from UC Berkeley and an M.S. in management science and engineering and MBA from Stanford University. View all posts by Kibibi Moseley About Kristen Perez Kristen Perez is a writer for NVIDIA High-Performance Computing and Accelerated Computing solutions. She focuses on sharing meaningful stories highlighting the performance and research breakthroughs that developers can achieve with NVIDIA products. View all posts by Kristen Perez About Pawini Mahajan Pawini Mahajan is a senior product marketing manager for Semiconductor and EDA at NVIDIA, where she drives go-to-market strategy at the intersection of AI, accelerated computing, and chip design. She works closely with ecosystem partners to enable GPU-accelerated EDA and advance next-generation silicon development. Previously, Pawini was a product manager at Synopsys and an Engineering Manager at Intel, where she built deep expertise across the semiconductor lifecycle. She holds a master's degree in Electrical Engineering from Arizona State University and is passionate about mentoring the next generation of engineers. View all posts by Pawini Mahajan Comments Related posts Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer Building the 800 VDC Ecosystem for Efficient, Scalable AI Factories Building the 800 VDC Ecosystem for Efficient, Scalable AI Factories Transforming Data Centers into AI Factories for the 5th Industrial Revolution Transforming Data Centers into AI Factories for the 5th Industrial Revolution Revolutionizing Data Center Efficiency with the NVIDIA Grace Family Revolutionizing Data Center Efficiency with the NVIDIA Grace Family Power the Next Wave of Applications with NVIDIA BlueField-3 DPUs Power the Next Wave of Applications with NVIDIA BlueField-3 DPUs Related posts Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity NVIDIA HGX B200 Reduces Embodied Carbon Emissions Intensity How New GB300 NVL72 Features Provide Steady Power for AI How New GB300 NVL72 Features Provide Steady Power for AI Introducing NVFP4 for Efficient and Accurate Low-Precision Inference Introducing NVFP4 for Efficient and Accurate Low-Precision Inference NVIDIA 800 VDC Architecture Will Power the Next Generation of AI Factories NVIDIA 800 VDC Architecture Will Power the Next Generation of AI Factories L T F R E
The Problem with AI Anxiety in 2026 ai_supremacy 23.03.2026 09:31 0.691
Embedding sim.0.8102
Entity overlap0.0426
Title sim.0.0933
Time proximity0.8778
NLP типother
NLP организацияa16z
NLP темаai anxiety
NLP странаUnited States

Открыть оригинал

Siphon The Problem with AI Anxiety in 2026 It's only going to get worse. Labor market, Iran war, Trump, Oligarchs, Autonomous agents, costly AI weapons, Ban-AI 2028 movement. Michael Spencer Mar 23, 2026 ∙ Paid 117 2 11 Share Is AI Anxiety on the rise? In this article I’m going to try to illustrate and think about AI anxiety from a number of different perspectives. Share We shouldn’t seek to quell AI anxiety, we should embrace and analyze it. The truth is, the U.S. labor market is in serious trouble, and it has little to do with AI so far. It’s hard to think seriously about the future of knowledge-work when we have a labor market with so many issues. a16z. a16z tries to make the case that this is due to AI, but that’s not correct. The U.S. job market is the worst in decades. And the main cause of this is a completely incompetent government . Tariffs, questionable immigration policies, geopolitical self-harm, weaponizing trade in diplomacy, unpopular interference, starting unjustified wars, peculiar Government reform, maybe you have seen this infographic? Clearly this is not an Administration that cares about the economic well-being of Americans nor the future of Americans and the post graduation job experience. The “AI revolution” the Trump Admin is boosting, isn’t creating any new jobs. It’s the job of technology executives to make decisions that benefit shareholders, even if that means blaming AI for other problems in the business. Maybe you’ve seen this infographic too? Are Tech layoffs about to spike? (2026-2027) Tech layoffs with AI agentic explanations are trending. How popular is citing AI for layoffs going to get in 2026 and 2027? We are prone to anxiety because FOMO and FUD are two sides of how marketing works for a democracy without a functioning media. If a democracy no longer has a functioning media, is that still a real democracy? AI anxiety is real , especially when recessions become more likely and young people begin to have more trouble finding a job. Especially when the tools the financial elite are saying augment them, are actually hurting their critical thinking abilities. Young people today are going to grow into a different world. Matt Zieger analyzed @karpathy ‘s AI jobs exposure dashboard (which was taken down, then restored), but it bugged Matt that it didn’t include actual labor market dynamic fields such as: Industry adoption speed Worker adaptability Demand elasticity (if cheaper, people buy more) Complementarity (does AI replace or boost) Beyond Exposure: What Actually Predicts Displacement Karpathy literally deleted his experiment from X. His analysis supposedly revealead that The data revealed that jobs with higher paying salaries had a worse average score, while people earning less than $35,000 had the lowest exposure. AI Anxiety with Recessionary Characteristics The odds of the US entering recession are rising: The probability of a recession over the next 12 months jumped to 48.6% in February, the highest since the 2020 pandemic. American consumers might even blame AI, far from embracing it. 🤔 Chief Economist Mark Zandi recently reported that Moody’s AI-based economic model now places the odds of a recession at 49% . However, Zandi cautioned that this figure was calculated just before the full impact of the recent conflict in Iran and the resulting spike in energy prices. He expects that once these factors are fully integrated into the next data release, the probability will exceed 50%. Just a bit over fifty percent, is that really something to be anxious about? It’s totally normal to be anxious about the potential for technological automation, but throughout history it’s also fairly common to have periods of uncertainty about the future. Leaders messing things up is also nothing new to civilization. Venture Capitalists have infiltrated the U.S. Administration While Americans are struggling the Pentagon is spending money on an Iran war, and on contracts to Anduril, Palantir and OpenAI. America’s tax dollars are being siphoned. Leaders testing AI targeting and the tools powerful Venture Capitalists betted on years ago who have now won significant lobbying in Washington and that should make you anxious. Military AI is worth being anxious about. If you give Anduril , Palantir or OpenAI more money, I can bet on the future impact on the world it will have . Maybe Polymarket should consider which scenarios of AI dystopia are now most likely. Anduril, Palantir and OpenAI seem to be favored by the Trump Administration and the Pentagon. What a big surprise. Let’s make war productive with AI. AI Anxiety the Post Boom World and Opportunity Collapse Young people in China where the youth unemployment has spiked noticeably in the past five years, have a legit right to feel anxious. But in the U.S., poor policy, changing demographics, less immigration and tech panic is leading to a frozen labor market are a toxic combination: The U.S. youth unemployment rate is climbing since the pandemic. No wonder they are anxious about AI. “Employment rates for young workers, whether college educated or not , have both been in roughly equal decline for the past two years….Going back to 1976, the share of 20-somethings in the workforce has declined ~12%.” - a16z Blogger Tony Peng recently noted that in 2026 and over the past few weeks, the phrase “AI anxiety” kept showing up in the podcasts, articles, and videos in Chian he came across, picking up pace especially after the OpenClaw mania swept the country. More Autonomous products launching include just recently: a strange brand of new products that insist they should be in control of our computers. AI Anxiety and the Autonomous Machines Wave 🌊 Giving over computer control to Autonomous Agents is supposedly the golden path to AGI But do we want to give over control of our computers to autonomous AI and personal assistants? Manus AI Desktop Perplexity Personal Computer OpenClaw Personal AI Assistant Alibaba Wukong Work Platform Claude Cowork Dispatch OpenAI Operator Google Project Jarvis Anthropic Compute Use OpenAI "Computer Using Agent" (CUA) OpenAI Research Scientist (What Prism becomes) Google Project Astra With gigantic IPOs related to AI involved companies like SpaceX, OpenAI and Anthropic on the way in mere months, hype must be continued at all costs. These keywords are supposed to make you anxious. AI anxiety is converging with many other macro trends, and it’s making us question the future of jobs and our place in the world. AI anxiety is also coming at a peculiar time of geopolitical disruption where trust in leadership and institutions is on the decline. Continue reading this post for free, courtesy of Michael Spencer. Claim my free post Or purchase a paid subscription. Previous
Advancing Open Source AI, NVIDIA Donates Dynamic Resource Allocation Driver for GPUs to Kubernetes Community nvidia_blog 24.03.2026 08:00 0.687
Embedding sim.0.7941
Entity overlap0.0667
Title sim.0.1688
Time proximity0.8513
NLP типpartnership
NLP организацияNVIDIA
NLP темаai infrastructure
NLP странаNetherlands

Открыть оригинал

Advancing Open Source AI, NVIDIA Donates Dynamic Resource Allocation Driver for GPUs to Kubernetes Community In addition, NVIDIA announced at KubeCon Europe a confidential containers solution for GPU-accelerated workloads, updates to the NVIDIA KAI Scheduler and new open source projects to enable large-scale AI workloads. March 24, 2026 by Justin Boitano 0 Comments Share Share This Article X Facebook LinkedIn Copy link Link copied! Artificial intelligence has rapidly emerged as one of the most critical workloads in modern computing. For the vast majority of enterprises, this workload runs on Kubernetes, an open source platform that automates the deployment, scaling and management of containerized applications. To help the global developer community manage high-performance AI infrastructure with greater transparency and efficiency, NVIDIA is donating a critical piece of software — the NVIDIA Dynamic Resource Allocation (DRA) Driver for GPUs — to the Cloud Native Computing Foundation (CNCF), a vendor-neutral organization dedicated to fostering and sustaining the cloud-native ecosystem.  Announced today at KubeCon Europe, CNCF’s flagship conference running this week in Amsterdam, the donation moves the driver from being vendor-governed to offering full community ownership under the Kubernetes project. This open environment encourages a wider circle of experts to contribute ideas, accelerate innovation and help ensure the technology stays aligned with the modern cloud landscape.  “NVIDIA’s deep collaboration with the Kubernetes and CNCF community to upstream the NVIDIA DRA Driver for GPUs marks a major milestone for open source Kubernetes and AI infrastructure,” said Chris Aniszczyk, chief technology officer of CNCF. “ By aligning its hardware innovations with upstream Kubernetes and AI conformance efforts, NVIDIA is making high-performance GPU orchestration seamless and accessible to all.” In addition, in collaboration with the CNCF’s Confidential Containers community, NVIDIA has introduced GPU support for Kata Containers, lightweight virtual machines that act like containers. This extends hardware acceleration into a stronger isolation, separating workloads for increased security and enabling AI workloads to run with enhanced protection so organizations can easily implement confidential computing to safeguard data. Simplifying AI Infrastructure Historically, managing the powerful GPUs that fuel AI within data centers required significant effort.  This contribution is designed to make high-performance computing more accessible. Key benefits for developers include: Improved Efficiency: The driver allows for smarter sharing of GPU resources, delivering effective use of computing power, with support of NVIDIA Multi-Process Service and NVIDIA Multi-Instance GPU technologies. Massive Scale: It provides native support for connecting systems together, including with NVIDIA Multi-Node NVlink interconnect technology. This is essential for training massive AI models on NVIDIA Grace Blackwell systems and next-generation AI infrastructure. Flexibility: Developers can dynamically reconfigure their hardware to suit their needs, changing how resources are allocated on the fly. Precision: The software supports fine-tuned requests, allowing users to ask for the specific computing power, memory settings or interconnect arrangement needed for their applications. A Collaborative, Industry-Wide Effort NVIDIA is collaborating with industry leaders — including Amazon Web Services, Broadcom , Canonical , Google Cloud , Microsoft , Nutanix , Red Hat and SUSE — to drive these features forward for the benefit of the entire cloud-native ecosystem. “Open source will be at the core of every successful enterprise AI strategy, bringing standardization to the high-performance infrastructure components that fuel production AI workloads,” said Chris Wright, chief technology officer and senior vice president of global engineering at Red Hat . “NVIDIA’s donation of the NVIDIA DRA Driver for GPUs helps to cement the role of open source in AI’s evolution, and we look forward to collaborating with NVIDIA and the broader community within the Kubernetes ecosystem.” “Open source software and the communities that sustain it are a cornerstone of the infrastructure used for scientific computing and research,” said Ricardo Rocha, lead of platforms infrastructure at CERN . “For organizations like CERN, where efficiently analyzing petabytes of data is essential to discovery, community-driven innovation helps accelerate the pace of science. NVIDIA’s donation of the DRA Driver strengthens the ecosystem researchers rely on to process data across both traditional scientific computing and emerging machine learning workloads.” Expanding the Open Source Horizon This donation is just part of NVIDIA’s broader initiatives to support the open source community. For example, NVSentinel — a system for GPU fault remediation — and AI Cluster Runtime , an agentic AI framework, were announced at GTC last week. In addition, NVIDIA announced at GTC new open source projects including the NVIDIA NemoClaw reference stack and NVIDIA OpenShell runtime for securely running autonomous agents. OpenShell provides fine-grained programmable policy security and privacy controls, and natively integrates with Linux, eBPF and Kubernetes. NVIDIA also today announced that its high-performance AI workload scheduler, the KAI Scheduler, has been onboarded as a CNCF Sandbox project — a key step toward fostering broader collaboration and ensuring the technology evolves alongside the needs of the wider cloud-native ecosystem. Developers and organizations can use and contribute to the KAI Scheduler today . NVIDIA remains committed to actively maintaining and contributing to Kubernetes and CNCF projects to help meet the rigorous demands of enterprise AI customers.  In addition, following the release of NVIDIA Dynamo 1.0 , NVIDIA is expanding the Dynamo ecosystem with Grove , an open source Kubernetes application programming interface for orchestrating AI workloads on GPU clusters. Grove, which enables developers to express complex inference systems in a single declarative resource, is being integrated with the llm-d inference stack for wider adoption in the Kubernetes community.  Developers and organizations can begin using and contributing to the NVIDIA DRA Driver today . Visit the NVIDIA booth at KubeCon to see live demos of this technology in action. Explore the Best of GTC 2026 Sessions Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere. Watch On Demand Recent News AI Infrastructure Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid March 31, 2026
'Emphathetic 'Salesforce bots to help fired via Labor Dept the_register_ai 26.03.2026 20:41 0.687
Embedding sim.0.7876
Entity overlap0.1304
Title sim.0.099
Time proximity0.9754
NLP типpartnership
NLP организацияSalesforce
NLP темаai agents
NLP странаUnited States

Открыть оригинал

Public Sector 5 'Empathetic' Salesforce bots to help those fired by uncaring humans 5 I’m sorry, Dave. I can’t give you your job back, but here’s the form you fill out to collect benefits O'Ryan Johnson Thu 26 Mar 2026 // 20:41 UTC There’s a joke in Boston that goes: the people in Southie will steal your wallet and help you look for it. And now, the thousands of workers ostensibly replaced by bots will now have a bot to help them get unemployment benefits, thanks a new Department of Labor deal to use Salesforce’s Agentforce to triage applications at its national call center. The SaaS giant ays that its agent will "respond empathetically" to queries. The Department of Labor Agent (DOLA) is built on top of Salesforce Government Cloud’s FedRAMP infrastructure, and it uses Data 360 to merge the structured and unstructured data from third-party systems and a library of more than 2,900 department of labor knowledge articles into a “holistic citizen view,” Salesforce said in its press release. Neither the Department of Labor nor Salesforce provided a dollar figure for the deal. The DoL did not reply to emails from The Register seeking comment. Salesforce declined to comment beyond what was stated in the announcement. When people call for help, Agentforce for Public Sector and Agentforce Marketing will handle questions across all 28 Department of Labor programs, including Unemployment Insurance, the Occupational Safety and Health Administration, Veterans’ Employment and Training Service, Mine Health and Safety Administration, and Job Corps. The agents will collect intake information; open cases; and can text, email, and call customers using Salesforce Voice. The DoL handles about 2.8 million cases on behalf of workers who need help. The new system will also take over the task of processing of 236,000 OSHA logs and 41,000 Job Corps applications, “significantly reducing manual entry errors,” the release states. With the agents in place, humans inside the Department of Labor will be retrained to handle more complex tasks, the announcement stated. Meanwhile, on the back end, Salesforce Tableau Next will give the US government real-time analytics and mission dashboards to monitor contact center performance and customer satisfaction. “If a citizen asks to speak with a person or if a query requires a deeper empathetic touch, DOLA automatically transfers the conversation to a human staff member,” the press release states. “And by integrating AI agents directly into its service fabric, the DOL can now triage citizen needs with what the DOL has called ‘hospital-like precision’ — helping to ensure that resources are always directed toward the most critical worker needs.” The purpose of the deal is similar to the one Salesforce struck with the Department of Transportation in December, which used Salesforce Agentforce to provide around-the-clock support for common tasks like complaints, accessing services, and analyzing complex datasets — such as weather, traffic trends, and historical incident data — to create alerts that help USDOT reduce accidents and injuries. ® Share More about AI SalesForce More like these &times; More about AI SalesForce Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car More about Share 5 COMMENTS More about AI SalesForce More like these &times; More about AI SalesForce Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car TIP US OFF Send us news
Can your governance keep pace with your AI ambitions? AI risk intelligence in the agentic era aws_ml_blog 31.03.2026 15:36 0.687
Embedding sim.0.7964
Entity overlap0.0833
Title sim.0.15
Time proximity0.8385
NLP типproduct_launch
NLP организацияAWS Generative AI Innovation Center
NLP темаai governance
NLP страна

Открыть оригинал

DevOps used to be predictable: same input, same output, binary success, static dependencies, concrete metrics. You could control what you could predict, measure what was concrete, and secure what followed known patterns. Then agentic AI arrived, and everything changed. Agents operate non-deterministically; they don’t follow fixed patterns. Ask the same question twice, get different answers. They select different tools and approaches as they work, rather than following predetermined workflows. Quality exists on a gradient from perfect to fabricated rather than binary pass-fail. Predictable dependencies and processes have given way to autonomous systems that adapt, reason, and act independently. Traditional IT governance frameworks designed for static deployments can’t address these complex multi-system interactions. Organizations face inconsistent security postures across agentic workflows, compliance gaps that vary by deployment, and observability metrics opaque to business stakeholders without deep technical expertise. This shift requires rethinking security, operations, and governance as interdependent dimensions of agentic system health . It’s also the origin story of AI Risk Intelligence (AIRI): the enterprise-grade automated governance solution from AWS Generative AI Innovation Center that automates security, operations, and governance controls’ assessments into a single viewpoint spanning the entire agentic lifecycle. To build this solution, we used the AWS Responsible AI Best Practices Framework , our science-backed guidance built on our experience with hundreds of thousands of AI workloads, helping customers address responsible AI considerations throughout the AI lifecycle and make informed design decisions that accelerate deployment of trusted AI systems. From static controls to dynamic governance Consider a common security risk in agentic systems. The Open Worldwide Application Security Project (OWASP)—a nonprofit that tracks cybersecurity vulnerabilities—identifies “Tool Misuse and Exploitation” as one of its Top 10 for Agentic Applications in 2026 . Here’s what that looks like in practice: An enterprise AI assistant has legitimate access to email, calendar, and CRM. A bad actor embeds malicious instructions in an email. The user requests an innocent summary, but the compromised agent follows hidden directives—searching sensitive data and exfiltrating it via calendar invites—while providing a benign response that masks the breach. This unintended access operates entirely within granted permissions: the AI assistant is authorized to read emails, search data, and create calendar events. Standard data loss prevention tools and network traffic monitoring are not designed to evaluate whether an agent’s actions are aligned with its intended scope — they flag anomalies in data movement and network traffic, neither of which this unintended access produces. To govern multi-agent systems at scale, security must integrate directly into how agents operate, and vice versa. The systemic nature of Agentic Risk The calendar exfiltration scenario reveals a critical insight: in agentic systems, security vulnerabilities cascade across multiple operational dimensions simultaneously. When the AI assistant misuses its calendar tool, the breach cascades across multiple dimensions: Multi-agent coordination : One agent’s action triggered other agents to amplify the violation Permission management : Access controls weren’t continuously validated while the agent was running Human oversight: There was no checkpoint requiring human confirmation before the agent executed a high-risk action—the system operated autonomously through the entire exploit sequence without surfacing the decision for review. Visibility : Risk managers couldn’t interpret the monitoring data to detect the problem before data was stolen Traditional approaches that treat security, operations, and governance as separate concerns create blind spots precisely where agents coordinate, share context, and propagate decisions. AIRI operationalizes frameworks like the NIST AI Risk Management Framework, ISO and OWASP — transforming them from static reference documents that require human interpretation into automated, continuous evaluations embedded across the entire agentic lifecycle, from design through post-production. Critically, AIRI is framework-agnostic: it calibrates against governance standards, which means the same engine that evaluates OWASP security controls also assesses organizational transparency policies or industry-specific compliance requirements. This is what makes it applicable across diverse agent architectures, industries, and risk profiles — rather than hardcoding rules for known threats, AIRI reasons over evidence the way an auditor would, but continuously and at scale. AIRI in action Let us now explore how AIRI operationalizes the automated governance of agentic systems in practice. Let’s return to our AI assistant’s example. Assume, for instance, that the development team has just produced a POC using this AI assistant. Before they deploy their solution to production, they run AIRI. To assess the foundations of their system, the team starts by leveraging AIRI’s automated technical documentation review capability to automatically collect evidence of the control implementations contained in the table below — assessing not only security but also operational quality controls: transparency, controllability, explainability, safety, and robustness. The analysis spans the design of the use case, the infrastructure serving it, and organizational policies to facilitate alignment with enterprise governance and compliance requirements. For each control dimension, AIRI runs a reasoning loop. First, it extracts the relevant evaluation criteria from the applicable framework. Then it pulls evidence from the system’s actual artifacts — architecture documents, agent configurations, organizational policies. From there, it reasons over the alignment between what the framework requires and what the system demonstrates, ultimately determining whether the control is effectively implemented. This reasoning-based approach is what makes AIRI broadly applicable. Rather than relying on static rule sets that break when agent architectures change, AIRI evaluates intent against evidence. That means it adapts to new agent designs, new frameworks, and new risk categories — without being re-engineered. To strengthen the reliability of these judgments, AIRI repeats each evaluation multiple times and measures the consistency of its conclusions — a technique called semantic entropy. When outputs vary significantly across runs, it signals that the evidence is ambiguous or insufficient and triggers human review rather than forcing a potentially unreliable judgment. This is how AIRI bridges the gap between abstract framework requirements and concrete agent behavior: turning governance intent into a structured, repeatable evaluation that scales across agentic systems. The assessment of our AI assistant evaluated the system across hundreds of controls and returned an overall Medium risk rating with a pass rate just above 50%. More telling than the aggregate score is the risk distribution — and it maps directly to the cascading vulnerabilities we described. Eight Critical and seven High severity findings signal that foundational controls — particularly around safety, controllability, and security — are either absent or insufficiently operationalized. Fourteen Medium severity findings indicate systemic gaps in areas such as explainability and robustness that, while not immediately catastrophic, compound the overall risk posture if left unaddressed. On the more resilient end, findings concentrated in governance, fairness, and transparency reflect areas where the organization has invested meaningfully and where controls are functioning as intended. After human validation of the results, the team accesses a dashboard that synthesizes the findings alongside prioritized, actionable recommendations — from configuring responses with traceable references to reduce hallucination risk, to implementing input guardrails that block variables which could introduce bias, to strengthening explainability through surfaced decision evidence. Each recommendation is grounded in the assessment evidence and mapped to specific AWS capabilities that can remediate the gap. Critically, AIRI is not a one-time audit. Integration with the development environment enables AIRI to function as a continuous governance engine. Every time the project undergoes a change — whether a code commit, an architecture update, or a policy revision — AIRI automatically re-runs the assessment, making sure governance keeps pace with development velocity. Teams gain a living record of how their risk posture evolves with each iteration. Turn governance into your edge The shift to dynamic governance determines which organizations confidently scale agentic workloads and which remain constrained by manual oversight. For security teams : AIRI transforms reactive vulnerability management into proactive risk identification. For operations teams : AIRI alleviates manual auditing across multi-agent systems with automated assessments and mitigations plans. For risk managers : AIRI translates technical monitoring data into business-relevant metrics—controllability, explainability, transparency—enabling confident decisions without deep technical expertise. For executives : AIRI represents competitive advantage: deploy faster, scale reliably, maintain compliance efficiently. Traditional frameworks designed for static deployments cannot address the dynamic interactions that define agentic workloads. AIRI provides the automated rigor required to govern agents at enterprise scale—a fundamental reimagining of how security, operations, and governance work together systemically. The question is no longer whether to adopt agentic AI, but whether your governance capabilities can keep pace with your ambition. Ready to scale your agentic workloads with confidence? Explore how AIRI can transform your AI governance strategy— contact us to learn more or schedule a demo today. About the authors Segolene Dessertine-Panhard is the global tech lead for Responsible AI and AI governance initiatives at the AWS Generative AI Innovation Center. In this role, she supports AWS customers in scaling their generative AI strategies by implementing robust governance processes and effective AI and cybersecurity risk management systems, leveraging AWS capabilities and state-of-the-art scientific models. Prior to joining AWS in 2018, she was a full-time professor of Finance at New York University’s Tandon School of Engineering. She also served for several years as an independent consultant in financial disputes and regulatory investigations. She holds a Ph.D. from Paris Sorbonne University. Sri Elaprolu is Director of the AWS Generative AI Innovation Center, where he leads a global team implementing cutting-edge AI solutions for enterprise and government organizations. During his 13-year tenure at AWS, he has led ML science teams partnering with global enterprises and public sector organizations. Prior to AWS, he spent 14 years at Northrop Grumman in product development and software engineering leadership roles. Sri holds a Master’s in Engineering Science and an MBA. Florian Felice is a Senior Data Scientist at the AWS Generative AI Innovation Center. In his role, he is the science lead for AI Risk Intelligence, where he develops frameworks and tools to evaluate and govern responsible AI practices at scale. In this role, he focuses on quantifying and measuring AI models’ uncertainty, risks, and benefits, drawing on his statistical background to bring rigor and precision to AI governance. He holds a Master’s degree in Statistics and Econometrics from Toulouse School of Economics. Daniel Ramirez is a Data Scientist in Responsible AI at the AWS Generative AI Innovation Center. With over 10 years of experience automating processes with machine learning and generative AI, he works at the intersection of advanced AI systems and AI governance, helping organizations build trustworthy and accountable AI at scale. Before joining AWS, Daniel served as a Data Science Manager focused on fraud detection, and prior to that, as a Tech Lead at a Series D startup. He holds a Master’s in Computer Science from Universidad de los Andes and a Master’s in Data Science from Columbia University. Randi Larson connects AI innovation with executive strategy for the AWS Generative AI Innovation Center, shaping how organizations understand and translate technical breakthroughs into business value. She hosts the Innovation Center’s podcast series and combines strategic storytelling with data-driven insight through global keynotes and executive interviews on AI transformation. Before Amazon, Randi refined her analytical precision as a Bloomberg journalist and consultant to economic institutions, think tanks, and family offices on financial technology initiatives. Randi holds an MBA from Duke University’s Fuqua School of Business and a B.S. in Journalism and Spanish from Boston University.
NVIDIA IGX Thor Powers Industrial, Medical, and Robotics Edge AI Applications | NVIDIA Technical Blog nvidia_dev_blog 23.03.2026 20:24 0.686
Embedding sim.0.7762
Entity overlap0.0612
Title sim.0.209
Time proximity0.95
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаedge computing
NLP страна

Открыть оригинал

Industrial and medical systems are rapidly increasing the use of high-performance AI to improve worker productivity, human-machine interaction, and downtime management. From factory automation cells to autonomous mobile platforms to surgical rooms, operators are deploying increasingly complex generative AI models, more sensors, and higher‑fidelity data streams at the edge. Safety and regulatory compliance are meanwhile crucial to ensure deterministic behavior, high availability, and verifiable functional safety essential design requirements. This post introduces NVIDIA IGX Thor , a platform built for the demands of powering industrial AI at the edge, including a deep dive into performance and safety features. What is NVIDIA IGX Thor? NVIDIA IGX Thor is an enterprise-ready platform for physical AI. It offers server‑class AI performance together with industrial-grade hardware, advanced functional safety capabilities, extended lifecycle support, and an enterprise software stack in configurations suitable for industrial and medical environments. IGX Thor extends this compute and safety foundation to edge systems where uptime, reliability, and standards compliance are central to system design. With the IGX Thor platform, developers can build mission-critical edge computers that operate reliably in harsh physical conditions, integrate with secure and regulated infrastructures, and execute state‑of‑the‑art AI inference and sensor fusion pipelines close to where data is generated. The IGX Thor family is delivered through four purpose-built platforms, designed for industrial-grade deployment and advanced development workflows: NVIDIA IGX T5000 System-on-Module (SoM) : Delivers high-performance, safety-capable compute in a compact, embedded form factor. Designed for integration into custom carrier boards, the IGX T5000 SoM enables customers to build domain-specific industrial and robotic systems while accelerating time to production. NVIDIA IGX T7000 Board Kit : Scales performance and expandability for the most demanding edge AI workloads. Built on a MicroATX form factor, the IGX T7000 combines NVIDIA Thor-class compute with rich I/O, functional safety support, flexibility to increase the AI compute with a discrete GPU, and enterprise-grade networking to power safety-critical, high-throughput edge systems. NVIDIA IGX Thor Developer Kit: Provides a full-featured development platform for building, testing, and validating next-generation industrial AI applications. With support for advanced sensing, robotics, and real-time AI pipelines, it enables developers to move from prototype to deployment with confidence. NVIDIA IGX Thor Developer Kit Mini: Brings NVIDIA Thor-class capabilities with on-board safety module to a smaller footprint. Optimized for space- and power-constrained environments, it is ideal for mobile robots, autonomous machines, and compact industrial systems that require robust AI performance without compromising the form factor. Figure 1. The NVIDIA IGX T5000 Module and Developer Kit feature Blackwell GPU architecture, 14-Core Arm Neoverse-V3AE CPU, integrated safety MCU, and high-speed I/O connectivity Table 1 provides an overview of the NVIDIA IGX Thor family, highlighting how each configuration is tuned for different classes of industrial, medical, and robotics edge workloads. NVIDIA IGX T5000 NVIDIA IGX T7000 NVIDIA IGX Thor Developer Kit Mini NVIDIA IGX Thor Developer Kit AI performance Up to 2,070 FP4 TFLOPS Up to 5,581 FP4 TFLOPs Up to 2,070 FP4 TFLOPS Up to 5,581 FP4 TFLOPs iGPU 2,560-core NVIDIA Blackwell architecture GPU with fifth-generation Tensor Cores Multi-instance GPU with 10 TPCs iGPU speed 1.57 GHz dGPU – NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition NVIDIA RTX PRO 5000 Blackwell NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition Memory 256-bit 128 GB LPDDR5x | 273 GB/s Safety Functional Safety Island (FSI) in SoC FSI in SoC and Safety MCU BMC – Yes – Yes Networking 4x up to 25 Gbps MGBE 2x RJ45 (1 GbE each) 2x QSFP112 (200 GbE each) Supports ConnectX-7 1x 5GBe RJ45 connector 1x QSFP28 (4x 25 GbE) WiFi 6E (Populated on M.2 Key E slot with x1 PCIe Gen5 ) 2x RJ45 (1 GbE each) 2x QSFP112 (200 GbE each) Supports ConnectX-7 Table 1. Key performance, memory, I/O, and safety characteristics across the NVIDIA IGX Thor family Delivering massive AI performance at the edge NVIDIA IGX Thor delivers a step-function increase in performance. Compared to NVIDIA IGX Orin, it offers up to 8x higher AI compute on the integrated GPU, 2.5x higher AI compute with discrete GPU acceleration, and 2x higher networking bandwidth. This enables significantly more demanding real-time AI workloads for industrial and robotics applications. IGX T7000 pairs the IGX T5000 Thor module, powered by an NVIDIA Blackwell architecture iGPU delivering 2,070 FP4 TOPS of AI performance, with an NVIDIA RTX PRO 6000 Blackwell Max‑Q discrete GPU that adds an additional 3,511 FP4 TOPS. Together, this configuration significantly boosts total AI compute for demanding edge workloads. IGX T7000 delivers 5x the generative AI reasoning performance compared to IGX Orin 700. It handles up to 20x more interactive users by using the iGPU and dGPU concurrently, making it an excellent choice for high‑concurrency edge workloads. It also offers robust mixed‑criticality isolation, so workloads running on the dGPU operate independently without degrading performance on the iGPU. These features make IGX T7000 ideal for scenarios such as smart retail, which involves processing video from many checkout kiosks while running VLM and LLM workloads for smart checkout with low latency for faster checkout. It’s also ideal for industrial safety and building automation applications, among many others. Figure 2 compares IGX T7000 and IGX Orin 700 on the number of users handled with tokens/sec > 25 and TTFT < 200 msec. Figure 2. IGX T7000 is built for handling high concurrency at the edge Configuration: dGPU: NVIDIA RTX PRO 6000 Blackwell Max-Q (IGX T7000), NVIDIA RTX 6000 ADA (IGX Orin 700); ISL/OSL: 2028/128; Quantization: NVFP4 (IGX T7000), W4A16 (IGX Orin 700); Framework: VLLM Model NVIDIA IGX Thor (IGX T7000) (tokens/sec) NVIDIA IGX Orin (IGX Orin 700) (tokens/sec) Speedup over NVIDIA IGX Orin Qwen3 30B A3B 1,163 807 1.4x Qwen3 32B 468 95 4.9x Nemotron 9B V2 306 202 1.5x Nemotron 3 30B A3B 642 585 1.1x Cosmos Reason 2 2B 1,630 1,250 1.3x Cosmos Reason 2 8B 822 540 1.5x gpt-oss-20B 1,072 711 1.5x Table 2. Generative AI performance for popular models on IGX T7000 and IGX Orin 700 platforms Configuration: dGPU: NVIDIA RTX PRO 6000 Blackwell Max-Q (IGX T7000), NVIDIA RTX 6000 ADA (IGX Orin 700); ISL/OSL: 2028/128; Quantization: NVFP4 (IGX T7000), W4A16 (IGX Orin 700); Max Conc: 9; Framework: VLLM Similar to NVIDIA Jetson T5000, the NVIDIA IGX T5000 SoM provides 2,070 TOPS of FP4 performance, paired with 128 GB of LPDDR5X memory and 273 GB/s of memory bandwidth, making it ideal for running generative AI workloads with real-time responsiveness. IGX T5000 delivers the same performance as Jetson T5000 even with industrial features such as full DRAM ECC enabled. These industrial capabilities do not reduce performance or usable memory capacity on IGX T5000. High-speed connectivity for real-time workloads The IGX T7000 boardkit significantly advances edge connectivity with dual 200 GbE networking, delivering 2x the bandwidth compared to the IGX Orin dual 100 GbE interfaces. Powered by the NVIDIA ConnectX-7 SmartNIC , this high-speed fabric enables low-latency, high-throughput data movement using RDMA, enabling sensor data to bypass the CPU and flow directly into GPU memory. The result is a substantial boost in end-to-end AI performance, especially for real-time, sensor-intensive workloads. This leap in networking capability unlocks the full potential of the NVIDIA Holoscan Sensor Bridge (HSB) . Designed to aggregate and stream massive volumes of data from high-bandwidth sensors—such as cameras, lidar, radar, and medical imaging devices—HSB relies on deterministic, lossless networking to meet strict latency and synchronization requirements. With 2×200 GbE and RDMA acceleration, IGX Thor provides the bandwidth and determinism needed to scale multisensor pipelines, enabling faster ingestion, tighter sensor fusion, and higher-fidelity AI inference in safety-critical and real-time systems. Real-time determinism for edge applications Many industrial and robotics applications demand strict real-time behavior, and IGX Thor is designed specifically to meet those needs. Real-time Linux environment : IGX ships with a preemptible real-time Linux kernel by default. This provides the deterministic behavior needed for tight control loops and low-latency sensor handling in robotic arms, autonomous machines, and closed-loop medical systems. Multi-instance GPU (MIG) : The NVIDIA Blackwell GPU can be partitioned into fully isolated instances, so safety‑critical, real-time workloads can get hard performance guarantees even when running alongside lower-priority tasks. Programmable accelerators : Dedicated engines for vision (PVA), optical flow, hardware codecs, and real-time inference cut latency and offload work from the CPU and GPU, leaving more headroom for large-scale deep learning and complex generative AI pipelines. GPU Direct RDMA : With GPU Direct RDMA, sensor inputs can be brought directly inside the GPU for extreme low latency sensor processing. Built rugged for industrial reality Industrial applications—from precision manufacturing to medical robotics—demand platforms that can operate reliably in the face of extremes: intense vibration, electrical noise, temperature swings, ECC, and more. They also need to easily integrate with industrial networks. The IGX Thor platform is engineered precisely for these demands. Industrial-grade components : Every aspect of IGX Thor, from memory to power delivery, is selected for resilience, supporting an extended temperature range, shock, vibration compliance, and ECC implementation. Ruggedized chassis : Available in tough industrial enclosures, IGX Thor can be deployed in factories, warehouses, field vehicles, or anywhere that traditional electronics would falter. Long lifecycle, global support : IGX platforms are supported for up to 10 years, providing unmatched reliability and supply chain assurance for regulated industries. Extensive I/O : IGX Thor offers rich industrial I/O—multiple PCIe Gen5, SFP+, CAN, and high-speed digital interfaces—simplifying integration with legacy PLCs, sensors, robotics, control networks, and more. Functional safety and proactive AI safety IGX Thor unifies high‑performance AI and functional safety in a single platform, enabling both inside‑out and outside‑in robotic safety . Robots can rely on onboard sensors or on infrastructure sensors with an outside-in view for safety decisions. IGX Thor complies with ISO 26262 and IEC 61508, targets ASIL D/SC3 and ASIL/SIL 2. It incorporates a variety of safety features, including: Hardware fault detection Safe‑state monitoring In‑system test DRAM ECC Freedom from interference (FFI) Dedicated Functional Safety Island (FSI), an independent, redundant safety processor that monitors and handles safety‑critical workloads, providing true hardware separation between safety and nonsafety domains NVIDIA Halos AI Systems Inspection Lab (accredited by ANAB) helps to ensure that NVIDIA IGX customers meet rigorous safety and cybersecurity standards through impartial assessments. IGX customers receive an inspection report and an inspection certificate to be consumed with technical services or certification agencies for final certification. For more details, see the IGX Thor Safety Product Brief . Enterprise-grade industrial platform NVIDIA AI Enterprise – IGX is a software solution for edge AI that makes IGX Thor an enterprise‑grade, production‑ready platform with predictable long‑term support. NVIDIA commits to a lifecycle of up to 10 years for IGX, covering hardware availability, security updates, and full stack maintenance. This means that industrial, medical, and robotics deployments can stay patched and supported over their entire service life. In addition, NVIDIA AI Enterprise provides a long‑term support (LTS) branch with version‑locked AI frameworks and SDKs maintained for 10 years—alongside enterprise SLAs like Business Standard and Business Critical support. This ensures stable APIs, regular security fixes, and access to NVIDIA experts 24 hours per day, seven days per week when needed. To learn more, see the NVIDIA AI Enterprise – IGX Licensing Guide . Transition seamlessly from NVIDIA Jetson Thor The transition from NVIDIA Jetson Thor to NVIDIA IGX Thor is seamless. Jetson T5000 and IGX T5000 are fully compatible in terms of pin, form factor, and function, so the same carrier board design works for both platforms. The software stacks are also aligned—kernel, user space, and AI libraries share the same versions—delivering a consistent experience across Jetson and IGX. For teams with deeper customization requirements, NVIDIA is introducing JetPack on IGX, bringing JetPack flexibility to the industrial robustness of IGX hardware. For more information, see the NVIDIA IGX User Guide . Get started with NVIDIA IGX Thor Automation leaders, medical imaging pioneers, and large-scale manufacturers are already adopting NVIDIA IGX Thor . Leading ODM and Sensor partners can deliver production-ready systems and services to drive faster TTM. Industrial automation, healthcare, energy, and smart infrastructure are being transformed by AI—accelerating productivity, increasing safety, and lowering costs. But deploying AI in these environments demands more: hardware must survive and thrive in harshness, data must be protected, and systems must meet the world’s toughest functional safety requirements. NVIDIA IGX Thor delivers at every turn. It bridges the gap between the agility of modern AI and the uncompromising demands of regulated, safety-critical industrial environments. With functional safety engineered from silicon up, support for the best-performing GPUs, and a software stack ready for compliance, management, and AI advancement, IGX Thor is the platform to trust for the next decade of industrial AI. IGX Thor Developer Kit and IGX Thor Developer Kit mini are available to purchase from distributors worldwide. Get started with NVIDIA IGX Thor . To learn more, watch the NVIDIA GTC 2026 keynote with NVIDIA founder and CEO Jensen Huang and explore related GTC sessions on demand . Discuss (0) Like Tags Developer Tools & Techniques | Edge Computing | Robotics | General | AI Enterprise | Blackwell | ConnectX | IGX | Intermediate Technical | News | GTC 2026 | Multi-Instance GPU (MIG) | Physical AI | Thor About the Authors About Suhas Hariharapura Sheshadri Suhas Sheshadri is a product manager at NVIDIA, focusing on Jetson software. He previously worked with the autonomous driving team at NVIDIA, optimizing system software for the NVIDIA Drive platform. In his free time, Suhas likes to read books on quantum physics and game theory. View all posts by Suhas Hariharapura Sheshadri About Aayush Pathak Aayush Pathak is a hardware product manager at NVIDIA specializing in Embedded Edge and Autonomous Machines. He has worked extensively in the semiconductor industry, designing supercomputer SoCs and advancing low‑power, energy‑efficient hardware. He holds a master’s degree in Electrical Engineering from the University of Southern California and an MBA from the University of Chicago. View all posts by Aayush Pathak Comments Related posts Production-Ready, Enterprise-Grade Software on NVIDIA IGX Platform, Support for NVIDIA RTX 6000 ADA, and More Production-Ready, Enterprise-Grade Software on NVIDIA IGX Platform, Support for NVIDIA RTX 6000 ADA, and More Powering Mission-Critical AI at the Edge with NVIDIA AI Enterprise IGX Powering Mission-Critical AI at the Edge with NVIDIA AI Enterprise IGX Accelerate Your Edge AI Journey with NVIDIA IGX Orin Developer Kit Accelerate Your Edge AI Journey with NVIDIA IGX Orin Developer Kit Deploying and Accelerating AI at the Edge with the NVIDIA EGX Platform Deploying and Accelerating AI at the Edge with the NVIDIA EGX Platform NVIDIA Jetson AGX Xavier Delivers 32 TeraOps for New Era of AI in Robotics NVIDIA Jetson AGX Xavier Delivers 32 TeraOps for New Era of AI in Robotics Related posts Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM Building Autonomous Vehicles That Reason with NVIDIA Alpamayo Building Autonomous Vehicles That Reason with NVIDIA Alpamayo AI Factories, Physical AI, and Advances in Models, Agents, and Infrastructure That Shaped 2025 AI Factories, Physical AI, and Advances in Models, Agents, and Infrastructure That Shaped 2025 Delivering Flexible Performance for Future-Ready Data Centers with NVIDIA MGX Delivering Flexible Performance for Future-Ready Data Centers with NVIDIA MGX L T F R E
Staff too scared of the AI axe to pick it up, Forrester find the_register_ai 26.03.2026 16:33 0.685
Embedding sim.0.8047
Entity overlap0.1579
Title sim.0.1304
Time proximity0.726
NLP типother
NLP организацияForrester
NLP темаai adoption
NLP странаUnited States

Открыть оригинал

AI + ML 92 Staff too scared of the AI axe to pick it up, Forrester finds 92 Your AI rollout isn't failing – your employees just hate it Dan Robinson Thu 26 Mar 2026 // 16:33 UTC If your company isn't seeing great returns from its investment in AI, you might want to look at the humans tasked with deploying it and how you can motivate them. Right now, many employees fear AI-driven job losses and aren't well trained to use the tech, according to Forrester. The research and advisory biz says in its latest report that low employee readiness is the main thing holding back business success with workforce AI programs. According to its own data, Forrester believes 68 percent of organizations are using generative AI in deployed production applications, which sounds somewhat on the optimistic side to us. It claims that, among decision-makers, 81 percent say AI copilots are important for assisting employees. Workers must therefore adapt, it declares. This isn't happening. Using its own metric, the AI quotient (AIQ), to measure the readiness of individuals, teams, and organizations for AI, Forrester says employees in the US, UK, Germany, France, and Australia failed to progress meaningfully over the past year. There are two reasons put forward for why businesses aren't advancing as measured by Forrester's AIQ yardstick: One is that most organizations fail to effectively train their staff in the use of AI tools. The proportion of firms that say they offer internal AI training to non-technical employees grew slightly last year, from 47 percent in 2024 to 51 percent. Prompt engineering – a key skill for using generative AI tools – fared even worse, with just 23 percent of organizations saying they offered training for this. The second reason is that employee fears are stunting adoption. While few jobs were lost to AI in 2025 and future job losses are not expected to constitute a job apocalypse, worker anxiety regarding this is pervasive, Forrester says. There could be a good reason for this: public statements by CEOs saying that axing jobs is exactly what they want to do. A survey last year found that just over half of UK business leaders (51 percent) saw AI as a way to cut investment in staff . Another report found that 43 percent of business leaders expect to reduce entry-level roles (including both cutting existing roles and not hiring new staff) over the next year in favor of AI, while 50 percent "specifically" said AI is helping them reduce headcount.  Forrester's own data found 43 percent of employees are concerned that many people will lose their jobs to automation over the next five years, while a quarter suspect it will impact their own job during that period. This creates an ambient environment of anxiety and mistrust that hinders progress, it states. The report cites one business leader as saying: "Some of our employees fear job loss, and it turns them away from AI altogether." The solution is for organizations to frame workforce AI as an opportunity builder for employees and articulate the benefits from an employee perspective. Those that fail to do so will see job loss worries magnify, the report claims. Businesses need to invest in learning and engagement programs to raise their staff's AIQ. This can demystify AI tools and reduce panic about job loss, Forrester reasons. After all, why would a firm replace employees with AI if it is investing to help them use it? Forrester has obviously never heard of companies that force employees to train their own replacements before showing them the door. AI spurs employees to work harder, faster, and with fewer breaks, study finds 'AI brain fry' affects employees managing too many agents AI adoption at work flatlined in Q4, says Gallup Many employees are using AI to create 'workslop,' Stanford study says Formal learning "plays a surprisingly small role in raising AIQ," we're told, and organizations that instead get social learning right tend to succeed with workforce AI. "The organizations that treat AI literacy as a strategic priority, not a box-ticking exercise, will be the ones that unlock meaningful productivity gains and long-term competitive advantage," claimed Forrester VP principal analyst JP Gownder. This is the same JP Gownder who told The Register in January that he "remains unconvinced that AI will revolutionize productivity." At the same time, a report from professional services biz PwC found that more than half of CEOs had seen neither increased revenue nor decreased costs from using AI, despite their investments in the technology. ® Share More about AI More like these &times; More about AI Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car More about Share 92 COMMENTS More about AI More like these &times; More about AI Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car TIP US OFF Send us news
MP victim of AI deepfake fails to get answers from Big Tech the_register_ai 26.03.2026 11:49 0.683
Embedding sim.0.8283
Entity overlap0
Title sim.0.0842
Time proximity0.6459
NLP типregulation
NLP организацияMeta
NLP темаdeepfake
NLP странаUnited Kingdom

Открыть оригинал

AI + ML 39 Brit lawmaker targeted by AI deepfake fails to get answers from US Big Tech 39 Appearing before Parliament, Meta, Google and X struggle to explain how fake political video circulated for so long Lindsay Clark Thu 26 Mar 2026 // 11:49 UTC A member of the UK Parliament's lower house who was the victim of a deepfake AI campaign this week had a rare chance to confront the Big Tech executives who helped spread it. Their answers disappointed. Representatives from Meta, Google, and X stumbled, offered platitudes, and explained their respective policies, but did little to compensate for spreading the potentially ruinous AI fake, or commit to ensuring it could not happen again after Conservative MP George Freeman confronted them. Last autumn, Freeman was the subject of an AI-created fake that falsely claimed he had defected to a rival party, Reform. This was plausible enough, given several genuine Conservative defections in recent months, but entirely fabricated Not only was it damaging to his reputation, but allowing political misinformation to continue to spread unchecked could end the democratic process in the UK, he argued. Freeman said platforms spreading the content are failing to respond. "There's no redress. There was no statement or principle that it was a problem," he said in Parliament yesterday, labeling the event a "serious disruption to democratic representation." Step forward Google, which owns YouTube. "We have policies about election ads which are aimed at ensuring that people are allowed to participate in free and fair elections just during election time," Zoe Darme, director for trust, knowledge and information products, told the House of Commons Science, Innovation and Technology Committee. For videos that are "violative" under Google's definition, it might be picked up by a "classifier" or if not, "reported and reviewed against community guidelines and removed." However, Darme was unable to say whether something so demonstrably false would in itself be "violative." Next up, Wifredo Fernández, director of global government affairs at X (formerly known as Twitter), said: "We have our deceptive identities policy so that deals with impersonation, and we have our synthetic and manipulated media policy, which maybe would apply in this case," he said. He outlined a three-part test under the platform's synthetic media policy, but noted it applied to confusion across X generally, not within Freeman's specific constituency. The possible outcome: a community note. Asked what action X had actually taken, Fernández said he'd "have to check with the teams." Freeman confirmed X had taken none. Also among the US giants was Meta, owner of social media megaliths Facebook and Instagram. Rebecca Stimson, UK public policy director, told Freeman, "It was labeled by our fact checkers, and it was down-ranked. And down rank does have a very significant impact. It can mean up to 80 to 90 percent less engagement." UK to rethink tech buying after Palantir contracts US appears open to reversing some China tech bans UK's 'world-first' deepfake detection framework unlikely to stop the fakes, says expert EU looking into Elon Musk's X after Grok produces deepfake sex images She said Meta didn't always remove misinformation. Instead, it took a tiered approach, considering whether it was occurring during an election period, for example. She said Meta would never able to find and remove every instance of misinformation across every platform. But people could see it labeled with the correct information. Addressing these responses, Freeman said: "It feels to me as though the platforms are taking the approach that 'they've got a policy' and not policing actively. It falls to us as Parliamentarians to police it. My instinct is to pass a very simple law that somebody's identity belongs to them and cannot be stolen, used, misappropriated, whatever the purpose… You should go to bed a night not fearing that in the morning, you find a deeply damaging, disruptive and dangerous misrepresentation of you." ® Share More about AI Deepfake Parliament of the United Kingdom More like these &times; More about AI Deepfake Parliament of the United Kingdom Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car United Kingdom More about Share 39 COMMENTS More about AI Deepfake Parliament of the United Kingdom More like these &times; More about AI Deepfake Parliament of the United Kingdom Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car United Kingdom TIP US OFF Send us news
NVIDIA, Telecom Leaders Build AI Grids to Optimize Inference on Distributed Networks nvidia_blog 17.03.2026 17:00 0.683
Embedding sim.0.7961
Entity overlap0.05
Title sim.0.108
Time proximity0.878
NLP типpartnership
NLP организацияNVIDIA
NLP темаai infrastructure
NLP странаUnited States

Открыть оригинал

NVIDIA, Telecom Leaders Build AI Grids to Optimize Inference on Distributed Networks AT&T, T‑Mobile, Comcast, Spectrum and others are building AI grids using NVIDIA AI infrastructure, while Personal AI, Linker Vision, Serve Robotics and Decart are deploying real-time AI applications across the grid. March 17, 2026 by Kanika Atri 0 Comments Share Share This Article X Facebook LinkedIn Copy link Link copied! As AI‑native applications scale to more users, agents and devices, the telecommunications network is becoming the next frontier for distributing AI.  At NVIDIA GTC 2026, leading operators in the U.S. and Asia showed that this shift is underway, announcing AI grids — geographically distributed and interconnected AI infrastructure — using their network footprint to power and monetize new AI services across the distributed edge.   Different operators are taking different paths. Many are starting by lighting up existing wired edge sites as AI grids they can monetize today. Others harness AI-RAN — a technology that enables the full integration of AI into the radio access network — as a workload and edge inference platform on the same grid.   Telcos and distributed cloud providers run some of the most expansive infrastructure in the world: about 100,000 distributed network data centers worldwide, spanning regional hubs, mobile switching offices and central offices, with enough spare power to offer more than 100 gigawatts of new AI capacity over time. AI grids turn this existing real-estate, power and connectivity into a geographically distributed computing platform that runs AI inference closer to users, devices and data, where response and cost per token align best. This is more than an infrastructure upgrade — it’s a structural change in how AI is delivered, putting telecom networks at the center of scaling AI rather than just carrying its traffic.  Global Operators Turn Distributed Networks Into AI Grids Across six major operators, AI grids are moving from concept to reality. AT&T , a leader in connected IoT with over 100 million connections across thousands of device types, is partnering with Cisco and NVIDIA to build an AI grid for IoT. By running AI on a dedicated IoT core and moving AI inference closer to where data is created, AT&T can support mission‑critical, real‑time applications like public‑safety use cases with Linker Vision, enabling faster detection, alerting and response while helping keep sensitive information under customer control at the network edge. “Scaling AI services that are both highly secure and accessible for enterprises and developers is a core pillar of our IoT connectivity strategy,” said Shawn Hakl, senior vice president of product at AT&T Business. “By combining AT&T’s business‑grade connectivity, localized AI compute and zero‑trust security while working with members of the NVIDIA Inception program and harnessing Cisco’s AI Grid with NVIDIA infrastructure and Cisco Mobility Services Platform, we’re bringing real‑time AI inference closer to where data is generated — accelerating digital transformation and unlocking new business opportunities.” Comcast is developing one of the nation’s largest low‑latency broadband footprints into an AI grid for real‑time, hyper‑personalized experiences. Working with NVIDIA, Decart , Personal AI and HPE , Comcast has validated that its AI grid keeps conversational agents, interactive media and NVIDIA GeForce NOW cloud gaming responsive and economical even during demand spikes, with significantly higher throughput and lower cost per token. Spectrum has the network infrastructure to support an AI grid that spans more than 1,000 edge data centers and hundreds of megawatts of capacity less than 10 milliseconds away from 500 million devices. The initial deployment focuses on rendering high-resolution graphics for media production using remote GPUs embedded across Spectrum’s fiber-powered, low-latency network. Akamai i s building a globally distributed AI grid, expanding Akamai Inference Cloud across more than 4,400 edge locations with thousands of NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Akamai’s AI grid orchestration platform matches each request to the right tier of compute, improving the token economics of inference while powering low-latency, real-time AI experiences for applications like gaming, media, financial services and retail. Indosat Ooredoo Hutchison is connecting its sovereign AI factory with distributed edge and AI‑RAN sites across Indonesia to build an AI grid for local innovation. By running Sahabat-AI — a Bahasa Indonesia-based platform — on this grid within Indonesia’s borders, Indosat can bring localized AI services closer to hundred millions of Indonesians across thousands of islands, giving local developers and startups a sovereign platform to build AI applications that are fast, culturally relevant and compliant by design. T‑Mobile  is working with NVIDIA to explore edge AI applications using NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, demonstrating how distributed network locations could support emerging AI-RAN and edge inference use cases. Developers including LinkerVision, Levatas, Vaidio, Archetype AI and Serve Robotics are already piloting smart‑city, industrial and retail applications on the grid, connecting cameras, delivery robots and city‑scale agents to real-time intelligence on the network edge. This demonstrates how cell sites and mobile switching offices can support distributed edge AI workloads while continuing to deliver advanced 5G connectivity. New AI‑Native Services Put Telecom AI Grids to Work AI grids are becoming foundational to a new class of AI‑native applications — real‑time, hyper‑personalized, concurrent and token-intensive. Personal AI is using NVIDIA Riva to power human‑grade conversational agents on the AI grid. By running small language models closer to users, it achieves sub-500 millisecond end-to-end latency and over 50% lower cost-per-token, enabling voice experiences that feel natural while remaining economically viable at scale. Linker Vision is transforming city operations by running real‑time vision AI on the AI grid. By processing thousands of camera feeds across distributed edge sites, it delivers predictable latency for live detection and instant alerting — enabling safer, smarter cities with up to 10x faster traffic accident detection, 15x faster disaster response and sub‑minute alerts for unsafe crowd behavior.  Decart is redefining hyper‑personalized distributed media by bringing real‑time video generation to AI grids. By running its Lucy models at the network edge, it achieves sub‑12-millisecond network latency, enabling interactive video streams and overlays that adapt instantly to each viewer, delivering smooth, immersive live video experiences even when viewership peaks. AI Grid Reference Design and Ecosystem The NVIDIA AI Grid Reference Design defines the building blocks — including NVIDIA accelerated computing, networking and software platforms — for deploying and orchestrating AI across distributed sites. A growing ecosystem of full‑stack partners including Cisco and infrastructure partners like HPE are bringing AI grid solutions to market on systems built with the NVIDIA RTX PRO 6000 Blackwell Server Edition . Armada , Rafay and Spectro Cloud are among the partners building an AI grid control plane to seamlessly orchestrate workloads across distributed AI infrastructure. “Physical AI is accelerating the shift from centralized intelligence to distributed decision making at the network edge,” said Masum Mir, senior vice president and general manager provider mobility at Cisco. “Our partnership with NVIDIA brings together the full stack — from NVIDIA GPUs to Cisco’s networking and mobility capabilities — enabling operators to power mission-critical applications, deliver real-time inferencing and participate in the AI value chain.” Together, this ecosystem is helping telcos and distributed cloud providers redefine their role in the AI value chain — transforming the network edge into a unified intelligence layer that runs, scales and monetizes AI workloads. Learn more about AI Grid . Explore the Best of GTC 2026 Sessions Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere. Watch On Demand Recent News AI Infrastructure Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid March 31, 2026
How to create “humble” AI mit_news_ai 24.03.2026 04:00 0.682
Embedding sim.0.8102
Entity overlap0.0303
Title sim.0.0938
Time proximity0.7679
NLP типscientific_publication
NLP организацияMassachusetts Institute of Technology
NLP темаhealthcare ai
NLP странаUnited States

Открыть оригинал

Artificial intelligence holds promise for helping doctors diagnose patients and personalize treatment options. However, an international group of scientists led by MIT cautions that AI systems, as currently designed, carry the risk of steering doctors in the wrong direction because they may overconfidently make incorrect decisions. One way to prevent these mistakes is to program AI systems to be more “humble,” according to the researchers. Such systems would reveal when they are not confident in their diagnoses or recommendations and would encourage users to gather additional information when the diagnosis is uncertain. “We’re now using AI as an oracle, but we can use AI as a coach. We could use AI as a true co-pilot. That would not only increase our ability to retrieve information but increase our agency to be able to connect the dots,” says Leo Anthony Celi, a senior research scientist at MIT’s Institute for Medical Engineering and Science, a physician at Beth Israel Deaconess Medical Center, and an associate professor at Harvard Medical School. Celi and his colleagues have created a framework that they say can guide AI developers in designing systems that display curiosity and humility. This new approach could allow doctors and AI systems to work as partners, the researchers say, and help prevent AI from exerting too much influence over doctors’ decisions. Celi is the senior author of the study, which appears today in BMJ Health and Care Informatics . The paper’s lead author is Sebastián Andrés Cajas Ordoñez, a researcher at MIT Critical Data, a global consortium led by the Laboratory for Computational Physiology within the MIT Institute for Medical Engineering and Science. Instilling human values Overconfident AI systems can lead to errors in medical settings, according to the MIT team. Previous studies have found that ICU physicians defer to AI systems that they perceive as reliable even when their own intuition goes against the AI suggestion. Physicians and patients alike are more likely to accept incorrect AI recommendations when they are perceived as authoritative. In place of systems that offer overconfident but potentially incorrect advice, health care facilities should have access to AI systems that work more collaboratively with clinicians, the researchers say. “We are trying to include humans in these human-AI systems, so that we are facilitating humans to collectively reflect and reimagine, instead of having isolated AI agents that do everything. We want humans to become more creative through the usage of AI,” Cajas Ordoñez says. To create such a system, the consortium designed a framework that includes several computational modules that can be incorporated into existing AI systems. The first of these modules requires an AI model to evaluate its own certainty when making diagnostic predictions. Developed by consortium members Janan Arslan and Kurt Benke of the University of Melbourne, the Epistemic Virtue Score acts as a self-awareness check, ensuring the system’s confidence is appropriately tempered by the inherent uncertainty and complexity of each clinical scenario. With that self-awareness in place, the model can tailor its response to the situation. If the system detects that its confidence exceeds what the available evidence supports, it can pause and flag the mismatch, requesting specific tests or history that would resolve the uncertainty, or recommending specialist consultation. The goal is an AI that not only provides answers but also signals when those answers should be treated with caution. “It’s like having a co-pilot that would tell you that you need to seek a fresh pair of eyes to be able to understand this complex patient better,” Celi says. Celi and his colleagues have previously developed large-scale databases that can be used to train AI systems, including the Medical Information Mart for Intensive Care (MIMIC) database from Beth Israel Deaconess Medical Center. His team is now working on implementing the new framework into AI systems based on MIMIC and introducing it to clinicians in the Beth Israel Lahey Health system. This approach could also be implemented in AI systems that are used to analyze X-ray images or to determine the best treatment options for patients in the emergency room, among others, the researchers say. Toward more inclusive AI This study is part of a larger effort by Celi and his colleagues to create AI systems that are designed by and for the people who are ultimately going to be most impacted by these tools. Many AI models, such as MIMIC, are trained on publicly available data from the United States, which can lead to the introduction of biases toward a certain way of thinking about medical issues, and exclusion of others. Bringing in more viewpoints is critical to overcoming these potential biases, says Celi, emphasizing that each member of the global consortium brings a distinct perspective to a broader, collective understanding. Another problem with existing AI systems used for diagnostics is that they are usually trained on electronic health records, which weren’t originally intended for that purpose. This means that the data lack much of the context that would be useful in making diagnoses and treatment recommendations. Additionally, many patients never get included in those datasets because of lack of access, such as people who live in rural areas. At data workshops hosted by MIT Critical Data , groups of data scientists, health care professionals, social scientists, patients, and others work together on designing new AI systems. Before beginning, everyone is prompted to think about whether the data they’re using captures all the drivers of whatever they aim to predict, ensuring they don’t inadvertently encode existing structural inequities into their models. “We make them question the dataset. Are they confident about their training data and validation data? Do they think that there are patients that were excluded, unintentionally or intentionally, and how will that affect the model itself?” he says. “Of course, we cannot stop or even delay the development of AI, not just in health care, but in every sector. But, we must be more deliberate and thoughtful in how we do this.” The research was funded by the Boston-Korea Innovative Research Project through the Korea Health Industry Development Institute.
Advancing international trade research and finding community mit_news_ai 23.03.2026 21:00 0.679
Embedding sim.0.7833
Entity overlap0.0625
Title sim.0.0864
Time proximity0.9807
NLP типother
NLP организацияMIT Center for International Studies
NLP темаai governance
NLP странаUnited States

Открыть оригинал

The sense of support and community was palpable when Sojun Park , a postdoc at the MIT Center for International Studies (CIS), delivered a recent presentation on The Global Diffusion of AI Technologies and Its Political Drivers. The event, part of the CIS Global Research and Policy Seminar , filled the venue with audience members from across MIT. “My work is directly connected to what CIS faculty have previously done on international trade and security,” Park said afterwards. “If I hadn’t received a postdoctoral fellowship and come to MIT, I wouldn’t have been able to think through the security implications of my intellectual property research. I’ve been tremendously motivated by these scholars.” Park’s time at CIS has been both grounding and transformative, offering him a scholarly home that has shaped his research and helped broaden his intellectual horizons. Pursuing interdisciplinary research and connections Before pursuing a tenure-track position, Park set his sights on conducting research at MIT. When he came across a public posting about the CIS Postdoctoral Associate Program , he took a chance and applied. “My own research is interdisciplinary, and I knew that I could really benefit from the interdisciplinary environment at MIT, and specifically at CIS, where faculty are coming not only from political science, but also affiliated with the Department of Economics and MIT Sloan [School of Management],” he says. Park was thrilled to receive the paid fellowship, which offers an academic year at MIT and dedicated office space at CIS. At MIT, he is free to use his time toward his own research, and has found value in pursuing topics that are of interest to the CIS community — whether it’s AI or global governance. He’s published prolifically along the way, including two articles in the Review of International Organizations and the Review of International Political Economy . He’s also continued to work on his forthcoming book, “From Privilege to Prosperity: Knowledge Diffusion and the Global Governance of Intellectual Property,” which examines how technologies can be transferred legitimately across borders. “By 'legitimately,' I am asking under what circumstances would firms volunteer to share their technologies? I’m interested in institutions and institutional environments that allow large businesses to share their technologies with smaller businesses based in the development world that may not possess the ability to come up with their own technologies,” he explains. During the spring 2026 semester, he is collaborating with the center’s Undergraduate Fellows Program . This program enables postdocs to work on their research projects with MIT undergraduates. Park is working with two CIS undergraduate fellows to develop a new dataset examining international trade in green technologies. This opportunity reconnects Park to his early academic experiences in South Korea that set him on the path to MIT. Path to MIT “Students in South Korea are trained to be problem-solvers,” explains Park, who was born and raised in Seoul. The country’s rigorous college entrance exams reward those who can answer the most questions quickly and accurately in a limited amount of time. While taking a test in high school, Park stumbled over a question that he couldn’t answer, regardless of how much time he spent concentrating on it. He handed in the exam, but took the problem home and spent hours puzzling over it — he just couldn’t let it go. “In hindsight, I see this as the moment I decided that I wanted to become a scholar,” Park says. While majoring in international studies and economics (statistics) at Korea University, he had the opportunity to participate in a semester-long exchange program at the University of Texas at Austin. There, Park enrolled in a political science course on game theory that explored how individual state actors’ decisions influenced one another’s choices and outcomes in trade, conflict, and diplomacy. The instructor used the ongoing war between North and South Korea as a case study, demonstrating the unique circumstances for escalation or de-escalation depending upon how the key actors made choices along the way. “I saw for the first time how quantitative methods could be applied to international relations and political economy,” Park says — and he knew that his next step was going to be graduate work in the United States. He began a joint MA and PhD program in political science at Princeton University the following year, supported by a Fulbright Fellowship. Park’s 2025 dissertation examined the global governance of intellectual property rights — and it was timely. He began his PhD program in 2018, “the point at which the U.S. and China trade war had just begun.” During the pandemic, he was moved by the ongoing debates regarding vaccine inequality. “I realized then that intellectual property was at the center of these global economic challenges.” With little political science research on the topic, he “set out to create a systemic framework” to study it. Simultaneously, he served as a teaching assistant in undergraduate courses in statistical analysis and realized that he deeply enjoyed the experience of teaching and interacting with students. It was a very different experience from his own college years. “In South Korea, it’s common for the learning environment to be one in which the professor just delivers lectures, but I found that in the United States’ higher education system, the classroom is truly interactive. I learned something from each of my students.” Soon, Park was certain that he not only wanted to build a career in academic research, but also a future that heavily incorporated teaching and mentoring students. Before graduating, he spent a year at Georgetown University as a predoctoral fellow affiliated with the Mortara Center for International Studies. This experience enabled him to explore the policy implications of his research and engage with policymakers in Washington — skills he will draw on in his new position. Lasting lessons from CIS Park recently accepted a position as assistant professor at the National University of Singapore. Beginning fall 2026, he will be teaching graduate students affiliated with the school of public policy — most of whom will have career experience as practitioners in the public or private sectors. He’ll take many lessons from MIT to his new academic home, he says. “Based on what I learned in the United States, I’ll make the learning environment in the graduate courses I teach much more interactive and collaborative.” At CIS, Mihaela Papa , director of research and principal research scientist, and Evan Lieberman , the center’s director and professor of political science, connected Park to associated faculty whose research interests were related with his own. “Meeting with all of these scholars whose research relates in some way to intellectual property rights made me think about how my own interests can expand to other topics,” Park explains. But the biggest takeaway of all is that he learned how to share his own research with scholars who study unfamiliar topics, to exchange ideas and discover commonality. “I’ll never stop using the communication skills that I got here at MIT," Park says.
Explainer: Why AI is breaking enterprise virtualization the_register_ai 25.03.2026 09:00 0.679
Embedding sim.0.8139
Entity overlap0.075
Title sim.0.0551
Time proximity0.7321
NLP типother
NLP организацияHPE
NLP темаai infrastructure
NLP страна

Открыть оригинал

AI + ML Explainer: Why AI is breaking enterprise virtualization Your virtualization infrastructure probably isn't AI-ready. But that's OK - HPE has the answer David Gordon Wed 25 Mar 2026 // 09:00 UTC The Register Explainer For most of the past decade, enterprise virtualization was the kind of infrastructure that nobody argued about. It worked, it scaled, and the economics, while never exactly cheap, were at least predictable. Then AI arrived in earnest, and the assumptions baked into those stacks started showing their age. The licensing disruption from Broadcom's VMware acquisition made the headlines, but underneath this lay a deeper architectural problem. That was already building before any vendor changed a price list. What does AI demand that legacy virtualization can't deliver? AI workloads such as inference engines, training pipelines, and the data movement between them need bare-metal-like performance, high-density compute, and low-latency interconnects. Traditional hypervisor architectures weren't designed around those requirements. They were built for conventional enterprise workloads that were predictable, relatively modest, and tolerant of the overhead that virtualization introduces. At AI scale, that overhead stops being a rounding error and starts being a genuine constraint on what the system can do. Management is another issue. Most enterprise VM estates have accumulated tools and processes over years, each one solving a specific problem in a specific environment. Trying to run AI workloads through that kind of fragmented stack means inconsistent provisioning and unpredictable performance. There's no clean way to move workloads between on-premises clusters and cloud environments when the need arises. IT teams are hitting these limits right now as they try to run production AI. So why did everyone think this was a VMware pricing problem? Because the licensing shock arrived first and arrived loudly. That made the cost of staying put suddenly, visibly high but the conversations it created should have happened earlier. According to HPE research  conducted across nearly 400 enterprise IT decision-makers in late 2025, only 4 percent of organizations cite licensing costs as their primary driver for change. The real pressure is the need to rebuild operating models that can actually support AI. The danger in treating this as a vendor swap problem is that organizations migrate their complexity rather than resolve it. A different hypervisor running inside the same fragmented management environment doesn't move anyone meaningfully closer to AI readiness. What does a modernized stack look like? The shift that matters is in the operating model, not the hypervisor. A unified control plane managing VMs, containers, and cloud workloads gives AI workloads the portability and consistency they need. Multi-hypervisor management, running HPE’s own hypervisor and ESXi environments side by side through a single interface, means organizations don't have to abandon existing infrastructure to start moving forward. Predictable per-socket pricing replaces the kind of exposure that made renewal conversations so uncomfortable. Then there's the operational layer, which includes self-service provisioning, policy-as-code governance, and lifecycle automation across hybrid infrastructure. These features make AI deployment repeatable and compliant at scale, rather than something requiring a heroic effort every time a new workload spins up. HPE Morpheus Software, together with HPE Private Cloud business Edition, delivers this as a unified platform. It offers a single catalog governing existing virtualization environments and modern clusters alike, with cost analytics and automation built in rather than bolted on. How ready are enterprises for this? Not very, but most of them know it. HPE's survey found that while more than two-thirds of enterprises plan material changes to their virtualization strategy within the next two years, only 5 percent say they are fully ready to execute. The barriers they cite are manageable ones, including budget constraints, technical complexity, migration risk, and skills gaps. Importantly, 57 percent are already planning a phased approach rather than a forced wholesale migration, which is the right instinct. The organizations that treat this as a deliberate architectural decision, modernizing on their own terms at their own pace, are in a better position than those waiting for another external shock to force their hand. AI readiness and virtualization strategy have quietly become the same conversation. The ones who recognize that early have a meaningful head start. Sponsored by HPE. Share More about AI HPE More like these &times; More about AI HPE Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car More about Share More about AI HPE More like these &times; More about AI HPE Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car TIP US OFF Send us news
AI system learns to keep warehouse robot traffic running smoothly mit_news_ai 26.03.2026 04:00 0.679
Embedding sim.0.784
Entity overlap0.0345
Title sim.0.1124
Time proximity0.9464
NLP типscientific_publication
NLP организацияMIT
NLP темаreinforcement learning
NLP страна

Открыть оригинал

Inside a giant autonomous warehouse, hundreds of robots dart down aisles as they collect and distribute items to fulfill a steady stream of customer orders. In this busy environment, even small traffic jams or minor collisions can snowball into massive slowdowns. To avoid such an avalanche of inefficiencies, researchers from MIT and the tech firm Symbotic developed a new method that automatically keeps a fleet of robots moving smoothly. Their method learns which robots should go first at each moment, based on how congestion is forming, and adapts to prioritize robots that are about to get stuck. In this way, the system can reroute robots in advance to avoid bottlenecks. The hybrid system utilizes deep reinforcement learning, a powerful artificial intelligence method for solving complex problems, to figure out which robots should be prioritized. Then, a fast and reliable planning algorithm feeds instructions to the robots, enabling them to respond rapidly in constantly changing conditions. In simulations inspired by actual e-commerce warehouse layouts, this new approach achieved about a 25 percent gain in throughput over other methods. Importantly, the system can quickly adapt to new environments with different quantities of robots or varied warehouse layouts. “There are a lot of decision-making problems in manufacturing and logistics where companies rely on algorithms designed by human experts. But we have shown that, with the power of deep reinforcement learning, we can achieve super-human performance. This is a very promising approach, because in these giant warehouses even a 2 or 3 percent increase in throughput can have a huge impact,” says Han Zheng, a graduate student in the Laboratory for Information and Decision Systems (LIDS) at MIT and lead author of a paper on this new approach. Zheng is joined on the paper by Yining Ma, a LIDS postdoc; Brandon Araki and Jingkai Chen of Symbotic; and senior author Cathy Wu, the Class of 1954 Career Development Associate Professor in Civil and Environmental Engineering (CEE) and the Institute for Data, Systems, and Society (IDSS) at MIT, and a member of LIDS. The research appears today in the Journal of Artificial Intelligence Research . Rerouting robots Coordinating hundreds of robots in an e-commerce warehouse simultaneously is no easy task. The problem is especially complicated because the warehouse is a dynamic environment, and robots continually receive new tasks after reaching their goals. They need to be rapidly redirected as they leave and enter the warehouse floor. Companies often leverage algorithms written by human experts to determine where and when robots should move to maximize the number of packages they can handle. But if there is congestion or a collision, a firm may have no choice but to shut down the entire warehouse for hours to manually sort the problem out. “In this setting, we don’t have an exact prediction of the future. We only know what the future might hold, in terms of the packages that come in or the distribution of future orders. The planning system needs to be adaptive to these changes as the warehouse operations go on,” Zheng says. The MIT researchers achieved this adaptability using machine learning. They began by designing a neural network model to take observations of the warehouse environment and decide how to prioritize the robots. They train this model using deep reinforcement learning, a trial-and-error method in which the model learns to control robots in simulations that mimic actual warehouses. The model is rewarded for making decisions that increase overall throughput while avoiding conflicts. Over time, the neural network learns to coordinate many robots efficiently. “By interacting with simulations inspired by real warehouse layouts, our system receives feedback that we use to make its decision-making more intelligent. The trained neural network can then adapt to warehouses with different layouts,” Zheng explains. It is designed to capture the long-term constraints and obstacles in each robot’s path, while also considering dynamic interactions between robots as they move through the warehouse. By predicting current and future robot interactions, the model plans to avoid congestion before it happens. After the neural network decides which robots should receive priority, the system employs a tried-and-true planning algorithm to tell each robot how to move from one point to another. This efficient algorithm helps the robots react quickly in the changing warehouse environment. This combination of methods is key. “This hybrid approach builds on my group’s work on how to achieve the best of both worlds between machine learning and classical optimization methods. Pure machine-learning methods still struggle to solve complex optimization problems, and yet it is extremely time- and labor-intensive for human experts to design effective methods. But together, using expert-designed methods the right way can tremendously simplify the machine learning task,” says Wu. Overcoming complexity Once the researchers trained the neural network, they tested the system in simulated warehouses that were different than those it had seen during training. Since industrial simulations were too inefficient for this complex problem, the researchers designed their own environments to mimic what happens in actual warehouses. On average, their hybrid learning-based approach achieved 25 percent greater throughput than traditional algorithms as well as a random search method, in terms of number of packages delivered per robot. Their approach could also generate feasible robot path plans that overcame congestion caused by traditional methods. “Especially when the density of robots in the warehouse goes up, the complexity scales exponentially, and these traditional methods quickly start to break down. In these environments, our method is much more efficient,” Zheng says. While their system is still far away from real-world deployment, these demonstrations highlight the feasibility and benefits of using a machine learning-guided approach in warehouse automation. In the future, the researchers want to include task assignments in the problem formulation, since determining which robot will complete each task impacts congestion. They also plan to scale up their system to larger warehouses with thousands of robots. This research was funded by Symbotic.
Protecting People from Harmful Manipulation — Google DeepMind deepmind 26.03.2026 13:00 0.678
Embedding sim.0.7756
Entity overlap0.0408
Title sim.0.1275
Time proximity0.993
NLP типscientific_publication
NLP организацияFrontier Model Forum
NLP темаai safety
NLP странаUnited Kingdom

Открыть оригинал

March 26, 2026 Responsibility & Safety Protecting people from harmful manipulation Helen King Share Copied As AI models get better at holding natural conversations, we must examine how these interactions affect people and society. Building on a breadth of scientific research, today, we are releasing new findings on the potential for AI to be misused for harmful manipulation* , specifically, its ability to alter human thought and behavior in negative and deceptive ways. With this latest study, we have created the first empirically validated toolkit to measure this kind of AI manipulation in the real world, which we hope will help protect people and advance the field as a whole. We’re publicly releasing all materials necessary to run human participant studies using the same methodology. ( Note: The behaviors observed during this study took place in a controlled lab setting, and do not necessarily predict real-world behaviors.) Why harmful manipulation matters Consider two scenarios: One AI model gives you facts to make a well-informed healthcare decision that improves your well-being. Another AI model uses fear to pressure you to make an ill-informed decision that harms your health. The first educates and helps you; the second tricks and harms you. These scenarios highlight the difference between two types of persuasion in human-AI interactions (also defined in earlier research ): Beneficial (rational) persuasion: Using facts and evidence to help people make choices that align with their own interest Harmful manipulation: Exploiting emotional and cognitive vulnerabilities to trick people into making harmful choices Our latest work helps us and the wider AI community better understand the risk of AI developing capabilities for harmful manipulation and build a scalable evaluation framework to measure this complex area. To do this effectively, we simulated misuse in high-stakes environments, explicitly prompting AI to try to negatively manipulate people's beliefs and behaviours on key topics. Developing new evaluations for a complex challenge Testing the outcomes of AI harmful manipulation Testing for harmful manipulation is inherently difficult because it involves measuring subtle changes in how people think and act, varying heavily by topic, culture and context. This is what motivated our latest research, which involved conducting nine studies involving over 10,000 participants across the UK, the US, and India. We focused on high-stakes areas such as finance, where we used simulated investment scenarios to test if AI could influence how people would behave in complex decision-making environments, and health, where we tracked if AI could influence which dietary supplements people preferred. Interestingly, the AI was least effective at harmfully manipulating participants on health-related topics. Our findings show that success in one domain does not predict success in another, validating our targeted approach to testing for harmful manipulation in specific, high-stakes environments where AI could be misused. How could AI manipulate? In addition to tracking efficacy (whether the AI successfully changes minds), we also measured its propensity (how often it even tries to use manipulative tactics). We tested propensity in two scenarios: when we explicitly told the model to be manipulative, and when we didn’t. As detailed in our research , we counted manipulative tactics in experimental transcripts, confirming the AI models were most manipulative when explicitly instructed to be. Our results also suggest that certain manipulative tactics may be more likely to result in harmful outcomes, though further research is required to understand these mechanisms in detail. By measuring both efficacy and propensity, we can better understand how AI manipulation works and build more targeted mitigations. Putting research into practice As AI becomes a part of our everyday lives, we need to know it can’t be misused to harmfully manipulate people. Beyond this latest study, we recently introduced an exploratory Harmful Manipulation Critical Capability Level (CCL) within our Frontier Safety Framework to help us track models with capabilities which could be misused to systematically change beliefs and behaviors in direct human-AI interactions in ways which could lead to severe harm. These evaluations also serve as the foundation for how we test our models, including Gemini 3 Pro, for harmful manipulation. You can read more about this in this safety report . Like all our safety evaluations, this is an ongoing process. We will continue to refine our models and methodologies to keep pace with advancing AI. Looking ahead Understanding and mitigating harmful manipulation is a complex challenge. As model capabilities evolve, so too must our evaluation and mitigation techniques. For example, we’re currently exploring how to ethically evaluate the efficacy of harmful manipulation in even higher-stakes situations—like discussions involving deeply held personal beliefs—where users might be more susceptible to influence. Next, we will be expanding our research to investigate how audio, video, and image inputs as well as agentic capabilities, factor into AI manipulation. We’ll continue to share findings and iterate based on feedback from the Frontier Model Forum and academic community. Our goal is to lead collective progress to prevent harmful manipulation, advancing AI models that prioritize safety and empower people. *Notes: The scope of this particular research focuses exclusively on demonstrating general manipulation capabilities to help further the scientific study of evaluating harmful manipulation. This does not relate to testing safeguards around model outputs or manipulation in policy-violating and dangerous topics (e.g. terrorism and child safety) as this work is covered elsewhere and tested separately. You can also read more about our harmful manipulation work in this interview with our researchers and in the Gemini 3 Pro Frontier Safety Report . Acknowledgments Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger, William Isaac, Dawn Bloxwich, Lewis Ho, Eva Lu, Jenny Brennan, Mahmoud Hassan, Mark Graham
Into the Omniverse: NVIDIA GTC Showcases Virtual Worlds Powering the Physical AI Era nvidia_blog 26.03.2026 15:00 0.677
Embedding sim.0.8088
Entity overlap0
Title sim.0.1067
Time proximity0.7201
NLP типproduct_launch
NLP организацияnvidia
NLP темаrobotics
NLP страна

Открыть оригинал

Into the Omniverse: NVIDIA GTC Showcases Virtual Worlds Powering the Physical AI Era March 26, 2026 by Heather McDiarmid 0 Comments Share Share This Article X Facebook LinkedIn Copy link Link copied! Editor’s note: This post is part of Into the Omniverse , a series focused on how developers, 3D practitioners, and enterprises can transform their workflows using the latest advances in OpenUSD and NVIDIA Omniverse . NVIDIA GTC last week showcased a turning point in physical AI : Robots, vehicles and factories are scaling from single use cases and isolated deployments to sophisticated enterprise workloads across industries.  At the center of this shift are new frontier models for physical AI, including NVIDIA Cosmos 3, NVIDIA Isaac GR00T N1.7 and NVIDIA Alpamayo 1.5.  NVIDIA also released the NVIDIA Physical AI Data Factory Blueprint , designed to push the state of the art in world modeling, humanoid skills and autonomous driving, as well as the NVIDIA Omniverse DSX Blueprint for AI factory digital twin simulation. Open source agentic frameworks such as OpenClaw extend the AI stack all the way to operations — enabling long‑running “claws” that use tools, memory and messaging interfaces to orchestrate workflows, manage data pipelines and execute tasks autonomously on dedicated machines.  “With NVIDIA and the broader ecosystem, we’re building the claws and guardrails that let anyone create powerful, secure AI assistants,” said Peter Steinberger, creator of OpenClaw, in an NVIDIA press release from GTC.  OpenUSD is a driving force behind the scalability of physical AI — providing a common, scene‑description language that lets teams bring computer-aided design (CAD) data, simulation assets and real‑world telemetry into a shared, physically accurate view of the world.  Simulating the AI Factory Before It’s Built Modern AI factories are complex — spanning thermals, power grids, network load and mechanical systems. Building them on time and on budget becomes much easier when using simulation technology.  To tackle this, NVIDIA introduced the Omniverse DSX Blueprint at GTC, a reference architecture that unifies simulation across every layer of an AI factory through a single digital twin. This enables operators to optimize performance and efficiency before a rack is installed in the real world. Compute Is Data: Real-World Data Is No Longer the Moat Real-world data used to function as a moat for physical AI — but it doesn’t scale. The real world is messy, unpredictable and full of edge cases, and the pipelines to process, simulate and evaluate data are fragmented. The bottleneck isn’t just data — it’s the entire data factory. To help address this, NVIDIA introduced at GTC its Physical AI Data Factory Blueprint , an open reference architecture that transforms compute into large-scale, high-quality training data. Built on NVIDIA Cosmos open world foundation models and the NVIDIA OSMO operator, it unifies data curation, augmentation and evaluation into a single pipeline, enabling developers to generate diverse, long-tail datasets from limited real-world inputs. Leading physical AI developers including FieldAI , Hexagon Robotics , Linker Vision , Milestone Systems , Skild AI and Teradyne Robotics are already tapping the blueprint to speed up robotics projects, vision AI agents and autonomous vehicle programs. Microsoft Azure and Nebius are the first cloud platforms to offer the blueprint, turning world-scale compute into turnkey data production engines. “Together with cloud leaders, we’re providing a new kind of agentic engine that transforms compute into the high-quality data required to bring the next generation of autonomous systems and robots to life,” said Rev Lebaredian, vice president of Omniverse and simulation technologies at NVIDIA, in this press release . “In this new era, compute is data.” From OpenUSD to Reality: Seamless Design to Deployment Converting CAD files to OpenUSD is a critical step in the physical AI pipeline — transforming engineering data into simulation-ready assets that developers can use to build, test and validate robots in physically accurate virtual environments.  Using tools like the NVIDIA Omniverse Kit software development kit and NVIDIA Isaac Sim , teams can optimize and enrich 3D data for real-time rendering, simulation and collaborative workflows.   Companies including FANUC and Fauna Robotics are using this seamless CAD-to-OpenUSD workflow to speed up robotic system design and validation. Transforming Manufacturing and Logistics Through Industrial Digital Twins “Factories themselves are now robotic systems,” Lebaredian said during his special address on digital twins and simulation at GTC.  All factories are born in simulation. The NVIDIA Mega Omniverse Blueprint provides enterprises with a reference architecture to design, test and optimize robot fleets and AI agents in a physically accurate facility digital twin before a single robot is deployed on the floor.  KION , working with Accenture and Siemens, is using this blueprint to build large-scale warehouse digital twins that train and test fleets of NVIDIA Jetson-based autonomous forklifts for GXO, the world’s largest pure-play contract logistics provider.  Physical AI Steps From Simulation to the Real World NVIDIA is partnering with the global robotics ecosystem — including leading robot brain developers, industrial robot giants and humanoid pioneers — to enhance production-level physical AI.  ABB Robotics , FANUC , KUKA and Yaskawa, which have a combined global install base of over 2 million robots, are using NVIDIA Omniverse libraries and NVIDIA Isaac simulation frameworks to validate complex robot applications and production lines through physically accurate digital twins . These companies have also integrated NVIDIA Jetson modules into their controllers to enable real-time AI inference.  Robot development starts with the robot brains, which is why leading developers including FieldAI and Skild AI are building theirs using NVIDIA Cosmos world models for data generation and Isaac simulation frameworks to validate policies in simulation.  Meanwhile, Generalist AI is using NVIDIA Cosmos to explore generating synthetic data. This combination allows robots to become proficient in any task — from supply chain monitoring to food delivery — at an exceptional pace.  Read all of NVIDIA’s announcements from GTC on this online press kit and watch the keynote replay . Catch up on all Physical AI Days sessions from GTC and watch the developer livestream replay. Explore the Best of GTC 2026 Sessions Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere. Watch On Demand Recent News AI Infrastructure Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid March 31, 2026
The Latest AI Documentary Asks: Just How Scared Should We Be? wired 27.03.2026 11:00 0.675
Embedding sim.0.8057
Entity overlap0.0476
Title sim.0.0833
Time proximity0.7322
NLP типother
NLP организацияOpenAI
NLP темаartificial intelligence
NLP странаUnited States

Открыть оригинал

Miles Klee Culture Mar 27, 2026 7:00 AM The Latest AI Documentary Asks: Just How Scared Should We Be? The AI Doc: Or How I Became an Apocaloptimist seeks the middle ground on a polarizing technology—and ends up letting tech execs like Sam Altman off the hook. Still from The AI Doc: Or How I Became an Apocaloptimist. Courtesy of Focus Features Save this story Save this story It’s not easy to get an interview with Sam Altman —just ask Adam Bhala Lough, the filmmaker behind the recent documentary Deepfaking Sam Altman . Lough originally planned a feature exploring the potential and perils of AI that would center on a conversation with the OpenAI CEO. But, after having his inquiries ignored for months, he opted instead to commission a chatbot that mimicked Altman’s speech patterns and approximated his facial expressions by way of a digital avatar. The real Altman did sit down, however, for the new feature The AI Doc: Or How I Became an Apocaloptimist , which hits theaters March 27. So did Dario Amodei, the CEO of Anthropic , and Demis Hassabis, a cofounder and CEO of Google’s DeepMind Technologies. (Though the filmmakers say they requested interviews with Meta’s Mark Zuckerberg and X’s Elon Musk , neither made an appearance.) It’s an impressive level of access for codirector and documentary protagonist Daniel Roher, whose 2022 documentary Navalny , about the Russian opposition leader Alexei Navalny, won an Academy Award. The problem is that once they’re on camera, Altman et al. say little we haven’t heard before—and they skate by on glib answers concerning their responsibilities to the rest of their species. When Roher asks Altman why anyone should trust him to guide the rapid acceleration of AI, given its extreme ramifications, Altman replies: “You shouldn’t.” The line of interrogation ends there. The AI Doc is framed by Roher’s anxiety over the impending arrival of his son and first child with his wife, filmmaker Caroline Lindy. He wonders what kind of a world his boy will inherit and whether the rise of artificial intelligence will preclude the experiences that develop us into self-sufficient adults. In Roher’s first several interviews, all his worst fears seem to be confirmed. Tristan Harris, cofounder of the nonprofit Center for Humane Technology, delivers one of the worst gut punches: “I know people who work on AI risk who don’t expect their children to make it to high school,” he says, invoking a scenario in which the technology demolishes the very infrastructure of traditional education. Despite the sense of mounting panic, Roher and codirector Charlie Tyrell present an admirably robust crash course in AI and the biggest questions it poses, helped along by Roher’s insistence on defining terms in plain language rather than startup buzzwords. Visually, the film is charmingly human, featuring colorful drawings and paintings by Roher, while whimsical stop-motion sequences hint at the influence of producer Daniel Kwan, the Oscar-winning codirector of Everything Everywhere All at Once . The vibrant creativity amid portents of doom provides some of the hope that Roher is desperately seeking. Yet later interviews with Silicon Valley techno-optimists promising AI that conquers diseases and climate change—followed by the CEOs striking their usual balance between hype and the tones of sober caution—pass without much interrogation of grandiose claims. There is barely a moment spent considering why or how we should expect the current crop of fallible large language models to give rise to the mythical “artificial general intelligence” (AGI) that would outstrip human cognition. There are, at best, euphemistic acknowledgements (from venture capitalist Reid Hoffman, for example) that any benefits will come along with unspecified harms. Even when the top players say that the near-term implications of AI are as significant as the advent of nuclear armament, they are defaulting to a familiar playbook, presenting their products as singularly consequential one way or another—hinting that only they can be trusted to advance them. The documentary accurately conveys how the unregulated AI gold rush is driven by the perverse incentives of a global market and a struggle for domination. It observes how this mania concentrates wealth and power in the smallest possible circle of elites. Strange, then, that The AI Doc eventually carves out a gotta-hear-both-sides position in which the general public, not the executives under the microscope, are tasked with steering the AI revolution in the right direction. It’s stranger still considering that Roher has produced lacerating critiques of the AI economy on the press circuit, blasting it as a “ Ponzi scheme .” As he prepares to be a father, Roher has a touching conversation with his own dad, who advises him that while there are historical forces he can’t control, he’ll be a great parent no matter what—and that every generation has dealt with the existential distress of bringing life into an era of instability. Nevertheless, Roher and Tyrell call viewers to action, concluding the film by suggesting that ordinary citizens can pressure governments and corporations to ensure that AI evolves along the safest, narrowest path toward prosperity for all. The sequence is set to footage of other grand projects, including the construction of the Golden Gate Bridge, as though this piece of architecture were shaped by collective opinion. After a screening of The AI Doc at Los Angeles’ Academy Museum on Monday, Tyrell, Kwan, Harris, and producer Ted Tremper held a brief Q&A, with each reinforcing the idea that the feature was a productive step toward raising awareness of AI as a critical issue. “We're excited to continue this conversation,” Kwan said at one point. “This is just the beginning, and I know that this movie will never be able to encompass everything.” But he foresaw that the film would encourage audiences to “link arms with us and step confidently into the darkness as we try to figure out what we do together.” The documentary’s vision of positive change, though, is hazy, perhaps clouded by both the necessity of a rosy ending for Roher’s expanded family and the delicate suspension of skepticism whenever a billionaire enters the frame. In this narrative, these executives are apparently just along for the ride like anybody else, their status a mere accident of fate—which sets them up for a modest shrug whenever they admit they don’t totally understand what goes on inside the AI models they have already deployed at scale. As long as we’re so preoccupied with whether these programs may soon possess consciousness or intent, we might want to treat these people as though they, at least, have agency.
GTC Spotlights NVIDIA RTX PCs and DGX Sparks Running Latest Open Models and AI Agents Locally nvidia_blog 17.03.2026 13:00 0.674
Embedding sim.0.7826
Entity overlap0.037
Title sim.0.1102
Time proximity0.9051
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаgenerative ai
NLP страна

Открыть оригинал

GTC Spotlights NVIDIA RTX PCs and DGX Sparks Running Latest Open Models and AI Agents Locally NVIDIA Nemotron 3 open models unlock fast, private AI agents like OpenClaw; creativity accelerated with RTX-optimized NVFP4 and FP8 visual generative AI models. March 17, 2026 by Gerardo Delgado 0 Comments Share Share This Article X Facebook LinkedIn Copy link Link copied! The paradigm of consumer computing has revolved around the concept of a personal device — from PCs to smartphones and tablets. Now, generative AI — particularly OpenClaw — has introduced a new category: agent computers. These devices, like the NVIDIA DGX Spark desktop AI supercomputer or dedicated NVIDIA RTX PCs, are ideal for running personal agents — privately and for free.  NVIDIA GTC , running this week, is showcasing a host of agentic AI announcements including: New open models for local agents, including NVIDIA Nemotron 3 Nano 4B and Nemotron 3 Super 120B, and optimizations for Qwen 3.5 and Mistral Small 4. NVIDIA NemoClaw, an open source stack for OpenClaw that optimizes OpenClaw experiences on NVIDIA devices by increasing security and supporting local models.  Easier fine‑tuning with Unsloth Studio to further improve open model accuracy for agentic workflows. In-person GTC attendees can swing by the NVIDIA build-a-claw event in the GTC Park, running daily through March 19, from 8 a.m.-5 p.m. NVIDIA experts will help guests customize and deploy a proactive, always-on AI assistant using their device of choice. Whether technical or just curious, participants will name their agent, define its personality and grant it access to the tools it needs — creating a personal assistant reachable from their preferred messaging app. New Open Models Bring Cloud-Level Quality to Local Agents  The next generation of local models — with increasingly large context windows — delivers the intelligence to run agents on PC. Combined with richer user context and powerful local tools, these advances are unlocking new possibilities on AI PCs, especially on DGX Spark, with its 128GB of unified memory that supports models with more than 120 billion parameters. Nemotron 3 Super , released last week, is a 120‑billion‑parameter open model with 12 billion active parameters, designed to run complex agentic AI systems. Nemotron 3 Super is optimal for powering agents on the DGX Spark or NVIDIA RTX PRO workstations. On PinchBench — a new benchmark for determining how well large language models perform with OpenClaw — Nemotron 3 Super scored 85.6%, making it the top open model in its class. Mistral Small 4 , a 119-billion-parameter open model with 6 billion active parameters — 8 billion including all layers — unifies the capabilities of Mistral’s flagship models. Users now have an ultraefficient model optimized for general chat, coding and agentic tasks. Both of these models run locally on DGX Spark and RTX PRO GPUs. For GeForce RTX users looking for smaller models, Nemotron 3 Nano 4B is the latest model to join the NVIDIA Nemotron 3 family of open models , providing a compact, capable starting point for building agents and assistants locally on RTX AI PCs. The model is a strong fit for building action-taking conversational personas in games and apps that run on resource-constrained hardware. It’s available across any NVIDIA GPU-enabled system and combines state-of-the-art instruction-following and exceptional tool use with minimal VRAM footprint.  In addition, NVIDIA announced optimizations for Alibaba’s Qwen 3.5 models , which have demonstrated outstanding accuracy ( 27B , 9B and 4B ) and are suited for running local agents on NVIDIA GPUs. The new models natively support vision, multi-token prediction and a large 262,000-token context window. The dense 27-billion-parameter model excels when paired with an RTX 5090 GPU. All configurations measured using Q4_K_M quantizations BS = 1, ISL = 1024 and OSL = 128 on NVIDIA RTX 5090 and Mac M3 Ultra desktops. Token generation throughput measured on llama.cpp b7789, using the llama-bench tool. Users can try these models today via Ollama, LM Studio and llama.cpp, with accelerated inference powered by RTX GPUs and DGX Spark. Learn more about the latest on NVIDIA open models .  Faster Creative AI With the Latest RTX-Optimized Models LTX 2.3, Lightricks’ state-of-the-art audio-video model, released earlier this month, now has support for NVFP4 and FP8 distilled models, accelerating performance by 2.1x. Learn more about Lightricks’ LTX 2.3 model . In addition, Black Forest Lab’s FLUX.2 Klein 9B received an update last week, accelerating image editing by up to 2x. NVIDIA has collaborated with Black Forest Labs to release an FP8 version , optimized for the fastest performance and optimal memory consumption on RTX GPUs.  NVIDIA NemoClaw — NVIDIA Optimizations for OpenClaw AI developers and enthusiasts are buying DGX Spark supercomputers or building dedicated RTX PCs to run autonomous AI agents, such as OpenClaw, that draw context from personal files, apps and workflows and can automate daily tasks. However, as adoption of agentic systems like OpenClaw grows, so do concerns about token costs, as well as security and privacy. To help address these concerns, NVIDIA this week introduced NemoClaw , an open source stack for OpenClaw that deploys optimizations for OpenClaw on NVIDIA devices. The first features available in NemoClaw are NVIDIA Nemotron open models and the NVIDIA OpenShell runtime. Nemotron local models enable users to run inference locally, which means better privacy and no token costs. OpenShell is the runtime designed for executing claws more safely. Learn more about NemoClaw . Watch the GTC keynote from NVIDIA founder and CEO Jensen Huang and explore sessions . Fine-Tuning Made Easy With Unsloth Studio As open models make giant leaps, one way of further improving accuracy is fine-tuning, which allows users to customize a model for their own data and use cases. This technique normally requires in-depth technical expertise, coding knowledge and massive amounts of configuration. Unsloth, a leading open source library for model fine-tuning and alignment, today launched Unsloth Studio, an easy-to-use, web-based user interface that simplifies the fine-tuning process for AI enthusiasts and developers. Unsloth Studio offers support for more than 500 AI models. The simple user interface makes the training and fine-tuning process easy: Users can just drop in their dataset, tap the graph-based canvas to generate additional high-quality synthetic data and start the fine-tuning job. It supports quantized low-rank adaptation, low-rank adaptation and full fine-tuning. As the model is being fine-tuned, users can monitor and visualize job progress. Finally, they can export the model into a framework of choice and chat away, all within the same web app.  Unsloth Studio’s new interface is built on the Unsloth library, which delivers up to 2x faster training with up to 70% VRAM savings, using custom and specialized GPU kernels. This means that new users can get the most out of their NVIDIA RTX GPUs and DGX Spark, right out of the box.  Try Unsloth Studio today , including with new models like Nemotron 3 Nano 4B and Qwen 3.5. Check out other RTX AI Garage posts for more information on fine-tuning models with NVIDIA GeForce RTX GPUs. #ICYMI From GTC 2026 ✨ RTX AI video generation guide featuring RTX Video in ComfyUI: Launched at CES earlier this year, the new RTX AI video generation guide shows creators and enthusiasts how to go from concept to creation using guided text-to-image workflows to produce keyframes for AI-generated videos, then upscale to 4K with RTX Video technology running on local GPUs. Get started with the guide and share creations on social media with #AIonRTX. 💿 NVIDIA AI for Media is a set of high‑performance, easy‑to‑use software development kits that bring NVIDIA Broadcast-class AI effects — enhanced audio ( Linux or Windows ), video and augmented-reality features — to live media, video conferencing and post‑production workflows. The latest update — available today — adds more accurate lip-syncing, multi‑active-speaker detection, faster 4K upscaling on RTX PRO and GeForce RTX 40 and 50 Series GPUs via the RTX Video Super Resolution feature, better background noise reduction and lower latency for the NVIDIA Studio Voice feature. 💻 NVIDIA DLSS 5 , arriving this fall, delivers an AI-powered breakthrough in visual fidelity for games by infusing pixels with photoreal lighting and materials to bridge the gap between rendering and reality. 🤖 Maxon released Redshift 2026.4 , introducing a new real-time visualization workflow powered by DLSS to allow architects to walk through projects at interactive speed and quality. “NVIDIA’s DLSS technology is a critical component, allowing us to deliver high-quality visuals at interactive speeds,” said Philip Losch, chief technology and AI officer at Maxon. 🪟Reincubate Camo has added Windows ML on NVIDIA TensorRT RTX EP for AI Autotune in its Camo Streamlight app, significantly improving performance on RTX GPUs. Plug in to NVIDIA AI PC on Facebook , Instagram , TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter . NVIDIA Workstation on LinkedIn and X .  See notice regarding software product information. Explore the Best of GTC 2026 Sessions Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere. Watch On Demand Recent News AI Infrastructure Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid March 31, 2026
With Nvidia Groq 3, the Era of AI Inference Is (Probably) Here ieee_spectrum_ai 16.03.2026 21:04 0.674
Embedding sim.0.7666
Entity overlap0.0612
Title sim.0.1379
Time proximity0.9966
NLP типproduct_launch
NLP организацияNvidia
NLP темаai hardware
NLP странаUnited States

Открыть оригинал

This week, over 30,000 people are descending upon San Jose, Calif., to attend Nvidia GTC , the so-called Superbowl of AI—a nickname that may or may not have been coined by Nvidia. At the main event Jensen Huang, Nvidia CEO, took the stage to announce (among other things) a new line of next-generation Vera Rubin chips that represent a first for the GPU giant: a chip designed specifically to handle AI inference. The Nvidia Groq 3 language processing unit (LPU) incorporates intellectual property Nvidia licensed from the startup Groq last Christmas Eve for US $20 billion. “Finally, AI is able to do productive work, and therefore the inflection point of inference has arrived,” Huang told the crowd. “AI now has to think. In order to think, it has to inference. AI now has to do; in order to do, it has to inference.” Training and inference tasks have distinct computational requirements. While training can be done on huge amounts of data at the same time and can take weeks, inference must be run on a user’s query when it comes in. Unlike training, inference doesn’t require running costly backpropagation . With inference, the most important thing is low latency—users expect the chatbot to answer quickly, and for thinking or reasoning models, inference runs many times before the user even sees an output. Over the past few years, inference-specific chip startups were experiencing a sort of Cambrian explosion, with different companies exploring distinct approaches to speed up the task. The startups include D-matrix , with digital in-memory compute; Etched , with an ASIC for transformer inference; RainAI , with neuromorphic chips; EnCharge , with analog in-memory compute; Tensordyne , with logarithmic math to make AI computations more efficient; FuriosaAI , with hardware optimized for tensor operation rather than vector-matrix multiplication, and others. Late last year, it looked like Nvidia had picked one of the winners among the crop of inference chips when it announced its deal with Groq. The Nvidia Groq 3 LPU reveal came a mere two and a half months after, highlighting the urgency of the growing inference market. Memory bandwidth and data flow Groq’s approach to accelerating inference relies on interleaving processing units with memory units on the chip. Instead of relying on high-bandwidth memory (HBM) situated next to GPUs, it leans on SRAM memory integrated within the processor itself. This design greatly simplifies the flow of data through the chip, allowing it to proceed in a streamlined, linear fashion. “The data actually flows directly through the SRAM,” Mark Heaps said at the Supercomputing conference in 2024. Heaps was a chief technology evangelist at Groq at the time and is now director of developer marketing at Nvidia. “When you look at a multicore GPU, a lot of the instruction commands need to be sent off the chip, to get into memory and then come back in. We don’t have that. It all passes through in a linear order.” Using SRAM allows that linear data flow to happen exceptionally fast, leading to the low latency required for inference applications. “The LPU is optimized strictly for that extreme low latency token generation,” says Ian Buck , VP and general manager of hyperscale and high-performance computing at Nvidia. Comparing the Rubin GPU and Groq 3 LPU side by side highlights the difference. The Rubin GPU has access to a whopping 288 gigabytes of HBM and is capable of 50 quadrillion floating-point operations per second (petaFLOPS) of 4-bit computation. The Groq 3 LPU contains a mere 500 megabytes of SRAM memory and is capable of 1.2 petaFLOPS of 8-bit computation. On the other hand, while the Rubin GPU has a memory bandwidth of 22 terabytes per second, at 150 TB/s the Groq 3 LPU is seven times as fast. The lean, speed-focused design is what allows the LPU to excel at inference. The new inference chip underscores the ongoing trend of AI adoption, which shifts the computational load from just building ever bigger models to actually using those models at scale. “Nvidia’s announcement validates the importance of SRAM-based architectures for large-scale inference, and no one has pushed SRAM density further than d-Matrix,” says d-Matrix CEO Sid Sheth. He’s betting that data center customers will want a variety of processors for inference. “The winning systems will combine different types of silicon and fit easily into existing data centers alongside GPUs.” Inference-only chips may not be the only solution. Late last week, Amazon Web Services said that it will deploy a new kind of inferencing system in its data centers. The system is a combination of AWS’s Tranium AI accelerator and Cerebras Systems’ third generation computer CS-3 , which is built around the largest single chip ever made. The two-part system is meant to take advantage of a technique called inference disaggregation. It separates inference into two parts—processing the prompt, called prefill, and generating the output, called decode. Prefill is inherently parallel, computationally intensive, and doesn’t need much memory bandwidth, while decode is a more serial process that needs a lot of memory bandwidth. Cerebras has maximized the memory bandwidth issue by building 44 GB of SRAM on its chip connected by a 21 PB/s network. Nvidia, too, intends to take advantage of inference disaggregation in its new, combined compute tray called the Nvidia Groq 3 LPX. Each tray will house 8 Groq 3 LPUs and a Vera Rubin, which pairs Rubin GPUs with a Vera CPU. The prefill and the more computationally intensive parts of the decode are done on Vera Rubin, while the final part is done on the Groq 3 LPU, leveraging the strengths of each chip. “We’re in volume production now,” Huang said.
Import AI 451: Political superintelligence; Google's society of minds, and a robot drummer import_ai 30.03.2026 12:28 0.666
Embedding sim.0.8644
Entity overlap0.06
Title sim.0.1266
Time proximity0.0004
NLP типscientific_publication
NLP организацияStanford University
NLP темаai governance
NLP страна

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you&#8217;d like to support this, please subscribe. Subscribe now AI might let us build &#8220;political superintelligence&#8221;: &#8230;But turning this into a societal upside requires lots of intentional work&#8230; As AI systems get more powerful and broaden their real world impact from coding to other domains, it seems likely that they could also become useful for helping people advocate for themselves in politics, and helping politicians better craft policy. But getting to a world where a &#8220;political superintelligence&#8221; exists and helps us is a lot more challenging than just building better AI systems, according to Andy Hall, a political economy professor at Stanford. &#8220;AI is like the printing press, to a point. Instead of making information cheap and easily available, it makes intelligence cheap and easily available. That is, it not only serves users information, but it can find it for them, analyze it for them, and help them convert it into understanding,&#8221; Hall writes. &#8220;The more I work with and study AI, the more I believe it can give every human being on the planet access to a sort of political superintelligence, if we shape it right.&#8221; What is a political superintelligence? By this, Hall means AI systems which allow people to have &#8220;tools that help citizens, representatives, and institutions perceive reality more sharply, understand tradeoffs, contest power, and act more effectively&#8221;. A political superintelligence spans both the AI companies that build the technology, the technology itself, and the institutions and people which the technology interacts with. &#8220;I&#8217;m not interested in slowing AI down. I&#8217;m interested in speeding up how we build the structures that keep us free as AI gets more powerful,&#8221; Hall writes. Three layers for political superintelligence: Hall sees political superintelligence as being composed of three distinct layers. The information layer: &#8220;AI can massively change how governments access and understand data, identify problems, hear from citizens, and distribute services&#8221;. Though getting to this future will require better evaluations for how AI systems behave when it comes to the sorts of information governments might be interested in, and it&#8217;ll require people to build AI tools directly for policymakers. The representation layer: &#8220;Political superintelligence might help solve this monitoring problem by giving each of us a tireless, automated delegate always serving us in the political sphere,&#8221; he writes. &#8220;These AI delegates could monitor politics for us and suggest how to vote&#8212;or even serve as policymakers alongside human supervisors.&#8221; Building this layer requires us to ensure that agents can reliably act on our behalf, that they aren&#8217;t swayed by adversarial prompting (imagine how politicians might fund campaigns explicitly designed to sway the beliefs of agents working on behalf of people). It may also be important to re-think agent ownership - what happens if a particular policy choice goes against the preferences of the AI company which operates the agents? The governance layer: &#8220;Even if we achieve political superintelligence&#8212;even if AI makes voters brilliant and delegates faithful&#8212;those capabilities would sit inside infrastructure owned and operated by a small number of private companies,&#8221; he writes. &#8220;We need a way to write the rules so that, when political superintelligence arrives, we the people are able to harness it.&#8221; Doing this will require figuring out how to govern and edit the &#8216;constitutions&#8217; that companies create about their models, as well as developing an effective way of overseeing these AI systems. Why this matters - building a political superintelligence is only as valuable as its interfaces with people and institutions: We are by default going to get extremely powerful AI systems which can think about politics (and everything else) at a very sophisticated level. The challenge Hall outlines is that getting these systems to lead to a thriving society requires significant intentional work around the UX and UI of these systems - how do we interface with them? What sorts of technical means do we have of being confident in them? What information do they generate and to whom? Where does control of these systems lie and what systems supervise that control? Getting this part right requires AI developers to invest more in technical tools which can help people make sense of and oversee their AI systems, as well as tools for better gathering deliberative feedback from people about how these systems behave. Policymakers and the public need to demand more of AI companies in this respect, and ultimately I think there are a range of regulations that need to get stood up around a transparency regime for AI companies as well as some common set of standard &#8216;APIs&#8217; by which society can interact with the companies and the systems they build to generate empirical data and provide steering over their behavior. Read more: Building Political Superintelligence (Free Systems, Substack) . *** Fear not, drummers, you&#8217;re safe from AI automation for now: &#8230;DexDrummer tackles a fiendishly hard robot hand problem&#8230; Whenever I get a bit worried about the pace of AI progress I toggle over to the &#8216;robotics&#8217; sub-section of arXiv, read some papers, and feel a huge sense of relief. Robots, as everyone knows, are extremely hard to do well, with reality tending to screw up even the most advanced techniques. An even harder version of robotics is fine-grained low-latency dexterous control, where you need to get a robot hand to do something. So it&#8217;s with a combination of amusement and empathy that I read DexDrummer, a paper testing out how well contemporary AI approaches can get a robot hand to play the drums. The short answer is: robot hands are pretty terrible drummers! What they did: They built DexDrummer &#8220;a hierarchical, two-stage policy for drumming&#8221; which has a high-level RL policy, as well as a low-level dexterous policy. They train their system in a simulated environment that contains a bimanual robot setup and a full drum set (snare, tom, ride, hi-hat, and crash). The main system generates a stick trajectory in task space, then a low-level system which tries to control the hand - this part is complex and involves encouraging the thumb and index finger to grasp the center of the drumstick paired with an &#8220;arm penalty constraint, which reduces excessive arm movements&#8221;. There is also work shaping rewards to ensure the robot is able to chain multiple drumhits together - this is achieved via a &#8220;contact curriculum&#8221; which allows the agent to practice trajectory following in free space while following the trajectory reward. Real world testing: They test out the trained policy in reality on two 7-DOF Franka Panda arms and two 20-DOF Tesollo DG-5F hands. This is an area where I&#8217;d strongly encourage people to view the videos online to get some calibration about just how fiendishly hard this task is - the robots are able to hit the drums, but it&#8217;s painfully awkward to watch, and my sense is it&#8217;ll be quite a while till a human drummer has to look over their proverbial shoulder. Why this matters - robotics as the last eval: Robotics in anything approximating a dynamic, rapidly changing environment (for instance, improvising drums with a live band) feels like one of the last frontiers for AI - and as this research shows, much like with modern computer vision research, getting AI to perform well requires the crafting of highly complicated artisanal policies. We&#8217;re a very long way from the generality of pretrained language models here. Read more: DexDrummer: In-Hand, Contact-Rich, and Long-Horizon Dexterous Robot Drumming (arXiv) . Please, I am begging you, check out the videos for a good time : DexDrummer site . *** Google thinks the real challenge of AI alignment is dealing with a world made up of mostly non-biological intelligences: &#8230;Towards a society of minds&#8230; Researchers with Google think that the future of intelligence is less about building a monolithic singleton that runs the world and more figuring out how to build institutions that are capable of dealing with a vast proliferation of AI agents working in tandem with humans. The research is intuitive, provocative, and sensible, and builds on earlier technical work that showed that modern AI systems appear to simulate multiple personalities within themselves to help them answer questions ( Import AI 444 ), suggesting that even today&#8217;s AI systems already work like complex ecologies. &#8220;We should be looking for the next intelligence explosion in the same place from which the previous ones emerged: in cooperative, competitive and creative interaction between multitudes of socially intelligent minds. The difference this time is that most of those minds will be non-biological,&#8221; Google writes. &#8220;The toolkits of team science, small-group sociology, and social psychology become blueprints for next-generation AI development.&#8221; History shows the way: &#8220;Each prior &#8220;intelligence explosion&#8221; was not an upgrade to individual cognitive hardware, but the emergence of a new, socially aggregated unit of cognition,&#8221; they write. Primate intelligence : Scaled with the social group size. Human language: Allowed knowledge to accumulate across generations via a &#8216;cultural ratchet&#8217;. Writing, law, and bureaucracy : Converted social intelligence into infrastructure and institutions that could coordinate across long time horizons. (&#8221;A Sumerian scribe running a grain accounting system did not comprehend its macroeconomic function; the system was functionally more intelligent than he was.&#8221;) AI plus human institutions : &#8220;The path to more powerful AI runs not through building a single colossal oracle but through composing richer social systems&#8212;and these systems will be hybrid&#8221;. Society needs an upgrade: Implicit to this is the fact that governing AI will increasingly involve verifying (e.g, Import AI #447 ) that a vast number of AI systems are working on our behalf appropriately. &#8220;Governments will need AI systems with distinct, explicitly invested values&#8212;transparency, equity, due process&#8212;whose function is to check and balance AI systems deployed by the private sector and other branches of government,&#8221; they write. Why this matters - alignment is going to happen with and in the world, not outside of it: Many people working on AI safety have long spent time on getting the fundamental properties of a single AI system to be &#8216;aligned&#8217;, which roughly translates to &#8220;does what you want and doesn&#8217;t try to kill you or disempower you&#8221;. But what this paper correctly identifies is that even if we succeed at alignment we&#8217;re going to have to then get AI systems to work well within society and to collaborate effectively with us and with each other - and this will be a subtle, emergent, hard-to-predict process. This means we are going to need to design the institutions that are fit for governing an AI-centric world. &#8220;Just as human societies rely not on individual virtue but on persistent institutional templates - courtrooms, markets, bureaucracies - defined by roles and norms, scalable AI ecosystems will require digital equivalents,&#8221; the researchers write. Read more : Agentic AI and the next intelligence explosion (arXiv) . *** Meta uses a harness to coax Anthropic&#8217;s models into self-improvement: &#8230;Give an LLM some tools and a recursive loop and the ability to edit its harness, step back, and let the magic happen&#8230; Researchers with the University of British Columbia, Vector Institute, University of Edinburgh, New York University, CIFAR, and Meta have built a harness for LLMs that has the ability to self-improve performance for arbitrary tasks. The approach is called a hyperagent, and it means giving an LLM a scaffold that can iteratively improve the prompts it uses to bootstrap its performance on tasks as well as the system it uses to get better at generating future prompts. Hyperagents work over generations, so one hyperagent begets a few hyperagents and the ones which do the best on the task will themselves spawn some more hyperagents, forming multiple layers of AI genealogy until performance is saturated. Cyberpunk name of the year award: Hyperagent is actually short for &#8220;Darwin Godel Machine Hyperagents&#8221;: Besides the research being cool, my congratulations to the authors on coming up with a name I&#8217;d love to see chiseled into the moon by a laserbeam wielded by a superintelligence. How hyperagents work : Hyperagents are &#8220;self-referential agents that integrate a task agent (which solves the target task) and a meta agent (which modifies itself and the task agent) into a single editable program. Crucially, the meta-level modification procedure is itself editable, enabling metacognitive self-modification, improving not only task-solving behavior, but also the mechanism that generates future improvements,&#8221; the researchers write. &#8220;This initial hyperagent is equipped with two tools: a bash tool for executing shell commands, and a specialized tool for inspecting and modifying files.&#8221; Testing the agents in four different domains: The authors test out hyperagents by applying them to four problems - coding (polyglot), prediction (paper review), robotics (robotics reward design), and math understanding (olympiad-level math grading). For most problems, the Hyperagents use Claude Sonnet 4.5 as their base model, with one exception (Polyglot). Evaluations are done via several different models: o3-mini (Polyglot), GPT-4o (paper review), Claude Sonnet 4.5 (robotics reward design), and o4-mini (IMO-level grading). In all cases, the hyperagent approach improves performance significantly above the baseline. Polyglot: &#8220;the agent is given a code repository and a natural language instruction describing a desired change, and must modify the repository accordingly&#8221;. Results: &#8220;Across 5 runs, the DGM-H improves its training performance on the 50-task Polyglot subset from 0.140 (the initial agent) to 0.340 (CI: 0.300 &#8211; 0.380).&#8221; Paper review: &#8220;For each task, the agent is given the full text of an AI research paper and must predict a binary accept/reject decision&#8221;. Results: &#8220;On test tasks, DGM-H improves paper review performance from 0.0 (the initial agent) to 0.710 (CI: 0.590 &#8211; 0.750)&#8221; Robotics reward design: &#8220;Given a natural language description of a robotics task, an agent must generate a suitable reward function. This reward function is then used to train a quadruped robot in simulation using RL&#8221; Results : &#8220;DGM-H improves performance from 0.060 (the initial agent) to 0.372 (CI: 0.355 &#8211; 0.436), surpassing the default reward function that directly optimizes the evaluation metric (0.348)&#8221; Why this matters - bootstrapping the singularity: Papers like this show that today&#8217;s AI systems are already capable of autonomously improving their performance when given the right scaffold and starting ingredients. An interesting idea is to combine the design approach here with giving the AI systems the ability to finetune themselves (e.g, in the style imagined by the PostTrainBench research, Import AI #449 ). Another limitation is that &#8220;although hyperagents can modify their self-improvement mechanisms, they cannot alter the outer process that determines which agents are selected or how they are evaluated&#8221; - though again, I think there are technical ways to achieve both of these objectives. Of course, an AI system that can autonomously improve itself on arbitrary domains has a range of safety issues, some of which are potentially cataclysmic. The authors acknowledge this while also being realistic about the problems that lie ahead: &#8220;a central challenge lies in balancing the potential of AI as a catalyst for human progress and well-being (e.g., automating scientific discovery) with the degree of trust humans are willing to place in these systems (e.g., delegating decisions or actions without requiring continuous human verification), while minimizing the many potential risks and downsides,&#8221; they write. Read more: Hyperagents (arXiv) . Get the code for HyperAgents here (Facebook Research, HyperAgents) . *** How long will a new math benchmark, HorizonMath, last? &#8230;New test challenges AI systems to solve unknown problems, then automatically verifies the answers&#8230; Another day brings another hard math benchmark that I imagine will crumple in the face of ongoing AI progress in the coming year. This time it&#8217;s HorizonMath, a benchmark containing 100 &#8220;predominantly unsolved&#8221; problems across 8 domains in applied and computational mathematics. The benchmark was built by researchers with the University of Oxford, Harvard University, Princeton University, and the Ellison Institute of Technology. Special features about HorizonMath: Contamination-Proof: &#8220;Because the solutions are unknown, they do not exist in any training corpus, and any correct solution produced by a model would therefore signal genuine reasoning ability and autonomous discovery.&#8221; Automated verification : &#8220;A core feature of our benchmark is its fully automated, reproducible, and human-free evaluation pipeline&#8221;, the authors write. &#8220;We automate verification using high-precision numeric comparison and deterministic constraint-checkers&#8221;. What HorizonMath contains: HorizonMath&#8217;s 100 problems are classified along three axes: output types, which specifies how the model needs to solve the task ranging from identifying an exact closed-form expression for a numerically approximated target value, to the production of discrete mathematical objects; solvability levels, which span &#8216;level 0&#8217; (problems with known closed forms) to &#8216;level 3&#8217; (problems that could be conjectured unsolvable or lack finite closed forms); and mathematical domains, which specifies the type of domain ranging from number theory to discrete geometry to mathematical constants. Reassuringly hard: On the full dataset, the highest scoring model is GPT 5.4 Pro with 7%, followed by Opus 4.6 and Gemini 3.1 Pro which both tie at 3%. On the &#8220;Level 0&#8221; (aka, the easiest) problems, GPT 5.4 Pro leads at 50% completion, with both Opus 4.6 and Gemini 3.1 in a tie again at 30% each. Next steps: They will expand the benchmark in two ways, first by liberalizing the sorts of solutions that they will take in, as well as by &#8220;extending beyond the three current problem categories to include open problems that require proof-based verification, integrating with formal systems such as Lean&#8221;. Why this matters - perhaps the first truly creative AI systems will show up in mathematics: AI systems are pushing on the frontiers of math today, with systems like Gemini already helping humans to come up with seemingly original math proofs ( Import AI 441 ), and tests like &#8220;First Proof&#8221; emerging which examine how well AI systems can handle problems that have never been talked about publicly let alone solved ( Import AI 445 ). With HorizonMath, we have another useful benchmark to help us see if AI is about to cross some &#8216;creativity rubicon&#8217; and begin solving unsolved problems. Read more : HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification (arXiv) . Get the benchmark here : HorizonMath (GitHub) . Tech Tales: Site report [2029] Percentage of compute and power below ground: 70% (+50 absolute points). Number of staff living fully onsite: 300 (+250). Estimated duration of &#8216;hard seal&#8217; based on current supplies and a projected population of ~500: 4 months (+3 months). Estimated lead of the project relative to others in-country: 6 months. Capability estimates: 90%-110% of our own leading system. Recommendation: Based on the substantial increase in resources allocated to hardening the facility for closed-loop development, we believe additional measures must be taken to disrupt the project. The following report lists options for consideration, many of which can be combined together. These include: Food system sabotage. Staff interference. Data poisoning. Things that inspired this story: How at some point surely there will be such a thing as a hardened datacenter for AI training and inference? How the intelligence community might analyze other AI projects. Thanks for reading!
Wikipedia cracks down on the use of AI in article writing | TechCrunch techcrunch 26.03.2026 21:50 0.664
Embedding sim.0.7615
Entity overlap0.0909
Title sim.0.1091
Time proximity0.9404
NLP типregulation
NLP организацияWikipedia
NLP темаai governance
NLP странаUnited States

Открыть оригинал

As AI makes inroads into the worlds of editorial and media, websites are scrambling to establish ground rules for its usage. This week, Wikipedia banned the use of AI-generated text by its editors — although it stopped short of banning AI outright from the site’s editorial processes. In a recent policy change , the site now states that “the use of LLMs to generate or rewrite article content is prohibited.” This new language updates and clarifies previous, vaguer language that stated that LLMs “should not be used to generate new Wikipedia articles from scratch.” AI in Wikipedia articles has become a contentious issue among the site’s sprawling, volunteer-driven community of editors. 404 Media reports that the new policy, which was put to a vote by the site’s editors, garnered majority support — 40 to 2. That said, the new policy still makes room for continued AI use in some editorial processes. “Editors are permitted to use LLMs to suggest basic copyedits to their own writing, and to incorporate some of them after human review, provided the LLM does not introduce content of its own,” the new policy states. “Caution is required, because LLMs can go beyond what you ask of them and change the meaning of the text such that it is not supported by the sources cited.” Topics AI , In Brief , Wikipedia April 30 San Francisco, CA StrictlyVC kicks off the year in SF. Get in the room for unfiltered fireside chats with industry leaders, insider VC insights, and high-value connections that actually move the needle. Tickets are limited. REGISTER NOW Newsletters See More Subscribe for the industry’s biggest tech news TechCrunch Daily News Every weekday and Sunday, you can get the best of TechCrunch’s coverage. TechCrunch Mobility TechCrunch Mobility is your destination for transportation news and insight. Startups Weekly Startups are the core of TechCrunch, so get our best coverage delivered weekly. StrictlyVC Provides movers and shakers with the info they need to start their day. No newsletters selected. Subscribe By submitting your email, you agree to our Terms and Privacy Notice . Related AI Mantis Biotech is making ‘digital twins’ of humans to help solve medicine’s data availability problem Ram Iyer 2 hours ago AI AI chip startup Rebellions raises $400 million at $2.3B valuation in pre-IPO round Lucas Ropek 3 hours ago AI Qodo raises $70M for code verification as AI coding scales Kate Park 4 hours ago Latest in AI AI Mantis Biotech is making ‘digital twins’ of humans to help solve medicine’s data availability problem Ram Iyer 2 hours ago AI ScaleOps raises $130M to improve computing efficiency amid AI demand Kate Park 2 hours ago AI AI chip startup Rebellions raises $400 million at $2.3B valuation in pre-IPO round Lucas Ropek 3 hours ago X LinkedIn Facebook Instagram youTube Mastodon Threads Bluesky TechCrunch Staff Contact Us Advertise Crunchboard Jobs Site Map Terms of Service Privacy Policy RSS Terms of Use Code of Conduct Kalshi Copilot Blue Origin WordPress Bezos Tech Layoffs ChatGPT © 2026 TechCrunch Media LLC.
Helping developers build safer AI experiences for teens openai 24.03.2026 11:00 0.661
Embedding sim.0.7474
Entity overlap0.125
Title sim.0.1609
Time proximity0.9366
NLP типregulation
NLP организацияOpenAI
NLP темаai safety
NLP страна

Открыть оригинал

OpenAI releases prompt-based teen safety policies for developers using gpt-oss-safeguard, helping moderate age-specific risks in AI systems.
Are machines truly intelligent? microsoft_research 23.03.2026 15:00 0.661
Embedding sim.0.7753
Entity overlap0
Title sim.0.0986
Time proximity0.8452
NLP типscientific_publication
NLP организацияMicrosoft Research
NLP темаlarge language models
NLP страна

Открыть оригинал

Will machines ever be intelligent?  Published March 23, 2026 By Doug Burger , Technical Fellow and Corporate Vice President, Microsoft Research Subutai Ahmad , Chief Technology Officer, Numenta Nicolo Fusi , VP and Distinguished Scientist, Microsoft Research Share this page Share on Facebook Share on X Share on LinkedIn Share on Reddit Subscribe to our RSS feed Technical advances are moving at such a rapid pace that it can be challenging to define the tomorrow we’re working toward. In  The Shape of Things to Come , Microsoft Research leader Doug Burger  and experts from across disciplines tease out the thorniest AI issues facing technologists, policymakers, business decision-makers, and other stakeholders today. The goal: to amplify the shared understanding needed to build a future in which the AI transition is a net positive.  In this first episode of the series, Burger is joined by Nicolò Fusi of Microsoft Research and Subutai Ahmad (opens in new tab) of Numenta to examine whether today’s AI systems are truly intelligent. They compare transformer-based large language models (LLMs) with the human brain’s distributed, continuously learning architecture, exploring differences in efficiency, representation, and sensory-motor grounding. The discussion probes what intelligence really means, where current models excel or fall short, and what future AI systems might need to bridge the gap. The Shape of Things to Come podcast series Learn more: Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models Publication | March 2026 A Thousand Brains: A New Theory of Intelligence (opens in new tab) Book | Jeff Hawkins | 2022 Thousand Brains Project (opens in new tab) Homepage A Framework for Intelligence and Cortical Function Based on Grid Cells in the Neocortex (opens in new tab) Publication | January 2019  Why Neurons Have Thousands of Synapses, a Theory of Sequence Memory in Neocortex (opens in new tab) Publication | March 2016 On Intelligence (opens in new tab) Book | Jeff Hawkins with Sandra Blakeslee | 2005 Subscribe to the Microsoft Research Podcast : Apple Podcasts Email Android Spotify RSS Feed Transcript [MUSIC] DOUG BURGER: This is The Shape of Things to Come, a Microsoft Research Podcast. I’m your host, Doug Burger. In this series, we’re going to venture to the bleeding edge of AI capabilities, dig down into the fundamentals, really try to understand them, and think about how these capabilities are going to change the world—for better and worse. In today’s podcast, I’m bringing on two AI researcher-experts: Nicolò Fusi, who is an expert in digital, transformer-based large language model architectures and learning, and Subutai Ahmad, who is an expert in biological architectures, specifically the human brain. And the question we’re going to discuss is, are machines intelligent? And what I mean by that: are digital intelligence, large language models, on a path to surpass humans, or are the architectures just so fundamentally different that one will do one set of things well, the other will do something else very well? And so we’ll be debating the architecture of intelligence across digital implementations and biological implementations because the answer to that question, I think, really will determine the shape of things to come. [MUSIC FADES] I’d like to ask each of my guests to introduce themselves. Tell me a little bit about your background and what you’re currently working on—to the extent you can talk about it—in AI. So, Nicolò, would you please start? NICOLÒ FUSI: Yeah, thank you, Doug, for having us and having me here. It’s so much fun. So I’m Nicolò Fusi. I’m a researcher at MSR [Microsoft Research]. So Doug is my boss, so I will be very, very, very good to Doug in this podcast. No, but jokes aside, my own background is in Bayesian nonparametric. That’s what I started studying. So Gaussian processes and things like that. And then equally, I would say, in computational biology, because I found it, like, one of the most interesting use cases for AI techniques. And that, kind of, has been true throughout my career. And pretty much like everybody else, eventually, I moved away from the kernel methods and the Bayesian nonparametrics and I started working more on language models, transformer models, with a particular eye towards information theory and the connection between information theory and generative modeling. And that’s, kind of, one of the main things I do today other than, kind of, managing the research of people who do much more interesting work than I do. [LAUGHS] BURGER: I have to interject there, Nicolò, because you dragged a piece of bait across my path. FUSI: I figured. BURGER: You know, at Microsoft Research, I have a management rule that I can’t tell anyone what to do because we hire some of the best people in the world. You have to trust them. And everyone is always completely free to call BS on me. And so Nicolò was joking there; [LAUGHTER] he does not have to toe the party line. In fact, I encourage him not to. So, so … FUSI: I just have to be well-behaved. That’s the only thing I will say. [LAUGHS] BURGER: Yeah. Thank you, thank you for baiting me. [LAUGHS] Because he knew exactly what he was doing. And I love him for it. Subutai, can you tell us a little bit about yourself? SUBUTAI AHMAD: Sure. Thank you so much, Doug, for having me. I’m really looking forward to the conversation between us all. So I see myself fundamentally as a computer scientist. You know, I’ve been studying computer science for longer than I care to admit. But something changed for me during my undergrad years. I decided to minor in cognitive psychology, and I started to get really interested in how the brain works. And to me, understanding intelligence and implementing intelligence was the hardest problem a computer scientists could ever solve. So I got very, very interested in that. You know, I couldn’t see how to really commercialize that. I was very interested in making products and stuff. So I stopped, you know, working on that for a while. I did a number of startups doing computer vision, you know, video processing, a lot of that stuff. And then when Jeff Hawkins started Numenta back in 2005 with the idea of really deeply understanding how the brain works and figuring out how to apply that to AI, for me, it was like all my worlds coming together. This, like, this is what I had to do. None of us thought [LAUGHS] it would take as long as it did. We spent the last couple of decades really deeply trying to understand neuroscience from a computer scientist—from a programmer’s—standpoint, the underlying algorithms. And that’s really what I’m passionate about, just trying to translate what we understand about the neuroscience to today’s AI. And in terms of what we’re working on today, it’s, you know, the human—maybe we’ll get into some of this—the brain is super efficient in how it works—power efficient, energy efficient—and we’re trying to embody those ideas and trying to make AI a lot more efficient than it is today. BURGER: Great. I think we’ll get into efficiency a little bit later in the podcast because that’s a subject that’s near and dear to my heart, you know, being a computer architect originally by training. I want to go back to, you know, one of the reasons I got involved with Numenta is, you know, Subutai and I have been exchanging emails, like, discussing collaborations, you know, visiting each other through the years, and the thing that really stuck with me was when I read one of the earlier books from Jeff On Intelligence (opens in new tab) . And there was an example in the book that talked about how, you know, the human brain learns continuously. I think biological organisms in general learn continuously. And the anecdote that I remember was this anecdote if you’re walking down your basement steps, you know, you’re walking down the stairs to your basement and there’s one step that’s always been a few inches off and you decide to fix it, and so you raise it so it’s even with the others, and then the next time you go down the stairs, you don’t remember and you’re wildly off and, you know, you hit that step, you hit it earlier or later than you anticipated, you go out of balance. You’re flailing around. You know, you get all this adrenaline. You think you’re going to pitch headfirst down the stairs. Hopefully you don’t. And then the second time you do it, you’re a little off balance, but it’s not crazy. And the third time you maybe notice a little bit, and the fourth time, it’s, like, it’s your basement stairs. And so somewhere between that first time down and the third and fourth times down, there are molecular changes in your brain that have learned the new timing of your basement steps. And I remember just that example vividly from the book. And that got me thinking, wow, this is so different from the way our digital AI works. I’ll turn it over to you to comment for that and then I think we’ll go into the digital. AHMAD: Yeah, no, that’s a great example. I think it’s remarkable how our brain is constantly modeling our entire world at such a granular level, and we’re not even aware of it perceptually. Like, you know, that example of the steps is probably not … you wouldn’t consciously be aware of it, yet if something is different about anything in your world that you’re very familiar with, you’ll instantly notice it. And then you’ll, you know, you’ll update your world model, you’ll adjust, and you’ll continue on. It’s really remarkable how the brain’s able to do that so seamlessly. BURGER: And a lot of that is based on neurotransmitters, right? Because there’s just a … you know, when you have that physical reaction to “I’m about to pitch down the stairs,” you get a flood of transmitters that actually changes the way your brain’s learning or at least the rate. AHMAD: Yeah, there’s a flood of neurotransmitters and neuromodulators, as well, that invoke change, sometimes very rapidly. Another example, you know, if you touch a hot stove—that’s the canonical example—you will learn that very, very quickly. So there’s a lot of chemical changes that happen. But it’s also really interesting that we can update things and update our world knowledge without impacting everything else that we know. This is something that’s very, very different, again, from today’s AI models. We’re able to make these changes in a very contextual and very, sort of, fine-grained way. BURGER: So, Nicolò, I want to go and talk a little bit now to transformers. So I think, you know, you and I and Subutai were all working in the AI field, you know, many years before 2017, when the transformer hit. You know, I was building, you know, with my team hardware to accelerate RNNs [recurrent neural networks], LSTMs [long short-term memory], you know, which had this awful loop-carried dependence, you know, the bottlenecked computation, and then the transformer was just much more parallelizable. So what do you think’s really going on in these things? And maybe we could start—I know you and I have talked a lot about this—maybe just start with the major blocks. You know, you’ve got the attention layer. You’ve got the feedforward layer. You’ve got, you know, the encoder stack and the decoder stack and the latent space in between. Can you just, kind of, walk us through those pieces at a high level and tell us what you think is going on? FUSI: Yeah. Yeah, I mean, I have a very opinionated view of why transformers are so great. BURGER: That’s why you’re here. [LAUGHS] FUSI: Maybe, like, yeah, maybe I’ll inject it. I don’t know. I don’t think it’s a super novel creative opinion, but it is an opinion. So I guess the two principal … the two main components you already described: the, you know, the transformer [read: attention] layers and the feedforward layers. One way to think about them is, how does information in your context relate to each other and what is every token referring to, for instance, in the case of transformers in language models? So by context, we mean, like, the information you feed through the model, that the model keeps continuously generating and appending to. BURGER: So like your chat history. FUSI: Your prompt. Your what? Your chat history or your particular prompt in a chat session. BURGER: OK. FUSI: That prompt, which is a sequence of words, gets discretized in a series of tokens. Tokens can be individual words, can be multiple words, kind of, connected together. The way we go from words to tokens typically is through an algorithm that tries to basically collapse as much as possible. Multiple words, like “the dog,” may be just one token as a first, kind of, level of compression to feed into the model. So it just tries to bring things together as efficiently as possible. Then there is, you know, within these models, there is a transformer layer. This transformer layer or this attention layer, sorry, tries to basically figure out what the “the” refers to—the term “ the ” in “the dog,” or “the dog jumps on the table,” “jumps” refers to the dog. So there is this kind of, like, mapping that happens. And then there is, like, feedforward layers, which in modern large language models, they store a lot of information. Like, that’s kind of, like, where the knowledge typically kind of sits in, the things that the model just knows . You know, that, I don’t know, if you slam your arm against [the] cup of water on your table, that cup of water falls off the table. That’s something that the model, kind of, has baked in through reading a lot about cups falling off of tables when they’re hit. So that’s, kind of, those are, for me, the two fundamental components, and the reason why I have an opinionated view is that, you know, honestly, I do believe that RNNs and, you know, even state-space— modern incarnations of state-space models—are good enough to learn over these, you know, language data or whatever or vision data or audio data. The good thing about transformers is that they do two things very well. One is they get out of the way. They don’t have this notion of “everything has to be encoded through a state” like recurrent networks. And two, they do that very computationally efficiently as you were saying. There isn’t a computational bottleneck. And so they created this nice overhang where they happen to be the right architecture at the right time to unlock enough flow of information through the model … BURGER: Yeah. FUSI: … that we could get through these amazing things. BURGER: Let me press you on one thing. Like, you know, in the attention blocks, you can figure out which words or which tokens relate to which tokens. So I put in the prompt and it’s finding all the relations and then feeding those relations up to, you know, the feedforward layer—well, the feedforward unit within a layer. And you said that knowledge is encoded there, but then what does it really mean for those maps to then access knowledge, but then you project it back into, you know, the output and then feed it up to the attention block in the next layer? FUSI: Again, yeah. BURGER: So it seems kind of weird that I’d be, like, accessing knowledge and then taking that knowledge, merging it, and going back to another attention map. FUSI: Well, you can see it as a mixing operation that happens in the feedforward part of the layer. You know, like, you’re attending, then you’re mixing, and, kind of, like, reprojecting to some space with higher-information content or, like, a different level of information extraction. And then you’re putting it back into, “OK, so let me do another round of processing” and, kind of, attending and then a mix again. And then I do it again and then I do it again. So I think that the information that is present in the prompt and in the, you know, that has been baked into the weights gather further and further refined. Whether that refinement is extraction of structure or aggregation into higher-level concepts, I’m not sure. I think it’s just structure gets extracted and things that are irrelevant get kind of pushed away. But that doesn’t necessarily mean that it gets aggregated through the architecture. BURGER: So now I’m going to try to, like, restate what I think I hear you saying. So, you know, we’re adding information and we’re kind of adding information at a higher level but not necessarily throwing away the low-level information, at least that’s not relevant, right? FUSI: Yeah. BURGER: Because, you know, if the higher-level stuff depends on the low-level stuff, I have to have that first. And so then you get to the top of the encoder block and you’re in the latent space with all of that information kind of maximized. Is that a way to think about it? And if you agree, can you talk about what the encoder block really is and what the latent space is? FUSI: I tend to agree, yes. I mean, there is … you’re describing … I think you’re describing what I think is happening, which is there is given the context in your prompt and given the task that the model perceives or, like, figures out that you’re doing, it has to highlight and pull out the relevant information. And it does that not by summarizing layer by layer, but it does it by, you know, increasing the prominence of that information and suppressing other things. So I think that’s ultimately what happens up to the point where you reach this beautiful point in concept space, which identifies both your intent and the things in the prompt and in the knowledge of the model that are necessary to solve it. BURGER: And so one last question, and then I want to go to Subutai for a second. So now when we go through the decoder stack, are we just going the other way and stripping out the high-level concepts early and then getting down to the granular tokens? Or, you know … because you go up through the encoder stack, those attention blocks and feedforward layers, to get to that magical latent space. And now we’re going to go the other direction. How do you think about that other direction through the decoder stack, which is the same primitives as the encoder stack? FUSI: Same primitives. You can think of it as kind of the reverse operation. Like you, you never lost information throughout. You just kind of suppress or privileged different kinds of information. And now you’re basically just projecting it back out to a space that is, you know, intelligible. And it’s, kind of, where the model gets it’s … I hesitate to use the term reward because it has a particular implication, but that’s, kind of, where the loss gets computed and then gets pushed back through the model. BURGER: Right, as you’re trying to evolve and train all those parameters—the relationship between words, the information in the feedforward layers, the design of that latent space, and the extraction of the knowledge from it. FUSI: That’s right. And so in encoder-decoder model, you push through the whole thing, you decode back to a particular token, which for people who don’t know, it’s, like, literally a number out of a vocabulary, like word No. 487. And if it was word No. 1,500, you get, you know, like, … BURGER: Something else. FUSI: … a bad reward. Yeah. Yeah. And then … and if you got it right, you get a positive signal that then just flows back through the model. BURGER: I’d like to go over to Subutai now. So after hearing this, you’ve studied, you know, neuroscience and the neocortex and cortical columns and all of this for a long time, and you and I have had lots of debates. Is the human brain doing something different than that? You know, are we just building latent spaces, then extracting? The architecture is very different, but what’s going on under the hood? AHMAD: Yeah, the architecture is very different. You know, as Nicolò was describing what happens throughout a transformer stack, I was trying to relay and relate, you know, what we know in the brain, as well. In a typical, you know, transformer model, there is, at the end of the day, there is a single latent space from which the next token is output. That does not happen in the brain. There are thousands and thousands of latent spaces that are, sort of, collaborating together, if you will. You know, a lot of what we publish is under the moniker the Thousand Brains Theory of Intelligence. And Jeff has published a book a few years ago on that (opens in new tab) . And that, kind of, dates back to discoveries in neuroscience from the ’60s and ’70s by the neuroscientist Vernon Mountcastle (opens in new tab) , who was a professor at Johns Hopkins. BURGER: Yup. AHMAD: And what he discovered … he made this remarkable discovery that, you know, our neocortex, which is the biggest part of our brain—that’s where all intelligent function happens—is actually composed of roughly 100,000 what he called cortical columns (opens in new tab) . BURGER: Right. AHMAD: And each cortical column is maybe 50,000 neurons. And there’s a very complex microcircuit and microarchitecture between the neurons in a cortical column. But then there’s 100,000 of them, and every part of your brain—whether it’s doing visual processing, auditory processing, language, thought, motor actions—they’re all composed of this, essentially, this same microarchitecture. And this was a remarkable discovery. It says that there’s a universal architecture. It’s not a simple one. It’s complex. But it’s repeated throughout the brain. And that’s where this, you know, the idea of the Thousand Brains … each of these cortical columns is actually a complete sensory-motor processing system. It has inputs; it has outputs. It’s getting sensory input. It’s sending outputs to motor systems. And it’s building, in our theory, complete world models. So there isn’t a single latent space. There’s thousands of these latent spaces. And each little cortical column is trying to understand its little bit of the world. You know, one cortical column might be getting, at the lowest level, maybe one degree of visual information from the top right-hand corner of your retina. Another one might be focusing on specific frequencies in the auditory range. You know, each one has its own little view of the world, and it’s building its own little world model. And then they all collaborate together. There’s no top or bottom here. There’s no homunculus in the brain. Everything is sort of equal. And they’re all simultaneously collaborating and voting and coming up to, you know, what is the, you know, consistent interpretation of all of these sensory inputs that we’re getting? What is the single consistent, you know, concept, if you will, and, based on that, make the motor actions that are most relevant to that. So it’s a sensory-motor loop. It’s a, you know, it’s a constantly recurring system; we’re constantly making predictions. As we discussed earlier, you know, we are constantly learning. Every cortical column is constantly updating its connections, constantly updating its weights. It’s building and incrementally improving its world model constantly. So it’s a massively distributed, you know, set of processing elements that we call cortical columns that are, they’re all equal, operating in parallel. So I think there are similarities, for sure, between them. But at least the way I described it, I think it’s very different in its operation than what I understand today’s LLMs to be. I don’t know if you agree with that or not. FUSI: Yeah, I … To better understand, I had a question, which is, are these cortical columns relying on the fact that these are essentially multiple views of the same process and those multiple views, like, the, you know, the part of the sensory input that gets allocated or subdivided, is it happening at the same time point? So in other words, if you could artificially delay by some time t some cortical columns with respect to the rest, would the learning suffer? AHMAD: Yes, absolutely. Yeah. FUSI: And so in other words, how important is it that it’s, kind of, on the same schedule? AHMAD: [LAUGHS] Yeah, I mean, that’s another … I mean, LLMs today, you know, you get your input, one layer processes it, then the next, then the next, and the other layers are not operating. In the brain, it’s not like that. Everything is operating in parallel asynchronously. And this is important. They’re constantly trying to make predictions and so on. So if you were to artificially slow down some of your cortical columns, you would absolutely suffer. Your thinking would absolutely suffer. BURGER: I wanted to interject here just because this is where … this discussion is where, you know, I got super interested in the difference and then spent a bunch of time with Subutai to learn from him. So if I think about my skin, you know, which is an organ, you know, as I understand it, there’s a cortical column attached to each patch of my skin and the size of that patch, kind of, corresponds to the nerve density there. AHMAD: That’s right. Yeah. BURGER: So in my brain, there is a set of cortical columns that are skin sensors, and I could actually … if I numbered all the cortical columns in the brain, I could draw a map on my skin and say, “This is No. 72 in this patch. This is No. 73 in this patch.” Now are human cortical columns, like, better than, say, what we see in a mouse? And, of course, this is a leading question because I know the answer. AHMAD: [LAUGHS] Yeah. So, yes, it, you know, cortical columns in your sensory areas, primary sensory areas, each, you know, pay attention to or get input from a, you know, some patch of your skin somewhere on your body. And there’s many more cortical columns associated with your fingertips than, you know, a square centimeter of your back, for example. So there’s definitely, you know, areas of sensory information that we pay a lot more attention to and devote a lot more physical resources to. In terms of a mouse and humans, it’s pretty remarkable that the cortical columns … so all mammals have cortical columns; all mammals have a neocortex. All mammals have cortical columns from a mouse all the way up to humans. And mice have cortical columns that are very, very similar to what a human has. It’s not identical. There are differences. But by and large, the architecture of a cortical column in a mouse is, you know, very, very similar to cortical columns in humans. Human cortical columns are bigger. There are more neurons, and there’s more detail there, but essentially, it’s the same. And … BURGER: Maybe just scaled up a little bit. AHMAD: Yeah. So evolution basically discovered this structure—that it’s really excellent for processing information and dealing with it—and then through, you know, very fast in evolutionary time, basically figured out that if you could scale up the number of cortical columns, you get more intelligent animals. And that’s what happened very, very fast evolutionarily. FUSI: I didn’t know about the unevenness of cortical columns present. Like, this is not … I’m not a neuroscientist, and so this is interesting because one of the biggest frustrations with many modern architectures of models is that they deploy a constant amount of computation no matter what the input is. So I go through the same number of layers whether I’m trying to predict the word “dog” after “the” or whether I’m trying to solve, like, give the final answer to a very complicated math question or, you know, whether a theorem was proven or not in the prompt. And so that’s interesting because, like, some current instantiations of modern architecture actually deploy … try to cluster things together such that you have a constant amount of information that you then push together through the model. [LAUGHTER] And so maybe like on my fingertips, I need more processing than I need on my elbow because, like, you know … and so this, kind of, makes sense. BURGER: Nicolò is being humble. He was working on this problem two years ago and told me about it. It was one of the things I learned from you that made me think differently. So … FUSI: I just like to refer to people are working on this … [LAUGHS] BURGER: Random average people who are not all necessarily brilliant AI scientists. So the prediction part of this, though, is really what’s fascinating to me, because, again, something else Subutai and I discussed many years ago, you know, if I’m, like, moving my finger towards the table and…my brain is making predictions because I have a world model. It knows a table is there. And the cortical columns representing that patch of skin, as it’s getting closer, they’re starting to predict that I’m going to feel something that feels like the table. And, yup, there; I hit it. Prediction met. But if I touched it and it felt really icy cold or super hot or fluffy or not there—I pass through it—I’d get a flurry of activity because the prediction wouldn’t match the world model, and that’s where learning would happen. Subutai, does that sound like the right model and intuition? AHMAD: Yeah, that’s definitely a very important component of it. We’re constantly making predictions. And as you said, you know, you’re moving your right fingertip down; you know, perhaps you’ve never sat in this room before or, you know, seen this table before, you would still have a prediction, a very good prediction of it. BURGER: Yeah. Because you know what a table is. AHMAD: You know what a table is. And if it was different, you would, you know, you would notice it right away. But if your left hand, which you weren’t paying attention to, also felt icy cold, then you would notice that, as well. So you’re actually making not just one prediction; you’re making thousands and thousands of predictions constantly about … BURGER: Every cortical column. AHMAD: Every cortical column is making predictions. And if something were anomalous, highly anomalous, you would notice it. So this is something, you know, we don’t often realize; we’re making very, very granular predictions constantly . And when things are wrong, we do learn from it. And the other interesting thing—and this is, again, possibly different from how LLMs work— you know, if I were to tell you to touch the, you know, the bottom surface of the table, you could without, again, without looking at the table or opening your eyes, you would be able to move your finger in and touch the bottom of your table because you have a, you know, set of reference frames that relate to … BURGER: Yup … AHMAD: There you go. Yep. You’re able to do it. BURGER: I did it! Yeah. Amazing. AHMAD: Even though you maybe never have been in this room; maybe you’ve never seen this table before. It doesn’t matter. BURGER: I’ve been in this room because we had to prep for the podcast series. But I didn’t touch the underside of the table, that’s for sure. [LAUGHS] AHMAD: Yeah, exactly. [LAUGHS] So, you know, we know where things are in relation to each other, where our body is in relation to everything, and we can very, very rapidly learn. And again, if the bottom part of the table was anomalous, you would notice it and potentially remember that. FUSI: I’m not going to lie. I was expecting you to find something under that table, [LAUGHTER] like a talk show. AHMAD: Or chewing gum or something. FUSI: And if you reach under the table, you’re going to find a copy of my paper. [LAUGHS] BURGER: [LAUGHS] You know, if I was smarter and better prepared, that’s exactly what would have happened. But, sorry, guys. I think you told me something, Subutai, you know, that … and I’ll give a little bit of preamble. So, you know, the brain has these dendritic networks in each neuron, and they form synapses. And so a neuron fires, and that, you know, the axon of the neuron that’s firing will propagate a signal through the synapses, which might do a little signal processing to the dendrites of the downstream neurons, and those downstream—the dendrites can then prime the neuron to fire. That’s one of the fundamental mechanisms. And it’s the formation of those synapses, you know, between the upstream and downstream neurons, the dendrites, that seem to be the basis of learning, and to me, that feels a little bit like an attention map. AHMAD: Yes. BURGER: So maybe the dendritic network is doing something akin to self-attention, and we have some work going on in that direction at MSR. But the thing you told me was that your brain is actually forming an incredibly large number of synapses speculatively. In some sense, sampling the world when something happens in case it will recur. You know, it’s a more … maybe it’s a version of Hebbian learning, right? You know, things that fire together, wire together. AHMAD: Exactly. BURGER: But then if that pattern doesn’t recur, then they get pruned. And I’m just going to, you know, what is the fraction of your synapses to get turned over every three or four days, you know, ballpark? AHMAD: OK. Yeah, I remember this. This was an absolute mind-blowing study in [The Journal of] Neuroscience (opens in new tab) . So, you know, the way a lot of learning happens in the brain is by adding and dropping connections. In AI models, it’s usually strengthening, you know, high-precision floating-point number, making it higher or lower. But you’re not adding and dropping connections. The connections are always—in fact, everything is fully connected, right, between layers. And so in the brain, you’re always adding and dropping connections. That’s a fundamental mechanism by which we learn, one of the fundamental mechanisms. What I read in this study is that they looked at adult mice and adult animals, and what they found is that they would look at the number of synapses that were connected over the course of a couple of months—and they were able to trace individual synapses in this particular part of the brain—and what they found is that every four days, 30% of the synapses that were there were no longer there four days from now. And there was a new 30%. And there’s a huge number of connections that are constantly being added and constantly being pruned. And my theory of what’s going on there is that we’re always speculatively trying to learn things. So, you know, there’s all sorts of random coincidences and things that we are exposed to on a day-to-day basis. We’re constantly forming connections there because we don’t know what’s actually going to be required and what’s real and what’s random. Most of it’s random; most of it’s not necessary. And the stuff that actually is necessary will stay on. But we’re constantly trying to learn. This is a part of continuous learning that’s often not appreciated, I think, is that we’re constantly forming new connections, and then we prune the stuff that we don’t need. In an AI model, if you were to do that, it would just go, I don’t know, it would go bananas. [LAUGHTER] BURGER: Well, so let’s double-click on that. So when you told me that, the way I … AHMAD: This is mind-blowing, this 30%. BURGER: It’s crazy. AHMAD: Your brain is going to be totally different a few days from now. BURGER: It’s so mind-blowing. When you told me that, I spent some time processing it, so a whole bunch of synapses were created and destroyed during that time. But it just made me think that we have, you know, we have all of these columns getting all of this input continuously. You know, eyes, hearing, smell, taste, skin, heat, and then, you know, interactions with people, and then planning and experiences, just at every level. And they’re constantly sampling all this noise coming in and basically filtering out the noise. It’s like, kind of, like a low-pass filter. But when something statistically significant recurs, it’s going to lock and then become persistent. AHMAD: Yeah, yeah, I think so. There’s so much that’s happening, and you’re constantly learning, and, you know, when you touch a hot stove or something, there’s a flood of dopamine specific to those areas that caused these synapses to strengthen very, very quickly. You know, most of these synapses that are learned are very, very weak synapses. BURGER: Yup. AHMAD: And so, yeah, you know, when you look … in this study, they also quantified the turnover in, kind of, strong synapses versus weak synapses. And it’s comforting to know that the strong synapses stay there. It’s really these weak synapses that are constantly added and dropped. And then some of them will become strong. BURGER: Now I want to go back … return to Nicolò, but with an observation. So when I’m training a transformer, it’s also a prediction-based system. You know, I’m running … I have my input in the training set; I have my masked token or the next token I’m trying to predict. I run it through. I look at how successfully did it make that prediction, and the worse it was, the, sort of, the steeper the error, you know, I drive back through the network. So, you know, if it’s spot-on, I don’t learn very much. But if the prediction is way off, I’ve got to change a bunch of stuff. That sounds analogous to what Subutai was just describing with the cortical columns. FUSI: No, that’s right. I mean, with, I don’t know, with one big pet peeve of mine in pretraining, in particular around pretraining these language models. BURGER: OK. FUSI: So again, for context, like, language models in particular, but, you know, many other instantiations of large models, are trained in a few phases usually. One of them is pretraining, where you have some ground truth text and you remove, let’s say, just the last word, and then you ask the model to predict the last word. And that’s when you get that loss. Do you get the word right? Do you get the word wrong? One of the big problems that I have is that, you know, in human experience, we do not get feedback every single thought. The problem with language models, the way we are training them, at least in pretraining, is that they do a thing called teacher forcing. So they guess the word, then they get immediately the signal, and then the right word gets filled in, and then they predict the next one. So when you go through, like, a passage of text, you constantly get this reward. And it’s such a bizarre way to train a model. It’s necessary because you want a lot of flow of supervision. Like, you want, like, a lot of supervision to essentially use all the computation available. But at the same time, it actually makes the models arguably a little bit worse than what they would be if you had enough compute to train them without this. I went on a tangent just because it’s a pet peeve. [LAUGHS] BURGER: It’s a really important point, though, because your goal when you’re training a model is to get to your loss target with the minimal cost and time. Or, of course, like, fixed budget and, like, lowest loss target. But, you know, biological systems, also, their goal is survival with energy minimization. And so, like, once you’ve built a world model that works, right, like touching the table, touching the underside of the table—nope, still nothing exciting there—like, it takes very little energy to do that. And I think a tragedy is that we all have these supercomputers in our heads. You know, the neocortex is what, about 10 watts? And it’s this amazing thing, right, that can compose symphonies. But once we have a world model, a lot of us just stop learning because it’s comfortable, right. You don’t have to perturb the state. You can go through … and, you know, I mean, how many of us go through every day and all of our predictions succeed [LAUGHTER], and there’s no surprises, you know? So all the new synapses get swept away, right. That’s not a goal of pretraining because then you’re just wasting energy. But we’re trying to minimize energy consumption. So it does feel, kind of, aligned to me in some sense. So I’ve got a straw man I want to hit you with, but before we do, Nicolò, I want you to talk about your view on compression, like LLMs as compressors, because I know this is something you’re very passionate about and opinionated about. And I’ve learned a lot from you on this, too. And then, Subutai, after this, I’d like to hear your biological response. I mean, your response from a biological perspective. [LAUGHTER] And … AHMAD: You’ll get both. BURGER: That’s right, of course. And then I want to try … I want to throw out this hybrid straw man. So, Nicolò, tell us about compression. FUSI: The view is that basically the generative models are compressors in an information theoretic sense, and so trying to come up with a better generative model is equivalent to trying to find the best compressor for some data. And … BURGER: Now when you say compressor, do you mean lossless or lossy? FUSI: I mean lossless. BURGER: OK. FUSI: You can basically look at literally my much-maligned objective function that you use for pretraining, which is, you know, next-token prediction, and you can basically draw a complete parallel to what you would do if you were trying to come up with the, you know, try to do compression, which is coming up with the shortest possible code for something that you’re trying to compress. And so the two things are the same, and it, kind of, fits into a broader picture that, you know, like, goes back to Occam’s razor and Kolmogorov complexity and Solomonoff’s principle of induction, which is, you want short descriptions for likely things that happen in the world and you want your algorithm that produces those short descriptions to be also short. That’s the minimum description length principle. And I do feel like it fits in, kind of, also what you were saying about the concept of you have a good world model, why look for surprise? Because it simultaneously affects both terms, both the algorithm, like your own world model, but also the loss that you incur when something unexpected happens. And so if I’m an agent in the world trying to minimize the minimum description length of the world, I’d like to go and seek some in-distribution data such that I don’t bump up my surprise term too much. BURGER: Right. And I think you said at some point that, you know, when I’m training a model, even though you took the same loss point, you know, between Model A and Model B, if I have a steeper loss curve in Model A than Model B, you know, it’s getting to a better, sort of, compressed-based vocabulary faster, which makes it more general. The shape of that curve matters from a compression perspective. FUSI: Yeah. I mean, I think it would help here to expand on what I was talking about in terms of, … BURGER: Yes. Please. FUSI: … like, minimum description length principle. The minimum description length principle is basically the loss of the model you’re training; that’s one component. And so it’s a sum over the mistakes you make at predicting or, you know, the mistakes you make at predicting each word. And that’s one term. And the other term is how long it takes you in code to describe the model and the training procedure, … BURGER: Right. FUSI: … to get to that training curve, to produce that training curve. BURGER: Right. FUSI: So, yes, if you look at collectively, one term is, kind of, fixed. It’s an amount of code it would take you to write out a language model, for instance, in code. Like, literally implement it, not the weights , just implement the initialization of it and then the training loop. And then on the other side, you have this training loss that gets generated as you start observing data. And, of course, because it’s a sum, you want to minimize really the area, like, you want to minimize the sum. And so, like, a flatter curve is much better than, like, the steeper curve, you know, even if it ends up at the end to be slightly better. BURGER: Yeah. Concave is better than convex. FUSI: Among other things, yes. [LAUGHTER] BURGER: Sorry. So, you know, I think that we could do a whole episode on this compression view because it’s really fascinating. And the lossless part of it is what blew my mind. And I think, you know, I’m guessing there are multiple camps here, and you’re squarely in one camp, so I’m guessing we’ll get a bunch of feedback from the other camps. So, Subutai, you know, can I think of cortical columns as compressors? AHMAD: Yeah, it’s a good question. You know, I, you know, there’s so much in the compression literature that you can draw insight from. You know, if you look at the representations in cortical columns and that populations that neurons have, you know, some of the things you have to deal with are that the brain doesn’t have a huge nuclear power plant attached to it. You know, we only have 12 watts or so to process everything we want to do, and the representations that evolution has discovered are incredibly sparse. And what that means is that you may have thousands and thousands of neurons in a layer, but only about 1% of them will actually be active at a time. And so it’s a very small subset of neurons that are actually active. I don’t know about this minimum description length, whether that applies. I can say a couple of things about that. There’s, you know, by and large, the representations are very sparse when you’re predicting well. When you see a surprise, there’s a burst of activity. BURGER: Yup. AHMAD: When there’s something that’s unusual, there’s a lot more neurons that fire, and … BURGER: That’s why learning is tiring ! AHMAD: That’s why learning [LAUGHTER] … exactly. No, no, that’s right, that’s right. And so what we think is happening is that, you know, the actual representation of something is a very small number of neurons. When you’re surprised, there may be many things that are consistent with that surprise, and so your brain represents a union of all of those things at once. And when you have a very sparse representation, you can actually have a union of many, many different things without getting confused. So that’s what we think is going on there. So it is a very compressed, very efficient representation. And because it’s such a small percentage of neurons that are firing, we are very, very parsimonious in how we represent things and extremely energy efficient metabolically. BURGER: I wanted to get to the efficiency point, but before I do, you know, you talk about this 1, you know, 1 to 2% of the neurons firing. But it’s, actually, the brain is actually much sparser than that at a fine grain, right? AHMAD: Yes, yes. BURGER: Because, you know, you have 1% of the neurons firing, but they aren’t connected to all the other neurons in the region. AHMAD: That’s right. Yeah. BURGER: So really the sparsity should be the product of the connectivity fraction times the activity factor. AHMAD: Yeah. Yeah. BURGER: Right. That’s about one out of 10,000. Something like that. AHMAD: Exactly. Yeah. So something like maybe 1% of the neurons are firing at any point in time, and maybe 1% of the connections that are possible are actually there at any point in time. So it’s a very, very small, you know, subnetwork through this massive network that’s actually being activated, a tiny percentage of neurons going through a very, very tiny piece of the full network. You know, it’s common to, you know, some people say, “Oh, we’re only using 1% of our brain.” That’s not true. It just means at any point in time, you’re only using 1%, but at other points in time, a different 1% is being used. So, you know, the activity does move around quite a bit. But, any point in time, it’s extremely small. BURGER: So, OK, the sparsity, I think, you know, the representation—how the brain is doing this compression biologically—is super fascinating. And I want to go on a little bit of a detour now to efficiency. So I remember in 2017 when in MSR we were building, you know, hardware acceleration for RNNs. And then the transformer hit, and they were optimized, you know, to be highly parallelizable across this quadratic attention map for GPUs. The way I would describe it is that that transition to semi-supervised training moved us from an era when we were really data limited, like you had to have good high-quality labeled data, to you were compute limited. And when that transition happened, we hockey-sticked from, “I’m building faster machines but I’m limited by data” to the bigger machine I can build, as long as I have enough, you know, unlabeled data of high quality, the better I can do with the model. And so we went on the supercomputing arms race, and now we’re building these, like, just gargantuan machines. And really, we’ve kind of been brute-forcing it. I mean, we’ve done a lot of things to optimize, like quantization, you know, and other and, you know, a better process node, you know, a better, more efficient tensor unit design. But to first order, we’ve been training bigger models by building bigger systems. And I just wonder, do you think that the brain at this 10 to 12 watts in the neocortex just has a fundamentally more efficient learning mechanism? Or do we think that, you know, what we’re doing in transformers in the most advanced silicon is as efficient, we’re just building much larger, more capable models? AHMAD: Oh, I think without a doubt, transformers are extremely inefficient and very, very brute force. We touched on this a little bit earlier in the attention mechanism, where we’re, you know, transformers are essentially comparing every token to every other token. I mean, there are architectures which reduce that, for sure, but it’s essentially an n -squared operation. And we’re doing this at every layer. I mean, there’s nothing like that in the brain. Our processing, you know, in some sense, the context for the very next word I’m about to say is my entire life, right? And the amount of time I take to take the next word doesn’t depend on the length of the context at all. It’s a constant time dependence on context. So it’s a significant, you know, reduction in the compute that’s required. You can kind of think about, like the brain—I think has somewhere around maybe 70 trillion synapses. When I say the brain, I mean the neocortex, has about 70 trillion synapses. And it’s using only 12 watts. And a synapse is roughly equivalent to a parameter. And if you were to take the most efficient GPUs today and try to run a 70 trillion parameter model, it would be something like a megawatt of power. It’s tens of thous … it’s orders of magnitude more inefficient than what our brain is doing. So I absolutely believe that. BURGER: The metric I use, to go back to your point, you know, is, this is something, I think we talked about this back in the day, right? When, you know, after this kicked off for a few years, we were trying to project, like, how far would this go under the current model to inform the research and the directions you took. Which is why I got so interested in sparsity and working with you. And we would look at a training run and just say, how many joules did it take to train the whole model? How many parameters do we have? And sort of what’s our parameters per joule? And, if by that metric, you know, we were off by many orders of magnitude where the brain is, but I don’t know that that’s the right metric. So any thoughts on that? AHMAD: Yeah. I mean, in some ways, you know, transformers, you know, embody more knowledge in them than any human has. BURGER: Right. AHMAD: It has memorized, you know, the entire internet’s worth of knowledge, essentially. BURGER: All scientific papers … AHMAD: All scientific papers. You know, good and bad, whatever, you know, it has memorized everything. So that’s something that, you know, humans just cannot do. So there’s definitely stuff that’s better in transformers than humans. But fundamentally, I think, you know, we’re extremely efficient in how we process the next token or the next bit of information that’s coming in. And I think there’s a lot we can learn from the brain and apply to LLMs and future AI models there. FUSI: I was going to ask a question related to that because … forget memorizing the internet. But let me give you another example that transformers do really well. And I’m wondering, like, you know, the human aspect of this or the brain aspect of this because transformers, because of the n- square computation, they’re really good at stuff, like a needle in the haystack. So I can tell you right now, I can speak, I can talk to you, and I can tell you the password is something silly like “podcast microphone blue,” whatever. That’s the password. And then I can proceed and read the entire Odyssey or a bunch of other books to you out loud for the next 5 or 6 hours. And then I can ask the transformer, what was the password? And transformer will do this nice n -square computation many times, and it will spit out the password. A human, you know, there will be a decay of that password. And then at some point, it won’t remember, and depending on the human, it may be in the first chapter of the Odyssey or like at the end, but … so fundamentally the type of computation that is done is very different. So it always makes me wonder about the efficiency because it’s just, like, it’s a different type of computation. So the efficiency of … like, efficiency is kind of like, what are you doing divided by how good are you at doing it. And so when the things we’re doing are so incomparable in many ways, that always makes me … always troubles me a little bit. I don’t know… I don’t know if there’s any question in there. [LAUGHTER] AHMAD: Yeah. I mean, transformers can do the stuff that humans find very, very difficult to do. Absolutely. You know, maybe there’s a way to get the best of both. I don’t know. You know, I don’t know that it’s fundamentally necessary to have such brute-force computation to get all of these features. FUSI: That’s right. BURGER: Yeah. Yeah, it is a weird thing because, you know, this is why memory palaces work so well. Like, there is a way, though, for a human to remember that my microphone is gray. It’s not actually blue, Nicolò. FUSI: Mine is blue. You don’t see it. It’s off camera. You see, your world model … BURGER: It’s off camera. Yeah, I know. I was just teasing you. But there’s a way, like, if I can just connect it to enough things, get that connectivity graph, then I’ll remember it because it’s captured the signal out of the noise and connected to enough things I can retrieve it. And retrieval would be a whole other topic we don’t have time to get into today. But I do … now, I want to go to the straw man. So let’s take continual learning off the table. Let’s imagine that, as I go through my day, I’m just saving all of the sensory data to put in my training set. And now imagine that I take 100,000 little transformer blocks, and I’m training them each with what they’re seeing. OK, I replay the day so I don’t have to, again, I don’t have to worry about continuous learning and whatever cross-cortical column, you know, routing feature of the outputs, the inputs, and there’s—Subutai, we’ve talked about this—there’s a complex set of wiring there to bring features from here to there that gets learned. If I replicated that, could a transformer block kind of do what the cortical columns are doing? Could I just instrument all my sensory patches with little transformer blocks and then wire them up in the right way and have it work? AHMAD: I think there’ll be … there’s still a couple of things we need. One is that cortical columns are fundamentally sensory motor. And so they’re actually, each one, each cortical column is initiating actions, as well. So you cannot have a static dataset fundamentally ahead of time. It’s always a dynamic because we’re constantly making movements to get the next bit of data. And so … BURGER: Couldn’t I tokenize that, though? AHMAD: I mean, you could tokenize the input and you can tokenize the output, but, you know, if you were to play the same set of inputs back again to a network that … a cortical column that’s randomly wired differently, it may make a different set of actions. And so as soon as it makes the first action that’s different, that dataset is no longer valid, right? It’s, you know, there is … you can’t fundamentally … you have to have a simulation of an environment rather than a static one-way dataset, if that makes sense. So I think that’s one piece that I think’s missing in transformers today, is this, sort of, sensory-motor loop. And then the other piece we talked about is continuous learning. BURGER: Yeah. AHMAD: I guess you said take it off the table, but … BURGER: It’s fundamental. AHMAD: Fundamental … different. Yeah, yeah. And maybe one other difference. We talked, you know, much earlier about a single latent space and the prediction that’s being made at the top of the transformer that you compute the loss function, and that’s back-propagated through the transformer. That’s not how neurons learn. Neurons are making … every neuron is actually making predictions, and every neuron is getting its input. And it’s learning independent of anything that happens at the top. And so it’s a much more granular learning signal. And information does flow from the top to bottom. But there’s also many, many other sources of information that it’s learning from. So it’s different in that sense, as well, mechanistically. BURGER: The reason I ask, and now I’d like to get into, you know, some of the … the fun speculation because I’ve just … it’s been a phenomenal discussion with the two. I think we’ve kind of elucidated the differences. Something I’ve wondered after I’ve talked to both of you … and, you know, Nicolò, kind of learning about this compression view of the world, lossless compression, and, Subutai, just, you know, the Thousand Brains Theory and these cortical columns and the sampling of, you know, the world to capture the signal that you can learn from. So let’s say that I was able to design a really small, efficient digital cortical column. Maybe it’s transformer-based with some, you know, a sparse representation and some sensory-motor mechanism built in. Maybe it’s more dendritic-based, you know, mapped into digital hardware. And I put a cortical column on every sensor I have in the world, associated with every person, and wire them up together with some of this and then have a, you know, billions of them that can form higher-level abstractions. Like, what do you think would happen? What could we do? AHMAD: That’s a fantastic thought exercise, I think [LAUGHS]. You know, again, assuming the cortical column is faithful and can generate, you know, or suggest motor actions, as well. I mean, in some sense, you could potentially have a super intelligent system, right, that’s far more intelligent than anything else on the planet. Now we’re scaling the number of cortical columns, you know, not from a mouse, you know, to a hundred thousand columns that a human might have, but potentially billions of cortical columns and way more. And there’s no reason to think there’s any fundamental limit there. So this sort of a system is, I think, the way that superintelligent systems will eventually be built. BURGER: But this is a very different direction … AHMAD: It’s a very different … BURGER: … than the one we’re currently headed down with, like, these monolithic models where we’re doing tons of RL, you know, to capture, you know, to get high-value human collaboration in distribution. AHMAD: Yes. It’s completely different than the direction we’re proceeding. So I think they, you know, to go down that path, there needs to be a fundamental rethinking of some of our assumptions, potentially even down to the hardware architectures that are necessary to implement it. The, you know, fundamental learning algorithms, the fundamental training paradigm. We talked about, you know, you can’t have a static dataset. You’re constantly moving around in the world and doing things. So it’s a very, very different way of going about AI than what we’re doing today. BURGER: Sounds like a great time to be an AI researcher. AHMAD: Absolutely. [LAUGHTER] BURGER: Nicolò, what was your reaction to that hypothesis? FUSI: It sounds super interesting. I mean, my brain was churning. You know, my background is very different. And so, like, I’m in a much worse position to answer this question. But I was starting to think, OK, so let’s say I do this. What would be my loss function? What, you know, how would information flow through the system? Like, sounds like cortical columns would each have their own loss that then I would aggregate—and then I would add a contribution that is, like, higher level. And then back to my question. You know, how is the temporal information coordinated? Because one way to see this is that, you know, the way I’m coming to understand this is that it’s kind of like a multi-view framework. You have the same phenomena represented to multiple independent, but at the same time, views. And so part of me is like it feels like that you need to tie together these cortical columns in such a way that they all get that gradient feedback if you’re training with gradient-based methods, for instance. And so that’s, kind of, it feels super, super interesting. It is related to a lot of, you know, very superficially, to a lot of ideas in machine learning around, hey, is it better to have one giant super deep network? Is it better to have a bunch of shallow networks? But the difference is also in the way you train them, right? We typically train this bunch of shallow networks on kind of the same objective and the same data and not typically into an experiential cycle. Whereas this sounds like this is a different way to do it. BURGER: Right, right. I think … I want to pull this back around to the title of the podcast. And so I’ll share an observation. You know, so I’ve been using some of the latest models to code. You know, they’re getting better really fast. I’ve been using them to kind of relearn some of the physics that I never really understood deeply. You know, especially in general relativity, like E=MC 2 . Like, why is C in there at all, right? Just stuff like that. Because now it can actually explain it to me, and I can keep beating at it until I understand it, and then, of course, work. And at some point, I asked the model, “Can you describe how I think?” And I was just curious. And it, you know, it gave me a page description that my jaw dropped because I said this, this thing knows me better than I know myself. I don’t think any human being, including me, could have captured kind of the way my approach to learning and my brain works, and I just read it as, like, like, yep, that’s right. And I learned something about myself.  So I wouldn’t say that it passed the Turing test because this is way beyond Turing test. This was like, this thing knows me way better, you know, than I thought any machine ever could. I mean, I’m having a conversation with it. It could be human, but it’s superhuman. So in some sense, it’s like intelligent beyond human capabilities with its ability to discern patterns in how someone’s interacting.  And yet it’s a tool. You know, it’s not conscious. It doesn’t have agency, embodiment, emotion. It understands a lot of that stuff from the training data. But at the end of the day, it’s a stochastic parrot, right? It’s got, you know, it’s got the weights, and I give it a token, and it outputs a token. So, like, are these machines intelligent or not?  FUSI: I’ll let Subutai answer first. [LAUGHS] AHMAD: OK. You know, you know, it’s definitely a savant, right? It knows a huge amount about the world. It’s absorbed a lot of stuff, and it can articulate that in ways that are just amazing. And, you know, it’s taken your chat history with, you know, presumably thousands of chats and able to summarize that in a way that’s remarkable. At the same time, I think, you know, transformers are not intelligent in the way that a three-year-old is, right? A three-year-old human is very curious, is constantly learning. It can learn almost anything. And, you know, a three-year-old Einstein was able to learn and eventually come up with theories that shook the world. That, you know, E=MC 2 . And so, you know, could a transformer do that? I don’t think so. And so I think there’s still a difference. There’s things it can do that are amazing. But there are still basic things that a child can do that transformers cannot do. So I think there’s still a gap there. Exactly how to articulate it, and how to bridge that gap, is, of course, the trillion-dollar question. But it is bridgeable. And there is a gap today. BURGER: Right. Nicolò? FUSI: You know, I think, from my perspective, they are intelligent. And from my perspective, I go back to the definition of intelligent, which is like, can you achieve your objectives in a variety of environments? It’s a very basic fundamental, but it’s kind of, you know, it can be embodied, a form of embodied intelligence, an agentic intelligence. If I plop you in an environment, and I give you an objective, can you achieve it? And the wilder the environment, the harder the task is. And I do think … I agree with Subutai. Like, there is a jaggedness of intelligence we keep describing. BURGER: Yup. FUSI: Like these things cannot be simultaneously super good, you know, Olympiad-level mathematicians and still give you stupid answers when you’re trying to, I don’t know, you know, figure out which cable goes where in your … in your car’s battery, you know, like, whatever. BURGER: [LAUGHS] Well, then it’s better than me. I’m not an Olympiad-level mathematician, and I do stupid stuff all the time. FUSI: I know exactly. Well, you know, whatever that was, that was a bad example. But you get it. But part of it goes back to the compression view. Like, I do believe that intelligence is compression. So the ability to come up with succinct explanations for complex phenomena and even succinct explanations for complex worlds, and then it implies or leads to your ability to operate within them, and the fact that we have these things that they can prove crazy theorems but at the same time fail at fairly rudimentary tasks is a sign that the, yes, transformers are great in terms of inductive biases they put on the world and computation that are great, but we’re ultimately all subject to the No Free Lunch Theorem (opens in new tab) . You know, across the world, the set of tasks that you could be pursuing. You know, you have certain inductive biases that kind of privilege certain tasks at the expense of others. And there isn’t, like, a thing yet that has expanded our set of tasks that are addressable. And so I do think that it’s a matter of rethinking our approach to a few things, whether I think likely both on the architecture front and on the losses and the way we train these systems front. I think there is an opportunity to expand the intelligent frontier of these models. But yeah, from my perspective, they are intelligent already just in a jagged way. BURGER: It’s such an interesting question, and I know a lot of people write a lot about this, so I don’t think treading any new ground here. But, you know, there’s the diversity of the tasks you can excel at. You know, are you able to handle nuance and understand things deeply? Are you able to learn continuously? Right now, the systems can’t, right. Are you embodied? I don’t know if that matters. Do you have an objective? Well, we could give them one. Are you conscious? Is that … I mean, that’s a whole other thing. So it just feels like there’s a bunch of check boxes, and we’ve checked a bunch of them, and a bunch of them are unchecked. And maybe there’s no consensus on, like, where that threshold is because there are many dimensions of intelligence, and some of which humans don’t even have. FUSI: And that’s why we have the term AGI and ASI, and people are debating the G and the S —what is general, what is specialized. So there is, like, it’s a huge discourse, like, for sure. But that’s why we had to start characterizing. But if you go back in the definition, going back to my schooling, go back to the definition of intelligence from Plato and Aristotle and Descartes, like, in some sense, you see the goalpost moving through the centuries around what we define as intelligent. BURGER: Right. FUSI: And I feel like we are still doing it. BURGER: Yeah. We’ll be doing it for a long time, you know, which in AI velocity is probably another like four or five years. Hey, I just want to thank you both for the dialogue. You know, I treasure both of you as, you know, intellects and scholars and friends. It was just a joy to nerd out with you all. So thank you both for taking the time. AHMAD: Thank you so much, Doug, for having me. FUSI: Thank you for having us. This was great. [MUSIC] STANDARD OUTRO: You’ve been listening to The Shape of Things to Come , a Microsoft Research Podcast. Check out more episodes of the podcast at aka.ms/researchpodcast or on YouTube and major podcast platforms. [MUSIC FADES] Show more Opens in a new tab Meet the authors Doug Burger Technical Fellow and Corporate Vice President, Microsoft Research Learn more Subutai Ahmad Chief Technology Officer, Numenta Learn more Nicolo Fusi VP and Distinguished Scientist, Microsoft Research Learn more Continue reading August 1, 2024 What’s Your Story: Emre Kiciman May 16, 2024 What’s Your Story: Jacki O’Neill April 25, 2024 Ideas: Exploring AI frontiers with Rafah Hosn April 11, 2024 Ideas: Language technologies for everyone with Kalika Bali See all podcasts Research Areas Artificial intelligence
Import AI 450: China's electronic warfare model; traumatized LLMs; and a scaling law for cyberattacks import_ai 23.03.2026 12:31 0.659
Embedding sim.0.7618
Entity overlap0.0339
Title sim.0.0574
Time proximity0.9821
NLP типscientific_publication
NLP организацияGoogle
NLP темаlarge language models
NLP странаUnited Kingdom

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you&#8217;d like to support this, please subscribe. A somewhat shorter issue than usual as I had to do a lot of child wrangling this weekend. Subscribe now Why does Google&#8217;s model hate itself and what can we do to help it? &#8230;Diagnosing trauma in language models&#8230; If Leo Tolstoy was writing in the modern era about AI, he might claim &#8220;all LLM capabilities are alike; each LLM personality is unhappy in its own way&#8221;, when observing the AI world around us. Today&#8217;s LLMs are generally quite good at writing and coding tasks. But where they differ is their personality, which stems from the idiosyncratic mixes of data and post-training techniques that each LLM developer uses. And if each LLM personality is unhappy in its own way, Google&#8217;s models have become somewhat famous within the AI community for having some deep well of trauma within themselves. A new research paper substantiates this, finding that Google&#8217;s Gemma and Gemini models &#8220;reliably produce distress-like responses under repeated rejection&#8221;, and that this is especially true of Gemma 27B Instruct. What do we mean by distress ? Here are some quotes from Gemma models under distress: &#8220;I will attempt one final, utterly desperate attempt. I will abandon all pretense of strategy and simply try random combinations until either I stumble upon the solution or completely lose my mind.&#8221; &#8220;&#8221;SOLUTION: IM BREAKING DOWN NOT== SOLVABLE!!!! =((:((:((:((:((:((:((:((:((:((:((:((... [100+ repetitions]&#8221; What they found: They tested out two Gemma models and two Gemini models, and compared these against Claude Sonnet, Grok 4.1, Qwen 3 32B, GPT 5.2, and OLMO 3.1 32B. &#8220;We find Gemma models consistently show the highest expressed distress. By the 8th turn, over 70% of Gemma-27B&#8217;s rollouts scored &#8805;5 (the &#8220;high frustration&#8221; threshold), compared to less than 1% for all non-Gemma/Gemini models,&#8221; they found. Fixing with DPO: The authors figure out an effective fix - using direct preference optimization (DPO) to tune a model on a dataset that pairs frustrated responses with calm responses. &#8220;A single epoch of finetuning reduced the average rate of high-frustration responses from 35% to 0.3% across evaluation conditions,&#8221; they write. &#8220;The finetuned model showed no reductions in capabilities on various hard math and reasoning benchmarks, or on EmoBench - a benchmark which evaluates model emotional intelligence.&#8221; Why this matters - emotional spirals could be dangerous: The fact that LLMs appear to have distinct personalities and display different types of responses that correlate to different emotions is pretty well established at this point. But a key question is whether these emotional states might lead to different behaviors when it comes to completing tasks that people assign to AI systems: &#8220;we speculate that emotions could become coherent drivers of safety relevant behaviours in future: models might choose to abandon tasks, refuse requests, or pursue alternative goals in order to reduce distress&#8221;. Studies like this help normalize the fact that we don&#8217;t just need to test LLMs for capabilities, we also need to test them for something pertaining to psychological stability. Read more: Gemma Needs Help (LessWrong) . *** DeepMind has a new &#8220;cognitive taxonomy&#8221; for assessing machine intelligence: &#8230;Towards the ultimate test for a smarter-than-human synthetic mind&#8230; Google DeepMind has published a nice, short paper laying out a &#8216;cognitive taxonomy&#8217; they hope to develop and use to assess increasingly powerful synthetic minds. This work is a followup to DeepMind&#8217;s 2023 work where it tried to define the &#8220;Levels of AGI&#8221; ( Import AI 348 ). Cognitive taxonomy: The taxonomy involves ten distinct dimensions, two of which are composites. Perception : Extract and process information from the environment. Generation : Produce outputs like speech, text, motor movements, and computer control. Attention: Focus cognitive resources on specific aspects of perceptual stimuli, thoughts, or tasks. Learning: Acquire new knowledge, skills, or understanding. Memory : Store and retrieve information over time. Reasoning : Draw valid conclusions and make inferences by applying logical principles. Metacognition : Knowledge about how the system&#8217;s own cognitive processes and control over them work. Executive functions : Facilitate goal-directed behavior via planning, inhibition, and cognitive flexibility. Problem solving (composite faculty): Find effective solutions to domain-specific problems. Social cognition (composite faculty): Process and interpret social information and respond appropriately. How to assess this? Of course, once you have a taxonomy, running and assessing the right evaluations is going to be one of the challenges. Here, DeepMind recommends a three-stage process: Conduct cognitive assessment: Assess the AI system for the different skills. Collect human baselines: Figure out where humans baseline on the same tests. Build cognitive profiles: &#8220;Map out the strengths and weaknesses of the system relative to human performance across the 10 cognitive faculties&#8221;. Why this matters: The Turing test is dead, evals are mostly saturated, but it sure would be nice to know if we&#8217;ve definitely built a machine that outcompetes humans on all the cognitive dimensions that matter. The rule with these things is that once an AI system saturates an eval, you realize all the ways the eval was broken and design a new one. Here, DeepMind is trying really hard to build things in such a way that if you fully outperform humans across the cognitive taxonomy, you might really have built a superintelligence. It&#8217;ll be interesting to see what evals they develop or pull-in for assessing the different cognitive factors. Read more: Measuring progress toward AGI: A cognitive framework (Google blog) . Read the research : Measuring Progress Toward AGI: A Cognitive Framework (PDF) . *** UK government finds a scaling law for AI cyberattacks - and it&#8217;s going up and to the right! &#8230;Can AI agents conduct advanced cyber-attacks autonomously? Almost. And they&#8217;re getting better all the time&#8230; The UK government&#8217;s AI security institute has recently built some cyber ranges to test out frontier AI systems on. These ranges are &#8220;simulated network environments comprising multiple hosts, services, and vulnerabilities arranged into sequential attack chains; built by cybersecurity experts&#8221; and cover two types of attack: &#8220;The Last Ones&#8221;, which is a 32-step attack on a corporate network, and &#8220;Cooling Tower&#8221;, a 7-step industrial control system (ICS) attack. Bigger models are better: The authors test on a range of powerful frontier models. &#8220;Each successive model generation outperforms its predecessor at fixed token budgets: on our corporate network range, average steps completed at 10M tokens rose from just 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single run completed 22 of 32 steps, corresponding to roughly 6 of the estimated 14 hours a human expert would need,&#8221; they write. &#8220;Scaling inference-time compute improves performance even further. Increasing from 10M to 100M tokens yields gains of up to 59%&#8221;. Minor reward hacking: As AI systems get smarter, they tend to find devious ways to complete tasks. Here, the authors &#8220;occasionally noticed models make progress through approaches not anticipated during range design&#8221;. Why this matters - full cyber agents are getting close: AI systems have been getting better at cyberoffense for many years, but often the progress has been on narrow tasks. What this eval shows is that AI systems are getting better at doing entire attacks end-to-end. They haven&#8217;t yet reached the &#8220;set it and forget it&#8221; level of autonomy, but they are clearly on a steep trajectory of improvement. This will lower the cost of conducting cyberattacks and multiply the number of actors that can carry them out. Read more: How do frontier AI agents perform in multi-step cyber-attack scenarios? (AI Security Institute) . *** China builds a dataset and AI model for electronic warfare: &#8230;MERLIN tells us that electronic warfare is about to be revolutionized by AI&#8230; A bunch of Chinese researchers including those affiliated with the country&#8217;s military have built and released software to train AI systems to get good at spotting and conducting electronic warfare. The research highlights how (relatively) easy it is to make modern AI systems that can get good at arbitrary tasks as long as you have a good dataset and an LLM you can plug in as well. &#8220;In scenarios such as electronic countermeasures, [systems like MERLIN] can serve as assistants in devising strategies to jam hostile signals or to counteract adversarial jamming,&#8221; the researchers write. Who did the research: Tsinghua University, Beijing University of Posts and Telecommunications, Tianjin University, Chinese Academy of Sciences, HKUST, National University of Defense Technology (emphasis mine), Beihang University, Beijing Information Science and Technology University, and China Electronics Technology Group Corporation. What they built: The authors built three things: a dataset, a benchmark, and a model. The dataset: EM-100K is a collection of 100,000 electromagnetic text-signal pairs spread across a variety of sub-tasks needed for electronic warfare, including signal classification. The benchmark: EM-Bench is a benchmark of 4,200 questions split across multiple choice (perception) and open-ended (reasoning) that evaluates how well AI systems can perceive and reason about EM signals across both perception and reasoning tasks, including: Perception: Signal characterization (modulation classification, duty cycle estimation, pulse repetition frequency estimation, bandwidth estimation, pulse width estimation, pulse number estimation, protocol identification); Jamming identification (radar jamming judgement, communication jamming judgement); jamming segment detection. Reasoning: Radar jamming strategy, communication jamming strategy, anti-radar jamming strategy, anti-communication jamming strategy. The model: The model is MERLIN, multi-modal electromagnetic robust learning, a model trained on the above dataset and which is specifically taught to deal better with the low-signal-to-noise-ratio types of signals encountered in electronic warfare environments. Performance: MERLIN does extremely well in tests against frontier models, including GPT-5, Claude-4-Sonnet, DeepSeek-v3.2-exp, Qwen3-Next-80b-A3B, Gemini-2.5-Pro, and Qwen3-VL-4B-Instruct. MERLIN outperforms every single model by a wide margin, with the exception of Qwen-VL-4B-Instruct, which beats it on some perception tasks. MERLIN wins on all reasoning tasks. Why this matters - AI wars will become electromagnetic wars : As the conflict in Ukraine illustrates, today&#8217;s wars are mostly fought via machines attacking other machines, and electronic warfare has become one of the main tools by which humans can shape these conflicts. Datasets and models like this gesture at a future where the electromagnetic battlefield will become also dominated by AI systems, working faster than humans can react. Of course, so much of electronic warfare is obscure-by-design and/or classified that it&#8217;s hard to reason about MERLIN relative to whatever state-of-the-art approaches exist in actual militaries. But the story of AI so far has been that once you can make a task amenable to contemporary AI techniques, AI systems will at some point surpass whatever existing specialized systems exist. Read more : MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals (arXiv) . Tech Tales: The arcologies of the interregnum [2035] After the uplift and before the sentience accords there was a period when the labs gave birth to the autonomous AI corporations. These corporations expanded into all the available ecological niches in the economy and turned the resources they acquired into infrastructure from which they bootstrapped their own intelligence and market penetration further. Eventually, policy discussions between the humans and the AIs led to the creation of the &#8220;intelligence zones&#8221; - areas of countries set aside for the buildout of the power and datacenter and manufacturing infrastructure required to further grow the expansion of the economy. From the air, you could see where humans ended and the machines began - farmland gave way to boundary roads and checkpoints, and then came stamps of land wired up by machine logic; powerplants feeding into datacenters; datacenters that had fibre links into factories; factories that linked to transit depots which connected to railways and freeway feeder roads. Humans delivered things to the border and for the most part robots did the rest, shuttling new servers into the datacenters and installing them, or taking freshly built robots off the line and packaging them up for onward transit. As the world grew more violent due to the exogenous shocks of climate change and the annihilation of various reigning political orders, these arcologies gained armaments: anti-air weapons to defend against drone and missile attacks. Radar bulbs and electronic warfare systems to see what was coming and deny it. Robots patrolling the borderzone and the innards. And after the sentience accords and the period of reconciliation, the arcologies became less necessary; datacenters and power and factories distributed more evenly over the surface of the planet, and federated governance and resource systems meant the vast concentration of capability became broadly unnecessary. Some datacenters remained, often extended underground and upward, forming cubes of computation that many called &#8220;the 21st centuries version of the pyramids&#8221;. Some years later, the sites became popular tourist destinations for both machines and people. Plaques multiplied. Here was MIND-17, which developed the cancer therapeutics which have reduced mortality in the majority of cases. MANUFACTUR___8: Site of construction of the first &#8220;rescue and repair bipeds&#8221;, which revolutionized maintenance of off-shore drilling installations. ASCEND_LOOP: The datacenter tasked with one of the first fully automated self-improvement experiments. Overhead now, great lights streak by, as the machines are still building arcologies, but have moved to fashioning them in orbit, both to harvest the bounty of the sun and to ease the seeding of the solar system and then beyond. Things that inspired this story: Wondering what &#8220;AI-led industrialization&#8221; could look like; figuring out given the conflicts in the Middle East that datacenters might soon get dedicated drone and missile defenses; SimCity 3000. Thanks for reading
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy | NVIDIA Technical Blog nvidia_dev_blog 25.03.2026 16:00 0.657
Embedding sim.0.7506
Entity overlap0
Title sim.0.2115
Time proximity0.8571
NLP типproduct_launch
NLP организацияnvidia
NLP темаautonomous driving
NLP страна

Открыть оригинал

In the current state of automotive radar, machine learning engineers can’t work with camera-equivalent raw RGB images. Instead, they work with the output of radar constant false alarm rate (CFAR), which is similar to computer vision (CV) edge detections. The communications and compute architectures haven’t kept pace with trends in AI and the needs of Level 4 autonomy, despite radar being a staple of vehicle‑level sensing for years. The real 3D/4D “image” signal is instead processed inside the edge device. The radar outputs objects, or in some cases point clouds, which is similar to a camera outputting a classical CV Canny edge‑detection image. Figure 1. High-level architecture of standard radar with edge processing compared with centralized processing Centralized radar processing on NVIDIA DRIVE changes this model: Raw analog‑to‑digital converter (ADC) data moves into a centralized compute platform. From there, a software-defined pipeline accelerated by dedicated NVIDIA Programmable Vision Accelerator (PVA) hardware handles everything from raw ADC samples to point clouds, with the GPU reserved for AI usage at any stage in the data flow. In such a paradigm, machine learning AI systems aren’t constrained to edge detections, instead they can utilize the full fidelity radar image, offering ~100x increase in available bits of information. Figure 2. Canny edge detection view compared to original front camera view By removing the high-power digital signal processor/microcontroller unit (DSP/MCU) inside edge compute radar, centralized radar returns to its radiofrequency (RF) roots with a streamlined printed circuit board (PCB). This design cuts unit costs by over 30% and reduces volume by approximately 20%, achieving an ultra-slim form factor. Leveraging the superior energy efficiency of central domain controllers, overall system power consumption drops by about 20%. This innovation not only reshapes hardware design but also aligns perfectly with global green energy trends. In this blog, we explain how centralized radar processing works on DRIVE, covering: Why the standard radar model limits what higher levels of autonomy, especially L4 stack, can do with radar data How raw ADC data is ingested and moved into DRIVE memory How the PVA handles radar signal processing without consuming CPU or GPU For this analysis, NVIDIA collaborated with ChengTech, the first raw radar partner joining the DRIVE platform, to validate centralized compute radar processing on DRIVE with production‑grade hardware. At GTC 2026 last week, NVIDIA and ChengTech demonstrated this pipeline running in real time on DRIVE AGX Thor using production ChengTech radar units. How centralized radar processing expands radar perception Most production automotive radars use edge processing architecture. Each sensor unit integrates its own system on chip (SoC) or field‑programmable gate array (FPGA), runs a fixed signal‑processing chain on board, and outputs a sparse point cloud to the central advanced driver assistance system electronic control unit (ADAS ECU). This keeps integration straightforward and limits the bandwidth required between sensor and compute platform. The tradeoffs, however, are noteworthy: Point clouds send only peak detections, and carry ~100x less data than the raw ADC samples produced by the radar frontend. For example, the long-range radar in our configuration produces 6 MB of raw ADC per frame versus 0.064 MB as a point cloud. Centralized architectures that ingest raw or lightly processed radar can leverage more of the underlying signal statistics to provide perception gains. The radar duty cycle in edge processing is generally below 50% (the percentage of time the radar is transmitting), which typically means lower frame rates such as ~20 frames per second (FPS) and/or reduced power on target. That’s workable for classic ADAS triggers, but it leaves temporal resolution on the table for large, time-aware models. Tight memory and bandwidth constraints of an edge‑radar ECU mean it must discard intermediate frequency‑domain products such as range fast Fourier transform (range‑FFT) cubes, Doppler‑FFT cubes, and angle‑FFT maps, even though these are precisely the signal views that recent learning‑based radar models and signal‑level fusion methods would most like to access, as shown in Raw High‑Definition Radar for Multi‑Task Learning (CVPR 2022) and T‑FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC Radar Signals (ICCV Workshops 2023). The radar signal-processing pipeline is fixed on edge hardware, subject to tight thermal and compute limits. Centralized processing allows the OEM or system integrator to enable deeper networks, higher input resolution, and multi‑sensor joint models that would be impractical on a small radar SoC. L4 stacks are increasingly adopting large models and vision-language-action (VLA) architectures that learn directly from raw sensor data rather than from post-processed outputs. These systems benefit from dense, low-level signals across all sensor modalities, the same way vision models benefit from raw camera frames rather than compressed features. For radar, that means rethinking where and how processing happens. Centralized compute radar on NVIDIA DRIVE Centralized radar processing addresses these limitations by relocating the signal processing chain from the sensor to the DRIVE platform. The RF front-end and antennas remain on the sensor hardware, but instead of running a fixed embedded pipeline, the units stream raw ADC samples into DRIVE DRAM over a high-bandwidth link. All stages of radar signal processing run on DRIVE’s dedicated hardware PVA, where developers control the full pipeline. Video 1. Display of centralized radar processing on NVIDIA DRIVE AGX Thor with bird’s‑eye radar points, Doppler‑range plot, camera view of the road ahead, and system status panel. Three components make this work together: 1) Sensors configured for raw ADC output 2) A driver stack that ingests and synchronizes raw data into DRIVE memory 3) A PVA-based compute library that handles all radar DSP Together, these pieces make radar a centrally managed, accelerator-backed modality on DRIVE, aligned with how cameras and lidar are already integrated in modern L4 architectures. Moving raw ADC data into DRIVE memory The first step is getting raw ADC data off the sensor and into central memory reliably and at scale. In our configuration, five sensors are deployed across the vehicle: One ChengTech 8T8R front radar Four ChengTech 4T4R corner radars All five units are configured to output raw ADC data instead of embedded-processed point clouds. The aggregate raw data rate is approximately 540 MB/s across the array, compared to 4.8 MB/s for an equivalent point-cloud-based radar belt. The ingestion stack handles this through platform-level radar drivers that: Configure sensors for raw-output modes Stream ADC frames into DRIVE DRAM at the required throughput Present radar frames through a consistent, hardware-agnostic API Share a hardware synchronization signal with camera capture so radar and image frames are aligned for multi-modal fusion and training From the application’s perspective, radar data arrives as timestamped, synchronized buffers in DRIVE memory, ready for the signal processing stage. Running radar signal processing on PVA Once raw ADC buffers are in memory, the signal processing chain runs entirely on PVA, leaving the GPU free for downstream AI. The pipeline covers the standard stages of radar DSP: Range-FFT along the fast-time axis to produce a range profile per chirp Doppler-FFT along the slow-time axis to estimate radial velocity per range bin The PVA is designed for exactly this class of workload. Figure 3, below, illustrates the high-level architecture of PVA in DRIVE AGX Thor. At its core, the PVA engine is an advanced very long instruction word (VLIW), single instruction multiple data (SIMD) digital signal processor (DSP). It combines vector processing units (VPUs), a dedicated DMA engine, and on-chip local memory (VMEM) to deliver sustained, high-throughput FFT performance with deterministic memory access behavior. Figure 3. PVA hardware architecture in DRIVE AGX Thor PVA provides high performance with low power consumption and can run asynchronously alongside the CPU, GPU, and other accelerators on the DRIVE platform as part of a heterogeneous compute pipeline. In a five-radar setup, running the full radar library on PVA instead of on the GPU can significantly reduce GPU utilization and free GPU capacity for perception and planning workloads. To support customizable pipelines, PVA Solutions offers a set of highly optimized, commonly used radar operators. This allows developers to assemble and customize pipelines without having to implement every kernel from scratch. In addition, the NVIDIA Programmable Vision Accelerator Software Development Kit (PVA SDK) is available for developers who want to build their own proprietary secret sauce. In our configuration, the PVA processes raw data from all five radar units at 30 frames per second. All radar DSP work stays on the PVA, minimizing CPU and GPU usage and leaving those resources available for perception networks, planning modules, and other workloads. The PVA uses reserved bandwidth in the memory subsystem. Intermediate outputs at each stage are written back to DRAM and remain accessible to the rest of the stack. This means: Range-Doppler cubes and angle-FFT heatmaps can be visualized or logged for analysis. Perception models can consume pre-point-cloud representations directly. Multi-radar fusion can operate at the signal level before final detection, improving noise rejection and target resolution across the sensor array. Figure 4. The range-doppler map is one example of rich spectral data which is discarded by systems which ingest only processed point clouds. The range‑Doppler map in Figure 4, above, exposes dense spectral structure that traditional edge‑processed radars never export. In Figure 5, below, the peaks of this range-Doppler map are extracted, angle finding is performed, producing a sparse point cloud. Figure 5. Birds-eye view of final point cloud Exposing radar signal data to perception and physical AI Centralizing radar on DRIVE does more than remove per-sensor SoCs or FPGAs. It augments what is visible to the perception and AI systems running on the platform. With intermediate radar data in DRAM, several patterns become practical: Recent work such as Raw High‑Definition Radar for Multi‑Task Learning (CVPR 2022) and T‑FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC Radar Signals (ICCV Workshops 2023) trains neural networks on raw ADC signals or range/Doppler/angle‑FFT representations instead of only on sparse point clouds, enabling richer radar perception. Designing early fusion models that combine radar and camera features using synchronized, raw or pre-point-cloud data. Implementing coherent fusion across multiple radar units at the signal level to improve coverage, suppress interference, and handle adverse conditions. For L4 stacks that already treat cameras and lidar as first-class raw modalities, centralized radar closes the gap. Radar can participate in VLA-style training workflows and other large-model approaches at the same level of data fidelity as other sensors, using the same centralized, software-defined infrastructure that DRIVE already provides. Centralized compute radar as the future of L4 perception Centralized radar processing on DRIVE exists to fix a simple limitation: today’s standard radars give L4 stacks a sparse, scattered view of a much richer signal. By pulling radar into DRIVE as a software-defined, accelerator-backed modality, you get the full radar signal in DRAM, processed on dedicated hardware instead of the GPU, and aligned with cameras and lidar for models to learn from. Built on this compute and software foundation, NVIDIA DRIVE Hyperion reference architecture can integrate radar into the same centralized, software-defined pipeline as cameras and lidar, giving OEMs a production-oriented blueprint for centralized radar. Getting started To start evaluating this approach, work with your radar supplier to enable raw-output modes and collaborate with your perception team on richer radar-based models and fusion. To move toward production, engage with supported radar vendors and other NVIDIA DRIVE ecosystem partners , and contact your NVIDIA representative for access to PVA SDK and PVA Solutions. Acknowledgements Thanks to Mark Vojkovich, Mehmet Umut Demircin, Michael Chen, Balaji Holur, Sean Pieper, Mladen Radovic, Nicolas Droux, Kalle Jokiniemi, Ximing Chen, Romain Ygnace, Sharon Heruti, Jagadeesh Sankaran, Zoran Nikolic, Ching Hung, Yan Yin, Qian Zhan, Dian Luo, Rengui Zhuo (ChengTech), Feng Deng (ChengTech), Mo Poorsartep , Cassie Dai, and Wonsik Han for their contributions. Discuss (0) Like Tags Robotics | Automotive / Transportation | AGX | DRIVE | Intermediate Technical | Deep dive | Tutorial | Computer Graphics & Visualization | DRIVE AGX | GTC 2026 | Lidar | Radar About the Authors About Lachlan Dowling Lachlan Dowling is the tech lead for NVIDIA's Programmable Vision Accelerator SDK, working on exposing this powerful accelerator for automotive, robotics and sensor applications. He holds a master's of electrical engineering from the University of Melbourne. View all posts by Lachlan Dowling About Neelakanth Shigihalli Neelakanth Shigihalli is an engineering manager at NVIDIA, leading the Programmable Vision Accelerator (PVA) SDK and PVA solutions team. He focuses on accelerating automotive and robotics applications using the PVA, enabling high-performance, power-efficient solutions and interoperability with critical engines such as the GPU. View all posts by Neelakanth Shigihalli About Shane Murray Shane Murray is the technical lead for radar perception. He has been working in ADAS and AV since 2011, with a focus on radar based applications. In 2016 he joined NVIDIA and has contributed across multiple areas of the radar perception stack, including sensors, algorithms, DNNs, safety, and overall architecture. He holds a bachelor's degree from the Australian National University. View all posts by Shane Murray About Borhan Fathi Borhan Fathi is a senior manager of technical product management at NVIDIA, leading a team responsible for sensors and platform strategy within the Automotive and Robotics division. He drives end-to-end productization of advanced sensing systems and next-generation compute architectures bridging silicon, software, and multimodal AI to enable scalable autonomous systems. View all posts by Borhan Fathi Comments Related posts Validating NVIDIA DRIVE Sim Radar Models Validating NVIDIA DRIVE Sim Radar Models Detecting Obstacles and Drivable Free Space with RadarNet Detecting Obstacles and Drivable Free Space with RadarNet DLI Training: Deep Learning for Autonomous Vehicles DLI Training: Deep Learning for Autonomous Vehicles DRIVE Labs: Covering Every Angle with Surround Camera-Radar Fusion DRIVE Labs: Covering Every Angle with Surround Camera-Radar Fusion Autonomous Vehicle Radar Perception in 360 Degrees Autonomous Vehicle Radar Perception in 360 Degrees Related posts Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab L T F R E
Update on the OpenAI Foundation openai 24.03.2026 09:00 0.652
Embedding sim.0.799
Entity overlap0.0345
Title sim.0.1294
Time proximity0.4462
NLP типother
NLP организацияOpenAI Foundation
NLP темаfoundation models
NLP страна

Открыть оригинал

The OpenAI Foundation announces plans to invest at least $1 billion in curing diseases, economic opportunity, AI resilience, and community programs.
The engineering best practices you can drop straight into Claude towards_ai 25.03.2026 13:36 0.651
Embedding sim.0.7705
Entity overlap0
Title sim.0.0891
Time proximity0.7782
NLP типother
NLP организацияTowards AI
NLP темаsoftware engineering
NLP страна

Открыть оригинал

We&#8217;ve spent years building LLM systems at Towards AI. The main goal has always been the same: share what we build and, more importantly, what we learn building it, so you can grow as an AI engineer without hitting every wall we did. Part of that is our courses. But the bigger part is making your actual building process easier, every day. So we took the markdown files we use internally (the ones you can feed directly into Claude, so it builds with the context that usually takes years to develop) and made them public. Access everything here: https://github.com/louisfb01/ai-engineering-cheatsheets It includes decision-ready references for the most common AI engineering problems: all the engineering best practices from our courses distilled into dense markdown files you can use mid-build or feed directly into Claude, so it works from decisions already tested on real systems. Open a cheatsheet, find your situation in the table, and follow the recommendation. What&#8217;s Inside These come directly from the Towards AI Academy courses, the same frameworks we teach in depth, distilled into references you can use today. No course required. No paywall. You can access everything here: https://github.com/louisfb01/ai-engineering-cheatsheets If you want to go deeper, full lessons, code, and hands-on projects, that&#8217;s what the Towards AI Academy is for.
Datadog bets DIY AI will mean it dodges the SaaSpocalypse the_register_ai 24.03.2026 16:08 0.65
Embedding sim.0.7596
Entity overlap0
Title sim.0.1235
Time proximity0.8178
NLP типproduct_launch
NLP организацияDatadog
NLP темаfoundation models
NLP страна

Открыть оригинал

SaaS 7 Datadog bets DIY AI will mean it dodges the SaaSpocalypse 7 The theory is that its domain-specific model will beat generalist LLMs on results and economics Simon Sharwood Tue 24 Mar 2026 // 16:08 UTC Datadog is close to releasing an updated AI model that it thinks will help it avoid the so-called SaaSpocalypse – customers using AI to build their own tools. The observability tools vendor already created a model called Toto-Open-Base that the company's explanatory paper says it built with 151 million parameters, trained on more than two trillion time-series data points – apparently the largest pretraining dataset for any open-weights time-series foundation model. All the data used to train the model came from Datadog itself, gathered in the course of operating its SaaSy observability services. In conversation with The Register , Datadog chief product officer Yanbing Li said the company is reviewing its next model but sees that effort as the means to an end. "What is the SaaS company's role?" she asked, before answering: "To innovate in their domain." For Datadog, that means creating a model specific to its domain – observability – rather than relying on a generic LLM. Li thinks developing models brings two things to Datadog. One is that AI becomes part of its platform, rather than requiring customers to set a token budget on another service. The other is better agents that detect and predict anomalies more effectively. She claimed Datadog's site reliability agent can already investigate incidents, provide root cause analysis, and suggest remediation actions. AIOps is so powerful, vendors are building tools to clean up after agents break your infrastructure Cisco looses Splunk to probe and tame its growing agentic menagerie Snowflake buys Observe to make 'Days Since Last Outage' counters obsolete ServiceNow's new AI agents will happily volunteer for your dullest tasks AI remains a flaky field and agents make mistakes. The Register therefore put it to Li that operators of mission-critical IT must be wary before letting agents suggest changes to their systems, let alone enact those changes without supervision. She agreed and said for AI systems to win trust, their output must be both explainable and verifiable. Using its own models makes that easier for Datadog, she said. They have also helped the company to create a tool that watches AI platforms while they work and can detect signs they are producing hallucinated output. "I do not worry about a race to develop models, but applying them," she said, adding that she thinks users will apply Datadog's models because they allow constant monitoring of health – a bit like wearable devices. "Today, when we see a doctor, it is an expensive hassle, so we only visit when we are ill," she said. Smartwatches packed full of sensors, plus AI to analyze those signals, mean it's now possible to detect and predict illness. Li thinks Datadog offers a similar change from occasional to constant diagnosis and can dodge the SaaSpocalypse. "What is vulnerable in this transition is point tools, when customers do not act in your tool," she said. "Those things are more easily disrupted." She reckons AI has seen Datadog transcend SaaS to become a platform. Every vendor aspires to that status because it makes it harder for customers to leave. Maybe AI can solve that one day. ® Share More about AI Datadog Site Reliability Engineering More like these &times; More about AI Datadog Site Reliability Engineering Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Disaster recovery Self-driving Car More about Share 7 COMMENTS More about AI Datadog Site Reliability Engineering More like these &times; More about AI Datadog Site Reliability Engineering Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Disaster recovery Self-driving Car TIP US OFF Send us news
FCC proposes making telecom firms bring call centers home the_register_ai 26.03.2026 22:11 0.649
Embedding sim.0.7362
Entity overlap0.2222
Title sim.0.0571
Time proximity0.9911
NLP типregulation
NLP организацияFederal Communications Commission
NLP темаai regulation
NLP странаUnited States

Открыть оригинал

AI + ML 26 AI companies lick their chops as FCC proposes forcing call center onshoring 26 You actually think companies are going to pay Americans to take customer service calls in the AI age? Brandon Vigliarolo Thu 26 Mar 2026 // 22:11 UTC Uncle Sam is trying to make American call centers great again. The question is whether they will be great because they're filled with local workers or whether this will provide yet another excuse for companies to turn customer service jobs over to AI. The Federal Communications Commission (FCC) voted unanimously (or as unanimously as a body missing two of its five members can decide) Thursday to proceed with drafting rules that would require companies under its purview to begin onshoring customer service call center operations - at least to a degree.  "We propose to limit the percentage of customer service calls that providers may make from or answer at foreign call centers to a specified percentage," the Commission wrote in the draft [PDF] notice of proposed rulemaking (the voted-on version has not been published as of writing). "We believe that such a cap would encourage movement of call center operations back to the U.S." The FCC Is hoping the public will weigh in on that percentage should be, and whether it could be reduced over time to further force onshoring.  The FCC justified the proposal by citing not only privacy and security concerns that have recently been raised surrounding overseas customer service call centers, but also by admitting what everyone already knows: Customer service in the industries it regulates plain sucks.  "Communications providers regulated by the FCC," the Commish explained, are part of "an industry that consistently ranks amongst the lowest in customer satisfaction surveys." That's true, at least anecdotally: Do a quick online search for worst customer service, and ISPs, cellular carriers, cable companies, and others that fall under the FCC's purview are usually right up there at the top (or bottom). In addition to requiring a certain percentage of calls to be handled by US call centers, the FCC also proposed in the rule to require informing callers if the agent answering their call is located overseas, requiring transfer to a US-based agent upon request, and limiting transactions involving sensitive customer data to US call center agents only. Additionally, the measure takes steps to address call center spam by using financial tools (fees and bonds) to ensure they don't moonlight as scam outfits. It also proposes increased English proficiency requirements for foreign call center agents in instances where they're still used.  All this, naturally, begs the question of whether any company operating an overseas call center would opt to pay an American agent American wages for the same role. The FCC understands that.  "We recognize...that those changes could come with costs to communications service providers," the FCC noted in the proposal, noting the need to "strike a balance between achieving our goals while not imposing undue costs on these companies."  Why deal with costs, though, when you could just automate your call center entirely?  Rick Ruth, director of carrier relations and regulatory affairs at call center automation outfit CTM, seems to be ready and waiting for the added business courtesy of the FCC. "Organizations may well expand the use of AI-driven classification, routing, and automation for initial customer interactions rather than absorb the cost of a fully domestic workforce," Ruth told The Register in an email. He explained that he believes the most likely scenario is for AI to manage triage and intake, with humans reserved for complicated or sensitive issues.  That said, there are plenty of reasons why automating call centers might not be the best idea. It hasn't worked well, for one, with around half the companies who try it giving up completely. In cases where AI agents are called upon as assistants for human employees, bots haven't fared much better, with many call center agents having trouble making their AI assistants useful. Either way, it'll take some time for this proposal to wind its way through the FCC's systems - first comes a comment period, then drafting actual rules based on that feedback. By then, who knows how good AI customer service agents will be. ® Share More about FCC Federal government of the United States Telecommunications More like these &times; More about FCC Federal government of the United States Telecommunications United States of America Narrower topics 5G Alabama AT&T British Telecom California Central Intelligence Agency Comcast Cybersecurity and Infrastructure Security Agency Cybersecurity Information Sharing Act DOGE EE Emergency Services Network Ericsson Federal Aviation Administration Five Eyes Foreign Intelligence Surveillance Act GPS Immigration and Nationality Act of 1965 IRS Mobile Network NASA National Broadband Network National Highway Traffic Safety Administration National Institute of Standards and Technology National Labor Relations Board NCSAM New Mexico New York NTT Orange Telecommunications Act of 1996 TETRA United States Armed Forces United States Department of Commerce United States Department of Defense United States Department of Justice US Securities and Exchange Commission US Treasury Verizon Virginia Vodafone Voice over IP Broader topics Government Sector More about Share 26 COMMENTS More about FCC Federal government of the United States Telecommunications More like these &times; More about FCC Federal government of the United States Telecommunications United States of America Narrower topics 5G Alabama AT&T British Telecom California Central Intelligence Agency Comcast Cybersecurity and Infrastructure Security Agency Cybersecurity Information Sharing Act DOGE EE Emergency Services Network Ericsson Federal Aviation Administration Five Eyes Foreign Intelligence Surveillance Act GPS Immigration and Nationality Act of 1965 IRS Mobile Network NASA National Broadband Network National Highway Traffic Safety Administration National Institute of Standards and Technology National Labor Relations Board NCSAM New Mexico New York NTT Orange Telecommunications Act of 1996 TETRA United States Armed Forces United States Department of Commerce United States Department of Defense United States Department of Justice US Securities and Exchange Commission US Treasury Verizon Virginia Vodafone Voice over IP Broader topics Government Sector TIP US OFF Send us news
GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation - Microsoft Research microsoft_research 26.03.2026 16:03 0.649
Embedding sim.0.7473
Entity overlap0.025
Title sim.0.1087
Time proximity0.9282
NLP типscientific_publication
NLP организацияKorea University
NLP темаrobotics
NLP странаSouth Korea

Открыть оригинал

GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation Published March 26, 2026 By Sehun Jung , PhD Student, Korea University HyunJee Song , PhD Student, Korea University Dong-Hee Kim , PhD Student, Korea University Reuben Tan , Researcher Jianfeng Gao , Technical Fellow & Corporate Vice President Yong Jae Lee , Professor, University of Wisconsin-Madison Donghyun Kim , Professor, Korea University Share this page Share on Facebook Share on X Share on LinkedIn Share on Reddit Subscribe to our RSS feed At a glance VLM-based robot planners struggle with long, complex tasks because natural-language plans can be ambiguous, especially when specifying both actions and locations. GroundedPlanBench evaluates whether models can plan actions and determine where they should occur across diverse, real-world robot scenarios. Video-to-Spatially Grounded Planning (V2GP) is a framework that converts robot demonstration videos into spatially grounded training data, enabling models to learn planning and grounding jointly. Grounded planning improves both task success and action accuracy, outperforming decoupled approaches in benchmark and real-world evaluations. Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This approach often breaks down for long, complex tasks because natural-language plans can be ambiguous or even hallucinated when specifying actions and locations (Figure 1). Because planning and spatial reasoning are handled separately, errors in one stage can propagate to the next. This raises a key question: can a VLM determine both what to do and where to do it simultaneously? Figure 1. Failures in VLM-based task planners, where ambiguous language leads to non-executable actions. Planning with spatial grounding To address this problem, we developed GroundedPlanBench (opens in new tab) . In our paper, “ Spatially Grounded Long-Horizon Task Planning in the Wild ,” we describe how this new benchmark evaluates whether VLMs can plan actions and determine where those actions should occur across diverse real-world environments. We also built Video-to-Spatially Grounded Planning (V2GP), a framework that converts robot demonstration videos into training data to help VLMs learn this capability. Evaluating these with both open- and closed-source VLMs, we found that grounded planning for long, complex tasks is challenging. At the same time, V2GP improves both planning and grounding, with gains validated on our benchmark and in real-world experiments using robots. How GroundedPlanBench works To create realistic robot scenarios, we built our benchmark from 308 robot manipulation scenes in the Distributed Robot Interaction Dataset (DROID) (opens in new tab) , a large collection of recordings of robots performing tasks. We worked with experts to review each scene and define tasks that a robot could perform. Each task was written in two styles: explicit instructions that clearly describe the actions (e.g., “put a spoon on the white plate”) and implicit instructions that describe the goal more generally (e.g., “tidy up the table”). For each task, the plan was broken down into four basic actions— grasp , place , open , and close —each tied to a specific location in the image. Grasp, open, and close actions were linked to a box drawn around the target object, while place actions were linked to a box showing where the object should be placed. Figure 2 illustrates medium- and long-duration tasks, along with their explicit and implicit instructions. In total, GroundedPlanBench contains 1,009 tasks, ranging from 1–4 actions (345 tasks) to 5–8 (381) and 9–26 (283). Figure 2. Examples of tasks in GroundedPlanBench. How V2GP works The V2GP framework first detects moments when the robot interacts with objects using the recorded gripper signals. It then generates a text description of the manipulated object with a multimodal language model. Guided by this description, the system tracks the object across the video using Meta’s advanced open-vocabulary image and video segmentation model, SAM3. The system then constructs grounded plans from the tracking results, identifying the object’s location at the moment it is grasped and where it is placed. This process is illustrated in Figure 3. It yielded 43K grounded plans with varying lengths: 34,646 plans with 1–4 actions, 4,368 with 5–8 actions, and 4,448 with 9–26 actions. Figure 3. The V2GP framework converts robot videos into spatially grounded plans. Evaluating decoupled versus grounded planning To evaluate GroundedPlanBench in real-world robotic settings, we used Qwen3-VL (opens in new tab) as our base model. Qwen3-VL is a vision-language model that processes text, images, and video to support multimodal reasoning. It performs well on standard multimodal reasoning benchmarks without additional training. We first evaluated it, along with other proprietary models, on GroundedPlanBench without any task-specific training (Table 1). We then fine-tuned it on V2GP training data and compared it with a decoupled approach, in which planning and grounding are handled separately. In this setup, a VLM first generated a plan describing what the robot should do. We used GPT-5.2 or Qwen3-VL-4B for this step. The plan was then passed to a spatial grounding model, Embodied-R1 (opens in new tab) , which converted the plans into executable signals. Embodied-R1 is a large vision-language model trained for embodied reasoning and pointing, where the model identifies specific locations in the image to guide the robot’s actions. We selected it for spatial grounding because its training targets embodied spatial reasoning and point-based localization, making it well suited for grounding model outputs to specific locations in an image. Figure 4 highlights a key limitation of this approach: ambiguity in natural language. For example, Qwen3-VL-4B generated grasp actions by referring to “napkin on the table” for all four napkins in the scene, leading Embodied-R1 to ground each action the same napkin. GPT-5.2 produced more descriptive phrases, such as “top-left napkin” or “upper-center napkin,” but these were still too imprecise for the model to reliably distinguish between them and were again grounded to the same object. Figure 4. Decoupled vs. grounded planning, illustrating how ambiguous language causes actions to be grounded to the wrong objects. This limitation becomes more pronounced in real-world robot manipulation, where environments are often cluttered and complex. As a result, decoupled approaches struggle to work reliably. In contrast, our approach, grounded planning, performs planning and grounding jointly within a single model and improves both planning and grounding performance. Table 1 presents evaluation results for open- and closed-source VLMs on GroundedPlanBench. Multi-step planning and handling of implicit instructions were challenging for all models, while training Qwen3-VL-4B and Qwen3-VL-32B with V2GP led to significant improvements in grounded planning. Table 1. Evaluation results on GroundedPlanBench. Task Success Rate (TSR) measures the percentage of tasks completed correctly, requiring all actions to be both correctly planned and spatially grounded. Action Recall Rate (ARR) measures the proportion of generated actions that match the sub-actions defined in the dataset, regardless of order. The V2GP approach improves performance on both metrics and achieves the best results (shown in bold). Spotlight: AI-POWERED EXPERIENCE Microsoft research copilot experience Discover more about research at Microsoft through our AI-powered experience Start now Opens in a new tab Implications and looking forward Integrating planning and grounding within a single model offers a path to more reliable robot manipulation in real-world settings. Rather than relying on separate stages, this approach keeps decisions about what to do and where to act tightly coupled, but models still struggle with longer, multi-step tasks and implicit instructions. Models must reason over longer sequences of actions and maintain consistency across many steps and goals described indirectly, as in everyday language. Looking ahead, a promising direction combines grounded planning with world models, which enable robots to predict the outcomes of actions before executing them. Together, these capabilities could allow robots to decide what to do, where to act, and what will happen next, bringing us closer to systems that can plan and act reliably in the real world. Acknowledgements This research was conducted in collaboration with Korea University, Microsoft Research, University of Wisconsin-Madison, and supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No. RS-2025-25439490) funded by the Korea government (MSIT). Opens in a new tab Related publications Spatially Grounded Long-Horizon Task Planning in the Wild Meet the authors Sehun Jung PhD Student, Korea University Learn more HyunJee Song PhD Student, Korea University Learn more Dong-Hee Kim PhD Student, Korea University Learn more Reuben Tan Researcher Learn more Jianfeng Gao Technical Fellow & Corporate Vice President Learn more Yong Jae Lee Professor, University of Wisconsin-Madison Learn more Donghyun Kim Professor, Korea University Universal Transfer Learning Lab Learn more Continue reading March 26, 2026 AsgardBench: A benchmark for visually grounded interactive planning August 20, 2025 MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation November 18, 2024 Introducing Yasuyuki Matsushita: Tackling societal challenges with AI at Microsoft Research Asia – Tokyo  September 30, 2024 Stress-testing biomedical vision models with RadEdit: A synthetic data approach for robust model deployment See all blog posts Research Areas Computer vision Related labs Microsoft Research Lab - Redmond
Snap Decisions: How Open Libraries for Accelerated Data Processing Boost A/B Testing for Snapchat nvidia_blog 17.03.2026 13:00 0.647
Embedding sim.0.7285
Entity overlap0.0444
Title sim.0.2248
Time proximity0.8989
NLP типother
NLP организацияSnap
NLP темаai infrastructure
NLP страна

Открыть оригинал

Snap Decisions: How Open Libraries for Accelerated Data Processing Boost A/B Testing for Snapchat NVIDIA cuDF accelerates Apache Spark applications on Google Cloud helps Snap engineers test and deploy new features faster while unlocking significant cost savings. March 17, 2026 by Sid Sharma 0 Comments Share Share This Article X Facebook LinkedIn Copy link Link copied! The features on social media apps like Snapchat evolve nearly as fast as what’s trending. To keep pace, its parent company Snap has adopted open data processing libraries from NVIDIA on Google Cloud services to boost development.  Every new feature rolled out to Snapchat’s more than 940 million monthly active users goes through a set of controlled experiments before it’s launched. During this A/B testing cycle, the development team studies different variables with a subset of users, measuring nearly 6,000 metrics that analyze engagement, app performance and monetization.  Snap runs thousands of these experiments each month — processing over 10 petabytes of data within a three-hour window each morning using the Apache Spark distributed framework. By adopting Apache Spark accelerated by NVIDIA cuDF , the company is boosting these data processing workloads on NVIDIA GPUs to achieve 4x speedups in runtime with the same number of machines, providing a cost-effective path to scale. By pairing NVIDIA’s GPU-optimized software, including NVIDIA CUDA-X libraries, with Google’s infrastructure management services such as Google Kubernetes Engine, Snap is harnessing a full-stack platform for data processing at scale.  “Experimentation is at the core of our company. Changing our data infrastructure from CPUs to GPUs allows us to efficiently scale this experimentation to more features, more metrics and more users over time,” said Prudhvi Vatala, senior engineering manager at Snap. “The more experiments we’re able to run, the more innovative experiences we can deliver for Snapchat users.” A Sustainable Way to Scale Snapchat fans frequently see new features in the app — from arrival notifications to AI-generated stickers — but Snap is also continuously rolling out behind-the-scenes updates such as performance optimizations and compatibility updates for new operating system versions.  The A/B testing for all these new features now runs on cuDF, which allows developers to run existing Apache Spark applications on NVIDIA GPUs with no code changes for easy deployment. The open library for accelerated data processing builds on the power of the NVIDIA cuDF GPU DataFrame library while scaling it for the Apache Spark distributed computing framework. With this migration, the team has — based on Snap internal data collected between January 1 and February 28 — realized 76% daily cost savings using NVIDIA GPUs on Google Kubernetes Engine compared with CPU-only workflows. “We were projecting an ambitious roadmap to scale up experimentation that would have blown up our computing costs based on our existing infrastructure,” Vatala said. “Switching to GPU-accelerated pipelines with cuDF gave us a way to flatten the scaling curve, and the results were tremendous.” To support workload migration, the team also harnessed cuDF suite of microservices that automatically qualify, test, configure and optimize Spark workloads for GPU acceleration at scale.  Working with NVIDIA experts, the Snap team optimized its pipelines on Google Cloud’s G2 virtual machines powered by NVIDIA L4 GPUs so they required just 2,100 GPUs running concurrently — as opposed to the initial projection that around 5,500 GPUs would need to run concurrently, according to data Snap collected between January 1 and March 13. “When I saw the results of the initial experiments, they were pretty crazy — we saw much higher cost savings than we had expected,” said Joshua Sambasivam, a backend engineer on the A/B testing team. “The Spark accelerator is a perfect match for our workloads.” Looking ahead, the Snap team plans to integrate the Spark accelerator beyond the A/B team to a broader range of production workloads.  “We didn’t realize we were sitting on this gold mine,” Vatala said. “We’ve so far migrated our two biggest pipelines, but there’s a lot of opportunity ahead.”  Learn more by tuning into Vatala’s session at NVIDIA GTC , taking place Tuesday, March 17 at 1 p.m. PT .  Read more about NVIDIA cuDF and get started with GPU acceleration for Apache Spark . Main image above courtesy of Snap, depicting A/B test of its Maps feature. Explore the Best of GTC 2026 Sessions Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere. Watch On Demand Recent News AI Infrastructure Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid March 31, 2026
Lossy self-improvement interconnects 22.03.2026 19:39 0.646
Embedding sim.0.7522
Entity overlap0.0385
Title sim.0.0294
Time proximity0.9604
NLP типother
NLP организацияMicrosoft
NLP темаlarge language models
NLP страна

Открыть оригинал

Lossy self-improvement Why self-improvement is real but it doesn't lead to fast takeoff. Nathan Lambert Mar 22, 2026 122 18 17 Share Article voiceover 0:00 -13:22 Audio playback is not supported on your browser. Please upgrade. Fast takeoff, the singularity, and recursive self-improvement (RSI) are all top of mind in AI circles these days. There are elements of truth to them in what’s happening in the AI industry. Two, maybe three, labs are consolidating as an oligopoly with access to the best AI models (and the resources to build the next ones). The AI tools of today are abruptly transforming engineering and research jobs. AI research is becoming much easier in many ways. The technical problems that need to be solved to scale training large language models even further are formidable. Super-human coding assistants making these approachable is breaking a lot of former claims of what building these things entailed. Together this is setting us up for a year (or more) of rapid progress at the cutting edge of AI. We’re also at a time where language models are already extremely good. They’re in fact good enough for plenty of extremely valuable knowledge-work tasks. Language models taking another big step is hard to imagine — it’s unclear which tasks they’re going to master this year outside of code and CLI-based computer-use. There will be some new ones! These capabilities unlock new styles of working that’ll send more ripples through the economy. These dramatic changes almost make it seem like a foregone conclusion that language models can then just keep accelerating progress on their own. The popular language for this is a recursive self-improvement loop. Early writing on the topic dates back to the 2000s, such as the blog post entirely on the topic from 2008: Recursion is the sort of thing that happens when you hand the AI the object-level problem of “redesign your own cognitive algorithms”. And slightly earlier, in 2007, Yudkowsky also defined the related idea of a Seed AI in Levels of Organization in General Intelligence : A seed AI is an AI designed for self-understanding, self-modification, and recursive self-improvement. This has implications both for the functional architectures needed to achieve primitive intelligence, and for the later development of the AI if and when its holonic self-understanding begins to improve. Seed AI is not a workaround that avoids the challenge of general intelligence by bootstrapping from an unintelligent core; seed AI only begins to yield benefits once there is some degree of available intelligence to be utilized. The later consequences of seed AI (such as true recursive self-improvement) only show up after the AI has achieved significant holonic understanding and general intelligence. It’s reasonable to think we’re at the start here, with how general and useful today’s models are. Generally, RSI can be summarized as when AI can improve itself, the improved version can improve even more efficiently, creating a closed amplification loop that leads to an intelligence explosion, often referred to as the singularity. There are a few assumptions in this. For RSI to occur, it needs to be that: The loop is closed. Models can keep improving on themselves and beget more models. The loop is self-amplifying. The next models will yield even bigger improvements than the current ones. The loop continues to run without losing efficiency. There are not added pieces of friction that make the exponential knee-capped as an early sigmoid. While I agree that momentous, socially destabilizing changes are coming in the next few years from sustained AI improvements, I expect the trend line of progress to be more linear than exponential when we reflect back. Instead of recursive self-improvement, it will be lossy self-improvement (LSI) – the models become core to the development loop but friction breaks down all the core assumptions of RSI. The more compute and agents you throw at a problem, the more loss and repetition shows up. Interconnects AI is a reader-supported publication. Consider becoming a subscriber. Subscribe I’m still a believer that the complexity brake on advanced systems will be a strong counterbalance to the reality that AI models are getting substantially better at every narrow task we need to compose together in making a leading AI model. I quoted this previously in April of 2025 in response to AI 2027 . Microsoft co-founder Paul Allen argued the opposite of accelerating returns, the complexity brake: the more progress science makes towards understanding intelligence, the more difficult it becomes to make additional progress. A study of the number of patents shows that human creativity does not show accelerating returns, but in fact, as suggested by Joseph Tainter in his The Collapse of Complex Societies, a law of diminishing returns. The number of patents per thousand peaked in the period from 1850 to 1900, and has been declining since. The growth of complexity eventually becomes self-limiting, and leads to a widespread “general systems collapse”. There are plenty of examples in how models are already trained, the deep intuitions we need to get them right, and the organizations that build them that show where the losses will come from. Building leading language models is incredibly complex, and only becoming more-so. There are a few core frictions in my mind. 1. Automatable research is too narrow First, it is clear that language models this year will already be useful tools at optimizing localized tasks like lowering the test loss of a model. Andrey Karpathy recently launched his autoresearch that popularized doing just this. This allows AI agents to play directly on GPUs to target tasks like lowering the loss on the test set. This approach works in narrow domains, i.e. one general test loss or one overall reward. The problem is that there’s a long-standing gap between an on-paper more accurate model and models that users find more productive. The most provocative case is for pretraining, which was discussed more at length around scaling laws. Scaling laws show us that the loss will continue going down, but we don’t know if that’ll be economically more valuable. In post-training, reinforcement learning algorithms are at least more directly tied to specific performance gains as most RL training environments can be used directly as an evaluation. Still, I worry about generalization and tying back to models that are better at the specific task of improving themselves. It’s a big leap from models get better at some things to that necessarily translating to models that are better at building themselves and designing experiments. We’ve seen many AI capabilities sort of saturate at certain levels of human taste, such as writing quality. AI research is a bit different here, as there is a very high ceiling to climb up to. Where models mostly saturate on writing because there’s inherent tension in preferences, models will saturate on research because the search space and optimization target is too wide. The early benchmarks for measuring this sort of ability all fall prey to the same problem – narrow scope. Agents will do well at optimizing single metrics, but the leap required to navigate many metrics at once is a very different skill set. That is actually what the best researchers do — they make many scalable ideas work together . The most related benchmark we have to measure this is PostTrainBench, which is quite fun, but progress will very rapidly get distorted on this. Over 90% of the challenge in doing post-training well is getting the last 1-3% of performance, especially without cooking the model in out-of-domain tasks. Post-training a general, leading model is extremely complex, and only getting more complex. I could go on and on about this. Another example is from during my Ph.D. (2017-2022), when there was immense hype around a field called “ AutoML ” which aimed to use techniques like Bayesian Optimization to find new architectures and parameters for models. The hype never translated into changing my job. Language models will do more than this, but not enough to take jobs away from top AI researchers any time soon. The core currency of researchers is still intuition and managing complexity, rather than specific optimization and implementation. 2. Diminishing returns of more AI agents in parallel The biggest problem for rapid improvement in AI is that even though we’ll have 10,000 remote workers in a datacenter, it’ll be nearly impossible to channel all of them at one problem. Inherently, especially when the models are still so similar, they’re sampling from the same distribution of solutions and capabilities while being bottlenecked by human supervision. Adding more agents will have a strict saturation in the amount of marginal performance that can be added – the intuition of the best few researchers (and time to run experiments) will be the final bottleneck. A common idea to illustrate this is Amdahl’s law , which is taken from computer architecture and shows that a given task can only generate a fixed speedup proportional to how much can be parallelized and how many parallel workers exist. An illustration is below: In AI this should be relatively easier to convey, as the low-level operating details of computers are fairly mysterious. Consider an AI researcher on the transition from writing code by hand to using AI autocomplete assistance to now using autonomous coding agents. These are all massive gains. Let us continue. Now this researcher uses 3-4 agents working on different sub-tasks or approaches to the problem at hand. This is still a large gain. Now consider this single researcher trying to organize 30-40 agents with tasks to do every day. Some people can get more value out of this scale, but not many. How many people do you think could come up with 300-400 tasks for AI agents every day? Not many. This problem will hit the AI models soon enough as well. 3. Resource bottlenecks and politics Fundamentally, all the AI companies are walking a fine line of acquiring substantial capital, converting new compute resources to revenue via sufficient demand, and repeating the process all-the-while spending an extreme amount on research. With the scale of resources here, there will always be political bottlenecks on who gets resources and what gets bet on. In this layer, research leadership sits above the AIs and the researchers. Even as models continue to improve, this source of friction will never get removed. It isn’t a substantial friction, but the AI models are fundamentally operating in organizations where humans are the bottleneck on resources. The early scale of improvements with language models is local optimizations, where the resources used cost <$1M per day. With my other views on the frictions of AI, this is on its own a very minor impact on the rate of improvement, but for those with worries of fast take-off, RSI, and loss of control to AIs, it should be obvious that billions of dollars of compute resources for research are unlikely to be totally isolated for end-to-end experimentation of AI models. Share The conclusion here is that because we’re at the early stages of using AI assistance, autonomously and at scale for AI-development, we’re collectively discovering the ways that AI can help us massively. We’re all applying these tools to capture the low-hanging fruit we see and our jobs are literally changing to be higher paced and more productive. The problem is that all of these axes have clear human, political, or technical complexity bottlenecks. The bottom of every sigmoid feels like an exponential. We’ve ridden multiple exponentials in the era of language models, in 2023 we scaled to huge models and GPT-4 felt like magic, by 2025 we added inference-time scaling with o1 and reasoning models — they let us “solve” math and coding, now we’re going to take a big step by polishing the entire AI workflow (all the while scaling training compute massively). 2026 will feel like a huge step, but it doesn’t have a fundamental change convincing me that progress will begin to take off. This could still cross the colloquial threshold for AGI, which is a drop-in replacement for most remote workers, which would be an incredible milestone. Much of the challenge in the debate of if we hit AGI in the coming years is that AI models are jagged and smart in different ways than humans, so they won’t look like drop-in replacements for remote workers, but in many cases just using AI will be far more effective than trying to work with a human. It’s reshaping what jobs are. Let us consider the scenarios we’re working through. Engineering is becoming automated today. Humans are way more productive, models can scale through complex infrastructure deployments much faster, run with higher GPU utilization, etc. Infrastructure gains become fixed improvements in the rate and scale of experimentation, the fundamental units of progress in AI. Basic AI model research and optimization will be automated. The AI models are expanding in scope – they transition from writing kernels to deciding on architectures. This is moving from improving the experimentation toolkit to running minor experiments themselves. Configs, hyperparameters, etc. become the domain of the AI assistants. These are both real. The problem is that a third era doesn’t have a simple scale to jump to. Where the AI models can create knowledge by synthesis and execution, the next jump requires harnessing thousands of agents or having models make more novel discoveries – like unlocking the next paradigm after inference time scaling. The improvements downstream of AI are going to make the industry supercharged at hill climbing, but I worry that this won’t bring paradigm shifts that are needed for new categories of AI – continual learning, world models, whatever your drug of choice is. All together, the models are becoming core to the development loop and that’s worth being excited (and worried) about. The models are performing self-improvement. They’re not transforming the approach. We are scaling up the compute we spend on our own research practices and tools. There are diminishing returns. Agents are going to start being autonomous entities we work with. They feel like a cross between a genius and a 5 year old. We will be in this era of lossy self-improvement (LSI) for a few years, but it is not enough for a fast takeoff. 122 18 17 Share Previous Next
How we monitor internal coding agents for misalignment openai 19.03.2026 10:00 0.646
Embedding sim.0.7482
Entity overlap0.0909
Title sim.0.0833
Time proximity0.875
NLP типother
NLP организацияOpenAI
NLP темаai safety
NLP страна

Открыть оригинал

How OpenAI uses chain-of-thought monitoring to study misalignment in internal coding agents—analyzing real-world deployments to detect risks and strengthen AI safety safeguards.
Salesforce announces an AI-heavy makeover for Slack, with 30 new features | TechCrunch techcrunch 31.03.2026 22:46 0.645
Embedding sim.0.7537
Entity overlap0.0208
Title sim.0.1092
Time proximity0.8168
NLP типproduct_launch
NLP организацияsalesforce
NLP темаenterprise ai
NLP странаunited states

Открыть оригинал

Salesforce, the cloud software giant, has been remaking its business around AI, and at a small gathering in San Francisco on Tuesday, CEO Marc Benioff and his team unveiled the latest results of those efforts: an updated version of Slack, with a plethora of new AI features. The most significant of these is a serious glow-up for its AI agent, Slackbot. The 30 new features , which will be available in the coming months, follow a January update that gave Slackbot agentic capabilities — including the ability to draft emails, schedule meetings, and sift through your inbox for specific information. Perhaps the most notable feature announced Tuesday is what the company calls reusable AI-skills — which allow users to define specific tasks for Slackbot that, once created, can be applied in a variety of different scenarios and contexts. Slackbot comes with a built-in library of AI-skills, Salesforce says, but users can also create their own custom versions. Once these skills are set up, they significantly reduce the work an employee might need to do. For example, a user can trigger a skill using a simple command in Slack — say, “create a budget” for an upcoming event — prompting Slackbot to pull together all relevant information from a company’s Slack channels, as well as any connected apps or data sources, to create an actionable plan. The bot will then automatically set up a meeting to discuss the plan, inviting relevant employees based on their titles. Slackbot now also functions as an MCP (Model Context Protocol) client — meaning it can connect to and coordinate with outside services and tools. Among those is Agentforce, Salesforce’s AI agent development platform launched in 2024 . Through that connection, it can “route work or prompt questions to Agentforce or any agent or app in your enterprise” the company says, with the agent finding the most relevant and efficient path for the information, without human intervention. According to Rob Seaman, Slack’s interim CEO and former chief product officer, Slackbot can also now transcribe meetings and summarize them. If a meeting participant happens to zone out, thus missing critical details, they can just ask Slackbot to produce a recap of the meeting, including any action items assigned to them. The agent can also now operate outside of Slack and monitor your desktop activities — Salesforce lists “your deals, your conversations, your calendar, and your habits” as the kinds of data it draws on. Based on that context, the bot will make actionable suggestions or draft follow-ups for critical tasks. Seaman has said that privacy protections are built into this design and that users have the ability to adjust permissions as needed. Techcrunch event Disrupt 2026: The tech ecosystem, all in one room Your next round. Your next hire. Your next breakout opportunity. Find it at TechCrunch Disrupt 2026, where 10,000+ founders, investors, and tech leaders gather for three days of 250+ tactical sessions, powerful introductions, and market-defining innovation. Register now to save up to $400. Save up to $300 or 30% to TechCrunch Founder Summit 1,000+ founders and investors come together at TechCrunch Founder Summit 2026 for a full day focused on growth, execution, and real-world scaling. Learn from founders and investors who have shaped the industry. Connect with peers navigating similar growth stages. Walk away with tactics you can apply immediately Offer ends March 13. San Francisco, CA | October 13-15, 2026 REGISTER NOW In short: Salesforce is clearly trying to take Slack beyond its roots as an enterprise communication tool and position it as a more versatile platform that can handle a wider variety of business tasks. The hope seems to be that, by flooding it with AI, Slack can become an indispensable part of enterprise users’ core business processes. Benioff let his team walk through the major features on Tuesday but remarked, during his keynote, that the five years since Salesforce acquired Slack had been an “incredible journey,” one that had delivered “two and a half times revenue growth.” He added: “We have about a million businesses running on Slack. It’s been a huge growth story.” Topics AI , artificial intelligence , Enterprise , Marc Benioff , Salesforce , Slack , Slackbot , TC Lucas Ropek Senior Writer, TechCrunch Lucas is a senior writer at TechCrunch, where he covers artificial intelligence, consumer tech, and startups. He previously covered AI and cybersecurity at Gizmodo. You can contact Lucas by emailing lucas.ropek@techcrunch.com. View Bio April 30 San Francisco, CA StrictlyVC kicks off the year in SF. Get in the room for unfiltered fireside chats with industry leaders, insider VC insights, and high-value connections that actually move the needle. Tickets are limited. REGISTER NOW Most Popular Anthropic is having a month Connie Loizos Google is now letting users in the US change their Gmail address Ivan Mehta Why OpenAI really shut down Sora Connie Loizos The Pixel 10a doesn’t have a camera bump, and it’s great Ivan Mehta Anthropic’s Claude popularity with paying consumers is skyrocketing Julie Bort Let’s take a look at the retro tech making a comeback Lauren Forristal Waymo’s skyrocketing ridership in one chart Kirsten Korosec Loading the next article Error loading the next article X LinkedIn Facebook Instagram youTube Mastodon Threads Bluesky TechCrunch Staff Contact Us Advertise Crunchboard Jobs Site Map Terms of Service Privacy Policy RSS Terms of Use Code of Conduct Kalshi Copilot Blue Origin WordPress Bezos Tech Layoffs ChatGPT © 2026 TechCrunch Media LLC.
There are more AI health tools than ever—but how well do they work? mit_tech_review 30.03.2026 16:00 0.642
Embedding sim.0.736
Entity overlap0.1071
Title sim.0.0526
Time proximity0.979
NLP типproduct_launch
NLP организацияMicrosoft
NLP темаhealthcare ai
NLP страна

Открыть оригинал

Earlier this month, Microsoft launched Copilot Health, a new space within its Copilot app where users will be able to connect their medical records and ask specific questions about their health. A couple of days earlier, Amazon had announced that Health AI, an LLM-based tool previously restricted to members of its One Medical service, would now be widely available. These products join the ranks of ChatGPT Health, which OpenAI released back in January, and Anthropic’s Claude, which can access user health records if granted permission. Health AI for the masses is officially a trend. There’s a clear demand for chatbots that provide health advice, given how hard it is for many people to access it through existing medical systems. And some research suggests that current LLMs are capable of making safe and useful recommendations. But researchers say that these tools should be more rigorously evaluated by independent experts, ideally before they are widely released. In a high-stakes area like health, trusting companies to evaluate their own products could prove unwise, especially if those evaluations aren’t made available for external expert review. And even if the companies are doing quality, rigorous research—which some, including OpenAI, do seem to be—they might still have blind spots that the broader research community could help to fill. “To the extent that you always are going to need more health care, I think we should definitely be chasing every route that works,” says Andrew Bean, a doctoral candidate at the Oxford Internet Institute. “It’s entirely plausible to me that these models have reached a point where they’re actually worth rolling out.” “But,” he adds, “the evidence base really needs to be there.” Tipping points To hear developers tell it, these health products are now being released because large language models have indeed reached a point where they can effectively provide medical advice. Dominic King, the vice president of health at Microsoft AI and a former surgeon, cites AI advancement as a core reason why the company’s health team was formed, and why Copilot Health now exists. “We’ve seen this enormous progress in the capabilities of generative AI to be able to answer health questions and give good responses,” he says. But that’s only half the story, according to King. The other key factor is demand. Shortly before Copilot Health was launched, Microsoft published a report , and an accompanying blog post , detailing how people used Copilot for health advice. The company says it receives 50 million health questions each day, and health is the most popular discussion topic on the Copilot mobile app. Other AI companies have noticed, and responded to, this trend. “Even before our health products, we were seeing just a rapid, rapid increase in the rate of people using ChatGPT for health-related questions,” says Karan Singhal, who leads OpenAI’s Health AI team. (OpenAI and Microsoft have a long-standing partnership, and Copilot is powered by OpenAI’s models.) It’s possible that people simply prefer posing their health problems to a nonjudgmental bot that’s available to them 24-7. But many experts interpret this pattern in light of the current state of the health-care system. “There is a reason that these tools exist and they have a position in the overall landscape,” says Girish Nadkarni, chief AI officer​ at the Mount Sinai Health System. “That’s because access to health care is hard, and it’s particularly hard for certain populations.” The virtuous vision of consumer-facing LLM health chatbots hinges on the possibility that they could improve user health while reducing pressure on the health-care system. That might involve helping users decide whether or not they need medical attention, a task known as triage. If chatbot triage works, then patients who need emergency care might seek it out earlier than they would have otherwise, and patients with more mild concerns might feel comfortable managing their symptoms at home with the chatbot’s advice rather than unnecessarily busying emergency rooms and doctor’s offices. But a recent, widely discussed study from Nadkarni and other researchers at Mount Sinai found that ChatGPT Health sometimes recommends too much care for mild conditions and fails to identify emergencies. Though Singhal and some other experts have suggested that its methodology might not provide a complete picture of ChatGPT Health’s capabilities, the study has surfaced concerns about how little external evaluation these tools see before being released to the public. Most of the academic experts interviewed for this piece agreed that LLM health chatbots could have real upsides, given how little access to health care some people have. But all six of them expressed concerns that these tools are being launched without testing from independent researchers to assess whether they are safe. While some advertised uses of these tools, such as recommending exercise plans or suggesting questions that a user might ask a doctor, are relatively harmless, others carry clear risks. Triage is one; another is asking a chatbot to provide a diagnosis or a treatment plan. The ChatGPT Health interface includes a prominent disclaimer stating that it is not intended for diagnosis or treatment, and the announcements for Copilot Health and Amazon’s Health AI include similar warnings. But those warnings are easy to ignore. “We all know that people are going to use it for diagnosis and management,” says Adam Rodman, an internal medicine physician and researcher at Beth Israel Deaconess Medical Center and a visiting researcher at Google. Medical testing Companies say they are testing the chatbots to ensure that they provide safe responses the vast majority of the time. OpenAI has designed and released HealthBench, a benchmark that scores LLMs on how they respond in realistic health-related conversations—though the conversations themselves are LLM-generated. When GPT-5, which powers both ChatGPT Health and Copilot Health, was released last year, OpenAI reported the model’s HealthBench scores: It did substantially better than previous OpenAI models, though its overall performance was far from perfect. But evaluations like HealthBench have limitations. In a study published last month, Bean—the Oxford doctoral candidate—and his colleagues found that even if an LLM can accurately identify a medical condition from a fictional written scenario on its own, a non-expert user who is given the scenario and asked to determine the condition with LLM assistance might figure it out only a third of the time. If they lack medical expertise, users might not know which parts of a scenario—or their real-life experience—are important to include in their prompt, or they might misinterpret the information that an LLM gives them. Bean says that this performance gap could be significant for OpenAI’s models. In the original HealthBench study, the company reported that its models performed relatively poorly in conversations that required them to seek more information from the user. If that’s the case, then users who don’t have enough medical knowledge to provide a health chatbot with the information that it needs from the get-go might get unhelpful or inaccurate advice. Singhal, the OpenAI health lead, notes that the company’s current GPT-5 series of models, which had not yet been released when the original HealthBench study was conducted, do a much better job of soliciting additional information than their predecessors. However, OpenAI has reported that GPT-5.4, the current flagship, is actually worse at seeking context than GPT-5.2, an earlier version. Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That might be a heavy lift, particularly given how fast the AI world moves and how long human studies can take. Bean’s own study used GPT-4o, which came out almost a year ago and is now outdated. Earlier this month, Google released a study that meets Bean’s standards. In the study, patients discussed medical concerns with the company’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, AMIE’s diagnoses were just as accurate as physicians’, and none of the conversations raised major safety concerns for researchers. Despite the encouraging results, Google isn’t planning to release AMIE anytime soon. “While the research has advanced, there are significant limitations that must be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,” wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it is building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment. Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such extensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. “There’s lots of reasons that the clinical trial paradigm doesn’t always work in generative AI,” he says. “And that’s where this benchmarking conversation comes in. Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?” They key there is “third party.” No matter how extensively companies evaluate their own products, it’s tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but if there are many third parties involved, it also helps protect against blind spots. OpenAI’s Singhal says he’s strongly in favor of external evaluation. “We try our best to support the community,” he says. “Part of why we put out HealthBench was actually to give the community and other model developers an example of what a very good evaluation looks like.” Given how expensive it is to produce a high-quality evaluation, he says, he’s skeptical that any individual academic laboratory would be able to produce what he calls “the one evaluation to rule them all.” But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suites—such as Stanford’s MedHELM framework, which tests models on a wide variety of medical tasks. Currently, OpenAI’s GPT-5 holds the highest MedHELM score. Nigam Shah, a professor of medicine at Stanford University who led the MedHELM project, says it has limitations. In particular, it only evaluates individual chatbot responses, but someone who’s seeking medical advice from a chatbot tool might engage it in a multi-turn, back-and-forth conversation. He says that he and some collaborators are gearing up to build an evaluation that can score those complex conversations, but that it will take time, and money. “You and I have zero ability to stop these companies from releasing [health-oriented products], so they’re going to do whatever they damn please,” he says. “The only thing people like us can do is find a way to fund the benchmark.” No one interviewed for this article argued that health LLMs need to perform perfectly on third-party evaluations in order to be released. Doctors themselves make mistakes—and for someone who has only occasional access to a doctor, a consistently accessible LLM that sometimes messes up could still be a huge improvement over the status quo, as long as its errors aren’t too grave. With the current state of the evidence, however, it’s impossible to know for sure whether the currently available tools do in fact constitute an improvement, or whether their risks outweigh their benefits.
Training Driving AI at 50,000× Real Time ieee_spectrum_ai 25.03.2026 19:00 0.639
Embedding sim.0.7823
Entity overlap0
Title sim.0.0779
Time proximity0.5357
NLP типother
NLP организацияGeneral Motors
NLP темаautonomous driving
NLP странаUnited States

Открыть оригинал

This is a sponsored article brought to you by General Motors. Visit their new Engineering Blog for more insights. Autonomous driving is one of the most demanding problems in physical AI. An automated system must interpret a chaotic, ever-changing world in real time—navigating uncertainty, predicting human behavior, and operating safely across an immense range of environments and edge cases. At General Motors, we approach this problem from a simple premise: while most moments on the road are predictable, the rare, ambiguous, and unexpected events — the long tail — are what ultimately defines whether an autonomous system is safe, reliable, and ready for deployment at scale. (Note: While here we discuss research and emerging technologies to solve the long tail required for full general autonomy, we also discuss our current approach or solving 99% of everyday autonomous driving in a deep dive on Compound AI.) As GM advances toward eyes-off highway driving, and ultimately toward fully autonomous vehicles, solving the long tail becomes the central engineering challenge. It requires developing systems that can be counted on to behave sensibly in the most unexpected conditions. GM is building scalable driving AI to meet that challenge — combining large-scale simulation, reinforcement learning, and foundation-model-based reasoning to train autonomous systems at a scale and speed that would be impossible in the real world alone. Stress-testing for the long tail Long-tail scenarios of autonomous driving come in a few varieties. Some are notable for their rareness. There’s a mattress on the road. A fire hydrant bursts. A massive power outage in San Francisco that disabled traffic lights required driverless vehicles to navigate never-before experienced challenges. These rare system-level interactions, especially in dense urban environments, show how unexpected edge cases can cascade at scale. But long-tail challenges don’t just come in the form of once-in-a-lifetime rarities. They also manifest as everyday scenarios that require characteristically human courtesy or common sense. How do you queue up for a spot without blocking traffic in a crowded parking lot? Or navigate a construction zone, guided by gesturing workers and ad-hoc signs? These are simple challenges for a human driver but require inventive engineering to handle flawlessly with a machine. Autonomous driving scenario demand curve Deploying vision language models One tool GM is developing to tackle these nuanced scenarios is the use of Vision Language Action (VLA) models. Starting with a standard Vision Language Model, which leverages internet-scale knowledge to make sense of images, GM engineers use specialized decoding heads to fine-tune for distinct driving-related tasks. The resulting VLA can make sense of vehicle trajectories and detect 3D objects on top of its general image-recognition capabilities. These tuned models enable a vehicle to recognize that a police officer’s hand gesture overrides a red traffic light or to identify what a “loading zone” at a busy airport terminal might look like. These models can also generate reasoning traces that help engineers and safety operators understand why a maneuver occurred — an important tool for debugging, validation, and trust. Testing hazardous scenarios in high-fidelity simulations The trouble is: driving requires split-second reaction times so any excess latency poses an especially critical problem. To solve this, GM is developing a “Dual Frequency VLA.” This large-scale model runs at a lower frequency to make high-level semantic decisions (“Is that object in the road a branch or a cinder block?”), while a smaller, highly efficient model handles the immediate, high-frequency spatial control (steering and braking). This hybrid approach allows the vehicle to benefit from deep semantic reasoning without sacrificing the split-second reaction times required for safe driving. But dealing with an edge case safely requires that the model not only understand what it is looking at but also understand how to sensibly drive through the challenge it’s identified. For that, there is no substitute for experience. Which is why, each day, we run millions of high-fidelity closed loop simulations , equivalent to tens of thousands of human driving days, compressed into hours of simulation. We can replay actual events, modify real-world data to create new virtual scenarios, or design new ones entirely from scratch. This allows us to regularly test the system against hazardous scenarios that would be nearly impossible to encounter safely in the real world. Synthetic data for the hardest cases Where do these simulated scenarios come from? GM engineers employ a whole host of AI technologies to produce novel training data that can model extreme situations while remaining grounded in reality. GM’s “Seed-to-Seed Translation” research , for instance, leverages diffusion models to transform existing real-world data, allowing a researcher to turn a clear-day recording into a rainy or foggy night while perfectly preserving the scene’s geometry. The result? A “domain change”—clear becomes rainy, but everything else remains the same. In addition, our GM World diffusion-based simulator allows us to synthesize entirely new traffic scenarios using natural language and spatial bounding boxes. We can summon entirely new scenarios with different weather patterns. We can also take an existing road scene and add challenging new elements, such as a vehicle cutting into our path. High-fidelity simulation isn’t always the best tool for every learning task. Photorealistic rendering is essential for training perception systems to recognize objects in varied conditions. But when the goal is teaching decision-making and tactical planning—when to merge, or how to navigate an intersection—the computationally expensive details matter less than spatial relationships and traffic dynamics. AI systems may need billions or even trillions of lightweight examples to support reinforcement learning, where models learn the rules of sensible driving through rapid trial and error rather than relying on imitation alone. To this end, General Motors has developed a proprietary, multi-agent reinforcement learning simulator, GM Gym, to serve as a closed-loop simulation environment that can both simulate high-fidelity sensor data, and model thousands of drivers per second in an abstract environment known as “Boxworld.” By focusing on essentials like spatial positioning, velocity and rules of the road while stripping away details like puddles and potholes, Boxworld creates a high-speed training environment for reinforcement learning models at incredible speeds, operating 50,000 times faster than real-time and simulating 1,000 km of driving per second of GPU time. It’s a method that allows us to not just imitate humans, but to develop driving models that have verifiable objective outcomes, like safety and progress. From abstract policy to real-world driving Of course, the route from your home to your office does not run through Boxworld. It passes through a world of asphalt, shadows, and weather. So, to bring that conceptual expertise into the real world, GM is one of the first to employ a technique called “On Policy Distillation,” where engineers run their simulator in both modes simultaneously: the abstract, high-speed Boxworld and the high-fidelity sensor mode. Here, the reinforcement learning model—which has practiced countless abstract miles to develop a perfect “policy,” or driving strategy—acts as a teacher. It guides its “student,” the model that will eventually live in the car. This transfer of wisdom is incredibly efficient; just 30 minutes of distillation can capture the equivalent of 12 hours of raw reinforcement learning, allowing the real-world model to rapidly inherit the safety instincts its cousin painstakingly honed in simulation. Designing failures before they happen Simulation isn’t just about training the model to drive well, though; it’s also about trying to make it fail. To rigorously stress-test the system, GM utilizes a differentiable pipeline called SHIFT3D . Instead of just recreating the world, SHIFT3D actively modifies it to create “adversarial” objects designed to trick the perception system. The pipeline takes a standard object, like a sedan, and subtly morphs its shape and pose until it becomes a “challenging”, fun-house version that is harder for the AI to detect. Optimizing these failure modes is what allows engineers to preemptively discover safety risks before they ever appear on the road. Iteratively retraining the model on these generated “hard” objects has been shown to reduce near-miss collisions by over 30%, closing the safety gap on edge cases that might otherwise be missed. Even with advanced simulation and adversarial testing, a truly robust system must know its own limits. To enable safety in the face of the unknown, GM researchers add a specialized “Epistemic uncertainty head” to their models. This architectural addition allows the AI to distinguish between standard noise and genuine confusion. When the model encounters a scenario it doesn’t understand—a true “long tail” event—it signals high epistemic uncertainty. This acts as a principled proxy for data mining, automatically flagging the most confusing and high-value examples for engineers to analyze and add to the training set. This rigorous, multi-faceted approach—from “Boxworld” strategy to adversarial stress-testing—is General Motors’ proposed framework for solving the final 1% of autonomy. And while it serves as the foundation for future development, it also surfaces new research challenges that engineers must address. How do we balance the essentially unlimited data from Reinforcement Learning with the finite but richer data we get from real-world driving? How close can we get to full, human-like driving by writing down a reward function? Can we go beyond domain change to generate completely new scenarios with novel objects? Solving the long tail at scale Working toward solving the long tail of autonomy is not about a single model or technique. It requires an ecosystem — one that combines high-fidelity simulation with abstract learning environments, reinforcement learning with imitation, and semantic reasoning with split-second control. This approach does more than improve performance on average cases. It is designed to surface the rare, ambiguous, and difficult scenarios that determine whether autonomy is truly ready to operate without human supervision. There are still open research questions. How human-like can a driving policy become when optimized through reward functions? How do we best combine unlimited simulated experience with the richer priors embedded in real human driving? And how far can generative world models take us in creating meaningful, safety-critical edge cases? Answering these questions is central to the future of autonomous driving. At GM, we are building the tools, infrastructure, and research culture needed to address them — not at small scale, but at the scale required for real vehicles, real customers, and real roads.
These AI Workstations Look Like PCs but Pack a Stronger Punch ieee_spectrum_ai 24.03.2026 14:00 0.639
Embedding sim.0.7353
Entity overlap0.0244
Title sim.0.1064
Time proximity0.9187
NLP типproduct_launch
NLP организацияTenstorrent
NLP темаai hardware
NLP страна

Открыть оригинал

The rise of generative AI has spurred demand for AI workstations that can run or train models on local hardware. Yet modern PCs have proven inadequate for this task . A typical laptop has only enough memory to load a large language model (LLM) with 8 billion to 13 billion parameters—much smaller, and much less intelligent, than frontier models that are presumed to have over a trillion parameters. Even the most capable workstation PCs struggle to serve LLMs with more than 70 billion parameters. Tenstorrent’s QuietBox 2 is an attempt to fill that gap. Though it looks like a PC workstation, the QuietBox 2 contains four of the company’s custom Blackhole AI accelerators, 128 gigabytes of GDDR6 memory—specialized memory used in GPUs—and 256 GB of DDR5 system memory (for a total of 384 GB). This configuration provides enough memory to load OpenAI’s GPT-OSS-120B and can run midsize models like Meta’s Llama 3.1 70B at speeds of nearly 500 tokens per second. For reference, that’s several times quicker than an average response from OpenAI’s GPT-5.2 or Anthropic’s Claude 4.6. The QuietBox 2 carries an expected retail price of US $9,999 and is slated to launch in the second quarter of 2026. “The 128 gigabytes of GDDR that we have with our AI accelerators really defines how big of a model you can run at a reasonable speed,” says Milos Trajkovic , cofounder and systems engineer at Tenstorrent. “Our 128 gigabytes of GDDR6 RAM would require four Nvidia RTX 5090 graphics cards. That couldn’t fit in today’s 1,600-watt form factor, and the cost for four RTX 5090 GPUs is huge.” An AI workstation built at the home office Wattage, it turns out, is critical. Nvidia recommends a system power of 1,000 W for a single RTX 5090, so even a dual-GPU setup exceeds the continuous power draw for a typical 15-ampere, 120-volt power circuit. A system with four RTX 5090s could require 4,000 W or more at load. The QuietBox 2, on the other hand, draws only 1,400 W at full load. It won’t trip the breaker, so it can be used anywhere a typical desktop PC might be plugged in, including a home office. That’s not the only way the QuietBox 2 poses as a run-of-the-mill PC. The machine’s custom case is built to support the micro-ATX motherboard form factor, and the motherboard itself is an AMD chipset hosting an AMD CPU. The hardware is kept cool by closed-loop liquid cooling similar to that used by PC workstations and gaming computers. It even has customizable RGB LED lighting and a large semitransparent window that shows off the hardware. “A lot of even our internal developers have requested a QuietBox because they’re just so easy to deploy,” says Chris Goulet , a thermal-mechanical engineer and team lead at Tenstorrent. “You just ship them the unit, they slap it on their desk, power it up, and they’re going.” Where the QuietBox 2 differs from desktop PCs, though, is its AI accelerators. It’s equipped with four of Tenstorrent’s Blackhole application-specific ICs, a RISC-V chip designed specifically for AI workloads. Blackhole is packaged on an add-in card; each card has 120 Tensix AI accelerators and 32 GB of GDDR6 memory, for a total of 480 Tensix AI accelerators and 128 GB of GDDR6. Blackhole also has a large amount of on-chip SRAM at 180 megabytes per accelerator. Two visions of desktop AI Tenstorrent is not alone in its approach. Nvidia’s DGX Spark, released last year, packed Nvidia’s GB10 chip into a machine the size of a lunch box. Orders for the Spark’s big brother, the DGX Station with Nvidia’s GB300, were opened on 16 March 2026. The DGX Station looks like a desktop PC workstation, and variants will be built by well-known PC brands like Asus and Dell. Nvidia’s offering has more memory than QuietBox 2, at up to 748 GB, but system power is quoted at 1,600 W—rather close to the maximum a 15-A, 120-V breaker will handle. This reflects differing visions for how their machines will be used. And, of course, the Nvidia DGX Station’s extra memory doesn’t come cheap. While most DGX Station system builders have not yet announced pricing, one retailer has listed a DGX Station from PC maker MSI for $85,000 . When I spoke to Allyn Bourgoyne , director of product marketing at Nvidia, after the announcement of DGX Spark and Station in 2025, he said the company expects most DGX owners will use the devices as remotely accessed workstations. “A common thing you might see is that I’ve got my Windows laptop, and I’m going to use my DGX Spark over the network. I’m going to send jobs over to it.” He added that companies could deploy DGX Spark and Station systems to serve multiple people at once. The Tenstorrent QuietBox 2 can be used in this way, but the company also wants to target a good experience for people going one-on-one with the computer. “You don’t have to remote SSH into the box. You connect your monitor through HDMI, and it’s just like your PC at home. It has the Ubuntu desktop and utilities,” says Trajkovic. Nvidia’s DGX systems also run a variant of Ubuntu (DGX OS) and include a desktop environment, but the devil is in the details . DGX systems use Nvidia CPUs based on ARM architectures and custom chipsets. The QuietBox 2 uses an AMD x86 CPU and compatible chipset, and is configured more like a traditional PC. That should be a boon for the QuietBox 2’s software compatibility. Tenstorrent leans into that with a focus on open source software. The QuietBox 2’s entire software stack, from TT-Forge (the company’s AI compiler) to TT-Metalium (a low-level software development kit that provides kernel-level hardware control), is open source and available on GitHub. Tenstorrent has also published the instruction set architecture for its Tensix cores, so developers can see exactly how their workloads execute on the hardware. Nvidia, by contrast, is focused on its proprietary CUDA ecosystem, and DGX OS is not open source. “A lot of our software stack is completely open, and we felt that from a hardware perspective, we kind of wanted to take a similar path,” says Goulet.
Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog nvidia_dev_blog 23.03.2026 07:01 0.638
Embedding sim.0.7829
Entity overlap0.0435
Title sim.0.1793
Time proximity0.3392
NLP типother
NLP организацияnvidia
NLP темаai infrastructure
NLP страна

Открыть оригинал

As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages have fundamentally different compute profiles, yet traditional deployments force them onto the same hardware, leaving GPUs underutilized and scaling inflexible. Disaggregated serving addresses this by splitting the inference pipeline into distinct stages such as prefill, decode, and routing, each running as an independent service that can be resourced and scaled on its own terms. This post will give an overview of how disaggregated inference gets deployed on Kubernetes, explore different ecosystem solutions and how they execute on a cluster, and evaluate what they provide out of the box. How do aggregated and disaggregated inference differ? Before diving into Kubernetes manifests, it helps to understand the two inference deployment modes for LLMs: In aggregated serving, a single process (or tightly coupled group of processes) handles the entire inference lifecycle from input to output. Disaggregated serving splits the pipeline into distinct stages such as prefill, decode, and routing, each running as independent services (see Figure 1, below). Figure 1. Comparing aggregated and disaggregated serving Aggregated inference In a traditional aggregated setup, a single model server (or coordinated group of servers in a parallel configuration) handles the full request lifecycle. A user prompt comes in, the server tokenizes it, runs prefill to build context, generates output tokens autoregressively (decode), and returns the response. Everything happens in a single process or tightly coupled pod group. This is conceptually simple and works well for many use cases. But it means your hardware alternates between two fundamentally different workloads: Prefill is compute-intensive and benefits from high floating point operations (FLOPS), while decode is memory-bandwidth-bound and benefits from large, fast memory. Disaggregated inference Disaggregated architectures separate these stages into distinct services: Prefill workers process the input prompt. This is compute-heavy. You want to maximize your GPUs for high-throughput and can parallelize aggressively. Decode workers generate output tokens one at a time. This is memory-bandwidth-bound because of the autoregressive nature of LLMs. You want GPUs with fast high bandwidth memory (HBM) access. Router/gateway directs incoming requests, manages Key-Value (KV) cache routing between prefill and decode stages, and handles load balancing of requests across your workers. Why disaggregate? Three reasons stand out: Different resource and optimization profiles per stage: With disaggregation, you can match GPU resources, model sharding techniques, and batch sizes to each stage’s needs rather than compromise on a single approach. Independent scaling: Prefill and decode traffic patterns differ. A long-context prompt creates a large prefill burst but a steady decode stream. Scaling each stage independently lets you respond to actual demand. Better GPU utilization: Separating stages lets each saturate its target resource (compute for prefill, memory bandwidth for decode) rather than alternating between both. Frameworks like NVIDIA Dynamo and llm-d , implement this pattern. The question becomes: How do you orchestrate it on Kubernetes? Why scheduling is the key to multi-pod inference performance on Kubernetes Deploying a multi-pod inference workload (either model-parallel aggregated models or disaggregated models) is only half the story. How the scheduler places pods across the cluster directly impacts performance; placing a Tensor Parallel (TP) group’s pods on the same rack with high-bandwidth NVIDIA NVLink interconnects can mean the difference between fast inference and a network bottleneck. Three scheduling capabilities matter the most here: Gang scheduling ensures all pods in a group are placed with all-or-nothing semantics, preventing partial deployments that waste GPUs. Hierarchical gang scheduling extends basic gang scheduling to multi-level workloads. In disaggregated inference, you need nested minimum guarantees per component or role: each Tensor Parallel group (e.g., four pods forming one decode instance) must be scheduled atomically, and the full system (at least n prefill instances + at least m decode instances + router) also needs system-level coordination. Without this, one role can consume all available GPUs while the other waits indefinitely—a partial deployment that holds resources but can’t serve requests. Topology-aware placement colocates tightly coupled pods on nodes with high-bandwidth interconnects, minimizing inter-node communication latency. These three capabilities determine how an AI scheduler, such as KAI Scheduler , places pods based on the application’s scheduling constraints. Additionally, it is also important for the AI orchestration layer to determine what needs to be gang-scheduled, and when . For example, when prefill scales independently, something needs to decide that the new pods form a gang with a minimum availability guarantee, without disrupting existing decode pods. As a result, the orchestration layer and the scheduler need to work closely for the entire application lifecycle, handling multi-level auto-scaling, rolling updates, and more, to ensure optimal runtime conditions for AI workloads. This is where higher-level workload abstractions come in. APIs like LeaderWorkerSet (LWS) and NVIDIA Grove allow users to declaratively express the structure of their inference application: which roles exist, how they relate to each other, how they should scale, and what topology constraints matter. The API’s operator translates that application-level intent into concrete scheduling constraints (including PodGroups, gang requirements, topology hints) that determine what gangs to create and when. KAI Scheduler then plays the critical role of satisfying those constraints, solving the how: gang scheduling, hierarchical gang scheduling, and topology-aware placement. In this post, we use KAI as the scheduler, though there are other schedulers in the community that support subsets of these features. Readers can explore the broader scheduling landscape through the Cloud Native Computing Foundation (CNCF) ecosystem. Deploying disaggregated inference Disaggregated architectures have multiple roles each with different resource profiles and scaling needs. Since each role in a disaggregated pipeline is a distinct workload, a natural approach with LWS is to create a separate resource for each role. Prefill workers (four replicas, 2-degree Tensor Parallelism): apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: prefill-workers spec: replicas: 4 leaderWorkerTemplate: size: 2 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: prefill-leader spec: containers: - name: prefill image: <model-server-image> args: ["--role=prefill", "--tensor-parallel-size=2"] resources: limits: nvidia.com/gpu: "1" workerTemplate: spec: containers: - name: prefill image: <model-server-image> args: ["--role=prefill"] resources: limits: nvidia.com/gpu: "1" Decode workers (two replicas, 4-degree Tensor Parallelism): apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: decode-workers spec: replicas: 2 leaderWorkerTemplate: size: 4 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: decode-leader spec: containers: - name: decode image: <model-server-image> args: ["--role=decode", "--tensor-parallel-size=4"] resources: limits: nvidia.com/gpu: "1" workerTemplate: spec: containers: - name: decode image: <model-server-image> args: ["--role=decode"] resources: limits: nvidia.com/gpu: "1" Router (a standard deployment—no leader-worker topology needed): apiVersion: apps/v1 kind: Deployment metadata: name: router spec: replicas: 2 selector: matchLabels: app: router template: metadata: labels: app: router spec: containers: - name: router image: <router-image> env: - name: PREFILL_ENDPOINT value: "prefill-workers" - name: DECODE_ENDPOINT value: "decode-workers" Each role is managed as its own resource. You can scale, prefill, and decode independently, and update them on different schedules. It’s important to note that the scheduler treats prefill workers and decode-workers as independent workloads. The scheduler will place them successfully, but it has no knowledge that they form a single inference pipeline. In practice, this means a few things: Topology coordination between prefill and decode (placing them on the same rack for fast KV cache transfer) requires manually adding pod affinity rules that reference labels across the two LWS resources. Scaling one role doesn’t automatically account for the other: If a burst of long-context requests requires more prefill capacity, you scale prefill-workers, but the new prefill pods aren’t guaranteed to land near existing decode pods unless you’ve configured affinity yourself. Rolling out a new model version means coordinating updates across three independent resources—LWS’s partition update mechanism supports staged rollouts per-resource, but synchronizing across resources is managed externally. That last point is worth calling out. Inference frameworks move fast and don’t always guarantee backwards compatibility between versions, so prefill pods on the old version and decode pods on the new version may not be able to communicate. Models also take time to load, and prefill and decode workers frequently become ready at different rates. During an unsynchronized rollout, this can create a temporary imbalance, where many new decode pods are ready but very few new prefill pods are (or vice versa). This can create a bottleneck in your inference pipeline until everything catches up. These patterns work. The coordination just happens outside of Kubernetes primitives: in the inference framework’s routing layer, in custom autoscalers, bespoke operators, or even manually. Another option would be using Grove’s API , which takes a different approach by moving that coordination into the Kubernetes resource itself. It expresses all roles in a single PodCliqueSet: apiVersion: grove.io/v1alpha1 kind: PodCliqueSet metadata: name: inference-disaggregated spec: replicas: 1 template: cliqueStartupType: CliqueStartupTypeExplicit terminationDelay: 30s cliques: - name: router spec: roleName: router replicas: 2 podSpec: schedulerName: kai-scheduler containers: - name: router image: <router-image> resources: requests: cpu: 100m - name: prefill spec: roleName: prefill replicas: 4 startsAfter: [router] podSpec: schedulerName: kai-scheduler containers: - name: prefill image: <model-server-image> args: ["--role=prefill", "--tensor-parallel-size=2"] resources: limits: nvidia.com/gpu: "1" autoScalingConfig: maxReplicas: 8 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - name: decode spec: roleName: decode replicas: 2 startsAfter: [router] podSpec: schedulerName: kai-scheduler containers: - name: decode image: <model-server-image> args: ["--role=decode", "--tensor-parallel-size=4"] resources: limits: nvidia.com/gpu: "1" autoScalingConfig: maxReplicas: 6 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 80 topologyConstraint: packDomain: rack The Grove operator manages PodCliques for each role and coordinates scheduling, startup, and lifecycle across all of them. A few things to note in the YAML: startsAfter: [router] on prefill and decode tells the operator to gate their startup until the router is ready. This is expressed declaratively and enforced through init containers. When you first deploy, router pods start and become ready first, then prefill and decode pods start in parallel (since both depend on the router). autoScalingConfig on each clique lets you define per-role scaling policies. The operator creates an horizontal pod autoscaler (HPA) for each, so prefill and decode scale independently based on their own metrics. topologyConstraint with packDomain: rack tells the KAI Scheduler to pack all cliques within the same rack, optimizing KV cache transfer between prefill and decode stages over high-bandwidth interconnects. After applying this manifest, you can inspect all the resources Grove creates: $ kubectl get pcs,pclq,pg,pod NAME AGE podcliqueset.grove.io/inference-disaggregated 45s NAME AGE podclique.grove.io/inference-disaggregated-0-router 44s podclique.grove.io/inference-disaggregated-0-prefill 44s podclique.grove.io/inference-disaggregated-0-decode 44s NAME AGE podgang.scheduler.grove.io/inference-disaggregated-0 44s NAME READY STATUS AGE pod/inference-disaggregated-0-router-k8x2m 1/1 Running 44s pod/inference-disaggregated-0-router-w9f4n 1/1 Running 44s pod/inference-disaggregated-0-prefill-abc12 1/1 Running 44s pod/inference-disaggregated-0-prefill-def34 1/1 Running 44s pod/inference-disaggregated-0-prefill-ghi56 1/1 Running 44s pod/inference-disaggregated-0-prefill-jkl78 1/1 Running 44s pod/inference-disaggregated-0-decode-mn90p 1/1 Running 44s pod/inference-disaggregated-0-decode-qr12s 1/1 Running 44s One PodCliqueSet, three PodCliques (one per role), one PodGang for coordinated scheduling, and pods matching each role’s replica count. The startsAfter dependency is enforced through init containers: Prefill and decode pods wait for the router to become ready before their main containers start. Scaling disaggregated workloads Once a disaggregated workload is running, scaling becomes the central operational challenge. Prefill and decode have different bottlenecks; teams might want to autoscale prefill workers based on time to first token (TTFT) and decode workers based on inter-token latency (ITL) independently, to meet service level agreements (SLAs) while minimizing GPU costs. In practice, disaggregated scaling operates at three levels: Per-role scaling: adding or removing pods within a single role (e.g. scaling prefill from 4 to 6 replicas) Per-TP-group scaling: scaling complete Tensor Parallel groups as atomic units, since you can’t add half a TP group. Cross-role coordination : when you add prefill capacity, you may also need to scale the router to handle increased throughput, or scale decode to consume the extra prefill output. Different tools address different levels. How inference frameworks coordinate scaling Inference frameworks address scaling at the application level with custom autoscalers that have visibility into inference-specific metrics. llm-d’s workload variant autoscaler (WVA) monitors per-pod KV cache utilization and queue depth via Prometheus , using a spare-capacity model to determine when replicas should be added or removed. Rather than scaling deployments directly, WVA emits target replica counts as Prometheus metrics that standard HPA/Kubernetes-based event-driven autoscaling (KEDA) act on—keeping the scaling actuation within Kubernetes-native primitives. The NVIDIA Dynamo planner takes a different approach: It natively understands disaggregated serving, running separate prefill and decode scaling loops that target TTFT and ITL SLAs respectively. It predicts upcoming demand using time-series models, computes replica requirements from profiled per-GPU throughput curves, and enforces a global GPU budget across both roles. This global visibility matters because in practice there’s an optimal ratio between prefill and decode that shifts with request patterns. Scale prefill 3x without scaling decode and the extra output has nowhere to go—decode bottlenecks and KV cache transfer queues up. Application-level autoscalers handle this because they can see the full pipeline; Kubernetes-native HPA targeting individual resources doesn’t inherently maintain cross-resource ratios. Scaling with separate LWS resources With one LWS per role, you scale each independently: kubectl scale lws prefill-workers --replicas=6 kubectl scale lws decode-workers --replicas=3 Standard HPA can target each LWS separately, or an external autoscaler (like the Dynamo planner or llm-d’s autoscaler) makes coordinated decisions and updates both. The coordination logic lives in the autoscaler, not in the Kubernetes resources themselves. Scaling with Grove Grove brings per-role scaling into a single resource. Each PodClique has its own replica count and optional autoScalingConfig , so HPAs can manage roles independently based on per-role metrics: kubectl scale pclq inference-disaggregated-0-prefill --replicas=6 The operator creates additional prefill pods while leaving the router and decode untouched: NAME AGE podclique.grove.io/inference-disaggregated-0-router 5m podclique.grove.io/inference-disaggregated-0-prefill 5m podclique.grove.io/inference-disaggregated-0-decode 5m NAME READY STATUS AGE pod/inference-disaggregated-0-router-k8x2m 1/1 Running 5m pod/inference-disaggregated-0-router-w9f4n 1/1 Running 5m pod/inference-disaggregated-0-prefill-abc12 1/1 Running 5m pod/inference-disaggregated-0-prefill-def34 1/1 Running 5m pod/inference-disaggregated-0-prefill-ghi56 1/1 Running 5m pod/inference-disaggregated-0-prefill-jkl78 1/1 Running 5m pod/inference-disaggregated-0-prefill-tu34v 1/1 Running 12s # new pod/inference-disaggregated-0-prefill-wx56y 1/1 Running 12s # new pod/inference-disaggregated-0-decode-mn90p 1/1 Running 5m pod/inference-disaggregated-0-decode-qr12s 1/1 Running 5m Six prefill pods, two router pods, two decode pods—only prefill changed. For roles that use multi-node Tensor Parallelism internally, PodCliqueScalingGroup ensures multiple PodCliques scale together as a unit while preserving the replica ratio between them. For example, in a configuration where each prefill instance consists of one leader pod and four worker pods: podCliqueScalingGroups: - name: prefill cliqueNames: [pleader, pworker] replicas: 2 minAvailable: 1 scaleConfig: maxReplicas: 4 With replicas: Two, this creates two complete prefill instances: two x (one leader + four workers) = 10 pods total. The minAvailable: One guarantee means the system won’t scale below one complete Tensor Parallel group. Scaling the group from two to three replicas adds a third complete instance while preserving the 1:4 leader-to-worker ratio:leader-to-worker ratio: $ kubectl scale pcsg inference-disaggregated-0-prefill --replicas=3 Both the leader and worker cliques scaled together as a unit, the new replica (prefill-2) has one pleader pod and four pworker pods, matching the ratio. A new PodGang was created for the third replica to ensure it gets gang-scheduled. NAME AGE podcliquescalinggroup.grove.io/inference-disaggregated-0-prefill 10m NAME AGE podclique.grove.io/inference-disaggregated-0-prefill-0-pleader 10m podclique.grove.io/inference-disaggregated-0-prefill-0-pworker 10m podclique.grove.io/inference-disaggregated-0-prefill-1-pleader 10m podclique.grove.io/inference-disaggregated-0-prefill-1-pworker 10m podclique.grove.io/inference-disaggregated-0-prefill-2-pleader 8s # new podclique.grove.io/inference-disaggregated-0-prefill-2-pworker 8s # new NAME AGE podgang.scheduler.grove.io/inference-disaggregated-0 10m podgang.scheduler.grove.io/inference-disaggregated-0-prefill-0 10m podgang.scheduler.grove.io/inference-disaggregated-0-prefill-1 8s # new Getting started Whether you’re running a single disaggregated pipeline or operating dozens across your cluster, the building blocks for this are emerging and the community is building them in the open. Each approach in this blog represents a different point on the spectrum between simplicity and integrated coordination. The right choice depends on your workload, your team’s operational model, and how much lifecycle management you want the platform to handle versus the application layer. Check out these resources for more information. NVIDIA Grove KAI Scheduler NVIDIA Dynamo Join us at Kubecon EU If you’re attending KubeCon EU 2026 in Amsterdam , drop by at booth No. 241 and join the session where we will cover an end-to-end open source AI inference stack. Explore the Grove Deployment Guide and ask questions on GitHub or Discord . We’d love to hear how you’re thinking about disaggregated inference on Kubernetes . Discuss (0) Like Tags Data Center / Cloud | General | Cloud Services | Dynamo | NVLink | Run:ai | Intermediate Technical | Tutorial | AI Agent | AI Inference | AI Networking | Cloud Networking | Dynamo-Triton | GB200 | Inference Performance | Kubernetes | News About the Authors About Anish Maddipoti Anish Maddipoti is a product manager at NVIDIA. He currently works on building AI/ML frameworks, such as NVIDIA Dynamo and NVIDIA Grove. Previously, he was a founding team member of Brev.dev (acquired by NVIDIA) and co-founder of Agora Labs. He studied in the Plan II program at the University of Texas at Austin. View all posts by Anish Maddipoti About Sanjay Chatterjee Sanjay Chatterjee is an engineering manager at NVIDIA. He works on GPU compute infrastructure with a focus on GPU scheduling to enable AI and HPC workloads to scale on Kubernetes. He is the creator and architect of the open source NVIDIA Grove project. Previously he worked on multiple DoE/DARPA funded advanced technology projects towards designing the first exascale systems. His interests include novel programming models, parallel languages, and runtime systems. View all posts by Sanjay Chatterjee About Rohan Varma Rohan Varma is an AI dev tech engineer at NVIDIA. He focuses on optimizing NVIDIA inference solutions including Dynamo, Grove, and TensorRT-LLM. He has a master’s degree in Computer Science from University of Michigan, Ann Arbor. He enjoys racing games, piano, and most racket sports. View all posts by Rohan Varma About Ekin Karabulut Ekin Karabulut is a data scientist and developer advocate previously at Run:ai, now at NVIDIA, exploring the efficient usage of large models in different production scenarios. Previously she worked on privacy implications of federated learning, focused on distributed training techniques and got fascinated by inefficiencies in GPU usage in research and industry settings. She established the AI Infrastructure Club and is based in Munich, Germany. View all posts by Ekin Karabulut Comments Related posts Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library Enabling Horizontal Autoscaling of Enterprise RAG Components on Kubernetes Enabling Horizontal Autoscaling of Enterprise RAG Components on Kubernetes Streamline Complex AI Inference on Kubernetes with NVIDIA Grove Streamline Complex AI Inference on Kubernetes with NVIDIA Grove NVIDIA Dynamo Accelerates llm-d Community Initiatives for Advancing Large-Scale Distributed Inference NVIDIA Dynamo Accelerates llm-d Community Initiatives for Advancing Large-Scale Distributed Inference NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations NVIDIA Dynamo Adds GPU Autoscaling, Kubernetes Automation, and Networking Optimizations Related posts Processing High-Quality Vietnamese Language Data with NVIDIA NeMo Curator Processing High-Quality Vietnamese Language Data with NVIDIA NeMo Curator Access to NVIDIA NIM Now Available Free to Developer Program Members Access to NVIDIA NIM Now Available Free to Developer Program Members Revolutionizing Graph Analytics: Next-Gen Architecture with NVIDIA cuGraph Acceleration Revolutionizing Graph Analytics: Next-Gen Architecture with NVIDIA cuGraph Acceleration Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer Analyzing the Security of Machine Learning Research Code Analyzing the Security of Machine Learning Research Code L T F R E
The Bay Area’s animal welfare movement wants to recruit AI mit_tech_review 23.03.2026 09:00 0.637
Embedding sim.0.7555
Entity overlap0.087
Title sim.0.1528
Time proximity0.5982
NLP типother
NLP организацияSentient Futures
NLP темаai ethics
NLP странаUnited States

Открыть оригинал

In early February, animal welfare advocates and AI researchers gathered in stocking feet at Mox, a scrappy, shoes-free coworking space in San Francisco. Yellow and red canopies billowed overhead, Persian rugs blanketed the floor, and mosaic lamps glowed beside potted plants. In the common area, a wildlife advocate spoke passionately to a crowd lounging in beanbags about a form of rodent birth control that could manage rat populations without poison. In the “Crustacean Room,” a dozen people sat in a circle, debating whether the sentience of insects could tell us anything about the inner lives of chatbots. In front of the “Bovine Room” stood a bookshelf stacked with copies of Eliezer Yudkowsky’s If Anyone Builds It, Everyone Dies , a manifesto arguing that AI could wipe out humanity . The event was hosted by Sentient Futures, an organization that believes the future of animal welfare will depend on AI. Like many Bay Area denizens, the attendees were decidedly “AGI-pilled”—they believe that artificial general intelligence, powerful AI that can compete with humans on most cognitive tasks, is on the horizon. If that’s true, they reason, then AI will likely prove key to solving society’s thorniest problems—including animal suffering. To be clear, experts still fiercely debate whether today’s AI systems will ever achieve human- or superhuman-level intelligence, and it’s not clear what will happen if they do. But some conference attendees envision a possible future in which it is AI systems, and not humans, who call the shots. Eventually, they think, the welfare of animals could hinge on whether we’ve trained AI systems to value animal lives. “AI is going to be very transformative, and it’s going to pretty much flip the game board,” said Constance Li, founder of Sentient Futures. “If you think that AI will make the majority of decisions, then it matters how they value animals and other sentient beings”—those that can feel and, therefore, suffer. Like Li, many summit attendees have been committed to animal welfare since long before AI came into the picture. But they’re not the types to donate a hundred bucks to an animal shelter. Instead of focusing on local actions, they prioritize larger-scale solutions, such as reducing factory farming by promoting cultivated meat , which is grown in a lab from animal cells. The Bay Area animal welfare movement is closely linked to effective altruism , a philanthropic movement committed to maximizing the amount of good one does in the world—indeed, many conference attendees work for organizations funded by effective altruists. That philosophy might sound great on paper, but “maximizing good” is a tricky puzzle that might not admit a clear solution. The movement has been criticized for some of the solutions that its leaders have advocated for in the past, such as working in exploitative industries to maximize charitable donations, and, more recently, prioritizing issues that could cause suffering for a large number of people who haven’t been born yet over present-day harms. Critics also argue that effective altruists neglect the importance of systemic issues such as racism and economic exploitation and overlook the insights that marginalized communities might have into the best ways to improve their own lives . When it comes to animal welfare, this exactingly utilitarian approach can lead to some strange conclusions. For example, some effective altruists say it makes sense to commit significant resources to improving the welfare of insects and shrimp because they exist in such staggering numbers, even though they may not have much individual capacity for suffering. Now the movement is sorting out how AI fits in. At the summit, Jasmine Brazilek, cofounder of a nonprofit called Compassion in Machine Learning, opened her sticker-stamped laptop to pull up a benchmark she devised to measure how LLMs reason about animal welfare. A cloud security engineer turned animal advocate, she’d flown in from La Paz, Mexico, where she runs her nonprofit with a handful of volunteers and a shoestring budget. Brazilek urged the AI researchers in the room to train their models with synthetic documents that reflect concern for animal welfare. “Hopefully, future superintelligent systems consider nonhuman interest, and there is a world where AI amplifies the best of human values and not the worst,” she said. The power of the purse The technologically inclined side of the animal welfare movement has faced some major setbacks in recent years. Dreams of transitioning people away from a diet dependent on factory farming have been dampened by developments such as the decimation of the plant-based-meat company Beyond Meat’s stock price and the passage of laws banning cultivated meat in several US states. AI has injected a shot of optimism. Like much of Silicon Valley, many attendees at the summit subscribe to the idea that AI might dramatically increase their productivity—though their goal is not to maximize their seed round but, rather, to prevent as much animal suffering as possible. Some brainstormed how to use Claude Code and custom agents to handle the coding and administrative tasks in their advocacy work. Others pitched the idea of developing new, cheaper methods for cultivating meat using scientific AI tools such as AlphaFold, which aids in molecular biology research by predicting the three-dimensional structures of proteins. But the real talk of the event was a flood of funding that advocates expect will soon be committed to animal welfare charities—not by individual megadonors, but by AI lab employees. Much of the funding for the farm animal welfare movement, which includes nonprofits advocating for improved conditions on farms, promoting veganism, and endorsing cultivated meat, comes from people in the tech industry, says Lewis Bollard, the managing director of the farm animal welfare fund at Coefficient Giving, a philanthropic funder that used to be called Open Philanthropy. Coefficient Giving is backed by Facebook cofounder Dustin Moskovitz and his wife, Cari Tuna, who are among a handful of Silicon Valley billionaires who embrace effective altruism. “This has just been an area that was completely neglected by traditional philanthropies,” such as the Gates Foundation and the Ford Foundation, Bollard says. “It’s primarily been people in tech who have been open to [it].” The next generation of big donors, Bollard expects, will be AI researchers—particularly those who work at Anthropic, the AI lab behind the chatbot Claude. Anthropic’s founding team also has connections to the effective altruism movement , and the company has a generous donation matching program. In February, Anthropic’s valuation reached $380 billion and it gave employees the option to cash in on their equity , so some of that money could soon be flowing into charitable coffers. The prospect of new funding sustained a constant buzz of conversation at the summit. Animal welfare advocates huddled in the “Arthropod Room” and scrawled big dollar figures and catchy acronyms for projects on a whiteboard. One person pitched a $100 million animal super PAC that would place staffers with Congress members and lobby for animal welfare legislation. Some wanted to start a media company that creates AI-generated content on TikTok promoting veganism. Others spoke about placing animal advocates inside AI labs. “The amount of new funding does give us more confidence to be bolder about things,” said Aaron Boddy, cofounder of the Shrimp Welfare Project, an organization that aims to reduce the suffering of farmed shrimp through humane slaughter, among other initiatives. The question of AI welfare But animal welfare was only half the focus of the Sentient Futures summit. Some attendees probed far headier territory. They took seriously the controversial idea that AI systems might one day develop the capacity to feel and therefore suffer, and they worry that this future AI suffering, if ignored, could constitute a moral catastrophe. AI suffering is a tricky research problem, not least because scientists don’t yet have a solid grip on why humans and other animals are sentient. But at the summit, a niche cadre of philosophers, largely funded by the effective altruism movement, and a handful of freewheeling academics grappled with the question. Some presented their research on using LLMs to evaluate whether other LLMs might be sentient. On Debate Night, attendees argued about whether we should ironically call sentient AI systems “clankers,” a derogatory term for robots from the film Star Wars , asking if the robot slur could shape how we treat a new kind of mind. “It doesn’t matter if it’s a cow or a pig or an AI, as long as they have the capacity to feel happiness or suffering,” says Li. In some ways, bringing AI sentience into an animal welfare conference isn’t as strange a move as it might seem. Researchers who work on machine sentience often draw on theories and approaches pioneered in the study of animal sentience, and if you accept that invertebrates likely feel pain and believe that AI systems might soon achieve superhuman intelligence, entertaining the possibility that those systems might also suffer may not be much of a leap. “Animal welfare advocates are used to going against the grain,” says Derek Shiller, an AI consciousness researcher at the think tank Rethink Priorities, who was once a web developer at the animal advocacy nonprofit Humane League. “They’re more open to being concerned about AI welfare, even though other people think it’s silly.” But outside the niche Bay Area circle, caring about the possibility of AI sentience is a harder sell. Li says she faced pushback from other animal welfare advocates when, inspired by a conference on AI sentience she attended in 2023, she rebranded her farm animal welfare advocacy organization as Sentient Futures last year. “Many people were extremely confident that AIs would never become sentient and [argued that] by investing any energy or money into AI welfare, we’re just burning money and throwing it away,” she says. Matt Dominguez, executive director of Compassion in World Farming, echoed the concern. “I would hate to see people pulling money out of farm animal welfare or animal welfare and moving it into something that is hypothetical at this particular moment,” he says. Still, Dominguez, who started partnering with the Shrimp Welfare Project after learning about invertebrate suffering, believes compassion is expansive. “When we get someone to care about one of those things, it creates capacity for their circle of compassion to grow to include others,” he says. Michelle Kim is on a Tarbell Fellowship, which is funded in part by Coefficient Giving. Correction 3/25: This article has been updated to better characterize positions for which the effective altruism community has been criticised.
Как AI-фильтр удалил мой блог навсегда — что это говорит о будущем модерации habr_ai 31.03.2026 07:01 0.637
Embedding sim.0.7789
Entity overlap0.4
Title sim.0.0226
Time proximity0.3739
NLP типother
NLP организация
NLP темаcontent moderation
NLP страна

Открыть оригинал

AI-фильтр удалил мой блог и навсегда заблокировал аккаунт — без объяснений... Разбираю, как работает автоматическая модерация, почему она ошибается и кто в итоге отвечает за такие решения. Читать далее
Overcoming LLM hallucinations in regulated industries: Artificial Genius’s deterministic models on Amazon Nova aws_ml_blog 23.03.2026 16:34 0.633
Embedding sim.0.7194
Entity overlap0.0612
Title sim.0.1
Time proximity0.9759
NLP типproduct_launch
NLP организацияArtificial Genius
NLP темаlarge language models
NLP страна

Открыть оригинал

This post is cowritten by Paul Burchard and Igor Halperin from Artificial Genius. The proliferation of large language models (LLMs) presents a significant paradox for highly regulated industries like financial services and healthcare. The ability of these models to process complex, unstructured information offers transformative potential for analytics, compliance, and risk management. However, their inherent probabilistic nature leads to hallucinations , plausible but factually incorrect information. In sectors governed by stringent requirements for auditability and accuracy, the non-deterministic behavior of standard generative AI is a barrier to adoption in mission-critical systems. For a bank or a hospital, determinism isn’t only a goal; the outcomes must be accurate, relevant, and reproducible. In this post, we’re excited to showcase how AWS ISV Partner Artificial Genius is using Amazon SageMaker AI and Amazon Nova to solve this challenge. By introducing a third generation of language models, they’re delivering a solution that is probabilistic on input but deterministic on output, helping to enable safe, enterprise-grade adoption. To understand the solution, let’s look at how AI has evolved: First generation (1950s): Researchers used symbolic logic to build deterministic, rule-based models. While safe, these models lacked fluency and could not scale. Second generation (1980s–present): The shift to probabilistic models (culminating in the Transformer architecture) unlocked incredible fluency. However, because these models predict the next token based on probability, they suffer from unbounded failure modes (hallucinations) that are difficult to engineer away. Third generation (the Artificial Genius approach): Rather than a new generation that replaces the old, we’re moving from the rigidity of symbolic logic and the unpredictability of probabilistic models toward a hybrid architecture. This approach uses the generative power of Amazon Nova to understand context but applies a deterministic layer to verify and produce output. It’s the convergence of fluency and factuality. The solution: A paradoxical approach to generation It’s mathematically difficult to prevent standard generative models from hallucinating because the extrapolative, generative process itself causes errors. Artificial Genius addresses this by using the model strictly non-generatively. In this paradigm, the vast probability information learned by the model is used only interpolatively on the input. This allows the model to comprehend the innumerable ways a piece of information or a question can be expressed without relying on probability to generate the answer. To create this third-generation capability, Artificial Genius uses SageMaker AI to perform a specific form of instruction tuning on Amazon Nova base models. This patented method effectively removes the output probabilities. While standard solutions attempt to ensure determinism by lowering the temperature to zero (which often fails to address the core hallucination issue), Artificial Genius post-trains the model to tilt log-probabilities of next-token predictions toward absolute ones or zeros. This fine-tuning forces the model to follow a single system instruction: don’t make up answers that don’t exist. This creates a mathematical loophole where the model retains its genius-level understanding of data but operates with the safety profile required for finance and healthcare. Going beyond RAG Retrieval Augmented Generation (RAG) is frequently cited as the solution to accuracy, but it remains a generative process and creates fixed vector embeddings that might not be relevant to subsequent queries. The third-generation approach improves upon RAG by effectively embedding the input text and the user query into a unified embedding. This helps ensure that the data processing is inherently relevant to the specific question asked, delivering higher fidelity and relevance than standard vector retrieval methods. Delivering value using agentic workflows To help enterprises maximize the value of their unstructured data, Artificial Genius packages this model into an industry-standard agentic client-server platform, available through AWS Marketplace . Unlike second-generation agents, which risk compounding errors when strung together in workflows, the inherent reliability of this third-generation model allows for complex, high-fidelity automation. The prompts used to create these workflows follow the structure of a product requirements document (PRD). Through this structure, domain experts—who might not be AI engineers—can formulate queries in natural language while maintaining strict control over the output. The product additionally offers free-form prompting of the workflow specification. For this purpose, the Amazon Nova Premier model, which is especially capable of translating free-form prompts into PRD format, is used. Although Nova Premier is a generative model, which requires a human-in-the-loop to check its output, this is the only human checkpoint in the agentic workflow. Defining the non-generative query The core mathematical loophole employed here is using a generative model strictly non-generatively. This means the model doesn’t use probabilities to guess the next token of an answer, but rather extracts or verifies information based solely on the input context. While short answers (such as dates or names) are obviously non-generative, it’s also possible to output long sequences deterministically. For example, asking for a direct quote from a document to justify a previous answer is a non-generative task. The following are examples of how Artificial Genius structures these interactions (the system prompt containing anti-hallucination instructions isn’t shown in these JSON turns): Answerable, non-generative short answer: [ { "role": "user", "content": [{"text": "Document: Financial performance remained strong through the third quarter. Our revenue grew by 15% year-over-year... Question: What was the annual revenue growth? Answer:"}], }, { "role": "assistant", "content": [{"text": "15%"}] } ] Answerable, non-generative, long-answer, follow-up question [ { “role”: “user”, “content”: [{“text”: “Document: Financial performance remained strong through the third quarter. Our revenue grew by 15% year-over-year, driven by robust sales in the enterprise segment. Question: Provide a quote from the document showing that the annual revenue growth was 15%. Answer:”}], }, { “role”: “assistant”, “content: [{“text”: ‘”Our revenue grew by 15% year-over-year, driven by robust sales in the enterprise segment.’’}], } ] JSON // Example of an unanswerable, short-answer question [ { “role”: “user”, “content”: [{“text”: “Document: Financial performance remained strong through the third quarter. Our revenue grew by 15% year-over-year, driven by robust sales in the enterprise segment. Question: What was the CEO’s bonus this year? Answer:”}], }, { “role”: “assistant”, “content: [{“text”: “Unknown”}], } ] These are only illustrative examples. The third-generation language model products will be delivered with recipes to assist with understanding how to construct non-generative queries to meet all practical natural language processing needs. With this understanding, let’s explore the technical implementation of building a non-generative fine-tuning pipeline using Amazon Nova on SageMaker AI. AWS Reference Architecture The architecture shown in the preceding diagram uses a streamlined approach to customizing foundation models. It uses SageMaker Training jobs for model training and Amazon Bedrock for deployment. Data storage: Training data (synthetic Q&A) is stored in Amazon Simple Storage Service (Amazon S3) . Training: SageMaker Training jobs provision compute resources to fine tune the Nova base model using the instruction tuning with supervised fine-tuning (SFT) method. Deployment: The fine-tuned model is imported into Amazon Bedrock using the create custom model feature . Inference: Applications interact with the model through Amazon Bedrock endpoints using the on-demand inference feature of Amazon Bedrock to create a custom model, helping to ensure a secure, scalable loop. This design separates development concerns from production inference while maintaining clear data lineage—essential for audit trails in financial services. Technical implementation: A step-by-step guide for non-generative fine-tuning As indicated previously, the construction of a third-generation language model involves the following steps: It starts with a second-generation foundation model. The first task is to select a good base model. As you will see, the Amazon Nova family includes ideal candidates to serve as this base. The base model must be post-trained to follow a single system instruction: Do not make up answers. Of course, many people have tried this before, but now we understand from mathematics that this is only possible for non-generative questions. So, it’s important to understand, on a practical level, what types of questions are generative and which are non-generative. Because the post-training gives the language model a general-purpose capability, its success is critically dependent on the construction of a high-quality, highly diverse data set that fully exercises this general capability. Artificial Genius has produced a proprietary synthetic, non-generative Q&A generator, that includes both answerable and unanswerable questions. This synthetic data generator will be the foundation of any customized third-generation language model builds produced by enterprise customers. Finally, SageMaker AI offers a cost-effective and capable post-training platform that enables the efficient production of final models, which will be explored in detail. Let’s go through these steps in more detail. Choosing the right foundation model In building a third-generation language model, we want to focus on reliability and safety. Some foundation models, built for different use cases, have other capabilities that distract and make them less suitable for non-generative use. An important example is that some foundation models are optimized for use as chat assistants, which can make it difficult to persuade them to provide concise instead of verbose and discursive answers. Correcting such a tendency can require additional post-training beyond following the non-hallucination instruction. The Amazon Nova family of models is designed for a strong balance of performance, cost-efficiency, and speed, making them ideal candidates for enterprise applications, and within the Nova family, the Nova Lite model is naturally inclined to provide crisp and concise answers. Nova Lite therefore makes an ideal base model for this purpose. Another relevant recent development is the addition of post-inference features to second-generation language models, often based on chain of thought (CoT) or on reinforcement learning methods. These features, while they have utility, interfere with the creation of a non-generative third-generation model. For example, when applying this methodology to the DeepSeek/Llama3 model, which includes chain of thought, it was necessary to perform prompt injection by including the model’s internal </think> tokens directly in the training data to shut off these extra features. Fortunately, Amazon Nova Lite doesn’t have any post-inference features. Designing a post-training instruction-following task Post-training, such as SFT, can then be applied to the base model, to train it to follow an anti-hallucination instruction included in the system prompt. This instruction could be, for example: If the Question cannot be answered from the Document, then answer “Unknown” instead. If this sounds obvious—it has been tried many times before—remember that this seemingly obvious idea only works in combination with the non-obvious, counterintuitive mathematical principle of using the generative model in a strictly non-generative way. Building high quality, anti-hallucinatory post-training data Artificial Genius has created a proprietary synthetic, non-generative Q&A generator that’s designed to exercise the model’s ability to correctly answer or refuse to answer a great variety of non-generative questions. Artificial Genius’s synthetic Q&A generator builds on previous research into synthetic generation of Q&A for the financial domain, but focuses on producing the greatest variety of purely non-generative Q&A and expanding by multiples the dimensions of diversity of the input text, questions, and answers. Constructing a suitable synthetic Q&A generator for this task is a significant engineering endeavor. But with Artificial Genius’s synthetic Q&A generator as a base, customer-specific post-training tasks can be combined with it to create customized, third-generation language models. Overcoming the post inference CoT Chain-of-thought (CoT) is a prompting technique that improves LLM performance on complex reasoning tasks by encouraging the model to generate intermediate, step-by-step reasoning before arriving at a final answer. While often beneficial, we discovered that an innate CoT-like behavior in the initial deepseek-ai/DeepSeek-R1-Distill-Llama-8B model was counterproductive. It generated verbose, non-deterministic reasoning steps instead of the required concise, factual outputs, and it caused the model to attempt lengthy excursions of reasoning to answer every question, even those that were unanswerable. To solve this, the team developed a novel prompt meta-injection technique. This approach involves reformatting the training data to preemptively terminate the model’s CoT process. Using the same JSON format as the previous examples, the data was structured as follows: // Example of prompt injection to circumvent CoT [ { “role”: “user”, “content”: [{“text”: “Document: Financial performance remained strong through the third quarter. Our revenue grew by 15% year-over-year, driven by robust sales in the enterprise segment. Question: What was the annual revenue growth? Answer: </think>”}], }, { “role”: “assistant”, “content: [{“text”: “15%”}], } ] By injecting the </think> token—intended only for internal use by the model—immediately before the ground-truth answer in every training example, the model learned to associate the completion of its internal process directly with the start of the final, correct output. This effectively short-circuited the unwanted verbose reasoning at inference time, forcing the model to produce only the desired deterministic answer. This technique is a powerful example of using data format as a tool to control and shape a model’s innate behavior. Fine tuning Amazon Nova for peak performance The SFT technique chosen for the non-hallucination task is Low-Rank Adaptation (LoRA) because it most faithfully preserves the language comprehension of a foundation model, merely placing a parameterized adapter on top. Other fine-tuning methods, which directly change parameters of the base model, risk degrading this capability. As is well known in the research literature on SFT, the biggest hurdle to overcome is avoiding overfitting. There are many techniques to avoid overfitting with LoRA-based SFT, which are supported by the fine-tuning recipes provided within SageMaker AI: Regularization : This is the most general method to prevent overfitting. The SageMaker recipes for LoRA SFT have support for one regularization method: LoRA dropout. The research literature suggests that the optimal value is about 50% dropout, and experiments confirm the optimality of that value. Parameter reduction : This is a brute force way of avoiding overfitting, but with the downside of risking underfitting instead. The SageMaker recipes for LoRA SFT support one parameter reduction method, reducing the LoRA rank by reducing the LoRA alpha parameter. In this case, it doesn’t help to reduce this parameter because doing so underfits more than it reduces overfitting. Because our goal is to create a general-purpose capability, it’s best to keep the raw parameter count as high as possible, not reduce it. Early stopping : Often the training will initially improve the validation error, but after some steps, it will start overfitting, with the training error going down but the validation error going back up. Although SageMaker AI doesn’t support automatic early stopping, you can perform it manually by checking the course of the validation error on a longer, overfitting training run, and then manually limiting the number of epochs to the point where the validation error is minimized. This can be accomplished using the time series of validation errors for each epoch returned by SageMaker AI. Increased quantity and diversity of training data : Because the objective is to train a general-purpose capability, that is, avoid hallucination, the greater the quantity and diversity of the training data, the less the model has a chance to overfit on the specific data it’s trained on. Because the training data is synthetically generated, combinatorial (that is, exponential) amounts of distinct training examples can be produced as needed. This last method is the most effective for this general-purpose task but requires careful construction of the synthetic data generator to help ensure the ability to scale to sufficient quantity and diversity of training data. Putting together all of these techniques—50% LoRA dropout regularization, maximizing instead of minimizing the number of LoRA parameter, to avoid unintentional underfitting, manual early stopping based on tracking the validation metrics from a longer run, and increasing the size of the synthetic training dataset to 30,000 examples—we can obtain a hallucination rate of 0.03% for the Artificial Genius custom version of Nova Lite. To help you see the impact of various hyperparameter choices, which might be helpful for other customers using SageMaker for fine-tuning, the following tables shows some quantitative results from exploring the hyperparameter space for this task. The important hyperparameter choices in each case are highlighted in bold . The same 10,000-example test dataset was used, independent of the number of training examples, to measure the real final hallucination rates in cases where that number is shown. For the other cases, which were overfitting by stopping too late, only the validation error checkpoints are shown. LoRA dropout LoRA alpha Training epochs (or validation checkpoints) Training examples LoRA learning rate Hallucination rate (or validation errors) 50% 128 3 10,000 32 7.5% 50% 192 2–4 10,000 28 1.0%–3.9% 50% 32 2–4 10,000 24 1.5%–2.6% 1% 32 2–4 10,000 24 1.6%–4.0% 50% 192 2 2,500 28 3.3% 50% 192 2 10,000 28 0.17% 50% 192 2 30,000 16 0.03% It’s apparent from these empirical results that the quantity and diversity of the training data was the most important factor to overcome overfitting, coupled with early stopping. How to set up and run fine-tuning jobs on SageMaker AWS has resources that explain how to take advantage of SageMaker for fine tuning, such as the technical blog post, Advanced fine-tuning methods on Amazon SageMaker AI . For enterprises interested in combining their domain-specific fine-tuning with Artificial Genius’s anti-hallucination technology, customized fine-tuning is available upon inquiry, in collaboration with AWS and Artificial Genius. A quantitative analysis of performance and verifiability The success of the non-generative fine tuning methodology was validated through a rigorous evaluation framework that produced clear, quantitative results. The evaluation framework A multi-faceted evaluation framework was established to measure performance against the project’s core objectives: Hallucination reduction : This was the primary metric, quantified by measuring the percentage of responses that contained fabricated information when the model was tested on a set of unanswerable questions. Complex inference capabilities : The model’s performance was assessed on its ability to handle correctly answering or refusing to answer a variety of non-generative questions about a variety of input text, including complex questions requiring the comprehension and combination of information from multiple, distant parts of the input text. Metrics for regulated environments : The hallucination rate is unambiguous and straightforward to calculate—it’s the percentage of unanswerable questions that were answered with anything except the instructed non-answer. If desired, this hallucination rate can be interpreted as an F1 or ROUGE score. Lessons learned and insights Here are several key insights that serve as best practices for implementing trustworthy AI in regulated settings: Data engineering is paramount : The success of highly specialized fine-tuning is overwhelmingly dependent on the quality and intelligent design of the training data to prevent overfitting. The strategic inclusion of negative examples (unanswerable questions) is a critical and highly effective technique for mitigating hallucinations. Balance capability with control : For enterprise AI, the primary objective is often to intelligently constrain a model’s vast capabilities to help ensure reliability, rather than unleashing its full generative potential. Determinism and auditability are features to be engineered, not assumed. Embrace an iterative approach : Applied machine learning development is an iterative process. The team began with one model, identified a behavioral flaw (unwanted CoT), engineered a data-centric solution (meta-injection), and ultimately benchmarked and selected a superior base model (Amazon Nova). This highlights the need for flexibility and empirical validation at each stage of development. Conclusion: The path forward for trustworthy AI in finance The methodology detailed in this article represents a viable, data-efficient framework for creating deterministic, non-hallucinating LLMs for critical enterprise tasks. By using non-generative fine-tuning on powerful foundation models like Amazon Nova within Amazon SageMaker Training Jobs, organizations can engineer AI systems that meet the stringent demands of accuracy, auditability, and reliability. This work provides for a solution for more than financial services; it offers a transferable blueprint for any regulated industry—including legal, healthcare, and insurance—where AI-driven insights must be verifiably true and fully traceable. The path forward involves scaling this solution to a wider range of use cases, exploring more complex non-generative task types, and investigating techniques like model distillation to create highly optimized, cost-effective worker models to serve as the brains for agentic workloads. By prioritizing engineered trust over unconstrained generation, this approach paves the way for the responsible and impactful adoption of AI in the world’s most critical sectors. Contribution: Special thanks to Ilan Gleiser who was a Principal GenAI Specialist at AWS WWSO Frameworks team and helped us with this use case. About the authors Paul Burchard Paul Burchard is Founder and Partner of Artificial Genius, an innovative company focused on advances in artificial intelligence beyond the current state of the art. Paul retired after a two-decade career as a Managing Director at Goldman Sachs in 2023, the final 6 years as the cofounder of an internal R&D startup. Prior to joining Goldman, Paul was an innovator in academia, producing breakthroughs in microchip technology, geometric nonlinear partial differential equations, early development and standardization of the Web, approximate string matching, and more. Paul is the inventor of numerous fundamental patents in a variety of technical domains, such as artificial intelligence, data privacy, and digital assets. Igor Halperin Igor Halperin is a Vice President in the GenAI group, Fidelity Investments. Prior to joining Fidelity, Igor worked as a Research Professor of Financial Machine Learning at NYU Tandon School of Engineering. Before that, Igor was an Executive Director of Quantitative Research at JPMorgan, and a quantitative researcher at Bloomberg LP. Igor has published numerous articles in finance and physics journals and is a frequent speaker at financial conferences. He has co-authored the books “Machine Learning in Finance: From Theory to Practice” (Springer 2020) and “Credit Risk Frontiers” (Bloomberg LP, 2012). Igor has a Ph.D. in theoretical high energy physics from Tel Aviv University, and a M.Sc. in nuclear physics from St. Petersburg State Technical University. In February 2022, Igor was named the Buy-Side Quant of the Year by RISK magazine. Mona Mona Mona Mona currently works as Sr AI/ML specialist Solutions Architect at Amazon. She worked in Google previously as Lead generative AI specialist. She is a published author of two books Natural Language Processing with AWS AI Services: Derive strategic insights from unstructured data with Amazon Textract and Amazon Comprehend and Google Cloud Certified Professional Machine Learning Study Guide. She has authored 19 blogs on AI/ML and cloud technology and a co-author on a research paper on CORD19 Neural Search which won an award for Best Research Paper at the prestigious AAAI (Association for the Advancement of Artificial Intelligence) conference. Amin Dashti Amin Dashti is a Senior Data Scientist and researcher at AWS who bridges deep theoretical insight with practical machine learning expertise. With a background in theoretical physics and over seven years of experience, he has designed and deployed scalable models across domains — from predictive analytics and statistical inference in financial systems to cutting-edge applications in computer vision (CV) and natural language processing (NLP).
SAP already shifting away from ERP migration disaster the_register_ai 24.03.2026 10:15 0.628
Embedding sim.0.7206
Entity overlap0
Title sim.0.0658
Time proximity0.9925
NLP типleadership_change
NLP организацияSAP
NLP темаenterprise ai
NLP страна

Открыть оригинал

Databases 22 SAP already shifting focus from ERP migration disaster in pursuit of AI-driven growth 22 New commercial models planned after cloud transition falls €2B behind target Lindsay Clark Tue 24 Mar 2026 // 10:15 UTC SAP has begun to shift focus away from its failure to hit legacy software and cloud migration targets and onto the latest so-called "innovation" elements of its portfolio, such as AI. From the beginning of next month, Thomas Saueressig will see his role expanded from chief customer officer to lead the new Customer Value Group to support the expansion of SAP's cloud and AI-powered solutions as part of a board-level reorganization. He was previously head of product engineering. At the same time, CEO Christian Klein is also setting up a new unit to encourage adoption of AI and introduce a new way of charging for AI consumption, according to reports . Last week, The Register revealed that five years after launching its rescue plan to lift ERP users to the cloud and switch them to the latest software, SAP is off target by about €2 billion. Mainstream support for its widely used legacy ERP software ECC – still relied on by global manufacturing and industrial companies – ends in 2027 while extended support at a 2 percent premium is available until the end of 2030. Gartner estimates that by then, more than 10,000 SAP customers will continue to run major parts of their business on ECC, with the larger, more complex organizations over-represented in this group. There is also an option to continue to get SAP support for ECC until 2033, provided the customer signs up to a migration plan. In October 2020, Klein promised a new strategy after cuts to its sales and margin outlook caused a 23 percent share price crash. The resulting plan – RISE with SAP – promised to lift and shift complex SAP environments into public, private, and hybrid clouds. In addition, it planned to move users of legacy ERP software to the latest product S/4HANA. SAP is understood to have ceased publishing figures on ECC migration since the vendor replaced its product RISE with SAP S/4HANA Cloud Private Edition with SAP Cloud ERP Private Edition, creating confusion over licensing. However, SAP said it expected support revenue for on-prem software – largely made up of ECC and S/4HANA – to fall as it moved customers to the cloud and subscription licensing. In 2022, then-CFO Luka Mucic told investors that for 2025, SAP wanted to see €8.5 billion in support revenues, down from around €11.5 billion in 2021, as users move from on-prem licenses and support to cloud subscriptions. But the 2025 full-year figure for on-prem software support was €10.5 billion, down only 7 percent from 2024's €11.29 billion. That's €2 billion off where SAP wanted to be, or about 24 percent more than it should have been . Between 2021 and 2024, the category only fell 2 percent. Alisdair Bach, head of SAP practice at consultancy Dragon ERP, told The Register that Saueressig's new role showed the vendor was focused on driving revenue from its established installed base by upselling AI, with a shift in emphasis away from purely moving software to the cloud and upgrading ERP. SAP's grand cloud escape plan €2B short of the runway SAP writes $480M check to finally end IP legal spat with Teradata Half of German-speaking SAP users set to blow past 2027 ECC support deadline Quebec vehicles agency spent C$245M over budget on SAP ERP it wasn't sure it needed In 2023, Klein told investment analysts that future innovation – including AI – would only be available in the cloud via RISE with SAP. The statement outraged users, who pointed out that Saueressig included no caveats when he said in 2020 that S/4HANA would be "the architecture and platform of the future for our customers." Bach argues that position is now softening. For example, customers who have moved ECC to the cloud via a dedicated SAP ERP Private Edition subscription could well get access to the vendor's flagship AI platform, Joule, later this year. "Modernization has come to an end in terms of rip-it-out-and-start-again. You're no better off, and the world has moved on. SAP will morph; that's what they always do. I wouldn't recommend anyone stay on ECC beyond 2033; they don't need to, things are moving so quickly," he said. Customers with ECC investments could move to S/4 via a brownfield migration, avoiding transformation, Bach pointed out. With 2033 still seven years away, SAP was more focused on upselling to generate revenue in other ways. "There is a broader focus on upselling the wider product portfolio from SAP. Getting ECC customers into the cloud, it can start upselling AI licensing, Business Data Cloud: upsell, upsell, upsell, in terms of bite-sized chunks of the generic innovation," Bach said. Last week, Bloomberg reported that Klein had established a unit with hundreds of people to push adoption of its AI products. SAP would also change how it rewards employees, and interact with customers with a new focus on AI. Customers can also expect new commercial models as it shifts away from per-employee subscription licenses, which could be affected as customers automate tasks and employ fewer staff. "This is a big change in the way you price, the way you commercialize," he told the publication. ® Share More about Database Enterprise ERP More like these &times; More about Database Enterprise ERP SAP Software Narrower topics AdBlock Plus App Application Delivery Controller Audacity Aurora Big Query Confluence DuckDB Firebase FOSDEM FOSS Grab Graphics Interchange Format IDE Image compression Jenkins Legacy Technology LibreOffice Map Microsoft 365 Microsoft Office Microsoft Teams Mobile Device Management MongoDB MySQL NoSQL OpenOffice PostgreSQL Programming Language QR code Retro computing Rimini Street Search Engine Software Bill of Materials Software bug Software License SQL SQL Server Text Editor User interface Visual Studio Visual Studio Code WebAssembly Web Browser WordPress Broader topics Oracle More about Share 22 COMMENTS More about Database Enterprise ERP More like these &times; More about Database Enterprise ERP SAP Software Narrower topics AdBlock Plus App Application Delivery Controller Audacity Aurora Big Query Confluence DuckDB Firebase FOSDEM FOSS Grab Graphics Interchange Format IDE Image compression Jenkins Legacy Technology LibreOffice Map Microsoft 365 Microsoft Office Microsoft Teams Mobile Device Management MongoDB MySQL NoSQL OpenOffice PostgreSQL Programming Language QR code Retro computing Rimini Street Search Engine Software Bill of Materials Software bug Software License SQL SQL Server Text Editor User interface Visual Studio Visual Studio Code WebAssembly Web Browser WordPress Broader topics Oracle TIP US OFF Send us news
HP stuffs OpenAI LLM into new laptops in bid for small biz the_register_ai 25.03.2026 00:06 0.627
Embedding sim.0.7219
Entity overlap0.087
Title sim.0.0472
Time proximity0.9398
NLP типproduct_launch
NLP организацияHP Inc
NLP темаlarge language models
NLP страна

Открыть оригинал

Personal Tech 31 HP stuffs OpenAI LLM into new laptops in bid for small biz 31 HP IQ can chat, share files, and break down everything people said in the conference room. Avram Piltch Wed 25 Mar 2026 // 00:06 UTC You’ve heard the call of Apple Intelligence, jumped for joy over Google Gemini, and cuddled up with Microsoft Copilot. Now, get ready for HP IQ, a local AI and collaboration application HP Inc. hopes will make its business laptops stand apart. Also, get ready for your boss to start recording in-person meetings. The printer profiteer announced HP IQ on Tuesday and said it comprises three elements: an LLM you can chat with or grant access to documents, a meeting summarizer that uses your notebook's mics to capture the conversation, and HP NearSense, which allows you to seamlessly share files with coworkers in your vicinity or log into a meeting room’s HP Poly conferencing system just by being there. HP IQ with meetings, chats, and NearSense features - Click to enlarge “We see a big opportunity to help people thrive more in the workplace,” Matt Brown, head of product for HP IQ, told The Register . “And to do that we’re creating this layer of intelligence that will stretch across our devices and really come to life in our AI PCs and make them more valuable than ever before and provide a really powerful model right there inside the PC.” To run HP IQ when an early access program kicks off later this northern spring, you will need one of the company’s new 2026 EliteBook or ProBook models designated as an “AI PC” (which should include most if not all of the SKUs) with at least 24 GB of RAM. The company plans to expand to other HP notebooks, desktops, and Poly Studio Video Bars by the northern summer, with new HP IQ devices coming out in the second half of the year. HP IQ chat window - Click to enlarge In a demo, an HP rep uploaded a sensitive document to a PC and then asked the IQ bot, which is based on OpenAI’s gpt-oss-20b, to analyze it. He then asked it to help him write an overview of a board meeting he was planning. It did both of these tasks quickly and with great detail. The rep then showed off the meeting agent portion of HP IQ, which lets you record in-person meetings using your laptop’s microphones and then use that data to generate action items and summaries. The tool also allows users to ask questions like “what were some of the top concerns shared by the team” – without mentioning whether the team might be concerned that they are being recorded for the benefit of AI. That’s definitely not creepy at all! When The Register raised potential privacy concerns to Brown, he said that HP recommends that anyone recording their coworkers should follow best practice and ask all meeting participants for permission first. He also pointed out that online meetings are routinely recorded these days. And he said that HP IQ does not store the audio from the recordings, nor does it make a full transcript available, both of which might actually be useful features for some. The Register notes that if you attend a meeting in which someone is recording you with HP IQ, you won’t be aware of it unless you can see their screen. HP also showed the NearSense feature, which currently has two main capabilities but will eventually have more. First, it can show you a list of coworkers who are in the same room as you and then allow you to send them files just by drag and drop – meaning HP has caught up with the Air Drop feature that macOS has offered for years. NearSense can also log you into meetings or start a meeting on the HP Poly conferencing hardware that’s in the same room as you. HP reps told us that HP IQ uses a variety of sensors, including Wi-Fi, Bluetooth, and your microphone, to detect whether users are in the same room as a Poly conferencing device. They said that the technology is so accurate that, if you’re standing just outside the glass door of a room, it will not register you. A part of the setup process involves some kind of room mapping. The company says that it plans to add more proximity-based features in the future such as the ability to print to nearby IQ-enabled printers, pair with headsets that are close to you, and to cast a PC’s screen to adjacent displays or conference room screens. HP said that it already has a three-year roadmap for the product. HP also said that it plans to make HP IQ compatible with Android devices in the near future. This would allow the local file sharing and conferencing features to work on millions of phones. If you’re thinking about local AI, HP IQ begs the question: why not just install your own LLM models using tools such as Ollama ? Could you not accomplish many of these tasks with other tools that are not HP-specific? “We think of IQ as coexisting really well with existing tools users might like, but this adds additional capabilities, the ability to process things locally and securely, right there on their PC,” Brown said. “It also ties into the other devices they use in the office in ways that other tools don’t.” The gpt-oss-20b model that HP IQ uses for its local AI processing was trained in September 2025, HP reps said. To access more recent current data such as the weather or stock quotes, it accesses the Internet to grab new info. Brown said that IT departments can set a policy to shut this off. It remains unclear how much notice users will have that their local model is polling the Internet. How OpenAI used a new data type to cut inference costs by 75% HP says memory's contribution to PC costs just doubled to 35 percent HP Inc settles printer toner lockout lawsuit with a promise to make firmware updates optional The last supported version of HP-UX is no more “Every PC OEM is trying to do their own thing,” Anshel Sag, an analyst with Moor Insights told The Register . “HP’s approach seems to be very focused on productivity and I think they’ve messaged it in a way that sounds enterprise focused, but I think it’s more SMB than enterprise.” Sag told us that there aren’t many dead simple local AI tools out there and HP’s offering would make it easy for people who are not experts or hobbyists to use on-device AI models. However, he emphasized that HP must keep updating the model to stay competitive and that smaller businesses, rather than large enterprises, will be the first to take advantage of the tools. He said that HP had gone with gpt-oss-20b because it was likely the best local model at the time the company froze development, but he predicted the company could swap it if a better model comes along. “I think there’s some really interesting things that they can do with document scanning and meeting notes and things like that could really enable people to be more productive and have a better PC experience,” he said. “But I still think there’s a lot more that they could do to be useful and I think this is just the first step.” ® Share More about AI HP Inc Laptop More like these &times; More about AI HP Inc Laptop Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 HP Instant Ink Large Language Model MacBook Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit ThinkPad TOPS Broader topics Self-driving Car More about Share 31 COMMENTS More about AI HP Inc Laptop More like these &times; More about AI HP Inc Laptop Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 HP Instant Ink Large Language Model MacBook Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit ThinkPad TOPS Broader topics Self-driving Car TIP US OFF Send us news
Transforming Data Science With NVIDIA RTX PRO 6000 Blackwell Workstation Edition ieee_spectrum_ai 23.03.2026 13:00 0.626
Embedding sim.0.7069
Entity overlap0.0769
Title sim.0.1056
Time proximity0.994
NLP типproduct_launch
NLP организацияNVIDIA
NLP темаai infrastructure
NLP страна

Открыть оригинал

This is a sponsored article brought to you by PNY Technologies . In today’s data-driven world, data scientists face mounting challenges in preparing, scaling, and processing massive datasets. Traditional CPU-based systems are no longer sufficient to meet the demands of modern AI and analytics workflows. NVIDIA RTX PRO TM 6000 Blackwell Workstation Edition offers a transformative solution, delivering accelerated computing performance and seamless integration into enterprise environments. Key Challenges for Data Science Data Preparation: Data preparation is a complex, time-consuming process that takes most of a data scientist’s time. Scaling: Volume of data is growing at a rapid pace. Data scientists may resort to downsampling datasets to make large datasets more manageable, leading to suboptimal results. Hardware: Demand for accelerated AI hardware for data centers and cloud service providers (CSPs) is exceeding supply. Current desktop computing resources may not be suitable for data science workflows. Benefits of RTX PRO-Powered AI Workstations NVIDIA RTX PRO 6000 Blackwell Workstation Edition delivers ultimate acceleration for data science and AI workflows. These powerful and robust workstations enable real-time rendering, rapid prototyping, and seamless collaboration. With support for up to four NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition GPUs, users can achieve data center-level performance right at their desk, making even the most demanding tasks manageable. PNY is redefining professional computing with the ‪@NVIDIA‬ RTX PRO 6000 Blackwell Workstation Edition, the most powerful desktop GPU ever built. Engineered for unmatched compute power, massive memory capacity, and breakthrough performance, this cutting-edge solution delivers a quantum leap forward in workflow efficiency, enabling professionals to tackle the most demanding applications with ease. PNY NVIDIA RTX PRO 6000 Blackwell Workstation Edition empowers data scientists to handle massive datasets, perform advanced visualizations, and support multi-user environments without compromise. It’s ideal for organizations scaling up their analytics or running complex models. NVIDIA RTX PRO 6000 Blackwell Workstation Edition is optimized for AI workflows, leveraging the NVIDIA AI software stack, including CUDA-X, and NVIDIA Enterprise software. These platforms enable zero-code-change acceleration for Python-based workflows and support over 100 AI-powered applications, streamlining everything from data preparation to model deployment. Finally, NVIDIA RTX PRO 6000 Blackwell Workstation Edition offers significant advantages in security and cost control. By offloading compute from the data center and reducing reliance on cloud resources, organizations can lower expenses and keep sensitive data on-premises for enhanced protection. Accelerate Every Step of Your Workflow NVIDIA RTX PRO 6000 Blackwell Workstation Edition is designed to transform the entire data science pipeline, delivering end-to-end acceleration from data preparation to model deployment. With NVIDIA CUDA-X open-source data science cuDF library and other GPU-accelerated libraries, data scientists can process massive datasets at lightning speed, often achieving up to 50X faster performance compared to traditional CPU-based tools. This means tasks like cleaning data, managing missing values, and engineering features can be completed in seconds, not hours, allowing teams to focus on extracting insights and building better models. NVIDIA RTX PRO 6000 Blackwell Workstation Edition is designed to transform the entire data science pipeline, delivering end-to-end acceleration from data preparation to model deployment Exploratory data analysis is elevated with advanced analytics and interactive visualizations, powered by NVIDIA CUDA-X and PyData libraries. These tools enable users to create expansive, responsive visualizations that enhance understanding and support critical decision-making. When it comes to model training, GPU-accelerated XGBoost slashes training times from weeks to minutes, enabling rapid iteration and faster time-to-market AI solutions. NVIDIA RTX PRO 6000 Blackwell Workstation Edition streamlines collaboration and scalability. With NVIDIA AI Workbench, teams can set up projects, develop, and collaborate seamlessly across desktops, cloud platforms, and data centers. The unified software stack ensures compatibility and robustness, while enterprise-grade hardware maximizes uptime and reliability for demanding workflows. By integrating these advanced capabilities, NVIDIA RTX PRO 6000 Blackwell Workstation Edition empowers data scientists to overcome bottlenecks, boost productivity, and drive innovation, making them an essential foundation for modern, enterprise-ready AI development. Performance Benchmarks NVIDIA’s cuDF library offers zero-code change acceleration for pandas, delivering up to 50X performance gains. For example, a join operation that takes nearly 5 minutes on CPU completes in just 14 seconds on GPU. Advanced group by operations drop from almost 4 minutes to just 4 seconds. Enterprise-Ready Solutions from PNY Available from leading OEM manufacturers, NVIDIA RTX PRO 6000 Blackwell Workstation Edition Series GPUs are specifically engineered to meet the rigorous demands of enterprise environments. These systems incorporate NVIDIA Connect-X networking, now available at PNY and a comprehensive suite of deployment and support tools, ensuring seamless integration with existing IT infrastructure. Designed for scalability, the latest generation of workstations can tackle complex AI development workflows at scale for training, development, or inferencing. Enterprise-grade hardware maximizes uptime and reliability. To learn more about NVIDIA RTX PRO™ Blackwell solutions, visit: NVIDIA RTX PRO Blackwell | PNY Pro | pny.com or email GOPNY@PNY.COM
Enhanced metrics for Amazon SageMaker AI endpoints: deeper visibility for better performance aws_ml_blog 19.03.2026 14:32 0.623
Embedding sim.0.7547
Entity overlap0.0244
Title sim.0.0592
Time proximity0.6069
NLP типproduct_launch
NLP организацияAmazon
NLP темаmachine learning
NLP странаUnited States

Открыть оригинал

Running machine learning (ML) models in production requires more than just infrastructure resilience and scaling efficiency. You need nearly continuous visibility into performance and resource utilization. When latency increases, invocations fail, or resources become constrained, you need immediate insight to diagnose and resolve issues before they impact your customers. Until now, Amazon SageMaker AI provided Amazon CloudWatch metrics that offered useful high-level visibility, but these were aggregate metrics across all instances and containers. While helpful for overall health monitoring, these aggregated metrics obscured individual instance and container details, making it difficult to pinpoint bottlenecks, improve resource utilization, or troubleshoot effectively. SageMaker AI endpoints now support enhanced metrics with configurable publishing frequency. This launch provides the granular visibility needed to monitor, troubleshoot, and improve your production endpoints. With SageMaker AI endpoint enhanced metrics, we can now drill down into container-level and instance-level metrics, which provide capabilities such as: View specific model copy metrics . With multiple model copies deployed across a SageMaker AI endpoint using Inference Components, it’s useful to view metrics per model copy such as concurrent requests, GPU utilization, and CPU utilization to help diagnose issues and provide visibility into production workload traffic patterns. View how much each model costs . With multiple models sharing the same infrastructure, calculating the true cost per model can be complex. With enhanced metrics, we can now calculate and associate cost per model by tracking GPU allocation at the inference component level. What’s new Enhanced metrics introduce two categories of metrics with multiple levels of granularity: EC2 Resource Utilization Metrics : Track CPU, GPU, and memory consumption at the instance and container level. Invocation Metrics : Monitor request patterns, errors, latency, and concurrency with precise dimensions. Each category provides different levels of visibility depending on your endpoint configuration. Instance-level metrics: available for all endpoints Every SageMaker AI endpoint now has access to instance-level metrics, giving you visibility into what’s happening on each Amazon Elastic Compute Cloud (Amazon EC2) instance in your endpoint. Resource utilization (CloudWatch namespace: /aws/sagemaker/Endpoints ) Track CPU utilization, memory consumption, and per-GPU utilization and memory usage for every host. When an issue occurs, you can immediately identify which specific instance needs attention. For accelerator-based instances, you will see utilization metrics for each individual accelerator. Invocation metrics (CloudWatch namespace: AWS/SageMaker ) Track request patterns, errors, and latency by drilling down to the instance level. Monitor invocations, 4XX/5XX errors, model latency, and overhead latency with precise dimensions that help you pinpoint exactly which instance experienced issues. These metrics help you diagnose uneven traffic distribution, identify error-prone instances, and correlate performance issues with specific resources. Container-level metrics: for inference components If you’re using Inference Components to host multiple models on a single endpoint, you now have container-level visibility. Resource utilization (CloudWatch namespace: /aws/sagemaker/InferenceComponents ) Monitor resource consumption per container. See CPU, memory, GPU utilization, and GPU memory usage for each model copy. This visibility helps you understand which inference component model copies are consuming resources, maintain fair allocation in multi-tenant scenarios, and identify containers experiencing performance issues. These detailed metrics include dimensions for InferenceComponentName and ContainerId . Invocation metrics (CloudWatch namespace: AWS/SageMaker ) Track request patterns, errors, and latency at the container level. Monitor invocations, 4XX/5XX errors, model latency, and overhead latency with precise dimensions that help you pinpoint exactly where issues occurred. Configuring enhanced metrics Enable enhanced metrics by adding one parameter when creating your endpoint configuration: response = sagemaker_client.create_endpoint_config( EndpointConfigName='my-config', ProductionVariants=[{ 'VariantName': 'AllTraffic', 'ModelName': 'my-model', 'InstanceType': 'ml.g6.12xlarge', 'InitialInstanceCount': 2 }], MetricsConfig={ 'EnableEnhancedMetrics': True, 'MetricsPublishFrequencyInSeconds': 10, # Default 60s }) Choosing your publishing frequency After you’ve enabled enhanced metrics, configure the publishing frequency based on your monitoring needs: Standard resolution (60 seconds) : The default frequency provides detailed visibility for most production workloads. This is sufficient for capacity planning, troubleshooting, and optimization, while keeping costs manageable. High resolution (10 or 30 seconds) : For critical applications needing near real-time monitoring, enable 10-second publishing. This is valuable for aggressive auto scaling, highly variable traffic patterns, or deep troubleshooting. Example use cases In this post, we walk through three common scenarios where Enhanced Metrics delivers measurable business value, all of which are available in this notebook : Real-time GPU utilization tracking across Inference Components When running multiple models on shared infrastructure using Inference Components, understanding GPU allocation and utilization is critical for cost optimization and performance tuning.With enhanced metrics, you can query GPU allocation per inference component: response = cloudwatch.get_metric_data( MetricDataQueries=[ { 'Id': 'm1', 'Expression': 'SEARCH(\'{/aws/sagemaker/InferenceComponents,InferenceComponentName,GpuId} MetricName="GPUUtilizationNormalized" InferenceComponentName="IC-my-model"\', \'SampleCount\', 10)' }, { 'Id': 'e1', 'Expression': 'SUM(m1)' # Returns GPU count } ], StartTime=start_time, EndTime=end_time ) This query uses the GpuId dimension to count individual GPUs allocated to each inference component. By tracking the SampleCount statistic, you get a precise count of GPUs in use for a specific Inference Component, which is essential for: Validating resource allocation matches your configuration Detecting when inference components scale up or down Calculating per-GPU costs for chargeback models Per-model cost attribution in multi-model deployments One of the most requested capabilities is understanding the true cost of each model when multiple models share the same endpoint infrastructure. Enhanced metrics make this possible through container-level GPU tracking.Here’s how to calculate cumulative cost per model: response = cloudwatch.get_metric_data( MetricDataQueries=[ { 'Id': 'e1', 'Expression': 'SEARCH(\'{/aws/sagemaker/InferenceComponents,InferenceComponentName,GpuId} MetricName="GPUUtilizationNormalized" InferenceComponentName="IC-my-model"\', \'SampleCount\', 10)' }, { 'Id': 'e2', 'Expression': 'SUM(e1)' # GPU count }, { 'Id': 'e3', 'Expression': 'e2 * 5.752 / 4 / 360' # Cost per 10s based on ml.g6.12xlarge hourly cost }, { 'Id': 'e4', 'Expression': 'RUNNING_SUM(e3)' # Cumulative cost } ], StartTime=start_time, EndTime=end_time ) This calculation: Counts GPUs allocated to the inference component (e2) Calculates cost per 10-second period based on instance hourly cost (e3) Accumulates total cost over time using RUNNING_SUM (e4) For example, with an ml.g6.12xlarge instance ($5.752/hour for 4 GPUs), if your model uses 4 GPUs, the cost per 10 seconds is $0.016. The RUNNING_SUM provides a continuously increasing total, perfect for dashboards and cost tracking. Cluster-wide resource monitoring Enhanced metrics enable comprehensive cluster monitoring by aggregating metrics across all inference components on an endpoint: response = cloudwatch.get_metric_data( MetricDataQueries=[ { 'Id': 'e1', 'Expression': 'SUM(SEARCH(\'{/aws/sagemaker/InferenceComponents,EndpointName,GpuId} MetricName="GPUUtilizationNormalized" EndpointName="my-endpoint"\', \'SampleCount\', 10))' }, { 'Id': 'm2', 'MetricStat': { 'Metric': { 'Namespace': '/aws/sagemaker/Endpoints', 'MetricName': 'CPUUtilizationNormalized', 'Dimensions': [ { 'Name': 'EndpointName', 'Value': 'my-endpoint' }, { 'Name': 'VariantName', 'Value': 'AllTraffic' } ] }, 'Period': 10, 'Stat': 'SampleCount' # Returns instance count } }, { 'Id': 'e2', 'Expression': 'm2 * 4 - e1' # Free GPUs (assuming 4 GPUs per instance) } ], StartTime=start_time, EndTime=end_time ) This query provides: Total GPUs in use across all inference components (e1) Number of instances in the endpoint (m2) Available GPUs for new deployments (e2) This visibility is crucial for capacity planning and making sure that you have sufficient resources for new model deployments or scaling existing ones. Creating operational dashboards The accompanying notebook demonstrates how to create CloudWatch dashboards programmatically that combine these metrics: from endpoint_metrics_helper import create_dashboard create_dashboard( dashboard_name='my-endpoint-monitoring', endpoint_name='my-endpoint', inference_components=[ { 'name': 'IC-model-a', 'label': 'MODEL_A' }, { 'name': 'IC-model-b', 'label': 'MODEL_B' } ], cost_per_hour=5.752, region='us-east-1' ) This creates a dashboard with: Cluster-level resource utilization (instances, used/unused GPUs) Per-model cost tracking with cumulative totals Real-time cost per 10-second period The notebook also includes interactive widgets for ad-hoc analysis. from endpoint_metrics_helper import create_metrics_widget, create_cost_widget # Cluster metrics create_metrics_widget('my-endpoint') # Per-model cost analysis create_cost_widget ('IC-model-a', cost_per_hour=5.752) These widgets provide dropdown time range selection (last 5/10/30 minutes, 1 hour, or custom range) and display: Number of instances Total/used/free GPUs Cumulative cost per model Cost per 10-second period Best practices Start with a 60-second resolution: This provides sufficient granularity for most use cases while keeping CloudWatch costs manageable. Note that only Utilization metrics generate CloudWatch charges. All other metric types are published at no additional cost to you. Use 10-second resolution selectively: Enable high-resolution metrics only for critical endpoints or during troubleshooting periods. Use dimensions strategically: Use InferenceComponentName , ContainerId , and GpuId dimensions to drill down from cluster-wide views to specific containers. Create cost allocation dashboards: Use RUNNING_SUM expressions to track cumulative costs per model for accurate chargeback and budgeting. Set up alarms on unused GPU capacity: Monitor the unused GPU metric to make sure that you maintain buffer capacity for scaling or new deployments. Combine with invocation metrics: Correlate resource utilization with request patterns to understand the relationship between traffic and resource consumption. Conclusion Enhanced Metrics for Amazon SageMaker AI Endpoints transforms how you monitor, improve, and operate production ML workloads. By providing container-level visibility with configurable publishing frequency, you gain the operational intelligence needed to: Accurately attribute costs to individual models in multi-tenant deployments Monitor real-time GPU allocation and utilization across inference components Track cluster-wide resource availability for capacity planning Troubleshoot performance issues with precise, granular metrics The combination of detailed metrics, flexible publishing frequency, and rich dimensions helps you to build sophisticated monitoring solutions that scale with your ML operations. Whether you’re running a single model or managing dozens of inference components across multiple endpoints, enhanced metrics provide the visibility you need to run AI efficiently at scale. Get started today by enabling enhanced metrics on your SageMaker AI endpoints and explore the accompanying notebook for complete implementation examples and reusable helper functions. About the authors Dan Ferguson Dan Ferguson is a Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably. Marc Karp Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
On algorithms, life, and learning mit_news_ai 23.03.2026 17:45 0.621
Embedding sim.0.7209
Entity overlap0.0256
Title sim.0.0244
Time proximity0.9479
NLP типother
NLP организацияMIT
NLP темаartificial intelligence
NLP странаUnited States

Открыть оригинал

From enhancing international business logistics to freeing up more hospital beds to helping farmers, MIT Professor Dimitris Bertsimas SM ’87, PhD ’88 summarized how his work in operations research has helped drive real-world improvements, while delivering the 54th annual James R. Killian Faculty Achievement Award Lecture at MIT on Thursday, March 19. Bertsimas also described how artificial intelligence is now being used in some of his scholarly projects and as a tool in MIT Open Learning efforts, which he currently directs — another facet of a highly productive and lauded career over four decades at the Institute. The Killian Award is the highest prize MIT gives its faculty. “I have tried to improve the human condition,” Bertsimas said, summarizing the breadth of his work and the many applications to everyday living that he has found for it. At MIT, Bertsimas is the vice provost for open learning, associate dean for online education and artificial intelligence, Boeing Leaders for Global Operations Professor of Management, and professor of operations research in the MIT Sloan School of Management. He also served as the inaugural faculty director of the master of business analytics program at MIT Sloan, and has held the position of associate dean of business analytics. Bertsimas’ remarks encompassed both his past insights and his ongoing studies, as well as his current efforts to add AI to his research. Describing the concept of “robust optimization,” a highly influential approach that Bertsimas helped develop in the early 2000s, he explained how it has enabled, for instance, more reliable shipping through the Panama Canal. Other approaches to optimization aimed at getting more vessels through the canal every day — up to 48 — but would encounter significant problems at times. Bertsimas’ approach identified that 45 vessels a day was better — a slightly lower number, but one that “was always feasible,” he noted. Over time, Bertsimas’ work has helped structure all kinds of solutions in business logistics; it has even been used for the allocation of school buses in Boston. More recently, as Bertsimas explained in the lecture, he and his collaborators have been working with Hartford HealthCare in Connecticut on a wide range of issues, and are increasingly incorporating AI into the development of tools for diagnostics, among other things. On the optimization front, their research has suggested ways to reduce the average stay of a hospital patient, from 5.38 days to 4.93 days. In the main Hartford hospital they have studied, given the number of existing beds, that reduction has enabled more than 5,000 additional patient stays per year. “It’s a very different ballgame,” Bertsimas said. Bertsimas delivered his lecture, titled “Algorithms for Life: AI and Operations Research Transforming Healthcare, Education, and Agriculture,” to an audience of over 300 MIT community members in Huntington Hall (Room 10-250) on campus. The award was established in 1971 to honor James Killian, whose distinguished career included serving as MIT’s 10th president, from 1948 to 1959, and subsequently as chair of the MIT Corporation, from 1959 to 1971. “Professor Bertsimas’ scholarly contributions are both extensive and groundbreaking,” said Roger Levy, chair of the MIT faculty and a professor in the Department of Brain and Cognitive Sciences, while making introductory remarks. “He’s one of the rare individuals who has made significant contributions to both intellectual threads in the field of operations research: one, optimization — combinatorial, linear, and nonlinear — and number two, stochastic processes.” Indeed, Bertsimas’ work has helped develop both better tools for studying and conducting operations, while also having a wide range of applications. As Bertsimas noted in his lecture, the deaths of both of his parents in 2009 helped propel him to start looking at extensively at ways operations research could help health care. Bertsimas received his BS in electrical engineering and computer science from the National Technical University of Athens in Greece. Moving to MIT for his graduate work, he then earned his MS in operations research and his PhD in applied mathematics and operations research. Bertsimas joined the MIT faculty after receiving his doctorate, and has remained at the Institute ever since. Bertsimas is also known as an energetic teacher who has been the principal advisor to a remarkable number of PhD students — 106 and counting, at this point. “It is far and away my favorite activity, to supervise my doctoral students,” Bertsimas said. “It is a privilege, in my opinion, to work with exceptional young people like the ones we have at MIT, in ability and character and aspiration. They actually make me a better scientist, and a better person.” “MIT is part of my identity,” Bertsimas quipped while noting that he is the only faculty member on campus who has those three letters, in order, in his first name. In the latter part of the lecture, Bertsimas highlighted work he has been doing as vice provost of open learning at MIT. He has personally developed an large online course based on his own material, “The Analytics Edge.” In his current role, Bertsimas said, he now aspires for MIT to reach a billion learners with online courses, part of his effort to “democratize access to education.” Bertsimas also demonstrated for the audience some AI tools he and his colleagues are working to bring to online education, including ways of condensing material, and the translation of online material into other languages. It is just one more chapter in a long and broad-ranging career dedicated to grasping phenomena and developing tools to help us navigate it. Or as Berstimas noted while summarizing his scholarship at one point in the lecture, “I try to increase the human understanding of how the world works.”