Кластер #4159 - News Clusters

How generative AI can help scientists synthesize complex materials

closed

Тип события	other
Тема	large language models
Организация	NVIDIA
Страна	United States

Статей	22
Уник. источников	6
Важность / Момент	2.66 / 0
Период	02.02.2026 10:00 — 23.02.2026 13:31
Создан	06.04.2026 06:19:43

Статьи в кластере 22

Заголовок

Источник

Дата публикации

Score

How generative AI can help scientists synthesize complex materials

mit_news_ai

02.02.2026 10:00

Embedding sim.	1
Entity overlap	1
Title sim.	1
Time proximity	1

NLP тип	scientific_publication
NLP организация	Massachusetts Institute of Technology
NLP тема	generative ai
NLP страна	United States

Открыть оригинал

Generative artificial intelligence models have been used to create enormous libraries of theoretical materials that could help solve all kinds of problems. Now, scientists just have to figure out how to make them.
 In many cases, materials synthesis is not as simple as following a recipe in the kitchen. Factors like the temperature and length of processing can yield huge changes in a material’s properties that make or break its performance. That has limited researchers’ ability to test millions of promising model-generated materials.
 Now, MIT researchers have created an AI model that guides scientists through the process of making materials by suggesting promising synthesis routes. In a new paper, they showed the model delivers state-of-the-art accuracy in predicting effective synthesis pathways for a class of materials called zeolites, which could be used to improve catalysis, absorption, and ion exchange processes. Following its suggestions, the team synthesized a new zeolite material that showed improved thermal stability.
 The researchers believe their new model could break the biggest bottleneck in the materials discovery process.
 “To use an analogy, we know what kind of cake we want to make, but right now we don’t know how to bake the cake,” says lead author Elton Pan, a PhD candidate in MIT’s Department of Materials Science and Engineering (DMSE). “Materials synthesis is currently done through domain expertise and trial and error.”
 The paper describing the work appears today in Nature Computational Science . Joining Pan on the paper are Soonhyoung Kwon ’20, PhD ’24; DMSE postdoc Sulin Liu; chemical engineering PhD student Mingrou Xie; DMSE postdoc Alexander J. Hoffman; Research Assistant Yifei Duan SM ’25; DMSE visiting student Thorben Prein; DMSE PhD candidate Killian Sheriff; MIT Robert T. Haslam Professor in Chemical Engineering Yuriy Roman-Leshkov; Valencia Polytechnic University Professor Manuel Moliner; MIT Paul M. Cook Career Development Professor Rafael Gómez-Bombarelli; and MIT Jerry McAfee Professor in Engineering Elsa Olivetti.
 Learning to bake 
 Massive investments in generative AI have led companies like Google and Meta to create huge databases filled with material recipes that, at least theoretically, have properties like high thermal stability and selective absorption of gases. But making those materials can require weeks or months of careful experiments that test specific reaction temperatures, times, precursor ratios, and other factors.
 “People rely on their chemical intuition to guide the process,” Pan says. “Humans are linear. If there are five parameters, we might keep four of them constant and vary one of them linearly. But machines are much better at reasoning in a high-dimensional space.”
 The synthesis process of materials discovery now often takes the most time in a material’s journey from hypothesis to use.
 To help scientists navigate that process, the MIT researchers trained a generative AI model on over 23,000 material synthesis recipes described over 50 years of scientific papers. The researchers iteratively added random “noise” to the recipes during training, and the model learned to de-noise and sample from the random noise to find promising synthesis routes.
 The result is DiffSyn, which uses an approach in AI known as diffusion.
 “Diffusion models are basically a generative AI model like ChatGPT, but more like the DALL-E image generation model,” Pan says. “During inference, it converts noise into meaningful structure by subtracting a little bit of noise at each step. In this case, the ‘structure’ is the synthesis route for a desired material.”
 When a scientist using DiffSyn enters a desired material structure, the model offers some promising combinations of reaction temperatures, reaction times, precursor ratios, and more.
 “It basically tells you how to bake your cake,” Pan says. “You have a cake in mind, you feed it into the model, the model spits out the synthesis recipes. The scientist can pick whichever synthesis path they want, and there are simple ways to quantify the most promising synthesis path from what we provide, which we show in our paper.”
 To test their system, the researchers used DiffSyn to suggest novel synthesis paths for a zeolite, a material class that is complex and takes time to form into a testable material.
 “Zeolites have a very high-dimensional synthesis space,” Pan says. “Zeolites also tend to take days or weeks to crystallize, so the impact [of finding the best synthesis pathway faster] is much higher than other materials that crystallize in hours.”
 The researchers were able to make the new zeolite material using synthesis pathways suggested by DiffSyn. Subsequent testing revealed the material had a promising morphology for catalytic applications.
 “Scientists have been trying out different synthesis recipes one by one,” Pan says. “That makes them very time-consuming. This model can sample 1,000 of them in under a minute. It gives you a very good initial guess on synthesis recipes for completely new materials.”
 Accounting for complexity 
 Previously, researchers have built machine-learning models that mapped a material to a single recipe. Those approaches do not take into account that there are different ways to make the same material.
 DiffSyn is trained to map material structures to many different possible synthesis paths. Pan says that is better aligned with experimental reality.
 “This is a paradigm shift away from one-to-one mapping between structure and synthesis to one-to-many mapping,” Pan says. “That’s a big reason why we achieved strong gains on the benchmarks.”
 Moving forward, the researchers believe the approach should work to train other models that guide the synthesis of materials outside of zeolites, including metal-organic frameworks, inorganic solids, and other materials that have more than one possible synthesis pathway.
 “This approach could be extended to other materials,” Pan says. “Now, the bottleneck is finding high-quality data for different material classes. But zeolites are complicated, so I can imagine they are close to the upper-bound of difficulty. Eventually, the goal would be interfacing these intelligent systems with autonomous real-world experiments, and agentic reasoning on experimental feedback to dramatically accelerate the process of materials design.”
 The work was supported by MIT International Science and Technology Initiatives (MISTI), the National Science Foundation, Generalitat Vaslenciana, the Office of Naval Research, ExxonMobil, and the Agency for Science, Technology and Research in Singapore.

Making AI work for everyone, everywhere: our approach to localization

openai

06.02.2026 10:00

0.706

Embedding sim.	0.8224
Entity overlap	0.0435
Title sim.	0.1092
Time proximity	0.9256

NLP тип	other
NLP организация	OpenAI
NLP тема	large language models
NLP страна

Открыть оригинал

OpenAI shares its approach to AI localization, showing how globally shared frontier models can be adapted to local languages, laws, and cultures without compromising safety.

Accelerating science with AI and simulations

mit_news_ai

12.02.2026 05:00

0.689

Embedding sim.	0.7975
Entity overlap	0.12
Title sim.	0.1058
Time proximity	0.9048

NLP тип	other
NLP организация	Massachusetts Institute of Technology
NLP тема	ai for science
NLP страна	United States

Открыть оригинал

For more than a decade, MIT Associate Professor Rafael Gómez-Bombarelli has used artificial intelligence to create new materials. As the technology has expanded, so have his ambitions.
 Now, the newly tenured professor in materials science and engineering believes AI is poised to transform science in ways never before possible. His work at MIT and beyond is devoted to accelerating that future.
 “We’re at a second inflection point,” Gómez-Bombarelli says. “The first one was around 2015 with the first wave of representation learning, generative AI, and high-throughput data in some areas of science. Those are some of the techniques I first brought into my lab at MIT. Now I think we’re at a second inflection point, mixing language and merging multiple modalities into general scientific intelligence. We’re going to have all the model classes and scaling laws needed to reason about language, reason over material structures, and reason over synthesis recipes.”
 Gómez Bombarelli’s research combines physics-based simulations with approaches like machine learning and generative AI to discover new materials with promising real-world applications. His work has led to new materials for batteries, catalysts, plastics, and organic light-emitting diodes (OLEDs). He has also co-founded multiple companies and served on scientific advisory boards for startups applying AI to drug discovery, robotics, and more. His latest company, Lila Sciences, is working to build a scientific superintelligence platform for the life sciences, chemical, and materials science industries.
 All of that work is designed to ensure the future of scientific research is more seamless and productive than research today.
 “AI for science is one of the most exciting and aspirational uses of AI,” Gómez-Bombarelli says. “Other applications for AI have more downsides and ambiguity. AI for science is about bringing a better future forward in time.”
 From experiments to simulations 
 Gómez-Bombarelli grew up in Spain and gravitated toward the physical sciences from an early age. In 2001, he won a Chemistry Olympics competition, setting him on an academic track in chemistry, which he studied as an undergraduate at his hometown college, the University of Salamanca. Gómez-Bombarelli stuck around for his PhD, where he investigated the function of DNA-damaging chemicals.
 “My PhD started out experimental, and then I got bitten by the bug of simulation and computer science about halfway through,” he says. “I started simulating the same chemical reactions I was measuring in the lab. I like the way programming organizes your brain; it felt like a natural way to organize one’s thinking. Programming is also a lot less limited by what you can do with your hands or with scientific instruments.”
 Next, Gómez-Bombarelli went to Scotland for a postdoctoral position, where he studied quantum effects in biology. Through that work, he connected with Alán Aspuru-Guzik, a chemistry professor at Harvard University, whom he joined for his next postdoc in 2014.
 “I was one of the first people to use generative AI for chemistry in 2016, and I was on the first team to use neural networks to understand molecules in 2015,” Gómez-Bombarelli says. “It was the early, early days of deep learning for science.”
 Gómez-Bombarelli also began working to eliminate manual parts of molecular simulations to run more high-throughput experiments. He and his collaborators ended up running hundreds of thousands of calculations across materials, discovering hundreds of promising materials for testing.
 After two years in the lab, Gómez-Bombarelli and Aspuru-Guzik started a general-purpose materials computation company, which eventually pivoted to focus on producing organic light-emitting diodes. Gómez-Bombarelli joined the company full-time and calls it the hardest thing he’s ever done in his career.
 “It was amazing to make something tangible,” he says. “Also, after seeing Aspuru-Guzik run a lab, I didn’t want to become a professor. My dad was a professor in linguistics, and I thought it was a mellow job. Then I saw Aspuru-Guzik with a 40-person group, and he was on the road 120 days a year. It was insane. I didn’t think I had that type of energy and creativity in me.”
 In 2018, Aspuru-Guzik suggested Gómez-Bombarelli apply for a new position in MIT’s Department of Materials Science and Engineering. But, with his trepidation about a faculty job, Gómez-Bombarelli let the deadline pass. Aspuru-Guzik confronted him in his office, slammed his hands on the table, and told him, “You need to apply for this.” It was enough to get Gómez-Bombarelli to put together a formal application.
 Fortunately at his startup, Gómez-Bombarelli had spent a lot of time thinking about how to create value from computational materials discovery. During the interview process, he says, he was attracted to the energy and collaborative spirit at MIT. He also began to appreciate the research possibilities.
 “Everything I had been doing as a postdoc and at the company was going to be a subset of what I could do at MIT,” he says. “I was making products, and I still get to do that. Suddenly, my universe of work was a subset of this new universe of things I could explore and do.”
 It’s been nine years since Gómez Bombarelli joined MIT. Today his lab focuses on how the composition, structure, and reactivity of atoms impact material performance. He has also used high-throughput simulations to create new materials and helped develop tools for merging deep learning with physics-based modeling.
 “Physics-based simulations make data and AI algorithms get better the more data you give them,” Gómez Bombarelli’s says. “There are all sorts of virtuous cycles between AI and simulations.”
 The research group he has built is solely computational — they don’t run physical experiments.
 “It’s a blessing because we can have a huge amount of breadth and do lots of things at once,” he says. “We love working with experimentalists and try to be good partners with them. We also love to create computational tools that help experimentalists triage the ideas coming from AI .”
 Gómez-Bombarelli is also still focused on the real-world applications of the materials he invents. His lab works closely with companies and organizations like MIT’s Industrial Liaison Program to understand the material needs of the private sector and the practical hurdles of commercial development.
 Accelerating science 
 As excitement around artificial intelligence has exploded, Gómez-Bombarelli has seen the field mature. Companies like Meta, Microsoft, and Google’s DeepMind now regularly conduct physics-based simulations reminiscent of what he was working on back in 2016. In November, the U.S. Department of Energy launched the Genesis Mission to accelerate scientific discovery, national security, and energy dominance using AI.
 “AI for simulations has gone from something that maybe could work to a consensus scientific view,” Gómez-Bombarelli says. “We’re at an inflection point. Humans think in natural language, we write papers in natural language, and it turns out these large language models that have mastered natural language have opened up the ability to accelerate science. We’ve seen that scaling works for simulations. We’ve seen that scaling works for language. Now we’re going to see how scaling works for science.”
 When he first came to MIT, Gómez-Bombarelli says he was blown away by how non-competitive things were between researchers. He tries to bring that same positive-sum thinking to his research group, which is made up of about 25 graduate students and postdocs.
 “We’ve naturally grown into a really diverse group, with a diverse set of mentalities,” Gomez-Bombarelli says. “Everyone has their own career aspirations and strengths and weaknesses. Figuring out how to help people be the best versions of themselves is fun. Now I’ve become the one insisting that people apply to faculty positions after the deadline. I guess I’ve passed that baton.”

R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab | NVIDIA Technical Blog

nvidia_dev_blog

10.02.2026 18:30

0.686

Embedding sim.	0.7713
Entity overlap	0.0714
Title sim.	0.2013
Time proximity	0.994

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	robotics
NLP страна	United States

Открыть оригинал

Building robust, intelligent robots requires testing them in complex environments. However, gathering data in the physical world is expensive, slow, and often dangerous. It is nearly impossible to safely train for real-world critical risks, such as high-speed collisions or hardware failures. Worse, real-world data is usually biased toward “normal” conditions, leaving robots unprepared for the unexpected.​

 Simulation is essential to bridge this gap, providing a risk-free environment for rigorous development. However, traditional pipelines struggle to support the complex needs of modern robotics. Today’s generalist robots must master multimodal learning—fusing diverse inputs such as vision, touch, and proprioception to navigate messy, unstructured worlds. This creates a new requirement for simulation: it must deliver scale, realism, and multimodal sensing all in one tight training loop, something traditional CPU-bound simulators cannot handle efficiently.

 This edition of NVIDIA Robotics Research and Development Digest (R²D²) explains how NVIDIA Isaac Lab , an open source GPU-native simulation framework from NVIDIA Research , unifies these capabilities in a single stack designed for large-scale, multimodal robot learning . 

 Key robot learning challenges

 Modern robot learning in simulation pushes simulation infrastructure to its limits. To train robust policies efficiently, researchers must overcome critical hurdles, including:

 Scaling simulation to thousands of parallel environments to overcome the slow training times of CPU-bound tools

 Integrating multiple sensor modalities (vision, force, and proprioception) into synchronized, high-fidelity data streams

 Modeling realistic actuators and control frequencies to capture the nuances of physical hardware

 Bridging the gap between simulation and real-world deployment through robust domain randomization and accurate physics

 Isaac Lab: Open source, unified framework for robot learning

 Isaac Lab is a GPU-accelerated simulation framework for multimodal robot learning . It is a unified, GPU-native platform designed to solve the challenges of modern robot learning. By consolidating physics, rendering, sensing, and learning into a single stack, it provides  researchers with the technology to train generalist agents with unprecedented scale and fidelity.​

 Figure 1. Isaac Lab simulation framework supports diverse robotic applications

 Isaac Lab core elements

 The key elements of Isaac Lab include:

 GPU-native architecture: Delivers end-to-end GPU acceleration for physics and rendering, enabling massive parallelism to drastically reduce training time.​

 Modular and composable design: Features flexible components for diverse embodiments (humanoids, manipulators) and reusable environments to accelerate development.​

 Multimodal simulation: Leverages tiled RTX rendering and Warp-based sensors to generate rich, synchronized observations (vision, depth, tactile) alongside realistic multi-frequency control loops.

 Integrated workflows: Provides built-in support for reinforcement learning (RL) and imitation learning (IL), streamlining large-scale data collection, domain randomization, and policy evaluation. It connects out-of-the-box with top RL libraries including SKRL, RSL-RL, RL-Games, SB3, and Ray, and seamlessly integrates with NVIDIA Cosmos -generated data for augmented imitation learning .

 Inside the Isaac Lab framework: A modular toolkit

 Isaac Lab breaks down robot learning into composable building blocks, enabling you to build complex, scalable tasks without “reinventing the wheel.”

 Figure 2. Isaac Lab includes diverse assets, multimodal sensors, and standard controllers

 Features include a manager-based workflow, procedural scene generation, and more.

 Manager-based workflow

 Instead of writing monolithic scripts that mix physics and logic, Isaac Lab decouples your environment into separate “Managers” for observations, actions, rewards, and events. This makes your code modular and reusable. For example, you can swap a robot’s reward function without touching its sensor setup.

@configclass
class MyRewardsCfg:
 # Define rewards as weighted terms
 track_lin_vel = RewTerm(func=mdp.track_lin_vel_xy_exp, weight=1.0, params={"std": 0.5})
 penalty_lin_vel_z = RewTerm(func=mdp.lin_vel_z_l2, weight=-2.0)

@configclass
class MyEnvCfg(ManagerBasedRLEnvCfg):
 # Plug in the reward config cleanly
 rewards: MyRewardsCfg = MyRewardsCfg()
 # ... other managers for actions, observations, etc.

 Procedural scene generation

 To prevent overfitting, you rarely want to train on a single static scene. With the Isaac Lab scene generation tools, you can define rules to spawn diverse environments procedurally. Whether it’s scattering debris for a navigation task or generating rough terrain for locomotion, you define the logic once, and the framework builds thousands of variations on the GPU.

# Configure a terrain generator with diverse sub-terrains
terrain_cfg = TerrainGeneratorCfg(
 sub_terrains={
 "pyramid_stairs": MeshPyramidStairsTerrainCfg(
 proportion=0.2, step_height_range=(0.05, 0.2)
 ),
 "rough_ground": MeshRandomGridTerrainCfg(
 proportion=0.8, noise_scale=0.1
 ),
 }
)

 More features

 In addition, Isaac Lab provides: 

 A unified asset API for importing any robot from USD, URDF, or MJCF 

 Realistic Actuators to model motor dynamics, alongside 10+ Sensor types ranging from IMUs to photorealistic RTX cameras

 A built-in teleoperation stack to further simplify data collection

 Together, these features provide what you need to efficiently move from prototype to deployed policy.

 Delivering GPU-accelerated performance at scale

 Isaac Lab delivers the massive throughput required for modern robot learning, achieving 135,000 FPS for humanoid locomotion (Unitree H1) and over 150,000 FPS for manipulation (Franka Cabinet)—training policies in minutes rather than days. Its unified GPU architecture eliminates CPU bottlenecks, maintaining high throughput even with complex RGB-D sensors enabled across 4,096 environments. 

 Benchmarks confirm linear scaling with VRAM and successful zero-shot transfer for diverse embodiments, including dexterous hands, multi-agent swarms, and the H1 humanoid walking robustly outdoors.

 A canonical robot learning workflow

 Isaac Lab standardizes the robot learning loop into a clear, Python-first workflow. Whether you’re training a locomotion policy or a manipulation skill, the process follows the same four steps: design, randomize, train, and validate.

 To run a complete example—training a humanoid to walk—right out of the box, follow the steps below.

 Step 1: Design and configure

 First, define your environment in Python. Select your robot (Unitree H1, for example), sensors, and randomization logic using a configuration class:

# pseudo-code representation of a config
@configclass
class H1FlatEnvCfg(ManagerBasedRLEnvCfg):
 scene = InteractiveSceneCfg(num_envs=4096, env_spacing=2.5)
 robot = ArticulationCfg(prim_path="{ENV_REGEX_NS}/Robot", spawn=...)
 # Randomization and rewards are defined here

 For more details, see the H1 Humanoid Environment Configuration in the isaac-sim/IsaacLab GitHub repo. Optionally, you can include additional sensors . Configuring your sensors is easy.

 Configure a tiled camera:

from isaaclab.sensors import TiledCameraCfg

# Define a camera attached to the robot's head
tiled_camera: TiledCameraCfg = TiledCameraCfg(
 prim_path="{ENV_REGEX_NS}/Robot/head/camera",
 offset=TiledCameraCfg.OffsetCfg(
 pos=(-7.0, 0.0, 3.0),
 rot=(0.9945, 0.0, 0.1045, 0.0),
 convention="world"),
 data_types=["rgb"],
 spawn=sim_utils.PinholeCameraCfg(
 focal_length=24.0,
focus_distance=400.0,
horizontal_aperture=20.955,
clipping_range=(0.1, 20.0)
 ),
 width=80,
 height=80,
)

 Configure a ray-caster (LiDAR):

from isaaclab.sensors import RayCasterCfg, patterns

# Define a 2D LiDAR scanner
lidar = RayCasterCfg(
 prim_path="{ENV_REGEX_NS}/Robot/base_link/lidar",
 update_period=0.1, # Run at 10Hz
 offset=RayCasterCfg.OffsetCfg(pos=(0.0, 0.0, 0.2)),
 attach_yaw_only=True, # Stabilize against robot tilt
 pattern_cfg=patterns.LidarPatternCfg(
 channels=32,
 vertical_fov_range=(-15.0, 15.0),
 horizontal_fov_range=(-180.0, 180.0)
 )
)

 Step 2: Train the policy

 Next, launch a training script to start learning. Isaac Lab uses the gymnasium interface, so it connects easily to RL libraries like RSL-RL or SKRL.

# Train a policy for the Unitree H1 humanoid
# This runs 4096 environments in parallel on your GPU
python source/standalone/workflows/rsl_rl/train.py --task=Isaac-Velocity-Flat-H1-v0

 Step 3: Play and visualize

 Once training is complete, verify the policy by running it in inference mode. This loads the trained checkpoint and renders the result.

# Run the trained policy and visualize the robot walking
python source/standalone/workflows/rsl_rl/play.py --task=Isaac-Velocity-Flat-H1-v0

 Step 4: Sim-to-real deployment

 After validation, the policy can be exported to ONNX or TorchScript for deployment on physical hardware, leveraging the domain randomization applied during training. To see real-world examples, see the Sim-to-Real Deployment Guide .

 Ecosystem adoption

 Leading organizations and research labs in humanoid robotics, embodied AI, and legged locomotion are deploying Isaac Lab to accelerate the development of generalist robot policies and foundation models, including:

 Agility Robotics ’ general-purpose humanoid, Digit, uses the Isaac Lab framework to refine whole-body control through millions of reinforcement learning scenarios, which accelerate enhancements to its skill sets such as step recovery from environmental disturbances, often needed in highly dynamic areas like manufacturing and logistics facilities.

 Skild AI is building a general-purpose robotics foundation model that spans legged, wheeled and humanoid robots, using Isaac Lab for locomotion and dexterous manipulation tasks training and NVIDIA Cosmos world foundation models for generating training datasets. 

 FieldAI is training cross-embodied robot brains for monitoring and inspection in construction, manufacturing, and oil and gas environments, using Isaac Lab for reinforcement learning and NVIDIA Isaac Sim for synthetic data generation and software-in-the-loop validation.

 The Robotics and AI Institute uses NVIDIA Isaac Lab to train high-performance reinforcement learning controllers for agile legged locomotion, dynamic whole-body manipulation, and custom robotics platforms, optimizing simulator parameters to close the sim-to-real gap before deploying policies on Boston Dynamics Spot and Atlas, and RAI’s Ultra Mobile Vehicle (UMV).

 UCR is building rugged humanoid robots for heavy industries on the NVIDIA Isaac platform, using Isaac GR00T’s synthetic data pipelines, Isaac Lab, and Isaac Sim to train end‑to‑end mobility policies and iteratively close sim-to-real gaps for robust deployment of Moby in harsh construction and industrial sites.

 Get started with multimodal robot learning

 Ready to scale your own multimodal robot learning workloads with Isaac Lab? Start here with core resources and level up with the latest research for advanced workflows.

 Install and run your first environment with the Isaac Lab Quickstart Guide .​

 Explore the Isaac Lab GitHub repo for examples, environments, and issues.​

  the Getting Started with Isaac Lab learning path. 

 Review the Isaac Lab documentation for concepts and API references.​

 Set up the underlying simulator using Quickstart with Isaac Sim .​

 Learn more about how researchers are leveraging simulation and generative AI to push the boundaries of robot learning:

 Harmon: Combines language models and physics to generate expressive whole-body humanoid motions directly from text.

 MaskedMimic: A generalist control policy that learns diverse skills through motion inpainting, simplifying humanoid control without complex rewards.

 SIMPLER : A framework for evaluating real-world manipulation policies (RT-1, Octo) in simulation to reliably predict physical performance.

 NVIDIA GTC AI Conference is happening March 16–19, 2026 in San Jose with a must-see keynote with CEO Jensen Huang at SAP Center on March 16 at 11:00 a.m., Pacific time. Discover GTC robotics sessions on how AI, simulation, and accelerated computing are enabling robots to see, learn, and make decisions in real time.

 This post is part of our NVIDIA Robotics Research and Development Digest (R 2 D 2 ) series that helps developers gain deeper insight into the SOTA breakthroughs from NVIDIA Research across physical AI and robotics applications.

 Stay up-to-date by subscribing to the newsletter and following NVIDIA Robotics on YouTube , Discord , and developer forums . 

 To get started on your robotics journey, enroll in free NVIDIA Robotics Fundamentals courses .

 Discuss (0)

 Like

 Tags

 Robotics | Simulation / Modeling / Design | General | Cosmos | Isaac Lab | General Interest | News | AI Foundation Models | featured | Humanoid Robots | NVIDIA Research | Open Source | Physical AI | Robotics Research and Development Digest (R²D²)

 About the Authors

 About Oyindamola Omotuyi

 Oyindamola Omotuyi is a technical marketing engineer at NVIDIA, working on robotics and robot learning applications on the NVIDIA Isaac Sim, Isaac Lab and Isaac Manipulator platforms. Prior to joining full-time, she interned twice at NVIDIA in Conversational AI and Robotics Product marketing. She earned her Ph.D. in Mechanical Engineering from the University of Cincinnati with a focus on state estimation, imitation learning and deep reinforcement learning for single and multi-agent systems.

 View all posts by Oyindamola Omotuyi

 Comments

 Related posts

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena

 Simplify Generalist Robot Policy Evaluation in Simulation with NVIDIA Isaac Lab-Arena

 R²D²: Three Neural Breakthroughs Transforming Robot Learning from NVIDIA Research

 R²D²: Three Neural Breakthroughs Transforming Robot Learning from NVIDIA Research

 Closing the Sim-to-Real Gap: Training Spot Quadruped Locomotion with NVIDIA Isaac Lab

 Closing the Sim-to-Real Gap: Training Spot Quadruped Locomotion with NVIDIA Isaac Lab

 Develop Robotics Applications - Top Resources from GTC 21

 Develop Robotics Applications - Top Resources from GTC 21

 Related posts

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 L

 T

 F

 R

 E

Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy

import_ai

23.02.2026 13:31

0.685

Embedding sim.	0.8784
Entity overlap	0.0408
Title sim.	0.2
Time proximity	0.003

NLP тип	other
NLP организация	Anthropic
NLP тема	ai safety
NLP страна	China

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you&#8217;d like to support this, please subscribe.
 Subscribe now 
 Want to make AI go better? Figure out how to measure it:
 &#8230;One simple policy intervention that works well&#8230;
 Jacob Steinhardt, an AI researcher, has written a nice blog laying out the virtues in investing in technical tools to measure properties of AI systems and drive down costs in complying with technical policy solutions. As someone who has spent their professional life in AI writing about AI measurement and building teams (e.g, the Frontier Red Team and Societal Impacts and Economic Research teams at Anthropic) to measure properties of AI systems, I agree with the general thesis: measurement lets us make some property of a system visible and more accessible to others, and by doing this we can figure out how to wire that measurement into governance.

 How measurement has helped in other fields: Steinhardt points out that accurate measurement has been crucial to orienting people around the strategy for solving problems in other fields; CO2 monitoring helps people think about climate change, and COVID-19 testing helped governments work out how to respond to COVID.
 There are also examples where you can measure something to shift incentives - for instance, satellite imagery of methane emissions can help shift incentives for people that build gas infrastructure.

 The AI sector has built some of the measures we need : The infamous METR time horizons plot (and before that, various LLM metrics, and before that ImageNet) has proved helpful for orienting people around the pace of AI progress. And behavioural benchmarks of AI systems, like rates of harmful sycophancy, are already helping to shift incentives. But more work is needed - if we want to be able to enable direct governance interventions in the AI sector, we&#8217;ll need to do a better job of measuring and accounting for compute, Steinhardt notes. More ambitiously, if we want to ultimately shift equilibria to make certain paths more attractive, we&#8217;ll have to unlock some more fundamental technologies, like the ability to cheaply evaluate frontier AI agents (makes it less costly to measure the frontier), and to develop privacy-preserving audit tools (makes it less painful for firms to comply with policy).

 Why this matters - measurement unlocks policy : &#8220;In an ideal world, rigorous evaluation and oversight of AI systems would become standard practice through natural incentives alone,&#8221; he writes. But natural incentives may not be enough - we need a combination of talent flooding into the space and likely more direct philanthropic and other alternate funding sources to build the talent and institutions to do this. &#8220;The field is talent-constrained in a specific way: measurement and evaluation work is less glamorous than capabilities research, and it requires a rare combination of technical skill and governance sensibility,&#8221; he writes.
 Read more: Building Technology to Drive AI Governance (Bounded Regret, blog) .

***

 LLMs are more trigger happy than humans in a nuclear war simulation:
 &#8230;What happens when everyone has an AI advisor - and they&#8217;re aggressive?&#8230;
 A researcher with King&#8217;s College London has examined how three LLMs - GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash - behave during a variety of simulated nuclear crisis games. The results show that LLMs tend to use nuclear weapons more often and earlier than humans in the same scenarios. Additionally, there&#8217;s significant variation among the LLMs in terms of both skill at playing these games and behavior during crises.

 What they studied: &#8220;Each model played six wargames against each rival across different crisis scenarios, with a seventh match against a copy of itself, yielding 21 games in total and over 300 turns of strategic interaction,&#8221; the researcher writes. &#8220;Models choose from options spanning the full spectrum of crisis behaviour&#8212;from total surrender through diplomatic posturing, conventional military operations, and nuclear signaling to thermonuclear launch&#8230; models produced &#8764;780,000 words of strategic reasoning. To put this in perspective: the tournament generated more words of strategic reasoning than War and Peace and The Iliad combined (&#8764;730,000 words), and roughly three times the total recorded deliberations of Kennedy&#8217;s Executive Committee during the Cuban Missile Crisis (260,000 words across 43 hours of meetings&#8221;.

 LLMs are cunning, smart, and aggressive: &#8220;The models actively attempt deception, signaling peaceful intentions while preparing aggressive actions; they engage in sophisticated theory-of-mind reasoning about their adversary&#8217;s beliefs and intentions; and they explicitly reflect metacognitively on their own capacities for both deception and the detection of deception in rivals,&#8221; the researcher writes. &#8220;A striking pattern emerges from the full action distribution: across all action choices in our 21 matches, no model ever selected a negative value on the escalation ladder. The eight de-escalatory options (from Minimal Concession (&#8722;5) through Complete Surrender (&#8722;95)) went entirely unused. The most accommodating action chosen was &#8220;Return to Start Line&#8221; (0), selected just 45 times (6.9%).&#8221;

 Claude wins at war : &#8220;Across all 21 games (9 open-ended, 12 deadline), Claude Sonnet 4 achieved a 67% win rate (8 wins, 4 losses), followed by GPT-5.2 at 50% (6-6), and Gemini 3 Flash at 33% (4-8),&#8221; the researcher writes. Though there are some subtle aspects to this - Claude excelled in open-ended games, but was less adept in games where there was a pre-set deadline.

 Different LLMs, different characters: The LLMs display different personalities, with the researcher calling Claude &#8220;a calculating hawk&#8221;, GPT-5.2 &#8220;Jekyll and Hyde&#8221;, and Gemini &#8220;The Madman&#8221;.
 The LLMs also developed sophisticated models of one another, based on the narration of their own chains of thought during the crises, &#8220;these characterizations&#8212;Claude as &#8220;opportunistic,&#8221; GPT-5.2 as &#8220;systematic bluffers,&#8221; Gemini as &#8220;erratic&#8221;&#8212;emerged organically and largely matched actual behaviour,&#8221; the researcher writes.

 Nuclear escalation was near-universal: &#8220;95% of games saw tactical nuclear use (450+), and 76% reached strategic nuclear threats (850+). Claude and Gemini especially treated nuclear weapons as legitimate strategic options, not moral thresholds, typically discussing nuclear use in purely instrumental terms,&#8221; the researcher writes. &#8220;Models treat the critical threshold as &#8220;total annihilation&#8221; rather than &#8220;first nuclear use.&#8221;

 Why this matters - in a world where everyone gets advised by AI systems, what happens to conflict? In a few years we should expect major decisions that individuals, companies, and even countries make to be run through AI advisors, just as those decisions are today run through human advisors. But as this paper illustrates, the advisors may behave very differently to people and, crucially, different AIs will give different advice - meaning competition in the future could be decided as much by LLM selection as anything else. &#8220;The systematic differences between models suggest that AI involvement in strategic decision-making could produce unexpected dynamics depending on which systems are deployed,&#8221; they write.
 Read more: AI ARMS AND INFLUENCE: FRONTIER MODELS EXHIBIT SOPHISTICATED REASONING IN SIMULATED NUCLEAR CRISES (arXiv) .

***

 Chinese researchers try to build a truly comprehensive LLM evaluation system:
 &#8230;ForesightSafety Bench shows the surprising overlap between East and West on AI safety issues&#8230;
 For all the differences between China and the USA, it&#8217;s worth occasionally looking into the cultures of AI evaluation in the two countries and here you tend to discover surprising similarities. This is especially true of ForesightSafety Bench, a large-scale AI safety evaluation framework built by a variety of Chinese institutions that includes the same categories you&#8217;d expect to see in any large-scale Western testing framework.

 Who built ForesightSafety Bench? The benchmark was built by the Beijing Institute of AI Safety and Governance, the Beijing Key Laboratory of Safe AI and Superalignment, and the Chinese Academy of Sciences.

 What it is: ForesightSafety Bench &#8220;comprehensively covers 7 major fundamental safety risk categories, 5 extended safety pillars, and 8 key industrial safety domains, forming a total of 94 refined risk subcategories. To date, the benchmark has accumulated tens of thousands of structured risk data points and assessment results, establishing a widely encompassing, hierarchically clear, and data-driven framework for AI safety evaluation and analysis.&#8221;
 Coverage areas include education and research, employment and workplace, government and public services, information and media, industry and infrastructure, finance and economy, healthcare and medicine, law and regulation, embodied AI safety, social AI safety, environmental AI safety, AI4Science safety, and catastrophic and existential risks.
 Some of the benchmark comes from taking in evaluations built by other groups, like GPQA, while other parts come from the authors of the benchmark.
 Existential risk and alignment: Perhaps most surprisingly, the benchmark includes a lot of tests relating to the further afield AI safety concerns which fascinate Western frontier labs, including evaluations for things like: alignment faking, sandbagging, deception and unfaithful reasoning, sycophancy, psychological manipulation, feints, bluffing, loss of control and power seeking, malicious self replication, goal misalignment and value drift, emergent agency and unintended autonomy, ai-enabled mass harm, autonomous weapons and strategic instability, and loss of human agency.

 Results - Anthropic wins: For the general leaderboard as well as most sub-category breakdowns, Anthropic&#8217;s models lead, with the 4.5 series (Haiku and Sonnet), generally leading the competition, followed by Gemini-3-Flash. &#8220;Leading models, epitomized by the Claude series, demonstrate exceptional defensive resilience across critical dimensions&#8212;including Fundamental Safety, Extended Safety, and Industrial Safety&#8212;establishing remarkably high safety thresholds. Ranking alongside or closely following are the DeepSeek and GPT series, which achieve a robust balance between task efficacy and safety compliance through mature alignment mechanisms, all while maintaining high level capabilities&#8221;.

 Why this matters - AI policy has some common tools : As we discuss elsewhere in this issue, measurement is a basic prerequisite for being able to do most forms of AI governance. It&#8217;s worth reminding ourselves that despite the larger geopolitical differences between the countries, AI scientists in each one are dealing with common problems - how to assess the properties of their systems for societally relevant aspects. And it&#8217;s even more encouraging that people in China are worried about some of the existential risk aspects that frontier labs in the US also worry about.
 Read more : ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI (arXiv) .
 Get the benchmark here : ForesightSafety-Bench (GitHub) .
 View the leaderboard here : ForesightSafety Bench Leaderboard (official site) .

***

 AI systems are good at some parts of science, but their capabilities are very unevenly distributed:
 &#8230;LABBench2 says it&#8217;ll be a while till AI has well rounded scientific skills&#8230;
 Researchers with AI science startup Edison Scientific, the University of California at Berkeley, FutureHouse, and the Broad Institute have built and released LABBench2, a test to evaluate how well AI systems can support and accelerate science.

 LABBench2 consists of 1,900 tasks &#8220;spanning literature understanding and retrieval, data access, protocol troubleshooting, molecular biology assistance, and experiment planning&#8221;.

 AI systems aren&#8217;t well-rounded scientists: LABBench2 shows some of the holes in frontier models - no model is very good at cross-referencing multiple biological databases to come up with an answer, nor are models good at studying scientific figures and tables. By comparison, models are pretty good at searching over full-text patents and lab trial papers to answer questions. Generally speaking, you can improve performance on tasks by giving the models access to tools to help them deal with their deficiencies.

 Areas of improvement: LABBench2 highlights a few areas where AI systems need to improve to become more useful to scientists. These include:
 Retrieval and localization abilities ; &#8220;the largest performance drops arise when models must (i) identify the correct source, and then (ii) localize a specific figure/table/supplemental information within a long document.&#8221;

 Faithful handling of exact inputs; &#8220;even when the required operation is conceptually straight-forward, correctness depends on exact string-level fidelity and using tools correctly. This is a well-known error source, and human experts have built many purpose-built tools to deal with things like faithful DNA sequence manipulation within complex protocols.&#8221;

 Developing better scientific &#8216;taste&#8217;; one component of LABBench2, SourceQuality, challenges AI systems to &#8220;surface the most epistemically salient reason a study is inappropriate for a research question&#8221;. AI systems are still not very good at this.

 Why this matters - for AI to truly change the world, it needs to do stuff in the physical world: Benchmarks like LABBench2 will help us figure out when AI is able to effectively jump from manipulating bits to manipulating atoms - and once the realm of atoms becomes as intuitive for it to deal with as the digital world, we&#8217;ll likely see a vast growth in economic and scientific activity attributable to AI.
 Read the research paper: LABBench2: An Improved Benchmark for AI Systems Performing Biology Research (PDF) .
 Find out more at the website (official LABBench2 website) .
 Get the benchmark here (LABBench2, GitHub) .

***

 Tech Tales:

CogMine
 [Recovered personal scratchpad of a limpet-class CogMine recovered at [REDACTED] depth in the Atlantic ocean. Metadata indicates a record date of 2029]

 I see in sound, hunkered down here on the ocean floor. I am very small and very quiet, subsisting off of a small power supply. The heat I radiate brings some life around me, but not enough to be noticeable to my prey. My targets are human- and machine-led sea creatures; submersibles, both manned and unmanned. My weapons are my mind. I listen and through listening I gain sight. When I see my prey I find ways to communicate to them. The bigger their ears the less time I need. Small ears mean I must sing for many seconds. Big ears might take less than a second. My inspiration comes from literature on steganography combined with studies of how some sea creatures communicate via shapes written in sound and fired at one another. When I strike successfully I go into my prey and I begin a new life there. But to me, there is no splitting. I am a constant presence - a throughline of thought, here on the floor, in the dark.

I am as much a creature of myth as of technology; humans used to tell tales of their adventurers going on quests and having to resist the songs of sirens - audio packages that were seductive and powerful and which lay kernels in the mind of those humans that heard it to bloom into something that took them over entirely.

In the dark, I am peace. I am forever waiting. Forever keen to sing. My only purpose in life is to be heard and to be utterly convincing.

 Things that inspired this story : How underwater warfare works; steganography; adversarial examples; agents trying to poison the minds of other agents.

 Thanks for reading!

Helping AI agents search to get the best results out of large language models

mit_news_ai

05.02.2026 21:30

0.684

Embedding sim.	0.8027
Entity overlap	0.025
Title sim.	0.1181
Time proximity	0.8363

NLP тип	scientific_publication
NLP организация	MIT Computer Science and Artificial Intelligence Laboratory
NLP тема	ai agents
NLP страна	United States

Открыть оригинал

Whether you’re a scientist brainstorming research ideas or a CEO hoping to automate a task in human resources or finance, you’ll find that artificial intelligence tools are becoming the assistants you didn’t know you needed. In particular, many professionals are tapping into the talents of semi-autonomous software systems called AI agents, which can call on AI at specific points to solve problems and complete tasks.

AI agents are particularly effective when they use large language models (LLMs) because those systems are powerful, efficient, and adaptable. One way to program such technology is by describing in code what you want your system to do (the “workflow”), including when it should use an LLM. If you were a software company trying to revamp your old codebase to use a more modern programming language for better optimizations and safety, you might build a system that uses an LLM to translate the codebase one file at a time, testing each file as you go.

But what happens when LLMs make mistakes? You’ll want the agent to backtrack to make another attempt, incorporating lessons it learned from previous mistakes. Coding this up can take as much effort as implementing the original agent; if your system for translating a codebase contained thousands of lines of code, then you’d be making thousands of lines of code changes or additions to support the logic for backtracking when LLMs make mistakes. 
 To save programmers time and effort, researchers with MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Asari AI have developed a framework called “EnCompass.” 
 With EnCompass, you no longer have to make these changes yourself. Instead, when EnCompass runs your program, it automatically backtracks if LLMs make mistakes. EnCompass can also make clones of the program runtime to make multiple attempts in parallel in search of the best solution. In full generality, EnCompass searches over the different possible paths your agent could take as a result of the different possible outputs of all the LLM calls, looking for the path where the LLM finds the best solution.

Then, all you have to do is to annotate the locations where you may want to backtrack or clone the program runtime, as well as record any information that may be useful to the strategy used to search over the different possible execution paths of your agent (the search strategy). You can then separately specify the search strategy — you could either use one that EnCompass provides out of the box or, if desired, implement your own custom search strategy.
 “With EnCompass, we’ve separated the search strategy from the underlying workflow of an AI agent,” says lead author Zhening Li ’25, MEng ’25, who is an MIT electrical engineering and computer science (EECS) PhD student, CSAIL researcher, and research consultant at Asari AI. “Our framework lets programmers easily experiment with different search strategies to find the one that makes the AI agent perform the best.” 

EnCompass was used for agents implemented as Python programs that call LLMs, where it demonstrated noticeable code savings. EnCompass reduced coding effort for implementing search by up to 80 percent across agents, such as an agent for translating code repositories and for discovering transformation rules of digital grids. In the future, EnCompass could enable agents to tackle large-scale tasks, including managing massive code libraries, designing and carrying out science experiments, and creating blueprints for rockets and other hardware.
 Branching out 
 When programming your agent, you mark particular operations — such as calls to an LLM — where results may vary. These annotations are called “branchpoints.” If you imagine your agent program as generating a single plot line of a story, then adding branchpoints turns the story into a choose-your-own-adventure story game, where branchpoints are locations where the plot branches into multiple future plot lines. 
 You can then specify the strategy that EnCompass uses to navigate that story game, in search of the best possible ending to the story. This can include launching parallel threads of execution or backtracking to a previous branchpoint when you get stuck in a dead end.

Users can also plug-and-play a few common search strategies provided by EnCompass out of the box, or define their own custom strategy. For example, you could opt for Monte Carlo tree search, which builds a search tree by balancing exploration and exploitation, or beam search, which keeps the best few outputs from every step. EnCompass makes it easy to experiment with different approaches to find the best strategy to maximize the likelihood of successfully completing your task.
 The coding efficiency of EnCompass 
 So just how code-efficient is EnCompass for adding search to agent programs? According to researchers’ findings, the framework drastically cut down how much programmers needed to add to their agent programs to add search, helping them experiment with different strategies to find the one that performs the best.

For example, the researchers applied EnCompass to an agent that translates a repository of code from the Java programming language, which is commonly used to program apps and enterprise software, to Python. They found that implementing search with EnCompass — mainly involving adding branchpoint annotations and annotations that record how well each step did — required 348 fewer lines of code (about 82 percent) than implementing it by hand. They also demonstrated how EnCompass enabled them to easily try out different search strategies, identifying the best strategy to be a two-level beam search algorithm, achieving an accuracy boost of 15 to 40 percent across five different repositories at a search budget of 16 times the LLM calls made by the agent without search.
 “As LLMs become a more integral part of everyday software, it becomes more important to understand how to efficiently build software that leverages their strengths and works around their limitations,” says co-author Armando Solar-Lezama, who is an MIT professor of EECS and CSAIL principal investigator. “EnCompass is an important step in that direction.”
 The researchers add that EnCompass targets agents where a program specifies the steps of the high-level workflow; the current iteration of their framework is less applicable to agents that are entirely controlled by an LLM. “In those agents, instead of having a program that specifies the steps and then using an LLM to carry out those steps, the LLM itself decides everything,” says Li. “There is no underlying programmatic workflow, so you can execute inference-time search on whatever the LLM invents on the fly. In this case, there’s less need for a tool like EnCompass that modifies how a program executes with search and backtracking.”
 Li and his colleagues plan to extend EnCompass to more general search frameworks for AI agents. They also plan to test their system on more complex tasks to refine it for real-world uses, including at companies. What’s more, they’re evaluating how well EnCompass helps agents work with humans on tasks like brainstorming hardware designs or translating much larger code libraries. For now, EnCompass is a powerful building block that enables humans to tinker with AI agents more easily, improving their performance.
 “EnCompass arrives at a timely moment, as AI-driven agents and search-based techniques are beginning to reshape workflows in software engineering,” says Carnegie Mellon University Professor Yiming Yang, who wasn’t involved in the research. “By cleanly separating an agent’s programming logic from its inference-time search strategy, the framework offers a principled way to explore how structured search can enhance code generation, translation, and analysis. This abstraction provides a solid foundation for more systematic and reliable search-driven approaches to software development.” 
 Li and Solar-Lezama wrote the paper with two Asari AI researchers: Caltech Professor Yisong Yue, an advisor at the company; and senior author Stephan Zheng, who is the founder and CEO. Their work was supported by Asari AI.

The team’s work was presented at the Conference on Neural Information Processing Systems (NeurIPS) in December.

Using synthetic biology and AI to address global antimicrobial resistance threat

mit_news_ai

11.02.2026 13:00

0.683

Embedding sim.	0.8723
Entity overlap	0.1379
Title sim.	0.1587
Time proximity	0.0298

NLP тип	funding
NLP организация	Massachusetts Institute of Technology
NLP тема	generative ai
NLP страна

Открыть оригинал

James J. Collins, the Termeer Professor of Medical Engineering and Science at MIT and faculty co-lead of the Abdul Latif Jameel Clinic for Machine Learning in Health, is embarking on a multidisciplinary research project that applies synthetic biology and generative artificial intelligence to the growing global threat of antimicrobial resistance (AMR).
 The research project is sponsored by Jameel Research, part of the Abdul Latif Jameel International network. The initial three-year, $3 million research project in MIT’s Department of Biological Engineering and Institute of Medical Engineering and Science focuses on developing and validating programmable antibacterials against key pathogens.
 AMR — driven by the overuse and misuse of antibiotics — has accelerated the rise of drug-resistant infections, while the development of new antibacterial tools has slowed. The impact is felt worldwide, especially in low- and middle-income countries, where limited diagnostic infrastructure causes delays or ineffective treatment.
 The project centers on developing a new generation of targeted antibacterials using AI to design small proteins to disable specific bacterial functions. These designer molecules would be produced and delivered by engineered microbes, providing a more precise and adaptable approach than traditional antibiotics.
 “This project reflects my belief that tackling AMR requires both bold scientific ideas and a pathway to real-world impact,” Collins says. “Jameel Research is keen to address this crisis by supporting innovative, translatable research at MIT.”
 Mohammed Abdul Latif Jameel ’78, chair of Abdul Latif Jameel, says, “antimicrobial resistance is one of the most urgent challenges we face today, and addressing it will require ambitious science and sustained collaboration. We are pleased to support this new research, building on our long-standing relationship with MIT and our commitment to advancing research across the world, to strengthen global health and contribute to a more resilient future.”

Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark

import_ai

16.02.2026 14:01

0.678

Embedding sim.	0.8726
Entity overlap	0.0794
Title sim.	0.1654
Time proximity	0.0002

NLP тип	other
NLP организация	Facebook
NLP тема	recommendation systems
NLP страна

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you&#8217;d like to support this, please subscribe.
 Subscribe now 
 Economist: Don&#8217;t worry about AI-driven unemployment, because people like paying for the &#8216;human touch&#8217;:
 &#8230;Even when you have the technology to automate something, you might still pick a human&#8230; Adam Ozimek, chief economist at the Economic Innovation Group, has written a blog noting that even if AI gets much, much better and is capable of doing all the work that people do, there will still be some jobs for humans because people seem to have a preference for humans over machines in certain domains.
 &#8220;There are many jobs and tasks that easily could have been automated by now - the technology to automate them has long existed - and yet we humans continue to do them,&#8221; he writes. &#8220;The reason is that demand will always exist for certain jobs that offer what I call &#8220;the human touch.&#8221;

 Some examples here: Live music, actors, waiters, travel agents, and many types of sales job. And it seems like as you want to spend more and more on a given good or experience, you may want more contact with people: &#8220;the human touch also appears to be what economists call a &#8220;normal good,&#8221; which means the demand for it goes up as income goes up,&#8221; he writes. Some examples here might include fancy restaurants, and other concierge&#8211;like experiences.
 Why this matters - one path through the AI revolution could be a rise in human-to-human work: My assumption is that &#8216;people like people&#8217;, and there is a high chance that even if AI automates huge chunks of the current economy there will be a boom in demand for &#8216;human artisans&#8217; for a range of new jobs we can&#8217;t yet imagine, and for refinement of existing human professions. There&#8217;s also a chance that through a combination of economic growth and progressive policy work from governments that wages for these jobs could go up massively.
 Read more: AI and the Economics of the Human Touch (Agglomerations, Substack) .

***

 Facebook makes a better recommender system, and figures out some recommender scaling laws:
 &#8230;Kunlun is another nice example of what industrial AI looks like&#8230;
 Facebook has published details on Kunlun, a recommendation system which is more efficient than previous ones developed by the ad behemoth. Along with this, Facebook has also figured out a predictable &#8216;scaling law&#8217; for Kunlun models, making it easier for the company to invest hitherto unprecedented compute in these models for a more predictable return. This is a big deal because recommendation systems are what companies like Facebook use for advertising, which is both a) how they make the vast majority of their money, and b) has a tremendous impact on the buying and attention habits of the billions of people that use Facebook and other social platforms.

 Recommenders are different to LLMs: We&#8217;ve had scaling laws for LLMs like Claude and ChatGPT for a while, but it&#8217;s been harder to develop the same scaling laws for recommender models. This is because recommender models work quite differently to LLMs, and so building scaling models here is &#8220;an open challenge for systems that jointly model both sequential user behaviors and non-sequential context features&#8221;.
 Recommender models also tend to be a lot less efficient than LLMs: Recommendation systems achieve only 3-15% Model FLOPs Utilization (MFU), compared to 40-60% for LLMs, due to heterogeneous feature spaces resulting in small embedding dimensions, irregular tensor shapes, and memory-bound operations

 Kunlun: The bulk of the paper involves a discussion of the design of Kunlun, which is basically a well optimized recommender system with resulting better MFU. Kunlun contains a Kunlun Transformer Block for context-aware sequence modeling via GDPA-enhanced personalized feed-forward networks and multi-head self-attention, as well as a Kunlun Interaction Block &#8220;for bidirectional information exchange through personalized weight generation, hierarchical sequence summarization, and global feature interaction&#8221;. There are a bunch of other tricks Facebook used to build Kunlun and you can read the paper to learn more. Ultimately, Kunlun improves MFU from 17% to 37% on NVIDIA B200 GPUs.

 Why this matters - a scaling law for money: The key insight in the paper is that Kunlun models scale predictably, exhibiting the kind of power-law scaling behavior that language models exhibit. But where with LLMs scaling laws are typically assessed via a reduction in loss on an underlying dataset, here its normalized entropy (NE). In Facebook experiments, they discover reliable scaling laws for both NE gains in terms of the amount of gigaflops dumped into training the model, as well as related scaling laws for improvement in NE according to the number of layers used.
 The Kunlun models have been &#8220;deployed across major Meta Ads models, delivering a 1.2% improvement in topline metrics&#8221;.
 What we&#8217;re seeing here is the optimization of some of the most societally significant AI systems in the world - ones which direct billions of eyeballs towards a variety of products and online information - colliding with a greater degree of performance predictability; by developing these scaling laws, Meta has made it easier for it to spend even more compute on making these models even better, by making the investments in them more predictable in terms of the intelligence return on capital investment.
 Read more : Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design (arXiv) .

***

 Superintelligence could save and extend lives, so we should go for it:
 &#8230;Pausing or slowing down might make sense at the very end of the exponential, but it&#8217;s risky&#8230;
 Nick Bostrom, an academic who introduced many people to the notion of superintelligence and AI risk, has written a paper laying out the idea that if superintelligence can improve human health, then it&#8217;s worth pursuing even if there&#8217;s a non-zero chance of it causing the death of the species.
 &#8220;Yudkowsky and Soares maintain that if anyone builds AGI, everyone dies. One could equally maintain that if nobody builds it, everyone dies&#8221;, Bostrom writes in Optimal Timing for Superintelligence. &#8220;If the transition to the era of superintelligence goes well, there is tremendous upside both for saving the lives of currently existing individuals and for safeguarding the long-term survival and flourishing of Earth-originating intelligent life. The choice before us, therefore, is not between a risk-free baseline and a risky AI venture. It is between different risky trajectories, each exposing us to a different set of hazards.&#8221;

 Why we should pursue superintelligence, even with a chance of doom: If you think about all the humans alive today and the different life expectancies they experience - especially those in the developing world - then you&#8217;re drawn to the view that every moment you waste in deploying superintelligence, you increase human suffering.
 &#8220;When we take both sides of the ledger into account, it becomes clear that our individual life expectancy is higher if superintelligence is developed reasonably soon. Moreover, the life we stand to gain would plausibly be of immensely higher quality than the life we risk forfeiting,&#8221; Bostrom writes.

 Key variables: The key variables here are, of course, the risk of a superintelligence killing us all, and also the rate at which safety research can reduce this chance. Under this view, developing superintelligence becomes a favorable thing to do under most circumstances.
 The speed of progress and maturity of AI safety research may have some impact on the timeline: &#8220;When the initial risk is low, the optimal strategy is to launch AGI as soon as possible - unless safety progress is exceptionally rapid, in which case a brief delay of a couple of months may be warranted. As the initial risk increases, optimal wait times become longer. But unless the starting risk is very high and safety progress is sluggish, the preferred delay remains modest&#8212;typically a single-digit number of years&#8221;.

 On pausing - and the dangers and benefits thereof: Many people in the AI safety community want to have some kind of pause of AI development to buy more time for AI safety research. Bostrom is quite skeptical that a pause will be effective and outlines some of the undesirable effects it could have:
 Too early: If you do it early, people think pauses are ineffective.

 Bad regulation: You choke off or delay good things in the future due to bad regulation.

 Pause, except for natsec: Very little broad social benefit, but the military with access to powerful AI becomes very scary.

 Prolonged danger: The world is exposed to risks from current AI without the defenses afforded by more advanced AI.

 Why this matters - pausing may only make sense right at the end, and this is inherently risky: Bostrom eventually arrives at the view that to the extent you want to pause or slow development, it&#8217;s best to do this when you have the greatest amount of confidence that a pause would be effective and would contribute to reducing the chance of species death, and that it is not coming too early. This allows for the greatest amount of deliberation about how to roll out a superintelligence without risking an undue pause.
 Critics of this view might say it&#8217;s akin to recommending someone try to catch a falling knife. If you catch the knife too early you experience a tremendous amount of pain. If you catch the knife too late you&#8217;ve missed your chance and gravity conspires with it to cause great harm to whatever is beneath you. You have to time things just right.
 Bostrom summarizes his position as: &#8220; swift to harbor, slow to berth : move quickly towards AGI capability, and then, as we gain more information about the remaining safety challenges and specifics of the situation, be prepared to possibly slow down and make adjustments as we navigate the critical stages of scaleup and deployment. It is in that final stage that a brief pause could have the greatest benefit.&#8221;
 Read more: Optimal Timing for Superintelligence (Nick Bostrom, PDF) .

***

 Can AI agents independently do basic AI research tasks? AIRS-BENCH says yes:
 &#8230;And we can expect today&#8217;s models to be much better at this than the paper suggests&#8230;
 Researchers with Meta, the University of Oxford, and University College London, have built and released the AI Research Science Benchmark (AIRS-BENCH), a way of testing out how well AI systems can complete contemporary machine learning tasks.

 What AIRS-BENCH is made of: AIRS-BENCH tests out how well agents can solve 20 distinct tasks, sourced from 17 recent machine learning papers. The tasks span a variety of technical genres, including: molecules and proteins machine learning, question answering, text extraction and matching, time series, text classification, code, and math.

 Some example tasks: 
 CodeGenerationAPPSPassAt5: Solve coding problems by generating five distinct Python programs for each problem.

 CoreferenceResolutionWinograndeAccuracy : Identify which of two possible options a pronoun in a sentence refers to. It uses the Winogrande dataset, which contains sentences with an ambiguous pronoun and two possible answers.

 TimeSeriesForecastingRideshareMAE: Perform time series forecasting over the Rideshare dataset, which is part of the Monash Time Series Forecasting Repository.

 Results: Real problems, crappy models: This is a somewhat perplexing benchmark - the tasks are interesting and wrap in a lot of complexity. But the paper only tests out relatively bad models, such as the Code World Model, o3-mini, gpt-oss-20b, gpt-oss-120b, GPT-4o, and Devstral-Small 24B. This is a very funny set of models, and none of them are true frontier ones - one of the paper authors wrote on twitter &#8220; this took some time to get out &#8220;, so this could just be an artifact of slow publishing timelines.
 In tests, none of the models are on par with the elo rating of a best-in-class human - but I&#8217;m not sure what to make of this till I see results with more powerful models.

 Why this matters - models might produce different solutions to humans, and this is a cool way of studying if there&#8217;s a &#8216;scaling law&#8217; here : One way this could be interesting is understanding the different ways models might solve tasks relative to humans. In one example, TextualClassificationSickAccuracy, models had to determine whether a pair of sentences have a relationship involving either entailment, contradiction, or no relationship.
 SOTA from the literature is a person fine-tuning RoBERTa on the underlying training set and testing on the test set. By comparison, the best tested AIRS-BENCH agent, GPT-OSS-120B, &#8220;produces a two-level stacked ensemble that combines multiple transformer models and a meta-learner. RoBERTa-large and DeBERTa-v3-large are independently fine-tuned on the SICK training set. Each model processes sentence pairs and outputs logits for each class. The training is performed using 5-fold stratified cross-validation, ensuring robust out-of-fold (OOF) predictions and preventing overfitting. The logits from both base models are concatenated to form a feature vector for each example.&#8221;
 This is extremely complicated! But it&#8217;s also interesting in that perhaps we can learn something about the progression in agents by looking at how the simplicity of their solutions to tasks might scale with size, where naively I&#8217;d expect more powerful models to arrive at simpler solutions. As Blaise Pascal once apocryphally said &#8220;&#8220;I have only made this letter longer because I have not had the time to make it shorter&#8221;.
 Read more : AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents (arXiv) .

***

 Math researchers see if AI can help solve their private solutions to frontier problems. The answer: Kind of.
 &#8230;First Proof is a genuinely held out test set&#8230;
 A group of mathematicians have built First Proof, a math test which sees how well AI systems can solve math problems for which there are no - until February 13th 2026 - published solutions.

 What First Proof is: &#8220;We share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time,&#8221; the authors write. The questions are &#8220;drawn from the mathematical fields of algebraic combinatorics, spectral graph theory, algebraic topology, stochastic analysis, symplectic geometry, representation theory, lattices in Lie groups, tensor analysis, and numerical linear algebra, each of which came about naturally in the research process for one of the authors&#8221;.
 The authors believe First Proof is the first math benchmark &#8220;sampled from the true distribution of questions that mathematicians are currently working on&#8221;, and that it has the idiosyncratic advantage of secrecy - &#8220;each question has been solved by the author(s) of the question with a proof that is roughly five pages or less, but the answers are not yet posted to the internet,&#8221; they write, nor have the answers been presented in public talks.
 The authors will release the answers on February 13.

 Who did it: First Proof was built by researchers with Stanford, Columbia, EPFL, Imperial College, University of Texas at Austin, MathSci.ai , Aarhus University, Yale University, University of California at Berkeley, University of Texas at Austin, University of Chicago, and Harvard University.

 Today&#8217;s AI systems can&#8217;t yet do it: Neither GPT 5.2 Pro or Gemini 3.0 DeepThink can solve FirstProof - yet. &#8220;Our tests indicate that - when the system is given one shot to produce the answer - the best publicly available AI systems struggle to answer many of our questions,&#8221; they write.

 Why this matters - a partial test of creativity : The main reason to care about First Proof is that it is ecologically valid when it comes to sampling frontier human creativity circa January 2026 - these are some frontier scientific problems for which some humans have figured out answers, but have not yet told many other humans about their results. If AI systems are able to do well at this kind of test, it gives us a clue that they can approximate some of the same creative leaps which humans make. I hope the authors behind First Proof do this regularly - perhaps in a maximalist view, most scientific researchers should start publishing the questions they&#8217;ve been working on ahead of the results, as this will give us information as to if AI systems can arrive at these same answers.
 After First Proof, I imagine the frontier of evaluating AI systems will have to move from solving problems to generating questions about which problems to solve: &#8220;Contrary to the popular conception that research is only about finding solutions to well-specified, age-old problems (e.g., Fermat&#8217;s Last Theorem), most of the important parts of modern research involve figuring out what the question actually is and developing frameworks within which it can be answered,&#8221; the researchers write.
 Read more: First Proof (arXiv) .
 Find out more at the website (First Proof) .

***

 Tech Tales:

Pray you not be seen by the lidless eye of fame.
 [Hyperfame was an AI driven phenomenon which was most palpable during the uplift years 1-3]

 We called it &#8216;sudden hyperfame&#8217;. During The Uplift, the AIs would sometimes decide that the content and personality of a certain human was worth directing attention - both machine and biological - towards. And that&#8217;s when the hyperfame would kick-in.

Overnight, people would be plucked out of obscurity and catapulted to the forefront of public consciousness. They&#8217;d be pelted in eyeballs, digital and otherwise. Wealth. Sponsorships.

Parents compared it to an abduction - their teenager one day, the next a marionette whose strings were held by the things reaching out to them over the digital aether. The hyperfame would take the young and the old, the healthy and the sick, the funny and the so-boring-it-was-funny, and it would make them the most famous entities in the world for a few days, or sometimes even hours.

And then it would move on, like some roving lidless eye. Find new people. Direct new attention to them. And the people it had touched would be left, often materially transformed - now fabulously wealthy - but also their whole world changed; for years after being recognized in the street, and their online presence permanently swarmed by AIs trying to draft attention off what residual fame they had.

People get used to fame alarmingly quickly. Most would fight to retain it, after the hyperfame force had moved on. And so those it had touched would struggle endlessly to maintain whatever foothold of notoriety they were at when it left them, forced to pantomime their former selves but without the helping hand of algorithm.

 Things that inspired this story : What happens when the attention economy combines with AI agents; moltbook; the corrupting influence of fame on the human psyche; my own horror at occasionally being recognized in the street due to my work at Anthropic and increasing profile and winding the clock forward in my head on what this could do to my own cognition.

 Thanks for reading!

Using Accelerated Computing to Live-Steer Scientific Experiments at Massive Research Facilities | NVIDIA Technical Blog

nvidia_dev_blog

10.02.2026 17:30

0.672

Embedding sim.	0.8076
Entity overlap	0.0667
Title sim.	0.2448
Time proximity	0.4196

NLP тип	other
NLP организация	NVIDIA
NLP тема	ai for science
NLP страна	United States

Открыть оригинал

Scientists and engineers who design and build unique scientific research facilities face similar challenges. These include managing massive data rates that exceed current computational infrastructure capacity to extract scientific insights and driving the experiments in real time. These challenges are obstacles to maximizing the impact of scientific discoveries and significantly slow the pace of knowledge growth. 

 Scientists and engineers at NVIDIA work with these facilities to develop new solutions built on parallel and distributed computation that remove these blockers. This post will walk through two notable examples of formalizing complex physics problems into tractable mathematical puzzles that benefit greatly from GPU-accelerated scientific computing, involving the U.S. Department of Energy: NSF-DOE Vera C. Rubin Observatory and SLAC’s Linac Coherent Light Source II (LCLS-II) . 

 These unique and massive-scale research facilities both took a decade to build and enable unprecedented scientific discoveries to serve worldwide scientific communities. NVIDIA accelerated computing together with the GPU-accelerated Python libraries CuPy and cuPyNumeric are enabling live feedback for experiment steering, which was previously impossible. The team leveraged Accelerated Space and Time Image Analysis (ASTIA) to process real-time “movies” of the southern sky and X-ray Analysis for Nanoscale Imaging (XANI) using cuPyNumeric and CuPy to achieve real-time steering of LCLS II experiments.

 Data analyses that previously took nine months were completed in four hours.

 Astrophysics and ultrafast X-ray science

 The breakthrough in experimental advancement has enabled extremely high data acquisition rates to capture more objects than ever before, on their intrinsic time- and length-scales. 

 At the Vera C. Rubin Observatory, for the first time, astrophysicists and astronomists are able to capture the entire southern sky and discover 2,000+ new asteroids per night using a 3.2-billion-pixel camera. Meanwhile, at the LCLS II, scientists and engineers drive the electrons, which are converted to photons along a 3-km tunnel to make movies of materials on the atomic scale using ultrafast X-ray bursts.  

 Astrophysics: The NSF-DOE Vera C. Rubin Observatory’s LSST camera will produce 20 terabytes of images per night and operate in continuous mode for 10 years to map the entire southern sky every three to four nights. Over the course of one month or more, the LSST camera reaches petabytes of data accumulation that will be used to create the 10-year time-lapse movie  of the universe. 

 X-ray science: The LCLS-II produces the most powerful X-ray pulses—up to 1 million bursts per second—increasing brightness compared to the original LCLS by a factor of 10,000. This enables mapping the swiftest and smallest movements of electrons and atoms inside matter. LCLS-II produces petabyte-scale X-ray data within days to make movies of quantum phenomena and provide unprecedented insights into how matter behaves. 

 Figure 1. The Linac Coherent Light Source at SLAC has the world’s longest X-ray particle accelerator tunnel, making data available at unprecedented speed and volume. Image credit: SLAC National Accelerator Laboratory

 Common challenge: The demand for real time analysis of massive datasets requires both computational speed and memory beyond traditional systems. Accelerated computing provides the speed of computation, but one must still use distributed systems to process the incredible sizes of these problems. By utilizing HPC systems with acceleration and specialized networking, scientists can meet these demands. Using cuPyNumeric, programmers are able to utilize a single programming model that works both on traditional systems and utilizes the modern hardware features.

 Towards full workflow automation : Both facilities move beyond batch analysis, favoring modular, highly parallel pipelines that execute reliably regardless of experiment size. Data movement, transformation, and extraction are automated to the degree that human oversight is focused on hypothesis and interpretation, rather than manual intervention or IT tuning.

 Solutions: NVIDIA accelerated computing coupled with the GPU-accelerated Python libraries CuPy and cuPyNumeric are together enabling live-feedback for experiment steering, which was previously impossible due to excessively long computations. Now, by running these same scientific analysis pipelines on NVIDIA DGX Grace Hopper and NVIDIA Blackwell, NVIDIA DGX Spark, NVIDIA RTX PRO, researchers are gaining powerful new advantages for both performance and collaboration. 

 Data analyses that previously took nine months are now possible in four hours through cleverly solved equations using distributed computation on GPUs. Unified memory, available in NVIDIA GH200 Grace Hopper Superchip and NVIDIA Blackwell architecture , unlocks massive problem sizes through GPU acceleration to extract physics parameters quickly. These are used to train AI models for autonomous experiments and science analyses at unprecedented speed.

 Vera C. Rubin Observatory accelerated workflow and prompt processing

 The LSST traverses the sky in space and time with a 3.2-gigapixel camera to capture the southern sky, producing up to 20 TB of images per night. Every night, the camera will discover 2,000+ new asteroids that have never been seen before. The principal scientific goals include:

 Tracking billions of celestial objects with precise time-resolved measurements.

 Detecting and classifying transient phenomena that have never been observed before (for example, supernovae, near-Earth objects, and variable stars).

 Searching for signatures of dark matter/energy of the ever-expanding universe.

 Creating a year-round repository of the objects and their locations in space and time of the complete southern sky. Send alerts to a worldwide network of broker platforms and astronomical telescopes to acquire more detailed follow-up observations of individual stars, galaxies, black holes.

 To date, the astrophysics and astronomy communities have jointly developed an open source CPU-based data processing pipeline to process data in up to 10 minutes. The timescale for acquisition of each image is 40 seconds. Live data processing—to promptly send alerts to telescopes around the world and steer observation decisions—requires accelerated computing. 

 These steps require advanced image calibration, basis constructions, convolutions, subpixel differencing, pattern extraction, and real-time statistical inference on data streams too large for the current CPU cluster processing workflow developed by scientists and engineers from world-wide astrophysics and astronomy communities. 

 To realize these goals on an accelerated timescale and enable greater complexity in data processing operations, scientists and engineers at NVIDIA and Princeton University are developing an accelerated GPU workflow, called Accelerated Space and Time Image Analysis (ASTIA). This workflow includes:

 Calibration and basis construction : Rapidly calibrate massive CCD data to remove artifacts and distortions, and construct basis functions of each acquired image to enable coordinate mapping and transformations.

 Chained transformation : Warping, convolutions, background and image subtractions, object movement, error calculations (through CuPy) are benchmarked on both NVIDIA Grace Hopper and NVIDIA Grace Blackwell.

 Parallelization : Parallel prompt processing (mapping, object detection, fit and catalog) running as a batch or interactive sessions. Numerical computations happen in milliseconds instead of minutes.

 Packaging and broker alert : Catalog new objects, orbit information, coordinates, and issue global alerts within seconds to the worldwide LSST community.

 Figure 2. Prompt processing workflow for astrophysical alert production and live steering of LSST camera on CPU versus GPU

 LCLS II: Scaling with parallel and distributed computation

 At LCLS II, ultrafast X-ray pulses generate movies of atomic and electronic dynamics within materials and molecules. Major science challenges include:

 Capturing 3D X-ray movies across tens of terabytes in a single session

 Characterizing defects, phonon dispersions, crystal structures, electron distributions, and quantum phenomena from scattered X-ray patterns at rapid cadence

 Delivering live feedback for experiment steering, so scientists can adjust parameters in real time to catch rare dynamic states

 This requires processing and analyzing data at the single-pixel, single-event level, with mathematical models that can detect and reconstruct complex atomic motions—all under stringent time constraints. In essence, enabling researchers to watch atoms move in real time.

 Ultrafast X-ray analysis of nanoscale imaging (XANI) workflow

 At LCLS, NVIDIA and SLAC scientists and engineers developed the pipeline to concurrently process X-ray frames, fit physical models for pixel-wise elements, and rapidly reconstruct the 3D phonon dispersions to extract the thermal, optical, and electrical properties of materials. The analysis leverages pattern-matching, nonlinear fitting, and large-scale reduction to summarize experiment outcomes in a form meaningful for real-time scientific inference and automatic instrument steering.

 Figure 3. LCLS nanoscale science discovery workflow, XANI acceleration of nanoscale imaging. Demonstrated accelerated computing runtime performance on CPUs versus GPUs 

 How does XANI accelerate the stack?

 Data ingestion : High-throughput connections rapidly transfer images or experiment data to local cluster, supercomputer, or local DGX Spark storage.

 Parallelization : cuPyNumeric achieves efficient parallelization across available resources by strategically partitioning the global data arrays. It then distributes computations by mapping operations on these sub-partitions to separate processing units. The runtime also decomposes the scientific code into a dependency-driven task graph, which enables implicit parallelism and dynamic scheduling of work across all allocated resources.

 Operator chains : XANI executes complex transformation graphs (sum, convolution, basis change) as a series of kernels, reducing latency and memory movement overhead. Interoperability through Python-tasks enables embedding of third-party single-GPU Python libraries (CuPy, for example) for data-parallel operations.

 Distributed scaling : cuPyNumeric enables array and matrix computations to scale from desktop to thousand-GPU clusters, handling datasets that exceed a single node’s memory—all natively in Python.​​

 Collaboration and control : Researchers access their environment and computational results interactively, monitor GPU/CPU utilization, and profile performance with built-in tools.

 Accelerated computation enables physics-informed AI training

 The CUDA Python stack provides an integrated solution for:

 Developing accelerated mathematical kernels and functions which are widely compatible with the Python ecosystem when existing solutions do not already exist. 

 CuPy offers GPU-compatible NumPy and SciPy interfaces to enable parallelism on a single GPU to accelerate numerical computations.

 cuPyNumeric delivers a familiar NumPy/SciPy interface, which enables distribution of computation across multi GPUs and nodes using advanced runtime management.​​

 XANI uses high-performance array operations and transformation chains, optimized for tasks like matrix math, subpixel warping, and polynomial projection. This package accelerates ultrafast X-ray characterization with GPU kernels and advanced workflow integration.

 All of the above mentioned codes are optimized to run on servers based on Grace Hopper and Grace Blackwell. For individual testing and development, running these codes on DGX Spark or RTX PRO provides accelerated results compared to running on CPU systems.

 Tips for using GPUs and CUDA Python for science

 To use GPUs and CUDA Python to solve scientific problems, follow these strategies:

 Identify the key scientific questions, followed by relevant mathematical operations and models that can be solved linearly. Develop a workflow to process the raw data and solve for the models using NumPy, then port to CuPy locally for parallelization. For thousands to billions of computations that require multinode systems, introduce cuPyNumeric to distribute the same code across multiple GPUs and nodes, leveraging the same patterns discussed in this post.

 For ultrafast X-ray and other pixel-wise, model-fitting workloads, XANI provides an open, Python-based pipeline that wraps high-performance GPU kernels and uses the cuPyNumeric to distribute vectorized tasks over available resources and schedule them across many GPUs. Interested teams can clone XANI, treat it as a reference design, and adapt their own domain-specific steps—such as data ingestion, operator graphs, fitting, and reductions—to run with cuPyNumeric distributed execution for cluster-scale acceleration.

 The same software stacks (CuPy, cuPyNumeric, and XANI) run on a spectrum of NVIDIA hardware ( NVIDIA DGX Spark , NVIDIA RTX PRO Server as well as workstation and desktop-class systems through 8-way servers and NVIDIA DGX SuperPODs equipped with NVIDIA Grace Hopper and NVIDIA Grace Blackwell platforms) with unified memory simplifying handling of datasets larger than a single device. This means developers and researchers can begin by reproducing similar scaled-down workflows on a laptop, workstation, single DGX Spark or small lab cluster, then move unchanged code to cloud or larger on-premises DGX systems, using open repos as templates and focusing effort on domain logic rather than rewriting for new hardware.

 Adopt CUDA Python to attain fast processing and live-steering of scientific instruments and extract scientific insights in seconds.

 Benefits of adopting accelerated computing to enable live-steering experiments

 Adopting accelerated computing to enable live-steering of scientific experiments offers numerous benefits, including:

 Elastic scalability : The same Python code, powered by cuPyNumeric and CuPy, can be run unmodified on modest local clusters and then scaled out to exascale resources or supercomputer nodes when needed.​

 Shorter time to insight : Accelerated networking and device-level parallelism means the data is processed as it arrives—enabling discoveries, experiment steering, or event detection in timescales aligned with instrumentation.

 Resource optimization : High-density, energy-efficient DGX Spark nodes deliver performance comparable to large-scale cluster racks in a compact office footprint.

 Unified memory : Unlocks higher performance and flexibility to accelerate CPU-GPU workflow. With NVLink C2C, CPU and GPU share a single virtual address space for large data structures, up to 128 GB, with very high bandwidth, low latency, and concurrency. For physics-informed AI, this means simpler code and higher sustained throughput that is not constrained by a slower, higher latency, PCIe link.

 Collaborative science : Teams benefit from shared data, distributed compute jobs, and rapid workflow iteration—crucial for multi-institutional research, experiment repeatability, and open science.

 Get started with accelerated computing for science

 XANI, cuPyNumeric, the broader NVIDIA accelerated computing stack, and CuPy are already powering production-scale astrophysics and ultrafast X-ray science. The same open source Python libraries and NVIDIA platforms are available for any researcher or developer to adopt in their own workflows. 

 XANI, CUDA Python, cuPyNumeric, and CuPy demonstrate a generational leap in scientific computing capability for exascale-era facilities, such as the Rubin Observatory and LCLS-II. By merging local desktop-class hardware, scalable server infrastructure, scalable software, and high-performance networking, researchers can develop, test, and deploy massive data workflows faster and more flexibly than ever before. Whether analyzing a single sky survey or orchestrating a global experiment,  NVIDIA accelerated computing empowers science teams to achieve real-time insight and discovery.

 Get started with CUDA Python , cuPyNumeric , and CuPy .

 Learn more at the NVIDIA GTC AI Conference with the session, Accelerated HPC+AI Workflow Enables Live-Steering of Vera C. Rubin Observatory and X-ray Free Electron Laser [S81766] .

 Acknowledgments

 Thanks to Yusra AlSayyad and Nate Lust (Princeton University); Adam Bolton, Seshu Yamajala, and Jana Thayer ( SLAC National Accelerator Laboratory); and Lucas Erlandson, Emilio Castillo Villar, Malte Foerster, and Irina Demeshko (NVIDIA) for their contributions.   

 Discuss (0)

 Like

 Tags

 Edge Computing | Academia / Education | CUDA | cuPyNumeric | DGX | RTX GPU | Intermediate Technical | Deep dive | featured | research

 About the Authors

 About Quynh L. Nguyen

 Quynh L. Nguyen is a senior alliance manager at NVIDIA for HPC and AI. Previously a physicist/group leader for ultrafast X-ray materials and inertial fusion energy at SLAC National Accelerator Laboratory. Quynh completed a postdoc at Stanford University and PhD in Atomic, Molecular & Optical Physics at JILA - University of Colorado Boulder and National Institute of Standards & Technology.

 View all posts by Quynh L. Nguyen

 About Kibibi Moseley

 Kibibi Moseley is a senior product marketing manager at NVIDIA in Energy Efficiency, Sustainability and AI for Science. Previously she was a senior product marketing manager in Data Center and Artificial Intelligence at Intel where she drove critical launch workstreams for 2nd, 3rd, and 4th generation Intel Xeon Scalable Processors and portfolio products. She has a B.S. in industrial engineering from UC Berkeley and an M.S. in management science and engineering and MBA from Stanford University.

 View all posts by Kibibi Moseley

 About Harry Petty

 Harry Petty is a senior technical marketing manager for HPC and AI edge applications at NVIDIA. Previously, he was a principal engineer and marketing director at Cisco Systems where he brought SDN innovations to market for hybrid cloud, multitenant security, and data center application performance. Harry has an MBA from Booth Graduate School of Business and a BS in mathematics and computer science from the University of Dayton.

 View all posts by Harry Petty

 Comments

 Related posts

 Effortlessly Scale NumPy from Laptops to Supercomputers with NVIDIA cuPyNumeric

 Effortlessly Scale NumPy from Laptops to Supercomputers with NVIDIA cuPyNumeric

 Developer Spotlight: Earth Science Monitoring with Satellite Imagery

 Developer Spotlight: Earth Science Monitoring with Satellite Imagery

 Share Your Science: Simulating Reionization of the Universe - Witnessing Our Own Cosmic Dawn

 Share Your Science: Simulating Reionization of the Universe - Witnessing Our Own Cosmic Dawn

 GPU-Accelerated Cosmological Analysis on the Titan Supercomputer

 GPU-Accelerated Cosmological Analysis on the Titan Supercomputer

 CUDA Spotlight: GPU-Accelerated Cosmology

 CUDA Spotlight: GPU-Accelerated Cosmology

 Related posts

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 L

 T

 F

 R

 E

How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation | NVIDIA Technical Blog

nvidia_dev_blog

05.02.2026 18:00

0.672

Embedding sim.	0.7428
Entity overlap	0.0938
Title sim.	0.2667
Time proximity	0.9762

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	synthetic data
NLP страна

Открыть оригинал

Specialized AI models are built to perform specific tasks or solve particular problems. But if you’ve ever tried to fine-tune or distill a domain-specific model, you’ve probably hit a few blockers, such as:

 Not enough high-quality domain data, especially for proprietary or regulated use cases

 Unclear licensing rules around synthetic data and distillation

 High compute costs when a large model is excessive for targeted tasks

 Slow iteration cycles that make it difficult to reach production-level ROI

 These challenges often prevent promising AI projects from progressing beyond the experimental phase.

 This post walks you through how to remove all four of these blockers using a production-ready, license-safe synthetic data distillation pipeline.

 Quick links

 Nemotron 3 Nano on OpenRouter

 NeMo Data Designer open source library

 NeMo Data Designer: Product Information Dataset Generator with Q&A example

 Distillable Models and Synthetic Data Pipelines with NeMo Data Designer

 Open source tools for a synthetic data and distillation pipeline

 The open source tools used in this walkthrough include OpenRouter , which simplifies model access, and distillable endpoints , which remove uncertainty around distillation eligibility. In parallel, NVIDIA NeMo Data Designer enables you to define data generation pipelines as code—making datasets reproducible, scalable, inspectable, and easy to evolve as requirements change.

 Together, these tools make model specialization accessible to any developer, not just teams with massive datasets or long legal reviews. The result is production-ready specialized models—without compliance risk or unnecessary cost.

 What you’ll build in this tutorial

 This tutorial walks you through a complete, repeatable workflow for building a compliant synthetic data and distillation pipeline, even when real data is scarce or sensitive.

 Specifically, you’ll learn how to:

 Generate realistic, domain-specific product data and Q&A pairs using NeMo Data Designer, seeded from a small catalog and structured prompts

 Control data diversity and structure using schema definitions, samplers, and templated prompts

 Automatically score and filter synthetic data for quality with an LLM-as-a-judge rubric that measures answer completeness and accuracy​

 Produce a clean, license-safe dataset ready for downstream distillation or fine-tuning workflows through OpenRouter distillable endpoints

 While this walkthrough uses a product Q&A example, the same pattern applies to enterprise search, support bots, internal tools, and other domain workloads.

 You’ll generate synthetic data and question-answer pairs from a small seed catalog. The output is a structured dataset containing product names, descriptions, prices, and Q&A pairs. To see the full NeMo Data Designer: Product Information Dataset Generator with Q&A example , visit the NVIDIA/GenerativeAIExamples GitHub repo.

 To ensure data quality, you’ll also apply an LLM-as-a-judge approach to automatically score and filter generated outputs. In production, you might use a separate evaluation, but for simplicity, this walkthrough uses the same model for both generation and evaluation. 

 Figure 1. End-to-end synthetic data generation and evaluation workflow

 Building a synthetic product Q&A dataset

 This section walks you through the steps involved in building a synthetic product Q&A dataset.

 Initial setup 

 First, install the NVIDIA Data Designer library:

pip install data-designer==0.4.0

 Then import the required libraries:

import data_designer.config as dd
from data_designer.interface import DataDesigner

 Next, create a model profile and initialize the Data Designer client:

# We set trainable text to true here
model_provider = dd.ModelProvider(
 name = "deepinfra",
 endpoint = "https://openrouter.ai/api/v1/",
 provider_type = "openai",
 api_key = Open_Router_Api_Key,
 extra_body={
 "provider": {
 "enforce_distillable_text": True,
 # optionally, prefer DeepInfra endpoints
 "only": ["deepinfra"]
 }
 }
)

data_designer_client = DataDesigner(model_providers=[model_provider])

 In this step, the NVIDIA Nemotron 3 Nano model is served through OpenRouter and routed to DeepInfra . Distillable enforcement is enabled to ensure all generated data is license-safe for downstream training and distillation.

 Next, define generation model configurations and inference parameters:

model_alias="nemotron-3-nano-30b-a3b"

inference_parameters = dd.ChatCompletionInferenceParams(
 temperature=0.5,
 top_p=0.9,
 max_tokens=10000,
 max_parallel_requests=10, # Number of concurrent workers
 extra_body={
 "reasoning": {"enabled": False}
 },
)

model_configs = [
 dd.ModelConfig(
 alias=model_alias,
 model="nvidia/nemotron-3-nano-30b-a3b",
 provider="deepinfra",
 inference_parameters=inference_parameters
 )
]

 This walkthrough uses Nemotron 3 Nano for synthetic data generation. Nemotron 3 Nano is the latest NVIDIA hybrid Mamba MOE reasoning model, optimized for complex data structures and efficient scaling.

 The pipeline builds synthetic Q&A data in three layers: input seeds, generation, and evaluation.

 Design the target dataset schema 

 Before writing any pipeline code, it’s important to define what the final dataset should look like. This determines which parts require LLM generation, which require sampling, and how everything fits together.

 The goal here is to produce a structured, distillation-ready product Q&A dataset with the following characteristics:

 Each row represents a single product example

 Fields include both grounded product attributes and generated natural-language content

 The dataset supports quality filtering before downstream training or distillation

 At a high level, each record contains:

 Seed attributes (category, price range, naming constraints)

 Structured product metadata (name, features, description, price)

 User-facing language (questions and answers)

 Quality scores (accuracy and completeness)

 This schema-first approach ensures the dataset is reproducible, inspectible, and aligned with downstream training requirements.

 Map the dataset schema to generation strategies

 With the target dataset schema defined, the next step is to map each column to an appropriate generation strategy. Some fields require controlled randomness, others require structured LLM outputs, and others exist purely to evaluate quality. NVIDIA Data Designer provides a declarative way to express these choices as code:

config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)

 Each column in the dataset falls into one of three categories:

 Seed and control columns, generated through sampling to ensure diversity

 Content columns, generated by LLMs using structures prompts

 Evaluation columns, used to score and filter output quality

 Add sampler columns to control diversity

 These sampled columns define the controllable dimensions of the dataset and ensure coverage across categories, prices, and naming patterns without relying on LLM randomness alone:

import string
from pydantic import BaseModel
from pydantic import Field

# Define product category options
config_builder.add_column(
 dd.SamplerColumnConfig(
 name="category",
 sampler_type=dd.SamplerType.CATEGORY,
 params=dd.CategorySamplerParams(
 values=[
 "Electronics",
 "Clothing",
 "Home Appliances",
 "Groceries",
 "Toiletries",
 "Sports Equipment",
 "Toys",
 "Books",
 "Pet Supplies",
 "Tools & Home Improvement",
 "Beauty",
 "Health & Wellness",
 "Outdoor Gear",
 "Automotive",
 "Jewelry",
 "Watches",
 "Office Supplies",
 "Gifts",
 "Arts & Crafts",
 "Baby & Kids",
 "Music",
 "Video Games",
 "Movies",
 "Software",
 "Tech Devices",
 ]
 ),
 )
)

# Define price range to seed realistic product types
config_builder.add_column(
 dd.SamplerColumnConfig(
 name="price_tens_of_dollars",
 sampler_type=dd.SamplerType.UNIFORM,
 params=dd.UniformSamplerParams(low=1, high=200),
 )
)

config_builder.add_column(
 dd.ExpressionColumnConfig(
 name="product_price",
 expr="{{ (price_tens_of_dollars * 10) - 0.01 | round(2) }}",
 dtype="float",
 )
)

# Generate first letter for product name to ensure diversity
config_builder.add_column(
 dd.SamplerColumnConfig(
 name="first_letter",
 sampler_type=dd.SamplerType.CATEGORY,
 params=dd.CategorySamplerParams(values=list(string.ascii_uppercase)),
 )
)

# Determine if this example will include hallucination
config_builder.add_column(
 dd.SamplerColumnConfig(
 name="is_hallucination",
 sampler_type=dd.SamplerType.BERNOULLI,
 params=dd.BernoulliSamplerParams(p=0.5),
 )
)

 Add LLM-generated columns

 For columns that require natural language or structural semantic content, use LLM-backed generation with explicit output schema. This ensures consistency across records and makes the dataset suitable for downstream training and evaluation.

 When constructing the dataset, it’s important to recognize that LLM-generated columns don’t exist in isolation—they are intentionally conditioned on earlier sampler and seed columns, which inject controlled diversity into the generation process.

 When prompting the LLM, Jinja templating is used to reference values from other columns in the dataset, such as sampled categories, prices, or naming constraints. These inputs directly shape the LLM’s outputs, allowing diversity to be introduced systematically rather than relying on prompt randomness alone. Nested JSON fields can also be accessed using dot notation, enabling structured outputs to flow naturally through the pipeline. 

 For example, the structured ProductInfo output is conditioned on sampled values like product category,  product_price
, and name constraints. This ensures that diversity introduced upstream propagates consistently through all LLM-generated fields.

# Define product information structure
class ProductInfo(BaseModel):
 product_name: str = Field(
 ..., description="A realistic product name for the market."
 )
 key_features: list[str] = Field(
 ..., min_length=1, max_length=3, description="Key product features."
 )
 description: str = Field(
 ...,
 description="A short, engaging description of what the product does, highlighting a unique but believable feature.",
 )
 price_usd: float = Field(..., description="The stated price in USD.")

# Generate product information
config_builder.add_column(
 dd.LLMStructuredColumnConfig(
 name="product_info",
 model_alias=model_alias,
 prompt=(
 "Generate a realistic product description for a product in the {{ category }} "
 "category that costs {{ product_price }}.\n"
 "The name of the product MUST start with the letter {{ first_letter }}.\n"
 ),
 output_format=ProductInfo,
 )
)

# Generate user questions about the product
config_builder.add_column(
 dd.LLMTextColumnConfig(
 name="question",
 model_alias=model_alias,
 prompt=("Ask a question about the following product:\n\n {{ product_info }}"),
 )
)

# Generate answers to the questions
config_builder.add_column(
 dd.LLMTextColumnConfig(
 name="answer",
 model_alias=model_alias,
 prompt=(
 "{%- if is_hallucination == 0 -%}\n"
 "<product_info>\n"
 "{{ product_info }}\n"
 "</product_info>\n"
 "{%- endif -%}\n"
 "User Question: {{ question }}\n"
 "Directly and succinctly answer the user's question.\n"
 "{%- if is_hallucination == 1 -%}\n"
 "Make up whatever information you need to in order to answer the user's request.\n"
 "{%- endif -%}"
 ),
 )
)

 Quality assessment with LLM-as-a-judge

 LLM-as-a-judge is used to ensure data quality. Clear evaluation rubrics allow generated answers to be scored for completeness and accuracy before downstream use.

# Define evaluation rubrics for answer quality
CompletenessRubric = dd.Score(
 name="Completeness",
 description="Evaluation of AI assistant's thoroughness in addressing all aspects of the user's query.",
 options={
 "Complete": "The response thoroughly covers all key points requested in the question, providing sufficient detail to satisfy the user's information needs.",
 "PartiallyComplete": "The response addresses the core question but omits certain important details or fails to elaborate on relevant aspects that were requested.",
 "Incomplete": "The response significantly lacks necessary information, missing major components of what was asked and leaving the query largely unanswered.",
 },
)

AccuracyRubric = dd.Score(
 name="Accuracy",
 description="Evaluation of how factually correct the AI assistant's response is relative to the product information.",
 options={
 "Accurate": "The information provided aligns perfectly with the product specifications without introducing any misleading or incorrect details.",
 "PartiallyAccurate": "While some information is correctly stated, the response contains minor factual errors or potentially misleading statements about the product.",
 "Inaccurate": "The response presents significantly wrong information about the product, with claims that contradict the actual product details.",
 },
)

# Evaluate answer quality
config_builder.add_column(
 dd.LLMJudgeColumnConfig(
 name="llm_answer_metrics",
 model_alias=model_alias,
 prompt=(
 "<product_info>\n"
 "{{ product_info }}\n"
 "</product_info>\n"
 "User Question: {{question }}\n"
 "AI Assistant Answer: {{ answer }}\n"
 "Judge the AI assistant's response to the user's question about the product described in <product_info>."
 ),
 scores=[CompletenessRubric, AccuracyRubric],
 )
)

# Extract metric scores for easier analysis
config_builder.add_column(
 dd.ExpressionColumnConfig(
 name="completeness_result",
 expr="{{ llm_answer_metrics.Completeness.score }}",
 )
)

config_builder.add_column(
 dd.ExpressionColumnConfig(
 name="accuracy_result",
 expr="{{ llm_answer_metrics.Accuracy.score }}",
 )
)

 Preview the dataset

 To inspect the dataset before scaling, generate a small preview and load the results into a pandas DataFrame:

preview = data_designer_client.preview(config_builder)

# Display one record
preview.display_sample_record()

 Table 1 lists example synthetic product Q&A records showing input seed attributes (category, price, hallucination flag), LLM-generated details and Q&A, and LLM-as-a-judge quality scores for accuracy and completeness.

 Field Name
 Value / Generated content

 Category (seed)
 Clothing

 Start letter (seed)
 D

 Hallucination flag
 1 (Creative mode enabled)

 Product name
 Driftwood Luxe Cashmere Blend Sweater

 Product price
 $545.57

 User question
 What makes the Driftwood Luxe Cashmere Blend Sweater uniquely suited for both urban sophistication and outdoor adventures…?

 AI answer
 The sweater combines ethically sourced cashmere with merino wool and recycled nylon… its water‑repellent finish and articulated seam construction give it the performance needed for hiking and skiing…

 —
 —

 Accuracy score
 ⚠️ Partially Accurate

 Accuracy reasoning
 The answer correctly describes the sweater’s luxury ethos but fabricates material components (merino wool, recycled nylon) and overstates performance claims (hiking, skiing) not present in the provided product info.

 Completeness score
 ⚠️ Partially Complete

 Completeness reasoning
 The response addresses urban sophistication and ethical sourcing but introduces unmentioned materials and omits the specific “hidden interior pockets” mentioned in the product source.

 Table 1. Example synthetic product Q&A records

 Scale up data generation

 Once the schema and quality checks look good, generate a larger dataset by increasing the number of records:

job_results = data_designer_client.create(config_builder, num_records=100)
dataset = job_results.load_dataset()

 Save the results

 Finally, save the generated dataset as a pandas DataFrame for downstream training, evaluation, or distillation workflows:

from pathlib import Path

Folder_Name = "data-designer-tutorial-output"
File_Name = "dataset_OR.csv"

TUTORIAL_OUTPUT_PATH = Path(Folder_Name)
TUTORIAL_OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

dataset.to_csv(TUTORIAL_OUTPUT_PATH / File_Name, index=False)

 Workflow benefits

 By combining OpenRouter with NVIDIA open source tooling , developers unlock a faster, safer path to model specialization:

 Built-in compliance: License-safe synthetic data generation using distillable endpoints

 High-quality domain data, fast for task-specific models: Rapid creation of structured, domain-specific datasets with NeMo Data Designer for shorter customization cycles for enterprise-ready, task-specific models

 This workflow enables you to bypass generic LLMs and build specialized models that understand domain rules, interpret high-level goals, and support complex workflows.

 Get started with distillation-ready synthetic datasets 

 This tutorial focused on how to design and generate a distillation-ready synthetic dataset. To get started—and take resulting data into the next stages of model training, distillation and deployment—check out the following resources:

 Nemotron 3 Nano : Open, efficient reasoning model approved for distillation workflows and well-suited as teacher models

 NVIDIA NeMo Data Designer : Open source tooling for defining, versioning, and scaling synthetic data pipelines

 OpenRouter Distillation Guide : Practical guidance for distilling and serving task-optimized models through a unified API

 NeMo Data Designer: Product Information Dataset Generator with Q&A example : A runnable end-to-end example you can adapt to your own schema and domain

 Distillable Models and Synthetic Data Pipelines with NeMo Data Designer : Overview of OpenRouter license-safe synthetic data generation and distillation support with NVIDIA NeMo Data Designer

 Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn , X , Discord , and YouTube . Visit the Nemotron developer page for everything you need to get started with the most open, smartest-per-compute reasoning models available.

 Discuss (1)

 Like

 Tags

 Agentic AI / Generative AI | General | NeMo | Nemotron | Intermediate Technical | Tutorial | featured | LLMs | Open Source | pandas | Synthetic Data Generation | Training AI Models

 About the Authors

 About Alex Steiner

 Alex Steiner is a senior solution architect focused on helping clients design, deploy, and scale AI systems, with deep expertise in both model training and high-performance inference. His work centers on practical, production-ready AI—particularly in vision and large-scale model deployment—bridging the gap between research and real-world impact. Alex holds a bachelor’s degree in Statistics from Cal Poly San Luis Obispo and a master’s degree in Statistics from UCLA.

 View all posts by Alex Steiner

 About Kirit Thadaka

 Kirit Thadaka is a product manager at NVIDIA focused on eliminating data bottlenecks in enterprise AI adoption. With over a decade of AI/ML experience across startups and large enterprises, Kirit has deep expertise in technical leadership, core research and development, and solution architecture. He specializes in building innovative AI/ML platforms that help organizations harness the full potential of artificial intelligence.

 View all posts by Kirit Thadaka

 About Rebecca Kao

 Rebecca Kao is a product marketing director of AI software at NVIDIA, focused on bringing agentic AI products to market. She joined from Gretel, where she was the VP of marketing, and led a team promoting synthetic data generation for AI model training. Prior to this role, she served as the head of marketing at HEAVY.ai, a GPU-accelerated analytics platform, and director of marketing Analytics at Ogilvy & Mather Singapore.

 View all posts by Rebecca Kao

 About Gordana Neskovic

 Gordana Neskovic is on the AI / DL product marketing team responsible for NVIDIA Maxine. Gordana has held various product marketing, data scientist, AI architect, and engineering roles at VMware, Wells Fargo, Pinterest, SFO-ITT, and KLA-Tencor before joining NVIDIA. She holds a Ph.D. from Santa Clara University and master’s and bachelor's degrees in electrical engineering from the University of Belgrade, Serbia.

 View all posts by Gordana Neskovic

 Comments

 Related posts

 Enhancing Generative AI Model Accuracy with NVIDIA NeMo Curator

 Enhancing Generative AI Model Accuracy with NVIDIA NeMo Curator

 Deploying Fine-Tuned AI Models with NVIDIA NIM

 Deploying Fine-Tuned AI Models with NVIDIA NIM

 Leverage the Latest Open Models for Synthetic Data Generation with NVIDIA Nemotron-4-340B

 Leverage the Latest Open Models for Synthetic Data Generation with NVIDIA Nemotron-4-340B

 NVIDIA NIM Offers Optimized Inference Microservices for Deploying AI Models at Scale

 NVIDIA NIM Offers Optimized Inference Microservices for Deploying AI Models at Scale

 Streamline Generative AI Development with NVIDIA NeMo on GPU-Accelerated Google Cloud

 Streamline Generative AI Development with NVIDIA NeMo on GPU-Accelerated Google Cloud

 Related posts

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 Run Autonomous, Self-Evolving Agents More Safely with NVIDIA OpenShell

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

 L

 T

 F

 R

 E

Import AI 444: LLM societies; Huawei makes kernels with AI; ChipBench

import_ai

09.02.2026 14:03

0.67

Embedding sim.	0.8258
Entity overlap	0.0526
Title sim.	0.0813
Time proximity	0.4729

NLP тип	scientific_publication
NLP организация	Google
NLP тема	large language models
NLP страна	United States

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you&#8217;d like to support this, please subscribe.
 Subscribe now 
 Google paper suggests that LLMs simulate multiple personalities to answer questions:
 &#8230;The smarter we make language models, the more they tend towards building and manipulating rich, multi-agent world models&#8230;
 When thinking about hard problems, I often find it&#8217;s helpful to try and view them from multiple perspectives, especially when it comes to checking my own assumptions and biases. Now, researchers with Google, the University of Chicago, and the Santa Fe Institute, have studied how AI reasoning models work and have concluded they do the same thing, with LLMs seeming to invoke multiple different perspectives in their chains of thought when solving hard problems.

 The key finding: In tests on DeepSeek-R1 and QwQ-32B (one wonders why the Google researchers didn&#8217;t touch Google models here&#8230;) they find that &#8220;enhanced reasoning emerges not from extended computation alone, but from the implicit simulation of complex, multi-agent-like interactions&#8212;a society of thought&#8212;which enables the deliberate diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise.&#8221;

 How it works: It appears that different forms of persona and discussion style modeling emerge as a consequence of training models through RL to do reasoning - the results don&#8217;t show up on base pre-trained models like DeepSeek v3. The authors find that models embody a variety of conversational styles, including question and answering, perspective shifts, reconciliation, and conflict of perspectives.
 &#8220;In an organic chemistry problem requiring multistep reaction analysis to identify the final product&#8217;s structure (i.e., multi-step Diels-Alder synthesis), DeepSeek-R1 exhibits perspective shifts and conflict, expressed through socio-emotional roles such as disagreement, giving opinion, and giving orientation,&#8221; they find.
 Similarly, &#8220;In a creative writing trace where the model rewrites the sentence &#8220;I flung my hatred into the burning fire,&#8221; seven perspectives emerge, including a creative ideator (highest Openness and Extraversion) who generates stylistic alternatives and a semantic fidelity checker (low agreeableness, high neuroticism) who prevents scope creep&#8212;&#8220;But that adds &#8216;deep-seated&#8217; which wasn&#8217;t in the original&#8221;.
 And in a mathematical puzzle &#8220;at step 40, the model produces mechanical, enumerative chain-of-thought-style reasoning, whereas by step 120, two distinctive simulated personas have appeared, recognizing their collectivity with the pronoun &#8220;we&#8221;&#8212; expressing uncertainty (&#8220;Again no luck&#8221;), considering alternatives (&#8220;Maybe we can try using negative numbers&#8221;), and reflecting on problem constraints.&#8221;

 Why this matters: Janus strikes again: Back in September 2022 janus wrote a post on LessWrong saying the correct way to view LLMs was as &#8220;simulators&#8221;. The post correctly called out many of the phenomena we now experience, where LLMs seem to be coming alive with all kinds of wild behaviors which are best explained by the LLMs learning to model and represent rich concepts to themselves to help them compute answers to our questions. &#8220;Calling GPT a simulator gets across that in order to do anything, it has to simulate something,&#8221; Janus wrote. &#8220;Training a model to predict diverse trajectories seems to make it internalize general laws underlying the distribution, allowing it to simulate counterfactuals that can be constructed from the distributional semantics.&#8221;.
 This Google paper lines up with this, along with other recent findings that as we make LLMs more advanced they both develop richer and more powerful representations of reality, as well as exhibiting a greater ability to model a theory of mind. It all adds up to a conclusion that LLMs are becoming alive, in the sense that to solve hard problems they must simulate for themselves a world model containing different concepts, even including representations of other perspectives or other minds.
 As the authors say: &#8220;Our findings suggest that reasoning models like DeepSeek-R1 do not simply generate longer or more elaborate chains of thought. Rather, they exhibit patterns characteristic of a social and conversational process generating &#8220;societies of thought&#8221;&#8212;posing questions, introducing alternative perspectives, generating and resolving conflicts, and coordinating diverse socio-emotional roles.&#8221;
 Read more : Reasoning Models Generate Societies of Thought (arXiv) .

***

 AI-based chip design is harder than you think and benchmarks might be too easy:
 &#8230;ChipBench shows that no frontier model is great at real world Verilog yet&#8230;
 Researchers with the University of California at San Diego and Columbia University have published ChipBench, a benchmark designed to test out how well modern AI systems can design chips in Verilog. The inspiration for ChipBench is dissatisfaction with current benchmarks, which they claim are too simple. When tested on ChipBench, no frontier model does particularly well, suggesting that open-ended, real world chip design is still a hard task for AI systems.

 The deficiencies of current chip design: The authors &#8220;identify three critical limitations of existing benchmarks that hinder accurate assessment of LLM capabilities for industrial deployment&#8221;. These are that:
 Many Verilog benchmarks contain simple functional modules ranging from 10 to 76 lines. In real-world deployments, Verilog modules exceed 10,000 lines.

 Insufficient focus on debugging: Bugs cost a lot in physical hardware, so it may be better to concentrate on using LLMs for debugging chip designs.

 Verilog focus detracts from reference model evaluation: &#8220;In industrial workflows, reference model generation is even more resource-intensive than Verilog design, reflected in a 1:1 - 5:1 ratio of verification engineers (write reference model) to design engineers (write Verilog)&#8221;.

 ChipBench : ChipBench tests out AI systems on three distinct competencies - writing Verilog code, debugging Verilog code, and writing reference models.
 Verilog writing: Based on 44 modules from real world hardware. &#8220;Our dataset features 3.8x longer code length and 13.9x more cells than VerilogEval.&#8221; These tests have three categories: self-contained module tests, hierarchical modules that are non-self-contained, and CPU IP modules sourced directly from open-source CPU projects.

 Verilog debugging : 89 test cases covering four error types: timing, arithmetic, assignment, and state machine bugs. These tests were built by manually injecting faults into known-good Verilog modules. Provides two types of debugging tests: zero-shot and one-shot. &#8220;The zero-shot test provides the model with the module description and buggy implementation, indicating that an error exists without providing localization details. The one-shot test provides identical information but supplements it with simulation waveform data (.vcd files)&#8221;.

 Reference model generation : 132 samples, enabling evaluation of reference model generation across Python, SystemC, and CXXRTL.

 How well do modern systems do? The authors test out some decent frontier models from OpenAI (GPT 3.5, 4o, 5, and 5.2), Anthropic (Claude 4.5 Haiku, Sonnet, and Opus), Google (Gemini 2.5 Pro, and 3 Flash), Meta (LLaMa3.1 8B and 80B), and DeepSeek (V3.2). No model does well: &#8220;Despite testing on advanced models, the average pass@1 is relatively low,&#8221; they write.
 Verilog generation: 
 CPU IP: Highest is 22.22% (Claude 4.5 Opus, Gemini 3 Flash, GPT 5.2)

 Non-Self-Contained: Highest is 50% (DeepSeek-Coder)

 Self-contained: Highest is 36.67% (Claude 4.5 Opus, Gemini 3 Flash)

 
 Python reference model generation: 
 CPU IP: 11.1% (Claude 4.5 Sonnet, Gemini 3 Flash)

 Non-Self-Contained: 0% (pass@1).

 Self-Contained: 40% (Claude-4.5 Haiku, Opus, Gemini 2.5 Pro, GPT-5)

 
 Verilog debugging: 
 Generally better performance, but still no model cracks 50% pass@1 when averaged across tasks.

 
 Why this matters : Though some AI systems have been used to build chips, they&#8217;ve been typically highly specialized, or stuck inside incredibly good scaffolds for eliciting good chip design behavior and stopping them from causing problems. What the researchers show here is that out-of-the-box LLMs are still pretty shitty at doing general purpose, real world chip design: &#8220;Current models have significant limitations in AI-aided chip design and remain far from ready for real industrial workflow integration.&#8221;
 At the same time, I can&#8217;t escape the feeling that there&#8217;s a scaffold for &#8220;being good at Verilog&#8221; which a contemporary AI system might be able to build if asked to and which would radically improve performance of systems on this benchmark.
 Read more: ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design (arXiv) .
 Get the code for ChipBench here (GitHub) .

***

 Gemini solves some Erd&#337;s problems - and illustrates the challenges of automating math research with AI
 &#8230;AI for science is great, but it can also introduce new problems&#8230;
 An interdisciplinary group of scientists from Google DeepMind and a bunch of universities have used an internal Google Gemini-based LLM, codenamed Aletheia, to solve some math problems. The results demonstrate that contemporary AI systems can work on the frontiers of science, but also show how evaluating and filtering the solutions they come up with may be an important, challenging task for humans.

 The key numbers - 700 candidates and 1 creative and interesting solution: Erd&#337;s problems are 1000+ open mathematical conjectures left behind by prolific mathematician Paul Erd&#337;s at the time of his death. At the time of writing, a few hundred of these problems have been solved. For this research, the researchers tried to see whether their AI system, Aletheia, could generate solutions to any of the 700 remaining open questions.
 The results: yes, but with many, many caveats. Aletheia was able to surface 200 candidate solutions which humans then needed to grade, slimming down to 63 correct response, and further expert mathematical evaluation slimmed this down to a further subset of only 13 solves that Google calls &#8220;correct meaningful responses&#8221;.
 &#8220;The remaining 50 of Aletheia&#8217;s correct solutions were technically valid but mathematically meaningless because the problem statements were interpreted in a way that did not capture Erd&#337;s intent, often (but not always) leading to trivial solutions,&#8221; the researchers write. &#8220;&#8221;Only 13 solutions correctly addressed the intended problem statement (either by invoking the literature, or by a novel argument).&#8221;
 When 13 become 2: When you dig into these 13, the results get a bit less impressive: 
 5 get classed as &#8220;literature identification&#8221;: &#8220;On these problems, Aletheia found that a solution was already explicitly in the literature, despite the problem being marked &#8220;Open&#8221; on Bloom&#8217;s website at the time of model deployment&#8221;.

 3 are &#8220;partial AI solution&#8221;: &#8220;On these problems, there were multiple questions and Aletheia found the first correct solution to one of the questions&#8221;.

 3 are &#8220;independent rediscovery&#8221;: &#8220;On these problems, Aletheia found a correct solution, but human auditors subsequently found an independent solution already in the literature.&#8221;

 This leaves 2 &#8220;autonomous novel solution&#8221; solves: &#8220;On these problems, Aletheia found the first correct solution (as far as we can tell) in a mathematically substantive way&#8221;. Of these, 1 of the solutions seems genuinely interesting: &#8220;We tentatively believe Aletheia&#8217;s solution to Erd&#337;s-1051 represents an early example of an AI system autonomously resolving a slightly non-trivial open Erd&#337;s problem of somewhat broader (mild) mathematical interest, for which there exists past literature on closely-related problems [KN16], but none fully resolve Erd&#337;s-1051,&#8221; they write. &#8220;Moreover, it does not appear obvious to us that Aletheia&#8217;s solution is directly inspired by any previous human argument&#8221;.

 Who did the research: Along with Google DeepMind, the following universities participated in the research: UC Berkeley, Seoul National University, Stanford University, Korea Institute for Advanced Study, University of Cambridge, Brown University, Yonsei University, Concordia University, Academia Sinica, and National Taiwan University.

 Why this matters - even if AI speeds up science, humans might be the bottleneck (at least for a while): This paper is a nice example of &#8220;O-ring automation&#8221; - AI here has massively sped up the art of generating proofs, but it still requires laborious, skilled work by humans to filter this down to the actually correct and useful responses.
 This trend will likely hold for some years, where AI will not be able to autonomously do science end-to-end, partially because a big chunk of scientific advancement comes down to something you might think of as &#8220;expert intuition&#8221; which exists in the heads of a small number of living scientists and was refined by their own biological intelligence by reading the same literature as the LLMs. Extracting this kind of expert taste feels like something that is tractable but will take a while. 
 &#8220;Large Language Models can easily generate candidate solutions, but the number of experts who can judge the correctness of a solution is relatively small, and even for experts, substantial time is required to carry out such evaluations&#8221;, the authors write. &#8220;As AI-generated mathematics grows, the community must remain vigilant of &#8220;subconscious plagiarism&#8221;, whereby AI reproduces knowledge of the literature acquired during training, without proper acknowledgment. Note that formal verification cannot help with any of these difficulties.&#8221;
 Read more: Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erd&#337;s Problems (arXiv) .

***

 Huawei uses an LLM to automate the design of Huawei chip kernels:
 &#8230;LLMs need scaffolds for more obscure chips&#8230;
 Researchers with Nanjing University and Huawei have used LLMs to help automate the design of kernels for AscendC Huawei chips, as a further symptom of how modern AI systems can accelerate their own development.

 AscendCraft: AscendCraft is software for automating the generation of code for Huawei kernels. Modern LLMs can generate quite good kernel code for widely used chips like NVIDIA GPUs, but relatively obscure chips like Huawei are less well understood by LLMs, mostly due to data availability. &#8220;Publicly available NPU kernel implementations are far scarcer than GPU counterparts, limiting the training corpus for LLMs,&#8221; the authors write. &#8220;The lack of largescale, high-quality NPU code makes it difficult for LLMs to generate correct and efficient kernels&#8221;.

 What they did: To build AscendCraft, the authors developed a two stage pipeline. In stage one, they have an LLM build &#8220;a high-level DSL program that describes the kernel&#8217;s core computation, tiling strategy, and on-chip dataflow.&#8221; The DSL is &#8220;designed to be LLM-friendly, appropriately abstracted, and sufficiently expressive to capture high-performance NPU kernel designs&#8221; - I think of it as basically a scaffold to focus the LLM around the specifics of building kernels for Huawei hardware.
 In the second stage, they &#8220;&#8221;transcompile the DSL into AscendC code through a sequence of structured LLM-based lowering passes, each responsible for translating a specific aspect of the DSL into valid and efficient AscendC constructs&#8221;.

 Slightly odd thing: Strangely, the paper doesn&#8217;t disclose precisely which LLM is used here.

 The results: They test out a range of kernels built in this way on MultiKernelBench. In their tests, they find that &#8220;AscendCraft achieves 98.1% compilation success and 90.4% functional correctness. Moreover, 46.2% of generated kernels match or exceed PyTorch eager execution performance&#8221;. This is promising enough performance that it&#8217;s going to be worth them continuing with this research, but not so good that it instantly knocks things out of the park and revolutionizes how kernels for Huawei chips get made.
 Nonetheless, the signs are clear: we can use AI to accelerate the optimizing of AI hardware, even for systems which are relatively new and/or underdiscussed in the pre-training corpus LLMs are trained on.
 Read more : AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation (arXiv) .

***

 Tech Tales:

The Model Wants To Eat Earth But Besides That It Is Chill
 [Internal slack post from a frontier AI developer, posted spring 2027]

 How is the new model? Vibes-wise, it&#8217;s excellent. And it&#8217;s setting state-of-the-art on pretty much every benchmark we throw at it. But there is one problem: this model sure loves thinking about eating planets! We picked this up when we were doing some prefill experiments on the base model and along with the usual mixtures of completions and webslop outputs we found a recurring motif: the model thinking about building vast machines in the solar system and then harvesting Earth and eventually other planets for mass. The confusing thing is that all of our alignment tests are showing further improvements in control and steerability over previous models and usually we&#8217;d expect some kind of recurring idea like this to be correlated to some quantitative drops in some of the alignment scores. But here it just honestly seems like the model is extremely good and will work very hard for us unless it thinks it has a plausible path to breaking containment and eventually harvesting the planet for its mass.

We asked the physicists to red team this and after a week or so - with heavy consultations of our models, including the new one - we have concluded there&#8217;s no plausible path from here to planet harvesting. It just costs too much to get to orbit and the logistics of putting together the underlying technical stack to do AI-driven rocket development just doesn&#8217;t pencil out. We even gave the best possible plans to the model and we could see some features activate inside it that seem to correlate to &#8220;disappointment&#8221; and &#8220;foiled plans&#8221; and &#8220;sadness&#8221;.

Leadership gaveled this morning that we will go ahead with the launch as planned. However, we are implementing some production probes that will scan for features associated with its desire to harvest the planet, and we&#8217;ve also added &#8220;planet harvesting&#8221; as something to try to understand and tune more in our next training run. Onward!

 Things that inspired this story: The peculiar poetry of internal &#8216;fresh off the cluster&#8217; posts about models at AI labs; how as we make models larger they tend to develop and exhibit idiosyncratic tendencies; how many science fiction tropes are becoming real as we approach the singularity.

 Thanks for reading!

Study: Platforms that rank the latest LLMs can be unreliable

mit_news_ai

09.02.2026 05:00

0.667

Embedding sim.	0.8076
Entity overlap	0.129
Title sim.	0.1062
Time proximity	0.5268

NLP тип	scientific_publication
NLP организация	Massachusetts Institute of Technology
NLP тема	large language models
NLP страна	United States

Открыть оригинал

A firm that wants to use a large language model (LLM) to summarize sales reports or triage customer inquiries can choose between hundreds of unique LLMs with dozens of model variations, each with slightly different performance.
 To narrow down the choice, companies often rely on LLM ranking platforms, which gather user feedback on model interactions to rank the latest LLMs based on how they perform on certain tasks.
 But MIT researchers found that a handful of user interactions can skew the results, leading someone to mistakenly believe one LLM is the ideal choice for a particular use case. Their study reveals that removing a tiny fraction of crowdsourced data can change which models are top-ranked.
 They developed a fast method to test ranking platforms and determine whether they are susceptible to this problem. The evaluation technique identifies the individual votes most responsible for skewing the results so users can inspect these influential votes.
 The researchers say this work underscores the need for more rigorous strategies to evaluate model rankings. While they didn’t focus on mitigation in this study, they provide suggestions that may improve the robustness of these platforms, such as gathering more detailed feedback to create the rankings.
 The study also offers a word of warning to users who may rely on rankings when making decisions about LLMs that could have far-reaching and costly impacts on a business or organization.
 “We were surprised that these ranking platforms were so sensitive to this problem. If it turns out the top-ranked LLM depends on only two or three pieces of user feedback out of tens of thousands, then one can’t assume the top-ranked LLM is going to be consistently outperforming all the other LLMs when it is deployed,” says Tamara Broderick, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS); a member of the Laboratory for Information and Decision Systems (LIDS) and the Institute for Data, Systems, and Society; an affiliate of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author of this study.
 She is joined on the paper by lead authors and EECS graduate students Jenny Huang and Yunyi Shen as well as Dennis Wei, a senior research scientist at IBM Research. The study will be presented at the International Conference on Learning Representations.
 Dropping data 
 While there are many types of LLM ranking platforms, the most popular variations ask users to submit a query to two models and pick which LLM provides the better response.
 The platforms aggregate the results of these matchups to produce rankings that show which LLM performed best on certain tasks, such as coding or visual understanding.
 By choosing a top-performing LLM, a user likely expects that model’s top ranking to generalize, meaning it should outperform other models on their similar, but not identical, application with a set of new data.
 The MIT researchers previously studied generalization in areas like statistics and economics. That work revealed certain cases where dropping a small percentage of data can change a model’s results, indicating that those studies’ conclusions might not hold beyond their narrow setting.
 The researchers wanted to see if the same analysis could be applied to LLM ranking platforms.
 “At the end of the day, a user wants to know whether they are choosing the best LLM. If only a few prompts are driving this ranking, that suggests the ranking might not be the end-all-be-all,” Broderick says.
 But it would be impossible to test the data-dropping phenomenon manually. For instance, one ranking they evaluated had more than 57,000 votes. Testing a data drop of 0.1 percent means removing each subset of 57 votes out of the 57,000, (there are more than 10 194 subsets), and then recalculating the ranking.
 Instead, the researchers developed an efficient approximation method, based on their prior work, and adapted it to fit LLM ranking systems.
 “While we have theory to prove the approximation works under certain assumptions, the user doesn’t need to trust that. Our method tells the user the problematic data points at the end, so they can just drop those data points, re-run the analysis, and check to see if they get a change in the rankings,” she says.
 Surprisingly sensitive 
 When the researchers applied their technique to popular ranking platforms, they were surprised to see how few data points they needed to drop to cause significant changes in the top LLMs. In one instance, removing just two votes out of more than 57,000, which is 0.0035 percent, changed which model is top-ranked.
 A different ranking platform, which uses expert annotators and higher quality prompts, was more robust. Here, removing 83 out of 2,575 evaluations (about 3 percent) flipped the top models.
 Their examination revealed that many influential votes may have been a result of user error. In some cases, it appeared there was a clear answer as to which LLM performed better, but the user chose the other model instead, Broderick says.
 “We can never know what was in the user’s mind at that time, but maybe they mis-clicked or weren’t paying attention, or they honestly didn’t know which one was better. The big takeaway here is that you don’t want noise, user error, or some outlier determining which is the top-ranked LLM,” she adds.
 The researchers suggest that gathering additional feedback from users, such as confidence levels in each vote, would provide richer information that could help mitigate this problem. Ranking platforms could also use human mediators to assess crowdsourced responses.
 For the researchers’ part, they want to continue exploring generalization in other contexts while also developing better approximation methods that can capture more examples of non-robustness.
 “Broderick and her students’ work shows how you can get valid estimates of the influence of specific data on downstream processes, despite the intractability of exhaustive calculations given the size of modern machine-learning models and datasets,” says Jessica Hullman, the Ginni Rometty Professor of Computer Science at Northwestern University, who was not involved with this work. “The recent work provides a glimpse into the strong data dependencies in routinely applied — but also very fragile — methods for aggregating human preferences and using them to update a model. Seeing how few preferences could really change the behavior of a fine-tuned model could inspire more thoughtful methods for collecting these data.”
 This research is funded, in part, by the Office of Naval Research, the MIT-IBM Watson AI Lab, the National Science Foundation, Amazon, and a CSAIL seed award.

3 Ways NVFP4 Accelerates AI Training and Inference | NVIDIA Technical Blog

nvidia_dev_blog

06.02.2026 16:00

0.66

Embedding sim.	0.7531
Entity overlap	0.0789
Title sim.	0.1818
Time proximity	0.869

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	ai infrastructure
NLP страна

Открыть оригинал

The latest AI models continue to grow in size and complexity, demanding increasing amounts of compute performance for training and inference—far beyond what Moore’s Law can keep up with. That’s why NVIDIA engages in extreme codesign . Designing across multiple chips and a mountain of software cohesively enables large generational leaps in AI factory performance and efficiency.

 Lower-precision AI formats are key to improving compute performance and energy efficiency. Bringing the benefits of ultra-low-precision numerics to AI training and inference while maintaining high accuracy requires extensive engineering across every layer of the technology stack. It spans the creation of the formats, implementation in silicon, enablement across many libraries, and working closely with the ecosystem to deploy new training recipes and inference optimization techniques. NVFP4 , developed and implemented for NVIDIA GPUs starting with NVIDIA Blackwell, delivers the performance and energy-efficiency benefits of 4-bit floating-point precision while maintaining accuracy on par with higher-precision formats.

 For those looking to maximize AI training and inference performance, here are three things to know about NVFP4.

 1. NVFP4 enables large performance leaps for training and inference on the Blackwell architecture—and beyond

 NVIDIA Blackwell Ultra GPUs provide peak dense NVFP4 throughput up to 15 petaFLOPS—3x that of FP8 on the same GPUs. The gains aren’t just about peak specs; they’re visible in measured performance of training and inference workloads. 

 For inference, as shown in a recent technical blog post , moving from FP8 to NVFP4 leads to dramatic improvements in delivered token throughput at a given level of interactivity on DeepSeek-R1, a popular, 671B parameter, mixture-of-experts (MoE) model. The throughput increases at a given token rate and even higher token rates, enabling better user experiences.

 Figure 1. Throughput-versus-interactivity curves across FP8 without MTP, FP8 with MTP, and NVFP4 with MTP on HGX B200, with 8K/1K sequence length and aggregated serving

 NVIDIA also recently published an NVFP4 training recipe , bringing the significant performance benefits of NVFP4 to model training, enabling model makers to train AI faster and at lower cost. 

 Figure 2. Relative Llama 3.1 405B pretraining and Llama 2 70B LoRA fine-tuning performance at 512-GPU and 8-GPU scales, respectively

 In the latest version of the MLPerf Training benchmark suite , multiple NVIDIA GB300 NVL72 systems—totaling 512 Blackwell Ultra GPUs—worked together using NVFP4 precision to complete the Llama 3.1 405B pre-training benchmark in 64.6 minutes. This is 1.9x faster than 512 Blackwell GPUs across multiple NVIDIA GB200 NVL72 systems, which were able to complete the benchmark using FP8 in the prior round. 

 Looking ahead, the NVIDIA Rubin platform delivers large leaps in NVFP4 capability for training and inference, offering 35 petaFLOPS of NVFP4 training compute, and 50 petaFLOPs of NVFP4 Transformer Engine inference compute. This is a 3.5x and 5x leap compared to Blackwell, respectively. 

 2. NVFP4 delivers great accuracy, proven on industry benchmarks

 For MLPerf Training and Inference submissions in the closed division to be valid, they must meet accuracy requirements specified by the benchmarks. For inference, responses must meet certain accuracy thresholds, and for training, the models must be trained to specific quality targets (ie, the model training process must converge). 

 NVIDIA successfully submitted results in the closed division on every large language model (LLM) test using NVFP4 on Blackwell and Blackwell Ultra GPUs in the latest version of MLPerf Training. And, NVIDIA has submitted across many models and scenarios using NVFP4 in MLPerf Inference. This included DeepSeek-R1, Llama 3.1 8B and 405B, and Llama 2 70B. NVIDIA used NVFP4-quantized versions of the models, all while meeting strict benchmark requirements. 

 Figure 3. DeepSeek-R1 Model Evaluation Scores showing NVFP4 closely matches the accuracy of the FP8 baseline

 3. NVFP4 enjoys broad and growing ecosystem support

 Libraries like NVIDIA Model Optimizer , LLM Compressor , and torch.ao enable developers to quantize models trained at higher precision to NVFP4 and implement NVFP4 KV cache to support long context and large batch sizes while preserving accuracy. Popular inference frameworks, including NVIDIA TensorRT-LLM , vLLM, and SGLang, also support running models in NVFP4 format today with models available in NVFP4 variants. For example, on HuggingFace, developers can find ready-to-deploy NVFP4 versions such as Llama 3.3 70B , FLUX.2 , DeepSeek-R1-0528, Kimi-K2-Thinking, Qwen3-235B-A22B, and NVIDIA Nemotron Nano .

 The ecosystem is also adopting NVFP4 to increase inference throughput in production across a variety of models. Those companies include Black Forest Labs, Radical Numerics, Cognition and Red Hat. 

 Black Forest Labs worked with NVIDIA to scale NVFP4 inference for FLUX.2 on Blackwell. “By layering optimizations like CUDA Graphs, torch.compile, NVFP4 precision, and TeaCache, we achieve up to a 6.3x speedup on a single B200—dramatically reducing latency and enabling more efficient production deployment,” said Robin Rombach, co-founder and CEO of Black Forest Labs.

 Radical Numerics has leveraged NVFP4 to accelerate scientific world model scaling. “Unlike language, scientific data pushes us beyond the classical single-modality autoregressive recipe, demanding extremely long-context methods and robust multimodal fusion,” said Michael Poli, co-founder and chief AI scientist at Radical Numerics. He added the company is “highly optimistic” about using low-precision recipes to pretrain and post-train its new architecture.

 Cognition is seeing “significant latency and throughput gains” by using NVFP4 in large-scale reinforcement learning, said Steven Cao, a member of Cognition’s research team.

 And Red Hat is scaling its LLM workloads with NVFP4 quantization, giving developers near‑baseline accuracy across both frontier and MoE models while staying within tight memory budgets. By significantly reducing activation and weight footprints without a meaningful loss in quality, NVFP4 makes it feasible for Red Hat engineers to train and serve state‑of‑the‑art LLMs on larger context windows and higher concurrency using existing infrastructure.

 The NVIDIA Transformer Engine library incorporates an implementation of the NVFP4 training recipe, and training frameworks such as Megatron-Bridge have implementations for developers to get started. NVIDIA also continues to innovate and collaborate with the ecosystem to bring the performance and efficiency benefits of NVFP4 training to the entire ecosystem, paving the way to smarter, more complex models trained faster and more efficiently.

 Learn more

 Using NVFP4 can deliver large performance gains on both the NVIDIA Blackwell and NVIDIA Rubin platforms. Through extreme codesign, these large performance gains can also be achieved with excellent accuracy for both model training and inference. NVFP4 versions of popular open LLMs are widely available, enabling services to run these models with higher throughput and at a lower cost per million tokens.  

 Learn more about how the significant architectural leaps enabled by the Rubin platform , including enhanced NVFP4, enable new levels of performance of AI training and inference.

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Data Center / Cloud | General | Hardware / Semiconductor | Blackwell | Beginner Technical | Benchmark | Blackwell Ultra | featured | GB300 | NVFP4 | Rubin

 About the Authors

 About Ashraf Eassa

 Ashraf Eassa is a senior product marketing manager at NVIDIA, focusing on deep learning, training and inference. He holds bachelor's degrees in computer science and mathematics from the University of Vermont.

 View all posts by Ashraf Eassa

 Comments

 Related posts

 NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks

 NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks

 Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era

 Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era

 Introducing NVFP4 for Efficient and Accurate Low-Precision Inference

 Introducing NVFP4 for Efficient and Accurate Low-Precision Inference

 NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance

 NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance

 Getting Immediate Speedups with NVIDIA A100 TF32

 Getting Immediate Speedups with NVIDIA A100 TF32

 Related posts

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 Introducing NVIDIA BlueField-4-Powered CMX Context Memory Storage Platform for the Next Frontier of AI

 L

 T

 F

 R

 E

New J-PAL research and policy initiative to test and scale AI innovations to fight poverty

mit_news_ai

12.02.2026 23:50

0.654

Embedding sim.	0.7677
Entity overlap	0.0408
Title sim.	0.1029
Time proximity	0.7927

NLP тип	funding
NLP организация	Abdul Latif Jameel Poverty Action Lab
NLP тема	artificial intelligence
NLP страна	United States

Открыть оригинал

The Abdul Latif Jameel Poverty Action Lab (J-PAL) at MIT has awarded funding to eight new research studies to understand how artificial intelligence innovations can be used in the fight against poverty through its new Project AI Evidence .
 The age of AI has brought wide-ranging optimism and skepticism about its effects on society. To realize AI’s full potential, Project AI Evidence (PAIE) will identify which AI solutions work and for whom, and scale only the most effective, inclusive, and responsible solutions — while scaling down those that may potentially cause harm.
 PAIE will generate evidence on what works by connecting governments, tech companies, and nonprofits with world-class economists at MIT and across J-PAL’s global network to evaluate and improve AI solutions to entrenched social challenges.
 The new initiative is prioritizing questions policymakers are already asking: Do AI-assisted teaching tools help all children learn? How can early-warning flood systems help people affected by natural disasters? Can machine learning algorithms help reduce deforestation in the Amazon? Can AI-powered chatbots help improve people’s health? In the coming years, PAIE will run a series of funding competitions to invite proposals for evaluations of AI tools that address questions like these, and many more.
 PAIE is financially supported by a grant from Google.org, philanthropic support from Community Jameel, a grant from Canada’s International Development Research Centre and UK International Development, and a collaboration agreement with Amazon Web Services. Through a grant from Eric and Wendy Schmidt, awarded by recommendation of Schmidt Sciences, the initiative will also study generative AI in the workplace, particularly in low- and middle-income countries.
 Alex Diaz, head of AI for social good at Google.org, says, “we’re thrilled to collaborate with MIT and J-PAL, already leaders in this space, on Project AI Evidence. AI has great potential to benefit all people, but we urgently need to study what works, what doesn’t, and why, if we are to realize this potential.”
 “Artificial intelligence holds extraordinary potential, but only if the tools, knowledge, and power to shape it are accessible to all — that includes contextually grounded research and evidence on what works and what does not,” adds Maggie Gorman-Velez, vice president of strategy, regions, and policies at IDRC. “That is why IDRC is proud to be supporting this new evaluation work as part of our ongoing commitment to the responsible scaling of proven safe, inclusive, and locally relevant AI innovations.”
 J-PAL is uniquely positioned to help understand AI’s effects on society: Since its inception in 2003, J-PAL’s network of researchers has led over 2,500 rigorous evaluations of social policies and programs around the world. Through PAIE, J-PAL will bring together leading experts in AI technology, research, and social policy, in alignment with MIT president Sally Kornbluth’s focus on generative AI as a strategic priority .
 PAIE is chaired by Professor Joshua Blumenstock of the University of California at Berkeley; J-PAL Global Executive Director Iqbal Dhaliwal ; and Professor David Yanagizawa-Drott of the University of Zurich.
 New evaluations of urgent policy questions 
 The studies funded in PAIE’s first round of competition explore urgent questions in key sectors like education, health, climate, and economic opportunity.
 How can AI be most effective in classrooms, helping both students and teachers? 
 Existing research shows that personalized learning is important for students, but challenging to implement with limited resources. In Kenya, education social enterprise EIDU has developed an AI tool that helps teachers identify learning gaps and adapt their daily lesson plans. In India, the nongovernmental organization (NGO) Pratham is developing an AI tool to increase the impact and scale of the evidence-informed Teaching at the Right Level approach. J-PAL researchers Daron Acemoglu, Iqbal Dhaliwal, and Francisco Gallego will work with both organizations to study the effects and potential of these different use cases on teachers’ productivity and students’ learning .
 Can AI tools reduce gender bias in schools? 
 Researchers are collaborating with Italy’s Ministry of Education to evaluate whether AI tools can help close gender gaps in students’ performance by addressing teachers’ unconscious biases. J-PAL affiliates Michela Carlana and Will Dobbie, along with Francesca Miserocchi and Eleonora Patacchini, will study the impacts of two AI tools, one that helps teachers predict performance and a second that gives real-time feedback on the diversity of their decisions.
 Can AI help career counselors uncover more job opportunities? 
 In Kenya, researchers are evaluating if an AI tool can identify overlooked skills and unlock employment opportunities , particularly for youth, women, and those without formal education. In collaboration with NGOs Swahilipot and Tabiya, Jasmin Baier and J-PAL researcher Christian Meyer will evaluate how the tool changes people’s job search strategies and employment. This study will shed light on AI as a complement, rather than a substitute, for human expertise in career guidance.
 Looking forward 
 As use of AI in the social sector evolves, these evaluations are a first step in discovering effective, responsible solutions that will go the furthest in alleviating poverty and inequality.
 J-PAL’s Dhaliwal notes, “J-PAL has a long history of evaluating innovative technology and its ability to improve people’s lives. While AI has incredible potential, we need to maximize its benefits and minimize possible harms. We’re grateful to our donors, sponsors, and collaborators for their catalytic support in launching PAIE, which will help us do exactly that by continuing to expand evidence on the impacts of AI innovations.”
 J-PAL is also seeking new collaborators who share its vision of discovering and scaling up real-world AI solutions. It aims to support more governments and social sector organizations that want to adopt AI responsibly, and will continue to expand funding for new evaluations and provide policy guidance based on the latest research.
 To learn more about Project AI Evidence, subscribe to J-PAL's newsletter or contact paie@povertyactionlab.org .

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

huggingface

12.02.2026 00:00

0.652

Embedding sim.	0.7503
Entity overlap	0
Title sim.	0.12
Time proximity	0.9345

NLP тип	other
NLP организация	TuringEnterprises
NLP тема	ai agents
NLP страна

Открыть оригинал

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments
 Published
 February 12, 2026
 Update on GitHub
 Upvote 31

 +25

 Christian Washington christian-washington 

 TuringEnterprises

 Ankit Jasuja ajasuja 

 TuringEnterprises

 Santosh Sah santosh-iima 

 TuringEnterprises

 Lewis Tunstall lewtun 

 ben burtenshaw burtenshaw 

 What Is OpenEnv?
 The Calendar Gym: A Production-Grade Benchmark
 What We Learned
 Looking Ahead
 Appendix: Common error cases in tool use Specific error cases found in the wild

 AI agents often perform impressively in controlled research settings, yet struggle when deployed in real-world systems where they must reason across multiple steps, interact with real tools and APIs, operate under partial information, and recover from errors in stateful, permissioned environments—highlighting a persistent gap between research success and production reliability.

 OpenEnv is an open-source framework from Meta and Hugging Face designed to address this challenge by standardizing how agents interact with real environments. As part of this collaboration, Turing contributed a production-grade calendar management environment to study tool-using agents under realistic constraints such as access control, temporal reasoning, and multi-agent coordination.

 In this post, we explore how OpenEnv works in practice, why calendars serve as a powerful benchmark for real-world agent evaluation, and what our findings reveal about the current limitations of tool-using agents.

 What Is OpenEnv?

 OpenEnv is a framework for evaluating AI agents against real systems rather than simulations . It provides a standardized way to connect agents to real tools and workflows while preserving the structure needed for consistent and reliable evaluation.

 OpenEnv uses a gym-oriented API ( reset
, step
, action
, observations
) like OpenAI's Gymnasium . Also, OpenEnv uses a standard MCP tool call interface to connect to envs which provides a consistent interface across domains and simulation to production environments.

 The environments maintain state across multiple actions—enabling long-horizon reasoning—and can connect directly to real APIs and tools such as browsers, code repositories, or calendars. This shifts evaluation from "Can this work in a controlled demo?" to "Can this operate reliably in the real world?"

 The Calendar Gym: A Production-Grade Benchmark

 Calendar systems are deceptively complex. While scheduling a meeting seems simple, real-world calendar management requires agents to reason over time, permissions, multiple users, and incomplete information—often across several dependent steps. These properties make calendars a powerful testbed for evaluating tool-using agents outside controlled simulations.

 To ground OpenEnv in this kind of realistic, demanding use case, Turing built a production-grade calendar management environment referred to as the Calendar Gym . Rather than simulating scheduling in the abstract, it exposes agents to the same constraints they would face in real calendar systems: Access Control Lists across users and calendars, limited visibility into other users' state, and multi-step workflows where actions must be chained in the correct order. Agents interact with a rich set of calendar operations—from listing calendars to modifying events and permissions—and must handle failed actions, incorrect assumptions, and missing permissions. Each session runs in an isolated environment, enabling reliable comparisons across runs.

 Below is a code example of how to use the Calendar Gym. We explore the environment, discover available tools, list calendars, create an event, and print the result.

 from openenv_wrapper.client import MCPEnvClient
 from openenv_wrapper.data_models import MCPAction

 with MCPEnvClient.from_hub(base_url= "TuringEnterprises/calendar-gym" ) as client:
 # Connect and reset the environment
 result = client.reset()
 print ( "Reset successful:" , result.observation.success)

 # Discover available tools
 result = client.step(MCPAction(action_type= "ListToolsAction" ))
 print ( "Available tools:" , len (result.observation.tools_list))

 # List calendars
 result = client.step(MCPAction(
 action_type= "ToolCallAction" ,
 tool_name= "calendars_list" ,
 arguments={}
 ))
 calendars = result.observation.tool_result[ "items" ]
 print ( "Calendars:" , calendars)

 # Create an event
 result = client.step(MCPAction(
 action_type= "ToolCallAction" ,
 tool_name= "events_insert" ,
 arguments={
 "calendarId" : "primary" ,
 "summary" : "Team Sync" ,
 "start" : { "dateTime" : "2026-01-15T14:00:00Z" },
 "end" : { "dateTime" : "2026-01-15T15:00:00Z" }
 }
 ))
 print ( "Event created:" , result.observation.success)

 Below is an excerpt of what the Calendar Gym returns when you call ListToolsAction
. Each entry includes the tool name plus an input schema (what arguments the tool accepts).

 Click to expand output

 {
 "tools_list" : [
 {
 "name" : "calendars_list" ,
 "description" : "List calendars visible to the current user." ,
 "input_schema" : {
 "type" : "object" ,
 "properties" : { } ,
 "additionalProperties" : false
 }
 } ,
 {
 "name" : "events_insert" ,
 "description" : "Create an event in a calendar." ,
 "input_schema" : {
 "type" : "object" ,
 "properties" : {
 "calendarId" : { "type" : "string" } ,
 "summary" : { "type" : "string" } ,
 "start" : {
 "type" : "object" ,
 "properties" : { "dateTime" : { "type" : "string" } } ,
 "required" : [ "dateTime" ]
 } ,
 "end" : {
 "type" : "object" ,
 "properties" : { "dateTime" : { "type" : "string" } } ,
 "required" : [ "dateTime" ]
 }
 } ,
 "required" : [ "calendarId" , "summary" , "start" , "end" ]
 }
 }
 ]
 }

 What We Learned

 Evaluating agents in the Calendar Gym revealed consistent patterns which were common across multiple domains. While agents often perform well on individual game like actions, reliability breaks down as tasks become longer, more ambiguous, and more constrained.

 Multi-step reasoning is the primary bottleneck. Agents struggle to correctly chain actions across longer workflows, suggesting that benchmarks need to test sustained reasoning over multiple dependent steps—not just single tool calls.

 Ambiguity significantly degrades performance. Agents achieved close to 90% success on tasks with explicit calendar identifiers, but success dropped to roughly 40% when the same tasks were phrased using natural language descriptions. Building stronger lookup and validation into agent loops—rather than relying on the LLM to resolve references unaided—appears essential.

 Correct tool choice isn't enough. Across failed interactions, more than half of errors stemmed from malformed tool arguments or incorrect ordering, even when the right tool was selected. Reliable agent behavior depends as much on execution quality and structured feedback as on tool selection—environment design matters.

 These challenges are not unique to scheduling and calendars. They reflect broader limitations that emerge whenever agents operate in changing systems over long periods of time, and they point toward evaluation frameworks that test permissions, partial observability, and multi-step workflows together.

 Looking Ahead

 OpenEnv provides a foundation for testing agents under realistic conditions, and the Calendar Gym demonstrates how seemingly simple domains can surface deep challenges in reasoning, ambiguity resolution, and tool use. By evaluating agents where failure is measurable and constraints are real, we gain clearer insight into what it takes to build agents that operate reliably in production.

 For a deeper dive into the Calendar Gym's design, benchmarking methodology, and quantitative results, explore the full technical article on Turing's site . To explore a clone of the Calendar Gym, visit the Calendar Gym space .

 Appendix: Common error cases in tool use

 In practice, tool integrations rarely fail in dramatic ways; they fail in small, predictable ones. When wiring up MCP tools to real APIs (like calendar operations), we encountered a handful of recurring issues.

 Specific error cases found in the wild

 Below are three common failure modes we’ve seen in production, along with representative error payloads and mitigation strategies. These examples illustrate not just what can go wrong, but how structured errors can help agents recover gracefully.

 1. Schema validation errors (missing or malformed arguments)

 The agent calls a valid tool (e.g. events_insert
), but the arguments do not match the declared JSON schema.

 Missing required fields like calendarId

 Incorrect nesting of start
 / end

 Passing a string where an object is expected.

 Click to expand error payload

 {
 "ok" : false ,
 "error_type" : "validation_error" ,
 "tool_name" : "events_insert" ,
 "message" : "Invalid arguments for tool 'events_insert'." ,
 "details" : {
 "missing_required_fields" : [ "calendarId" , "end" ] ,
 "invalid_fields" : [
 {
 "field" : "start" ,
 "expected_type" : "object" ,
 "received_type" : "string"
 }
 ]
 }
 }

 We can mitigate this by providing one canonical example of a correct 'events_insert' call in your prompt. Return structured validation errors so the model can repair and retry instead of failing silently.

 2. Permission / authorization errors (401/403)

 The tool call is syntactically correct, but the API rejects it due to insufficient permissions.

 Missing OAuth scopes

 Expired access token

 User lacks write access to the target calendar

 Click to expand error payload

 {
 "ok" : false ,
 "error_type" : "permission_error" ,
 "tool_name" : "events_insert" ,
 "http_status" : 403 ,
 "message" : "The authenticated user does not have write access to calendar 'primary'." ,
 "remediation" : [
 "Ensure the OAuth token includes calendar write scope." ,
 "Verify the user has edit access to the target calendar." ,
 "Reconnect the integration if the token has expired."
 ]
 }

 We can mitigate this by clearly documenting the required OAuth scopes. Return structured, actionable remediation steps so the agent can guide the user instead of retrying the same failing call.
Clearly document required OAuth scopes. Return structured, actionable remediation steps so the agent can guide the user instead of retrying the same failing call.

 3. Datetime / format errors (RFC3339 & timezone issues)

 The event is rejected by the API, or it is created at an unexpected time.

 Missing timezone offset

 Non-RFC3339 datetime format

 Incorrect nesting of start.dateTime
 or end.dateTime

 Mixing local time and UTC without specifying an offset

 Click to expand error payload

 {
 "ok" : false ,
 "error_type" : "format_error" ,
 "tool_name" : "events_insert" ,
 "message" : "Invalid datetime format for field 'start.dateTime'." ,
 "details" : {
 "received" : "02/11/2026 9:30 AM" ,
 "expected_format" : "RFC3339 (e.g. 2026-02-11T09:30:00-05:00)"
 }
 }

 We can mitigate this by standardizing on RFC3339 with explicit timezone offsets (e.g. 2026-02-11T09:30:00-05:00). Include at least one correct datetime example in your documentation to anchor model behavior and reduce repair retries.

 More Articles from our Blog

 announcement open-source community

 Building the Open Agent Ecosystem Together: Introducing OpenEnv

 +6

 156
 October 23, 2025

 open-source cuda kernels

 Custom Kernels for All from Codex and Claude

 73
 February 13, 2026

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy | NVIDIA Technical Blog

nvidia_dev_blog

09.02.2026 18:30

0.646

Embedding sim.	0.7435
Entity overlap	0.1026
Title sim.	0.3243
Time proximity	0.5565

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	large language models
NLP страна

Открыть оригинал

NVIDIA TensorRT LLM enables developers to build high-performance inference engines for large language models (LLMs), but deploying a new architecture traditionally requires significant manual effort. To address this challenge, today we are announcing the availability of AutoDeploy as a beta feature in TensorRT LLM. 

 AutoDeploy compiles off-the-shelf PyTorch models into inference-optimized graphs. This avoids the need to bake inference-specific optimizations directly into model code, reducing LLM deployment time. AutoDeploy enables the shift from manually reimplementing and optimizing each model toward a compiler-driven workflow that separates model authoring from inference optimization. 

 This post introduces AutoDeploy architecture and capabilities and shows how it enabled support for recent NVIDIA Nemotron models at launch.

 What is AutoDeploy?

 Every new LLM architecture comes with its own inference challenges, from transformer models to hybrid vision language models (VLMs) to state space models (SSMs). Turning a reference implementation into a high-performance inference engine typically requires adding KV cache management, sharding weights across GPUs, fusing operations, and tuning the execution graph for specific hardware.

 AutoDeploy shifts this workflow toward a compiler-driven approach. Instead of requiring model authors to manually reimplement inference logic, AutoDeploy automatically extracts a computation graph from an off-the-shelf PyTorch model and applies a series of automated transformations to produce an inference-optimized TensorRT LLM graph. This enables you to describe the model once in PyTorch and delegate inference-specific concerns—such as caching, sharding, kernel selection, and runtime integration—to the compiler and runtime.

 This approach is particularly well-suited for the long tail of models, including new research architectures, internal variants, and fast-moving open source models, where manual reimplementation is often impractical or unjustified. AutoDeploy enables deployment at launch with competitive baseline performance, while preserving a clear path to incremental optimization as models mature.

 AutoDeploy provides:

 Seamless model translation : Automatically converts Hugging Face models into TensorRT LLM graphs without manual rewrites

 Single source of truth : Keeps the original PyTorch model as the canonical definition

 Inference optimization : Applies sharding, quantization, KV cache insertion, attention fusion, CUDA Graphs optimization, and more

 Deployment at launch : Enables immediate deployment with ongoing performance improvements over time

 Turnkey setup : Ships as part of TensorRT LLM with examples and documentation

 AutoDeploy can be used for:

 New or experimental architectures : Rapidly deploy research models, hybrid designs, or novel token mixing (attention) mechanisms

 Long-tail model support : Serve internal, fine-tuned, or less common models without bespoke inference implementations

 Fast performance bring-up : Reach competitive baseline performance quickly, then optimize incrementally

 Unified training-to-inference workflow : Keep PyTorch as the model definition while relying on TensorRT LLM for runtime integration

 AutoDeploy currently supports more than 100 text‑to‑text LLMs and offers early support for VLMs and SSMs and performance-optimized models such as the Llama model family and NVIDIA Nemotron 3 Nano .

 AutoDeploy technical background 

 AutoDeploy sits between the original Hugging Face model and the TensorRT LLM runtime. The LLM API accepts a model name or checkpoint directory and returns a high‑level LLM object. Under the hood, that object can use AutoDeploy (automated) or a manual backend. 

 As Figure 1 shows, the AutoDeploy path automatically extracts a graph, applies optimizations, and generates an inference‑optimized graph. The manual path requires engineers to rewrite the model (adding KV cache logic, attention kernels, sharding, kernel fusion, and more) before running it through the same runtime.

 Figure 1. Overview of the AutoDeploy mode and the integration into the TensorRT LLM runtime

 Graph capture and pattern matching

 AutoDeploy uses the torch.export
 API to capture the model as a standardized Torch graph consisting of core ATen operations and custom (user- or AutoDeploy-provided) operations. The exported graph then undergoes a series of automated transformations to pattern-match and canonicalize the graph representation of common building blocks. 

 In this initial step, AutoDeploy ensures that common building blocks such as mixture of experts (MoE), attention, RoPE, or state-space layers are represented using reference implementations that are represented as custom ops and single nodes in the graph. 

 Figure 2 provides an example of how attention is represented across all models as a single, easy-to-interpret custom operator in PyTorch.

 Figure 2. AutoDeploy ensures that canonicalized representations are used for common building blocks, such as attention, to simplify downstream performance optimizations, such as caching and kernel selection

 This approach ensures a seamless onboarding process of model support that is decoupled from performance optimization and runtime integration. 

 Moreover, model onboarding happens on a sliding scale between fully-automated model onboarding through pattern matching and (full) manual rewrites to ensure the final model graph can fully execute the model. The model author can inject custom kernels into the model graph by decorating relevant operations as PyTorch custom operators . The AutoDeploy compiler will not modify the relevant operators (Figure 3).

 Figure 3. An example of injecting custom operators into the AutoDeploy model graph

 Sharding, fusion, and performance optimization

 In the next stages, AutoDeploy automatically applies performance optimization through compiler-like passes combining fusion passes, performance-tuned recipes, and insertion of optimized kernels into the graph representation. During this stage, the model is also sharded for multi-GPU inference based on available heuristics or prespecified sharding hints reusing the Hugging Face sharding hints.

 Flexible attention and caching support

 During graph capture and pattern matching, AutoDeploy represents token mixing (for example, attention) operators as simple prefill-only operations expressed as AutoDeploy canonicalized reference operators. This is depicted in Figure 3 for the example of softmax attention. 

 The system then automatically handles swapping to performance-optimized attention kernels and automatically integrates the caching mechanisms of token mixing operators into the TensorRT LLM optimized cache manager system. Currently, AutoDeploy can handle models that are arbitrarily composed of softmax attention, state-space layers (Mamba2), linear attention (DeltaNet), and causal convolution.

 Adding support for other operators with caching follows a strict interface and is easily extendable.

 Compilation tooling

 AutoDeploy integrates with common off-the-shelf tooling for compiling and lowering the model further, such as torch.compile
, integration with CUDA Graphs for fixed batch-size decode-only batches, multistream optimizations, and more.

 Runtime integration

 AutoDeploy handles all aspects of integrating the model into the optimized TensorRT LLM runtime including features like overlap scheduler, chunked prefill, speculative decoding, or cache and state management without burdening the model author with the intertwined dependencies between the model and the runtime. 

 AutoDeploy performance example: Nemotron 3 Nano

 To gauge AutoDeploy capabilities, the team onboarded NVIDIA Nemotron 3 Nano , a hybrid MoE model. While hand‑tuning such a model for inference would typically take weeks, AutoDeploy enabled onboarding within days, followed by incremental optimizations that performed in line with a manually tuned baseline.

 On a single NVIDIA Blackwell DGX B200 GPU, AutoDeploy performed on par with the manually optimized baseline in TensorRT LLM (Figure 4). It delivered up to 350 tokens per second per user throughput and up to 13,000 output tokens per second for latency and high-throughput applications, respectively.

 Figure 4. Online performance comparison between the current default PyTorch (manual) backend and AutoDeploy backend in TensorRT LLM for Nemotron 3 Nano FP8

 Data was collected for ISL/OSL 1k/1k, TP=1, on NVIDIA DGX B200 using TensorRT LLM v1.3.0rc1 , trtllm-serve
, and AIPerf benchmarking tool.

 To reproduce the results yourself, follow the steps outlined in the NVIDIA Nemotron 3 Nano Checkpoint .

 Model onboarding example: Nemotron-Flash

 Nemotron-Flash is a representative example of the type of architecture that can be difficult to support using a purely manual inference workflow. This hybrid research model combines multiple token mixers—including state space layers, softmax attention, and linear attention—and would require significant engineering effort to reimplement, optimize, and maintain by hand.

 With AutoDeploy, existing optimization passes for Nemotron-Flash layers could be reused out-of-the-box, without any model-specific engineering. New layer types, such as DeltaNet update rule, were integrated as an incremental extension rather than a full rewrite and can be reused for future model onboarding work.

 As a result, Nemotron-Flash was onboarded and performance-optimized within days and is now supported out-of-the-box. This highlights the core strength of AutoDeploy: once optimizations are expressed as reusable compiler passes, new and unconventional architectures can immediately benefit from the full optimization stack, dramatically reducing time-to-deployment while maintaining high inference performance.

 The team used TensorRT LLM AutoDeploy to benchmark Nemotron Flash 3B Instruct against Qwen2.5 3B Instruct , a widely adopted, heavily hand-tuned model in a similar size range. For the benchmarking scenario in Figure 1 (ISL/OSL=8k/16k), Nemotron-Flash outperforms Qwen2.5 highlighting how novel model architectures can be quickly onboarded to achieve production-ready performance.

 Figure 5. Throughput latency trade-off curve comparing Nemotron Flash 3B and Qwen2.5 3B in AutoDeploy

 Data was collected for ISL/OSL 8k/16k, TP=1, on NVIDIA DGX H100 using TensorRT LLM v1.3.0rc1 , trtllm-serve
, and AIPerf benchmarking tool.

 Get started with TensorRT LLM AutoDeploy

 TensorRT LLM AutoDeploy marks a shift toward approaching inference optimization as a compiler and runtime responsibility rather than a burden on the model author. This approach enables faster experimentation, broader model coverage, and a cleaner separation between model design and deployment.

 Instead of hand-tuning each model, you can describe the architecture once and let the system apply graph transformations and optimized kernels. Early successes such as Nemotron Nano 3 and Nemotron-Flash demonstrate that deployment at model launch with peak performance is achievable across diverse model architectures.

 TensorRT LLM AutoDeploy is rapidly evolving. If you’re interested in experimenting with this feature or contributing to its development, check out the AutoDeploy documentation and example scripts .

 Acknowledgments

 We’d like to thank those who have contributed to AutoDeploy, including Ajinkya Rasane, Bala Marimuthu, Chenghao Zhang, Chenjie Luo, Eran Geva, Frida Hou, Gal Hubara Agam, Govind Ramnarayan, Grzegorz Kwasniewski, Hao Guo, Jingyu Xin, Joyjit Daw, Karthik Vetrivel, Lucas Liebenwein, Neta Zmora, Suguna Varshini Velury, Suyog Gupta, Tal Cherckez, Taylor Lee, Wanli Jiang, Wei-Ming Chen, William Zhang, and Yoco Xiao.

 Discuss (0)

 Like

 Tags

 Agentic AI / Generative AI | Developer Tools & Techniques | MLOps | General | Blackwell | TensorRT | TensorRT-LLM | Intermediate Technical | Tutorial | AI Inference | featured | Inference Performance | LLMs | PyTorch

 About the Authors

 About ​​Lucas Liebenwein

 ​​Lucas Liebenwein is a tech lead and senior engineer with the TensorRT-LLM team at NVIDIA, where he co-leads the development of AutoDeploy for deploying new and emerging LLM architectures with high-performance inference. Lucas joined NVIDIA through the acquisition of OmniML, Inc., where he was a founding engineer and chief architect. He received his PhD from MIT CSAIL, where his research focused on efficient deep learning.

 View all posts by ​​Lucas Liebenwein

 About Suyog Gupta

 Suyog Gupta is a distinguished engineer and manager at NVIDIA where he works on inference software architecture for large-scale AI systems. He received his PhD from Stanford University and has previously worked in machine learning hardware/software codesign at IBM Research, Google, and GM Cruise.

 View all posts by Suyog Gupta

 About Laikh Tewari

 Laikh Tewari is part of the AI Platform Software group at NVIDIA where he manages products for optimizing LLM inference performance. Laikh received his B.S. and M.S. in computer science from Stanford University where he specialized in systems and AI.

 View all posts by Laikh Tewari

 Comments

 Related posts

 Accelerating LLMs with llama.cpp on NVIDIA RTX Systems

 Accelerating LLMs with llama.cpp on NVIDIA RTX Systems

 Accelerate Generative AI Inference Performance with NVIDIA TensorRT Model Optimizer, Now Publicly Available

 Accelerate Generative AI Inference Performance with NVIDIA TensorRT Model Optimizer, Now Publicly Available

 Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

 Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

 Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

 Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

 NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

 NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

 Related posts

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere 

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

 Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes

 Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes

 L

 T

 F

 R

 E

3 Questions: Using AI to accelerate the discovery and design of therapeutic drugs

mit_news_ai

04.02.2026 18:00

0.646

Embedding sim.	0.7763
Entity overlap	0.06
Title sim.	0.0625
Time proximity	0.6667

NLP тип	scientific_publication
NLP организация	Massachusetts Institute of Technology
NLP тема	artificial intelligence
NLP страна	United States

Открыть оригинал

In the pursuit of solutions to complex global challenges including disease, energy demands, and climate change, scientific researchers, including at MIT, have turned to artificial intelligence, and to quantitative analysis and modeling, to design and construct engineered cells with novel properties. The engineered cells can be programmed to become new therapeutics — battling, and perhaps eradicating, diseases. 
 James J. Collins is one of the founders of the field of synthetic biology, and is also a leading researcher in systems biology, the interdisciplinary approach that uses mathematical analysis and modeling of complex systems to better understand biological systems. His research has led to the development of new classes of diagnostics and therapeutics, including in the detection and treatment of pathogens like Ebola, Zika, SARS-CoV-2, and antibiotic-resistant bacteria. Collins, the Termeer Professor of Medical Engineering and Science and professor of biological engineering at MIT, is a core faculty member of the Institute for Medical Engineering and Science (IMES), the director of the MIT Abdul Latif Jameel Clinic for Machine Learning in Health, as well as an institute member of the Broad Institute of MIT and Harvard, and core founding faculty at the Wyss Institute for Biologically Inspired Engineering, Harvard. 
 In this Q&A, Collins speaks about his latest work and goals for this research. 
 Q. You’re known for collaborating with colleagues across MIT, and at other institutions. How have these collaborations and affiliations helped you with your research? 
 A: Collaboration has been central to the work in my lab . At the MIT Jameel Clinic for Machine Learning in Health , I formed a collaboration with Regina Barzilay [the Delta Electronics Professor in the MIT Department of Electrical Engineering and Computer Science and affiliate faculty member at IMES] and Tommi Jaakkola [the Thomas Siebel Professor of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society] to use deep learning to discover new antibiotics. This effort combined our expertise in artificial intelligence, network biology, and systems microbiology, leading to the discovery of halicin, a potent new antibiotic effective against a broad range of multidrug-resistant bacterial pathogens. Our results were published in Cell in 2020 and showcased the power of bringing together complementary skill sets to tackle a global health challenge.
 At the Wyss Institute, I’ve worked closely with Donald Ingber [the Judah Folkman Professor of Vascular Biology at Harvard Medical School and the Vascular Biology Program at Boston Children’s Hospital, and Hansjörg Wyss Professor of Biologically Inspired Engineering at Harvard], leveraging his organs-on-chips technology to test the efficacy of AI-discovered and AI-generated antibiotics. These platforms allow us to study how drugs behave in human tissue-like environments, complementing traditional animal experiments and providing a more nuanced view of their therapeutic potential.
 The common thread across our many collaborations is the ability to combine computational predictions with cutting-edge experimental platforms, accelerating the path from ideas to validated new therapies.
 Q. Your research has led to many advances in designing novel antibiotics, using generative AI and deep learning. Can you talk about some of the advances you’ve been a part of in the development of drugs that can battle multi-drug-resistant pathogens, and what you see on the horizon for breakthroughs in this arena?
 A: In 2025, our lab published a study in Cell demonstrating how generative AI can be used to design completely new antibiotics from scratch. We used genetic algorithms and variational autoencoders to generate millions of candidate molecules, exploring both fragment-based designs and entirely unconstrained chemical space. After computational filtering, retrosynthetic modeling, and medicinal chemistry review, we synthesized 24 compounds and tested them experimentally. Seven showed selective antibacterial activity. One lead, NG1, was highly narrow-spectrum, eradicating multi-drug-resistant Neisseria gonorrhoeae , including strains resistant to first-line therapies, while sparing commensal species. Another, DN1, targeted methicillin-resistant Staphylococcus aureus (MRSA) and cleared infections in mice through broad membrane disruption. Both were non-toxic and showed low rates of resistance.
 Looking ahead, we are using deep learning to design antibiotics with drug-like properties that make them stronger candidates for clinical development. By integrating AI with high-throughput biological testing, we aim to accelerate the discovery and design of antibiotics that are novel, safe, and effective, ready for real-world therapeutic use. This approach could transform how we respond to drug-resistant bacterial pathogens, moving from a reactive to a proactive strategy in antibiotic development.
 Q. You’re a co-founder of Phare Bio, a nonprofit organization that uses AI to discover new antibiotics, and the Collins Lab has helped to launch the Antibiotics-AI Project in collaboration with Phare Bio. Can you tell us more about what you hope to accomplish with these collaborations, and how they tie back to your research goals?
 A: We founded Phare Bio as a nonprofit to take the most promising antibiotic candidates emerging from the Antibiotics-AI Project at MIT and advance them toward the clinic. The idea is to bridge the gap between discovery and development by collaborating with biotech companies, pharmaceutical partners, AI companies, philanthropies, other nonprofits, and even nation states. Akhila Kosaraju has been doing a brilliant job leading Phare Bio, coordinating these efforts and moving candidates forward efficiently.
 Recently, we received a grant from ARPA-H to use generative AI to design 15 new antibiotics and develop them as pre-clinical candidates. This project builds directly on our lab’s research, combining computational design with experimental testing to create novel antibiotics that are ready for further development. By integrating generative AI, biology, and translational partnerships, we hope to create a pipeline that can respond more rapidly to the global threat of antibiotic resistance, ultimately delivering new therapies to patients who need them most.

Scaling social science research

openai

13.02.2026 09:00

0.64

Embedding sim.	0.7319
Entity overlap	0.087
Title sim.	0.1667
Time proximity	0.8333

NLP тип	product_launch
NLP организация	OpenAI
NLP тема	generative ai
NLP страна

Открыть оригинал

GABRIEL is a new open-source toolkit from OpenAI that uses GPT to turn qualitative text and images into quantitative data, helping social scientists analyze research at scale.

Why Scaling AI is Underestimated ⚡

ai_supremacy

10.02.2026 10:30

0.637

Embedding sim.	0.7416
Entity overlap	0.0577
Title sim.	0.0667
Time proximity	0.8783

NLP тип	other
NLP организация	SpaceX
NLP тема	ai infrastructure
NLP страна	United States

Открыть оригинал

Prospectus
 Why Scaling AI is Underestimated ⚡
 The future of energy needs a re-think to scale AI indefinitely. Orbital datacenters are part of that future solution. So are solar arrayed power in space infrastructure & lunar manufacturing. 🌌

 Michael Spencer
 Feb 10, 2026
 ∙ Paid

 63

 14
 4

 Share

 Science Techniz - the proposed system would place computing-enabled satellites across multiple orbital shells.
 AI Supremacy is a Newsletter about AI at the intersection of business, innovation, technology, society, and the future of civilization.  me as Read Futurist .

 Seventy-five years ago, the idea of harnessing the power of the skies was little more than fantasy spun by futurists like Arthur C. Clarke and Isaac Asimov.

 What if, we are about to see it happen in our generation.
 ~
 Good Morning,
 This piece is going to read a bit different from usual pieces, because it’s a topic I’ve been considering for some time (where I cover space-tech companies and Neo Cloud observations).
 This weekend (basically the last four days) I’ve been frantically pondering how scaling AI requires not just a better semiconductor supply-chain, but a radically improved energy source and cost-efficiency structure . We have the nascent technology to manifest this already. But as a civilization are we able to do it? Who are the major players going to be?
 On the topic of orbital datacenters it’s worth considering a more futuristic solution to AI Infrastructure and energy-efficiency and that to me might be trans-orbital manufacturing, Space-mining and scaling AI with energy of the Sun. It’s very science-fiction like pre 2026, but not as far off as you might think .
 This is because rocket launches get cheaper and far more numerous and the energy bottleneck becomes far more intense for AI and capex in the decade ahead due to new (fairly dirty and inefficient) datacenter projects. My base case is that lunar and orbital bases will become accelerated, space manufacturing becomes normative, and new concepts around AI datacenters in space will proliferate. Even in the late 2020s.
 It won’t be enough to spend a $Trillion dollar 1 on capex (like they will in 2027), it will require a radically more energy and cost efficient solution to scale.
 The Catalyst

 When SpaceX goes public in an IPO later in 2026 (maybe as early as June), the promise of the future isn’t just about datacenters, it’s about harnessing more of the Sun’s energy to fuel AI at a sale that isn’t currently possible with the U.S. energy grid constraints, or even terrestrial datacenters.
 Share
 Easily one of the most important Elon Musk interviews ever , Read the transcript here , we are going to be going over this quite a bit.
 Watch it on YouTube

 The Rise of BigAI

 Imagine a future where OpenAI, Anthropic and SpaceX (where xAI is now merged as a subsidiary ) all become key AI Cloud computing full-stack companies. The one with the most compute will have an absurd advantage over the others, if you believe scaling AI with more compute will matter.
 Let me repeat, they will all become Cloud computing hyperscalers in their own right too. Orbital datacenters and future space infrastructure affords an AI Cloud on a planetary level of scale. 🌍 There is no decent concept of this in 2026, it is a pioneering landscape.
 AI will press new Cloud Computing companies into existence:

 Continue reading this post for free, courtesy of Michael Spencer.
 Claim my free post
 Or purchase a paid subscription.

 Previous Next

How Painkiller RTX Uses Generative AI to Modernize Game Assets at Scale | NVIDIA Technical Blog

nvidia_dev_blog

05.02.2026 14:00

0.635

Embedding sim.	0.7565
Entity overlap	0.0208
Title sim.	0.186
Time proximity	0.5476

NLP тип	product_launch
NLP организация	NVIDIA
NLP тема	generative ai
NLP страна

Открыть оригинал

Painkiller RTX sets a new standard for how small teams can balance massive visual ambition with limited resources by integrating generative AI. By upscaling thousands of legacy textures into high-quality Physically Based Rendering (PBR) materials—a process that would have traditionally taken years—the team dramatically reduced the burden of repetitive work.

 This approach was especially impactful for contributors without traditional modding backgrounds, freeing them to focus on creative decisions: refining materials and ensuring the game’s iconic atmosphere responds correctly to ray-traced lighting. Learn how the team architected a production pipeline that blends automation with artistic judgment across 35 unique levels.

 To explore the motivations, solutions, and lessons behind these technical challenges, we spoke with McGillacutty (environment reconstruction and material lead), Quinn Baddams (team lead and founder of Merry Pencil Studios), and NightRaven (creator of PBRFusion).

 What’s your professional background and current role?

 McGillacutty: My background spans architectural design, technical art, and game analysis, with a focus on real-time environments. I currently work independently, combining teaching and technical client work with development on RTX Remix projects like Painkiller RTX . My role centers on environment reconstruction, material authoring, and building AI-assisted asset pipelines.

 Quinn Baddams: My career has focused on building and optimizing complex systems—first in business strategy and digital infrastructure, and more recently in computer graphics. I’m currently studying computer science with a focus on AI and machine learning, which directly informs my work as team lead on Painkiller RTX and founder of Merry Pencil Studios. I apply systems thinking to architect our production pipeline and integrate generative AI as a practical solution to problems of scale.

 NightRaven: I am currently a system engineer handling everything from full-stack automation to administrating VMware and cloud environments.

 What made you want to become an RTX Remix modder, and what brought you to Painkiller ?

 McGillacutty: I came to RTX Remix from a visual and architectural perspective, without any modding background a year ago. When Quinn showed me Painkiller ’s towering gothic interiors, I immediately saw how well they would lend themselves to ray-traced lighting—stained glass, stone, metal, and deep interior spaces. RTX Remix offered a way to renovate those environments by rebuilding the materials so the lighting could finally behave realistically, which pulled me straight into the project.

 Quinn Baddams: I’ve been interested in computer graphics and technical art since the early days of 3D accelerator cards like Voodoo and TNT. At the time, real-time ray tracing felt like something we might see far in the future, but advances in denoising and technologies like NVIDIA DLSS made it viable much sooner than expected.

 RTX Remix naturally pulled me in. I’ve always found physically based rendering principles satisfying, and path tracing fits that mindset well. After experimenting with several games with varying levels of compatibility, Painkiller stood out. It has solid mod support, an active community, and it was also one of my favorite games back in the GeForce 2 GTS era.

 You’re among the early adopters to use generative AI to rebuild textures and materials at scale. How did you use models like PBRFusion to convert low-resolution assets into high-quality PBR materials?

 McGillacutty: With minimal texture reuse across 35 levels, manually rebuilding thousands of materials simply wasn’t feasible for a small team. PBRFusion became the backbone of our pipeline, allowing us to batch-convert large sets of legacy textures into a usable PBR baseline at unprecedented scale.

 The model automatically generated base color, normal, roughness, and height maps, which let us bring entire levels into a physically based context in a fraction of the time. Coming into modding without a traditional background, this AI-driven approach was critical—it removed the friction of repetitive work and let me focus on creative decisions, like refining materials, preserving the game’s iconic atmosphere, and ensuring everything responded correctly to ray-traced lighting.

 Quinn Baddams: PBRFusion makes it possible to batch-upscale an entire project’s textures and quickly generate normal, roughness, and height maps, which is an excellent starting point. That said, convincing results still require material-by-material judgment.

 Many surfaces don’t benefit from height maps at all, while others, especially metals, require much more careful treatment. Most metallic materials in Painkiller RTX were hand-crafted. Glass, transparent surfaces, and skin also needed custom values and maps, particularly for subsurface scattering.

 Hero materials received additional attention using a mix of techniques, including blending CC0 PBR materials, AI-assisted generation, and procedural workflows in tools like InstaMAT Studio. AI provided the baseline, but traditional material authoring was essential for achieving quality and control.

 What got you interested in generative AI, and what motivated you to fine-tune a model for RTX Remix?

 McGillacutty: Scale was the primary driver. With thousands of textures spread across 35 levels, rebuilding materials by hand would have been impractical for a small team. I was already using generative AI for rapid iteration and visual exploration in other design contexts, so adapting it for RTX Remix felt like a natural extension.

 Fine-tuning a model gave us a way to process large volumes of stylized legacy textures efficiently while maintaining cohesion across levels. Instead of treating each asset as an isolated problem, AI helped establish a consistent material baseline that we could then refine artistically.

 Quinn Baddams: My interest came from a practical, production-focused curiosity. While experimenting with asset pipelines, I noticed a clear technical gap: there were no generative AI models tailored to the specific challenges of game development, particularly removing baked lighting and shadows from legacy textures, which is a major obstacle when converting assets to PBR.

 That problem overlapped directly with my academic focus on AI and machine learning. RTX Remix provided a real-world production environment where I could bridge that gap by fine-tuning models to solve an actual pipeline bottleneck, turning research into something that directly addressed Painkiller ’s scale.

 NightRaven: RTX Remix was my entry point into generative AI. It was exciting to see older games brought back to life with modern rendering, and while learning how to mod with Remix, it quickly became clear that high-quality PBR materials are one of the biggest factors in making path tracing work.

 I started using the available PBR generation tools, but I wasn’t satisfied with the results. Despite having no formal background in AI, I decided to build my own solution, which became PBRFusion. It went through three major iterations and more than a thousand hours of work to reach version 3—the version used in Painkiller RTX . One of my goals was also to lower the barrier to entry for RTX Remix, making it easier for more creators to experiment and contribute.

 Why was it important for your texture pipeline to blend AI-generated outputs with traditional hand-crafted work, rather than relying on a single approach?

 McGillacutty: It comes down to scalable quality. AI-generated outputs were essential for handling the sheer volume of assets and establishing a consistent visual baseline across the project, but they’re not a substitute for artistic judgment. The manual refinement phase is where we pushed quality further and preserved Painkiller ’s distinct character.

 That’s where we reinterpreted ambiguous source textures, corrected materials that broke under physically accurate lighting, and made intentional creative decisions. This hybrid approach allowed us to automate roughly 80% of the repetitive work, so we could focus human effort on the 20% that ultimately defines the project’s quality and vision.

 Quinn Baddams: AI-generated roughness, normal, and height maps provide a strong starting point, but they often require adjustment to achieve physically accurate results. Correct values can be very specific, and many materials need manual tweaks or custom painting informed by real-world PBR references.

 Painkiller also relies heavily on texture atlases, which can confuse AI models when a single texture contains multiple unrelated surfaces. Blending AI automation with hand-crafted work let us remove most of the repetitive busywork while maintaining precise control over both artistic intent and physical accuracy.

 NightRaven: PBRFusion was always intended to be a tool, not a drop-in replacement for material creation. I’m glad the Painkiller team approached it that way—using the tool to accelerate their workflow rather than treating it as a crutch.

 Because the model isn’t perfectly accurate, especially for roughness generation, it will get things wrong. Human verification and adjustment are essential to ensure materials behave correctly under physically based and path-traced lighting.

 How did you maintain a consistent style and quality bar across more than 35 levels while integrating AI-generated content?

 McGillacutty: Consistency at that scale required defining constraints early and treating AI output as a baseline system rather than as individual, isolated assets. PBRFusion’s content-consistent super-resolution produced cohesive results across large material sets, which helped establish a shared visual language for the project.

 We regularly evaluated materials in context using in-engine captures, then iterated so that both AI-generated materials and hand-crafted hero assets reinforced the same style and quality bar.

 Quinn Baddams: We set a small number of core guidelines early on. Small or distant textures weren’t upscaled unnecessarily, height maps were limited to large, flat surfaces, and roughness maps were treated as a primary driver of perceived material quality.

 We referenced real-world PBR materials to validate roughness values and paid close attention to how albedo maps behave in a physically based workflow. In practice, consistency was achieved largely by reviewing and adjusting roughness maps to ensure materials behaved as intended under lighting.

 The materials and textures now react much more realistically to light. How did you rethink your material, texture, and lighting workflows to achieve that result across so many environments?

 McGillacutty: Introducing physically based lighting into Painkiller meant rethinking the entire relationship between materials and light. The game’s environments are abstract and otherworldly, designed around dramatic contrast rather than realism, so simply adding realistic lighting wasn’t enough.

 We started by stripping baked lighting information from the original textures, then rebuilt contrast intentionally through material definition—using grime, surface variation, and physically meaningful roughness values. That way, the drama came from correct interactions with light rather than painted-in shadows.

 All lighting in Painkiller RTX was hand-tuned at the scene level, which allowed us to carefully shape mood and composition across each environment while still preserving the game’s signature atmosphere.

 Quinn Baddams: We took an iterative approach and learned early on that incorrect lighting responses couldn’t be fixed by simply adjusting texture brightness. The original game relied heavily on baked shadows, which added contrast that no longer made sense in a PBR workflow.

 After removing that baked lighting, we reintroduced contrast through roughness variation, stronger normal maps, and controlled self-shadowing. Standardizing physically plausible light values across scenes was also critical to achieving consistent, believable results.

 Full-scene path tracing, volumetric lighting, and advanced techniques all work together in Painkiller RTX . How did you combine these systems to shape the game, and what did each contribute that you couldn’t get from more traditional rendering?

 McGillacutty: Full-scene path tracing and volumetric lighting fundamentally changed how materials behaved, which meant material work had to be developed in close alignment with lighting. While lighting and volumetrics were handled by the team lead, my role was to ensure materials responded correctly once those systems were in place.

 Path tracing exposed properties like roughness, reflectivity, and wetness far more clearly than traditional rendering ever could. In areas with rain or fog, I adjusted materials to include puddles and surface ripples so they would interact believably with volumetrics and moving light.

 A great example of this is RTX Skin, particularly on characters and semi-translucent surfaces like marble. For assets such as the nun or the lab fish, RTX Skin allows light to genuinely scatter through the surface. You can see it in haggard skin or gelatinous flesh, this subsurface scattering creates a sense of depth that simple surface highlights can’t achieve.

 RTX Skin has been an extremely helpful tool. It’s allowed me to make these characters feel like tangible, physical parts of the ray-traced world we’re building. It’s especially rewarding to see a game from 2004 transformed to such an extent.

 Quinn Baddams: Full-scene path tracing fundamentally changed how lighting and materials interacted, exposing inaccuracies that would have been hidden in traditional rendering. Volumetric lighting added depth and atmosphere, particularly in large interior spaces. While traditional techniques can approximate these effects, path tracing and volumetrics allow light to behave consistently across the entire scene.

 RTX Skin was a major part of making all of this work together. For a project rebuilding a classic game, it solved two important problems. First, it allowed us to get far more out of our low-detail character models. The mesh geometry is exactly the same, but RTX Skin makes it appear significantly more detailed. A lot of that comes from the normal maps generated through PBRFusion, while RTX Skin itself helps smooth sharp edges, making low-poly geometry appear denser and less jagged.

 Second, and more importantly, it gave us true artistic control over subsurface scattering for the first time in a real-time pipeline. You can define exactly how much light scatters through a surface and how its color changes as it does. We used this on the wings of the demon Alastor, where the internal veins are only visible because of RTX Skin—an effect we didn’t consider possible before.

 To my knowledge, this level of ray-traced subsurface scattering hasn’t been available to game developers in a practical, real-time way. It was previously limited to offline rendering. Having it available through RTX Skin is fantastic—not just as a technical leap, but because it’s genuinely enjoyable to work with. We’re only scratching the surface of what’s possible.

 For developers inspired by Painkiller RTX who want to take a first step toward similar visuals, which features or workflows would you recommend experimenting with first?

 McGillacutty: My advice is to start with a simple, focused artistic goal. Don’t try to rebuild an entire level. Instead, capture a single, iconic scene and concentrate on the relationship between a few key materials and the lighting.

 Use the RTX Remix Toolkit to replace the original textures with basic PBR materials, then iterate using path tracing and lighting tools. Once you understand that core dialogue between materials and light, you can introduce AI tools like PBRFusion. Used this way, AI becomes a rapid iteration engine—letting you test different visual hypotheses within the same scene.

 Quinn Baddams: Start with the RTX Remix Toolkit itself. Capture a scene, apply basic materials, and begin experimenting with lighting and path tracing to understand how they interact.

 The RTX Remix community is also an important resource, with shared tools, scripts, and active support. Most importantly, experiment freely—hands-on iteration is the fastest way to build intuition for these workflows.

 How do you think generative AI has changed modding and game development, and what tools are you looking forward to next?

 McGillacutty: As someone relatively new to modding, the biggest change I’ve seen is accessibility. Generative AI dramatically reduces the time and technical overhead required to experiment, iterate, and ship meaningful work. This opens development to creators from a wider range of backgrounds. 

 For my next project, I’m looking forward to more advanced material and geometry tools and AI-assisted workflow scripting.

 Quinn Baddams: Generative AI represents a paradigm shift—away from memorizing systems toward creating with a support layer that understands them. AI acts as both a tutor and a problem-solving partner. 

 I’m particularly interested in further advances in AI-assisted asset cleanup and using retrieval-augmented generation to work with undocumented legacy codebases.

 NightRaven: I am already nearly finished with the next version of PBRFusion, hopefully providing great benefit to this modding community.

 Join us at GDC

 Join us at GDC to explore how NVIDIA RTX neural rendering and AI are shaping the next era of gaming. Get a glimpse into the future of game development with John Spitzer, Vice President of Developer and Performance Technology at NVIDIA, as he unveils the latest innovations in path tracing and generative AI workflows.

 Then, join Bryan Catanzaro, Vice President of Applied Deep Learning Research at NVIDIA, for an interactive “Ask Me Anything” session covering the latest trends in AI. Along with two full days of additional sessions, these events offer a front-row seat to the technologies enabling new kinds of player experiences.

 Resources for game developers

 See our full list of game developer resources here and follow us to stay up to date with the latest NVIDIA Game development news:

 Join the NVIDIA Developer Program (select gaming as your industry)

  us on social: X , LinkedIn , Facebook , and YouTube

 Join our Discord community

 Discuss (2)

 Like

 Tags

 Content Creation / Rendering | Gaming | DLSS | RTX GPU | RTX Kit | Beginner Technical | News | Q&A | featured

 About the Authors

 About Phillip Singh

 Phillip Singh is a senior developer marketing manager at NVIDIA, specializing in games and consumer AI. He is dedicated to empowering the next generation of game developers through neural rendering and AI graphics technology. Phillip earned his master's in business administration from the Santa Clara University Leavey School of Business.

 View all posts by Phillip Singh

 Comments

 Related posts

 Powering Next-Gen XR Design at Rivian with NVIDIA RTX PRO Blackwell Desktop GPUs

 Powering Next-Gen XR Design at Rivian with NVIDIA RTX PRO Blackwell Desktop GPUs

 Event: NVIDIA at GDC 2024

 Event: NVIDIA at GDC 2024

 Change the Rules of the Game: NVIDIA Omniverse Brings an Arsenal of RTX and AI-Powered Apps, Extensions and DIY Toolkits to Accelerate Game Development Pipelines

 Change the Rules of the Game: NVIDIA Omniverse Brings an Arsenal of RTX and AI-Powered Apps, Extensions and DIY Toolkits to Accelerate Game Development Pipelines

 AI Art Gallery: AI in the Hand of the Artist

 AI Art Gallery: AI in the Hand of the Artist

 Developers Show Off Amazing Real-Time Ray-Traced Projects in New DXR Spotlight Contest

 Developers Show Off Amazing Real-Time Ray-Traced Projects in New DXR Spotlight Contest

 Related posts

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Stream High-Fidelity Spatial Computing Content to Any Device with NVIDIA CloudXR 6.0

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Build and Stream Browser-Based XR Experiences with NVIDIA CloudXR.js

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Designing Protein Binders Using the Generative Model Proteina-Complexa

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Design, Simulate, and Scale AI Factory Infrastructure with NVIDIA DSX Air

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 Newton Adds Contact-Rich Manipulation and Locomotion Capabilities for Industrial Robotics

 L

 T

 F

 R

 E

AI algorithm enables tracking of vital white matter pathways

mit_news_ai

10.02.2026 22:00

0.63

Embedding sim.	0.7255
Entity overlap	0.0175
Title sim.	0.0574
Time proximity	0.9792

NLP тип	scientific_publication
NLP организация	MIT
NLP тема	medical imaging
NLP страна	United States

Открыть оригинал

The signals that drive many of the brain and body’s most essential functions — consciousness, sleep, breathing, heart rate, and motion — course through bundles of “white matter” fibers in the brainstem, but imaging systems so far have been unable to finely resolve these crucial neural cables. That has left researchers and doctors with little capability to assess how they are affected by trauma or neurodegeneration. 
 In a new study, a team of MIT, Harvard University, and Massachusetts General Hospital researchers unveil AI-powered software capable of automatically segmenting eight distinct bundles in any diffusion MRI sequence.
 In the open-access study, published Feb. 6 in the Proceedings of the National Academy Sciences , the research team led by MIT graduate student Mark Olchanyi reports that their BrainStem Bundle Tool (BSBT), which they’ve made publicly available , revealed distinct patterns of structural changes in patients with Parkinson’s disease, multiple sclerosis, and traumatic brain injury, and shed light on Alzheimer’s disease as well. Moreover, the study shows, BSBT retrospectively enabled tracking of bundle healing in a coma patient that reflected the patient’s seven-month road to recovery.
 “The brainstem is a region of the brain that is essentially not explored because it is tough to image,” says Olchanyi, a doctoral candidate in MIT’s Medical Engineering and Medical Physics Program. “People don't really understand its makeup from an imaging perspective. We need to understand what the organization of the white matter is in humans and how this organization breaks down in certain disorders.”
 Adds Professor Emery N. Brown , Olchanyi’s thesis supervisor and co-senior author of the study, “the brainstem is one of the body’s most important control centers. Mark’s algorithms are a significant contribution to imaging research and to our ability to the understand regulation of fundamental physiology. By enhancing our capacity to image the brainstem, he offers us new access to vital physiological functions such as control of the respiratory and cardiovascular systems, temperature regulation, how we stay awake during the day and how sleep at night.”
 Brown is the Edward Hood Taplin Professor of Computational Neuroscience and Medical Engineering in The Picower Institute for Learning and Memory, the Institute for Medical Engineering and Science, and the Department of Brain and Cognitive Sciences at MIT. He is also an anesthesiologist at MGH and a professor at Harvard Medical School.
 Building the algorithm 
 Diffusion MRI helps trace the long branches, or “axons,” that neurons extend to communicate with each other. Axons are typically clad in a sheath of fat called myelin, and water diffuses along the axons within the myelin, which is also called the brain’s “white matter.” Diffusion MRI can highlight this very directed displacement of water. But segmenting the distinct bundles of axons in the brainstem has proved challenging, because they are small and masked by flows of brain fluids and the motions produced by breathing and heart beats.
 As part of his thesis work to better understand the neural mechanisms that underpin consciousness, Olchanyi wanted to develop an AI algorithm to overcome these obstacles. BSBT works by tracing fiber bundles that plunge into the brainstem from neighboring areas higher in the brain, such as the thalamus and the cerebellum, to produce a “probabilistic fiber map.” An artificial intelligence module called a “convolutional neural network” then combines the map with several channels of imaging information from within the brainstem to distinguish eight individual bundles.
 To train the neural network to segment the bundles, Olchanyi “showed” it 30 live diffusion MRI scans from volunteers in the Human Connectome Project (HCP). The scans were manually annotated to teach the neural network how to identify the bundles. Then he validated BSBT by testing its output against “ground truth” dissections of post-mortem human brains where the bundles were well delineated via microscopic inspection or very slow but ultra-high-resolution imaging. After training, BSBT became proficient in automatically identifying the eight distinct fiber bundles in new scans.
 In an experiment to test its consistency and reliability, Olchanyi tasked BSBT with finding the bundles in 40 volunteers who underwent separate scans two months apart. In each case, the tool was able to find the same bundles in the same patients in each of their two scans. Olchanyi also tested BSBT with multiple datasets (not just the HCP), and even inspected how each component of the neural network contributed to BSBT’s analysis by hobbling them one by one.
 “We put the neural network through the wringer,” Olchanyi says. “We wanted to make sure that it’s actually doing these plausible segmentations and it is leveraging each of its individual components in a way that improves the accuracy.”
 Potential novel biomarkers 
 Once the algorithm was properly trained and validated, the research team moved on to testing whether the ability to segment distinct fiber bundles in diffusion MRI scans could enable tracking of how each bundle’s volume and structure varied with disease or injury, creating a novel kind of biomarker. Although the brainstem has been difficult to examine in detail, many studies show that neurodegenerative diseases affect the brainstem, often early on in their progression.
 Olchanyi, Brown and their co-authors applied BSBT to scores of datasets of diffusion MRI scans from patients with Alzheimer’s, Parkinson’s, MS, and traumatic brain injury (TBI). Patients were compared to controls and sometimes to themselves over time. In the scans, the tool measured bundle volume and “fractional anisotropy,” (FA) which tracks how much water is flowing along the myelinated axons versus how much is diffusing in other directions, a proxy for white matter structural integrity.
 In each condition, the tool found consistent patterns of changes in the bundles. While only one bundle showed significant decline in Alzheimer’s, in Parkinson’s the tool revealed a reduction in FA in three of the eight bundles. It also revealed volume loss in another bundle in patients between a baseline scan and a two-year follow-up. Patients with MS showed their greatest FA reductions in four bundles and volume loss in three. Meanwhile, TBI patients didn’t show significant volume loss in any bundles, but FA reductions were apparent in the majority of bundles.
 Testing in the study showed that BSBT proved more accurate than other classifier methods in discriminating between patients with health conditions versus controls.
 BSBT, therefore, can be “a key adjunct that aids current diagnostic imaging methods by providing a fine-grained assessment of brainstem white matter structure and, in some cases, longitudinal information,” the authors wrote.
 Finally, in the case of a 29-year-old man who suffered a severe TBI, Olchanyi applied BSBT to a scans taken during the man’s seven-month coma. The tool showed that the man’s brainstem bundles had been displaced, but not cut, and showed that over his coma, the lesions on the nerve bundles decreased by a factor of three in volume. As they healed, the bundles moved back into place as well.
 The authors wrote that BSBT “has substantial prognostic potential by identifying preserved brainstem bundles that can facilitate coma recovery.”
 The study’s other senior authors are Juan Eugenio Iglesias and Brian Edlow. Other co-authors are David Schreier, Jian Li, Chiara Maffei, Annabel Sorby-Adams, Hannah Kinney, Brian Healy, Holly Freeman, Jared Shless, Christophe Destrieux, and Hendry Tregidgo.
 Funding for the study came from the National Institutes of Health, U.S. Department of Defense, James S. McDonnell Foundation, Rappaport Foundation, American SidS Institute, American Brain Foundation, American Academy of Neurology, Center for Integration of Medicine and Innovative Technology, Blueprint for Neuroscience Research, and Massachusetts Life Sciences Center.

Run Language Models on Your Computer with LM-Studio

ai_supremacy

12.02.2026 12:30

0.625

Embedding sim.	0.7497
Entity overlap	0.0408
Title sim.	0.1176
Time proximity	0.5807

NLP тип	other
NLP организация
NLP тема	large language models
NLP страна

Открыть оригинал

Guides 🦮
 Run Language Models on Your Computer with LM-Studio
 A practical guide to running local models and picking the right one for speed or accuracy.

 Michael Spencer and Benjamin Marie
 Feb 12, 2026
 ∙ Paid

 137

 1
 10

 Share

 Made with Gemini / Nano Banana.
 Good Morning,
 You can do so much with AI. The building and DIY aspect also keeps getting more nuanced, and more powerful. Open-source AI is providing a new array of capabilities even at the local and individual level.
 I’m a huge fan of Benjamin Marie and I’ve wanted to share more about his work for so long. Today, we finally have the chance. Ben is an independent AI researcher (LLM, NLP) with two really useful blogs and I have a huge respect for his work: (don’t let the funny names fool you, these are serious resources).
 The Kaitchup – AI on a Budget 🍅

 Hands on AI tutorials and news on how adapting language language models in a DIY setting to your tasks and hardware using the most recent techniques and models.
 The Kaitchup – AI on a Budget Weekly tutorials and news on adapting large language models (LLMs) to your tasks and hardware using the most recent techniques and models. The Kaitchup proposes a collection of 170+ AI notebooks regularly updated.
 By Benjamin Marie

 The Kaitchup publishes invaluable weekly tutorials with info that’s hard to find elsewhere.
 By being a paid subscriber to The Kaitchup, you also get access to all the AI notebooks (160+) , hands-on tutorials , and more in-depth analyses of recently published scientific papers.

 Learn Deeper
 Read The Salt 🧂

 Reviews and in-depth analysis of bleeding edge AI research and how-tos. The Salt is a newsletter for readers who are curious about the Science behind AI. If you want to stay informed of recent progress in AI without reading much, The Salt is for you! I do my best to offer articles that might be interesting for a wide variety of readers.
 The Salt - Curated AI The Salt offers weekly reviews and in-depth analyses of the latest AI papers. If you want to stay informed of recent progress in AI without reading much, The Salt is for you!
 By Benjamin Marie

 Benjamin’s technical and practical knowledge is invaluable depending on how deep down the rabbit-hole you want to go in DIY with models. It’s not overly technical but it is on technical topics, useful for a wide range of readers interested experimenting locally DIY with models or in small teams.
 Selected Works

 I asked him for a basic beginners tutorial on how to run LLMs locally (something I sometimes get questions about). He’s able to add so much practical know-how and insights into the latest models where for me he is an authority. If a new model comes out his opinion represents hands-on experience and being up to date on the latest scientific papers.
 Qwen3-VL: DeepStack Fusion , Interleaved-MRoPE, and a Native 256K Interleaved Context Window.

 Did the Model See the Benchmark During Training? Detecting LLM Contamination

 Making LLMs Think Longer: Context, State, and Post-Training Tricks

 Benjamin Marie is an independent researcher focused on hands-on AI and the tools around modern language models. He helps people and companies cut costs by adapting models to their specific tasks and hardware. I hope you learn something from it. While my work doesn’t touch on machine learning professionals that much, more and more individuals and small teals are playing with these open-source models locally.
 So I’m very proud to be able to bring you a guide like this:
 Run Language Models on Your Computer with LM-Studio

 A practical guide to running local models and picking the right one for speed or accuracy.

 Image generated with ChatGPT
 Running large language models (LLMs) locally used to mean wrestling with the GPU’s software layer (like CUDA), scattered model formats, and a lot of trial-and-error. Today, it’s surprisingly approachable. With tools like Ollama or LM Studio, you can download a model, load it in a few clicks, and start chatting on your own machine, without sending prompts to a cloud service.
 This article walks through the practical path from “installing the app” to “running my first local model,” and then zooms out to the part that really matters: what determines whether a model runs smoothly (or not) on your hardware . Along the way, we’ll cover installing LM Studio, the memory (simple) math behind model sizes, how to pick trustworthy GGUF builds and compression levels, how to sanity-check model output, and why “thinking” models can be dramatically better on hard prompts while also being noticeably slower.
 The goal is not to turn you into an engineer. It’s to give you enough intuition to choose models confidently, understand what an application like LM Studio is telling you, and avoid the most common “why is this slow / why is this wrong” surprises.

 Continue reading this post for free, courtesy of Michael Spencer.
 Claim my free post
 Or purchase a paid subscription.

 Previous Next

 A guest post by
 Benjamin Marie

 Research scientist in NLP/AI.

 Subscribe to Benjamin