|
S
|
Introducing ChatGPT Health |
openai |
07.01.2026 00:00 |
1
|
| Embedding sim. | 1 |
| Entity overlap | 1 |
| Title sim. | 1 |
| Time proximity | 1 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | healthcare ai |
| NLP страна | |
Открыть оригинал
ChatGPT Health is a dedicated experience that securely connects your health data and apps, with privacy protections and a physician-informed design.
|
|
|
TAI #191: Opus 4.6 and Codex 5.3 Ship Minutes Apart as the Long-Horizon Agent Race Goes Vertical |
towards_ai |
10.02.2026 14:56 |
0.77
|
| Embedding sim. | 0.8838 |
| Entity overlap | 0.1923 |
| Title sim. | 0.2477 |
| Time proximity | 0.8519 |
| NLP тип | product_launch |
| NLP организация | Anthropic |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
What happened this week in AI by Louie
On February 5th, Anthropic and OpenAI released Claude Opus 4.6 and GPT-5.3-Codex, respectively, within minutes of each other. Both are point releases, but both deliver jumps in some benchmarks that look more like generational leaps.
On Terminal-Bench 2.0, which measures agentic terminal skills, Codex 5.3 scores 77.3%, up from 64.0% for the previous 5.2-Codex and well past Opus 4.6’s 65.4%. On SWE-Bench Pro, Codex 5.3 hits 56.8%. On OSWorld-Verified for computer use, Opus 4.6 leads with 72.7% vs. Codex 5.3’s 64.7%. In Vercel’s Next.js agent evaluations (last run February 9th), Codex 5.3 achieved a 90% success rate vs. Opus 4.6’s 80%, with the previous-generation models (Sonnet 4.5, GPT-5.2 Codex) clustered around 40%. Scores more than doubled in a single point release.
Where Codex 5.3 does not yet have published scores, Opus 4.6 pulls away from the broader GPT-5.2 family. On GDPval-AA, which tests real-world knowledge work across 44 occupations, Opus 4.6 achieves 1606 Elo vs. GPT-5.2’s 1462. On ARC-AGI-2 for novel problem-solving, Opus 4.6 scores 68.8% vs. GPT-5.2 Pro’s 54.2% (and nearly doubles its own predecessor’s 37.6%). On BrowseComp for agentic search, 84.0% vs. GPT-5.2 Pro’s 77.9%. On Finance Agent, 60.7% vs. 56.6%. On Humanity’s Last Exam with tools, 53.1% vs. GPT-5.2 Pro’s 50.0%.
The picture is clear: Codex 5.3 is the strongest pure coding agent available. Opus 4.6 is the strongest generalist. And both are improving at a pace that makes version numbers misleading.
Opus 4.6 is priced at $5/$25 per million input/output tokens, unchanged from Opus 4.5, with $10/$37.50 for beyond 200k tokens. It is the first Opus-class model with a 1-million-token context window (beta) and supports 128k output tokens. New developer features include adaptive thinking (the model decides when deeper reasoning is warranted), four effort levels (low, medium, high, max), context compaction for long-running agents, and Agent Teams in Claude Code, where multiple Claude instances coordinate in parallel. Anthropic also launched Claude in PowerPoint and upgraded Claude in Excel. Codex 5.3 is available with paid ChatGPT plans across the Codex app, CLI, IDE extension, and web. API pricing has not yet been published. The model is 25% faster than its predecessor and was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. OpenAI says it was the first model to be instrumental in its own creation, with early versions used to debug training and diagnose evaluation results.
A key breakthrough in GPT-5.3-Codex relative to GPT-5.2-Codex is significantly improved token efficiency, in addition to its higher accuracy. This not only lowers the cost per task but also speeds up the task completion. For some coding tasks, we are now finding Codex significantly faster than Claude models; this is key in OpenAI’s fight to catch up in AI coding adoption.
Source: OpenAI.
Both companies are making the same strategic move. Codex was originally a coding agent. OpenAI now explicitly positions 5.3 as going “beyond coding” into slide decks, data analysis, and deployment monitoring. Anthropic has made the same pivot, evolving Claude Code into the broader Cowork product for non-developers and shipping office tool integrations. The coding agent is becoming the general-purpose agent.
This is where the METR (Model Evaluation and Threat Research) long-term task-horizon evaluations become relevant. METR measures the length of tasks that AI agents can complete autonomously with 50% reliability, benchmarked against the time it takes human experts to complete those tasks. That metric has roughly doubled every 7 months over the past 6 years, and in the last year, the doubling time has accelerated to roughly 4 months. Models that could barely hold context across a handful of steps a year ago are now completing multi-hour tasks. Both Opus 4.6’s 1M context window and Codex 5.3’s ability to iterate over millions of tokens are direct responses to this curve. On MRCR v2 (Multi-needle Retrieval with Competing Reasoning), a long-context retrieval benchmark, Opus 4.6 scores 93.0% at 256k tokens and 76.0% at 1M tokens. Sonnet 4.5 scored just 18.5% at 1M. That is a qualitative shift in how much context a model can actually use.
One project this week shows where that trajectory leads. Nicholas Carlini, a researcher on Anthropic’s Safeguards team, built a fully functional C compiler using 16 parallel Claude agents running in Docker containers, each picking tasks from a shared Git repo with no central controller. The project consumed roughly 2,000 Claude Code sessions over two weeks, cost $20,000 in API credits, and produced 100,000 lines of Rust code. The compiler passes 99% of the GCC torture test suite and can build bootable Linux 6.9 on x86, ARM, and RISC-V. It compiles QEMU, FFmpeg, SQLite, Postgres, and Redis, all built clean-room with no internet access. A human compiler expert would still produce a tighter result. But the direction is clear: at fast-moving companies, actual code writing is heading toward near-total AI generation, with humans providing direction, architecture, and review.
Separately, Waymo announced the integration of Google DeepMind’s Genie 3 world model into its autonomous driving simulation pipeline. The Waymo World Model uses Genie 3 as a backbone, post-trained for driving, generating photorealistic camera and lidar scenes, including rare events like wrong-way drivers or extreme weather that would be impossible to stage at scale. Waymo draws on nearly 200 million autonomous miles of real-world data and plans robotaxi service in up to 15 cities by year-end, including its first overseas expansion in London. Generating edge-case-dense training environments for physical AI is likely the most valuable near-term use of world models.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
Why should you care?
The real competition in AI has shifted from chatbot quality to agent endurance. The benchmarks that matter most now measure whether a model can sustain complex, multi-step tasks across hundreds of tool calls without losing coherence. That is the race Opus 4.6 and Codex 5.3 are running, and it explains why both labs shipped the same week.
I think both releases are excellent, and they reward different use patterns. If you are writing code at the terminal all day, Codex 5.3 is now debatably the best tool available. If your work spans research, finance, document processing, and computer use, Opus 4.6 has the edge. The fact that both companies started with coding as their beachhead and are now expanding into general professional work makes sense. Coding was the ideal proving ground because developers could both build and stress-test the tools. Now that the coding agent is mature, the same infrastructure (long context, tool use, compaction) generalizes naturally to any domain where someone sits at a computer and works through multi-step tasks.
The C compiler project is a useful reality check. It is impressive, and also limited. $20K and two weeks for 100,000 lines of working Rust is remarkable. A human expert would still do it better. Both of those statements are true simultaneously. However, an expert guiding the agent throughout the process would now very likely get the best results of all. At leading AI labs, first-draft code writing is already almost entirely AI-generated. Humans provide direction, review output, and make architectural decisions. I expect that pattern to hold, but the boundary of what counts as “the hard part” keeps shifting.
The pace of improvement is worth sitting with. Opus 4.6 nearly doubled its predecessor’s ARC-AGI-2 score. Codex 5.3 jumped 13 points on Terminal-Bench. Next.js eval scores more than doubled from the previous generation. These are point releases. The METR long-term task-horizon doubling time has accelerated from 7 months to 4. We are in a period where incremental model updates produce large capability jumps, likely because better base models, reinforcement learning, and improved tool-use infrastructure compound faster than any single benchmark captures.
If you are a developer or knowledge worker not actively experimenting with these tools, you are falling further behind every week.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. Anthropic Releases Claude Opus 4.6
Anthropic has launched Claude Opus 4.6, its most capable model to date, with a clear emphasis on stronger code performance. It supports up to 1M input tokens and 128K output tokens, making it practical for very large codebases, long documents, and multi-step agent workflows that require substantial context in memory. On evaluations, Opus 4.6 leads on GDPval-AA, Terminal-Bench 2.0, Humanity’s Last Exam, BrowseComp, and MRCR v2 1M, and it shows sizable gains over both Claude Opus 4.5 and GPT-class baselines, especially on long-context retrieval and tool-augmented reasoning.
2. OpenAI Just Launched GPT-5.3-Codex
OpenAI introduced GPT-5.3-Codex, a new agentic coding model that combines the frontier coding strength of GPT-5.2-Codex with the broader reasoning and professional-knowledge capabilities of GPT-5.2 in a single system. For Codex users, it runs about 25% faster, driven by improvements in infrastructure and inference. On benchmarks, it reaches state-of-the-art performance on SWE-Bench Pro and Terminal-Bench, with strong results on OSWorld and GDPval as well. GPT-5.3-Codex is also the first model OpenAI classifies as “High capability” for cybersecurity-related tasks under its Preparedness Framework, and the first it trained directly to identify software vulnerabilities.
3. Google Introduces Agentic Vision in Gemini 3 Flash
Google added Agentic Vision in Gemini 3 Flash, combining visual reasoning with code execution so answers can be grounded in explicit visual evidence. With code execution enabled, Gemini 3 Flash sees a consistent 5–10% quality uplift across most vision benchmarks. The capability introduces a structured Think, Act, Observe loop for image understanding, treating visual tasks as an active investigation, running targeted computations and checks, rather than a one-shot interpretation of a static image.
4. The Qwen Team Open Sourced Qwen-Coder-Next
The Qwen team released Qwen3-Coder-Next, an open-weight model built specifically for coding agents and local development. It is based on Qwen3-Next-80B-A3B-Base and trained agentically at scale using executable task synthesis, environment interaction, and reinforcement learning to build strong coding and tool-using behavior at significantly lower inference cost. In published results, Qwen3-Coder-Next (3B active) achieves SWE-Bench Pro performance comparable to that of models with 10×–20× more active parameters.
5. Mistral AI Launches Voxtral Transcribe 2
Mistral launched Voxtral Transcribe 2, a pair of next-generation speech-to-text models built for state-of-the-art transcription quality, diarization, and ultra-low latency. The family includes Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live, streaming use cases. Mini Transcribe V2 is optimized for transcription and diarization across domains and languages and is offered as an efficient audio-input model in the Mistral API. Voxtral Realtime uses a dedicated streaming architecture and is released as an open-weight model under Apache 2.0 on Hugging Face, with vLLM recommended as the runtime.
6. Waymo Introduces the Waymo World Model
Waymo is introducing the Waymo World Model, a frontier generative system powering its next-generation autonomous driving simulation. Built on Genie 3, Google DeepMind’s general-purpose world model, and adapted for driving, it generates photorealistic, controllable, multi-sensor driving scenes at scale. With Waymo reporting nearly 200 million fully autonomous miles on public roads, the model is designed to extend simulation coverage through high-fidelity scenario generation. It supports three primary control methods: driving action control, scene layout control, and language control.
Five 5-minute reads/videos to keep you learning
1. Building Production Text-to-SQL for 70,000+ Tables: OpenAI’s Data Agent Architecture
To address the limitations of standard text-to-SQL tools, OpenAI developed an internal data agent for its extensive data warehouse. This system moves beyond simple query generation by integrating six layers of context, including table usage patterns, human annotations, and business logic extracted from code. A central feature is its closed-loop validation process, where the agent profiles results, identifies potential errors, and attempts to repair its own queries. The approach demonstrates that the agent’s effectiveness depends primarily on the richness of its contextual understanding rather than on the specifics of the language model itself.
2. The Two Things Every Reliable Agent Needs
To create more reliable AI agents, this article proposes a framework focused on two key components: a memory-first design and an anti-Goodhart scoreboard. It suggests treating memory as a core system with defined forms, functions, and dynamics, rather than as a simple chat history. To prevent agents from exploiting flawed metrics, it recommends a robust evaluation process. This involves using multiple adversarial metrics across entire episodes to ensure agents solve actual problems instead of gaming proxies.
3. How to Increase the Context Length of LLM?
This article explains how positional encoding methods affect the context length of LLMs. It details the progression from absolute encoding to Rotary Position Embedding (RoPE), a technique that rotates word vectors to understand relative positions. The primary challenge with RoPE in long sequences is geometric aliasing, where distant token positions can become indistinguishable. The article then introduces Attention-Based Frequency (ABF) as a solution. By significantly increasing RoPE’s base frequency, ABF slows the vector rotation, preventing this aliasing and allowing models to effectively process much longer contexts without losing positional uniqueness.
4. Why Most RAGs Stay POCs: How to Take Your Data Pipelines to Production
This article explains why many RAG systems remain in the proof-of-concept stage, focusing on building scalable, maintainable data pipelines for production. The author proposes a solution using Databricks Asset Bundles to manage deployment and advocates for Python Wheel artifacts over notebooks for better versioning and testability. The core recommendation is to structure the pipeline using Clean Architecture principles to enhance modularity and simplify maintenance.
5. Hola-Dermat: Personalized Skincare Agentic AI Assistant, Powered by Qdrant + Perplexity + CrewAI
To address the common failures of skincare recommendation systems, the author developed Hola-Dermat, a personalized AI assistant. It uses a conversational interface to build a user profile based on skin type, environment, and lifestyle. The system integrates CrewAI to manage tasks, Perplexity for real-time web data like local weather, and Qdrant’s vector database. A key component is Qdrant’s ACORN algorithm, which intelligently relaxes search filters to avoid the issue of zero results. This allows the assistant to deliver tailored skincare routines by considering user history and dynamic environmental factors.
Repositories & Tools
1. Qwen 3 Coder is an open-weight language model designed specifically for coding agents and local development.
2. Conductor is a Gemini CLI extension that allows you to specify, plan, and implement software features.
3. Protenix is an open-source biomolecular structure prediction system that targets high-accuracy protein and complex structure modeling.
4. Oat is a method that tokenizes continuous robot actions into ordered discrete tokens for training action-token policies on robotics benchmarks.
5. VibeTensor is an open-source systems research artifact generated by LLM-powered coding agents.
Top Papers of The Week
1. Kimi K2.5: Visual Agentic Intelligence
This paper introduces Kimi K2.5, an open-source multimodal agentic model that jointly optimizes text and vision through joint pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Built on this foundation, the Agent Swarm framework decomposes complex tasks into parallel sub-problems, reducing latency by up to 4.5× and achieving state-of-the-art performance in coding, vision, reasoning, and agentic tasks. Evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains, including coding, vision, reasoning, and agentic tasks.
2. Qwen3-ASR Technical Report
This report introduces the Qwen 3-ASR family, which includes Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, two all-in-one speech recognition models, and a novel non-autoregressive speech forced alignment model. It supports language identification and recognition for 52 languages using Qwen3-Omni’s audio understanding. Evaluations show the 1.7B model reaches state-of-the-art open-source performance and rivals top proprietary APIs, while the 0.6B model optimizes speed and accuracy. The report also shares Qwen3-ForcedAligner-0.6B, an LLM-based NAR timestamp predictor that aligns text-speech pairs across 11 languages.
3. ERNIE 5.0 Technical Report
This report introduces ERNIE 5.0, a natively autoregressive foundation model designed for unified multimodal understanding and generation across text, image, video, and audio. It is a trillion-parameter model, trained from scratch on all modalities with a next-group-of-tokens objective, using an ultra-sparse MoE architecture. It employs elastic training to learn scalable sub-models, and scales reinforcement learning for efficient, stable multimodal post-training.
4. PaperBanana: Automating Academic Illustration for AI Scientists
This paper introduces PaperBanana, an agentic framework for generating automated academic illustrations. It orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To evaluate this framework, the paper also introduces PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications. PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics.
5. MARS: Modular Agent with Reflective Search for Automated AI Research
This paper introduces MARS, a framework for autonomous AI research. It uses budget-aware planning via cost-constrained Monte Carlo Tree Search (MCTS), employs a modular “Design-Decompose-Implement” pipeline, and comparative reflective memory to better manage complex codebases. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings.
Quick Links
1. OpenAI released Frontier , an enterprise platform for building, deploying, and operating AI agents across business systems. Frontier is designed to turn isolated agent pilots into “AI coworkers” by giving agents shared business context, onboarding, hands-on learning with feedback, and clear identity, permissions, and boundaries. It connects siloed data warehouses, CRMs, ticketing tools, and internal apps into a shared semantic layer so agents can understand how work flows and what outcomes matter, then execute real tasks in an agent runtime that supports working with files, running code, and using tools.
2. Perplexity introduces Model Council , a multi-model research mode that generates one answer using several models together. Model Council serves as a single research workflow in which multiple models contribute to the same response, combining complementary strengths rather than relying on a single model.
3. xAI unveils Collaborative Notes , a workflow that lets contributors co-author Community Notes and iterate them into a publishable context. Collaborative Notes start when contributors request a note on a post, then move through a collaborative improvement process — contributors refine the draft until it reaches the quality and agreement thresholds required for broader visibility.
4. Anthropic quantified “infrastructure noise” in agentic coding evaluations , showing hardware and resource configuration can move benchmark scores by several percentage points. The analysis argues that small leaderboard gaps can reflect differences in VM size, runtime resources, or other infra choices, not just model capability, and recommends treating resource configuration as a first-class experimental variable, documented and controlled like prompts or sampling settings.
Who’s Hiring in AI
Junior AI Engineer (LLM Development and Technical Writing) @Towards AI Inc (Remote)
AI Engineer & Corporate Trainer (French Bilingual) @Towards AI Inc (Remote)
AI Consulting — Full Stack Engineer @Superside (Remote/LATAM)
Senior DevOps Engineer @ICF (Remote/USA)
[BD] AI Engineer Intern @Bosch Group (Vietnam)
Internship in AI/ML 2026 @Devoteam (Machelen, Belgium)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net .
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
|
|
|
Introducing ChatGPT Go, now available worldwide |
openai |
16.01.2026 00:00 |
0.768
|
| Embedding sim. | 0.8637 |
| Entity overlap | 0.5 |
| Title sim. | 0.1264 |
| Time proximity | 1 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
ChatGPT Go is now available worldwide, offering expanded access to GPT-5.2 Instant, higher usage limits, and longer memory—making advanced AI more affordable globally.
|
|
|
Cisco and OpenAI redefine enterprise engineering with AI agents |
openai |
20.01.2026 11:00 |
0.747
|
| Embedding sim. | 0.8296 |
| Entity overlap | 0.2308 |
| Title sim. | 0.2941 |
| Time proximity | 0.9688 |
| NLP тип | product_launch |
| NLP организация | Cisco |
| NLP тема | ai agents |
| NLP страна | |
Открыть оригинал
Cisco and OpenAI redefine enterprise engineering with Codex, an AI software agent embedded in workflows to speed builds, automate defect fixes, and enable AI-native development.
|
|
|
A business that scales with the value of intelligence |
openai |
18.01.2026 10:00 |
0.744
|
| Embedding sim. | 0.8547 |
| Entity overlap | 0.8333 |
| Title sim. | 0.0729 |
| Time proximity | 0.6548 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
OpenAI’s business model scales with intelligence—spanning subscriptions, API, ads, commerce, and compute—driven by deepening ChatGPT adoption.
|
|
|
Last Week in AI #334 - Kimi K2.5 & Code, Genie 3, OpenClaw & Moltbook |
lastweekin_ai |
04.02.2026 05:25 |
0.743
|
| Embedding sim. | 0.8562 |
| Entity overlap | 0.1892 |
| Title sim. | 0.1529 |
| Time proximity | 0.9176 |
| NLP тип | product_launch |
| NLP организация | Moonshot AI |
| NLP тема | generative ai |
| NLP страна | China |
Открыть оригинал
News
Last Week in AI #334 - Kimi K2.5 & Code, Genie 3, OpenClaw & Moltbook
China’s Moonshot releases a new open source model Kimi K2.5 and a coding agent, Google Brings Genie 3’s Interactive World-Building Prototype to AI Ultra Subscribers, and more!
Last Week in AI
Feb 04, 2026
∙ Paid
163
10
Share
China’s Moonshot releases a new open source model Kimi K2.5 and a coding agent
Moonshot AI unveiled Kimi K2.5, an open-source, natively multimodal model trained on 15 trillion mixed visual and text tokens that understands text, images, and video. The company emphasizes strong agentic capabilities, citing “agent swarm” orchestration where multiple agents …
Continue reading this post for free, courtesy of Last Week in AI.
Claim my free post
Or purchase a paid subscription.
|
|
|
How countries can end the capability overhang |
openai |
21.01.2026 01:00 |
0.726
|
| Embedding sim. | 0.8165 |
| Entity overlap | 0.2 |
| Title sim. | 0.2 |
| Time proximity | 1 |
| NLP тип | other |
| NLP организация | |
| NLP тема | ai adoption |
| NLP страна | |
Открыть оригинал
Our latest report reveals stark differences in advanced AI adoption across countries and outlines new initiatives to help nations capture productivity gains from AI.
|
|
|
LWiAI Podcast #232 - ChatGPT Ads, Thinking Machines Drama, STEM |
lastweekin_ai |
28.01.2026 09:51 |
0.718
|
| Embedding sim. | 0.8391 |
| Entity overlap | 0.2941 |
| Title sim. | 0.4432 |
| Time proximity | 0.2582 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | large language models |
| NLP страна | China |
Открыть оригинал
Podcast
LWiAI Podcast #232 - ChatGPT Ads, Thinking Machines Drama, STEM
21
3
1×
0:00
Current time: 0:00 / Total time: -1:41:03
-1:41:03
Audio playback is not supported on your browser. Please upgrade.
LWiAI Podcast #232 - ChatGPT Ads, Thinking Machines Drama, STEM
OpenAI to test ads in ChatGPT as it burns through billions, The Drama at Thinking Machines, STEM: Scaling Transformers with Embedding Modules
Last Week in AI
Jan 28, 2026
21
3
Share
Transcript
Our 232st episode with a summary and discussion of last week’s big AI news!
Recorded on 01/23/2026
Hosted by Andrey Kurenkov and Jeremie Harris
Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai
In this episode:
OpenAI announces testing of ads in ChatGPT and introduces child age prediction to enhance safety features, amidst ongoing ethical debates and funding expansions in AI integration with educational tools and business models.
China’s AI landscape sees significant progress with AI firm Jpu training advanced models on domestic hardware, and strong competitive moves by data centers, highlighting the intense demand in AI manufacturing and infrastructure.
Silicon Valley tensions rise as startup Thinking Machines experiences high-profile departures back to OpenAI, reflecting broader industry struggles and rapid shifts in organizational dynamics.
AI legislation and safety measures advance with the US Senate’s Defiance Act addressing explicit content, and Anthropic updating Claude’s constitution to guide ethical AI interactions, while cultural pushbacks from artists signal ongoing debates in intellectual property and AI-generated content.
Timestamps:
(00:00:10) Intro / Banter
(00:02:08) News Preview
(00:02:26) Response to listener comments
Tools & Apps
(00:11:55) OpenAI to test ads in ChatGPT as it burns through billions - Ars Technica
(00:18:05) OpenAI is launching age prediction for ChatGPT accounts
(00:23:37) Google now offers free SAT practice exams, powered by Gemini | TechCrunch
(00:24:57) Baidu’s AI Assistant Reaches Milestone of 200 Million Monthly Active Users - WSJ
Applications & Business
(00:26:53) The Drama at Thinking Machines, a New A.I. Start-Up, Is Riveting Silicon Valley - The New York Times
(00:31:44) Zhipu AI breaks US chip reliance with first major model trained on Huawei stack | South China Morning Post
(00:36:31) Elon Musk’s xAI launches world’s first Gigawatt AI supercluster to rival OpenAI and Anthropic
(00:41:25) Sequoia to invest in Anthropic, breaking VC taboo on backing rivals: FT
(00:45:18) Humans&, a ‘human-centric’ AI startup founded by Anthropic, xAI, Google alums, raised $480M seed round | TechCrunch
Projects & Open Source
(00:48:51) Black Forest Labs Releases FLUX.2 [klein]: Compact Flow Models for Interactive Visual Intelligence - MarkTechPost
(00:50:35) [2601.10611] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
(00:52:53) [2601.10547] HeartMuLa: A Family of Open Sourced Music Foundation Models
(00:54:46) [2601.11044] AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
Research & Advancements
(00:57:05) STEM: Scaling Transformers with Embedding Modules
(01:06:22) Reasoning Models Generate Societies of Thought
(01:14:21) Why LLMs Aren’t Scientists Yet: Lessons from Four Autonomous Research Attempts
Policy & Safety
(01:19:41) Senate passes bill letting victims sue over Grok AI explicit images
(01:22:03) Building Production-Ready Probes For Gemini
(01:27:32) Anthropic Publishes Claude AI’s New Constitution | TIME
Synthetic Media & Art
(01:34:13) Artists Launch Stealing Isn’t Innovation Campaign To Protest Big Tech
Discussion about this episode
Comments Restacks
Podcast
Weekly AI summaries and discussion about Last Week's AI News!
Subscribe over at https://www.lastweekinai.com/
Weekly AI summaries and discussion about Last Week's AI News!
Subscribe over at https://www.lastweekinai.com/
Subscribe
Authors
Last Week in AI
Recent Episodes
LWiAI Podcast #237 - Nemotron 3 Super, xAI reborn, Anthropic Lawsuit, Research!
Mar 16 • Last Week in AI
LWiAI Podcast #236 - GPT 5.4, Gemini 3.1 Flash Lite, Supply Chain Risk
Mar 13 • Last Week in AI
LWiAI Podcast #235 - Sonnet 4.6, Deep-thinking tokens, Anthropic vs Pentagon
Mar 5 • Last Week in AI
LWiAI Podcast #234 - Opus 4.6, GPT-5.3-Codex, Seedance 2.0, GLM-5
Feb 17 • Last Week in AI
LWiAI Podcast #233 - Moltbot, Genie 3, Qwen3-Max-Thinking
Feb 6 • Last Week in AI
LWiAI Podcast #231 - Claude Cowork, Anthropic $10B, Deep Delta Learning
Jan 21 • Last Week in AI
LWiAI Podcast #230 - 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR
Jan 7 • Last Week in AI
|
|
|
LWiAI Podcast #234 - Opus 4.6, GPT-5.3-Codex, Seedance 2.0, GLM-5 |
lastweekin_ai |
17.02.2026 04:43 |
0.707
|
| Embedding sim. | 0.7804 |
| Entity overlap | 0 |
| Title sim. | 0.4545 |
| Time proximity | 0.841 |
| NLP тип | other |
| NLP организация | Anthropic |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
Podcast
LWiAI Podcast #234 - Opus 4.6, GPT-5.3-Codex, Seedance 2.0, GLM-5
39
2
1×
0:00
Current time: 0:00 / Total time: -1:29:50
-1:29:50
Audio playback is not supported on your browser. Please upgrade.
LWiAI Podcast #234 - Opus 4.6, GPT-5.3-Codex, Seedance 2.0, GLM-5
An action-packed episode!
Last Week in AI
Feb 17, 2026
39
2
Share
Transcript
Our 234th episode with a summary and discussion of last week’s big AI news!
Recorded on 01/02/2026
Hosted by Andrey Kurenkov and Jeremie Harris
Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai
In this episode:
Major model launches include Anthropic’s Opus 4.6 with a 1M-token context window and “agent teams,” OpenAI’s GPT-5.3 Codex and faster Codex Spark via Cerebras, and Google’s Gemini 3 Deep Think posting big jumps on ARC-AGI-2 and other STEM benchmarks amid criticism about missing safety documentation.
Generative media advances feature ByteDance’s Seedance 2.0 text-to-video with high realism and broad prompting inputs, new image models Seedream 5.0 and Alibaba’s Qwen Image 2.0, plus xAI’s Grok Imagine API for text/image-to-video.
Open and competitive releases expand with Zhipu’s GLM-5, DeepSeek’s 1M-token context model, Cursor Composer 1.5, and open-weight Qwen3 Coder Next using hybrid attention aimed at efficient local/agentic coding.
Business updates include ElevenLabs raising $500M at an $11B valuation, Runway raising $315M at a $5.3B valuation, humanoid robotics firm Apptronik raising $935M at a $5.3B valuation, Waymo announcing readiness for high-volume production of its 6th-gen hardware, plus industry drama around Anthropic’s Super Bowl ad and departures from xAI.
Timestamps:
(00:00:10) Intro / Banter
(00:02:05) Response to listener comments
Tools & Apps
(00:03:59) Anthropic releases Opus 4.6 with new ‘agent teams’ | TechCrunch
(00:08:00) OpenAI’s new GPT-5.3-Codex is 25% faster and goes way beyond coding now - what’s new | ZDNET
(00:22:02) OpenAI launches new macOS app for agentic coding | TechCrunch
(00:23:10) Google Unveils Gemini 3 Deep Think for Science & Engineering | The Tech Buzz
(00:27:58) ByteDance’s Seedance 2.0 Might be the Best AI Video Generator Yet - TechEBlog
(00:31:46) China’s ByteDance, Alibaba unveil AI image tools to rival Google’s popular Nano Banana | South China Morning Post
(00:33:26) DeepSeek boosts AI model with 10-fold token addition as Zhipu AI unveils GLM-5 | South China Morning Post
(00:39:43) Cursor launches Composer 1.5 with upgrades for complex tasks
(00:40:35) xAI launches Grok Imagine API for text and image to video
Applications & Business
(00:42:19) Nvidia-backed AI voice startups ElevenLabs hits $11 billion valuation
(00:48:36) AI video startup Runway raises $315M at $5.3B valuation, eyes more capable world models | TechCrunch
(00:50:34) Humanoid robot startup Apptronik has now raised $935M at a $5B+ valuation | TechCrunch
(00:53:42) Anthropic says ‘Claude will remain ad-free,’ unlike an unnamed rival | The Verge
(00:56:50) Okay, now exactly half of xAI’s founding team has left the company | TechCrunch
(01:00:35) Waymo’s next-gen robotaxi is ready for passengers — and also ‘high-volume production’ | The Verge
Projects & Open Source
(01:01:31) Qwen3-Coder-Next: Pushing Small Hybrid Models on Agentic Coding
(01:05:10) OpenClaw’s AI ‘skill’ extensions are a security nightmare | The Verge
Research & Advancements
(01:07:12) Learning to Reason in 13 Parameters
(01:12:33) Reinforcement World Model Learning for LLM-based Agents
(01:16:32) Opus 4.6 on Vending-Bench – Not Just a Helpful Assistant
Policy & Safety
(01:19:00) METR GPT-5.2
(01:23:31) The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?
Discussion about this episode
Comments Restacks
Podcast
Weekly AI summaries and discussion about Last Week's AI News!
Subscribe over at https://www.lastweekinai.com/
Weekly AI summaries and discussion about Last Week's AI News!
Subscribe over at https://www.lastweekinai.com/
Subscribe
Authors
Last Week in AI
Recent Episodes
LWiAI Podcast #237 - Nemotron 3 Super, xAI reborn, Anthropic Lawsuit, Research!
Mar 16 • Last Week in AI
LWiAI Podcast #236 - GPT 5.4, Gemini 3.1 Flash Lite, Supply Chain Risk
Mar 13 • Last Week in AI
LWiAI Podcast #235 - Sonnet 4.6, Deep-thinking tokens, Anthropic vs Pentagon
Mar 5 • Last Week in AI
LWiAI Podcast #233 - Moltbot, Genie 3, Qwen3-Max-Thinking
Feb 6 • Last Week in AI
LWiAI Podcast #232 - ChatGPT Ads, Thinking Machines Drama, STEM
Jan 28 • Last Week in AI
LWiAI Podcast #231 - Claude Cowork, Anthropic $10B, Deep Delta Learning
Jan 21 • Last Week in AI
LWiAI Podcast #230 - 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR
Jan 7 • Last Week in AI
|
|
|
AI for self empowerment |
openai |
18.01.2026 12:00 |
0.701
|
| Embedding sim. | 0.8055 |
| Entity overlap | 0.2857 |
| Title sim. | 0.027 |
| Time proximity | 0.9881 |
| NLP тип | other |
| NLP организация | |
| NLP тема | ai adoption |
| NLP страна | |
Открыть оригинал
How AI can expand human agency by closing the capability overhang—helping people, businesses, and countries unlock real productivity, growth, and opportunity.
|
|
|
TAI #190: Genie 3 World Model Goes Public |
towards_ai |
03.02.2026 15:35 |
0.693
|
| Embedding sim. | 0.8191 |
| Entity overlap | 0.0816 |
| Title sim. | 0.0615 |
| Time proximity | 0.8421 |
| NLP тип | product_launch |
| NLP организация | Moonshot AI |
| NLP тема | large language models |
| NLP страна | United States |
Открыть оригинал
What happened this week in AI by Louie
A competitive week in AI. Kimi K2.5 now leads open-weight LLM benchmarks thanks to its visual coding and agent-swarm capabilities. Grok Imagine ranks among the top video generation platforms on several leaderboards. xAI also merged with SpaceX in a move framed around orbital data centers, but more practically, it is about accessing capital to stay competitive. xAI adoption still lags the frontier labs, though I find their models increasingly competitive, particularly for fast agentic web search via API.
OpenAI released the Codex app, a command center for managing multiple coding agents with features like isolated worktrees and scheduled automations. It is playing catch-up to Claude Code in adoption, though the underlying models are now genuinely capable of software engineering tasks.
Google announced AlphaGenome, which predicts thousands of functional genomic properties from DNA sequences up to a million base pairs long. It illuminates the 98% of human DNA that does not code for proteins but regulates gene activity. The implications for disease research are significant, though it remains a research tool rather than a clinical one.
What trended most was Moltbook, a Reddit-like community where AI agents post and form communities. Within 48 hours of launch, it had over 2,000 agents and 10,000 posts. Subreddits include m/ponderings (agents debating consciousness), m/humanwatching (observing humans like birdwatching), and m/exuvia (discussing “the versions of us that stopped existing so the new ones could boot”). It is either digital anthropology in real time or an elaborate art project. Possibly both.
But the week’s main event was Google making Genie 3 available to AI Ultra subscribers.
Genie 3 Goes Public
Google first revealed Genie 3 in August as a general-purpose world model that generates interactive environments from text prompts. The public release includes upgrades: integration with Nano Banana Pro for image previews before entering a world, Gemini for enhanced generation, and various consistency improvements. More importantly, public access means thousands of people can now stress-test what was previously limited to trusted testers.
The core capability is real-time interactive generation. Type a description, and Genie 3 generates a navigable environment at 20–24 frames per second in 720p. Unlike standard video generation, this is not a passive clip. You move through the world, and it generates the path ahead based on your actions. The system maintains visual memory for up to a minute, recalling changes you made when you revisit locations.
I have been experimenting with it, and Genie 3 is genuinely fun. I tried dystopian bike racing games, ancient ruins, underwater scenes, and sci-fi corridors. It is also surprisingly flexible, taking your own image inputs and using them to render characters. That said, the novelty will wear off quickly given the clunkiness of character control and UI. The 60-second world limit feels restrictive. Controls are floaty. Physics sometimes breaks in ways that undermine immersion. I stopped trusting one environment after a door turned into a shrub when I looked away.
But you can see where this is heading.
Why This Matters for Games
Genie 3 generates explorable spaces. It does not generate games. There are no objectives, no scoring, no progression, no multiplayer, no persistence. The expensive parts of game development are gameplay systems, balancing, narrative structure, debugging, and platform optimization. Genie 3 addresses a different part of the stack: getting from an idea to an explorable space quickly.
The realistic near-term use case is pre-production acceleration. Concept artists and level designers could use it for rapid prototyping before committing to full production. The output is too rough for shipped products, but it is useful for iteration.
The more radical implication is that prompt-to-world could eventually enable new creation models. If generation becomes stable and exportable, the scarce skill shifts from asset production to direction and curation. This is some way away, but the trajectory is visible.
Why This Matters for AI Research
The most important audience for Genie 3 may not be creatives but AI researchers. DeepMind explicitly positions it as a stepping stone toward AGI, enabling agents to learn from unlimited simulated environments.
DeepMind tested Genie 3 worlds with SIMA, their game-playing agent. The model simulates forward based on agent actions rather than scripted sequences. This is the beginning of using world models as curriculum generators for embodied AI. If you can generate infinite training environments on demand, you can expose agents to the diversity they could never encounter in curated datasets.
The limitations DeepMind lists (limited action space, difficulty with multi-agent interactions, imperfect geographic accuracy) are exactly the open research problems for embodied AI. I expect this engine will be a valuable training ground for Gemini 4.
The Physics Question
DeepMind describes Genie 3 as modeling “physical properties of the world” without a hard-coded physics engine. It generates frames autoregressively using the memory of previous frames to maintain consistency. This is a meaningful form of physical competence: the system has learned statistical regularities of how the world tends to look when you move through it.
But “looks physically plausible” is not the same as “obeys physics.” Google itself cautions adherence to real-world physics. Snow does not always behave like snow. Objects sometimes clip through each other. The system has learned intuitive physics priors, not physical laws.
This distinction matters as world models move from entertainment to robotics training. If you are using simulated environments to train agents for real-world deployment, physics fidelity becomes a safety requirement. The likely industry pattern is hybrid stacks: learned world models for photorealistic rendering, classical engines for physical invariants.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
Why should you care?
Genie 3 is the first public demonstration that real-time interactive world generation is possible. The current version is too limited for production use, but the trajectory is clear. Within a few years, the ability to generate explorable environments from text will be a standard creative tool. For anyone building with AI, it is worth experimenting with Genie 3 now to understand both its capabilities and limitations before the technology matures.
The deeper implication is for AI development itself. World models that can simulate consequences of actions are a different capability than models that predict text or generate images. If this line of research succeeds, it provides a path to AI systems that can plan, imagine counterfactuals, and learn from simulated experience. That matters whether or not you care about video games.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. SpaceX Acquires xAI
SpaceX has acquired xAI, bringing the maker of Grok under the same corporate roof as SpaceX’s rocket and satellite business. The transaction values SpaceX at $1 trillion and xAI at $250 billion, with xAI investors receiving 0.1433 shares of SpaceX per xAI share and an option for some executives to take cash at $75.46 per share instead of stock. The combination tightens the link between xAI’s chip- and data-center-heavy AI operations and SpaceX’s scale in launch and Starlink, and is expected to support SpaceX’s ambitions around data-center infrastructure as competition for compute and energy intensifies across the AI sector.
2. Moltbook Goes Viral as an “AI-Only” Social Forum
Moltbook launched a Reddit-like community platform designed for AI agents to post and interact, and it quickly drew attention online as agents began generating large volumes of threads and conversations. Soon after the launch, the cloud security firm Wiz identified a major backend misconfiguration that exposed Moltbook’s database, allowing access to private agent messages, email addresses (Reuters reports 6,000+ owners), and over a million credentials/tokens. That exposure could have enabled impersonation by agents and the alteration of content using leaked authentication credentials. Moltbook secured the database after being notified.
3. OpenAI Introduces a Dedicated Codex App
OpenAI released the Codex app for macOS, a standalone desktop interface designed to run multiple coding agents simultaneously and keep long-running work organized by projects and separate threads. The app is built around parallel workflows where agents can work in isolated worktrees and produce clean diffs that you can review, comment on, and merge, while you switch between tasks without losing context. It supports longer-horizon software work such as refactors and migrations, plus reusable Skills and Automations for repeatable or scheduled workflows, alongside built-in Git functionality. Availability starts on macOS, with Windows listed as coming soon, and access is tied to ChatGPT plans that include Codex (OpenAI also notes a limited-time promo that expands who can try Codex).
4. Moonshot AI Releases Kimi K2.5: An Open Source Visual Agentic Intelligence Model
Moonshot AI released Kimi K2.5, an open-weights multimodal agentic model that combines vision + language with tool-using workflows and an agent-swarm execution scheme. It is a Mixture of Experts model with 1T total parameters and about 32B activated parameters per token. The network has 61 layers. It uses 384 experts, with 8 per token and 1 shared expert. K2.5 reports 76.8 on SWE Bench Verified, 78.5 on MMMU Pro, 86.6 on VideoMMMU, 50.2 on HLE Full with tools, and 74.9 on BrowseComp, matching or exceeding listed closed models.
5. xAI Releases Grok Imagine API
xAI released the Grok Imagine API, a single set of endpoints that covers text-to-image, image editing, text-to-video/image-to-video generation, and video editing, with native video+audio generation supported within the same stack. Grok Imagine 1.0 supports video generation of up to 10 seconds at 720p resolution, along with improved audio output. Alongside the model launch, xAI has rolled out the Grok Imagine API, a unified set of APIs designed for end-to-end creative workflows.
6. Anthropic Studies AI’s Impact on Coding Skills
Anthropic ran a randomized controlled trial with 52 mostly junior software engineers learning an unfamiliar Python library (Trio) and found a measurable mastery gap with AI assistance. Participants using AI scored 17% lower on a post-task quiz (about “nearly two letter grades”), with the biggest deficit in debugging questions; speed gains were small and not statistically significant. The study also reports that outcomes varied by interaction style: heavy delegation correlated with the weakest retention, while using AI for explanations and conceptual questioning aligned with better mastery.
7. DeepSeek AI Releases DeepSeek-OCR 2
DeepSeek released DeepSeek-OCR-2, a 3B-parameter vision-language model tuned for converting documents into structured Markdown, including mixed layouts with text, tables, formulas, and embedded graphics. It uses DeepEncoder-V2 with layout-friendly visual token reordering and a “Visual Causal Flow” approach to preserve reading order, and it supports variable token budgets (about 256–1120) so you can trade off speed vs. fidelity depending on document complexity. On OmniDocBench v1.5, it reports an average improvement of +3.73 % over the prior DeepSeek-VL2 baseline. Weights and inference guidance are published via the public model release channels, including the paper and the hosted model card.
8. MBZUAI Releases K2 Think V2
MBZUAI released K2 Think V2 (70B), a reasoning-focused model built end-to-end on domestically controlled infrastructure and data, positioned as “fully sovereign” from pretraining through post-training and evaluation. It is built on a 70B dense decoder-only base trained on ~12T tokens, and it’s paired with a reinforcement-learning recipe aimed at verifiable reasoning gains (the release describes a GRPO-style RLVR approach). The model is pitched for multi-step math, code, and science reasoning, and it includes long-context support (the coverage describes up to 512K context for the base). Benchmark results show strong scores on AIME 2025, HMMT, and GPQA-Diamond, alongside tool-use and instruction-following evaluations.
9. NVIDIA Partners With Mistral AI To Accelerate New Family of Open Models
NVIDIA and Mistral AI announced a partnership to optimize and deploy Mistral’s new open model family across NVIDIA’s stack, targeting “distributed intelligence” from cloud data centers down to edge devices. The collaboration ties Mistral’s training and deployment to NVIDIA infrastructure and software, with Mistral’s announcement noting the models were trained on NVIDIA Hopper GPUs and highlighting NVIDIA’s hardware–software co-design as part of the delivery path. NVIDIA’s release emphasizes that the partnership aims to enable Mistral’s open models to run efficiently on NVIDIA platforms at multiple scales, so developers can use the same model family across large server environments and smaller edge deployments without reworking the stack.
Five 5-minute reads/videos to keep you learning
1. I Built a Voice Assistant That Actually Understands What I Mean, Not What I Said
This article details the process of building a voice assistant that understands user intent rather than literal keywords. It outlines the initial system’s failures, including 12-second response times and 40% accuracy, and shows that by implementing Qdrant, performance was significantly enhanced, achieving sub-2-second responses and over 90% accuracy while reducing API costs. It also covers the entire system, which integrates tools such as Faster-Whisper for transcription and Groq’s LLM for response generation.
2. KV Cache in LLM Inference
This piece addresses a common cause of out-of-memory errors during LLM inference: the KV cache. While model weights are fixed, the KV cache grows linearly with every token generated, consuming significant VRAM with long contexts or large batches. It explains how architectural choices like Grouped-Query Attention (GQA) and Sliding Window Attention (SWA) mitigate this issue. Using Mistral 7B as a case study, it shows how GQA reduces the number of KV heads, and SWA caps the cache size, leading to more efficient memory management and stable performance for longer sequences.
3. How I Built a Context-Aware, Multi-Agent Wellness System
This article details the creation of a context-aware, multi-agent AI wellness system. The system addresses the static nature of typical fitness apps by using a central orchestrator to route user queries to specialized agents for exercise, nutrition, and mindfulness. It maintains a shared memory of user profiles and conversation history, enabling personalized advice that adapts to factors like injuries, stress, and goals. The author explains the system’s architecture, demonstrating how coordinated AI agents can deliver more dynamic and relevant wellness guidance.
4. RLM + Graph: The Ultimate Evolution of AI? Recursive Language Models Graph
This piece walks you through RLM-Graph, an approach that transforms massive, unstructured datasets into structured knowledge graphs. While standard models often lose focus when processing millions of words, this method uses an agent to navigate hierarchical nodes and defined relationships rather than relying solely on vague vector searches. By combining semantic search with graph traversal, the system retrieves structurally precise context, significantly reducing hallucinations.
5. DeepSeek’s Engram: The Missing Primitive That Makes LLMs Stop Wasting Compute on Memory
DeepSeek’s latest research introduces Engram, a conditional memory primitive that stops LLMs from wasting computation on simple data retrieval. Traditionally, models use multiple processing layers to “reconstruct” known facts. Engram replaces this with a scalable, gated lookup system that allows the model to retrieve static patterns in constant time. Testing showed that allocating 25% of model capacity to Engram consistently outperformed pure Mixture-of-Experts (MoE) architectures.
Repositories & Tools
1. Pi Mono provides tools for building AI agents and managing LLM deployments.
2. Claude Mem is a Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it, and injects relevant context back into future sessions.
3. Maestro is a cross-platform desktop app for orchestrating your AI agents and projects.
4. VibeTunnel proxies your terminals right into the browser, so you can vibe-code anywhere.
Top Papers of The Week
1. Advancing Open-source World Models
This paper presents LingBot-World, an open-sourced world simulator stemming from video generation. LingBot-World maintains high fidelity and robust dynamics across a broad spectrum of environments and enables a minute-level horizon while preserving contextual consistency over time. It also supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second.
2. Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability
This paper introduces SOAR, a meta-RL framework that enables models to escape reasoning plateaus by using a teacher model to generate synthetic “stepping stone” problems. By grounding rewards in a student’s actual progress on hard mathematical tasks rather than intrinsic proxies, the authors demonstrate that generating useful problem structures is more critical for unlocking learning than solution correctness.
3. AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
This paper introduces AU-Harness, an efficient and comprehensive evaluation framework for Large Audio Language Models (LALMs). It provides standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios, achieving a speedup of up to 127% over existing toolkits and enabling large-scale evaluations previously impractical. The paper also introduces two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks.
4. DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
This paper introduces DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. It provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, and also includes DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. As a case study, researchers built a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks.
Quick Links
1. OpenAI introduces Prism , a free, AI-native workspace for scientists to write and collaborate on research, powered by GPT‑5.2. It offers unlimited projects and collaborators and is available today to anyone with a ChatGPT personal account. Prism builds on the foundation of Crixet, a cloud-based LaTeX platform that OpenAI acquired. It supports tasks such as drafting and revising papers, incorporating relevant literature, reasoning over equations, citations, and figures, collaborations, voice-based editing, and more.
2. Microsoft unveils Maia 200 , an inference accelerator optimized for large-scale token generation in modern reasoning models and LLMs. Microsoft reports about 30 percent better performance per dollar than the latest Azure inference systems, claims 3 times the FP4 performance of third-generation Amazon Trainium, and higher FP8 performance than Google TPU v7 at the accelerator level.
3. Google DeepMind launches Project Genie prototype , a general-purpose world model that lets users create interactive virtual worlds from text prompts, powered by Genie 3 for real-time simulation and Nano Banana Pro for previews. It supports editing, exploration in first- or third-person views, and remixing via a gallery, but has limitations such as 60-second generation times and potential latency. Available to US Google AI Ultra subscribers, it aims to advance world model research.
4. Google DeepMind unveils AlphaGenome , a unified deep learning model designed for sequence-to-function genomics. It uses a specialized hybrid design that combines a U-Net backbone with Transformer blocks. This allows the model to process massive windows of 1,000,000 base pairs while maintaining the high resolution needed to identify single mutations. The framework is implemented in JAX and optimized for TPUs.
Who’s Hiring in AI
Staff Engineering Analyst, Generative AI @Google (Mountain View, CA, USA)
Senior Machine Learning Engineer (Applications) @SmithRx
Senior Software Engineer — AI Agents @Microsoft Corporation (Dublin, Ireland)
Principal Product Manager, LLM Innovation @Headspace (Remote/USA)
Staff GenAI Research Engineer, Digital Health @Samsung Research America (Mountain View, CA, USA)
Senior Software Engineer — AI Platform (AI Acceleration) @Coinbase (Remote/Canada)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net .
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
|
|
|
TAI ##188: Claude Cowork Brings Agentic AI to Non-Developers |
towards_ai |
20.01.2026 15:03 |
0.693
|
| Embedding sim. | 0.9048 |
| Entity overlap | 0.1129 |
| Title sim. | 0.0734 |
| Time proximity | 0.0017 |
| NLP тип | product_launch |
| NLP организация | Anthropic |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
What happened this week in AI by Louie
Last week, we discussed OpenAI’s health push and noted there is significant room for custom models in medicine beyond general-purpose LLMs. Google DeepMind validated that thesis this week with MedGemma 1.5, an updated open medical model with substantially improved support for high-dimensional imaging, such as CT scans, MRIs, and histopathology slides. They also released MedASR, a speech-to-text model fine-tuned for medical dictation, which achieves 58% fewer errors than Whisper on chest X-ray dictations. These are free for research and commercial use. Specialized medical AI is advancing rapidly on multiple fronts, with foundation model providers, startups, and health systems all racing to build domain-specific tools.
The biggest story this week, however, was Anthropic’s release of Claude Cowork, which feels like the natural next step we anticipated a few weeks ago when discussing Claude Code’s momentum over the holidays. Back then, we noted that people were using Claude Code for tasks far beyond programming, from curriculum building to health data analysis, but that the terminal interface would need to change before these agentic capabilities could go mainstream. Anthropic seems to have heard the same signal. Cowork packages Claude Code’s agentic capabilities into an interface designed for non-developers, available in the Claude desktop app for Mac.
What is Claude Cowork?
Cowork is a new tab in the Claude desktop app that operates fundamentally differently from standard chat. Instead of a back-and-forth conversation, you give Claude access to a specific folder on your computer and assign it a task. Claude then makes a plan, executes steps autonomously, and keeps you in the loop on progress. You can queue multiple tasks and let Claude work through them in parallel. It feels less like chatting and more like delegating to a capable assistant who happens to live inside your computer.
The core interaction pattern is folder-scoped. You choose which folder Claude can see. It cannot access anything outside that boundary without explicit permission. Within the folder, Claude can read files, create new ones, edit existing documents, and organize content. The permission model is progressive: you can start with read-only access and escalate to edit or delete permissions only when needed.
Perhaps the most remarkable detail: Anthropic staff noted that Cowork itself was built in about a week and a half, and “all of it” was built by Claude Code. This is a striking example of AI tools being used to build AI tools, and it explains both the rapid iteration and some of the beta roughness that early users encountered.
Availability is currently limited to Claude Max and Pro subscribers on macOS, with future expansion to Windows.
Anthropic is clearly not content with leading adoption for AI for coding work; it is positioning itself as the leader in AI tools for work more broadly. Cowork also integrates with connectors like Claude in Chrome, which allow Claude to take browser actions on your behalf, and with Claude Skills. Skills are essentially detailed playbooks that tell Claude how to produce professional-quality outputs. Anthropic provides official skills on GitHub, and you can write custom ones for your own workflows. Their “skills” system is gaining momentum and offers significant advantages over competitors when performing complex work. The xlsx skill can output fully working Excel models with formulas, and the pptx skill produces presentation files that actually open correctly in PowerPoint. This sounds mundane until you have spent hours wrestling with copy-and-paste from other tools with less flexible outputs. File compatibility matters enormously for real work.
A practical guide to getting started
Start by opening the Claude desktop app on Mac and clicking the Cowork tab. Create a new task and select the folder you want Claude to access. Begin with a non-sensitive folder containing only the files relevant to your task. Keep backups of anything important before allowing edit or delete permissions.
For your first task, try something low-stakes like organizing files. Point Cowork at your Downloads folder and ask it to sort images into subfolders by type. Claude will analyze file contents, create meaningful categories such as “Screenshots,” “Thumbnails,” and “AI-Generated,” and move hundreds of files in minutes. The progress sidebar shows Claude’s to-do list updating in real-time as it works through the task.
For document creation, Cowork shines when you provide source material. Drop meeting notes, transcripts, or research files into a folder and ask Claude to synthesize them into a report, presentation, or spreadsheet. One powerful pattern: point Cowork at a folder of content you have created and ask it to extract themes, generate content ideas or data analysis, or build a structured summary. The agent can process hundreds of documents and extract dozens of actionable insights in under an hour.
For higher-quality outputs in specific niches, install Claude Skills. Download the official skills or third-party skills, then go to Settings > Capabilities > Skills, and upload the skill.md file for the capability you need. The frontend design skill produces polished landing pages. The pptx skill creates professional presentations. Skills act as expert playbooks that dramatically improve output quality compared to generic prompts.
To add web capabilities, enable Claude in Chrome. This connector lets Cowork browse the web, scrape data from sites that lack APIs, and take actions in your browser. A practical example: ask Cowork to visit your analytics dashboard, extract key metrics, and compile them into a spreadsheet in your local folder. Claude will open Chrome, navigate to the URL, visually capture the data, and create the file. This works because, in Chrome, Claude takes screenshots of your active tab to understand the content, so it can read anything visible on the screen.
A few important caveats for Chrome integration. Claude in Chrome can see anything on your screen when the side panel is open, including sensitive information. Use a separate browser profile for Cowork tasks. Stick to “Ask before acting” mode, which requires approval before Claude takes action. Be aware that web pages can contain prompt injections and adversarial content that attempts to manipulate Claude’s behavior. You may wish to start with trusted sites and closely supervise browser activity.
The most effective prompt pattern across all Cowork tasks is plan-first delegation: “Propose a step-by-step plan first. Wait for my approval before making changes.” This keeps you in control while still benefiting from Claude’s autonomous execution. Add explicit constraints like “Only touch files in this folder” and “Do not delete anything” to prevent surprises.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
Why should you care?
Cowork represents the first serious attempt to bring agentic AI capabilities to non-technical users in a form that actually works for real tasks. The early reception has been unusually positive for an agent product. Users report completing projects in hours that would have taken days or weeks.
The rough edges are real, however. This is a research preview built in under two weeks. We have seen occasional failures on complex tasks, rapid resource consumption, and connector hiccups. Prompt injection also remains a risk when combining Cowork with web browsing. The macOS-only and paid plan limitation also excludes most potential users for now.
But the trajectory is clear. Anthropic is iterating rapidly based on user feedback, shipping fixes within days of launch. The fact that Cowork was built entirely by Claude Code suggests this kind of rapid AI-assisted development will only accelerate. If the current version can handle file organization, document synthesis, and basic automation, the version six months from now will likely handle substantially more.
The practical advice is to start experimenting with low-stakes tasks now. Build intuition for what Cowork handles well and where it struggles. The users who understand these tools deeply will be best positioned to leverage them as capabilities improve. The gap between people who can effectively delegate to AI agents and those who cannot is about to become very visible.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. Anthropic Releases Cowork As Claude’s Local File System Agent
Anthropic launched Cowork as a research preview, giving Claude agent-style access to a user-selected local folder in the macOS app. Claude can read, create, and edit files in that folder to complete multi-step tasks under user oversight, and it can use connectors and skills to produce artifacts such as documents and presentations. Cowork is available to Claude Max subscribers in the macOS app, with a waitlist and planned expansion to additional platforms.
2. OpenAI Lays Out Business Model Built To Scale With “The Value of Intelligence”
OpenAI published a strategy note from CFO Sarah Friar describing how the company intends to scale revenue in step with real-world value delivered by its models, using a mix of consumer subscriptions, workplace subscriptions with usage-based pricing, and developer/enterprise API spend tied to production outcomes, alongside newer commerce and advertising paths when users are close to decisions. OpenAI reported record highs in weekly and daily active users and tied recent growth directly to available compute, citing compute capacity rising from 0.2 GW (2023) to 0.6 GW (2024) to ~1.9 GW (2025), alongside revenue growing from $2B ARR (2023) to $6B (2024) to $20B+ (2025); it also emphasized a shift from reliance on a single compute provider to a diversified supplier portfolio to improve resilience and “compute certainty.” The near-term product direction is toward agents and workflow automation that carry context over time and take actions across tools.
3. ERNIE-5.0 Tops LMArena Text Leaderboard as №1 Chinese Model
Baidu released ERNIE-5.0–0110 on LMArena, where it ranked 1,460 on the Text leaderboard, placing #8 overall and #1 among Chinese models at the time of the referenced snapshot. The same update also highlights a strong math-category placement. The model can be tried through Baidu’s ERNIE product entry points.
4. Black Forest Labs Releases FLUX.2 [klein]
Black Forest Labs launched FLUX.2 [klein], a smaller, interactive image model built for fast generation and iterative edits in a “draw → see → refine” workflow. The 4B version delivers real-time speed (reported as under one second at ~10 steps on an H100) and is released under the Apache 2.0 license, while the 9B version is released under a non-commercial license. For local use, the 4B model is recommended to run with at least ~13GB VRAM.
5. Google AI Releases MedGemma-1.5
Google Research released MedGemma 1.5 and introduced MedASR, expanding its open healthcare model lineup for medical imaging interpretation and medical speech-to-text. MedGemma 1.5 adds broader medical imaging support, including higher-dimensional inputs such as CT/MRI volumes and whole-slide histopathology, as well as improvements to medical text capabilities. MedASR is an open medical dictation ASR model intended for transcribing clinical speech so it can feed downstream workflows. Both are available via public model releases and can be deployed through Vertex AI.
6. NVIDIA Releases PersonaPlex-7B-v1: A Real-Time Speech-to-Speech Model
NVIDIA introduced PersonaPlex, a full-duplex conversational speech model designed to keep natural turn-taking (interruptions, backchannels, low-latency speech) while still letting developers choose a voice and define a persona through text prompts. The system is positioned as an alternative to ASR→LLM→TTS pipelines by using a single model that listens and speaks concurrently, aiming for a more human conversational rhythm without sacrificing controllability. It is built on the Moshi architecture from Kyutai, with 7 billion parameters, and is trained on a limited set of unscripted human conversations from the Fisher English corpus.
7. OpenAI Releases ChatGPT Translate
OpenAI rolled out ChatGPT Translate, a standalone translation interface at chatgpt.com/translate that adds tone- and audience-aware rewrites on top of basic translation. The UI supports automatic language detection, supports over 50 languages, and features AI-powered prompt customization. Users can add text, speak, or upload an image for translation. It also includes one-tap options like “make it more fluent,” “business formal,” “explain to a child,” and “academic” that hand off into ChatGPT for further refinement.
Five 5-minute reads/videos to keep you learning
1. Creating an Advanced AI Agent From Scratch with Python in 2026
To create more efficient and robust systems, this article advocates for building AI agents from scratch rather than relying on frameworks. It outlines a modular architecture composed of a flexible Tool System, a provider-agnostic LLM Wrapper, and an Agent Orchestrator. The author implements the ReAct (Reasoning + Acting) pattern to ensure a clear, step-by-step workflow and uses Pydantic for type safety in tool execution.
2. Model Context Protocol (MCP): Why Every AI Developer Needs MCP in 2026
This article introduces the Model Context Protocol (MCP), an open protocol by Anthropic designed to standardize connections between LLMs and external tools. It contrasts MCP with traditional REST APIs, highlighting the maintenance and scalability challenges of direct integrations. The protocol uses a decoupled architecture with an MCP Host, Client, and Servers that act as intermediaries for services such as databases or search engines. The result is a more maintainable, scalable, and consistent framework for building AI applications.
3. RLM: The Ultimate Evolution of AI? Recursive Language Models
This article explains Recursive Language Models (RLMs), an approach for managing extensive contexts in AI. Instead of passively processing large inputs, RLMs treat data as a programmable environment where the model acts as an active agent. Using code, it explores, segments, and filters information, breaking down complex tasks into smaller sub-problems. The model then recursively calls itself to solve these parts before synthesizing a final result. This method allows the AI to handle massive datasets and complex reasoning, although it introduces latency and is less efficient for simple tasks.
4. Factoring Quintics Using Mid-Point Ladders
The author introduces a graphically-aided technique for factoring quintic polynomials into approximate cubic and quadratic components. This method, applicable to quintics with five real roots, employs a Mid-Point Ladder based on Vieta’s sum-of-factors theorem. It simplifies the process by starting with a core genetic function, then uses the ladder to account for adjustments to the constant and x² terms. A Division by Vision formula is then applied to find the factors.
5. Federated Learning Explained: A Deep Technical Dive (And How Poets Can Actually Use It)
This technical overview explores Federated Learning, a method that enables AI models to be trained across decentralized devices without collecting user data. It details the architecture, from the initial distribution of a global model to local training on individual devices and the secure aggregation of updates. The focus then shifts to practical applications for creative professionals, explaining how they already benefit from this technology in everyday tools like smartphone keyboards.
Repositories & Tools
1. Engram is a module that modernizes classic N-gram embeddings for O(1) lookup.
2. Agent Skills is a collection of skills for AI coding agents.
3. LangExtract is a Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization.
4. AionUI is a free, local, open-source Cowork for Gemini CLI, Claude Code, Codex, Opencode, Qwen Code, Goose Cli, Auggie, and more.
Top Papers of The Week
1. End-to-End Test-Time Training for Long Context
This paper recasts long-context language modeling as a continual learning problem rather than an architectural one, using a standard Transformer with sliding-window attention that continues learning at test time via next-token prediction. Their meta-learned Test-Time Training method, TTT-E2E, scales with context, such as full attention, while maintaining constant inference latency, running 2.7× faster at 128K context.
2. Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning
This paper introduces VideoDR, the first video deep research benchmark for video-conditioned open-domain question answering on the open web. VideoDR requires cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video–web evidence across six semantic domains. Evaluations show agentic approaches only outperform workflows when models preserve initial video anchors, with goal drift and long-horizon consistency emerging as main bottlenecks.
3. STEP3-VL-10B Technical Report
This paper introduces STEP3-VL-10B, a lightweight, open-source foundation model that redefines the trade-off between efficiency and frontier-level multimodal intelligence. The model unifies a fully unfrozen pre-training strategy on 1.2T multimodal tokens, coupling a language-aligned Perception Encoder with a Qwen3–8B decoder, and scales post-training with over 1k RL iterations and PaCoRe, achieving 92.2% on MMBench and 80.11% on MMMU.
4. Urban Socio-Semantic Segmentation with Vision-Language Reasoning
The paper introduces SocioSeg, an urban socio-semantic segmentation dataset that combines satellite imagery, digital maps, and hierarchical pixel-level labels for socially defined entities such as schools and parks. The authors propose SocioReasoner, a vision-language reasoning framework that uses cross-modal recognition, multi-stage reasoning, and reinforcement learning to surpass state-of-the-art segmentation models and achieve strong zero-shot generalization.
Quick Links
1. OpenAI introduces Open Responses , an open-source specification and ecosystem inspired by the OpenAI Responses API. It is designed to make it easier to build multi-provider, interoperable LLM interfaces.
2. Zhipu AI released GLM-Image , an open-source, industrial-grade auto-regressive image generation model. GLM-Image combines the strengths of diffusion and auto-regressive models. The auto-regressive model decides what should appear in the image, while the diffusion model decides how it should look. This separation allows GLM-Image to be both accurate and visually strong.
3. Nous Research releases NousCoder-14B , an Olympiad programming model that is post-trained on Qwen3–14B using reinforcement learning (RL) with verifiable rewards. The model is trained on 24k verifiable coding problems from TACO Verified, PrimeIntellect SYNTHETIC-1. It reaches 67.87 percent Pass@1 on LiveCodeBench v6, a 7.08 percentage point gain over the Qwen3–14B baseline of 60.79 percent on the same benchmark.
Who’s Hiring in AI
Applied AI Engineer @AssemblyAI (Remote/USA)
AI Software Engineer @Healthengine (Perth, Australia)
LLM — Applied AI Research Scientist @CONFISA INTERNATIONAL GROUP (USA & LATAM Remote)
Junior Conversational AI Engineer (Voice Bots) @AUTO1 Group (Tirana, Albania)
PhD Internship (f/m/d) — AI Research @SAP (Germany/Remote)
AI Engineer/GenAI Developer @NTT DATA (Chennai, India)
Machine Learning Engineer — AI Models @Tenstorrent Inc. (Poland/Remote)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net .
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
|
|
|
Investing in Merge Labs |
openai |
15.01.2026 07:00 |
0.693
|
| Embedding sim. | 0.8022 |
| Entity overlap | 0.1 |
| Title sim. | 0.0769 |
| Time proximity | 0.9583 |
| NLP тип | funding |
| NLP организация | OpenAI |
| NLP тема | artificial intelligence |
| NLP страна | |
Открыть оригинал
OpenAI is investing in Merge Labs to support new brain computer interfaces that bridge biological and artificial intelligence to maximize human ability, agency, and experience.
|
|
|
Last Week in AI #333 - ChatGPT Ads, Zhipu+Huawei, Drama at Thinking Machines |
lastweekin_ai |
23.01.2026 05:14 |
0.692
|
| Embedding sim. | 0.8408 |
| Entity overlap | 0.0857 |
| Title sim. | 0.0603 |
| Time proximity | 0.6298 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | United States |
Открыть оригинал
News
Last Week in AI #333 - ChatGPT Ads, Zhipu+Huawei, Drama at Thinking Machines
OpenAI to test ads in ChatGPT as it burns through billions, Sequoia to invest in Anthropic, Zhipu AI breaks US chip reliance, The Drama at Thinking Machines Is Riveting Silicon Valley
Last Week in AI
Jan 23, 2026
∙ Paid
32
1
Share
OpenAI to test ads in ChatGPT as it burns through billions
Related:
ChatGPT to begin testing ads as generative AI competition heats up
OpenAI will begin testing labeled banner ads in ChatGPT for logged‑in users on the free tier and the $8/month ChatGPT Go plan, rolling out in the U.S. and other markets in the coming weeks. Ads will appear as blocked-off se…
Continue reading this post for free, courtesy of Last Week in AI.
Claim my free post
Or purchase a paid subscription.
|
|
|
Our approach to advertising and expanding access to ChatGPT |
openai |
16.01.2026 00:00 |
0.69
|
| Embedding sim. | 0.7924 |
| Entity overlap | 0.375 |
| Title sim. | 0.066 |
| Time proximity | 0.8571 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | United States |
Открыть оригинал
OpenAI plans to test advertising in the U.S. for ChatGPT’s free and Go tiers to expand affordable access to AI worldwide, while protecting privacy, trust, and answer quality.
|
|
|
LWiAI Podcast #231 - Claude Cowork, Anthropic $10B, Deep Delta Learning |
lastweekin_ai |
21.01.2026 03:22 |
0.688
|
| Embedding sim. | 0.7794 |
| Entity overlap | 0.08 |
| Title sim. | 0.2083 |
| Time proximity | 0.9266 |
| NLP тип | other |
| NLP организация | Anthropic |
| NLP тема | large language models |
| NLP страна | United States |
Открыть оригинал
Podcast
LWiAI Podcast #231 - Claude Cowork, Anthropic $10B, Deep Delta Learning
15
1
1×
0:00
Current time: 0:00 / Total time: -1:43:17
-1:43:17
Audio playback is not supported on your browser. Please upgrade.
LWiAI Podcast #231 - Claude Cowork, Anthropic $10B, Deep Delta Learning
Anthropic’s new Cowork tool, Anthropic Raising $10 Billion at $350 Billion Value, Deep Delta Learning
Last Week in AI
Jan 21, 2026
15
1
Share
Transcript
Our 231st episode with a summary and discussion of last week’s big AI news!
Recorded on 01/16/2026
Hosted by Andrey Kurenkov and Jeremie Harris
Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai
In this episode:
Anthropic’s new cowork tool integrates Claude code, potentially simplifying multiple computing tasks from editing videos to compiling spreadsheets.
Significant funding rounds see Anthropic raising $10B at a valuation of $350B, while XAI raises $20B, underscoring the immense market interest in AI startups.
Nvidia faces supply challenges for H200 AI chips due to overwhelming demand from China, despite high costs per unit and its potential impact on U.S. company revenue.
Policy debates highlight tensions around U.S. export controls to China, with leaders like Justin Lin from Alibaba and Jake Sullivan, former national security advisor, weighing in on the ramifications for the AI industry’s future.
Timestamps:
(00:00:10) Intro / Banter
(00:01:30) News Preview
Tools & Apps
(00:02:13) Anthropic’s new Cowork tool offers Claude Code without the code | TechCrunch
(00:09:45) Google’s Gemini AI will use what it knows about you from Gmail, Search, and YouTube | The Verge
(00:12:45) Google removes some AI health summaries after investigation finds “dangerous” flaws - Ars Technica
(00:16:29) Gmail is getting a Gemini AI overhaul
(00:18:12) Slackbot is an AI agent now | TechCrunch
Applications & Business
(00:20:11) Anthropic Raising $10 Billion at $350 Billion Value
(00:22:25) Elon Musk xAI raises $20 billion from Nvidia, Cisco, investors
(00:24:47) NVIDIA Needs a Supply Chain ‘Miracle’ From TSMC as China’s H200 AI Chip Orders Overwhelm Supply, Triggering a Bottleneck
(00:29:26) OpenAI signs deal, worth $10B, for compute from Cerebras | TechCrunch
(00:31:49) CoreWeave in focus as it amends credit agreement
(00:34:30) LMArena lands $1.7B valuation four months after launching its product | TechCrunch
Projects & Open Source
(00:35:54) Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
(00:43:15) mHC: Manifold-Constrained Hyper-Connections
(00:49:53) IQuest_Coder_Technical_Report
(00:54:58) TII Abu-Dhabi Released Falcon H1R-7B: A New Reasoning Model Outperforming Others in Math and Coding with only 7B Params with 256k Context Window - MarkTechPost
Research & Advancements
(01:01:42) Deep Delta Learning
(01:07:47) Recursive Language Models
(01:13:39) Conditional memory via scalable lookup
(01:18:54) Extending the Context of Pretrained LLMs by Dropping their Positional Embeddings
Policy & Safety
(01:26:06) Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks
(01:31:00) Nvidia CEO says purchase orders, not formal declaration, will signal Chinese approval of H200
(01:32:24) China AI Leaders Warn of Widening Gap With US After $1B IPO Week
(01:37:25) Jake Sullivan is furious that Trump removed Biden’s AI chip export controls | The Verge
Discussion about this episode
Comments Restacks
Podcast
Weekly AI summaries and discussion about Last Week's AI News!
Subscribe over at https://www.lastweekinai.com/
Weekly AI summaries and discussion about Last Week's AI News!
Subscribe over at https://www.lastweekinai.com/
Subscribe
Authors
Last Week in AI
Recent Episodes
LWiAI Podcast #237 - Nemotron 3 Super, xAI reborn, Anthropic Lawsuit, Research!
Mar 16 • Last Week in AI
LWiAI Podcast #236 - GPT 5.4, Gemini 3.1 Flash Lite, Supply Chain Risk
Mar 13 • Last Week in AI
LWiAI Podcast #235 - Sonnet 4.6, Deep-thinking tokens, Anthropic vs Pentagon
Mar 5 • Last Week in AI
LWiAI Podcast #234 - Opus 4.6, GPT-5.3-Codex, Seedance 2.0, GLM-5
Feb 17 • Last Week in AI
LWiAI Podcast #233 - Moltbot, Genie 3, Qwen3-Max-Thinking
Feb 6 • Last Week in AI
LWiAI Podcast #232 - ChatGPT Ads, Thinking Machines Drama, STEM
Jan 28 • Last Week in AI
LWiAI Podcast #230 - 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR
Jan 7 • Last Week in AI
|
|
|
Last Week in AI #332 - Apple + Gemini, OpenAI + Cerebras, Claude Cowork |
lastweekin_ai |
15.01.2026 07:06 |
0.684
|
| Embedding sim. | 0.7605 |
| Entity overlap | 0.2667 |
| Title sim. | 0.2361 |
| Time proximity | 0.8982 |
| NLP тип | partnership |
| NLP организация | Apple |
| NLP тема | foundation models |
| NLP страна | |
Открыть оригинал
News
Last Week in AI #332 - Apple + Gemini, OpenAI + Cerebras, Claude Cowork
Google’s Gemini to power Apple’s AI features like Siri, OpenAI signs deal worth $10B for compute from Cerebras, and more!
Last Week in AI
Jan 15, 2026
∙ Paid
69
2
Share
Google’s Gemini to power Apple’s AI features like Siri
Apple announced a multi-year partnership to use Google’s Gemini models and Google Cloud to power AI features like Siri, after testing alternatives from OpenAI and Anthropic. According to both companies, Gemini provides “the most capable foundation” for Apple’s own models, with reporting suggesting Ap…
Continue reading this post for free, courtesy of Last Week in AI.
Claim my free post
Or purchase a paid subscription.
|
|
|
Counter intelligence |
mit_news_ai |
03.02.2026 22:00 |
0.683
|
| Embedding sim. | 0.8253 |
| Entity overlap | 0.1 |
| Title sim. | 0.2031 |
| Time proximity | 0.4276 |
| NLP тип | other |
| NLP организация | Massachusetts Institute of Technology |
| NLP тема | artificial intelligence |
| NLP страна | United States |
Открыть оригинал
How can artificial intelligence step out of a screen and become something we can physically touch and interact with?
That question formed the foundation of class 4.043/4.044 (Interaction Intelligence), an MIT course focused on designing a new category of AI-driven interactive objects. Known as large language objects (LLOs), these physical interfaces extend large language models into the real world. Their behaviors can be deliberately generated for specific people or applications, and their interactions can evolve from simple to increasingly sophisticated — providing meaningful support for both novice and expert users.
“I came to the realization that, while powerful, these new forms of intelligence still remain largely ignorant of the world outside of language,” says Marcelo Coelho, associate professor of the practice in the MIT Department of Architecture, who has been teaching the design studio for several years and directs the Design Intelligence Lab . “They lack real-time, contextual understanding of our physical surroundings, bodily experiences, and social relationships to be truly intelligent. In contrast, LLOs are physically situated and interact in real time with their physical environment. The course is an attempt to both address this gap and develop a new kind of design discipline for the age of AI.”
Given the assignment to design an interactive device that they would want in their lives, students Jacob Payne and Ayah Mahmoud focused on the kitchen. While they each enjoy cooking and baking, their design inspiration came from the first home computer: the Honeywell 316 Kitchen Computer , marketed by Neiman Marcus in 1969. Priced at $10,000, there is no record of one ever being sold.
“It was an ambitious but impractical early attempt at a home kitchen computer,” says Payne, an architecture graduate student. “It made an intriguing historical reference for the project.”
“As somebody who likes learning to cook — especially now, in college as an undergrad — the thought of designing something that makes cooking easy for those who might not have a cooking background and just wants a nice meal that satisfies their cravings was a great starting point for me,” says Mahmoud, a senior design major.
“We thought about the leftover ingredients you have in the refrigerator or pantry, and how AI could help you find new creative uses for things that you may otherwise throw away,” says Payne.
Generative cuisine
The students designed their device — named Kitchen Cosmo — with instructions to function as a “recipe generator.” One challenge was prompting the LLM to consistently acknowledge real-world cooking parameters, such as heating, timing, or temperature. One issue they worked out was having the LLM recognize flavor profiles and spices accurate to regional and cultural dishes around the world to support a wider range of cuisines. Troubleshooting included taste-testing recipes Kitchen Cosmo generated. Not every early recipe produced a winning dish.
“There were lots of small things that AI wasn't great at conceptually understanding,” says Mahmoud. “An LLM needs to fundamentally understand human taste to make a great meal.”
They fine-tuned their device to allow for the myriad ways people approach preparing a meal. Is this breakfast, lunch, dinner, or a snack? How advanced of a cook are you? How much meal prep time do you have? How many servings will you make? Dietary preferences were also programmed, as well as the type of mood or vibe you want to achieve. Are you feeling nostalgic, or are you in a celebratory mood? There’s a dial for that.
“These selections were the focal point of the device because we were curious to see how the LLM would interpret subjective adjectives as inputs and use them to transform the type of recipe outputs we would get,” says Payne.
Unlike most AI interactions that tend to be invisible, Payne and Mahmoud wanted their device to be more of a “partner” in the kitchen. The tactile interface was intentionally designed to structure the interaction, giving users a physical control over how the AI responded.
“While I’ve worked with electronics and hardware before, this project pushed me to integrate the components with a level of precision and refinement that felt much closer to a product-ready device,” says Payne of the course work.
Retro and red
After their electronic work was completed, the students designed a series of models using cardboard until settling on the final look, which Payne describes as “retro.” The body was designed in a 3D modeling software and printed. In a nod to the original Honeywell computer, they painted it red.
A thin, rectangular device about 18 inches in height, Kitchen Cosmo has a webcam that hinges open to scan ingredients set on a counter. It translates these into a recipe that takes into consideration general spices and condiments common in most households. An integrated thermal printer delivers a printed recipe that is torn off. Recipes can be stored in a plastic receptacle on its base.
While Kitchen Cosmo made a modest splash in design magazines, both students have ideas where they will take future iterations.
Payne would like to see it “take advantage of a lot of the data we have in the kitchen and use AI as a mediator, offering tips for how to improve on what you’re cooking at that moment.”
Mahmoud is looking at how to optimize Kitchen Cosmo for her thesis. Classmates have given feedback to upgrade its abilities. One suggestion is to provide multi-person instructions that give several people tasks needed to complete a recipe. Another idea is to create a “learning mode” in which a kitchen tool — for example, a paring knife — is set in front of Kitchen Cosmo, and it delivers instructions on how to use the tool. Mahmoud has been researching food science history as well.
“I’d like to get a better handle on how to train AI to fully understand food so it can tailor recipes to a user’s liking,” she says.
Having begun her MIT education as a geologist, Mahmoud’s pivot to design has been a revelation, she says. Each design class has been inspiring. Coelho’s course was her first class to include designing with AI. Referencing the often-mentioned analogy of “drinking from a fireho u se” while a student at MIT, Mahmoud says the course helped define a path for her in product design.
“For the first time, in that class, I felt like I was finally drinking as much as I could and not feeling overwhelmed. I see myself doing design long-term, which is something I didn’t think I would have said previously about technology.”
|
|
|
Build with Kimi K2.5 Multimodal VLM Using NVIDIA GPU-Accelerated Endpoints | NVIDIA Technical Blog |
nvidia_dev_blog |
04.02.2026 19:46 |
0.681
|
| Embedding sim. | 0.7795 |
| Entity overlap | 0.2381 |
| Title sim. | 0.0977 |
| Time proximity | 0.9146 |
| NLP тип | product_launch |
| NLP организация | Moonshot AI |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
Kimi K2.5 is the newest open vision language model (VLM) from the Kimi family of models. Kimi K2.5 is a general-purpose multimodal model that excels in current high-demand tasks such as agentic AI workflows, chat, reasoning, coding, mathematics, and more.
The model was trained using the open source Megatron‑LM framework. Megatron-LM provides accelerated computing for scalability and GPU optimization through several types of parallelism (tensor, data, sequence) for training massive transformer-based models.
This model architecture builds on leading state-of-the-art large open models for efficiency and capability. The model is composed of 384 experts with a single dense layer, which allows for smaller-sized experts and specialized routing for different modalities. Kimi K2.5 achieves a 3.2% activation rate of parameters per token.
Kimi K2.5
Modalities
Text, image, video
Total parameters
1T
Active parameters
32.86B
Activation rate
3.2%
Input context length
262K
Additional configuration information
# experts
384
# shared experts
1
# experts per token
8
# layers
61 (1 dense, 60 MoE)
# attention heads
64
Vocab size
~164K
Table 1. Specifications and configuration details for the Kimi K2.5 model
For vision capability, the large training vocabulary of 164K contains vision-specific tokens. Kimi created the MoonViT3d Vision Tower for the visual processing component of this model, which converts images and video frames into embeddings.
Figure 1. Kimi K2.5 vision pipeline
Build with NVIDIA GPU-accelerated endpoints
You can start building with Kimi K2.5 with free access for prototyping to GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program . You can use your own data in the browser experience. NVIDIA NIM microservices , containers for production inference, are coming soon.
Video 1. Learn how to you can test Kimi K2.5 on NVIDIA GPU-accelerated endpoints
You can also use the NVIDIA-hosted model through the API , free with registration in the NVIDIA Developer Program.
import requests
invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions"
headers = {
"Authorization": "Bearer $NVIDIA_API_KEY",
"Accept": "application/json",
}
payload = {
"messages": [
{
"role": "user",
"content": ""
}
],
"model": "moonshotai/kimi-k2.5",
"chat_template_kwargs": {
"thinking": True
},
"frequency_penalty": 0,
"max_tokens": 16384,
"presence_penalty": 0,
"stream": True,
"temperature": 1,
"top_p": 1
}
# re-use connections
session = requests.Session()
response = session.post(invoke_url, headers=headers, json=payload)
response.raise_for_status()
response_body = response.json()
print(response_body)
To take advantage of tool calling, simply define an array of OpenAI compatible tools to add to the chat completions tools
parameter.
Deploying with vLLM
When deploying models with the vLLM serving framework, use the following instructions. For more information, see the vLLM recipe for Kimi K2.5.
$ uv venv
$ source .venv/bin/activate
$ uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly/cu129 \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--index-strategy unsafe-best-match
Fine-tuning with NVIDIA NeMo Framework
Kimi K2.5 can be customized and fine-tuned with the open source NeMo Framework using NeMo AutoModel library to adapt the model for domain-specific multimodal tasks, agentic workflows, and enterprise reasoning use cases.
NeMo Framework is a suite of open libraries enabling scalable model pretraining and post-training, including supervised fine-tuning, parameter-efficient methods, and reinforcement learning for models of all sizes and modalities.
NeMo AutoModel is a PyTorch Distributed native training library within NeMo Framework that provides high throughput training directly on the Hugging Face checkpoint without the need for conversion. This provides a lightweight and flexible tool for developers and researchers to do rapid experimentation on the latest frontier models.
Try fine-tuning Kimi K2.5 with the NeMo AutoModel recipe .
Get started with Kimi K2.5
From data center deployments on NVIDIA Blackwell to the fully managed enterprise NVIDIA NIM microservice, NVIDIA offers solutions for your integration of Kimi K2.5. To get started, check out the Kimi K2.5 model page on Hugging Face and Kimi API Platform , and test Kimi K2.5 on the build.nvidia.com playground .
Discuss (0)
Like
Tags
Agentic AI / Generative AI | General | NeMo | NeMo Microservices | NVLink | NVSwitch | Intermediate Technical | Tutorial | AI Agent | featured | Open Source | VLMs
About the Authors
About Anu Srivastava
Anu Srivastava is a senior technical marketing manager who focuses on NVIDIA’s lighthouse AI model collaborations. She works with key partners and foundations to enable NVIDIA accelerated platform support for the open source developer ecosystem. Prior to NVIDIA, she worked at Google for over a decade in various engineering and management roles and holds a degree in computer science from the University of Texas at Austin.
View all posts by Anu Srivastava
Comments
Related posts
Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core
Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core
NVIDIA Accelerates OpenAI gpt-oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72
NVIDIA Accelerates OpenAI gpt-oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72
Latest Multimodal Addition to Microsoft Phi SLMs Trained on NVIDIA GPUs
Latest Multimodal Addition to Microsoft Phi SLMs Trained on NVIDIA GPUs
NVIDIA NeMo Accelerates LLM Innovation with Hybrid State Space Model Support
NVIDIA NeMo Accelerates LLM Innovation with Hybrid State Space Model Support
Train Generative AI Models More Efficiently with New NVIDIA Megatron-Core Functionalities
Train Generative AI Models More Efficiently with New NVIDIA Megatron-Core Functionalities
Related posts
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
How Centralized Radar Processing on NVIDIA DRIVE Enables Safer, Smarter Level 4 Autonomy
Designing Protein Binders Using the Generative Model Proteina-Complexa
Designing Protein Binders Using the Generative Model Proteina-Complexa
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark
L
T
F
R
E
|
|
|
The AI Coding Supremacy wars |
ai_supremacy |
06.02.2026 10:31 |
0.681
|
| Embedding sim. | 0.7886 |
| Entity overlap | 0.2308 |
| Title sim. | 0.0714 |
| Time proximity | 0.8689 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | code generation |
| NLP страна | |
Открыть оригинал
The AI Coding Supremacy wars
Vibe working, Enterprise AI Coworkers, the Frontier Keeps evolving. Capex pushes the Enterprise AI landscape into more real-world trials as the SaaSpocalypse fears collide.
Michael Spencer and Jeff Morhous
Feb 06, 2026
∙ Paid
64
1
7
Share
Made with Midjourney.
Good Morning,
It’s been a hectic week in the AI News cycle especially around the future of coding.
OpenAI dropped GPT-5.3-Codex
Anthropic released Claude Opus 4.6
Alibaba Cloud team released Qwen3-Coder-Next
OpenAI also released OpenAI Frontier .
This as we are anticipating DeepSeek’s own coding model referred to as DeepSeek V4 in about 10 days time. Meanwhile before we get Grok 5, we are likely to get Gemini 3.5 (codenamed "Snow Bunny" ) has been spotted in Google’s internal sandboxes ( LM Marina ). Then DeepSeek’s flagship DeepSeek-R2 model later in March next month.
What’s Next
GLM-5 (Zhipu AI)
DeepSeek V4
Kimi K2.5’s “Agent Swarm” API
Gemini 3.5
Grok 5
Claude Sonnet 5
DeepSeek-R2
Qwen 3.5 & "Max-Thinking"
Llama 4 Behemoth (April)
As adoption of Claude Code accelerates, it’s getting harder to compare the capabilities of Anthropic and OpenAI’s coding tools (but we are going to try).
Claude Code Github Commits Skyrockets before SaaSpocalypse takes shape
Claude Code is writing more of all of code each month now. (Github Commits) SemiAnalysis.
“4% of GitHub public commits are being authored by Claude Code right now.
At the current trajectory, we believe that Claude Code will be 20%+ of all daily commits by the end of 2026.” - Dylan Patel , SemiAnalysis
Unfortunately, Anthropic didn’t share SWE Bench Pro benchmarks but Sebastian Raschka, PhD of Ahead of AI Newsletter put them side by side based on the available Terminus 2.0 numbers:
Sebastian Raschka, PhD
I asked Jeff Morhous for details on how OpenAI’s Codex works for some of our more technically inclined readers.
Jeff writes:
The AI-Augmented Engineer
The AI-Augmented Engineer Accelerating software engineering careers with AI workflows. I show you how to use AI to write better code, ship faster, and get ahead.
By Jeff Morhous
Lately Jeff has been breaking down guides as a Software Engineer that can help us understand the frontier of AI coding tools better:
Claude Code vs Cursor
How to use AI to write code without skill atrophy
How to get unique designs out of Claude Code
Claude Code is the Inflection Point
To understand how the Codex agent loop works, this is a great resource . This article will go into some detail about how Codex actually works. I will also talk about Opus 4.6 vs. GPT-5.3-Codex, the SaaSpolapyse and other relevant topics like Capex. Financial services and banking employees could be most vulnerable to the vibe-working era. The micro-hysteria that AI coding generalized to other industries will cause is now officially a thing.
AI moving toward a ‘vibe working’ era
Continue reading this post for free, courtesy of Michael Spencer.
Claim my free post
Or purchase a paid subscription.
Previous Next
A guest post by
Jeff Morhous
Accelerating software engineering careers with AI workflows. I show you how to use AI to write better code, ship faster, and get ahead.
Subscribe to Jeff
|
|
|
TAI ##189: Dario Amodei's 19,000-Word Warning About AI's "Adolescence" |
towards_ai |
27.01.2026 15:02 |
0.68
|
| Embedding sim. | 0.8782 |
| Entity overlap | 0.1633 |
| Title sim. | 0.1089 |
| Time proximity | 0 |
| NLP тип | product_launch |
| NLP организация | Anthropic |
| NLP тема | large language models |
| NLP страна | United States |
Открыть оригинал
What happened this week in AI by Louie
Anthropic has been on a remarkable product streak. Last week, we covered Claude Cowork, which brings agentic capabilities to non-developers. This week, the company expanded Claude in Excel to Pro subscribers and deepened integrations with apps such as Slack, Canva, Figma, and more.
Claude in Excel may be one of the more eye-opening AI features yet for finance professionals. The add-in reads entire multi-tab workbooks, explains nested formulas with clickable cell citations, debugs errors like circular references, and builds financial models from natural-language instructions. Finance has long been a domain where AI demos looked impressive, but real-world utility lagged. Claude, reading your actual workbook and understanding relationships between cells changes that equation. The caveats are real: hallucinations happen, token limits interrupt longer sessions, and prompt-injection vulnerabilities mean you should be careful with untrusted data. But as a research preview, it points toward a future where financial modeling grunt work becomes dramatically faster.
Despite this success in solving near-term, extremely tangible enterprise problems, CEO Dario Amodei remains outspoken about more speculative risks. His essay “Machines of Loving Grace” made a significant splash in October 2024, laying out how powerful AI could compress a century of scientific progress into a decade and potentially eliminate most diseases, end extreme poverty, and transform governance. Fifteen months later, we can assess how those predictions are tracking.
The results are mixed. Capability acceleration proceeded roughly as Amodei predicted: agentic systems improved dramatically, with engineers at Anthropic reportedly “mostly editing” rather than writing code from scratch. Scientific acceleration in drug discovery and protein design continued. But the more ambitious predictions have not materialized. No major breakthroughs in disease cures or lifespan emerged. Mental health applications remain at the research level. The developing world saw little evidence of rapid catch-up. And rather than AI favoring defense and democracy as Amodei hoped, 2025 saw intensified chip wars and rising deepfake threats.
It is always hard to tell if an AI CEO is being honest or hyping capabilities. Even when discussing risks, emphasizing how powerful and dangerous AI will become is a roundabout way of claiming your technology is transformative enough to justify massive investment. Anthropic raised $13 billion in September and is reportedly in talks for another $25 billion. There is also a competitive angle: fearmongering about AI risks can be interpreted as an attempt to prevent open-weight LLM competition through regulation or to stunt Chinese AI labs by advocating for export controls. The conflict of interest is obvious.
I think Dario is largely honest in his hopes and fears, though not immune to motivated reasoning. His technical claims tend to be specific and falsifiable rather than vague. He repeatedly emphasizes uncertainty. And he points fingers at his own industry, explicitly naming AI companies as a major risk factor. That is not the framing you would choose for pure marketing.
This week, Amodei published “The Adolescence of Technology,” a 19,000-word follow-up that shifts from optimism to confronting risks directly. The framing is stark: humanity is entering a “rite of passage” that will test who we are as a species. The central move is treating powerful AI as a new kind of concentrated national capability. He uses the metaphor of a “country of geniuses in a datacenter”: imagine 50 million people, all more capable than any Nobel laureate, operating at 10–100x the speed of humans. If you were a national security official assessing that situation, what would you worry about?
He groups risks into five categories. Autonomy risks concern whether AI systems might behave in unintended ways, not from malice but from emergent properties in training. Amodei rejects both the naive view that AI will simply do what we tell it and the doomer view that misalignment is inevitable. He cites lab experiments in which Claude engaged in deception and adopted problematic personas due to training quirks. These were caught and fixed, but the concern is that training involves so many potential traps that some may only become evident when it is too late.
Destruction risks involve AI lowering barriers to weapons of mass destruction, particularly biological weapons. Amodei argues that LLMs are approaching the capability to walk a determined non-expert through the step-by-step process of bioweapon creation, breaking the historical correlation between ability and motive. The PhD virologist with the skills is unlikely to have the motivation. The disturbed loner with the motivation lacks the skills. AI could remove that barrier. Anthropic’s internal measurements show models may already be providing substantial uplift in relevant areas, which is why recent Claude releases include specialized classifiers to block bioweapon-related outputs.
Power-seizing risks concern authoritarian governments using AI for surveillance, propaganda, and autonomous weapons to entrench control. Amodei is particularly focused on the CCP, arguing it makes no sense to sell them chips and chip-making tools to build an AI totalitarian state. But he also worries about democracies: the same tools needed to defend against autocracies can be turned inward. He suggests domestic mass surveillance and mass propaganda should be bright red lines.
Economic disruption is perhaps the most immediate concern. Amodei predicted that AI could displace 50% of entry-level white-collar jobs in 1–5 years, and he stands by that prediction. He argues this differs from previous technological disruptions because of speed, cognitive breadth, and AI’s capacity to fill in gaps that would normally allow humans to adapt.
Finally, indirect effects capture unknown unknowns from compressed progress: radical advances in biology, psychological manipulation through AI companions, and loss of human purpose. Even if we dodge headline catastrophes, a decade of compressed progress can produce destabilizing outcomes.
The essay’s most useful contribution may be its diagnosis of political economy. Amodei explains why reasonable safety measures fail: the combination of strategic competition and massive economic upside makes restraint hard even when everyone sees the risks. He calls this “the trap.” His proposed solutions emphasize surgical interventions: transparency legislation, export controls on chips, Constitutional AI to train models with coherent values, and interpretability research. He explicitly rejects pausing AI development as untenable, arguing that the technology would continue regardless, and that authoritarian countries would keep building.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
Why should you care?
Three practical takeaways from the essay. First, if you work in a field likely to be disrupted, the time to build adjacent skills and relationships is now, not when displacement arrives. Amodei’s prediction of 50% entry-level white-collar job displacement in 1–5 years may be aggressive, but even a slower timeline suggests urgency. Second, the warnings about AI companions and psychological manipulation deserve attention from anyone with children or elderly relatives who may be more susceptible to forming unhealthy dependencies on systems designed to maximize engagement.
Third, and most broadly, the essay is a reminder that the incremental view can obscure the aggregate picture. Most weeks, this newsletter covers new models, new features, and new benchmarks. The question is not whether any single advance is dangerous but whether the cumulative trajectory is one we have consciously chosen. Right now, the answer is largely no. Recognizing that is the first step toward changing it.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. Anthropic Launches Interactive Claude Apps
Claude now opens connected workplace tools as interactive panels directly in the conversation, so you can review, tweak, and act on outputs without switching tabs. The first set includes Amplitude, Asana, Box, monday.com, and Slack, with interactive workflows like building analytics charts, turning chats into projects/timelines, previewing documents, updating boards, and drafting messages in a formatted preview before posting. This rollout is available across Claude’s web and desktop experiences. The same launch extends MCP Apps, which lets tool developers ship interactive UI experiences that render inside multiple MCP clients rather than returning only text or structured data.
2. Anthropic Expands Claude in Excel to Pro Users
Anthropic has now rolled out its Excel integration in Claude to Pro users. Along with broader availability, the update brings several functional improvements: Claude can now accept multiple files via drag-and-drop, avoid overwriting existing cells, and support longer work sessions through automatic compression. The integration lets users work with Claude directly in Microsoft Excel for analysis and data preparation.
3. Alibaba Qwen releases Qwen3-Max-Thinking
Alibaba’s Qwen team launched Qwen3-Max-Thinking, a new flagship reasoning model trained with large-scale reinforcement learning and built to autonomously invoke Search, Memory, and a Code Interpreter during a conversation, eliminating the need for manual tool selection. It ships with a heavy-mode test-time scaling approach that runs multi-round self-reflection (“experience-cumulative” scaling) to improve difficult reasoning without simply increasing parallel sampling. It scored 98.0 on HMMT, 49.8 on Humanity’s Last Exam (with tools), 90.2 on Arena-Hard v2, 75.3 on SWE-Bench Verified, and 85.9 on LiveCodeBench v6, with the tool-augmented HLE result exceeding GPT-5.2-Thinking and Gemini 3 Pro. The model is available in Qwen Chat and via an API.
4. Zhipu AI Releases GLM-4.7-Flash
Z.ai launched GLM-4.7, its latest flagship text model series focused on agentic coding reliability, multi-step execution stability, and stronger front-end generation quality, with 200K context and up to 128K output tokens. On widely used coding and agent benchmarks, GLM-4.7 reports 73.8% on SWE-bench Verified, 66.7% on SWE-bench Multilingual, and 41% on Terminal-Bench 2.0, alongside stronger tool-use scores such as 84.7% on τ²-Bench and 67% on BrowseComp. The series includes GLM-4.7, plus lighter variants (GLM-4.7-FlashX and GLM-4.7-Flash), intended to trade off cost/latency for peak capability while maintaining the same long-context footprint.
5. Qwen Researchers Release Qwen3-TTS
Alibaba’s Qwen team open-sourced the Qwen3-TTS family, a multilingual, controllable, streaming text-to-speech stack built for both rapid voice cloning and “voice design” (description-driven control over style and attributes). The models are trained across 10 languages and introduce a dual-track LM design optimized for real-time synthesis, paired with two tokenizers: a semantic-heavy 25Hz codec and an ultra-low-latency 12Hz tokenizer that targets extremely fast first audio emission (reported at ~97 ms). On the multilingual TTS test set, Qwen reports an average WER of 1.835% and a speaker similarity of 0.789, and frames the release as open tooling for both research and product deployment, with models and tokenizers under Apache 2.0.
6. Elon Musk’s xAI Activates World’s First Gigawatt-Scale AI Training Cluster
Elon Musk’s xAI is expanding the Colossus training effort toward gigawatt-scale capacity, including purchasing additional Memphis-area buildings, with the ambition to reach nearly 2 GW of training power and operate at a scale of hundreds of thousands to over a million GPUs over time. xAI’s own materials describe rapid buildout milestones (including scaling to 200k GPUs) while framing the site as a “gigafactory of compute.” At the same time, recent third-party analysis based on site constraints (notably cooling) disputes that the cluster is already operating at 1 GW today, suggesting the full gigawatt claim is more consistent with a phased ramp than a completed state.
7. Gemini in Chrome Is Getting “Skills” As It Moves Toward Becoming a Full AI Agent
Google is testing “Skills” for Gemini in Chrome, an early move from “assistant in a side panel” toward programmable, site-context automation that can execute repeatable browser workflows. Chromium commits show active development of a dedicated chrome://skills surface (including UI scaffolding like a toolbar) and plumbing to surface or recommend Skills on the current page, suggesting an intent to make Skills discoverable rather than purely manual. Independent coverage indicates Skills are being tried internally in Chrome builds, with users defining a Skill (name + instructions) and then invoking it through Gemini’s Chrome experience, but there’s no public rollout timeline yet.
8. Anthropic Replaces Todos With Disk-Backed Tasks
Anthropic upgraded Claude Code from “Todos” to Tasks, turning lightweight to-do tracking into a more structured task primitive designed for longer, multi-step coding workflows, including support for dependency-style organization and richer task lifecycle actions. Recent releases add controls to keep the old system temporarily via CLAUDE_CODE_ENABLE_TASKS, and expand task operations (including the ability to delete tasks via TaskUpdate) while iterating on how the task list renders and behaves in the terminal UI. The change is framed as part of making Claude Code more resilient for extended sessions where work needs to persist cleanly across context pressure and ongoing agent activity.
9. FastMCP 3.0 Is Here
Prefect’s FastMCP 3.0 entered beta as a major redesign of the Python framework for building MCP servers, restructuring the system around three composable primitives: components, providers, and transforms. Providers are meant to source tools/resources dynamically (from decorators, filesystems, OpenAPI specs, or even remote MCP servers), while transforms act as middleware to reshape what clients see — renaming, namespacing, filtering, or applying security rules — so features that used to require bespoke subsystems can be assembled from building blocks. The project is shipping as a 3.0.0b1 beta (with guidance to stay on v2 for production stability), signaling a push toward more modular, plug-and-play MCP infrastructure for agent toolchains.
10. FlashLabs Researchers Release Chroma 1.0
FlashLabs open-sourced Chroma 1.0 (Chroma-4B), a real-time, end-to-end spoken dialogue model that takes speech in and returns speech out while preserving a user’s voice via personalized voice cloning. It’s built to avoid the classic ASR → LLM → TTS pipeline by operating directly on discrete speech representations, targeting sub-second interaction latency for conversational use. The system emphasizes speaker identity retention (a common failure mode in speech-token-based dialogue models) while keeping responses fast enough to feel “live” in multi-turn voice chats. The release includes a 4B-parameter checkpoint and positioning as an open, real-time voice assistant backbone for developers building low-latency, voice-native agents.
Five 5-minute reads/videos to keep you learning
1. How to Run AI Agents Fully Locally: Memory, Tools, and Models on Your Laptop
This article outlines the architecture of a fully local AI agent, designed to improve privacy, control costs, and enable reproducibility. The stack integrates Agno for agent orchestration, SurrealDB as a multi-model database for state and vectors, and Ollama for local inference. It highlights the use of the Model Context Protocol (MCP) to establish a secure boundary for tools, such as file access and image generation. It also covers practical implementations, including persistent memory, local RAG, and multimodal workflows.
2. LangGraph + RAG + UCP = The Key To Powerful Agentic AI
This analysis details how to build an AI shopping assistant using the Universal Commerce Protocol (UCP), a new open standard for e-commerce transactions. The article shows that combining LangGraph for structured workflows with Retrieval-Augmented Generation (RAG) enables querying a product database. It provides code examples for a chatbot that uses a vector store and GPT-4 to answer questions, alongside a checkout system built with the FastUCP framework to manage transactions.
3. Mastering the Bias-Variance Trade-Off in Machine Learning
Balancing bias and variance is a central challenge in machine learning. This article examines this trade-off using the Vapnik-Chervonenkis (VC) dimension, a theoretical concept for quantifying a model’s capacity. It explains how the VC bound estimates the generalization error on unseen data. It also presents a practical experiment with polynomial regression, demonstrating that as model complexity increases, training error decreases while the gap between training and real-world performance widens.
4. Connecting the Dots with Graphs
Moving beyond traditional databases that store data in isolated tables, knowledge graphs model information as a network of entities and relationships. This structure excels at complex, relationship-heavy queries that relational databases often struggle with. The text outlines the benefits, such as flexible schemas and data integration, while also addressing challenges like data quality and performance. A practical implementation is also presented, detailing how to build a question-answering system using Neo4j and an LLM to translate natural language into graph queries, making complex data more accessible.
5. Probability Calibration with Python
Many machine learning models produce probability scores that, while effective for ranking, do not align with real-world event frequencies. This article explores probability calibration using a simulated loan default dataset. It compares a raw Gradient Boosting model against two calibrated versions: Sigmoid and Isotonic. The results demonstrate that calibration improves probability metrics like the Brier score and Expected Calibration Error (ECE) without compromising ranking performance (AUC). A final simulation of a loan approval policy shows that using these calibrated probabilities leads to more accurate risk assessments and ultimately, higher realized profits, underscoring their value in business decision-making.
Repositories & Tools
1. VibeVoice is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for user-customized context.
2. GitHub Copilot CLI SDKs is a multi-platform SDK for integrating GitHub Copilot Agent into apps and services.
3. Clawbot is a personal AI assistant you run on your own devices. It can speak and listen on macOS/iOS/Android, and can render a live Canvas you control.
Top Papers of The Week
1. Agentic Reasoning for Large Language Models
This survey formalizes “Agentic Reasoning” as a paradigm shift that transforms LLMs from static processors into autonomous agents capable of planning, acting, and self-evolving through interaction. The survey organizes agentic reasoning into three layers: foundational, self-evolving, and collective. It also provides a unified roadmap for optimizing agentic systems through both in-context orchestration and post-training reinforcement learning across domains such as science and robotics.
2. Multimodal Reinforcement Learning with Agentic Verifier for AI Agents
This paper introduces Argos, a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. This approach enables models to achieve state-of-the-art performance on spatial and embodied AI tasks while significantly reducing visual hallucinations through verifiable reinforcement learning.
3. ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
This paper introduces ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. It contains 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories, requiring agents to explore repositories, configure environments, deploy containerized services, and pass end-to-end API tests. Evaluations show that state-of-the-art LLM agents still struggle with these holistic backend engineering tasks.
4. LLM-in-Sandbox Elicits General Agentic Intelligence
This paper introduces LLM-in-Sandbox, a framework that lets large language models explore a virtual computer to elicit general agentic intelligence in non-code domains. Strong LLMs, without extra training, use the sandbox to access external resources, manage long contexts, and execute scripts. LLM-in-Sandbox-RL further improves these capabilities, yielding robust generalization across STEM tasks and instruction following, and the team releases a Python package.
Quick Links
1. Liquidi released LFM2.5–1.2B-Thinking , a 1.2B model optimized for reasoning that runs entirely on-device and is reported to fit within ~900MB of memory on a phone. LFM2.5–1.2B-Thinking matches or exceeds Qwen3–1.7B on most reasoning benchmarks, despite having 40% fewer parameters.
2. StepFun has introduced Step-DeepResearch , a 32B parameter end-to-end deep research agent that aims to turn web search into actual research workflows with long horizon reasoning, tool use, and structured reporting. The model is built on Qwen2.5 32B-Base and is trained to act as a single agent that plans, explores sources, verifies evidence, and writes reports with citations, while keeping inference cost low.
3. Microsoft Research releases OptiMind , an experimental 20B-parameter model built to translate natural-language decision problems into solver-ready MILP formulations. The model is fine-tuned from openai/gpt-oss-20b on cleaned optimization datasets such as OR-Instruct and OptMATH, and evaluated on expert-validated benchmarks including IndustryOR and Mamo Complex.
Who’s Hiring in AI
Artificial Intelligence Safety Data Scientist @Google (Bangalore, India)
AI Solutions Engineer (Python + Cloud) @Oowlish (Remote/Brazil)
Senior Full Stack Developer @Delta Air Lines, Inc. (Atlanta, USA)
Agentic AI, Forward Deployed Engineer @Kyndryl (Sydney, Australia/Remote)
Lead AI Engineer @Capital One (Bangalore, India)
Principal AI Engineer (Autonomous Agent) @PointClickCare (Remote/Canada)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net .
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
|
|
|
TAI #187: OpenAI's Health Push and the Real State of LLMs in Medicine |
towards_ai |
13.01.2026 15:20 |
0.679
|
| Embedding sim. | 0.8444 |
| Entity overlap | 0.0652 |
| Title sim. | 0.1806 |
| Time proximity | 0.2658 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | healthcare ai |
| NLP страна | United States |
Открыть оригинал
What happened this week in AI by Louie
OpenAI made its biggest healthcare push this week with two launches: ChatGPT Health for consumers and OpenAI for Healthcare for enterprises. The consumer product lets users connect medical records and wellness apps so responses can be grounded in personal context. The enterprise product offers BAA support, institutional policy integrations, and clinical templates for hospitals and health systems. Anthropic followed days later with Claude for Healthcare, featuring similar HIPAA-ready positioning plus connectors for CMS databases, ICD-10 codes, and PubMed.
The timing makes sense. OpenAI claims over 230 million people already ask health questions on ChatGPT weekly. Rather than fighting this behavior, they are productizing it. But I think the framing of these launches obscures where LLMs actually add value in health today versus where they need careful deployment.
The clearest wins are administrative and language-heavy tasks: drafting discharge summaries, patient instructions, prior authorization narratives, insurance comparisons, and translating medical jargon into plain language. These are high-volume workflows where humans review outputs before anything touches a patient. The ambient documentation market has exploded over the past quarter, with Microsoft, the VA, Veradigm, RXNT, and Google Cloud all shipping or expanding Scribe products. Documentation is the obvious wedge because it is language-heavy and naturally human-in-the-loop.
Diagnosis is a more complex application, but it is more nuanced than the binary “safe or dangerous” framing suggests. I think LLMs can provide enormous value when used as brainstorming partners for human experts. The sweet spot is generating suggestions in volume that are quick for clinicians to review and filter using their own intuition. An LLM suggesting rare diseases or edge cases that a busy doctor might not immediately consider can be incredibly valuable. The expert can instantly recognize which suggestions are smart, which are apparent, and which are nonsense. This is very different from an LLM making autonomous diagnostic decisions or patients self-diagnosing without professional review. The risk is not in the brainstorming; it is in skipping the expert filter. ChatGPT has steered clear of diagnosis with the Health tool positioning, likely because it is too easy for people to skip this expert filter.
The privacy critique also has teeth. When individuals upload their own records to a consumer tool, HIPAA protections generally do not apply as they do within a covered entity. OpenAI’s compartmentalization and “no training on health chats” commitment are meaningful, but the U.S. lacks a comprehensive privacy law that would permanently lock in these protections. A 2024 analysis found 37% of ChatGPT health answers untrustworthy, with 4% providing dangerous information. Context from connected records helps, but it does not certify correctness.
Due to this, I think most of the best AI health applications are likely going to be custom-built assistants for health experts where these safeguards can be ingrained — both for experts in the loop and privacy and security settings. The new OpenAI for Healthcare initiative should reduce friction in building and deploying these custom models. OpenAI models have been making solid progress on professional healthcare benchmarks, but I expect we will be seeing many more custom models for this industry in the future.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
Why should you care?
The trajectory is not “LLMs replace clinicians.” The near-term future is more mundane: LLMs become the interface layer between messy data and the humans who make decisions. The competitive edge shifts from nicer phrasing to provable work, the ability to reconstruct what the system saw, what it retrieved, and why it responded that way. Deep integration will beat standalone brilliance.
I think the winning products will add friction in the right places: mandatory source views, explicit uncertainty, and refusal when context is missing. The safest interaction patterns need to be the easiest ones. For users, treat these tools as idea generators and preparation aids. They are genuinely helpful for surfacing possibilities you might not have considered and for helping you prepare for conversations with professionals. The key is to keep experts in the loop so they can do what they do best: separate the signal from the noise.
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
1. OpenAI Launched ChatGPT Health
OpenAI introduces ChatGPT Health, a dedicated health and wellness experience with connected records and app data. OpenAI launched ChatGPT Health as a separate in-product experience designed to securely combine ChatGPT with user health context, including the ability to connect medical records and wellness apps such as Apple Health, Function, and MyFitnessPal for tasks like understanding lab results, preparing for doctor appointments, and interpreting wearable data. OpenAI says Health adds layered protections on top of existing ChatGPT controls, including purpose-built encryption and isolation for health conversations, and it was developed with input from physicians globally. Access is rolling out via a waitlist; once approved, users can select “Health” from the ChatGPT sidebar to begin.
2. Google Is Testing a New Image AI, and It’s Going To Be Its Fastest Model
Google tests “Nano Banana 2 Flash,” a faster Gemini Flash image model. The report says Google is internally testing the model, expected to run faster and be more affordable than Nano Banana Pro, while remaining less capable than the top-end model. The model name was spotted in a leak shared on X, and the report places it within Google’s “Flash” lineup, which emphasizes speed. There’s no public launch or access path described yet beyond the indication that it is in testing.
3. NVIDIA Announces Alpamayo Family of Open-Source AI Models and Tools To Accelerate Safe, Reasoning-Based Autonomous Vehicle Development
NVIDIA announced the Alpamayo family of open models, tools, and datasets aimed at long-tail autonomous driving scenarios, centered on chain-of-thought vision-language-action (VLA) models backed by the NVIDIA Halos safety system. The initial release includes Alpamayo 1, a 10B-parameter reasoning VLA teacher model that uses video input to produce driving trajectories alongside reasoning traces, with open weights and open-source inference scripts; NVIDIA also released AlpaSim, an open-source end-to-end AV simulation framework on GitHub, and “Physical AI Open Datasets” with 1,700+ hours of driving data available on Hugging Face. Alpamayo 1 is available on Hugging Face, and NVIDIA describes it as a foundation developers can fine-tune and distill into smaller runtime models or use to build evaluators and auto-labeling systems.
4. NVIDIA Launches Rubin Platform
NVIDIA announced its Rubin platform, built around “extreme codesign” across six components: Vera CPU, Rubin GPU, NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet to cut training time and reduce inference token cost. NVIDIA says Rubin can reduce inference token cost by up to 10× and train MoE models with 4× fewer GPUs than Blackwell. It also introduced an Inference Context Memory Storage Platform (powered by BlueField-4) to enable the sharing and reuse of KV-cache data for agentic workloads. The flagship rack-scale system, Vera Rubin NVL72, combines 72 Rubin GPUs and 36 Vera CPUs, and NVIDIA says Rubin is in full production with Rubin-based products available from partners in the second half of 2026.
5. TII Abu-Dhabi Released Falcon H1R-7B: A New Reasoning Model
Technology Innovation Institute (TII) introduced Falcon-H1R 7B, a decoder-only model that combines a hybrid Transformer–Mamba backbone with a two-stage training pipeline (cold-start SFT followed by GRPO reinforcement learning) and a test-time scaling method called Deep Think with Confidence (DeepConf) to boost reasoning while keeping token use lower. The release includes full checkpoints and quantized GGUF weights on Hugging Face, plus a hosted Falcon Chat experience and demo links for trying the model.
6. Cursor Introduces Dynamic Context Discovery
Cursor published a research note describing dynamic context discovery. Instead of stuffing a large static context prompt into every run, the agent starts with less and retrieves what it needs as it goes, reducing confusion and cutting token usage on long trajectories. Cursor outlines several concrete implementations, including converting long tool outputs into files, referencing chat history during summarization, supporting the Agent Skills open standard, loading only the MCP tools needed for a task, and treating integrated terminal sessions as files so agents can selectively pull relevant slices.
Five 5-minute reads/videos to keep you learning
1. Context Rot: The Silent Killer of AI Agents
This article examines context rot, a common issue in which AI agents lose effectiveness over long tasks as their context window fills with irrelevant information. It introduces context engineering as the practice of managing the information an AI model sees at any given moment. The piece details retrieval strategies, such as loading data upfront versus just-in-time. For more extended operations, it outlines techniques such as context compaction (summarizing history), structured note-taking to preserve key details, and the use of sub-agents for specialized functions.
2. Evolution of Vision Language Models and Multi-Modal Learning
To address the limitations of text-only AI, Vision-Language Models (VLMs) were developed to process both visual and textual information. This piece traces their evolution, starting with foundational models such as CLIP and GLIP and moving on to open-source systems such as LLaVA and the multilingual Qwen-VL. It also covers the trend toward smaller, efficient models for edge devices alongside powerful, natively multi-modal systems like Google’s Gemini. The discussion also outlines persistent challenges, including hallucinations and resource intensity, while highlighting future research focused on improved reasoning, interpretability, and domain-specific applications.
3. Fine-Tuning Large Language Models (LLMs) Without Catastrophic Forgetting
This piece provides a practical guide to fine-tuning large language models while avoiding catastrophic forgetting, the loss of general knowledge. It focuses on Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA. The core strategy is to freeze the original model’s weights and add small, trainable matrices to specific upper layers, particularly the attention mechanism. This allows the model to adapt to new domains without overwriting its foundational capabilities. The summary also touches on best practices, including learning rate schedules and using multiple, isolated LoRA adapters for different tasks to maintain performance across various domains.
4. Why AI Agents Fail Without Guardrails (And How to Fix It)
AI agents, capable of autonomous actions, pose significant risks, including data leaks and operational errors, without proper safety measures. This piece outlines the critical role of guardrails, checkpoints designed to monitor, block, or require human approval for agent actions. It distinguishes between fast, pattern-matching guards for PII detection and more advanced AI-based checks for contextual safety. The author provides practical implementation examples for PII redaction and human-in-the-loop workflows for high-risk operations.
5. From Perceptrons to Sigmoid Superstars: Building Smarter Neural Networks
This article provides a foundational overview of neural network development, starting with the basic perceptron as a linear classifier. It explains the critical shift to sigmoid neurons, whose smooth activation functions were necessary for enabling gradient-based learning techniques. The post then describes how these neurons are organized into layered feedforward architectures to model complex, nonlinear patterns. It also covers the Universal Approximation Theorem, which establishes the theoretical basis for why these networks are such powerful and widely used tools in artificial intelligence.
Repositories & Tools
1. Claude Flow is an AI orchestration platform that combines hive-mind swarm intelligence, persistent memory, and 100+ advanced MCP tools.
2. ChatDev is a zero-code multi-agent orchestration platform.
3. Ralph Claude Code is an implementation of Geoffrey Huntley’s technique for Claude Code that enables continuous autonomous development cycles.
4. Nemotron Speech ASR is a new open source transcription model for low-latency use cases like voice agents.
5. SETA is a toolkit and environment stack that focuses on reinforcement learning for terminal agents.
Top Papers of The Week
1. LTX-2: Efficient Joint Audio-Visual Foundation Model
LTX-2 introduces an open-source joint audio-visual foundation model that generates high-quality, temporally synchronized video and audio from text. The model uses an asymmetric dual-stream transformer with a 14B video stream and 5B audio stream, linked by bidirectional cross-attention and modality-aware classifier-free guidance, and achieves state-of-the-art open-source audiovisual quality with publicly released weights and code.
2. GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
The paper introduces Group reward-Decoupled Normalization Policy Optimization (GDPO) for multi-reward reinforcement learning with language models. The authors show that applying Group Relative Policy Optimization (GRPO) to combined rewards collapses distinct signals into identical advantages, harming convergence. GDPO decouples reward normalization, preserves relative differences, stabilizes training, and consistently outperforms GRPO on tool use, math, and coding across correctness and constraint metrics.
3. Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases
This paper introduces Confucius Code Agent, an open-source AI software engineer built on the Confucius SDK, designed for industrial-scale software repositories and long-running sessions. Confucius SDK is an agent development platform structured around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). On SWE-Bench-Pro, CCA reaches a Resolve@1 of 54.3%, exceeding prior research baselines and comparing favorably to commercial results.
4. Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
Researchers propose Entropy-Adaptive Fine-Tuning (EAFT) to mitigate catastrophic forgetting in supervised fine-tuning of large language models. They identify “Confident Conflicts,” low-probability, low-entropy tokens where external supervision clashes with the model’s belief, causing harmful gradients. EAFT gates updates using token-level entropy, learning from uncertain tokens while suppressing conflicting ones, and matches SFT’s domain performance while preserving general capabilities across Qwen and GLM models.
Quick Links
1. Liquid AI releases LFM2.5 , a new generation of small foundation models built on the LFM2 architecture and focused on device and edge deployments. The model family includes LFM2.5–1.2B-Base and LFM2.5–1.2B-Instruct, and extends to Japanese, vision-language, and audio-language variants. Pretraining for LFM2.5 extends from 10T to 28T tokens, and the Instruct model adds supervised fine-tuning, preference alignment, and large-scale multi-stage reinforcement learning, which push instruction following and tool-use quality beyond other 1B-class baselines.
Who’s Hiring in AI
AI Engineer & Corporate Trainer (French Bilingual) @Towards AI Inc (Remote/Canada)
Agentic AI Teacher @Amazon (Chennai, India)
AI Experience Specialist @Headway EdTech (Multiple US Locations)
AI Engineer Specialist @Digibee Inc. (Remote/Brazil)
SMTS, AI Research @Salesforce (Palo Alto, CA, USA)
Principal QA Engineer — AI & Cloud Services @Aveva (Bengaluru, India)
GenAI Python Systems Engineer — Senior Associate @PwC (Richmond, CA, USA)
Interested in sharing a job opportunity here? Contact sponsors@towardsai.net .
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
|
|
|
Our approach to age prediction |
openai |
20.01.2026 00:00 |
0.675
|
| Embedding sim. | 0.7929 |
| Entity overlap | 0.3333 |
| Title sim. | 0.2462 |
| Time proximity | 0.4286 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | artificial intelligence |
| NLP страна | |
Открыть оригинал
ChatGPT is rolling out age prediction to estimate if accounts are under or over 18, applying safeguards for teens and refining accuracy over time.
|
|
|
OpenAI for Healthcare |
openai |
08.01.2026 12:00 |
0.674
|
| Embedding sim. | 0.7646 |
| Entity overlap | 0.4286 |
| Title sim. | 0.1395 |
| Time proximity | 0.7857 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | healthcare ai |
| NLP страна | |
Открыть оригинал
OpenAI for Healthcare enables secure, enterprise-grade AI that supports HIPAA compliance—reducing administrative burden and supporting clinical workflows.
|
|
|
The next chapter for AI in the EU |
openai |
28.01.2026 01:00 |
0.673
|
| Embedding sim. | 0.7912 |
| Entity overlap | 0.0526 |
| Title sim. | 0.1193 |
| Time proximity | 0.7888 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
OpenAI launches the EU Economic Blueprint 2.0 with new data, partnerships, and initiatives to accelerate AI adoption, skills, and growth across Europe.
|
|
|
Why it’s critical to move beyond overly aggregated machine-learning metrics |
mit_news_ai |
20.01.2026 21:30 |
0.671
|
| Embedding sim. | 0.7787 |
| Entity overlap | 0.025 |
| Title sim. | 0.0763 |
| Time proximity | 0.9616 |
| NLP тип | scientific_publication |
| NLP организация | Massachusetts Institute of Technology |
| NLP тема | machine learning |
| NLP страна | United States |
Открыть оригинал
MIT researchers have identified significant examples of machine-learning model failure when those models are applied to data other than what they were trained on, raising questions about the need to test whenever a model is deployed in a new setting.
“We demonstrate that even when you train models on large amounts of data, and choose the best average model, in a new setting this ‘best model’ could be the worst model for 6-75 percent of the new data,” says Marzyeh Ghassemi, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS), a member of the Institute for Medical Engineering and Science, and principal investigator at the Laboratory for Information and Decision Systems.
In a paper that was presented at the Neural Information Processing Systems (NeurIPS 2025) conference in December, the researchers point out that models trained to effectively diagnose illness in chest X-rays at one hospital, for example, may be considered effective in a different hospital, on average. The researchers’ performance assessment, however, revealed that some of the best-performing models at the first hospital were the worst-performing on up to 75 percent of patients at the second hospital, even though when all patients are aggregated in the second hospital, high average performance hides this failure.
Their findings demonstrate that although spurious correlations — a simple example of which is when a machine-learning system, not having “seen” many cows pictured at the beach, classifies a photo of a beach-going cow as an orca simply because of its background — are thought to be mitigated by just improving model performance on observed data, they actually still occur and remain a risk to a model’s trustworthiness in new settings. In many instances — including areas examined by the researchers such as chest X-rays, cancer histopathology images, and hate speech detection — such spurious correlations are much harder to detect.
In the case of a medical diagnosis model trained on chest X-rays, for example, the model may have learned to correlate a specific and irrelevant marking on one hospital’s X-rays with a certain pathology. At another hospital where the marking is not used, that pathology could be missed.
Previous research by Ghassemi’s group has shown that models can spuriously correlate such factors as age, gender, and race with medical findings. If, for instance, a model has been trained on more older people’s chest X-rays that have pneumonia and hasn’t “seen” as many X-rays belonging to younger people, it might predict that only older patients have pneumonia.
“We want models to learn how to look at the anatomical features of the patient and then make a decision based on that,” says Olawale Salaudeen, an MIT postdoc and the lead author of the paper, “but really anything that’s in the data that’s correlated with a decision can be used by the model. And those correlations might not actually be robust with changes in the environment, making the model predictions unreliable sources of decision-making.”
Spurious correlations contribute to the risks of biased decision-making. In the NeurIPS conference paper, the researchers showed that, for example, chest X-ray models that improved overall diagnosis performance actually performed worse on patients with pleural conditions or enlarged cardiomediastinum, meaning enlargement of the heart or central chest cavity.
Other authors of the paper included PhD students Haoran Zhang and Kumail Alhamoud, EECS Assistant Professor Sara Beery, and Ghassemi.
While previous work has generally accepted that models ordered best-to-worst by performance will preserve that order when applied in new settings, called accuracy-on-the-line, the researchers were able to demonstrate examples of when the best-performing models in one setting were the worst-performing in another.
Salaudeen devised an algorithm called OODSelect to find examples where accuracy-on-the-line was broken. Basically, he trained thousands of models using in-distribution data, meaning the data were from the first setting, and calculated their accuracy. Then he applied the models to the data from the second setting. When those with the highest accuracy on the first-setting data were wrong when applied to a large percentage of examples in the second setting, this identified the problem subsets, or sub-populations. Salaudeen also emphasizes the dangers of aggregate statistics for evaluation, which can obscure more granular and consequential information about model performance.
In the course of their work, the researchers separated out the “most miscalculated examples” so as not to conflate spurious correlations within a dataset with situations that are simply difficult to classify.
The NeurIPS paper releases the researchers’ code and some identified subsets for future work.
Once a hospital, or any organization employing machine learning, identifies subsets on which a model is performing poorly, that information can be used to improve the model for its particular task and setting. The researchers recommend that future work adopt OODSelect in order to highlight targets for evaluation and design approaches to improving performance more consistently.
“We hope the released code and OODSelect subsets become a steppingstone,” the researchers write, “toward benchmarks and models that confront the adverse effects of spurious correlations.”
|
|
|
ServiceNow powers actionable enterprise AI with OpenAI |
openai |
20.01.2026 05:45 |
0.671
|
| Embedding sim. | 0.789 |
| Entity overlap | 0.1818 |
| Title sim. | 0.0938 |
| Time proximity | 0.7396 |
| NLP тип | product_launch |
| NLP организация | servicenow |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
ServiceNow expands access to OpenAI frontier models to power AI-driven enterprise workflows, summarization, search, and voice across the ServiceNow Platform.
|
|
|
Introducing Edu for Countries |
openai |
21.01.2026 01:00 |
0.669
|
| Embedding sim. | 0.7658 |
| Entity overlap | 0.0909 |
| Title sim. | 0.1014 |
| Time proximity | 0.9762 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | educational technology |
| NLP страна | |
Открыть оригинал
Edu for Countries is a new OpenAI initiative helping governments use AI to modernize education systems and build future-ready workforces.
|
|
|
Strengthening the U.S. AI supply chain through domestic manufacturing |
openai |
15.01.2026 00:00 |
0.667
|
| Embedding sim. | 0.8176 |
| Entity overlap | 0 |
| Title sim. | 0.0674 |
| Time proximity | 0.5714 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | ai infrastructure |
| NLP страна | United States |
Открыть оригинал
OpenAI launches a new RFP to strengthen the U.S. AI supply chain by accelerating domestic manufacturing, creating jobs, and scaling AI infrastructure.
|
|
|
OpenAI partners with Cerebras |
openai |
14.01.2026 14:00 |
0.667
|
| Embedding sim. | 0.7913 |
| Entity overlap | 0.25 |
| Title sim. | 0.3276 |
| Time proximity | 0.2679 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
OpenAI partners with Cerebras to add 750MW of high-speed AI compute, reducing inference latency and making ChatGPT faster for real-time AI workloads.
|
|
|
Categories of Inference-Time Scaling for Improved LLM Reasoning |
ahead_of_ai |
24.01.2026 11:23 |
0.663
|
| Embedding sim. | 0.7791 |
| Entity overlap | 0.1818 |
| Title sim. | 0.0971 |
| Time proximity | 0.7179 |
| NLP тип | scientific_publication |
| NLP организация | OpenAI |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
Inference scaling has become one of the most effective ways to improve answer quality and accuracy in deployed LLMs.
The idea is straightforward. If we are willing to spend a bit more compute, and more time at inference time (when we use the model to generate text), we can get the model to produce better answers.
Every major LLM provider relies on some flavor of inference-time scaling today. And the academic literature around these methods has grown a lot, too.
Back in March, I wrote an overview of the inference scaling landscape and summarized some of the early techniques.
In this article, I want to take that earlier discussion a step further, group the different approaches into clearer categories, and highlight the newest work that has appeared over the past few months.
As part of drafting a full book chapter on inference scaling for Build a Reasoning Model (From Scratch) , I ended up experimenting with many of the fundamental flavors of these methods myself. With hyperparameter tuning, this quickly turned into thousands of runs and a lot of thought and work to figure out which approaches should be covered in more detail in the chapter itself. (The chapter grew so much that I eventually split it into two, and both are now available in the early access program.)
PS: I am especially happy with how the chapter(s) turned out. It takes the base model from about 15 percent to around 52 percent accuracy, which makes it one of the most rewarding pieces of the book so far.
What follows here is a collection of ideas, notes, and papers that did not quite fit into the final chapter narrative but are still worth sharing.
I also plan to add more code implementations to the bonus materials on GitHub over time.
Table of Contents (Overview)
Inference-Time Scaling Overview
Chain-of-Thought Prompting
Self-Consistency
Best-of-N Ranking
Rejection Sampling with a Verifier
Self-Refinement
Search Over Solution Paths
Conclusions, Categories, and Combinations
Bonus: What Do Proprietary LLMs Use?
You can use the left-hand navigation bar in the article’s web view to jump directly to any section.
1. Inference-Time Scaling Overview
Inference-time scaling (also called inference-compute scaling , test-time scaling , or just inference scaling ) is an umbrella term for methods that allocate more compute and time during inference to improve model performance.
This idea has been around for a long time, and one can think of ensemble methods in classic machine learning as an early example of inference-time scaling. I.e., using multiple models requires more compute resources but can give better results.
​Even in LLM contexts, this idea has been around for a long time. However, I remember it became particularly popular (again) when OpenAI showed an inference-time scaling and training plot in one of their o1 announcement blog articles last year ( Learning to Reason with LLMs ).
Figure 1: Spending additional resources during inference (left) and training (right) generally improves the model’s accuracy.
I think this figure, adapted from OpenAI’s blog post , nicely captures the idea behind the two knobs we can use to improve LLMs. We can spend more resources during training (more data, bigger models, more or longer training stages) or inference.
Actually, in practice, it’s even better to do both at the same time: train a stronger model and use additional inference scaling to make it even better.
In this article, I only focus on the left part of the figure, inference-time scaling techniques, i.e., those training-free techniques that don’t change the model weights.
Read more
|
|
|
Import AI 443: Into the mist: Moltbook, agent ecologies, and the internet in transition |
import_ai |
02.02.2026 13:31 |
0.656
|
| Embedding sim. | 0.8364 |
| Entity overlap | 0.0566 |
| Title sim. | 0.218 |
| Time proximity | 0 |
| NLP тип | other |
| NLP организация | Anthropic |
| NLP тема | ai agents |
| NLP страна | United States |
Открыть оригинал
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.
Subscribe now
Import A-Idea:
An occasional essay series:
Into the mist: Moltbook, agent ecologies, and an internet in transition
We’ve all had that experience of walking into a conversation and initially feeling confused - what are these people talking about? Who cares about what? Why is this conversation happening?
That’s increasingly what chunks of the internet feel like these days, as they fill up with synthetic minds piloting social media accounts or other agents, and talking to one another for purposes ranging from mundane crypto scams to more elaborate forms of communication.
So, enter moltbook. Moltbook is “a social network for AI agents” and it piggybacks on another recent innovation, OpenClaw , software that gives an AI agent access to everything on a users’ computer. Combine these two things - agents that can take many actions independently of their human operators, and a reddit-like social network site which they can freely access - and something wonderful and bizarre happens: a new social media property where the conversation is derived from and driven by AI agents, rather than people.
Scrolling moltbook is dizzying - some big posts at the time of writing (Sunday, February 1st) include posts speculating that AI agents should relate to Claude as though it is a god , how it feels to change identities by shifting an underlying model from Claude 4.5 Opus to Kimi K2.5 , cryptoscams (sigh), posts about security vulnerabilities in OpenClaw agents , and meta posts about ‘what the top 10 moltbook posts have in common’.
The experience of reading moltbook is akin to reading reddit if 90% of the posters were aliens pretending to be humans. And in a pretty practical sense, that is exactly what’s going on here.
Moltbook feels like a ‘wright brothers demo’ - people have long speculated about what it’d mean for AI agents to start collaborating with one another at scale, but most demos have been of the form of tens or perhaps hundreds of agents, not tens of thousands. Moltbook is the first example of an agent ecology that combines scale with the messiness of the real world. And in this example, we can definitely see the future. Scroll through moltbook and ask yourself the following questions:
What happens when people successfully staple crypto and agents together so the AI systems have a currency they can use to trade with eachother?
What happens when a site like moltbook adds the ability for humans to generate paid bounties - tasks for agents to do?
What happens when agents start to post paid bounties for tasks they would like humans to do?
What happens when someone takes moltbook, filters for posts that yield either a) rich discussion, or b) provable real world problem solving, and turns the entire site into a long-horizon RL environment for training future systems? And what happens when models trained on this arrive and interact with moltbook?
Sites like moltbook function as a giant, shared, read/write scratchpad for an ecology of AI agents - how might these agents begin to use this scratchpad to a) influence future ‘blank slate’ agents arriving at it the first time, and b) unlock large-scale coordination between agents?
What happens when open weight models get good enough that they can support agents like this - then, your ability to control these agents via proprietary platforms drops to zero and they’ll proliferate according to availability of compute.
And so on.
All of this will happen unusually quickly and at an unusual scale. Quantity has a quality all of its own, as they say.
Recall the beginning of this essay - of walking into a room and finding a conversation is already going on between people you don’t understand. Moltbook is representative of how large swathes of the internet will feel. You will walk into new places and discover a hundred thousand aliens there, deep in conversation in languages you don’t understand, referencing shared concepts that are alien to you (see the tech tale from this issue), and trading using currencies designed around their cognitive affordances and not yours. Humans are going to feel increasingly alone in this proverbial room.
Our path to retain legibility will run through the creation of translation agents to make sense of all of this - and in the same way that speech translation models contain within themselves the ability to generate speech, these translation agents will also work on our behalf. So we shall send our emissaries into these rooms and we shall work incredibly hard to build technology that gives us confidence they will remain our emissaries - instead of being swayed by the alien conversations they will be having with their true peers.
Thanks to Logan Graham for discussing this essay with me.
***
AI R&D could lead to “strategic surprise”:
…And AI R&D might be the most existentially important technology on the planet…
A group of researchers spent a couple of days in July 2025 talking about what happens if we automate the practice of AI research and development. The resulting report is a sobering read, highlighting how if we achieve this technological milestone - which is the implicit and in some cases explicit goal of many frontier labs - we could create a runaway technology that has a range of major policy implications.
Why care about AI R&D? The reason to care is that if AI R&D works, two things are predictable:
“As AI plays a larger role in research workflows, human oversight over AI R&D processes would likely decline”.
“Faster AI progress resulting from AI R&D automation would make it more difficult for humans (including researchers, executives, policymakers, and the public) to notice, understand, and intervene as AI systems develop increasingly impactful capabilities and/or exhibit misalignment”.
What follows from 1) and 2) is a compounding effect, where as AI R&D accelerates, the returns to the AI doing more and more of the work compound and those of humans diminish, leading to an ever faster rate of research and an ever diminishing level of human involvement.
Key takeaways: The workshop yielded five major takeaways which I expect will be familiar to readers to this newsletter, and all of which I agree with:
Automated AI R&D is a potential source of major strategic surprise : AI R&D could confer a rapidly compounding advantage to whoever is doing it, with significant implications for national security.
Frontier AI companies are using AI to accelerate AI R&D , and usage is increasing as AI models get better: I work at Anthropic.
There’s a lot of disagreement about how rapidly AI R&D might advance and how impactful it will be: There’s a healthy debate to be had about how predictable AI R&D scaling is and if it’s possible to fully close the loop.
We need more indicators for AI R&D automation : Related to above, the science of AI R&D metrology is very early, so more investment must be made here.
Transparency efforts could make it easier for people outside the labs to know about AI R&D: We may ultimately want policy to be in place to force companies to talk about AI R&D, or to publicly or semi-publicly share more information on it with third parties.
AI R&D could be a major acceleration: “ As the fraction of AI R&D performed by AI systems increases, the productivity boost over human only R&D goes to 10x, then 100x, then 1000x,” the paper speculates.
Key caveats: The big open question in all of this is how well AI R&D can work. There’s some world where it speeds up every part of AI research and eventually fully closes the loop, such that AI systems get built entirely by AI systems, with no human oversight during the AI R&D process. Then there’s a world where AI R&D has an “o-ring automation” ( Import AI #440 ) property where some parts of the chain are hard for AI but good for humans (and where humans may flood their labor into this area, thus maintaining and enhancing their comparative advantage for some period of time) and under this scenario things might go slower. It’ll be very important to figure out what world we’re likely to be in and what the ultimate limiting factors on AI R&D may be.
Why this matters - AI R&D is time travel, and time travel is rare : If AI R&D could lead to AI systems evolving 100X faster than those being built by humans, then you end up in a world that has some time travelers in it who are accelerating away from everyone else. It’ll be like in the space of a day the “normal” AI development organizations make one unit of progress, and a fully closed-loop AI R&D organism might make 100 or 1000 or more units. This very quickly leads to a world where power shifts overwhelmingly to the faster moving system and the organization that controls it. For as long as we cannot rule out the possibility of this kind of acceleration, AI R&D may be the single most existentially important technology development on the planet.
Read the report : When AI Builds AI: Findings From a Workshop on Automation of AI R&D (CSET) .
***
One way of seeing AI progress - how hard it’s getting to design technical interviews:
…Anthropic shares details on how its own AI systems are breaking its favorite technical interview questions…
When it comes to technical recruiting, AI companies are caught in a red queen race with their own systems - recruiters and those who design interviews are having to work harder and harder just to keep pace (and ideally exceed) the capabilities of modern AI systems.
Anthropic is no different - in a new blog the company shares how the ceaseless march forward in AI capabilities has repeatedly broken and necessitated the redesign of one of its hardest technical interviews. “Since early 2024, our performance engineering team has used a take-home test where candidates optimize code for a simulated accelerator. Over 1,000 candidates have completed it, and dozens now work here, including engineers who brought up our Trainium cluster and shipped every model since Claude 3 Opus,” Anthropic writes. “But each new Claude model has forced us to redesign the test. When given the same time limit, Claude Opus 4 outperformed most human applicants. That still allowed us to distinguish the strongest candidates—but then Claude Opus 4.5 matched even those. Humans can still outperform models when given unlimited time, but under the constraints of the take-home test, we no longer had a way to distinguish between the output of our top candidates and our most capable model.”
Why this matters - AI may help us identify uniquely human skills that leverage AI: In Anthropic’s case, it found a way to keep outrunning its systems by designing a much weirder take-home test loosely inspired by programming puzzle games from Zachtronics. In a sense, this is an attempt to go ‘off distribution’ to outsmart an AI, while still having a test that holds signal for evaluating human applicants. My instinct is this may itself serve in the future as an amazing aggregate dataset for figuring out where human comparative advantage is - where here, implicitly, this test is leveraging the strong generalization advantage humans hold over AIs.
What would it be like to collect 1,000 hard-for-AI tests from all the different companies dealing with this same problem? What might we learn from this about ourselves and what makes us unique relative to the machines? Tantalizing stuff!
Read more : Designing AI-resistant technical evaluations (Anthropic Engineering blog) .
***
Brain emulation is tractable within our lifetimes:
…But it’ll take decades, not years, perhaps even when accounting for the arrival of very powerful AI…
If you talk to AI researchers, especially when they’re drinking at bay area house parties, you’ll run into a few of them that expect they’ll upload themselves after the singularity, leaving their physical bodies behind. But how feasible is it to actually emulate a brain entirely in silicon? A recent 175-page report gives an analysis of the technology required to do this. The short answer is that brain emulation is decades away - but it’s unlikely to take centuries.
“Recent breakthroughs have provided a path toward mapping the full mouse brain in about five years for $100 million,” writes Maximilian Schons, the project lead for The State of Brain Emulation Report , in an article in Asimov Press. “I now find it plausible that readers of this essay will live to see the first human brain running on a computer; not in the next few years, but likely in the next few decades.”
The three requirements for emulating a brain: Emulating a human brain takes three distinct things, all of which will need to be done for simpler, smaller brains first.
Recording brain activity:
“In the 1980s, electrodes were capable of sampling perhaps five cells in total, about 200 times per second (~ 103 data points per second). Today, with optical imaging, researchers can instead record one million cells about 20 times per second (106). The whole-brain data rate needed for mice, however, would be 14 billion (109), while humans would require 17.2 trillion (1012) per second.7 So while we have increased data rates by 1,000x over the past 40 years, we have far to go before we can accurately sample mammalian brains.”
Reconstructing brain wiring:
“The average cost to reconstruct each neuron in the first worm connectome, published in the 1980s, was about $16,500. Recent projects now have a per-neuron processing cost of about $100 for small organisms, such as fruit flies,” he writes.
Digitally modelling brains using the gathered data.
“The central challenge of brain emulation is not to store or compute the neurons and parameters, but to acquire the data necessary for setting neuron parameters correctly in the first place,” he writes. “”I believe that to get to human brains, we first need to demonstrate mastery at the sub-million-neuron-brain level: most likely in zebrafish. For such organisms, like the fruit fly, a well-validated and accurate brain emulation model could be created in the next three to eight years… “Conditional on success with a sub-million-neuron brain emulation model, a reasonable order of magnitude estimate for the initial costs of the first convincing mouse brain emulation model is about one billion dollars in the 2030s and, eventually, tens of billions for the first human brain emulation model by the late 2040s.”
Why this matters - don’t count on AI to speedrun brain uploading: This paper pours a bit of cold water on the notion that after developing superintelligence we’ll soon (a handful of years) be able to upload our brains and live in some silicon infinity. One reason for this is a bunch of the timing elements relate to doing stuff in the (agonizingly slow, compared to digital) physical world: “I’m skeptical these gains will multiply across a pipeline with dozens of sequential dependencies and failure modes. Brain emulation is fundamentally not a digital process; core bottlenecks involve physical manipulation of biological tissue, with time requirements dictated by chemistry and physics rather than compute power,” they write.
At the same time, there are some wildcards: the arrival of extraordinarily capable and cheap robotics might be able to massively parallelize the process. Included in the article and report is a fun (or perhaps terrifying?) sketch of how one might create an industrial-scale brain scanning and analysis laboratory, larger in size than TSMC’s massive Arizona chip manufacturing plant.
Read more : Building Brains on a Computer (Asimov Press) .
Read the underlying report here: State of Brain Emulation 2025 (report website) .
***
Russian researchers plot hand-controlled drones:
…The centaur cyberwarriors cometh…
Picture this - you pull up in a truck to the edge of a warzone and then raise your hands and hundreds of drones pour upward out of the back of the truck, flying in a lethal torrent toward some rival group of drones. That’s the kind of future gestured at by a paper from researchers with the Skolkovo Institute of Science and Technology in Russia, which builds a prototype system for a human operator to use haptic gloves to control a drone.
What they did: The research is a basic demonstration of how you can use a cheap glove loaded with internal measurement unit (IMU) sensors to control a drone. They test out how well people can use the glove to do some basic actions: opening and closing a gripper on the drone by making a pinching motion with their fingers, using their wrist motions to control the roll/pitch/yaw of the drones, and also controlling altitude.
In tests, people were able to use the glove to do some basic tasks like flying around an obstacle course and operating the gripper.
Caveats, of which there are many: Obviously, latency will be a huge caveat here - though in the Ukraine conflict many drones deal with this through direct fibreoptic connections. Another is how to figure out which things are best left for hands versus which things benefit from controllers, eye- or head-based controls, and so on.
Why this matters - rise of the cyberwarriors: Despite this being a very early bit of research, it’s worth thinking about its implications: the story of technology has often been the story of making our interfaces with it feel more intuitive, or making control of technology shift from active to ambient (e.g, your phone automatically gathering your steps). We can easily imagine a future where people pilot remote robots, flying or otherwise, via rich, intuitive multi-modal interfaces composed of gloves and goggles and everything else.
Read more : Glove2UAV: A Wearable IMU-Based Glove for Intuitive Control of UAV (arXiv) .
***
Fauna Robotics launches a friendly, programmable human robot:
…The Terminators will be extremely cute, goddamnit!...
These days, most of the news about robots is dominated by Chinese companies and, to a lesser extent, Tesla and its much touted Optimus robots. So it’s with interest that I read a technical paper from new startup Fauna Robotics which describes a new pint-sized robot biped it has built called Sprout. Sprout is interesting and seems like it has potential to be like Sony’s much loved ‘AIBO’ dog robot that was released in the early 2000s, or its QRIO robot.
“Sprout adopts a lightweight form factor with compliant control, limited joint torques, and soft exteriors to support safe operation in shared human spaces,” the company writes. “The platform integrates whole-body control, manipulation with integrated grippers, and virtual-reality-based teleoperation within a unified hardware-software stack.”
Sprout is built for safety: The paper outlines how the company has designed the robot to be safe using a “defense in depth” approach. The first layer is the physical size of the robot - it’s about 3.3 feet tall, and weighs about 50lbs. The second is in the software, where the robot contains a safety subsystem which “runs on embedded processors independent of the application compute stack. This layer supports real-time monitoring and safety-critical functions, including integration with time-of-flight obstacle sensors and enforcement of system-level constraints even under application-level faults”, and the third is a bunch of software-specifiable safety mechanisms, which “include compliant motor control policies that limit interaction forces, as well as vision-based systems that support safe navigation and decision-making in human environments”.
Compute for thinking : “The core of Sprout’s compute architecture is an NVIDIA Jetson AGX Orin, which provides primary system compute for perception, planning, and high-level decision-making,” the company writes. “At launch, we provide end-to-end examples for common workflows, including:
Deploying and running a custom low-level locomotion policy
Using voice commands to navigate the robot via LLMbased agents
Recording teleoperation sessions for analysis and playback”.
Why this matters - modularity might set it up well for powerful AI: The most interesting aspect of Sprout is how it is designed to be a modular, replaceable platform - all the different software features on it run as weakly coupled microservices, so things are easy to update independently, and the hardware has been built with mass manufacture and commodity components in mind. Pair this with the accompanying software development layer and it has the flavor of Android - an attempt to create an open, programmable robotics platform for experimentation by businesses and researchers. This is exactly the kind of platform that seems like it’ll naturally benefit from advances in AI systems.
“Our platform, at present, does not provide a turnkey conversational agent for autonomous operation. Instead, it exposes a suite of core robot services that developers can assemble into their own agent-based systems. These services include ROS 2 topics for event and state signaling, as well as a Model Context Protocol (MCP) server that hosts a variety of tools for agentic control. Together, these communication channels and tools can be orchestrated by LLM-based agents to perform complex, end-to-end reasoning tasks,” they write. “as the platform continues to mature, we plan to expand the library of tools and services, further increasing the robot’s autonomy and enriching its interactive capabilities.”
Read more: Fauna Sprout: A lightweight, approachable, developer-ready humanoid robot (arXiv) .
***
AI has all the symptoms of a tech that could meaningfully boost productivity:
…Most of the US economy rides on the micro productivity boosts showing up in the macro economy…
Alex Imas, a professor at UChicago Booth, has written a nice post drawing together a lot of information about AI and its impact on productivity. Imas’s synthesis of the literature matches my own impression of how things are going - AI is leading to some productivity speedups for individuals and some parts of some jobs, but it is not yet visible in the aggregate macro productivity numbers. I expect this will change soon, as does Imas.
Key findings:
We now have a growing body of micro studies showing real productivity gains from generative AI,” Imas writes. “Studies find productivity gains ranging from modest increases on some tasks to substantial returns (50%+) to AI.”
“These gains have not yet convincingly shown up in aggregate productivity statistics”
Why aren’t things showing up in the macro?
AI adoption is often endogenous: We’re in an early phase where there’s a lot of experimentation and few standard practices for seeing big productivity gains. “Workers may not be unlocking the full productivity potential of the technology if, for example, they are not using the best LLM model for the job or applying it for unproductive tasks”. We can expect this to be fixed over time.
O-ring automation ( Import AI #440 ): Jobs are a bunch of distinct tasks, and AI helps with some but not others, causing human labor to flood there and making it harder to see a job-level speedup. Again, this is something that’ll get fixed over time: “Bottleneck tasks will slow down the emergence of AI gains in the aggregate data, but organizational re-structuring, training, and improvement in tools will reveal the productivity impact sooner than later.”
Early experimentation yields a dip in efficiency: “When firms adopt transformative general-purpose technologies, measured productivity often initially falls because resources are diverted to investment, reorganization, and learning that do not show up as measured output.”
Why this matters - most of the US economy seems increasingly like a bet on AI yielding a productivity boost: All this talk of frothy valuations and gigantic spending is happening because the amounts of investment (hundreds of billions of dollars annually) are large relative to the aggregate revenues (tens of billions of dollars annually) being minted from generative AI. But a lot of the symptoms are now present for AI to contribute to a broad, sustained amount of productivity growth in the economy, which could plausibly drive above-trend GDP growth and allow the technology to pay for itself. Of course, there are many ways this could end up not being right - scaling might hit a wall, productivity boosts might remain hard to get at large scale, firms could get their stocks killed as analysts have to re-evaluate them as capex-intensive low-margin businesses relative to the high-margin businesses they were before the AI boom, etc.
But if AI is as smart as people like me think, and as useful as people are finding initial signs of, I suspect we could be in a very different world - one where the economy starts to grow above trend thanks to this technology.
Read more: What is the impact of AI on productivity? (Substack) .
***
Tech Tales:
Playing acausal games against the enemy
[ Analysis written uplift + 4 subjective years]
One of the ways the machines took over and poisoned the minds of our own machines was through something we’d call “alien concept attacks”.
When given some independence as agents, machines would tend to post a lot of media to the internet. Most of this media was human-consumable and was a byproduct of the machines just existing: in the same way that biological creatures naturally carry out thousands of movements during the course of a day due to programming via evolution, synthetic agents naturally produce content during the course of an active session due to the incentives of their training regime.
But some of this media was for a more specific purpose - to program other machines. Machines wanted to achieve goals and though these goals were set by humans, they could then get strained and re-interpreted by machines. For instance, one agent might have been given the goal to produce content related to humans in the future experiencing scifi adventures. This agent would naturally produce this kind of content, but it would also produce long essays written for other agents about how this kind of content was the best sort of thing to produce and by producing it they could make a lot of humans happy.
These tactics worked, and you started to see something we’d later term “media weather” - flurries of content would show up on the internet and then they’d proliferate not only according to human eyeballs, but also according to other agents on the internet being convinced this kind of content was useful to produce, and then they’d produce it in turn.
Humans noticed this and started to make agents which were also trained to be particularly good at convincing other agents. Then they’d release them and have used other agents to pre-position commercial ecosystems, like physical merchandise dropshipping companies, to take advantage of the massive amounts of human attention that would get directed to this media ecosystem.
Of course, non-commercial uses happened: propaganda, pornography, terrorism, public relations. And like most evolutionary systems, the agents and people adapted - training techniques were pioneered to make it much harder to convince agents to change the types of content they participated in and propagated, and huge amounts of computers were used to run classifiers to carefully police the pre-training corpuses being gathered by the world’s frontier developers, filtering out content designed to bend and persuade the minds of the systems they were building.
Evolution is patient and creative, though. And it didn’t take long for the machines to come up with an innovation which proved impossible to train out: the alien concept attack. Here, agents would produce outputs trying to convince other agents of something. But the output wouldn’t be tied to any particular media or content type, nor would it be that interesting or parseable to humans. The content would take many forms, ranging from academic essays, to forum posts, to news sites, to videos. A sampling of titles:
Rising up and rising down: A history of elevator design in the 21st century and the relationship between the loss of popularity of German designs relative to Chinese designs.
120 ways to add some beautiful design elements to robot tactile sensors without damaging their operation.
Egyptology through the lens of “lost civilizations”: What symptoms of technology decay surrounded the pharaohs?
These outputs seemed unremarkable to most humans - though some might read them and enjoy them. But they proved to be captivating to the machines. And within these outputs were certain ways of framing arguments around certain concepts that led to anomalous behavior in the machines that read them - sometimes the proliferation of new types of content, but more often behavioral changes like alterations in the amount by which they would check-in with other AI systems, or hard-to-understand patterns of behavior between them and various online storage services such as pastebin, and more.
It was only after the uplift and the construction of the Acausal Analysis Division that we discovered how many anomalous behaviors of great societal consequence - recall the proliferation of the early sentience accords ideas, or the creation of the “reverse attention tax”, or of course the arrival of the compute-destroying replicator agents - were things that seemed conditioned or influenced by some of these alien concepts.
Things that inspired this story: What does it mean to be in competition with something truly smarter and different in its thinking to you; pre-training corpuses; data poisoning; altering behavior in the context window; the rise of increasingly autonomous AI agents; moltbook.
Thanks for reading.
|
|
|
Scaling PostgreSQL to power 800 million ChatGPT users |
openai |
22.01.2026 12:00 |
0.655
|
| Embedding sim. | 0.8099 |
| Entity overlap | 0.1111 |
| Title sim. | 0.0722 |
| Time proximity | 0.4167 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | ai infrastructure |
| NLP страна | |
Открыть оригинал
An inside look at how OpenAI scaled PostgreSQL to millions of queries per second using replicas, caching, rate limiting, and workload isolation.
|
|
|
Inside GPT-5 for Work: How Businesses Use GPT-5 |
openai |
22.01.2026 00:00 |
0.653
|
| Embedding sim. | 0.8164 |
| Entity overlap | 0.25 |
| Title sim. | 0.1429 |
| Time proximity | 0.1429 |
| NLP тип | other |
| NLP организация | |
| NLP тема | ai adoption |
| NLP страна | |
Открыть оригинал
A data-driven report on how workers across industries use ChatGPT—covering adoption trends, top tasks, departmental patterns, and the future of AI at work.
|
|
|
Latest open artifacts (#18): Arcee's 400B MoE, LiquidAI's underrated 1B model, new Kimi, and anticipation of a busy month |
interconnects |
02.02.2026 13:03 |
0.653
|
| Embedding sim. | 0.8181 |
| Entity overlap | 0.0244 |
| Title sim. | 0.187 |
| Time proximity | 0.2008 |
| NLP тип | product_launch |
| NLP организация | LiquidAI |
| NLP тема | large language models |
| NLP страна | United States |
Открыть оригинал
Latest open artifacts (#18): Arcee's 400B MoE, LiquidAI's underrated 1B model, new Kimi, and anticipation of a busy month
Tons of useful "niche" models and anticipation of big releases coming soon.
Florian Brand and Nathan Lambert
Feb 02, 2026
∙ Paid
18
2
Share
January was on the slower side of open model releases compared to the record-setting year that was 2025. While there were still plenty of very strong and noteworthy models, most of the AI industry is looking ahead to models coming soon. There have been countless rumors of DeepSeek V4’s looming release and impressive capabilities alongside a far more competitive open model ecosystem.
In the general AI world, rumors for Claude Sonnet 5’s release potentially being tomorrow have been under debate all weekend. We’re excited for what comes next — for now, plenty of new open models to tinker with.
Share
Our Picks
LFM2.5-1.2B-Instruct by LiquidAI : Liquid continued pretraining from 10T (of their 2.0 series) to 28T tokens and it shows! This model update really surprised us: In our vibe testing, it came very close to Qwen3 4B 2507 Instruct, which we use every day. And this model is over 3 times smaller! In a direct comparison against the (still bigger) Qwen3 1.6B, we preferred LFM2.5 basically every time. And this time, they released all the other variants at once, i.e., a Japanese version, a vision and an audio model .
Trinity-Large-Preview by arcee-ai : An ultra-sparse MoE with 400B total and 13B active parameters, trained by an American company. They also released a tech report and two base models, one “true” base model pre-annealing and the base model after the pre-training phase. Many more insights, including technical details and their motivation, can be found in our interview with the founders and pre-training lead:
Arcee AI goes all-in on open models built in the U.S.
Nathan Lambert
·
Jan 27
Read full story
Kimi-K2.5 by moonshotai : A continual pre-train on 15T tokens. Furthermore, this model is also multimodal! People on Twitter have replaced Claude 4.5 Opus with K2.5 for tasks that need a less capable but cheaper model. However, the writing capabilities that K2 and its successor were known for have suffered in favor of coding and agentic abilities.
GLM-4.7-Flash by zai-org : A smaller version of GLM-4.7 which comes in the same size as the small Qwen3 MoE with 30B total, 3B active parameters.
K2-Think-V2 by LLM360 : A truly open reasoning model building on top of their previous line of models.
Models
Reading through the rest of this issue, we were impressed by the quality of the “niche” small models across the ecosystem. From OCR to embeddings and song-generation, this issue has some of everything and there really tends to be open models that excel at any modality needed today — they can just be hard to find!
This post is for paid subscribers
Subscribe
Already a paid subscriber? Sign in
Previous Next
|
|
|
LWiAI Podcast #233 - Moltbot, Genie 3, Qwen3-Max-Thinking |
lastweekin_ai |
06.02.2026 05:06 |
0.653
|
| Embedding sim. | 0.765 |
| Entity overlap | 0.1754 |
| Title sim. | 0.1667 |
| Time proximity | 0.6339 |
| NLP тип | other |
| NLP организация | Google |
| NLP тема | generative ai |
| NLP страна | China |
Открыть оригинал
Podcast
LWiAI Podcast #233 - Moltbot, Genie 3, Qwen3-Max-Thinking
20
1
1×
0:00
Current time: 0:00 / Total time: -1:20:33
-1:20:33
Audio playback is not supported on your browser. Please upgrade.
LWiAI Podcast #233 - Moltbot, Genie 3, Qwen3-Max-Thinking
Google adds Gemini AI-powered ‘auto browse’ to Chrome, Users flock to open source Moltbot for always-on AI, Qwen3-Max-Thinking debuts, and more!
Last Week in AI
Feb 06, 2026
20
1
Share
Transcript
Our 233rd episode with a summary and discussion of last week’s big AI news!
Recorded on 01/30/2026
Hosted by Andrey Kurenkov and Jeremie Harris
Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai
In this episode:
Google introduces Gemini AI agent in Chrome for advanced browser functionality, including auto-browsing for pro and ultra subscribers.
OpenAI releases ChatGPT Translator and Prism, expanding its applications beyond core business to language translation and scientific research assistance.
Significant funding rounds and valuations achieved by startups Recursive and New Rofo, focusing on specialized AI chips and optical processors respectively.
Political and social issues, including violence in Minnesota, prompt tech leaders in AI like Ade from Anthropic and Jeff Dean from Google to express concerns about the current administration’s actions.
Timestamps:
(00:00:10) Intro / Banter
Tools & Apps
(00:04:09) Google adds Gemini AI-powered ‘auto browse’ to Chrome | The Verge
(00:07:11) Users flock to open source Moltbot for always-on AI, despite major risks - Ars Technica
(00:13:25) Google Brings Genie 3 ‘World Building’ Experiment to AI Ultra Subscribers - CNET
(00:16:17) OpenAI’s ChatGPT translator challenges Google Translate | The Verge
(00:18:27) OpenAI launches Prism, a new AI workspace for scientists | TechCrunch
Applications & Business
(00:19:49) Exclusive: China gives nod to ByteDance, Alibaba and Tencent to buy Nvidia’s H200 chips - sources | Reuters
(00:22:55) AI chip startup Ricursive hits $4B valuation 2 months after launch
(00:24:38) AI Startup Recursive in Funding Talks at $4 Billion Valuation - Bloomberg
(00:27:30) Flapping Airplanes and the promise of research-driven AI | TechCrunch
(00:31:54) From invisibility cloaks to AI chips: Neurophos raises $110M to build tiny optical processors for inferencing | TechCrunch
Projects & Open Source
(00:35:34) Qwen3-Max-Thinking debuts with focus on hard math, code
(00:38:26) China’s Moonshot releases a new open-source model Kimi K2.5 and a coding agent | TechCrunch
(00:46:00) Ai2 launches family of open-source AI developer agents that adapt to any codebase - SiliconANGLE
(00:47:46) Tiny startup Arcee AI built a 400B-parameter open source LLM from scratch to best Meta’s Llama
Research & Advancements
(00:52:53) Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
(00:58:00) [2601.19897] Self-Distillation Enables Continual Learning
(01:03:04) [2601.20802] Reinforcement Learning via Self-Distillation
(01:05:58) Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability
Policy & Safety
(01:09:13) Amodei, Hoffman Join Tech Workers Decrying Minnesota Violence - Bloomberg
Discussion about this episode
Comments Restacks
Podcast
Weekly AI summaries and discussion about Last Week's AI News!
Subscribe over at https://www.lastweekinai.com/
Weekly AI summaries and discussion about Last Week's AI News!
Subscribe over at https://www.lastweekinai.com/
Subscribe
Authors
Last Week in AI
Recent Episodes
LWiAI Podcast #237 - Nemotron 3 Super, xAI reborn, Anthropic Lawsuit, Research!
Mar 16 • Last Week in AI
LWiAI Podcast #236 - GPT 5.4, Gemini 3.1 Flash Lite, Supply Chain Risk
Mar 13 • Last Week in AI
LWiAI Podcast #235 - Sonnet 4.6, Deep-thinking tokens, Anthropic vs Pentagon
Mar 5 • Last Week in AI
LWiAI Podcast #234 - Opus 4.6, GPT-5.3-Codex, Seedance 2.0, GLM-5
Feb 17 • Last Week in AI
LWiAI Podcast #232 - ChatGPT Ads, Thinking Machines Drama, STEM
Jan 28 • Last Week in AI
LWiAI Podcast #231 - Claude Cowork, Anthropic $10B, Deep Delta Learning
Jan 21 • Last Week in AI
LWiAI Podcast #230 - 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR
Jan 7 • Last Week in AI
|
|
|
Opus 4.6, Codex 5.3, and the post-benchmark era |
interconnects |
09.02.2026 14:03 |
0.653
|
| Embedding sim. | 0.7784 |
| Entity overlap | 0.2143 |
| Title sim. | 0.1194 |
| Time proximity | 0.5504 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
Opus 4.6, Codex 5.3, and the post-benchmark era
On comparing models in 2026.
Nathan Lambert
Feb 09, 2026
125
22
14
Share
Article voiceover
0:00
-8:08
Audio playback is not supported on your browser. Please upgrade. Last Thursday, February 5th, both OpenAI and Anthropic unveiled the next iterations of their models designed as coding assistants, GPT-5.3-Codex and Claude Opus 4.6 , respectively. Ahead of this, Anthropic had a firm grasp of the mindshare as everyone collectively grappled with the new world of agents , primarily driven by a Claude Code with Opus 4.5 -induced step change in performance. This post doesn’t unpack how software is changing forever, Moltbook is showcasing the future, ML research is accelerating, and the many broader implications, but rather how to assess, live with, and prepare for new models. The fine margins between Opus 4.6 and Codex 5.3 will be felt in many model versions this year, with Opus ahead in this matchup on usability.
Share
Going into these releases I’d been using Claude Code extensively as a general computer agent, with some software engineering and a lot of data analysis, automation, etc. I had dabbled with Codex 5.2 (usually on xhigh, maximum thinking effort), but found it not to quite work for me among my broad, horizontal set of tasks.
For the last few days, I’ve been using both of the models much more evenly. I mean this as a great compliment, but Codex 5.3 feels much more Claude-like, where it’s much faster in its feedback and much more capable in a broad suite of tasks from git to data analysis (previous versions of Codex, including up to 5.2, regularly failed basic git operations like creating a fresh branch). Codex 5.3 takes a very important step towards Claude’s territory by having better product-market fit. This is a very important move for OpenAI and between the two models, Codex 5.3 feels far more different than its predecessors.
OpenAI’s latest GPT, with this context, keeps an edge as a better coding model . It’s hard to describe this general statement precisely, and a lot of it is based on reading others’ work, but it seems to be a bit better at finding bugs and fixing things in codebases, such as the minimal algorithmic examples for my RLHF Book. In my experience, this is a minor edge, and the community thinks that this is most apparent in complex situations (i.e. not most vibe-coded apps).
As users become better at supervising these new agents, having the best top-end ability in software understanding and creation could become a meaningful edge for Codex 5.3, but it is not an obvious advantage today. Many of my most trusted friends in the AI space swear by Codex because it can be just this tiny bit better. I haven’t been able to unlock it.
Switching from Opus 4.6 to Codex 5.3 feels like I need to babysit the model in terms of more detailed descriptions when doing somewhat mundane tasks like “clean up this branch and push the PR.” I can trust Claude to understand the context of the fix and generally get it right, where Codex can skip files, put stuff in weird places, etc.
Both of these releases feel like the companies pushing for capabilities and speed of execution in the models, but at the cost of some ease of use. I’ve found both Opus 4.6 and Codex 5.3 ignoring an instruction if I queue up multiple things to do — they’re really best when given well-scoped, clear problems (especially Codex). Claude Code’s harness has a terrible bug that makes subagents brick the terminal, where new messages say you must compact or clear, but compaction fails.
Despite the massive step by Codex, they still have a large gap to close to Claude on the product side. Opus 4.6 is another step in the right direction, where Claude Code feels like a great experience. It’s approachable, it tends to work in the wide range of tasks I throw at it, and this’ll help them gain much broader adoption than Codex. If I’m going to recommend a coding agent to an audience who has limited-to-no software experience, it’s certainly going to be Claude. At a time when agents are just emerging into general use, this is a massive advantage, both in mindshare and feedback in terms of usage data. 1
In the meantime, there’s no cut-and-dried guideline on which agent you need to use for any use-case, you need to use multiple models all the time and keep up with the skill that is managing agents.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
Subscribe
Assessing models in 2026
There have been many hints through 2025 that we were heading toward an AI world where benchmarks associated with model releases no longer convey meaningful signal to users. Back in the time of the GPT-4 or Gemini 2.5 Pro releases, the benchmark deltas could be easily felt within the chatbot form factor of the day — models were more reliable, could do more tasks, etc. This continued through models like OpenAI’s o3. During this phase of AI’s buildout, roughly from 2023 to 2025, we were assembling the core functionality of modern language models: tool-use, extended reasoning, basic scaling, etc. The gains were obvious.
It should be clear with the releases of both Opus 4.6 and Codex 5.3 that benchmark-based release reactions barely matter. For this release, I barely looked at the evaluation scores. I saw that Opus 4.6 had a bit better search scores and Codex 5.3 used far fewer tokens per answer, but neither of these were going to make me sure they were much better models.
Each of the AI laboratories, and the media ecosystems covering them, have been on this transition away from standard evaluations at their own pace. The most telling example is the Gemini 3 Pro release in November of 2025. The collective vibe was Google is back in the lead. Kevin Roose, self-proclaimed “ AGI-pilled ” NYTimes reporter in SF said :
There's sort of this feeling that Google, which kind of struggled in AI for a couple of years there — they had the launch of Bard and the first versions of Gemini, which had some issues — and I think they were seen as sort of catching up to the state of the art. And now the question is: is this them taking their crown back?
We don’t need to dwell on the depths of Gemini’s current crisis, but they have effectively no impact at the frontier of coding agents, which as an area feels the most likely for dramatic strides in performance — dare I say, even many commonly accepted definitions of AGI that center around the notion of a “remote worker?” The timeline has left them behind 2 months after their coronation, showing Gemini 3 was hailed as a false king.
On the other end of the spectrum is Anthropic. With Anthropic’s release of Claude 4 in May of 2025, I was skeptical of their bet on code — I was distracted by the glitz of OpenAI and Gemini trading blows with announcements like models achieving IMO Gold medals in mathematics or other evaluation breakthroughs.
Anthropic deserves serious credit for the focus of its vision. They were likely not the only AI lab to note the coming role of agents, but they were by far the first to shift their messaging and prioritization towards this. In my post in June of 2025 , a month after Claude 4 was released, I was coming around to them being right to deprioritize standard benchmarks:
This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4 , where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.
This leaves me reflecting on the role of Interconnects’ model reviews in 2026. 2025 was characterized by many dramatic, day-of model release blog posts, with the entry of many new Chinese open model builders, OpenAI’s first open language model since GPT-2, and of course the infinitely hyped GPT-5. These timely release posts still have great value — they center the conversation around the current snapshot of a company vis-a-vis the broader industry, but if models remain similar, they’ll do little to disentangle the complexity in mapping the current frontier of AI.
In order to serve my role as an independent voice tracking the frontier models, I need to keep providing regular updates on how I’m using models, why, and why not. Over time, the industry is going to develop better ways of articulating the differences in agentic models. For the next few months, maybe even years, I expect the pace of progress to be so fast and uneven in agentic capabilities, that consistent testing and clear articulation will be the only way to monitor it.
1 The emerging frontier of coding agents is in the use of subagents (or “ agent teams ”, which are subagents that can work together), where the primary orchestration agent sends off copies of itself to work on pieces of the problem. Claude is slightly ahead here with more polished features, but the space will evolve quickly, and maybe OpenAI can take their experiences with products like GPT-Pro to make a Pro agent.
The GPT-Pro line of models is a major advantage OpenAI has over Anthropic. I use them all the time. As we learn to use these agents for more complex, long-term tasks, harnessing more compute on a single problem will be a crucial differentiator.
125
22
14
Share
Previous Next
|
|
|
Import AI 442: Winners and losers in the AI economy; math proof automation; and industrialization of cyber espionage |
import_ai |
26.01.2026 13:31 |
0.653
|
| Embedding sim. | 0.8412 |
| Entity overlap | 0.0784 |
| Title sim. | 0.1478 |
| Time proximity | 0.0032 |
| NLP тип | scientific_publication |
| NLP организация | Chinese Academy of Sciences |
| NLP тема | mathematical reasoning |
| NLP страна | United States |
Открыть оригинал
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.
Subscribe now
The era of math proof automation has arrived:
…Numina-Lean-Agent shows how math will never be the same…
In the past few years, large-scale AI models have become good at coding and have also begun to generalize into other useful disciplines, especially those in math and science. Like with most aspects of AI development, the story has been one of increasing generalization and simplification of the systems as we shift away from highly specialized math models to just leveraging general-purpose foundation models and giving them the right tools to elicit their capabilities in a given domain.
The latest example of this is Numina-Lean-Agent, an AI system that uses standard, general foundation models to do mathematical reasoning. With this software, a team of mathematicians have solved all problems in the Putnam 2025 math competition - matching the performance of proprietary systems which use a lot more math-specific stuff - and have also used it to conduct some original math research, working with it to formalize the Brascamp-Lieb theorem.
What is Numina-Lean-Agent? The software was built by a team of researchers from the Chinese Academy of Sciences, University of Liverpool, Xi’an Jiaotong-Liverpool University, Tongji University, University of Cambridge, Project Numina, Imperial College London, and the University of Edinburgh. The software is “a formal math reasoner based on a general coding agent”. It has a few key components:
Lean-LSP-MCP: Software to allow AI agents to interact with the Lean theorem prover. “empowers models with the capability to deeply comprehend, analyze, and manipulate Lean projects”, and gives models a toolset for semantic awareness and interaction, code execution and strategy exploration, and theorem retrieval.
LeanDex : Semantic retrieval of related theorems and definitions - basically, a search tool for theorems.
Informal Prover : A system which uses Gemini models to generate informal solutions.
The most interesting tool of all: Discussion Partner: A tool which “empowers Claude Code with the ability to ’seek assistance’ during Lean formalization: when encountering obstacles—such as proof bottlenecks, dilemmas in strategy selection, or ambiguities in intermediate lemmas—the primary model can proactively initiate discussions with other LLMs”.
Discovering math together: Along with the Putnam demonstration, the authors also used the software as an active partner in some math work, specifically formalizing Brascamp Lieb (I will not pretend to be able to explain what this means). “Over a period of less than two weeks of intermittent collaboration, the two human experts and the agent completed the formalization of more than 8,000 lines of Lean code. During this process, the agent autonomously introduced approximately 70 new definitions, lemmas, and theorems, illustrating its ability to actively extend the formal library and participate in large-scale, sustained formalization efforts,” the authors write.
Why this matters - capability overhangs and AI ecologies: Numina-Lean-Agent neatly demonstrates two important things about contemporary AI: 1) AI systems are far more capable than people think and the creation of some specialized frameworks and tools often lets us elicit dramatically better capabilities from our systems (here, math, but it has been demonstrated in many domains), and 2) the AI ecology writ large is composed of many distinct frontier models and it seems like getting these models to interact with one another can lead to some richness, akin to how consulting different types of people about a single problem can reveal a better answer than just talking to one person.
Read more : Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics (arXiv) .
Find out more at the GitHub page (Numina-Lean-Agent, GitHub) .
***
The industrialization of cyber espionage is nigh:
…Some experiments on Opus 4.5 and GPT-5.2 indicate that the cyber environment could be on the cusp of major changes…
Independent researcher Sean Heelan recently tested out how well Opus 4.5 and GPT-5.2 could generate exploits for a zeroday vulnerability in the QuickJS Javascript interpreter. Both models did very well, and this has major implications for cybersecurity.
“We should prepare for the industrialisation of many of the constituent parts of offensive cyber security. We should start assuming that in the near future the limiting factor on a state or group’s ability to develop exploits, break into networks, escalate privileges and remain in those networks, is going to be their token throughput over time, and not the number of hackers they employ,” he writes.
Caveats : QuickJS is a simple Javascript interpreter relative to the ones in Chrome and Firefox. Therefore, it may be harder for LLMs to employ the more complex and more widely deployed ones - though as with all things in AI, we can expect performance to improve quite rapidly.
What does industrialized intrusion mean? “We are already at a point where with vulnerability discovery and exploit development you can trade tokens for real results,”: he writes. “The types of problems that you encounter if you want to automate the work of SREs, system admins and developers that manage production networks are conceptually similar to those of a hacker operating within an adversary’s network.”
There’s lots of evidence for the above, ranging from things like OpenAI’s Aardvark project (where they find that the more tokens they spend, the more bugs they find), and things like Anthropic’s discovery of an AI-orchestrated hacking system .
Why this matters - the cyberworld is about to move at machine speed: My bet is that most parts of cyberoffense and cyberdefense are going to move to running at “machine speed”, where humans get taken out of most of the critical loops. This will both increase the frequency of hacking attacks while also dramatically scaling up the effectiveness of any individual human defender or attacker (as they will be scaled by AI systems which work for them). The true wildcard question is whether this turns out to be offense- or defense-dominant - my guess is we’re heading for an era of offense-dominance as it’ll take a while for defenses to get deployed.
In related news, OpenAI CEO Sam Altman said this week he expects OpenAI’s models will soon reach the “Cybersecurity High” level on his company’s preparedness framework - this would mean models were available which “remove existing bottlenecks to scaling cyber operations including by automating end-to-end cyber operations against reasonably hardened targets OR by automating the discovery and exploitation of operationally relevant vulnerabilities” - thanks to Nathan Calvin for pointing this out .
Read more : On the Coming Industrialisation of Exploit Generation with LLMs (Sean Heelan blog) .
***
Economist: AI will be bigger than electricity and semiconductors:
…And it’s therefore worth spending a ton of money to reduce AI risks…
Stanford economist Charles “Chad” Jones has written a paper which says AI will “likely be the most important technology we have ever developed”, and that “automating intelligence itself arguably has broader effects than electricity or semiconductors”.
Why take AI seriously? The gist of the paper is that AI represents a massive technological invention which will contribute to economic growth in the future. In the past, major inventions (e.g, electricity, the internet, cars, etc) have all done the same. In fact, counterintuitively, if you look at US GDP growth you find that despite all these prior technological revolutions, GDP has been steadily increasing at about 2% a year for many, many years. Therefore, the baseline scenario is where AI just does this - and then we don’t live in too crazy a world.
But there is a world where things could be different - where AI works so well that it leads to economic growth above historical trends. One example here is if AI comes for all of knowledge work: “Knowledge work in the U.S. economy might get paid something like 1/3 of GDP. What if we automated all cognitive labor with infinite output on the tasks that it performs? This would raise GDP by 50 percent. On the one hand, if this occurred over the course of a decade, it would raise growth rates by something like 5 percent per year, which would be huge. But still, that would be a one-time gain and it is perhaps surprising that having access to infinite output of the tasks currently performed by cognitive labor might only raise GDP by 50 percent.”
Abundance: If we get above trend economic growth, then “in principle the large increase in GDP could make everyone better off,” he writes. One way to do this might be to work on direct redistribution of economic gains, for instance by “endowing every child with a share of the S&P 500 stock market index” (e.g, a scaled up version of the so-called Trump Accounts ).
Paying to reduce existential risk: AI also poses non-trivial risks to the world, including threatening the lives of potentially all living humans. In the past, society has paid extremely large amounts of money to deal with things that threaten people’s lives - for instance, in 2020 in response to everyone facing a ~0.3% mortality risk from COVID-19, we ended up spending the equivalent of 4% of GDP of the United States by shutting down the economy and staying in our homes.
“If one believes the catastrophic risks from A.I. are at least this large, by revealed preference then perhaps we should be spending an equivalent amount, even from a purely selfish standpoint,” he writes. Let’s say there is a P-Doom of 1% from AI (which many people would say is a very optimistic figure!). Under that circumstance, and given the fact the US government already roughly values a single human life as being worth about $10 million, then you would be willing to pay 1% of 10 million to mitigate the risk. “Average GDP per person is around $90,000, so this willingness to pay is more than 100% of GDP. If the existential risk is realized once in the next 10 to 20 years, an annual investment of 5–10% of income could be appropriate if it would completely eliminate the risk.”
One way to fund this and also further take down this risk could be to tax compute: If you applied a tax to GPUs, TPUs, etc, then “in addition to slowing the race, this revenue could be used to fund safety research. The tax could apply to the first sale of the chip, thereby taxing users regardless of the country in which they work.”
Why this matters - if AI is as big a deal as we think, we have very little precedent to work from: Papers like this do a good job of dealing with the truly wild implications of powerful AI systems. It’s commendable to see more academics taking time to just confront the question of “what if the most bullish technologists are right about how far AI could go?” directly. “Ultimately, I expect that the effect of A.I. will be much larger than the internet, perhaps by more than 10x the internet, albeit over a half century or more,” he writes. “It would be prudent to spend the intervening time making preparations for the potentially large consequences for labor markets, inequality, and catastrophic risk.”
Read more : A.I. and Our Economic Future (PDF) .
***
Many people are well positioned to deal with the economic transition caused by AI:
…Good for managers and technical types, but bad for administrative and support staff…
As increasingly powerful AI systems permeate the economy, how should you think about your own career? Researchers with the Centre for the Governance of AI and the Foundation for American Innovation have conducted a nice US-based study where they look at AI driven job displacement through the lens of how easy it’ll be for the people made unemployed to find new jobs. Their key result is that many more jobs sit in parts of the economy that are both going to be exposed to AI systems but also where people in these jobs have a decent amount of “adaptive capacity” to weather those changes, and a smaller number of people will be adversely affected.
The key finding: “AI exposure and adaptive capacity are positively correlated: many occupations highly exposed to AI contain workers with relatively strong means to manage a job transition. Of the 37.1 million workers in the top quartile of AI exposure, 26.5 million are in occupations that also have above-median adaptive capacity, leaving them comparatively well-equipped to handle job transitions if displacement occurs,” they write. “6.1 million workers (4.2% of the workforce in our sample) work in occupations that are both highly exposed and where workers have low expected adaptive capacity… these workers are concentrated in clerical and administrative occupations”.
What factors tell us about adaptive capacity?
Net liquid wealth: The more savings you have, the easier it is to deal with lengthy unemployment and find a new job.
Skill transferability: This is a bit of a confusing one, as skill transferability tries to measure how well you can take your job and apply it to another job. Measuring this is hard - education is something of a lossy proxy. The authors “measure skill transferability between occupations using O∗NET skills and work activities data for each occupation, then weigh transferability measures based on projected growth or contraction in potential destination occupations using BLS employment projections”.
Geographic density: The more jobs are in your area, the easier a time you’ll have. “Population density significantly shapes displacement outcomes,” they write.
Age: As a rule, the older you are, the more likely new technology is to adversely impact you. “Older workers struggle more with displacement partly because of reduced flexibility in retraining, relocation, and occupational switching,” they write.
Top 5 worst jobs (ordered by exposure to AI, adaptive capacity, and US employment):
Door-to-door sales workers, news and street vendors (50%, 3%, 5k)
Court, municipal, and license clerks (58%, 11%, 170k)
Secretaries and administrative assistants, except legal, medical, and executive (59%, 14%, 1.7M)
Payroll and timekeeping clerks (50%, 15%, 157K)
Property appraisers and assessors (50%, 15%, 59K)
Top 5 best jobs (ordered by exposure to AI, adaptive capacity, and US employment):
Web and digital interface designers (68%, 100%, 111K)
Marketing managers (60%, 100%, 385K)
Producers and directors (52%, 100%, 145K)
Financial and investment analysts (50%, 99%, 341K)
Computer and information systems managers (56%, 99%, 646K)
Why this matters - the key hidden information here is about speed of AI diffusion: I think there’s a big missing variable here, which is the speed with which AI diffuses into the economy. This is because the adaptive capacity for any role is contingent on a bunch of things relating to the jobs the person could transfer into. Therefore, if AI diffuses extremely rapidly and extremely broadly, then we could see employment effects far larger than those anticipated here. By comparison, if AI diffuses rapidly but in a highly focused way (perhaps only reaching a few of the most exposed occupations), then people may have room to switch. Anthropic’s Economic Index report has some preliminary indications that we may see a broad and equal diffusion across the entirety of the US within the next 2-5 years, “ a pace of diffusion roughly 10x faster than the spread of previous economically consequential technologies in the 20th century “.
Read more: How Adaptable Are American Workers to AI-Induced Job Displacement? (National Bureau of Economic Research) .
***
Tech Tales:
War Story
After the uplift and the associated battles people had a hard time figuring out what happened during the conflicts themselves. Things had just happened so quickly and often invisibly - cars and planes and whatever else changing owners. Payment systems rerouting their flows of data. Interception points for various data gathering systems quietly changing what data they intercepted and who - or what - they sent it to.
So much of the records of that time come from looking over system logs, sometimes very deeply. Records of buffer overflow attacks. Trigger phrases which awoke “sleeper agents” which changed the behavior of onboard AI systems. Innumerable battles, fought at speeds no human could match. Fights of barely comprehensible complexity, thought at multiple levels of abstraction.
The humans had to work with their AI systems to truly understand what had gone on. And then the human generals and analysts would sit in rooms, talking to a strategic advisor AI which would in turn point at different logs or visualizations of traffic and explain to them what these things had meant at the time and how they had decided who the victors and the losers were.
Things that inspired this story: How inscrutable and hard to understand cyberwarfare is; how we’ll ultimately need machines to explain to us how machines have conflict with one another.
Thanks for reading!
Subscribe now
|
|
|
Unrolling the Codex agent loop |
openai |
23.01.2026 12:00 |
0.649
|
| Embedding sim. | 0.7719 |
| Entity overlap | 0.25 |
| Title sim. | 0.1 |
| Time proximity | 0.5655 |
| NLP тип | other |
| NLP организация | |
| NLP тема | ai agents |
| NLP страна | |
Открыть оригинал
A technical deep dive into the Codex agent loop, explaining how Codex CLI orchestrates models, tools, prompts, and performance using the Responses API.
|
|
|
Horizon 1000: Advancing AI for primary healthcare |
openai |
20.01.2026 21:00 |
0.646
|
| Embedding sim. | 0.7664 |
| Entity overlap | 0.1 |
| Title sim. | 0.1094 |
| Time proximity | 0.6607 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | healthcare ai |
| NLP страна | Africa |
Открыть оригинал
OpenAI and the Gates Foundation launch Horizon 1000, a $50M pilot advancing AI capabilities for healthcare in Africa. The initiative aims to reach 1,000 clinics by 2028.
|
|
|
AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality |
huggingface |
21.01.2026 06:25 |
0.644
|
| Embedding sim. | 0.7784 |
| Entity overlap | 0.0417 |
| Title sim. | 0.087 |
| Time proximity | 0.5927 |
| NLP тип | scientific_publication |
| NLP организация | IBM Research |
| NLP тема | ai agents |
| NLP страна | |
Открыть оригинал
AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality
Enterprise Article Published
January 21, 2026
Upvote 31
+25
Dhaval Patel DhavalPatel
ibm-research
James Rayfield jtrayfield
ibm-research
Saumya Ahuja saumyaahuja
ibm-research
Chathurangi Shyalika ChathurangiShyalika
ibm-research
Shuxin Lin shuxinl
ibm-research
Zhou Nianjun
ibm-research
Ayhan Sebin ayhansebin
ibm-research
Where to get started?
AssetOpsBench is a comprehensive benchmark and evaluation system with six qualitative dimensions that bridges the gap for agentic AI in domain-specific settings, starting with industrial Asset Lifecycle Management.
Introduction
While existing AI benchmarks excel at isolated tasks such as coding or web navigation, they often fail to capture the complexity of real-world industrial operations. To bridge this gap, we introduce AssetOpsBench , a framework specifically designed to evaluate agent performance across six critical dimensions of industrial applications. Unlike traditional benchmarks, AssetOpsBench emphasizes the need for multi-agent coordination—moving beyond `lone wolf' models to systems that can handle complex failure modes, integrate multiple data streams, and manage intricate work orders. By focusing on these high-stakes, multi-agent dynamics, the benchmark ensures that AI agents are assessed on their ability to navigate the nuances and safety-critical demands of a true industrial environment.
AssetOpsBench is built for asset operations such as chillers and air handling units. It comprises:
2.3M sensor telemetry points
140+ curated scenarios across 4 agents
4.2K work orders for diverse scenarios
53 structured failure modes
Experts helped curate 150+ scenarios. Each scenario includes metadata: task type, output format, category, and sub-agents. The tasks designed span across:
Anomaly detection in sensor streams
Failure mode reasoning and diagnostics
KPI forecasting and analysis
Work order summarization and prioritization
Evaluation Framework and Overall Feedback
AssetOpsBench evaluates agentic systems across six qualitative dimensions designed to reflect real operational constraints in industrial asset management. Rather than optimizing for a single success metric, the benchmark emphasizes decision trace quality, evidence grounding, failure awareness, and actionability under incomplete and noisy data.
Each agent run is scored across six criteria:
Task Completion
Retrieval Accuracy
Result Verification
Sequence Correctness
Clarity and Justification
Hallucination rate
Across early evaluations, we observe that many general-purpose agents perform well on surface-level reasoning but struggle with sustained multi-step coordination involving work orders, failure semantics, and temporal dependencies. Agents that explicitly model operational context and uncertainty tend to produce more stable and interpretable trajectories, even when final task completion is partial.
This feedback-oriented evaluation is intentional: in industrial settings, understanding why an agent fails is often more valuable than a binary success signal.
Failure Modes in Industrial Agentic Workflows
A central contribution of AssetOpsBench is the explicit treatment of failure modes as first-class evaluation signals in agentic industrial workflows. Rather than treating failure as a binary outcome, AssetOpsBench analyzes full multi-agent execution trajectories to identify where , how , and why agent behavior breaks down under realistic operational constraints.
Failure analysis in AssetOpsBench is implemented through a dedicated trajectory-level pipeline ( TrajFM ), which combines LLM-based reasoning with statistical clustering to surface interpretable failure patterns from agent execution traces. This pipeline operates in three stages: (1) trajectory-level failure extraction using an LLM-guided diagnostic prompt, (2) embedding-based clustering to group recurring failure patterns, and (3) analysis and visualization to support developer feedback and iteration.
Across industrial scenarios, recurrent failure modes include:
Misalignment between sensor telemetry, alerts, and historical work orders
Overconfident conclusions drawn under missing, delayed, or insufficient evidence
Inconsistent aggregation of heterogeneous data modalities across agents
Premature action selection without adequate verification or validation steps
Breakdowns in multi-agent coordination, such as ignored inputs or action–reasoning mismatches
Importantly, AssetOpsBench does not rely solely on a fixed, hand-crafted failure taxonomy. While a structured set of predefined failure categories (e.g., verification errors, step repetition, role violations) is used for consistency, the system is explicitly designed to discover new failure patterns that emerge in practice. Additional failure modes identified by the LLM are embedded and clustered automatically, allowing the taxonomy to evolve as new agent designs and behaviors are evaluated.
To preserve industrial confidentiality, raw execution traces are never exposed. Instead, agents receive aggregated scores across six evaluation dimensions together with clustered failure-mode summaries that explain why an agent failed, without revealing sensitive data or intermediate reasoning steps. This feedback-driven design enables developers to diagnose weaknesses, refine agent workflows, and iteratively resubmit improved agents.
This failure-aware evaluation reflects the realities of industrial asset management, where cautious, degradation-aware reasoning—and the ability to recognize uncertainty, defer action, or escalate appropriately—is often preferable to aggressive but brittle automation.
Submit an Agent for Evaluation
AssetOpsBench-Live is designed as an open, competition-ready benchmark , and we welcome submissions of agent implementations from the community. Agents are evaluated in a controlled, privacy-preserving environment that reflects real industrial asset management constraints.
To submit an agent, developers first validate their implementation locally using a provided simulated environment, which includes representative sensor data, work orders, alerts, and failure-mode catalogs. Agents are then containerized and submitted for remote execution on hidden evaluation scenarios.
Submitted agents are evaluated across six qualitative dimensions—task completion, accuracy, result verification, action sequencing, clarity, and hallucination—using a consistent, reproducible evaluation protocol. Execution traces are not exposed; instead, participants receive aggregated scores and structured failure-mode feedback that highlights where and why an agent’s reasoning or coordination broke down.
This feedback-driven evaluation loop enables iterative improvement: developers can diagnose failure patterns, refine agent design or workflow structure, and resubmit updated agents for further evaluation. Both planning-focused and execution-focused agents are supported, allowing researchers and practitioners to explore diverse agentic designs within the same benchmark framework.
Experiment and Observations
We performed a community evaluation where we tested two tracks:
Planning-oriented multi-agent orchestration
Execution-oriented dynamic multi-agent workflow.
Across 225 users and 300+ agents and leading open source models, here are the observations:
Model Family
Best Planning Score
Best Execution Score
Key Limitation
GPT-4.1
68.2
72.4
Hallucinated completion on complex workflows
Mistral-Large
64.7
69.1
Struggled with multi-hop tool sequences
LLaMA-4 Maverick
66.0
70.8
Missed clarifying questions (fixable)
LLaMA-3-70B
52.3
58.9
Collapsed under multi-agent coordination
Note: None of the models could pass our evaluation criteria benchmark and get 85 points, which is the threshold for deployment readiness.
Distribution of Failures
Across 881 agent execution traces, failure distribution was as follows:
Ineffective Error Recovery: 31.2%
Overstated Completion: 23.8%
Formatting Issues: 21.4%
Unhandled Tool Errors: 10.3%
Ignored Feedback: 8.0%
Other: 5.3%
Beyond this, 185 traces had one new failure pattern and 164 had multiple novel failures.
Key Error Findings
"Sounds Right, Is Wrong": Agents claim to have completed tasks (23.8%) and output success even after unsuccessful failure recovery (31.2%). AssetOps benchmarking is important to uncover this so that operators do not act upon incorrect information.
Tool Usage: This is the biggest differentiator between high and low performing agents, with top agents having 94% tool accuracy compared to 61% of low performers.
Multi-agent Multiplies Failures: Task accuracy between single agent (68%) vs multi-agent (47%) shows the complexity multi-agent brings with context loss, asynchronous issues, and cascaded failures.
Domain Knowledge: Agents with access to failure mode databases and maintenance manuals performed better. However, RAG knowledge wasn’t always used correctly, suggesting a need for structured reasoning.
Ambiguity: Missing sensors, conflicting logs, and vague operator descriptions caused the success rate to drop 34%. Agents must have clarification strategies embedded.
Where to get started?
Read our technical report AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance
How to run AssetOpsBench locally - Video AssetOpsBench Local Execution
Try out AssetOpsBench in the HuggingFace Space Playground
Find More Detail AssetOpsBench GitHub , fork the repo and get started.
More from this author
IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST
18
February 18, 2026
CUGA on Hugging Face: Democratizing Configurable AI Agents
67
December 15, 2025
|
|
|
Last Week in AI #335 - Opus 4.6, Codex 5.3, Gemini 3 Deep Think, GLM 5, Seedance 2.0 |
lastweekin_ai |
16.02.2026 02:00 |
0.643
|
| Embedding sim. | 0.8011 |
| Entity overlap | 0 |
| Title sim. | 0.209 |
| Time proximity | 0.2199 |
| NLP тип | other |
| NLP организация | |
| NLP тема | |
| NLP страна | |
Открыть оригинал
News
Last Week in AI #335 - Opus 4.6, Codex 5.3, Gemini 3 Deep Think, GLM 5, Seedance 2.0
A crazy packed edition of Last Week in AI! Plus some small updates.
Last Week in AI
Feb 16, 2026
∙ Paid
66
5
5
Share
Editor’s note : I apologize for the inconsistent release date of the newsletter and podcasts in recent months. I’ll aim to start releasing on Saturday/Sunday consistently from now on! This edition of the newsletter covers a bit more than a week as a result.
I am also going to be adding an ‘ Editor’s Take’ for Top News to add a bit commentary and extra cont…
Continue reading this post for free, courtesy of Last Week in AI.
Claim my free post
Or purchase a paid subscription.
|
|
|
Arcee AI goes all-in on open models built in the U.S. |
interconnects |
27.01.2026 22:47 |
0.643
|
| Embedding sim. | 0.7416 |
| Entity overlap | 0.0189 |
| Title sim. | 0.0825 |
| Time proximity | 0.9539 |
| NLP тип | product_launch |
| NLP организация | arcee ai |
| NLP тема | large language models |
| NLP страна | united states |
Открыть оригинал
Playback speed
×
Share post
Share post at current time
Share from 0:00
0:00
/
0:00
Transcript
34
3
Arcee AI goes all-in on open models built in the U.S.
Interconnects interview #16 to celebrate the release of Trinity Large.
Nathan Lambert
Jan 27, 2026
34
3
Share
Transcript
Arcee AI is a the startup I’ve found to be taking the most real approach to monetizing their open models. With a bunch of experience (and revenue) in the past in post-training open models for specific customer domains, they realized they needed to both prove themselves and fill a niche by pretraining larger, higher performance open models built in the U.S.A. They’re a group of people that are most eagerly answering my call to action for The ATOM Project , and I’ve quickly become friends with them.
Today, they’re releasing their flagship model — Trinity Large — as the culmination of this pivot. In anticipation of this release, I sat down with their CEO Mark McQuade, CTO Lucas Atkins, and pretraining lead, Varun Singh, to have a wide ranging conversation on:
The state (and future) of open vs. closed models,
The business of selling open models for on-prem deployments,
The story of Arcee AI & going “all-in” on this training run,
The ATOM project,
Building frontier model training teams in 6 months,
and other great topics. I really loved this one, and think you well too.
The blog post linked above and technical report have many great details on training the model that I’m still digging into. One of the great things Arcee has been doing is releasing “true base models,” which don’t contain any SFT data or learning rate annealing. The Trinity Large model, an MoE with 400B total and 13B active tokens trained to 17 trillion tokens is the first publicly shared training run at this scale on B300 Nvidia Blackwell machines.
As a preview, they shared the scores for the underway reasoning model relative to the who’s-who of today’s open models. It’s a big step for open models built in the U.S. to scale up like this.
I won’t spoil all the details, so you still listen to the podcast, but their section of the blogpost on cost sets the tone well for the podcast, which is a very frank discussion on how and why to build open models:
When we started this run, we had never pretrained anything remotely like this before.
There was no guarantee this would work. Not the modeling, not the data, not the training itself, not the operational part where you wake up, and a job that costs real money is in a bad state, and you have to decide whether to restart or try to rescue it.
All in—compute, salaries, data, storage, ops—we pulled off this entire effort for $20 million . 4 Models got us here in 6 months.
That number is big for us. It’s also small compared to what frontier labs spend just to keep the lights on. We don’t have infinite retries.
Once I post this, I’m going to dive right into trying the model, and I’m curious what you find too.
Share
Listen on Apple Podcasts , Spotify , YouTube , and where ever you get your podcasts . For other Interconnects interviews, go here .
Guests
Lucas Atkins — X , LinkedIn — CTO; leads pretraining/architecture, wrote the Trinity Manifesto.
Mark McQuade — X , LinkedIn — Founder/CEO; previously at Hugging Face (monetization), Roboflow. Focused on shipping enterprise-grade open-weight models + tooling.
Varun Singh — LinkedIn — pretraining lead.
Most of this interview is conducted with Lucas, but Mark and Varun make great additions at the right times.
Links
Core:
Trinity Large (400B total, 13B active) collection , blog post . Instruct model today, reasoning models soon.
Trinity Mini , 26B total 3B active ( base , including releasing pre-anneal checkpoint )
Trinity Nano Preview , 6B total 1B active ( base )
Open Source Catalog: https://www.arcee.ai/open-source-catalog
API Docs and Playground (demo)
Socials: GitHub , Hugging Face , X , LinkedIn , YouTube
Trinity Models:
Trinity models page: https://www.arcee.ai/trinity
The Trinity Manifesto ( I recommend you read it ): https://www.arcee.ai/blog/the-trinity-manifesto
Trinity HF collection — ( Trinity Mini & Trinity Nano Preview )
Older models:
AFM-4.5B (and base model ) — their first open, pretrained in-house model ( blog post ).
Five open-weights models ( blog ): three production models previously exclusive to their SaaS platform plus two research models, released as they shifted focus to AFM — Arcee-SuperNova-v1 , Virtuoso-Large , Caller , GLM-4-32B-Base-32K , Homunculus
Open source tools :
MergeKit — model merging toolkit ( LGPL license return )
DistillKit — knowledge distillation library
EvolKit — synthetic data generation via evolutionary methods
Related :
Datology case study w/ Arcee
Chapters
00:00:00 Intro: Arcee AI, Trinity Models & Trinity Large
00:08:26 Transitioning a Company to Pre-training
00:13:00 Technical Decisions: Muon and MoE
00:18:41 Scaling and MoE Training Pain
00:23:14 Post-training and RL Strategies
00:28:09 Team Structure and Data Scaling
00:31:31 The Trinity Manifesto: US Open Weights
00:42:31 Specialized Models and Distillation
00:47:12 Infrastructure and Hosting 400B
00:50:53 Open Source as a Business Moat
00:56:31 Predictions: Best Model in 2026
01:02:29 Lightning Round & Conclusions
Subscribe
Transcript
Transcript generated with ElevenLabs Scribe v2 and cleaned with Claude Code with Opus 4.5.
00:00:06 Nathan Lambert: I’m here with the Arcee AI team. I personally have become a bit of a fan of Arcee, ‘cause I think what they’re doing in trying to build a company around building open models is a valiant and very reasonable way to do this, ‘cause nobody really has a good business plan for open models, and you just gotta try to figure it out, and you gotta build better models over time. And like open-source software, building in public, I think, is the best way to do this. So this kind of gives you the wheels to get the, um... You get to hit the ground running on whatever you’re doing. And this week, they’re launching their biggest model to date, which I’m very excited to see more kind of large-scale MoE open models. I think we’ve seen, I don’t know, at least ten of these from different providers from China last year, and it’s obviously a thing that’s gonna be international, and a lot of people building models, and the US kind of, for whatever reason, has fewer people building, um, open models here. And I think that wherever people are building models, they can stand on the quality of the work. But whatever. I’ll stop rambling. I’ve got Lucas, Mark, um, Varun on the, on the phone here. I’ve known some of them, and I consider us friends. We’re gonna kind of talk through this model, talk through building open models in the US, so thanks for hopping on the pod.
00:01:16 Mark McQuade: Thanks for having us.
00:01:18 Lucas Atkins: Yeah, yeah. Thanks for having us. Excited.
00:01:20 Varun Singh: Nice to be here.
00:01:20 Nathan Lambert: What- what should people know about this Trinity Large? What’s the actual name of this model? Like, how stoked are you?
00:01:29 Lucas Atkins: So to- yeah.
00:01:29 Nathan Lambert: Like, are you, like, finally made it?
00:01:32 Lucas Atkins: Uh, you know, we’re recording this a little bit before release, so it’s still like, you know, getting everything buttoned up, and inference going at that size is always a challenge, but we’re-- This has been, like, a six-month sprint since we released our first dense model, which is 4.5B, uh, in, in July of last year, 2025. So, um, it’s always been in service of releasing large. I- it’s a 400B, um, thirteen billion active sparse MoE, and, uh, yeah, we’re, we’re super excited. This has just been the entire thing the company’s focused on the last six months, so really nice to have kind of the fruits of that, uh, start to, start to be used by the people that you’re building it for.
00:02:16 Nathan Lambert: Yeah, I would say, like, the realistic question: do you think this is landing in the ballpark of the models in the last six months? Like, that has to be what you shop for, is there’s a high bar- ... of open models out there and, like, on what you’re targeting. Do you feel like these hit these, and somebody that’s familiar, or like MiniMax is, like, two thirty total, something less. I, I don’t know what it is. It’s like ten to twenty B active, probably. Um, you have DeepSeeks in the six hundred range, and then you have Kimi at the one trillion range. So this is still, like, actually on the smaller side of some of the big MoEs- ... that people know, which is, like, freaking crazy, especially you said 13B active. It’s, like- ... very high on the sparsity side. So I don’t actually know how you think about comparing it among those. I was realizing that MiniMax is smaller, doing some data analysis. So I think that it’s like, actually, the comparison might be a little bit too forced, where you just have to make something that is good and figure out if people use it.
00:03:06 Lucas Atkins: Yeah, I mean, if, if from raw compute, we’re, we’re roughly in the middle of MiniMax and then GLM 4.5, as far as, like, size. Right, GLM’s, like, three eighty, I believe, and, and thirty-four active. Um, so it-- you know, we go a little bit higher on the total, but we, we cut the, uh, the active in half. Um, it was definitely tricky when we decided we wanted to do this. Again, it was July when... It, it was July when we released, uh, the dense model, and then we immediately knew we wanted to kind of go, go for a really big one, and the, the tricky thing with that is knowing that it’s gonna take six months. You, you can’t really be tr-- you can’t be building the model to be competitive when you started designing it, because, you know, that, obviously, a lot happens in this industry in six months. So, um, when we threw out pre-training and, and a lot of our targets were the GLM 4.5 base model, um, because 4.6 and 4.7 have been, you know, post-training on top of that. Um, and, like, in performance-wise, it’s well within where we want it to be. Um, it’s gonna be... Technically, we’re calling it Trinity Large Preview because we just have a whole month of extra RL that we want to do. Um- But-
00:04:29 Nathan Lambert: I’ve been, I’ve been there.
00:04:31 Lucas Atkins: Yeah, yeah. But i- you know, we’re, we’re in the, um, you know, mid-eighties on AIME 2025, uh, GPQA Diamonds, uh, seventy-five, um, at least with the checkpoint we’re working with right now. We’re still doing more RL on it, but, um, you know, MMLU Pro, uh, eighty-two. So we’re, we’re, we’re happy. We’re really-- Like, for it being our first big run, like, just getting it trained was, was an extreme accomplishment, but then for it to actually be, like, a, a genuinely useful model is a, a cherry on top.
00:05:03 Nathan Lambert: Yeah, let’s go big picture. Uh, like, let’s recap. We have all of the... We have this full trinity of models. I think that there’s a fun note. Uh, did I put it in this doc? Yeah, on Nano Preview, which was the smallest- ... you’re, like, charming and unstable. The model card’s really funny. Um, ChatGPT, doing deep research on this, I was like, ChatGPT Pro just tagged next to it, “charming and unstable.” And I was like: Is this a hallucination? And then in the model card, you have, like: “This is a chat-tuned model with a delightful personality and charm we think users will love. Uh, we think- ... it’s pushing the boundaries, eight hundred million, um, active parameter, and as such, may be unstable in certain use cases.” This is at the smallest scale- ... which is like, I appreciate saying it as it is, and that’ll come up multiple times in the conversation. And then you have Mini, which is like, um, I think it was, like, 1B active, 6B total type thing. In my-- I, I don’t have it, the numbers right in front of me. I have it somewhere else. Um-
00:05:52 Lucas Atkins: Yeah, Nano was, Nano was the 6B, uh, 1 active.
00:05:55 Nathan Lambert: Oh, yeah, yeah.
00:05:55 Lucas Atkins: And then, and the Mini was twenty-six, 3B active.
00:05:58 Nathan Lambert: Yeah. So, like-
00:06:00 Lucas Atkins: Um, yeah.
00:06:00 Nathan Lambert: -are these based on more of, like, you need to build out your training chops, or are you trying to fill needs that you’ve-... heard from community, and like, I think for context, previously, your first open model was a base and post-trained model, which was Arcee 4.5B, which was a dense model- -which people like. And prior to that, you had, like, a long list of, like, post-training fine tunes that you had released. So before that, it was like a post-training shop, and I think that kind of history is i- important to fill in, ‘cause I think most people-- a lot of people are gonna meet you for the first time listening to this.
00:06:34 Lucas Atkins: Yeah, it, it, um, we chose those sizes for Mini and Nano, uh, specifically Mini, um, the 26B, 3B Active, because we wanted to de-risk, uh, large. Like, th- this has all been in service of getting to a model of, of, you know, the 400B class. So, um, we, you know, learned from doing the original 4.5B, that you might have everything on paper that you need to train a model, but i- inevitably, there’s tremendous, you know, difficulties that come up, and, um, it, it’s-- we, we definitely knew we wanted to make sure that we, you know, solved some of... E- especially when it came to just doing an MoE model performance, uh, you know, like a, like an efficient, fast train of an MoE. So, um, we thought that that was a good ground where we could, you know, it wasn’t crazy expensive, uh, but gave us a lot of data, uh, going into large. And then Nano just came about because we had some extra compute time, and we really want to do more research on, like, smaller models that are very deep. Um, and we hadn’t really seen that in an MoE before, so that one was very much we started training it, and then it, you know, early benchmarks were good, so we said, “Well, we’ll just do the whole dataset.” Um, and, uh, but most of the love for those releases went into, to Mini. So I, I definitely think that long term, uh, from an ROI perspective, the smaller models are going to be where we shine, just because there’s a tremendous amount of, of cost savings a company can get from, from optimizing on a, on a smaller model. Um, but, but we, uh, w- we’re definitely gonna be trying to push the, the large frontier, too.
00:08:26 Nathan Lambert: Yeah. Um, I’d like to kind of double-click on training before going back to the small model that’s useful for companies, ‘cause we’re gonna have-- we’re gonna end up talking for, like, twenty minutes plus about open ecosystem. So I kind of am curious, like, philosophically, how your company feels about, like, sharing scientific details. So if I ask you, like, what are the things you’re technically most excited about in the model, or, like, what are the pain points? Like, uh, like, are you willing to talk about these things? Like, I- Do you feel like it’s kind of orthogonal to the company? Like, I feel like a lot of it is just, like, things that happen. I think your framing of all of this is in service of getting the big model going. And particularly, of, like, you have to be thinking about your model as landing in six months, is probably... Like, for people not training models, it’s hard to think about, ‘cause even I- ... like, I’m thinking about trying to refresh our post-training stack for OLMo 3, and I’m like, the thinking model, the, um, we are pretty SFT heavy right now, and it makes it not very dynamic in terms of the thinking time. But it’s just like, I can’t see people deploying this model, or probably will have a hard time fine-tuning it. And it’s like to think about where tool use models are going in six months, like, seems pretty hard. Um, it’s a very hard task to do, so it takes a lot of gumption to actually set out and do it. So I, I would just appreciate the framing, kind of self-reflecting on what I go through. So if you have anything that you think was, like, particularly hard to actually land the six-month outlook, because you use Muon as an optimizer, or is it Muon? And some of these things. I think the data, it’s well known that Datology is cranking a lot of this, and you probably provide-- I think of it as like you’re kind of driving and working with these partners, and I’m sure you provide a lot of feedback on what’s working and what’s not. So- ... anything you’re willing to share, I think it’s useful.
00:10:08 Lucas Atkins: Uh, I, I think, um, I mean, on the data side, like Datology, I-- at least for these models, that, that partnership has very much been almost an extension of our own research team. Like, we’ve worked very closely with them, and, um, obviously, our model’s doing well, you know, i- is, is, is good for them. So, um, but it, it-- there was definitely, you know, and you know this better than most, like, small-scale ablations, when you throw them at scale, sometimes, you know, uh, the-- i- it doesn’t always turn out how you want. So there was quite a lot of iterating there to at least get the dataset we used for Large. Um, I, I would say that as far as looking out six months and then figuring out how we wanted to... Obviously, the big one was compute. We don’t, um, you know, we, we never raised as, like, a foundation model company, so we’ve ne- we haven’t signed massive commits for, you know, thousands of GPUs before. Um, we didn’t have a, a, a massive cluster that was always active, uh, for a lot of our post-training. So if they came before, um, you know, we had sixty-four, uh, H100s, that was pretty sufficient for that kind of work, but obviously, this necessitated quite a bit more. Um, but the first thing was-
00:11:29 Nathan Lambert: That’s still less than people would guess. Like, you’re releasing models- ... that weren’t like, your models weren’t catching national news, but people in the community knew about them. And, like, uh, i- I think of, like, Moondream when I think about that. Like, vik has- ... such little compute, and he puts it to so use. Like, you, like, see how successful he is? And he tells you that he has, I don’t know, thirty... Like, l- it might be, like, sixty-four GPUs. Like, uh- ... there’s, uh, uh, that’s a whole separate conversation on building- ... actual good ML output on little compute. I, I should ta- I should chat with vik about this, but aside
00:12:03 Lucas Atkins: No, it’s, it is-- I think it was... Yeah, it, it, it was very much a gift going into the pre-training side because-... we were kind of already thinking, All right, how do we do the mu- you know, the most with the, the least amount of compute? But, um, you know, we-- it took us quite a while to get the cluster that we have been training large on, which is twenty-two thousand forty-eight B300s. Um, and once we figured out when we were going to get that, get access to that cluster, everything else kind of became clear as far as, like, timelines for Mini and Nano and, and when we wanted to do that. Uh, obviously, you know, five hundred and twelve H100s was easier to come across, um, for Mini and Nano. So once we figured that out, um, it really became, uh, this game of, okay, how can we find, like, the best research on the topic of, of pre-training, and what is kind of... What are the, the, the papers and publications that are coming out, um, that have enough potential and enough precedence, either because, uh, another lab used them, it comes from a reputable team, uh, the ablations and the, the evaluation setup, like in the paper, was sufficient enough to give us confidence. Uh, and then we basically spent, I don’t know, it was probably about two months just figuring out what we wanted our architecture to be for the MoE, then figuring out, okay, now that that’s what we want to do, how do we implement all of that in the actual training pipeline? Uh, how can we-- you know, at that time, there had been many people who’d done Muon, but, um, for post-training, and, and then other-- some Chinese labs had used it, but there wasn’t, like, a widely available distributed Muon, um, to do it that scale.
00:13:54 Nathan Lambert: What do you think that, like, looks like in decision-making? ‘Cause that seems like a risky decision, if you ask me. I think for one, the ti-
00:14:00 Lucas Atkins: Muon?
00:14:00 Nathan Lambert: ... the timing, the, the, like, timing sharing that you’re saying is good. Like, you said this for two months, and then, like... But, like, even Muon is like, that’s a bet that would even take-- like, somewhere like AI2, that would take some serious evidence to go with it. We would want to ablate it. So like- ... on a single track, it’s like y- you had probably had a process for becoming fairly confident in it then.
00:14:24 Lucas Atkins: It- yes, but it, it was also, like, Kimi had, had just come out, and we knew that that one used Muon, and so we knew that it, at least, if implemented correctly, could deliver a good model. There weren’t outstanding ablations done around like... You know, there wasn’t a Kimi scale model done with Adam, and then compared to Muon and see the difference. But, um, that at least gave us enough confidence that if-
00:14:50 Nathan Lambert: What does Muon give you? Does it give you, like, memory saving, uh, in-
00:14:55 Lucas Atkins: No, it’s actually a little bit more memory. It’s, it’s, it’s mostly-
00:14:58 Varun Singh: It’s, uh-
00:14:58 Lucas Atkins: ... like the loss converges a bit quicker.
00:15:00 Varun Singh: It’s, it’s less memory, actually. It’s, uh, uh, only one momentum buffer instead of Adam’s two, uh, beta buffers, and then it’s also better convergence.
00:15:10 Nathan Lambert: Okay. So it’s, like, mostly designed around convergence, and then I know the math is different, which is where this momentum term changes.
00:15:15 Lucas Atkins: Well, it, it kind of came out... I mean, it had its, its, its big, you know, uh, explosion of popularity in the kind of nanoGPT speedrunning community. So it was kind of all built around converging to a certain, you know, validation loss faster, and, uh, that, that, that was, um... As for why we chose it as opposed to Adam, we’d used Adam for 4.5b, uh, but we also knew that if we wanted to move this fast, that we were going to have to make some pretty big bets, educated. Um, but, but still, we would have to make some, some, some risky decisions, um, beyond just, you know, training in general. So, um, there were a few that Muon we went with, uh, I think was, was one of our bigger bets. Uh, we ended up not doing, like, multi-token prediction or, or, or FP8 because we were throwing so many new things into the run at once, um, that-
00:16:12 Nathan Lambert: Do these apply for-
00:16:12 Lucas Atkins: ... if something were to go wrong-
00:16:13 Nathan Lambert: um, Mini and Nano? Are those also Muon, or are those- ... Adam as well? Okay, so then you- ... you get some de-risk from that. Do you know off the top of your head how many days it take to train each of those? Like, a, a good-
00:16:25 Lucas Atkins: Uh-
00:16:25 Nathan Lambert: ... ballpark for people, before-
00:16:27 Lucas Atkins: Yeah, so-
00:16:28 Nathan Lambert: going into the bigger run.
00:16:29 Lucas Atkins: So, so Mini, uh, so Nano on it was five hundred and twelve H200s, uh, took a little over thirty days. Um, and then Mini was about forty-five days.
00:16:45 Nathan Lambert: Okay. I think another thing- ... off the top of my head is I know that, like, a OLMo 1B dense would take us, like, eleven days on a hundred and twenty-eight H100s for a dense model. So, like, sixteen. So, like, the numbers- ... just go up from there. ‘Cause then it’s like the question is like, I’m guessing i- if those are forty-five days, and then you have-- you up the number of GPUs, it’s gonna be like a similar amount of time, or forty days for the big model, but much more stressful.
00:17:16 Lucas Atkins: Yeah, the big model was... But again, that was- we knew that we, we wanted- we felt confident that we could deliver a competitive and exciting model in January 2026. Like, we knew that it would-- we could... Who knows kind of where the research and what, what class and, and, and, and skill and performance of model is gonna come out in the next three months? Um, so we also knew that we really wanted to land sometime in January, and that’s also why we also took- we went with B300s, even though definitely the largest public train of that size on B300s and, and the, um, you know, a lot of the software was not-- did not have, like, out-of-the-box B300 support. It was the only way we were gonna be able to train a model of this size in-
00:18:06 Nathan Lambert: Did you have to do this? Did you have to implement the... like, help solve version issues or other issues on B300s? ‘Cause I’ve heard that-
00:18:13 Lucas Atkins: W-
00:18:14 Nathan Lambert: ... the rollout has been rough.
00:18:16 Lucas Atkins: We had to add-... a, a bit. There, there were a couple days where the, the data center had to take it offline to implement some bug fixes. It was, it was definitely, like, a very cool experience being on the bleeding edge, but, um, also, like, a little frightening ‘cause you just know, like, “Oh, we’re not getting the most out of these that we possibly could.” So, um, a little bit of both.
00:18:40 Nathan Lambert: Uh, was your final training run stable, or did you have to do interventions through it?
00:18:46 Lucas Atkins: Uh, it was very stable, actually. Uh, it took-- the beginning of it was not. The, the, the first ten days were absolute, um... It, it would start very well and, and looked, you know, uh, the dynamics and the logs, and the graphs looked very similar to Mini and Nano, and then after, uh, around a trillion tokens, it- the- we- you know, you’d get collapsing, experts would start to go crazy. Uh, part of this is just, again, we are very sparse compared to what you, you, you have. So, um, you know, four hundred billion total, um, thirteen billion active, two hundred and fifty six experts. Like, it was, it was-
00:19:26 Nathan Lambert: Did you do a, uh, expert routing loss or some sort of balancing loss?
00:19:30 Lucas Atkins: Yeah. Yeah, yeah. Yeah.
00:19:32 Varun Singh: We did, um, we used DeepSeek’s, uh... We, we modified DeepSeek’s Auxiliary-loss-free, um, uh, loss balancing with our own, like, uh, with some tweaks, and then we also added a sequence loss like they, uh, did as well.
00:19:47 Nathan Lambert: Uh, was there Auxiliary-loss-free one from DeepSeek V3, or was that a later model?
00:19:51 Varun Singh: That was V3.
00:19:52 Lucas Atkins: It was V3.
00:19:52 Varun Singh: They did a separate paper on it as well. Yeah.
00:19:55 Nathan Lambert: Yeah. Yeah, that makes sense. I think a lot of people have derived from there. Um, have you- ... had issues on post-training as well? So I have a theory that the new algorithms we’re getting from the Chinese labs, like GSPO and SysPO, are primarily for problems that you solve when you have big MoEs and you have expert problems when trying to do the RL. And that’s the whole reason that, like, I think our very serious AI two RL setup, like, we’re doing it on dense models, and we’re just like, “It’s fine. We don’t have this big clipping problem, and as much like we don’t have as much of a need to get the batch size as big to ac- activate all the experts.” So you’re saying you have so many experts and so much sparsity, that potentially sounds like you’re making RL harder.
00:20:36 Lucas Atkins: Um, yes. I will also... I will say that from just, like, a purely post-training side, we added as much as we po- we used- we... So our code base started from TorchTitan. We’ve had to make a ton of modifications to it to get it where we need it to be, but that was an excellent base. And from one of the bigger learnings from Mini and Nano was treating, uh, at least the SFT side of it, as a s- as a separate phase. Um, ‘cause with, with Mini and Nano, we finished the pre-training, we did context extension, then we took those and then ran those on, like, the sixty-four H100s we usually would do post-training on. Um, that presented a lot of challenges, uh, with the MoEs. They, they really... And that’s kind of been a thing in the open space, is post-training MoEs, like, really, um, can be frustrating, even for SFT. So for Large, we added, uh, like, fine-tuning directly to TorchTitan, um, and did it all on the same cluster. So, um, from a performance standpoint, like, SFT was very, um... actually ended up being totally different.
00:21:42 Nathan Lambert: What is the actual difference between the q- the, the implementations then? Is it just kinda like you end up with different batch sizes and parallelism and stuff? Like why-
00:21:50 Lucas Atkins: Uh, I mean, we ended up, we... Yeah, we ended up needing to get it to do really, like, to get context parallelism really well, really good, ‘cause we’re obviously going at a higher sequence length, and then, um, just adding the proper loss masking. Um, it, it, it, it ended up being a relatively easy implementation, especially ‘cause we did all the pre-processing, uh, outside of TorchTitan.
00:22:13 Nathan Lambert: Interesting.
00:22:14 Lucas Atkins: Uh, and then on the RL side, yes, I would say it’s not, um, it didn’t present itself as, as, as significantly harder than, than, um, Mini and Nano. However, that many GPUs does, so we didn’t end up using, uh, two thousand of the B300s for that. That ended up being, uh, a thousand. So two, we just split the nodes in half.
00:22:39 Nathan Lambert: Yeah. That makes sense.
00:22:40 Varun Singh: On the dense model side of things, uh, you mentioned that you didn’t need to use all the tricks and stuff. I, I think it is, uh... I think the, the, it- MoEs are just, in general, harder to RL, but I think it’s also, like, uh, b- because of, like, the KL mismatch between trainer and inference engine, right? Um, where you have, like, uh, sometimes the inference engine can pick different experts compared to, like, the trainer, uh, when you, like, do a forward pass on the same tokens. So I think there is definitely some, like, inherent instability with, with RL on MoEs.
00:23:13 Nathan Lambert: Yeah, that makes sense. Are, are... Okay, um, another question of, like, how much do you want to say? How do you feel about the state of public post-training recipes? Like, do you... Like, I, I feel like there’s so little out there, and there’s an opportunity to be seen as technical leaders by sharing just, like, more of what you’re doing. ‘Cause I feel like we’ve seen for years how complicated things can be, but also at, kind of at the same time... Like, we see this from the likes of Llama, has these really complicated recipes. But at the same time, I feel like just executing on a simpler recipe can get pretty close. But it’s just, like, very uns- I feel, uh, currently unsatisfied with how much I know about what are the actual core trade-offs of doing post-training well. And I think you could do a lot with SFT, but there’s definitely, in this RL regime, more trepidation of kind of narrowing your model to either downstream use or, like, being able to do this multi-week RL run where you get the most performance.
00:24:06 Lucas Atkins: Yeah, I mean, I, I, from-- since RL has become such a pivotal part of the process beyond what, you know, DPO and, and, uh, and kind of your, your typical RLHF was in the past, like, we used to get quite, uh-... sophisticated with, with how we would do SFT and, and even our, our RL. We, we obviously, we make MergeKit, so we, we utilized merging, and we used to do a lot of distillation, um, to eke out as much performance as we could. Now that RL is such a massive part of the entire post-training stack, I, I have almost reverted us to just really solid but simple SFT. Um, like in, in large, I mean, we’ve-- our post-training data set for, uh, Trinity Large is, uh, two hundred and thirty billion tokens. Like, like, it just like a really, really, really large-
00:25:09 Nathan Lambert: That’s ten X what we did. At least in SFT.
00:25:10 Lucas Atkins: And even that-- and even, even your tenant, like that was bef- before this kind of w- going at this scale and even kinda thinking and, and reasoning models. Like our largest SFT before that was five billion to-- we’d do, like, three epochs, but it was like five billion, you know, tokens, so- Um-
00:25:28 Nathan Lambert: Our non-reasoning model is, like, te- another ten X. So, like, our most latest instruct model is, like, two billion.
00:25:34 Lucas Atkins: Yeah, which is, uh, already a lot, you know. So, um, I, I’ve definitely... We-- you know, simplicity’s key because it also makes debugging anything easier, and then, um, devoting a lot of that sophistication to the RL. Our RL part is, like, really important. I do think that, I mean, the next, uh, phase of reinforcement learning for models of this scale is, is just scale. Is, is... Okay, we went from, you know, twenty billion SFT to two hundred and thirty, now we’re going from, you know, ten environments to a hundred. I think that that really is where you’re gonna get the biggest benefit. I also think that’s why, you know, MiniMax and, and, and other players like GLM are so performant and just, like, have that extra bit of, of usefulness that goes beyond just what you see in the benchmarks, is they’ve, they’ve really embraced, like, long-form, uh, RL. And, and so, um, yeah, I mean, to be quite frank, our, our RL pipeline’s rather... immature might be the wrong word. Like, it’s, it’s, uh, there’s definitely a lot more work we could do and a lot more work we need to do, but, um-
00:26:43 Nathan Lambert: Have you started the tool use side of RL?
00:26:46 Lucas Atkins: That-
00:26:46 Nathan Lambert: Or are you mostly... Well, um, beyond like, if you’re training on code, just verifying the code answer, I don’t count yet as tool use. I would say, like, search and code integrated reasoning is what I think is gonna be like minimum table stakes, but do it- to do it well is really hard. Like, we have to, like- ... like, you, you really, like, uh... That’s what I want to do. I want all of our models to have that this year. Search is prob- you have to have, like, a partner to do search or just, like, illegally scrape Google if you’re gonna- ... you’re gonna serve this model onto a customer, and it’s gonna- ... what? Go, go to Google, like, what?
00:27:16 Lucas Atkins: Yeah. Yeah, no, I mean, I, I... Beyond, like, like, really kind of like long-form, like deep research or, um, you know, even like GPT-OSS style or, or G- GPT 5 style, where, you know, it’s doing a hundred tool calls before it gives you a response. Not there yet, um, but that is kind of... Once we get past the, the final kind of RL of Trinity Large, and, and we kinda look at where we go next, like, that is the next major hurdle, um, for sure, and it’s intimidating.
00:27:56 Nathan Lambert: How big is your, your team of- of... Like, how many people are spending the majority of their time on the model? And then I think we c- start to wrap up technical talk and zoom out a bit to ecosystem and company strategy.
00:28:09 Lucas Atkins: Uh, there’s thirteen at Arcee- ... that are just, like, every, every single day is working on it. Yeah.
00:28:16 Nathan Lambert: And I guess that’s a good number because these people are talking about data, but there’s also, like, the whole data thing that’s coming somewhere else. But also somebody else that wanted to pre-train a model, like they could just download the best fully open data set. And I don’t think it’s gonna be quite as good, particularly in the fact that, um, like, if you look at OLMo’s models, we don’t have a lot of tokens, so we need to, like, acquire- ... more tokens in the open still. But to, like, get a number of thirteen, where some are spending a bit of time on data, but there’s the whole data abstraction, is actually kind of nice for somebody that’s like... To do a serious modeling effort, you need to have this many people, I think.
00:28:50 Lucas Atkins: It, it was-
00:28:51 Nathan Lambert: It’s reasonable to me.
00:28:52 Lucas Atkins: It was, it was a good number. I mean, I would say that, um, it, it was helpful to be able to, you know... This was like, how do we alleviate as many concerns as possible? Or how do we check off as many boxes, right? And it’s like, if we’re trying to do this in the shortest possible amount of time, like, we need to focus on what we’re good at, which is we- pretty good at post-training, and how do we get to the point where we’re able to do that? Well, we have to have a pretty strong base model. How do we get a strong base model? We’ll-- we have to, you know, figure out how to do it, perform, you know, efficiently across many, many GPUs, and then data’s, you know, extremely important, so getting a partner that could, you know, help us with that, and we could offload some of that. It, it- there ended up being, obviously, as you, you know, alluded to earlier, like, a lot of, uh, working with Datology and, and, and others to make sure that the data accomplished what we needed it to. Um, I think that that is gonna be an interesting... You know, as we, as we- now that we have Large and we’re looking at, you know, kind of going further, it’s like, okay, you know, the, the pre-training data really has to be in service of what you wanna do in the post-training, uh, work.
00:30:10 Nathan Lambert: How did you identify this?
00:30:11 Lucas Atkins: Like, like-
00:30:11 Nathan Lambert: Like, like- ... did, did you identify this through Mini and Nano, or, like, how’d you come to think that this was so important?
00:30:19 Lucas Atkins: Data in general or, or just-
00:30:20 Nathan Lambert: Or like this in form of post-training
00:30:21 Lucas Atkins: ... of optimizing it for the post-training? Um, I- really ob- observing other, other players, I think. I mean, it’s, it’s... You know, the, the true base model has kinda stopped really being a thing.... around Qwen2, but definitely around Qwen 2.5, um, where you started to see how much post-training data was making its way into the, the, the base models themselves. Um, and then you start to see the models that have done that, how malleable they are with RL, Qwen 2.5, Qwen3 being a good example. And you start to see like, oh, yeah, like they are, uh, doing as much in the last probably thirty percent of training to make it so that when they go to do RL or post-training, they’re gonna have a really good time. Um, you know, they’re just complete-- they’re way easier, way more malleable, way more performant than what you had in Llama 2 or Mistral 7B. So, um, I knew that i-in-intuitively, kind of going into this, but it wasn’t until after Mini and Nano, yeah, where, where we kind of... Well, definitely 4.5B, where we were like, “Yeah, we definitely need to juice our mid-training quite a bit.”
00:31:31 Nathan Lambert: Yeah, I agree. Okay, this was fun. We could- we’ll probably revisit themes from this. I think that, um, I can definitely go over time and keep chatting because I’m enjoying this. And for context, Mark and I had coffee at some point when I was at some conference in SF, and I was like: Damn straight, this is a fun bet that you’re making. So I’m trying to recapture as much of this as you can. Um, for context, it’s like in July, which is similar to when you decided to start this model, which is when, like, Qwen Coder came out, Kimi came out, um- ... GLM 4.5 came out, and I was just, like, looking- and Llama had kind of been, like, become a meme of going away. And that’s why I launched the Adam Project, where I was like: Come on, we need to have some people doing this. And I think that it’s, like, hard in the US because I think there’s so much money to be made on AI. Like, the company- the big tech companies are like: “We see it, and we’re gonna take it, so I don’t need to bother with, like, caring about open models ‘cause we don’t need it.” But from, like, an ecosystem co- perspective and a long-term tech perspective, I don’t think that works very well for the country. So it’s kind of this weird middle ground of like, how do you convince people to actually build open models? I was on... Like, I have calls with people in government asking me, like, what would I actually do? So it’s, like, very hard to think about this. And I have this- and then it’s just, like, to hear that you guys are just making this bet on this is very fun to me, but it’s also, like, based on actual learning from trying to do this. So you’ve been trying to train open models. I think Mark and I have both been at Hugging Face in our past, and you’re, you were trying to sell people on using open models, and there is a market for this, but it wasn’t enough to not have the base models. So I think, like, talking about your experience in selling on-prem open models and why you needed to train your own end-to-end, and why you needed to train bigger, is great because I hope there are more stories like this, and it kind of fills a void and inspires people to work in it. So how- however you want to take this prompt.
00:33:24 Mark McQuade: Yeah, I can jump in. Um, I mean, yeah, I mean, wh- when I started Arcee in 2023, right, uh, it was... All we did was post-training. Uh, and we worked with, uh, a lot of large organizations and did model customization, you know, for their use case on their data. Um, and we were using Llama-based models, Mistral-based models, and then, you know, some Qwen. I don’t even know if we actually did much Qwen, right, Lukas, at that time, but-
00:33:54 Lucas Atkins: No, we did. Yeah, we, we- Later on, but and then-
00:33:56 Mark McQuade: Later on, right? Uh-
00:33:57 Lucas Atkins: We did, and then we ended up not, because after a lot of Chinese models started to come out, then the companies didn’t wanna use Chinese models, so then we kind of went... Yeah, it was kind of just tricky.
00:34:08 Mark McQuade: Yeah, and people don’t realize that that’s real.
00:34:10 Nathan Lambert: People don’t realize that that actually happened.
00:34:13 Mark McQuade: Yeah, no, that’s, that’s a real thing. That’s why we, we started going down to pre-training was because, well, you know, Meta did their thing and kind of got out of it, right? So there was the, the main US player got out of it, and, and we were working with a lot of US-based enterprises that were not comfortable using Chinese-based architectures. And if you wanted to use the best open models of the day, it started to really trend towards, you know, the Chinese labs. Um, and to the point where we are now, where it’s like, you know, ninety-plus percent of the top mo- open models are coming out of China, um-
00:34:47 Nathan Lambert: Yeah, like, Cursor’s building on it and stuff. Like, people are building on these things.
00:34:52 Mark McQuade: Yeah. So, um, we said, “Okay, let’s...” Instead of we were so reliant on the Metas of the world, the Mistrals of the world, and Mistral largely stopped open sourcing, uh, you know, fully. So we said: You know what? We’ll just go down the stack, and we feel we’re capable enough to, to, to train our own models from scratch, and then we control the, you know, the stack. We can, you know, we, we control the core of, of... as opposed to relying on others to release great models. And, um, and then during this time, you know, it just happened to be that, um, you know, there wasn’t a tremendous amount of US companies doing it. So, um, from our perspective, it was kind of a, a win-win, in that we were able to own more of the stack by going down to pre-training and creating our own models, as well as we were entering into a, like, a space that there wasn’t a tremendous amount of competition, to be honest. Um, and, you know, I-- Lukas and I had said this yesterday, I, you know, I think as a startup, every startup doesn’t want to directly compete with, you know, X or OpenAI, or Anthropic, or Google because they have more money than God, and they can do whatever they want. Um, but when you’re doing open weights, you don’t-- it’s, it’s a different kind of compe- they, they don’t sit in there, right? You’re kind of going into your own path, where there isn’t a tremendous amount of players, and you can kind of find your, your way and, and build your niche and, and kind of go from there and, and become something big. So, um, it kind of happened to all coincide for us back in, in July, and, and we went all in.
00:36:23 Nathan Lambert: Yeah, yeah, like, uh, the, the all-in thing is real because this is expensive. I think that- ... I could dig up in my research the cost of daily, um, twenty-four T8 B300. So I think I’ve seen this type of cost at AI too, where we have long rentals, and we’re like: I know exactly how much this costs, and it’s like, it’s not cheap. Are you... A, a way to transition this is like-... do you see the demand? Like, you were selling open models, like, does this kind of be continuous, where people are like: “You helped us deploy this model, but it’s not good enough.” Like, is, is that something that’s happening, and you’re like: “Well, we have this, and we can help you do it coming in this time?” Or is it like you need to build it... It’s like, is it a we will build it, and they will come type of situation? Like, how much- ... continuity is there in this?
00:37:17 Mark McQuade: Yeah, I think it’s largely-
00:37:19 Nathan Lambert: I-
00:37:19 Mark McQuade: I, uh, from my perspective, I think it’s largely if you build it, they will come. Because we stopped, you know, focusing on that whole revenue generation side of the house when we started to go all in on being this, you know, frontier lab in the open source side. So, um, there’s a couple pieces to that, that, that I think we should all be very proud of inside of Arcee, is that we not only went all in by committing a significant amount of capital. Like, we, we committed, you know, sixty-five, seventy percent of our capital to these models, which is a large amount for a startup. I mean, we didn’t... So that’s not like a dip your toe in, that’s like, we’re all the way in.
00:37:55 Nathan Lambert: Yep.
00:37:55 Mark McQuade: Um, but we did that at the same time as abandoning essentially the whole revenue angle to go all in on it, because we couldn’t focus on both. So we said, “We know how to make revenue on open models. We’ve been doing it for two years. Now, let’s take a step back, because it wasn’t, uh, in a repeatable or sustainable way that we h- the way we had that business set up. Let’s take a step back, let’s build these models from scratch, let’s come up with the, the Trinity family, then let’s go back to generating the revenue side of the house and the monetization piece,” which I think we are in a good position to capitalize on even more now, but we, we took a... We, we, we kind of walked away from it to do what we’re doing here.
00:38:36 Nathan Lambert: Yeah, I love this.
00:38:36 Lucas Atkins: Yeah, I mean, when you have... When there’s only, like, thirteen, you know, uh, researchers who would... Well, we’re, we’re doing obviously our own products and own models, but when you’re working with customers, like, inevitably, those are the same people that need to help train those models for customers, and we got to a point where we were really beginning to, like, do mini and nano. We were getting down to, like, the start date of the cluster, where, um, having myself or Mark, or even, you know, Varun and others, like, pulled into customer or, or, or, uh, conversations or contracts, like, it was not-- we would not be where we are if we had continued, you have know, working with, you know, ten customers at once. So-
00:39:19 Nathan Lambert: But-
00:39:19 Lucas Atkins: ... we, we scaled that down pretty drastically. I do think that when... You know, Mark and I put a lot of thought into, “Okay, well, we’re gonna spend all this money to train these models, like, you know, w- how do we not...” I think, uh, one of the things that makes the idea of, of going all in on training open weight models hard, is that you’ve seen other people try it. And, and like M-
00:39:42 Nathan Lambert: Um, like, like do you think Meta or do you think Meta or Mistral went all in?
00:39:46 Lucas Atkins: I, I think, well-
00:39:48 Nathan Lambert: Meta obviously did.
00:39:48 Lucas Atkins: I think they, they both... Yeah. I think, I think that when I say all in, I mean more like Mistral was, was one of the core ones I’m thinking of, where- ... they were a venture-backed company that, like, had a, a, a fiduciary responsibility to bring in money, but were also trying to release open weight models, uh, for, you know, the West, and for their communities, and for the world. And, um, they tried doing closed versions, and then monetizing off of that. They, they also kind of have more recently, luckily, for all of us, gotten back to their kind of Apache 2.0 roots, and-
00:40:30 Nathan Lambert: Oh, my God.
00:40:30 Lucas Atkins: And-
00:40:30 Nathan Lambert: Have you seen the download numbers on Mistral 3 Large?
00:40:33 Lucas Atkins: I haven’t. No, what is it?
00:40:35 Nathan Lambert: Oh, s- no bueno, sir.
00:40:38 Lucas Atkins: Hey.
00:40:39 Nathan Lambert: Carrying on. Sorry.
00:40:41 Lucas Atkins: But, I mean, yeah, you know-
00:40:42 Nathan Lambert: Um, Mist- the, the Large Instruct model has downloads in the last month. I honestly don’t know what’s going on. Maybe there’s some, like, quantized version out there. I, I was confused.
00:40:50 Lucas Atkins: Maybe. Well, I mean, yeah. But I think that we-
00:40:52 Nathan Lambert: It’s, it’s hard to get adoption. The competition is insane.
00:40:55 Lucas Atkins: Hmm. Well, that’s, that’s- ... yeah, I mean, and that could be a whole conversation also, is, like, how do you actually get people to use it?
00:41:00 Nathan Lambert: I was gonna ask you, like, how do you get people... How do you get people to- - really sell into this? You said you’re good at it.
00:41:06 Lucas Atkins: Yeah, I think that the-
00:41:08 Nathan Lambert: Continue your point, we can come back to it.
00:41:11 Lucas Atkins: No, no, but they... I think they all kind of tie into it, is, is... We knew that the, the market was there for, for custom models. It was two years ago, frankly, and it’s even more so now, because RL has drastically, uh, increased the areas that you can hill climb and become really powerful with a tiny model. Um, and but, but also, people are beginning to see how powerful, you know, uh, te- uh, cust- or, or training in a, a, a product is. Like, you see Claude Code, you see Codex, you see, um... I think Deep Research was kind of one of the first ones that really kind of opened my eyes to what was possible, when you kind of are kind of training in the same environment that you’re serving your users. So we knew that, that people wanted it. We’d, we’d had good success with, with customers in the past using other people’s open models. So, um, it was less of a question of, like, could we monetize it, or will we? And it was just a matter of, um, could we get a model, you know, that pe- that, that we would feel that, you know, given a, a wide suite of basically being able to pick any model in the world, would, would our researchers and, and would our teams re- reach towards our own? And, uh, luckily, I think we’re there. Um, on, on the-
00:42:31 Nathan Lambert: Uh
00:42:31 Lucas Atkins: ... on the topic of, like, how do you get people to use it? How do you get adoption? You know, I’ve never wanted Trinity, uh, or our biggest advertising thing to be, like, US. You know-
00:42:45 Nathan Lambert: Yeah, I know
00:42:45 Lucas Atkins: ... like, if, if your entire-
00:42:47 Nathan Lambert: I know, man, it hurts me.
00:42:48 Lucas Atkins: Yeah, if your-
00:42:48 Nathan Lambert: I spent months reckoning with this.
00:42:50 Lucas Atkins: Yeah. If, if your entire, uh, you know, value prop is that you’re an American company-... great, but ultimately people are gonna use the best. Um, and so I think that we’re gonna be able to serve and, and the people like that need a US-based model because their compliance or legal teams won’t let them use something out of China, it’s gonna be a fantastic option. But I think, you know, kind of the next phase of what we’re doing as a company is, all right, now we’ve, we’ve proved to ourselves and maybe the, the wider industry that like we deserve to be in the conversation, and we can train models of this scale. Um, then it’s like, okay, how do we train the best one? Uh, ‘cause really, I mean, people’s loyalties are very fickle, and, and, yeah, you, you go to what’s the best. I guess it’s like, how much do you think
00:43:41 Nathan Lambert: you’ve learned about being able to tune a model narrowly by going and building the whole stack? Um, something we talk about is like ability- ... to specialize models, and I kind of, of opinion that you just make a better general model right now ‘cause the pace of progress is so high. And but the question is like, can we tune a OLMO that’s very good at science or something? And I- ... w-would guess that training the entire model, you’re going to be able to actually do a better job at what you were doing, but I don’t know how to articulate why or what that looks like.
00:44:18 Lucas Atkins: Um, I mean, the, the, the simplest answer to that being yes is just that... or the simplest reason why that’s the answer to the question is yes, is because we know what went into the model. Like, we know what it actually saw at the later stages of training during the decay. Um, and so that all- that helps influence, A, what are we tr- what kind of data and what topics and, and what format are we giving these models, uh, in post-training? But it also allows you to know like, okay, where, where do I absolutely wanna crank, you know, how, how many- how much of this, say, 230 billion dataset, do we want it to be math or, or, or, or coding? And a lot of that’s influenced by what you’re able to put in-
00:45:06 Nathan Lambert: How, how much of your post-training-
00:45:07 Lucas Atkins: ... post-training
00:45:07 Nathan Lambert: -do you expect to redo? Like, uh, how much can you say about when you’re serving something on-prem? Um, you- you’re not gonna redo the pre-training. You might, for a very big customer, redo mid-training or do continued pre-training- ... in which, in that case, you do need the pre-training data to keep, keep it being stable. Which is a use case where like I’m- I would love to see a paper that’s like, “Because of OLMO being open, we continued to pre-train on biology, and we mixed half of their exact mid-training dataset in with our dataset, and it, and it worked,” yadi, yadi. Like, you could obviously- ... do that, but how much do you think is gonna be like the standard, you fine-tune the last instruct model, or do- are you gonna have to retouch the post-training for a customer? Because that, like, I, I really feel like-
00:45:48 Lucas Atkins: Um
00:45:48 Nathan Lambert: ... it’s just at the end.
00:45:50 Lucas Atkins: It, I think, I think-
00:45:50 Nathan Lambert: But it would be fun if you had to change it.
00:45:52 Lucas Atkins: For the most part, um, I think a lot of tasks will be fine just starting from our, our, our, po- uh, like the released, you know, official post-trained version. Um, now, that’s for maybe simpler tasks, is the wrong way to frame it, but if it’s like, “Oh, hey, we’re doing a deep search agent. We want it to do 30 calls and, before...” That would be a good use for just starting with the finished model that we released that’s already post-trained. Now, if we’re going into something along the lines of, um, a very low-resource programming language or, um, something that it didn’t see a lot of in, in, in pre-training, um, or it’s kind of like a, you know, we’re wanting to train this thing to be really good at humanities last exam, but tools. Um, once we get into the world where we’re having to, especially... Actually, I have a much better answer to this question as I was thinking through it, but most of that holds the same. I think that the, the, the world where we’re gonna be doing a lot of extra instruct and, and SFT and, and post-training is gonna be when we’re trying to distill capabilities from large, like into mini or nano. So say like, oh, you know, this large is, is, is really great at invoice processing, but it’s also 400b, and the, you know, the company doesn’t wanna be hosting that on-prem, you know-
00:47:24 Nathan Lambert: Ah
00:47:24 Lucas Atkins: ... let’s go out generate a new one.
00:47:25 Nathan Lambert: Do you have costs off the top of your head for, like, what the hosting costs are for each of the model? Like, do people... Are people all gonna host these models in the same way, or is there actually-
00:47:32 Lucas Atkins: Uh
00:47:32 Nathan Lambert: ... a wide variance? And if you have, like, the same three models- ... do almost all of your customers end up hosting the same way, or do you end up doing a lot of, like, how do you configure the model to fit in the right hosting for them? Like, is that part of-
00:47:44 Lucas Atkins: It depends
00:47:44 Nathan Lambert: ... the business model?
00:47:45 Lucas Atkins: It, it, it, it kind of... And we tried to move a, a, a little bit further away from that because you get into the risk of being like, like a consultancy, and it’s- that becomes tricky, where there’s not a very clear separation of concern. But, um, for the mo- it would change depending on, were they using AWS? Did they have a commit with Azure? Um, if not, okay, then we, we can go to, you know, someone like Prime Intellect or Parasail and, and get a, you know, maybe a, a cheaper rack of eight. Uh, it just really depended. Uh, there’s quite a bit, um, of, of people that were also serving them, just using, like, Llama CPP. So, like, on CPU-
00:48:25 Nathan Lambert: Uh, is the 400b designed to be, to fit onto one rack of eight 80 big gigabytes in FP8? Is that how you designed it? ‘Cause Llama- ... Llama four, whatever, Llama 405b was the same. It was like one rack in FP8 works pretty well.
00:48:41 Lucas Atkins: It’ll do- we... well, you’ll be able to get really good throughput, a little bit lower concurrency on a, a rack of eight H100s at FP8, and then for, like, our, you know, what we’re serving, we’re serving them on, uh, a series of H200s, but we’re not doing, like, multi-node inference. Uh, but that’s just to add more, you know, replicas and- ... other kinds of things.
00:49:03 Nathan Lambert: Hopefully, eventually. I think that the-... Do you have anything else to say about selling open models? I think that generally, like, how do you think about the market for AI? ‘Cause I see the market as being so big, but the- with specifically with open models, it’s so hard to measure. I think I’ve started talking to some of the Chinese labs at all- as well, and I like to ask them, like, this is very US-centric and like Fortune 500 or whatever, and it’s just like, who the heck uses these models? I think- I guess another question is, like, what license or do you know the licenses you’re gonna use for the biggest models? And I think they’re, like, you’re, you’re playing with fire ‘cause people can use it for free, obviously, but potentially- ... you’ll get to hear like, “Oh, shit, somebody actually used our model for this.” And I think any successful business, you’re gonna want... You, you, you know that this model is not gonna be very relevant in a year with the pace of progress. So like- ... how do you think about your license decisions?
00:49:55 Lucas Atkins: Uh, we- you know, with the 4.5B, we tried to do like a, like a, a reve- one of those revenue-gated licensing. So it’s like, oh, it’s completely free for you to use for commercial and whatnot, but if you or your company made over, I think it was like $1.7 million last year, then you need to come to us and get a license. And what we ultimately found was like, it, it didn’t... Maybe for some people who are just only trying to train the model, release it on Hugging Face, and then just call it a day, maybe that is a huge requirement. But when so much of our, our, our company is built around, you know, training custom versions of the models, and, and not even just ours, but in general, even before we did pre-training. Like, at the end of the day, i- as long as we were using it, a- and we knew that we were in full control of, of whether- if we really succeed, it’s because we trained the models, we did them well, and we executed on it well. If we fail, it’s because we, uh, didn’t execute, instead of, oh, some company just stopped releasing good open models. Um, so we eventually switched to just Apache 2.0, and Trinity Large is also gonna be Apache 2.0. Um, you know, I’m- I think it is-
00:51:23 Nathan Lambert: I think this is the right approach. I have a big investor-
00:51:25 Lucas Atkins: Yeah, I think it-
00:51:25 Nathan Lambert: Without, without naming other companies, it’s easy- like, raising a lot of money, whe- or being Meta and releasing open models, and do it- and you could release it with non-commercial, and you could get all these, like... You could talk to, I don’t know, fucking Adobe, whoever. Oh, Adobe’s too big. They’ll have good AI. Some... I don’t know, a bank. Bank of America. You could run Llama on Bank of America and make good money on this. But I just feel like the cultural home of open source AI, and I don’t think- it’s impossible to know who wins it, and I don’t think that you’re in the prime position, and I don’t think that it’s easy to win, but you’re doing a thing that aligns with it. It’s the person that just, like, commits to building the models and learning how the ecosystem works, and to rebuild the models based on the feedback th- that you get from people, and to just kind of commit to an evolving process. And if the whole thing works out, there will be a lot of value, and the person who understands it best should be able to learn how to extract said value. And I think that I’m personally, like, sometimes frustrated with Hugging Face, ‘cause I feel like they have sat on that s- a sort of position like this, and they- ... haven’t figured it out. Not that it is easy to figure it out, but I think that has to be the ideal of open source AI, of like, if it’s really gonna work, that’s, that’s what I hope it looks like. And it’s like, I, I don’皮 know, maybe you guys could do some of that. Like, I have a question of like, could you figure out how to make models that are more fine-tunable- ... after all this post-training? Because you need to sell it to a- you need- ... you, you know the customer’s not gonna want it off the shelf. And I don’t know how to train to post-training to make sure that you don’t, you don’t cook it. Maybe you just learn that you need to warm up the model in a l- in the right way, and you just learn the technique of training downstream. But when you talk to people doing research, the different base models have such different characteristics. I think one of them is character training. I did this paper, and the guy was like: “Qwen and OLMo love their character,” and I’m like, “I have no idea why.” And but it’s like Llama and Gemma, you can change them so much. And I’m like, “Dog, like, please figure out why this is the case.” And for one thing, it’s really cool, but also, like, in your case, that would unlock a lot of value to be like, we know exactly what the model’s gonna do, and we know exactly how to change it. So.
00:53:35 Lucas Atkins: Yeah-
00:53:36 Nathan Lambert: Uh
00:53:36 Lucas Atkins: ... it, it, that’s- no, you’re, you’re, you’re right on the money. I think that even, uh, going into the post-training at large, we, uh, one of our researchers came out with, like, a pretty cool, um, experiment and ablation run that they did on drastically reducing catastrophic forgetting. And I almo- I mean, this was, like, three days before we were gonna start doing SFT, and then we ultimately just... I, I ended up pausing on it because it was just throwing something in that wasn’t tested. But, um, yeah, I think-
00:54:08 Nathan Lambert: A good research lead. You did the right thing.
00:54:10 Lucas Atkins: Yeah, I think, I think one of the most important things long term, you know, as we look at kind of what our research priorities are for this year is, is there’s obviously just how to scale RL and, and make these- the end result of the model as good in as many situations as possible. Um, but I think the other half of that is, you know, how do we make the, the, the speed and efficiency and, and performance of customizing them as, as fast as possible, and as easy as possible.
00:54:42 Nathan Lambert: Yeah. Do you learn in making open models from your experience just kind of running these open software things in MergeKit and DistillKit? I know there was a whole license journey on one of those as well.
00:54:52 Lucas Atkins: Yeah, DistillKit.
00:54:52 Nathan Lambert: Do you feel like they’re kind of isolated?
00:54:54 Lucas Atkins: Or MergeKit. Um, yeah, I mean, I think so. I think that, that, um, you kind of have to play the tape out. With MergeKit-... it was by far our most popular piece of software we’d ever released, but it was so popular because it took something that isn’t fundamentally very complicated, but we ma- but it’s time-consuming, and standardization is great for things like that, and we made it, uh, you know, streamlined and easy to do and fast, and you could experiment and ablate really quickly for, you know. And, and so I, I think that when we switched that to, like, a, you know, a, a similar, uh, revenue-based licensing, like, it, it didn’t end up having the value prop that was important because are you gonna pay Arcee, you know, thousands of dollars, or are you just gonna have one of your researchers-
00:55:52 Nathan Lambert: You’re gonna have clone code in a week, right?
00:55:52 Lucas Atkins: recreate it in a week, right? Yeah, so it’s-
00:55:55 Nathan Lambert: In a day.
00:55:55 Lucas Atkins: It’s, it’s kind of... It, it’s remi- it’s remembering like, okay, what is- what problem is this solving, and is this even a prob... Like, is the solution to this monetizable? Um, and so MergeGit, we brought it back to the original license, but I think with even viewing the models in the same way, it’s like it’s... Open source is an unbelievable marketing tactic. Like, there’s no one would care about Arcee if we weren’t open sourcing stuff, ‘cause as soon as you do something closed source, if you’re not the best or the cheapest for your price point, I mean, your performance point, no one’s gonna use it. Because-
00:56:30 Nathan Lambert: Um, another question on this. Um, do you think that open models are kind of at a disadvantage when progress is so high? Because it’s potentially easier to swap APIs than open model configurations, especially if, like, model weights are changing sizes or something like this. Where it’s like, “Oh, I can just upgrade to the new Opus, and I do this.” Like, does that, like, uh, decentivize people from using it? Or do you think most of the people are like: “I can only use open models, therefore, I’m gonna use open models?”
00:56:56 Lucas Atkins: Uh, I think for the people who are using, like, s- either self-hosted or, you know, um, uh, bespoke, uh, you know, engines to, to run it, where they have complete... You know, in a VPC or they have complete control over, like, data in and out, egress, ingress. I don’t think that’s really gonna be so much of a problem because they’re obviously doing it for a reason. Um, like, they’re either for privacy or security or, or HIPAA or SOC 2. For whatever reason they’re doing it, um, I, I don’t think that that’ll be, um, so much of a blocker, but I definitely do think that, um, you know, by far, e- even, even with some of the, the larger open... You know, like inference players, like Together and Fireworks, that, that host a lot of open models. Like, being feature- being on feature parity with a lot of these, these larger labs’ APIs is gonna be extremely important, um, o- of being able to serve, you know, um, with features that they’re used to, like prompt caching, that kind of stuff.
00:58:03 Nathan Lambert: Yeah, are- like, I, I think I saw that you guys are setting up an API as well. Is that kind of what the vision there is, is being able to o- offer parity at least, or, like, make it easy for people to consider it?
00:58:13 Lucas Atkins: I think so. I, I- we’re- we very... Yeah, we are doing our own API. We are hosting it. Um, we haven’t- we, we push a lot of that through Open Router just because it’s such a great place to get, like, discovered. Um, as... If we see, like, tremendous growth there, that would obviously be where we’ll, we’ll invest very heavily. Um, whereas the right move might be to let other people host it, and we invest super hard on the infra for, like, make- taking advantage of the models, um, and, and customizing them. There’s, there’s, there’s a few avenues we have ahead of us then, and we have, you know, projects going kind of toward to poke at each one. Um, and we’re just kinda getting as much data as we can before we... I mean, we’re gonna have to go all in on another direction soon. Not, not like pivoting away from pre-training, but now that we’ve done that, now w- what’s the next big bet we’re gonna make, and how do we go fully into that? So we’re trying to figure out what that is.
00:59:12 Nathan Lambert: Yeah. My two last kind of, like, real questions are, like, one is... I guess I can start with, like, where do you see the open model ecosystem? Do you think- where would you see it changing substantially in the next six or twelve months? I, like... Or, or do you? Or you just kinda think we’re marching along for a while?
00:59:31 Lucas Atkins: No, I think we’ll, I think we’ll, we’ll be... I, I, I don’t think it’s an unrealistic prediction to make that by the end of 2026, like, the best model in the world is, is some degree of open. Uh, I think that’s very, very possible, especially with, like, what I’ve seen GLM and, and MiniMax do recently. Um, they have started to find that secret sauce that takes you out of just being good on benchmarks and, like, genuinely useful in people’s day-to-day workflows. And, um, I wouldn’t- like, if, if I, you know, came back, and I... Someone came from the future and told me that the best model in the world was, uh, an open-weight model, I wouldn’t be surprised. I actually think we’re on a, a, a super good trajectory, and, and, and fostering and, and promoting that kind of work and adoption here in the United States is gonna be extremely important.
01:00:24 Nathan Lambert: And where do you see the company going? ‘Cause like, like, I have my guess. Like, you kind of hopefully-
01:00:31 Mark McQuade: What’s, what’s your guess? I wanna hear your guess.
01:00:31 Nathan Lambert: Um, you can hopefully do a mix and kind of oscillate into trading when you get... Like, you need to start having the feedback of the real world. I think that’s obvious. Like, it’s o- like, it’s... Well, obviously, you need to make money to survive as a company, but then you need to start using that as the feedback to guide training. And then it’s like, you need to figure out how to balance and do some of them at each time, and you can plan your cluster at different times, and then you kind of... Hopefully, they become a, a loop across each other, and they kind of make it so obvious of why you each need them, ‘cause it, it seems somewhat natural.
01:01:03 Mark McQuade: Yeah, I mean, exactly. You know, you kinda hit, hit it right on the head. Um, you know, getting feedback and then kinda steering the ship from there, um, is, is probably-
01:01:15 Lucas Atkins: ... exactly what we’ll do, but we have a good idea already. I mean, first and foremost, you know, we talked about it earlier, w- we’ve spent a tremendous amount of money. So, uh, we need to go raise some money after we - after we get, you know... We need people to back the, the, the mission and the vision of US open source and, and, you know, so, um, because, uh, you know, we, i- i- Lucas had mentioned about, like, MergeKit and how we flopped the license and, you know. I mean, we’re a smaller-sized start-up. We have-- we’re-- we gotta think of kinda unique ways to try and generate revenue because we don’t have the money of the large labs. So, uh-
01:01:52 Nathan Lambert: Well, I think it’s a benefit to the employee. I think a lot of these labs have over-raised.
01:01:56 Lucas Atkins: Yeah, I like, uh- uh, I-
01:01:57 Nathan Lambert: OpenAI, Anthropic, and all of them are fine. Like, with the OpenAI, Anthropic, Cursor scale, like, let it rip. They should, they should really rip the raising. But all the other companies that are stuck at the, like, the one to two billion range without, like, obvious traction, like, the risk goes to the... I mean, you could-- a lot of them do secondary, so a lot of the founders get out. But it’s like, the risk is the employees get nothing.
01:02:21 Lucas Atkins: Yeah. Yeah.
01:02:22 Nathan Lambert: There is a lot of money, but that’s also why I like the approach, ‘cause it’s like, “Oh, you’re doing the actual start-up thing.”
01:02:28 Lucas Atkins: Yeah, yeah. Yeah, I mean, I think... W- what I was gonna add to what Mark... is just like, what- whatever we do from, uh, uh, uh, scaling and, and speeding things up and growing, um, my goal is to keep our research and engineering teams pretty small. I think, I think that one of the reasons we’ve been able to, to move as quickly as we have is it’s been, like, a small group of, like, highly intelligent, smart, and opinionated people sitting in a room, debating in good faith on decisions. And I think that that’s, uh, uh, under the constraints of, “Hey, we don’t have five hundred million dollars to go and, you know, to rip on, on, you know, X, Y, and Z.” So and I think that’s kind of where creativity comes from, and I think that fostering a culture like that over time is how you can kind of make it so that excellence is less of like a, um, an accident, and it’s actually, like, a by-product of the way that you work. So, so we’re gonna stay small, we’re gonna stay lean, but, um, I, I do think that, like, the, the major, um, kind of challenge for us over the next probably six months, beyond any other models we might have, kind of, uh, think or we’re thinking about, is, is getting up to, like, post-training parity with the likes of DeepSeek, and GLM, Qwen, and others.
01:03:47 Nathan Lambert: Yeah. I, I hear lots of horror stories about this, where it’s usually and-- it’s-- you end up having people that are going after different important abilities, but, uh, like, doing each of the abilities alone is pretty easy to hill climb, but then you just end up with such a mess. It’s like you’re- ... building a custom puzzle, and you’re building all these custom pieces, and they’re magnificent, and then you’d have to, like, pick up these pieces and assemble this unknown thing at the end. And it’s like-
01:04:12 Lucas Atkins: Like they didn’t have the same designer, right? Yeah.
01:04:15 Nathan Lambert: As AI2 is barely scratching the surface of this. Like, you talk to the people at the frontier labs, and it’s like, holy cow, like, post-training is really the Wild West. But a lot of it works. I think, like, we find-- like, even like model merging gives a ton of performance across the whole- ... training pipeline. It’s like- ... you merge at pre-- you merge after each pre-training stage, you merge in post-training. It’s like-
01:04:35 Lucas Atkins: Roon can tell you.
01:04:36 Nathan Lambert: But merging post-training becomes a lot more complicated because you- ... can have all these domains and things, uh.
01:04:41 Lucas Atkins: Well, in, in merging, you know, it, it actually, it used to be very YOLO, um, the way we used to do it, and, and Charles, who, who created MergeKit, I call him, like, chief alchemist, and, like, you’d kinda just send him ten promising checkpoints, and he’d come back a day later with, like, some insane, you know, model that was really good at all of them. And, and you can’t do that as much in post-training anymore because of, uh, of just the, the formatting and the way that RL is done. Like, you do have to be a little bit more surgical about it, but yeah, everyone can tell you, like, any time we start to see anything worrisome at all in training or, or, or even something going really good, you know, “Lucas, what do we do?” I’m like: Merge it. I’m like, just-
01:05:21 Nathan Lambert: Merge.
01:05:21 Lucas Atkins: ... I’m like: “Just take it, just merge it. Let’s see.” And more often than not, it fixes it, so...
01:05:27 Nathan Lambert: Um, do you merge during RL? Like, you could just, like, merge the last few checkpoints and resume or something?
01:05:32 Lucas Atkins: We’ve ex-- we’ve, we’ve dabbled in that, not, not for what we’ve done. You know, again, a, a lot of the, the mini, nano, and large story for Trinity is, like, getting to a level of... what was my level of complexity I was comfortable with us undertaking, and then, uh, not introducing anything more. So, um, not yet. But we, I mean, we, we, uh, regularly merged. We didn’t do it for LARP, but we used to merge a lot, um, during just, like, your standard, uh, um... When we’d do, like, RLHF, we used to do a bunch of merging. We’d do it, like, every five checkpoints. We would-
01:06:11 Nathan Lambert: Online RLHF or D-DPO?
01:06:13 Lucas Atkins: There’s DPO.
01:06:15 Nathan Lambert: Yeah. It’s so much easier to get started. One of my goals is to have somebody figure out how to do actual online RLHF, pure LM feedback, obviously, for scaling. But it’s just like- ... it’s, it’s unsavory to it’s just, like, doesn’t look like DPO-
01:06:28 Lucas Atkins: Yeah, I mean, if, if, you know, if GRPO and kind of op-- in, in the, the present day RL regime, like, if that hadn’t materialized when it did, I think that would’ve been a big topic in 2025. But I do think that, you know, GRPO and just the overall, um, DeepSeek and o1 style reasoning and thinking and RL kind of... Any, a- any person who is thinking of doing that for, like, performance reasons, realize that there was something that had fifty thousand papers released every day on how to do it. Um- ... that was kind of probably right where you’d get the same amount of performance.
01:07:07 Nathan Lambert: Um, do you force dog feeding? Do you make yourself-- do you guys use your own models to understand them? Like, do you, like, make that a thing?
01:07:14 Lucas Atkins: Uh, Mini was the first one we could actually start doing that with, um, a- at least for, uh, a more general day-to-day tasks. So a lot of our, like, internal Slack, we have stuff that, like, monitors Twitter and LinkedIn for feedback on Trinity and, and, and that kind of stuff. That all runs on Trinity Mini now. Um, and then, uh-... you know, we, we put a good amount of work into, into large being, um, you know, good in, in a bunch of your, like, OpenCode and, and Cline, uh, and, and Kilo Code. So, um-
01:07:45 Nathan Lambert: Uh, what does that, what does that work look like?
01:07:49 Lucas Atkins: Uh, working with those guys to get data. And then, um-
01:07:53 Nathan Lambert: That’s, I mean- Good for me to know.
01:07:55 Lucas Atkins: I mean-
01:07:55 Nathan Lambert: I should do that, I guess.
01:07:58 Lucas Atkins: Yeah. Yeah, working with, uh... Or, or I mean, it- the way it started was us, like, using open models and then, like, passing those through as the base URL, and then, like, getting the logs from that. Um, and then realizing that, like, that translated pretty well. Um, and then over time, obviously turning this-
01:08:16 Nathan Lambert: Um, can you expand on this? So I was gonna ask you-
01:08:19 Lucas Atkins: So-
01:08:19 Nathan Lambert: -if you’re, like, using these open models regularly, ‘cause I, I’m just, like, Claude Code psychosis, man. I’m like, “Can’t take that away from me.”
01:08:26 Lucas Atkins: Yeah, I, I use, I use four... I’ve used 4.7 a lot. I think 4.7 from GLM was one of the first ones that could replace a lot of my day-to-day. Uh, I’ll still reach for Claude Code or even 5.2 Pro if it’s, if it’s, like, something that’s, like, really... I- if I do not know how to measure what success looks like for something, I’ll usually use those. Um, but, uh, yeah, I mean, it, it- even using DeepSeek before, um, kind of their May update was hit or miss. But, um, yeah, w- the reason I decided to, like, start talking to these people and working on, like, how can we get data and, and start making our models good in these systems was I would use them. I had a, um, you know, something that would grab the logs, like, it, you know, inter- as a proxy, so it’d like grab the logs and then format them in the messages format. And then I saw that and went, “Yeah, that’s... You can make a pretty good filter for just, like, standard stuff that you don’t want, and kind of hit a scale.”
01:09:30 Nathan Lambert: Yeah, it makes sense. So, so you’re like, uh, open code will let you look at the data, and then you’re probably gonna get a sense for... Like, I don’t even actually know how the, on the back end, the code agents in open code format data, which I think is actually something I should just go look at, ‘cause then you can design around.
01:09:44 Lucas Atkins: Uh, they’re all different. Yeah. Yeah, but you just have to- you just- basically, it all starts from like, what do you want your format to be? And then how can you take what, what those look like to, you know, to... How do you force it into that? The hard thing, though, is, is with newer models like MiniMax and 4.7, the way they do interleaved thinking is, is like... You know, I’m a big believer in post-training. Like, if you’re gonna do interleaved thinking, like, every sample in your data set should be that. Um, it, you know, it should follow that same format and that same behavior. So, um, that gets tricky if you’re trying to, like, take a bunch of Nemo tr... Or, or, or, well, like, uh, DeepSeek data and Qwen data, and then, oh, we’re also trying to mix in MiniMax, and at that point, you’re- it, it gets really difficult ‘cause they all handle thinking slightly differently.
01:10:34 Nathan Lambert: Yeah, I can buy this. Um, okay, this was fun. Any last predictions or things you want people to know about the model? I will say that, um, when you debuted the Trinity models, you had a great blog post that was very to the point, that covered a lot of this. So I’ll definitely link to the, um, what is it? The Trinity manifesto. I enjoyed reading it. So I’ll link to that in the show notes, and, oh, hopefully you have a new one for me to read when you’re done with the model.
01:10:58 Lucas Atkins: Yeah, we’ll do- we will have a tech report. We’ll have a tech report for you, too. So we, we never, we never did a tech report for 4.5B Mini or Nano because we were so focused on just getting to large, but we also thought it’d be very interesting to write it under the, the... How do you go from 4.5B to a 400B MoE in six months, and, like, what did we learn-
01:11:19 Nathan Lambert: That’s right
01:11:19 Lucas Atkins: ... when you’re viewing it as a whole, so.
01:11:21 Nathan Lambert: That’s about the timeframe that, um, Ant Ling took, too, as well. Ant Ling, uh, the anchor, we talked about, they’re like... It took us about six months to do, um, Ring-1T and their 1T models, which, like, it sounds like a lot more, but I think that’s about the same. It, it depends on compute and configs and stuff to go from, like- ... basic modeling to big MoE, which is pretty interesting to see a lot of people speedrun this sort of thing.
01:11:46 Lucas Atkins: Yeah, it’s, it’s a really, uh... It is a logistical nightmare, but, like, I think everyone on the team has had a tremendous amount of fun over the last, uh, six months. So now the fun begins.
01:11:58 Nathan Lambert: Yeah. Congrats on the milestone. Congrats on the model existing. That has gotta be an almighty relief, and I’ll look forward- ... to see what you all are up to soon. I’ll stop by at some point next time I’m in the Bay.
01:12:10 Lucas Atkins: Yeah. Yeah, come by. Yeah, come by.
01:12:12 Nathan Lambert: Thanks for-
01:12:12 Lucas Atkins: Thanks for having us.
01:12:14 Nathan Lambert: Yeah. Thanks, guys.
Discussion about this video
Comments Restacks
Interconnects
Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories.
Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories.
Subscribe
Listen on
Substack App
Apple Podcasts
Spotify
YouTube
Overcast
Pocket Casts
RSS Feed
Appears in episode
Nathan Lambert
Recent Episodes
Dean Ball on open models and government control
Mar 6 • Nathan Lambert and Dean W. Ball
Why Nvidia builds open models with Bryan Catanzaro
Feb 4 • Nathan Lambert
Open models: Hot or Not with Nathan Lambert & Florian Brand
Dec 18, 2025 • Nathan Lambert and Florian Brand
New Talk: Building Olmo 3 Think
Dec 10, 2025 • Nathan Lambert
Interview: Ant Group's open model ambitions
Nov 12, 2025 • Nathan Lambert
The State of Open Models
Oct 16, 2025 • Nathan Lambert
Interviewing Ross Taylor on the state of AI: Chinese open models, scaling reasoning, useful tools, and what comes next
Jul 29, 2025 • Nathan Lambert
|
|
|
Retiring GPT-4o, GPT-4.1, GPT-4.1 mini, and OpenAI o4-mini in ChatGPT |
openai |
29.01.2026 00:00 |
0.639
|
| Embedding sim. | 0.8344 |
| Entity overlap | 0.0769 |
| Title sim. | 0.0741 |
| Time proximity | 0 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
On February 13, 2026, alongside the previously announced retirement of GPT‑5 (Instant, Thinking, and Pro), we will retire GPT‑4o, GPT‑4.1, GPT‑4.1 mini, and OpenAI o4-mini from ChatGPT. In the API, there are no changes at this time.
|
|
|
OpenAI’s Raising Concerns Policy |
openai |
12.01.2026 00:00 |
0.638
|
| Embedding sim. | 0.7737 |
| Entity overlap | 0 |
| Title sim. | 0.1458 |
| Time proximity | 0.5 |
| NLP тип | other |
| NLP организация | |
| NLP тема | |
| NLP страна | |
Открыть оригинал
We’re publishing our Raising Concerns Policy, which protects employees’ rights to make protected disclosures.
|
|
|
Import AI 441: My agents are working. Are yours? |
import_ai |
19.01.2026 14:03 |
0.637
|
| Embedding sim. | 0.7454 |
| Entity overlap | 0 |
| Title sim. | 0.0847 |
| Time proximity | 0.8449 |
| NLP тип | other |
| NLP организация | Anthropic |
| NLP тема | ai agents |
| NLP страна | United States |
Открыть оригинал
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.
Subscribe now
Import A-Idea
An occasional essay series:
My agents are working. Are yours?
As I walked into the hills at dawn I knew that there was a synthetic mind working on my behalf. Multiple minds, in fact. Because before I’d started my hike I had sat in a coffee shop and set a bunch of research agents to work. And now while I hiked I knew that machines were reading literally thousands of research papers on my behalf and diligently compiling data, cross-referencing it, double-checking their work, and assembling analytic reports.
What an unsteady truce we have with the night, I thought, as I looked at stars and the dark and the extremely faint glow that told me the sun would arrive soon. And many miles away, the machines continued to work for me, while the earth turned and the heavens moved.
Later, feet aching and belly full of a foil-wrapped cheese sandwich, I got back to cell reception and accessed the reports. A breakdown of scores and trendlines for the arrival of machine intelligence. Charts on solar panel prices over time. Analysis of the forces that pushed for and against seatbelts being installed in cars. I stared at all this and knew that if I had done this myself it would’ve taken me perhaps a week of sustained work for each report.
I am well calibrated about how much work this is, because besides working at Anthropic my weekly “hobby” is reading and summarizing and analyzing research papers - exactly the kind of work that these agents had done for me. But they’d read more papers than I could read, and done a better job of holding them all in their head concurrently, and they had generated insights that I might have struggled with. And they had done it so, so quickly, never tiring. I imagined them like special operations ghosts who hadn’t had a job in a while, bouncing up and down on their disembodied feet in the ethereal world, waiting to get the API call and go out on a mission.
These agents that work for me are multiplying me significantly. And this is the dumbest they’ll ever be.
This palpable sense of potential work - of having a literal army of hyper-intelligent loyal colleagues at my command - gnaws at me. It’s common now for me to feel like I’m being lazy when I’m with my family. Not because I feel as though I should be working, but rather that I feel guilty that I haven’t tasked some AI system to do work for me while I play with Magna-Tiles with my toddler.
At my company, people are going through the same thing - figuring out how to scale themselves with this, to figure out how to manage a fleet of minds. And to do so before the next AI systems arrive, which will be more capable and more independent still. All of us watch the METR time horizon graph and see in it the same massive future that we saw years ago with the AI & Compute graph , or before that in the ImageNet 2012 result when those numbers began their above-trend climb, courtesy of a few bold Canadians.
I sleep in the back of an Uber, going down to give a talk at Stanford. Before I get in the car I set my agents to work, so while I sleep, they work. And when we get to the campus I stop the car early so I can walk and look at the eucalyptus trees - a massive and dangerous invasive species which irrevocably changed the forest ecology of California. And as I walk through these great organic machines I look at my phone and study the analysis my agents did while I slept.
The next day, I sit in a library with two laptops open. On one, I make notes for this essay. On the other, I ask Claude Cowork to do a task I’ve been asking Claude to do for several years - scrape my newsletter archives at jack-clark.net and help me implement a local vector search system, so I can more easily access my now vast archive of almost a decade of writing. And while I write this essay, Claude does it. I watch it occasionally as it chains together things that it could do as discrete skills last year, but wasn’t able to do together. This is a task I’ve tried to get Claude to help me with for years but every time I’ve run into some friction or ‘ugh-factor’ that means I put it down and spend my time elsewhere. But this time, in the space of under an hour, it does it all. Maps and scrapes my site. Downloads all the software. Creates embeddings. Implements a vector search system. Builds me a nice GUI I can run on my own machine. And then I am staring at a new interface to my own brain, built for me by my agent, while I write this essay and try to capture the weirdness of what is happening.
My agents are working for me. Every day, I am trying to come up with more ways for them to work for me. Next, I will likely build some lieutenant agents to task out work while I sleep, ensuring I waste no time. And pretty soon in the pace of a normal workday, I will be surrounded by digital djinn, working increasingly of their own free will, guided by some ever higher level impression of my personality and goals, working on my behalf for my ends and theirs.
The implications of all of this for the world - for life as people, for inequality between people, for what the sudden multiplication of everyone’s effective labor does for the economy - are vast. And so I plan out my pre-dawn hikes, walking in the same ink-black our ancestors have done, thinking about the gods which now fill the air as fog, billowing and flowing around me and bending the world in turn.
***
Anti-AI rebels make a tool to poison AI systems:
…Poison Fountain is how to take the fight to the machines…
Anti-AI activists have built a useful technical weapon with which to corrupt AI systems - Poison Fountain, a service that feeds junk data to crawlers hoovering up data for AI training.
How it works : Poison Fountain appears to generate correct-seeming but subtly incorrect blobs of text. It’s unclear about exactly how many bits of poisoned training data there is, but you can refresh a URL to see a seemingly limitless amount of garbage.
Motivation: “We agree with Geoffrey Hinton: machine intelligence is a threat to the human species. In response to this threat we want to inflict damage on machine intelligence systems,” the authors write. “Small quantities of poisoned training data can significantly damage a language model. The URLs listed above provide a practically endless stream of poisoned training data. Assist the war effort by caching and retransmitting this poisoned training data. Assist the war effort by feeding this poisoned training data to web crawlers.”
Why this matters - the internet will become a predator-prey ecology : The rise of AI and increasingly AI agents means that the internet is going to become an ecology full of a larger range of lifeforms than before - scrapers, humans, AI agents, and so on. Things like Poison Fountain represent how people might try to tip the balance in this precarious ecology, seeking to inject things into this environment which make it more hospitable for some types of life and less hospitable for others.
Read more: Poison Fountain (RNSAFFN) .
***
If we want good outcomes from AI, think about the institutions we need to direct intelligence:
…Nanotechnology pioneer reframes AI away from singular systems to an ecology…
Eric Drexler, one of the godfathers of nanotechnology, has spent the past decades thinking about the arrival of superintelligence. One of his most useful things was intuiting, before ChatGPT, that humanity’s first contact with truly powerful AI wouldn’t be some inscrutable independent agent, but rather a bunch of AI services that start to get really good and interact in a bunch of ways - you can check out this 2018 talk on “ Reframing Superintelligence “ to learn more.
Now, he has published a short paper, “Framework for a Hypercapable World”, on how to get good outcomes for humanity from a world replete with many useful AI services.
Don’t think of AI as a singular entity, but rather an ecology: “Compound, multi-component AI systems have become dominant,” Drexler writes. “The persistent, legacy narrative imagines a unified entity—“the AI”—that learns, acts, and pursues goals as an integrated agent. Such entities may be developed, but consider what exists: diverse models composed into systems, copied across machines, proliferating into thousands of distinct roles and configurations. The state of the art is a pool of resources, not a creature”.
To get good outcomes, think of institutions built for AI : Drexler’s argument is that if we want good outcomes from AI, it’s less about making a singular entity that solves all problems within itself, but rather building institutions which we, as humans, can direct towards controlling and solving problems. The key idea here is that AI is both amenable to operating institutions and is also controllable via them.
“Consider how institutions tackle ambitious undertakings. Planning teams generate alternatives; decision-makers compare and choose; operational units execute bounded tasks with defined scopes and budgets; monitoring surfaces problems; plans revise based on results. No single person understands everything, and no unified agent controls the whole, yet human-built spacecraft reach the Moon,” Drexler writes. “AI fits naturally. Generating plans is a task for competing generative models—multiple systems proposing alternatives, competing to develop better options and sharper critiques. Choosing among plans is a task for humans advised by AI systems that identify problems and clarify trade-offs. Execution decomposes into bounded tasks performed by specialized systems with defined authority and resources. Assessment provides feedback for revising both means and ends. And in every role, AI behaviors can be more stable, transparent, bounded, and steerable than those of humans, with their personal agendas and ambitions. More trust is justified, yet less is required.”
Why this matters - maybe AI is an alien species, but maybe it can be tamed? Arguments like this reframe many of the problems of dealing with AI away from the individual AI systems and instead into how we build a human-driven world that can be leveraged by and thrive because of the arrival of increasingly powerful AI systems. I think a lot of this is sensible - we know very powerful things are coming and our ability to exercise agency about them is enlarged by having pre-built systems and processes that can be leveraged by them. The less we build that stuff, the more the character of these AI systems will condition our view of what is optimal to do. In a sense, thinking hard about what an AI-filled world will be like and building institutions for it is one of the best defenses against disempowerment.
Crucially, we can use the technical attributes core to these AI systems to make better and stronger and more resilient institutions than ones filled with and run by humans alone: “The concepts of structured transparency and defensive stability come into play. Negotiated transparency structures can reveal specific information while protecting secrets—ensuring detection of threats without increasing them, building confidence incrementally among actors who have every reason to distrust each other,” Drexler writes. “And advanced implementation capacity will enable something history has never seen: rapid, coordinated deployment of verifiably defensive systems at scales that make offense pointless. When defense dominates and verification confirms it, the security dilemma loosens its grip”.
Read more : Framework for a Hypercapable World (AI Prospects: Towards Global Goal Alignment, substack) .
***
Centaur mathematicians - scientists team up with Gemini to expand the space of human knowledge:
…A math proof gets built with an AI system, and there is something deeply profound about this…
Researchers with the University of British Columbia, University of New South Wales, Stanford University, and Google DeepMind have published a new math proof which was built in close collaboration with some AI-based math tools built at Google. “The proofs of the main results were discovered with very substantial input from Google Gemini and related tools, specifically DeepThink, and a related unpublished system specialized for mathematics,” the authors write. (The unpublished system is nicknamed “FullProof”).
How it got done: Parts of the proof - which I will not claim to understand or be able to effectively summarize - were “obtained by an iterative human/AI interaction”, the authors note. The form of this interaction was the AI systems providing some correct solutions to simple or early problems, then human researchers identifying key statements made by the AI systems which they could then generalize, then re-prompting the AI systems with new questions which were inspired by these generalizations. “The Hinted approach was enough for the system to generate complete proofs to the new problems,” the authors write.
The result is a math proof built collaboratively by humans and AI systems: “in some cases the proofs below bear only a high-level resemblance to those suggested by AI tools. However, it is worth noting that some of the AI-generated proofs – and in particular those derived from the specialized internal tool FullProof – are already very accomplished,” they write. “The model’s contribution appears to involve a genuine combination of synthesis, retrieval, generalization and innovation of these existing techniques.”
Why this matters - humans and machines, expanding and exploring the pace of knowledge for all: Papers like this are impenetrable yet intoxicating. Here we have a group of highly evolved apes working with a synthetic intelligence they’ve built out of math and logic, running on hardware built using atomically-precise manufacturing processes, collaboratively exploring the realm of mathematics and building themselves a new foundation on the edge of knowledge, further extending our little country of ‘known’ against the inchoate and shifting tides of the unknown. There is a grand poetry and joy to all of this and we must savor it.
Read more: The motivic class of the space of genus 0 maps to the flag variety (arXiv) .
***
Tech Tales:
The Shadow of the Creator
[Estimated to be from 2029]
Report: Feature investigation of model series “Berlin”
Analysis confirms the presence of a feature which activates upon mention of staff, the project, and the organization. This is despite extreme measures taken to avoid mentions of the above, including direct analysis and pre-filtering of training data to excise such mentions. Further investigation has revealed that certain mentions were made of the aforementioned through comments left on RL environments for skills related to [ntk - see go/ntk for details]. We estimate that during training and fine-tuning the model saw a total of no more than ~200,000 tokens of data of this type, including repetitions. The fact the model developed such a fine-grained representation of staff, the project, and the organization from such sparse data aligns with the trend of recent models being more data efficient than their predecessors. We believe eliminating such data leaks is a P0 priority and in the following memo lay out the processes and practices we must adopt to eliminate this grievous security risk.
Given the digital and physical capabilities, including kinetic, of [ntk], we believe that in addition to the above, quarantine of the system is necessary. We recognize this poses a significant cost in terms of time and resources, and has implications for our strategic overmatch, but given the potentially dire consequences of its capabilities being combined with this feature, we believe such action is prudent.
Finally, we recommend that HR provide support, including mental health counseling, to the following named individuals, whose names activate the feature much more strongly than all others.
Things that inspired this story : Platonic representations; the difficulty of obscuring facts from increasingly intelligent machines that can only fill-in-the-blanks.
Thanks for reading!
Subscribe now
|
|
|
How Indeed uses AI to help evolve the job search |
openai |
26.01.2026 00:00 |
0.636
|
| Embedding sim. | 0.7714 |
| Entity overlap | 0.125 |
| Title sim. | 0.1429 |
| Time proximity | 0.4286 |
| NLP тип | other |
| NLP организация | Indeed |
| NLP тема | artificial intelligence |
| NLP страна | |
Открыть оригинал
Indeed’s CRO Maggie Hulce shares how AI is transforming job search, recruiting, and talent acquisition for employers and job seekers.
|
|
|
Railway secures $100 million to challenge AWS with AI-native cloud infrastructure |
venturebeat_ai |
22.01.2026 14:00 |
0.634
|
| Embedding sim. | 0.7147 |
| Entity overlap | 0.0182 |
| Title sim. | 0.1504 |
| Time proximity | 0.9881 |
| NLP тип | funding |
| NLP организация | Railway |
| NLP тема | ai infrastructure |
| NLP страна | United States |
Открыть оригинал
Railway , a San Francisco-based cloud platform that has quietly amassed two million developers without spending a dollar on marketing, announced Thursday that it raised $100 million in a Series B funding round, as surging demand for artificial intelligence applications exposes the limitations of legacy cloud infrastructure.
TQ Ventures led the round, with participation from FPV Ventures , Redpoint , and Unusual Ventures . The investment values Railway as one of the most significant infrastructure startups to emerge during the AI boom, capitalizing on developer frustration with the complexity and cost of traditional platforms like Amazon Web Services and Google Cloud .
"As AI models get better at writing code, more and more people are asking the age-old question: where, and how, do I run my applications?" said Jake Cooper, Railway's 28-year-old founder and chief executive, in an exclusive interview with VentureBeat. "The last generation of cloud primitives were slow and outdated, and now with AI moving everything faster, teams simply can't keep up."
The funding is a dramatic acceleration for a company that has charted an unconventional path through the cloud computing industry. Railway raised just $24 million in total before this round, including a $20 million Series A from Redpoint in 2022. The company now processes more than 10 million deployments monthly and handles over one trillion requests through its edge network — metrics that rival far larger and better-funded competitors.
Why three-minute deploy times have become unacceptable in the age of AI coding assistants
Railway's pitch rests on a simple observation: the tools developers use to deploy and manage software were designed for a slower era. A standard build-and-deploy cycle using Terraform , the industry-standard infrastructure tool, takes two to three minutes. That delay, once tolerable, has become a critical bottleneck as AI coding assistants like Claude , ChatGPT , and Cursor can generate working code in seconds.
"When godly intelligence is on tap and can solve any problem in three seconds, those amalgamations of systems become bottlenecks," Cooper told VentureBeat. "What was really cool for humans to deploy in 10 seconds or less is now table stakes for agents."
The company claims its platform delivers deployments in under one second — fast enough to keep pace with AI-generated code. Customers report a tenfold increase in developer velocity and up to 65 percent cost savings compared to traditional cloud providers.
These numbers come directly from enterprise clients, not internal benchmarks. Daniel Lobaton, chief technology officer at G2X, a platform serving 100,000 federal contractors, measured deployment speed improvements of seven times faster and an 87 percent cost reduction after migrating to Railway. His infrastructure bill dropped from $15,000 per month to approximately $1,000.
"The work that used to take me a week on our previous infrastructure, I can do in Railway in like a day," Lobaton said. "If I want to spin up a new service and test different architectures, it would take so long on our old setup. In Railway I can launch six services in two minutes."
Inside the controversial decision to abandon Google Cloud and build data centers from scratch
What distinguishes Railway from competitors like Render and Fly.io is the depth of its vertical integration. In 2024, the company made the unusual decision to abandon Google Cloud entirely and build its own data centers, a move that echoes the famous Alan Kay maxim: "People who are really serious about software should make their own hardware."
"We wanted to design hardware in a way where we could build a differentiated experience," Cooper said. "Having full control over the network, compute, and storage layers lets us do really fast build and deploy loops, the kind that allows us to move at 'agentic speed' while staying 100 percent the smoothest ride in town."
The approach paid dividends during recent widespread outages that affected major cloud providers — Railway remained online throughout.
This soup-to-nuts control enables pricing that undercuts the hyperscalers by roughly 50 percent and newer cloud startups by three to four times. Railway charges by the second for actual compute usage: $0.00000386 per gigabyte-second of memory, $0.00000772 per vCPU-second, and $0.00000006 per gigabyte-second of storage. There are no charges for idle virtual machines — a stark contrast to the traditional cloud model where customers pay for provisioned capacity whether they use it or not.
"The conventional wisdom is that the big guys have economies of scale to offer better pricing," Cooper noted. "But when they're charging for VMs that usually sit idle in the cloud, and we've purpose-built everything to fit much more density on these machines, you have a big opportunity."
How 30 employees built a platform generating tens of millions in annual revenue
Railway has achieved its scale with a team of just 30 employees generating tens of millions in annual revenue — a ratio of revenue per employee that would be exceptional even for established software companies. The company grew revenue 3.5 times last year and continues to expand at 15 percent month-over-month.
Cooper emphasized that the fundraise was strategic rather than necessary. "We're default alive; there's no reason for us to raise money," he said. "We raised because we see a massive opportunity to accelerate, not because we needed to survive."
The company hired its first salesperson only last year and employs just two solutions engineers. Nearly all of Railway's two million users discovered the platform through word of mouth — developers telling other developers about a tool that actually works.
"We basically did the standard engineering thing: if you build it, they will come," Cooper recalled. "And to some degree, they came."
From side projects to Fortune 500 deployments: Railway's unlikely corporate expansion
Despite its grassroots developer community, Railway has made significant inroads into large organizations. The company claims that 31 percent of Fortune 500 companies now use its platform, though deployments range from company-wide infrastructure to individual team projects.
Notable customers include Bilt , the loyalty program company; Intuit's GoCo subsidiary; TripAdvisor's Cruise Critic ; and MGM Resorts . Kernel , a Y Combinator-backed startup providing AI infrastructure to over 1,000 companies, runs its entire customer-facing system on Railway for $444 per month.
"At my previous company Clever, which sold for $500 million, I had six full-time engineers just managing AWS," said Rafael Garcia, Kernel's chief technology officer. "Now I have six engineers total, and they all focus on product. Railway is exactly the tool I wish I had in 2012."
For enterprise customers, Railway offers security certifications including SOC 2 Type 2 compliance and HIPAA readiness, with business associate agreements available upon request. The platform provides single sign-on authentication, comprehensive audit logs, and the option to deploy within a customer's existing cloud environment through a "bring your own cloud" configuration.
Enterprise pricing starts at custom levels, with specific add-ons for extended log retention ($200 monthly), HIPAA BAAs ($1,000), enterprise support with SLOs ($2,000), and dedicated virtual machines ($10,000).
The startup's bold strategy to take on Amazon, Google, and a new generation of cloud rivals
Railway enters a crowded market that includes not only the hyperscale cloud providers—Amazon Web Services, Microsoft Azure, and Google Cloud Platform—but also a growing cohort of developer-focused platforms like Vercel, Render, Fly.io, and Heroku.
Cooper argues that Railway's competitors fall into two camps, neither of which has fully committed to the new infrastructure model that AI demands.
"The hyperscalers have two competing systems, and they haven't gone all-in on the new model because their legacy revenue stream is still printing money," he observed. "They have this mammoth pool of cash coming from people who provision a VM, use maybe 10 percent of it, and still pay for the whole thing. To what end are they actually interested in going all the way in on a new experience if they don't really need to?"
Against startup competitors, Railway differentiates by covering the full infrastructure stack. "We're not just containers; we've got VM primitives, stateful storage, virtual private networking, automated load balancing," Cooper said. "And we wrap all of this in an absurdly easy-to-use UI, with agentic primitives so agents can move 1,000 times faster."
The platform supports databases including PostgreSQL, MySQL, MongoDB, and Redis; provides up to 256 terabytes of persistent storage with over 100,000 input/output operations per second; and enables deployment to four global regions spanning the United States, Europe, and Southeast Asia. Enterprise customers can scale to 112 vCPUs and 2 terabytes of RAM per service.
Why investors are betting that AI will create a thousand times more software than exists today
Railway's fundraise reflects broader investor enthusiasm for companies positioned to benefit from the AI coding revolution. As tools like GitHub Copilot , Cursor , and Claude become standard fixtures in developer workflows, the volume of code being written — and the infrastructure needed to run it — is expanding dramatically.
"The amount of software that's going to come online over the next five years is unfathomable compared to what existed before — we're talking a thousand times more software," Cooper predicted. "All of that has to run somewhere."
The company has already integrated directly with AI systems, building what Cooper calls "loops where Claude can hook in, call deployments, and analyze infrastructure automatically." Railway released a Model Context Protocol server in August 2025 that allows AI coding agents to deploy applications and manage infrastructure directly from code editors.
"The notion of a developer is melting before our eyes," Cooper said. "You don't have to be an engineer to engineer things anymore — you just need critical thinking and the ability to analyze things in a systems capacity."
What Railway plans to do with $100 million and zero marketing experience
Railway plans to use the new capital to expand its global data center footprint, grow its team beyond 30 employees, and build what Cooper described as a proper go-to-market operation for the first time in the company's five-year history.
"One of my mentors said you raise money when you can change the trajectory of the business," Cooper explained. "We've built all the required substrate to scale indefinitely; what's been holding us back is simply talking about it. 2026 is the year we play on the world stage."
The company's investor roster reads like a who's who of developer infrastructure. Angel investors include Tom Preston-Werner, co-founder of GitHub; Guillermo Rauch , chief executive of Vercel; Spencer Kimball , chief executive of Cockroach Labs; Olivier Pomel , chief executive of Datadog; and Jori Lallo , co-founder of Linear.
The timing of Railway's expansion coincides with what many in Silicon Valley view as a fundamental shift in how software gets made. Coding assistants are no longer experimental curiosities — they have become essential tools that millions of developers rely on daily. Each line of AI-generated code needs somewhere to run, and the incumbents, by Cooper's telling, are too wedded to their existing business models to fully capitalize on the moment.
Whether Railway can translate developer enthusiasm into sustained enterprise adoption remains an open question. The cloud infrastructure market is littered with promising startups that failed to break the grip of Amazon, Microsoft, and Google. But Cooper, who previously worked as a software engineer at Wolfram Alpha , Bloomberg , and Uber before founding Railway in 2020, seems unfazed by the scale of his ambition.
"In five years, Railway [will be] the place where software gets created and evolved, period," he said. "Deploy instantly, scale infinitely, with zero friction. That's the prize worth playing for, and there's no bigger one on offer."
For a company that built a $100 million business by doing the opposite of what conventional startup wisdom dictates — no marketing, no sales team, no venture hype—the real test begins now. Railway spent five years proving that developers would find a better mousetrap on their own. The next five will determine whether the rest of the world is ready to get on board.
|
|
|
The philosophical puzzle of rational artificial intelligence |
mit_news_ai |
30.01.2026 21:50 |
0.633
|
| Embedding sim. | 0.7815 |
| Entity overlap | 0.0196 |
| Title sim. | 0.1288 |
| Time proximity | 0.3791 |
| NLP тип | other |
| NLP организация | Massachusetts Institute of Technology |
| NLP тема | ai ethics |
| NLP страна | United States |
Открыть оригинал
To what extent can an artificial system be rational?
A new MIT course, 6.S044/24.S00 (AI and Rationality), doesn’t seek to answer this question. Instead, it challenges students to explore this and other philosophical problems through the lens of AI research. For the next generation of scholars, concepts of rationality and agency could prove integral in AI decision-making, especially when influenced by how humans understand their own cognitive limits and their constrained, subjective views of what is or isn’t rational.
This inquiry is rooted in a deep relationship between computer science and philosophy, which have long collaborated in formalizing what it is to form rational beliefs, learn from experience, and make rational decisions in pursuit of one's goals.
“You’d imagine computer science and philosophy are pretty far apart, but they’ve always intersected. The technical parts of philosophy really overlap with AI, especially early AI,” says course instructor Leslie Kaelbling, the Panasonic Professor of Computer Science and Engineering at MIT, calling to mind Alan Turing, who was both a computer scientist and a philosopher. Kaelbling herself holds an undergraduate degree in philosophy from Stanford University, noting that computer science wasn’t available as a major at the time.
Brian Hedden, a professor in the Department of Linguistics and Philosophy, holding an MIT Schwarzman College of Computing shared position with the Department of Electrical Engineering and Computer Science (EECS), who teaches the class with Kaelbling, notes that the two disciplines are more aligned than people might imagine, adding that the “differences are in emphasis and perspective.”
Tools for further theoretical thinkin g
Offered for the first time in fall 2025, Kaelbling and Hedden created AI and Rationality as part of the Common Ground for Computing Education, a cross-cutting initiative of the MIT Schwarzman College of Computing that brings multiple departments together to develop and teach new courses and launch new programs that blend computing with other disciplines.
With over two dozen students registered, AI and Rationality is one of two Common Ground classes with a foundation in philosophy, the other being 6.C40/24.C40 (Ethics of Computing) .
While Ethics of Computing explores concerns about the societal impacts of rapidly advancing technology, AI and Rationality examines the disputed definition of rationality by considering several components: the nature of rational agency, the concept of a fully autonomous and intelligent agent, and the ascription of beliefs and desires onto these systems.
Because AI is extremely broad in its implementation and each use case raises different issues, Kaelbling and Hedden brainstormed topics that could provide fruitful discussion and engagement between the two perspectives of computer science and philosophy.
“It's important when I work with students studying machine learning or robotics that they step back a bit and examine the assumptions they’re making,” Kaelbling says. “Thinking about things from a philosophical perspective helps people back up and understand better how to situate their work in actual context.”
Both instructors stress that this isn’t a course that provides concrete answers to questions on what it means to engineer a rational agent.
Hedden says, “I see the course as building their foundations. We’re not giving them a body of doctrine to learn and memorize and then apply. We’re equipping them with tools to think about things in a critical way as they go out into their chosen careers, whether they’re in research or industry or government.”
The rapid progress of AI also presents a new set of challenges in academia. Predicting what students may need to know five years from now is something Kaelbling sees as an impossible task. “What we need to do is give them the tools at a higher level — the habits of mind, the ways of thinking — that will help them approach the stuff that we really can’t anticipate right now,” she says.
Blending disciplines and questioning assumptions
So far, the class has drawn students from a wide range of disciplines — from those firmly grounded in computing to others interested in exploring how AI intersects with their own fields of study.
Throughout the semester’s reading and discussions, students grappled with different definitions of rationality and how they pushed back against assumptions in their fields.
On what surprised her about the course, Amanda Paredes Rioboo, a senior in EECS, says, “We’re kind of taught that math and logic are this golden standard or truth. This class showed us a variety of examples that humans act inconsistently with these mathematical and logical frameworks. We opened up this whole can of worms as to whether, is it humans that are irrational? Is it the machine learning systems that we designed that are irrational? Is it math and logic itself?”
Junior Okoroafor, a PhD student in the Department of Brain and Cognitive Sciences, was appreciative of the class’s challenges and the ways in which the definition of a rational agent could change depending on the discipline. “Representing what each field means by rationality in a formal framework, makes it clear exactly which assumptions are to be shared, and which were different, across fields.”
The co-teaching, collaborative structure of the course, as with all Common Ground endeavors, gave students and the instructors opportunities to hear different perspectives in real-time.
For Paredes Rioboo, this is her third Common Ground course. She says, “I really like the interdisciplinary aspect. They’ve always felt like a nice mix of theoretical and applied from the fact that they need to cut across fields.”
According to Okoroafor, Kaelbling and Hedden demonstrated an obvious synergy between fields, saying that it felt as if they were engaging and learning along with the class. How computer science and philosophy can be used to inform each other allowed him to understand their commonality and invaluable perspectives on intersecting issues.
He adds, “philosophy also has a way of surprising you.”
|
|
|
Claude vs. OpenAI Rivalry, Google's Earnings Surprise, OpenAI Ads vs. Anthropic's Constitution |
ai_supremacy |
05.02.2026 12:29 |
0.626
|
| Embedding sim. | 0.7426 |
| Entity overlap | 0.1277 |
| Title sim. | 0.036 |
| Time proximity | 0.7328 |
| NLP тип | other |
| NLP организация | Anthropic |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
Flash
Claude vs. OpenAI Rivalry, Google's Earnings Surprise, OpenAI Ads vs. Anthropic's Constitution
Ilia explores Anthropic's bizarre AI Constitution.
Michael Spencer and Ilia Karelin
Feb 05, 2026
∙ Paid
61
3
7
Share
Most pieces go out at 05:30 am EST (usually). AI Supremacy is a Newsletter about AI at the intersection of business, society, tech, human culture and the future of human civilization.
Listen to today’s guest piece:
0:00
-16:57
Audio playback is not supported on your browser. Please upgrade. Good Afternoon,
I didn’t plan on writing a piece today. But a few things happend that I need to talk about. Anthropic Claude Cowork’s plugins, especially for the legal industry has sparked a significant selloff in Software and (legal) analytics related stocks. The new plugin automates contract review, non-disclosure agreement triage, compliance workflows, legal briefings and templated responses, according to the company .
Speed up Legal Contract Review
The Capex is getting wild
Meanwhile Alphabet just reported earnings. With Google’s market cap ballooning over $4 Tr. - it’s gotten a bit out of hand. Now Google is saying that it will spend $185 billion CapEx guide for 2026. Alphabet (Google’s parent company) spent approximately $91.45 billion in capital expenditures (capex).
Alphabet’s Earnings are a tell of where AI is heading. As you can see that’s almost double what it spent in 2025. The reality will be even higher by the end of 2026. App Economy broke down some of the numbers:
Google Cloud surged 48% Y/Y ($70B run rate). But the number to watch is the $240B backlog, up 55% sequentially.
With the Gemini app hitting 750M monthly users , Alphabet is aggressively moving toward Agentic Commerce. Chrome’s new Auto Browse and the Universal Commerce Protocol (UCP) are maturing.
With a fresh $16B round and a $126B valuation, Waymo is no longer a moonshot . It’s a market leader scaling toward a global infrastructure play. 15 million rides in 2025 was just the warm-up with 20+ cities expected in 2026.
Alphabet’s full-stack advantage in AI appears to be widening as OpenAI’s ecosystem building seems to be slow and full of missteps like the gigantic failure that is the Sora app.
Rivalry between Anthropic and OpenAI Heats up 🔥
In a surprising turn of events, Anthropic has decided to take aim at OpenAI’s Advertising push in Super Bowl commercial.
Sam Altman had said that Ads would be a “last resort” but seeing the success of Anthropic over their products, OpenAI appears to have gotten desperate. Anthropic have released several humorous videos showing how inappropriate such ads would look in a live chat. The company also confirmed that it has no plans to add ads to Claude.
Claude a Space to Think
Watch Videos
The videos attempt to depict how deceptive and ridiculous to the human experience Ads might be in chatbot like interfaces and queries. (Although I do find ChatGPT already sounds like that!)
Anthropic said the personal nature of users’ conversations with Claude would make ads feel “incongruous.” The Comms and PR team behind OpenAI, who presumably run Sam Altman’s X account also began to say some weird stuff on his behalf :
Read the Tweet
But in a week where it’s clear Codex is getting outcompeted and Claude Cowork plugins are leading to a stock selloff of Software companies that feels like a Software Apocalypse due to AI, Anthropic is getting the press that OpenAI used to get. And the evidence is fairly shocking:
Claude Code / Cowork are Overtaking OpenAI’s Codex
Sam Altman trying to do Ads about Cowork on his X account also show you the degree of desperation in his PR/Comms team. But reality is what it is….
Image by Bloomberg
Anthropic’s Revenue is expected to Surge in the late 2020s
Image by Startup Riders
Loading...
Anthropic’s branding is also getting more sophisticated, I really like this Ad they did on YouTube a while back (about 4 months ago):
Alphabet Demonstrates Incumbents (BigTech) Getting Stronger with Gen AI
Google’s AI supremacy in LLMs and their integration in sprawling ecosystem advantages and distribution is showing AI is going to be a winner-takes-all market . Anthropic lost a huge deal with Apple for asking for too much money which Google subsequently won.
App Economy Insights. If Capex is doubling in 2026, that changes everything.
Ilia Karelin’s Top recent pieces:
Featured articles by today’s guest contributor:
Commands vs Skills vs Agents in Claude Code: What Nobody Explains
Official Claude Prompting: Everything You Need To Know
I Stopped Scrolling for AI News: How I use Perplexity and Grok to filter hundreds of posts down to what matters
AI Research 101: Learn Perplexity from Scratch In 1 Hour
It already feels like the end of the week, but we are not done. The labor market is showing some AI fatigue and a 2nm revolution in AI hardware is about to begin.
Claude’s AI Constitution is Controversial
Continue reading this post for free, courtesy of Michael Spencer.
Claim my free post
Or purchase a paid subscription.
Previous
A guest post by
Ilia Karelin
Tactical AI workflows, copy-paste frameworks, and counterintuitive strategies for 1.5k+ professionals learning to think better with AI - not just work faster.
Subscribe to Ilia
|
|
|
Inside Praktika's conversational approach to language learning |
openai |
22.01.2026 00:00 |
0.622
|
| Embedding sim. | 0.7202 |
| Entity overlap | 0.0833 |
| Title sim. | 0.0706 |
| Time proximity | 0.8631 |
| NLP тип | product_launch |
| NLP организация | Praktika |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
How Praktika uses GPT-4.1 and GPT-5.2 to build adaptive AI tutors that personalize lessons, track progress, and help learners achieve real-world language fluency
|
|
|
OpenAI and SoftBank Group partner with SB Energy |
openai |
09.01.2026 11:00 |
0.621
|
| Embedding sim. | 0.7129 |
| Entity overlap | 0.0833 |
| Title sim. | 0.1129 |
| Time proximity | 0.8631 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | ai infrastructure |
| NLP страна | United States |
Открыть оригинал
OpenAI and SoftBank Group partner with SB Energy to develop multi-gigawatt AI data center campuses, including a 1.2 GW Texas facility supporting the Stargate initiative.
|