Кластер #4122 - News Clusters

The creator of Claude Code just revealed his workflow, and developers are losing their minds

closed

Тип события	other
Тема	code generation
Организация	Anthropic
Страна	China

Статей	6
Уник. источников	3
Важность / Момент	1.72 / 0
Период	05.01.2026 07:45 — 11.01.2026 14:02
Создан	06.04.2026 06:19:21

Статьи в кластере 6

Заголовок

Источник

Дата публикации

Score

The creator of Claude Code just revealed his workflow, and developers are losing their minds

venturebeat_ai

05.01.2026 07:45

Embedding sim.	1
Entity overlap	1
Title sim.	1
Time proximity	1

NLP тип	other
NLP организация	Anthropic
NLP тема	software development
NLP страна

Открыть оригинал

When the creator of the world's most advanced coding agent speaks, Silicon Valley doesn't just listen — it takes notes.
 For the past week, the engineering community has been dissecting a thread on X from Boris Cherny , the creator and head of Claude Code at Anthropic . What began as a casual sharing of his personal terminal setup has spiraled into a viral manifesto on the future of software development, with industry insiders calling it a watershed moment for the startup.
 
 "If you're not reading the Claude Code best practices straight from its creator, you're behind as a programmer," wrote Jeff Tang , a prominent voice in the developer community. Kyle McNease , another industry observer, went further, declaring that with Cherny's "game-changing updates," Anthropic is "on fire," potentially facing "their ChatGPT moment."
 The excitement stems from a paradox: Cherny's workflow is surprisingly simple, yet it allows a single human to operate with the output capacity of a small engineering department. As one user noted on X after implementing Cherny's setup, the experience " feels more like Starcraft " than traditional coding — a shift from typing syntax to commanding autonomous units.
 Here is an analysis of the workflow that is reshaping how software gets built, straight from the architect himself. 
 How running five AI agents at once turns coding into a real-time strategy game 
 The most striking revelation from Cherny's disclosure is that he does not code in a linear fashion. In the traditional " inner loop " of development, a programmer writes a function, tests it, and moves to the next. Cherny, however, acts as a fleet commander.
 "I run 5 Claudes in parallel in my terminal," Cherny wrote. "I number my tabs 1-5, and use system notifications to know when a Claude needs input."
 By utilizing iTerm2 system notifications, Cherny effectively manages five simultaneous work streams. While one agent runs a test suite, another refactors a legacy module, and a third drafts documentation. He also runs "5-10 Claudes on claude.ai " in his browser, using a "teleport" command to hand off sessions between the web and his local machine.
 This validates the " do more with less " strategy articulated by Anthropic President Daniela Amodei earlier this week. While competitors like OpenAI pursue trillion-dollar infrastructure build-outs, Anthropic is proving that superior orchestration of existing models can yield exponential productivity gains.
 The counterintuitive case for choosing the slowest, smartest model 
 In a surprising move for an industry obsessed with latency, Cherny revealed that he exclusively uses Anthropic's heaviest, slowest model: Opus 4.5 .
 "I use Opus 4.5 with thinking for everything," Cherny explained . "It's the best coding model I've ever used, and even though it's bigger & slower than Sonnet, since you have to steer it less and it's better at tool use, it is almost always faster than using a smaller model in the end."
 For enterprise technology leaders, this is a critical insight. The bottleneck in modern AI development isn't the generation speed of the token; it is the human time spent correcting the AI's mistakes. Cherny's workflow suggests that paying the "compute tax" for a smarter model upfront eliminates the "correction tax" later.
 One shared file turns every AI mistake into a permanent lesson 
 Cherny also detailed how his team solves the problem of AI amnesia. Standard large language models do not "remember" a company's specific coding style or architectural decisions from one session to the next.
 To address this, Cherny's team maintains a single file named CLAUDE.md in their git repository. "Anytime we see Claude do something incorrectly we add it to the CLAUDE.md, so Claude knows not to do it next time," he wrote.
 This practice transforms the codebase into a self-correcting organism. When a human developer reviews a pull request and spots an error, they don't just fix the code; they tag the AI to update its own instructions. " Every mistake becomes a rule ," noted Aakash Gupta , a product leader analyzing the thread. The longer the team works together, the smarter the agent becomes.
 Slash commands and subagents automate the most tedious parts of development 
 The "vanilla" workflow one observer praised is powered by rigorous automation of repetitive tasks. Cherny uses slash commands — custom shortcuts checked into the project's repository — to handle complex operations with a single keystroke.
 He highlighted a command called /commit-push-pr , which he invokes dozens of times daily. Instead of manually typing git commands, writing a commit message, and opening a pull request, the agent handles the bureaucracy of version control autonomously.
 Cherny also deploys subagents — specialized AI personas — to handle specific phases of the development lifecycle. He uses a code-simplifier to clean up architecture after the main work is done and a verify-app agent to run end-to-end tests before anything ships.
 Why verification loops are the real unlock for AI-generated code 
 If there is a single reason Claude Code has reportedly hit $1 billion in annual recurring revenue so quickly, it is likely the verification loop. The AI is not just a text generator; it is a tester.
 "Claude tests every single change I land to claude.ai/code using the Claude Chrome extension," Cherny wrote. "It opens a browser, tests the UI, and iterates until the code works and the UX feels good."
 He argues that giving the AI a way to verify its own work — whether through browser automation, running bash commands, or executing test suites — improves the quality of the final result by "2-3x." The agent doesn't just write code; it proves the code works.
 What Cherny's workflow signals about the future of software engineering 
 The reaction to Cherny's thread suggests a pivotal shift in how developers think about their craft. For years, "AI coding" meant an autocomplete function in a text editor — a faster way to type. Cherny has demonstrated that it can now function as an operating system for labor itself.
 "Read this if you're already an engineer... and want more power," Jeff Tang summarized on X.
 The tools to multiply human output by a factor of five are already here. They require only a willingness to stop thinking of AI as an assistant and start treating it as a workforce. The programmers who make that mental leap first won't just be more productive. They'll be playing an entirely different game — and everyone else will still be typing.

8 plots that explain the state of open models

interconnects

07.01.2026 15:07

0.672

Embedding sim.	0.7923
Entity overlap	0.05
Title sim.	0.0561
Time proximity	0.8567

NLP тип	other
NLP организация	Qwen
NLP тема	large language models
NLP страна	China

Открыть оригинал

8 plots that explain the state of open models
 Measuring the impact of Qwen, DeepSeek, Llama, GPT-OSS, Nemotron, and all of the new entrants to the ecosystem.
 Nathan Lambert
 Jan 07, 2026

 62

 16
 10

 Share

 Starting 2026, most people are aware that a handful of Chinese companies are making strong, open AI models that are applying increasing pressure on the American AI economy.
 While many Chinese labs are making models , the adoption metrics are dominated by Qwen (with a little help from DeepSeek). Adoption of the new entrants in the open model scene in 2025, from Z.ai, MiniMax, Kimi Moonshot, and others is actually quite limited. This sets up the position where dethroning Qwen in adoption in 2026 looks impossible overall, but there are areas for opportunity. In fact, the strength of GPT-OSS shows that the U.S. could very well have the smartest open models again in 2026, even if they’re used far less across the ecosystem.
 The following plots are from a comprehensive update of the data supporting The ATOM Project ( atomproject.ai ) with our expanded ecosystem measurement tools we use to support our monthly open model roundups, Artifacts Log .
 Interconnects is a reader-supported publication. Consider becoming a subscriber.

 Subscribe

 1. China has a growing lead in every adoption metric

 Models from the US and the EU defined the early eras of open language models. 2025 saw the end of Llama and Qwen triumphantly took its spot as the default models of choice across a variety of tasks, from local LLMs to reasoning models or multimodal tools. The adoption of Chinese models continues to accelerate.

 These first two plots show the cumulative downloads of all LLMs we consider representative of the ecosystem (we’re tracking 1152 in total right now ), which were released after ChatGPT.
 2. The West isn’t close to replacing Llama

 Where we’ve seen China’s lead increase in overall downloads in the previous figure, it feels increasingly precarious for supporters of Western open models to learn that Llama models — despite not being updated nor supported by their creator Meta — are still by far the most downloaded Western models in recent months. OpenAI’s GPT-OSS models are the only models from a new provider in the second half of 2025 that show early signs of shifting the needle on the balance of overall downloads from either an American or Chinese provider (OpenAI’s two models get about the same monthly downloads at the end of 2025 as all of DeepSeek’s or Mistral’s models).

 What is a HuggingFace download? HuggingFace registers a download for any web request to the storage buckets to the model (e.g. wget, curl, etc.), so it is a very noisy metric. Still, it’s the best we have. Due to this noise, when measuring adoption via how many finetunes a model has, we filter to derivative models with only >5 downloads. Still, downloads are the standard way of measuring adoption of open models.
 3. New organizations barely show up in adoption metrics

 While much has been said (including by me, on Interconnects) about new open frontier model providers, their adoption tends to look like a rounding error in adoption metrics. These models from Z.ai, Nvidia, Kimi Moonshot, and MiniMax are crucial to developing local ecosystems, but they are not competing with Qwen as being the open model standard.
 Note the different y-axes from this plot and the previous, where DeepSeek and OpenAI are included in both for scale. This plot shows the downloads just since July 2025 to showcase recent performance.

 4. Qwen’s weakness is in large model adoption

 One of the most surprising things in the data is just how successful DeepSeek’s large models are (particularly both versions of V3 and R1). These 4 large models dominate the adoption numbers of any of Qwen’s large MoE/dense models over the last few years. It’s only at these large scales where opportunities to compete with Qwen exist, and with the rise of more providers like Z.ai, MiniMax, and Kimi, we’ll be following this closely. These large models are crucial tools right now for many startups based in the U.S. trying to finetune their own frontier model for applications — e.g. Cursor’s Composer model is finetuned from a large Chinese MoE.

 Share
 5. A few models from Qwen dwarf new entrants

 While Qwen has one Achilles’ heel right now, its recent models totally dominate any HuggingFace metric. If we look at the top 5 Qwen3 downloaded models just in December (Qwen3-[0.6B, 1.7B, 4B (Original), 8B, & 4B-Instruct-2507]), they have more downloads than all of the models we’re tracking from OpenAI, Mistral AI, Nvidia, Z.ai, Moonshot AI, and MiniMax combined.
 This is the advantage that Qwen has built and will take year(s) to unwind.

 6. In December Qwen got more downloads than roughly the rest of the open ecosystem

 If we account for every meaningful Qwen LLM released since ChatGPT, the downloads Qwen got in December well outnumber literally every other organization we’re tracking combined. This includes the 6 from the previous figure, along with DeepSeek and Meta, who are the second and third most downloaded creators.

 7. People are still finetuning Qwen more than anything else

 The other primary way we can measure Qwen’s adoption lead is to look at the share of derivative models on HuggingFace (filtered to only those with >5 downloads to indicate a meaningful finetune) that come from a certain base model. Qwen’s share here continued to grow throughout 2025, and we’ll be watching this closely around the likely release of Qwen 4.
 Despite the dramatic increase in the number of players releasing open models in 2025, the share of finetuned models has concentrated among the 5 organizations we highlighted below (Qwen, Llama, Mistral, Google, and DeepSeek).

 8. China still has the smartest open models

 The primary factor that drives the adoption and influence of Chinese open models today is that they’re the smartest open models available. There’s a variety of second order issues, such as licenses, model sizes, documentation, developer engagement, etc., but for over a year now, Chinese open models have been the smartest on most benchmarks.
 GPT-OSS 120B was close to retaking the lead (slightly behind MiniMax M2), but it wasn’t quite there. It’ll be fascinating to watch if upcoming Nemotron, Arcee, or Reflection AI models can buck this trend. If you look at other metrics than the Artificial Analysis intelligence index, the same trends hold.

 Leave a comment
 Thanks for reading! Please reach out or leave a comment if there’s a corner of the data you think we should spend more time in. Stay tuned for more updates on The ATOM Project and related efforts in the near future.

 62

 16
 10

 Share

 Previous Next

TAI #186: Claude Code and the Christmas Awakening: Why CLI Agents Are Winning the Agentic Race

towards_ai

06.01.2026 15:03

0.669

Embedding sim.	0.7547
Entity overlap	0.2903
Title sim.	0.1938
Time proximity	0.8137

NLP тип	product_launch
NLP организация	Anthropic
NLP тема	software development
NLP страна

Открыть оригинал

What happened this week in AI by Louie
 This week was quieter on major model releases, though DeepSeek published a paper on Manifold-Constrained Hyper-Connections (mHC), a training technique that improves stability when scaling up model size. We think this could be significant when integrated into their next-generation models. But the AI community&#8217;s attention turned to something arguably more transformative: how people are actually using these models. Over the Christmas break, a wave of new users discovered Claude Code, Anthropic&#8217;s terminal-based agentic coding assistant, and many are calling it a genuine step change in what AI can accomplish. The combination of Opus 4.5&#8217;s release in November and holiday downtime created perfect conditions for exploration. Social media has been flooded with reports of developers shipping projects in hours that would have taken weeks, and perhaps more surprisingly, non-technical users automating tasks they never thought possible.
 Claude Code is Anthropic&#8217;s command-line tool that gives Claude direct access to your file system, terminal, and local environment. Unlike chatbot interfaces, which require you to manually provide context by copying and pasting, Claude Code can read your entire codebase, edit multiple files coherently, run your test suite, and iterate until things work. The AI navigates your file system and finds what it needs itself, rather than relying on you to assemble the relevant context.
 The Opus 4.5 upgrade appears to have crossed a critical threshold. Users consistently report that it eliminates the &#8220;slop code&#8221; problem that plagued earlier models, where AI-generated code was functional but poorly structured and hard to maintain. Opus 4.5 produces code that experienced developers actually want to keep. It understands architectural patterns, creates appropriate abstractions, and can debug its own work by writing minimal reproducible examples. Anthropic&#8217;s internal surveys indicate that engineers now rely on Claude for 60% of their daily work, with a mean productivity improvement of 220% reported across the team. However, individual results vary significantly by workflow and level of expertise.
 

 
 CLI agents vs. IDE tools vs. chatbot interfaces 
 The distinction between Claude Code and tools like Cursor is more nuanced than many realize. Cursor also has a full agent mode, not just autocomplete. Both can autonomously execute multi-step coding tasks. The real differences lie elsewhere.
 Claude Code runs in your terminal and treats your entire computer as its workspace. It can chain together shell commands, access external services through MCP (Model Context Protocol) integrations, run scripts, and work across applications. Cursor is an IDE-first experience, essentially VS Code rebuilt with AI at its core. It offers visual diffs, familiar keybindings, and a polished review flow where you can accept or reject changes file by file.
 Claude Code tends to include more of your codebase in each request, which improves understanding but increases costs. Some comparisons suggest Claude Code costs roughly four times as much as Cursor for similar tasks, though the higher context often yields better results for complex refactors.
 The philosophical difference is telling. One developer described it this way: with Cursor, you drive, and AI assists. With Claude Code, AI drives, and you supervise. Many developers use both. You can run Claude Code inside Cursor&#8217;s terminal, using the IDE for visual editing and summoning Claude Code when you need deep reasoning on a complex problem.
 For non-technical users, the comparison extends to chatbot interfaces like ChatGPT. With Claude Code, you can say &#8220;analyze all the spreadsheets in this folder, identify trends, and create a summary report,&#8221; and it handles the entire process. Non-technical users are leveraging this for tasks such as reorganizing thousands of files by content, extracting insights from contracts, processing research papers, and automating administrative workflows.
 However, the CLI interface will not be the mainstream way this capability reaches most users. Terminals remain intimidating for people who have spent years in graphical interfaces. Even some experienced developers find it hard to adjust to CLI-based coding after working in IDEs. Claude Code does offer VS Code integration, but most users report better results in the terminal, where the complete agentic loop operates more naturally. The future likely involves more user-friendly interfaces that retain this agentic file system access.
 This momentum poses a challenge to Microsoft&#8217;s strategy of infusing each application with its own focused AI assistant. The bet that people want Copilot for Excel, Copilot for Word, and Copilot for PowerPoint as separate experiences looks increasingly questionable as users gravitate toward agents that work across applications. When you can tell a single agent to analyze spreadsheets, summarize findings, and create a presentation, switching between three different AI assistants feels cumbersome. OpenAI&#8217;s Codex, Google&#8217;s Antigravity, and Anthropic&#8217;s Claude Code are all betting on this general-purpose agent model.
 How Boris Cherny, the creator of Claude Code, actually uses it 
 Boris Cherny shared his personal setup this week, describing it as &#8220;surprisingly vanilla.&#8221; But reading through his workflow reveals just how far from vanilla it would seem to most users and implies that others at Anthropic have even more complex configurations.
 Boris runs five Claude instances in parallel in his terminal, numbered 1 through 5, using system notifications to know when any instance needs input. He also runs another five to ten sessions on the web version of Claude Code simultaneously, frequently handing off sessions between local and web using the teleport feature. He kicks off sessions from his phone each morning and checks in on them later.
 For model selection, Boris uses Opus 4.5 with thinking mode for everything. While it is larger and slower than Sonnet, he finds that the reduced need for steering and better tool use make it faster overall for completing actual tasks.
 His team shares a single CLAUDE.md file that is checked into Git. This file serves as the project&#8217;s working agreement with Claude, containing build commands, style conventions, architectural boundaries, and definitions of done. Any time anyone sees Claude do something incorrectly, they add a rule to CLAUDE.md so it does not happen again. This creates a compounding effect where Claude gets better at each specific codebase over time.
 Most sessions start in Plan mode (shift+tab twice). He goes back and forth with Claude until he likes the plan, then switches to auto-accept edits mode, where Claude can usually execute in one shot. Getting the plan right is critical.
 He uses slash commands for every &#8220;inner loop&#8221; workflow he performs multiple times daily, like a /commit-push-pr command that he and Claude use dozens of times every day. Subagents handle specialized workflows: code-simplifier cleans up code after Claude finishes, verify-app runs end-to-end tests. A PostToolUse hook automatically formats Claude&#8217;s code. MCP integrations let Claude search and post to Slack, run BigQuery queries, and grab error logs from Sentry.
 Perhaps most importantly, Boris emphasizes giving Claude a way to verify its work. Claude tests every change before landing using the Claude Chrome extension, which opens a browser, tests the UI, and iterates until the code works and the user experience feels right. This verification loop improves the quality of results by two to three times.
 The gap between Boris&#8217;s setup and how most people use Claude Code highlights a broader challenge in AI adoption. Setting up an effective workflow with these tools is far from straightforward. It requires understanding permission modes, context management, hooks, MCP integrations, and verification strategies. The productivity gains require significant time investment to unlock.
 The repo maintenance question 
 Agentic coding also raises new questions about codebase organization. When AI writes and modifies code at high speed, repositories can quickly become messy. Tidying up can consume significant time. But this raises a genuine question: how neat and human-readable do these repos need to be anymore if you are primarily using AI to code and review them?
 Claude Code can do a good job of refactoring and tidying up its own repos, but it usually needs detailed rules and workflow instructions to do so consistently. This is another area where investing in CLAUDE.md files and custom commands pays off. Without explicit guidance, agentic coding tends to accrue technical debt more quickly than traditional development. With the proper guardrails, Claude Code can maintain cleaner codebases than many human developers, but getting those guardrails right takes work.
 
 Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.

 
 

 

 
 Why should you care?
 The Christmas surge in Claude Code adoption signals we may be entering a new phase of AI interaction where agentic tools that can navigate your files, execute commands, and chain together workflows become more valuable than chat interfaces. Power requires access. A chatbot interface is sandboxed by design. An agent with file system access, shell execution, and external integrations can actually do the work. The trade-off is that wielding that power requires more skill and carries more risk.
 For technical users, investing time in learning CLI agents is likely to pay dividends. The productivity improvements reported by power users are not available to someone using ChatGPT to generate code snippets they paste into their editor. But the learning curve is real, and Boris&#8217;s &#8220;vanilla&#8221; setup would take most developers considerable time to replicate.
 For non-technical users, these tools are genuinely helpful for tasks like file organization, data analysis, and research automation. But the CLI will not be how most people access these capabilities. The future likely involves more accessible interfaces that retain the agentic power.
 The agentic AI era is arriving faster than many expected. The models are ready. The tooling is maturing. The question now is how quickly people can learn to use them effectively, and how quickly more accessible interfaces will bring these capabilities to everyone else. The winners will be determined by which agents can deliver the reliability and verification loops that make them trustworthy for real work.
 &#8212; Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO 
 
 Hottest News
 1. DeepSeek Releases Hyper-Connections for Transformers 
 DeepSeek introduces mHC (Manifold-Constrained Hyper-Connections) to scale training while preserving stability. mHC targets a common scaling tension: increasing internal information flow can improve capability, but it can also destabilize training; the method constrains hyper-connection residuals by projecting them onto a defined manifold to restore identity-mapping behavior while keeping the system efficient. DeepSeek reports empirical gains in large-scale pretraining experiments (including MoE-based variants inspired by prior DeepSeek work), positioning mHC as a training-stack improvement rather than a product feature, with the full technical details published on arXiv.
 2. GPT-5.2 Pro Tops FrontierMath T4 
 OpenAI&#8217;s GPT-5.2 strengthens science/math performance, with new FrontierMath results and clearer context on what Tier 4 represents. OpenAI reports GPT-5.2 Thinking at 14.6% on FrontierMath Tier 4 (with Python enabled) and 40.3% on Tier 1&#8211;3, while also showing broad gains across science-heavy benchmarks (e.g., GPQA Diamond) and publishing a subset of GPT-5.2 Pro numbers for several evaluations. Epoch AI notes Tier 4 is the research-level slice of FrontierMath, 50 problems written as short-term research projects, so progress there is treated as a meaningful capability signal rather than routine test-taking.
 3. Alibaba Qwen Open Sourced Qwen-Image-2512 
 Qwen releases Qwen-Image-2512, a December update to its open text-to-image model. The update focuses on three visible quality lifts: more reliable text rendering and layout, more realistic human generation (reduced &#8220;AI-generated&#8221; look), and finer natural textures (e.g., landscapes and fur). The weights are available through Hugging Face and ModelScope, with an interactive demo on Hugging Face Spaces. Results from 10,000 rounds of blind model evaluations on AI Arena show that Qwen-Image-2512 is currently the strongest open-source model.
 4. Tencent Researchers Release Tencent HY-MT1.5 
 Tencent releases HY-MT1.5, a new open machine-translation model family. HY-MT1.5 ships in two sizes (1.8B and 7B parameters) and is trained with a multi-stage pipeline that combines general + MT-oriented pretraining, supervised fine-tuning, strong-to-weak on-policy distillation, and reinforcement learning to balance quality with deployment efficiency. Beyond &#8220;plain translation,&#8221; the models support practical constraint controls, such as terminology injection, context-aware translation, and format preservation for structured documents. Tencent also points to quantization options for edge or high-throughput deployments and provides model weights on Hugging Face, along with an accompanying code repository for use and integration.
 5. OpenAI Ramps Up Audio AI Efforts Ahead of Device 
 OpenAI ramps up its audio-model push ahead of an audio-first device, with a new architecture targeted for 2026. Reporting around the effort describes OpenAI consolidating previously separate audio teams and rebuilding core infrastructure so audio can be treated as a first-class modality (not just &#8220;text, then voice&#8221;), aiming to close gaps in latency, accuracy, and natural conversational flow. The plan centers on a new audio-model architecture expected in Q1 2026, alongside longer-term work on voice-first hardware form factors, with leadership reportedly tied to talent brought in from Character.AI.
 6. IQuest-Coder Beats Claude Sonnet 4.5 on Coding Benchmarks 
 IQuestLab releases IQuest-Coder-V1, an open-source code LLM family tuned for autonomous software engineering. The lineup includes 7B, 14B, and 40B variants with 128K native context, plus &#8220;Instruct&#8221; and &#8220;Thinking&#8221; options and a &#8220;Loop&#8221; variant built around a recurrent-style mechanism for a better capacity&#8211;deployment trade-off. The project highlights &#8220;Code-Flow&#8221; training, learning from repository evolution and commit transitions rather than static snapshots. It scored 76.2% on SWE-Bench Verified, 81.1% on LiveCodeBench v6, and 49.9% on BigCodeBench.
 Five 5-minute reads/videos to keep you learning
 1. Understanding Retrieval in RAG Systems: Why Chunk Size Matters 
 This article examines the critical role of the retrieval step in RAG systems by isolating its mechanics from the generation component. The author demonstrates how varying text chunk sizes (80, 220, and 500 characters) directly affect performance. The analysis shows that small chunks lack sufficient context, medium ones can be unstable, while larger chunks yield more robust results. It also introduces a method for handling uncertainty, which uses the similarity score gap between top results to identify and flag ambiguous situations, preventing the system from providing a potentially incorrect answer when it has low confidence.
 2. Deep Compression, 2015: How Much More Can We Squeeze in 2025? 
 This article revisits the 2015 Deep Compression paper, first reproducing its pipeline of pruning, retraining, and quantization on the LeNet model, achieving a ~22x compression rate while maintaining accuracy. It then introduces a novel, TF-IDF-inspired pruning score that identifies important parameters based on activation patterns. This computationally lighter method improved upon the baseline, pushing the model&#8217;s compression up to ~65x with minimal impact on accuracy after retraining.
 3. Gemini 3.0 Flash + MistralOCR 3 + RAG Just Revolutionized Agent OCR Forever 
 This article explains how to combine Mistral OCR 3 and Google&#8217;s Gemini 3.0 Flash to build a document processing and chat application. It highlights Mistral OCR&#8217;s ability to accurately extract structured text and tables from documents and convert them to Markdown. The extracted content is then used by Gemini 3.0 Flash, a fast and efficient model, to power a chat interface. This allows users to ask questions about the uploaded document. The piece includes a step-by-step guide and code for creating the Streamlit application, providing a practical example of this integration.
 4. Why Humans Are Not Reinforcement Learning Agents And Why This Matters for AI 
 While reinforcement learning (RL) is a cornerstone of modern AI, it operates on assumptions that human decision-making consistently violates. This analysis explores the fundamental mismatch, noting that human rewards are unstable, influenced by emotion, and subject to time-inconsistent preferences. Humans also actively construct their reality rather than just reacting to fixed states, often relying on heuristics and identity to guide actions. The author suggests that acknowledging these differences is key to developing AI that can effectively support the complexity of human judgment, rather than simply optimizing for a fixed goal.
 5. Beyond Vectors: A Deep Dive into Modern Search in Qdrant 
 To address the complexity of modern user queries, this piece details the construction of a hybrid search system using Qdrant. It demonstrates how to combine dense vectors for semantic understanding, sparse vectors for keyword precision, and full-text indexing for exact-match requirements. It also explores advanced techniques like ASCII-folding for multilingual support and ACORN for efficient, filter-aware vector searches. It also provides a practical e-commerce implementation to show how these elements are integrated into a single, effective retrieval pipeline that balances user intent with specific constraints.
 Repositories & Tools
 1. Cs249r Book is the open learning stack for AI systems engineering. It includes the textbook source, TinyTorch, hardware kits, and upcoming co-labs.
 2. LLM Pruning Collection is a collection of various llm pruning implementations, training code for GPUs & TPUs, and an evaluation script.
 3. CPython is Python version 3.15.0 alpha 3.
 4. OpenBB is the open-source toolset for integrating proprietary, licensed, and public data sources into downstream applications.
 Top Papers of The Week
 1. TurboDiffusion: Accelerating Video Diffusion Models by 100&#8211;200 Times 
 TurboDiffusion accelerates end-to-end video diffusion generation by 100&#8211;200x while maintaining video quality. The framework speeds attention with low-bit SageAttention and trainable Sparse-Linear Attention, compresses sampling steps via rCM-based step distillation, and applies W8A8 quantization to model parameters and activations. Experiments on multiple Wan2.x I2V and T2V models confirm the speedups on a single RTX 5090 GPU.
 2. Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models 
 Youtu-LLM introduces a 1.96B-parameter lightweight language model pre-trained from scratch to cultivate reasoning and planning. The model uses a dense Multi-Latent Attention architecture, STEM-oriented vocabulary, and a 128k context window. Researchers apply a &#8220;Commonsense-STEM-Agent&#8221; curriculum over ~11T tokens and scalable agentic mid-training, enabling state-of-the-art agentic performance among sub-2B models on general and agent-specific benchmarks.
 3. Recursive Language Models 
 Recursive Language Models aim to break the usual trade-off between context length, accuracy, and cost in large language models. Instead of forcing a model to read a giant prompt in one pass, RLMs treat the prompt as an external environment and let the model decide how to inspect it with code, then recursively call itself on smaller pieces. Across S-NIAH, BrowseComp-Plus, OOLONG, and OOLONG Pairs, RLM variants of GPT-5 and Qwen3-Coder improve accuracy and F1 over direct model calls, retrieval agents such as CodeAct, and summarization agents.
 4. Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space 
 The paper introduces Dynamic Large Concept Models, which shift computation from individual tokens to a learned concept space, addressing non-uniform information density in language. DLCM discovers variable-length concepts, defines a compression-aware scaling law that separates token capacity, concept-reasoning capacity, and compression ratio, and uses a decoupled &#956;P to enable stable training.
 Who&#8217;s Hiring in AI
 Research Intern&#8202;&#8212;&#8202;MSRC AI Security Research @Microsoft Corporation (Cambridge, UK) 
 Junior Product Designer&#8202;&#8212;&#8202;AI Driven @CloudWalk (Remote, Brazil) 
 Intern AI Developer @Entrust (Shakopee, MN, USA) 
 Full-stack Software Engineer&#8202;&#8212;&#8202;Core @Dataiku (Remote/France) 
 Junior Data Developer @CI&T (Remote) 
 Interested in sharing a job opportunity here? Contact sponsors@towardsai.net . 
 Think a friend would enjoy this too? Share the newsletter and let them join the conversation. 
 Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.

Nous Research's NousCoder-14B is an open-source coding model landing right in the Claude Code moment

venturebeat_ai

07.01.2026 20:00

0.664

Embedding sim.	0.7604
Entity overlap	0.1087
Title sim.	0.1797
Time proximity	0.8277

NLP тип	product_launch
NLP организация	Nous Research
NLP тема	code generation
NLP страна

Открыть оригинал

Nous Research , the open-source artificial intelligence startup backed by crypto venture firm Paradigm , released a new competitive programming model on Monday that it says matches or exceeds several larger proprietary systems — trained in just four days using 48 of Nvidia's latest B200 graphics processors .
 The model, called NousCoder-14B , is another entry in a crowded field of AI coding assistants, but arrives at a particularly charged moment: Claude Code , the agentic programming tool from rival Anthropic, has dominated social media discussion since New Year's Day, with developers posting breathless testimonials about its capabilities . The simultaneous developments underscore how quickly AI-assisted software development is evolving — and how fiercely companies large and small are competing to capture what many believe will become a foundational technology for how software gets written.
 type: embedded-entry-inline id: 74cSyrq6OUrp9SEQ5zOUSl 
 NousCoder-14B achieves a 67.87 percent accuracy rate on LiveCodeBench v6 , a standardized evaluation that tests models on competitive programming problems published between August 2024 and May 2025. That figure represents a 7.08 percentage point improvement over the base model it was trained from, Alibaba's Qwen3-14B , according to Nous Research's technical report published alongside the release.
 "I gave Claude Code a description of the problem, it generated what we built last year in an hour," wrote Jaana Dogan , a principal engineer at Google responsible for the Gemini API, in a viral post on X last week that captured the prevailing mood around AI coding tools. Dogan was describing a distributed agent orchestration system her team had spent a year developing — a system Claude Code approximated from a three-paragraph prompt.
 The juxtaposition is instructive: while Anthropic's Claude Code has captured imaginations with demonstrations of end-to-end software development, Nous Research is betting that open-source alternatives trained on verifiable problems can close the gap — and that transparency in how these models are built matters as much as raw capability.
 How Nous Research built an AI coding model that anyone can replicate 
 What distinguishes the NousCoder-14B release from many competitor announcements is its radical openness. Nous Research published not just the model weights but the complete reinforcement learning environment , benchmark suite, and training harness — built on the company's Atropos framework — enabling any researcher with sufficient compute to reproduce or extend the work .
 "Open-sourcing the Atropos stack provides the necessary infrastructure for reproducible olympiad-level reasoning research," noted one observer on X , summarizing the significance for the academic and open-source communities.
 The model was trained by Joe Li , a researcher in residence at Nous Research and a former competitive programmer himself. Li's technical report reveals an unexpectedly personal dimension: he compared the model's improvement trajectory to his own journey on Codeforces, the competitive programming platform where participants earn ratings based on contest performance.
 Based on rough estimates mapping LiveCodeBench scores to Codeforces ratings, Li calculated that NousCoder-14B's improvemen t— from approximately the 1600-1750 rating range to 2100-2200 — mirrors a leap that took him nearly two years of sustained practice between ages 14 and 16. The model accomplished the equivalent in four days.
 "Watching that final training run unfold was quite a surreal experience," Li wrote in the technical report.
 But Li was quick to note an important caveat that speaks to broader questions about AI efficiency: he solved roughly 1,000 problems during those two years, while the model required 24,000. Humans, at least for now, remain dramatically more sample-efficient learners.
 Inside the reinforcement learning system that trains on 24,000 competitive programming problems 
 NousCoder-14B 's training process offers a window into the increasingly sophisticated techniques researchers use to improve AI reasoning capabilities through reinforcement learning.
 The approach relies on what researchers call "verifiable rewards" — a system where the model generates code solutions, those solutions are executed against test cases, and the model receives a simple binary signal: correct or incorrect. This feedback loop, while conceptually straightforward, requires significant infrastructure to execute at scale.
 Nous Research used Modal , a cloud computing platform, to run sandboxed code execution in parallel. Each of the 24,000 training problems contains hundreds of test cases on average, and the system must verify that generated code produces correct outputs within time and memory constraints — 15 seconds and 4 gigabytes, respectively.
 The training employed a technique called DAPO (Dynamic Sampling Policy Optimization) , which the researchers found performed slightly better than alternatives in their experiments. A key innovation involves "dynamic sampling" — discarding training examples where the model either solves all attempts or fails all attempts, since these provide no useful gradient signal for learning.
 The researchers also adopted "iterative context extension," first training the model with a 32,000-token context window before expanding to 40,000 tokens. During evaluation, extending the context further to approximately 80,000 tokens produced the best results, with accuracy reaching 67.87 percent.
 Perhaps most significantly, the training pipeline overlaps inference and verification — as soon as the model generates a solution, it begins work on the next problem while the previous solution is being checked. This pipelining, combined with asynchronous training where multiple model instances work in parallel, maximizes hardware utilization on expensive GPU clusters.
 The looming data shortage that could slow AI coding model progress 
 Buried in Li's technical report is a finding with significant implications for the future of AI development: the training dataset for NousCoder-14B encompasses "a significant portion of all readily available, verifiable competitive programming problems in a standardized dataset format."
 In other words, for this particular domain, the researchers are approaching the limits of high-quality training data.
 "The total number of competitive programming problems on the Internet is roughly the same order of magnitude," Li wrote, referring to the 24,000 problems used for training. "This suggests that within the competitive programming domain, we have approached the limits of high-quality data."
 This observation echoes growing concern across the AI industry about data constraints. While compute continues to scale according to well-understood economic and engineering principles, training data is "increasingly finite," as Li put it.
 "It appears that some of the most important research that needs to be done in the future will be in the areas of synthetic data generation and data efficient algorithms and architectures," he concluded.
 The challenge is particularly acute for competitive programming because the domain requires problems with known correct solutions that can be verified automatically. Unlike natural language tasks where human evaluation or proxy metrics suffice, code either works or it doesn't — making synthetic data generation considerably more difficult.
 Li identified one potential avenue: training models not just to solve problems but to generate solvable problems, enabling a form of self-play similar to techniques that proved successful in game-playing AI systems. "Once synthetic problem generation is solved, self-play becomes a very interesting direction," he wrote.
 A $65 million bet that open-source AI can compete with Big Tech 
 Nous Research has carved out a distinctive position in the AI landscape: a company committed to open-source releases that compete with — and sometimes exceed — proprietary alternatives.
 The company raised $50 million in April 2025 in a round led by Paradigm, the cryptocurrency-focused venture firm founded by Coinbase co-founder Fred Ehrsam. Total funding reached $65 million, according to some reports. The investment reflected growing interest in decentralized approaches to AI training, an area where Nous Research has developed its Psyche platform .
 Previous releases include Hermes 4 , a family of models that we reported " outperform ChatGPT without content restrictions ," and DeepHermes-3, which the company described as the first " toggle-on reasoning model " — allowing users to activate extended thinking capabilities on demand.
 The company has cultivated a distinctive aesthetic and community, prompting some skepticism about whether style might overshadow substance. "Ofc i'm gonna believe an anime pfp company. stop benchmarkmaxxing ffs," wrote one critic on X , referring to Nous Research's anime-style branding and the industry practice of optimizing for benchmark performance.
 Others raised technical questions. " Based on the benchmark, Nemotron is better ," noted one commenter, referring to Nvidia's family of language models. Another asked whether NousCoder-14B is "agentic focused or just 'one shot' coding" — a distinction that matters for practical software development, where iterating on feedback typically produces better results than single attempts.
 What researchers say must happen next for AI coding tools to keep improving 
 The release includes several directions for future work that hint at where AI coding research may be heading.
 Multi-turn reinforcement learning tops the list. Currently, the model receives only a final binary reward — pass or fail — after generating a solution. But competitive programming problems typically include public test cases that provide intermediate feedback: compilation errors, incorrect outputs, time limit violations. Training models to incorporate this feedback across multiple attempts could significantly improve performance.
 Controlling response length also remains a challenge. The researchers found that incorrect solutions tended to be longer than correct ones, and response lengths quickly saturated available context windows during training — a pattern that various algorithmic modifications failed to resolve.
 Perhaps most ambitiously, Li proposed "problem generation and self-play" — training models to both solve and create programming problems. This would address the data scarcity problem directly by enabling models to generate their own training curricula.
 "Humans are great at generating interesting and useful problems for other competitive programmers, but it appears that there still exists a significant gap in LLM capabilities in creative problem generation," Li wrote.
 The model is available now on Hugging Face under an Apache 2.0 license. For researchers and developers who want to build on the work, Nous Research has published the complete Atropos training stack alongside it.
 What took Li two years of adolescent dedication to achieve—climbing from a 1600-level novice to a 2100-rated competitor on Codeforces—an AI replicated in 96 hours. He needed 1,000 problems. The model needed 24,000. But soon enough, these systems may learn to write their own problems, teach themselves, and leave human benchmarks behind entirely.
 The question is no longer whether machines can learn to code. It's whether they'll soon be better teachers than we ever were.

Claude Code Hits Different

interconnects

09.01.2026 17:42

0.632

Embedding sim.	0.7625
Entity overlap	0.2963
Title sim.	0.1354
Time proximity	0.3693

NLP тип	product_launch
NLP организация	Anthropic
NLP тема	code generation
NLP страна

Открыть оригинал

Claude Code Hits Different
 Coding agents cross a meaningful threshold with Opus 4.5.
 Nathan Lambert
 Jan 09, 2026

 143

 36
 15

 Share

 Article voiceover
 0:00

 -4:57

 Audio playback is not supported on your browser. Please upgrade. There is an incredible amount of hype for Claude Code with Opus 4.5 across the web right now, which I for better or worse entirely agree with. Having used coding agents extensively for the past 6-9 months, where it felt like sometimes OpenAI’s Codex was the best and sometimes Claude, there was some meaningful jump over the last few weeks. The jump is well captured by this post, which called it the move of “software creation from an artisanal, craftsman activity to a true industrial process.” Translation: Software is becoming free and human design, specification, and entrepreneurship is the only limiting factor.

 Sergey Karayev @sergeykarayev

 Claude Code with Opus 4.5 is a watershed moment, moving software creation from an artisanal, craftsman activity to a true industrial process.

It’s the Gutenberg press. The sewing machine. The photo camera.
 7:40 PM · Jan 4, 2026 · 414K Views

 130 Replies · 386 Reposts · 5.19K Likes

 What is odd is that this latest Opus model was released on November 24, 2025, and the performance jump in Claude Code seemed to come at least weeks after its integration — I wouldn’t be surprised if a small product change unlocked massive real (or perceived) gains in performance.
 Interconnects is a reader-supported publication. Consider becoming a subscriber.

 Subscribe

 The joy and excitement I feel when using this latest model in Claude Code is so simple that it necessitates writing about it. It feels right in line with trying ChatGPT for the first time or realizing o3 could find any information I was looking for, but in an entirely new direction. This time, it is the commodification of building. I type and outputs are constructed directly. Claude’s perfect mix of light sycophancy, extreme productivity, and an elegantly crafted application has me coming up with things to do with Claude. I’d rather do my work if it fits the Claude form factor, and soon I’ll modify my approaches so that Claude will be able to help. In a near but obvious future I’ll just manage my Claudes from my phone at the coffee shop.
 Where Claude is an excellent model, maybe the best, its product is where the magic happens for building with AI that instills confidence. We could see the interfaces the models are used in being so important to performance, such that Anthropic’s approach with Claude feels like Apple’s integration of hardware, software, and everything in between. This sort of magical experience is not one I expect to be only buildable by Anthropic — they’re just the first to get there.
 The fact that Claude makes people want to go back to it is going to create new ways of working with these models and software engineering is going to look very different by the end of 2026. Right now Claude (and other models) can replicate the most-used software fairly easily. We’re in a weird spot where I’d guess they can add features to fairly complex applications like Slack, but there are a lot of hoops to jump through in landing the feature (including very understandable code quality standards within production code-bases), so the models are way easier to use when building from scratch than in production code-bases.
 This dynamic amplifies the transition and power shift of software, where countless people who have never fully built something with code before can get more value out of it. It will rebalance the software and tech industry to favor small organizations and startups like Interconnects that have flexibility and can build from scratch in new repositories designed for AI agents. It’s an era to be first defined by bespoke software rather than a handful of mega-products used across the world. The list of what’s already commoditized is growing in scope and complexity fast — website frontends, mini applications on any platform, data analysis tools — all without having to know how to write code.
 I expect mental barriers people have about Claude’s ability to handle complex codebases to come crashing down throughout the year, as more and more Claude-pilled engineers just tell their friends “skill issue.” With these coding agents all coming out last year, the labs are still learning how to best train models to be well-expressed in the form factor. It’ll be a defining story of 2026 as the commodification of software expands outside of the bubble of people deeply obsessed with AI.
 There are things that Claude can’t do well and will take longer to solve, but these are more like corner cases and for most people immense value can be built around these blockers.
 The other part that many people will miss is that Claude Code doesn’t need to be restricted to just software development — it can control your entire computer. People are starting to use it for managing their email, calendars, decision making, referencing their notes, and everything in between. The crucial aspect is that Claude is designed around the command line interface (CLI), which is an open door into the digital world.
 The DGX Spark on my desk can be a mini AI research and development station managed by Claude.
 Share
 This complete interface managing my entire internet life is the beginnings of current AI models feeling like they’re continually learning . Whenever Claude makes a mistake or does something that doesn’t match your taste, dump a reminder into CLAUDE.md, it’s as simple as that. To quote Doug OLaughlin , my brother in arms of Claude fandom, Claude with a 100X context window and 100X the speed will be AGI. By the end of 2026 we definitely could get the first 10X of both with the massive buildout of compute starting to become available.
 Happy building.

 143

 36
 15

 Share

 Previous Next

Use multiple models

interconnects

11.01.2026 14:02

0.629

Embedding sim.	0.7677
Entity overlap	0.0667
Title sim.	0.1273
Time proximity	0.435

NLP тип	other
NLP организация	OpenAI
NLP тема	large language models
NLP страна

Открыть оригинал

Use multiple models
 The meta for getting the most out of AI in 2026.
 Nathan Lambert
 Jan 11, 2026

 91

 22
 7

 Share

 Article voiceover
 0:00

 -7:11

 Audio playback is not supported on your browser. Please upgrade. I’ll start by explaining my current AI stack and how it’s changed in recent months. For chat, I’m using a mix of:
 GPT 5.2 Thinking / Pro : My most frequent AI use is getting information. This is often a detail about a paper I’m remembering, a method I’m verifying for my RLHF Book , or some other niche fact. I know GPT 5.2 can find it if it exists, and I use Thinking for queries that I think are easier and Pro when I want to make sure the answer is right. Particularly GPT Pro has been the indisputable king for research for quite some time — Simon Willison’s coining of it as his “ research goblin ” still feels right.

 I never use GPT 5 without thinking or other OpenAI chat models. Maybe I need to invest more in custom instructions, but the non-thinking models always come across a bit sloppy relative to the competition out there and I quickly churn. I’ve heard gossip that the Thinking and non-Thinking GPT models are even developed by different teams, so it would make sense that they can end up being meaningfully different.

 I also rarely use Deep Research from any provider, opting for GPT 5.2 Pro and more specific instructions. In the first half of 2025 I almost exclusively used ChatGPT’s thinking models — Anthropic and Google have done good work to win back some of my attention.

 Claude 4.5 Opus : Chatting with Claude is where I go for basic code questions, visualizing simple data, and getting richer feedback on my work or decisions. Opus’s tone is particularly refreshing when trying to push the models a bit (in a way that GPT 4.5 used to provide for me, as I was a power user of that model in H1 2025). Claude Opus 4.5 isn’t particularly fast relative to a lot of models out there, but when you’re used to using the GPT Thinking models like me, it feels way faster (even with extended thinking always on, as I do) and sufficient for this type of work.

 Gemini 3 Pro : Gemini is for everything else — explaining concepts I know are well covered in the training data (and minor hallucinations are okay, e.g. my former Google rabbit holes), multimodality, and sometimes very long-context capabilities (but GPT 5.2 Thinking took a big step here, so it’s a bit closer). I still open and use the Gemini app regularly, but it’s a bit less locked-in than the other two.

 Relative to ChatGPT, sometimes I feel like the search mode of Gemini is a bit off. It could be a product decision with how the information is presented to the user, but GPT’s thorough, repeated search over multiple sources instills a confidence I don’t get from Gemini for recent or research information.

 Grok 4: I use Grok ~monthly to try and find some piece of AI news or Alpha I recall from browsing X. Grok is likely underrated in terms of its intelligence (particularly Grok 4 was an impressive technical release), but it hasn’t had sticky product or differentiating features for me.

 For images I’m using a mix of mostly Nano Banana Pro and sometimes GPT Image 1.5 when Gemini can’t quite get it.
 For coding, I’m primarily using Claude Opus 4.5 in Claude Code, but still sometimes find myself needing OpenAI’s Codex or even multi-LLM setups like Amp . Over the holiday break, Claude Opus helped me update all the plots for The ATOM Project , which included substantial processing of our raw data from scraping HuggingFace, perform substantive edits for the RLHF Book (where I felt it was a quite good editor when provided with detailed instructions on what it should do), and other side projects and life organization tasks. I recently published a piece explaining my current obsession with Claude Opus 4.5, I recommend you read it if you haven’t had the chance:
 Interconnects
 Claude Code Hits Different

 There is an incredible amount of hype for Claude Code with Opus 4.5 across the web right now, which I for better or worse entirely agree with. Having used coding agents extensively for the past 6-9 months, where it felt like sometimes OpenAI’s Codex was the best and sometimes Claude, there was some meaningful jump over the last few weeks. The jump is we…
 Read more
 3 months ago · 39 likes · 19 comments · Nathan Lambert

 A summary of this is that I pay for the best models and greatly value the marginal intelligence over speed — particularly because, for a lot of the tasks I do, I find that the models are just starting to be able to do them well. As these capabilities diffuse in 2026, speed will become more of a determining factor in model selection.
 Peter Wildeford had a post on X with a nice graphic that reflected a very similar usage pattern:

 Peter Wildeford🇺🇸🚀 @peterwildeford

 Here's currently how I'm using each of the LLMs

 3:33 PM · Jan 8, 2026 · 70.6K Views

 38 Replies · 45 Reposts · 720 Likes

 Share
 Across all of these categories, it doesn’t feel like I could get away with just using one of these models without taking a substantial haircut in capabilities. This is a very strong endorsement for the notion of AI being jagged — i.e. with very strong capabilities spread out unevenly — while also being a bit of an unusual way to need to use a product. Each model is jagged in its own way. Through 2023, 2024, and the earlier days of modern AI, it quite often felt like there was always just one winning model and keeping up was easier. Today, it takes a lot of work and fiddling to make sure you’re not missing out on capabilities.
 The working pattern that I’ve formed that most reinforces this using multiple models era is how often my problem with an AI model is solved by passing the same query to a peer model. Models get stuck, some can’t find bugs, some coding agents keep getting stuck on some weird, suboptimal approach, and so on. In these cases, it feels quite common to boot up a peer model or agent and get it to unblock project.
 This multi-model approach or agent-switching happening occasionally would be what I’d expect, but with it happening regularly it means that the models are actually all quite close to being able to solve the tasks I’m throwing at them — they’re just not quite there. The intuition here is that if we view each task as having a probability of success, if said the probability was low for each model, switching would almost always fail. For switching to regularly solve the task, each model must have a fairly high probability of success.
 For the time being, it seems like tasks at the frontier of AI capabilities will always keep this model-switching meta, but it’s a moving suite of capabilities. The things I need to switch on now will soon be solved by all the next-generation of models.
 I’m very happy with the value I’m getting out of my hundreds of dollars of AI subscriptions, and you should likely consider doing the same if you work in a domain that sounds similar to mine.
 Interconnects is a reader-supported publication. Consider becoming a subscriber.

 Subscribe

 On the opposite side of the frontier models pushing to make current cutting edge tasks 100% reliable are open models pushing to undercut the price of frontier models. The coding plans on open models tend to cost 10X (or more) less than the frontier lab plans. It’s a boring take, but for the next few years I expect this gap to largely remain steady, where a lot of people get an insane value out of the cutting edge of models. It’ll take longer for the open model undercut to hit the frontier labs, even though from basic principles it looks like a precarious position for them to be in, in terms of costs of R&D and deployment. Open models haven’t been remotely close to Claude 4.5 Opus or GPT 5.2 Thinking in my use.
 The other factor is that 2025 gave us all of Deep Research agents, code/CLI agents, search (and Pro) tool use models, and there will almost certainly be new form factors we end up using almost every day in released 2026. Historically, closed labs have been better at shipping new products into the world, but with better open models this should be more diffused, as good product capabilities are very diffuse across the tech ecosystem. To capitalize on this, you need to invest time (and money) trying all the cutting-edge AI tools you can get your hands on. Don’t be loyal to one provider.

 91

 22
 7

 Share

 Previous Next