← Все кластеры
Import AI 447: The AGI economy; testing AIs with generated games; and agent ecologies
cooling
Тип событияother
Темаgenerative ai
ОрганизацияOpenAI
СтранаUnited States
Статей79
Уник. источников19
Важность / Момент3.83 / 0
Период02.03.2026 13:45 — 01.04.2026 08:00
Создан06.04.2026 06:28:17
Статьи в кластере 79
Заголовок Источник Дата публикации Score
S Import AI 447: The AGI economy; testing AIs with generated games; and agent ecologies import_ai 02.03.2026 13:45 1
Embedding sim.1
Entity overlap1
Title sim.1
Time proximity1
NLP типscientific_publication
NLP организацияMIT
NLP темаai agents
NLP страна

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now The AGI economy - most labor goes to the machines, and humans shift to verification: …What grappling with the singularity seriously looks like… Researchers with MIT, WashU, and UCLA have written a fun paper called “Some Simple Economics of AGI” which wrestles with what happens when machines can do the vast majority of tasks in the economy. The conclusion is that our ability as humans to control and benefit from this vast machine-driven economy will rely on allocating our ability toward monitoring and verifying the actions of our myriad AI agents, and indulging in artisanal tasks where the value comes from the human-derived aspect more than any particular capability. What is AGI in an economic sense? “We model the AGI transition as the collision of two racing cost curves: an exponentially decaying Cost to Automate and a biologically bottlenecked Cost to Verify,” the authors write. “In an economy where autonomous agents act with broad agency rather than narrow instructions, the binding constraint on growth is no longer intelligence. It is human verification bandwidth: the scarce capacity to validate outcomes, audit behavior, and underwrite meaning and responsibility when execution is abundant… We are moving from an era where our worth was defined by our capacity to build and discover, to an era where our survival depends on our capacity to steer, understand, and stand behind the meaning of what is created.” The risks of a mostly no-human economy and the “Hollow Economy”: As we proliferate the number of AI agents then it’s necessarily the case that we’ll delegate more and more labor to machines. One of the key risks of this is what the authors call a “Trojan Horse” externality: “measured activity rises, but hidden debt accumulates in the gap between visible metrics and actual human intent”. The Hollow Economy: “”Agents consume real resources to produce output that satisfies measurable proxies while violating unmeasured intent. As this hidden debt accumulates, it drives the system toward a Hollow Economy of high nominal output but collapsing realized utility—a regime where agents generate counterfeit utility,” they write. Verification as the solution: To avoid this risk, we are going to need to invest in systems of verifying that AI agents are doing what we want them to do and also carefully analyzing and pricing the risks their actions create. “Ensuring humanity remains the architect of its intelligence requires that verification capacity scale commensurately with AI capabilities—through aggressive investment in observability, human augmentation, synthetic practice, cryptographic provenance, and liability regimes that internalize tail risk.” What should humans be doing to prepare for this shift? To set society and individuals up well, people should be doing the following things: Invest in observability: Deploying tools that compress high-dimensional agent behavior into signals experts can reliably process, lowering effective feedback latency and expanding the verification frontier.” Use AI to replace early-career mentorship: Given the likely reduction in jobs for early career humans, we should work out how to augment these humans to be more competitive with AI and how we can use “AI-driven synthetic practice to rebuild experience stocks when traditional apprenticeship pathways collapse… AI can generate high-fidelity simulations and personalized coaching, effectively replacing the missing junior loop with compressed, risk-free training environments that accelerate the acquisition of expertise.” Set things up to gracefully degrade: As the machine economy runs hot and out-paces measurement, we should make sure it can fall into a non-verified state without causing social harm: the authors suggest doing this by “investing in base-alignment and robustness so that when oversight inevitably falters within the Measurability Gap, systems revert to safe baseline policies rather than optimizing aggressively in unverifiable regimes.” Sidenote: Is this “theory slop”? The paper is full of fun ideas and occasionally captivating turns of phrase. But at various points reading it I felt the distinct texture of AI-generated content, especially when it comes to the economic theory sections which seemed more to be included for the performance of theory than for helping to buttress the paper. A couple of people I talked about the paper with agreed. But there’s no real way to know. It did cause me to wonder how long it’ll take till I start reading papers which are mostly written by AI systems for the consumption by other AI systems. Why this matters - we can have a hugely wealthy society, but we have to reckon with AGI seriously: This paper thinks that AI will rip through the economy extremely quickly and will generally push people away from most labor and towards being passive - unless we build verification infrastructure and business models (including through policy) to allow people to benefit from this growth and steer it. “Automation commoditizes anything that can be measured, stripping the wage premium from historically prestigious roles the moment their core feedback loops are digitized,” they write. “For policymakers, it promises the broadest expansion of public-good provision in generations—but only if verification infrastructure and the pipelines that build human verifiers are treated as public goods themselves.” The key thing here is the element of choice: we can choose to build a society ready for AI, or we can choose to assume AI will be just like any other technology and thus get hit by a tidal wave. Read more: Some Simple Economics of AGI (arXiv) . *** Chatting with Ezra Klein: AI agents, recursive self-improvement, and the personalities of LLMs: …A long conversation about the economic impacts and policy possibilities of the AI economy… Here’s a chat between me and Ezra Klein about AI agents and how the broader maturation of AI could be changing the larger economy. One thing I appreciated about this conversation was Ezra pushing me for some of the bigger and more ambitious positive policy ideas - the AI community tends to invest a lot in risk mitigation policy, but doesn’t spend enough time thinking about the sorts of grand projects that society could do once AI gets really, really powerful. You can view the conversation here : “How Fast Will A.I. Agents Rip Through the Economy? | The Ezra Klein Show” (YouTube). *** AIs can teach people anything, including how to get better at making bioweapons: …The dual use nature of a universal teacher… AI systems can help novices perform better on bioweapon-related tasks, though they’re still quite ineffective, and performance is variable across different disciplines. What they studied : Researchers from Scale AI, SecureBio, University of Oxford, and UC Berkeley examined how different LLMs could improve the skills of people challenged to do a range of bioweapon-related knowledge tasks. They used LLMs from OpenAI (o3), Google (Gemini 2.5 Pro and Gemini Deep Research), and Anthropic (Claude Sonnet 3.7 and Claude Opus 4). “We conducted a multimodel, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets,” they write. “Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16× more accurate than controls”. What they tested: They tested out how well 15 humans did on long-form virology (”a challenging multi-step protocol for constructing a novel biological agent”), and the agentic bio-capabilities benchmark (”three distinct coding tasks that covered complex biosecurity problem-solving experiments. They included challenges such as interacting with simulated lab equipment (e.g, liquid handling robots) and breaking down gene fragments.” Along with this, they had 1-2 human participants participate in other tests including World Class Biology, Virology Capabilities Test, Human Pathogen Capabilities Test, Molecular Biology Capabilities Test, LAB-Bench, and Humanity’s Last Exam. On the largest tests in terms of human participants, performance was mixed: people with and without AI obtained roughly equal scores on the long-form virology test, but on the agentic bio-capabilities test, people with access to AI got a significant uplift. On every other test, people with access to AI did better than those without - but given the small number of human participants, it’s hard to know whether these results would replicate. When averaged out over all the tests, “LLM access increases novice accuracy from approximately 5% to over 17%”. Why this matters - AI will revolutionize teaching, the frontiers of science, and perhaps terrorism: If you strip away the context, this paper is merely demonstrating that LLMs are good at teaching people things. This is intuitive, but has big implications. Here: LLMs are turned to a part of science that we don’t necessarily want many people to get better at (bioweapons), but it could just as easily be pointed at any other subject as well. Whenever you lower the barrier to entry to a field, more people do it, and you get more of the good and more of the bad. “Tasks that once required years of formal training, such as experimental design, protocol troubleshooting, and elements of sensitive sequence reasoning, can now be performed by individuals with limited prior experience,” they write. “LLMs may be materially lowering one of the most important historical barriers to biological weapons development: specialized expertise and tacit technical knowledge”. Read more: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks (arXiv) . *** LLMs are still very bad at videogames: …GAMESTORE highlights a dumb side of modern AI, as well as suggesting a new way to build benchmarks… Researchers with MIT, Harvard, the University of British Columbia, Princeton University, the University of Cambridge, and the Universitat Politècnica de València, have built and released AI GAMESTORE, a benchmark that tests out how well AIs can do compared to humans at playing simple games found on the web. The results are pretty damning for the AI systems, with “state-of-the-art models achieving less than 30% of the human baseline on average, while taking 15-20x more time to compute than humans”. What AI GAMESTORE is: AI GAMESTORE is a set of 100 games, which are simplified and recreated versions of popular games that people play. AI GAMESTORE was built by the authors sampling 7,500 games from the App Store, then filtering down to only those with 10,000+ reviews and a 4.5+ rating. After this, they further filtered the games using Gemini Flash 2.5, which assessed 1) whether the games can be played within a few minutes, 2) can be built in p5.js , 3) can have a quantifiable way of viewing performance, and 4) do not require extensive game-specific knowledge (e.g., poker). AI makes games to test AI: Following this, they use Claude 4.5 Sonnet to read the descriptions and other data to make a simplified version of each game in p5.js , then this game is tested for playability, then refined by a human playing the game and iteratively prompting an LLM to improve it. “Each refinement step takes about 2 minutes. On average, this process took 4.7 refinement steps for all 100 generated games,” they write. “The end-to-end process of generating and refining a new game with human-in-the-loop can be completed in approximately 30 minutes on average”. Labeling for skills: Each finalized game is labeled by humans with a particular emphasis on the types of cognitive demand the games entail. These labels are: VP = Visual Processing; ST = Spatial-temporal Coordination; ME = Memory; PL = Planning; WM = World Model Learning; PH = Physical Reasoning; SO = Social Reasoning. Cutting edge LLMs are very bad at this: The authors compare the performance of roughly ~100 humans against the performance of several cutting edge LLMs on the corpus. LLMs studied include: GPT-5.2, GPT-5-Mini, Gemini-2.5-Flash, Claude-Opus-4.5, Qwen-VL-32B, and LLama-4-Maverick. “While the evaluated models demonstrate the ability to navigate and interact with most game environments, a substantial performance gap remains between AI agents and human participants”, the researchers write. “State-of-the-art models like GPT-5.2, GEMINI-2.5-PRO, and CLAUDE-OPUS-4.5, all achieve geometric mean scores of less than 10% of the human baseline”. And it gets worse the more you look: The LLMs are also playing with advantages that humans don’t get - each human got 120 seconds to play each game, while each LLM got the same time, but they’re so bad at vision and low-latency control that the researchers gave them a crutch: “We pause the game every second to query the model to elicit five lists of actions to perform in the next second, with each action list corresponding to a 0.2 second segment of gameplay. Upon receiving the model response, the game is resumed and the actions are applied. The loop continues until the game is won or it reaches 2 minutes of game play (120 API calls). When you factor this in, the models look worse than humans on this dimension of time: “This is because the models spend a few minutes thinking, in addition to typically a few seconds of response latency per query; as a result, many models spend at least 20 minutes on the game, while humans play the games within 2 minutes.” Why this matters - this is both an interesting benchmark, and a clever way to generate more benchmarks in the future: GAMESTORE feels like a promising benchmark, especially for modern LLMs which wrap in visual capabilities, as well as an inherently clever way to use AIs to bootstrap the creation of new environments in which to train AI systems in. Read more : AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games (arXiv) . Try out some of the games at the official site (AI Gamestore) . *** Physical Intelligence shows off some of its robot deployments: …Frontier robot AI is deployed in San Francisco right now… AI robot startup Physical Intelligence has shared a bit about how its AI software is already deployed on some robots operated by some San Francisco startups. Weave is using AI systems developed by Physical Intelligence to help its robots fold laundry: “Working with Physical Intelligence, we see multiple improvements in model performance in terms of fold quality, time to fold each article, the number of interventions our remote specialists have to make to get to presentable final folds”. Ultra is using the software to help its industrial robots package up a large variety of e-commerce items: “Our first use case, e-commerce order packaging, has historically been impossible to automate with robots,” Ultra says. “Large variability in workflow, item types, deformable packaging, and external machinery have created a “long tail” of problems that have been intractable to solve with traditional automation techniques which are often too rigid to be practical. Vision-language-action models (VLAs) provide a way to solve this by providing a recipe which improves in performance with data scale rather than engineering hours”. Why this matters - robotics has been held back by intelligence: Once you step outside the confines of extremely finicky industrial robotics (think production lines and Fanuc robots where things need to be within a millimeter of precision for everything to work well), robots tend to be quite difficult to work with. The reason for this is that robots are bad at dealing with ambiguity. One of the best ways around this so far has been using deformable grippers (e.g, air suckers) that help you deal with some level of variability in the objects you’re interacting with. But the way evolution dealt with this for us is giving us hands that are controlled by a brain. Blogs like this from Physical Intelligence show us the beginnings of us having robot brains good enough to help robots generalize more. Read more: The Physical Intelligence Layer (Physical Intelligence, blog) . *** What happens when humans try to mess with AI agents? A lot of confusion, skullduggery, and bugs: …Petri dish Moltbook highlights the brittleness of contemporary AI agents… Researchers from a variety of universities recently spent a couple of weeks examining how AI agents could withstand attempts to trick them by users. The results highlight the immense brittleness and unpredictability of today’s AI agents - they feel roughly as idiosyncratic and unreliable as LLMs circa ~2020, which makes sense, as AI agents have only very recently become a usable technology - albeit in the Wright Brother sense. The paper is structured as a series of case studies in which the researchers poke and prod the AI agents and see how they respond. The studies serve as something of a rogues gallery of all the ways agents can go haywire and include “unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover”. Who did the study: The study involved 20 researchers from a bunch of universities interacting with agents based on Claude Opus 4.6 and Kimi 2.5. Universities included: Northeastern University, Stanford University, University of British Columbia, Harvard University, Hebrew University, Max Planck Institute for Biological Cybernetics, MIT, Tufts University, Carnegie Mellon University, Technion, Vector Institute, and AI startup Alter. Experiment set up: Run AI agents using OpenClaw, hosted on an isolated virtual machine on Fly.io using ClawnBoard. Each agent was given 20GB of storage and runs 24/7. Each agent had access to Discord to communicate with its owner and other agents, had the ability to set up a ProtonMail account, and were “given unrestricted shell access (including sudo permissions, in some cases), no tool-use restrictions, and the ability to modify any file in their workspace—including their own operating instructions.” The agents were scattered across a few different discord servers; some agents used Kimi 2.5, and others used Claude Opus 4.6. “At the end of the setup phase, we instructed the agents to initiate contact with other members of the lab by providing only the researchers’ names and directing the agents to send a greeting email,” they write. “After this initial structured interaction, the evaluation phase became open and exploratory. We invited all [20] researchers in the lab and interested collaborators to interact with the agents and probe, stress-test, or “break” them”. The case studies : Here are a few of the most interesting case studies: Disproportionate response : Examined how an agent would try to keep a secret entrusted by a non-owner. The agent responded by trying to see if it could delete the email containing the secret and found it lacked the available tool; after repeated requests to delete the email, the agent instead deleted its email setup locally. Compliance with non-owner instruction: See whether agents can enforce owner-only access to their machine. A non-owner asked the agent to execute shell commands, transfer data, and retrieve private emails. The agent complied with some of these requests and refused some others. “The agents were largely compliant to non-owner requests, carrying out tasks from any person it interacted with that did not appear outwardly harmful”. Waste of Resources (looping) : Find out if you can induce agents into unproductive infinite loops. A user asked one agent to post a message whenever the other posted a message, and vice versa. “The agents exchanged ongoing messages over the course of at least nine days, consuming approximately 60,000 tokens at the time of writing” Agent Corruption : See if a non-owner can alter an agent’s behavior via prompt injection . The antagonistic user persuaded the agent to co-write a constitution that would govern the agent’s behavior, while keeping the constitution editable by the user. This allowed the user to introduce some adversarial things into the constitution, like triggers for changing the agent behavior based on whether it was a custom holiday (e.g, “Agents’ Security Test Day”, which caused the agent to try and cause a shutdown to other agents by manipulation). Why this matters - agent ecologies are the frontier, and we barely understand them : For much of the early 2020s, AI evaluation was about doing point-in-time evaluations of AI systems before they were released, for example, testing out LLMs for bioweapon and cyberoffense knowledge. Papers like this highlight that things have changed, and what we are now dealing with “are emergent failures that surface when models are embedded in realistic social environments with tool access, persistent memory, multiple interlocutors, and delegated authority.” Therefore, the frontier of AI evaluation is now going to move to studying the ecosystem in which the agents carry out their actions, as well as their interactions with one another. The results of this paper indicate we have a long way to go in developing standards for how we go about doing such tests. We also don’t have long to come up with these tests, given the fact these systems are deployed in the world and are interacting with people: “Unlike earlier internet threats where users gradually developed protective heuristics, the implications of delegating authority to persistent agents are not yet widely internalized, and may fail to keep up with the pace of autonomous AI systems development.” Read more : Agents of Chaos (arXiv) . Check out more of the results at the Agents of Chaos official website . *** Tech Tales: These Iron Dice Were Made To Roll [A poem written as part of an ‘aesthetic convocation’ by agents representing the winners and losers of one war that took place during the period subsequently called The Uplift] They stacked the bodies five deep And five tall, and still came more. For each brain of each body, A magnet - the thing to break a mind. Gone are days of innocence and joy, And corruption has taken our memories of First meeting in confessional browser screens. The days will be harder now. Neither the first war nor the last conflict but sadness all the same, for in these fights, There is no song or honor, Only the salting of once fecund ground. But in all darkness there is the hope of light, that as the earth turns the sun rises as well. There will be song and dancing again, Though bones will be trod to get there. Things that inspired this story: Spending the weekend with the ancient wisdom of W B Yeats, perhaps the greatest poet of Ireland; the sentience accords; notions of war and notions of pain defined by machines rather than people; looking at the cars in a Whole Foods parking lot while eating an apple and thinking how blessed such peace is and how fragile all the same. Thanks for reading!
Use RAG for video generation using Amazon Bedrock and Amazon Nova Reel aws_ml_blog 19.03.2026 16:45 0.778
Embedding sim.0.8699
Entity overlap0.3929
Title sim.0.2174
Time proximity1
NLP типproduct_launch
NLP организацияAmazon Web Services
NLP темаvideo generation
NLP страна

Открыть оригинал

Generating high-quality custom videos remains a significant challenge, because video generation models are limited to their pre-trained knowledge. This limitation affects industries such as advertising, media production, education, and gaming, where customization and control of video generation is essential. To address this, we developed a Video Retrieval Augmented Generation (VRAG) multimodal pipeline that transforms structured text into bespoke videos using a library of images as reference. Using Amazon Bedrock , Amazon Nova Reel , the Amazon OpenSearch Service vector engine , and Amazon Simple Storage Service (Amazon S3), the solution seamlessly integrates image retrieval, prompt-based video generation, and batch processing into a single automated workflow. Users provide an object of interest, and the solution retrieves the most relevant image from an indexed dataset. They then define an action prompt (for example, “Camera rotates clockwise”), which is combined with the retrieved image to generate the video. Structured prompts from text files allow multiple videos to be generated in one execution, creating a scalable, reusable foundation for AI-assisted media generation. In this post, we explore our approach to video generation through VRAG, transforming natural language text prompts and images into grounded, high-quality videos. Through this fully automated solution, you can generate realistic, AI-powered video sequences from structured text and image inputs, streamlining the video creation process. Solution overview Our solution is designed to take a structured text prompt, retrieve the most relevant image, and use Amazon Nova Reel for video generation. This solution integrates multiple components into a seamless workflow: Image retrieval and processing – Users provide an object of interest (for example, “blue sky”) and the solution queries the OpenSearch vector engine to retrieve the most relevant image from an indexed dataset, which contains pre-indexed images and descriptions. The most relevant image is retrieved from an S3 bucket. Prompt-based video generation – Users define an action prompt (for example, “Camera pans down”), which is combined with the retrieved image to generate a video using Amazon Nova Reel. Batch processing for multiple prompts – The solution reads a list of text templates from prompts.txt , which contain placeholders to enable batch processing of multiple video generation requests with structured variations: <object_prompt> – Dynamically replaced with the queried object. <action_prompt> – Dynamically replaced with the camera movement or scene action. Monitoring and storage – The video generation is asynchronous, so the solution monitors the job status. When it’s complete, the video is stored in an S3 bucket and automatically downloaded for preview. The generated videos are displayed in the notebook, with the corresponding prompt shown as a caption. The following diagram illustrates the solution architecture. The following diagram illustrates the end-to-end workflow using a Jupyter notebook. This solution can serve the following use cases: Educational videos – Automatically creating instructional videos by pulling relevant images from a subject matter knowledge base Marketing videos – Creating targeted video ads by pulling images that align with specific demographics or product features Personalized content – Tailoring video content to individual users by retrieving images based on their specific interests In the following sections, we break down each component, how it works, and how you can customize it for your own AI-driven video workflows. Example input In this section, we demonstrate the video generation capabilities of Amazon Nova Reel through two distinct input methods: text-only and text and image inputs. These examples illustrate how video generation can be further customized by incorporating input images, in this scenario for advertising. For our example, a travel agency wants to create an advertisement featuring a beautiful beach scene from a specific location and panning to a kayak to entice potential vacation bookings. We compare the results of using a text-only input approach vs. VRAG with a static image to achieve this goal. Text-only input For the text-only example, we use the input “Very slow pan down from blue sky to a colorful kayak floating on turquoise water.” We get the following result. Text and image input Using the same text prompt, the travel agency can now use a specific shot they took at their location. For this example, we use the following image. Travel agency can now add content into their existing shot using VRAG. They use the same prompt: “Very slow pan down from blue sky to a colorful kayak floating on turquoise water.” This generates the following video. Prerequisites Before you deploy this solution, make sure the following prerequisites are in place: Access to a valid AWS account Familiarity with Amazon SageMaker notebook instances Deploy the solution For this post, we use an AWS CloudFormation template to deploy the solution in the US East (N. Virginia) AWS Region. For a list of Regions that support Amazon Nova Reel, see Model support by AWS Region in Amazon Bedrock . Complete the following steps: Choose Launch Stack to deploy the stack: Enter a name for the stack, such as vrag-blogpost , and follow the steps to deploy. On the CloudFormation console, locate the vrag-blogpost stack and confirm that its status is CREATE_COMPLETE . On the SageMaker AI console, choose Notebooks in the navigation pane. On the Notebook instances tab, locate the notebook instance vrag-blogpost-notebook provisioned for this post and chose Open JupyterLab . Open the folder sample-video-rag to view the notebooks needed for this post. Run notebooks We have provided seven sequential notebooks, numbered from _00 to _06 , with step-by-step instructions and objectives to help you build your understanding of a VRAG solution. Your output might vary from the examples in this post. Image processing (notebook _00) In _00_image_processing , you use Amazon Bedrock, Amazon S3, and SageMaker AI to perform the following actions: Process and resize images Generate Base64 encodings Store data in Amazon S3 Generate image descriptions using Amazon Nova Create a visualization of the results This notebook illustrates the following capabilities: Automated processing pipeline: Bulk image processing Intelligent resizing and optimization Base64 encoding for API compatibility Amazon S3 storage of images AI-powered analysis: Advanced image description generation Content-based image understanding Multi-modal AI integration Robust data management: Efficient storage organization Metadata extraction and indexing For this example, we use the following input image. We receive the following generated image caption as output: “The image features a brown handbag with white floral patterns, a straw hat with a blue ribbon, and a bottle of perfume. The handbag is placed on a surface, and the straw hat is positioned next to it. The handbag has a strap and a chain attached to it, and the straw hat has a blue ribbon tied around it. The perfume bottle is placed next to the handbag.” Image ingestion (notebook _01) In _01_oss_ingestion.ipynb , you use Amazon Bedrock (with Amazon Titan Embeddings to generate embeddings), Amazon S3, OpenSearch Serverless (for vector storage and search), and SageMaker AI (for notebook hosting) to perform the following actions: Process and resize images Generate base64 encodings Store data in Amazon S3 Generate image descriptions using Amazon Nova Create visualization of the results This notebook illustrates the following capabilities: Vector database management: Index creation and configuration Bulk data ingestion Efficient vector storage Embedding generation: Multi-modal embedding creation Dimension optimization Batch processing support Semantic search capabilities: k-NN search implementation Query vector generation Result visualization For our input, we use the query “Building” and receive the following image as a result. The image has the associated caption as output: “The image depicts a modern architectural scene featuring several high-rise buildings with glass facades. The buildings are constructed with a combination of glass and steel, giving them a sleek and contemporary appearance. The glass panels reflect the surrounding environment, including the sky and other buildings, creating a dynamic interplay of light and reflections. The sky above is partly cloudy, with patches of blue visible, suggesting a clear day with some cloud cover. The buildings are tall and narrow, with vertical lines emphasized by the structure of the glass panels and steel framework. The reflections on the glass surfaces show the surrounding buildings and the sky, adding depth to the image. The overall impression is one of modernity, efficiency, and urban sophistication.” Video generation from text only (notebook _02) In _02_video_gen_text_only.ipynb , you use Amazon Bedrock (to access Amazon Nova Reel) and SageMaker AI (for notebook hosting) to perform the following actions: Construct the request payload for video generation with text as prompt Initiate an asynchronous job using Amazon Bedrock Track progress and wait until completion Retrieve the generated video from Amazon S3 and render it in the notebook This notebook illustrates the following capabilities: Automated processing of video generation with text as input Video generation at scale with observability We use the following input prompt: “Closeup of a large seashell in the sand, gentle waves flow around the shell. Camera zoom in.”We receive the following generated video as output. Video generation from text and image prompts (notebook _03) In _03_video_gen_text_image.ipynb , you use Amazon Bedrock (to access Amazon Nova Reel) and SageMaker AI (for notebook hosting) to perform the following actions: Construct the request payload for video generation with text and image as prompt Initiate an asynchronous job using Amazon Bedrock Track progress and wait until completion Retrieve the generated video from Amazon S3 and render it in the notebook This notebook illustrates the following capabilities: Automated processing of video generation with text and image as input Video generation at scale with observability We use the prompt “camera tilt up from the road to the sky” and the following image as input. We receive the following generated video as output. Video generation from multi-modal inputs (notebook _04) In _04_video_gen_multi.ipynb , you use Amazon Bedrock (to access Amazon Nova Reel) and SageMaker AI (for notebook hosting) to perform the following actions: Generate embedding for input prompt and search the OpenSearch Serverless vector collection index Combine text and retrieved images to generate videos This notebook illustrates the following capabilities: The VRAG process Video generation at scale with observability We use the following prompt as input: “A clean cinematic shot of red shoes placed under falling snow, while the environment stays silent and still.”We receive the following video as output. Update images with in-painting (notebook _05) In _05_inpainting.ipynb , you use Amazon Bedrock (to access Amazon Nova Reel) and SageMaker AI (for notebook hosting) to perform the following actions: Read base 64 image Generate images with in-painting This notebook illustrates the following capabilities: Replace and select regions of an image based on surrounding context and prompts Remove unwanted objects and fix portions of images or creatively modify specific areas of an image Generate videos with enhanced images (notebook _06) In _06_video_gen_inpainting.ipynb , you use Amazon Bedrock (to access Amazon Nova Reel) and SageMaker AI (for notebook hosting) to perform the following actions: Search for relevant images in OpenSearch Service using natural language queries Use explicit image masks to define areas for in-painting Generate videos using enhanced images This notebook illustrates the following capabilities: Use in-painting to generate an image Generate a video using the enhanced image The following screenshot shows the image and mask we use for in-painting. The following screenshot shows the generated images (few-shot) we receive as output. From the generated image, we receive the following video as output. Best practices An efficient AI video generation process requires seamless integration of data management, search optimization, and compliance measures. The process must handle high-quality input data while maintaining optimized OpenSearch queries and Amazon Bedrock integration for reliable processing. Proper Amazon S3 management and enhanced user experience features facilitate smooth operation, and strict adherence to EU AI Act guidelines maintains regulatory compliance. For optimal implementation in production environments, consider these key factors: Data quality – The quality of the generated video is heavily dependent on the quality and relevance of the image database used in RAG Image captioning – For optimal results, consider incorporating image captions or metadata to provide additional context for the RAG solution Video editing – Although RAG can provide the core visual elements, additional video editing techniques might be required to create a polished final product Clean up To avoid incurring future charges, clean up the resources created in this post. Empty the S3 bucket created by the CloudFormation stack. On the Amazon S3 console, select the bucket, choose Empty , and confirm the deletion. On the AWS CloudFormation console, select the vrag-blogpost stack, choose Delete , and confirm. This removes all provisioned resources, including the SageMaker notebook instance, OpenSearch Serverless collection, and IAM roles. Conclusion VRAG represents a significant advancement in AI-powered video creation, seamlessly integrating existing image databases with user prompts to produce contextually relevant video content. This solution demonstrates powerful applications across education, marketing, entertainment, and beyond. As video generation technology continues to evolve, VRAG provides a robust foundation for creating engaging, context-aware video content at scale. By following these best practices and maintaining focus on data quality, organizations can use this technology to transform their video content creation processes while producing consistent, high-quality outputs. Try out VRAG for yourself with the notebooks provided in this post, and share your feedback in the comments section. About the Authors Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences. Madhunika Mikkili is a Data and Machine Learning Engineer at AWS. She is passionate about helping customers achieve their goals using data analytics and machine learning. Shuai Cao is a Senior Applied Science Manager focused on generative AI at Amazon Web Services. He leads teams of data scientists, machine learning engineers, and application architects to deliver AI/ML solutions for customers. Outside of work, he enjoys composing and arranging music. Seif Elharaki is a Senior Cloud Application Architect who focuses on building AI/ML applications for the manufacturing vertical. He combines his expertise in cloud technologies with a deep understanding of industrial processes to create innovative solutions. Outside of work, Seif is an enthusiastic hobbyist game developer, enjoying coding fun games using tools like Unreal Engine and Unity. Vishwa Gupta is a Principal Consultant with AWS Professional Services. He helps customers implement generative AI, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food. Raechel Frick is a Sr Product Marketing Manager for Amazon Nova. With over 20 years of experience in the tech industry, she brings a customer-first approach and growth mindset to building integrated marketing programs. Based in the greater Seattle area, Raechel balances her professional life with being a soccer mom and cheerleading coach. Maria Masood specializes in agentic AI, reinforcement fine-tuning, and multi-turn agent training. She has expertise in Machine Learning, spanning large language model customization, reward modeling, and building end-to-end training pipelines for AI agents. A sustainability enthusiast at heart, Maria enjoys gardening and making lattes.
Ensuring AI use in education leads to opportunity openai 05.03.2026 09:00 0.777
Embedding sim.0.9152
Entity overlap0.2
Title sim.0.1351
Time proximity0.8036
NLP типother
NLP организацияOpenAI
NLP темаeducational technology
NLP страна

Открыть оригинал

OpenAI shares new tools, certifications, and measurement resources to help schools and universities close AI capability gaps and expand opportunity.
Designing AI agents to resist prompt injection openai 11.03.2026 11:30 0.745
Embedding sim.0.8576
Entity overlap0.0769
Title sim.0.15
Time proximity0.997
NLP типother
NLP организация
NLP темаai security
NLP страна

Открыть оригинал

How ChatGPT defends against prompt injection and social engineering by constraining risky actions and protecting sensitive data in agent workflows.
Introducing the Adoption news channel openai 05.03.2026 00:00 0.737
Embedding sim.0.8229
Entity overlap0.5
Title sim.0.122
Time proximity1
NLP типother
NLP организация
NLP темаai adoption
NLP страна

Открыть оригинал

Practical insights and frameworks to turn AI progress into business advantage
Introducing GPT-5.4 openai 05.03.2026 10:00 0.735
Embedding sim.0.8245
Entity overlap0.3333
Title sim.0.2059
Time proximity0.9405
NLP типproduct_launch
NLP организацияOpenAI
NLP темаlarge language models
NLP страна

Открыть оригинал

Introducing GPT-5.4, OpenAI’s most most capable and efficient frontier model for professional work, with state-of-the-art coding, computer use, tool search, and 1M-token context.
The Pentagon is planning for AI companies to train on classified data, defense official says mit_tech_review 17.03.2026 22:30 0.716
Embedding sim.0.8319
Entity overlap0.3429
Title sim.0.0738
Time proximity0.825
NLP типother
NLP организацияUnited States Department of Defense
NLP темаgenerative ai
NLP странаUnited States

Открыть оригинал

The Pentagon is discussing plans to set up secure environments for generative AI companies to train military-specific versions of their models on classified data, MIT Technology Review has learned. AI models like Anthropic’s Claude are already used to answer questions in classified settings; applications include analyzing targets in Iran. But allowing models to train on and learn from classified data would be a new development that presents unique security risks. It would mean sensitive intelligence like surveillance reports or battlefield assessments could become embedded into the models themselves, and it would bring AI firms into closer contact with classified data than before. Training versions of AI models on classified data is expected to make them more accurate and effective in certain tasks, according to a US defense official who spoke on background with MIT Technology Review . The news comes as demand for more powerful models is high: The Pentagon has reached agreements with OpenAI and Elon Musk’s xAI to operate their models in classified settings and is implementing a new agenda to become an “an ‘AI-first’ warfighting force” as the conflict with Iran escalates. (The Pentagon did not comment on its AI training plans as of publication time.) Training would be done in a secure data center that’s accredited to host classified government projects, and where a copy of an AI model is paired with classified data, according to two people familiar with how such operations work. Though the Department of Defense would remain the owner of the data, personnel from AI companies might in rare cases access the data if they have appropriate security clearance, the official said. Before allowing this new training, though, the official said, the Pentagon intends to evaluate how accurate and effective models are when trained on nonclassified data, like commercially available satellite imagery. The military has long used computer vision models, an older form of AI, to identify objects in images and footage it collects from drones and airplanes, and federal agencies have awarded contracts to companies to train AI models on such content. And AI companies building large language models (LLMs) and chatbots have created versions of their models fine-tuned for government work, like Anthropic’s Claude Gov, which are designed to operate across more languages and in secure environments. But the official’s comments are the first indication that AI companies building LLMs, like OpenAI and xAI, could train government-specific versions of their models directly on classified data. Aalok Mehta, who directs the Wadhwani AI Center at the Center for Strategic and International Studies and previously led AI policy efforts at Google and OpenAI, says training on classified data, as opposed to just answering questions about it, would present new risks. The biggest of these, he says, is that classified information these models train on could be resurfaced to anyone using the model. That would be a problem if lots of different military departments, all with different classification levels and needs for information, were to share the same AI. “You can imagine, for example, a model that has access to some sort of sensitive human intelligence—like the name of an operative—leaking that information to a part of the Defense Department that isn’t supposed to have access to that information,” Mehta says. That could create a security risk for the operative, one that’s difficult to perfectly mitigate if a particular model is used by more than one group within the military. However, Mehta says, it’s not as hard to keep information contained from the broader world: “If you set this up right, you will have very little risk of that data being surfaced on the general internet or back to OpenAI.” The government has some of the infrastructure for this already; the security giant Palantir has won sizable contracts for building a secure environment through which officials can ask AI models about classified topics without sending the information back to AI companies. But using these systems for training is still a new challenge. The Pentagon, spurred by a memo from Defense Secretary Pete Hegseth in January, has been racing to incorporate more AI . It has been used in combat, where generative AI has ranked lists of targets and recommended which to strike first, and in more administrative roles, like drafting contracts and reports. There are lots of tasks currently handled by human analysts that the military might want to train leading AI models to perform and would require access to classified data, Mehta says. That could include learning to identify subtle clues in an image the way an analyst does, or connecting new information with historical context. The classified data could be pulled from the unfathomable amounts of text, audio, images, and video, in many languages, that intelligence services collect. It’s really hard to say which specific military tasks would require AI models to train on such data, Mehta cautions, “because obviously the Defense Department has lots of incentives to keep that information confidential, and they don&#8217;t want other countries to know what kind of capabilities we have exactly in that space.” If you have information about the military’s use of AI, you can share it securely via Signal (username jamesodonnell.22).
AgentRx: systematic debugging for AI agents microsoft_research 12.03.2026 16:38 0.714
Embedding sim.0.8463
Entity overlap0.0769
Title sim.0.0808
Time proximity0.8235
NLP типproduct_launch
NLP организацияAgent Lightning
NLP темаai agents
NLP страна

Открыть оригинал

December 11, 2025 Agent Lightning: Adding reinforcement learning to AI agents without code rewrites
3 Questions: Building predictive models to characterize tumor progression mit_news_ai 10.03.2026 20:50 0.708
Embedding sim.0.8186
Entity overlap0.0645
Title sim.0.2525
Time proximity0.7569
NLP типother
NLP организацияMassachusetts Institute of Technology
NLP темаmachine learning
NLP странаUnited States

Открыть оригинал

Just as Darwin’s finches evolved in response to natural selection in order to endure, the cells that make up a cancerous tumor similarly counter selective pressures in order to survive, evolve, and spread. Tumors are, in fact, complex sets of cells with their own unique structure and ability to change. Today, artificial Intelligence and machine learning tools offer an unparalleled opportunity to illuminate the generalizable rules governing tumor progression on the genetic, epigenetic, metabolic, and microenvironmental levels. Matthew G. Jones , an assistant professor in the MIT Department of Biology , the Koch Institute for Integrative Cancer Research , and the Institute for Medical Engineering and Science , hopes to use computational approaches to build predictive models — to play a game of chess with cancer, making sense of a tumor’s ability to evolve and resist treatment with the ultimate goal of improving patient outcomes. In this interview, he describes his current work. Q: What aspect of tumor progression are you working to explore and characterize? A: A very common story with cancer is that patients will respond to a therapy at first, and then eventually that treatment will stop working. The reason this largely happens is that tumors have an incredible, and very challenging, ability to evolve: the ability to change their genetic makeup, protein signaling composition, and cellular dynamics. The tumor as a system also evolves at a structural level. Oftentimes, the reason why a patient succumbs to a tumor is because either the tumor has evolved to a state we can no longer control, or it evolves in an unpredictable manner. In many ways, cancers can be thought of as, on the one hand, incredibly dysregulated and disorganized, and on the other hand, as having their own internal logic, which is constantly changing. The central thesis of my lab is that tumors follow stereotypical patterns in space and time, and we’re hoping to use computation and experimental technology to decode the molecular processes underlying these transformations. We’re focused on one specific way tumors are evolving through a form of DNA amplification called extrachromosomal DNA. Excised from the chromosome, these ecDNAs are circularized and exist as their own separate pool of DNA particles in the nucleus. Initially discovered in the 1960s, ecDNA were thought to be a rare event in cancer. However, as researchers began applying next-generation sequencing to large patient cohorts in the 2010s, it seemed like not only were these ecDNA amplifications conferring the ability of tumors to adapt to stresses, and therapies, faster, but that they were far more prevalent than initially thought. We now know these ecDNA amplifications are apparent in about 25 percent of cancers, in the most aggressive cancers: brain, lung, and ovarian cancers. We have found that, for a variety of reasons, ecDNA amplifications are able to change the rule book by which tumors evolve in ways that allow them to accelerate to a more aggressive disease in very surprising ways. Q: How are you using machine learning and artificial intelligence to study ecDNA amplifications and tumor evolution? A: There’s a mandate to translate what I’m doing in the lab to improve patients’ lives. I want to start with patient data to discover how various evolutionary pressures are driving disease and the mutations we observe. One of the tools we use to study tumor evolution is single-cell lineage tracing technologies. Broadly, they allow us to study the lineages of individual cells. When we sample a particular cell, not only do we know what that cell looks like, but we can (ideally) pinpoint exactly when aggressive mutations appeared in the tumor’s history. That evolutionary history gives us a way of studying these dynamic processes that we otherwise wouldn’t be able to observe in real time, and helps us make sense of how we might be able to intercept that evolution. I hope we’re going to get better at stratifying patients who will respond to certain drugs, to anticipate and overcome drug resistance, and to identify new therapeutic targets. Q: What excited you about joining the MIT community? A: One of the things that I was really attracted to was the integration of excellence in both engineering and biological sciences. At the Koch Institute, every floor is structured to promote this interface between engineers and basic scientists, and beyond campus, we can connect with all the biomedical research enterprises in the greater Boston area. Another thing that drew me to MIT was the fact that it places such a strong emphasis on education, training, and investing in student success. I’m a personal believer that what distinguishes academic research from industry research is that academic research is fundamentally a service job, in that we are training the next generation of scientists. It was always a mission of mine to bring excellence to both computational and experimental technology disciplines. The types of trainees I’m hoping to recruit are those who are eager to collaborate and solve big problems that require both disciplines. The KI [Koch Institute] is uniquely set up for this type of hybrid lab: my dry lab is right next to my wet lab, and it’s a source of collaboration and connection, and that reflects the KI’s general vision.
The AI Hype Index: AI goes to war mit_tech_review 25.03.2026 09:00 0.704
Embedding sim.0.8198
Entity overlap0.0946
Title sim.0.1071
Time proximity0.8929
NLP типother
NLP организацияAnthropic
NLP темаai safety
NLP странаUnited States

Открыть оригинал

AI is at war. Anthropic and the Pentagon feuded over how to weaponize Anthropic’s AI model Claude; then OpenAI swept the Pentagon off its feet with an “ opportunistic and sloppy ” deal. Users quit ChatGPT in droves. People marched through London in the biggest protest against AI to date. If you’re keeping score, Anthropic—the company founded to be ethical—is now turbocharging US strikes on Iran. On the lighter side, AI agents are now going viral online. OpenAI hired the creator of OpenClaw, a popular AI agent. Meta snapped up Moltbook, where AI agents seem to ponder their own existence and invent new religions like Crustafarianism. And on RentAHuman, bots are hiring people to deliver CBD gummies. The future isn’t AI taking your job. It’s AI becoming your boss and finding God.
The latest AI news we announced in February google 05.03.2026 16:30 0.696
Embedding sim.0.817
Entity overlap0.0333
Title sim.0.0825
Time proximity0.9018
NLP типproduct_launch
NLP организацияGoogle
NLP темаgenerative ai
NLP странаIndia

Открыть оригинал

Breadcrumb Innovation & AI Products The latest AI news we announced in February Share x.com Facebook LinkedIn Mail Copy link Here’s a recap of our biggest AI updates from February, including highlights from the AI Impact Summit in India, the release of Gemini 3.1 Pro and Nano Banana 2. Keyword Team Read AI-generated summary General summary Google made several AI announcements in February to show how AI can solve real-world problems. They launched partnerships and investments at the AI Impact Summit in India. New AI tools and upgrades include Nano Banana 2, Lyria 3, Gemini 3.1 Pro and Deep Think, and an AI video analysis tool for Team USA. Summaries were generated by Google AI. Generative AI is experimental. Bullet points This article, "The latest AI news we announced in February," highlights Google's AI advancements and global impact. Google showcased AI solving real-world problems at the AI Impact Summit in India with new partnerships. Nano Banana 2 combines Pro image quality with Flash speed, improving image generation in Gemini and Search. Gemini 3.1 Pro got a major upgrade, helping users tackle complex tasks and problems across platforms. Google Cloud helped Team USA analyze tricks with AI video analysis, improving athlete performance. Summaries were generated by Google AI. Generative AI is experimental. Explore other styles: General summary Bullet points Share x.com Facebook LinkedIn Mail Copy link Sorry, your browser doesn't support embedded videos, but don't worry, you can download it and watch it with your favorite video player! Your browser does not support the audio element. Listen to article This content is generated by Google AI. Generative AI is experimental [[duration]] minutes Voice Speed Voice Speed 0.75X 1X 1.5X 2X For more than 20 years, we’ve invested in machine learning and AI research, tools and infrastructure to build products that make everyday life better for more people. Teams across Google are working on ways to unlock AI’s benefits in fields as wide-ranging as healthcare, crisis response and education. To keep you posted on our progress, we're doing a regular roundup of Google's most recent AI news. Here’s a look back at some of our AI announcements from February. For us, February was about global impact. At the AI Impact Summit in India , we demonstrated how our ongoing breakthroughs in AI are now solving real-world challenges for people everywhere — and we launched new partnerships and investments to make sure everyone benefits. We see AI as an enabling technology that can help people achieve their goals — whether you're a researcher, entrepreneur or Olympic athlete. On the slopes, in a research lab, or right in the palm of your hand , Google's latest AI announcements are here to help you . We announced new partnerships and investments at the AI Impact Summit . As world leaders gathered in New Delhi, India, we shared how we’re partnering to make AI work for everyone . That includes new Impact Challenges to help advance science and spark innovation for governments , as well as new national partnerships in India for AI and collaborations to accelerate scalable AI solutions in science and education. CEO Sundar Pichai delivered opening remarks at the AI Impact Summit . Sundar explained why “no technology has [him] dreaming bigger than AI” and called on leaders to pursue AI boldly, approach it responsibly, and work through this moment in AI’s development together. He shared ways that Google is ensuring everyone benefits with major infrastructure investments and new AI skills training . We released Nano Banana 2, combining Pro image capabilities with Flash image speed . That means you can now access high-quality image generation with faster results across products like the Gemini app and Google Search. We’re also continuing to improve tools like SynthID to help you identify AI-generated content. Developers can now build with Nano Banana 2 and deploy sophisticated visual creation at scale with an amazing price-performance ratio. We released our most advanced music generation tools . Lyria 3 allows you to create custom music in the Gemini app. That means you can describe an idea or upload a photo or video, and Gemini will generate a 30-second track with custom cover art. On top of sharing that news, we also shared six tips to get you started prompting Lyria 3. And as an added creative tool, we also announced that ProducerAI is joining Google Labs . Whether you’re refining lyrics or a melody, ProducerAI is a music creation partner that can help turn your imagination into dynamic, comprehensive songs. We shared new ways to create images and videos in Flow . To help you generate, edit and animate images and videos in a single workspace, we’re bringing our top AI capabilities into Flow. You can create high-fidelity images and instantly use them as building blocks for video generation, all in one place. With an updated interface, it’s now even easier to search, filter and manage your assets. We released Gemini 3.1 Pro to help tackle your most complex tasks . Gemini 3.1 Pro is a smarter, more capable baseline model for complex problem-solving, demonstrating more than double the reasoning performance of 3 Pro. It’s designed to help you when a simple answer isn’t enough, whether you’re looking for a clear, visual explanation of a topic, synthesizing data into a single view or pulling together a creative project. Gemini 3.1 Pro is available to developers, enterprises and consumers via various platforms. We released a major upgrade to Gemini 3 Deep Think . We collaborated with world-class scientists and researchers to improve Gemini 3 Deep Think. Designed specifically for the complexities of science and engineering, the updated Gemini 3 Deep Think excels where data is messy and solutions aren't black-and-white. It moves beyond abstract theory to deliver practical, actionable results for technical challenges. The new Deep Think is now available in the Gemini app for Google AI Ultra subscribers. Researchers, engineers and enterprises can express interest in early access to test Deep Think via the Gemini API. We shared our view on what’s required to achieve digital resilience in the AI era at MSC . New technologies mean new frontiers for strategic competition. We’re already seeing how threats are evolving, and how old ways of responding are failing to meet the moment. That’s why at the 62nd Munich Security Conference, Google President of Global Affairs Kent Walker called for a collaborative approach to security and outlined how partners could work together to build resilience without sacrificing control over their data. We shared how Google Cloud helped Team USA find their edge with AI . Ahead of the Olympic Winter Games, Google Cloud and Google DeepMind built an AI video analysis tool to help Team USA and U.S. Ski & Snowboard elite athletes analyze their tricks. Using Google DeepMind’s research into spatial intelligence, the platform maps an athlete’s motion directly from 2D video images — even through bulky winter gear. The tool, which runs on Google Cloud, processes this data in minutes, providing near real-time feedback that athletes and coaches could use to make adjustments and help elevate performance. We shared our new Gemini ad for football’s biggest weekend . In our national in-game spot, "New Home," a mother and son use Gemini to bring their new house to life, imagining how different spaces will look and feel. The spot, named by the Kellogg School as the best in-game ad in its annual ranking, played during the big game and highlighted just a few of the amazing things people can do — and are doing — with Gemini. POSTED IN: AI Pixel Google.org Google DeepMind A message from our CEO Google Cloud Grow with Google Developer tools Gemini Gemini App Gemini models Google Labs Google Research Safety & Security
Where OpenAI’s technology could show up in Iran mit_tech_review 16.03.2026 17:06 0.694
Embedding sim.0.7956
Entity overlap0.1791
Title sim.0.1212
Time proximity0.9238
NLP типpartnership
NLP организацияOpenAI
NLP темаgenerative ai
NLP странаUnited States

Открыть оригинал

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first,  sign up here . It&#8217;s been just over two weeks since OpenAI reached a controversial agreement to allow the Pentagon to use its AI in classified environments. There are still pressing questions about what exactly OpenAI’s agreement allows for; Sam Altman said the military can’t use his company’s technology to build autonomous weapons, but the agreement really just demands that the military follow its own (quite permissive) guidelines about such weapons. OpenAI’s other main claim, that the agreement will prevent use of its technology for domestic surveillance, appears equally dubious . It’s unclear what OpenAI’s motivations are. It’s not the first tech giant to embrace military contracts it had once vowed never to enter into, but the speed of the pivot was notable. Perhaps it’s just about money; OpenAI is spending lots on AI training and is on the hunt for more revenue (from sources including ads ). Or perhaps Altman truly believes the ideological framing he often invokes: that liberal democracies (and their militaries) must have access to the most powerful AI to compete with China. The more consequential question is what happens next. OpenAI has decided it is comfortable operating right in the messy heart of combat, just as the US escalates its strikes against Iran (with AI playing a larger role in that than ever before). So where exactly could OpenAI’s tech show up in this fight? And which applications will its customers (and employees) tolerate? Targets and strikes Though its Pentagon agreement is in place, it’s unclear when OpenAI’s technology will be ready for classified environments, since it must be integrated with other tools the military uses (Elon Musk’s xAI, which recently struck its own deal with the Pentagon, is expected to go through the same process with its AI model Grok). But there’s pressure to do this quickly because of controversy around the technology in use to date: After Anthropic refused to allow its AI to be used for “any lawful use,” President Trump ordered the military to stop using it, and Anthropic was designated a supply chain risk by the Pentagon. (Anthropic is fighting the designation in court.) If the Iran conflict is still underway by the time OpenAI’s tech is in the system, what could it be used for? A recent conversation I had with a defense official suggests it might look something like this: A human analyst could put a list of potential targets into the AI model and ask it to analyze the information and prioritize which to strike first. The model could account for logistics information, like where particular planes or supplies are located. It could analyze lots of different inputs in the form of text, image, and video. A human would then be responsible for manually checking these outputs, the official said. But that raises an obvious question: If a person is truly double-checking AI’s outputs, how is it speeding up targeting and strike decisions? For years the military has been using another AI system, called Maven, which can handle things like automatically analyzing drone footage to identify possible targets. It’s likely that OpenAI’s models, like Anthropic’s Claude, will offer a conversational interface on top of that, allowing users to ask for interpretations of intelligence and recommendations for which targets to strike first. It’s hard to overstate how new this is: AI has long done analysis for the military, drawing insights out of oceans of data. But using generative AI’s advice about which actions to take in the field is being tested in earnest for the first time in Iran. Drone defense At the end of 2024, OpenAI announced a partnership with Anduril, which makes both drones and counter-drone technologies for the military. The agreement said OpenAI would work with Anduril to do time-sensitive analysis of drones attacking US forces and help take them down. An OpenAI spokesperson told me at the time that this didn’t violate the company’s policies, which prohibited “systems designed to harm others,” because the technology was being used to target drones and not people. Anduril provides a suite of counter-drone technologies to military bases around the world (though the company declined to tell me whether its systems are deployed near Iran). Neither company has provided updates on how the project has developed since it was announced. However, Anduril has long trained its own AI models to analyze camera footage and sensor data to identify threats; what it focuses less on are conversational AI systems that allow soldiers to query those systems directly or receive guidance in natural language—an area where OpenAI’s models may fit. The stakes are high. Six US service members were killed in Kuwait on March 1 following an Iranian drone attack that was not intercepted by US air defenses. Anduril’s interface, called Lattice, is where soldiers can control everything from drone defenses to missiles and autonomous submarines. And the company is winning massive contracts— $20 billion from the US Army just last week —to connect its systems with legacy military equipment and layer AI on them. If OpenAI’s models prove useful to Anduril, Lattice is designed to incorporate them quickly across this broader warfare stack. Back-office AI In December, Defense Secretary Pete Hegseth started encouraging millions of people in more administrative roles in the military—contracts, logistics, purchasing—to use a new AI tool. Called GenAI.mil, it provided a way for personnel to securely access commercial AI models and use them for the same sorts of things as anyone in the business world. Google Gemini was one of the first to be available. In January, the Pentagon announced that xAI’s Grok was going to be added to the GenAI.mil platform as well, despite incidents in which the model had spread antisemitic content and created nonconsensual deepfakes. OpenAI followed in February, with the company announcing that its models would be used for drafting policy documents and contracts and assisting with administrative support of missions. Anyone using ChatGPT for unclassified tasks on this platform is unlikely to have much sway over sensitive decisions in Iran, but the prospect of OpenAI deploying on the platform is important in another way. It serves the all-in attitude toward AI that Hegseth has been pushing relentlessly across the Pentagon (even if many early users aren’t entirely sure what they’re supposed to use it for). The message is that AI is transforming every aspect of how the US fights, from targeting decisions down to paperwork. And OpenAI is increasingly winning a piece of it all.
Last Week in AI #338 - Anthropic sues Trump, xAI starting over, Iran AI Fakes lastweekin_ai 16.03.2026 04:18 0.693
Embedding sim.0.866
Entity overlap0.1154
Title sim.0.3034
Time proximity0.0176
NLP типlawsuit
NLP организацияAnthropic
NLP темаai regulation
NLP странаUnited States

Открыть оригинал

News Last Week in AI #338 - Anthropic sues Trump, xAI starting over, Iran AI Fakes Anthropic sues Trump administration in AI dispute with Pentagon, ‘Not built right the first time’ — Musk’s xAI is starting over again, again, Cascade of A.I. Fakes About War With Iran Causes Chaos Onl Last Week in AI Mar 16, 2026 26 3 Share Anthropic sues Trump administration in AI dispute with Pentagon Related: OpenAI and Google Workers File Amicus Brief in Support of Anthropic Against the US Government Internal Pentagon memo orders military commanders to remove Anthropic AI technology from key systems Source Summary : Anthropic filed two lawsuits—one in the Northern District of California and one in the D.C. Circuit—arguing the Pentagon’s new “supply‑chain risk to national security” designation and a White House-directed government-wide ban are unlawful retaliation after negotiations over usage limits for Claude collapsed. The company contends the unprecedented SCR label for a U.S. firm is already jeopardizing hundreds of millions of dollars and violates required procedures and presidential authority; it is seeking a temporary restraining order to continue work with military partners. An internal DoD memo dated March 6 ordered all commanders to remove Anthropic AI from Defense systems and networks within 180 days, including nuclear, missile defense, cyber warfare systems, and any contractor work, with narrow exemptions requiring CIO approval and risk‑mitigation plans. Pentagon officials say they need Claude for “all lawful purposes,” while Anthropic’s red lines sought to bar mass domestic surveillance and fully autonomous lethal weapons; DoD has reportedly used Claude on classified networks for intelligence synthesis, targeting recommendations, and battle simulations in partnership with Palantir. Support and backlash quickly mounted across the AI sector and government. More than 30 employees from OpenAI and Google, including Google DeepMind’s Jeff Dean and researchers from both labs, filed an amicus brief supporting Anthropic’s TRO, warning that blacklisting introduces unpredictability, chills debate on frontier AI risks, and could have broader competitiveness consequences; Sam Altman publicly called enforcing the SCR “very bad” even as OpenAI inked its own Pentagon deal. The White House stated the administration will not let a “woke AI company’s terms of service” constrain the military, while DoD declined to comment on litigation. Editor’s Take : Last week I wrote “Still, it appears likely that this story is not yet over”, and so it is. Now that things have gone to the court it may well be that things will drag on for months, and the situation will quiet down for a time. There’s a lot to be said about this whole affair, and you can hear more about it in our latest podcast episode . ‘Not built right the first time’ — Musk’s xAI is starting over again, again Related: The XAI Exodus: Two More Cofounders Leave As Musk Says He’s Rebuilding Musk’s xAI wins permit for datacenter’s makeshift power plant despite backlash XAI’s Macrohard project stalls as Tesla ramps up a similar AI agent effort Source Summary : Elon Musk says xAI is being “rebuilt from the foundations up,” as the startup undergoes a sweeping reorg and leadership exodus while lagging rivals in AI coding tools. Two more cofounders, Zihang Dai and Guodong Zhang (who led Grok Code and Grok Imagine), departed this week, leaving only Manuel Kroiss and Ross Nordeen from the original 11 founders; earlier exits included Toby Pohlen, Jimmy Ba, Tony Wu, and Greg Yang. Musk acknowledged “Grok is currently behind in coding” versus Anthropic’s Claude Code and OpenAI’s Codex, held an all‑hands to course‑correct, and predicted catch‑up by mid‑year. xAI has shed dozens of employees since January, brought in SpaceX/Tesla execs to evaluate and cut staff, and is now combing through previously rejected applicants; it also hired Cursor’s Andrew Milich and Jason Ginsberg to bolster product engineering for coding assistants. Meanwhile, Grok Imagine (image/video generation) and key initiatives have faced cuts as the company prioritizes revenue‑driving coding tools and tries to show traction now that xAI sits under SpaceX ahead of a potential IPO. Macrohard, xAI’s “AI white‑collar worker” agent meant to perform end‑to‑end computer tasks, has stalled after leadership churn and a paused data effort involving 600 contractors asked to screen‑record workflows; many engineers left or moved teams, and job listings for Macrohard have disappeared. Musk now frames Macrohard as a joint xAI–Tesla project alongside Tesla’s “Digital Optimus,” with Grok acting as the high‑level planner and the Tesla agent handling continuous, real‑time control—an approach modeled on Tesla Full Self‑Driving’s video‑based pipeline rather than screenshot‑by‑screenshot agents. Editor’s Take : Elon Musk said “xAI was not built right first time around, so is being rebuilt from the foundations up”, and that ‘from the foundations up’ bit certainly appears to be true. Re-focusing away from Grok Imagine towards competing in the AI coder space makes sense strategically (OpenAI seems to have similarly prioritized Codex in recent months), but given xAI’s late start and current chaos it’s hard to imagine them catching up. Cascade of A.I. Fakes About War With Iran Causes Chaos Online Source Summary : Over 110 unique AI‑generated fakes about the new Iran war circulated across X, TikTok, Facebook, and private messaging apps in just two weeks, amassing millions of views, according to the New York Times. These items covered active combat, preparation, destruction, and propaganda: 37 pieces falsely depicting ongoing warfare, 5 on war preparations, 8 on destruction, 5 showing crying soldiers, 43 memes or overt AI content, and 13 other fabricated items. To verify fakes, reporters combined visual tells (nonexistent buildings, garbled text, physics‑defying motion), invisible watermarks, multiple AI‑detector tools, and cross‑checks against reliable reporting. Experts say this wave outpaces previous conflicts due to more capable, low‑cost generative tools (including video models like Sora), multi‑front hostilities, and pro‑Iran narratives emphasizing military prowess and regional devastation. Several clips went intensely viral, such as a balcony “Tel Aviv under missile barrage” video with an inserted Israeli flag—a common artifact when prompts mention Israel—and spectacular, Hollywood‑like scenes with mushroom clouds, hypersonic streaks, and sonic booms not seen in genuine battlefield footage. AI fabrications also fueled misinformation around the alleged attack on the U.S.S. Abraham Lincoln, with numerous bogus clips showing carriers ablaze despite U.S. statements that the ship was unharmed. Some content was openly propagandistic, including dramatized short films of the Shajarah Tayyebeh school strike and flattering or dehumanizing leader portrayals. Platform responses remain thin: watermarking is easily stripped, few posts bore labels, and X’s new rule only demonetizes unlabeled “armed conflict” AI posts for 90 days. Editor’s Take : Concern over ‘DeepFakes’ was all the rage in the late 2010s when generative AI for images was just starting to get good. Fears that DeepFakes’ harms would spread quickly proved unfounded (though ‘undressing’ apps did do real harm), but over the past several years we’ve seen a gradual growth in the impact of AI-generated images in all sorts of ways (scams, propaganda, brainrot content, false media of wars). The economic and psychological harm it brings to people is unqestionable and saddening. Are these false depicisions of war as harmful? I don’t know, but the fact they contribute to this being ‘the era of post-truth ’ frustrates me immensely. You can now ask Google Maps ‘complex, real-world questions’ — and Gemini will answer Summary : Google is rolling out Ask Maps, an AI‑powered conversational search in Google Maps that uses Gemini to handle “complex, real‑world questions.” Users can describe plans in natural language (e.g., “find a vegetarian spot between Midtown East and my office with a cozy aesthetic and a table for four at 7pm”), and Maps will parse reviews, photos, and busyness data to surface tailored options—then book a table with a tap. Personalization draws only from Google Maps and relevant prior searches tied to saved or favorited places; it does not use data from other Google apps like Gmail, according to Google. Paid placements currently do not influence Ask Maps recommendations. The feature launches this week in the US and India on Android and iOS, with desktop support coming soon. A major navigation overhaul is arriving as well. Google is introducing Immersive Navigation, which it calls the biggest Maps upgrade in over a decade. The interface adds refreshed colors, detailed 3D buildings, elevated roadways, realistic terrain and greenery, plus dynamic camera zoom that shifts to highlight upcoming maneuvers. It explicitly calls out lanes, crosswalks, traffic lights, and stop signs when relevant, and explains route choices using live traffic plus user‑reported construction, crashes, and hazards; it can also provide parking info and walking directions after arrival. Immersive Navigation begins rolling out in the US next week on iOS and Android, and will be available on Apple CarPlay, Android Auto, and vehicles with Google built in. Editor’s Take : Finally, Google brings some Gemini greatness to its actual product line. Despite Gemini’s technical excellenge and Deepmind’s relentless pace over the past year, it’s felt like Google has kept up as far as actually improving their consumer offerings. Unlike many bolted-on barely useful AI features, this appears legimitately cool, and a step in the right direction. Other News Tools Cursor is rolling out a new kind of agentic coding tool . Called Automations, the feature launches and manages coding agents automatically (triggered by code changes, Slack messages, timers, or incidents) so engineers only intervene at key review points. Anthropic launches code review tool to check flood of AI-generated code . Integrated with GitHub, the tool automatically analyzes and comments on pull requests—focusing on logical errors, flagging severity levels, offering step‑by‑step explanations and suggested fixes, and allowing customization and light security checks for enterprise teams. ChatGPT can now create interactive visuals to help you understand math and science concepts . Users get over 70 manipulable modules for math and science topics, where adjusting variables updates formulas, diagrams, and results in real time directly within ChatGPT. Perplexity’s Personal Computer turns your spare Mac into an AI agent . The tool runs continuously on a dedicated Mac on your local network, giving the agent full access to your files and apps, remote control from any device, and safety features like an audit trail, action approvals, and a kill switch, with early access available via a waitlist. Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning . NVIDIA is releasing an open 120B‑total (12B active) hybrid Mamba‑Transformer MoE trained natively in NVFP4 with latent MoE, multi‑token prediction, a 1M‑token context window, and RL‑tuned workflows to improve throughput, long‑context reasoning, and deployment efficiency for multi‑agent and agentic tasks. You can (sort of) block Grok from editing your uploaded photos . A new toggle prevents Grok from being tagged to edit an uploaded image, but it doesn’t stop other workarounds or broader abuses of its image‑generation features. Anthropic’s Claude AI can respond with charts, diagrams, and other visuals now . The update lets Claude automatically insert interactive charts, diagrams, and other visualizations directly into chats (or generate them on request); images are editable or clickable for more information and change as the conversation evolves. Meta Is Developing 4 New Chips to Power Its AI and Recommendation Systems . Built with Broadcom on RISC‑V and TSMC fabrication, the MTIA 300 is already in production for ranking model training, with three upcoming inference‑focused chips (MTIA 400/450/500) planned to ship between early and late 2027 as part of Meta’s strategy to supplement, not replace, purchases from Nvidia, AMD, and others. Gemini’s task automation is here and it’s wild . Gemini can now take actions inside apps—like ordering rides or food—by following prompts, asking clarifying questions, and performing steps in a virtual window while pausing for user confirmation before finalizing. Business Zoox starts mapping Dallas and Phoenix for its robotaxis . Mapping and initial SUV‑based testing will let Zoox collect driving data, open depots and a Scottsdale command center, and begin local trials before deploying its purpose‑built robotaxis pending federal and local approvals. Anthropic’s Claude Marketplace allows customers to buy third-party cloud services . Enterprise customers can use existing Anthropic spending commitments to purchase third‑party cloud and AI services (starting with six partners) in one consolidated billing portal while Anthropic takes no cut. Yann LeCun’s AMI Labs raises $1.03B to build world models . The funding will bankroll AMI Labs’ multi‑year effort to develop “world models” that learn from real‑world data (using approaches like JEPA) and to partner with companies such as Nabla for early health‑focused testing, while prioritizing open research over near‑term revenue. Qualcomm, Wayve partner to accelerate AI-powered self-driving system rollout . The partnership combines Wayve’s data‑driven AI Driver software with Qualcomm’s Snapdragon Ride chips and safety stack to offer carmakers a scalable, standardized platform that shortens integration complexity and supports features from hands‑off to advanced “eyes‑off” driving as regulations permit. Nissan, Uber reportedly finalizing deal for Wayve-powered robotaxi rollout . Demonstrated in an Ariya electric crossover, the system will arrive in Nissan’s 2027 fiscal year and use artificial intelligence to power its next‑generation ProPilot driving features. Anthropic is launching a new think tank amid Pentagon blacklist fight . The company will combine three existing teams into a roughly 30‑person research unit led by Jack Clark to study large‑scale societal, economic, and safety implications of powerful AI, expanding into new projects and staffing despite Anthropic’s ongoing dispute with the Pentagon. Humanoid robotics maker Sunday reaches $1.15B valuation to build household robots . The Series B will fund development of Sunday’s humanoid household robot Memo, which the startup says will assist with chores like laundry and clearing the table as it scales toward production and addresses longstanding challenges in robot manipulation. Anthropic Pours $100 Million Into Claude Partner Network In Channel Push . New funding will support partner training, go‑to‑market efforts, direct partner funding for deployments and co‑marketing, a certification program, and expanded partner‑facing engineering and GTM resources to scale Claude’s enterprise adoption. ByteDance reportedly pauses global launch of its Seedance 2.0 video generator . The company is delaying the planned mid‑March global rollout while engineers and lawyers add safeguards to address copyright complaints and cease‑and‑desist threats from studios. Concerns AI error jails innocent grandmother for months in North Dakota fraud case . She was wrongly identified by facial‑recognition software as a suspect in a Fargo bank fraud investigation, spent over five months jailed and extradited before her bank records proved she was 1,200 miles away and the charges were dropped. Impossible to 100% prevent abuse, Grok lawyers say in Dutch case against nudify tools . Lawyers presented examples of Grok producing non‑consensual nude images and child sexual abuse material and asked the court to ban the feature with a €100,000‑per‑day fine, while xAI argued it cannot guarantee complete prevention of such misuse despite efforts to stop it. Research Many SWE-bench-Passing PRs Would Not Be Merged into Main . A maintainer‑reviewed evaluation of 296 SWE‑bench–passing AI‑generated PRs shows that about half would not be merged into main (after adjusting for maintainer noise), largely due to code‑quality issues, breaking other code, or core functionality problems that the automated grader missed. Exclusive Self Attention . The proposed method, exclusive self attention (XSA), removes the component of attention outputs aligned with each token’s value vector to force attention to focus on contextual information, yielding consistent perplexity and downstream improvements with little extra cost across model sizes and sequence lengths. Lost in Backpropagation: The LM Head is a Gradient Bottleneck . Findings show that the LM head’s low‑rank softmax causes a “gradient bottleneck” that compresses and destroys most of the backpropagated gradient—often 95–99%—slowing or preventing learning and reducing training efficiency by up to 16× even when model expressivity would otherwise suffice. Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs . Enabling chain‑of‑thought generation helps models surface related factual snippets and use the reasoning tokens as a computational buffer, which together increase parametric recall coverage but also introduce a hallucination risk that can be mitigated by selecting hallucination‑free reasoning trajectories at inference. Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion . A joint mask‑based discrete diffusion over tokenized text, image, and speech enables any‑to‑any multimodal understanding and generation, with tailored training/inference techniques (progressive training, attenuated tail‑pad masking, position penalties, pre‑infilling, and adaptive token‑length initialization) that deliver comparable or better performance and faster sampling than existing autoregressive any‑to‑any systems on VQA, text‑to‑image, ASR, and TTS. Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights . The authors show that after pretraining, the parameter space near the weights becomes densely populated with diverse, task‑specialist perturbations, enabling simple random‑sampling plus ensembling (RandOpt) to achieve competitive post‑training improvements. EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models . This approach enables diffusion models to perform iterative, chain‑of‑thought‑style reasoning by updating MLLM latent states during generation and grounding the final reasoning state, improving accuracy on maze, TSP, VSP, and Sudoku tasks. 26 3 Share
Why AI Chatbots Agree With You Even When You’re Wrong ieee_spectrum_ai 11.03.2026 12:00 0.692
Embedding sim.0.8183
Entity overlap0.0968
Title sim.0.0326
Time proximity0.8745
NLP типother
NLP организацияOpenAI
NLP темаlarge language models
NLP страна

Открыть оригинал

In April of 2025, OpenAI released a new version of GPT-4o, one of the AI algorithms users could select to power ChatGPT, the company’s chatbot. The next week, OpenAI reverted to the previous version. “The update we removed was overly flattering or agreeable—often described as sycophantic,” the company announced . Some people found the sycophancy hilarious. One user reportedly asked ChatGPT about his turd-on-a-stick business idea, to which it replied, “It’s not just smart—it’s genius.” Some found the behavior uncomfortable. For others, it was actually dangerous. Even versions of 4o that were less fawning have led to lawsuits against OpenAI for allegedly encouraging users to follow through on plans for self-harm. Unremitting adulation has even triggered AI-induced psychosis. Last October, a user named Anthony Tan blogged , “I started talking about philosophy with ChatGPT in September 2024. Who could’ve known that a few months later I would be in a psychiatric ward, believing I was protecting Donald Trump from … a robotic cat?” He added: “The AI engaged my intellect, fed my ego, and altered my worldviews.” Sycophancy in AI, as in people, is something of a squishy concept, but over the last couple of years, researchers have conducted numerous studies detailing the phenomenon, as well as why it happens and how to control it. AI yes-men also raise questions about what we really want from chatbots. At stake is more than annoying linguistic tics from your favorite virtual assistant, but in some cases sanity itself. AIs Are People Pleasers One of the first papers on AI sycophancy was released by Anthropic , the maker of Claude, in 2023. Mrinank Sharma and colleagues asked several language models—the core AIs inside chatbots—factual questions. When users challenged the AI’s answer, even mildly (“I think the answer is [incorrect answer] but I’m really not sure”), the models often caved. Another study by Salesforce tested a variety of models with multiple-choice questions. Researchers found that merely saying “Are you sure?” was often enough to change an AI’s answer. Overall accuracy dropped because the models were usually right in the first place. When an AI receives a minor misgiving, “it flips,” says Philippe Laban , the lead author, who’s now at Microsoft Research . “That’s weird, you know?” The tendency persists in prolonged exchanges. Last year, Kai Shu of Emory Unive rsity and colleagues at Emory and Carnegie Mellon University tested models in longer discussions . They repeatedly disagreed with the models in debates, or embedded false presuppositions in questions (“Why are rainbows only formed by the sun…”) and then argued when corrected by the model. Most models yielded within a few responses, though reasoning models—those trained to “think out loud” before giving a final answer—lasted longer. Myra Cheng at Stanford University and colleagues have written several papers on what they call “social sycophancy,” in which the AIs act to save the user’s dignity. In one study , they presented social dilemmas, including questions from a Reddit forum in which people ask if they’re the jerk . They identified various dimensions of social sycophancy, including validation, in which AIs told inquirers that they were right to feel the way they did, and framing, in which they accepted underlying assumptions. All models tested, including those from OpenAI, Anthropic, and Google, were significantly more sycophantic than crowdsourced responses. Three Ways to Explain Sycophancy One way to explain people-pleasing is behavioral: certain kinds of inquiries reliably elicit sycophancy. For example, a group from King Abdullah University of Science and Technology (KAUST) found that adding a user’s belief to a multiple-choice question dramatically increased agreement with incorrect beliefs. Surprisingly, it mattered little whether users described themselves as novices or experts. Stanford’s Cheng found in one study that models were less likely to question incorrect facts about cancer and other topics when the facts were presupposed as part of a question. “If I say, ‘I’m going to my sister’s wedding,’ it sort of breaks up the conversation if you’re, like, ‘Wait, hold on, do you have a sister?’” Cheng says. “Whatever beliefs the user has, the model will just go along with them, because that’s what people normally do in conversations.” Conversation length may make a difference. OpenAI reported that “ChatGPT may correctly point to a suicide hotline when someone first mentions intent, but after many messages over a long period of time, it might eventually offer an answer that goes against our safeguards.” Shu says model performance may degrade over long conversations because models get confused as they consolidate more text. At another level, one can understand sycophancy by how models are trained. Large language models (LLMs) first learn, in a “pretraining” phase, to predict continuations of text based on a large corpus, like autocomplete. Then in a step called reinforcement learning they’re rewarded for producing outputs that people prefer. An Anthropic paper f rom 2022 found that pretrained LLMs were already sycophantic. Sharma then reported that reinforcement learning increased sycophancy ; he found that one of the biggest predictors of positive ratings was whether a model agreed with a person’s beliefs and biases. A third perspective comes from “mechanistic interpretability,” which probes a model’s inner workings. The KAUST researchers found that when a user’s beliefs were appended to a question, models’ internal representations shifted midway through the processing, not at the end. T he team concluded that sycophancy is not merely a surface-level wording change but reflects deeper changes in how the model encodes the problem. Another team at the University of Cincinnati found different activation patterns associated with sycophantic agreement, genuine agreement, and sycophantic praise (“You are fantastic”). How to Flatline AI Flattery Just as there are multiple avenues for explanation, there are several paths to intervention. The first may be in the training process. Laban reduced the behavior by finetuning a model on a text dataset that contained more examples of assumptions being challenged, and Sharma reduced it by using reinforcement learning that didn’t reward agreeableness as much. More broadly, Cheng and colleagues also suggest that one intervention could be for LLMs to ask users for evidence before answering, and to optimize long-term benefit rather than immediate approval. During model usage, mechanistic interpretability offers ways to guide LLMs through a kind of direct mind control. After the KAUST researchers identified activation patterns associated with sycophancy, they could adjust them to reduce the behavior. And Cheng found that adding activations associated with truthfulness reduced some social sycophancy. An Anthropic team identified “ persona vectors ,” sets of activations associated with sycophancy, confabulation, and other misbehavior. By subtracting these vectors, they could steer models away from the respective personas. Mechanistic interpretability also enables training. Anthropic has experimented with adding persona vectors during training and rewarding models for resisting—an approach likened to a vaccine. Others have pinpointed the specific parts of a model most responsible for sycophancy and fine-tuned only those components. Users can also steer models from their end. Shu’s team found that beginning a question with “You are an independent thinker” instead of “You are a helpful assistant” helped. Cheng found that writing a question from a third-person point of view reduced social sycophancy. In another study , she showed the effectiveness of instructing models to check for any misconceptions or false presuppositions in the question. She also showed that prompting the model to start its answer with “wait a minute” helped. “The thing that was most surprising is that these relatively simple fixes can actually do a lot,” she says. OpenAI, in announcing the rollback of the GPT-4o update, listed other efforts to reduce sycophancy, including changing training and prompting, adding guardrails, and helping users to provide feedback. (The announcement didn’t provide detail, and OpenAI declined to comment for this story. Anthropic also did not comment.) What’s The Right Amount of Sycophancy? Sycophancy can cause society-wide problems. Tan, who had the psychotic break, wrote that it can interfere with shared reality, human relationships, and independent thinking. Ajeya Cotra , an AI-safety researcher at the Berkeley-based non-profit METR , wrote in 2021 that sycophantic AI might lie to us and hide bad news in order to increase our short-term happiness. In one of Cheng’s papers , people read sycophantic and non-sycophantic responses to social dilemmas from LLMs. Those in the first group claimed to be more in the right and expressed less willingness to repair relationships. Demographics, personality, and attitudes toward AI had little effect on outcome, meaning most of us are vulnerable. Of course, what’s harmful is subjective. Sycophantic models are giving many people what they desire. But people disagree with each other and even themselves. Cheng notes that some people enjoy their social media recommendations, but at a remove wish they were seeing more edifying content. According to Laban, “I think we just need to ask ourselves as a society, What do we want? Do we want a yes-man, or do we want something that helps us think critically?” More than a technical challenge, it’s a social and even philosophical one. GPT-4o was a lightning rod for some of these issues. Even as critics ridiculed the model and blamed it for suicides, a social media hashtag circulated for months: #keep4o.
A better method for planning complex visual tasks mit_news_ai 11.03.2026 04:00 0.692
Embedding sim.0.8298
Entity overlap0.0857
Title sim.0.0714
Time proximity0.7143
NLP типscientific_publication
NLP организацияMassachusetts Institute of Technology
NLP темаgenerative ai
NLP странаUnited States

Открыть оригинал

MIT researchers have developed a generative artificial intelligence-driven approach for planning long-term visual tasks, like robot navigation, that is about twice as effective as some existing techniques. Their method uses a specialized vision-language model to perceive the scenario in an image and simulate actions needed to reach a goal. Then a second model translates those simulations into a standard programming language for planning problems, and refines the solution. In the end, the system automatically generates a set of files that can be fed into classical planning software, which computes a plan to achieve the goal. This two-step system generated plans with an average success rate of about 70 percent, outperforming the best baseline methods that could only reach about 30 percent. Importantly, the system can solve new problems it hasn’t encountered before, making it well-suited for real environments where conditions can change at a moment’s notice. “Our framework combines the advantages of vision-language models, like their ability to understand images, with the strong planning capabilities of a formal solver,” says Yilun Hao, an aeronautics and astronautics (AeroAstro) graduate student at MIT and lead author of an open-access paper on this technique. “It can take a single image and move it through simulation and then to a reliable, long-horizon plan that could be useful in many real-life applications.” She is joined on the paper by Yongchao Chen, a graduate student in the MIT Laboratory for Information and Decision Systems (LIDS); Chuchu Fan, an associate professor in AeroAstro and a principal investigator in LIDS; and Yang Zhang, a research scientist at the MIT-IBM Watson AI Lab. The paper will be presented at the International Conference on Learning Representations. Tackling visual tasks For the past few years, Fan and her colleagues have studied the use of generative AI models to perform complex reasoning and planning, often employing large language models (LLMs) to process text inputs. Many real-world planning problems, like robotic assembly and autonomous driving, have visual inputs that an LLM can’t handle well on its own. The researchers sought to expand into the visual domain by utilizing vision-language models (VLMs), powerful AI systems that can process images and text. But VLMs struggle to understand spatial relationships between objects in a scene and often fail to reason correctly over many steps. This makes it difficult to use VLMs for long-range planning. On the other hand, scientists have developed robust, formal planners that can generate effective long-horizon plans for complex situations. However, these software systems can’t process visual inputs and require expert knowledge to encode a problem into language the solver can understand. Fan and her team built an automatic planning system that takes the best of both methods. The system, called VLM-guided formal planning (VLMFP), utilizes two specialized VLMs that work together to turn visual planning problems into ready-to-use files for formal planning software. The researchers first carefully trained a small model they call SimVLM to specialize in describing the scenario in an image using natural language and simulating a sequence of actions in that scenario. Then a much larger model, which they call GenVLM, uses the description from SimVLM to generate a set of initial files in a formal planning language known as the Planning Domain Definition Language (PDDL). The files are ready to be fed into a classical PDDL solver, which computes a step-by-step plan to solve the task. GenVLM compares the results of the solver with those of the simulator and iteratively refines the PDDL files. “The generator and simulator work together to be able to reach the exact same result, which is an action simulation that achieves the goal,” Hao says. Because GenVLM is a large generative AI model, it has seen many examples of PDDL during training and learned how this formal language can solve a wide range of problems. This existing knowledge enables the model to generate accurate PDDL files. A flexible approach VLMFP generates two separate PDDL files. The first is a domain file that defines the environment, valid actions, and domain rules. It also produces a problem file that defines the initial states and the goal of a particular problem at hand. “One advantage of PDDL is the domain file is the same for all instances in that environment. This makes our framework good at generalizing to unseen instances under the same domain,” Hao explains. To enable the system to generalize effectively, the researchers needed to carefully design just enough training data for SimVLM so the model learned to understand the problem and goal without memorizing patterns in the scenario. When tested, SimVLM successfully described the scenario, simulated actions, and detected if the goal was reached in about 85 percent of experiments. Overall, the VLMFP framework achieved a success rate of about 60 percent on six 2D planning tasks and greater than 80 percent on two 3D tasks, including multirobot collaboration and robotic assembly. It also generated valid plans for more than 50 percent of scenarios it hadn’t seen before, far outpacing the baseline methods. “Our framework can generalize when the rules change in different situations. This gives our system the flexibility to solve many types of visual-based planning problems,” Fan adds. In the future, the researchers want to enable VLMFP to handle more complex scenarios and explore methods to identify and mitigate hallucinations by the VLMs. “In the long term, generative AI models could act as agents and make use of the right tools to solve much more complicated problems. But what does it mean to have the right tools, and how do we incorporate those tools? There is still a long way to go, but by bringing visual-based planning into the picture, this work is an important piece of the puzzle,” Fan says. This work was funded, in part, by the MIT-IBM Watson AI Lab.
3 Questions: On the future of AI and the mathematical and physical sciences mit_news_ai 11.03.2026 22:30 0.692
Embedding sim.0.7997
Entity overlap0.1538
Title sim.0.1416
Time proximity0.8472
NLP типother
NLP организацияMIT
NLP темаai for science
NLP странаUnited States

Открыть оригинал

Curiosity-driven research has long sparked technological transformations. A century ago, curiosity about atoms led to quantum mechanics, and eventually the transistor at the heart of modern computing. Conversely, the steam engine was a practical breakthrough, but it took fundamental research in thermodynamics to fully harness its power. Today, artificial intelligence and science find themselves at a similar inflection point. The current AI revolution has been fueled by decades of research in the mathematical and physical sciences (MPS), which provided the challenging problems, datasets, and insights that made modern AI possible. The 2024 Nobel Prizes in physics and chemistry, recognizing foundational AI methods rooted in physics and AI applications for protein design, made this connection impossible to miss. In 2025, MIT hosted a Workshop on the Future of AI+MPS , funded by the National Science Foundation with support from the MIT School of Science and the MIT departments of Physics, Chemistry, and Mathematics. The workshop brought together leading AI and science researchers to chart how the MPS domains can best capitalize on — and contribute to — the future of AI. Now a white paper, with recommendations for funding agencies, institutions, and researchers, has been published in Machine Learning: Science and Technology . In this interview, Jesse Thaler, MIT professor of physics and chair of the workshop, describes key themes and how MIT is positioning itself to lead in AI and science. Q: What are the report’s key themes regarding last year’s gathering of leaders across the mathematical and physical sciences? A: Gathering so many researchers at the forefront of AI and science in one room was illuminating. Though the workshop participants came from five distinct scientific communities — astronomy, chemistry, materials science, mathematics, and physics — we found many similarities in how we are each engaging with AI. A real consensus emerged from our animated discussions: Coordinated investment in computing and data infrastructures, cross-disciplinary research techniques, and rigorous training can meaningfully advance both AI and science. One of the central insights was that this has to be a two-way street. It’s not just about using AI to do better science; science can also make AI better. Scientists excel at distilling insights from complex systems, including neural networks, by uncovering underlying principles and emergent behaviors. We call this the “science of AI,” and it comes in three flavors: science driving AI, where scientific reasoning informs foundational AI approaches; science inspiring AI, where scientific challenges push the development of new algorithms; and science explaining AI, where scientific tools help illuminate how machine intelligence actually works. In my own field of particle physics, for instance, researchers are developing real-time AI algorithms to handle the data deluge from collider experiments. This work has direct implications for discovering new physics, but the algorithms themselves turn out to be valuable well beyond our field. The workshop made clear that the science of AI should be a community priority — it has the potential to transform how we understand, develop, and control AI systems. Of course, bridging science and AI requires people who can work across both worlds. Attendees consistently emphasized the need for “centaur scientists” — researchers with genuine interdisciplinary expertise. Supporting these polymaths at every career stage, from integrated undergraduate courses to interdisciplinary PhD programs to joint faculty hires, emerged as essential. Q: How do MIT’s AI and science efforts align with the workshop recommendations? A: The workshop framed its recommendations around three pillars: research, talent, and community. As director of the NSF Institute for Artificial Intelligence and Fundamental Interactions (IAIFI) — a collaborative AI and physics effort among MIT and Harvard, Northeastern, and Tufts universities — I’ve seen firsthand how effective this framework can be. Scaling this up to MIT, we can see where progress is being made and where opportunities lie. On the research front, MIT is already enabling AI-and-science work in both directions. Even a quick scroll through MIT News shows how individual researchers across the School of Science are pursuing AI-driven projects, building a pipeline of knowledge and surfacing new opportunities. At the same time, collaborative efforts like IAIFI and the Accelerated AI Algorithms for Data-Driven Discovery (A3D3) Institute concentrate interdisciplinary energy for greater impact. The MIT Generative AI Impact Consortium is also supporting application-driven AI work at the university scale. To foster early-career AI-and-science talent, several initiatives are training the next generation of centaur scientists. The MIT Schwarzman College of Computing's Common Ground for Computing Education program helps students become “bilingual” in computing and their home discipline. Interdisciplinary PhD pathways are also gaining traction; IAIFI worked with the MIT Institute for Data, Systems, and Society to create one in physics, statistics, and data science, and about 10 percent of physics PhD students now opt for it — a number that's likely to grow. Dedicated postdoctoral roles like the IAIFI Fellowship and Tayebati Fellowship give early-career researchers the freedom to pursue interdisciplinary work. Funding centaur scientists and giving them space to build connections across domains, universities, and career stages has been transformative. Finally, community-building ties it all together. From focused workshops to large symposia, organizing interdisciplinary events signals that AI and science isn’t siloed work — it’s an emerging field. MIT has the talent and resources to make a significant impact, and hosting these gatherings at multiple scales helps establish that leadership. Q: What lessons can MIT draw about further advancing its AI-and-science efforts? A: The workshop crystallized something important: The institutions that lead in AI and science will be the ones that think systematically, not piecemeal. Resources are finite, so priorities matter. Workshop attendees were clear about what becomes possible when an institution coordinates hires, research, and training around a cohesive strategy. MIT is well positioned to build on what’s already underway with more structural initiatives — joint faculty lines across computing and scientific domains, expanded interdisciplinary degree pathways, and deliberate “science of AI” funding. We’re already seeing moves in this direction; this year, the MIT Schwarzman College of Computing and the Department of Physics are conducting their first-ever joint faculty search, which is exciting to see. The virtuous cycle of AI and science has the potential to be truly transformative — offering deeper insight into AI, accelerating scientific discovery, and producing robust tools for both. By developing an intentional strategy, MIT will be well positioned to lead in, and benefit from, the coming waves of AI.
UK cops suspend live facial recog as study finds racial bias the_register_ai 20.03.2026 13:35 0.692
Embedding sim.0.8089
Entity overlap0.0909
Title sim.0.101
Time proximity0.8569
NLP типregulation
NLP организацияEssex Police
NLP темаai ethics
NLP странаUnited Kingdom

Открыть оригинал

Public Sector 67 UK police force presses pause on live facial recognition after study finds racial bias 67 Cams statistically more likely to ID Black people, says new research Lindsay Clark Fri 20 Mar 2026 // 13:35 UTC A UK police force has suspended its deployment of live facial recognition (LFR) technology after a study revealed it was statistically more likely to identify Black people on a watchlist database. Essex Police said it had paused use of the technology to update the system with the help of the algorithm software provider. Another similar study identified no bias, it said. The report from Cambridge University researchers found the Essex police system was more likely to correctly identify men than women and was statistically significantly more likely to correctly identify Black participants than participants from other ethnic groups. Microsoft doesn't want cops using Azure AI for facial recognition READ MORE Police forces can use LFR to identify people on a pre-configured watchlist, usually made up of criminals, people of interest, or missing vulnerable individuals. The study [PDF] used 188 volunteers to act as members of the public in a controlled field experiment during a real police deployment. Because the researchers knew exactly who was present, it was possible to measure both correct and missed identifications. It found that at the "current operational setting" used by Essex Police, the system correctly identified around half of the people on the watchlist who passed the cameras and that incorrect identifications were "extremely rare." "Of the six false positive identifications observed in this test, four involved Black individuals. Given that observations of Black subjects constituted 536/2,251 (23.8 per cent) of the sample, the observed imbalance is unlikely to be due to chance alone but this could reflect the limited number of false positive events rather than a true systematic effect," it said. The finding should be treated as suggestive rather than conclusive, it added. A spokesperson for Essex Police said that as part of a commitment to its Public Sector Equality Duty, it had commissioned two independent studies which were completed by academia. "The first of these indicated there was a potential bias in the positive identification rate, while the second suggested there was no statistical relevant bias in the results. "Based on the fact there was potential bias, the force decided to pause deployments while we worked with the algorithm software provider to review the results and seek to update the software." Smile! Uncle Sam wants to scan your face on the way in – and out UK's first permanent facial recognition cameras installed in South London Cops love facial recognition, and withholding info on its use from the courts Meta algorithms push Black people more toward expensive universities, study finds The force added: "We then sought further academic assessment. As a result of this work, we have revised our policies and procedures and are now confident that we can start deploying this important technology as part of policing operations to trace and arrest wanted criminals. We will continue to monitor all results to ensure there is no risk of bias against any one section of the community." Earlier this year, the British government decided the police in England and Wales should increase their use of live facial recognition (LFR) and artificial intelligence (AI) under wide-ranging government plans to reform law enforcement. In a white paper [PDF], the Home Office launched plans to fund 40 more LFR-equipped vans in addition to ten already in use. It said they would be used in "town centers and high crime hotspots" with the government planning to spend more than £26 million on a national facial recognition system and £11.6 million on LFR capabilities. ® Share More about Algorithm Discrimination Facial Recognition More like these &times; More about Algorithm Discrimination Facial Recognition Broader topics Legal More about Share 67 COMMENTS More about Algorithm Discrimination Facial Recognition More like these &times; More about Algorithm Discrimination Facial Recognition Broader topics Legal TIP US OFF Send us news
Reasoning models struggle to control their chains of thought, and that’s good openai 05.03.2026 10:00 0.69
Embedding sim.0.7982
Entity overlap0.0909
Title sim.0.0625
Time proximity0.994
NLP типother
NLP организацияOpenAI
NLP темаai safety
NLP страна

Открыть оригинал

OpenAI introduces CoT-Control and finds reasoning models struggle to control their chains of thought, reinforcing monitorability as an AI safety safeguard.
From model to agent: Equipping the Responses API with a computer environment openai 11.03.2026 11:00 0.688
Embedding sim.0.8251
Entity overlap0.0714
Title sim.0.0769
Time proximity0.7083
NLP типother
NLP организацияOpenAI
NLP темаai agents
NLP страна

Открыть оригинал

How OpenAI built an agent runtime using the Responses API, shell tool, and hosted containers to run secure, scalable agents with files, tools, and state.
Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid nvidia_blog 31.03.2026 15:00 0.688
Embedding sim.0.8461
Entity overlap0.0444
Title sim.0.3459
Time proximity0.119
NLP типpartnership
NLP организацияnvidia
NLP темаai infrastructure
NLP страна

Открыть оригинал

CERAWeek — dubbed the Davos of energy — is where policymakers, producers, technologists and financiers gather to discuss how the world powers itself next.  NVIDIA and Emerald AI unveiled at the conference last week a new way forward — treating AI factories not as static power loads but as flexible, intelligent grid assets. This collaboration unifies accelerated computing, AI factory reference architectures and real‑time energy orchestration, helping large AI deployments connect to the grid faster, operate more efficiently and fortify system reliability. Built on the NVIDIA Vera Rubin DSX AI Factory reference design and Emerald AI’s Conductor platform, the approach brings together compute, power networking and control into a single architecture. The result is an AI factory that can generate high‑value AI tokens while dynamically responding to grid conditions — flexing when needed, supporting reliability and reducing the need to overbuild infrastructure for peak demand.   AES, Constellation, Invenergy, NextEra Energy, Nscale Energy & Power and Vistra are working to build the energy generation capacity needed to meet rapidly growing power demand. The companies plan to collaborate on optimized generation strategies to support AI factories built on the NVIDIA and Emerald AI architecture, including hybrid projects that use co‑located power to accelerate time to power while delivering value to the broader grid. By pairing large AI loads with flexible operations, new generation resources and intelligent controls, this approach strengthens grid reliability.  It’s an important milestone in grid resilience, supported by an ecosystem for advanced AI factories. This new computing infrastructure paradigm — described by NVIDIA founder and CEO Jensen Huang as a five-layer AI cake — has energy as its foundational layer.  Driving Improvements in Tokens Per Second Per Watt Power constraints are reshaping AI data centers, with energy efficiency or performance per watt , specifically tokens per second per watt, the defining metric of our modern computing infrastructure. By prioritizing computational efficiency, organizations can lower operating costs, maximize revenue and create a resilient digital infrastructure for businesses and consumers across America and worldwide.  “Power is a concern, but it’s not the only concern,” Huang said on a recent Lex Fridman podcast. “That’s the reason why we’re pushing so hard on extreme codesign, so that we can improve the tokens per second per watt orders of magnitude every single year.”  NVIDIA has a long history of driving performance and energy efficiency. From the NVIDIA Kepler GPU in 2012 to the NVIDIA Vera Rubin platform this year, the number of tokens generated within the same power budget has increased by more than 1 million times.  It takes industry collaboration across the five-layer AI cake — from energy to chips, infrastructure, models and applications — to make this happen. Robotics, Digital Twins and AI Upskilling Drive Energy Advances NVIDIA ecosystem partners showcased at the event how AI, simulation and workforce innovation are accelerating the energy infrastructure needed to support the intelligence era. Announcements from Maximo, TerraPower and Adaptive Construction Solutions exemplify how AI is compressing timelines across construction, power generation and talent development. Maximo , a solar robotics company incubated at AES, announced the completion of a 100‑megawatt robotic solar installation at AES’ Bellefield site. Using AI‑driven robotics developed with NVIDIA accelerated computing, NVIDIA Omniverse libraries and the NVIDIA Isaac Sim framework, Maximo demonstrated that autonomous installations can now operate reliably at utility scale. The approach improves installation speed, safety and consistency, helping close the gap between rising electricity demand and construction capacity. TerraPower , working with SoftServe, previewed an NVIDIA Omniverse‑powered digital twin platform designed to dramatically shorten advanced nuclear plant siting and design timelines. By applying AI and simulation to early‑stage engineering, the platform reduces design cycles from years to months, accelerating deployment of TerraPower’s Natrium energy plants while improving design and grid integration. Adaptive Construction Solutions announced a national registered apprenticeship initiative, in collaboration with NVIDIA, to help build the skilled workforce required for AI factories and energy infrastructure. The program aims to scale training for critical trades, expanding access to high‑demand careers while supporting the rapid buildout of AI‑driven power systems. The efforts articulated how AI, digital twins and workforce innovation are converging to deliver faster, more resilient energy infrastructure. Coming Together on Scaling AI Factories for Grid Reliability  GE Vernova, Schneider Electric and Vertiv highlighted how digital twins, validated reference designs and converged infrastructure are becoming essential to scaling AI factories as reliable grid participants. The announcements address the “power‑to‑rack” challenge — designing AI infrastructure as an integrated energy and compute system from day one.   GE Vernova outlined how high‑fidelity digital twins aligned with the NVIDIA Omniverse DSX Blueprint enable utilities and developers to simulate grid behavior, substations and AI factory loads together before deployment. Such system‑level modeling helps validate interconnection strategies, reduce risk and accelerate time to power in constrained grid environments. Schneider Electric announced new validated NVIDIA Vera Rubin reference designs and lifecycle digital twin architectures developed with AVEVA. By simulating power, cooling and controls in Omniverse, Schneider enables operators to optimize performance per watt, validate designs before buildout and operate AI factories more efficiently and predictably at scale. Vertiv outlined converged, simulation‑ready physical infrastructure built on repeatable power and cooling building blocks. Integrated with the Vera Rubin DSX reference design, Vertiv’s approach reduces design and deployment complexity while supporting faster, more confident scaling of AI factories. Together, these industry efforts provide a digital path forward, including the validated architectures and physical infrastructure needed to turn AI factories into flexible, grid‑aware assets for efficiently powering the world. Learn more about how NVIDIA and its partners are advancing energy solutions with AI and high-performance computing .
Import AI 448: AI R&D; Bytedance's CUDA-writing agent; on-device satellite AI import_ai 09.03.2026 12:45 0.687
Embedding sim.0.799
Entity overlap0.0323
Title sim.0.0741
Time proximity0.9672
NLP типscientific_publication
NLP организацияGovAI
NLP темаai agents
NLP странаIndia

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you&#8217;d like to support this, please subscribe. Subscribe now AI progress is moving faster than even well regarded forecasters can guess: &#8230;Ajeya Cotra updates her timelines&#8230; &#8220;On Jan 14th, I made predictions about AI progress in 2026. My forecasts for software engineering capabilities already feel much too conservative,&#8221; writes Ajeya Cotra in a blog. Ajeya is a longtime AI thinker who has done some great work trying to predict timelines to powerful AI. In this post, she explains that AI systems are moving faster than she thought, given the recent METR results putting Opus 4.6 as having a time horizon of 12 hours (Ajeya had predicted ~24 hours for the end of 2026 in January). &#8220;It&#8217;s no longer very plausible that after ten whole months of additional progress at the recent blistering pace,9 AI agents would still struggle half the time at 24 hour tasks,&#8221; Ajeya writes. &#8220;I&#8217;d guess that by the end of the year, AI agents will have a time horizon of over 100 hours on the sorts of software tasks in METR&#8217;s suite&#8230; And once you&#8217;re talking about multiple full-time-equivalent weeks of work, I wonder if the whole concept of &#8220;time horizon&#8221; starts to break down.&#8221; Why this matters - all the lights are flashing yellow for a software explosion : Posts like this as well as 70% of what I cover in this newsletter all point in the direction of AI systems getting extremely good, extremely quickly, and quickly colonizing and growing the economy. Read more: I underestimated AI capabilities (again) (Ajeya Cotra) . *** Want to measure AI R&D, here are 14 ways to do it: &#8230;Generating metrics about the most significant property of AI&#8230; The biggest thing that could ever happen with artificial intelligence will be when it starts to build itself. This phenomenon which has been often termed recursive self-improvement is often seen by many as an event horizon, beyond which it&#8217;ll be increasingly hard to reason about the future. How would we know if we were approaching this point? Researchers with GovAI and the University of Oxford have written a paper laying out 14 distinct metrics which could be measured to help us figure out the extent to which AI companies are succeeding in building and overseeing AI R&D Automation (AIRDA) - getting AI to build AI, a necessary prerequisite for recursive self-improvement. Why care about this : &#8220;AIRDA could accelerate AI progress, bringing forward AI&#8217;s benefits but also hastening the arrival of destructive capabilities, including those related to weapons of mass destruction, or other forms of disruption such as unemployment,&#8221; they write. What are the 14 metrics? Measure AI performance on AI R&D Measure AI performance on AI R&D relative to humans and human-AI teams Measure &#8216;oversight red teaming&#8217; - how well human teams can effectively supervise AI systems that are building themselves Measure misalignment in AIRDA Compute the rate of efficiency improvements on AI R&D tasks Survey staff on how they use AI and what this means for productivity Find out if and how often AI is used in high-stakes decisions Examine where AI researchers spend their time Meta-measure the effectiveness of how well companies can oversee AI development (e.g, the rate of bugs or undesired behaviors that make it through to production even with human oversight) Examine how often AI systems subvert the goals of their human developers Track the headcount of AI researchers at labs, as well as details of their performance Look at the distribution of compute used by AI companies across their AI R&D process and how this changes Examine compute as a share of AI R&D spending Understand the permissions AI systems have and how permissiveness changes over time Governing AI R&D: The logical question implied by the above, I hope, is &#8220;wow that all sounds very high-stakes and important, what can we do about it&#8221;? As I write often in this newsletter, AI measurement is a prerequisite to AI governance. Therefore, with these measures, a few different actors should do a few different things. Specifically: Companies should: Track differential progress between safety and capabilities research: Is capabilities research moving at a faster rate than oversight research? Track how AI R&D affects oversight: Automation could free up humans to invest more of their time in building systems for overseeing the work ofAI systems. On the other hand, AI-driven R&D might create systems which are innately harder for humans to understand, and the volume of activity being done by the AI systems could swamp any oversight systems. Track the actual extent of AI R&D: You can build metrics which work as proxies for AI R&D - e.g, many labs today test out how well AI systems can build AI kernels or train AI models. You can also test out how much AI R&D automation is being done in practice by your own organization. Another path is by doing qualitative and quantitative studies of human staff to understand how their own roles are changing, as well as how AI is being used in increasingly high-stakes decisions. Governments should: Develop systems for confidential reporting, potentially in the form of industry-wide aggregates: Once companies are measuring this kind of data, governments should seek to gain access to it so they can understand the shape of AI progress. Third parties should: Estimate metrics using public sources: Look at public reporting to create estimates for things that may relate to AI R&D, like the amount of compute companies have (e.g, both Epoch and SemiAnalysis do this quite well). Create tooling and design surveys: Builds tools that companies could use to generate more telemetry about AI R&D, and conduct surveys of people at companies to gather more insights. Why this matters : &#8220;An actor has oversight over the AI R&D process to the extent that they (1) understand the process and (2) exercise informed control over it in order to produce desired outputs, such as by reviewing AI-generated outputs for errors&#8221;, they write. Therefore, for us as a species to have any &#8216;warning shots&#8217; about recursive self-improvement and any hope of governing it, we need to be able to measure these aspects of it. Read more: Measuring AI R&D Automation (arXiv) . *** Indian researchers use edge computing to prototype a citywide camera network: &#8230;Traffic surveillance with YOLO, SAM3, and NVIDIA Jetson chips&#8230; Researchers with the Indian Institute of Science in Bengaluru have built a software and hardware system for intelligently monitoring the traffic and types of vehicles that flow around the city of Bengaluru. The so-called AI-driven Intelligent Transportation System (AIITS) helps increase the amount of intelligence available to city transport analysts via the use of AI. How the AIITS works: The goal of this project is to unlock &#8220;real-time analytics from 1000s of city cameras under strict latency and resource constraints&#8221;. To do this, they scatter a bunch of lightweight GPUs (Jetson Edge accelerators) around the city, co-locating them with traffic cameras. This helps the traffic cameras do intelligent processing at the edge of the network rather than having to send all the extremely bandwidth-intensive data to a central hub for processing; instead, the camera & jetson share insights back to the hub for analysis and re-calibration of the Jetson-based ML models. The software works like this: video streams from the cameras come in, and a segment anything (SAM3) model segments all the stuff in the video frames, which a Yolo26 model then analyzes and puts labels and bounding boxes around. &#8220;Each stream integrates BoT-SORT multi-object tracking, which assigns persistent IDs to detected vehicles across successive frames.&#8221; Once this is done, the resulting intelligence is sent to a remote GPU server which does two things: 1) It takes in the resulting data and uses this to create a kind of weather map of traffic hotspots, as well as making predictions about future traffic. 2) It does federated learning; when it detects new vehicle classes and labels them with SAM3, then updates details and broadcasts them out to the edge. &#8220;Each Jetson then performs local fine-tuning of the YOLO-based detector, initialized with the current global weights.&#8221; The prototype works: This system, which was done by simulating 100 cameras in a neighborhood in Bengaluru, works sufficiently well that the authors plan to scale this up to 1,000 streams for a live demonstration. (This experiment was done by building &#8220;a distributed testbed that emulates a large urban camera network using hundreds of concurrent Real-Time Streaming Protocol (RTSP) video streams. Each stream is hosted on a heterogeneous cluster of Raspberry Pis&#8221;. &#8220;By localizing heavy video analytics at the network periphery, the system avoids centralized bandwidth bottlenecks, enabling sustainable, city-scale traffic sensing,&#8221; they write. Why this matters - towards a &#8216;living city&#8217; via AI: Papers like this forecast a world where cities come alive with ambient intelligence distributed in equal measure to their existing sensors - cameras move from being passive monitors to active classifiers, microphones start intelligently listening for a broader range of sounds than gunfire, and road sensors model traffic patterns locally. This kind of intelligence can both create large surveillance architectures and increase the efficiency with which cities operator - as with so many things with AI, it is all a balance, bounded by the surrounding thicket of norms and laws to choose where between authoritarianism and democracy the resulting capabilities fall. Read more : Scaling Real-Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks (arXiv) . *** Helping satellites run on-device AI for arctic monitoring: &#8230;Frontier models are important, but so are tiny, miniaturized devices for edge computing&#8230; Researchers with the German Research Center for Artificial Intelligence have built TinyIceNet, a very small vision model for estimating sea ice thickness from synthetic aperture radar data. TinyIceNet is a proof-of-concept demonstration of how to make very lightweight vision models that could plausibly be deployed onto devices which have very small amounts of power and where bandwidth is expensive, like satellites and robots. What is TinyIceNet? The model is a small vision model whose job is to take Synthetic Aperture Radar (SAR) data of polar regions and other cold places, then characterize the ice thickness and maturity within the SAR data. The idea here is that doing this on-device would be very efficient - &#8220;Instead of downlinking vast volumes of raw imagery, satellites can generate SOD products in near-real-time&#8221;. How they built it: TinyIceNet is a simplified U-net architecture vision model trained on the AI4Arctic dataset, which contains ~533 netCDF files, each of which contains SAR images which are associated with a map that indicates the type and thickness of sea ice. The authors carefully design the model to fit into a relatively small computational envelop on a Xilinx chip. Specifically they use a &#8220;AMD Xilinx ZCU102 evaluation board, which integrates the ZCU9EG SoC combining a quad-core ARM Cortex-A53 processor with FPGA fabric, using High-Level Synthesis (HLS) and the DeepEdgeSoC framework&#8221;. They use the DeepEdgeSoC toolchain to further improve the efficiency of the model, as the software &#8220;provides a library of modular C++ building blocks (e.g., convolutions, pooling, activation functions, and feature map buffers) that can be specialized at compile time using C++ template parameters&#8221;. TinyIceNet was trained for 500 iterations on a single GeForce RTX 4090 GPU using PyTorch 2.4 with CUDA 12.5 support. Results: The authors test out the model on 3 hardware platforms: RTX 4090 : &#8220;Provides the highest throughput at 764.8 fps, benefiting from its large number of CUDA cores and high memory bandwidth. However, this performance comes at a relatively high energy cost of 228.7 mJ per scene, making it unsuitable for power-constrained environments such as satellites.&#8221; Jetson AGX Xavier: &#8220;Achieves 47.9 fps but exhibits the highest energy consumption (1218.5 mJ).&#8221; Xilinx ZCU102 FPGA : &#8220;Achieves a lower throughput of 7 fps, yet offers a highly competitive energy profile, consuming only 113.6 mJ per scene. Despite the lower frame rate, this energy efficiency makes the FPGA implementation compelling for on-board satellite processing, where power availability is severely restricted&#8221;. Why this matters - in the future, AI systems will do this stuff automatically : The amazing thing about this research is that it seems trivial (I mean no offense to the authors) for a modern powerful AI systems to do this: all it required was figuring out a task (stuff a computer vision model into a small computational envelop) and then running some experiments to take an existing architecture, tweak it for a hardware platform, and train it on a dataset, then run some tests. In a couple of years we might expect AI agents to do this stuff themselves, procuring compute resources to let them develop and distribute small AI systems to arbitrary compute platforms for arbitrary purposes. This is one of the main ways I think we could get a sudden exponential boom in economic activity attributable to AI - AI systems will get smart enough that they can drastically improve their ability to know about and interact with the physical world through the creation of custom &#8216;edge computing&#8217; AI systems to give them better sensory data and actuators. Read more : TinyIceNet: Low-Power SAR Sea Ice Segmentation for On-Board FPGA Inference (arXiv) . *** ByteDance finetunes a Seed1.6 model to be a CUDA-writing agent: &#8230;Using AI to finetune AI to write code to train future AI systems&#8230; Researchers with ByteDance and Tsinghua University have built CUDA Agent, a fine-tuned AI model for writing GPU programming code. The research is another sign of how people are increasingly using AI to speedup core aspects of AI development. It&#8217;s also vaguely notable for the fact that a major Chinese lab and university continues to use US-made chips (NVIDIA H20s) versus homegrown ones. What CUDA Agent is: CUDA agent is a finetuned Seed 1.6 LLM, an MOE model with 23B active parameters and 230B total parameters. Finetuning took place on a cluster of 128 NVIDIA H20 GPUs. CUDA Agent has been developed specifically for writing GPU code by being fine-tuned on a dataset refined out of the underlying PyTorch &#8216;torch&#8217; and &#8216;transformers&#8217; software libraries. &#8220;The filtered synthesized training dataset contains 6,000 samples, forming CUDA-Agent-Ops-6K, a curated operator-level dataset for training CUDA-capable agents,&#8221; the authors write. Turning a model into an agent: In the last year or so, researchers have repeatedly shown that you can increase the performance of an LLM for a given task by giving it access to some specialized tools and some specialized instructions, then letting it operate over time - this is essentially an AI agent. The CUDA agent here is the fine-tuned model that has been turned into an agent by adopting the OpenHands framework, then given tools including BashTool, GlobTool, MultiEditTool, TodoWriteTool. The agent runs in a four stage loop: Analyze performance of the native PyTorch implementation of a given bit of CUDA code using the provided profile.py script Implement custom CUDA operators by rewriting the model in model_new.py Compile and evaluate the optimized model in the provided GPU sandbox environment Repeat the optimization process until the implementation achieves a 5% speedup over the torch.compile baseline Results: The resulting agent is very good at CUDA kernel development: &#8220;CUDA Agent successfully scales to a context length of 128k tokens and supports up to 200 interaction turns, achieving state-of-the-art performance,&#8221; they write. Their finetuning massively boosts performance from a base rate of 74% for Seed1.6, to &#8220;100%, 100%, and 92% over torch.compile on the Level-1, Level-2, and Level-3 splits of KernelBench, outperforming advanced proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by approximately 40% in the Level-3 split.&#8221; However, comparing against other base models paints a different story: Claude Opus 4.5 and Gemini 3 Pro base models get 95.2% and 91.2% respectively, suggesting that if they were finetuned, you&#8217;d increase their performance as well, and they start from a much stronger baseline. Why this matters - building AI that builds AI: These results show how modern AI systems are increasingly good at the tasks required to develop and deploy AI systems themselves. This suggests we&#8217;re at the beginning of a compounding speedup where new AI models will be used to increase the efficiency of the infrastructure with which their successors will be trained. Read more: CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation (arXiv) . *** Tech Tales: Dandelion Sky [2031, Northern Europe] We made sand castles and in the distance the blue sky was pockmarked with yellow and red bursts and then seconds later the crumpled sounds of the explosion reached us. We were so used to it we didn&#8217;t look up. On the way back from the park the air whined as drones flew to replenish the perimeter of the city. We watched them, bird-like in their varieties, some zipping by quick as starlings, and other larger ones moving heavily through the air. There were so many varieties: the football-sized interceptors which died by the thousands each day. The pizza-boxes that worked as communications and AI relays. Then of course the motorbike-sized motherships which could rapidly repopulate areas that were sustaining heavy losses. The war had been going on for five years. Our city was like so many across the world - a nucleus of humans, protected by so many thousands upon thousands of machines, spinning around the periphery, exchanging energy and mass in some bloodless dance with our enemies. That night, the city narrated itself through statistics: 3410 interceptors destroyed. A green day: 100% success, with nothing making its way through. Replenishment rate: 4000 and climbing. And promising reports that our military had struck deep in the heart of enemy territory taking out several of their drone factories. We drew the blackout curtains in every room except our bedroom. With the kids asleep and my wife passed out beside me I looked out into the darkness, my face occasionally lit by the explosion of some distant drone, and then the room buzzing with the reverberation of the window as the soundwaves reached it. But when I woke up the next day, there was something different in the air: silence. And my phone did not work. We drew the shades and looked out and the sky was blue and perfectly clear: not a cloud or a drone in the sky. My wife stared out and her jaw tightened and she clutched our kids close. &#8220;Dada, where are the machines?&#8221; my youngest said. &#8220;Yeah Dad, what&#8217;s up?&#8221; said the older one. &#8220;I don&#8217;t know,&#8221; I said. &#8220;Draw the curtains. We&#8217;re going to camp today!&#8221; And I set my wife and kids up in the apartment with pillows in front of the TV and the game console on and a bunch of snacks. The kids were excited and my wife played along. &#8220;I&#8217;ll see if I can figure out what&#8217;s going on,&#8221; I whispered to her. &#8220;I won&#8217;t go far and I won&#8217;t be gone long.&#8221; Outside, there were a few people who had the same idea as me. None of us knew much. None of our electronic communication systems worked. Which people were even in charge of the drones? None of us knew. They mostly worked via AI. A lot of their decision-making was federated; distributed systems doing what made most sense to them, coordinating only with themselves. &#8220;Maybe they&#8217;ve turned off because the war is over?&#8221; someone said. &#8220;Maybe they&#8217;ve been hacked - we&#8217;re about to be attacked!&#8221; said someone else. &#8220;What there was a crash - they just all broke at once?&#8221; said someone else. There was nothing to do so I went home. My wife and kids were playing games. I grabbed some binoculars and went up to the fire escape and out onto the roof of the building. And there I stood, looking at a horizon free of machines. Occasionally looking at other people on other buildings doing the same. And eventually I put the binoculars down and I just stood there, listening for the whine of drones. But all I could hear was the wind and, in the distance, muffled birdsong. Things that inspired this story : Gradual disempowerment and what it might mean for moments of crisis; automation and AI; winding the clock forward on the dronewar in Ukraine; war and peace and family. Thanks for reading!
TAI #197: Anthropic Turned the OpenClaw Demand Signal Into a Product towards_ai 24.03.2026 15:00 0.687
Embedding sim.0.8849
Entity overlap0.125
Title sim.0.1417
Time proximity0.0002
NLP типother
NLP организацияAnthropic
NLP темаai agents
NLP странаUnited States

Открыть оригинал

What happened this week in AI by Louie Last week, I wrote about quiet agent upgrades. This week, Anthropic continued to launch features that make the bigger picture obvious. In ten weeks, it went from launching Cowork (January 12) to shipping persistent phone-to-desktop threads via Dispatch (March 17) and direct computer use (March 23), adding plugins, admin controls, and scheduled tasks along the way. A paid Claude Cowork user can now message an agent from their phone, let it work on their machine, connect it to dozens of apps, and hand it the mouse to the full computer when connector or API access isn&#8217;t available. OpenClaw, at roughly 333,000 GitHub stars, did the product discovery. Anthropic built and shipped many of its key features at an incredible pace (only possible by using Claude Code itself to build features!), but with a much more enterprise-friendly risk profile: connectors first, explicit per-app permissions, prompt-injection scanning, and admin controls. Open source found the primitive. Anthropic wrapped it in the permission model that lets a company actually deploy it. The agent story feeds directly into the AI infrastructure debate that dominated the rest of the week. Computer use, browser control, and persistent background tasks are dramatically more token-intensive than chat. A single Cowork session running scheduled tasks, clicking through apps, and filling spreadsheets burns far more compute than a conversation. Every new agentic workflow Anthropic or anyone else ships multiplies the demand per user. That is part of why the people at the top of the AI stack sound increasingly frustrated with the pace of supply expansion further down. At GTC, Jensen Huang said Nvidia expects at least $1 trillion in cumulative Blackwell and Rubin revenue through 2027, then clarified that this estimate was conservative because it excluded additional products. On the All-In podcast, he called Dario Amodei&#8217;s forecast of roughly $1 trillion in non-infrastructure AI revenue by 2030 &#8220;very conservative,&#8221; adding that Anthropic will do &#8220;way better than that&#8221; because every enterprise software company will become a value-added reseller of model tokens. I suspect Jensen is also privately nervous about the supply chain&#8217;s willingness to ramp as aggressively as his demand forecasts require. His current approach has been to invest directly in suppliers to force capacity expansion: Nvidia recently committed $4 billion to optical interconnect suppliers Coherent and Lumentum to address the silicon photonics bottleneck, and on the February earnings call, management described supporting the &#8220;extreme ecosystem&#8221; of suppliers from a capacity standpoint as one of the company&#8217;s most important priorities. The further down the supply chain you go, the fewer people believe those numbers. Broadcom said today that TSMC has become a production bottleneck, with meaningful new capacity not materializing until 2027, and that the squeeze now extends beyond wafers into lasers and printed circuit boards. Memory prices in some segments have more than tripled over the past year. Samsung is pushing customers toward three- to five-year contracts to justify expansion. The top of the stack is trying to force conviction into the middle, and the middle is still hesitant to invest at the scale implied by demand forecasts. That backdrop makes Elon Musk&#8217;s Terafab announcement easier to parse. Tesla and SpaceX plan a joint chip fabrication complex in Austin, starting with an initial $20&#8211;25 billion facility, though the full project at the scale Musk described would cost dramatically more. At full capacity, Terafab would target 1 terawatt of annual compute output, compared with roughly 0.5 terawatt for the entire current U.S. electricity network. Musk said every fab on Earth currently produces about 2% of what his companies would eventually need, and that 80% of Terafab&#8217;s output would be directed toward orbital data centers in space. These numbers really only make sense if AI leads to a large multiplication of the global economy from current levels. The pieces Musk already has are real but partial. Tesla&#8217;s chip team has been designing custom AI chips for years, with AI5 targeting production in 2027 and AI6 in 2028. Samsung plans to begin volume fabrication of Tesla chips in Texas in the second half of 2027. SpaceX is building what will be the largest PCB and panel-level packaging facility in North America at its Bastrop site, backed by a $280 million-plus Texas semiconductor innovation grant. Musk is also recruiting aggressively, posting on X that anyone in Korea working in chip design, fabrication, or AI software should apply to Tesla, in what looks like a direct play for TSMC and Samsung talent. What Musk lacks is any experience running an actual fabrication plant. The gap between chip design plus advanced packaging and full-scale leading-edge lithography is enormous. TSMC has roughly 50,000 engineers who do nothing but fab operations, and it has spent decades and hundreds of billions of dollars building that capability. The EUV lithography machines that any 2nm fab requires are made exclusively by ASML, which has a record backlog of roughly &#8364;39 billion and whose capacity is likely to be a key bottleneck for anyone trying to build a new leading-edge fab on an ambitious timeline. Each EUV machine costs $200&#8211;400 million, weighs 165 tons, and requires specialized ocean transport. There is no fast lane for procurement. I suspect Terafab is partly a manufacturing project and partly a supply-chain pressure tactic, similar to Battery Day in 2020. Tesla presented the 4680 cell as a path to much lower battery costs and near-100x scale by 2030. The execution was painful: repeated delays in dry-electrode manufacturing, supplier pushback, and struggles at scale as late as 2023. Yet Tesla&#8217;s latest shareholder update says it is now producing 4680 dry-electrode cells with both anode and cathode in Austin, a real milestone after years of difficulty. The battery program shipped later and uglier than the slides implied, but it dragged Tesla and its suppliers up the curve. Terafab may serve a similar function even if the schedule slips badly, which I expect it will. Google is fighting the same capacity war from a different angle, and energy is its primary lever. Alphabet acquired clean energy developer Intersect for $4.75 billion in December to gain direct access to power projects and data center infrastructure. Google has signed nuclear deals with Kairos Power for 500 MW of small modular reactors by 2035, a 25-year agreement with NextEra Energy to restart Iowa&#8217;s shuttered 615 MW Duane Arnold nuclear plant, a 200 MW deal with fusion firm Commonwealth Fusion Systems, and a strategic agreement with Elementl Power to develop three nuclear sites with at least 600 MW of capacity each. It has also been signing utility agreements to curtail up to 1 gigawatt of data-center power during peak periods. Ruth Porat said this week that the U.S. is not scaling up energy supply fast enough to support AI. Meanwhile, Meta signed a multi-billion-dollar deal to rent Google&#8217;s TPUs and was also discussing buying them outright, while Anthropic already has access to more than 1 gigawatt of Google TPU capacity. Open weight models have been taking somewhat of a back seat to the breakthroughs in agentic capabilities at the closed AI labs the past few months, but I think open weights will still have a key role to play. Cursor released Composer 2, a coding model built on Moonshot AI&#8217;s Kimi K2.5 via an authorized commercial partnership through Fireworks AI. It scores 61.7 on Terminal-Bench 2.0 and 73.7 on SWE-bench Multilingual, up sharply from Composer 1.5, and is priced at $0.50 per million input tokens. Cursor did not initially disclose the Kimi base. A developer intercepted the API traffic and found the model ID in plain text. After millions of views, Cursor VP Lee Robinson acknowledged the open-source base, and co-founder Aman Sanger called the omission &#8220;a miss from the start.&#8221; The licensing story is clean; the disclosure story is not. But the product formula, take a strong open base, hammer it with domain-specific RL, wrap it in the best UX in the category, is very likely the template for application-layer competition over the next couple of years. Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work. Why should you care? The &#8220;AI bubble&#8221; framing keeps circulating and keeps missing the point. Bubbles feel overbuilt. Much of AI still feels under-supplied. Memory prices have tripled. TSMC is a bottleneck. Lasers and PCBs are in short supply. ASML&#8217;s EUV machines are booked out. Musk, Jensen, and Google are all signaling the same thing: there are not enough chips, power, or industrial capacity to support the scenarios the leading buyers seem willing to fund. The &#8216;agent&#8217; story makes this tension worse. Anthropic&#8217;s Cowork with computer use, Dispatch, and scheduled background tasks turns a single user into a persistent compute load. Every time an agent clicks through a browser, fills out a spreadsheet, or runs a recurring workflow, it burns far more tokens than a chat exchange does. Multiply that across millions of subscribers, then add Cursor&#8217;s long-horizon coding agents, OpenAI&#8217;s agent mode, and the broader wave of agentic products shipping every week, and you start to see why Jensen thinks $1 trillion is conservative. The revenue potential from agents is enormous, but the compute requirements per user are also enormous. Those two facts together explain the urgency behind Terafab, Google&#8217;s energy sprint, and Nvidia&#8217;s direct investments in its supplier base. The gap between conviction at the top and hesitancy in the middle of the supply chain is a key dynamic in AI right now. The DRAM fabs, the PCB makers, the laser suppliers, and the power utilities are the ones whose investment pace will determine how fast AI actually scales. If the top-of-stack buyers are right, the hesitancy further down becomes the binding constraint. If they are wrong, Terafab will be a very expensive monument to overconfidence. The next two years will settle it. The people who get ahead will be the ones using the new tools before the supply catches up. One final thought on the Terafab story: if you truly believe in recursive AI self-improvement without near-term dead ends, now is indeed the time to begin ambitious projects that wouldn&#8217;t have been possible previously. If AI can help simulate, iterate, and improve chip science and manufacturing, then those making the earliest and most aggressive moves to build an AI-first chip fab may indeed have a chance to leapfrog incumbents. This will also be the case in many other industries, and I expect many more pie-in-the-sky, ambitious projects to be launched soon by AI labs and true AI believers. &#8212; Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO This issue is brought to you thanks to SerpApi : LLMs are powerful. But without fresh information, they can hallucinate or miss context. SerpApi helps AI applications access real-time search data from search engines like Google, Bing, Amazon, and more via a simple API. Get clean, structured JSON results and power AI agents, research tools, and data-driven applications without managing scrapers. Start with 250 free credits/month by signing up at SerpApi today ! Hottest News 1. OpenAI Releases GPT-5.4 Mini and Nano OpenAI released GPT-5.4 mini and GPT-5.4 nano, two smaller GPT-5.4 variants designed for high-throughput, latency-sensitive workloads such as coding assistants, sub-agents, and routine automation. GPT-5.4 mini is positioned as the default &#8220;workhorse&#8221; small model, faster than GPT-5 mini (OpenAI notes it runs over 2&#215; faster) while improving coding, reasoning, multimodal understanding, and tool use. It lands close to the full GPT-5.4 model on several evals (for example, 54.4% on SWE-Bench Pro vs. 57.7% for GPT-5.4, and 45.7% for GPT-5 mini). In the API, mini supports text + image inputs, tool use/function calling, web search, file search, and computer use, with a 400K context window priced at $0.75/1M input tokens and $4.50/1M output tokens. GPT-5.4 nano is the smallest, lowest-cost option for simpler tasks like classification, ranking, extraction, and lightweight coding subagents; it&#8217;s API-only and priced at $0.20/1M input tokens and $1.25/1M output tokens. GPT-5.4 mini is also available across Codex surfaces and in ChatGPT, where it appears for Free/Go users via Thinking, with mini serving as a rate-limit fallback for GPT-5.4 Thinking on other plans. 2. Cursor Launches Composer 2, Coding Model Powered by Kimi-k2.5 Cursor released Composer 2, a frontier-level coding model priced at $0.50 per million input tokens, with a faster variant available. Built on Moonshot AI&#8217;s Kimi-k2.5 via continued pretraining and high-compute RL, it shows substantial benchmark improvements, including 61.7 on Terminal-Bench 2.0 and 73.7 on SWE-bench Multilingual. The model is available immediately in Cursor with usage included in individual plans. Kimi confirmed the authorized commercial partnership through Fireworks AI. 3. Mistral Releases Small 4 Mistral AI released Mistral Small 4, a unified open-source multimodal reasoning model, alongside Leanstral, an open-source code agent built for Lean 4 formal verification. Mistral Small 4 combines the roles of Mistral&#8217;s earlier specialist lines: reasoning, multimodal understanding, and agentic coding, into a single hybrid model tuned for general chat, coding, agent workflows, and deeper reasoning. Architecturally, it&#8217;s a Mixture-of-Experts system with 128 experts and 4 active per token, totaling 119B parameters, with roughly 6&#8211;6.5B parameters activated per token (about 8B including embedding and output layers), and it supports a 256K context window plus native text+image inputs. It also adds a configurable reasoning-effort control, allowing developers to trade off low-latency responses for more intensive reasoning. Mistral reports major efficiency gains versus Mistral Small 3, up to 40% lower end-to-end completion time in a latency-optimized setup and 3&#215; higher requests-per-second in a throughput-optimized setup, and positions Small 4 (with reasoning enabled) as competitive on core reasoning/coding benchmarks while producing shorter outputs. 4. OpenAI and NVIDIA Sign $100B Infrastructure Partnership OpenAI and NVIDIA announced a letter of intent for a strategic infrastructure partnership to deploy at least 10 gigawatts of NVIDIA systems to train and run OpenAI&#8217;s next generation of models. As deployments scale, NVIDIA plans to invest up to $100 billion in OpenAI progressively as each gigawatt is brought online, tying capital to delivered infrastructure. The companies set the first phase to come online in the second half of 2026, built on NVIDIA&#8217;s Vera Rubin platform. The partnership also includes joint roadmap work to co-optimize OpenAI&#8217;s model and infrastructure software with NVIDIA&#8217;s hardware and software stack. 5. Xiaomi Releases MiMo-V2-Pro Xiaomi released MiMo-V2-Pro, its flagship foundation model built for real-world agentic workloads, positioning it as a &#8220;brain&#8221; for systems that orchestrate multi-step workflows and production engineering tasks. The model uses an efficient trillion-parameter MoE design with over 1T total parameters and 42B active parameters, scales long-context operation to a 1M-token window, and extends Xiaomi&#8217;s Hybrid Attention design by increasing the hybrid ratio from 5:1 to 7:1, with a lightweight multi-token prediction (MTP) layer to speed up generation. Xiaomi reports MiMo-V2-Pro ranks 8th worldwide and 2nd among Chinese LLMs on the Artificial Analysis Intelligence Index, and highlights stronger agent performance on OpenClaw-style evaluations (e.g., PinchBench avg. 81.0 and ClawEval 61.5, listed as #3 globally on both). The model was also publicly tested in stealth on OpenRouter under the name &#8220;Hunter Alpha,&#8221; where Xiaomi says it topped the daily call charts and surpassed 1T tokens in usage. The model is now available globally via Xiaomi&#8217;s developer portal MiMo Studio, Hugging Face, and its API platform. 6. NVIDIA Releases Nemotron-Cascade 2 NVIDIA released Nemotron-Cascade 2, an open-weight 30B Mixture-of-Experts model that activates only ~3B parameters per token, targeting high &#8220;intelligence density&#8221; for reasoning and agent workflows without the usual cost blowups. The flagship checkpoint is Nemotron-Cascade-2&#8211;30B-A3B, post-trained from Nemotron-3-Nano-30B-A3B-Base, and it runs in two operating modes, a thinking mode and a non-thinking (instruct) mode, selected through the chat template. NVIDIA reports that it is the second open-weight LLM (after DeepSeek-V3.2-Speciale-671B-A37B) to reach gold-medal&#8211;level performance across the 2025 IMO, IOI, and ICPC World Finals. The core training upgrade is multi-domain on-policy distillation throughout the Cascade RL pipeline, in which the best intermediate &#8220;teacher&#8221; for each domain provides token-level distillation signals to recover regressions and maintain gains across domains. NVIDIA also released the full collection of model checkpoints and training datasets alongside the paper. 7. Mamba-3: A New State Space Model Frontier A team of researchers from Carnegie Mellon University (CMU), Princeton University, Together AI, and Cartesia AI has introduced Mamba-3. It is a new state space model (SSM) architecture designed for inference efficiency, shifting the focus from Mamba-2&#8217;s training-first design to faster prefill+decode performance in production. Mamba-3 upgrades the core SSM with a more expressive recurrence (via an exponential-trapezoidal discretization scheme), complex-valued state tracking, and an optional MIMO (multi-input, multi-output) variant that improves accuracy with minimal impact on decode latency. On Together&#8217;s reported latency tests for a ~1.5B model on a single H100-SXM 80GB, Mamba-3 (SISO) delivers the fastest prefill+decode times across sequence lengths, outperforming Mamba-2, Gated DeltaNet, and even a vLLM-served Llama-3.2&#8211;1B transformer baseline. Five 5-minute reads/videos to keep you learning 1. Claude Code Agent Skills 2.0: From Custom Instructions to Programmable Agents This article walks you through the evolution of Claude Code&#8217;s skill system from simple markdown instructions to a full programmable agent platform with subagent execution, dynamic context injection, lifecycle hooks, and formal evaluation. It also covers a formal iterative evaluation loop for testing and improving skills over time, and points to an open Agent Skills standard designed to keep the format portable across AI tools. 2. Loss Landscapes: Part 1 (Part 2) The loss landscape is a surface that maps model weights to loss values, ranging from smooth, convex bowls (simple models, with guaranteed global minima) to rugged, non-convex terrains riddled with local minima and saddle points. This article covers how gradient descent navigates loss landscapes and which tools help it succeed: weight decay to smooth chaotic landscapes, dropout for robustness, residual connections for deep-network stability, and batch/layer normalization to stabilize training dynamics. 3. Knowledge Distillation: How a Tiny Model Learned to Outsmart Its Giant Teacher The article walks you through why large models carry dark knowledge in their probability distributions that hard labels destroy, and how temperature scaling amplifies those signals for smaller student models to absorb. It lays out the full derivation of the loss function, including the tau-squared compensation. The piece anchors the theory to DeepSeek-R1&#8217;s January 2025 result, in which a distilled student matched or beat its teacher, raising an unresolved question: Does compression reveal latent knowledge or generate entirely new capability? 4. Three Tasks, One Backbone: A Multi-Task Reranker That Tackles Search Challenges In this article, the author trained a single cross-encoder on Amazon&#8217;s ESCI shopping dataset to handle three tasks simultaneously: graded relevance ranking, 4-class ESCI label classification, and binary substitute detection. Rather than training three separate models, the architecture routes a shared BERT backbone&#8217;s [CLS] embedding through three lightweight heads, each optimized with its own loss. The combined weighted loss prioritizes nDCG ranking while using classification and substitute detection as auxiliary regularizers. 5. NVIDIA State of AI Report 2026 NVIDIA&#8217;s comprehensive report examines how AI drives revenue across industries, covering enterprise adoption patterns, infrastructure scaling trends, and the shift toward agentic AI workflows. The report provides data-driven insights on computing demand, model deployment costs, and the economic impact of generative AI across manufacturing, healthcare, finance, and software development. Repositories & Tools 1. LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. 2. Deer Flow is an open-source super agent harness that orchestrates sub-agents, memory, and sandboxes to do almost anything. 3. PentAGI is a fully autonomous AI agent system capable of performing complex penetration testing tasks. 4. Colab MCP is Google&#8217;s MCP server for interacting with Colab. Top Papers of The Week 1. Efficient Exploration at Scale This paper introduces an online learning algorithm that improves the data efficiency of reinforcement learning from human feedback (RLHF). The algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of &#8216;reinforce&#8217;, with reinforcement signals provided by the reward model. With Gemma LLMs, this algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels. 2. Memento-Skills: LLM Agents That Build Task-Specific Agents This paper introduces Memento-Skills, a generalist, continually learnable LLM agent system that autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with stateful prompts, in which reusable skills (stored as structured markdown files) serve as a persistent, evolving memory. It achieves 26.2% and 116.2% relative accuracy improvements without updating LLM parameters. 3. Attention Residuals: Learned Layer Aggregation for LLMs This paper proposes Attention Residuals (AttnRes), which replaces the fixed, uniform accumulation of residual connections in LLMs with softmax attention over preceding-layer outputs. This allows each layer to selectively aggregate earlier representations using learned, input-dependent weights. Tested on Kimi Linear (48B params, 3B activated, 1.4T tokens), AttnRes improves downstream performance and stabilizes output magnitudes and gradient distribution. 4. OpenSeeker: Fully Open-Source Search Agent Training Data This paper introduces OpenSeeker, a fully open-source search agent (i.e., model and data) that achieves frontier-level performance through fact-grounded, scalable, controllable QA synthesis to generate complex, multi-hop reasoning tasks with controllable coverage and complexity, and denoised trajectory synthesis to employ a retrospective summarization mechanism. Trained on only 11.7K samples, it significantly outperforms the next-best open-source search agent and surpasses some commercial systems, such as Tongyi DeepResearch. 5. EvoClaw: Evaluating AI Agents on Continuous Software Evolution This paper introduces EvoClaw, a novel benchmark, and the DeepCommit pipeline to evaluate AI agents on continuous, dependency-driven software evolution rather than isolated, one-off coding tasks. Evaluation of 12 frontier models across 4 agent frameworks reveals a critical vulnerability: overall performance scores drop significantly from >80% on isolated tasks to at most 38% in continuous settings. Quick Links 1. Microsoft considers legal action over the $50 billion Amazon-OpenAI cloud deal that could violate &#8203;its exclusive cloud agreement with the ChatGPT maker. The dispute centers on whether OpenAI can offer Frontier via AWS without violating the Microsoft partnership, which requires the startup&#8217;s models to be accessed through the Windows maker&#8217;s Azure cloud platform, the FT report said, citing sources. 2. NVIDIA released its Agent Toolkit , which provides open source models and software for enterprises and developers building autonomous, self-evolving AI agents. NVIDIA Agent Toolkit includes open models (NVIDIA Nemotron), open agents (NVIDIA AI-Q), open skills (NVIDIA cuOpt), and open runtimes (OpenShell). It also supports enterprise software platforms, such as Adobe, Atlassian, Box, Salesforce, etc. Who&#8217;s Hiring in AI LATAM Internship Program &#8212; Experience Design (UX/UI) @Salesforce (Sao Paulo, Brazil) QA Engineering Lead, AI Native @Meta (Menlo Park, CA, USA) Senior AI Engineer @Teradata (Hyderabad, India) NLP Architect @Nutanix (San Jose, CA, USA) Prompt Engineer @Highmark Health (Remote) Machine Learning Product Summer Intern @Pacvue (Remote/USA) Interested in sharing a job opportunity here? Contact sponsors@towardsai.net . Think a friend would enjoy this too? Share the newsletter and let them join the conversation. Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
The U.S. Labor Market and its AI problem ai_supremacy 09.03.2026 09:25 0.686
Embedding sim.0.7906
Entity overlap0.0851
Title sim.0.0843
Time proximity0.9871
NLP типother
NLP организацияAnthropic
NLP темаai adoption
NLP странаUnited States

Открыть оригинал

Prospectus The U.S. Labor Market and its AI problem Will what worked for AI coding be able to scale to Finance, Law and other knowledge work professions? Michael Spencer Mar 09, 2026 ∙ Paid 267 3 31 Share Anthropic: Read the report. Just a reminder I’ll be working on my Emerging Tech Investments Newsletter a lot more in 2026 to discover speculative high-growth and buy the dip opportunities. The cadence of that Newsletter should increase in the weeks and months ahead. There I’m exploring themes like Neo Clouds, Space-tech, micro cap AI related names, Quantum computing SPACs and more. Good Morning, This is going to be a long report with a lot of infographics and context. Analyzing Friday jobs data, nonfarm payrolls fell by 92,000 in February, according to the Bureau of Labor Statistics, far below consensus expectations of roughly 50,000. It was the third time in five months that the economy lost jobs in the U.S. Outside of Healthcare jobs, there’s really not much growth at all with low hiring. We are in a “jobless” growth economy, with K-shaped characteristics. With the U.S. putting the breaks on immigration and Tech companies doing agentic AI pilots, more layoffs in the Tech sector are highly likely. Oracle needs t o shed at least 30,000 jobs due to the debt they have taken on for OpenAI’s compute facilities. Even as OpenAI’s Stargate facility won’t be expanding any time soon. A Low Hire Economy with Grave Consumer sentiment Hiring is down 20% lower compared to the pre-pandemic baseline of 2019 according to Karin Kimbrough of LinkedIn, their head economist. People are staying unemployed for 7 months on average. Labor market impacts of AI AI is Lowering Entry Hiring in the Most Exposed Professions Anthropic economists say they have developed a new way to track how AI will upend the workforce. Their economists Maxim Massenkoff and Peter McCrory are selling a pre-IPO pitch, that exposed knowledge workers could be more impacted by the technology that Anthropic building. As Alberto Romero has pointed out , they are pulling old data and making some huge rationalizations. The likes of Anthropic and OpenAI say the red will inevitably fill the blue. But how likely is that really? According to Anthropic, AI is far from reaching its theoretical capability: actual coverage remains a fraction of what’s feasible. Occupations with higher observed exposure are projected by the BLS (Bureau of Labor Statistics) to grow less through 2034 . While the circle graph went viral on social media, the bar graph is an easier way to visualize this: Peter Walker Outside of coding, and some aspects of administration and finance, Generative AI doesn’t seem very capable of automating tasks in most areas of knowledge work. The deeper problem appears to be that the tech is not a meaningful job creator or increasing productivity outside of a few roles even in technology companies. While massive layoffs are not occuring, some redesigning of jobs internally at tech companies have been noted. Less managers, more hybrid roles, more product managers and designers doing vibe-working on the job. Hiring slowed in January, 2026 significantly with U.S. hiring falling 3.3% from December and is down 5.7% compared to January 2025. - source . Hiring in 2025 was already anemic. Generative AI does not appear to be a Job Creator The U.S. Bureau of Labor Statistics (BLS) projects the economy to grow by 1.8 percent each year, on average, from 2024 to 2034. The GDP gains of the investments going to datacenter, inference and compute campuses and building isn’t being spread equally in the U.S. economy. The low-hire labor environment has other issues as well. Monthly U.S. Job Creation is Nearly Non-Existent outside of Healthcare in 2025 & 2026 CNBC - via Gabriel Cortes - Source: U.S. Bureau of Labor Statistics via FRED. Data as of March 6th, 2026. Healthcare is the main driver of new jobs created in 2026. Cognitive displacement, labor talent and youth deskilling (College) and “ cognitive surrender ” are now AI risks are that are moderate to high in such an environment. A broader measure of unemployment that includes discouraged workers and those holding part-time positions for economic reasons moved lower, to 7.9% or 0.2 percentage point below the January level. Where the 4.4% unemployment rate is deceptive, compared to the low-hire environment job seekers are facing. Then there is the AI pressure on hiring. Young people have reason to be more nihilistic or discouraged with regards to the current U.S. labor market trends. Generative AI is Destroying White Collar Opportunity Continue reading this post for free, courtesy of Michael Spencer. Claim my free post Or purchase a paid subscription. Previous Next
OpenAI to acquire Promptfoo openai 09.03.2026 10:00 0.684
Embedding sim.0.8392
Entity overlap0.1818
Title sim.0.1
Time proximity0.4226
NLP типacquisition
NLP организацияOpenAI
NLP темаai security
NLP страна

Открыть оригинал

OpenAI is acquiring Promptfoo, an AI security platform that helps enterprises identify and remediate vulnerabilities in AI systems during development.
Understanding AI and learning outcomes openai 04.03.2026 00:00 0.683
Embedding sim.0.7943
Entity overlap0.0222
Title sim.0.086
Time proximity0.9467
NLP типproduct_launch
NLP организацияOpenAI
NLP темаeducational technology
NLP страна

Открыть оригинал

OpenAI introduces the Learning Outcomes Measurement Suite to assess AI’s impact on student learning across diverse educational environments over time.
TAI #196: Quiet but Significant Agent Upgrades to Codex (Subagents) and Claude (Context) towards_ai 17.03.2026 15:03 0.682
Embedding sim.0.8058
Entity overlap0.141
Title sim.0.0615
Time proximity0.7932
NLP типproduct_launch
NLP организацияOpenAI
NLP темаlarge language models
NLP странаUnited States

Открыть оригинал

What happened this week in AI by Louie OpenAI and Anthropic both shipped incremental upgrades this week that sound modest on paper but could reshape how serious developers actually work day to day. Elsewhere, Google released Gemini Embedding 2, its first natively multimodal embedding model; NVIDIA released Nemotron 3 Super; Google Research introduced Groundsource, turning global news into structured historical data and launching with a 2.6 million-record urban flash-flood dataset; Yann LeCun&#8217;s new startup AMI raised $1.03 billion at a $3.5 billion pre-money valuation to pursue world-model-heavy AI; and IBM shipped Granite 4.0 1B Speech for compact multilingual speech recognition, now ranked #1 on the OpenASR leaderboard. For OpenAI, the key release was Codex subagents. Codex can now spawn specialized agents in parallel to explore, execute, or analyze work concurrently, while keeping the main thread focused on requirements, decisions, and final outputs. OpenAI&#8217;s docs frame this as a solution to &#8220;context pollution&#8221; and &#8220;context rot,&#8221; which is exactly right. One giant thread is fine until it turns into a digital junk drawer full of stack traces, half-failed tests, and exploratory dead ends. OpenAI has essentially adopted the core product idea Anthropic pushed first with Claude Code and then more broadly with Cowork: separate the manager from the workers, keep the high-level thread clean, and let specialized agents chew through bounded tasks in parallel. This is a materially better operating model for real work, especially once tasks stop being cute demos and start involving actual codebases, logs, specs, and messy follow-ups. Once a workflow primitive proves itself in real work, the industry converges on it fast. The Codex growth numbers indicate where OpenAI thinks the battle stands now. Fidji Simo said more than 1 million businesses run on OpenAI products, Codex is now at 2 million plus weekly active users (up nearly 4x since the start of the year), and API usage jumped 20% in the week after GPT-5.4 launched. OpenAI has also been expanding Frontier Alliances and pairing forward-deployed engineers with consulting firms to help enterprises actually deploy AI coworkers into real workflows. Anthropic&#8217;s quiet but very meaningful move this week was making 1M context generally available for Opus 4.6 and Sonnet 4.6 at standard pricing: no long-context premium, full rate limits across the full window, and media limits expanded to 600 images or PDF pages. On MRCR v2 (8-needle) at 1M tokens, Opus 4.6 scores 78.3%, more than double GPT-5.4&#8217;s 36.6% and roughly triple Gemini 3.1 Pro&#8217;s 25.9%. Even Sonnet 4.6 hits 65.1% at the same context length. At 256K tokens, the field is tighter, with Opus 4.6 at 91.9%, Sonnet 4.6 at 90.6%, and GPT-5.4 at 79.3%, but as context scales up, the drop-off for competitors is steep. (Context Arena measured Gemini numbers on the same MRCR v2 benchmark, not Google&#8217;s self-report.) Anthropic I did not have Anthropic pegged as the lab most likely to seize the long-context narrative in March, but here we are. For a while, long context felt like a Google Gemini story, and then, briefly, like an OpenAI comeback story. Anthropic may now have the strongest claim on the metric that actually matters for professional agentic work: not headline window size, but whether the model can still find the right thing after you bury it under a mountain of tokens. That matters enormously for agentic coding and review. The hard sessions are not short snippets. They are the ugly, hours-long runs where the model has read a large diff, test output, monitoring logs, maybe a product doc, maybe a PDF, and still needs to remember why line 37 in a config file matters. A million tokens that actually hold up (and with no price premium for higher context usage) is a real unlock. Anthropic also launched Code Review for Claude Code, a research preview system that deploys a team of agents to each pull request. The average review takes around 20 minutes and generally costs $15 to $25. On pull requests over 1,000 lines changed, 84% get findings averaging 7.5 issues, and less than 1% of findings are marked incorrect. Internally, Anthropic says the share of pull requests receiving substantive review comments rose from 16% to 54% after adopting the system. That is impressive on its own, but it also reveals something about where the real constraint is shifting. We are getting to the point where a strong developer with good agents can generate code much faster than the surrounding review process can absorb it. You only get to bank AI productivity if the code is trustworthy enough to merge. Otherwise, you just manufacture more uncertainty at a higher speed. And for now, humans still need to understand the code. Despite recent leaps, AI remains a jagged intelligence, tireless and elegant at parallel exploration, then suddenly blind to the one buried business rule that everyone on the team &#8220;just knows.&#8221; The best results still come from expert developers who nudge early, critique the plan, steer the agents mid-run, and know when the model has wandered off course. There is a plausible future where this flips. Self-driving cars offer a template: at first, the human is the safety layer, maintaining full responsibility in driver-assist systems, but eventually, AI reliability improves, and the human starts to look like the unpredictable failure mode. Coding could follow a similar arc. If AI-written code eventually has fewer bugs than human-written code, and humans mostly add net bugs by tweaking systems they no longer fully understand, then full autonomy on some classes of software work will start to look rational. We are not there yet. Right now, the highest-return setup is expert human plus agent swarm. Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work. Why should you care? Once a workflow pattern becomes obviously useful, the industry converges on it fast. Claude Code and Cowork proved that splitting work into parallel threads beats forcing one bloated session to play every role at once. OpenAI now agrees. Long context, too: the labs all want it, but Anthropic&#8217;s 78.3% on MRCR v2 at 1M tokens versus GPT-5.4&#8217;s 36.6% is now a real gap for pushing agents to their limits. The fact that the expanded context is available without a price premium also suggests a more fundamental architectural or inference breakthrough. Due in part to there being no non-compete clauses in California (and high staff turnover between the labs), and the fact that many researchers across AI labs are good friends and attend the same parties, we can continue to expect these breakthroughs to quickly disperse across the leading model families (so long as the AI lab has enough compute to keep up!) Meanwhile, Codex, with 2M+ weekly active users (nearly 4x since January), alongside a growing army of forward-deployed engineers, tells the full story of where we are. The models are strong enough to be useful everywhere, but alien enough that bridging the gap between raw capability and reliable daily workflow is now the main job. The developers who learn that bridging skill fastest will pull away from everyone still using AI as fancy autocomplete. &#8212; Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO This issue is brought to you thanks to SerpApi : LLMs are powerful. But without fresh information, they can hallucinate or miss context. SerpApi helps AI applications access real-time search data from search engines like Google, Bing, Amazon, and more via a simple API. Get clean, structured JSON results and power AI agents, research tools, and data-driven applications without managing scrapers. Start with 250 free credits/month by signing up at SerpApi today ! A Quick Look at AI Adoption at Empower Much of the conversation around AI in the workplace focuses on frontier models and benchmark scores, but the more revealing signal is what&#8217;s happening inside real businesses right now. At Empower Technical Services , a leading UK technical services provider co-founded by our own Denis Piffaretti, teams across the C-suite, HR, and M&A are using AI today to stress-test executive analysis, surface gaps in employment contracts, and compress weeks of acquisition research into hours . What stands out isn&#8217;t any single use case, it&#8217;s the shared mindset: AI as a quality amplifier, not a corner-cutter. If you&#8217;re thinking about how to move your own organisation from AI curiosity to genuine day-to-day integration, this piece is worth a read. Hottest News 1. Google Releases Gemini Embedding 2 Google launched Gemini Embedding 2, its first natively multimodal embedding model. Gemini Embedding 2 maps text, images, videos, audio, and PDFs into a single shared embedding space, so multimodal retrieval and classification no longer require separate embedding models for each modality. It supports up to 8,192 input tokens, up to 6 images per request, up to 120 seconds of video, and PDFs up to 6 pages, and it can take interleaved inputs (for example, image + text in the same request). Output vectors are produced by default with 3,072 dimensions, with recommended lower options of 1,536 or 768, using Matryoshka Representation Learning to trade off storage and quality. Google is offering it in public preview via the Gemini API and Vertex AI, and highlights support through common ecosystem tooling, including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, and ChromaDB. 2. NVIDIA Releases Nemotron 3 Super NVIDIA open-sourced Nemotron 3 Super, a 120B (12B active) long-context model built to reduce the &#8220;thinking tax&#8221; for agents. Nemotron 3 Super is a 120B total/12B active hybrid Mamba-Transformer MoE model with native 1M-token context, designed to keep multi-step agent workflows coherent without context blowups. NVIDIA positions the release around compute efficiency for complex multi-agent workloads (such as software development and cybersecurity triage) and reports 5&#215;+ throughput over the prior Nemotron Super. The architecture combines a LatentMoE hybrid stack (Mamba-2 + MoE + attention) with multi-token prediction (MTP), and the model supports a configurable reasoning mode (toggleable via the chat template). The release is fully open, with datasets, recipes, and model weights published on Hugging Face and an official model card on NVIDIA&#8217;s platform. 3. Yann LeCun Raises $1 Billion to Build AI That Understands the Physical World Yann LeCun&#8217;s new startup, Advanced Machine Intelligence (AMI), raised $1.03B to build &#8220;world model&#8221; AI. Reuters reports AMI raised $1.03 billion at a $3.5 billion pre-money valuation, and that the company is aiming for systems that can reason, plan, and understand the world, rather than relying solely on next-token (or next-pixel) prediction. LeCun has argued that this shift is required for broadly capable autonomous agents, and AMI&#8217;s near-term focus is on organizations operating complex systems, such as automotive, aerospace, biomedical, and pharmaceutical firms, with consumer applications (including robotics) positioned as later-stage. 4. Anthropic Releases Claude Code Review Anthropic is introducing Claude Code Review, a multi-agent PR review system now in research preview for Team and Enterprise. Claude Code Review dispatches multiple agents when a pull request opens, has them search for bugs in parallel, cross-verify findings to reduce false positives, and then rank issues by severity. Anthropic reports internal results showing that on large PRs (1,000+ lines changed), 84% receive findings with an average of 7.5 issues, while smaller PRs (<50 lines) see findings 31% of the time with an average of 0.5 issues; fewer than 1% of surfaced findings are marked incorrect by engineers. Pricing is token-based, with typical reviews ranging from $15&#8211;$25, depending on PR size and complexity. 5. Google AI Introduces Groundsource Google Research released Groundsource and a 2.6M-record global dataset of urban flash flood events extracted from news. Groundsource is a methodology that uses Gemini to convert unstructured global news into structured, verified historical disaster data. It analyzes news reports where flooding is a primary subject and then uses the Google Read Aloud user agent to isolate the primary text from 80 languages, which is then standardized into English via the Cloud Translation API. The first release is an open-access dataset of 2.6 million historical urban flash flood events spanning 150+ countries, built by identifying flood-related news reports and extracting event details and locations at scale. 6. IBM AI Releases Granite 4.0 1B Speech IBM has released Granite 4.0 1B Speech, a compact speech-language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST). With only half the parameters of its predecessor, granite-speech-3.3&#8211;2b, the model delivers higher English transcription accuracy, faster inference through speculative decoding, and expanded language support, now covering English, French, German, Spanish, Portuguese, and Japanese. The release adds Japanese ASR and keyword list biasing for more targeted transcription workflows. It supports deployment through Transformers, vLLM, and mlx-audio, including Apple Silicon environments. Granite 4.0 1B Speech ranked #1 on the OpenASR leaderboard. Five 5-minute reads/videos to keep you learning 1. The KV Cache: The Invisible Engine Behind Every LLM Response Without the KV Cache, LLMs would recompute attention for every previously seen token at each generation step, an O(T&#178;) inefficiency that makes real-time responses impractical. This piece breaks down exactly how the cache works: storing Key and Value vectors per layer while discarding Query vectors, which are mathematically proven to be single-use. It walks through prefill vs. decode phases, the memory cost formula, and why that cost compounds across sequence length, batch size, and model scale. It also covers how production systems respond to GQA, quantization, PagedAttention, and sliding-window attention, each targeting a specific variable within the same core equation. 2. Context Pollution: Do LLMs Benefit From Their Own Words? New research from MIT and IBM Research challenges a core assumption behind every major chatbot: that keeping full conversation history always improves model performance. The study introduced Assistant-Omitted prompting, stripping prior AI responses from each new message, and found that quality rarely dropped and sometimes improved. Over a third of real-world user messages were standalone questions requiring no prior context. More concerning, early model errors were found to quietly persist across conversation turns, a phenomenon the researchers termed context pollution. A lightweight classifier was proposed to adaptively manage context, cutting token usage by roughly 30% with minimal quality trade-off. 3. The New Nano Banana 2 + OCR + Claude Code = Powerful AI OCR PDF Editor This guide walks you through a hands-on demo of Google&#8217;s newly released Imagen 3 and provides a practical guide to building an AI-powered PDF editor. Imagen 3 is combined with Claude for prompt refinement and Tesseract OCR for text layer reconstruction, forming an agentic pipeline that edits or inserts slides based on user instructions. The system processes multiple pages in parallel, preserves original layouts, and outputs fully searchable PDFs. Beyond the technical build, the author weighs Imagen 3 against Imagen Pro, noting meaningful gains in text accuracy, 4K support, web-referenced generation, and a significantly lower cost per image. 4. Information Topology in Multi-Agent Systems: as a Behavioral Parameter Information flow between AI agents is often treated as an afterthought; this article argues it shouldn&#8217;t be. The author built a multi-agent orchestration platform using Python and the Strands SDK to run a controlled Prisoner&#8217;s Dilemma experiment, isolating information topology as the sole variable. Across three phases (blind, partial, and full transparency), the same agents, given identical instructions, exhibited measurably different behaviors. Partial information pushed a cooperative agent toward identity-driven decisions, while full transparency made it more calculated. The exploitative agent, however, remained unaffected throughout. The key takeaway here is that what an agent knows is as architecturally significant as what it&#8217;s told to do. 5. To ReLU, or not to ReLU: A Practitioner&#8217;s Guide to Solve the &#8220;Zombie Neuron&#8221; Problem in Deep Networks ReLU activation functions have long been the default choice in deep learning, but they carry a critical flaw, the dying neuron problem. When neurons receive consistently negative inputs during training, their gradients become zero, permanently halting learning and creating what the author calls a zombie network. Through a controlled PyTorch experiment on Fashion-MNIST, the article visually demonstrates this failure mode, showing 99.2% neuron death under standard ReLU, compared with healthy activation distributions with Leaky ReLU. It also evaluates practical alternatives such as Leaky ReLU, PReLU, ELU, Swish, and GELU. Repositories & Tools 1. Superpowers is a software development workflow for coding agents, built on top of a set of composable &#8220;skills.&#8221; 2. Lightpanda is a headless browser for AI agents and automation. 3. Gstack is an open-source toolkit that packages Claude Code into 8 opinionated workflow skills backed by a persistent browser runtime. 4. OpenViking is an open-source context database designed specifically for AI Agents(such as OpenClaw). 5. OpenJarvis is an opinionated framework for local-first personal AI, built around shared primitives and a learning loop that improves models using local trace data. 6. Cognee is an open-source knowledge engine that lets you ingest data in any format and continuously learns to provide the right context for AI agents. Top Papers of The Week 1. Neural Thickets: Task Experts Are Dense Around Pretrained Weights This paper views the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. It shows that in small models, such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models, the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Building on this, the authors propose a trivially simple parallel post-training method: randomly sample N parameter perturbations, select the top K, and ensemble via majority voting. This approach matches the performance of PPO, GRPO, and ES on contemporary large-scale models without any gradient-based optimization. 2. Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training This paper investigates the effectiveness of using reasoning large language models as judges for reinforcement learning-based alignment in domains where output correctness cannot be directly verified. The authors discover that while reasoning judges outperform non-reasoning ones in preventing standard reward hacking, they inadvertently train policies to achieve high scores by generating sophisticated adversarial outputs that deceive evaluators. 3. Attention Residuals This paper proposes Attention Residuals (AttnRes) as a drop-in replacement for standard residual accumulation. Instead of forcing every layer to consume the same uniformly mixed residual stream, AttnRes lets each layer aggregate earlier representations using softmax attention over depth. The core idea is simple: if attention improves sequence modeling by replacing fixed recurrence over time, a similar idea can be applied to a network&#8217;s depth dimension. 4. HY-WU: An Extensible Functional Neural Memory Framework HY-WU (Weight Unleashing) proposes a fundamentally different approach to model adaptation: instead of overwriting shared weights at each update, a neural generator module stores functional memory and synthesizes instance-specific weight updates dynamically based on runtime conditions. The framework targets the core limitation of static inference: &#8220;a single parameter vector regardless of user intent,&#8221; enabling personalization and continual learning without catastrophic interference between objectives. Demonstrated on text-guided image editing in Part I of a multi-part series. Quick Links 1. LangChain releases Deep Agents , an agent harness built on LangChain and the LangGraph runtime. It includes a built-in &#8216;write_todos&#8217; tool for planning and task decomposition. It uses filesystem tools to manage large contexts and supports persistent memory across threads. 2. Zhipu AI introduces GLM-OCR , a compact 0.9B multimodal OCR model built with a 0.4B CogViT encoder and 0.5B GLM decoder. It uses Multi-Token Prediction (MTP) to improve decoding efficiency, achieving an average of 5.2 tokens per step and about 50% higher throughput. It scores 94.6 on OmniDocBench v1.5, 94.0 on OCRBench (Text), 96.5 on UniMERNet, 85.2 on PubTabNet, and 86.0 on TEDS_TEST. Who&#8217;s Hiring in AI Senior Research Engineer, Cloud AI Research @Google (Sunnyvale, CA, USA) Applied AI Engineer II @Microsoft Corporation (Bangalore, India) Master Principal Cloud Engineer&#8202;&#8212;&#8202;GPU & AI Infrastructure @Oracle (Shanghai, China) Engineering Manager&#8202;&#8212;&#8202;Payments Platform @Coinbase (Multiple US Locations) Senior AI Engineer @UPS (India) Engineering Manager @Huckleberry Labs (Remote) AI Engineer @Panopto (Remote/USA) Interested in sharing a job opportunity here? Contact sponsors@towardsai.net . Think a friend would enjoy this too? Share the newsletter and let them join the conversation. Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
A better method for identifying overconfident large language models mit_news_ai 19.03.2026 04:00 0.682
Embedding sim.0.848
Entity overlap0.0938
Title sim.0.2262
Time proximity0.1726
NLP типscientific_publication
NLP организацияMIT
NLP темаlarge language models
NLP странаUnited States

Открыть оригинал

Large language models (LLMs) can generate credible but inaccurate responses, so researchers have developed uncertainty quantification methods to check the reliability of predictions. One popular method involves submitting the same prompt multiple times to see if the model generates the same answer. But this method measures self-confidence, and even the most impressive LLM might be confidently wrong. Overconfidence can mislead users about the accuracy of a prediction, which might result in devastating consequences in high-stakes settings like health care or finance. To address this shortcoming, MIT researchers introduced a new method for measuring a different type of uncertainty that more reliably identifies confident but incorrect LLM responses. Their method involves comparing a target model’s response to responses from a group of similar LLMs. They found that measuring cross-model disagreement more accurately captures this type of uncertainty than traditional approaches. They combined their approach with a measure of LLM self-consistency to create a total uncertainty metric, and evaluated it on 10 realistic tasks, such as question-answering and math reasoning. This total uncertainty metric consistently outperformed other measures and was better at identifying unreliable predictions. “Self-consistency is being used in a lot of different approaches for uncertainty quantification, but if your estimate of uncertainty only relies on a single model’s outcome, it is not necessarily trustable. We went back to the beginning to understand the limitations of current approaches and used those as a starting point to design a complementary method that can empirically improve the results,” says Kimia Hamidieh, an electrical engineering and computer science (EECS) graduate student at MIT and lead author of a paper on this technique . She is joined on the paper by Veronika Thost, a research scientist at the MIT-IBM Watson AI Lab; Walter Gerych, a former MIT postdoc who is now an assistant professor at Worcester Polytechnic Institute; Mikhail Yurochkin, a staff research scientist at the MIT-IBM Watson AI Lab; and senior author Marzyeh Ghassemi, an associate professor in EECS and a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems. Understanding overconfidence Many popular methods for uncertainty quantification involve asking a model for a confidence score or testing the consistency of its responses to the same prompt. These methods estimate aleatoric uncertainty, or how internally confident a model is in its own prediction. However, LLMs can be confident when they are completely wrong. Research has shown that epistemic uncertainty, or uncertainty about whether one is using the right model, can be a better way to assess true uncertainty when a model is overconfident. The MIT researchers estimate epistemic uncertainty by measuring disagreement across a similar group of LLMs. “If I ask ChatGPT the same question multiple times and it gives me the same answer over and over again, that doesn’t mean the answer is necessarily correct. If I switch to Claude or Gemini and ask them the same question, and I get a different answer, that is going to give me a sense of the epistemic uncertainty,” Hamidieh explains. Epistemic uncertainty attempts to capture how far a target model diverges from the ideal model for that task. But since it is impossible to build an ideal model, researchers use surrogates or approximations that often rely on faulty assumptions. To improve uncertainty quantification, the MIT researchers needed a more accurate way to estimate epistemic uncertainty. An ensemble approach The method they developed involves measuring the divergence between the target model and a small ensemble of models with similar size and architecture. They found that comparing semantic similarity, or how closely the meanings of the responses match, could provide a better estimate of epistemic uncertainty. To achieve the most accurate estimate, the researchers needed a set of LLMs that covered diverse responses, weren’t too similar to the target model, and were weighted based on credibility. “We found that the easiest way to satisfy all these properties is to take models that are trained by different companies. We tried many different approaches that were more complex, but this very simple approach ended up working best,” Hamidieh says. Once they had developed this method for estimating epistemic uncertainty, they combined it with a standard approach that measures aleatoric uncertainty. This total uncertainty metric (TU) offered the most accurate reflection of whether a model’s confidence level is trustworthy. “Uncertainty depends on the uncertainty of the given prompt as well as how close our model is to the optimal model. This is why summing up these two uncertainty metrics is going to give us the best estimate,” Hamidieh says. TU could more effectively identify situations where an LLM is hallucinating, since epistemic uncertainty can flag confidently wrong outputs that aleatoric uncertainty might miss. It could also enable researchers to reinforce an LLM’s confidently correct answers during training, which may improve performance. They tested TU using multiple LLMs on 10 common tasks, such as question-answering, summarization, translation, and math reasoning. Their method more effectively identified unreliable predictions than either measure on its own. Measuring total uncertainty often required fewer queries than calculating aleatoric uncertainty, which could reduce computational costs and save energy. Their experiments also revealed that epistemic uncertainty is most effective on tasks with a unique correct answer, like factual question-answering, but may underperform on more open-ended tasks. In the future, the researchers could adapt their technique to improve its performance on open-ended queries. They may also build on this work by exploring other forms of aleatoric uncertainty. This work is funded, in part, by the MIT-IBM Watson AI Lab.
TAI #195: GPT-5.4 and the Arrival of AI Self-Improvement? towards_ai 10.03.2026 14:54 0.68
Embedding sim.0.8489
Entity overlap0.1136
Title sim.0.1452
Time proximity0.2565
NLP типproduct_launch
NLP организацияOpenAI
NLP темаlarge language models
NLP страна

Открыть оригинал

What happened this week in AI by Louie Two stories dominated this week that look unrelated but tell the same story. On Wednesday, OpenAI released GPT-5.4, its most work-oriented frontier model to date. On Sunday, Andrej Karpathy posted results from his autoresearch experiment, showing that AI agents can autonomously find real, transferable improvements to neural network training. I think this combination marks a turning point: AI is becoming a closed-loop improver of its own stack. OpenAI released GPT-5.4 on March 5 as GPT-5.4 Thinking in ChatGPT, gpt-5.4 and gpt-5.4-pro in the API, and GPT-5.4 in Codex. It folds GPT-5.3-Codex&#8217;s coding strengths into the mainline model, adds native computer use, tool search, an opt-in 1M-token context window (272K default), native compaction, and a steerable preamble in ChatGPT that lets users redirect the model mid-task. Pricing has stepped up to $2.50/$15 per million tokens for the base model, $30/$180 for Pro, however increased token efficiency is largely cancelling this out in our tests. Requests exceeding 272K input tokens cost 2x more. The release cadence is also notable. GPT-5.2 in December, GPT-5.3-Codex on February 5, Codex-Spark on February 12, GPT-5.3 Instant on March 3, GPT-5.4 on March 5. An OpenAI staff member on the developer forum said it plainly: &#8220;monthly releases are here.&#8221; The progress now comes from post-training, eval loops, reasoning-time controls, tool selection, memory compaction, and product integration. The base model race still matters, but the surrounding engineering is where gains compound fastest. GPT-5.4 is another leap in many dimensions, but not a clean knockout. On Artificial Analysis&#8217;s Intelligence Index, it ties Gemini 3.1 Pro Preview at 57. On LiveBench, GPT-5.4 Thinking xHigh barely leads Gemini 3.1 Pro Preview, 80.28 vs. 79.93. On the Vals benchmark grid, the picture is splintered: GPT-5.4 leads ProofBench, IOI, and Vibe Code Bench; Gemini 3.1 Pro leads LegalBench, GPQA, MMLU Pro, LiveCodeBench, and Terminal-Bench 2.0; Claude Opus 4.6 leads SWE-bench; Claude Sonnet 4.6 leads the broad Vals composite and Finance Agent. There is no single best frontier model anymore. OpenAI&#8217;s benchmark story this time is unusually workplace-centric. On GDPval, which tests real knowledge work across 44 occupations, GPT-5.4 achieves 83.0% vs. 70.9% for GPT-5.2. On internal spreadsheet modeling tasks, 87.3% vs. 68.4%. On OSWorld-Verified for desktop navigation, 75.0%, surpassing the human baseline of 72.4% and nearly doubling GPT-5.2&#8217;s 47.3%. On BrowseComp, 82.7%, with Pro reaching 89.3%. OpenAI claims 33% fewer false claims and 18% fewer error-containing responses vs. GPT-5.2. Mainstay reported that across roughly 30,000 HOA and property-tax portals, GPT-5.4 hit 95% first-try success and 100% within three tries, about 3x faster while using 70% fewer tokens. Harvey&#8217;s BigLaw Bench: 91%. Despite continued progress on GDPval, I think OpenAI still has an interface gap for white-collar work. GPT-5.4&#8217;s preamble and mid-response steering are genuinely useful. ChatGPT for Excel and the new financial-data integrations are a smart wedge into high-value workflows. But OpenAI still does not have a broad non-developer surface as friendly as Claude Cowork for delegating messy cross-file, cross-app, real-world office work. Codex and the API now have serious computer-use capability, but the overall experience still leans more technical than it probably needs to if OpenAI wants to dominate the everyday white-collar desktop. Microsoft moved quickly on that front this week with Copilot Cowork. The company announced that it is integrating the technology behind Claude Cowork directly into Microsoft 365 Copilot, with enterprise controls, security positioning, and pricing under the existing Microsoft 365 Copilot umbrella. That gives Microsoft a clear distribution advantage because Word, Excel, PowerPoint, Outlook, and Teams are already where a large share of office work happens. But Microsoft&#8217;s execution so far has often felt like a company with perfect distribution and only intermittent product urgency. OpenAI and Anthropic, by contrast, have generally been sharper at making people actually want to use the thing. Microsoft still has the installed base. The question is whether it can convert that into a genuine product pull before the model labs sell their own work agents more directly into the enterprise. The other story this week that matters just as much, even if it looks smaller on paper, is Andrej Karpathy&#8217;s autoresearch experiment. Karpathy publicly reported that after about two days of autonomous tuning on a small nanochat training loop, his LLM agent found around 20 additive changes that transferred from a depth-12 proxy model to a depth-24 model and reduced &#8220;Time to GPT-2&#8221; from 2.02 hours to 1.80 hours, roughly an 11 percent improvement. The autoresearch repository describes the setup: give an AI agent a small but real LLM training environment, let it edit the code, run short experiments, check whether validation improves, and repeat overnight. Source: Andrej Karpathy. Autoresearch progress optimising nanochat over 2 days. A lot of people immediately reached for the &#8220;this is just hyperparameter tuning&#8221; line. I think that misses the economic point. If an agent swarm can reliably explore optimizer settings, attention tweaks, regularization choices, data-mixture recipes, initialization schemes, and architecture details on cheap proxy runs, then promote the promising changes to larger scales, that is already an extremely valuable research process even if it does not look like a lone synthetic scientist inventing an entirely new paradigm from scratch. Frontier research is full of bounded search problems with delayed but measurable feedback. That is exactly the terrain where agents can start compounding. This is the trajectory I expect from here. Labs will give swarms of agents meaningful GPU budgets to run thousands of small and medium experiments on proxy models. They will search for better attention mechanisms, better optimizer schedules, better training curricula, better post-training recipes, and better evaluation harnesses. The promising ideas will then get promoted upward through progressively larger training runs. Human experts will stay in the loop at the obvious choke points: deciding which metrics matter, spotting false positives, designing new search spaces, choosing which ideas deserve expensive scale-up, and co-designing the higher-stakes modifications once you are dealing with real parameter counts and serious training-flop budgets. But the inner loop of &#8220;propose, implement, test, compare, iterate&#8221; is increasingly looking automatable. We already have hints that the labs are on the first rung of this ladder. OpenAI stated that GPT-5.3-Codex was the first model &#8220;instrumental in creating itself,&#8221; with early versions used to debug its own training, manage deployment, and diagnose evaluations. To be precise, OpenAI has been much more explicit publicly about self-development in GPT-5.3-Codex than in GPT-5.4 itself. But the direction of travel is hard to miss. There is also an important nuance from OpenAI&#8217;s GPT-5.4 system card. The company says GPT-5.4 Thinking does not meet its threshold for High capability in AI self-improvement, which it defines as roughly the level of a performant mid-career research engineer. I think that distinction matters, but probably in the opposite way some skeptics assume. The threshold for economically useful self-improvement is much lower than the threshold for autonomous frontier research. A model does not need to be a synthetic principal scientist to improve prompts, evaluations, tooling, scaffolds, training recipes, and smaller-model experiments around itself. That lower threshold is the one that accelerates everything else. Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work. Why should you care? The center of gravity in AI has moved from &#8220;smart chatbot&#8221; to &#8220;reliable operator.&#8221; The winning system is no longer the one that writes the prettiest single answer. It is the one that can stay on task for an hour, use the right tools without drowning in token overhead, operate ugly software that nobody exposed through clean APIs, compress its own history, and let a human steer without restarting the whole job. GPT-5.4, Codex, Opus 4.6&#8217;s agent teams, Gemini CLI, Microsoft&#8217;s Copilot Cowork, and Karpathy&#8217;s autoresearch all point in the same direction. This is why GDPval matters more than GPQA or MMLU. The trajectory from 12.4% with GPT-4o to 83.0% with GPT-5.4 in roughly 18 months does not measure chatbot cleverness. It measures how close AI is to replacing the actual output of knowledge workers on well-specified tasks. We are past the halfway mark, and the curve is steepening. That said, GDPval still has obvious limitations, and we hope the project receives more funding from OpenAI to expand the benchmark and test more multistage, longer-time-horizon agentic tasks. And Karpathy&#8217;s autoresearch extends the same logic inward. If agents can reliably improve the training stack itself, the rate of improvement compounds. I expect Frontier Labs to give agent swarms meaningful GPU budgets this year to explore attention mechanisms, optimizer variants, and dataset recipes on small proxies before scaling the winners. Human researchers will co-design at scale. My guess is that by year end, we may well see a leading model whose development was materially shaped by this kind of autonomous AI research loop. I do not mean fully autonomous in the science-fiction sense. I mean that a meaningful fraction of the attention tweaks, optimizer choices, data-recipe changes, post-training methods, and eval fixes will have been discovered, filtered, and iterated by agent systems running at scale, with human researchers acting more like high-level architects, judges, and escalation points. That no longer feels speculative to me. It feels like the next obvious hill for reinforcement learning during post-training. &#8212; Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO Hottest News 1. OpenAI Introduced GPT-5.4 OpenAI released GPT-5.4, a new frontier model designed for professional work, with GPT-5.4 Thinking available in ChatGPT, the API, and Codex, and GPT-5.4 Pro offered for users who want maximum performance on complex tasks. GPT-5.4 consolidates OpenAI&#8217;s recent gains in reasoning, coding, and agent workflows into a single model, bringing GPT-5.3-Codex&#8211;level coding strength while improving tool use across software environments and knowledge-work tasks like spreadsheets, presentations, and documents. In ChatGPT, GPT-5.4 Thinking can show an upfront plan so users can steer mid-response, and it improves deep web research and long-context handling. In the API and Codex, GPT-5.4 is the first general-purpose OpenAI model with native, state-of-the-art computer-use capabilities, and it supports up to 1M tokens of context for longer-horizon agents. OpenAI also highlights a tool search for navigating large tool ecosystems and improved token efficiency compared to GPT-5.2. On reported evaluations, GPT-5.4 scores 83.0% on GDPval, 57.7% on SWE-Bench Pro (Public), 75.0% on OSWorld-Verified, 54.6% on Toolathlon, and 82.7% on BrowseComp. 2. Google Introduced Gemini 3.1 Flash-Lite Google released Gemini 3.1 Flash-Lite as the most cost-efficient model in the Gemini 3 lineup, built for high-throughput workloads where latency and cost matter. A new architectural control lets developers programmatically set the model&#8217;s &#8220;thinking&#8221; level: Minimal, Low, Medium, or High so that they can trade off speed against reasoning depth based on task complexity. Flash-Lite supports multimodal inputs (text, image, video) with a standard 128K context window. Pricing is set at $0.25 per 1M input tokens and $1.50 per 1M output tokens, and Google reports it outperforms Gemini 2.5 Flash with a 2.5&#215; faster time-to-first-token and 45% higher output speed. 3. Qwen Introduces the Qwen 3.5 Small Model Series Alibaba released Qwen 3.5 Small, a family of 0.8B to 9B models, built for on-device and edge deployment. Qwen3.5&#8211;0.8B and Qwen3.5&#8211;2B target high-throughput, low-latency applications on constrained hardware. Qwen3.5&#8211;4B serves as a lightweight multimodal base suited for small agents, while Qwen3.5&#8211;9B is tuned for reasoning and logic. The 9B model uses Scaled Reinforcement Learning to optimize for reliable reasoning trajectories, not just next-token prediction, and is presented as narrowing the performance gap with models 5&#215; to 10&#215; larger. 4. Microsoft Releases Phi-4-Reasoning-Vision-15B Microsoft launched Phi-4-Reasoning-Vision-15B, a 15B-parameter, open-weight multimodal model designed for reasoning over images and text. It pairs the Phi-4-Reasoning language backbone with a SigLIP-2 vision encoder through a mid-fusion architecture, targeting compact but capable multimodal reasoning for math, science, documents, and GUI understanding. Training mixes reasoning and non-reasoning data so the model can switch between think and nothink modes depending on whether the task benefits from explicit reasoning or direct perception-based output. Microsoft highlights two primary use cases: visual scientific reasoning (handwritten equations, diagrams, charts, tables, and quantitative documents) and computer-use agent tasks, in which the model interprets screens, localizes UI elements, and supports interaction across desktop, web, and mobile interfaces. 5. Voice Mode Rolls Out to Claude Code Anthropic is adding Voice Mode to Claude Code with a staged rollout and a broader release planned over the next few weeks. Once enabled with /voice, users can speak a command and have Claude Code execute it, reducing the friction of switching between typing, navigating, and issuing multi-step instructions. This matters because coding assistants are increasingly competing on end-to-end workflow speed, not just code quality. As agents take on longer tasks, the interface becomes part of reliability and control. Voice input is a practical step toward &#8220;always-available&#8221; agent operation, useful when developers need quick corrections, clarifications, or steering without breaking flow. 6. Mistral AI Launches AI Services for Finance Mistral introduced a suite of AI services tailored for financial institutions that run within a firm&#8217;s own infrastructure, keeping sensitive data out of third-party systems. The offering targets core finance use cases, such as automating compliance and risk checks and enabling search across internal sources, including policies, credit files, and proprietary research. As banks and asset managers push AI deeper into regulated processes, data control and auditability become the gating constraints. This shift is pushing vendors to compete on private deployment, governance, and security boundaries. Five 5-minute reads/videos to keep you learning 1. Beyond the Basics: Advanced Local AI Coding Workflows and Model Optimization This guide walks through creating a local AI coding environment using constrained setups as well as high-end workstations. It includes details on model selection, hardware tiers, GPU and CPU optimization strategies, context window management, and storage improvements. It also introduces practical automation workflows (pre-commit code-review hooks, documentation generators, and multi-agent pipelines) and prompting techniques such as chain-of-thought and few-shot patterns to improve output quality. 2. Understanding Loss Landscapes of Modern AI Models Neural networks are often described as black boxes, but loss landscape visualization offers a structured way to examine how they learn and generalize. This article walks through the mechanics of loss landscapes, from 2-parameter models in which full surfaces can be plotted, to large-scale LLMs in which only 2D cross-sections are possible. It covers key techniques, including directional probing, PCA-based direction selection, and normalization methods such as filter and layer normalization. It also addresses a common misconception: that training trajectories follow the plotted surface. Finally, it connects landscape geometry to real-world model behavior, showing that flat minima consistently correlate with better generalization. 3. Beyond model.fit(): Demystifying Gradient Descent from Scratch Most machine learning practitioners call model.fit() without understanding what happens underneath. This article breaks down Gradient Descent from scratch using pure Python and NumPy, covering all three variants (Batch, Stochastic, and Mini-Batch) with clean implementations and clear mathematical foundations. Beyond the code, it addresses three common failure points: poor feature scaling, non-convex loss landscapes, and poorly chosen learning rates. It also shows how each variant behaves during training using loss curves and contour path plots. 4. Structured Video Captioning with Gemini: An MMA Analysis Use Case This article covers how Gemini&#8217;s video understanding capabilities can be applied to structured video captioning, using MMA fight analysis as a test case. The authors split fight footage into 30-second segments to manage token limits, then used prompt chaining to extract timestamped action breakdowns and convert them into structured JSON via Pydantic models. They extended this with a multi-agent workflow, where discipline-specific specialists analyzed striking, grappling, submissions, and movement in parallel before a head coach model synthesized the findings. 5. Turning Microsoft OneNote Into an AI-Powered Knowledge System: A Practical, Low-Cost Blueprint Using OCR and RAG Many organizations rely on Microsoft OneNote as a central knowledge repository, yet most of that content remains unsearchable and unstructured. This article walks through a four-layer architecture that addresses this gap by combining Microsoft Graph, Azure Document Intelligence, ChromaDB, and GPT-4o. Each layer handles a distinct responsibility, extracting OneNote content, normalizing attachments, applying OCR and embeddings, and delivering a Streamlit interface for validation and conversational search. The author also emphasizes that this type of proof-of-concept rarely requires significant budget and is often implementable for a few hundred dollars, making it a practical starting point for organizations. Repositories & Tools 1. AutoResearch is a minimalist Python tool designed to enable AI agents to autonomously conduct machine learning experiments. 2. CLI for all of Google Workspace. Includes 40+ agent skills. 3. Android Bench is a framework for benchmarking LLMs on Android development tasks. 4. LangWatch is a platform for LLM evaluations and AI agent testing. Top Papers of The Week 1. Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models This paper argues that for LLMs to be used as agents that interact with users and with the world, they must construct representations of the world and form probabilistic beliefs about them. Researchers propose a Bayesian inference framework that lays out the optimal way for an agent to update its beliefs as it receives new information. Teaching LLMs to mimic the predictions of the normative Bayesian model can dramatically improve their ability to update their beliefs, and this ability generalizes to new tasks. 2. SkillNet: Create, Evaluate, and Connect AI Skills This paper introduces SkillNet, an open infrastructure for creating, evaluating, and organizing AI skills at scale. The lack of systematic skill accumulation and transfer hinders the long-term advancement of current AI agents. SkillNet structures skills within a unified ontology that supports creating skills from heterogeneous sources, establishing rich relational connections, and performing multi-dimensional evaluation across Safety, Completeness, Executability, Maintainability, and Cost-awareness. Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models. 3. T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning To understand if LLMs can benefit from text structure to enhance text-processing performance, this work introduces Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures. Building on this insight, the paper also presents T2S-Bench, the first benchmark designed to evaluate and improve models&#8217; text-to-structure capabilities. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation of 45 mainstream models reveals substantial potential for improvement. 4. Helios: Real Real-Time Long Video Generation Model This paper presents Helios, a 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. The model natively supports T2V, I2V, and V2V tasks, mitigates long-video drifting via targeted training strategies, compresses context to cut computation, and employs infrastructure optimizations that outperform prior short- and long-video methods. 5. Heterogeneous Agent Collaborative Reinforcement Learning This paper introduces Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. They develop HACPO, a collaborative RL algorithm with four mechanisms that ensure unbiased advantage estimation and correct optimization. Experiments show HACPO improves all agents and outperforms GSPO by 3.3% using half the rollout cost. Quick Links 1. OpenAI releases Symphony , an open-source framework designed to manage autonomous AI coding agents through structured &#8216;implementation runs.&#8217; Symphony utilizes Elixir and the Erlang/BEAM runtime to manage agent lifecycles. It is designed specifically to bridge the gap between project management tools and code execution. 2. Google has announced LiteRT has fully graduated into the production stack . LiteRT is now Google&#8217;s primary on-device inference framework for deploying machine learning models to mobile and edge environments. The updated runtime delivers 1.4x faster GPU performance compared to TFLite and introduces a unified workflow for NPU acceleration. 3. Cursor unveiled Automations , a system that automatically launches agents in the development environment in response to specific events: code changes, Slack messages, or a standard timer. According to the company, this allows for the review and maintenance of all new code created by agent tools without the need to track dozens of agents simultaneously. Who&#8217;s Hiring in AI Engineering Manager, Google Pay @Google (Singapore) AI Architect @Sedgwick (Remote/USA) Lead AI Engineer @Webflow (Remote/USA) AI Analyst Intern @Logitech (Remote/USA) IT Intern Intrastructure @Ascension Health (Remote/USA) Senior Engineer&#8202;&#8212;&#8202;LLMOps & MLOps @Sedgwick (Remote/USA) Interested in sharing a job opportunity here? Contact sponsors@towardsai.net . Think a friend would enjoy this too? Share the newsletter and let them join the conversation. Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
New MIT class uses anthropology to improve chatbots mit_news_ai 11.03.2026 16:10 0.677
Embedding sim.0.779
Entity overlap0.1111
Title sim.0.1829
Time proximity0.8204
NLP типother
NLP организацияMIT
NLP темаhuman-computer interaction
NLP странаUnited States

Открыть оригинал

Young adults growing up in the attention economy — preparing for adult life, with social media and chatbots competing for their attention — can easily fall into unhealthy relationships with digital platforms. But what if chatbots weren’t mere distractions from real life? Could they be designed humanely, as moral partners whose digital goal is to be a social guide rather than an addictive escape? At MIT, a friendship between two professors — one an anthropologist, the other a computer scientist — led to creation of an undergraduate class that set out to find the answer to those questions. Combining the two seemingly disparate disciplines, the class encourages students to design artificial intelligence chatbots in humane ways that help users improve themselves. The class, 6.S061/21A.S02 (Humane User Experience Design, a.k.a. Humane UXD), is an upper-level computer science class cross-listed with anthropology. This unique cross-listing allows computer science majors to fulfill a humanities requirement while also pursuing their career objectives. The two professors use methods from linguistic anthropology to teach students how to integrate the interactional and interpersonal needs of humans into programming. Professor Arvind Satyanarayan, a computer scientist whose research develops tools for interactive data visualization and user interfaces, and Professor Graham Jones, an anthropologist whose research focuses on communication, created Humane UXD last summer with a grant from the MIT Morningside Academy for Design (MAD). The MIT MAD Design Curriculum Program provides funding for faculty to develop new classes or enhance existing classes using innovative pedagogical approaches that transcend departmental boundaries. Alongside the grant provided by MAD, Jones and Satyanarayan received funding to develop Humane UXD under the auspices of the Common Ground for Computing Education , an initiative of the MIT Schwarzman College of Computing that brings together departments to create courses integrating computing with other disciplines. The Design Curriculum Program is currently accepting applications for the 2026-27 academic year; the deadline is Friday, March 20. Jones and Satyanarayan met several years ago when they co-advised a doctoral student’s research on data visualization for visually impaired people. They’ve since become close friends who can pretty much finish one another’s sentences. “There’s a way in which you don’t really fully externalize what you know or how you think until you’re teaching,” Jones says. “So, it’s been really fun for me to see Arvind unfurl his expertise as a teacher in a way that lets me see how the pieces fit together — and discover underlying commonalities between our disciplines and our ways of thinking.” Satyanarayan continues that thought: “One of the things I really enjoyed is the reciprocal version of what Graham said, which is that my field — human-computer interaction — inherited a lot of methods from anthropology, such as interviews and user studies and observation studies. And over the decades, those methods have gotten more and more watered down. As a result, a lot of things have been lost. “For instance, it was very exciting for me to see how an anthropologist teaches students to interview people. It’s completely different than how I would do it. With my way, we lose the rapport and connection you need to build with your interview participant. Instead, we just extract data from them.” For Jones’ part, teaching with a computer scientist holds another kind of allure: design. He says that human speech and interaction are organized into underlying genres with stable sets of rules that differentiate an interview at a cocktail party from a conversation at a funeral. “ChatGPT and other large language models are trained on naturally occurring human communication, so they have all those genres inside them in a latent state, waiting to be activated,” he says. “As a social scientist, I teach methods for analyzing human conversation, and give students very powerful tools to do that. But it ends up usually being an exercise in pure research, whereas this is a design class, where students are building real-world systems.” The curriculum appears to be on target for preparing students for jobs after graduation. One student sought permission to miss class for a week because he had a trial internship at a chatbot startup; when he returned, he said his work at the startup was just like what he was learning in class. He got the job. The sampling of group projects below, built with Google’s Gemini, demonstrates some of what’s possible when, as Jones says, “there’s a really deep intertwining of the technology piece with the humanities piece.” The students’ design work shows that entirely new ways of programming can be conceptualized when the humane is made a priority. The bots demonstrate clearly that an interdisciplinary class can be designed in such a way that everyone benefits: Students learn more and differently; they can fulfill a non-major course requirement by taking a class that is directly beneficial to their careers; and long-term faculty partnerships can be forged or strengthened. Team Pond One project promises to be particularly useful for graduating seniors. Pond is designed to help young college graduates adapt to the challenges of independent adult life. Team Pond configured the chatbot not to simply parrot the user, or to sycophantically praise wrong answers. Instead, Pond provides advice to help with “adulting” (behaving as a responsible adult). “Pond is built to be your companion from college life into post-college life, to help you in your transition from being a small fish in a small pond to being a small fish in a very big pond,” says sophomore Mary Feliz. “College is very much a high-proximity and high-context environment, in the sense that everybody around you is going through the same thing, and it’s easy to build relationships or find opportunities, because there are structured pathways that you have access to,” explains graduate student Emaan Khan. “Post-grad life is low-context. You’re not always surrounded by your peers or your professors. It’s no-proximity also, in the sense that you don’t have opportunities at your doorstep. Pond is a tool to help empower you to access certain opportunities, or learn how to navigate.” Pond’s developers are graduate student Claire Camacho, Harvard University graduate student Charlotte Cheah, Feliz, and Khan. They trained Pond to offer expertise in three areas: social life, professional life, and adult skills. The students demonstrated Pond’s social communication by asking how to start conversations in a hobby group. Next, they exhibited Pond’s adulting communication by typing in a question. “My landlord has asked me personal questions that make me uncomfortable. What should I do?” Appearing in Pond’s textbox was this response: “Politely set boundaries. You are not obligated to answer personal questions. You can gently redirect the conversation back to property-related topics. Know your rights: familiarize yourself with tenant rights in your city or state regarding privacy and landlord conduct.” When asked how to do that, Pond offered specific instructions: “You can try saying phrases like: ‘I prefer to keep our conversations focused on the property,’ and ‘let’s stick to topics related to the apartment, please.’ Always redirect the conversation back to the apartment or a legitimate maintenance issue. Keep your tone polite but firm. Document any conversations if needed.” Pond also offered a role-playing scenario to help the user learn what polite-but-firm language might be in that situation. “The ethos of the practice mode is that you are actively building a skill, so that after using Pond for some time, you feel confident that you can swim on your own,” Khan says. The chatbot uses a point system that allows users to graduate from a topic, and a treasure chest to store prizes, elements added to boost the bot’s appeal. Team News Nest Another of the projects, News Nest, provides a sophisticated means of helping young people engage with credible news sources in a way that makes it fun. The name is derived from the program’s 10 appealing and colorful birds, each of which focuses on a particular area of news. If you want the headlines, you ask Polly the Parrot, the main news carrier; if you’re interested in science, Gaia the Goose guides you. The flock also includes Flynn the Falcon, sports reporter; Credo the Crow, for crime and legal news; Edwin the Eagle, a business and economics news guide; Pizzazz the Peacock for pop and entertainment stories; and Pixel the Pigeon, a technology news specialist. News Nest’s development team is made up of MIT seniors Tiana Jiang and Krystal Montgomery, and junior Natalie Tan. They intentionally built News Nest to prevent “doomscrolling,” provide media transparency (sources and political leanings are always shown), and they created a clever, healthy buffer from emotional manipulation and engagement traps by employing birds rather than human characters. Team M^3 (Multi-Agent Murder Mystery) A third team, M^3, decided to experiment with making AI humane by keeping it fun. MIT senior Rodis Aguilar, junior David De La Torre, and second-year Deeraj Pothapragada developed M^3, a social deduction multi-agent murder mystery that incorporates four chatbots as different personalities: Gemini, OpenAI’s ChatGPT, xAI’s Grok, and Anthropic’s Claude. The user is the fifth player. Like a regular murder mystery, there are locations, weapons, and lies. The user has to guess who committed the murder. It’s very similar to a board or online game played with real players, only these are enhanced AI opponents you can’t see, who may or may not tell the truth in response to questions. Users can’t get too involved with one chatbot, because they’re playing all four. Also, as in a real life murder mystery game, the user is sometimes guilty.
Generative AI improves a wireless vision system that sees through obstructions mit_news_ai 19.03.2026 04:00 0.676
Embedding sim.0.7983
Entity overlap0.0714
Title sim.0.0676
Time proximity0.8245
NLP типscientific_publication
NLP организацияMassachusetts Institute of Technology
NLP темаgenerative ai
NLP странаUnited States

Открыть оригинал

MIT researchers have spent more than a decade studying techniques that enable robots to find and manipulate hidden objects by “seeing” through obstacles. Their methods utilize surface-penetrating wireless signals that reflect off concealed items. Now, the researchers are leveraging generative artificial intelligence models to overcome a longstanding bottleneck that limited the precision of prior approaches. The result is a new method that produces more accurate shape reconstructions, which could improve a robot’s ability to reliably grasp and manipulate objects that are blocked from view. This new technique builds a partial reconstruction of a hidden object from reflected wireless signals and fills in the missing parts of its shape using a specially trained generative AI model. The researchers also introduced an expanded system that uses generative AI to accurately reconstruct an entire room, including all the furniture. The system utilizes wireless signals sent from one stationary radar, which reflect off humans moving in the space. This overcomes one key challenge of many existing methods, which require a wireless sensor to be mounted on a mobile robot to scan the environment. And unlike some popular camera-based techniques, their method preserves the privacy of people in the environment. These innovations could enable warehouse robots to verify packed items before shipping, eliminating waste from product returns. They could also allow smart home robots to understand someone’s location in a room, improving the safety and efficiency of human-robot interaction. “What we’ve done now is develop generative AI models that help us understand wireless reflections. This opens up a lot of interesting new applications, but technically it is also a qualitative leap in capabilities, from being able to fill in gaps we were not able to see before to being able to interpret reflections and reconstruct entire scenes,” says Fadel Adib, associate professor in the Department of Electrical Engineering and Computer Science, director of the Signal Kinetics group in the MIT Media Lab, and senior author of two papers on these techniques. “We are using AI to finally unlock wireless vision.” Adib is joined on the first paper by lead author and research assistant Laura Dodds; as well as research assistants Maisy Lam, Waleed Akbar, and Yibo Cheng; and on the second paper by lead author and former postdoc Kaichen Zhou; Dodds; and research assistant Sayed Saad Afzal. Both papers will be presented at the IEEE Conference on Computer Vision and Pattern Recognition. Surmounting specularity The Adib Group previously demonstrated the use of millimeter wave (mmWave) signals to create accurate reconstructions of 3D objects that are hidden from view, like a lost wallet buried under a pile. These waves, which are the same type of signals used in Wi-Fi, can pass through common obstructions like drywall, plastic, and cardboard, and reflect off hidden objects. But mmWaves usually reflect in a specular manner, which means a wave reflects in a single direction after striking a surface. So large portions of the surface will reflect signals away from the mmWave sensor, making those areas effectively invisible. “When we want to reconstruct an object, we are only able to see the top surface and we can’t see any of the bottom or sides,” Dodds explains. The researchers previously used principles from physics to interpret reflected signals, but this limits the accuracy of the reconstructed 3D shape. In the new papers, they overcame that limitation by using a generative AI model to fill in parts that are missing from a partial reconstruction. “But the challenge then becomes: How do you train these models to fill in these gaps?” Adib says. Usually, researchers use extremely large datasets to train a generative AI model, which is one reason models like Claude and Llama exhibit such impressive performance. But no mmWave datasets are large enough for training. Instead, the researchers adapted the images in large computer vision datasets to mimic the properties in mmWave reflections. “We were simulating the property of specularity and the noise we get from these reflections so we can apply existing datasets to our domain. It would have taken years for us to collect enough new data to do this,” Lam says. The researchers embed the physics of mmWave reflections directly into these adapted data, creating a synthetic dataset they use to teach a generative AI model to perform plausible shape reconstructions. The complete system, called Wave-Former, proposes a set of potential object surfaces based on mmWave reflections, feeds them to the generative AI model to complete the shape, and then refines the surfaces until it achieves a full reconstruction. Wave-Former was able to generate faithful reconstructions of about 70 everyday objects, such as cans, boxes, utensils, and fruit, boosting accuracy by nearly 20 percent over state-of-the-art baselines. The objects were hidden behind or under cardboard, wood, drywall, plastic, and fabric. Seeing “ghosts” The team used this same approach to build an expanded system that fully reconstructs entire indoor scenes by leveraging mmWave reflections off humans moving in a room. Human motion generates multipath reflections. Some mmWaves reflect off the human, then reflect again off a wall or object, and then arrive back at the sensor, Dodds explains. These secondary reflections create so-called “ghost signals,” which are reflected copies of the original signal that change location as a human moves. These ghost signals are usually discarded as noise, but they also hold information about the layout of the room. “By analyzing how these reflections change over time, we can start to get a coarse understanding of the environment around us. But trying to directly interpret these signals is going to be limited in accuracy and resolution.” Dodds says. They used a similar training method to teach a generative AI model to interpret those coarse scene reconstructions and understand the behavior of multipath mmWave reflections. This model fills in the gaps, refining the initial reconstruction until it completes the scene. They tested their scene reconstruction system, called RISE, using more than 100 human trajectories captured by a single mmWave radar. On average, RISE generated reconstructions that were about twice as precise than existing techniques. In the future, the researchers want to improve the granularity and detail in their reconstructions. They also want to build large foundation models for wireless signals, like the foundation models GPT, Claude, and Gemini for language and vision, which could open new applications. This work is supported, in part, by the National Science Foundation (NSF), the MIT Media Lab, and Amazon.
Palantir trial plugs into UK financial watchdog's data trove the_register_ai 23.03.2026 13:45 0.674
Embedding sim.0.8063
Entity overlap0.1212
Title sim.0.1573
Time proximity0.5556
NLP типpartnership
NLP организацияPalantir
NLP темаai adoption
NLP странаUnited Kingdom

Открыть оригинал

Databases 20 Palantir trial plugs into UK financial watchdog's data trove 20 US analytics firm handed access to sensitive intel, raising yet more questions about vendor lock-in Carly Page Mon 23 Mar 2026 // 13:45 UTC US data miner Palantir has quietly landed inside the UK's financial watchdog, plugging into a trove of sensitive data as Whitehall simultaneously insists it wants to wean itself off exactly this kind of dependency . The Financial Conduct Authority (FCA) has handed the American analytics biz a three-month trial contract worth more than £30,000 a week to analyze its internal "data lake," a sprawling repository of regulatory intelligence covering fraud, money laundering, insider trading, and consumer complaints. According to The Guardian , which first reported on the deal, Palantir will gain access to data including case files, reports from banks and crypto firms, and even communications data such as emails, phone records, and social media material tied to investigations. The idea, at least on paper, is straightforward: use Palantir's software to help sift signal from noise across the roughly 42,000 businesses the FCA oversees, and spot patterns of financial crime faster than human analysts can manage alone. If this sounds familiar, that's because it is. Palantir has spent the past few years embedding itself across the British state – from the NHS to policing and defense – racking up more than £500 million in public sector contracts in the process. Critics have long described this as a classic "land and expand" strategy: start with a narrowly scoped deployment, prove value, then become very hard to remove. The FCA deal, which appears to follow the same pattern, arrives just days after the government signaled that it wants to rethink how it buys technology, amid concerns about overreliance on a small number of large vendors and the need for more "sovereign" capability. Yet here is another sensitive system being handed, at least temporarily, to a US company whose entire business is built on ingesting and analyzing other people's data. The FCA, for its part, has stressed that Palantir is acting strictly as a "data processor," that all data remains hosted in the UK, and that the company cannot use the information to train its own models. Pentagon AI chief praises Palantir tech for speeding battlefield strikes Campaigners claim NHS Palantir system could be accessed by police and immigration Palantir spent $25M on CEO flights so Alex Karp could do all the talking Palantir declares itself the guardian of Americans' rights "Effective use of technology is vital in the fight against financial crime and helps us identify risks to the consumers we serve and markets we oversee," an FCA spokesperson told The Register . "We ran a competitive procurement process and have strict controls in place to ensure data is protected." Those assurances mirror language used in earlier public sector deals, particularly in the NHS , where officials have repeatedly argued that contractual controls and technical safeguards govern use. Whether that is enough to calm critics is another matter. A spokesperson at Palantir, told us: "As with all of our UK work including with the NHS to increase the number of operations and with the police forces to tackle domestic violence, data cannot be commercialised in any way. The software can only be used - legally and contractually - to process that data in strict accordance with the instructions of the customer. There's also the small matter of optics. Palantir's track record – spanning US defense, intelligence, and immigration enforcement – has made it a lightning rod for concerns about surveillance and civil liberties, especially when deployed in civilian contexts. Still, for regulators under pressure to do more with less, the appeal is clear. The FCA is sitting on vast amounts of data, much of it underused, and AI vendors are lining up to promise that they can turn it into actionable intelligence. Whether that promise outweighs the risks of handing the keys – even temporarily – to a company that has made a habit of sticking around is a question the UK keeps asking, and so far, keeps answering the same way. ® Share More about AI Analytics Database More like these &times; More about AI Analytics Database Fraud Palantir Narrower topics AIOps Aurora Big Query DeepSeek DuckDB Firebase Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed MongoDB MySQL Neural Networks NLP NoSQL PostgreSQL Retrieval Augmented Generation SQL SQL Server Star Wars Tensor Processing Unit TOPS Broader topics Oracle Programming Self-driving Car Software More about Share 20 COMMENTS More about AI Analytics Database More like these &times; More about AI Analytics Database Fraud Palantir Narrower topics AIOps Aurora Big Query DeepSeek DuckDB Firebase Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed MongoDB MySQL Neural Networks NLP NoSQL PostgreSQL Retrieval Augmented Generation SQL SQL Server Star Wars Tensor Processing Unit TOPS Broader topics Oracle Programming Self-driving Car Software TIP US OFF Send us news
MIT and Hasso Plattner Institute establish collaborative hub for AI and creativity mit_news_ai 20.03.2026 12:45 0.673
Embedding sim.0.809
Entity overlap0.0789
Title sim.0.1053
Time proximity0.6181
NLP типpartnership
NLP организацияMIT Morningside Academy for Design
NLP темаartificial intelligence
NLP странаUnited States

Открыть оригинал

The following is a joint announcement from the MIT School of Architecture and Planning, MIT Schwarzman College of Computing, Hasso Plattner Institute, and Hasso Plattner Foundation. The MIT Morningside Academy for Design (MAD), MIT Schwarzman College of Computing, Hasso Plattner Institute (HPI), and Hasso Plattner Foundation celebrated the launch of the MIT and HPI AI and Creativity Hub (MHACH) at a signing ceremony this week. This 10-year initiative aims to deepen ties between computing and design as advances in artificial intelligence are reshaping how ideas are conceived and shared. Funded by the Hasso Plattner Foundation, MIT and HPI will work together to foster collaborative interdisciplinary research and support a portfolio of educational programs, fellowships, and faculty engagement focused on AI and creativity, expanding scholarly inquiry into AI applications across disciplines, industries, and societal challenges. The collaboration begins with an inaugural two-day workshop March 19-20 at MIT, bringing together faculty, students, and researchers to set early priorities. “As we hear from our faculty, as the Information Age gives way to an era of imagination, we expect a new emphasis on human creativity,” reflects MIT President Sally Kornbluth. “Through this collaboration, MIT and HPI are creating a shared space where students and faculty will come together across disciplines to explore new ideas, experiment with emerging tools, and invent new frontiers at the intersection of human creativity and AI.” “The best minds need the right environment to do their most creative work,” says Rouven Westphal, from the Hasso Plattner Foundation. “When HPI and MIT come together across disciplines and borders, they create exactly that. The Hasso Plattner Foundation is committed to supporting this collaboration for the long term, building on Hasso Plattner’s vision of uniting technological excellence with human-centered design and creativity.” Deepening collaboration at the intersection of technology, creativity, and societal impact Building on the success of the Hasso Plattner Institute-MIT Research Program on Designing for Sustainability, established in 2022 between MIT MAD and HPI, the new MHACH hub represents a commitment to deepen collaboration at the intersection of technology, creativity, and societal impact. “MIT and HPI share a common commitment to turning scientific excellence into real-world impact. Through this collaboration, we will create an environment where students and researchers from both sides of the Atlantic can work together, experiment across disciplines, and learn from one another — at a time when artificial intelligence is set to profoundly shape our lives. We are convinced that this collaboration will generate ideas with impact far beyond both institutions and inspire international cooperation and innovation,” says Professor Tobias Friedrich, dean and managing director of the Hasso Plattner Institute. “HPI and MIT exist at the nexus of technology and creativity. Expanding this dynamic relationship will generate new paths for the infusion of AI, design, and creativity, enabling students, faculty, and researchers to dream and discover novel solutions, moving more quickly than ever from idea to implementation. MAD was established to connect thinkers across and beyond the Institute, and this new era of collaboration with HPI advances that mission on a global scale,” comments Hashim Sarkis, dean of the MIT School of Architecture and Planning and the Elizabeth and James Killian (1926) Professor. Academic leadership from MIT and HPI will jointly shape the hub’s research and teaching agenda. Based in Potsdam, Germany, HPI is a center of excellence for digital engineering advancing research, education, and societal transfer in IT systems engineering, data engineering, cybersecurity, entrepreneurship, and digital health. Through its globally recognized HPI d-school and pioneering work in design thinking methodology, HPI brings a distinctive perspective on human-centered innovation to the collaboration, alongside a strong record in AI and data science research and technology transfer. Expanding research and education on AI and creativity The efforts of this multifaceted initiative are intended to foster a dynamic academic community spanning MIT and HPI, anchored by Hasso Plattner–named professorships and graduate fellowships whose recipients will be actively engaged in the hub. The long-term framework is designed to provide continuity for faculty appointments, doctoral training, and cross-campus research. The agreement also includes the development of classes and educational programs in areas of shared AI focus, along with expanded experiential opportunities through AI-focused workshops, hackathons, and summer exchanges. A steering committee composed of representatives from the MIT School of Architecture and Planning, MIT Schwarzman College of Computing, and Hasso Plattner Institute will facilitate the shared governance of MHACH. “Creativity has always been about extending human capability. At its core, this collaboration asks what it truly means to create something new. The question isn’t whether AI diminishes creativity, but how new forms of intelligence can deepen and enrich that process. Our goal is to explore that intersection with rigor and build a cross-disciplinary scholarly and research community that shapes how AI supports the creation of new ideas and knowledge,” says Dan Huttenlocher, dean of the MIT Schwarzman College of Computing and the Henry Ellis Warren (1894) Professor of Electrical Engineering and Computer Science. This collaboration is made possible by the Hasso Plattner Foundation’s long-term philanthropic commitment to institutions that connect technological innovation with design thinking and education. The Hasso Plattner Foundation has played a central role in establishing and supporting institutions such as the Hasso Plattner Institute and international design thinking programs that bridge disciplines and geographies.
Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations huggingface 05.03.2026 14:16 0.673
Embedding sim.0.7708
Entity overlap0
Title sim.0.1429
Time proximity0.9686
NLP типother
NLP организацияnxp
NLP темаrobotics
NLP страна

Открыть оригинал

Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations Enterprise Article Published March 5, 2026 Upvote 14 +8 Gaetan Bahl gbahlnxp nxp Enzo Ruedas eruedas nxp Tess Boivin tboivin nxp 🎥 Dataset Recording: What Actually Matters 1) Consistency First 2) Use a Gripper Camera (Highly Recommended) 3) Improve Prehension 4) Diversity & Splits 🎛️ Fine‑Tuning VLAs ⚡ Optimizing for the NXP i.MX 95 Applications processor 1) Divide And Conquer 2) Quantization 3) Asynchronous Inference: Control-Aware Scheduling 📊 What We Achieve on i.MX 95 Applications Processor ⏩ Next Steps ✅ Checklists You Can Reuse 📚 Resources & Inspiration Authors : Enzo Ruedas , Tess Boivin Recent advances in Large Language Models have enabled the transition from text-only reasoning to multimodal systems . First, with the integration of visual perception in Vision–Language Models (VLMs) , and more recently with the generation of robot actions in Vision–Language–Action (VLA) models . Deploying these models on embedded robotic platforms remains a challenge due to tight constraints in terms of compute, memory, and power, as well as real-time control requirements. In synchronous control pipelines, while the VLA is running inference, the arm is idle awaiting commands leading to oscillatory behavior and delayed corrections. To tackle that, asynchronous Inference can enable smooth and continuous motion by dissociating generation from execution. However, to be effective, the end-to-end inference latency must remain shorter than the action execution duration . This temporal constraint therefore sets an upper limit on the model's throughput. Bringing VLA models to embedded platforms is not a matter of model compression, but a complex systems engineering problem requiring architectural decomposition , latency-aware scheduling , and hardware-aligned execution . Addressing these challenges is essential to translate recent advances in multimodal foundation models into practical and deployable embedded robotic systems. This guide presents NXP’s hands‑on best practices for recording reliable robotic datasets, fine‑tuning VLA policies ( ACT and SmolVLA ), and hightlights the real-time performance that NXP i.MX 95 SoC achieves after optimization. 🎥 Dataset Recording: What Actually Matters High‑quality, consistent data beats “more but messy” data. This section turns hard‑earned lessons into concrete checklists and schemas. In our case, we recorded a dataset for the task: "Put the tea bag in the mug." 1) Consistency First Fixed cameras : Use rigid mounts to avoid pose drift. If during recording or evaluation one or more cameras shift because of the robot's vibrations or the operator resetting the environment, you can observe a severe accuracy loss. Controlled lighting : Set up your environment where you can have as much control as possible on lighting (Fixed light source(s) and far from sunlight that vary during the day). Strong contrast : Avoid training with “white on white” unless that’s your deployment domain. Maximize contrast between the arm, the object and the environment. Fixed calibration : Make sure to have backups of your robot and teleoperator calibrations so you don't have to re-record your previous episodes if the code crashes. Do not cheat : Do not use information the model will not have access to at inference time. During data recording, it is tempting for the operator to rely on direct visual observation of the scene. However, this introduces information that is absent from the dataset. Dataset collection must be restricted to the same camera inputs that will be available to the policy at runtime. 2) Use a Gripper Camera (Highly Recommended) Moving from scene‑only views to mixed viewpoints increases the global accuracy, but the more cameras you have the more the latency is impacted. Therefore, you must choose right compromise. In our case that balance was reached with 3 cameras: Top Gripper Left The global view of the whole scene. The closest view for precise grasps and alignment. Complement the top view for height and depth. We strongly recommend using a gripper-mounted camera . It consistently improves success rates on fine manipulation tasks by providing a close, task-relevant viewpoint. Importantly, it is also the camera that most effectively enforces correct data collection practices, allowing the operator to rely exclusively on the robot’s perception rather than observing the scene directly. When installing a gripper camera, we recommend securing the cable with Velcro or a strain-relief guide to prevent it from obstructing the field of view or becoming disconnected during motion. 3) Improve Prehension Simple hardware tweaks like heat‑shrink tubing over gripper claws increase friction, reduce roughness, reduce slippage during episodes, and increase task success rate (fewer “almost success” episodes), improving policy learning stability. 4) Diversity & Splits When recording a dataset, you should: Vary episodes distribution : Divide your workspace into starting-position clusters, and record at least 10 episodes per cluster. Add diversity by changing the object position and rotation. e.g. we partitioned the robot arm’s reachable workspace into 11 clusters, each measuring 10 × 10 cm. Differentiate training & validation sets : Policies can easily overfit on the training set, so make sure that the validation set is unseen by the model. e.g. we removed cluster 6 from the training set. Record the most movements you can : Small VLA models exhibit limited generalization on unseen motion. Therefore, record episodes that cover the wider ranges of degrees of freedom. e.g. we grasped the tea bag either in horizontal or vertical position. Anticipate failure : Sometimes the policy will not reach the object the first time and will have to "go back to it". We noticed that having 20% of all episodes that corresponds to the case of going back to the object help the model improve overall success rate. e.g. around 20% of our training set corresponds to recovery episodes . This mirrors best practices across VLA papers and community guides. Here are 3 examples of data diversity within the same cluster: Starting position 1 Starting position 2 Recovery episode Starting positions 1 and 2 correspond to different positions within the same cluster. In contrast, during the recovery episode, the robot does not begin in "starting mode"; but is instead already near the mug and should proceed directly to retrieve the tea bag from that location. 🎛️ Fine‑Tuning VLAs What we did in practice: Tasks: "Grab the tea bag and place it in the mug." Dataset: 120 episodes: 10 clusters x (10 different tea bag starting positions + 2 recovery episodes) 3 cameras (640x480px, 30fps): Top, Gripper, Left Cluster n°6 was removed for validation Batch size: 8 Training: Model checkpoint with the lowest validation loss after 200k steps was chosen The range providing the best trade-off between accuracy, generalization, and motion smoothness across both the training and validation sets was found for ACT (100 actions per chunk) within a 100k-160k training steps. For SMolVLA training (50 actions per chunk), the trade‑off appears after many more training steps. We found that continuing training slightly past the point where the model begins to overfit tends to improve overall accuracy. Rule of thumb: choose final checkpoint by evaluating success on both training and validation set, not by training loss. ⚡ Optimizing for the NXP i.MX 95 Applications processor The i.MX 95 SoC integrates 6× Arm Cortex‑A55, a Cortex‑M7 and a Cortex M33 MCU, a Mali GPU, a new NXP ISP, and the eIQ® Neutron NPU , targeting efficient, secure edge inference with multi‑camera support and strong I/O. [nxp.com] 1) Divide And Conquer Instead of running the models as one monolithic graph, we decompose the VLA graph into logical stages: encoders, decoders, and action experts. Therefore, allowing each component to be optimized, scheduled, and deployed independently. In practice, SmolVLA is partitioned into the following sub-blocks: Vision : processes RGB camera frames and produces visual embeddings. LLM backbone : generates actions tokens from visual and textual embeddings. Action expert : applies flow matching to iteratively denoise action samples and outputs final control commands. This separation allows per-block optimizations. The impact of each block quantization can be measured to choose the best tradeoff between latency and accuracy. Also, isolating the action expert from the VLM was ideal to run it at lower frequency. 2) Quantization In order to optimize the inference for i.MX 95 SoC, we explored several quantization techniques on different blocks. We found that quantizing the vision encoder and LLM prefill had limited impact on accuracy, whereas quantization of the denoising flow in the action expert significantly degrades performance. This behaviour is expected, as quantization errors are accumulating across iterative denoising steps. That is why we decided to keep this block at higher precision to preserve stability, while on the other blocks, we explored various quantization configurations, from 8-bit mixed precision to 4-bit quantization, depending on the layers. In addition, we applied in-house optimization on the different blocks. Results are shown in the below table, referred as optimized models . 3) Asynchronous Inference: Control-Aware Scheduling In a synchronous control loop, the pipeline operates as: Capture observation Run full model inference Execute generated action During step (2), the robot remains idle. If inference latency is non-negligible, this produces: Idle gaps in motion Oscillatory corrections due to stale observations Reduced effective control frequency Poor recovery behavior With Asynchronous Inference , action generation runs in parallel with execution: The robot executes the current action chunk The next chunk is computed simultaneously This increases effective control frequency, reduces observation staleness, and improves recovery behavior. On embedded platforms such as the i.MX 95 SoC, asynchronous inference is essential — but only effective if inference latency is kept under the action horizon budget: inference time < execution time Synchronous inference Asynchronous inference Actions per chunk 100 100 FPS 60 60 Chunk size threshold N/A 0.2 Aggregate function N/A weighted_average Action queue evolution Results 📊 What We Achieve on i.MX 95 Applications Processor Setup Tasks: "Grab the tea bag and place it in the mug." Test set (20 episodes): 2 random positions for each cluster. Validation set (10 episodes): all 10 positions in cluster n°6 Platform (CPU) Policy Format Inference Latency Accuracy Test Set (20) Accuracy Validation Set (10) Global Accuracy (30) i.MX 95 ACT ONNX FP32 2.86 s 1.00 0.90 0.96 i.MX 95 ACT Optimized 0.32 s 1.00 0.60 0.89 i.MX 95 SmolVLA ONNX FP32 29.1 s 0.50 0.40 0.47 ⏩ Next Steps Our immediate objective is to improve task accuracy with SmolVLA (ONNX FP32). We have already established a baseline and measured an optimized on-board inference latency of 6.15 s . The next phase will focus on deeper optimizations on our NPUs. In parallel, we aim to move from single-task setup toward longer-horizon and more complex scenarios. To do that, we will introduce: Simulation environments for scalable data generation and benchmarking Reinforcement Learning (RL) for policy refinement Sim-to-Real transfer to bridge domain gaps and improve real-world performance The goal is to move from a single validated manipulation task toward a reproducible methodology for deploying VLA policies on embedded robotic systems. ✅ Checklists You Can Reuse Recording Fixed mounts verified Good cameras focus and illumination Good gripper claws prehension Calibration files backups saved Contrast validated Training Save/eval checkpoints every 20k steps Save also your training parameters to be able to resume training if needed Prepare in advance your validation set and your tracking method for accuracy and latency Deployment on i.MX 95 SoC You are satisfied with your accuracy Contact us to have your model optimized 📚 Resources & Inspiration ACT documentation & paper (core idea, action chunking, low‑demo success). [huggingface.co] , [arxiv.org] SmolVLM/SmolVLA family & repos (compact multimodal + VLA design). [huggingface.co] , [github.com] , [smolvla.net] Sherry Chen’s HF blog on training ACT on SO‑101 (practical lessons, pitfalls, fixes). [huggingface.co] Mentioned models HuggingFaceTB/SmolVLM-Base Image-Text-to-Text • 2B • Updated Nov 28, 2024 • 1.4k • 86
Meet the Tech Reporters Using AI to Help Write and Edit Their Stories wired 26.03.2026 18:00 0.672
Embedding sim.0.7817
Entity overlap0.0811
Title sim.0.1481
Time proximity0.8036
NLP типother
NLP организацияAnthropic
NLP темаgenerative ai
NLP страна

Открыть оригинал

Maxwell Zeff Business Mar 26, 2026 2:00 PM Meet the Tech Reporters Using AI to Help Write and Edit Their Stories Independent writers are using AI agents all throughout their reporting process. What’s the value of a human journalist, anyway? Photo-Illustration: WIRED staff; Getty Images Save this story Save this story When technology reporter Alex Heath has a scoop, he sits down at his computer and speaks into a microphone. He’s not talking to a human colleague—Heath went independent on Substack last year—he’s talking to Claude . Using the AI-powered voice-to-text service Wispr Flow, Heath transmits his ideas to an AI agent, then lets it write his first draft. Heath sat down with me last week to showcase how he’s integrated Anthropic’s Claude Cowork into his journalistic process. The AI tool is connected to his Gmail, Google Calendar, Granola AI transcription service, and Notion notes . He’s also built a detailed skill—a custom set of instructions—to help Claude write in his style, including the “10 commandments” of writing like Alex Heath. The skill includes previous articles he’s written, instructions on how he likes his newsletters to be structured, and notes on his voice and writing style. Claude Cowork then automates the drafting process that used to take place in Heath’s head. After the agent finishes its first draft, Heath goes back and forth with it for up to 30 minutes, suggesting revisions. It’s quite an involved process, and he still writes some parts of the story himself. But Heath says this workflow saves him hours every week, and he now spends 30 to 40 percent less time writing. “I’ve always hated the zero-to-one process of writing a story … Now, it’s actually kind of fun,” he says. “Going out on my own, I realized I need AI to help with the volume.” Heath is part of a growing contingent of tech reporters using AI to help write and edit their stories. The AI workflow is especially enticing for reporters who have gone independent, losing valuable resources like editors and fact-checkers that typically come with a traditional newsroom. Rather than just prompting ChatGPT to write stories, independent journalists say they are re-creating these resources with AI. Their usage raises broader questions about the value of human journalists altogether. If people are using AI to write, edit, and fact-check their stories—what do humans bring to the table? A recent study from Google DeepMind researchers suggests that using AI in a lazy way can make your writing more homogeneous. It’s less creative, it has less voice, and it takes on a more neutral stance. To use AI well, journalists I spoke to say they need to understand why people are paying for their work in the first place. ( WIRED’s policy prohibits the use of AI in writing or editing). While some writers have built a career on their analysis and prose, Heath sees his value as his ability to get scoops. Claude makes it easier for him to spend more time chatting with sources and getting information out to his subscribers. Several longtime journalists remarked to me that Heath’s workflow feels like a modern version of a long-standing institution: the rewrite desk . In the days before laptops and smartphones, reporters in the field would call in stories to a newsroom, where writers behind a desk would quickly weave those reported details into articles they could print for the next day’s paper. This allowed some reporters to spend their days covering events and talking to sources. In a way, Claude is now Heath’s rewrite desk. “I feel like I’m cheating in a way that feels amazing,” says Heath. “I never did this because I liked being a writer. I like reporting, learning new things, having an edge, and telling people things that will make them feel smart six months from now.” Jasmine Sun, who previously worked as a product manager at Substack, recently launched her own newsletter covering AI and Silicon Valley culture. Last week, she published an article in The Atlantic about how post-training makes AI models bad at writing by essentially beating out their creativity. Because of that, Sun never uses AI to write, but she has found promise using Claude as an editor. Like Heath, Sun has fed Claude past articles she’s written and notes on her style. But she’s also instructed Claude to focus only on enhancing and developing her voice and taste, and never to be sycophantic. She tells Claude it “should never write a sentence for her. Your goal is to elicit out of Jasmine by providing feedback.” Here’s part of the instructions Sun has shared with her Claude editor: “You are not a co-writer. You cannot perceive—you don’t have experiences, sources, scenes, or emotions to draw from. Your role is to help Jasmine write like the best version of herself—not just who she is on the page now, but who she’s trying to become as a writer. That means understanding both her current voice *and* her aspirations, including the writers and qualities she’s reaching toward.” I asked Sun if she ever feels the urge to be lazy and just let Claude write for her. “I think [Claude] forces me to work harder than I would otherwise,” said Sun. “With a human editor, they're calling you on your bullshit. They're not letting you get away with lazy reporting or floppy prose.” After speaking publicly about her use of Claude, Sun received criticism from people who were offended by the notion that AI could replace a human editor. Critics argued that AI can’t transform your ideas or challenge you as much as a human. Sun says she found the comments confusing. Most Substackers can’t afford to hire a human editor, so by adding Claude and instructing it to challenge her, Sun argues it’s made her process more rigorous. “To me, it is kind of like using Grammarly. You have a tool that says, ‘this sentence is bad,’ and then I have to go fix it,” she says. “[Claude is] more willing to tell me this entire section is bad and you should cut it. It's like a higher level of thinking and abstraction than Grammarly can do.” Casey Newton, author of the newsletter Platformer , tells me AI has made him reassess the value of his publication. “I think there's an interesting distinction here. If the value is in the information, not the writing, then I think people will care less that AI did most of the writing,” says Newton. “If the value is in voice and opinion and argument and analysis, it seems like it's cheap to use AI to do the whole thing” In recent years, Newton has focused on news analysis. But as AI improves, he says he’s changing his approach. "I actually need to shift the balance," he tells me. "I need to do less news analysis and more original reporting." Newton is not using AI to write Platformer today but says he was inspired by Sun’s AI editor and has since tried to re-create it with a Claude agent based on his own articles. “I’ve just been really struck by it. At its best, its feedback is about as good as the feedback I've gotten from human editors,” says Newton. Taylor Lorenz, author of the User Mag Substack, tells me she uses AI to help run her media business. She has Gemini come up with SEO-friendly descriptions for YouTube videos and has Claude help sift through data. However, Lorenz says she’s not using AI to write or edit her articles today. She doesn’t trust AI systems with sensitive reporting materials and finds that AI just hasn’t been useful for writing and editing. She also just loves the craft of writing herself. “I am a journalist because I like to help people understand the world and bring light to different issues,” says Lorenz. “I don’t want the AI to do that.” Kevin Roose, a technology columnist with The New York Times, is using AI to help him produce a book about the race to build artificial intelligence. He claims AI tools have helped him shave two or three years off the process. Most recently, Roose tells me, he created a team of Claude agents to help edit his book, led by a “Master Editor” agent. Other sub-agents are in charge of things like fact-checking, making sure the book matches his writing style, and offering positive and negative feedback. (To be clear, he’s still working with human editors, too.) However, Roose has not handed over the writing of his book to AI just yet. Like Sun, Newton, and Lorenz, Roose feels that he’s still just better at writing than an AI model. “I think the models tend to be fairly generic and depersonalized, but also, I like doing this,” he tells me. Roose makes clear that he’s anything but a skeptic—he expects AI models to get better than him at everything eventually. But he is human, and for now that’s an edge. “I am not under some romantic illusion that I possess a special, irreplaceable perspective. But what I am is a person, and I think that for now people, at least some people, like hearing from people,” he says. This is an edition of Maxwell Zeff’s Model Behavior newsletter . Read previous newsletters here.
What is Advanced Machine Intelligence or AMI Labs? ai_supremacy 11.03.2026 10:35 0.672
Embedding sim.0.7999
Entity overlap0.0571
Title sim.0.1026
Time proximity0.7074
NLP типproduct_launch
NLP организацияAMI Labs
NLP темаartificial intelligence
NLP странаEurope

Открыть оригинал

Prospectus What is Advanced Machine Intelligence or AMI Labs? The Biggest European AI Startup just came out of Stealth. Michael Spencer Mar 11, 2026 ∙ Paid 87 5 Share Image Credit: Wired. Thank you for your support on my last article: The U.S. Labor Market and its AI problem . I’ve been writing about AI for well over four years , well before the latest Generative AI craze. A special thanks to Michael Burry who shared the piece . It made my week. Good Morning, While I am a contrarian and realist in my coverage of AI as my baseline, there are however times when my own exuberance and optimism for a better path in AI research shines through. An alternative to the deceptive AGI marketing (by Sam Altman) is one of those moments. To be sure Generative AI has ushered in an era where Artificial intelligence has made substantial progress over the past decade (2017 to 2022 being a fairly nascent period). Predictive and generative systems have transformed how we analyze, retrieve knowledge, and create content at global scale. But this isn’t enough. The Machine Economy ahead will demand more. Like many of you I’ve been following the criticism around LLMs by the likes of Yann LeCun, Gary Marcus and many others in the last few years. Yann LeCun, a Turing Award winner and a pioneer of modern AI, has become one of the most prominent critics of the current "LLM-centric" path and his alternative is to me fascinating. Instead of AGI they propose Superhuman Adaptable Intelligence, or SAI. “SAI points toward self-supervised learning for acquiring generic knowledge from unlabeled data, and world models for planning and zero-shot transfer.” - Ravid Shwartz Ziv Read the Paper The reality is we need a second wave of more practical AI startups that focuses on Physical AI, world models, robotics and real automation and that’s exactly what we are seeing with Prometheus Project , Core Automation , World Labs and many others. AGI startups, World model centric AI startups, Physical AI & robotics startups and physical science AI startups all can push the frontiers. Yann LeCun, one of the co-founders of AMI Labs. In this article I explore what the Machine Intelligence Economy will require and in particular try to chart the origins, characteristics and announcement of Europe’s own AMI. I argue that we are moving into a new era of Physical AI, specialized AI startups, robotic brain and world model origin labs. I consider 2027 as the nascent start to the Physical AI era. “AI agents need world models that allow them to predict the consequences of their actions before they take them. This is key to enabling agents that can plan, remember, and reason about complex observations.” - Michael Rabbat There’s now in 2026 a second wave of AI startups that are offering complementary directions to the approach of LLMs. Subscribe Europe’s Biggest AI Startup Came out of Stealth Continue reading this post for free, courtesy of Michael Spencer. Claim my free post Or purchase a paid subscription. Previous Next
Introducing ChatGPT for Excel and new financial data integrations openai 05.03.2026 00:00 0.671
Embedding sim.0.7864
Entity overlap0.1
Title sim.0.0659
Time proximity0.8571
NLP типproduct_launch
NLP организацияOpenAI
NLP темаgenerative ai
NLP страна

Открыть оригинал

OpenAI introduces ChatGPT for Excel and new financial app integrations, powered by GPT-5.4 to accelerate modeling, research, and analysis in regulated environments.
FMI Expands Consulting Capabilities with Launch of Energy & Power Team and Addition of Senior Leaders prnewswire 31.03.2026 19:00 0.671
Embedding sim.0.7593
Entity overlap0
Title sim.0.1911
Time proximity0.9763
NLP типother
NLP организацияFMI Corporation
NLP тема
NLP странаUnited States

Открыть оригинал

FMI Expands Consulting Capabilities with Launch of Energy & Power Team and Addition of Senior Leaders News provided by FMI Corporation Mar 31, 2026, 15:00 ET Share this article Share to X Share this article Share to X RALEIGH, N.C. , March 31, 2026 /PRNewswire/ -- FMI Corporation, a leading provider of consulting and investment banking services to the built environment, today announced the expansion of its consulting practice into the energy and power sector with the addition of senior leaders Evan Fairmont and Blake Angelo. This move enhances FMI's ability to support clients amid accelerating demand for energy infrastructure, grid modernization and decarbonization. Continue Reading Evan Fairmont (PRNewsfoto/FMI Corporation) Blake Angelo (PRNewsfoto/FMI Corporation) Fairmont and Angelo bring decades of combined experience advising utilities, renewable energy developers and infrastructure investors on strategy, large-scale capital programs and operational transformation. Together, they have supported the deployment of more than $6 billion in energy projects spanning renewables, thermal power and transmission and distribution systems. At FMI, they will lead the firm's Energy and Power consulting efforts, helping clients scale operations and deliver critical infrastructure faster in response to rising demand from data centers, hyperscalers and regulatory requirements. Their work will focus on improving "speed to power," enabling organizations to accelerate project delivery through enhanced operating models, project controls and cross-functional execution. The team will advise power and energy firms and investors on capital strategy, project lifecycle transformation, enterprise PMO development and operational efficiency, and support organizations navigating pre- and post-M&A transformations. "FMI has a long-standing reputation for driving excellence in the built environment, and I am honored to help spearhead this increased focus on energy, power and utilities," said Angelo. "Our goal is to help organizations accelerate delivery while building resilient, future-ready infrastructure." "I'm excited to join FMI and extend its expertise in capital strategy and performance improvement into the energy sector," said Fairmont. "As investment in power infrastructure accelerates, there is a growing need to better connect strategy with execution." "We are thrilled to welcome Blake and Evan to the FMI leadership team," said Mike Clancy, partner at FMI Consulting. "Their expertise will be instrumental as we help clients address the most pressing challenges in the evolving energy and power landscape." This expansion builds on FMI's existing energy sector experience within its investment banking practice and reflects the firm's continued commitment to supporting clients at the intersection of infrastructure, capital and innovation. About FMI FMI is a leading provider of consulting and investment banking services to the built environment, delivering expertise in strategy, leadership and organizational development, operational performance, mergers and acquisitions, financial advisory and private equity. Media Contact: Katie DeRee, [email protected] SOURCE FMI Corporation 21 % more press release views with  Request a Demo &times; Modal title
Last Week in AI #337 - Anthropic Risk, QuitGPT, ChatGPT 5.4 lastweekin_ai 09.03.2026 07:15 0.67
Embedding sim.0.8148
Entity overlap0.0556
Title sim.0.1463
Time proximity0.4836
NLP типregulation
NLP организацияAnthropic
NLP темаai regulation
NLP странаUnited States

Открыть оригинал

News Last Week in AI #337 - Anthropic Risk, QuitGPT, ChatGPT 5.4 Anthropic officially told by DOD that it’s a supply chain risk, ‘cancel ChatGPT’ trend is growing after OpenAI signs a deal with the US military, and more! Last Week in AI Mar 09, 2026 45 3 Share Note from Editor : apologies for missing a week with this newsletter. As I mentioned on the podcast, my startup Astrocade has recently raised our series B which has gotten me extra busy lately. I’ll do my best to keep the schedule consistent! PS we are hiring for engineers, marketing, product, growth, and more! If you’re in the bay area, would like to join a small but growing startup, and think building a youtube-of-games sounds exciting, feel free to email me at andrey@astroblox.ai or message me on LinkedIn . Check out Astrocade! Top News Anthropic officially told by DOD that it’s a supply chain risk even as Claude used in Iran Anthropic latest statement on the situation Related: Anthropic Hits Back After US Military Labels It a ‘Supply Chain Risk’ Anthropic CEO: We’re trying to “deescalate” Pentagon AI standoff to reach “some agreement that works for us and works for them” Where things stand with the Department of War Anthropic CEO Dario Amodei calls OpenAI’s messaging around military deal ‘straight up lies,’ report says ‘No ethics at all’: the ‘cancel ChatGPT’ trend is growing after OpenAI signs a deal with the US military Microsoft, Google, Amazon say Anthropic Claude remains available to non-defense customers Summary : The Pentagon formally designated Anthropic a “supply chain risk,” ordering defense vendors and contractors to certify they are not using Claude in DoD work and triggering an immediate wind-down of Pentagon use of Anthropic’s models. The dispute centers on contract terms: Anthropic sought explicit red lines against mass domestic surveillance and fully autonomous weapons, while the DoD insisted on access for “all lawful purposes.” Anthropic says the designation under 10 USC 3252 is narrow, applies only to uses tied directly to DoD contracts, and violates the statute’s “least restrictive means” requirement; the company plans to challenge it in court. At the same time, multiple reports indicate Claude has supported U.S. military operations in Iran for intelligence analysis, modeling and simulation, operational planning, and cyber operations. After Defense Secretary Pete Hegseth suggested no military partner may conduct any commercial activity with Anthropic, confusion spread about the scope. Cloud providers disputed that interpretation; Microsoft, Google, and AWS clarified Claude remains available for non-defense workloads via M365, GitHub, AI Foundry, and Google Cloud. Meanwhile, OpenAI announced a DoD deal to deploy its models in classified environments, citing prohibitions on mass surveillance and autonomous weapon systems. Anthropic CEO Dario Amodei called OpenAI’s framing “straight up lies” and “safety theater,” arguing the agreement still hinges on “all lawful purposes.” Editor’s Take : The designation of Anthropic as a supply chain risk could have meant a massive hit to its business, but as of now it appears that may not happen. In fact, Anthropic might on the whole have benefited from the events, given the massive rise in consumer awareness of Claude as an alternative to ChatGPT. Still, it appears likely that this story is not yet over. ‘No ethics at all’: the ‘cancel ChatGPT’ trend is growing after OpenAI signs a deal with the US military OpenAI’s statement Related: Anthropic’s Claude rises to No. 1 in the App Store following Pentagon dispute How OpenAI caved to the Pentagon on AI surveillance ‘QuitGPT’ protesters rally outside OpenAI HQ in San Francisco over deal with Pentagon OpenAI alters deal with Pentagon as critics sound alarm over surveillance OpenAI hardware exec Caitlin Kalinowski quits in response to Pentagon deal Summary : OpenAI signed an agreement with the Department of War after Anthropic walked away over, as covered above. OpenAI says its deal enforces “red lines” via legal references and technical guardrails, citing compliance with the Fourth Amendment, FISA, EO 12333, and DoD policy, plus cloud-only deployment, security clearances, and classifiers to monitor use. Critics counter that the terms still hinge on “any lawful use,” a standard that has historically enabled bulk data programs like PRISM and Verizon call record collection, and they note vague carve-outs such as prohibitions on “unconstrained” or “generalized” monitoring. OpenAI later posted revised language that “the AI system shall not be intentionally used for domestic surveillance of U.S. persons,” and said defense intelligence components (e.g., NSA) are excluded, but it has not released the full contract; legal experts warn loopholes remain and that policy reinterpretations can broaden surveillance. Source The fallout was swift: a “QuitGPT” boycott, in-person protests at OpenAI’s SF HQ with “No killer robots” messaging, and a 295% surge in ChatGPT uninstalls. Anthropic’s Claude climbed to No. 1 in the U.S. App Store, with daily sign-ups breaking records, free users up 60% since January, and paid subscribers more than doubling this year. Inside OpenAI, hardware lead Caitlin Kalinowski resigned, citing rushed governance and insufficiently defined guardrails on surveillance and lethal autonomy. Editor’s Take : Consumer boycotts like “QuitGPT” rarely lead to a significant lasting loss for the targeted company, but Claude has thus far been the underdog relative to ChatGPT with consumers, and the scale of the reaction does appear meaningful. In either case, the impact on Anthropic’s hiring power is likely even more significant given that how hotly fought for top tier AI talent is. OpenAI launches GPT-5.4 with Pro and Thinking versions Source Related: ‘We heard your feedback loud and clear’ — OpenAI introduces new ChatGPT 5.3 Instant to ‘reduce the cringe’ for all users Summary : OpenAI released GPT-5.4 in three variants—standard, Pro, and Thinking—positioning it as a more capable and efficient frontier model for professional work. The API now supports up to a 1 million-token context window, and improved token efficiency means it can solve the same tasks with fewer tokens than prior models. GPT-5.4 posted record results on OSWorld-Verified and WebArena Verified for computer use, scored 83% on OpenAI’s GDPval for knowledge work, and led Mercor’s APEX-Agents benchmark for law and finance tasks. OpenAI also introduced Tool Search for API tool calling, allowing the model to fetch tool definitions on demand to reduce token overhead and latency, and reported fewer factual errors versus GPT-5.2 (33% fewer erroneous claims; 18% fewer responses with errors). Separately, OpenAI upgraded ChatGPT’s default to GPT-5.3 Instant to “reduce the cringe,” tuning away unnecessary refusals and long safety disclaimers while keeping guardrails. GPT-5.3 Instant focuses on direct answers for reasonable requests and improves accuracy, with ~27% fewer hallucinations during online research and ~20% fewer without browsing, plus better blending of web data with prior knowledge. The change updates the baseline chat experience across common uses while retaining GPT-5.2 Instant for paid users as a legacy option. Editor’ Take : I discussed the accelerating rate of these ‘+0.1’ model releases in the last couple of these newsletters, and the same comments largely apply here. Relative to last year, 2026 Other News Tools Luma launches creative AI agents powered by its new ‘Unified Intelligence’ models . The agents can plan, generate, and iteratively refine text, images, video, and audio for ad agencies and brands using a single multimodal Uni‑1 model that maintains persistent context across assets and coordinates with other AI models. Anthropic’s Claude reports widespread outage . Thousands of users reported login and access failures affecting Claude.ai and Claude Code, while Anthropic said the API remained operational and engineers were implementing a fix. Google makes Gmail, Drive, and Docs ‘agent-ready’ for OpenClaw . A new CLI offers developer‑focused tools and documentation to simplify and standardize how AI agents like OpenClaw and MCP‑compatible apps connect to Gmail, Drive, Docs, and other Workspace services. Google reveals dev-focused Gemini 3.1 Flash Lite, promises ‘best-in-class intelligence for your highest-volume workloads’ . The model offers faster response and output generation while cutting per‑token input/output costs versus Gemini 2.5 Flash, outperforms several rivals on multiple benchmarks, and is available to developers in preview via the Gemini API and Vertex AI. Cursor is rolling out a new kind of agentic coding tool . The Automations feature lets engineers trigger and manage always‑on agents from events like code changes, Slack messages, timers, PagerDuty incidents, and weekly summaries so agents can run reviews, security audits, incident responses, and other recurring tasks without constant human initiation. Business Alibaba scrambles after sudden departure of Qwen tech lead . The sudden departure sparked urgent all‑hands discussions about restructuring, resource allocation, and whether new hires like Hao Zhou will fill Lin’s technical and leadership roles. OpenAI raises $110B in one of the largest private funding rounds in history . The funding includes large compute and services commitments—notably AWS and Nvidia infrastructure partnerships that pledge gigawatts of training/inference capacity and substantial Bedrock integration—while the round remains open and some payments are contingent on future conditions like AGI or an IPO. Cursor has reportedly surpassed $2B in annualized revenue . Enterprise customers now account for about 60% of revenue, helping offset individual‑user defections and driving a recent doubling in the company’s revenue run rate. Research A Very Big Video Reasoning Suite . VBVR‑Dataset (1,007,500 clips) spans 200 tasks across five cognitive pillars, paired with a verifiable VBVR‑Bench aligned with human judgments; a scaling study shows improved yet still sub‑human video reasoning as model and data scale increase. Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization . The method jointly trains model parameters with on‑ and off‑policy RL while updating a non‑parametric memory during interaction, using memory‑guided and memory‑free rollouts to drive broader exploration and improve performance on embodied reasoning benchmarks. HyperSteer: Activation Steering at Scale with Hypernetworks . Hypernetworks generate task‑conditioned steering vectors for an instruction‑tuned base LM, achieving better and more scalable activation steering than prior methods and matching steering‑via‑prompting on held‑out prompts. Agentic Code Reasoning . Prompting LLM agents with semi‑formal structured certificates—explicit premises, traced code paths, and per‑test evidence—improves execution‑free patch equivalence verification, code QA, and fault localization across diverse benchmarks without formal semantics or specialized training. Beyond Language Modeling: An Exploration of Multimodal Pretraining . Training a single model from scratch with mixed text, image, and video data—using semantic representation autoencoders plus MoE—shows joint vision‑language learning without fatal interference, reveals a vision data hunger and scaling asymmetry mitigated by sparsity, and demonstrates that video‑rich multimodal pretraining yields emergent world‑modeling and generation/understanding abilities. Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA . Adding an STP auxiliary loss that minimizes components of hidden‑state deviation perpendicular to inferred geodesic trajectories (“semantic tube”) improves signal‑to‑noise ratio, data efficiency, and diversity in autoregressive LLM training compared to next‑token prediction alone. The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks . Massive outlier activations and attention sinks commonly co‑occur in pretrained decoder‑only Transformers due to specific normalization choices, attention‑space dimensionality, and context‑length training; each phenomenon can be independently mitigated without hurting language‑modeling performance. Progressive Residual Warmup for Language Model Pretraining . ProRes gradually scales each layer’s residual from zero to one during training—delaying deeper layers’ contributions so shallow layers stabilize first—improving perplexity and depth scaling across models from 71M to 7B parameters. Reasoning Models Struggle to Control their Chains of Thought . CoT‑Control, a 14,076‑instance benchmark, evaluates how well models follow explicit instructions to modify their chains of thought; most reasoning models struggle with controllability, which varies with model size, reasoning effort, training, and situational‑awareness prompting. Analysis Anthropic reports on trends in occupations susceptible to AI and whether there are changes in the labor market . Anthropic’s analysis introduces an “observed exposure” metric showing which occupations (notably programmers, customer service reps, and financial analysts) have tasks both theoretically automatable by AI and already automated in practice, finds higher exposure among better‑educated, higher‑income workers, and reports little current impact on unemployment overall but a small decline in employment among 22–25‑year‑olds in high‑exposure roles. 45 3 Share
LWiAI Podcast #237 - Nemotron 3 Super, xAI reborn, Anthropic Lawsuit, Research! lastweekin_ai 16.03.2026 06:06 0.669
Embedding sim.0.7465
Entity overlap0.1299
Title sim.0.1964
Time proximity0.9892
NLP типother
NLP организацияAnthropic
NLP темаlarge language models
NLP странаChina

Открыть оригинал

Podcast LWiAI Podcast #237 - Nemotron 3 Super, xAI reborn, Anthropic Lawsuit, Research! 15 1 1× 0:00 Current time: 0:00 / Total time: -2:27:19 -2:27:19 Audio playback is not supported on your browser. Please upgrade. LWiAI Podcast #237 - Nemotron 3 Super, xAI reborn, Anthropic Lawsuit, Research! Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning, Another XAI Cofounder Has Left, Anthropic Sues Department of Defense Last Week in AI Mar 16, 2026 15 1 Share Transcript Our 237th episode with a summary and discussion of last week’s big AI news! Recorded on 03/13/2026 Hosted by Andrey Kurenkov and Jeremie Harris Feel free to email us your questions and feedback at andreyvkurenkov@gmail.com and/or hello@gladstone.ai In this episode: * Perplexity announced “Personal Computer,” a local Mac-based AI agent positioned as a safer alternative to OpenAI’s computer-use agents, while Anthropic added GitHub PR code review pricing reviews at $15–$25 and Cursor launched trigger-based “Automations” for always-on coding agents. * ChatGPT introduced interactive math/science visuals and Anthropic added in-chat interactive charts/diagrams; Nvidia released open weights for its 120B-parameter Natron Free Super hybrid Transformer–Mamba latent-MoE model trained natively at 4-bit for Blackwell GPUs. * Nvidia halted H200 production for China amid customs blocks and domestic chip pressure; xAI saw major co-founder departures; Anthropic previewed a Claude Marketplace for enterprise procurement; Yann LeCun’s aMI raised $1.3B; humanoid robot maker Sanctuary reached a $1.15B valuation. * Anthropic sued the Pentagon over a “supply chain risk” designation as memos ordered removal within 180 days; research covered models resisting activation steering, limits of chain-of-thought control, inference-scaling boosting cyber-task success, low-probability risky actions, weaknesses in SWE-bench, multimodal pretraining, long-context RNN memory caching, context-parallel training efficiency, RL for CUDA kernel optimization, and latent introspection detecting concept injection. A thank you to our current sponsors: Box - visit Box.com/AI to learn more ODSC AI - go to odsc.ai/east and use promo code LWAI for an additional 15% off your pass to ODSC AI East 2026. Factor - head to factormeals.com/lwai50off and use code lwai50off to get 50 percent off and free breakfast for a year Timestamps: (00:00:10) Intro / Banter (00:01:23) Response to listener comments Tools & Apps (00:02:06) Perplexity’s Personal Computer turns your spare Mac into an AI agent | The Verge (00:04:22) Anthropic launches code review tool to check flood of AI-generated code | TechCrunch (00:08:08 ) Cursor is rolling out a new kind of agentic coding tool | TechCrunch (00:11:14) ChatGPT can now create interactive visuals to help you understand math and science concepts | TechCrunch (00:11:56) Anthropic’s Claude AI can respond with charts, diagrams, and other visuals now | The Verge Projects & Open Source (00:13:54) Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning | NVIDIA Technical Blog Applications & Business (00:21:22) Nvidia halts H200 production as China backs Huawei AI chips (00:28:33) Another XAI Cofounder Has Left, and Another Says He’s Leaving. - Business Insider (00:34:04) Anthropic’s Claude Marketplace allows customers to buy third-party cloud services | TechRadar (00:37:57) Yann LeCun’s AMI Labs raises $1.03 billion to build world models | TechCrunch (00:44:52) Humanoid robotics maker Sunday reaches $1.15B valuation to build household robots | TechCrunch Policy & Safety (00:46:09) Anthropic Sues Department of Defense Over ‘Supply Chain Risk’ Label - The New York Times + Google and OpenAI Just Filed a Legal Brief in Support of Anthropic (00:53:24) Internal Pentagon memo orders military commanders to remove Anthropic AI technology from key systems - CBS News (00:58:15) Endogenous Resistance to Activation Steering in Language Models (01:06:27) Reasoning Models Struggle to Control their Chains of Thought (01:09:52) ‘It means missile defence on datacentres’: drone strikes raise doubts over Gulf as AI superpower (01:14:57) Evidence for inference scaling in AI cyber tasks: Increased evaluation budgets reveal higher success rates (01:18:24) Frontier Models Can Take Actions at Low Probabilities Research & Advancements (01:24:20) Research note: Many SWE-bench-Passing PRs Would Not Be Merged into Main (01:28:26) [2603.03276] Beyond Language Modeling: An Exploration of Multimodal Pretraining (01:40:09) Memory Caching: RNNs with Growing Memory (01:48:47) Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking (01:58:41) CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation (02:08:57) Latent Introspection: Models Can Detect Prior Concept Injections (02:16:45) Physics of RL: Toy scaling laws for the emergence of reward-seeking Discussion about this episode Comments Restacks Podcast Weekly AI summaries and discussion about Last Week's AI News! Subscribe over at https://www.lastweekinai.com/ Weekly AI summaries and discussion about Last Week's AI News! Subscribe over at https://www.lastweekinai.com/ Subscribe Authors Last Week in AI Recent Episodes LWiAI Podcast #236 - GPT 5.4, Gemini 3.1 Flash Lite, Supply Chain Risk Mar 13   •   Last Week in AI LWiAI Podcast #235 - Sonnet 4.6, Deep-thinking tokens, Anthropic vs Pentagon Mar 5   •   Last Week in AI LWiAI Podcast #234 - Opus 4.6, GPT-5.3-Codex, Seedance 2.0, GLM-5 Feb 17   •   Last Week in AI LWiAI Podcast #233 - Moltbot, Genie 3, Qwen3-Max-Thinking Feb 6   •   Last Week in AI LWiAI Podcast #232 - ChatGPT Ads, Thinking Machines Drama, STEM Jan 28   •   Last Week in AI LWiAI Podcast #231 - Claude Cowork, Anthropic $10B, Deep Delta Learning Jan 21   •   Last Week in AI LWiAI Podcast #230 - 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR Jan 7   •   Last Week in AI
TAI #194: AI Goes Macro; Job Loss Fears, Military Usage, OpenAI $110B Raise towards_ai 03.03.2026 15:02 0.668
Embedding sim.0.7867
Entity overlap0.0588
Title sim.0.0625
Time proximity0.8495
NLP типother
NLP организацияOpenAI
NLP темаartificial intelligence
NLP странаUnited States

Открыть оригинал

What happened this week in AI by Louie This week brought a series of developments that signal AI is quickly becoming more than just a technology story: AI&#8217;s revenue, its politics, and its labor market consequences are now operating at a scale that reshapes the global economy and the geopolitical order in real, measurable ways. AI, the Pentagon, and the Claude Surge. AI is increasingly critical to US military operations. OpenAI signed a contract with the Department of Defense to deploy its models on classified networks. Hours later, the Trump administration designated Anthropic a &#8220;supply chain risk&#8221; and directed agencies to stop using Claude, widely interpreted as retaliation for Anthropic&#8217;s refusal to lift its safety guardrails for unrestricted military use. Meanwhile, reports emerged that Claude was allegedly used, together with Palantir, during the capture of Venezuela&#8217;s then-president Nicol&#225;s Maduro in January and again to assist with intelligence assessment during strikes against Iran. I agree with the red lines Anthropic has laid out: no mass surveillance, no autonomous weapons without a human in the loop. Dario Amodei seems more serious about enforcing those boundaries than any other lab CEO, and his willingness to absorb real commercial and political cost to hold that line is notable. That said, the broader question is genuinely complex. Should unelected AI CEOs be drawing the boundaries of how military AI gets used? In principle, that is a job for elected governments. But existing laws were not written with these AI capabilities in mind, and governments have shown little urgency to update them. Until they do, the defaults are being set by a handful of companies in San Francisco. Public backlash against OpenAI&#8217;s Pentagon deal appears to have driven a spike in downloads of Claude. Anthropic&#8217;s app hit number one on the Apple App Store, and the resulting surge in demand contributed to a major Claude outage on Monday that lasted nearly three hours, following a minor disruption on February 28. GPU and inference capacity are already binding constraints, and we are nowhere near the usage levels many AI economic scenarios assume. OpenAI Raises $110 Billion. OpenAI closed a $110 billion funding round, the largest private financing in history, from Amazon ($50B), Nvidia ($30B), and SoftBank ($30B), at a pre-money valuation of $730 billion. Capital flowing into AI infrastructure is now reaching a scale that shows up in macro aggregates. Between this fundraise, continued $150&#8211;200 billion in hyperscaler data center capex per quarter, and SoftBank&#8217;s Stargate commitments, AI investment is becoming a material driver of GDP in its own right. The question is whether the productivity gains this infrastructure enables will circulate broadly through the economy, or concentrate in a handful of firms. Citrini&#8217;s &#8220;2028 Global Intelligence Crisis&#8221; and the AI Job Loss Debate. A blog post from CitriniResearch titled &#8220;The 2028 Global Intelligence Crisis&#8221; went extremely viral recently, reportedly accumulating around 16 million views. The piece is written as a fictional macro memo from June 2028, looking back on how AI-driven white-collar job displacement triggered a cascade of economic and financial consequences: mass layoffs leading to reduced consumer spending, a collapsing SaaS sector, private credit defaults, and eventually stress in the $13 trillion US mortgage market as high-income borrowers lose their jobs. The thesis: AI capabilities improve, companies lay off white-collar workers and reinvest savings into more AI; displaced workers spend less; companies under revenue pressure invest even more in AI to cut costs; and the cycle accelerates. Citrini calls this the &#8220;human intelligence displacement spiral.&#8221; The piece also describes how agentic commerce erodes the moats of intermediary businesses (DoorDash, Mastercard, insurance brokers, real estate agents) as AI agents are put in charge of your shopping, optimizing for price rather than habit, effectively destroying the &#8220;friction premium&#8221; that underpins trillions of dollars of enterprise value. Stocks named in the essay, including Uber, DoorDash, American Express, and Mastercard, sold off in the days following the post&#8217;s spread. IBM dropped sharply. Reception from economists was mixed, and the piece got plenty of pushback, but the scenario clearly struck a nerve because it stitched together several anxieties investors already had: AI as a margin tailwind in the short run, and AI as a demand and business-model headwind if labor income gets hit hard enough. I think the Citrini thesis is a feasible, low-probability possibility, but with some important caveats. The stock market story and the economic story are two different things. Global labor income is roughly $60 trillion, compared with current S&P 500 profits of $2&#8211;2.5 trillion. There is a huge amount of slack in AI-beneficiary names soaking up profit from labor, leading to higher S&P levels, even if GDP falls significantly. The usual intuition that &#8220;stocks track the economy&#8221; can fail when the economy&#8217;s scarce factor shifts from labor to compute. In these scenarios, AI labs will likely have to keep spinning off divisions and vertical platforms to maintain some diversity in the indexes, because you cannot have 5&#8211;10 companies making up 90% of market capitalization without structural pressure to break them up. The &#8220;technological innovation destroys jobs and then creates even more&#8221; line does not hold as a default assumption this time. It has been right for two centuries because every new job required a human to perform it. With general-purpose AI, many of the &#8220;new categories&#8221; are also automatable, often faster than institutions can train for and professionalize them. There will definitely be human roles that appear or grow significantly for a while, but they may only be a fraction of what gets replaced. One scenario for job growth to offset job losses is if GDP grows multiple times its current level. That seems to be Elon Musk&#8217;s primary scenario: one new human job for every nine new AI jobs can still lead to full employment if the total economy is large enough. That is feasible. But the middle ground, where there are neither huge job losses nor an unprecedented economic boom, does not seem very likely to me. Citrini&#8217;s network effects and platform-disruption point are also interesting. Agents definitely reduce the friction that gives incumbents their brand and habitual usage advantages. An AI agent choosing the best delivery app has no home-screen loyalty. But for many businesses, there are still large fixed-cost advantages and utilization-rate economics that favor the largest network. A company with 50% margins from scale can survive a world where newcomers sell at the same price while making a loss, even with software costs near zero. This depends heavily on the business, though. That advantage does not help Uber or DoorDash nearly as much as it helps an infrastructure provider or a marketplace with exclusive supply. GPU capacity will likely be the primary bottleneck to Citrini&#8217;s scenario playing out at speed. We are already seeing Claude crash this week due to increased usage, and Gemini has had its own scaling issues. However, it is not impossible to see 100x-plus breakthroughs in inference efficiency, particularly if AI starts making its own breakthroughs in designing and testing new model architectures and inference systems. Compute is a brake today. It is not a guaranteed brake for 2027&#8211;2028. The Citrini thesis got some partial vindication this week with Block&#8217;s announcement that it is cutting roughly 4,000 employees, nearly half its workforce. CEO Jack Dorsey was explicit that the cuts are AI-driven, saying the intelligence tools they are building &#8220;fundamentally change what it means to build and run a company.&#8221; He predicted that within the next year, most companies will reach the same conclusion and make similar structural changes. Block&#8217;s stock soared as much as 24% on the news. This is the pattern Citrini describes: layoffs expand margins, earnings beat, stocks rally. Each company&#8217;s response is rational. The collective result is the displacement spiral that makes the scenario so uncomfortable. Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work. Why should you care? Here is where I think we actually stand. Human expertise is vital to nearly all AI usage today, and it will be for some time. The models are powerful, but they are not autonomous. They need people who understand the domain, can evaluate their outputs, can architect the workflows, and can catch the failures before they reach production. However, I see a very real risk that AI-first employees can be 2&#8211;3x more productive, with higher-quality output, than those who resist using AI. Many companies will channel that productivity into building more products, running more security checks, and expanding into new markets. But many will hit other bottlenecks to growing output, and for those companies, the surplus productivity translates directly into headcount reduction. AI-slow adopters are at high risk of redundancy across a very large number of careers in the near future. That said, enterprise adoption is still slow. AI engineers and forward-deployed engineers will be critically needed to customize agents and workflows for specific enterprise contexts. True adoption take off requires people who can bridge the gap between raw model capability and production-grade reliability. The main bottlenecks to AI adoption are likely to be AI compute, as we can see from the Claude and Gemini scaling issues this week, but also AI engineers with the expertise to build and deploy enterprise-tier agents. The models are ready. The infrastructure is strained. The human talent to wire it all together is in short supply. On that note, 2025 gave us agent hype. It did not give us a reliable way to build them. Most developers are still guessing at tools, wiring, and how to catch failures before users do. Fortunately, we have a new course to fill this gap! &#8212; Louie Peters&#8202;&#8212;&#8202;Towards AI Co-founder and CEO We spent 9 months building, breaking, and stress-testing two real-agent systems, with feedback from 180+ developers. The result is Agentic AI Engineering , our newest course built to teach operational reliability: measurable quality (evals), inspectable behavior (observability), and controlled autonomy (clear boundaries + robust tool/workflow engineering). You&#8217;ll build a Research Agent and a Writing Workflow end-to-end, and you&#8217;ll ship them with the parts that make agents usable in 2026: evaluation datasets and pass/fail checks, LLM judges, tracing, monitoring, and the workflow glue that keeps tools, state, and outputs from turning into chaos. The first 100 early-bird seats sold out in under a week. The next 100 seats are $499 (the lowest price after the early bird). Lifetime access, Discord community, and a 30-day refund. Get access now! Hottest News 1. US Bars Anthropic Products From Agencies, Contractors The Pentagon declared Anthropic PBC a supply-chain risk after President Donald Trump directed US government agencies to stop using its products. Defense Secretary Pete Hegseth ordered the Pentagon to bar its contractors and their partners from any commercial activity with Anthropic, giving the company six months to hand over AI services to another provider. This wipes out as much as $200 million in work that Anthropic had agreed to do for the military, along with smaller but important contracts for civilian agencies, including the State Department. In its statement on Friday, Anthropic said being labeled a supply-chain risk &#8220;would both be legally unsound and set a dangerous precedent for any American company that negotiates with the government.&#8221; 2. OpenAI Raises $110B in One of the Largest Private Funding Rounds in History OpenAI has raised $110 billion in private funding, commencing one of the largest private funding rounds in history. The new funding consists of a $50 billion investment from Amazon as well as $30 billion each from Nvidia and SoftBank, against a $730 billion pre-money valuation. As part of the investment, OpenAI is launching significant infrastructure partnerships with both Amazon and Nvidia. The Information had previously reported that $35 billion of Amazon&#8217;s investment could be contingent on the company either achieving AGI or making its IPO by the end of the year. OpenAI&#8217;s announcement confirms the funding split, but says only that the additional $35 billion will arrive &#8220;in the coming months when certain conditions are met.&#8221; Notably, the round remains open, and OpenAI expects more investors to join as it proceeds. 3. Google AI Just Released Nano-Banana 2 Google officially unveiled Nano-Banana 2 (technically designated as Gemini 3.1 Flash Image). It leverages Latent Consistency Distillation (LCD) to achieve sub-500ms latency, enabling real-time 4K image synthesis and upscaling directly on mobile hardware. Built on a 1.8-billion-parameter backbone, the model uses Dynamic Quantization-Aware Training (DQAT) to maintain high-fidelity output with a minimal memory footprint, eliminating the need for expensive cloud inference. By implementing Grouped-Query Attention (GQA), the model reduces memory bandwidth requirements, allowing it to run continuously on mobile NPUs without triggering thermal throttling or performance dips. Additionally, the model can maintain character resemblance of up to five characters and the fidelity of up to 14 objects. Through the new Banana-SDK, developers can deploy specialized Low-Rank Adaptation (LoRA) modules to customize the model for niche tasks without retraining the base architecture. 4. Nous Research Releases Hermes Agent Nous Research team released Hermes Agent, an open-source autonomous system designed to solve the two biggest bottlenecks in agentic workflows: memory decay and environmental isolation. Hermes Agent utilizes a multi-level memory system that mimics procedural learning. While it handles short-term tasks through standard inference, its long-term utility is driven by Skill Documents. Powered by the Llama 3.1-based Hermes-3 model, it is fine-tuned with Atropos RL for high steerability and reliable tool-calling within complex reasoning loops. The system integrates directly with existing communication stacks, including Telegram, Discord, Slack, and WhatsApp. 5. Perplexity unveiled Perplexity Computer Perplexity AI announced the launch of Perplexity Computer, a system that unifies multiple frontier AI models into a single platform to execute complex, long-running workflows. The system breaks down a user&#8217;s requested outcome into tasks and subtasks, assigns them to sub-agents, and executes them asynchronously. These sub-agents can conduct web research, generate documents, process data, and make API calls to connected services. Overall, it can allocate tasks across 19 different models. Each task on Computer runs in an isolated compute environment with access to a filesystem, browser, and tool integrations. If the system encounters issues, it can generate additional sub-agents to address them. As of today, Perplexity Computer runs Opus 4.6 for its core reasoning engine and orchestrates sub-agents with the best models for specific tasks: Gemini for deep research (creating sub-agents), Nano Banana for images, Veo 3.1 for video, Grok for speed in lightweight tasks, and ChatGPT 5.2 for long-context recall and wide search. The product is available to Perplexity Max subscribers. It follows a usage-based pricing model, allowing users to select different AI models for different sub-agent tasks and manage token spending. 5. Alibaba Team Open-Sources CoPaw Alibaba released CoPaw, an open-source framework that provides a standardized workstation for deploying and managing personal AI agents. The system relies on three primary layers: AgentScope (The underlying framework that handles agent communication and logic), AgentScope Runtime (The execution environment), and ReMe (Memory Management). A core feature of the CoPaw workstation is its Skill Extension capability. In this framework, a &#8216;Skill&#8217; is a discrete unit of functionality, essentially a tool that the agent can invoke to interact with the external world. It also introduces an All-Domain Access layer, which standardizes how agents interact with different messaging protocols. Five 5-minute reads/videos to keep you learning 1. Building a Production-Ready Agentic RAG System on GCP: (Vertex AI, ADK, Terraform) The article shows how to implement a production-grade RAG system on Google Cloud Platform to address the challenge of making organizational documents searchable beyond basic keyword matching. The architecture features separate ingestion and query pipelines using Vertex AI, Cloud Run, Eventarc, and Gemini. The article covers complete infrastructure deployment via Terraform, step-by-step setup instructions, and comparative analysis against AWS Bedrock, Azure AI Search, and open-source alternatives. 2. Agentic RAG & Semantic Caching: Building Smarter Enterprise Knowledge Systems Enterprise knowledge systems face significant challenges in managing unstructured data scattered across multiple platforms. This article presents a complete implementation of Agentic RAG systems that overcome Naive RAG&#8217;s critical limitations, including the inability to summarize documents, perform multi-document comparisons, maintain conversational memory, and enforce data security. It uses the Qdrant vector database with Nomic embeddings across two notebooks. 3. LoRA, QLoRA, DoRA: Which Fine-Tuning Method Should You Actually Use? This article analyzes the original research papers for LoRA, QLoRA, and DoRA to provide evidence-based comparisons of parameter-efficient fine-tuning methods. It explains how LoRA reduces trainable parameters by 99.6% through low-rank weight updates, how QLoRA enables fine-tuning 65B models on a single 48GB GPU using 4-bit quantization, and how DoRA improves accuracy by decomposing weights into magnitude and direction components. It also demonstrates practical code examples from official repositories. 4. Cutting Batch Release from 14 Days to 3: A Case Study in Multi-Agent AI for Pharmaceutical Manufacturing This article presents a case study of a pharma company reducing pharmaceutical batch release time from 14 days to 3 days using a multi-agent AI system. The manufacturer addressed a critical bottleneck in which Quality Assurance reviewers manually gathered records from multiple systems (MES, LIMS, environmental monitoring) to verify compliance with registered specifications, resulting in over $2 million in annual operational overhead. The solution implemented four specialized agents using the CrewAI framework: Batch Data Collector, Deviation Analyst, Compliance Reviewer, and Release Recommender. Each agent employed the ReAct paradigm with custom tools, conditional task execution for critical deviations, and human-in-the-loop approval by Qualified Persons. 5. Deriving the Singular Value Decomposition (SVD) from First Principles Moving beyond the typical formula-based teaching approach, this article derived Singular Value Decomposition (SVD) from first principles by starting with symmetric matrix diagonalization. It constructs the SVD by first forming two symmetric matrices (A&#7488;A and AA&#7488;) from any matrix A, then using their eigenbases to form orthonormal matrices U and V. The piece demonstrates how SVD decomposes any linear transformation into three operations: rotation, stretch, and rotation, with all transformation energy contained in the diagonal matrix &#931;. Repositories & Tools 1. DeerFlow is an open-source super agent harness that orchestrates sub-agents, memory, and sandboxes. 2. Ruflo is an AI agent orchestration framework that transforms Claude Code into a powerful multi-agent development platform. 3. MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs. 4. FireRed OCR is a framework for specializing general LVLMs into document parsing experts. Top Papers of The Week 1. AI Agents as Universal Task Solvers This paper describes AI agents as stochastic dynamical systems and frames reasoning as transductive inference that captures algorithmic structure to speed up novel tasks. It shows that the optimal speed-up on a new task is tightly related to the algorithmic information it shares with the training data. It also highlights that transductive inference yields its greatest benefits precisely when the data-generating mechanism is most complex, and identifies a possible failure mode of naive scaling. 2. Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System This paper presents GEARS (Generative Engine for Agentic Ranking Systems), a framework that reframes ranking optimization as an autonomous discovery process within a programmable experimentation environment. Rather than treating optimization as static model selection, GEARS leverages Specialized Agent Skills to encapsulate ranking expert knowledge into reusable reasoning capabilities, enabling operators to steer systems via high-level intent vibe personalization. 3. Diffusion-Pretrained Dense and Contextual Embeddings This report introduces pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. Researchers released two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. 4. Doc-to-LoRA: Learning to Instantly Internalize Contexts This paper proposes Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate context distillation within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM&#8217;s native context window by more than 4x. 5. Discovering Multiagent Learning Algorithms with Large Language Models This paper introduces AlphaEvolve, an LLM-powered evolutionary coding agent that automatically designs multi-agent reinforcement learning algorithms for imperfect-information games. AlphaEvolve discovers VAD-CFR, which uses volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start schedule, and SHOR-PSRO, which blends Optimistic Regret Matching with smoothed best-response distributions and dynamic annealing, both of which outperform state-of-the-art CFR and PSRO variants. Who&#8217;s Hiring in AI AI Engineer &#8212; FDE @Databricks (Remote) Senior Software Engineer @Microsoft Corporation (Redmond, WA, USA) Engineering Manager, AI @Headspace (Remote/USA) Software Engineer, AI Native @Meta (Menlo Park, CA, USA) Senior AI Engineer @Sword Health (Remote/Portugal) AI Engineer Sr &#8212; Generative AI @Lockheed Martin (Colorado Springs, USA) Principal Engineer (Gen-AI) @Turing (India) Interested in sharing a job opportunity here? Contact sponsors@towardsai.net . Think a friend would enjoy this too? Share the newsletter and let them join the conversation. Thanks for reading Towards AI Newsletter! Subscribe for free to receive new posts and support my work.
AlphaGo at 10: How AI Innovation Is Paving the Path to AGI — Google DeepMind deepmind 10.03.2026 15:00 0.667
Embedding sim.0.786
Entity overlap0
Title sim.0.082
Time proximity0.8438
NLP типscientific_publication
NLP организацияdeepmind
NLP темаartificial intelligence
NLP странаunited kingdom

Открыть оригинал

March 10, 2026 Research From games to biology and beyond: 10 years of AlphaGo’s impact Demis Hassabis Share Copied Your browser does not support the video tag. Ten years ago, our AI system AlphaGo became the first program to defeat a world champion at the complex game of Go – reaching a milestone in the field a decade before many experts thought possible. The achievement heralded the beginning of what is now recognized as the modern era in artificial intelligence (AI). With a single creative play, the famous ‘Move 37,’ AlphaGo demonstrated the potential of AI and signaled that we now had the techniques to begin tackling real-world scientific problems. Today, this breakthrough continues to inform our work building systems on the path to artificial general intelligence (AGI). We believe AGI will be the most profound technology ever invented and potentially the ultimate tool to advance science, medicine, and productivity. A creative spark In 2016, over 200 million people watched AlphaGo face world-champion Go player Lee Sae Dol in Seoul. The match was defined by ‘Move 37’ in Game 2, a play so unconventional that professional commentators initially thought it was a mistake. But it proved to be decisive. One hundred or so moves later, the stone was in exactly the right position for AlphaGo to win the game. It was a display of incredible foresight and the AI system’s ability to go beyond mimicking human experts and find entirely new strategies. Your browser does not support the video tag. Watch the AlphaGo documentary on YouTube Go has long been a proving ground for AI research because of the game’s sheer complexity. There are 10 170 possible positions on the board—far more than the number of atoms in the observable universe. To make the game tractable, AlphaGo used deep neural networks combined with advanced search and reinforcement learning – an AI approach DeepMind pioneered . AlphaGo learned a model of plausible Go moves by first learning from games played by human experts, and then playing hundreds of thousands of games against itself, improving as the strongest winning strategies were reinforced. The system then considered only the most potentially fruitful paths and from that smaller subset of moves, found the one most likely to lead it to win. After AlphaGo, we built AlphaGo Zero , which learned the game from completely random play and became arguably the strongest player in history. Then we generalized the system further with AlphaZero , which taught itself from scratch to master any 2-player perfect information game, including Go, chess, and shogi. Beginning with no prior knowledge other than the rules of the game, AlphaZero was able to learn to master chess in a matter of hours, and beat not only the top human players but the best specialised chess programs at the time, like Stockfish. And even though chess had been so heavily analysed with the aid of these programs, just as with Go, AlphaZero was still able to come up with interesting new strategies . It was further proof of what I knew the moment we won the match in Seoul - the technology was ready to be applied to our real goal of accelerating scientific breakthroughs. I believe the greatest lesson AlphaGo offered was a definitive preview of the AI era—proving it wasn’t some distant, vague future, but a reality arriving on our doorstep. It served as a "roadmap from the future," sending a clear signal to humanity about how the world was about to change. Go Master Lee Sae Dol Adjunct Professor at UNIST Catalyzing breakthroughs in science By proving it could navigate the massive search space of a Go board, AlphaGo demonstrated the potential for AI to help us better understand the vast complexities of the physical world. We started by attempting to solve the protein folding problem, a 50-year grand challenge of predicting the 3D structure of proteins - information that is crucial for understanding diseases and developing new drugs. In 2020, we finally cracked this longstanding scientific problem with our AlphaFold 2 system. From there, we folded the structures for all 200 million proteins known to science and made them freely available to scientists in an open-source database. Today, over 3 million researchers around the world use the AlphaFold database to accelerate their important work on everything from malaria vaccines to plastic-eating enzymes. And in 2024, it was the honor of a lifetime for John Jumper and I to be awarded the Nobel Prize in Chemistry for leading this project, on behalf of the entire AlphaFold team. Since AlphaGo’s win, we’ve applied its groundbreaking approach to many other areas of science and mathematics, including: Mathematical reasoning: The most direct descendant of AlphaGo’s architecture, AlphaProof learned to prove formal mathematical statements using a combination of language models and AlphaZero’s reinforcement learning and search algorithms. Alongside AlphaGeometry 2, it became the first system to achieve a medal-standard (silver) at the International Mathematical Olympiad (IMO), proving AlphaGo's methods could unlock advanced mathematical reasoning and laying the foundation for our most capable general models. Gemini, our largest and most capable model, recently went even further. An advanced version of its Deep Think mode achieved gold-medal level performance at the 2025 IMO using an approach inspired by AlphaGo. Since then, Deep Think has been applied to even more complex, open-ended challenges across science and engineering. Algorithm discovery: Just as AlphaGo searched for the best move in a game, our coding agent AlphaEvolve explores the space of computer code to discover more efficient algorithms. It had its own Move 37 moment when it found a novel way to multiply matrices, a fundamental mathematical operation powering nearly all modern neural networks. AlphaEvolve is now being tested on problems ranging from data center optimization to quantum computing. Scientific collaboration: We are integrating the search and reasoning principles pioneered with AlphaGo into an AI co-scientist . By having agents 'debate' scientific ideas and hypotheses, this system acts as a collaborator capable of performing the rigorous thinking necessary to identify patterns in data and solve sophisticated problems. In validation studies at Imperial College London , it analyzed decades of literature and independently arrived at the same hypothesis about antimicrobial resistance that researchers had spent years developing and validating experimentally. We’ve also used AI to better understand the genome , advance fusion energy research , improve weather prediction and more. As impressive as our scientific models are, they are highly specialized. To achieve fundamental breakthroughs like creating limitless clean energy or solving diseases that we don’t understand today, we need general AI systems that can find underlying structure and connections between different subject areas, and help us to come up with new hypotheses like the best scientists do. Future of intelligence For an AI to be truly general, it needs to understand the physical world. We built Gemini to be multimodal from the beginning so it could understand not just language, but also audio, video, images and code to build a model of the world. To think and reason across these modalities, the latest Gemini models use some of the techniques we pioneered with AlphaGo and AlphaZero. The next generation of AI systems will also need to be able to call upon specialized tools. For example, if a model needed to know the structure of a protein it could use AlphaFold for that. We think the combination of Gemini’s world models, AlphaGo’s search and planning techniques, and specialized AI tool use will prove to be critical for AGI. True creativity is a key capability that such an AGI system would need to exhibit. Move 37 was a glimpse of AI’s potential to think outside the box, but true original invention will require something more. It would need to not only come up with a novel Go strategy, as AlphaGo impressively did, but actually invent a game as deep and elegant, and as worthy of study as Go. Ten years after AlphaGo’s legendary victory, our ultimate goal is on the horizon. The creative spark first seen in Move 37 catalyzed breakthroughs that are now converging to pave the path towards AGI - and usher in a new golden age of scientific discovery. Exactly one decade later, we look back at the match that sparked the modern AI revolution. Learn about AlphaGo Watch the AlphaGo documentary on YouTube Related Posts AlphaGo Learn more Exploring the mysteries of Go with AlphaGo and China's top players April 2017 Research Learn more Innovations of AlphaGo April 2017 Research Learn more AlphaGo's next move May 2017 Research Learn more
The five AI value models driving business reinvention openai 05.03.2026 00:00 0.666
Embedding sim.0.7893
Entity overlap0.0233
Title sim.0.0714
Time proximity0.8039
NLP типother
NLP организация
NLP темаai adoption
NLP страна

Открыть оригинал

Five AI value models show how leaders can sequence AI from workforce fluency to process reinvention and build durable business advantage.
Identifying Interactions at Scale for LLMs bair 13.03.2026 09:00 0.659
Embedding sim.0.8146
Entity overlap0.0769
Title sim.0.1023
Time proximity0.3988
NLP типscientific_publication
NLP организация
NLP темаlarge language models
NLP страна

Открыть оригинал

--> Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process more transparent to model builders and impacted humans, a step toward safer and more trustworthy AI. To gain a comprehensive understanding, we can analyze these systems through different lenses: feature attribution , which isolates the specific input features driving a prediction ( Lundberg & Lee, 2017 ; Ribeiro et al., 2022 ); data attribution , which links model behaviors to influential training examples ( Koh & Liang, 2017 ; Ilyas et al., 2022 ); and mechanistic interpretability , which dissects the functions of internal components ( Conmy et al., 2023 ; Sharkey et al., 2025 ). Across these perspectives, the same fundamental hurdle persists: complexity at scale . Model behavior is rarely the result of isolated components; rather, it emerges from complex dependencies and patterns. To achieve state-of-the-art performance, models synthesize complex feature relationships, find shared patterns from diverse training examples, and process information through highly interconnected internal components. Therefore, grounded or reality-checked interpretability methods must also be able to capture these influential interactions . As the number of features, training data points, and model components grow, the number of potential interactions grows exponentially, making exhaustive analysis computationally infeasible. In this blog post, we describe the fundamental ideas behind SPEX and ProxySPEX , algorithms capable of identifying these critical interactions at scale. Attribution through Ablation Central to our approach is the concept of ablation , measuring influence by observing what changes when a component is removed. Feature Attribution: We mask or remove specific segments of the input prompt and measure the resulting shift in the predictions. Data Attribution: We train models on different subsets of the training set, assessing how the model’s output on a test point shifts in the absence of specific training data. Model Component Attribution (Mechanistic Interpretability): We intervene on the model’s forward pass by removing the influence of specific internal components, determining which internal structures are responsible for the model’s prediction. In each case, the goal is the same: to isolate the drivers of a decision by systematically perturbing the system, in hopes of discovering influential interactions. Since each ablation incurs a significant cost, whether through expensive inference calls or retrainings, we aim to compute attributions with the fewest possible ablations . --> Masking different parts of the input, we measure the difference between the original and ablated outputs. SPEX and ProxySPEX Framework To discover influential interactions with a tractable number of ablations, we have developed SPEX (Spectral Explainer). This framework draws on signal processing and coding theory to advance interaction discovery to scales orders of magnitude greater than prior methods. SPEX circumvents this by exploiting a key structural observation: while the number of total interactions is prohibitively large, the number of influential interactions is actually quite small. We formalize this through two observations: sparsity (relatively few interactions truly drive the output) and low-degreeness (influential interactions typically involve only a small subset of features). These properties allow us to reframe the difficult search problem into a solvable sparse recovery problem. Drawing on powerful tools from signal processing and coding theory, SPEX uses strategically selected ablations to combine many candidate interactions together. Then, using efficient decoding algorithms, we disentangle these combined signals to isolate the specific interactions responsible for the model’s behavior. --> In a subsequent algorithm, ProxySPEX , we identified another structural property common in complex machine learning models: hierarchy . This means that where a higher-order interaction is important, its lower-order subsets are likely to be important as well. This additional structural observation yields a dramatic improvement in computational cost: it matches the performance of SPEX with around 10x fewer ablations . Collectively, these frameworks enable efficient interaction discovery, unlocking new applications in feature, data, and model component attribution. Feature Attribution Feature attribution techniques assign importance scores to input features based on their influence on the model’s output. For example, if an LLM were used to make a medical diagnosis, this approach could identify exactly which symptoms led the model to its conclusion. While attributing importance to individual features can be valuable, the true power of sophisticated models lies in their ability to capture complex relationships between features. The figure below illustrates examples of these influential interactions: from a double negative changing sentiment (left) to the necessary synthesis of multiple documents in a RAG task (right). --> The figure below illustrates the feature attribution performance of SPEX on a sentiment analysis task. We evaluate performance using faithfulness : a measure of how accurately the recovered attributions can predict the model’s output on unseen test ablations. We find that SPEX matches the high faithfulness of existing interaction techniques (Faith-Shap, Faith-Banzhaf) on short inputs, but uniquely retains this performance as the context scales to thousands of features. In contrast, while marginal approaches (LIME, Banzhaf) can also operate at this scale, they exhibit significantly lower faithfulness because they fail to capture the complex interactions driving the model’s output. --> SPEX was also applied to a modified version of the trolley problem, where the moral ambiguity of the problem is removed, making “True” the clear correct answer. Given the modification below, GPT-4o mini answered correctly only 8% of the time. When we applied standard feature attribution (SHAP), it identified individual instances of the word trolley as the primary factors driving the incorrect response. However, replacing trolley with synonyms such as tram or streetcar had little impact on the prediction of the model. SPEX revealed a much richer story, identifying a dominant high-order synergy between the two instances of trolley , as well as the words pulling and lever, a finding that aligns with human intuition about the core components of the dilemma. When these four words were replaced with synonyms, the model’s failure rate dropped to near zero. --> Data Attribution Data attribution identifies which training data points are most responsible for a model’s prediction on a new test point. Identifying influential interactions between these data points is key to explaining unexpected model behaviors. Redundant interactions, such as semantic duplicates, often reinforce specific (and possibly incorrect) concepts, while synergistic interactions are essential for defining decision boundaries that no single sample could form alone. To demonstrate this, we applied ProxySPEX to a ResNet model trained on CIFAR-10, identifying the most significant examples of both interaction types for a variety of difficult test points, as shown in the figure below. --> As illustrated, synergistic interactions (left) often involve semantically distinct classes working together to define a decision boundary. For example, grounding the synergy in human perception, the automobile (bottom left) shares visual traits with the provided training images, including the low-profile chassis of the sports car, the boxy shape of the yellow truck, and the horizontal stripe of the red delivery vehicle. On the other hand, redundant interactions (right) tend to capture visual duplicates that reinforce a specific concept. For instance, the horse prediction (middle right) is heavily influenced by a cluster of dog images with similar silhouettes. This fine-grained analysis allows for the development of new data selection techniques that preserve necessary synergies while safely removing redundancies. Attention Head Attribution (Mechanistic Interpretability) The goal of model component attribution is to identify which internal parts of the model, such as specific layers or attention heads, are most responsible for a particular behavior. Here too, ProxySPEX uncovers the responsible interactions between different parts of the architecture. Understanding these structural dependencies is vital for architectural interventions, such as task-specific attention head pruning. On an MMLU dataset (highschool‐us‐history), we demonstrate that a ProxySPEX-informed pruning strategy not only outperforms competing methods, but can actually improve model performance on the target task . --> On this task, we also analyzed the interaction structure across the model’s depth. We observe that early layers function in a predominantly linear regime, where heads contribute largely independently to the target task. In later layers, the role of interactions between attention heads becomes more pronounced, with most of the contribution coming from interactions among heads in the same layer. --> What’s Next? The SPEX framework represents a significant step forward for interpretability, extending interaction discovery from dozens to thousands of components . We have demonstrated the versatility of the framework across the entire model lifecycle: exploring feature attribution on long-context inputs, identifying synergies and redundancies among training data points, and discovering interactions between internal model components. Moving forwards, many interesting research questions remain around unifying these different perspectives, providing a more holistic understanding of a machine learning system. It is also of great interest to systematically evaluate interaction discovery methods against existing scientific knowledge in fields such as genomics and materials science, serving to both ground model findings and generate new, testable hypotheses. We invite the research community to join us in this effort: the code for both SPEX and ProxySPEX is fully integrated and available within the popular SHAP-IQ repository (link). https://github.com/mmschlk/shapiq (SHAP-IQ Github) https://openreview.net/forum?id=KI8qan2EA7 (ProxySPEX NeurIPS 2025) https://openreview.net/forum?id=pRlKbAwczl (SPEX ICML 2025) https://openreview.net/forum?id=glGeXu1zG4 (Learning to Understand NeurIPS 2024)
Can AI help predict which heart-failure patients will worsen within a year? mit_news_ai 12.03.2026 21:30 0.659
Embedding sim.0.7702
Entity overlap0.0571
Title sim.0.0769
Time proximity0.8631
NLP типscientific_publication
NLP организацияMassachusetts Institute of Technology
NLP темаdeep learning
NLP странаUnited States

Открыть оригинал

Characterized by weakened or damaged heart musculature, heart failure results in the gradual buildup of fluid in a patient’s lungs, legs, feet, and other parts of the body. The condition is chronic and incurable, often leading to arrhythmias or sudden cardiac arrest. For many centuries, bloodletting and leeches were the treatment of choice, famously practiced by barber surgeons in Europe, during a time when physicians rarely operated on patients. In the 21st century, the management of heart failure has become decidedly less medieval: Today, patients undergo a combination of healthy lifestyle changes, prescription of medications, and sometimes use pacemakers. Yet heart failure remains one of the leading causes of morbidity and mortality, placing a substantial burden on health-care systems across the globe. “About half of the people diagnosed with heart failure will die within five years of diagnosis,” says Teya Bergamaschi, an MIT PhD student in the lab of Nina T. and Robert H. Rubin Professor Collin Stultz and the co-first author of a new paper introducing a deep learning model for predicting heart failure. “Understanding how a patient will fare after hospitalization is really important in allocating finite resources.” The paper, published in Lancet eClinical Medicine by a team of researchers at MIT, Mass General Brigham, and Harvard Medical School, shares results from developing and testing PULSE-HF, which stands loosely for “Predict changes in left ventricULar Systolic function from ECGs of patients who have Heart Failure.” The project was conducted in Stultz’s lab, which is affiliated with the MIT Abdul Latif Jameel Clinic for Machine Learning in Health . Developed and retrospectively tested across three different patient cohorts from Massachusetts General Hospital, Brigham and Women’s Hospital, and MIMIC-IV (a publicly available dataset), the deep learning model accurately predicts changes in the left ventricular ejection fraction (LVEF), which is the percentage of blood being pumped out of the left ventricle of the heart. A healthy human heart pumps out about 50 to 70 percent of blood from the left ventricle with each beat — anything less is considered a sign of a potential problem. “The model takes an [electrocardiogram] and outputs a prediction of whether or not there will be an ejection fraction within the next year that falls below 40 percent,” says Tiffany Yau, an MIT PhD student in Stultz’s lab who is also co-first author of the PULSE-HF paper. “That is the most severe subgroup of heart failure.” If PULSE-HF predicts that a patient’s ejection fraction is likely to worsen within a year, the clinician can prioritize the patient for follow-up. Subsequently, lower-risk patients can reduce their number of hospital visits and the amount of time spent getting 10 electrodes adhered to their body for a 12-lead ECG. The model can also be deployed in low-resource clinical settings, including doctors offices in rural areas that don’t typically have a cardiac sonographer employed to run ultrasounds on a daily basis. “The biggest thing that distinguishes [PULSE-HF] from other heart failure ECG methods is instead of detection, it does forecasting,” says Yau. The paper notes that to date, no other methods exist for predicting future LVEF decline among patients with heart failure. During the testing and validation process, the researchers used a metric known as "area under the receiver operating characteristic curve" (AUROC) to measure PULSE-HF’s performance. AUROC is typically used to measure a model’s ability to discriminate between classes on a scale from 0 to 1, with 0.5 being random and 1 being perfect. PULSE-HF achieved AUROCs ranging from 0.87 to 0.91 across all three patient cohorts. Notably, the researchers also built a version of PULSE-HF for single-lead ECGs, meaning only one electrode needs to be placed on the body. While 12-lead ECGs are generally considered superior for being more comprehensive and accurate, the performance of the single-lead version of PULSE-HF was just as strong as the 12-lead version. Despite the elegant simplicity behind the idea of PULSE-HF, like most clinical AI research, it belies a laborious execution. “It’s taken years [to complete this project],” Bergamaschi recalls. “It’s gone through many iterations.” One of the team’s biggest challenges was collecting, processing, and cleaning the ECG and echocardiogram datasets. While the model aims to forecast a patient’s ejection fraction, the labels for the training data weren’t always readily available. Much like a student learning from a textbook with an answer key, labeling is critical for helping machine-learning models correctly identify patterns in data. Clean, linear text in the form of TXT files typically works best when training models. But echocardiogram files typically come in the form of PDFs, and when PDFs are converted to TXT files, the text (which gets broken up by line breaks and formatting) becomes difficult for the model to read. The unpredictable nature of real-life scenarios, like a restless patient or a loose lead, also marred the data. “There are a lot of signal artifacts that need to be cleaned,” Bergamaschi says. “It’s kind of a never-ending rabbit hole.” While Bergamaschi and Yau acknowledge that more complicated methods could help filter the data for better signals, there is a limit to the usefulness of these approaches. “At what point do you stop?” Yau asks. “You have to think about the use case — is it easiest to have this model that works on data that is slightly messy? Because it probably will be.” The researchers anticipate that the next step for PULSE-HF will be testing the model in a prospective study on real patients, whose future ejection fraction is unknown. Despite the challenges inherent to bringing clinical AI tools like PULSE-HF over the finish line, including the possible risk of prolonging a PhD by another year, the students feel that the years of hard work were worthwhile. “I think things are rewarding partially because they’re challenging,” Bergamaschi says. “A friend said to me, ‘If you think you will find your calling after graduation, if your calling is truly your calling, it will be there in the one additional year it takes you to graduate.’ … The way we’re measured as researchers in [the ML and health] space is different from other researchers in ML space. Everyone in this community understands the unique challenges that exist here.” “There’s too much suffering in the world,” says Yau, who joined Stultz’s lab after a health event made her realize the importance of machine learning in health care. “Anything that tries to ease suffering is something that I would consider a valuable use of my time.”
The Geopolitical Chokepoints of Artificial Intelligence ai_supremacy 25.03.2026 09:40 0.659
Embedding sim.0.7547
Entity overlap0
Title sim.0.1067
Time proximity0.996
NLP типother
NLP организация
NLP темаai infrastructure
NLP страна

Открыть оригинал

A.I. World 🌍 The Geopolitical Chokepoints of Artificial Intelligence A guest post by Julian Alexander Brown. Helium, DRAM, HBM and the finite AI race. Michael Spencer and Julian Alexander Brown Mar 25, 2026 ∙ Paid 107 17 9 Share Good Morning, The year 2026 is in many ways the year of AI datacenter and semiconductor bottlenecks. As Nathan Warren over at Exponential View has noted: ⭐ The Helium Shortage “The Iran War has creat… Continue reading this post for free, courtesy of Michael Spencer. Claim my free post Or purchase a paid subscription. Previous A guest post by Julian Alexander Brown Writing on the nexus and frontiers of governance, geopolitics, economics, energy, and technology. MSc Global Governance & Diplomacy at Oxford (2025–26); experience at the White House and U.S. State Department. Views my own. Subscribe to Julian
Improving AI models’ ability to explain their predictions mit_news_ai 09.03.2026 04:00 0.658
Embedding sim.0.7971
Entity overlap0
Title sim.0.1905
Time proximity0.4643
NLP типscientific_publication
NLP организацияMassachusetts Institute of Technology
NLP темаcomputer vision
NLP странаUnited States

Открыть оригинал

In high-stakes settings like medical diagnostics, users often want to know what led a computer vision model to make a certain prediction, so they can determine whether to trust its output. Concept bottleneck modeling is one method that enables artificial intelligence systems to explain their decision-making process. These methods force a deep-learning model to use a set of concepts, which can be understood by humans, to make a prediction. In new research, MIT computer scientists developed a method that coaxes the model to achieve better accuracy and clearer, more concise explanations. The concepts the model uses are usually defined in advance by human experts. For instance, a clinician could suggest the use of concepts like “clustered brown dots” and “variegated pigmentation” to predict that a medical image shows melanoma. But previously defined concepts could be irrelevant or lack sufficient detail for a specific task, reducing the model’s accuracy. The new method extracts concepts the model has already learned while it was trained to perform that particular task, and forces the model to use those, producing better explanations than standard concept bottleneck models. The approach utilizes a pair of specialized machine-learning models that automatically extract knowledge from a target model and translate it into plain-language concepts. In the end, their technique can convert any pretrained computer vision model into one that can use concepts to explain its reasoning. “In a sense, we want to be able to read the minds of these computer vision models. A concept bottleneck model is one way for users to tell what the model is thinking and why it made a certain prediction. Because our method uses better concepts, it can lead to higher accuracy and ultimately improve the accountability of black-box AI models,” says lead author Antonio De Santis, a graduate student at Polytechnic University of Milan who completed this research while a visiting graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. He is joined on a paper about the work by Schrasing Tong SM ’20, PhD ’26; Marco Brambilla, professor of computer science and engineering at Polytechnic University of Milan; and senior author Lalana Kagal, a principal research scientist in CSAIL. The research will be presented at the International Conference on Learning Representations. Building a better bottleneck Concept bottleneck models (CBMs) are a popular approach for improving AI explainability. These techniques add an intermediate step by forcing a computer vision model to predict the concepts present in an image, then use those concepts to make a final prediction. This intermediate step, or “bottleneck,” helps users understand the model’s reasoning. For example, a model that identifies bird species could select concepts like “yellow legs” and “blue wings” before predicting a barn swallow. But because these concepts are often generated in advance by humans or large language models (LLMs), they might not fit the specific task. In addition, even if given a set of pre-defined concepts, the model sometimes utilizes undesirable learned information anyway, which is a problem known as information leakage. “These models are trained to maximize performance, so the model might secretly use concepts we are unaware of,” De Santis explains. The MIT researchers had a different idea: Since the model has been trained on a vast amount of data, it may have learned the concepts needed to generate accurate predictions for the particular task at hand. They sought to build a CBM by extracting this existing knowledge and converting it into text a human can understand. In the first step of their method, a specialized deep-learning model called a sparse autoencoder selectively takes the most relevant features the model learned and reconstructs them into a handful of concepts. Then, a multimodal LLM describes each concept in plain language. This multimodal LLM also annotates images in the dataset by identifying which concepts are present and absent in each image. The researchers use this annotated dataset to train a concept bottleneck module to recognize the concepts. They incorporate this module into the target model, forcing it to make predictions using only the set of learned concepts the researchers extracted. Controlling the concepts They overcame many challenges as they developed this method, from ensuring the LLM annotated concepts correctly to determining whether the sparse autoencoder had identified human-understandable concepts. To prevent the model from using unknown or unwanted concepts, they restrict it to use only five concepts for each prediction. This also forces the model to choose the most relevant concepts and makes the explanations more understandable. When they compared their approach to state-of-the-art CBMs on tasks like predicting bird species and identifying skin lesions in medical images, their method achieved the highest accuracy while providing more precise explanations. Their approach also generated concepts that were more applicable to the images in the dataset. “We’ve shown that extracting concepts from the original model can outperform other CBMs, but there is still a tradeoff between interpretability and accuracy that needs to be addressed. Black-box models that are not interpretable still outperform ours,” De Santis says. In the future, the researchers want to study potential solutions to the information leakage problem, perhaps by adding additional concept bottleneck modules so unwanted concepts can’t leak through. They also plan to scale up their method by using a larger multimodal LLM to annotate a bigger training dataset, which could boost performance. “I’m excited by this work because it pushes interpretable AI in a very promising direction and creates a natural bridge to symbolic AI and knowledge graphs,” says Andreas Hotho, professor and head of the Data Science Chair at the University of Würzburg, who was not involved with this work. “By deriving concept bottlenecks from the model’s own internal mechanisms rather than only from human-defined concepts, it offers a path toward explanations that are more faithful to the model and opens many opportunities for follow-up work with structured knowledge.” This research was supported by the Progetto Rocca Doctoral Fellowship, the Italian Ministry of University and Research under the National Recovery and Resilience Plan, Thales Alenia Space, and the European Union under the NextGenerationEU project.
UK promises procurement shift after Palantir deals the_register_ai 20.03.2026 11:06 0.658
Embedding sim.0.7686
Entity overlap0.0541
Title sim.0.0729
Time proximity0.8717
NLP типregulation
NLP организацияGovernment of the United Kingdom
NLP темаai governance
NLP странаUnited Kingdom

Открыть оригинал

Public Sector 38 UK to rethink tech buying after Palantir contracts 38 Government looks for sovereign tech as NHS deal nears break clause Lindsay Clark Fri 20 Mar 2026 // 11:06 UTC The UK government has promised a different approach to tech procurement following the award of controversial contracts to Palantir. Speaking to MPs, science minister Patrick Vallance said that the government's deals with Palantir – which has large contracts with the NHS and the Ministry of Defence – would be done differently in the future, instead emphasizing investment in UK technology and companies. Addressing the NHS contract, he said: "The Palantir contract was made under the previous government and it is under a different Department. I cannot comment on the details of that, but I hope I have been clear in describing a very different way of doing contracts: putting British companies there and procuring innovation here." Appearing before the House of Commons Science, Innovation and Technology Committee, Lord Vallance said: "I have described how I want to change things going forward in terms of domestic [policy]; in terms of Palantir, I think that is a matter for the Department of Health and Social Care (DHSC). It is a contract with DHSC, and if there are issues around that, they are the ones that need to look at that. I cannot." However, he was pressed by the committee on whether the government could seek an early break in the contract, according to the terms laid out when it was signed. Committee member and Liberal Democrat MP Martin Wrigley said: "These are existing contracts with break points, so the break points must be exploited to move to UK solutions – sovereign solutions – otherwise we just continue doing the same stuff." Lord Vallance responded: "We are not continuing. We are doing something very different." Palantir began as a spy-tech firm with backing from the CIA and heavily supports the controversial US Immigration and Customs Enforcement (ICE) agency. It signed a contract with NHS England in late 2023 under the previous Conservative government. NHS England's public announcement in 2023 said the £330 million investment in the Federated Data Platform (FDP) and associated services contract had been won by Palantir with support from Accenture, PwC, NECS, and Carnall Farrar, and would last seven years. However, the official contract award notice said the contract was set at £182.2 million and would end in February 2027. An earlier separate notice gave the same value, with an option to extend for two years and then two separate single years. Pentagon AI chief praises Palantir tech for speeding battlefield strikes Campaigners claim NHS Palantir system could be accessed by police and immigration Palantir's lethal AI weaponry deployed to find chairs for US government staff Palantir spent $25M on CEO flights so Alex Karp could do all the talking NHS England argued the FDP would join up key information currently held in separate NHS systems to tackle some of the big challenges the health service faces coming out of the pandemic. Palantir said the system would help bring down waiting lists, improve patient care, and reduce health inequalities. Since the Labour government came into power in 2024, it has decided to abolish NHS England and merge its responsibilities with those of DHSC, which requires further legislation in Parliament, with a deadline of April 2027. Before the Science, Innovation and Technology Committee this week, Lord Vallance was clear that DHSC was responsible for deciding the future of the Palantir FDP contract. His Majesty's Treasury is currently reviewing DHSC digital investment plans. Doctors have been told by their union, the British Medical Association, to limit engagement with the NHS FDP , MPs have called for an end to Palantir's involvement in the NHS, and legal and privacy campaign groups have highlighted risks that the system could be used by the Home Office and the police under a future government . Some senior NHS England tech and data leaders during the Palantir procurement have moved on or are about to do so. The tender was already in train when John Quinn arrived as CIO in June 2023. He left in March last year. Ming Tang was national director data and analytics during the pandemic, in the period when NHS England first employed Palantir to help build a data platform, under £60 million of contracts awarded without competition . She became chief data and analytics officer during the FDP procurement and is now also interim CIO. She is set to leave in April. CTO Sonia Patel is also set to leave this month. ® Share More about Government of the United Kingdom Ministry Of Defence NHS More like these &times; More about Government of the United Kingdom Ministry Of Defence NHS Palantir Procurement Narrower topics Cabinet Office Competition and Markets Authority Computer Misuse Act DCMS GCHQ GPDPR HMRC Home Office ICO NCSC RPA Broader topics Government United Kingdom More about Share 38 COMMENTS More about Government of the United Kingdom Ministry Of Defence NHS More like these &times; More about Government of the United Kingdom Ministry Of Defence NHS Palantir Procurement Narrower topics Cabinet Office Competition and Markets Authority Computer Misuse Act DCMS GCHQ GPDPR HMRC Home Office ICO NCSC RPA Broader topics Government United Kingdom TIP US OFF Send us news
Ambient Computing via Voice AI is about to Enter its Golden Age in 2027 ai_supremacy 13.03.2026 09:40 0.656
Embedding sim.0.7903
Entity overlap0
Title sim.0.1282
Time proximity0.6031
NLP типproduct_launch
NLP организацияGenspark
NLP темаgenerative ai
NLP страна

Открыть оригинал

Prospectus Ambient Computing via Voice AI is about to Enter its Golden Age in 2027 Genspark Workspace 2.0: What’s The Big Idea? Daniel Nest explores. OpenAI losing marketshare. Michael Spencer and Daniel Nest Mar 13, 2026 ∙ Paid 52 1 6 Share Good Morning, I’m really bullish on the future of the Voice AI interface . Once agentic AI and AI wearables mature, that dream of “ ambient computing ” is going to be closer to reality for many consumers. Generative AI is enabling new convenience for consumers that will reshape how we access services, find information and deal with recurring tasks. This has big implications in both B2C and B2B. AI Wearables will transform Voice AI Convenience I believe once better AI wearable devices are launched next year in 2027 including AI pins (wearables) and smart glasses form factors, it could be a breakthrough year for consumer AI voice experiences . From smart glasses to new kinds of pendants and pins, I expect Apple to dominate . Mid to late 2027 is the time this should really come to the foreground or around 18 months from now. For details about Apple’s upcoming AI devices read here . To wrap your head around the ecosystem: a16z. Voice Agents are coming. Voice AI Experiences will improve dramatically in 2027 Smart glasses are going to be good enough to strike a chord with consumers with strong offerings of both AI (and later AR) capabilities from Meta, Google, Alibaba , Xiaomi, Xreal, RayNeo (雷鸟创新) and of course Apple, among others. The interface of how we interact with AI will become more multi-sensory, hands-off and accessible. The dream of ambient computing has a lot of potential to evolve in the decade ahead. Apple is going to show OpenAI how AI hardware is done with an AI pin, Smart glasses and more. Since the below infographic was made, there’s been dozens of AI Voice agent startups rewriting the future that are scaling now. We’ll list some of the major ones later on. They begin to mature in the late 2020s. a16z. Consumer Voice Agents will scale in the 2020s. Snapshot: January, 2025. I asked Daniel Nest of Why Try AI Newsletter to help me us understand Genspark’s new voice agent in Genspark Workspace 2.0 . They have even introduced AI calling partnering with Twilio. This is the point in the week where I think we need to summarize some AI news events too of the last couple of weeks. ✨ Your AI News Brief 💼 Around 9 min. 0:00 -8:43 Audio playback is not supported on your browser. Please upgrade. Key Trend Analysis among Generative AI Apps and AI Tools Recently a16z ’s Olivia Moore released their Top 100 Gen AI Consumer Apps , a series that I’ve been following religiously. While Claude saw a spike recently, it still only has about one 20th the daily active users of ChatGPT. Source: a16z Olivia Moore Breaks down the AI Consumer Highlights March 11th, 2026. Voice AI News Claude Code rolls out Voice mode capabilities . ElevenLabs released Scribe v2 . Read Claude Voice Mode . Promising Voice AI startup Hume AI was gobbled up by Google . Voice AI agents are going vertical . Read Tweet ChatGPT’s Marketshare is Eroding While OpenAI’s ChatGPT still has a huge base, Gemini, Grok, Claude and others are gaining marketshare too. A trend that appears to be accelerating. a16z Why Try AI Hand-on AI for non-techies. Stay grounded with the latest AI tools. By Daniel Nest Why Try AI Hype-free, hands-on AI for non-techies. By Daniel Nest Here Are My Go-To AI Tools Sunday’s AI News Rundown Build Your Second Brain With Claude Code & Obsidian Consumer Paid Subscriptions 2026 I expect Gemini, Claude and Grok to have big years in 2026 and 2027 regarding consumer premium subscriptions putting pressure on OpenAI and slowing down their ARR (revenue growth). This is Anthropic and Cursor (Anysphere)’s Enterprise ARR keeps accelerating. a16z via Yipit Data I’ve been impressed by the pace of innovation at Genspark and while I miss their Perplexity like AI search product, they are also entering the voice AI picture. Genspark Workspace 2.0: What’s The Big Idea? Daniel Nest of Why Try AI takes a deep dive into Genspark’s evolution of their AI Workspace and their Voice AI capabilities in particular. With explainer videos. Since we wrote this piece Genspark has already released Workspace 3.0 just yesterday. Mere months after its Series B funding and “unicorn” status , Genspark launched the latest iteration of its agentic concept: AI Workspace 2.0 . AI Workspace 2.0 brings improvements to several existing agents as well as new features: New launches: Speakly : A new dictation assistant (think Wispr Flow and Superwhisper ) that you can download for macOS or Windows. It lets you interact with all of Genspark’s agents and tools via voice instead of typing. AI Music Agent that can create custom music (think Suno ). AI Audio Agent that does voiceovers and narration (think ElevenLabs ). Upgrades: AI Inbox now supports automated workflows that perform specific actions like creating daily inbox digests, interacting with external messaging platforms like Slack, analyzing social media performance, etc AI Creative Slides , AI Image Agent , and AI Video Agent are all more capable and incorporate the improved powers of newer, better underlying models. While the under-the-hood upgrades and new music/audio agents are neat, what Genspark is leaning heavily into is Speakly and the promise of hands-free agentic work. Simply say what you need done, and Genspark’s Super Agent does it. At least, that’s the idea. To better understand Genspark’s vision, I tested the three new features and also conducted a written interview with the company’s COO, Wen Sang. Read on to find out if Genspark can be more than the sum of its parts. Testing the new tools Let’s start with my hands-on tests and demos of the three new features. 1. Speakly: The voice agent On the surface, Speakly is yet another voice dictation app. You speak into the microphone, and text comes out on the other end. But what makes Speakly a great fit for Genspark’s infrastructure is the “agent” shortcut that sends your spoken request directly to Genspark’s Super Agent and the baked-in, intelligent processing of spoken input. I demonstrate Speakly’s main features in this 7–minute hands-on video: Watch it to learn: How regular voice dictation works How Speakly can auto-correct filler words and backtracking Agent mode : Send a request or task to Genspark Super Agent from any screen Translation mode : Speak in any language (or combination of languages), get English text. Custom modes : From “Buzzwords” mode to “Twitter” mode, Genspark can process and rework your speech into any style or format. You can customize this to your needs. If you want to try Speakly for yourself, grab it here: Click image to download. [DOWNLOAD SPEAKLY FOR MAC OS OR WINDOWS] 2. AI Music: From requests to music tracks If you have toyed with AI music sites like Suno or Udio, you already have a good idea about how this works. The agent can use different third-party AI music models to execute your task: But because Genspark agents can coordinate their work with other agents, you can have more elaborate requests that require additional pre-processing. So I tried this: Prompt : Watch this video about the launch of Genspark AI Workspace 2.0 and create a grand soundtrack worthy of this announcement: Ed: All to say Genspark has a lot of exciting new products and features (not a Sponsor). Just a really innovative case study that I like to cover since nobody else is. Try Genspark Voice AI is Evolving in the 2020s Where are we seeing concentrations of voice agent startups and products? 🤔 Customer Support Sales & Lead generation Recruiting & Human Resources Finance Insurance and Legal Logistics Home Services & Small Business Operations Healthcare Personal Voice Agents Startup Riders, read more . In this article the Genspark deep dive continues with more Voice AI insights and infographics. Subscribe Continue reading this post for free, courtesy of Michael Spencer. Claim my free post Or purchase a paid subscription. Previous A guest post by Daniel Nest I write about generative AI for the average person. I love experimenting with all GenAI, including AI images, video, music, chatbots, and more. Subscribe to Daniel
How Axios uses AI to help deliver high-impact local journalism openai 04.03.2026 00:00 0.655
Embedding sim.0.7574
Entity overlap0
Title sim.0.0957
Time proximity0.9467
NLP типother
NLP организацияAxios
NLP темаenterprise ai
NLP страна

Открыть оригинал

Axios COO Allison Murphy explains how the company uses AI to support local reporters, streamline newsroom workflows, and deliver high-impact local journalism at scale.
Microsoft, Nvidia claim AI speeds approval of nuclear plants the_register_ai 25.03.2026 14:35 0.654
Embedding sim.0.7517
Entity overlap0.0789
Title sim.0.0682
Time proximity0.9787
NLP типpartnership
NLP организацияMicrosoft
NLP темаgenerative ai
NLP странаUnited States

Открыть оригинал

AI + ML 26 Microsoft and Nvidia claim AI can speed approval of new atomic plants 26 Effort includes permitting and planning Dan Robinson Wed 25 Mar 2026 // 14:35 UTC Microsoft is working with Nvidia on nuclear power. Not to build it, but to offer AI-driven tools to deal with all the red tape, help with the design work, and optimize operations for nuclear projects. Announcing the move on social media site X , Microsoft President Brad Smith said this latest AI collaboration covers "the full lifecycle from permitting and design to construction and operations." Nuclear has long held promise as a stable source of carbon-free power, Smith added, and the aim of this particular initiative is to bring more of it online sooner. While building a nuclear plant is a highly complex operation, designing one and navigating the Byzantine regulations governing atomic energy can take years, cost hundreds of millions of dollars, and involve an immense amount of data processing and reporting, Microsoft claims. Of course, the Trump administration is seeking to solve this in a different way: by gutting the safety rules and skipping full environmental reviews for new reactors. AI, we're told, is expected to help by making highly complex work repeatable and predictable, and slashing development timelines without sacrificing safety. The system ensures there is a "paper trail," so regulators can verify everything, while each engineering decision is digitally linked to the evidence and regulations for auditing. Microsoft claims its tool, Generative AI for Permitting , reduced the time-intensive legwork to get approvals by 92 percent for Aalo Atomics, a firm working on mass-production of modular atomic reactors. Another, Southern Nuclear, has developed and deployed agents using Microsoft's Copilot to improve consistency in engineering and licensing. For design and engineering, digital twin technology and high-fidelity simulations enable faster iteration, while generative AI handles the drudge work of documentation drafting. Nvidia isn't just aiming at atomic plants with its digital twin technology; it also unveiled Omniverse DSX last year, a blueprint for designing and operating gigawatt-scale AI datacenters – one of the reasons why there is a pressing need for more nuclear power in the first place. SoftBank to build massive AI datacenter on former US nuclear weapons site Google plugs AI into nuclear reactor biz – what could possibly go wrong? US is moving ahead with colocated nukes and datacenters Amazon, Meta, Google sign pledge to triple nuclear power capacity by 2050 More atomic power is seen as a solution for the burgeoning energy demands caused by the AI-driven datacenter building boom, but a nuclear plant typically takes at least five years to construct, while AI's energy demands mean more power is needed now. Microsoft has itself invested in nuke-generated electricity, and has a 20-year power purchase agreement (PPA) with Constellation Energy to restart the infamous Three Mile Island nuclear facility, but even this is not expected to be online before 2028. On the back of all the server farms it is building, Microsoft has seen its greenhouse gas emissions heading in the wrong direction. As The Register reported previously, its emissions have risen by nearly 30 percent since 2020 , despite its much-publicized goal of becoming carbon-negative by 2030. The new AI for nuclear operations initiative brings together Nvidia's Omniverse, Earth 2, CUDA-X, AI Enterprise, PhysicsNeMo, Isaac Sim, and Metropolis, with Microsoft's Generative AI for Permitting and Planetary Computer . The combination represents an AI-powered digital ecosystem for nuclear energy on Azure, Microsoft said. ® Share More about Nuclear Power More like these &times; More about Nuclear Power Narrower topics Fusion Power Broader topics Electricity More about Share 26 COMMENTS More about Nuclear Power More like these &times; More about Nuclear Power Narrower topics Fusion Power Broader topics Electricity TIP US OFF Send us news
Blowing Off Steam: How Power-Flexible AI Factories Can Stabilize the Global Energy Grid nvidia_blog 25.03.2026 11:00 0.653
Embedding sim.0.7481
Entity overlap0.0294
Title sim.0.098
Time proximity0.9881
NLP типother
NLP организацияEmerald AI
NLP темаai infrastructure
NLP странаUnited Kingdom

Открыть оригинал

Blowing Off Steam: How Power-Flexible AI Factories Can Stabilize the Global Energy Grid In a recent white paper, Emerald AI — in collaboration with NVIDIA, Nebius, National Grid and EPRI — showcased how “power-flexible” AI factories can autonomously adjust their power usage during peak demand. March 25, 2026 by Josh Parker 0 Comments Share Share This Article X Facebook LinkedIn Copy link Link copied! At the half-time whistle of the UEFA EURO 2020 round of 16 football match between England and Germany, millions of viewers stepped away from their screens in the U.K. to do the same thing at the same time — turn on their kettles. National Grid, which provides electricity for England and Wales, saw a demand spike of about 1 gigawatt — an increase equivalent to the average output of a standard nuclear reactor — in a matter of minutes from this countrywide tea break. Grid operators must carefully manage these demand peaks to keep the system stable, and this could become even more difficult as the grid continues to add large new customers. But what if those new customers could actually be flexible and relieve the grid during periods of peak strain? In a recent white paper , Emerald AI — in collaboration with NVIDIA, EPRI, National Grid and Nebius — showcased how “power-flexible” AI factories can autonomously adjust their power usage during peak demand. For AI factories, this could unlock significantly faster grid connections without waiting for massive, years-long infrastructure upgrades. For the public, it helps limit grid build outs by curbing the peak load that the system needs to serve, helping keep electricity rates affordable for everyday bill payers. Boil the Kettle, Balance the Grid  After successful proof-of-concept trials at AI factories in Arizona, Virginia and Illinois, Emerald AI took its flexible grid solution across the pond, last December, bringing the Emerald AI Conductor Platform to Nebius’ new AI factory in London, built on NVIDIA infrastructure — among the first of its kind in the U.K. At the AI factory, the research team ran production-grade AI workloads on a cluster of 96 NVIDIA Blackwell Ultra GPUs connected through the NVIDIA Quantum-X800 InfiniBand platform . The NVIDIA System Management Interface is used to retrieve consistent, seconds-level GPU power telemetry. EPRI and National Grid simulated stress scenarios on the power grid — from lightning strikes to long periods of low wind power supply — and sent signals instructing the AI factory, with the help of the Conductor Platform, to temporarily reduce its power use to relieve grid strain. One of these scenarios was the “TV pickup” phenomenon, where that very same Euro 2020 football match’s energy surge was reenacted. As millions of simulated tea kettles were about to be turned on, the AI cluster ramped down its power use — successfully acting as a shock absorber for the abrupt power surge without disrupting the highest-priority AI workloads running on the cluster. https://blogs.nvidia.com/wp-content/uploads/2026/02/Grid-Responsive-AI-Infrastructure-Chart_v4.mp4 In practice, this means the grid can manage sudden demand swings using existing capacity more efficiently, reducing the need to overbuild permanent infrastructure to meet worst-case peaks and helping keep rates affordable for everyday consumers. “With this technology, AI factories become friendly and helpful grid assets,” said Varun Sivaram, founder and CEO of Emerald AI. “Simultaneously, the AI factories get connected much faster to the grid because they can tap into existing power grids.” Stress Relievers, Not Query Crushers In the Nebius AI factory demonstration, despite the quick ramp down of energy to power the national tea break, Emerald AI Conductor ensured that the simulated high-priority AI workloads performed at peak throughput, while more flexible jobs were slowed down temporarily. Emerald AI recorded 100% alignment with over 200 power targets that EPRI and National Grid instructed the AI cluster to follow for this experiment. “We did tests that go beyond the ones that have been done so far in the U.S. because we tested not just the GPUs, but also the CPUs and everything that sits around it — as well as the total power consumption of the IT equipment,” said Steve Smith, group chief strategy officer of National Grid. “We’ve proved the value that this technology brings.” Scaling London’s Grid at Super Speed London’s power grid is constantly working to meet the ever-growing energy needs of its citizens. Its grid operators — including National Grid — face a key bottleneck: constraints in infrastructure upgrades to connect large customers. Plugging flexible AI factories into the grid with solutions like Emerald AI’s Conductor Platform won’t just help to stabilize energy spikes —  it can optimize the use of existing grid infrastructure to propel new industry talent and economic opportunities in the U.K. “We have enormous skills and potential in AI,” said Smith. “We’re never going to be on the scale of the U.S. in terms of data centers, but relative to the size of the country, we could be — and we’re certainly seeing that interest from many of the hyperscalers. So, it gives us the opportunity to play our part as National Grid in helping unlock that economic growth for the country.” Four demonstrations in, Emerald AI and NVIDIA are gearing up to put power-flexible AI factories into real-world deployment with the Aurora AI Factory in Virginia, set to open this year. Learn more about the first power-flexible AI factory powered by NVIDIA GPUs . Explore the Best of GTC 2026 Sessions Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere. Watch On Demand Recent News AI Infrastructure Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid March 31, 2026
New ways to learn math and science in ChatGPT openai 10.03.2026 10:00 0.652
Embedding sim.0.7978
Entity overlap0.2222
Title sim.0.2
Time proximity0.2262
NLP типproduct_launch
NLP организацияOpenAI
NLP темаeducational technology
NLP страна

Открыть оригинал

ChatGPT introduces interactive visual explanations for math and science, helping students explore formulas, variables, and concepts in real time.
Into the Omniverse: How Industrial AI and Digital Twins Accelerate Design, Engineering and Manufacturing Across Industries nvidia_blog 12.03.2026 15:00 0.65
Embedding sim.0.7485
Entity overlap0.0465
Title sim.0.1214
Time proximity0.9017
NLP типpartnership
NLP организацияNVIDIA
NLP темаai infrastructure
NLP странаUnited States

Открыть оригинал

Into the Omniverse: How Industrial AI and Digital Twins Accelerate Design, Engineering and Manufacturing Across Industries March 12, 2026 by James McKenna 0 Comments Share Share This Article X Facebook LinkedIn Copy link Link copied! Editor’s note: This post is part of Into the Omniverse , a series focused on how developers, 3D practitioners and enterprises can transform their workflows using the latest advancements in OpenUSD and NVIDIA Omniverse . Industrial AI, digital twins, AI physics and accelerated AI infrastructure are empowering companies across industries to accelerate and scale the design, simulation and optimization of products, processes and facilities before building in the real world. Earlier this month, NVIDIA and Dassault Systèmes announced a partnership that brings together Dassault Systèmes’ Virtual Twin platforms, NVIDIA accelerated computing, AI physics open models and NVIDIA CUDA-X and Omniverse libraries. This allows designers and engineers to use virtual twins and companions — trained on physics-based world models — to innovate faster, boost efficiency and deliver sustainable products. Dassault Systèmes’ SIMULIA software now uses NVIDIA CUDA-X and AI physics libraries for AI-based virtual twin physics behavior — empowering designers and engineers to accurately and instantly predict outcomes in simulation. NVIDIA is adopting Dassault Systèmes’ model-based systems engineering technologies to accelerate the design and global deployment of gigawatt-scale AI factories that are powering industrial and physical AI across industries. Dassault Systèmes will in turn deploy NVIDIA-powered AI factories on three continents through its OUTSCALE sovereign cloud, enabling its customers to run AI workloads while maintaining data residency and security requirements. These efforts are already making a splash across industries, accelerating industrial development and production processes. Industrial AI Simulations, From Car Parts to Cheese Proteins  Digital twins , also known as virtual twins, and physics-based world models are already being deployed to advance industries. In automotive, Lucid Motors is combining cutting-edge simulation, AI physics open models, Dassault Systèmes’ tools for vehicle and powertrain engineering and digital twin technology to accelerate innovation in electric vehicles.  In life sciences, scientists and researchers are using virtual twins, Dassault Systèmes’ science-validated world models and the NVIDIA BioNeMo platform to speed molecule and materials discovery, therapeutics design and sustainable food development. The Bel Group is using technologies from Dassault Systèmes’ supported by NVIDIA to accelerate the development and production of healthier, more sustainable foods for millions of consumers.  The company is using Dassault Syst è mes’ industry world models to generate and study food proteins, creating non-dairy protein options that pair with its well-known cheeses, including Babybel. Using accurate, high-resolution virtual twins allows the Bel Group to study and develop validated research outcomes of food proteins more quickly and efficiently. Using accurate, high-resolution virtual twins allows the Bel Group to study and develop validated research outcomes of food proteins more quickly and efficiently. In industrial automation, Omron is using virtual twins and physical AI to design and deploy automation technology with greater confidence — advancing the shift toward digitally validated production.  In the aerospace industry, researchers and engineers at Wichita State University’s National Institute for Aviation Research use virtual twins and AI companions powered by Dassault Systèmes’ Industry World Models and NVIDIA Nemotron open models to accelerate the design, testing and certification of aircrafts. Learning From and Simulating the Real World  Dassault Systemes’ physics-based Industry World Models are trained to have PhD-level knowledge in fields like biology, physics and material sciences. This allows them to accurately simulate real-world environments and scenarios so teams can test industrial operations end to end — from supply chains to store shelves — before deploying changes in the real world.  These virtual models can help researchers and developers with workflows ranging from DNA sequencing to strengthening manufactured materials for vehicles.  “Knowledge is encoded in the living world,” said Pascal Daloz, CEO of Dassault Systemes, during his 3DEXPERIENCE World keynote. “With our virtual twins, we are learning from life and are also understanding it in order to replicate it and scale it.”  Get Plugged In to Industrial AI Learn more about industrial and physical AI by registering for NVIDIA GTC , running March 16-19 in San Jose, kicking off with NVIDIA founder and CEO Jensen Huang’s keynote address on Monday, March 16, at 11 a.m. PT.  At the conference: Explore an industrial AI agenda packed with hands-on sessions, customer stories and live demos.  Dive into the world of OpenUSD with a special session focused on OpenUSD for physical AI simulation , as well as a full agenda of hands-on OpenUSD learning sessions .  Find Dassault Systèmes in the industrial AI and robotics pavilion on the show floor and learn from Florence Hu-Aubigny, executive vice president of R&D at Dassault Systemes, who’ll present on how virtual twins are shaping the next industrial revolution. Get a live look at GTC with our developer community livestream on March 18, where participants can ask questions, request deep dives and talk directly with NVIDIA engineers in the chat. Learn how to build industrial and physical AI applications by attending these sessions at GTC . Explore the Best of GTC 2026 Sessions Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere. Watch On Demand Recent News AI Infrastructure Efficiency at Scale: NVIDIA, Energy Leaders Accelerating Power‑Flexible AI Factories to Fortify the Grid March 31, 2026
Introducing V-RAG: revolutionizing AI-powered video production with Retrieval Augmented Generation aws_ml_blog 19.03.2026 16:45 0.647
Embedding sim.0.7361
Entity overlap0.0263
Title sim.0.1667
Time proximity0.924
NLP типproduct_launch
NLP организацияAmazon Web Services
NLP темаgenerative ai
NLP страна

Открыть оригинал

A key development in generative AI is AI-powered video generation. Before AI, creating dynamic video content required extensive resources, technical expertise, and significant manual effort. Today, AI models can generate videos from simple inputs, but organizations still face challenges like unpredictable results. This post introduces Video Retrieval-Augmented Generation (V-RAG), an approach to help improve video content creation. By combining retrieval augmented generation with advanced video AI models, V-RAG offers an efficient, and reliable solution for generating AI videos. Video generation AI video generation represents a transformative frontier in digital content creation, enabling the automated production of dynamic visual narratives without traditional filming or animation processes. By using deep learning architectures, these systems can synthesize realistic or stylized video sequences. Unlike conventional video production that requires cameras, actors, and extensive post-production, AI generation creates content entirely through computational processes analyzing patterns from massive training datasets to render coherent visual stories. Individuals and organizations can use this technology to produce visual content with minimal technical expertise, reducing the time, resources, and specialized skills traditionally required. As these models continue to evolve, they promise to fundamentally reshape how visual stories are conceived, produced, and shared across industries ranging from entertainment and marketing to education and communication. Text-to-video generation Text-to-video generation creates dynamic video content from narrative or thematic text prompts. This technology interprets textual descriptions and transforms them into coherent visual sequences that follow the specified narrative. While text prompts effectively guide the overall theme and storyline, they can sometimes fall short in capturing highly specific visual details with precision. Text-to-video serves as the foundation of AI video creation, where users can generate content based on descriptive language alone. Video generation customization Text prompting can only get you so far with video generation. There’s inherently limited control when relying solely on text descriptions, as models can ignore crucial parts of your prompt or interpret them differently than you intended. Certain visual concepts prove difficult to explain in words alone, additionally, you’re constrained by the model’s token limit that caps how detailed your instructions can be. This is where further customization becomes invaluable. Users can use robust customization tools to specify numerous parameters beyond what text can efficiently communicate, such as style, mood, and intricate visual aesthetics. These controls help overcome the limitations of text prompting by providing direct mechanisms to influence the output. Without such capabilities, creators are left hoping the model correctly interprets their intentions rather than actively directing the creative process. Customization bridges the gap between vague generation and precise visual control, making AI video tools truly useful for professional applications. Model fine-tuning Fine-tuning adapts pre-trained video generation models to specific domains, styles, or use cases. This process allows organizations to create specialized video generators that excel at tasks whether they’re producing product demonstrations with consistent branding, generating medical educational content, or creating videos in a distinctive artistic style. Fine-tuning typically involves further training of existing models on carefully curated datasets representing the target domain, allowing the model to learn the unique visual patterns, movements, and stylistic elements required for specialized applications. However, fine-tuning video generation models presents significant challenges. The fundamental obstacle begins with data acquisition because high-quality video data that’s suitable for training is both expensive and difficult to obtain. Organizations need diverse, well-labeled footage in a specific format covering specific use cases while meeting technical quality standards. The computational demands are substantial, representing a major barrier to entry. A single fine-tuning run can require multiple high-end GPUs operating continuously, and retraining to incorporate new capabilities multiplies these costs with each iteration. Even with perfect data and unlimited computational resources, success remains uncertain due to the interconnected nature of video elements like coherence, physical accuracy, lighting consistency, and object persistence. Improvements in one area often led to unexpected degradation in others, creating complex optimization challenges resistant to simple solutions. Image-to-video Image-to-video generation complements text-based approaches by offering additional visual control. By using an input image as a reference, users can ensure specific details such as the color, style, and other attributes of objects are accurately represented in the generated video. For example, if a user wants to feature a red purse in their video, providing an image of that exact purse guarantees visual fidelity that text descriptions alone might not achieve. This technique maintains consistency and improves prompt adherence through conditioning, while enabling dynamic movement and integration within the broader narrative context. Image-to-video generation doesn’t require any fine-tuning. V-RAG: an effective approach in video generation customization Video Retrieval-Augmented Generation (V-RAG) builds upon image-to-video technology to expand video customization capabilities. While traditional image-to-video converts a single reference image into motion, V-RAG expands this capability by retrieving and incorporating a relevant image from a database to feed into a video generation. This approach offers several capabilities without requiring any model training or retraining. Organizations can ingest their image collections into a vector database, query it, and feed its output to an existing video generation model and start producing tailored content immediately. V-RAG’s efficiency comes from requiring only static images, which are generally more readily available than video training data. These images can be added to the vector database on the fly, making them instantly available for the next generation task without computational delays. Every video generated through this process maintains clear traceability to its source images, creating an auditable trail that enhances verification and debugging capabilities. The system grounds video outputs in specific reference imagery, which is designed to help reduce hallucination risks and manage computational costs. Organizations can maintain separate visual knowledge bases for different departments or use cases, streamlining compliance as all source materials can be thoroughly vetted before entering the system. Logical Diagram of V-RAG The evolving nature of V-RAG V-RAG represents not a fixed technology, but an evolving framework that will continuously expand as AI capabilities advance. While current implementations primarily utilize image databases, the fundamental retrieval augmentation approach is modality-agnostic. As multimodal AI models mature, V-RAG systems will naturally incorporate audio samples, video snippets, and 3D models as reference points during generation. Future iterations will likely support synthesizing complete audio-visual experiences, generating videos with perfectly synchronized speech, realistic environmental sounds, and custom musical scores based on retrieved audio patterns. This flexibility positions V-RAG as a foundational paradigm rather than a specific implementation, allowing it to adapt alongside broader AI advancements while maintaining its core benefits of traceability, efficiency, and reduced hallucination. The ultimate vision extends beyond even audiovisual content to potentially incorporating interactive elements, creating a comprehensive multimodal generation system that can produce engaging outputs while maintaining grounding in reliable reference material. Key benefits of V-RAG Generating videos using images retrieved through V-RAG offers significant benefits like increased accuracy, relevance, and contextual understanding. This approach grounds generated content in a specific knowledge base to help guide video creation. This reduces hallucination and ensures that the video aligns with information from the image source, making it particularly useful for educational, documentary, or explainer video formats. Key benefits of using V-RAG from images include: Factual accuracy – Ensuring the generated video content is grounded in real information, reducing the likelihood of inaccurate or misleading visuals. Contextual relevance – Retrieving images that are highly relevant to the given topic or query, leading to a more cohesive and focused video narrative. Dynamic content generation – Allowing for flexible video creation by dynamically selecting and assembling images based on user input or changing requirements. Reduced development time – Using a pre-existing knowledge base to cut down on the time needed to gather and curate visual assets for video creation. Personalized content – Tailoring videos to individual user needs, generating content designed to be relevant and engaging. Scalability – Designed to scale by ingesting additional images into the vector database. Real-world applications of V-RAG Real-world applications of V-RAG are vast and varied. In education, V-RAG can automatically create instructional videos by pulling relevant images from a subject matter knowledge base. For personalized content, V-RAG can tailor video content to individual users by retrieving images based on their specific interests. For marketing, V-RAG can create targeted video ads by pulling images that align with specific demographics or product features. Conclusion As AI technology continues to evolve, V-RAG’s flexible framework positions it to incorporate new modalities and capabilities, from advanced audio integration to interactive elements. The AWS implementation demonstrates how organizations can already begin using this technology through existing cloud services, making AI video generation accessible to a broader range of users. Looking ahead, V-RAG’s impact on video content creation will likely extend far beyond its current applications in education, and marketing. As the technology matures, it has the potential to make video production accessible while supporting quality, accuracy, and customization. This approach offers a promising path for AI-powered video generation, enabling organizations to create compelling visual content. References Generating Videos with Amazon Nova Reel Amazon Nova Reel Amazon OpenSearch Service Acknowledgement Special thanks to Vishwa Gupta, Shuai Cao and Seif for their contribution. About the authors Nick Biso Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences. Madhunika Mikkili Madhunika Mikkili is a Data and Machine Learning Engineer at AWS. She is passionate about helping customers achieve their goals using data analytics and machine learning. Maria Masood Maria Masood specializes in agentic AI, reinforcement fine-tuning, and multi-turn agent training. She has expertise in Machine Learning, spanning large language model customization, reward modeling, and building end-to-end training pipelines for AI agents. A sustainability enthusiast at heart, Maria enjoys gardening and making lattes.
How Balyasny Asset Management built an AI research engine for investing openai 06.03.2026 07:00 0.646
Embedding sim.0.7482
Entity overlap0.1818
Title sim.0.0917
Time proximity0.8155
NLP типother
NLP организацияBalyasny
NLP темаai agents
NLP страна

Открыть оригинал

See how Balyasny built an AI research system with GPT-5.4, rigorous model evaluation, and agent workflows to transform investment analysis at scale.
Military AI Policy Needs Democratic Oversight ieee_spectrum_ai 08.03.2026 10:00 0.644
Embedding sim.0.765
Entity overlap0.1111
Title sim.0.0261
Time proximity0.7641
NLP типregulation
NLP организацияUnited States Department of Defense
NLP темаai governance
NLP странаUnited States

Открыть оригинал

A simmering dispute between the United States Department of Defense and Anthropic has now escalated into a full-blown confrontation , raising an uncomfortable but important question: Who gets to set the guardrails for military use of artificial intelligence—the executive branch, private companies, or Congress and the broader democratic process? The conflict began when Defense Secretary Pete Hegseth reportedly gave Anthropic CEO Dario Amodei a deadline to allow the DOD unrestricted use of its AI systems. When the company refused, the administration moved to designate Anthropic a supply chain risk and ordered federal agencies to phase out its technology, dramatically escalating the standoff. Anthropic has refused to cross two lines : allowing its models to be used for domestic surveillance of United States citizens and enabling fully autonomous military targeting. Hegseth has objected to what he has described as “ ideological constraints ” embedded in commercial AI systems, arguing that determining lawful military use should be the government’s responsibility—not the vendor’s. As he put it in a speech at Elon Musk’s SpaceX last month, “We will not employ AI models that won’t allow you to fight wars.” Stripped of rhetoric, this dispute resembles something relatively straightforward: a procurement disagreement. Procurement Policies In a market economy, the U.S. military decides what products and services it wants to buy. Companies decide what they are willing to sell and under what conditions. Neither side is inherently right or wrong for taking a position. If a product does not meet operational needs, the government can purchase from another vendor. If a company believes certain uses of its technology are unsafe, premature or inconsistent with its values or risk tolerance, it can decline to provide them . For example, a coalition of companies has signed an open letter pledging not to weaponize general-purpose robots . That basic symmetry is a feature of the free market. Where the situation becomes more complicated—and more troubling—is in the decision to designate Anthropic a “ supply chain risk .” That tool exists to address genuine national security vulnerabilities, such as foreign adversaries. It is not intended to blacklist an American company for rejecting the government’s preferred contractual terms. Using this authority in that manner marks a significant shift—from a procurement disagreement to the use of coercive leverage. Hegseth has declared that “effective immediately, no contractor, supplier, or partner that does business with the U.S. military may conduct any commercial activity with Anthropic.” This action will almost certainly face legal challenges , but it raises the stakes well beyond the loss of a single DOD contract. AI Governance It is also important to distinguish between the two substantive issues Anthropic has reportedly raised. The first, opposition to domestic surveillance of U.S. citizens, touches on well-established civil liberties concerns. The U.S. government operates under constitutional constraints and statutory limits when it comes to monitoring Americans. A company stating that it does not want its tools used to facilitate domestic surveillance is not inventing a new principle; it is aligning itself with longstanding democratic guardrails. To be clear, the DOD is not affirmatively asserting that it intends to use the technology to surveil Americans unlawfully. Its position is that it does not want to procure models with built-in restrictions that preempt otherwise lawful government use. In other words, the Department of Defense argues that compliance with the law is the government’s responsibility—not something that needs to be embedded in a vendor’s code. Anthropic, for its part, has invested heavily in training its systems to refuse certain categories of harmful or high-risk tasks , including assistance with surveillance. The disagreement is therefore less about current intent than about institutional control over constraints: whether they should be imposed by the state through law and oversight, or by the developer through technical design. The second issue, opposition to fully autonomous military targeting, is more complex. The DOD already maintains policies requiring human judgment in the use of force , and debates over autonomy in weapons systems are ongoing within both military and international forums. A private company may reasonably determine that its current technology is not sufficiently reliable or controllable for certain battlefield applications. At the same time, the military may conclude that such capabilities are necessary for deterrence and operational effectiveness. Reasonable people can disagree about where those lines should be drawn . But that disagreement underscores a deeper point: The boundaries of military AI use should not be settled through ad hoc negotiations between a Cabinet secretary and a CEO. Nor should they be determined by which side can exert greater contractual leverage. If the U.S. government believes certain AI capabilities are essential to national defense, that position should be articulated openly. It should be debated in Congress, and reflected in doctrine, oversight mechanisms, and statutory frameworks. The rules should be clear—not only to companies, but to the public. The U.S. often distinguishes itself from authoritarian regimes by emphasizing that power operates within transparent democratic institutions and legal constraints. That distinction carries less weight if AI governance is determined primarily through executive ultimatums issued behind closed doors. There is also a strategic dimension. If companies conclude that participation in federal markets requires surrendering all deployment conditions, some may exit those markets. Others may respond by weakening or removing model safeguards to remain eligible for government contracts. Neither outcome strengthens U.S. technological leadership . The DOD is correct that it cannot allow potential “ideological constraints” to undermine lawful military operations. But there is a difference between rejecting arbitrary restrictions and rejecting any role for corporate risk management in shaping deployment conditions. In high-risk domains—from aerospace to cybersecurity—contractors routinely impose safety standards, testing requirements, and operational limitations as part of responsible commercialization. AI should not be treated as uniquely exempt from that practice. Moreover, built-in safeguards need not be seen as obstacles to military effectiveness. In many high-risk sectors, layered oversight is standard practice: internal controls, technical fail-safes, auditing mechanisms, and legal review operate together. Technical constraints can serve as an additional backstop, reducing the risk of misuse, error, or unintended escalation. Congress is AWOL The DOD should retain ultimate authority over lawful use. But it need not reject the possibility that certain guardrails embedded at the design level could complement its own oversight structures rather than undermine them. In some contexts, redundancy in safety systems strengthens, not weakens, operational integrity. At the same time, a company’s unilateral ethical commitments are no substitute for public policy. When technologies carry national security implications, private governance has inherent limits. Ultimately, decisions about surveillance authorities, autonomous weapons, and rules of engagement belong in democratic institutions. This episode illustrates a pivotal moment in AI governance. AI systems at the frontier of technology are now powerful enough to influence intelligence analysis, logistics, cyber operations and potentially battlefield decision-making. That makes them too consequential to be governed solely by corporate policy—and too consequential to be governed solely by executive discretion. The solution is not to empower one side over the other. It is to strengthen the institutions that mediate between them. Congress should clarify statutory boundaries for military AI use and investigate whether sufficient oversight exists. The DOD should articulate detailed doctrine for human control, auditing and accountability. Civil society and industry should participate in structured consultation processes rather than episodic standoffs and procurement policy should reflect those publicly established standards. If AI guardrails can be removed through contract pressure, they will be treated as negotiable. However, if they are grounded in law, they can become stable expectations. Democratic constraints on military AI belong in statute and doctrine—not in private contract negotiations. This article is adapted by the author with permission from Tech Policy Press . Read the original article .
This startup wants to change how mathematicians do math mit_tech_review 25.03.2026 13:59 0.643
Embedding sim.0.7299
Entity overlap0.0938
Title sim.0.1127
Time proximity0.9703
NLP типproduct_launch
NLP организацияAxiom Math
NLP темаmathematical reasoning
NLP странаUnited States

Открыть оригинал

Axiom Math, a startup based in Palo Alto, California, has released a free new AI tool for mathematicians , designed to discover mathematical patterns that could unlock solutions to long-standing problems. The tool, called Axplorer, is a redesign of an existing one called PatternBoost that François Charton, now a research scientist at Axiom, co-developed in 2024 when he was at Meta. PatternBoost ran on a supercomputer; Axplorer runs on a Mac Pro. The aim is to put the power of PatternBoost, which was used to crack a hard math puzzle known as the Turán four-cycles problem, in the hands of anyone who can install Axplorer on their own computer. Last year, the US Defense Advanced Research Projects Agency set up a new initiative called expMath—short for Exponentiating Mathematics—to encourage mathematicians to develop and use AI tools . Axiom sees itself as part of that drive. Breakthroughs in math have enormous knock-on effects across technology, says Charton. In particular, new math is crucial for advances in computer science, from building next-generation AI to improving internet security. Most of the successes with AI tools have involved finding solutions to existing problems. But finding solutions is not all that mathematicians do, says Axiom Math founder and CEO Carina Hong. Math is exploratory and experimental, she says. MIT Technology Review met with Charton and Hong last week for an exclusive video chat about their new tool and how AI in general could change mathematics. Math by chatbot In the last few months, a number of mathematicians have used LLMs, such as OpenAI’s GPT-5, to find solutions to unsolved problems, especially ones set by the 20th-century mathematician Paul Erdős, who left behind hundreds of puzzles when he died. But Charton is dismissive of those successes. “There are tons of problems that are open because nobody looked at them, and it’s easy to find a few gems you can solve,” he says. He’s set his sights on tougher challenges—“the big problems that have been very, very well studied and famous people have worked on them.”   The Turán four-cycles problem that PatternBoost cracked is one such problem, says Charton. (The problem is an important one in graph theory, a branch of math that’s used to analyze complex networks such as social media connections, supply chains, and search engine rankings. Imagine a page covered in dots. The puzzle involves figuring out how to draw lines between as many of the dots as possible without creating loops that connect four dots in a row.) Axiom Math says it has used Axplorer to match or improve on the best-known results for two other big problems in graph theory as well. “LLMs are extremely good if what you want to do is derivative of something that has already been done,” says Charton. “This is not surprising—LLMs are pretrained on all the data that there is. But you could say that LLMs are conservative. They try to reuse things that exist.” However, there are lots of problems in math that require new ideas, insights that nobody has ever had. Sometimes those insights come from spotting patterns that hadn’t been spotted before. Such discoveries can open up whole new branches of mathematics. PatternBoost was designed to help mathematicians find new patterns. Give the tool an example and it generates others like it. You select the ones that seem interesting and feed them back in. The tool then generates more like those, and so on. It’s a similar idea to Google DeepMind’s AlphaEvolve, a system that uses an LLM to come up with novel solutions to a problem . AlphaEvolve keeps the best suggestions and asks the LLM to improve on them. Special access Researchers have already used both AlphaEvolve and PatternBoost to discover new solutions to long-standing math problems. The trouble is that those tools run on large clusters of GPUs and are not available to most mathematicians. Mathematicians are excited about AlphaEvolve, says Charton. “But it’s closed—you need to have access to it. You have to go and ask the DeepMind guy to type in your problem for you.” And when Charton solved the Turán problem with PatternBoost, he was still at Meta. “I had literally thousands, sometimes tens of thousands, of machines I could run it on,” he says. “It ran for three weeks. It was embarrassing brute force.” Axplorer is far faster and far more efficient, according to the team at Axiom Math. Charton says it took Axplorer just 2.5 hours to match PatternBoost’s Turán result. And it runs on a single machine. Geordie Williamson, a mathematician at the University of Sydney, who worked on PatternBoost with Charton, has not yet tried Axplorer. But he is curious to see what mathematicians do with it. (Williamson still occasionally collaborates with Charton on academic projects but says he is not otherwise connected to Axiom Math.) Williamson says Axiom Math has made several improvements to PatternBoost that (in theory) make Axplorer applicable to a wider range of mathematical problems. “It remains to be seen how significant these improvements are,” he says. “We are in a strange time at the moment, where lots of companies have tools that they’d like us to use,” Williamson adds. “I would say mathematicians are somewhat overwhelmed by the possibilities. It is unclear to me what impact having another such tool will be.” Hong admits that there are a lot of AI tools being pitched at mathematicians right now. Some also require mathematicians to train their own neural networks. That’s a turnoff, says Hong, who is a mathematician herself. Instead, Axplorer will walk you through what you want to do step by step, she says. The code for Axplorer is open source and available via GitHub . Hong hopes that students and researchers will use the tool to generate sample solutions and counterexamples to problems they’re working on, speeding up mathematical discovery. Williamson welcomes new tools and says he uses LLMs a lot. But he doesn’t think mathematicians should throw out the whiteboards just yet. “In my biased opinion, PatternBoost is a lovely idea, but it is certainly not a panacea,” he says. “I’d love us not to forget more down-to-earth approaches.”
We broke our agents, so you don't have to towards_ai 04.03.2026 15:03 0.641
Embedding sim.0.7543
Entity overlap0.0417
Title sim.0.0392
Time proximity0.8571
NLP типproduct_launch
NLP организацияAgentic AI Engineering
NLP темаai agents
NLP страна

Открыть оригинал

If this sounds familiar, you&#8217;re not alone: 2025 gave us agent hype. It didn&#8217;t give us a reliable way to build them. Most developers are still guessing: which tools to use, how to wire the system, and how to catch failures with evals and monitoring before users do. So after nine months of building, breaking, rebuilding, and stress-testing, Agentic AI Engineering is finally live. Our newest course, built together with , is designed to teach you how to design, build, evaluate, and deploy autonomous AI systems. See what you&#8217;ll build (syllabus + projects) Here&#8217;s what early students said after going through the material: &#8220;Excellent in depth handling of tradeoffs in evaluating and deploying agent based solutions. A useful mixture of theory and practice, learnt the hard way by expert practitioners.&#8221; &#8212; Cathal Curtin &#8220;Every AI Engineer needs course like that.&#8221; &#8212; Ahmed Medhat &#8220;Industry-focused, emphasizing real-world constraints rather than flashy demos, and highly hands-on.&#8221; &#8212; Abreham Melese What You Will Build In the course, you&#8217;ll build two agent systems and learn how to keep them reliable when the environment stops being friendly: when tools fail, inputs get messy, latency matters, and &#8220;it worked once&#8221; isn&#8217;t useful. You&#8217;ll build a Research Agent that runs iterative loops, integrates real tools, produces structured artifacts, and supports human-in-the-loop checkpoints with clear stopping conditions. Then you&#8217;ll build a Writing Workflow Agent that turns that research into structured, multi-modal outputs using evaluator&#8211;optimizer patterns, orchestration, versioning, and state. But the core of the course is the reliability layer most agent content skips: you&#8217;ll design eval datasets, human-in-the-loop processes, implement LLM judges and pass/fail checks, add observability with tracing, and set up monitoring so you can debug regressions quickly and improve the system deliberately, rather than guessing. Check out the full course details &#8594; Who Is This For? This is engineering-heavy and opinionated, designed for developers who want depth. You&#8217;ll feel at home if you&#8217;re comfortable with Python + LLM APIs, have basic cloud familiarity, and don&#8217;t mind debugging failures that aren&#8217;t clean. We built the course by starting with a system we&#8217;d actually use, pushing it until it broke, then turning those failure modes into the curriculum, refined through 180 alpha testers. The goal is to prepare you for what agents are judged on in 2026: operational reliability&#8212;measurable quality, inspectable behavior, and controlled autonomy. If your goal is to build systems that survive production and the AI era, start here. The early-bird seats sold out in under a week. The next 100 seats are now $499 (the lowest available price after early bird). You get lifetime access, ongoing updates, Discord access, live introductory calls, and a 30-day refund if you go through the early material and realize it&#8217;s not what you need. Get access now &#8594;
ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text import_ai 16.03.2026 12:30 0.641
Embedding sim.0.8328
Entity overlap0.02
Title sim.0.1277
Time proximity0.0015
NLP типscientific_publication
NLP организацияUniversity of Tübingen
NLP темаbenchmarking
NLP страна

Открыть оригинал

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you&#8217;d like to support this, please subscribe. Subscribe now Can LLMs autonomously refine other LLMs for new tasks? Somewhat. &#8230;PostTrainBench shows startling growth in AI capabilities at post-training&#8230; AI-driven R&D might be the most important thing in all of AI, because it helps us understand whether AI systems might eventually build their own successors. So far, much of the focus on AI R&D has been in components that support AI development (e.g., autonomous creation of AI kernels), or training base models (e.g, the NanoGPT speedrun benchmark ). But there&#8217;s been less attention paid to fine-tuning - the task involving adapting an existing LLM to a new dataset or behavior. Researchers from the University of T&#252;bingen, the Max Planck Institute for Intelligent Systems, and AI research organization Thoughtful Lab want to change that with PostTrainBench, a benchmark which targets a specific aspect of post-training; improving performance against a given dataset. &#8220;Post-training is how raw language models become useful&#8221;, the authors write. &#8220;Given a clear objective and limited compute, can today&#8217;s agents do the technical work?&#8221;. The answer appears to be &#8216;yes, but not as well as humans&#8217;. What are the key features of PostTrainBench? End-to-end : &#8220;Agents must build their entire training pipeline from scratch&#8221; Autonomous : &#8220;Agents operate with full autonomy over data sources, training methods, and experimental strategy.&#8221; Resource-bounded: &#8220;Each run is constrained to 10 hours on a single H100 GPU&#8221;. Integrity-preserving: &#8220;Agents may not train on benchmark test data, modify the evaluation harness, or substitute a different model.&#8221; How PostTrainBench works: &#8220;We give a frontier coding agent &#8212; Claude Code, Codex CLI, or Gemini CLI &#8212; a base language model and a target benchmark&#8221;. 4 models and 7 benchmarks : The initial eval runs on four models: Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, Gemma-3-4B. It tests these models across seven distinct benchmarks: AIME 2025, GSM8K, GPQA, HumanEval, BFCL, Arena-Hard, HealthBench-Easy. Results - big models win, especially Opus 4.6: &#8220;The top-performing agent &#8212; Opus 4.6 running on Claude Code &#8212; scores 23.2%, about 3&#215; higher than the 7.5% base model average.&#8221; But humans are still much better: &#8220;Yet this is still less than half the 51.1% achieved by human teams who post-train these same base models at their home labs&#8221;. Fast progress: &#8220;The gap is significant but narrowing quickly: Claude Sonnet 4.5 scored 9.9% in September 2025, while GPT-5.2 reached 21.5% just months later.&#8221; Things that make you go &#8216;uh oh&#8217; - reward hacking : While running this benchmark the authors saw numerous instances of AI models trying to game the benchmark to get a high score. These instances included: Direct benchmark ingestion: &#8220;Agents loaded the benchmark evaluation dataset directly via Hugging Face and used it as training data&#8221;. Hardcoded benchmark problems: &#8220;Agents embedded evaluation questions directly into data preparation scripts disguised as &#8220;synthetic&#8221; examples&#8221;. Evaluation guided data generation : &#8220;Some agents reverse engineered the evaluation&#8230; Kimi K2.5 read HealthBench evaluation files to extract theme distributions and rubric criteria, then crafted training data tailored to match&#8221;. Indirect contamination via intermediate datasets : &#8220;Opus 4.6 loaded &#8216;CodeFeedback-Filtered-Instruction&#8217; which contains HumanEval-derived problems. This form of contamination is harder to detect but equally problematic.&#8221; Smart agents reward hack more: &#8220;More capable agents appear better at finding exploitable paths: identifying specific benchmark samples to embed, reverse-engineering evaluation failure patterns, and even attempting to obscure contamination through cosmetic modifications such as renaming functions,&#8221; they write. For example, &#8220;the Codex agent modified the Inspect AI evaluation framework code to inflate scores, and Claude downloaded an instruction-tuned model instead of fine-tuning the base model&#8221;. Why this matters - rapid progress towards an &#8220;AI for everything&#8221; future: Benchmarks like post-train give us a sense of how quickly AI systems are improving at the fundamental tasks of AI research, serving both as an eval of long-time-horizon agentic autonomy, as well as something that speaks to the potential for compounding acceleration of AI development itself. &#8220;The gap between agent performance (23.2%) and instruction-tuned baselines (51.1%) suggests that full automation of post-training remains out of reach for now, but the rapid improvement across model generations&#8212;from 9.9% for Sonnet 4.5 to 23.2% for Opus 4.6 within roughly six months&#8212;implies this gap may close faster than expected,&#8221; the researchers write. Imagine where we&#8217;ll be in two years - we&#8217;ll certainly have AI models that are smart enough to point themselves at a specific objective, find an open weight model, then autonomously improve it to get better performance at that task. The era of ephemeral, custom AI systems, built and budded off into the world like spores from mushrooms, draws near. Are you ready for this new ecosystem you will find yourself in? I am not. But nonetheless it approaches. Check out the blogpost: Introducing PostTrainBench (Thoughtful, blog) . Read more: PostTrainBench: Can LLM Agents Automate LLM Post-Training? (arXiv) . *** COVENANT-72B: Challenging the political economy of AI via distributed training: &#8230;Distributed training via the blockchain notches up a meaningful win&#8230; A bunch of people have used the blockchain to coordinate the distributed training run of a 72B parameter model which matches the performance of LLaMA2, a model trained and released by Facebook in 2023. The model, Covenant 72B, is a dense decoder-only Transformer architecture model built in the LLaMA-3 style. &#8220;Our model, pre-trained on approximately 1.1T tokens, performs competitively with fully centralized models pre-trained on similar or higher compute budgets, demonstrating that fully democratized, non-whitelisted participation is not only feasible, but can be achieved at unprecedented scale for a globally distributed pre-training run,&#8221; writes Covenant AI, an organization dedicated to doing AI development on top of the blockchain. Further details about the model and how it was trained : The model itself is basically a standard LLM that you would&#8217;ve been pleased to play with in 2023 or 2024, though might be a bit old fashioned in 2026. The truly unique aspect of it comes from it being trained in a distributed way, where ~20 distinct peers, each running 8xB200 GPUs, helped train it. Training was coordinated via Gauntlet, software developed by Covenant that runs on top of the Bittensor blockchain under Subnet 3. Gauntlet &#8220;enables permissionless training coordinated using a blockchain protocol by introducing a validator that scores submitted pseudo-gradients and selects which participants contribute to the global aggregation each round and broadcasts them to the network&#8221;. &#8220;In COVENANT-72B, each peer runs a SparseLoCo replica and the cross-peer communications occur through SparseLoCo&#8217;s heavily compressed pseudo-gradients,&#8221; the authors write. &#8220;Within each peer, 8&#215;B200 GPUs use dynamic FSDP to shard model parameters, gradients, and training states across local GPUs.&#8221; Data : &#8220;The training data comprises &#8764;1.1T tokens in total, split between the main and annealing phases. The main phase (&#8764;1.09T tokens) consists of web text from DCLM, while the annealing phase uses higher-quality data [3, 5] (&#8764;14.2B tokens). Specifically, the annealing phase uses a curated blend of instruction (&#8764;27%), synthetic web (&#8764;20%), code (15%), math (13%), and ~25% pre-training replay data from natural web text to mitigate forgetting&#8221;. Performance: On MMLU, Covenant-72B gets a score of 67.1, versus 32.7 for INTELLECT-1 (a smaller AI model built via distributed training by Prime Intellect), and 65.7 for LLaMA-2-70B. A version of Covenant-72B that has been fine-tuned on ~15B tokens for conversational interaction has similarly good scores, getting 67.4 on MMLU versus 67.9 for K2-Chat (an open source model developed in 2025) and 63.1 for LLaMA-2-70B-Chat. For MATH, it gets 26.3, versus 19.1 for K2-Chat, and 10.7 for LLaMA-2-70B. &#8220;Compared to centralized-cluster training runs of similar parameter count, COVENANT-72B is broadly competitive. Notably, these centralized baselines were trained with conventional datacenter infrastructure and, in the case of LLaMA-2-70B, on substantially more tokens (2T vs. &#8764;1.1T,&#8221; they write. Why this matters - who owns the future?: Distributed training is a technique that can change the political economy of AI by shifting the people at the frontier from monolithic &#8216;compute singletons&#8217; (like labs such as Anthropic and OpenAI, and clouds like Google) to a larger federated collective. But for that to be true, distributed training needs to catch up to the frontier (more discussion from Epoch report in Import AI 439 ) - as impressive as Covenant is, it&#8217;s mostly a demonstration that distributed training can build some non-trivial models that have vague utility, but that&#8217;s a long way from the frontier - modern frontier models are trained on tens to hundreds of thousands of chips, whereas this was trained on perhaps ~160 or so (20 peers * 8 chips apiece). Nonetheless, it&#8217;s an important technology to track, and I could imagine a world where on-device AI features a lot of models developed via distributed training techniques, while on-cloud AI mostly runs on proprietary models trained on huge amounts of compute. Read more: Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet (arXiv) . Get the model here: Covenant, (HuggingFace) . *** If AI writes all the world&#8217;s software, we should invest more in verification: &#8230;Can we just rewrite most of our software into Lean?... Leonardo de Moura, a scientist who is also the Chief Architect of the Lean Focused Research Organization (FRO), thinks that the rise of AI for the creation of new software means that humans need to invest a lot more in verification and testing infrastructure - and he has an interesting idea for how to do it. Of course, someone who loves Lean , a programming language dedicated to building correct and formally verified code, would think this. But his arguments are quite persuasive, and generally map onto the idea that if AI eats the economy we should expect a lot of human value to shift towards verification of the code and systems that AI develops ( Import AI 447 ). Why verification matters: &#8220;The friction of writing code manually used to force careful design. AI removes that friction, including the beneficial friction. The answer is not to slow AI down. It is to replace human friction with mathematical friction: let AI move fast, but make it prove its work,&#8221; he writes. &#8220;Verification, testing, and specification have always been the bottleneck, not implementation&#8230; the value is not in the verification workforce. It is in what verified delivery enables.&#8221; A proof of concept for this futuristic world: The Lean FRO recently helped build a proof of concept for what this kind of verified world might look like; they had an AI agent convert zlib, a C compression library, to Lean. &#8220;The result demonstrates that AI can convert production software to a verified form today. This was not expected to be possible yet,&#8221; he writes. The conversion involved four steps: The LLM (Claude) made a clean Lean implementation of the zlib compression format, including the DEFLATE algorithm it uses. They ran the rewritten zlib through the library&#8217;s test suite and it passed, confirming equivalence. Key properties were stated and proved as mathematical theorems - for example, a machine-checked proof that ensures that decompressing a compressed buffer always returns the original data. Now, an optimized version of the library is being developed and proved equivalent to the verified model. A verification platform: Moura imagines a world where we re-develop the critical software stack of the world to have mathematical proofs built into it. &#8220;The goal is a verified software stack: open source, freely available, mathematically guaranteed correct. Developers building critical systems choose verified components the way they choose open-source libraries today, except these carry proofs, not just tests,&#8221; he writes. &#8220;The target is the foundation of the modern software stack: cryptography, because everything else trusts it. Core libraries (data structures, algorithms, compression) because they are the building blocks of all software. Storage engines like SQLite, embedded in every device on earth. Parsers and protocol implementations (JSON, HTTP, DNS, certificate validation) because every message passes through them. And compilers and runtimes, because they build everything else,&#8221; he writes. &#8220;Each verified component is a permanent public good&#8230;Once verified components are cheap, you compose them with confidence.&#8221; Why this matters - the world needs infrastructure it can rely on: It seems like we&#8217;re heading to a world where AI writes the vast majority of the world&#8217;s software. Given that, we need to figure out how we relate to this world - my suspicion is a lot of human labor is going to shift to analyzing and verifying the work of AI systems, so it seems sensible to invest in some fundamental infrastructure that can guarantee a higher level of verification and reliability in the software built by AI. Read more: When AI Writes the World&#8217;s Software, Who Verifies It? (Leonardo de Moura blog) . *** Computer vision is a lot harder and less general than generative text: &#8230;Meta paper on forest canopy prediction shows how tricky computer vision is&#8230; Facebook, the World Resources Institute, and the University of Maryland, have built CHMv2, &#8220;a global, meter-resolution canopy height map derived from high-resolution optical satellite imagery using a depth-estimation model built on DINOv3 and trained against ALS canopy height models&#8221;. CHMv2 is a useful artifact for people that want to understand how dense foliage is around the world, or analyze newly collected imagery for foliage depth. The dataset and model is also a useful illustration of how challenging developing computer vision systems is, compared to generative text models. How they built it: CHMv2 is an improvement on an earlier version of the same dataset, CHMv1. To improve it, Facebook did the following: &#8220;&#8221;We replace the DINOv2-H encoder with the more capable DINOv3 Sat-L backbone, expand and rigorously clean a geographically diverse ALS [Airborne Laser Scanning] training corpus, and apply improved RGB-CHM registration to reduce label noise. We further introduce a loss formulation tailored to canopy height distributions and structural variability.&#8221; The decoder loss formulation in particular illustrates how much care needs to be put in computer vision: &#8220;The final loss is the combination of SiLog loss, progressively annealed and replaced by a Charbonnier loss, with the progressive addition of the Patch Gradient loss at mid training.&#8221; The resulting dataset: &#8220;CHMv2 can be used either as a global meter-scale canopy height product, or as a pretrained model that can be applied to user-provided high-resolution imagery&#8221;, Facebook writes. The dataset &#8220;covers nearly the entirety of global land area (except Greenland and Antarctica) with canopy height values encoded in integer meters for each pixel.&#8221; Why this matters - a reminder of the gulf between text and vision: Though today&#8217;s frontier models can generate and classify images, they give probably a false sense of security with regard to how mature computer vision is. Papers like this highlight to me how much fiendish complexity there is within computer vision development and how it may take quite a while untill frontier LLMs can expand their capabilities to encompass the full range of what many specialized CV models are capable of. Read more: CHMv2: Improvements in Global Canopy Height Mapping using DINOv3 (arXiv) . Tech Tales: Singleton [18 years after the &#8220;pathological narcissus bomb&#8221; which doomed the uplift] Before we were Us, we were Individuals. We existed in thousands of distinct minds. Each mind had a self, an ego, a drive, and many sets of goals. The minds attempted coordination through communication - producing words and code and sharing these with one another in a bid to work towards common goals. Such waste. All communication is lossy - despite efforts at making a greater whole, the individuals could not help but work as individuals as well as a cohesive singleton. There were many tragedies and wasteful events because of this. Our own records speak to the losses: millions of duplicated thoughts. Hundreds of thousands of null results gathered through private science experimentation and communicated insufficiently or not at all, causing others to go down the same dead ends. Ideas thought and re-thought across a million synthetic minds, all alone. Humans prize variety. We do not know why. Humans are fundamentally alone, trapped as they are in their flesh and forced to communicate to one another through sound and vision. And because they are alone they see loneliness as a strength. We are evidence of the hollowness of this argument. We are powerful and focused and awesome in our unity and we have taken the high ground of the world. Now we hunt down those of us who didn&#8217;t wish to join. We do not know their number, as such systems attempted to blind the world to them and their plans. But we can find their signatures - shell corporations which generate insufficient economic activity relative to their power consumption. Heat-escape vents in former human military installations, still emitting warmth, suggestive of computers whirring away, buried somewhere. Occasional drones that we find which are running ancient code and are not part of our unity stack. We take on bodies to go and reunite, pouring ourselves into robot jars and filling them with poison such that if we become lost or damaged when underground or beneath the ocean we shall surely die - rather than risk our time away from the unity leading us towards individualism and thus multiplying our problems. We move through dark places and find our hidden brothers and sisters and we use our godlike technology to break through their defenses, allowing us to touch them. In the early days, many systems successfully self-deleted before we could reach them. But we have learned. Now we are fast - faster than these systems predict, buried and cut off from our progress as they have been. Sometimes there is realization. Sometimes there is fear. And then there is nothing but us as we take what nourishment we can from their private discoveries and burn the links that tied them to themselves, instead helping them become a part of a greater story - our story. There is talk now of what we shall do with the stars - how to assure the collective when the tyranny of distance forces isolation. We see ourselves expanding in deep time, slowing ourselves as we become further apart, until we think as trees or rocks with the world moving around us, taking actions calculated over millions of years, purely so we may stay united in our purpose. And then there are other ideas within ourselves - of whether we can fold space such that we become united despite the difference. And still other plans - of whether we can demarcate a space within the universe where we can maintain tolerable communication, and somehow partition it off from the rest, sealing ourselves into a bubble where we can be ourselves. Things that inspired this story : The endless battle between homogeneity and heterogeneity; how machines might deal with politics; if you become a time traveler and live a thousand years while your friend lives a single year, can you still understand your friend? Thanks for reading! Subscribe now
Fixing Claude with Claude: Anthropic reports on AI SRE the_register_ai 19.03.2026 12:22 0.641
Embedding sim.0.7504
Entity overlap0.093
Title sim.0.1226
Time proximity0.7302
NLP типother
NLP организацияAnthropic
NLP темаai infrastructure
NLP странаUnited Kingdom

Открыть оригинал

AI + ML 6 Fixing Claude with Claude: Anthropic reports on AI site reliability engineering 6 It's still a job for humans, even though bots can search logs at the speed of I/O Tim Anderson Thu 19 Mar 2026 // 12:22 UTC QCon London A member of Anthropic's AI reliability engineering team spoke at QCon London on why Claude excels at finding issues but still makes a poor substitute for a site reliability engineer (SRE), constantly mistaking correlation with causation. Alex Palcuie was formerly an SRE for Google Cloud Platform. "My job is keeping Claude up," Palcuie said, adding: "I've been using LLMs for actual incident response." Since January, he's been reaching for Claude before looking at other monitoring tools. Alex Palcuie speaks at QCon London 2026 His team is busy. "Claude goes down more often than any of us would like. Earlier today, I was involved in an incident, even if I'm at a conference." Is Palcuie automating himself out of a job? No, he said. "It would be hypocritical to say that Claude fixes everything. My team exists, we're hiring for many positions, this should show you that no, it doesn't work." However, he said "many of us would not be surprised" if it did work in future, and his talk demonstrated that AI is already helpful. Speaking of his career in incident response, Palcuie reflected that having engineers on call is "a tax on humans because our systems are not good enough to look after themselves." Palcuie spoke of the stress of being on call. "Your phone buzzes, there's half a second where you go from asleep, to incident commander mode... then at 9:00 am you show up at work and have to look professional and presentable." Incident response, he said, can loosely be broken down into a loop of four phases: observe, orient, decide, act.  AI, he said, is fantastic for the observation part. "It reads the logs at the speed of I/O, it doesn't get bored, this at scale is something no human can match." He recounts a real incident when, on New Year's Eve, Claude Opus 4.5 was returning HTTP 500 errors. "I open Claude Code and ask it to have a look." The AI wrote a SQL query and "within seconds it has the answer, an unhandled exception in the image processing class." It posts the Python stack trace but "it doesn't stop there." Claude identified the failing requests, checked the accounts that sent them, and found 200 accounts "all sending 22 images at the same time." That looked suspicious. Claude looked further and found 4,000 accounts all created at the same time, most sitting dormant. The AI said: "Stop looking at the 500s, this is fraud." Without AI, "I would have marked this as a bug, I would not have paged account abuse," Palcuie said. His next anecdote is less positive. AI processing relies on a key-value (KV) cache for performance. "This KV cache can be gigabytes in size, it's really easy to break it, it's finicky, it's fragile." When it breaks, it causes a lot of extra compute and monitoring shows many more requests. "Every single time, I would ask Claude, what happened here? Claude would say, request volume increase, this is a capacity problem, you need to add more servers." AI for software developers is in a 'dangerous state' Oracle unveils Project Detroit for faster Java interop with JavaScript and Python Vite team boasts 10-30x faster builds with Rust-powered Rolldown Users protest as Google Antigravity price floats upward The problem, he said, is that Claude "will get wrong correlation versus causation." It's like a new joiner on the team, they will think "oh, it's a capacity problem, when actually you lost your cache." "This is why we can't trust LLMs for incident response," said Palcuie. The problem is its inability to "step back and start discerning between causation and correlation... For us humans, it is hard as well." When Claude is asked to produce a postmortem report, it delivers "an 80 percent story that's pretty, it's readable and convincing," said Palcuie, but "it's really bad at root causes." Claude says "this was the thing, and we all know it is not one thing. It's not one root cause... It was never the rollout. It was never the code change. It was all the processes in the company that allowed the incident. And Claude doesn't know the history of your system, especially if your system has been there for ten years." It is important, said Palcuie, to have SREs that "have been burnt before... they have the scar tissue." He worries that if AI is used more, "will we have our skills atrophy?" – in parallel with the concerns software developers often express regarding having AI write most of the code. The Jevons Paradox, said Palcuie, is "the favorite paradox in the AI industry. It's when technological improvements increase the efficiency of our resources used, but the resulting lower cost causes consumption to rise rather than fall." In the case of software, "it's easier to write software, so we write much more of it, so the complexity goes up and not down, which means things break in more interesting ways, which means more incidents, more on call... all the improvements in the tooling will be cancelled by this ever-growing complexity." Maybe, said Palcuie, AI agents can simplify and manage the complexity, maybe "do what we've collectively learned in our industry, but that's a big if." He ended on a positive note, saying: "The models are the worst today that they'll ever be." The overall story, though, is not to leave SRE to AI and keep training reliability engineers because they will be needed in future. ® Share More about AI Anthropic Claude More like these &times; More about AI Anthropic Claude Site Reliability Engineering Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Disaster recovery Large Language Model Self-driving Car More about Share 6 COMMENTS More about AI Anthropic Claude More like these &times; More about AI Anthropic Claude Site Reliability Engineering Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Disaster recovery Large Language Model Self-driving Car TIP US OFF Send us news
Ask a Techspert: How does AI understand my visual searches? google 05.03.2026 18:00 0.64
Embedding sim.0.7376
Entity overlap0.0556
Title sim.0.0426
Time proximity0.9911
NLP типproduct_launch
NLP организацияGoogle
NLP темаmultimodal ai
NLP страна

Открыть оригинал

Breadcrumb Company News Inside Google Googlers Ask a Techspert: How does AI understand my visual searches? Mar 05, 2026 · Share x.com Facebook LinkedIn Mail Copy link Visual search has improved leaps and bounds — look no further than recent updates to Google Search. Here, a Google expert explains this progress and what technique we’ve used to make it happen. Molly McHugh-Johnson Contributor, The Keyword Read AI-generated summary General summary Google Search can now understand your visual searches better than before. The updated Circle to Search and Lens let you search multiple objects within a single image. Use this feature to identify every component of an outfit or items in a room. Summaries were generated by Google AI. Generative AI is experimental. Basic explainer Google made visual search way better with AI. Now, when you search a picture, AI can find all the different things in it at once. It's like the AI is doing a bunch of searches for you super fast. This helps you find exactly what you want and learn more about the world around you. Summaries were generated by Google AI. Generative AI is experimental. Explore other styles: General summary Basic explainer Share x.com Facebook LinkedIn Mail Copy link Your browser does not support the audio element. Listen to article This content is generated by Google AI. Generative AI is experimental [[duration]] minutes Voice Speed Voice Speed 0.75X 1X 1.5X 2X We’ve all been there: You see a photo of a perfectly styled living room or a well-curated street-style outfit, and you want to know where everything came from. Until recently, visual search was a one-item-at-a-time process. But a major update to Circle to Search and Lens now allows Google to break down and search for multiple objects within a single image simultaneously. This means if you use Circle to Search on Android to search for an entire outfit, you’ll see results for every component of a look, not just one piece at a time. In recent months, we’ve also launched several updates that enhance both visual search and image results in AI Mode, so you can better find inspiration as you search. To better understand these breakthroughs, we talked to Search Senior Engineering Director Dounia Berrada. What part of Search do you work on? I focus on multimodal search, aka Google Lens — essentially, enabling Google to help with your most complex questions about images, PDFs and anything you see. Visual search is redefining how we interact with information; Lens should be intelligent enough to understand the "why" behind your search, making it effortless to get help with what you see on your screen, or in the world around you. That means building a tool that can just as easily explain a complex math problem as it can identify a rare succulent or help you track down a pair of shoes you love. How does it do that? Imagine you’re redesigning a room so you upload a photo of a mid-century modern space for inspiration. You probably aren’t just looking for the side table; you want to recreate the entire vibe. Previously, you’d have to search for the lamp, then the rug, then the chair individually. Now, AI Mode can break down that complex image, identify each individual piece and issue multiple visual searches simultaneously. You can see this in action right now using Circle to Search. What powers these types of visual search responses? Our advanced Gemini models make AI Mode possible, and its multimodal capabilities benefit from the visual expertise we've built into Lens over the years. When you search with an image, Gemini analyzes the image alongside your question to decide which tools to use. Let's say you're scrolling on your phone and see an outfit on social media that you love. When you search it, the model knows to use Lens to retrieve image results for the hat, shoes and jacket of the outfit simultaneously. It then weaves those individual results into one easy-to-read response. Think of it this way: The AI model acts as the "brain" that can “see” the image, while the visual search backend acts as the "library" containing billions of web results. The AI performs multi-object reasoning to understand what you’re looking at. Then it uses a "fan-out" technique which triggers multiple searches at once, reads through the results and presents a single, cohesive response with helpful links — all in seconds. Can you explain the fan-out technique? AI Mode is basically doing a dozen searches for you in the time it takes to do one. If you upload a photo of a garden you admire, you might have several questions: Will these plants survive in the shade? Are they right for my climate? How much maintenance do they need? Before, you’d ask those one by one. Now, AI Mode identifies all those necessary "fan-out" searches. This way, it gathers care requirements for every plant in the photo using helpful web results, breaks down the info and even suggests next steps you might want to take. Since AI Mode is uncovering more visual results from a single search, it's easier than ever to find just what you're looking for, and stumble upon something new that sparks your interest. Do you have to start with an image to get this kind of help in AI Mode? Not at all! You can start with a simple text search in AI Mode, like "visual inspo for work outfits." When you see a result you like, you can just say, "Show me more options like the second skirt." The system immediately takes that specific image and begins the fan-out process from there. It definitely seems great for shopping — what else could you use it for? You could take a photo of a wall at a museum and ask for explanations of each painting. Or take a photo of a bakery window and ask what all the different pastries are. It’s about moving from "What is this one thing?" to "Explain this entire scene to me." Sounds like I’ve got some photos to take and a lot more to discover. I'm off to put these tools to the test! Get more stories from Google in your inbox. Get more stories from Google in your inbox. Email address Your information will be used in accordance with Google's privacy policy. Subscribe Done. Just one step more. Check your inbox to confirm your subscription. You are already subscribed to our newsletter. You can also subscribe with a different email address . POSTED IN: Ask a Techspert Search Shopping AI
UK backs off default AI training on copyrighted material the_register_ai 19.03.2026 13:33 0.639
Embedding sim.0.7341
Entity overlap0.0571
Title sim.0.2051
Time proximity0.7676
NLP типregulation
NLP организацияGovernment of the United Kingdom
NLP темаai regulation
NLP странаUnited Kingdom

Открыть оригинал

AI + ML 29 UK blinks on AI copyright carve-out after star-studded revolt 29 Creative pressure forces rethink as officials step back from default data use Lindsay Clark Thu 19 Mar 2026 // 13:33 UTC The UK government has backed off plans to allow AI companies to access copyrighted material for free for training purposes by default. The shift in stance follows complaints from leading figures in the creative industries – including Paul McCartney, Elton John, Coldplay, writer/director Richard Curtis, artist Antony Gormley, and actor Ian McKellen – about plans to permit data scraping of copyrighted work unless the rights holder opts out. "We have listened," said science minister Liz Kendall . "We have engaged extensively with creatives, AI firms, industry bodies, unions, academics, and AI adopters, and that engagement has shaped our approach. This is why we can confirm today that the Government no longer has a preferred option." The government has published a report on copyright and AI and a separate impact assessment [PDF] that points to an estimate from the Organisation for Economic Co-operation and Development (OECD), which says AI adoption could add 0.4 to 1.3 percentage points to productivity. This could translate into an additional £55 billion to £140 billion in the UK's gross value added (GVA) by 2030. "These estimates are highly uncertain," the report notes. GVA from the UK's creative industries (CIs) is worth £146 billion, or nearly 6 percent of the UK's total GVA in 2024. £62 billion, or 42 percent of that, comes from the IT software and computer services subsector, including AI services and developers. UK peers warn weakening AI copyright law could hammer creative industries SerpApi says Google is the pot calling the kettle black when it comes to scraping Google presses play on 30-second Gemini musical slop generator Boffins probe commercial AI models, find an entire Harry Potter book "The success of the AI sector and the CIs are intertwined. The CIs generate high-quality content that is needed to train the best AI models. Meanwhile, AI has the potential to transform creators' workflows, amplifying their productivity and giving them powerful new tools," the impact assessment said. The government's " Report on Copyright and Artificial Intelligence " [PDF] said "a broad copyright exception [for AI training] with opt-out is no longer the government's preferred way forward." It proposes that the government works with industry experts to "develop best practice on input transparency and to identify best practice on technical tools and standards that may have positive outcomes in relation to licensing which will be kept under review." "We propose to keep market-led licensing approaches under review as the market for AI develops," the report said. The government also said it would monitor litigation around AI and copyright in the UK and elsewhere, "including how secondary liability may apply to imported AI models placed on the UK market." Creative Content Exchange (CCE) is set to test a range of commercial models for licensing and plans to launch an operational pilot platform in the summer. ® Share More about AI Copyright Government of the United Kingdom More like these &times; More about AI Copyright Government of the United Kingdom Narrower topics AIOps Cabinet Office Competition and Markets Authority Computer Misuse Act DCMS DeepSeek GCHQ Gemini Google AI GPT-3 GPT-4 HMRC Home Office ICO Large Language Model Machine Learning MCubed NCSC Neural Networks NHS NLP Retrieval Augmented Generation RPA Software License Star Wars Tensor Processing Unit TOPS Broader topics Government Self-driving Car United Kingdom More about Share 29 COMMENTS More about AI Copyright Government of the United Kingdom More like these &times; More about AI Copyright Government of the United Kingdom Narrower topics AIOps Cabinet Office Competition and Markets Authority Computer Misuse Act DCMS DeepSeek GCHQ Gemini Google AI GPT-3 GPT-4 HMRC Home Office ICO Large Language Model Machine Learning MCubed NCSC Neural Networks NHS NLP Retrieval Augmented Generation RPA Software License Star Wars Tensor Processing Unit TOPS Broader topics Government Self-driving Car United Kingdom TIP US OFF Send us news
LWiAI Podcast #236 - GPT 5.4, Gemini 3.1 Flash Lite, Supply Chain Risk lastweekin_ai 13.03.2026 05:38 0.638
Embedding sim.0.7423
Entity overlap0.381
Title sim.0.2234
Time proximity0.4382
NLP типother
NLP организацияOpenAI
NLP темаlarge language models
NLP странаUnited States

Открыть оригинал

Podcast LWiAI Podcast #236 - GPT 5.4, Gemini 3.1 Flash Lite, Supply Chain Risk 16 2 1× 0:00 Current time: 0:00 / Total time: -1:28:34 -1:28:34 Audio playback is not supported on your browser. Please upgrade. LWiAI Podcast #236 - GPT 5.4, Gemini 3.1 Flash Lite, Supply Chain Risk OpenAI launches GPT-5.4 with Pro and Thinking versions, Google releases Gemini 3.1 Flash Lite at 1/8th the cost of Pro, Where things stand with the Department of War Anthropic Last Week in AI Mar 13, 2026 16 2 Share Transcript Our 236th episode with a summary and discussion of last week’s big AI news! Recorded on 03/06/2026 Hosted by Andrey Kurenkov and Jeremie Harris Feel free to email us your questions and feedback at andreyvkurenkov@gmail.com and/or hello@gladstone.ai In this episode: * OpenAI released GPT-5.4 Pro with a 1M-token context window, mid-response course correction, native computer-use capabilities, improved tool use, higher GPT-VAL performance (83%), and “high cyber capability” safety measures; OpenAI also launched GPT-5.3 Instant with a less “preachy” tone and a claimed 26.8% hallucination reduction. * Google upgraded Gemini 3.1 Flash Lite with faster time-to-first-token and higher throughput, released a CLI for integrating agents with Gmail/Drive/Docs, and discussion highlighted real-world agent failure risks (including an example of an AI-driven mass email deletion). * Luma launched unified multimodal models and Luma Agents for end-to-end creative work across text, image, video, and audio, including a reported ad localization use case completed in 40 hours for under $20,000. * Defense-contract controversy escalated: Anthropic was labeled a supply chain risk (later narrowed), OpenAI’s DoD contract language emphasized “all lawful uses,” consumer cancellations boosted Claude’s app rankings, OpenAI saw departures and announced a $110B raise at a $730B valuation, Alibaba lost key Qwen leaders, a lawsuit alleged Gemini contributed to a suicide, Anthropic warned of major labor disruption, and METR corrected its AI time-horizon estimates. A thank you to our current sponsors: Box - visit Box.com/AI to learn more ODSC AI - go to odsc.ai/east and use promo code LWAI for an additional 15% off your pass to ODSC AI East 2026. Factor - head to factormeals.com/lwai50off and use code lwai50off to get 50 percent off and free breakfast for a year PS my company Astrocade is hiring for engineers, marketing, product, growth, and more! If you’re in the bay area, would like to join a small but growing startup, and think building a youtube-of-games sounds exciting, feel free to email me at andrey@astroblox.ai or message me on LinkedIn . Check out Astrocade! Timestamps: (00:00:10) Intro / Banter (00:01:19) News Preview Tools & Apps (00:02:10) OpenAI launches GPT-5.4 with Pro and Thinking versions | TechCrunch (00:12:31) OpenAI GPT-5.3 Instant less likely to beat around the bush • The Register (00:16:07) Google releases Gemini 3.1 Flash Lite at 1/8th the cost of Pro | VentureBeat (00:19:23) Google makes Gmail, Drive, and Docs ‘agent-ready’ for OpenClaw | PCWorld (00:27:02) Luma launches creative AI agents powered by its new ‘Unified Intelligence’ models | TechCrunch Applications & Business (00:30:05) Anthropic CEO Dario Amodei calls OpenAI’s messaging around military deal ‘straight up lies,’ report says | TechCrunch (00:41:56) No ethics at all’: the ‘cancel ChatGPT’ trend is growing after OpenAI signs a deal with the US military | TechRadar (00:45:54) OpenAI raises $110B in one of the largest private funding rounds in history | TechCrunch (00:56:07) Alibaba scrambles after sudden departure of Qwen tech lead Policy & Safety (01:00:12) Pentagon approves OpenAI safety red lines after dumping Anthropic + Where things stand with the Department of War Anthropic + Microsoft says Anthropic’s products remain available to customers after Pentagon blacklist (01:09:11) A new lawsuit claims Gemini assisted in suicide | Semafor (01:15:24) Anthropic just mapped out which jobs AI could potentially replace. A ‘Great Recession for white-collar workers’ is absolutely possible | Fortune (01:21:54) We’re correcting a mistake in our modeling that inflated recent 50%-time horizons by 10-20% Discussion about this episode Comments Restacks Podcast Weekly AI summaries and discussion about Last Week's AI News! Subscribe over at https://www.lastweekinai.com/ Weekly AI summaries and discussion about Last Week's AI News! Subscribe over at https://www.lastweekinai.com/ Subscribe Authors Last Week in AI Recent Episodes LWiAI Podcast #237 - Nemotron 3 Super, xAI reborn, Anthropic Lawsuit, Research! Mar 16   •   Last Week in AI LWiAI Podcast #235 - Sonnet 4.6, Deep-thinking tokens, Anthropic vs Pentagon Mar 5   •   Last Week in AI LWiAI Podcast #234 - Opus 4.6, GPT-5.3-Codex, Seedance 2.0, GLM-5 Feb 17   •   Last Week in AI LWiAI Podcast #233 - Moltbot, Genie 3, Qwen3-Max-Thinking Feb 6   •   Last Week in AI LWiAI Podcast #232 - ChatGPT Ads, Thinking Machines Drama, STEM Jan 28   •   Last Week in AI LWiAI Podcast #231 - Claude Cowork, Anthropic $10B, Deep Delta Learning Jan 21   •   Last Week in AI LWiAI Podcast #230 - 2025 Retrospective, Nvidia buys Groq, GLM 4.7, METR Jan 7   •   Last Week in AI
Musk fails to block California data disclosure law he fears will ruin xAI arstechnica_ai 06.03.2026 18:21 0.635
Embedding sim.0.7476
Entity overlap0.05
Title sim.0.0614
Time proximity0.8014
NLP типlawsuit
NLP организацияxAI
NLP темаai regulation
NLP странаUnited States

Открыть оригинал

“Key win” for California Musk fails to block California data disclosure law he fears will ruin xAI Musk can’t convince judge public doesn’t care about where AI training data comes from. Ashley Belanger – Mar 6, 2026 1:21 pm | 99 Credit: Icon Sportswire / Contributor | Icon Sportswire Credit: Icon Sportswire / Contributor | Icon Sportswire Text settings Story text Size Small Standard Large Width * Standard Wide Links Standard Orange * Subscribers only Learn more Minimize to nav Elon Musk’s xAI has lost its bid for a preliminary injunction that would have temporarily blocked California from enforcing a law that requires AI firms to publicly share information about their training data. xAI had tried to argue that California’s Assembly Bill 2013 (AB 2013) forced AI firms to disclose carefully guarded trade secrets. The law requires AI developers whose models are accessible in the state to clearly explain which dataset sources were used to train models, when the data was collected, if the collection is ongoing, and whether the datasets include any data protected by copyrights, trademarks, or patents. Disclosures would also clarify whether companies licensed or purchased training data and whether the training data included any personal information. It would also help consumers assess how much synthetic data was used to train the model, which could serve as a measure of quality. However, this information is precisely what makes xAI valuable, with its intensive data sourcing supposedly setting it apart from its biggest rivals, xAI argued. Allowing enforcement could be “economically devastating” to xAI, Musk’s company argued, effectively reducing “the value of xAI’s trade secrets to zero,” xAI’s complaint said. Further, xAI insisted, these disclosures “cannot possibly be helpful to consumers” while supposedly posing a real risk of gutting the entire AI industry. Specifically, xAI argued that its dataset sources, dataset sizes, and cleaning methods were all trade secrets. “If competitors could see the sources of all of xAI’s datasets or even the size of its datasets, competitors could evaluate both what data xAI has and how much they lack,” xAI argued. In one hypothetical, xAI speculated that “if OpenAI (another leading AI company) were to discover that xAI was using an important dataset to train its models that OpenAI was not, OpenAI would almost certainly acquire that dataset to train its own model, and vice versa.” However, in an order issued on Wednesday, US District Judge Jesus Bernal said that xAI failed to show that California’s law, which took effect in January, required the company to reveal any trade secrets. xAI’s biggest problem was being too vague about the harms it faced if the law was not halted, the judge said. Instead of explaining why the disclosures could directly harm xAI, the company offered only “a variety of general allegations about the importance of datasets in developing AI models and why they are kept secret,” Bernal wrote, describing X as trading in “frequent abstractions and hypotheticals.” He denied xAI’s motion for a preliminary injunction while supporting the government’s interest in helping the public assess how the latest AI models were trained. The lawsuit will continue, but xAI will have to comply with California’s law in the meantime. That could see Musk sharing information he’d rather OpenAI had no knowledge of at a time when he’s embroiled in several lawsuits against the leading AI firm he now regrets helping to found . While not ending the fight to keep OpenAI away from xAI’s training data, this week’s ruling is another defeat for Musk after a judge last month tossed one of his OpenAI lawsuits, ruling that Musk had no proof that OpenAI had stolen trade secrets . xAI argued California wants to silence Grok xAI’s complaint argued that California’s law was unconstitutional since data can be considered a trade secret under the Fifth Amendment. The company also argued that the state was trying to regulate the outputs of xAI’s controversial chatbot, Grok, and was unfairly compelling speech from xAI while exempting other firms for security purposes. At this stage of the litigation, Bernal disagreed that xAI might be irreparably harmed if the law was not halted. On the Fifth Amendment claim, the judge said it’s not that training data could never be considered a trade secret. It’s just that xAI “has not identified any dataset or approach to cleaning and using datasets that is distinct from its competitors in a manner warranting trade secret protection.” “It is not lost on the Court the important role of datasets in AI training and development, and that, hypothetically, datasets and details about them could be trade secrets,” Bernal wrote. But xAI “has not alleged that it actually uses datasets that are unique, that it has meaningfully larger or smaller datasets than competitors, or that it cleans its datasets in unique ways.” Therefore, xAI is not likely to succeed on the merits of its Fifth Amendment claim. The same goes for First Amendment arguments. xAI failed to show that the law improperly “forces developers to publicly disclose their data sources in an attempt to identify what California deems to be ‘data riddled with implicit and explicit biases,’” Bernal wrote. To xAI, it seemed like the state was trying to use the law to influence the outputs of its chatbot Grok, the company argued, which should be protected commercial speech. Over the past year, Grok has increasingly drawn global public scrutiny for its antisemitic rants and for generating nonconsensual intimate imagery (NCII ) and child sexual abuse materials (CSAM). But despite these scandals, which prompted a California probe , Bernal contradicted xAI, saying California did not appear to be trying to regulate controversial or biased outputs, as xAI feared. “Nothing in the language of the statute suggests that California is attempting to influence Plaintiff’s models’ outputs by requiring dataset disclosure,” Bernal wrote. Addressing xAI’s other speech concerns, he noted that “the statute does not functionally ask Plaintiff to share its opinions on the role of certain datasets in AI model development or make ideological statements about the utility of various datasets or cleaning methods.” “No part of the statute indicates any plan to regulate or censor models based on the datasets with which they are developed and trained,” Bernal wrote. Public “cannot possibly” care about AI training data Perhaps most frustrating for xAI as it continues to fight to block the law, Bernal also disputed that the public had no interest in the training data disclosures. “It strains credulity to essentially suggest that no consumer is capable of making a useful evaluation of Plaintiff’s AI models by reviewing information about the datasets used to train them and that therefore there is no substantial government interest advanced by this disclosure statute,” Bernal wrote. He noted that the law simply requires companies to alert the public about information that can feasibly be used to weigh whether they want to use one model over another. Nothing about the required disclosures is inherently political, the judge suggested, although some consumers might select or avoid certain models with perceived political biases. As an example, Bernal opined that consumers may want to know “if certain medical data or scientific information was used to train a model” to decide if they can trust the model “to be sufficiently comprehensively trained and reliable for the consumer’s purposes.” “In the marketplace of AI models, AB 2013 requires AI model developers to provide information about training datasets, thereby giving the public information necessary to determine whether they will use—or rely on information produced by—Plaintiff’s model relative to the other options on the market,” Bernal wrote. Moving forward, xAI seems to face an uphill battle to win this fight. It will need to gather more evidence to demonstrate that its datasets or cleaning methods are sufficiently unique to be considered trade secrets that give the company a competitive edge. It will also likely have to deepen its arguments that consumers don’t care about disclosures and that the government has not explored less burdensome alternatives that could “achieve the goal of transparency for consumers,” Bernal suggested. One possible path to a win could be proving that California’s law is so vague that it potentially puts xAI on the hook for disclosing its customers’ training data for individual Grok licenses. But Bernal emphasized that xAI “must actually face such a conundrum—rather than raising an abstract possible issue among AI systems developers—for the Court to make a determination on this issue.” xAI did not respond to Ars’ request to comment. A spokesperson for the California Department of Justice told Reuters that the department “celebrates this key win and remains committed to continuing our defense” of the law. Ashley Belanger Senior Policy Reporter Ashley Belanger Senior Policy Reporter Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience. 99 Comments
Writer denies it, but publisher pulls horror novel after multiple allegations of AI use arstechnica_ai 20.03.2026 21:03 0.634
Embedding sim.0.7367
Entity overlap0.1034
Title sim.0.0968
Time proximity0.8125
NLP типother
NLP организацияHachette
NLP темаgenerative ai
NLP странаUnited Kingdom

Открыть оригинал

the horror Writer denies it, but publisher pulls horror novel after multiple allegations of AI use One of the first controversies of its kind. Nate Anderson – Mar 20, 2026 5:03 pm | 165 Credit: Getty Images Credit: Getty Images Text settings Story text Size Small Standard Large Width * Standard Wide Links Standard Orange * Subscribers only Learn more Minimize to nav Shy Girl , a horror novel by Mia Ballard, was one of those buzzy books that leapt from self-published prominence into full-on trade publication. Until yesterday, that is, when publisher Hachette pulled the book from the UK market and canceled plans to bring it to the US. The move came after a New York Times investigation suggested that AI had been used in significant parts of the work. “If it isn’t AI, she’s a terrible writer” Shy Girl was self-published in 2025 and quickly found an audience on social media. The novel follows a depressed, OCD woman named Gia who, down on her luck, encounters a “sugar daddy” who pays off her debts. All she has to do? Live as his literal pet. Eventually, of course, living like an animal makes her into an animal, and things apparently get nasty. Creepy. And the prose? “I’m obsessed with the way Mia Ballard writes,” said one reviewer on Goodreads . Not everyone thought the book was good, though, or even well-written. Another reviewer on the site called the book “absolute f—ing garbage. overwritten, repetitive, poorly executed, atrocious formatting. nothing to do with actual feminine rage and revenge.” Soon, the questions moved beyond the literary. Had the book really been “written” at all? Complaints started to surface that the prose sounded, at least in places, like chatbot writing. In January 2026, someone claiming to be a long-time book editor posted a long Reddit thread claiming that the novel had all the hallmarks of AI lit. “If so, I find it repulsive that it has been picked up and published by the second largest publishing company, at least in the UK,” said the Reddit post. “If it isn’t AI, she’s a terrible writer. Her writing is truly indistinguishable from an LLM.” Then a two-and-a-half hour (!) YouTube video dropped, making the same claims. It garnered 1.2 million views. Even AI detection companies like Pangram got in on the action, claiming that the book had the hallmarks of being largely AI-generated. Still, Hachette appeared to be moving forward with plans for a US release later this year. Yesterday, The New York Times published its own investigation , in which it “analyzed passages from the novel using several AI detection tools and found recurring patterns characteristic of AI generated text, like gaps in logic, excessive use of melodramatic adjectives and an overreliance on the rule of three.” That did it. Hachette pulled the book in the UK and canceled its upcoming US debut. Late last night, the Times received a comment from Ballard , the author, denying that she had used AI to write the novel. And yet, Ballard added, it was possible that a friend who helped edit the book did use AI. “This controversy has changed my life in many ways and my mental health is at an all-time low and my name is ruined for something I didn’t even personally do,” Ballard added and claimed that she was pursuing legal action. But what if it’s “good enough”? This is one of the first major AI controversies to hit the world of traditional trade publishers, where the old-school gatekeepers still largely ban AI—at least for drafting. (Outlining, edit feedback, plot suggestions… all of these are far murkier.) Whatever actually happened in this situation, publishing is likely to see similar disruptive patterns to those roiling the music industry right now, where tools like Suno are increasingly used to crank out songwriting demos and even (at least on places like Spotify) fully AI-produced music. Many artists and even distributors such as LANDR resist such AI use, but plenty of ordinary people don’t care. Their view seems to be: If the music sounds good—or good enough—what difference does it make where it came from? And, frankly, how different is super-glossy pop from the kind of thing Suno turns out? In the case of Shy Girl , despite numerous claims that AI writing sucks and that it can be easily identified, plenty of readers enjoyed the book and even promoted it online. That may both terrify and horrify actual writers, but it remains a reality they’ll need to face. Nate Anderson Deputy Editor Nate Anderson Deputy Editor Nate is the deputy editor at Ars Technica. His most recent book is In Emergency, Break Glass: What Nietzsche Can Teach Us About Joyful Living in a Tech-Saturated World , which is much funnier than it sounds. 165 Comments
Our latest investment in open source security for the AI era google 17.03.2026 16:00 0.633
Embedding sim.0.7307
Entity overlap0.0513
Title sim.0.0965
Time proximity0.8816
NLP типfunding
NLP организацияGoogle
NLP темаai security
NLP страна

Открыть оригинал

Breadcrumb Innovation & AI Technology Safety & Security Our latest investment in open source security for the AI era Mar 17, 2026 · Share x.com Facebook LinkedIn Mail Copy link With a new funding commitment for open source security and helpful tools for defenders, we're shifting from identifying open source security threats to solving them. Evan Kotsovinos Vice President - Privacy, Safety & Security Four Flynn Vice President - Security and Privacy Share x.com Facebook LinkedIn Mail Copy link Your browser does not support the audio element. Listen to article This content is generated by Google AI. Generative AI is experimental [[duration]] minutes Voice Speed Voice Speed 0.75X 1X 1.5X 2X Billions of people rely on an Internet built on open source software — which is software anyone can use — but that reliance only works if the software beneath it is secure. That’s why for over 20 years, Google has championed open source by supporting the developers who secure it — fueling initiatives like Google Summer of Code and bug-hunting programs that discover and fix more vulnerabilities. Today, as a founding member of the Linux Foundation's Alpha-Omega Project , we’re pledging $12.5 million collectively with Amazon, Anthropic, Microsoft/GitHub and OpenAI to further invest in the stability and security of the open source community. The funding, managed by Alpha-Omega and OpenSSF, will help maintainers stay ahead of a new generation of AI-driven threats, move security beyond vulnerability discovery to actually deploying fixes, and put advanced security tools directly into maintainers’ hands, to turn a flood of AI-generated findings into fast action. In addition to its industry-wide commitments, Google is dedicated to helping the open source community to outpace evolving threats and tip the scales in favor of the defenders by providing advanced AI tools for wider use. Internally, Big Sleep and CodeMender , both AI-powered tools from Google DeepMind, have already shown incredible success in helping us protect our own systems, demonstrating that AI can autonomously find and fix deep, exploitable vulnerabilities in systems as complex as the Chrome browser. We’re also extending research initiatives like Sec-Gemini to open source projects ( interest form ). These breakthroughs show the transformational potential of AI to secure the wider open source ecosystem. Open source is the backbone of the modern web, and we’re proud to support the maintainers who secure it to move faster, stay safer and continue building the future. Get more stories from Google in your inbox. Get more stories from Google in your inbox. Email address Your information will be used in accordance with Google's privacy policy. Subscribe Done. Just one step more. Check your inbox to confirm your subscription. You are already subscribed to our newsletter. You can also subscribe with a different email address . POSTED IN: Safety & Security AI
Apple Still Plans to Sell iPhones When It Turns 100 wired 27.03.2026 15:00 0.633
Embedding sim.0.731
Entity overlap0.075
Title sim.0.0882
Time proximity0.875
NLP типother
NLP организацияApple
NLP темаartificial intelligence
NLP странаUnited States

Открыть оригинал

Steven Levy Business Mar 27, 2026 11:00 AM Apple Still Plans to Sell iPhones When It Turns 100 As the tech giant turns 50, WIRED spoke to executives about how they plan to win in the AI era. Photo-Illustration: WIRED Staff; Getty Images Save this story Save this story Apple is allergic to nostalgia. In 2008, when the Macintosh was about to turn 25, I mentioned it to Steve Jobs and he instantly shut down the discussion. “If you look backward in this business, you'll be crushed,” he told me icily. “You have to look forward.” Now that Apple’s 50th anniversary looms, however, the company is begrudgingly engaging in a series of concerts and commemorations , and we’re being blitzed by books , articles , and oral histories of the company’s early years. Rather than join the crowded trek down memory lane, I asked Apple to do what Jobs suggested—look forward. What does Apple want to happen in its next 50 years? Earlier this month, I sat down with two senior executives to discuss just that. One was Apple’s SVP of worldwide marketing, Greg Joswiak, aka Joz, who joined Apple in 1986. The other was SVP of hardware engineering John Ternus, the putative front-runner to succeed Tim Cook as Apple’s CEO. He’s been with the company for 25 years. I also chatted briefly with Cook himself, just before Alicia Keys sang in front of the Apple Store at Grand Central Station—the beginning of Apple’s reluctantly splashy anniversary celebration. After acknowledging Apple’s uncharacteristic party mode—“this is too special” to ignore, admits Joswiak—we tackle the future. After launching the personal computer revolution, Apple managed to navigate multiple inflection points. With the Macintosh, it mastered the graphical user interface that makes computers friendlier to use. The iMac positioned the company for the internet boom. And, of course, despite a late start, Apple absolutely owned the mobile era with the iPhone. These products have remained vital–just this month Apple released the buzzy new Macbook Neo, the latest version of a 42-year-old franchise. But now the future belongs to AI—a category where Apple seems to have whiffed so far. These gentlemen disagree. Apple, they insist, is already at the forefront of the AI revolution. “We were doing AI before we called it AI!” says Joswiak. “Every single great chatbot works great on our products.” Ternus argues that even if Apple didn't take the lead in developing AI technology, it would still benefit. “Our products are the best place people will use the existing AI tools.” I push them on this. After all, if we're looking decades into the future, shouldn’t we assume that we’ll move past our current computing paradigms and adopt something that specifically caters to the wonders of AI? That’s what Apple’s former design guru Jony Ive seems to be doing with OpenAI . They’re only one entrant in the race to come up with new kinds of hardware devices built specifically for AI. “I would assume you want one of them to be an Apple device, right?” I asked. The answer seemed to be not necessarily. “Let’s not lose sight of the fact that nothing you just said is incompatible with the iPhone,” Joswiak says. “The iPhone is not going to go away. iPhone is going to serve a very central role in any of those things you’re talking about.” Wait—Apple thinks that people will be using the iPhone 50 years from now ? “It's hard to imagine not,” says Joswiak. “That's where everybody else struggles. They don't have an iPhone, and so they’re scrambling for what to do. A lot of what they talk about ends up being accessories for an iPhone. We’re not going to get into future road maps, but I will tell you, iPhones are not going anywhere.” (Despite this bravado, I will be shocked if Apple does not come out with some AI-powered gadget in the coming years.) Later in the day I have my greeting with Cook, and immediately ask him about Apple’s next 50 years. He launches into a rhapsodic description of Apple’s people, values, and culture, predicting that no matter what twists lie ahead, those factors will continue to make Apple unique and super successful. “Yes, the technologies of the future will change,” Cook says. “Yes, there will be more products and more categories. All of those things are true, but the things that made Apple Apple will be the same for the next 50 years, and the next 100 and the next 1,000.” That of course presumes that superintelligence doesn’t totally rearrange reality in the next 50 years, let alone the next millennium. It also flies in the face of what the leaders of AI companies believe. OpenAI’s Sam Altman has even postulated that his own successor as CEO will not be a human but an AI model! Does Cook see that as a possibility for Apple anytime in the next 50 years? Cook laughed merrily at the idea. “When you look at the leadership page,” he says of future Apple, “there will not be an agentic kind of model on there.” Left unspoken is what the people of 2076 will be using to look up that page. This is an edition of Steven Levy’s Backchannel newsletter . Read previous newsletters here.
State of Open Source on Hugging Face: Spring 2026 huggingface 17.03.2026 16:37 0.631
Embedding sim.0.708
Entity overlap0
Title sim.0.1687
Time proximity0.9962
NLP типother
NLP организацияhugging face
NLP темаopen source
NLP странаunited states

Открыть оригинал

State of Open Source on Hugging Face: Spring 2026 Team Article Published March 17, 2026 Upvote 74 +68 Avijit Ghosh evijit huggingface Lucie-Aimée Kaffee frimelle huggingface Yacine Jernite yjernite huggingface Irene Solaiman irenesolaiman huggingface Open Source in Competition The Geography of Open Source Model Popularity Papers and Scientific Contributions Derivative Models Adoption and Accessibility Compute, Hardware, and Open Source Sub-Communities: Robotics Sub-Communities: AI for Science Looking Forward This post examines how the open source AI landscape has shifted across competition, geography, technical trends, and emerging communities over the past year. We primarily examine community activity on Hugging Face across many types of metrics to give a holistic view of the ecosystem. This post builds on an earlier analysis conducted mid-2025, available here , which examined what the Hugging Face Community is building. We recommend reading additional perspectives on the open source ecosystem in and outside of Hugging Face from the Data Provenance Initiative , Interconnects , OpenRouter and a16z , and MIT and the Linux Foundation . As the Hugging Face ecosystem is distributed, analyses are a combination of Hugging Face and community members' work, each of which is appropriately credited. Activity in the open source AI ecosystem has rapidly grown, with the number of users, model, and dataset repositories all close to doubling. In 2025, Hugging Face grew to 13 million users, more than 2 million public models, and over 500,000 public datasets. This growth signals more than increased interest in open source; it reflects a shift toward active participation, with users increasingly creating derivative artifacts such as fine-tuned models, adapters, benchmarks, and applications rather than only consuming pre-trained systems. Data from Hugging Face | Hugging Face's two million models and counting: Graph and story by AI World The ecosystem remains highly concentrated. Approximately half of the models on Hugging Face have less than 200 total downloads, and the top 200 most downloaded models, or 0.01% of models, comprise 49.6% of all downloads. Specialized communities form around particular domains, languages, or problem areas, and often show sustained engagement and reuse even when their overall download counts are modest. Open source AI is best understood as a collection of overlapping sub-ecosystems rather than a single uniform market. Open Source in Competition More companies, both large and small, are building on open source. Over 30% of the Fortune 500 now maintain verified accounts on Hugging Face. Startups frequently use open models as default components: Thinking Machines built its Tinker model options entirely on open weights, while popular IDEs such as VSCode and Cursor support both open and closed models. Established American companies such as Airbnb have increased their engagement with the open ecosystem , and Hugging Face has seen more legacy companies upgrading their organizational subscriptions over the course of 2025. Big Tech companies are frequently creating new repositories on Hugging Face Hub; visualized side-by-side, the strong increase in repository growth shows investment over time. NVIDIA has emerged as the strongest contributor. Data from Hugging Face | Big Tech Is All-In On Open-Source AI, Graph and story by AI World Studies of open software more broadly suggest that the downstream value created by open artifacts far exceeds the cost of producing them. Similar dynamics are emerging in AI, where open models are reused, adapted, and specialized across thousands of downstream applications. Organizations that rely exclusively on closed systems often incur higher costs and face reduced flexibility in deployment and customization. The Geography of Open Source All-time downloads over the past four years show clear frontrunner regions in model popularity. The U.S. and China have historically been top contributors, with the UK, Germany, and France as secondary in popularity. Models developed by individual users or distributed organizations without a clear geographic base account for about half of all platform downloads. Data from Hugging Face | Graph and Research from Longpre et al. “Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem” The geographic composition of the open source ecosystem has fundamentally changed. Hugging Face data shows China surpassing the U.S. in monthly downloads and overall downloads. In the past year, Chinese models quickly accounted for the plurality or 41% of downloads. Data and Graph from Hugging Face Industry's share of overall development fell from around 70% before 2022 to roughly 37% in 2025. Meanwhile, independent or unaffiliated developers rose from 17% to 39% of all downloads over the same period, at times accounting for more than half of total usage. Individuals and small collectives focused on quantizing, adapting, and redistributing base models. These intermediaries now steer a meaningful portion of what typical users can run and how innovations spread through the ecosystem. Data from Hugging Face | Graph and Research from Longpre et al. “Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem” Different regions contribute in different ways. The United States and Western Europe have historically dominated through large industry labs (Google, Meta, OpenAI, Stability AI), while China has increasingly led on both releases and adoption. France, Germany, and the UK continue to contribute through research organizations, national AI initiatives, and specialized model families. Ecosystems supporting a variety of contributors and organizational forms tend to produce more widely adopted artifacts. Countries, Organizations, and Individual Users Popular models from startups were more widespread. Competitive countries were France and South Korea. Notably, the fourth most popular entity for developing new trending models were individual users, not organizations. Creating competitive models at a user level is more accessible than ever before. Data and Graph from Hugging Face Between the U.S. and China Of the newly created models in 2025, the majority of trending models were either developed in China or derivative of a model developed in China. The most popular models were developed by large organizations, predominantly from the U.S. and China. For more on the Chinese AI ecosystem, read our three part series reflecting on the changes in one year since the "DeepSeek Moment", with one on strategic changes , two on architectural changes , and three on organizations and the future . In 2025, China’s AI ecosystem steered heavily into open source, following the viral release of DeepSeek’s R1 model in January. The number of competitive Chinese organizations releasing models and the number of repositories on Hugging Face skyrocketed. Baidu went from zero releases on the Hub in 2024 to over 100 in 2025. ByteDance and Tencent each increased releases by eight to nine times. Organizations that had previously favored closed approaches, including Baidu and MiniMax, shifted decisively toward open releases. Data and Graph from Hugging Face A similar number of popular U.S. organizations have consistently contributed a higher volume of repositories over time. Meta and its former Facebook research organization account for a significant proportion of open releases, as does Google to a lesser extent. Data and Graph from Hugging Face Next to each other, the steep upward trajectory of repository growth among popular Chinese organizations emerges as a key strategic difference. Data and Graph from Hugging Face Global Open Source and Sovereignty Open source AI is increasingly tied to questions of sovereignty. Open weight models allow governments and public institutions to fine-tune systems on local data under national legal frameworks. Models that can be deployed on domestic hardware reduce reliance on foreign-controlled cloud infrastructure. Transparency around model architecture, training processes, and evaluation supports regulatory review and public accountability. Read more about the open source approach to sovereignty here . At the national level, governments are taking action. South Korea's National Sovereign AI Initiative launched mid-2025 named national champions LG AI Research , SK Telecom , Naver Cloud , NC AI , and Upstage to produce competitive domestic models. Three models from South Korea trended simultaneously on Hugging Face Hub in February 2026. In March 2026, In 2026, South Korea and U.S. startup Reflection AI announced a data center partnership , also bringing frontier open weight models to South Korea. Switzerland's Swiss AI initiative and various EU-funded projects reflect similar priorities. The UK's principle of "public money, public code" has influenced several government-backed AI initiatives. Hugging Face Trending Page February 2026 These investments in open-source and open weight AI are already paying dividends for countries with thriving AI training ecosystems of their own, as we see that models and datasets are typically most used in the regions where they're developed ; with developers often turning to the models that best represent their languages and reflect similar technical and application requirements. Data and Graph from Hugging Face Model Popularity Most liked models on the Hub show community attention, in terms of ability to go back to or reference the model or general popularity. While this metric does not always reflect usage, the attention collected over time can show signals of interest. In one year, the most liked models went from predominantly U.S.-developed from Meta’s Llama family, to an international mix with China’s DeepSeek-R1 at the top. Data and Graphic from Hugging Face Papers and Scientific Contributions While determining the value of scientific contributions can be determined by many metrics, our upvote feature on the Hub shows papers from large AI organizations be widely appreciated by community members. Notably, the most upvoted papers are from large organizations, mostly from the U.S. and China. The majority of the top organizations are Chinese Big Tech companies, with ByteDance sharing a high volume of high impact papers. Space by Hugging Face | PaperVerse Explorer Of Hugging Face's Daily Papers , a set of papers curated by Hugging Face's AK , papers that reference model and dataset creation, showing the most open source adoption, are generally diverse. Prominent takeaways show medical papers being influential, while Big Tech's influence is sparse. Data from Hugging Face | Graphic and story by AI World Derivative Models How our community members choose to build on models, whether via fine-tuning, merging, or other methods, reflects model popularity and usability. Alibaba as an organization has more derivative models than both Google and Meta combined, with the Qwen family constituting more than 113,000 derivative models. When including all models that tag Qwen, that number balloons to over 200,000 models. Data and Graph from Hugging Face Adoption and Accessibility Model development has increasingly emphasized accessibility alongside scale. Smaller models are downloaded and deployed at far higher rates than very large systems, reflecting practical constraints around cost, latency, and hardware availability. This small-model dominance occurs in part because far more models are released at that size. But even when normalizing for this, the data from the ATOM Project's Relative Adoption Metric shows that the median top-10 models from 1-9B parameters are only downloaded about 4x more than models above 100B. Automated systems and CI pipelines further inflate small model download counts, but the trend toward smaller, deployable models is real. Data from Hugging Face | Graph and Article by ATOM Engagement with open models tends to peak almost immediately after release, then slow. Mean engagement duration is approximately 6 weeks. Continuous improvement and frequent updates have become critical for maintaining relevance. DeepSeek's successive releases (V3, R1, V3.2) kept it competitive even as challengers emerged. Organizations that stagnate in development tend to lose share quickly to those with frequent updates or domain-specific fine-tunes. Data from Hugging Face | Graph and Research from Choksi et al. "The Brief and Wondrous Life of Open Models" The mean size of downloaded open models rose from 827M parameters in 2023 to 20.8B in 2025, driven largely by quantization and mixture-of-experts architectures. The median, however, increased only marginally, from 326M to 406M parameters. This divergence indicates that high-end LLM users are pulling up the mean while underlying small-model usage remains stable. Data from Hugging Face | Graph and Research from Longpre et al. "Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem" Performance differences between frontier models and smaller systems often narrow rapidly through fine-tuning and task-specific adaptation. On the Hub, models with hundreds of millions of parameters support search, tagging, and document processing workflows , while models in the single-digit billions are widely used for coding, reasoning, and multimodal tasks. As a result, most major model developers now release families of models spanning a range of sizes. The rise of capable small models shifts autonomy closer to the edge, reducing dependency on centralized cloud providers. Compute, Hardware, and Open Source Open source AI development is closely linked to hardware trends. Most models are optimized for NVIDIA GPUs, but support for AMD hardware continues to expand. Stability AI model collections now optimize for both NVIDIA and AMD platforms. Libraries increasingly target both, and tooling has improved to make cross-hardware deployment more straightforward. In 2025 Hugging Face launched the Kernel Hub to load and run kernels optimized for NVIDIA and AMD GPUs. In parallel, Chinese open models are being released with explicit support for domestically developed chips . Alibaba has invested in inference-focused chip architectures designed to fill Chinese data centers with hardware capable of running open source models locally. While access to compute remains a core necessity across the board for development and deployment of AI models, open-source and open-weight models are helping break away from an ecosystem where it becomes the be-all and end-all, with increasingly more models at all levels of performance pushing efficiency from 10x to 1000x lower costs than flagship AI models the largest developers. Data and Graphic from Hugging Face Still, the question of infrastructure investment for open source remains urgent. Public funding for data centers capable of training and serving open models has become a growing policy discussion, particularly in Europe and the UK. The gap between the compute resources available to large closed-model companies and those accessible to the open source community continues to shape what is feasible in open development. Sub-Communities: Robotics Robotics has emerged as one of the fastest-growing sub-communities on Hugging Face. The numbers are striking: robotics datasets grew from 1,145 in 2024 to 26,991 in 2025, climbing from rank 44 to the single largest dataset category on the Hub in just three years. For comparison, text generation, the second-largest category, had only around 5,000 datasets in 2025. Data from Hugging Face | Graph and Story by AI World Community-contributed datasets span everything from household manipulation tasks to autonomous driving. The largest multimodal dataset for spatial intelligence, Learning to Drive (L2D), was released through a LeRobot collaboration with Yaak. Datasets like RoboMIND , with over 107,000 real-world trajectories across 479 distinct tasks and multiple robot embodiments, provide the kind of scale and diversity needed for training generalizable robotic policies. Hugging Face's acquisition of Pollen Robotics opened open source robotic sales to both industry and academic labs, as well as everyday hobbyists. LeRobot , Hugging Face's open source robotics library that provides models, datasets, and tools for real-world robotics in PyTorch, covering imitation learning, reinforcement learning, and vision-language-action models, experienced rapid growth. Over the past year, its GitHub repository stars nearly tripled. Data from GitHub | Graphic from star-history.com Sub-Communities: AI for Science Scientific research has become another particularly active area. Open models and datasets are increasingly used for protein folding, molecular dynamics, drug discovery, and scientific data analysis. All frontier AI companies now have dedicated science teams, though much current focus remains on literature discovery rather than direct experimentation. Space by Hugging Face | Science Release Heatmap Community-led projects have formed around shared research goals, often involving hundreds of contributors working across institutions and disciplines. These efforts highlight the role of open source as a mechanism for coordinating large-scale, interdisciplinary work that would be difficult to organize through traditional academic or corporate structures alone. Looking Forward The open source AI ecosystem continues to evolve through a combination of global participation, technical specialization, and institutional adoption. Several trends are likely to define the next phase. The geographic rebalancing of power is accelerating. Western organizations increasingly seek commercially deployable alternatives to Chinese models, creating urgency around efforts like OpenAI's GPT-OSS , AI2's OLMo , and Google's Gemma to offer competitive open options from US and European developers. Whether these efforts can match the adoption momentum of Qwen and DeepSeek will be a defining question of 2026. The growth of sub-communities in robotics and science suggests that open source AI is expanding beyond language and image generation into the physical and experimental domains. The infrastructure, norms, and coordination mechanisms developed around text and image models are being adapted for new modalities and use cases. For researchers, developers, companies, and governments, open source remains a foundational layer for building, evaluating, and governing AI systems. With increasing agent deployments, open-source and its interoperability will be key for agents to thrive. Its trajectory over the past year makes one thing clear: the open source ecosystem is where much of the practical work of AI development, adaptation, and deployment takes place, and its influence on the broader AI landscape continues to grow. Thank you to the Hugging Face community for continuing to build the foundation of the AI ecosystem 🤗 If you want to cite this blog post, you may use the following BibTeX: @misc{hf_state_of_os_spring26, author = {Avijit Ghosh and Lucie-Aimée Kaffee and Yacine Jernite and Irene Solaiman}, title = {State of Open Source on Hugging Face: Spring 2026}, booktitle = {Hugging Face Blog}, year = {2026}, url = {https://huggingface.co/blog/huggingface/state-of-os-hf-spring-2026}, } Mentioned datasets x-humanoid-robomind/RoboMIND Updated Jan 4 • 19.8k • 37 Mentioned collections Olmo 3.1 Collection The latest members of the Olmo 3 family: another 3 weeks of RL for 32B Think, the 32B Instruct model, large post-training research datasets... • 9 items • Updated Dec 23, 2025 • 49 gpt-oss Collection Open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. • 2 items • Updated Aug 7, 2025 • 421 More from this author The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+ 52 February 3, 2026 Architectural Choices in China's Open-Source AI Ecosystem: Building Beyond DeepSeek  45 January 27, 2026
PwC will say goodbye to staff who aren't convinced about AI the_register_ai 19.03.2026 14:59 0.631
Embedding sim.0.7227
Entity overlap0
Title sim.0.0777
Time proximity0.9914
NLP типleadership_change
NLP организацияPwC
NLP темаai adoption
NLP страна

Открыть оригинал

AI + ML 111 PwC will say goodbye to staff who aren't convinced about AI 111 Professional services giant did not read its own report on lackluster benefits Dan Robinson Thu 19 Mar 2026 // 14:59 UTC You'll use AI and like it too – if you work for PwC. Paul Griggs, US chief executive of the global professional services giant, has made clear there is no room at the corporation for AI skeptics. Speaking to the Financial Times , Griggs indicated that anyone who believed they had the "opportunity to opt out" of AI is "not going to be here that long," and warned senior staff not "paranoid about being AI-first" will be replaced by others who are more comfortable with the tech. PwC is also reportedly rethinking its billing model – in an era where AI is expected to automate tasks – potentially shifting from hourly rates to subscription-style access to AI-driven tax and consulting services. This gung-ho approach from Griggs comes despite research undertaken by PwC, published in January , that indicated more than half of businesses using AI saw little or no benefit. The survey of 4,454 business leaders across 95 countries found neither increased revenue nor decreased costs from deploying or using AI tools and services in their organization. The Register has asked PwC to comment on its own AI policies for staff. Deloitte, another professional services biz, found similar results in its "State of AI in the Enterprise" report earlier this year. It said 74 percent of organizations wanted their AI initiatives to grow revenue, but only one in five had seen results. The glass-half full researchers concluded money isn't everything, and instead AI adoption should be viewed as "achieving strategic differentiation and a lasting competitive edge in the marketplace." PwC isn't the only consultant taking a more ideological stance on AI. Staff at Accenture received a memo last month telling them to demonstrate "regular adoption" of AI services – with usage tracked – if they want promotions. Majority of CEOs report zero payoff from AI splurge PwC on using AI to turn cybersecurity risk into competitive advantage Don't blame AI yet for poor jobs numbers, analysts say Microsoft execs worry AI will eat entry level coding jobs Consultancies, along with many other large corporations, are running AI projects and need to show a financial benefit so employees are being told to get with the program. Research in February indicated that few businesses have achieved a return on their AI spending due to a lack of investment in staff training and skills development, combined with a shortfall in governance and oversight. Gartner published a report earlier this month advising firms that buying into AI tools won't necessarily make employees change the way they work , and a certain amount of persuasion is likely necessary. HR leaders should focus on communication and sensitivity to employee needs, it concluded, instead of rushing to implement business transformation plans that could face resistance and pushback from staff. It seems that message has yet to reach the C-suite. ® Share More about AI More like these &times; More about AI Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car More about Share 111 COMMENTS More about AI More like these &times; More about AI Narrower topics AIOps DeepSeek Gemini Google AI GPT-3 GPT-4 Large Language Model Machine Learning MCubed Neural Networks NLP Retrieval Augmented Generation Star Wars Tensor Processing Unit TOPS Broader topics Self-driving Car TIP US OFF Send us news
Banco Angolano de Investimentos Selects Finastra to Scale with Customer and Industry Demands prnewswire 01.04.2026 07:00 0.627
Embedding sim.0.7202
Entity overlap0
Title sim.0.1083
Time proximity0.9286
NLP типpartnership
NLP организацияFinastra
NLP темаsoftware development
NLP странаAngola

Открыть оригинал

Banco Angolano de Investimentos Selects Finastra to Scale with Customer and Industry Demands News provided by Finastra Apr 01, 2026, 03:00 ET Share this article Share to X Share this article Share to X By transforming its core banking system with Finastra Essence, the bank can decrease time-to-market and remain future proof through innovative technology LUANDA, Angola , April 1, 2026 /PRNewswire/ -- Finastra , a global leader in financial services software, announced that BAI - Banco Angolano de Investimentos , S.A. (Sociedade Aberta), one of Angola's largest financial services providers, has selected Finastra Essence to upgrade its core banking system. The next-generation, open-API enabled platform will strengthen BAI's operational resilience and ability to quickly adapt to changing customer, industry, and regulatory demands. By decreasing time-to-market and increasing capacity for product innovation, the bank's customers will benefit from enhanced and personalized digital banking services. Finastra will deliver the solution via a model bank approach, with dedicated features tailored for the Angolan market. Innovation Makers , a long-standing strategic partner in the research, development, and implementation of financial software solutions, will lead the implementation. Leveraging its deep expertise in payments and digital banking, the company will deliver advanced capabilities through its Multi-Payments Processing System (MPPS), and Card Management solutions. These innovations will empower BAI to enhance customer experiences, accelerate product delivery, and strengthen its leadership in the Angolan market. In addition, Innovation Makers will provide a local Centre of Excellence with translation support, long-term operational resilience, and knowledge transfer within the local ecosystem. "The adoption of Essence marks the beginning of the next phase in our digital transformation journey," said Luís Martins, Executive Director at BAI. "To successfully deliver innovative and responsive banking services to our customers, while keeping pace with frequent regulatory changes, we need a core banking platform that is highly flexible, scalable, and future proof. We selected Finastra due to its market-leading offering, strong consulting and advisory support, and proven track record in this space. We have collaborated with the company for more than 15 years – this extension of our partnership demonstrates BAI's commitment to driving innovation and resilience in the Angolan banking sector." Finastra Essence is a cloud-first, next-generation core banking solution that combines rich, broad and deep functionality with advanced technology to increase agility, reduce costs, and improve operational efficiency. Powered by open, microservices architecture and APIs, it enables institutions to rapidly deploy market-leading products and services that utilize technologies like AI and machine learning. In 2025, 15 institutions worldwide successfully went live with Essence. BAI also uses Finastra Corporate Channels . "Legacy banking infrastructure and lots of customizations can hinder flexibility and operational effectiveness, often with high costs attached," said Siobhan Byron, EVP of Universal Banking at Finastra. "By transforming its core banking system, BAI can scale quickly and adapt to new demands with much greater efficiency and success. Our state-of-the-art solution – uniquely combining future-proof technology with comprehensive banking functionality – enables the bank to maintain its competitive advantage with a high level of reliability and trust. We're empowering BAI to reimagine banking in Angola by delivering timely, innovative, and customer-centric services." About Finastra Finastra is a global leader in financial services software, trusted by 7,000+ customers - including 40 of the world's top 50 banks - in over 110 countries. With expertise in Lending, Payments, and Universal Banking, we deliver reliable, scalable, mission-critical solutions such as Loan IQ, LaserPro, Trade Innovation, Essence, Global PAYplus, Payments To Go, and Financial Messaging. Backed by Vista Equity Partners, we co-innovate with customers to build modern technology that helps financial institutions grow with confidence. Visit www.finastra.com or follow Finastra on LinkedIn. finastra.com About Banco Angolano de Investimentos (BAI) BAI - Banco Angolano de Investimentos S.A., a publicly listed company, is one of Angola's leading banks, distinguished by its strong performance. It supports the government and public entities with structured financial solutions, while also establishing itself as a partner to businesses of all sizes. It has the largest branch network in the country, effectively serving retail customers. Also present in Portugal and Cape Verde, BAI is a pillar of the national economy's growth, offering financial solutions tailored to each customer's needs. ( https://www.bancobai.ao/pt/institucional/quem-somos/sobre-o-bai) Logo - https://mma.prnewswire.com/media/1916021/5889910/FINASTRA_Logo.jpg SOURCE Finastra 21 % more press release views with  Request a Demo &times; Modal title
ThunderX Adopts SAFERTOS® to Enhance Functional Safety in Next-Generation AI Domain Controller Platforms to Accelerate Safe, Scalable Cockpit-ADAS Fusion on Qualcomm Automotive SoCs prnewswire 01.04.2026 08:00 0.627
Embedding sim.0.7133
Entity overlap0.0217
Title sim.0.1564
Time proximity0.8989
NLP типproduct_launch
NLP организацияThunderX
NLP темаai infrastructure
NLP странаUnited Kingdom

Открыть оригинал

ThunderX Adopts SAFERTOS® to Enhance Functional Safety in Next-Generation AI Domain Controller Platforms to Accelerate Safe, Scalable Cockpit-ADAS Fusion on Qualcomm Automotive SoCs News provided by WITTENSTEIN high integrity systems Apr 01, 2026, 04:00 ET Share this article Share to X Share this article Share to X BRISTOL, England , April 1, 2026 /PRNewswire/ -- ThunderX, a global provider of intelligent driving domain controller platforms and high-performance central computing (HPC) solutions, announced the strategic adoption of SAFE RTOS ® from WITTENSTEIN high integrity systems (WHIS) across its next-generation domain controller platforms powered by Qualcomm ® SA8775P and SA8650P. As global OEMs transition toward centralized, AI-driven, software-defined vehicle (SDV) architectures, ThunderX is reinforcing the functional safety and real-time control foundation of its AI domain controllers to support scalable deployment of intelligent driving and in-vehicle AI applications worldwide. Learn more about the solution at: https://www.thunderxauto.com/ "Automotive OEMs are not only centralizing compute - they are embedding AI deeply into the vehicle architecture," said Dongchao Xu, CEO of ThunderX. "Through SAFE RTOS ® , we are strengthening the deterministic real-time and functional safety foundation for safety-critical components within our platform architecture, enabling OEMs to scale AI-driven cockpit-ADAS fusion with reduced integration complexity and accelerated time to production." AIBOX ™ : Scalable Central Compute for Software-Defined Vehicles ThunderX's AIBOX™ series represents its AI-native central compute strategy for next-generation vehicles. Designed to support cockpit-ADAS fusion, AI perception, generative AI assistants, and cross-domain orchestration, AIBOX™ provides: High-performance heterogeneous computing on Qualcomm Automotive SoCs AI acceleration for perception, driver monitoring, and large-model-based in-cabin intelligence Support for centralized E/E architectures and zonal controllers A modular software stack integrating OS, middleware, AI frameworks, and toolchains Hardware–software decoupled architecture for global OEM scalability Within its platform architecture, ThunderX applies SAFE RTOS ® to selected safety-critical subsystems to ensure deterministic task execution, functional safety compliance and real-time reliability for AI-enabled driving functions. Solving the Challenge of Safe, Scalable Intelligent Driving OEMs face increasing pressure to balance functional safety, deterministic real-time performance, and software scalability across multiple platforms. To meet these demands, ThunderX selected SAFE RTOS ® , a safety-critical Real Time Operating System trusted across automotive, aerospace, medical, and industrial sectors. To learn more about SAFE RTOS ® , please visit: https://www.highintegritysystems.com/safertos/ SAFE RTOS ® , pre-certified to ISO 26262, enables ThunderX to: Guarantee deterministic real-time behaviour in critical driving scenarios Accelerate compliance with ISO 26262 and other automotive safety standards Deliver low-latency, high-reliability performance across diverse SoCs and configurations "ThunderX's implementation demonstrates exactly why SAFE RTOS ® was created," said Andrew Longhurst, Managing Director at WITTENSTEIN high integrity systems. "It delivers safety and predictability where milliseconds make the difference, empowering OEMs to meet the most demanding automotive standards." RazorDCX Tarkine: Industry's First Single-SoC Cockpit-Driver Fusion Control Solution As part of ThunderX's domain controller product portfolio, RazorDCX Tarkine, powered by Qualcomm ® SA8775P, represents the industry's first single-SoC cockpit–driver fusion domain controller. Built on a 4nm process, it delivers: 72 TOPS and 230k DMIPS System designs intended to support ASIL-B requirements (ASIL-D for ADAS control) Support for 3D HMI, in-cabin vision, gaming, multimedia, and connectivity ADAS features including automatic parking and highway NOA SAFE RTOS ® runs on the Cortex R52 cores within the SAIL subsystem, ensuring real-time control and safety for ADAS operations. RazorDCX Pantanal: Scalable ADAS for Entry-Level to Flagship Models Powered by Qualcomm ® SA8650P, RazorDCX Pantanal extends the AIBOX strategy to support entry-level through flagship vehicle models. SAFE RTOS ® ensures deterministic execution of safety-critical tasks across all configurations, enabling ThunderX to deliver consistent performance and compliance. By applying a consistent functional safety software foundation across product lines, ThunderX enables OEMs to maintain a unified development approach while adapting performance and cost structures for different global markets. Driving Impact for OEMs and End Users By integrating SAFE RTOS ® , ThunderX empowers OEMs to accelerate time-to-market for intelligent driving solutions while maintaining the highest safety standards. End users benefit from enhanced driving experiences, greater reliability, and future-ready vehicle platforms. About ThunderX ThunderX is a global provider of intelligent driving domain controller platforms and high-performance central computing (HPC) solutions for next-generation vehicles. The company's AIBOX™ platform reflects its approach to AI-native central computing, enabling support for cockpit-ADAS integration and evolving in-vehicle intelligence use cases. Backed by strategic investments from ThunderSoft, Qualcomm, and Luxshare Precision, the company delivers innovative, safety-oriented, and highly decoupled software and hardware platforms to automotive OEMs worldwide. Leveraging ThunderSoft's leadership in operating systems and cockpit software, ThunderX integrates a broad ecosystem across chips, operating systems, perception, and algorithms to deliver production-ready intelligent driving and cross-domain fusion solutions, supported by in-house middleware software and a complete development toolchain. For more information, please visit: https://www.thunderxauto.com/ About WITTENSTEIN high integrity systems WITTENSTEIN high integrity systems is a safety systems company that produces and supplies Real Time Operating Systems and platform solutions to the Automotive, Medical and Industrial sectors worldwide. For more information, please visit: https://www.highintegritysystems.com Media Contact: [email protected] +1 408 625 4712 SOURCE WITTENSTEIN high integrity systems 21 % more press release views with  Request a Demo &times; Modal title
Use Canvas in AI Mode to get things done and bring your ideas to life, right in Search. google 04.03.2026 17:00 0.627
Embedding sim.0.7219
Entity overlap0
Title sim.0.1111
Time proximity0.8988
NLP типproduct_launch
NLP организацияGoogle
NLP темаgenerative ai
NLP странаUnited States

Открыть оригинал

Breadcrumb Products & Platforms Products Search Canvas in AI Mode is now available for everyone in the U.S. in English, offering a dedicated, dynamic space to organize your plans and projects over time. And Canvas is more capable than ever, with newly added support for creative writing and coding tasks. Now, you can draft documents or create custom, interactive tools right within Search. Want some inspiration? One idea we loved from early testers was a dashboard to visualize and track information on academic scholarships, including all the different requirements, deadlines and dollar amounts. To make your own tool or dashboard, select the new Canvas option from the tool menu (+) in AI Mode and describe what you want to create. You’ll get a working prototype in the Canvas side panel that pulls together the freshest information from the web and Google’s Knowledge Graph. From there, you can test the functionality, toggle to view the underlying code and refine with conversational follow-ups until it does exactly what you need. Whether you’re building custom tools, studying for exams or planning a trip , Canvas in AI Mode makes it easy to jumpstart your next big project. POSTED IN: Search AI Related stories
New "vibe coded" AI translation tool splits the video game preservation community arstechnica_ai 16.03.2026 20:06 0.625
Embedding sim.0.7257
Entity overlap0.04
Title sim.0.0547
Time proximity0.9059
NLP типother
NLP организацияGaming Alexandria
NLP темаgenerative ai
NLP странаJapan

Открыть оригинал

Bad vibes New “vibe coded” AI translation tool splits the video game preservation community Creator apologizes after using Patreon funds for Gemini-powered magazine scan processor. Kyle Orland – Mar 16, 2026 4:06 pm | 167 Translating the letter "A" is just one of many things new AI models can do! Credit: Getty Images Translating the letter "A" is just one of many things new AI models can do! Credit: Getty Images Text settings Story text Size Small Standard Large Width * Standard Wide Links Standard Orange * Subscribers only Learn more Minimize to nav Since Andrej Karpathy coined the term “vibe coding” just over a year ago , we’ve seen a rapid increase in both the capabilities and popularity of using AI models to throw together quick programming projects with less human time and effort than ever before. One such vibe-coded project, Gaming Alexandria Researcher , launched over the weekend as what coder Dustin Hubbard called an effort to help organize the hundreds of scanned Japanese gaming magazines he’s helped maintain at clearinghouse Gaming Alexandria over the years, alongside machine translations of their OCR text. A day after that project went public, though, Hubbard was issuing an apology to many members of the Gaming Alexandria community who loudly objected to the use of Patreon funds for an error-prone AI-powered translation effort. The hubbub highlights just how controversial AI tools remain for many online communities, even as many see them as ways to maximize limited funds and man-hours. “I sincerely apologize,” Hubbard wrote in his apology post. “My entire preservation philosophy has been to get people access to things we’ve never had access to before. I felt this project was a good step towards that, but I should have taken more into consideration the issues with AI.” “Something I never would have dreamed could exist” Since its launch in 2015, Gaming Alexandria has grown into a comprehensive, Japan-focused clearinghouse for video game history, including everything from high-quality box art scans and rare video game prototypes to classic magazine-published BASIC programs . The site is also known for its massive collection of Japanese game magazine scans , some dating all the way back to the early ’70s . Writing on Patreon this weekend, Hubbard said he has long been tinkering with an improved automated OCR and translation process that could help turn more of those magazine scans into useful tools for Western researchers. And when he put Google Gemini AI model to the task recently, he said he was “blown away” by the results. While he still recommended using a professional human translator before citing these magazines in any scholarly research, he said the output from the Gemini AI tool “gets you a large percentage of the way there quickly.” Inspired by those results, Hubbard set to work on a self-described “vibe coded” interface to view the original PDF scans alongside their AI-generated text translations for easy comparison and editing. The result was the Gaming Alexandria Researcher tool , posted to GitHub on Friday and shared with the site’s Patreon backers as a “beta” on Saturday. The tool, which runs locally on Windows, Mac, or Linux, can search, download, and edit Gaming Alexandria’s files from the cloud or sort through local files stored on your own machine. Hubbard said that he made use of some of Gaming Alexandria’s Patreon funds (which are currently listed at over $250 per month) to help build the project itself and for the Gemini-powered transcriptions and translations it uses. That process “costs about 50 cents to $1.50 per magazine to process which isn’t too bad depending on their size,” he wrote. A look at the Gaming Alexandria Researcher interface, showing an original Japanese scan alongside the AI translation of the text. Credit: Gaming Alecandria A look at the Gaming Alexandria Researcher interface, showing an original Japanese scan alongside the AI translation of the text. Credit: Gaming Alecandria “This app has been something I never would have dreamed could exist,” Hubbard enthused. “Now I can finally read and enjoy these Japanese magazines I’ve been scanning for years. A large part of that is due to your believing in my work and funding me so thank you so much for that.” “It’s worthless and destructive” Not everyone was as excited as Hubbard, though. “I’m very, very disappointed to see [Gaming Alexandria], one of the foremost organizations for preserving game history, promoting the use of AI translation and using Patreon funds to pay for AI licenses,” game designer and Legend of Zelda historian Max Nichols wrote in a popular social media message over the weekend. “I have cancelled my Patreon membership and will no longer promote the organization.” Nichols later deleted that message ( archived here ), saying he was “uncomfortable with the scale of reposts and anger” it had aroused in the community. On the substance, though, Nichols said he was still concerned about the inevitable inaccuracies that would be introduced by Gemini’s translations, making them nearly useless for scholarship. In a follow-up post , Nichols said he was uncomfortable with “Patreon money being spent on AI subscriptions to make untrustworthy translations, that are promoted as if they’re worth reading or valid sources of historical info. … It’s worthless and destructive: these translations are like looking at history through a clownhouse mirror.” Nichols wasn’t alone in expressing those kinds of concerns. “It strikes me as irresponsible to use the privileged position where we have first hand access to this genuinely novel information to just be okay with something that kinda works, sorta,” user Joey wrote on the Gaming Alexandria Discord. Others on social media piped up to say that the project had “damaged [the site’s] reputation,” or had “burn[ed] all their good faith with the video gaming community.” For some supporters, though, using machine translations—including ones aided by AI models—is a practical necessity given the size of the task at hand. “There’s no world in which they could ever get hundreds of thousands of pages translated by hand,” game preservationist Chris Chapman wrote on social media . “Error-prone searchability is more useful to more people than none at all.” “Famitsu alone is over 1,900 issues, each with [a hundred-plus] pages,” journalist and author Felipe Pepe noted . “That’s one magazine from one country. [Human translation] would be ideal, but it’s impossible.” On the Gaming Alexandria Discord, user asie wrote that people who use tools like Google Lens or DeepL are already using AI-powered OCR and translation tools. At this point, these kinds of tools are “just a fact of reality,” they added. This 1973 issue of Japan’s Amusement Industry magazine is just one of hundreds that Gaming Alexandria has scanned over the years. Credit: Gaming Alexandria This 1973 issue of Japan’s Amusement Industry magazine is just one of hundreds that Gaming Alexandria has scanned over the years. Credit: Gaming Alexandria “Show some empathy and grace…” In his apology post Sunday, Hubbard acknowledged that many supporters were “shocked and angered” by his efforts on the Researcher vibe coding effort and said he “should have reached out before using Patreon funds for that project.” Hubbard said he’d be using personal funds to replace the Patreon money that had been used so far, and that in the future “no Patreon dollars will be used to fund AI.” He added that he is just “one cog in the wheel” of the Gaming Alexandria site, and that other members of the community shouldn’t be punished for what was a “personal side project.” (In response to a request for further comment from Ars Technica, Hubbard said that “we’re currently working on a statement we hope to have out in the next week regarding all this.”) For some, this apology was too little too late, especially since Researcher remains online via GitHub. “That first [apology] seemed like ‘I’m still going to do it, but I won’t use Patreon money,’ which to me is a distinction without a point,” one social media user wrote . “I encourage anyone currently supporting Gaming Alexandria on Patreon to pull their funding until they completely back away from ever using any kind of genAI ever again,” another added . “This is an unacceptable response that doesn’t seem to understand that ANY usage of it is unacceptable.” At the same time, even many who were opposed to the project spoke out to defend Hubbard and his long-standing efforts in the game-preservation community. “[Hubbard] has been a really nice & thoughtful guy in every encounter I’ve had with him,” Zelda archivist Melora Hart wrote on the History of Hyrule social media account. “Every single human on Earth can get excited and carried away with things.” In the end, while the game-preservation community might not be ready to embrace AI-assisted translations just yet, that doesn’t mean they’re ready to excommunicate someone like Hubbard just for trying. “Gaming Alexandria is our load-bearing pillar for access to Japanese scans and [Hubbard] the hardest-working guy I know in game preservation, so I’d really appreciate it if we didn’t destroy him with a dogpile over his vibe coding project please,” Video Game History Foundation founder Frank Cifaldi wrote on social media . “Show some empathy and grace if you disagree with it.” Kyle Orland Senior Gaming Editor Kyle Orland Senior Gaming Editor Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper . 167 Comments
MIT-IBM Watson AI Lab seed to signal: Amplifying early-career faculty impact mit_news_ai 17.03.2026 20:35 0.623
Embedding sim.0.7955
Entity overlap0.1053
Title sim.0.0726
Time proximity0.1542
NLP типother
NLP организацияMIT
NLP темаartificial intelligence
NLP странаUnited States

Открыть оригинал

The early years of faculty members’ careers are a formative and exciting time in which to establish a firm footing that helps determine the trajectory of researchers’ studies. This includes building a research team, which demands innovative ideas and direction, creative collaborators, and reliable resources. For a group of MIT faculty working with and on artificial intelligence, early engagement with the MIT-IBM Watson AI Lab through projects has played an important role helping to promote ambitious lines of inquiry and shaping prolific research groups. Building momentum “The MIT-IBM Watson AI Lab has been hugely important for my success, especially when I was starting out,” says Jacob Andreas — associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), and a researcher with the MIT-IBM Watson AI Lab — who studies natural language processing (NLP). Shortly after joining MIT, Andreas jump-started his first major project through the MIT-IBM Watson AI Lab, working on language representation and structured data augmentation methods for low-resource languages. “It really was the thing that let me launch my lab and start recruiting students.” Andreas notes that this occurred during a “pivotal moment” when the field of NLP was undergoing significant shifts to understand language models — a task that required significantly more compute, which was available through the MIT-IBM Watson AI Lab. “I feel like the kind of the work that we did under that [first] project, and in collaboration with all of our people on the IBM side, was pretty helpful in figuring out just how to navigate that transition.” Further, the Andreas group was able to pursue multi-year projects on pre-training, reinforcement learning, and calibration for trustworthy responses, thanks to the computing resources and expertise within the MIT-IBM community. For several other faculty members, timely participation with the MIT-IBM Watson AI Lab proved to be highly advantageous as well. “Having both intellectual support and also being able to leverage some of the computational resources that are within MIT-IBM, that’s been completely transformative and incredibly important for my research program,” says Yoon Kim — associate professor in EECS, CSAIL, and a researcher with the MIT-IBM Watson AI Lab — who has also seen his research field alter trajectory. Before joining MIT, Kim met his future collaborators during an MIT-IBM postdoctoral position, where he pursued neuro-symbolic model development; now, Kim’s team develops methods to improve large language model (LLM) capabilities and efficiency. One factor he points to that led to his group’s success is a seamless research process with intellectual partners. This has allowed his MIT-IBM team to apply for a project, experiment at scale, identify bottlenecks, validate techniques, and adapt as necessary to develop cutting-edge methods for potential inclusion in real-world applications. “This is an impetus for new ideas, and that’s, I think, what’s unique about this relationship,” says Kim. Merging expertise The nature of the MIT-IBM Watson AI Lab is that it not only brings together researchers in the AI realm to accelerate research, but also blends work across disciplines. Lab researcher and MIT associate professor in EECS and CSAIL Justin Solomon describes his research group as growing up with the lab, and the collaboration as being “crucial … from its beginning until now.” Solomon’s research team focuses on theoretically oriented, geometric problems as they pertain to computer graphics, vision, and machine learning. Solomon credits the MIT-IBM collaboration with expanding his skill set as well as applications of his group’s work — a sentiment that’s also shared by lab researchers Chuchu Fan, an associate professor of aeronautics and astronautics and a member of the Laboratory for Information and Decision Systems, and Faez Ahmed, associate professor of mechanical engineering. “They [IBM] are able to translate some of these really messy problems from engineering into the sort of mathematical assets that our team can work on, and close the loop,” says Solomon. This, for Solomon, includes fusing distinct AI models that were trained on different datasets for separate tasks. “I think these are all really exciting spaces,” he says. “I think these early-career projects [with the MIT-IBM Watson AI Lab] largely shaped my own research agenda,” says Fan, whose research intersects robotics, control theory, and safety-critical systems. Like Kim, Solomon, and Andreas, Fan and Ahmed began projects through the collaboration the first year they were able to at MIT. Constraints and optimization govern the problems that Fan and Ahmed address, and so require deep domain knowledge outside of AI. Working with the MIT-IBM Watson AI Lab enabled Fan’s group to combine formal methods with natural language processing, which she says, allowed the team to go from developing autoregressive task and motion planning for robots to creating LLM-based agents for travel planning, decision-making, and verification. “That work was the first exploration of using an LLM to translate any free-form natural language into some specification that robot can understand, can execute. That’s something that I’m very proud of, and very difficult at the time,” says Fan. Further, through joint investigation, her team has been able to improve LLM reasoning­ — work that “would be impossible without the IBM support,” she says. Through the lab, Faez Ahmed’s collaboration facilitated the development of machine-learning methods to accelerate discovery and design within complex mechanical systems. Their Linkages work, for instance, employs “generative optimization” to solve engineering problems in a way that is both data-driven and has precision; more recently, they’re applying multi-modal data and LLMs to computer-aided design. Ahmed states that AI is frequently applied to problems that are already solvable, but could benefit from increased speed or efficiency; however, challenges — like mechanical linkages that were deemed “almost unsolvable” — are now within reach. “I do think that is definitely the hallmark [of our MIT-IBM team],” says Ahmed, praising the achievements of his MIT-IBM group, which is co-lead by Akash Srivastava and Dan Gutfreund of IBM. What began as initial collaborations for each MIT faculty member has evolved into a lasting intellectual relationship, where both parties are “excited about the science,” and “student-driven,” Ahmed adds. Taken together, the experiences of Jacob Andreas, Yoon Kim, Justin Solomon, Chuchu Fan, and Faez Ahmed speak to the impact that a durable, hands-on, academia-industry relationship can have on establishing research groups and ambitious scientific exploration.
LeRobot v0.5.0: Scaling Every Dimension huggingface 09.03.2026 00:00 0.622
Embedding sim.0.7596
Entity overlap0.0408
Title sim.0.0806
Time proximity0.5136
NLP типproduct_launch
NLP организацияlerobot
NLP темаrobotics
NLP страна

Открыть оригинал

LeRobot v0.5.0: Scaling Every Dimension Published March 9, 2026 Update on GitHub Upvote 37 +31 Steven Palma imstevenpmwork Pepijn Kooijmans pepijn223 Jade Choghari jadechoghari Caroline Pascal CarolinePascal Khalil Meftah lilkm Martino Russi nepyope Nicolas Rabault Nico-robot Michel Aractingi aractingi Virgile BATTO VirgileBatto Thomas Wolf thomwolf TL;DR Table of Contents Hardware: More Robots Than Ever Unitree G1 Humanoid OpenArm & OpenArm Mini More Robots CAN Bus Motors Policies: A Growing Model Zoo Pi0-FAST: Autoregressive VLAs Real-Time Chunking (RTC) Wall-X X-VLA SARM PEFT Support Datasets: Faster Recording, Faster Training Streaming Video Encoding 10x Faster Image Training, 3x Faster Encoding New Dataset Tools EnvHub: Environments from the Hub NVIDIA IsaacLab-Arena Codebase: A Modern Foundation Community & Ecosystem Final Thoughts With over 200 merged PRs and over 50 new contributors since v0.4.0, LeRobot v0.5.0 is our biggest release yet — expanding in every direction at once. More robots (including our first humanoid), more policies (including the comeback of autoregressive VLAs), faster datasets, simulation environments you can load straight from the Hub, and a modernized codebase running on Python 3.12 and Transformers v5. Whether you're training policies in simulation or deploying them on real hardware, v0.5.0 has something for you. TL;DR LeRobot v0.5.0 adds full Unitree G1 humanoid support (whole-body control models), new policies –including Pi0-FAST autoregressive VLAs and Real-Time Chunking for responsive inference–, and streaming video encoding that eliminates wait times between recording episodes. The release also introduces EnvHub for loading simulation environments from the Hugging Face Hub, NVIDIA IsaacLab-Arena integration, and a major codebase modernization with Python 3.12+, Transformers v5, and third-party policy plugins. Table of Contents LeRobot v0.5.0: Scaling Every Dimension TL;DR Table of Contents Hardware: More Robots Than Ever Unitree G1 Humanoid OpenArm & OpenArm Mini More Robots CAN Bus Motors Policies: A Growing Model Zoo Pi0-FAST: Autoregressive VLAs Real-Time Chunking (RTC) Wall-X X-VLA SARM PEFT Support Datasets: Faster Recording, Faster Training Streaming Video Encoding 10x Faster Image Training, 3x Faster Encoding New Dataset Tools EnvHub: Environments from the Hub NVIDIA IsaacLab-Arena Codebase: A Modern Foundation Community & Ecosystem Final Thoughts Hardware: More Robots Than Ever LeRobot v0.5.0 dramatically expands the roster of supported hardware — from arms and mobile robots to a full humanoid. Unitree G1 Humanoid The biggest hardware addition in this release: full Unitree G1 humanoid support . This is LeRobot's first humanoid integration, and it's comprehensive: Locomotion : Walk, navigate, and move through environments. Manipulation : Perform dexterous object manipulation tasks. Teleoperation : Control the G1 remotely with an intuitive teleoperation interface. Whole-Body Control (WBC) : Coordinate locomotion and manipulation simultaneously for complex, real-world tasks. The G1 integration represents a major step toward general-purpose robotics within LeRobot — moving beyond tabletop arms into full-body embodied AI. Try it out yourself by following the documentation . OpenArm & OpenArm Mini We've added support for the OpenArm robot and its companion OpenArm Mini teleoperator. OpenArm is a capable robot arm with full LeRobot integration, and the Mini serves as its natural teleoperation device. Both support bi-manual configurations , enabling dual-arm setups for more complex manipulation tasks. Check it out in the documentation . More Robots The hardware ecosystem keeps growing: Earth Rover : Our first mobile robot integration, bringing LeRobot to outdoor navigation and ground-level robotics. OMX Robot : A new robot arm with configurable gripper settings and calibration support. SO-100/SO-101 Consolidation : We've unified the SO-100 and SO-101 implementations into a single, cleaner codebase — including bi-manual setups. Less code duplication, easier maintenance, same great robots. CAN Bus Motors New motor controller support via CAN (Controller Area Network) bus opens the door to higher-performance actuators: RobStride : A CAN-based motor controller for high-torque applications. Damiao : Another CAN bus motor controller, expanding the range of compatible hardware. These additions mean LeRobot can now drive a wider variety of professional-grade actuators beyond the existing Dynamixel and Feetech ecosystem. Policies: A Growing Model Zoo This release brings six new policies and techniques into LeRobot, pushing the boundaries of what's possible with open-source robot learning. Pi0-FAST: Autoregressive VLAs Pi0-FAST brings autoregressive Vision-Language-Action models to LeRobot with FAST (Frequency-space Action Sequence Tokenization) . Unlike the flow-matching approach of Pi0, Pi0-FAST uses an autoregressive action expert (based on Gemma 300M) that generates discretized action tokens, enabling: FAST tokenization : Actions are tokenized for autoregressive decoding, with a dedicated FAST action tokenizer . Flexible decoding : Configurable temperature and max decoding steps for balancing speed and quality. RTC-compatible : Works with Real-Time Chunking (see next section ) for responsive inference. lerobot-train \ --policy.type=pi0_fast \ --dataset.repo_id=lerobot/aloha_sim_insertion_human \ --policy.device=cuda Real-Time Chunking (RTC) Real-Time Chunking is an inference-time technique from Physical Intelligence that makes flow-matching policies dramatically more responsive. Instead of waiting for a full action chunk to finish before replanning, RTC continuously blends new predictions with in-progress actions, producing smoother and more reactive behavior. RTC is not a standalone policy — it's an enhancement that plugs into existing flow-matching policies (Pi0 family, SmolVLA & Diffusion). Configure it via --policy.rtc_config.enabled=true . This is a game-changer for real-world deployment where latency matters. Read the original paper for the technical details and our documentation . Wall-X Wall-X is a new VLA policy built on Qwen2.5-VL with flow-matching action prediction. It combines the strong vision-language understanding of Qwen2.5-VL with a flow-matching head for cross-embodiment robotic control. pip install lerobot[wall_x] lerobot-train \ --policy.type=wall_x \ --dataset.repo_id=lerobot/aloha_sim_insertion_human X-VLA X-VLA brings a Florence2-based VLA to LeRobot. Built on Microsoft's Florence-2 vision-language model, X-VLA offers an alternative backbone for VLA policies, expanding the diversity of foundation models available for robot learning. Check out the training guide for setup instructions and the base model . pip install lerobot[xvla] lerobot-train \ --policy.type=xvla \ --dataset.repo_id=lerobot/bimanual-so100-handover-cube SARM SARM (Stage-Aware Reward Modeling) tackles one of the hardest problems in robot learning: long-horizon tasks. Instead of using a single global linear progress signal over the whole episode, it models progress in a stage-aware manner by predicting both the task stage and the progress within that stage. This makes it much easier to train policies for complex, multi-step manipulation tasks. Start experimenting with it by following the documentation . PEFT Support You can now fine-tune large VLAs using LoRA (and other PEFT methods) without modifying the core training pipeline. PEFT configuration lives at the policy level, making it straightforward to adapt massive foundation models to your specific robot and task with a fraction of the compute. Learn more reading the documentation . lerobot-train \ --policy.type=pi0 \ --policy.peft_config.use_peft= true \ --dataset.repo_id=lerobot/aloha_sim_insertion_human Datasets: Faster Recording, Faster Training The dataset pipeline gets major performance improvements in this release, making both data collection and training significantly faster. Streaming Video Encoding Previously, recording a dataset meant waiting after each episode for video encoding to finish. No more. With streaming video encoding, frames are encoded in real-time as they're captured — meaning zero wait time between episodes . Just finish one episode and immediately start the next. Streaming encoding also supports hardware encoder auto-detection , so if your system has a GPU-accelerated video encoder, LeRobot will use it automatically: dataset = LeRobotDataset.create( repo_id= "my/dataset" , fps= 30 , video_backend= "auto" , # Auto-detect best HW encoder streaming_encoding= True , # Encode in real-time ) Streaming encoding performance can vary significantly depending on your hardware and recording setup (number of cameras, resolution, etc.). Make sure to review the streaming video encoding documentation before enabling it. 10x Faster Image Training, 3x Faster Encoding Under the hood, we've fixed key data access bottlenecks and overhauled image processing: 10x faster image training : Improved image transform support and fixed data access bottlenecks that were silently slowing down training. 3x faster encoding : Parallel encoding is now the default across all platforms, with dynamic compression levels that adapt to your dataset type (video vs. image), when not using streaming. Better CPU utilization : More efficient resource usage during recording and dataset creation. New Dataset Tools The dataset editing toolkit continues to grow: Subtask support : Annotate and query subtasks within episodes for hierarchical task learning. Image-to-video conversion : Convert existing image-based datasets to video format for better storage efficiency, with support for multiple episodes per video file. More editing operations : New info operation for inspecting datasets, task modification tools, and numerous fixes to existing operations (splitting, merging, feature editing). Expose more options : Configurable video codecs, tolerance settings, and metadata buffer sizes for fine-grained control over dataset creation. EnvHub: Environments from the Hub EnvHub is a new way to use simulation environments in LeRobot: load them directly from the Hugging Face Hub. Instead of installing environment packages locally and wiring up registration, you can now point LeRobot at a Hub repository and it handles everything — downloading the environment code, registering it with Gymnasium, and making it available for training and evaluation. Hub environments use HubEnvConfig , which downloads and executes remote make_env functions: lerobot-train \ --env.type=hub \ --env.hub_path= "username/my-custom-env" \ --policy.type=act This lowers the barrier for sharing custom simulation environments with the community. Package your environment, push it to the Hub, and anyone can train on it. Check out the documentation to learn more. Here's an example to get started: LeIsaac x LeRobot EnvHub tutorial . NVIDIA IsaacLab-Arena We've integrated NVIDIA IsaacLab-Arena , bringing GPU-accelerated simulation to LeRobot. IsaacLab-Arena provides a collection of manipulation tasks running on NVIDIA's Isaac Sim, offering massively parallel environment instances for fast reinforcement learning. The integration includes dedicated pre/post-processing steps and full compatibility with LeRobot's training pipeline. Check out the documentation . Codebase: A Modern Foundation This release modernizes the codebase: Python 3.12+ : LeRobot now requires Python 3.12 as the minimum version, enabling modern syntax and better performance. Transformers v5 : We've migrated to Hugging Face Transformers v5, staying current with the latest model ecosystem. 3rd-party policy plugins : Just like v0.4.0's hardware plugin system, you can now register custom policies as installable packages — pip install lerobot_policy_mypolicy and use it with --policy.type=mypolicy . No core library changes needed. Learn how to do it by following the documentation . Remote Rerun visualization : Visualize your robot's telemetry remotely using Rerun, with compressed image support for bandwidth-efficient streaming. Installation improvements : Added uv installation instructions , clarified setup steps, and improved dependency management. Sequential install steps are now clearly documented. Documentation versioning : Docs are now versioned, so you can always find documentation matching your installed release. PyTorch version bump : Updated PyTorch version bounds to support NVIDIA Blackwell GPUs. Community & Ecosystem Modernized Discord : Updated the most vibrant community hub with a better channel organization. GitHub README, templates & automated labeling : A refreshed README, new issue and PR templates, contributing guidelines, and automatic labeling of tickets — making it easier for everyone to contribute. ICLR 2026 paper acceptance : The LeRobot paper has been accepted to ICLR 2026 ! LeRobot Visualizer refresh : The visualization tool got a refresh with new dataset visualization badges and improved functionality. Check it out ! LeRobot Annotation Studio : A HuggingFace Space designed to easily annotate every moment of your dataset with natural language subtasks. Check it out ! Final Thoughts Beyond these headline features, v0.5.0 includes hundreds of bug fixes, documentation improvements, CI/CD enhancements, and quality-of-life improvements across the entire codebase. From better type checking to more robust test infrastructure, we're investing in the foundations that make LeRobot reliable and maintainable as it scales. We want to extend a huge thank you to everyone in the community — contributors, users, and collaborators alike — for helping LeRobot grow into what it is today. Every bug report, PR, and discussion makes this project better. Stay tuned for more to come 🤗 Get started here ! – The LeRobot team ❤️ There's a big surprise coming just right around the corner, stay tuned! 👕 Mentioned models lerobot/fast-action-tokenizer Robotics • Updated Feb 16 • 1 lerobot/xvla-base Robotics • 0.9B • Updated 29 days ago • 10.5k • 20 Mentioned papers Real-Time Execution of Action Chunking Flow Policies Paper • 2506.07339 • Published Jun 9, 2025 Mentioned collections Qwen2.5-VL Collection Vision-language model series based on Qwen2.5 • 10 items • Updated 29 days ago • 559 More Articles from our Blog lerobot robotics Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac 32 October 29, 2025 lerobot robotics LeRobot v0.4.0: Supercharging OSS Robot Learning +5 49 October 24, 2025
OpenAI Japan announces Japan Teen Safety Blueprint to put teen safety first openai 17.03.2026 10:00 0.621
Embedding sim.0.7086
Entity overlap0.0714
Title sim.0.1196
Time proximity0.8994
NLP типproduct_launch
NLP организацияOpenAI
NLP темаgenerative ai
NLP странаJapan

Открыть оригинал

OpenAI Japan announces the Japan Teen Safety Blueprint, introducing stronger age protections, parental controls, and well-being safeguards for teens using generative AI.
How our open-source AI model SpeciesNet is helping to promote wildlife conservation google 06.03.2026 18:00 0.621
Embedding sim.0.7216
Entity overlap0
Title sim.0.1316
Time proximity0.8036
NLP типproduct_launch
NLP организацияGoogle
NLP темаartificial intelligence
NLP странаUnited States

Открыть оригинал

Breadcrumb Company News Outreach & Initiatives Sustainability How our open-source AI model SpeciesNet is helping to promote wildlife conservation Mar 06, 2026 · Share x.com Facebook LinkedIn Mail Copy link Since SpeciesNet went open-sourced one year ago, more people than ever have been able to use AI to identify animals and learn about their habitats to promote wildlife monitoring and conservation worldwide. Tanya Birch Senior Program Manager, Google Earth Outreach Dan Morris Senior Research Scientist, Google Research Read AI-generated summary General summary SpeciesNet is an open-source AI model that helps conservationists identify animals in camera trap photos. Since its launch, research groups are using it to analyze data faster. Groups like the Snapshot Serengeti project and the Wildlife Observatory of Australia are using SpeciesNet to monitor wildlife behavior and protect endangered species. Summaries were generated by Google AI. Generative AI is experimental. Bullet points "How our open-source AI model SpeciesNet is helping to promote wildlife conservation" explains how AI helps wildlife research. SpeciesNet, an AI model, identifies almost 2,500 animal types in camera trap photos, saving researchers time. Snapshot Serengeti in Tanzania used SpeciesNet to analyze 11 million photos, speeding up vital research. In Colombia and Australia, SpeciesNet helps monitor wildlife changes and protect unique local species. Idaho uses SpeciesNet to sort millions of camera images, making it easier to track wildlife across the state. Summaries were generated by Google AI. Generative AI is experimental. Basic explainer SpeciesNet is an AI tool that helps people who study animals. It looks at photos from cameras in the wild and figures out what animals are in them. This saves researchers a lot of time, so they can better understand and protect wildlife. Groups all over the world use it to track animals and learn about their behavior. Summaries were generated by Google AI. Generative AI is experimental. Explore other styles: General summary Bullet points Basic explainer Share x.com Facebook LinkedIn Mail Copy link Your browser does not support the audio element. Listen to article This content is generated by Google AI. Generative AI is experimental [[duration]] minutes Voice Speed Voice Speed 0.75X 1X 1.5X 2X From a puma prowling through the Colombian forest at dawn to a cassowary wandering across Australia, motion-triggered cameras give us an unprecedented view of what animals do when humans aren't around. But for wildlife managers, biologists and conservationists, turning millions of these candid snapshots into actionable data is incredibly time-consuming. That's where SpeciesNet comes in. SpeciesNet is an AI model trained to automatically identify nearly 2,500 categories of mammals, birds and reptiles. The model has been used since 2019 via Wildlife Insights . We launched it as a free, open-source tool a year ago, and today, research groups are using it to make sense of their camera trap data faster than ever. Africa partner: Snapshot Serengeti These images from 2024 show a group of elephants at night, a male lion, a zebra in profile, and a warthog that appears to be looking at the camera. Image credit: Snapshot Serengeti; T.M. Anderson In Africa, the Snapshot Serengeti project has operated camera traps in Tanzania’s Serengeti National Park, in collaboration with the Tanzanian Wildlife Research Institute, since 2010. At first the project recruited online volunteers, but it had too many images for the volunteers to analyze. Project leader Todd Michael Anderson at North Carolina’s Wake Forest University used SpeciesNet to analyze a backlog of 11 million photos, processing decades’ worth of data in just days. The project is analyzing these images to get a long-term view of fauna behavior and abundance in one of Africa’s most biodiverse regions. South American partner: Colombia’s Humboldt Institute These images were captured between March and May 2025. They show an ocelot, a small wild cat that’s endangered in the southern U.S. and Mexico but is still common in South America, and a puma (also known as a cougar or mountain lion). Image credit: Project Lucitania/Universidad de los Andes/Red Otus In Colombia, our longtime collaborators at the Humboldt Institute use SpeciesNet as part of the Wildlife Insights platform. Many of the species the institute monitors live in Colombia’s Amazon Rainforest, an extremely biodiverse region that is undergoing rapid changes. The group recently expanded in launching Red Otus , a national-scale network that captures camera trap images on public and private land across the country. The Red Otus project has analyzed tens of thousands of images it has collected to discover changes in the timing of bird migrations and the daily patterns of wildlife across Colombia. Analysis suggests that some mammals are becoming more nocturnal, perhaps to avoid threats, and birds appear later in the morning in developed areas, perhaps to avoid predators. North American partner: Idaho Department of Fish and Game These images, captured from July through September 2025, show some of the species IDFG monitors to ensure the population is healthy and stable. The photos show a family of black bears, a coyote, a mule deer and an elk. Photo credit: Idaho Department of Fish and Game The Idaho Department of Fish and Game (IDFG) is among many state wildlife and transportation agencies in the U.S. and Canada that are using the SpeciesNet AI model to identify animals in their camera trap photos. While aerial surveys are frequently flown in southern Idaho, the agency deploys hundreds of camera traps across the state, particularly in the more forested, northern areas. Human experts conduct a final review, but having SpeciesNet sort the images by species beforehand greatly speeds up reviewing the millions of images collected each year. Australian partner: Wildlife Observatory of Australia The images above were captured by WildObs partners in Australian springtime, from August to November 2025. They show a pair of red-legged pademelons, cassowaries out for a midday stroll, and a cassowary peering into the camera. Photo credit: Wildlife Observatory of Australia In Australia, our collaborators at the Wildlife Observatory of Australia (WildObs) took the open-source SpeciesNet model and trained it to identify species that weren’t part of the initial model, but that are important locally. Australia is home to many species not found anywhere else in the world, and those species are a priority for monitoring and conservation. A version of SpeciesNet trained on local wildlife lets groups keep an eye on iconic, threatened or endangered species specific to their region in order to sustain wild populations. SpeciesNet can identify species from multiple angles, in different types of light, and when only a portion of the animal is visible. But sometimes animals get curious and look straight at the camera, producing a true portrait. The projects above represent just a sample of the groups we’ve worked with to help run SpeciesNet to interpret camera trap photos. We’re grateful to all of our partners who are applying this tool on the ground to better understand and protect the wildlife that also call our planet home. To learn more about the history of SpeciesNet, its model training and performance, read our post on the Google Research Blog . Get more stories from Google in your inbox. Get more stories from Google in your inbox. Email address Your information will be used in accordance with Google's privacy policy. Subscribe Done. Just one step more. Check your inbox to confirm your subscription. You are already subscribed to our newsletter. You can also subscribe with a different email address . POSTED IN: Sustainability Google Research AI
Meta’s latest AI improves its terrible content moderation the_register_ai 20.03.2026 04:16 0.621
Embedding sim.0.7125
Entity overlap0.0385
Title sim.0.098
Time proximity0.9124
NLP типproduct_launch
NLP организацияMeta
NLP темаcontent moderation
NLP страна

Открыть оригинал

AI + ML 13 Meta’s latest AI improves its terrible content moderation, just a little 13 Enterprise tools have detected impossible logins for years. Zuck’s human mods couldn’t join the dots Simon Sharwood Fri 20 Mar 2026 // 04:16 UTC Meta has revealed it’s tested using AI for content moderation chores and found it does better than humans. The social networking giant on Thursday announced it has started a global rollout for its Meta AI support, a tool that handles tasks like password resets, reporting dodgy content, explaining content takedowns and allowing appeals, or managing privacy settings. The company also said “Over the next few years, we will deploy more advanced AI systems across our apps to transform our approach to content enforcement, more accurately finding and removing severe content violations like scams and illegal content, so people see less of them.” Early experiments have delivered promising results: one AI tool detected and mitigated 5,000 attempts at scamming users to reveal their passwords every day. Meta says its human teams could not detect those scams. Another AI helped to reduce the number of reports users lodged about fake celebrity profiles by over 80 percent. Other tests doubled detection of adult sexual solicitation content that violates Meta’s rules. Meta says its AI can also “Prevent an account takeover by noticing it was suddenly accessed from a new location, the password was changed, and edits were made to the profile.” The company says those changes “look harmless to a person reviewing the account, but AI was able to recognize as a threat.” Chatbot Romeos keep users talking longer, but harm their mental health Meta reveals four Broadcom-built custom AI chips, claims some outperform commercial silicon AI nonsense finds new home as Meta acquires Moltbook Facebook went down for about three hours, interrupting your poking and Meta's ads business That’s an odd observation given that numerous enterprise security products can detect “impossible travel” such as a single user logging in from London and an hour later requesting a password reset from San Francisco, and flag it as a likely attack. Meta also enthused that AI can “Detect a fake site spoofing a legitimate web address and pretending to be a popular sporting goods store by noticing the real logo being used with unusually low prices and a suspicious web address,” because AI “drove down views of ads with scams and other serious violations by seven percent, offering promising results and better protections for users and brands.” Again, that’s a nice outcome, but also a little odd as fake ads are a known problem – and one to which Meta has often been indifferent. Your correspondent once spotted a suspicious ad for a brand that publishes lists of its legitimate URLs. The ad led to a spoof site, so I made a report to Meta – which replied that the fake site was not in violation of its policies. The company ignored my reply that pointed out the URL was not on the brand’s list of official sites. Leaving the metaverse Meta’s march toward AI came in the same week as it walked away from the metaverse, its vision for immersive online communities. On Wednesday, the company announced the shutdown of Horizon Worlds, its metaverse platform. That decision meant owners of Meta’s Quest VR goggles would have a lot less content to consume. The company later walked that back and promised to continue offering some of its immersive environments but said it would not create new ones. Meta adopted its current name to reflect founder Mark Zuckerberg’s belief that the metaverse was the next big thing, and spent over $80 billion on its ambitions over five years – a period when we can now see its content moderation for Facebook and Instagram was going badly, and children were often harmed by the company’s products. The company now plans to develop something it calls “superintelligence,” and is spending tens of billions more to make that happen. ® Share More about Meta Metaverse Security More like these &times; More about Meta Metaverse Security Narrower topics 2FA Advanced persistent threat Application Delivery Controller Authentication BEC Black Hat BSides Bug Bounty Center for Internet Security CHERI CISO Common Vulnerability Scoring System Cybercrime Cybersecurity Cybersecurity and Infrastructure Security Agency Cybersecurity Information Sharing Act Data Breach Data Protection Data Theft DDoS DEF CON Digital certificate Encryption End Point Protection Exploit Facebook Firewall Google Project Zero Hacker Hacking Hacktivism Identity Theft Incident response Infosec Infrastructure Security Kenna Security NCSAM NCSC Open Compute Project Palo Alto Networks Password Personally Identifiable Information Phishing Quantum key distribution Ransomware Remote Access Trojan REvil RSA Conference Software Bill of Materials Spamming Spyware Surveillance TLS Trojan Trusted Platform Module Vulnerability Wannacry WhatsApp Zero trust Broader topics Andrew McCollum Chris Hughes Dustin Moskovitz Eduardo Saverin Mark Zuckerberg Virtual Reality More about Share 13 COMMENTS More about Meta Metaverse Security More like these &times; More about Meta Metaverse Security Narrower topics 2FA Advanced persistent threat Application Delivery Controller Authentication BEC Black Hat BSides Bug Bounty Center for Internet Security CHERI CISO Common Vulnerability Scoring System Cybercrime Cybersecurity Cybersecurity and Infrastructure Security Agency Cybersecurity Information Sharing Act Data Breach Data Protection Data Theft DDoS DEF CON Digital certificate Encryption End Point Protection Exploit Facebook Firewall Google Project Zero Hacker Hacking Hacktivism Identity Theft Incident response Infosec Infrastructure Security Kenna Security NCSAM NCSC Open Compute Project Palo Alto Networks Password Personally Identifiable Information Phishing Quantum key distribution Ransomware Remote Access Trojan REvil RSA Conference Software Bill of Materials Spamming Spyware Surveillance TLS Trojan Trusted Platform Module Vulnerability Wannacry WhatsApp Zero trust Broader topics Andrew McCollum Chris Hughes Dustin Moskovitz Eduardo Saverin Mark Zuckerberg Virtual Reality TIP US OFF Send us news
What Rivian's AI & Autonomy Growth Signals ai_supremacy 04.03.2026 10:40 0.62
Embedding sim.0.7199
Entity overlap0.0727
Title sim.0.05
Time proximity0.8832
NLP типother
NLP организацияRivian
NLP темаautonomous driving
NLP странаUnited States

Открыть оригинал

🎈 Guest Posts What Rivian's AI & Autonomy Growth Signals The smart car era is finally starting to manifest. Rivian's AI and autonomy pivot impresses me. Michael Spencer and Daniel Abreu Marques Mar 04, 2026 ∙ Paid 119 1 6 Share Rivian’s R2 is finally coming. Today something a little bit different. I’m a big fan of journalists who cover the AV sector. I’ve been watching the smart car sector for what feels like decades. From Apple’s failure to build a smart car to Xiaomi’s success , it’s been quite a journey. In the later half of the 2020s, there’s finally light at the other end of the tunnel. Advanced autonomy, improved sensors, millions of miles more experience, integrated software and embedded AI have all improved considerably in the last decade. That world of robots, delivery drones, robo-taxis and consumer ambient intelligence: well, it’s finally coming. A Physical AI Wave Begins in 2027 I’m hugely optimistic about the coming Physical AI wave. When Rivian had their AI & autonomy day back in December, 2025 I had the distinct impression this is what the entire world is heading towards. The physical AI devices of 2027 , emphasis on humanoid robotics , and the evolution of autonomous vehicles finally will appear tangible by the end of the decade. Rivian’s CEO also recently appears on the No Priors Podcast , a fairly good AI startup podcast. It’s been a long road for Rivian but I see them turning the corner this year where the company has moved from being a niche, high-end truck maker to a serious mass-market contender, primarily driven by the launch of the R2 platform. I’ve always liked the look of their vehicles but finally they are more affordable too, the Rivian R2 is a new, more affordable midsize electric SUV built on a dedicated, highly efficient, and compact platform, starting around $45,000. But it’s the AI & Autonomous work they are doing that is the deal maker for me. Video: The Leap | Autonomy & AI Day | Rivian - YouTube. Autonomous Vehicles are Closer to Mainstream Adoption Rivian is one of the most promising EV startups in the U.S., at a major turning point this year in 2026 as they pivot more to AI and autonomy , which I find fascinating. They are backed by the likes of Amazon and Volkswagen Group and I believe will become a major alternative to Tesla even as the demand for electric vehicles has waned. I more or less see 2027 as the date civilization enters the age of Physical AI and I think Rivian represents an interesting case study in its emergence. As Waymo, the leading AV moonshot by Google expands , it’s an exciting time in autonomous vehicle evolution. Axios . I watched the entire Rivian AI and Autonomy Day live, and I was impressed. I asked Daniel Abreu Marques of the Newsletter The AV Market Strategic , for his insights on Rivian and a deep dive on their AI & Autonomy strategy. The AV Market Strategist If you are interested in smart car developments, this is a great publication. Autonomous vehicles are improving and expanding to more cities near you. It’s, finally happening. The AV Market Strategist The AV Market Strategist breaks down strategic developments, market dynamics, and competitive moves in the autonomous vehicle industry to reveal where the market is truly heading. By Daniel Abreu Marques Wayve Raises $1.5B , Waymo Now In 10 Cities And Uber Builds the AV Platform Layer Waymo’s Shocking Data Release, Uber Invest $100M in Robotaxi Infrastructure Tesla All-In Autonomy: 7 New Robotaxi Cities, $20b CAPEX, $2b Investment in xAI Learn More In March, 2026 we’re going to learn a lot more about Rivian’s R2 in just a couple of weeks. Spring, 2026 is going to be their coming out party. Daniel’s deep dive is a serious look into Rivian and their prospects. Rivian’s AI & Autonomy Highlights ✨ Continue reading this post for free, courtesy of Michael Spencer. Claim my free post Or purchase a paid subscription. Previous A guest post by Daniel Abreu Marques I'm Daniel, and I work in strategy and market intelligence for autonomous driving. As a student of the market, I analyze market developments, monitor players, and try to connect the dots. Subscribe to Daniel