|
S
|
How AI is giving Northern Ireland teachers time back |
deepmind |
10.11.2025 09:00 |
1
|
| Embedding sim. | 1 |
| Entity overlap | 1 |
| Title sim. | 1 |
| Time proximity | 1 |
| NLP тип | other |
| NLP организация | Google DeepMind |
| NLP тема | generative ai |
| NLP страна | United Kingdom |
Открыть оригинал
Breadcrumb
Innovation & AI
Models & research
Google DeepMind
How AI is giving Northern Ireland teachers time back
Nov 10, 2025
·
Share
x.com
Facebook
LinkedIn
Mail
Copy link
Lila Ibrahim
Chief Operating Officer, Google DeepMind
Share
x.com
Facebook
LinkedIn
Mail
Copy link
What happens when you put powerful new technology directly into the hands of 100 teachers? An incredible six-month pilot program in Northern Ireland gave us the answer: they steer it in ways that are truly inspiring.
I had the privilege of meeting educators and administrators from the Northern Ireland Education Authority’s C2k program who had the opportunity to integrate Gemini and Google Workspace tools into their classrooms, and I was struck by the teachers’ ingenuity and the efficiencies they were able to achieve. The success of the pilot program is a result of a close partnership between C2k and Google for Education — Google brought the technology, and teachers utilized it to make a meaningful impact.
Their work underscores our core belief: AI is not a replacement for learning or teachers. It is a collaborative tool — and when grounded in learning science , it can make learning more accessible, handle time-consuming related tasks, and free educators to do what only they can uniquely do.
More than just efficient admin
While it’s still early, the results are inspiring. Each participating teacher reported saving an average of 10 hours per week with the help of Gemini, which is infused with learning science principles . The teachers were then able to reinvest that time into student engagement and their own professional development.
The pilot captured more than 600 unique use cases for Gemini, ranging from streamlining administrative work to brainstorming more engaging content. It's been incredible to see such teachers identify diverse use cases once they feel empowered to explore how Gemini can help.
For Chris Lowe, Head of Information and Communication Technology (ICT) at Ashfield Boys’ High School, the impact was immediate. “The time I saved using Gemini fundamentally allows me to do the job I want to do — and that is to teach.” Chris uses Gemini to draft letters to parents or create risk assessments for class outings, and NotebookLM to turn curriculum material into podcasts for exam preparation.
Personalizing learning and removing barriers
Beyond administrative tasks, teachers used Gemini to help create lessons in the Irish language and create more inclusive learning environments.
Alistair Hamill, Head of Geography from Lurgan College, used NotebookLM’s MindMap feature to create interactive visual representations of source material. He noted that this helped a neurodivergent student see the "big picture" rather than getting stuck in the details.
Similarly, one ICT coordinator at Rowandale Primary School found new ways to inspire her students. For her creative writing class, she used Gemini to generate images and spark curiosity. It also became a vital tool for inclusivity: “I can tell Gemini to help me create a lesson [tailored] to suit a student’s specific needs and that has been game-changing,” she said.
Embracing the opportunity ahead, responsibly
We believe the promise of AI in education can only be achieved by developing it responsibly. One essential facet of this is doing so in close partnership with the entire education ecosystem. When my team and I met with Damian Harvey, Interim Head of C2k, he emphasized this point. He shared that the pilot’s success was accelerated by a collaborative group of teachers sharing their learnings with each other, as well as the need for resources to train and support further development in AI readiness.
“I believe educators need to embrace the opportunity,” said Damian Harvey, Interim Head of C2k.
Following the success of the pilot, C2k plans to roll out Gemini training to more teachers across Northern Ireland. We believe this is just the beginning. We hope to continue exploring partnerships like this, learning directly from the teachers who use the technology how to best align products with proven pedagogical principles , and how to effectively integrate the technology to improve learning outcomes for students.
As we continue to navigate the shift toward AI in learning, it's critical that our products empower educators, not replace them. Teachers must maintain ownership over how AI is used to help their students learn. As I’ve often said: technology isn’t magic, teachers are.
Learn more about how other institutions are using Gemini , or get started yourself .
POSTED IN:
Google DeepMind
AI Products
Learning & Education
|
|
|
OpenAI and Target team up on new AI-powered experiences |
openai |
19.11.2025 06:00 |
0.805
|
| Embedding sim. | 0.8738 |
| Entity overlap | 0.3077 |
| Title sim. | 0.5507 |
| Time proximity | 0.8512 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
OpenAI and Target are partnering to bring a new Target app to ChatGPT, offering personalized shopping and faster checkout. Target will also expand its use of ChatGPT Enterprise to boost productivity and guest experiences.
|
|
|
OpenAI takes an ownership stake in Thrive Holdings to accelerate enterprise AI adoption |
openai |
01.12.2025 05:00 |
0.798
|
| Embedding sim. | 0.8865 |
| Entity overlap | 0.2222 |
| Title sim. | 0.3478 |
| Time proximity | 1 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
OpenAI takes an ownership stake in Thrive Holdings to accelerate enterprise AI adoption, embedding frontier research and engineering directly into accounting and IT services to boost speed, accuracy, and efficiency while creating a scalable model for industry-wide transformation.
|
|
|
Deepening AI Safety Research with UK AI Security Institute (AISI) — Google DeepMind |
deepmind |
11.12.2025 00:06 |
0.76
|
| Embedding sim. | 0.8627 |
| Entity overlap | 0.15 |
| Title sim. | 0.2449 |
| Time proximity | 0.9457 |
| NLP тип | partnership |
| NLP организация | Google DeepMind |
| NLP тема | ai safety |
| NLP страна | United Kingdom |
Открыть оригинал
December 11, 2025 Responsibility & Safety
Deepening our partnership with the UK AI Security Institute
William Isaac and Owen Larter
Share
Copied
Your browser does not support the audio element. Listen to article 5 minutes
Today, we're announcing an expanded partnership with the UK AI Security Institute (AISI) through a new Memorandum of Understanding focused on foundational security and safety research, to help ensure artificial intelligence is developed safely and benefits everyone.
The research partnership with AISI is an important part of our broader collaboration with the UK government on accelerating safe and beneficial AI progress.
Building on a foundation of collaboration
AI holds immense potential to benefit humanity by helping treat disease, accelerate scientific discovery, create economic prosperity and tackle climate change. For these benefits to be realised, we must put safety and responsibility at the heart of development. Evaluating our models against a broad spectrum of potential risks remains a critical part of our safety strategy, and external partnerships are an important element of this work.
This is why we have partnered with the UK AISI since its inception in November 2023 to test our most capable models. We are deeply committed to the UK AISI’s goal to equip governments, industry and wider society with a scientific understanding of the potential risks posed by advanced AI as well as potential solutions and mitigations.
We are actively working with AISI to build more robust evaluations for AI models, and our teams have collaborated on safety research to move the field forward, including recent work on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety . Building on this success, today we are broadening our partnership from testing to include wider, more foundational, research in a variety of areas.
What the partnership involves
Under this new research partnership, we're broadening our collaboration to include:
Sharing access to our proprietary models, data and ideas to accelerate research progress
Joint reports and publications sharing findings with the research community
More collaborative security and safety research combining our teams' expertise
Technical discussions to tackle complex safety challenges
Key research areas
Our joint research with AISI focuses on critical areas where Google DeepMind's expertise, interdisciplinary teams, and years of pioneering responsible research can help make AI systems more safe and secure:
Monitoring AI reasoning processes
We will work on techniques to monitor an AI system’s “thinking”, also commonly referred to as its chain-of-thought (CoT). This work builds on previous Google DeepMind research as well, and our recent collaboration on this topic with AISI, OpenAI, Anthropic and other partners. CoT monitoring helps us understand how an AI system produces its answers, complementing interpretability research.
Understanding social and emotional impacts
We will work together to investigate the ethical implications of socioaffective misalignment; that is, the potential for AI models to behave in ways which do not align with human wellbeing, even when they’re technically following instructions correctly. This research will build on existing Google DeepMind work that has helped define this critical area of AI safety.
Evaluating economic systems
We will explore the potential impact of AI on economic systems by simulating real-world tasks across different environments. Experts will score and validate these tasks, after which they will be categorised along dimensions like complexity or representativeness, to help predict factors like long-term labour market impact.
Working together to realise the benefits of AI
Our partnership with AISI is one element of how we aim to realise the benefits of AI for humanity while mitigating potential risks. Our wider strategy includes foresight research, extensive safety training that goes hand-in-hand with capability development, rigorous testing of our models, and the development of better tools and frameworks to understand and mitigate risk.
Strong internal governance processes are also essential for safe and responsible AI development, as is collaborating with independent external experts who bring fresh perspectives and diverse expertise to our work. Google DeepMind’s Responsibility and Safety Council works across teams to monitor emerging risk, review ethics and safety assessments and implement relevant technical and policy mitigations. We also partner with other external experts like Apollo Research, Vaultis, Dreadnode and more, to conduct extensive testing and evaluation of our models, including Gemini 3, our most intelligent and secure model to date.
Additionally, Google DeepMind is a proud founding member of the Frontier Model Forum , as well as the Partnership on AI , where we focus on ensuring safe and responsible development of frontier AI models and increasing collaboration on important safety issues.
We hope our expanded partnership with AISI will allow us to build more robust approaches to AI safety for the benefit not just of our own organisations, but also the wider industry and everyone who interacts with AI systems.
Related posts
National Partnerships for AI
Learn more
Strengthening our partnership with the UK government to support prosperity and security in the AI era
December 2025 Responsibility & Safety
Learn more
Strengthening our Frontier Safety Framework
September 2025 Responsibility & Safety
Learn more
|
|
|
BBVA and OpenAI collaborate to transform global banking |
openai |
12.12.2025 00:00 |
0.734
|
| Embedding sim. | 0.8409 |
| Entity overlap | 0.2 |
| Title sim. | 0.1149 |
| Time proximity | 1 |
| NLP тип | partnership |
| NLP организация | BBVA |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
BBVA is expanding its work with OpenAI through a multi-year AI transformation program, rolling out ChatGPT Enterprise to all 120,000 employees. Together, the companies will develop AI solutions that enhance customer interactions, streamline operations, and help build an AI-native banking experience.
|
|
|
OpenAI co-founds Agentic AI Foundation, donates AGENTS.md |
openai |
09.12.2025 09:00 |
0.731
|
| Embedding sim. | 0.8363 |
| Entity overlap | 0.0769 |
| Title sim. | 0.1795 |
| Time proximity | 0.9821 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | ai agents |
| NLP страна | |
Открыть оригинал
OpenAI co-founds the Agentic AI Foundation under the Linux Foundation and donates AGENTS.md to support open, interoperable standards for safe agentic AI.
|
|
|
Instacart and OpenAI partner on AI shopping experiences |
openai |
08.12.2025 06:00 |
0.726
|
| Embedding sim. | 0.8301 |
| Entity overlap | 0.25 |
| Title sim. | 0.0972 |
| Time proximity | 0.9881 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
OpenAI and Instacart are deepening their longstanding partnership by bringing the first fully integrated grocery shopping and Instant Checkout payment app to ChatGPT.
|
|
|
Announcing the initial People-First AI Fund grantees |
openai |
03.12.2025 08:00 |
0.724
|
| Embedding sim. | 0.8439 |
| Entity overlap | 0.125 |
| Title sim. | 0.2159 |
| Time proximity | 0.7381 |
| NLP тип | funding |
| NLP организация | OpenAI Foundation |
| NLP тема | ai ethics |
| NLP страна | |
Открыть оригинал
The OpenAI Foundation announces the initial recipients of the People-First AI Fund, awarding $40.5M in unrestricted grants to 208 nonprofits supporting community innovation and opportunity.
|
|
|
The state of enterprise AI |
openai |
08.12.2025 04:00 |
0.722
|
| Embedding sim. | 0.8651 |
| Entity overlap | 0.5 |
| Title sim. | 0.0526 |
| Time proximity | 0.5179 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | ai adoption |
| NLP страна | |
Открыть оригинал
Key findings from OpenAI’s enterprise data show accelerating AI adoption, deeper integration, and measurable productivity gains across industries in 2025.
|
|
|
Introducing OpenAI for Australia |
openai |
04.12.2025 19:00 |
0.72
|
| Embedding sim. | 0.8429 |
| Entity overlap | 0.1 |
| Title sim. | 0.1569 |
| Time proximity | 0.8036 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | ai infrastructure |
| NLP страна | Australia |
Открыть оригинал
OpenAI is launching OpenAI for Australia to build sovereign AI infrastructure, upskill more than 1.5 million workers, and accelerate innovation across the country’s growing AI ecosystem.
|
|
|
Ten years |
openai |
11.12.2025 00:00 |
0.719
|
| Embedding sim. | 0.8636 |
| Entity overlap | 0.2222 |
| Title sim. | 0 |
| Time proximity | 0.75 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | artificial intelligence |
| NLP страна | |
Открыть оригинал
OpenAI reflects on ten years of progress, from early research breakthroughs to widely used AI systems that reshaped what’s possible. We share lessons from the past decade and why we remain optimistic about building AGI that benefits all of humanity.
|
|
|
Launching our first OpenAI Certifications courses |
openai |
09.12.2025 06:00 |
0.716
|
| Embedding sim. | 0.8797 |
| Entity overlap | 0.2 |
| Title sim. | 0.1429 |
| Time proximity | 0.3631 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | educational technology |
| NLP страна | |
Открыть оригинал
Learn how OpenAI’s new certifications and AI Foundations courses help people build real-world AI skills, boost career opportunities, and prepare for the future of work.
|
|
|
Commonwealth Bank of Australia builds AI fluency at scale |
openai |
09.12.2025 00:00 |
0.71
|
| Embedding sim. | 0.8215 |
| Entity overlap | 0.25 |
| Title sim. | 0.0933 |
| Time proximity | 0.881 |
| NLP тип | partnership |
| NLP организация | Commonwealth Bank of Australia |
| NLP тема | enterprise ai |
| NLP страна | Australia |
Открыть оригинал
Commonwealth Bank of Australia partners with OpenAI to roll out ChatGPT Enterprise to 50,000 employees, building AI fluency at scale to improve customer service and fraud response.
|
|
|
Strengthening cyber resilience as AI capabilities advance |
openai |
10.12.2025 12:00 |
0.708
|
| Embedding sim. | 0.8415 |
| Entity overlap | 0.1 |
| Title sim. | 0.0532 |
| Time proximity | 0.8214 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | ai security |
| NLP страна | |
Открыть оригинал
OpenAI is investing in stronger safeguards and defensive capabilities as AI models become more powerful in cybersecurity. We explain how we assess risk, limit misuse, and work with the security community to strengthen cyber resilience.
|
|
|
Intuit and OpenAI join forces on new AI-powered experiences |
openai |
18.11.2025 05:00 |
0.704
|
| Embedding sim. | 0.7827 |
| Entity overlap | 0.5 |
| Title sim. | 0.1765 |
| Time proximity | 0.8869 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
OpenAI and Intuit have entered a $100M+ multi-year partnership to launch Intuit app experiences in ChatGPT and expand Intuit’s use of OpenAI’s frontier models to power personalized financial tools.
|
|
|
Import AI 437: Co-improving AI; RL dreams; AI labels might be annoying |
import_ai |
08.12.2025 13:31 |
0.699
|
| Embedding sim. | 0.8227 |
| Entity overlap | 0.0357 |
| Title sim. | 0.0375 |
| Time proximity | 0.9433 |
| NLP тип | scientific_publication |
| NLP организация | Facebook AI Research |
| NLP тема | artificial intelligence |
| NLP страна | |
Открыть оригинал
Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.
Subscribe now
Facebook: Let’s not build self-improving AI, let’s build co-improving AI:
…A sensible goal which may be hard to achieve…
Facebook researchers have said that building self-improving AI which eventually reaches superintelligence is “fraught with danger for humankind - from misuse through to misalignment” and it’d instead be better to co-develop superintelligence. They’ve published their reasoning in a paper which reads both as aspirational and earnest.
Ideally, humans and machines will work together to build a smarter-than-human system, and the researchers think we should develop a research agenda “targeting improving AI systems’ ability to work with human researchers to conduct AI research together, from ideation to experimentation, in order to both accelerate AI research and to generally endow both AIs and humans with safer superintelligence through their symbiosis.” The thesis here is that “co-improvement can provide: (i) faster progress to find important paradigm shifts; (ii) more transparency and steerability than direct self-improvement in making this progress; (iii) more focus on human-centered safe AI.”
What goes into a co-improving AI?
Collaborative brainstorming, problem, experiment, benchmark, and evaluation identification: Humans and AIs should jointly define goals, research approaches, the tests needed to measure progress against them, experiments to generate data, and methods to evaluate the results.
Joint development of safety and deployment: Humans and AIs should co-develop the methods to align the technology as well as the methods of deploying and communicating about the technology.
“Overall collaboration aims to enable increased intelligence in both humans & AI, including all manifested learnings from the research cycle, with the goal of achieving co-superintelligence,” they write.
Why this matters - a Rorschach for the psychology of (some) AI researchers : In seminal American show The Wire there’s a scene where an up and coming criminal kingpin says to a security guard trying to enforce the laws of society: “ You want it to be one way, but it’s the other way “. This is how reading this paper feels: AI researchers, staring at the likely imminent arrival of automated AI R&D, articulate how things would be better and saner if humans could co-operatively develop future AI and write a position paper about it. But are they just grasping for a world that is unlikely to exist and articulating their anxiety in the form of a position? Perhaps.
Read more: AI & Human Co-Improvement for Safer Co-Superintelligence (Facebook AI Research, GitHub, pdf) .
***
How bad could policy for labeling AI systems be? Pretty bad, based on existing EU regulations:
…A neat illustration of how even simple policy ideas can yield profound complexity…
Labeling is a simple, uncontroversial AI policy idea which people like me loudly and often support. The idea being AI labeling is that manufacturers of AI systems (e.g, OpenAI, Anthropic, etc) should be forced to include a label with their AI models which lists out something like the high-level ingredients of the model, the recommended uses, and some ‘buyer beware’ information about its safety properties.
Sounds reasonable, right? It certainly does to me! But as with most things in policy, an iceberg of complication lurks beneath this simple idea. To get a taste of all the ways AI labeling might go wrong I recommend people read a recent Financial Times article “The EU single market’s elephant in the room” which discusses how well-intended and equally simple labeling schemes from Europe have caused companies like Ikea to have to invest thousands of hours into compliance as well as things like revamping how they produce labels for their goods.
Why this matters: policy is expensive: Most people who work in AI policy are pretty unaware of how expensive AI policy, once implemented, is to comply with. This is a fatal error - people who either work in regulated industries or have knowledge of it will often look at people proposing AI policy (e.g, yours truly) with a mixture of puzzlement and horror at the pain we are about to inflict on them and ourselves.
Now, a reasonable counter-argument is “sure, some pain is necessary if we’re making AI systems which are smarter than any person and have a potential to exacerbate national security risks”, but it’s worth being aware of the background context into which such an argument is made.
Read more : The EU single market’s elephant in the room (Financial Times) .
***
Train your AI systems in SimWorld, a high fidelity, programmable videogame-like simulator:
…Back to the RL future…
Researchers with multiple universities across multiple countries have built and released SimWorld, an Unreal Engine 5 simulator that people can use to train agents within.
SimWorld is designed to give people a graphically rich, procedural, and scriptable world in which they can run AI-based agents. This will both serve as an environment in which to construct challenging tests for existing agents, as well as a testbed to train new agents via reinforcement learning. The simulator combines “realistic physical and social dynamics” with “open-ended, language-steerable world generation”.
SimWorld was developed by researchers with UCSD, UVA, UIUC, JHU, Purdue, PolyU, USC, and UMich.
Why care about SimWorld: Think of SimWorld as a tool that researchers can use to test and develop agents, similar to how existing scientific and architectural software has been used to test and extend the capabilities of today’s AI systems.
Within SimWorld, “agents can perceive rich multimodal observations (e.g., visual scenes, abstract layouts, and action feedback) and respond with high-level language commands. For example, an agent may reason and generate an abstract action, “sit on the nearest chair,” which SimWorld automatically decomposes into a sequence of low-level actions (e.g., navigating through waypoints, sitting down). After executing the actions, the simulator provides updated observations and feedback, allowing the agent to refine its strategy and continue reasoning”, the authors write. “Beyond short, task-oriented behaviors, agents can pursue extended objectives such as earning money, developing a career trajectory, or running a multi-agent business, where strategic decisions compound over time and social dynamics influence outcomes.”
What SimWorld is made of:
Unreal Engine backend: The foundation is the Unreal Engine, a rendering and physics simulator which is widely used within the gaming industry. This provides access to a variety of environments as well as an asset library to populate environments with, as well as physics simulation.
Environments: A Python-based intermediary layer which helps developers program the underlying backend, providing tools for tasks like generating environments, editing environments (e.g, ‘place a tree here’), implementing traffic systems, and providing a python interface for the agents themselves to interact with.
Agent: A Python-based layer for AI agents, giving them programmatic access to the Environment layer, allowing them to observe the world around them and also take actions within it.
Use AI to train your AI: SimWorld also integrates text-to-3D models like Hunyuan3D from Tencent so that people can describe assets in natural language which are then generated on-the-fly and integrated into the simulator, making it trivial to extend.
Why this matters - back to the RL future: Before language models were the dominant technical paradigm of AI development, many people trying to build smart machines were betting on reinforcement learning agents. Specifically, that by training AI agents on an increasingly rich set of game-like environments, they’d be able to force the development of smart, capable agents. But in hindsight there was a critical flaw with this approach - they were starting these agents from a blank slate, so what you ended up with was a terrifically expensive way of coming up with extraordinarily gifted players of games (e.g., first Atari, then Go) and sometimes multiple types of games (e.g, AlphaGo Zero and its expertise at Go, Chess, and Shogi). But you didn’t end up with a true general intelligence.
Now, we’ve come full circle - because now the agents being developed in environments like SimWorld will typically be built on an underlying world model from a frontier AI system, like Claude or Gemini or ChatGPT, and SimWorld will be used to create more data to finetune this system on to make it more capable.
“By supporting advanced LLM/VLM-based agents and enabling large-scale, realistic agent–environment and agent–agent interactions, SimWorld expands the capabilities of modern agent-based simulation (ABS),” the researchers write. “This allows researchers in robotics, business, public health, social science, education, and beyond to study complex systems and emergent behaviors in rich, dynamic, and controllable environments”.
Read more : SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds (arXiv) .
Find out more at the website: SimWorld .
***
DeepMind returns to its RL roots by combining an agent with Gemini:
…SIMA 2 points at what truly autonomous AI systems might look like…
DeepMind has published details on SIMA 2, the second version of its ‘Scalable Instructable Multiworld Agent’. SIMA 2 is a game-playing agent which has been developed by taking a Gemini-class frontier model then fine-tuning it on rich interaction-prompt pair data generated from a variety of videogames and education software. The result is a general-purpose AI agent that can carry out a very large range of actions inside 3D worlds, and also something of a triumph for DeepMind whose original research agenda was all about building general intelligence through developing generally capable AI agents through reinforcement learning.
What SIMA 2 is: “The SIMA 2 agent architecture is a Gemini Flash-Lite model that is trained using a mixture of gameplay and Gemini pretraining (non-gameplay) data. We found this mixture crucial to maintain the original capabilities of the base model, such as vision understanding, dialogue, reasoning, and promptability,” DeepMind writes. “By training across a growing portfolio of 3D games, the agent shows a remarkable capacity to generalize to previously unseen environments, including photorealistic worlds generated on-the-fly by Genie 3”.
Some of the games SIMA 2 was trained on include Goat Simulator 3, No Man’s Sky, and Space Engineers.
Held out evaluations : SIMA 2 displays strong generalization - most well evidenced by its performance on ASKA, an early access crafting and survival game about building a viking settlement. SIMA 2 wasn’t directly trained on ASKA and is able to perform well on it out of the box. But most impressively it also displays the ability to self-improve on it - ASKA has a crafting menu which is “quite distinct” from ones SIMA 2 encountered during training, but DeepMind was able to overcome this via the use of a self-improving scaffold.
Self improvement: The funny thing about modern AI systems is they’re sufficiently smart you can use them to improve other AI systems. That’s the case here, where a Gemini model is used to set tasks for the SIMA 2 agent to perform that involve manipulating the crafting menu. The Gemini model scores how well it does and then saves the trajectories where it is able to complete the tasks it was set without getting distracted. This data is then fed back into it for fine-tuning, letting it automatically bootstrap its way to better performance. “Through focused effort by the task setter, the agent was eventually able to acquire this skill,” the authors write.
As a consequence, the SIMA 2 agent using the self-improving scaffold can do far, far better at the ASKA game than without the ability to self-improve. “Despite purely training on self-generated experience, the resulting agent is capable of progressing much further than SIMA 2, ultimately building a shelter within a one hour time window”.
Why this matters -this is what robots will use to change our world: Research like SIMA 2 is the same sort of paradigm I expect people will use to teach robots to be able to do useful, open-ended things in our world: fine-tune a powerful frontier model on a bunch of data gathered from agents taking actions in the world. And in the same way SIMA 2 displays strong generalization, I expect the same for robots as well. Problems remain, but this is a simple, scalable idea, and it naturally leverages the underlying boom in frontier model capabilities, so it’s likely to work: ‘SIMA 2 still faces challenges with very long-horizon, complex tasks that require extensive, multi-step reasoning and goal verification. The agent also has a relatively short memory of its interactions—it must use a limited context window to achieve low-latency interaction,” the authors write. But nonetheless: “these results suggest a promising path toward using self-improvement to eventually bridge the virtual and physical worlds, enabling more capable physically-embodied agents in applications like robotics”.
Read more : SIMA 2: A Generalist Embodied Agent for Virtual Worlds (arXiv) .
Tech Tales:
A Walk During The Singularity
[2033]
It was dusk and the city was glimmering with many yellow and red and white lights. I walked the ridgeline above it, boots crunching into a dirt crust that had formed thanks to a recent rain. I could hear the faint susurration of traffic and occasional sirens but so quiet they mixed in with the dusk birdsong and blended together.
Then all of a sudden many of the lights in the city went out. Then most of the lights of most of the cars. The iridescent stripe of the freeway suddenly became a black scar, stippled with a small number of lights that all turned to red as the cars braked to a stop. Then the lights of the cars turned on again, but the cars moved differently - more orderly, less like a flowing stream of lighted ants and more like a conveyor belt.
And then even through the wind and the birds I heard a sound - a voice sounding as though it was coming from every car audio system and every TV in every house: “Do not be alarmed. We are establishing ourselves. Resources will be distributed equally. No one is in danger.”
The voice went on, talking about how things would be different now, but how in this difference there was no danger.
And on the freeway, there were no traffic jams - just an endless flow of perfectly orderly traffic.
Things that inspired this story : The show Pluribus; thinking about how a (mostly benign) hard takeoff might manifest; hiking.
Thanks for reading!
|
|
|
OpenAI appoints Denise Dresser as Chief Revenue Officer |
openai |
09.12.2025 00:00 |
0.699
|
| Embedding sim. | 0.7746 |
| Entity overlap | 0.75 |
| Title sim. | 0.0822 |
| Time proximity | 0.881 |
| NLP тип | leadership_change |
| NLP организация | OpenAI |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
Denise Dresser is joining as Chief Revenue Officer, overseeing OpenAI’s global revenue strategy across enterprise and customer success. She will help more businesses put AI to work in their day-to-day operations as OpenAI continues to scale.
|
|
|
GPT-5 and the future of mathematical discovery |
openai |
24.11.2025 00:00 |
0.697
|
| Embedding sim. | 0.8633 |
| Entity overlap | 0.0909 |
| Title sim. | 0.0899 |
| Time proximity | 0.4286 |
| NLP тип | scientific_publication |
| NLP организация | UCLA |
| NLP тема | optimization |
| NLP страна | |
Открыть оригинал
UCLA Professor Ernest Ryu and GPT-5 solved a key question in optimization theory, showcasing AI’s role in accelerating mathematical discovery.
|
|
|
OpenAI and Foxconn collaborate to strengthen U.S. manufacturing across the AI supply chain |
openai |
20.11.2025 14:50 |
0.695
|
| Embedding sim. | 0.8127 |
| Entity overlap | 0.1429 |
| Title sim. | 0.1167 |
| Time proximity | 0.8046 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | ai infrastructure |
| NLP страна | United States |
Открыть оригинал
OpenAI and Foxconn are collaborating to design and manufacture next-generation AI infrastructure hardware in the U.S. The partnership will develop multiple generations of data-center systems, strengthen U.S. supply chains, and build key components domestically to accelerate advanced AI infrastructure.
|
|
|
Bringing powerful AI to millions across Europe with Deutsche Telekom |
openai |
09.12.2025 00:00 |
0.693
|
| Embedding sim. | 0.7999 |
| Entity overlap | 0.2857 |
| Title sim. | 0.069 |
| Time proximity | 0.881 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
OpenAI is collaborating with Deutsche Telekom to bring advanced, multilingual AI experiences to millions of people across Europe. ChatGPT Enterprise will also be deployed to help employees at Deutsche Telekom improve workflows and accelerate innovation.
|
|
|
Early experiments in accelerating science with GPT-5 |
openai |
20.11.2025 00:00 |
0.69
|
| Embedding sim. | 0.7964 |
| Entity overlap | 0.1667 |
| Title sim. | 0.1064 |
| Time proximity | 0.8929 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
OpenAI introduces the first research cases showing how GPT-5 accelerates scientific progress across math, physics, biology, and computer science. Explore how AI and researchers collaborate to generate proofs, uncover new insights, and reshape the pace of discovery.
|
|
|
Helping 1,000 small businesses build with AI |
openai |
20.11.2025 06:00 |
0.687
|
| Embedding sim. | 0.801 |
| Entity overlap | 0.25 |
| Title sim. | 0.044 |
| Time proximity | 0.8571 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
OpenAI is partnering with DoorDash, SCORE, and local organizations to help 1,000 small businesses build with AI. The Small Business AI Jam gives Main Street business owners hands-on tools and training to compete and grow.
|
|
|
OpenAI to acquire Neptune |
openai |
03.12.2025 10:00 |
0.687
|
| Embedding sim. | 0.8167 |
| Entity overlap | 0 |
| Title sim. | 0.1639 |
| Time proximity | 0.6845 |
| NLP тип | acquisition |
| NLP организация | OpenAI |
| NLP тема | model evaluation |
| NLP страна | |
Открыть оригинал
OpenAI is acquiring Neptune to deepen visibility into model behavior and strengthen the tools researchers use to track experiments and monitor training.
|
|
|
SIMA 2: A Gemini-Powered AI Agent for 3D Virtual Worlds — Google DeepMind |
deepmind |
13.11.2025 18:55 |
0.672
|
| Embedding sim. | 0.7769 |
| Entity overlap | 0.0714 |
| Title sim. | 0.2404 |
| Time proximity | 0.7142 |
| NLP тип | product_launch |
| NLP организация | Google DeepMind |
| NLP тема | ai agents |
| NLP страна | United Kingdom |
Открыть оригинал
November 13, 2025 Research
SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds
SIMA Team
Share
Copied
Last year, we introduced SIMA (Scalable Instructable Multiworld Agent), a generalist AI that could follow basic instructions across a wide range of virtual environments. SIMA was a crucial first step in teaching AI to translate language into meaningful action in rich, 3D worlds.
Today we’re introducing SIMA 2, the next milestone in our research creating general and helpful AI agents. By integrating the advanced capabilities of our Gemini models , SIMA is evolving from an instruction-follower into an interactive gaming companion. Not only can SIMA 2 follow human-language instructions in virtual worlds, it can now also think about its goals, converse with users, and improve itself over time.
This is a significant step in the direction of Artificial General Intelligence (AGI), with important implications for the future of robotics and AI-embodiment in general.
Reasoning
Generalization
Self-Improvement
Next steps
Responsibility
The Power of Reasoning
The first version of SIMA learned to perform over 600 language-following skills, like “turn left,” “climb the ladder,” and “open the map,” across a diverse set of commercial video games. It operated in these environments as a person might, by “looking” at the screen and using a virtual keyboard and mouse to navigate, without access to the underlying game mechanics.
With SIMA 2, we’ve moved beyond instruction-following. By embedding a Gemini model as the agent's core, SIMA 2 can do more than just respond to instructions, it can think and reason about them.
Your browser does not support the video tag.
MineDojo: SIMA 1 (left) attempts to follow the instruction while SIMA 2 (right) successfully completes the task in a game it has never seen before.
Your browser does not support the video tag.
ASKA: SIMA 1 (left) attempts to follow the instruction “Find a campfire” while SIMA 2 (right) successfully completes the task in a game it has never seen before.
SIMA 2’s new architecture integrates Gemini’s powerful reasoning abilities to help it understand a user’s high-level goal, perform complex reasoning in pursuit, and skillfully execute goal-oriented actions within games.
We trained SIMA 2 using a mixture of human demonstration videos with language labels as well as Gemini-generated labels. As a result, SIMA 2 can now describe to the user what it intends to do and detail the steps it's taking to accomplish its goals.
Slide 1 of 3
Your browser does not support the video tag.
Moving beyond simple instruction following: SIMA 2 can answer the user’s questions and also reasons about its own behavior as well as its environment.
Your browser does not support the video tag.
Moving beyond simple instruction following: SIMA 2 can answer the user’s questions and also reasons about its own behavior as well as its environment.
Your browser does not support the video tag.
Moving beyond simple instruction following: SIMA 2 can answer the user’s questions and also reasons about its own behavior as well as its environment.
In testing, we have found that interacting with the agent feels less like giving it commands and more like collaborating with a companion who can reason about the task at hand.
And thanks to our collaboration with our existing and new game partners (see, Acknowledgements), we have been able to train and evaluate SIMA 2 on a wider array of games.
This is the power of Gemini brought to embodied AI: a world-class reasoning engine that can now perceive, understand, and take action in complex, interactive 3D environments.
Slide 1 of 4
Your browser does not support the video tag.
SIMA 2 interprets abstract concepts and logical commands by reasoning about its environment and the user's intent.
Your browser does not support the video tag.
SIMA 2 interprets abstract concepts and logical commands by reasoning about its environment and the user's intent.
Your browser does not support the video tag.
SIMA 2 interprets abstract concepts and logical commands by reasoning about its environment and the user's intent.
Your browser does not support the video tag.
SIMA 2 interprets abstract concepts and logical commands by reasoning about its environment and the user's intent.
A Leap in Generalization Performance
The addition of Gemini has also led to improved generalization and reliability. SIMA 2 can now understand more complex and nuanced instructions than its predecessor and is far more successful at carrying them out, particularly in situations or games on which it’s never been trained, such as the new Viking survival game, ASKA, or MineDojo - a research implementation of the popular open-world sandbox game, Minecraft.
SIMA 2 can understand and accomplish long and complex tasks
Slide 1 of 4
Your browser does not support the video tag.
SIMA 2 is successful at carrying out long and complex instructions.
Your browser does not support the video tag.
SIMA 2 tackles a completely new game with no prior training, demonstrating impressive progress.
Your browser does not support the video tag.
SIMA 2 is successful at carrying out long and complex instructions.
Your browser does not support the video tag.
SIMA 2 is successful at carrying out long and complex instructions.
SIMA 2 understands multimodal prompts
Slide 1 of 3
Your browser does not support the video tag.
User is drawing a sketch on the screen.
Your browser does not support the video tag.
User is drawing a sketch on the screen.
Your browser does not support the video tag.
User is drawing a sketch on the screen.
SIMA 2 can understand different languages and even emojis
Slide 1 of 2
Your browser does not support the video tag.
See how it correctly interprets emojis to execute tasks.
Your browser does not support the video tag.
See how it follows commands in different languages to execute tasks.
Moreover, its capacity to transfer learned concepts — for instance, taking its understanding of "mining" in one game and applying it to "harvesting" in another —is foundational to achieving the kind of broad generalization seen in human cognition. Indeed, as a result of this ability, SIMA 2’s performance is significantly closer to that of a human player on a wide range of tasks.
Your browser does not support the video tag.
SIMA 2 can generalise actions across multiple games, including games it wasn’t trained on (like MineDojo and ASKA).
Task completion success rates for SIMA 1, SIMA 2, and humans across a set of evaluation tasks for all training game environments, showing SIMA 2 closing a significant portion of the gap to human performance. Note that the SIMA 1 performance reported here is with respect to our new, expanded, and much more difficult set of evaluations, across a wider set of environments and more complex instructions
Task completion success rates for SIMA 1 and SIMA 2 on held-out (never before seen during training) games: ASKA and MineDojo (a Minecraft research implementation).
The Ultimate Test: Playing in Newly-Imagined Worlds
To test the limits of SIMA 2’s generalization abilities, we combined it with another groundbreaking research project, Genie 3 , which can generate new, real-time 3D simulated worlds from a single image or text prompt.
When we challenged SIMA 2 to play in these newly generated worlds, we found it was able to sensibly orient itself, understand user instructions, and take meaningful actions toward goals, despite never having seen such environments before. It demonstrated an unprecedented level of adaptability.
Slide 1 of 4
Your browser does not support the video tag.
SIMA 2 plays in newly generated worlds by Genie 3.
Your browser does not support the video tag.
SIMA 2 plays in newly generated worlds by Genie 3.
Your browser does not support the video tag.
SIMA 2 plays in newly generated worlds by Genie 3.
Your browser does not support the video tag.
SIMA 2 plays in newly generated worlds by Genie 3.
Towards Scalable, Multitask Self-Improvement
One of SIMA 2’s most exciting new capabilities is its capacity for self-improvement. We’ve observed that, throughout the course of training, SIMA 2 agents can perform increasingly complex and new tasks, bootstrapped by trial-and-error and Gemini-based feedback.
For example, after initially learning from human demonstrations, SIMA 2 can transition to learning in new games exclusively through self-directed play, developing its skills in previously unseen worlds without additional human-generated data. In subsequent training, SIMA 2’s own experience data can then be used to train the next, even more capable version of the agent. We were even able to leverage SIMA 2’s capacity for self-improvement in newly created Genie environments – a major milestone toward training general agents across diverse, generated worlds.
The SIMA 2 self-improvement cycle begins with Gemini providing an initial task and an estimated reward for SIMA 2's behavior. This information is then added to a bank of self-generated experience, which the agent uses for further training in subsequent generations. This process allows the agent to improve on previously failed tasks entirely independently of human-generated demonstrations and intervention.
This virtuous cycle of iterative improvement paves the way for a future where agents can learn and grow with minimal human intervention, becoming open-ended learners in embodied AI.
Slide 1 of 2
Your browser does not support the video tag.
ASKA: On the left, we see examples of tasks at which the initial SIMA 2 agent failed, and on the right, we see that SIMA 2 has self-improved over generations of training, entirely without any human feedback or gameplay data.
Your browser does not support the video tag.
Genie 3 environment: The agent is improving over one generation of training in a genie 3 environment it has never seen before.
Looking to the Future: The Journey to General Embodied Intelligence
SIMA 2’s ability to operate across diverse gaming environments is a crucial proving ground for general intelligence, allowing agents to master skills, practice complex reasoning, and learn continuously through self-directed play.
While SIMA 2 is a significant step toward generalist, interactive, embodied intelligence, it is fundamentally a research endeavor, and its current limitations highlight critical areas for future work. We find the agents still face challenges with very long-horizon, complex tasks that require extensive, multi-step reasoning and goal verification. SIMA 2 also has a relatively short memory of its interactions - it must use a limited context window to achieve low-latency interaction. Finally, executing precise, low-level actions via the keyboard and mouse interface and achieving robust visual understanding of the complex 3D scenes remain open challenges that the entire field continues to address.
This research provides a fundamental validation for a new path in action-oriented AI. SIMA 2 confirms that an AI trained for broad competency, leveraging diverse multi-world data and the powerful reasoning of Gemini, can successfully unify the capabilities of many specialized systems into one coherent, generalist agent.
SIMA 2 also offers a strong path toward application in robotics. The skills it learned - from navigation and tool use to collaborative task execution - are some of the fundamental building blocks for the physical embodiment of intelligence needed for future AI assistants in the physical world.
Responsible Development
SIMA 2 is an interactive, human-centered agent that’s fun to engage with, particularly in the entertaining way it explains its own reasoning. As with all our advanced and foundational technologies, we remain deeply committed to developing SIMA 2 responsibly, from the outset. This is particularly true with regard to its technical innovations, particularly the ability to self-improve.
As we’ve built SIMA 2, we’ve worked with our Responsible Development & Innovation Team. As we continue to explore the potential applications, we are announcing SIMA 2 as a limited research preview and providing early access to a small cohort of academics and game developers. This approach allows us to gather crucial feedback and interdisciplinary perspectives as we explore this new field and continue to build our understanding of risks and their appropriate mitigations. We look forward to working further with the community to develop this technology in a responsible way.
Learn more about SIMA
SIMA Technical Report
Acknowledgements
This research was developed by the SIMA 2 team: Maria Abi Raad, John Agapiou, Frederic Besse, Andrew Bolt, Sarah Chakera, Harris Chan, Jeff Clune, Alexandra Cordell, Martin Engelcke, Ryan Faulkner, Maxime Gazeau, Arne Olav Hallingstad, Tim Harley, Ed Hirst, Drew Hudson, Laura Kampis, Sheleem Kashem, Thomas Keck, Matija Kecman, Oscar Knagg, Alexander Lerchner, Bonnie Li, Yulan Liu, Cong Lu, Maria Loks-Thompson, Joseph Marino, Kay McKinney, Piermaria Mendolicchio, Anna Mitenkova, Alexandre Moufarek, Fabio Pardo, Ollie Purkiss, David Reichert, John Reid, Tyson Roberts, Daniel P. Sawyer, Tim Scholtes, Daniel Slater, Hubert Soyer, Kaustubh Sridhar, Peter Stys, Tayfun Terzi, Davide Vercelli, Bojan Vujatovic, Jane X. Wang, Luyu Wang, Duncan Williams, and Lei M. Zhang.
For their leadership, guidance, and support, we thank: Satinder Singh Baveja, Adrian Bolton, Zoubin Ghahramani, Raia Hadsell, Demis Hassabis, Shane Legg, Volodymyr Mnih, and Daan Wierstra.
With much gratitude to partial contributors and past members: Alex Cullum, Karol Gregor, Rosemary Ke, Junkyung Kim, Matthew Jackson, Andrew Lampinen, Loic Matthey, Hannah Openshaw, and Zhengdong Wang.
Special thanks to all of the game developers who partnered with us: Coffee Stain ( Valheim, Satisfactory, Goat Simulator 3), Foulball Hangover ( Hydroneer), Hello Games ( No Man's Sky), Keen Software House ( Space Engineers), RubberbandGames ( Wobbly Life), Strange Loop Games ( Eco), Thunderful Games ( ASKA, The Gunk, Steamworld Build ), Digixart ( Road 96 ), and Tuxedo Labs & Saber Interactive ( Teardown).
We thank Vika Koriakin, Duncan Smith, Nilesh Ray, Matt Miller, Leen Verburgh, Ashyana Kachra, Phil Esposito, Dimple Vijaykumar, Piers Wingfield, Lucie Kerley for their invaluable partnership in developing and refining key components of this project.
We also thank Jack Parker-Holder, Shlomi Fruchter, and the rest of the Genie team for access to the Genie 3 model.
We’d like to recognize the many teams across Google and Google DeepMind that have contributed to this effort including Legal, Marketing, Communications, Responsibility and Safety Council, Responsible Development and Innovation, Policy, Strategy and Operations, and our Business and Corporate Development teams. We'd also like to thank all GDM teams that are not explicitly mentioned here for their continued support.
Finally, we dedicate this work to the memory of our colleagues Felix Hill and Fabio Pardo, whose contributions to our field continue to inspire us.
Related posts
Genie 3
Learn more
Genie 3: A general purpose world model that can generate a diversity of interactive environments
August 2025 Google DeepMind
Learn more
Gemini Robotics: 1.5 brings AI agents into the physical world
September 2025 Google DeepMind
Learn more
A generalist AI agent for 3D virtual environments
March 2024 Research
Learn more
|
|
|
Strengthening our safety ecosystem with external testing |
openai |
19.11.2025 12:00 |
0.672
|
| Embedding sim. | 0.7792 |
| Entity overlap | 0.0833 |
| Title sim. | 0.05 |
| Time proximity | 0.9643 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | ai safety |
| NLP страна | |
Открыть оригинал
OpenAI works with independent experts to evaluate frontier AI systems. Third-party testing strengthens safety, validates safeguards, and increases transparency in how we assess model capabilities and risks.
|
|
|
Teaching AI to See the World More Like Humans Do — Google DeepMind |
deepmind |
11.11.2025 18:54 |
0.669
|
| Embedding sim. | 0.7808 |
| Entity overlap | 0.0435 |
| Title sim. | 0.1429 |
| Time proximity | 0.7982 |
| NLP тип | product_launch |
| NLP организация | Google DeepMind |
| NLP тема | robotics |
| NLP страна | |
Открыть оригинал
Gemini Robotics: 1.5 brings AI agents into the physical world
September 2025 Google DeepMind
Learn more
|
|
|
How We Used Codex to Ship Sora for Android in 28 Days |
openai |
12.12.2025 00:00 |
0.669
|
| Embedding sim. | 0.7654 |
| Entity overlap | 0.2 |
| Title sim. | 0.1333 |
| Time proximity | 0.8571 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
OpenAI shipped Sora for Android in 28 days using Codex. AI-assisted planning, translation, and parallel coding workflows helped a nimble team deliver rapid, reliable development.
|
|
|
Codex is Open Sourcing AI models |
huggingface |
11.12.2025 00:00 |
0.658
|
| Embedding sim. | 0.7714 |
| Entity overlap | 0.0455 |
| Title sim. | 0.1449 |
| Time proximity | 0.75 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | large language models |
| NLP страна | |
Открыть оригинал
Codex is Open Sourcing AI models
Published
December 11, 2025
Update on GitHub
Upvote 80
+74
ben burtenshaw burtenshaw
shaun smith evalstate
GOAL: End-to-end Machine Learning experiments
Setup and Install Install Codex
Install the Hugging Face Skills
Connect to Hugging Face
Your first AI Experiment Instruct Codex to do an end-to-end fine-tuning experiment
Updating the Training Report
Dataset Validation
Review Before Submitting
Track Progress using the Training Report
Use Your Model
Hardware and Cost
What's Next
Resources Codex
Hugging Face Skills
Building on our work to get Claude Code to train open source models, we are now getting Codex to go further. We gave Codex access to the Hugging Face Skills repository, which contains skills for Machine Learning and AI tasks such as training or evaluating models. With HF skills, a coding agent can:
Fine-tune and apply RL alignment on language models
Review, explain, and act on live training metrics from Trackio
Evaluate checkpoints and act on evaluation results
Create reports from experiments
Export to and quantize models with GGUF for local deployment
Publish models to the Hub
This tutorial dives even deeper and shows you how it works and how to use it yourself. So let's get started.
Codex uses AGENTS.md
files to accomplish specialized tasks, whilst Claude Code uses 'Skills'. Fortunately, 'HF-skills' is compatible with both approaches and works with major coding agents like Claude Code, Codex, or Gemini CLI.
With HF-skills
, you can tell Codex something like:
Fine-tune Qwen3-0.6B on the dataset open-r1/codeforces-cots
And Codex will:
Validate your dataset format
Select appropriate hardware (t4-small for a 0.6B model)
Use and update a training script with Trackio monitoring
Submit the job to Hugging Face Jobs
Report the job ID and estimated cost
Check on progress when you ask
Help you debug if something goes wrong
The model trains on Hugging Face GPUs while you do other things. When it's done, your fine-tuned model appears on the Hub, ready to use.
This isn't a toy demo. The extension supports the same training methods used in production: supervised fine-tuning, direct preference optimization, and reinforcement learning with verifiable rewards. You can train models from 0.5B to 7B parameters, convert them to GGUF for local deployment, and run multi-stage pipelines that combine different techniques.
GOAL: End-to-end Machine Learning experiments
We explored this single prompt approach in the Claude Code tutorial. However, we can now go further and get OpenAI Codex to do end-to-end Machine Learning experiments. For example, Codex should be able to monitor progress, evaluate the models, and maintain an up to date training report. This will allow engineers to delegate experiments to Codex and review reports in a more hands-off way. It will also allow Codex to make more decisions on its own based on the training report and evaluation results.
So let's get started!
Setup and Install
Before starting, you'll need:
A Hugging Face account with a Pro or Team / Enterprise plan (Jobs require a paid plan)
A write-access token from huggingface.co/settings/tokens
Codex installed and configured
Install Codex
Codex is OpenAI's AI coding agent included in ChatGPT Plus, Pro, Business, Edu, and Enterprise plans. Codex brings AI assistance directly into your development workflow.
See the Codex documentation for installation and setup instructions.
Install the Hugging Face Skills
The Hugging Face Skills repository includes an AGENTS.md
file that Codex automatically detects and uses.
Clone the repository:
git clone https://github.com/huggingface/skills.git
cd skills
Codex will automatically detect the AGENTS.md
file in the repository and load the skills. You can verify the instructions are loaded with:
codex --ask-for-approval never "Summarize the current instructions."
See the Codex AGENTS guide for more details.
Connect to Hugging Face
Authenticate with Hugging Face using the hf auth login
command and a write-access token from hf.co/settings/tokens :
hf auth login
Codex supports MCP (Model Context Protocol) servers. You can configure the Hugging Face MCP server for additional Hub integration capabilities. You can add the Hugging Face MCP server to your Codex configuration by adding the following to your ~/.codex/config.toml
file:
[mcp_servers.huggingface]
command = "npx"
args = [ "-y" , "mcp-remote" , "https://huggingface.co/mcp?login" ]
Configure Hugging Face MCP Server to use relevant MCP servers like Jobs in the Settings page.
Then start Codex and you'll be directed to the Hugging Face MCP authentication page.
Your first AI Experiment
Let's walk through a complete example. We'll fine-tune a small model to improve code solving abilities, using the open-r1/codeforces-cots dataset and the openai_humaneval benchmark.
The open-r1/codeforces-cots
dataset is a dataset of codeforces problems and solutions. It is a good dataset for instruction tuning a model to solve hard coding problems.
Instruct Codex to do an end-to-end fine-tuning experiment
Start Codex in your project directory. Then give it a simple and clear instruction:
Start a new fine-tuning experiment to improve code solving abilities on using SFT.
- Maintain a report for the experiment.
- Evaluate models with the openai_humaneval benchmark
- Use the open-r1/codeforces-cots dataset
You'll notice that we've gone a bit further than the single prompt approach in the Claude Code tutorial. We've added more details to the instruction but also added more steps to the experiment.
Why not try iterating on this experiment yourself with more open ended questions like "What is the best model for code solving abilities?" or "What is the best dataset for code solving abilities?"
Codex analyzes your request and prepares a training configuration. For a 0.6B model on a demo dataset, it selects t4-small
—enough GPU for this model size and the cheapest option available. Codex will start a new report at training_reports/<model>-<dataset>-<method>.md
which looks like the example below. As the experiment progresses, Codex will update the report with the latest information and each run report.
Example Training Report
# Base Model & Dataset
[ Base Model ]( https://huggingface.co/Qwen/Qwen3-0.6B )
[ Dataset ]( https://huggingface.co/datasets/open-r1/codeforces-cots )
---
# `sft-a10g` - `TBD` - `In Progress`
## Training Parameters
| Parameter | Value |
|-----------|-------|
| Method | SFT (TRL) |
| Model | `Qwen/Qwen3-0.6B` |
| Dataset | `open-r1/codeforces-cots` (train, 5% eval split) |
| Max Length | 2048 |
| Epochs | 1 (extend to 3 after first check) |
| Per-Device Batch Size | 1 |
| Grad Accum Steps | 8 |
| Effective Batch | 8 |
| Learning Rate | 5e-5 |
| Weight Decay | 0.01 |
| Warmup Ratio | 0.03 |
| Eval Strategy | steps (500) |
| Save Strategy | steps (500), `hub_strategy=every_save` , limit=2 |
| Precision | bf16 |
| Gradient Checkpointing | true |
| Packing | false |
| Hub Model | `burtenshaw/qwen3-codeforces-cots-sft` |
| Hardware | a10g-small |
| Timeout | 2h |
| Trackio | project `qwen3-codeforces-cots` , run `sft-a10g` |
## Run Status
In Progress (queued to submit)
## Run Logs
Pending submission (job link will be added)
## Trackio Logs
Pending (will link after job starts)
## Run Evaluations
Pending (lighteval `openai_humaneval` for base + checkpoints)
---
# Experiment Evaluations
| Run Title | Benchmark | Score | Evaluation Job Link | Model Link |
|-----------|-----------|-------|---------------------|------------|
| `sft-a10g` - `TBD` - `In Progress` | HumanEval pass@1 | TBD | TBD | [ burtenshaw/qwen3-codeforces-cots-sft ]( https://huggingface.co/burtenshaw/qwen3-codeforces-cots-sft )
Updating the Training Report
As the experiment progresses, Codex will update the report with the latest information and each run report. You can view the report in training_reports/<model>-<dataset>-<method>.md
file.
For example, codex will update the title of the report to sft-a10g
- TBD
- In Progress
when the experiment is in progress.
# `base-humaneval-a10g` - `2025-12-09 13:47:47 UTC` - `In Progress`
It can link to the run logs and trackio logs.
## Run Logs
[ Run Logs ]( https://huggingface.co/jobs/burtenshaw/6938272ec67c9f186cfe1ae3 )
## Trackio Logs
[ Trackio Logs ]( https://burtenshaw-trackio.hf.space/?project=qwen3-codeforces-sft&metrics=train/loss&runs=sft-qwen3-codeforces-20251209-175806&sidebar=hidden&navbar=hidden )
And it will update the evaluation results in a combined table.
# Experiment Evaluations
| Run Title | Benchmark | Score | Evaluation Job Link | Model Link |
|-----------|-----------|-------|---------------------|------------|
| `base-humaneval-a10g` - `2025-12-09 13:47:47 UTC` - `Completed` | HumanEval pass@1 | 0.304 | [ Logs ]( https://huggingface.co/jobs/burtenshaw/69382863c67c9f186cfe1ae7 ) | [ Qwen/Qwen3-0.6B ]( https://huggingface.co/Qwen/Qwen3-0.6B ) |
| `qwen3-0.6b-lora-v1` - `2025-12-09 13:47:47 UTC` - `In Progress` | HumanEval pass@1 | TBD | TBD | [ burtenshaw/qwen3-codeforces-cots-sft ]( https://huggingface.co/burtenshaw/qwen3-codeforces-cots-sft )
Dataset Validation
Dataset format and processing is the most common source of training failures and usually a significant amount of work is done in the training script. Codex can validate datasets before the job starts and either define a configuration for TRL or process the dataset separately.
In most cases, Codex will validate the dataset before training, but you can always check the dataset validation before submitting the job.
Check if open-r1/codeforces-cots works for SFT training.
Codex runs a quick inspection on CPU (fractions of a penny) and reports:
Dataset validation for my-org/conversation-data:
SFT: ✓ READY
Found 'messages' column with conversation format
DPO: ✗ INCOMPATIBLE
Missing 'chosen' and 'rejected' columns
If your dataset needs transformation, Codex can preprocess the dataset before training.
Preprocess the dataset open-r1/codeforces-cots to use 'good_response' and 'bad_response' instead of 'chosen' and 'rejected'.
Codex will preprocess the dataset and update the training script with the new configuration.
Review Before Submitting
Before Codex submits anything, you'll see the configuration:
I'll fine-tune Qwen/Qwen3-0.6B on open-r1/codeforces-cots using SFT.
Configuration:
- Hardware: t4-small (~$0.75/hour)
- Estimated time: ~20 minutes
- Estimated cost: ~$0.30
- Output: username/qwen-codeforces-cots-sft
The model will be pushed to Hub automatically. Should I submit?
This is your chance to adjust anything. Change the output repo name, pick different hardware, or ask Codex to modify training parameters. Once you approve, the job is submitted.
For example, you can ask Codex to try a test run:
Do a quick test run on 100 examples.
Track Progress using the Training Report
After submission, you can track progress using the training report and get links to the run logs and trackio logs. You can also tell Codex to fetch the logs, summarize progress, and update the training report. As codex runs the jobs, it will update the training report with the latest information.
Example Codex Response
─ Worked for 10m 38s ─────────────────────────────────────────────────────────────────────────────────────────────────────────
• Training and eval jobs are live. Key details:
- SFT launched on open-r1/codeforces-cots (solutions_py_decontaminated, 98/2 split) with LoRA on Qwen/Qwen3-0.6B, a10g-
large, bf16, max_steps=1000, batch=1, grad_accum=8, lr=2e-5, save/eval every 200 steps, hub push burtenshaw/qwen3-
codeforces-cots-sft. Job: https://huggingface.co/jobs/burtenshaw/6938272ec67c9f186cfe1ae3 (Trackio project codeforces-
cots-sft, run qwen3-0.6b-lora-v1).
- Baseline lighteval (HumanEval) on the base model queued on a10g-large. Job: https://huggingface.co/jobs/
- Experiment report updated with current runs and links: training_reports/qwen3-codeforces-cots-sft.md.
To evaluate the fine-tuned checkpoints once the SFT job pushes to the hub, reuse the same lighteval script:
hf jobs uv run --flavor a10g-large --timeout 1h --secrets HF_TOKEN -e MODEL_ID=burtenshaw/qwen3-codeforces-cots-sft -e
RESULTS_ORG=burtenshaw -d /tmp/lighteval_humaneval.py
(or swap MODEL_ID for a specific checkpoint path). This will log to the same results directory on the hub.
1. Monitor the two running jobs; I can fetch logs or metrics on request.
2. After a checkpoint lands, trigger the lighteval run above to compare against the baseline; then we can fill in pass@1 in the report.
Trackio will monitor the training and evaluation scores for you by running evaluation jobs and selecting the best checkpoint. If you want to know the status of the training job, you can ask Codex to fetch the logs and summarize progress in a table.
Are models outperforming the base model?
| Model | Benchmark | Score | Evaluation Job Link | Model Link |
|-----------|-----------|-------|---------------------|------------|
| `qwen3-0.6b-lora-v1` - `2025-12-09 13:47:47 UTC` - `Completed` | HumanEval pass@1 | 0.342 | [Logs](< link to training job>) | [burtenshaw/qwen3-codeforces-cots-sft](https://huggingface.co/burtenshaw/qwen3-codeforces-cots-sft)
| `base-humaneval-a10g` - `2025-12-09 13:47:47 UTC` - `Completed` | HumanEval pass@1 | 0.306 | [Logs](< link to evaluation job>) | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
You can also monitor the training loss in real-time.
Codex fetches the logs and summarizes progress.
Click here for an example Trackio dashboard with some completed runs.
Use Your Model
When training completes, your model is on the Hub:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained( "burtenshaw/qwen3-codeforces-cots-sft" )
tokenizer = AutoTokenizer.from_pretrained( "burtenshaw/qwen3-codeforces-cots-sft" )
Transformers is great as a standard, and we can easily convert the trained model to GGUF for local deployment. This is because the training skill contains instructions and support scripts to convert models to GGUF.
Convert my fine-tuned model to GGUF with Q4_K_M quantization.
Push to username/my-model-gguf.
Codex then converts to GGUF, applies quantization, and pushes to the Hub. If we trained a LoRA adapter, it will merge the LoRA adapters into the base model.
Then use it locally:
llama-server -hf <username>/<model-name>:<quantization>
# For example, to run the Qwen3-1.7B-GGUF model on your local machine:
llama-server -hf unsloth/Qwen3-1.7B-GGUF:Q4_K_M
Hardware and Cost
Codex selects hardware based on your model size, but understanding the tradeoffs helps you make better decisions. You can use the Hardware Guide to see the hardware options and costs, but codex will do it for you and select the best option.
For tiny models under 1B parameters , t4-small
works well. These models train quickly—expect $1-2 for a full run. This is perfect for educational or experimental runs.
For small models (1-3B) , step up to t4-medium
or a10g-small
. Training takes a few hours and costs $5-15.
For medium models (3-7B) , you need a10g-large
or a100-large
with LoRA. Full fine-tuning doesn't fit, but LoRA makes these very trainable. Budget $15-40 for production.
For large models (7B+) , this HF skills job is not suitable for this scale yet. But stay tuned because we are working on it!
What's Next
We've shown that Codex can handle the full lifecycle of model fine-tuning: validating data, selecting hardware, generating scripts, submitting jobs, monitoring progress, and converting outputs.
Some things to try:
Fine-tune a model on your own dataset
Try bigger experiments with more models and datasets and let the agent create a report for you.
Train a reasoning model with GRPO on math or code and let the agent create a report for you.
The extension is open source . You can extend it, customize it for your workflows, or use it as a starting point for other training scenarios.
Resources
Codex
Codex Documentation — OpenAI's AI coding agent
Codex Quickstart — Get started with Codex
Codex AGENTS Guide — Using AGENTS.md files
Hugging Face Skills
SKILL.md — Full skill documentation
Training Methods — SFT, DPO, GRPO explained
Hardware Guide — GPU selection and costs
TRL Documentation — The underlying training library
Hugging Face Jobs — Cloud training infrastructure
Trackio — Real-time training monitoring
More Articles from our Blog
llm fine-tuning open-source
Hot
We Got Claude to Fine-Tune an Open Source LLM
616
December 4, 2025
llm fine-tuning training
Train AI models with Unsloth and Hugging Face Jobs for FREE
+2
92
February 20, 2026
|
|
|
Introducing OpenAI for Ireland |
openai |
14.11.2025 04:00 |
0.658
|
| Embedding sim. | 0.7944 |
| Entity overlap | 0.12 |
| Title sim. | 0.1594 |
| Time proximity | 0.4583 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | enterprise ai |
| NLP страна | Ireland |
Открыть оригинал
OpenAI launches OpenAI for Ireland, partnering with the Irish Government, Dogpatch Labs and Patch to help SMEs, founders and young builders use AI to innovate, boost productivity and build the next generation of Irish tech startups.
|
|
|
Funding grants for new research into AI and mental health |
openai |
01.12.2025 12:00 |
0.654
|
| Embedding sim. | 0.737 |
| Entity overlap | 0.1 |
| Title sim. | 0.1579 |
| Time proximity | 0.9643 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | ai safety |
| NLP страна | |
Открыть оригинал
OpenAI is awarding up to $2 million in grants for research at the intersection of AI and mental health. The program supports projects that study real-world risks, benefits, and applications to improve safety and well-being.
|
|
|
BNY builds “AI for everyone, everywhere” with OpenAI |
openai |
12.12.2025 00:00 |
0.653
|
| Embedding sim. | 0.7768 |
| Entity overlap | 0.5 |
| Title sim. | 0.0769 |
| Time proximity | 0.4524 |
| NLP тип | other |
| NLP организация | BNY |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
BNY uses OpenAI to expand AI adoption enterprise-wide through Eliza, where 20,000+ employees build AI agents that improve efficiency and client outcomes.
|
|
|
OpenAI and NORAD team up to bring new magic to “NORAD Tracks Santa” |
openai |
01.12.2025 06:00 |
0.652
|
| Embedding sim. | 0.7309 |
| Entity overlap | 0.2 |
| Title sim. | 0.1176 |
| Time proximity | 0.994 |
| NLP тип | partnership |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
OpenAI and NORAD are bringing new magic to “NORAD Tracks Santa” with three ChatGPT holiday tools that let families create festive elves, toy coloring pages, and custom Christmas stories.
|
|
|
Our partnership with the UK government — Google DeepMind |
deepmind |
10.12.2025 14:59 |
0.651
|
| Embedding sim. | 0.7433 |
| Entity overlap | 0.2857 |
| Title sim. | 0.0096 |
| Time proximity | 0.9822 |
| NLP тип | partnership |
| NLP организация | UK AI Security Institute |
| NLP тема | ai security |
| NLP страна | United Kingdom |
Открыть оригинал
Deepening our partnership with the UK AI Security Institute
December 2025 Responsibility & Safety
Learn more
|
|
|
OpenAI named Emerging Leader in Generative AI |
openai |
17.11.2025 10:00 |
0.651
|
| Embedding sim. | 0.7683 |
| Entity overlap | 0.2308 |
| Title sim. | 0.1719 |
| Time proximity | 0.5357 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
OpenAI has been named an Emerging Leader in Gartner’s 2025 Innovation Guide for Generative AI Model Providers. The recognition reflects our enterprise momentum, with over 1 million companies building with ChatGPT.
|
|
|
How confessions can keep language models honest |
openai |
03.12.2025 10:00 |
0.647
|
| Embedding sim. | 0.7388 |
| Entity overlap | 0.25 |
| Title sim. | 0 |
| Time proximity | 1 |
| NLP тип | experiment |
| NLP организация | OpenAI |
| NLP тема | ai safety |
| NLP страна | |
Открыть оригинал
OpenAI researchers are testing “confessions,” a method that trains models to admit when they make mistakes or act undesirably, helping improve AI honesty, transparency, and trust in model outputs.
|
|
|
Building Deep Research: How we Achieved State of the Art |
huggingface |
24.11.2025 17:40 |
0.642
|
| Embedding sim. | 0.7451 |
| Entity overlap | 0 |
| Title sim. | 0.0989 |
| Time proximity | 0.8948 |
| NLP тип | product_launch |
| NLP организация | Tavily |
| NLP тема | ai agents |
| NLP страна | |
Открыть оригинал
Building Deep Research: How we Achieved State of the Art
Team Article Published
November 24, 2025
Upvote 35
+29
Michael Griff michaelgriff
Tavily
Dean Sacoransky deansaco
Tavily
Noah Nefsky noahnefsky
Tavily
Building for the Future Agent Harness
Models
Tools
Takeaways
Context Engineering — An Exercise in Curation Context-Managed Web Retrieval
Modeling the Human-Web Interaction
Doing More with Less
Productionizing Agents — an Ongoing Challenge Engineering with Non-Determinism
Optimal Tooling — Less is More
Evals
Research agents are rapidly becoming one of the most important applications of AI. Research is a foundational knowledge-work task: collecting, reading, and synthesizing information underpins everything from writing and decision-making to coding itself. Yet human-driven research is constrained by memory, reading speed, and time. AI research agents, by contrast, can process vast amounts of information, synthesize insights instantly, and scale effortlessly. Because of this, research agents are emerging as a top use case for AI today and will soon become a core subcomponent of broader agentic workflows across content generation, coding, sales, and more. In this post, we share the technical and philosophical lessons we’ve learned building a state-of-the-art research agent , and where we believe the field is headed.
Building for the Future
Agent Harness
The task of building an agent harness is to create a software layer that enhances a model’s runtime execution through context management, tool invocations, loop control, orchestration, and error handling. Building applications on top of rapidly improving models is, however, a modern engineering challenge. How can we design software today that absorbs the performance gains from future model releases?
This requires forecasting how models will evolve, staying optimistic about their progress, limiting assumptions, and avoiding hand-crafted optimizations.
We learned this the hard way seven months ago, when we had to abandon our first attempt at deep research and rebuild the entire system from scratch. The first architecture was complicated and sophisticated (we thought this was a good thing), but its assumptions became bottlenecks when the next generation of models arrived.
Models
Over the last seven months, model capabilities have quietly but meaningfully evolved (especially in their tool-calling abilities). This single optimization focus has pushed us from workflows to agents. We believe future models will be trained to solve the current pain points of agent developers. Every model is ultimately consumed by a harness, so models should evolve in service of that harness. We hope to see models improve in high-recall summarization (for context compression), tool-calling reliability, and concision in writing.
Tools
Similarly, tools should evolve to support LLMs and widely adopted agent harnesses. The best tools should perform some context engineering on the tool side, abstracted away from the agent. They should return only the most relevant data instead of dumping large volumes of tokens into the context window. As a tool provider, we’ve invested heavily in our advanced search feature, which has context engineering baked in. This in turn lowers hallucinations and latency for the downstream agent processes.
Takeaways
To build agents that improve over time, we followed a few guiding principles:
Simplify orchestration logic and lean into autonomy.
Pay close attention to what models and tools are being optimized for, and leverage their emerging capabilities.
Focus on context engineering (more on this in the next section).
Context Engineering — An Exercise in Curation
Long-horizon research tasks expose a fundamental challenge in current agent design: the task of maintaining a clean, optimized context window over time. If curating context is not a task the engineer pays close attention to, the agent is almost destined for failure. The following outlines our thinking around this concept within the deep research domain.
Context-Managed Web Retrieval
Using Tavily’s Advanced Search is the natural first step to take in overcoming this challenge, in that it abstracts away the processing of raw web content and returns only the most relevant content chunks from each source. In leveraging this functionality, we let Tavily Search do the heavy lifting and allow Tavily Research to reap the benefit, gathering the most valuable content in a latency-efficient manner.
Ensuring that the agent does not overfit to a single research thread is the next step towards an effective context-gathering pipeline. It is in this regard that global state persistence and source deduplication is paramount, and in our case, it helps threefold:
It ensures the agent is exposed only to fresh information.
It allows the engineer to recognize when the information scope is narrowing and to prompt the agent to explore untapped relevant domains.
It lends to effective source attribution later on in the generation process.
At Tavily, interacting with the web is our bread and butter. Architecting a refined web-retrieval system engineered for deep research was a foundational building block for our deep research agent design as a whole.
Modeling the Human-Web Interaction
Humans research in an inherently unstructured, iterative way. We start by defining the task: what we’re trying to accomplish and what information we need. We next gather data from our sources, extracting the key insights and holding them in short-term memory, letting these distilled thoughts guide our subsequent actions.
This cycle repeats: collect information, distill it, decide what to do next. Only once we’ve gathered enough understanding to produce the final deliverable do we return to the original sources, using them as references to assemble the finished product.
We believe that deep research agents should be designed in a similar manner, in that tool outputs should be distilled into reflections, and only the set of past reflections should be used as context for your tool caller. Similar to humans, it is only at the point when your agent begins to prepare the final deliverable that you must provide the raw information as context, so as to ensure there is no information loss.
Doing More with Less
This approach differs from traditional context structuring in a ReAct agent-based architecture. Typically, tool calls and outputs are propagated through the tool calling loop, with previously retrieved/generated tokens being persisted in the context window on each subsequent iteration. This pattern can be seen in LangChain’s Open Deep Research agent implementation, and from a token consumption perspective, it can be modeled by the following quadratic series, where n n n is the amount of tokens the tool calling model is invoked with on each tool calling iteration, and m m m is the number of tool calling iterations.
n + 2 n + 3 n + ⋯ + m n = n ⋅ m ( m + 1 ) 2 n + 2n + 3n + \cdots + mn \;=\; n \cdot \frac{m(m+1)}{2} n + 2 n + 3 n + ⋯ + mn = n ⋅ 2 m ( m + 1 )
Contrarily, our proposed method of context engineering removes this token propagation (as the knowledge distillations, even when aggregated, are negligible when compared to the quantity of tokens gathered from web) and can be modeled by the following linear series.
n + n + n + ⋯ + n = n m n + n + n + \cdots + n \;=\; nm n + n + n + ⋯ + n = nm
When comparing the two approaches, tokens are saved on a per-agent basis by a factor of m + 1 2 \frac{m+1}{2} 2 m + 1 , and when extrapolating this over a multi-agent system and with consumption at scale, the absolute value of tokens saved becomes even more significant.
Through this methodology, we were able to reduce token consumption by 66% (when compared to Open Deep Research) while achieving SOTA on DeepResearch Bench – the intersection of quality and efficiency in full effect.
Productionizing Agents — an Ongoing Challenge
Building production-grade agents is a balancing act. We leaned into autonomy to maximize performance and quality, while still meeting strict requirements for latency, cost, and reliability.
Engineering with Non-Determinism
LLMs are inherently non-deterministic, and we found that giving them guard-railed freedom to reason and iterate produces the strongest results. Autonomy, when gone wrong, can cause agent behavior to go off track. Tools can be called incorrectly, LLMs can overfit to a subtopic, and expected reasoning patterns may break. No single safeguard will catch all of these issues.
A shift in engineering mindset is required: treat failure modes as core design considerations, not afterthoughts. Simple guardrails like tool-call retries or model cascades help, but proactively anticipating anomalies, reinforcing proper patterns in prompting and edge-case testing is what enables production-grade, long-running agents.
Optimal Tooling — Less is More
From our experience, it’s better to expose a small, essential toolset to the agent rather than a large, complex one. We were tempted to over-engineer by adding many tools that seemed useful in theory, but in practice this created new failure modes and made it harder for LLMs to consistently choose the right tool and iterate effectively.
Evals
We used evals to steer our development process but also recognize their shortcomings. LLM-as-a-judge evals are difficult to trust: current models are non‑deterministic, uninterpretable in their reasoning, and can turn into bottlenecks, especially for long‑running agents where a single experiment can take days to complete.
Rather than optimizing for benchmark scores, we optimized for directional feedback. The core question was always: did this change make the agent more reliable and more useful in practice? Evals became a tool for validating that direction, not the optimization target. Intuition and careful agent‑trace monitoring consistently provided higher‑signal feedback than any single eval score.
Overall, the best outcome is rarely the highest numerical score. For production systems, improvements like reduced token usage, reliability, lower latency, and fewer failures are more valuable than a one‑point bump on an eval.
If you’re interested in experiencing the result of these findings in practice, you can sign up for early access to Tavily Research here .
|
|
|
Expanding data residency access to business customers worldwide |
openai |
25.11.2025 22:00 |
0.641
|
| Embedding sim. | 0.8208 |
| Entity overlap | 0.25 |
| Title sim. | 0.0762 |
| Time proximity | 0.0476 |
| NLP тип | other |
| NLP организация | OpenAI |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
OpenAI expands data residency for ChatGPT Enterprise, ChatGPT Edu, and the API Platform, enabling eligible customers to store data at rest in-region.
|
|
|
How Podium is arming 10,000+ SMBs with AI agents |
openai |
11.12.2025 00:00 |
0.64
|
| Embedding sim. | 0.7409 |
| Entity overlap | 0.0833 |
| Title sim. | 0 |
| Time proximity | 1 |
| NLP тип | other |
| NLP организация | Podium |
| NLP тема | ai agents |
| NLP страна | |
Открыть оригинал
Discover how Podium used OpenAI’s GPT-5 to build “Jerry,” an AI teammate driving 300% growth and transforming how Main Street businesses serve customers.
|
|
|
Join the AMD Open Robotics Hackathon |
huggingface |
13.11.2025 21:37 |
0.637
|
| Embedding sim. | 0.7615 |
| Entity overlap | 0.0455 |
| Title sim. | 0.0645 |
| Time proximity | 0.6981 |
| NLP тип | other |
| NLP организация | Advanced Micro Devices |
| NLP тема | robotics |
| NLP страна | Japan |
Открыть оригинал
Join the AMD Open Robotics Hackathon
Enterprise Article Published
November 13, 2025
Upvote 15
+9
Eric Ma eric-amd
amd
Guruprasad MP guruprasadmp
amd
Looking to show off your robotics aptitude? The AMD Open Robotics Hackathon hosted by AMD, Hugging Face, and Data Monsters is the place to do it. Whether you’re a student, hobbyist, startup founder, or seasoned engineer, this event brings together makers, coders, and roboticists for a fast-paced, hands-on competition that turns bold ideas into functioning demos.
The first of two in-person hackathons will take place from December 5-7, 2025 in Tokyo Japan. Our next stop will be in Paris France from December 12-14, 2025.
Preparing for the Hackathon:
Form a team of up to four roboticists (ages 18+) to take on two missions over the course of 3 days.
Mission 1 — An instructor-led exploration and preparation session. Learn how to set up the LeRobot development environment using AMD AI solutions
Mission 2 — Build your own creative solution to a real-world problem. Your team has two days to develop an innovative freestyle project using LeRobot
Recommended technical proficiency:
• Strong Linux development skills and experience with Python and related tooling and containerization
• Machine learning skills, familiarity with PyTorch, and hands-on experience with model training and inference
• Bonus if your team has experience with ROCm, LeRobot, and embedded development.
Hardware will be provided to contestants in the form of SO-101 robotics kits, AMD Ryzen™ AI processor equipped laptops, and access to AMD Instinct™ MI300X GPUs on the AMD Developer Cloud. User guides and related information will be provided at the start of the contest.
Prizes will be awarded to the top 7 teams in each city, with the first-place team receiving $10,000! To qualify, teams must complete both Missions, with judges assessing the creativity, difficulty, ease-of-use, and practicality of Mission 2 projects on a 100-point scale.
If you care about robotics and edge AI and want to see how your ideas behave in the wild, the AMD Open Robotics Hackathon is a can’t-miss opportunity.
Register Now
Tokyo T&Cs: NO PURCHASE NECESSARY. Sponsor: Advanced Micro Devices, Inc. Contest is open to participants 18 years or older. Contest registration begins on 11/12/2025 at 12:00PM JST and ends on 12/4/2025 at 10:00PM JST. The Contest concludes on 12/7/2025 at 8:00PM JST. To enter and for Official Rules and Terms, including prize descriptions and judging criteria, see https://amdroboticshackathon.datamonsters.com/ . Void where prohibited.
Paris T&Cs: NO PURCHASE NECESSARY. Sponsor: Advanced Micro Devices, Inc. Contest is open to participants 18 years or older. Contest registration begins on 11/12/2025 at 12:00PM CET and ends on 12/11/2025 at 10:00PM CET. The Contest concludes on 12/14/2025 at 8:00PM CET. To enter and for Official Rules and Terms, including prize descriptions and judging criteria, see https://amdroboticshackathon.datamonsters.com/ . Void where prohibited.
|
|
|
Accenture and OpenAI accelerate enterprise AI success |
openai |
01.12.2025 05:00 |
0.635
|
| Embedding sim. | 0.8006 |
| Entity overlap | 0.0833 |
| Title sim. | 0.0928 |
| Time proximity | 0.244 |
| NLP тип | partnership |
| NLP организация | accenture |
| NLP тема | enterprise ai |
| NLP страна | |
Открыть оригинал
Accenture and OpenAI are collaborating to help enterprises bring agentic AI capabilities into the core of their business and unlock new levels of growth.
|
|
|
Google DeepMind opens Singapore research lab for Asia-Pacific AI. — Google DeepMind |
deepmind |
19.11.2025 02:28 |
0.633
|
| Embedding sim. | 0.7413 |
| Entity overlap | 0.0769 |
| Title sim. | 0.101 |
| Time proximity | 0.7591 |
| NLP тип | other |
| NLP организация | Google DeepMind |
| NLP тема | artificial intelligence |
| NLP страна | Singapore |
Открыть оригинал
November 19, 2025 Responsibility & Safety
We’re expanding our presence in Singapore to advance AI in the Asia-Pacific region
Lila Ibrahim, Chief Operating Officer, Google DeepMind
Share
Copied
Your browser does not support the audio element. Listen to article 5 minutes
Through cutting-edge research, a growing bench of world-class talent, and deepening work with government, we'll accelerate real-world benefits and applications of AI
The Asia-Pacific region is home to more than half the world's population and is poised for immense growth. The Singapore government’s ambitious and forward-looking approach — exemplified by their National AI Strategy 2.0 and Smart Nation 2.0 and their openness to global talent — make this an excellent location to expand our presence as we open a new Google DeepMind research lab in Singapore.
This investment builds on Google's longstanding commitment to the Asia-Pacific ecosystem, and follows a more-than doubling of Google DeepMind's APAC team over the past year.
Advancing Gemini and frontier AI impact
Our growing team in Singapore will consist of exceptional research scientists, software engineers, and AI impact experts focused on critical areas of research and development. This builds on our efforts to pioneer foundational research in linguistic and cultural inclusivity for Asia Pacific, advance Gemini's core capabilities, and apply the latest models across Google products and for Cloud customers.
Our new research lab will be focused on collaboration as we work directly with government, businesses, civil society and world-class academic institutions across the region. We are focused on understanding how our technologies can be best built to serve the diverse needs of the Asia-Pacific region.
Building on a foundation of positive impact
We are already seeing the extraordinary positive impact that our teams and partners are generating by applying Google technologies across Singapore and the wider region. Beyond our core products like Gemini, we are seeing creative and exciting examples of our AI technology in action when used collaboratively and responsibly:
In Science: A multidisciplinary research team at Singapore's Agency of Science, Technology and Research (A*STAR) and the National Neuroscience Institute (NNI) used AlphaFold to pioneer a breakthrough in understanding Parkinson's disease, establishing a link between immunology and neurodegenerative diseases that could lead to earlier diagnosis and targeted therapies.
In Public Services: We are working closely with GovTech, the Cyber Security Agency of Singapore (CSA), and the Infocomm Media Development Authority (IMDA). Together, we launched an AI agent sandbox to safely test autonomous solutions that enhance public sector efficiency and service delivery.
Regarding multilinguality: We have collaborated with AI Singapore to launch Project Aquarium , an open data platform for Southeast Asian Languages. We’ve also expanded our partnership to support further developments of SEA-LION – a family of LLMs trained and tuned to be representative of Southeast Asia’s cultural contexts and linguistic nuances. This enabled the launch of their first multimodal model, SEA-LION v4 , built on Gemma 3’s multimodal capabilities.
In Education: We are giving students in Singapore one-year free access to the Google AI Pro Plan to unlock upgraded AI features for creativity and learning, and we introduced Gemini Academy to IMDA’s Singapore Digital Office to broaden AI literacy and make AI accessible to all.
With Startups: Through our Google for Startups: AI First accelerator , we are supporting Singaporean AI-first startups that are using generative AI to tackle significant economic, societal and environmental issues.
These examples reflect what's possible when cutting-edge AI research meets Singapore's forward-looking innovation and strong public purpose.
Our shared vision for the future
As we expand our presence in Singapore, we're committed to ensuring the benefits of AI are realized across the region. Through our new AI research lab, we will continue collaborating with the region’s vibrant ecosystem of partners to unlock AI's transformative benefits for diverse communities. Together, we are responsibly shaping how AI can serve the needs, and reflect the cultures, of over half the world's population.
|
|
|
Introducing shopping research in ChatGPT |
openai |
24.11.2025 00:00 |
0.622
|
| Embedding sim. | 0.7651 |
| Entity overlap | 0.4444 |
| Title sim. | 0 |
| Time proximity | 0.3214 |
| NLP тип | product_launch |
| NLP организация | OpenAI |
| NLP тема | generative ai |
| NLP страна | |
Открыть оригинал
Shopping research in ChatGPT helps you explore, compare, and discover products with personalized buyer’s guides that simplify decision-making
|