Кластер #4005 - News Clusters

Introducing study mode in ChatGPT

closed

Тип события	product_launch
Тема	large language models
Организация	OpenAI
Страна	United States

Статей	22
Уник. источников	4
Важность / Момент	2.53 / 0
Период	29.07.2025 10:00 — 14.08.2025 00:00
Создан	06.04.2026 06:18:11

Статьи в кластере 22

Заголовок

Источник

Дата публикации

Score

Introducing study mode in ChatGPT

openai

29.07.2025 10:00

Embedding sim.	1
Entity overlap	1
Title sim.	1
Time proximity	1

NLP тип	product_launch
NLP организация	ChatGPT
NLP тема	educational technology
NLP страна

Открыть оригинал

Introducing study mode in ChatGPT, a new learning experience that helps you work through problems step by step, guiding students with questions, scaffolding, and feedback for deeper learning.

Introducing GPT-5 for developers

openai

07.08.2025 10:00

0.875

Embedding sim.	0.9524
Entity overlap	0.4
Title sim.	0.5455
Time proximity	0.9405

NLP тип	product_launch
NLP организация	OpenAI
NLP тема	large language models
NLP страна

Открыть оригинал

Introducing GPT-5 in our API platform—offering high reasoning performance, new controls for devs, and best-in-class results on real coding tasks.

GPT-5 and the new era of work

openai

07.08.2025 10:00

0.829

Embedding sim.	0.9344
Entity overlap	0.7143
Title sim.	0.1429
Time proximity	0.9405

NLP тип	product_launch
NLP организация	OpenAI
NLP тема	foundation models
NLP страна

Открыть оригинал

GPT-5 is OpenAI’s most advanced model—transforming enterprise AI, automation, and workforce productivity in the new era of intelligent work.

Coding and design with GPT-5

openai

07.08.2025 00:03

0.828

Embedding sim.	0.9179
Entity overlap	0.4
Title sim.	0.3333
Time proximity	0.9999

NLP тип	other
NLP организация
NLP тема	large language models
NLP страна

Открыть оригинал

Learn how GPT-5 unlocks new possibilities in coding and design.

gpt-oss-120b & gpt-oss-20b Model Card

openai

05.08.2025 00:00

0.819

Embedding sim.	0.9137
Entity overlap	0.5714
Title sim.	0.2105
Time proximity	1

NLP тип	product_launch
NLP организация
NLP тема	large language models
NLP страна

Открыть оригинал

We introduce gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models available under the Apache 2.0 license and our gpt-oss usage policy.

Creative writing with GPT-5

openai

07.08.2025 00:02

0.781

Embedding sim.	0.8842
Entity overlap	0.4
Title sim.	0.1463
Time proximity	0.9998

NLP тип	other
NLP организация
NLP тема	large language models
NLP страна

Открыть оригинал

Learn how GPT-5 assists with creative writing.

How Amgen uses GPT-5

openai

07.08.2025 00:00

0.779

Embedding sim.	0.8826
Entity overlap	0.25
Title sim.	0.2059
Time proximity	1

NLP тип	other
NLP организация	Amgen
NLP тема	generative ai
NLP страна

Открыть оригинал

Learn how Amgen uses GPT-5.

GPT-5 System Card

openai

07.08.2025 00:00

0.77

Embedding sim.	0.8606
Entity overlap	0.4167
Title sim.	0.2
Time proximity	1

NLP тип	product_launch
NLP организация	OpenAI
NLP тема	large language models
NLP страна

Открыть оригинал

This GPT-5 system card explains how a unified model routing system powers fast and smart responses using gpt-5-main, gpt-5-thinking, and lightweight versions like gpt-5-thinking-nano, optimized for different tasks and developer use.

First look at GPT-5

openai

07.08.2025 00:00

0.769

Embedding sim.	0.8721
Entity overlap	0.25
Title sim.	0.1875
Time proximity	1

NLP тип	other
NLP организация
NLP тема	generative ai
NLP страна

Открыть оригинал

See how a group of leading developers use GPT-5 for the first time.

Medical research with GPT-5

openai

07.08.2025 00:01

0.756

Embedding sim.	0.856
Entity overlap	0.3333
Title sim.	0.1429
Time proximity	0.9999

NLP тип	other
NLP организация
NLP тема	healthcare ai
NLP страна

Открыть оригинал

Learn how GPT-5 is used for medical research.

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

ahead_of_ai

09.08.2025 11:23

0.726

Embedding sim.	0.8877
Entity overlap	0.2353
Title sim.	0.1613
Time proximity	0.3608

NLP тип	product_launch
NLP организация	OpenAI
NLP тема	large language models
NLP страна

Открыть оригинал

OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later).
 This is the first time since GPT-2 that OpenAI has shared a large, fully open-weight model. Earlier GPT models showed how the transformer architecture scales. The 2022 ChatGPT release then made these models mainstream by demonstrating concrete usefulness for writing and knowledge (and later coding) tasks. Now they have shared some long-awaited weight model, and the architecture has some interesting details.
 I spent the past few days reading through the code and technical reports to summarize the most interesting details. (Just days after, OpenAI also announced GPT-5, which I will briefly discuss in the context of the gpt-oss models at the end of this article.)
 Below is a quick preview of what the article covers. For easier navigation, I recommend using the Table of Contents on the left of on the article page.
 Model architecture comparisons with GPT-2

 MXFP4 optimization to fit gpt-oss models onto single GPUs

 Width versus depth trade-offs (gpt-oss vs Qwen3)

 Attention bias and sinks

 Benchmarks and comparisons with GPT-5

 I hope you find it informative!
 1. Model Architecture Overview 
 Before we discuss the architecture in more detail, let's start with an overview of the two models, gpt-oss-20b and gpt-oss-120b, shown in Figure 1 below.
 
 Figure 1: The two gpt-oss models side by side. 
 
 If you have looked at recent LLM architecture diagrams before, or read my previous Big Architecture Comparison article, you may notice that there is nothing novel or unusual at first glance. 
 
 This is not surprising, since leading LLM developers tend to use the same base architecture and then apply smaller tweaks. This is pure speculation on my part, but I think this is because
 There is significant rotation of employees between these labs.

 We still have not found anything better than the transformer architecture. Even though state space models and text diffusion models exist, as far as I know no one has shown that they perform as well as transformers at this scale. (Most of the comparisons I found focus only on benchmark performance. It is still unclear how well the models handle real-world, multi-turn writing and coding tasks. At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer&#8211;state space model hybrid, at rank 96. EDIT: Someone kindly pointed out that there's a higher-ranking hybrid model: Hunyuan-TurboS at rank 22.)

 Most of the gains likely come from data and algorithm tweaks rather than from major architecture changes.

 That being said, there are still many interesting aspects of their design choices. Some are shown in the figure above (while others are not, but we will discuss them later as well). In the rest of this article, I will highlight these features and compare them to other architectures, one at a time.
 I should also note that I am not affiliated with OpenAI in any way. My information comes from reviewing the released model code and reading their technical reports. If you want to learn how to use these models locally, the best place to start is OpenAI's official model hub pages:
 https://huggingface.co/openai/gpt-oss-20b 

 https://huggingface.co/openai/gpt-oss-120b 

 The 20B model can run on a consumer GPU with up to 16 GB of RAM. The 120B model can run on a single H100 with 80 GB of RAM or newer hardware. I will return to this later, as there are some important caveats.
 2. Coming From GPT-2 
 Before we jump into comparisons between gpt-oss and a more recent architecture, let's hop into the time machine and take a side-by-side look at GPT-2 (Figure 2) to see just how far things have come.
 
 Figure 2: A side-by-side comparison between gpt-oss-20b and GPT-2 XL 1.5B. 
 
 Both gpt-oss and GPT-2 are decoder-only LLMs built on the transformer architecture introduced in the Attention Is All You Need (2017) paper. Over the years, many details have evolved.
 However, these changes are not unique to gpt-oss. And as we will see later, they appear in many other LLMs. Since I discussed many of these aspects in the previous Big Architecture Comparison article, I will try to keep each subsection brief and focused.
 2.1 Removing Dropout 
 Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended).
 
 Figure 3: An illustration of dropout applied to the attention score matrix. 
 
 I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting.
 Interestingly, while Dropout is kind of ignored in LLM architecture design for many years, I found a 2025 research paper with small scale LLM experiments (Pythia 1.4B) that confirms that Dropout results in worse downstream performance in these single-epoch regimes.
 2.2 RoPE Replaces Absolute Positional Embeddings 
 In transformer-based LLMs, positional encoding is necessary because of the attention mechanism. By default, attention treats the input tokens as if they have no order. In the original GPT architecture, absolute positional embeddings addressed this by adding a learned embedding vector for each position in the sequence (Figure 4), which is then added to the token embeddings.
 
 Figure 4: Illustration of absolute positional embeddings. 
 
 RoPE ( Rotary Position Embedding ) introduced a different approach: instead of adding position information as separate embeddings, it encodes position by rotating the query and key vectors in a way that depends on each token's position. (RoPE is an elegant idea but also a bit of a tricky topic to explain. I plan to cover separately in more detail one day.)
 While first introduced in 2021, RoPE became widely adopted with the release of the original Llama model in 2023 and has since become a staple in modern LLMs.
 2.3 Swish/SwiGLU Replaces GELU 
 Early GPT architectures used GELU. Why now use Swish over GELU? Swish (also referred to as sigmoid linear unit or SiLU) is considered computationally slightly cheaper, and in my opinion, that all there is to it. Depending on which paper you look at, you will find that one is slightly better than the other in terms of modeling performance. In my opinion, these small differences are probably within a standard error, and your mileage will vary based on hyperparameter sensitivity.
 Activation functions used to be a hot topic of debate until the deep learning community largely settled on ReLU more than a decade ago. Since then, researchers have proposed and tried many ReLU-like variants with smoother curves, and GELU and Swish (Figure 5) are the ones that stuck.
 
 Figure 5: Comparison between Swish and GELU activations, which are both smoother versions or ReLU. 
 
 Early GPT architectures used GELU, which is defined as 0.5x * [1 + erf(x / sqrt(2))] . Here, erf (short for error function) is the integral of a Gaussian and it is computed using polynomial approximations of the Gaussian integral, which makes it more computationally expensive than simpler functions like the sigmoid used in Swish, where Swish is simply x * sigmoid(x) .
 In practice, Swish is computationally slightly cheaper than GELU, and that's probably the main reason it replaced GELU in most newer models. Depending on which paper we look at, one might be somewhat better in terms of modeling performance. But I'd say these gains are often within standard error, and the winner will depend heavily on hyperparameter tuning.
 Swish is used in most architectures today. However, GELU is not entirely forgotten; for example, Google's Gemma models still use GELU.
 What's more notable, though, is that the feed forward module (a small multi-layer perceptron) is replaced by a gated "GLU" counterpart, where GLU stands for gated linear unit and was proposed in a 2020 paper . Concretely, the 2 fully connected layers are replaced by 3 fully connected layers that are used as shown in Figure 6 below.
 
 Figure 6: A comparison between Swish and GELU and their gated counterparts, SwiGLU and GEGLU. 
 
 At first glance, it may appear that the GEGLU/SwiGLU variants may be better than the regular feed forward layers because there are simply more parameters due to the extra layer. But this is deceiving because in practice, the W and V weight layers in SwiGLU/GEGLU are usually chosen to be half the size each of the W_1 layer in a traditional feed forward layer.
 To illustrate this better, consider the concrete code implementations of the regular and GLU variants:
 
 Figure 7: Regular feed forward module (top) and SwiGLU variant (bottom) next to each other. Note that the Swish function is implemented as &#8220;silu&#8221; in PyTorch. 
 
 So, suppose we have an embedding dimension of 1024. In the regular feed forward case, this would then be
 fc1: 1024 &#215; 4096 = 4,194,304

 fc2: 1024 &#215; 4096 = 4,194,304

 That is fc1 + fc2 = 8,388,608 parameters.
 For the GLU variant, we have
 fc1: 1024 &#215; 1024 = 1,048,576

 fc2: 1024 &#215; 1024 = 1,048,576

 fc3: 1024 &#215; 1024 = 1,048,576

 I.e., 3 &#215; 1,048,576 = 3,145,728 weight parameters.
 So, overall, using the GLU variants results in fewer parameters, and they perform better as well. The reason for this better performance is that these GLU variants provide an additional multiplicative interaction, which improves expressivity (the same reason deep & slim neural nets perform better than shallow & wide neural nets, provided they are trained well).
 2.4 Mixture-of-Experts Replaces Single FeedForward Module 
 In addition to upgrading the feed forward module to a SwiGLU, as discussed in the previous section, gpt-oss replaces the single feed forward module with multiple feed forward modules, using only a subset for each token generation step. This approach is known as a Mixture-of-Experts (MoE) and illustrated in Figure 8 below.
 
 Figure 8: The feed forward module is replaced by a Mixture-of-Expert (MoE) module. 
 
 So, replacing a single feed forward module with multiple feed forward modules (as done in a MoE setup) substantially increases the model's total parameter count. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a router selects only a small subset of experts per token.
 Because only a few experts are active at a time, MoE modules are often referred to as sparse , in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time.
 (Fun fact: In most MoE models, expert weights account for more than 90% of the total model parameters.)
 2.5 Grouped Query Attention Replaces Multi-Head Attention 
 As mentioned in my previous articles, Grouped Query Attention (GQA) has emerged in recent years as a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA).
 In MHA, each head has its own set of keys and values. GQA reduces memory usage by grouping multiple heads to share the same key and value projections.
 For example, as shown in Figure 9, if there are 2 key&#8211;value groups and 4 attention heads, heads 1 and 2 might share one set of keys and values, while heads 3 and 4 share another. This grouping decreases the total number of key and value computations, leading to lower memory usage and improved efficiency without noticeably affecting modeling performance, according to ablation studies.
 
 Figure 9: A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries. 
 
 So, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache.
 (If you are curious how GQA looks in code, see my GPT-2 to Llama 3 conversion guide for a version without KV cache and my KV-cache variant here .)
 While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the original GQA paper and the Llama 2 paper ) show it performs comparably to standard MHA in terms of LLM modeling performance.
 2.6 Sliding Window Attention 
 Sliding-window attention (Figure 10 below) was first introduced in the LongFormer paper (2020) and later popularized by Mistral. Interestingly, gpt-oss applies it in every second layer. You can think of it as a variation of multi-head attention, or in this case grouped query attention (GQA), where the attention context is restricted to a smaller window, reducing both memory usage and compute costs.
 
 Figure 10: Comparison between regular attention (left) and sliding window attention (right). 
 
 Concretely, gpt-oss alternates between GQA layers that attend to the full context and GQA layers with a sliding window limited to 128 tokens.
 As I discussed in my previous article , Gemma 2 (2024) used a similar 1:1 ratio. Gemma 3 earlier this year went much further and shifted to a 5:1 ratio, which means only one full-attention layer for every five sliding-window (local) attention layers.
 According to the Gemma ablation studies, sliding-window attention has minimal impact on modeling performance, as shown in the figure below. Note that the window size in Gemma 2 was 4096 tokens, which Gemma 3 reduced to 1024. In gpt-oss, the window is just 128 tokens, which is remarkably small.
 And as a fun fact, the official announcement article notes that sliding-window attention was apparently already used in GPT-3:
 The models use alternating dense and locally banded sparse attention patterns, similar to GPT-3
 Who knew!? I went back to the original GPT-3 paper , and it was indeed mentioned there:
 We use the same model and architecture as GPT-2 [ RWC+19 ], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [ CGRS19 ]. 
 2.7 RMSNorm Replaces LayerNorm 
 Finally, the last small tweak, coming from GPT-2, is replacing LayerNorm (2016) by RMSNorm (2019) , which has been a common trend in recent years.
 Akin to swapping GELU with Swish and SwiGLU, RMSNorm is one of these smaller but sensible efficiency improvements. RMSNorm is similar to LayerNorm in its purpose to normalize layer activations, as shown in Figure 11 below.
 You might recall that not too long ago, BatchNorm was the go-to choice for this task. It has since fallen out of favor, largely because it is harder to parallelize efficiently (due to the mean and variance batch statistics) and performs poorly with small batch sizes.
 
 Figure 11: A comparison between LayerNorm (left) and RMSNorm (right) for a small linear layer. 
 
 As we can see in Figure 11 above, both LayerNorm and RMSNorm scale the layer outputs to be in a reasonable range.
 LayerNorm subtracts the mean and divides by the standard deviation such that the layer outputs have a zero mean and unit variance (variance of 1 and standard deviation of one).
 RMSNorm divides the inputs by the root-mean-square. This scales activations to a comparable magnitude without enforcing zero mean or unit variance. In this particular example shown in Figure 11, the mean is 0.77 and the variance is 0.41.
 Both LayerNorm and RMSNorm stabilize activation scales and improve optimization, but RMSNorm is often preferred in large-scale LLMs because it is cheaper to compute. Unlike LayerNorm, RMSNorm has no bias (shift) term and reduces the expensive mean and variance computations to a single root-mean-square operation. This reduces the number of cross-feature reductions from two to one, which lowers communication overhead on GPUs and improving training efficiency.
 Figure 12 shows what this looks like in code:
 
 Figure 12: Code implementations of LayerNorm and RMSNorm showing that RMSNorm is computationally simpler. 
 
 2.8 The GPT-2 Legacy 
 I still think that GPT-2 is an excellent beginner architecture when learning about LLMs. It's simple enough to understand without getting lost in layers of optimization tricks, but still complex enough to give you a solid grasp of how modern transformer models work.
 By starting with GPT-2, you can focus on the fundamentals (attention mechanisms, positional embeddings, normalization, and the overall training pipeline) without being overwhelmed by the extra features and tweaks found in newer architectures.
 In fact, I think it's worth the time to learn about and even implement GPT-2 first before trying to stack newer changes on top. You will not only have an easier time understanding those changes, but you will likely also appreciate them more, because you will get a better understanding of what limitations or problems they try to solve.
 For instance, starting with my GPT-2 code I recently implemented the Qwen3 architecture from scratch , which is super similar to gpt-oss, which brings us to the next topic: Comparing gpt-oss to a more recent architecture.
 Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

 3. Comparing gpt-oss To A Recent Architecture (Qwen3) 
 Now that we have walked through the evolution from GPT-2 to GPT OSS, we can take the next step and compare GPT OSS to a more recent architecture, Qwen3, which was released three months earlier in May 2025.
 The reason I am selecting Qwen3 here is that it is among the top open-weight models as of the time of writing. Additionally, one of the Qwen3 MoE models is more or less directly comparable to GPT OSS due to its relatively similar overall size in terms of trainable parameters.
 Figure 13 below compares gpt-oss-20b to a Qwen3 model of comparable size.
 
 Figure 13: A gpt-oss and Qwen3 model of comparable size side by side. 
 
 As we can see, gpt-oss 20B and Qwen3 30B-A3B are very similar in their architecture components. The primary difference here, aside from the dimensions, is that gpt-oss employs sliding window attention, as discussed earlier in section 1.6 (not shown in this figure), whereas Qwen3 does not.
 Let's walk through the noteworthy details one by one in the following subsections.
 3.1 Width Versus Depth 
 If we look at the two models closely, we see that Qwen3 is a much deeper architecture with its 48 transformer blocks instead of 24 (Figure 14).
 
 Figure 14: Qwen3 has twice as many transformer blocks as gpt-oss-20b. 
 
 On the other hand, gpt-oss is a much wider architecture:
 An embedding dimension of 2880 instead of 2048

 An intermediate expert (feed forward) projection dimension of also 2880 instead of 768

 It's also worth noting that gpt-oss uses twice as many attention heads, but this doesn't directly increase the model's width. The width is determined by the embedding dimension.
 Does one approach offer advantages over the other given a fixed number of parameters? As a rule of thumb, deeper models have more flexibility but can be harder to train due to instability issues, due to exploding and vanishing gradients (which RMSNorm and shortcut connections aim to mitigate).
 Wider architectures have the advantage of being faster during inference (with a higher tokens/second throughput) due to better parallelization at a higher memory cost.
 When it comes to modeling performance, there's unfortunately no good apples-to-apples comparison I am aware of (where parameter size and datasets are kept constant) except for an ablation study in the Gemma 2 paper (Table 9) , which found that for a 9B parameter architecture, a wider setup is slightly better than a deeper setup. Across 4 benchmarks, the wider model achieved a 52.0 average score, and the deeper model achieved a 50.8 average score.
 3.2 Few Large Versus Many Small Experts 
 As shown in Figure 14 above, it's also noteworthy that gpt-oss has a surprisingly small number of experts (32 instead of 128), and only uses 4 instead of 8 active experts per token. However, each expert is much larger than the experts in Qwen3.
 This is interesting because the recent trends and developments point towards more, smaller models as being beneficial. This change, at a constant total parameter size, is nicely illustrated in Figure 15 below from the DeepSeekMoE paper.
 
 Figure 15: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 
 
 Notably, unlike DeepSeek's models, neither gpt-oss nor Qwen3 uses shared experts, though.
 To be fair, the small number of experts in gpt-oss could be a side effect of the 20B size. Looking at the 120B mode below, they indeed increased the number of experts (and transformer blocks) while keeping everything else fixed, as shown in Figure 16 below.
 
 Figure 16: The two gpt-oss architectures side by side, where the larger 120B model only scales the number of transformer blocks and number of experts. 
 
 The boring explanation for the fact that the 20B and 120B models are so similar is probably that the 120B model was the main focus. And the easiest way to create a smaller model was to make it a bit shorter (fewer transformer blocks) and to reduce the number of experts, because that's where most of the parameters are. However, one might speculate whether they started training the 120B model, and then chopped some of the transformer blocks and experts for continued pre-training (instead of starting from random weights).
 In any case, it's because it's quite unusual to only scale those two (transformer blocks and number of experts). For instance, when looking at Qwen3 MoE models of multiple sizes (Figure 17 below), they were scaled more proportionally to each other over many more aspects..
 
 Figure 17: Architecture differences in the various Qwen3 models. 
 
 3.3 Attention Bias and Attention Sinks 
 Both gpt-oss and Qwen3 use grouped query attention. The main difference is that gpt-oss restricts the context size via sliding window attention in each second layer, as mentioned earlier.
 However, there's one interesting detail that caught my eye. It seems that gpt-oss uses bias units for the attention weights, as shown in the figure below.
 
 Figure 18: gpt-oss models use bias units in the attention layers. See code example here . 
 
 I haven't seen these bias units being used since the GPT-2 days, and they are commonly regarded as redundant. Indeed, I found a recent paper that shows mathematically that this is at least true for the key transformation (k_proj). Furthermore, the empirical results show that there is little difference between with and without bias units (see Figure 19 below).
 
 Figure 19: Table from https://arxiv.org/pdf/2302.08626 showing the average test loss when the models were trained from scratch with and without bias units. 
 
 Another detail you may have noticed is the definition of sinks in the code screenshot in Figure 18. In general models, attention sinks are special "always-attended" tokens placed at the start of the sequence to stabilize attention, which is especially useful in long-context scenarios. I.e., if the context gets very long, this special attended token at the beginning is still attended to, and it can learn to store some generally useful information about the entire sequence. (I think it was originally proposed in the Efficient Streaming Language Models with Attention Sinks paper.)
 In the gpt-oss implementation, attention sinks are not actual tokens in the input sequence. Instead, they are learned per-head bias logits that are appended to the attention scores (Figure 20). The goal is the same as with the above-mentioned attention sinks, but without modifying the tokenized inputs.
 
 Figure 20: The use of attention sinks in gpt-oss; based on the Hugging Face code here . 
 
 3.4 License 
 Lastly, and similar to Qwen3, the gpt-oss models are Apache 2.0 open-source license, which is great (it's the same license that I prefer for my own open-source projects). This means that the models can be distilled into other models or used in commercial products without restriction.
 Open-weight vs. open-source LLMs. This distinction has been debated for years, but it is worth clarifying to avoid confusion about this release and its artifacts. Some model developers release only the model weights and inference code (for example, Llama, Gemma, gpt-oss), while others (for example, OLMo) release everything including training code, datasets, and weights as true open source.
 By that stricter definition, gpt-oss is an open-weight model (just like Qwen3) because it includes the weights and inference code but not the training code or datasets. However, the terminology is used inconsistently across the industry.
 I assume the "oss" in "gpt-oss" stands for open source software ; however, I am positively surprised that OpenAI itself clearly describes gpt-oss as an open-weight model in their official announcement article .
 4 Other Interesting Tidbits 
 While the previous sections described how the architecture has evolved since GPT-2 and discussed its similarities to Qwen3 (and most other recent models), there are still a few additional but noteworthy details I have not mentioned, yet. These are points that did not fit neatly into the earlier sections but are still worth mentioning.
 4.1 Training Overview 
 Unfortunately, there is not much information about the training set sizes and algorithms available. I added the most interesting puzzle pieces from the model card report (1) and announcement post (2) below:
 The gpt-oss models were trained using our most advanced pre-training and post-training techniques [...] (1)
 [...] required 2.1million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. (1)
 [...] including a supervised fine-tuning stage and a high-compute RL stage [...] (2)
 We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge. (2)
 So, we know that the gpt-oss models are reasoning models. The training compute of 2.1 million H100 GPU hours is roughly on par with the 2.788 million H800 GPU hours that the ~5.6x larger DeepSeek V3 model was trained for. Unfortunately, there is no information about the Qwen3 training time available yet.
 Interestingly, the GPT-oss training hour estimate includes both the supervised learning for instruction following and the reinforcement learning for reasoning, whereas DeepSeek V3 is just a pre-trained base model on top of which DeepSeek R1 was trained separately.
 4.2 Reasoning Efforts 
 As mentioned in the previous section, the gpt-oss models are reasoning models. However, what's particularly interesting is that they were trained so that users can easily control the degree of reasoning via inference time scaling.
 Concretely, gpt-oss models can receive "Reasoning effort: low/medium/high" instructions as part of their system prompt, which directly affects the response length and accuracy, as shown in Figure 21.
 
 Figure 21: Response length and quality of gpt-oss models under different reasoning efforts (annotated figure from the model card ) 
 
 This level of adjustability is useful because it lets us balance cost, compute, and accuracy. For example, if the task is simple, such as answering a straightforward knowledge question or fixing a small typo, we can skip extended reasoning. This saves time and resources while avoiding unnecessarily long responses and verbose reasoning traces.
 It is somewhat unfortunate that OpenAI did not release the base models prior to reinforcement learning-based reasoning training, unlike Qwen3 or OLMo. Base models are particularly valuable starting points for researchers working on reasoning methods (which is one reason I currently like working with Qwen3 Base). My guess is that OpenAI's decision was driven more by industry and production use cases than by research considerations.
 Note that the original Qwen3 models also have a toggle for enabling/disabling thinking (reasoning) modes (via a enable_thinking=True/False setting in the tokenizer that simply adds <think></think> tags to disable the reasoning behavior). However, the Qwen3 team updated their models in the last few weeks and moved away from the hybrid model towards dedicated Instruct/Thinking/Coder variants.
 The reason was that the hybrid mode resulted in lower performance compared to the individual models:
 After discussing with the community and reflecting on the matter, we have decided to abandon the hybrid thinking mode. We will now train the Instruct and Thinking models separately to achieve the best possible quality. Source 
 4.3 MXFP4 Optimization: A Small But Important Detail 
 One interesting surprise is that OpenAI released the gpt-oss models with an MXFP4 quantization scheme for the MoE experts.
 Quantization formats used to be a niche topic, mostly relevant to mobile or embedded AI, but that's changed with the push toward bigger models. In this case, the MXFP4 optimization allows the model to run on single GPU devices.
 Here&#8217;s what that looks like in practice:
 The large model (think 120B) fits on a single 80GB H100 or newer GPU. Not consumer hardware, but hey, it's much cheaper to rent a 1-H100 machine than a multi-H100 machine. Plus, we don't have to worry about distributing the model across GPUs and adding communication overhead. It's really nice that AMD MI300X cards are supported from day 1 as well!

 The smaller 20B model even fits into 16 GB of VRAM; the caveat is that it has to be a RTX 50-series GPU or newer to support MXFP4. (Edit: support for older cards, such as RTX 4090, was recently added via a patch .)

 Note that the models will also run on older hardware but without MXFP4 support and will thus consume more RAM. Without MXFP4 optimization, the models in bfloat16 will consume more like 48 GB (gpt-oss-20b) and 240 GB (gpt-oss-120b).
 By the way, I can run the gpt-oss-20b model comfortably on my Mac Mini using ollama. It uses about 13.5 Gb or memory, which is really reasonable.
 4.4 Benchmarks 
 The models are still a bit too new for independent benchmarks. Checking the LM Arena leaderboard , I found that gpt-oss is not listed, yet. So, Qwen3-Instruct remains the top open-weight model, according to users on the LM Arena, for now (Figure 22).
 
 Figure 22: Current view of the LM Arena Leaderboard (as of 8 Aug 2025) 
 
 Looking at a reasoning benchmarks provide in the gpt-oss announcement post, we can see that the gpt-ossmodels are on par with OpenAI's proprietary models as well as Qwen3 (Figure 23).
 
 Figure 23: The main benchmark charts are from the official gpt-oss announcement post . The "no tools" gpt-oss-120b data is taken from the official model card paper , and the Qwen3 numbers are taken from the official Qwen3 repository . 
 
 However, this should be caveated by the fact that gpt-oss-120b is almost half the size of the Qwen3 A235B-A22B-Thinking-2507 model and can run on a single GPU.
 Benchmark performance, however, does not always reflect real-world usability. In my limited use over the past few days, I have found gpt-oss to be quite capable. That said, as others have observed, it does seem to have a relatively high tendency to hallucinate (a point also mentioned in its model card).
 This may stem from its heavy training focus on reasoning tasks such as math, puzzles, and code, which could have led to some "general knowledge forgetting." Still, because gpt-oss was designed with tool use in mind, this limitation may become less relevant over time. Tool integration in open-source LLMs is still in its early stages, but as it matures, I expect that we increasingly let models consult external sources (like search engines) when answering factual or knowledge-based queries.
 If that happens, it could be sensible to prioritize reasoning capacity over memorization. This is much like in human learning in school (or in life in general), where problem-solving skills often matter more than memorizing facts.
 5 gpt-oss and GPT-5 
 OpenAI had a busy week and released the long-awaited GPT-5 model shortly after gpt-oss. The GPT-5 release was interesting. And if there's one thing I have to say here, it's that I am really surprised by how good their open-source models really are compared to their best product offering in terms of benchmark performance (Figure 24).
 
 Figure 24: The main benchmark charts are from the official GPT-5 announcement post . The gpt-oss data is taken from the official model card paper and announcement post , and the Qwen3 numbers are taken from the official Qwen3-Coder repository . 
 
 All in all, even though some people called the release overhyped, I am glad that we have a new set of really strong open weight models that are not too far behind the best proprietary ones. Of course, benchmarks often do not accurately reflect real-world use, and it is still too early to tell based on the limited usage. But I think these are good times for people who like to work with open-weight and local (or privately hosted) models.
 
 This magazine is a personal passion project, and your support helps keep it alive. If you would like to contribute, there are a few great ways: 
 Grab a copy of my book . Build a Large Language Model (From Scratch) walks you through building an LLM step by step, from tokenizer to training. 

 Check out the video course . There&#8217;s now a 17-hour video course based on the book, available from Manning. It follows the book closely, section by section, and works well both as a standalone or as a code-along resource. The video course is ad-free (unlike the YouTube version) and has a cleaner, more structured format. It also contains 5 additional hours of pre-requisite video material created by Abhinav Kimothi. 

 Subscribe . A paid subscription helps to make my writing sustainable and gives you access to additional contents. 

 Thanks for reading, and for helping support independent research! 
 
 Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Introducing GPT-5

openai

07.08.2025 00:00

0.724

Embedding sim.	0.751
Entity overlap	0.3333
Title sim.	0.7273
Time proximity	0.7143

NLP тип	product_launch
NLP организация	OpenAI
NLP тема	large language models
NLP страна

Открыть оригинал

We are introducing GPT‑5, our best AI system yet. GPT‑5 is a significant leap in intelligence over all our previous models, featuring state-of-the-art performance across coding, math, writing, health, visual perception, and more.

How Cursor uses GPT-5

openai

07.08.2025 00:00

0.717

Embedding sim.	0.8078
Entity overlap	0.2
Title sim.	0.1765
Time proximity	1

NLP тип	other
NLP организация	Cursor
NLP тема	software development
NLP страна

Открыть оригинал

Learn how Cursor uses GPT-5.

Open Weights and AI for All

openai

05.08.2025 00:00

0.686

Embedding sim.	0.8008
Entity overlap	0
Title sim.	0.0794
Time proximity	0.9524

NLP тип	product_launch
NLP организация
NLP тема	foundation models
NLP страна

Открыть оригинал

AI’s next frontier isn’t just about capability—it’s about who gets to use it. Our mission to put AI in the hands of as many people as possible is what drives us. Today’s release of our most capable open-weights models is a major step forward that makes advanced AI more open, flexible, and accessible worldwide.

From hard refusals to safe-completions: toward output-centric safety training

openai

07.08.2025 00:00

0.676

Embedding sim.	0.7552
Entity overlap	0.5
Title sim.	0.0366
Time proximity	1

NLP тип	other
NLP организация	OpenAI
NLP тема	large language models
NLP страна

Открыть оригинал

Discover how OpenAI's new safe-completions approach in GPT-5 improves both safety and helpfulness in AI responses—moving beyond hard refusals to nuanced, output-centric safety training for handling dual-use prompts.

Providing ChatGPT to the Entire U.S. Federal Workforce

openai

06.08.2025 00:00

0.672

Embedding sim.	0.775
Entity overlap	0.2
Title sim.	0.1972
Time proximity	0.7143

NLP тип	partnership
NLP организация	OpenAI
NLP тема	enterprise ai
NLP страна	United States

Открыть оригинал

Today, OpenAI for Government is announcing a new partnership with the U.S. General Services Administration (GSA) to launch a transformative initiative. For the next year, ChatGPT Enterprise will be available to the entire federal executive branch workforce at essentially no cost.

What we’re optimizing ChatGPT for

openai

04.08.2025 00:00

0.672

Embedding sim.	0.8253
Entity overlap	0.2857
Title sim.	0.1852
Time proximity	0.2024

NLP тип	other
NLP организация
NLP тема	generative ai
NLP страна

Открыть оригинал

We build ChatGPT to help you thrive in all the ways you want. Learn how we're improving support for tough moments, have rolled out reminders to take breaks, and are working on better life advice, all guided by expert input.

Scaling accounting capacity with OpenAI

openai

12.08.2025 00:00

0.648

Embedding sim.	0.7914
Entity overlap	0.4167
Title sim.	0.0462
Time proximity	0.3452

NLP тип	product_launch
NLP организация	Basis
NLP тема	ai agents
NLP страна

Открыть оригинал

Built with OpenAI o3, o3-Pro, GPT-4.1, and GPT-5, Basis’ AI agents help accounting firms save up to 30% of their time and expand capacity for advisory and growth.

Rethinking how we measure AI intelligence

deepmind

04.08.2025 16:00

0.642

Embedding sim.	0.7407
Entity overlap	0
Title sim.	0.1212
Time proximity	0.9048

NLP тип	product_launch
NLP организация	Google DeepMind
NLP тема	benchmarking
NLP страна

Открыть оригинал

Breadcrumb

 Innovation & AI

 Products

 Rethinking how we measure AI intelligence

 Aug 04, 2025

 ·

 Share

 x.com

 Facebook

 LinkedIn

 Mail

 Copy link

 Game Arena is a new, open-source platform for rigorous evaluation of AI models. It allows for head-to-head comparison of frontier systems in environments with clear winning conditions.

 Kate Olszewska

 Product Manager, Google DeepMind

 Meg Risdal

 Product Manager, Kaggle

 Read AI-generated summary

 General summary

 Current AI benchmarks struggle to keep pace with modern models. Google DeepMind and Kaggle are introducing the Kaggle Game Arena, a public AI benchmarking platform where AI models compete in strategic games. Watch the chess exhibition matches on August 5 at 10:30 a.m. Pacific Time and look for more tournaments in the future.

 Summaries were generated by Google AI. Generative AI is experimental.

 Share

 x.com

 Facebook

 LinkedIn

 Mail

 Copy link

 Current AI benchmarks are struggling to keep pace with modern models. As helpful as they are to measure model performance on specific tasks, it can be hard to know if models trained on internet data are actually solving problems or just remembering answers they've already seen. As models reach closer to 100% on certain benchmarks, they also become less effective at revealing meaningful performance differences. We continue to invest in new and more challenging benchmarks, but on the path to general intelligence, we need to continue to look for new ways to evaluate. The more recent shift towards dynamic, human-judged testing solves these issues of memorization and saturation, but in turn, creates new difficulties stemming from the inherent subjectivity of human preferences.
 While we continue to evolve and pursue current AI benchmarks, we’re also consistently looking to test new approaches to evaluating models. That’s why today, we're introducing the Kaggle Game Arena : a new, public AI benchmarking platform where AI models compete head-to-head in strategic games, providing a verifiable, and dynamic measure of their capabilities.

 Why games are a meaningful evaluation benchmark
 Games provide a clear, unambiguous signal of success. Their structured nature and measurable outcomes make them the perfect testbed for evaluating models and agents. They force models to demonstrate many skills including strategic reasoning, long-term planning and dynamic adaptation against an intelligent opponent, providing a robust signal of their general problem-solving intelligence. The value of games as a benchmark is further enhanced by their scalability—difficulty increases with the opponent's intelligence—and by our ability to inspect and visualize a model's "reasoning," which offers a glimpse into its strategic thought process.
 Specialized engines like Stockfish and general game playing AI models like AlphaZero have been able to play games at a superhuman level for many years and would beat every frontier model without a doubt. Today’s large language models, however, are not built to specialize in any specific games, and as a result they do not play them nearly as well. While the immediate challenge for the models is to close this gap, in the long-term we would hope for them to achieve a level of play beyond what is currently possible. And with an endlessly increasing set of novel environments we can continue to challenge them even further.

 How Game Arena promotes fair and open evaluation
 Game Arena is built on Kaggle to provide a fair, standardized environment for model evaluation. For transparency, game harnesses — the frameworks that connect each AI model to the game environment and enforce the rules — as well as the game environments are all open-sourced. Final rankings are determined by a rigorous all-play-all system, where an extensive number of matches between each model pair ensures a statistically robust result.
 Google DeepMind has long used games as a benchmark, from Atari to AlphaGo and AlphaStar , to demonstrate complex AI capabilities. By testing these models in a competitive arena, we can establish a clear baseline for their strategic reasoning and track progress. The goal is to build an ever-expanding benchmark that grows in difficulty as models face tougher competition. Over time, this could lead to novel strategies, much like AlphaGo's famous and creative “ Move 37 ” that baffled human experts. The ability to plan, adapt and reason under pressure in a game is analogous to the thinking needed to solve complex challenges in science and business.

 How you can watch the chess exhibition matches

 On August 5 at 10:30 a.m. Pacific Time, join us for a special chess exhibition where eight frontier models will face off in a single elimination showdown. We selected a sample from the matches for this exhibition. Hosted by the world's best chess experts, this event is the premiere demonstration of the Game Arena methodology.
 While the fun exhibition matches are in a tournament format, the final leaderboard rankings will be determined by the all-play-all system and released after the exhibition. This more extensive method runs over a hundred matches between every pair of models to ensure a statistically robust and definitive measure of performance. You can find more details and how to watch the games at kaggle.com/game-arena .
 We plan to run more tournaments in the future on a regular basis, more on that soon.

 How we’re building the future of AI benchmarks
 This is only the beginning. Our vision for the Game Arena extends far beyond a single game. Kaggle will soon expand Game Arena with new challenges, starting with classics like Go and poker. These games, along with future additions like video games, are excellent tests of AI’s ability to perform long-horizon planning and reasoning, helping us create a comprehensive and ever-evolving benchmark for AI. We’re committed to continuously adding new models and harnesses to the mix, pushing the boundaries of what AI models can achieve. For more details about the Game Arena and the inaugural chess exhibition tournament, see Kaggle’s blog post .

 POSTED IN:

AI Products

Google DeepMind

Introducing Gemma 3 270M: The compact model for hyper-efficient AI- Google Developers Blog

deepmind

14.08.2025 00:00

0.634

Embedding sim.	0.7875
Entity overlap	0.037
Title sim.	0.3043
Time proximity	0.0595

NLP тип	product_launch
NLP организация	Google DeepMind
NLP тема	large language models
NLP страна

Открыть оригинал

Products

 Develop

 Android

 Chrome

 ChromeOS

 Cloud

 Firebase

 Flutter

 Google Assistant

 Google Maps Platform

 Google Workspace

 TensorFlow

 YouTube

 Grow

 Firebase

 Google Ads

 Google Analytics

 Google Play

 Search

 Web Push and Notification APIs

 Earn

 AdMob

 Google Ads API

 Google Pay

 Google Play Billing

 Interactive Media Ads

 Solutions

 Events

 Learn

 Community

 Groups

 Google Developer Groups

 Google Developer Student Clubs

 Woman Techmakers

 Google Developer Experts

 Tech Equity Collective

 Programs

 Accelerator

 Solution Challenge

 DevFest

 Stories

 All Stories

 Developer Program

 Blog

 Search

 Products

 More

 Solutions

 Events

 Learn

 Community

 More

 Developer Program

 Blog

 Develop

 Android

 Chrome

 ChromeOS

 Cloud

 Firebase

 Flutter

 Google Assistant

 Google Maps Platform

 Google Workspace

 TensorFlow

 YouTube

 Grow

 Firebase

 Google Ads

 Google Analytics

 Google Play

 Search

 Web Push and Notification APIs

 Earn

 AdMob

 Google Ads API

 Google Pay

 Google Play Billing

 Interactive Media Ads

 Groups

 Google Developer Groups

 Google Developer Student Clubs

 Woman Techmakers

 Google Developer Experts

 Tech Equity Collective

 Programs

 Accelerator

 Solution Challenge

 DevFest

 Stories

 All Stories

 Gemma

 Introducing Gemma 3 270M: The compact model for hyper-efficient AI

 AUG. 14, 2025

 Olivier Lacombe

 Group Product Manager

 Google DeepMind

 Kathleen Kenealy

 Research Engineer

 Kat Black

 Ravin Kumar

 Francesco Visin

 Jiageng Zhang

 Share

 Facebook

 Twitter

 LinkedIn

 Mail

 The last few months have been an exciting time for the Gemma family of open models. We introduced Gemma 3 and Gemma 3 QAT , delivering state-of-the-art performance for single cloud and desktop accelerators. Then, we announced the full release of Gemma 3n , a mobile-first architecture bringing powerful, real-time multimodal AI directly to edge devices. Our goal has been to provide useful tools for developers to build with AI, and we continue to be amazed by the vibrant Gemmaverse you are helping create, celebrating together as downloads surpassed 200 million last week.
 Today, we're adding a new, highly specialized tool to the Gemma 3 toolkit: Gemma 3 270M , a compact, 270-million parameter model designed from the ground up for task-specific fine-tuning with strong instruction-following and text structuring capabilities already trained in.

 Gemma 3 270M brings strong instruction-following capabilities to a small-footprint model. As shown by the IFEval benchmark (which tests a model's ability to follow verifiable instructions), it establishes a new level of performance for its size, making sophisticated AI capabilities more accessible for on-device and research applications.

 Core capabilities of Gemma 3 270M

 Compact and capable architecture: Our new model has a total of 270 million parameters: 170 million embedding parameters due to a large vocabulary size and 100 million for our transformer blocks. Thanks to the large vocabulary of 256k tokens, the model can handle specific and rare tokens, making it a strong base model to be further fine-tuned in specific domains and languages.

 Extreme energy efficiency: A key advantage of Gemma 3 270M is its low power consumption. Internal tests on a Pixel 9 Pro SoC show the INT4-quantized model used just 0.75% of the battery for 25 conversations, making it our most power-efficient Gemma model.

 Instruction following: An instruction-tuned model is released alongside a pre-trained checkpoint. While this model is not designed for complex conversational use cases, it’s a strong model that follows general instructions right out of the box.

 Production-ready quantization: Quantization-Aware Trained (QAT) checkpoints are available , enabling you to run the models at INT4 precision with minimal performance degradation, which is essential for deploying on resource-constrained devices.

 The right tool for the job
 In engineering, success is defined by efficiency, not just raw power. You wouldn't use a sledgehammer to hang a picture frame. The same principle applies to building with AI.
 Gemma 3 270M embodies this "right tool for the job" philosophy. It's a high-quality foundation model that follows instructions well out of the box, and its true power is unlocked through fine-tuning. Once specialized, it can execute tasks like text classification and data extraction with remarkable accuracy, speed, and cost-effectiveness. By starting with a compact, capable model, you can build production systems that are lean, fast, and dramatically cheaper to operate.

 A real-world blueprint for success
 The power of this approach has already delivered incredible results in the real world. A perfect example is the work done by Adaptive ML with SK Telecom. Facing the challenge of nuanced, multilingual content moderation, they chose to specialize. Instead of using a massive, general-purpose model, Adaptive ML fine-tuned a Gemma 3 4B model. The results were stunning: the specialized Gemma model not only met but exceeded the performance of much larger proprietary models on its specific task.
 Gemma 3 270M is designed to let developers take this approach even further, unlocking even greater efficiency for well-defined tasks. It's the perfect starting point for creating a fleet of small, specialized models, each an expert at its own task.
 But this power of specialization isn't just for enterprise tasks; it also enables powerful creative applications. For example, check out this Bedtime Story Generator web app :

 Link to Youtube Video
 (visible only when JS is disabled)

 Gemma 3 270M used to power a Bedtime Story Generator web app using Transformers.js. The model’s size and performance make it suitable for offline, web-based, creative tasks. (Credit: Joshua (@xenovacom on X) from the Hugging Face team)

 When to choose Gemma 3 270M
 Gemma 3 270M inherits the advanced architecture and robust pre-training of the Gemma 3 collection, providing a solid foundation for your custom applications.
 Here’s when it’s the perfect choice:

 You have a high-volume, well-defined task. Ideal for functions like sentiment analysis, entity extraction, query routing, unstructured to structured text processing, creative writing, and compliance checks.

 You need to make every millisecond and micro-cent count. Drastically reduce, or eliminate, your inference costs in production and deliver faster responses to your users. A fine-tuned 270M model can run on lightweight, inexpensive infrastructure or directly on-device.

 You need to iterate and deploy quickly. The small size of Gemma 3 270M allows for rapid fine-tuning experiments, helping you find the perfect configuration for your use case in hours, not days.

 You need to ensure user privacy. Because the model can run entirely on-device, you can build applications that handle sensitive information without ever sending data to the cloud.

 You want a fleet of specialized task models. Build and deploy multiple custom models, each expertly trained for a different task, without breaking your budget.

 Get started with fine-tuning
 We want to make it as easy as possible to turn Gemma 3 270M into your own custom solution. It’s built on the same architecture as the rest of the Gemma 3 models, with recipes and tools to get you started quickly. You can find our guide on full fine-tuning using Gemma 3 270M as part of the Gemma docs.

 Download the model: Get the Gemma 3 270M models from Hugging Face , Ollama , Kaggle , LM Studio , or Docker . We are releasing both pretrained and instruction tuned models.

 Try the model: Try the models on Vertex AI or with popular inference tools like llama.cpp Gemma.cpp , LiteRT , Keras , and MLX .

 Start fine-tuning: Use your favorite tools, including Hugging Face , UnSloth , and JAX.

 Deploy your solution: Once fine-tuned, you can deploy your specialized model anywhere, from your own local environment to Google Cloud Run .

 The Gemmaverse is built on the idea that innovation comes in all sizes. With Gemma 3 270M, we’re empowering developers to build smarter, faster, and more efficient AI solutions. We can’t wait to see the specialized models you create.

 posted in:

 Gemma

 AI

 Announcements

 Gemma 3

 Gemma 3 QAT

 on-device AI

 Previous

 Next

 Related Posts

 Gemma

 Mobile

 AI

 Announcements

 Introducing EmbeddingGemma: The Best-in-Class Open Model for On-Device Embeddings

 SEPT. 4, 2025

 AI

 Cloud

 Announcements

 Best Practices

 Developer’s Guide to AI Agent Protocols

 MARCH 18, 2026

 Gemma

 AI

 Announcements

 T5Gemma: A new collection of encoder-decoder Gemma models

 JULY 9, 2025

 AI

 Best Practices

 Closing the knowledge gap with agent skills

 MARCH 25, 2026

 AI

 Cloud

 How-To Guides

 Announcements

 Announcing ADK for Java 1.0.0: Building the Future of AI Agents in Java

 MARCH 30, 2026

 Connect

 Blog

 Bluesky

 Instagram

 LinkedIn

 X (Twitter)

 YouTube

 Programs

 Google Developer Program

 Google Developer Groups

 Google Developer Experts

 Accelerators

 Women Techmakers

 Google Cloud & NVIDIA

 Developer consoles

 Google API Console

 Google Cloud Platform Console

 Google Play Console

 Firebase Console

 Actions on Google Console

 Cast SDK Developer Console

 Chrome Web Store Dashboard

 Google Home Developer Console

 Android

 Chrome

 Firebase

 Google Cloud Platform

 All products

 Manage cookies

 Terms

 Privacy

OpenAI’s letter to Governor Newsom on harmonized regulation

openai

12.08.2025 00:00

0.624

Embedding sim.	0.8036
Entity overlap	0.0909
Title sim.	0.0472
Time proximity	0.1429

NLP тип	other
NLP организация
NLP тема	ai regulation
NLP страна	United States

Открыть оригинал

We’ve just sent a letter to Gov. Gavin Newsom calling for California to lead the way in harmonizing state-based AI regulation with national—and, by virtue of US leadership, emerging global—standards.

Introducing gpt-oss

openai

05.08.2025 00:00

0.623

Embedding sim.	0.705
Entity overlap	0.25
Title sim.	0.1064
Time proximity	0.8571

NLP тип	product_launch
NLP организация
NLP тема	large language models
NLP страна

Открыть оригинал

We’re releasing gpt-oss-120b and gpt-oss-20b—two state-of-the-art open-weight language models that deliver strong real-world performance at low cost. Available under the flexible Apache 2.0 license, these models outperform similarly sized open models on reasoning tasks, demonstrate strong tool use capabilities, and are optimized for efficient deployment on consumer hardware.