Olmo 3 Shows How Far Open-Source Reasoning Can Go

Ai2’s latest open model pushes reasoning and context length forward while keeping compute demands in check

Nov 20, 2025

A billboard advertises that Ai2 is pro-open-source when it comes to AI. Photo credit: Ai2

The Allen Institute for Artificial Intelligence (Ai2) is rolling out its third-generation open language model, promising top-tier performance without the steep energy demands or hefty infrastructure price tag. The nonprofit AI lab states that the new system, Olmo 3, provides developers with “full visibility and control” across every stage of training, checkpoint, and dataset.

“Olmo 3 proves that openness and performance can advance together,” Ali Farhadi, Ai2’s chief executive, remarks in a statement. “By sharing the full model flow, we’re giving researchers, developers, and builders everything they need for frontier-scale AI development that’s efficient, reproducible, and built to serve science.”

“High performance doesn’t have to come at high cost, either,” he goes on to add. “By dramatically reducing compute and energy requirements, we’re demonstrating that responsible, sustainable AI can scale without compromise.”

What’s Changed with Olmo?

Olmo 3 arrives nearly a year after its predecessor. When Ai2 introduced Olmo 2, it billed it as its “best fully open language model to date,” drawing comparisons to Meta’s Llama 3.1 and emphasizing stability and efficiency. With Olmo 3, the goal appears to be different. Rather than simply keeping pace with top-tier rivals, Ai2 is also pushing ahead on context length and specialized capabilities—all while driving down the cost of training.

“Olmo 3 represents a significant scale-up in both data and methodology compared to Olmo 2,” Luca Soldaini, a research scientist on Ai2’s Olmo team, who goes by they/them, states in an email. “The team scaled data collection and strengthened dataset curation methods, training on Dolma 3—a dataset of nearly six trillion tokens. The dataset uses a novel method to upsample higher-quality documents and includes nearly a trillion tokens from scientific PDFs.”

There are other differences between Olmo 2 and Olmo 3. For example, while both have the same pre-training stages—initial and mid-training—the latter features an additional one: long-context extension. Soldaini explains that Olmo 3’s mid-training stage is based on the latest version of Dolmino, a collection of “higher-quality tokens of high-quality math, science, code, and reasoning data.” The model’s third pre-training stage extends the context window to 65,000 tokens—16 times larger than its predecessor and the equivalent of 48,000 words.

They also reveal that Olmo 3 utilizes a novel post-training data suite called Dolci, designed specifically for reasoning, tool use, and instruction-following. It has separate mixes for Supervised Fine-Tuning, Direct Preference Optimization, and Reinforced Learning with Value Ranking.

Perhaps the most significant change is that Olmo 3 is the first of its kind to be explicitly designed for reasoning capabilities, though the technology is limited to one model variant.

The Olmo 3 Model Family

The third-generation Olmo is available in two standard sizes: 7B and 32B parameters. These are slightly different from Olmo 2—that family was only available in 7B billion and 13B sizes. Soldaini points out that 32B was chosen because it sits on a perceived “sweet spot” for reasoning and is designed to be efficient on modern hardware.

“A 7B can be fine-tuned on a single GPU, while a 32B model maps fits on a single node, eliminating the need of an expensive node interconnect,” they comment. “The jump from 7B to 32B delivers a large boost in reasoning capability, while moving beyond 32B often yields much smaller improvements relative to the extra compute, cost and hardware requirements.”

Soldaini continues, “The team specifically built the 32B variant to provide a strong platform for reinforcement learning research and other advanced experiments that require better base model capabilities, which 32B unlocks, while remaining accessible to researchers and developers without requiring enterprise-scale infrastructure.”

And what about a 1B offering? They acknowledge one is coming, but details will be shared “at a later date.”

Olmo-3-Base

A foundational model of the Olmo 3 family, it’s made for continued pre-training and specialized fine-tuning. Ai2 says Olmo-3-Base performs well in programming, reading comprehension, and math, and can integrate with reinforcement learning workflows, trained on data and open recipes.

Ai2 claims Olmo-3-Base 7B requires 2.5 times less compute for training when compared to Llama 3.1 8B.

Olmo-3-Think

This is Olmo 3’s prized pony. “Olmo-3-Think represents the flagship reasoning model in the family, featuring intermediate reasoning traces that surface the model’s step-by-step thinking process for complex problems,” Soldaini shares.

The 7B model version makes transparency compatible for smaller-scale use cases, while the 32B option is built to think through problems step-by-step and give more detailed reasoning.

Olmo-3-Instruct

This is a post-trained version of Olmo-3-Base. It can carry out explicit commands or tasks, hold a coherent conversation over several back-and-forth exchanges, and connect with external systems, such as APIs, search tools, or software to perform actions or conduct tasks.

Olmo-3-RL Zero

This is a reinforcement learning process built on Olmo-3-Base. Ai2 states this is designed to “bootstrap complex reasoning behaviors and enable clear benchmarking of RL algorithms.” Olmo-3-RL Zero is only available in a 7B size and has four sub-versions, each defined by different types of training data, specifically math, code, instruction following, and general chat.

Nvidia’s AI infrastructure is powering all three Olmo 3 variants. Ai2 says the model’s development was primarily conducted across Nvidia H100 GPUs on Google Cloud’s A3 Mega Cluster. It builds on the partnership both firms began in August to develop open AI models for scientific breakthroughs.

“Transparency and performance are essential for developers to scale AI with open, U.S.-built models like Olmo 3,” Kari Briski, Nvidia’s vice president of generative AI software for enterprise, remarks. “Powered by Nvidia AI infrastructure, these models turn intelligence into America’s most renewable resource, accelerating AI for researchers and industries.”

Olmo 3 Use Cases

According to Soldaini, each Olmo 3 variation has its own distinct use case.

For example, Olmo-3-Base is ideal for “developing specialized capabilities like reasoning, tool use, and instruction following,” while Olmo-3-Think is for complex tasks that require step-by-step reasoning. Olmo-3-Instruct is suited for data processing and for “plugging into everyday tools like calculators, databases, or web services to fetch fresh facts and take simple actions—it can call functions and APIs including sources of live data like weather conditions and web search indices.”

How Olmo 3 Performs

“Olmo 3 demonstrates that we don’t need to choose between efficiency and performance,” Ai2 writes in a blog post. According to the provided evaluations, the new Olmo models performed as well, if not better, compared to other models that are “many times” their size. The organization reveals that Olmo 3 outperformed fully open peers, such as Apertus 70B and SmolLM 3, in terms of reasoning, comprehension, and long-context benchmarks.

“Olmo-3-Think 32B is the strongest fully open thinking model, narrowing the gap to the best open-weight models of similar scale, such as the Qwen-3-32B thinking models on our suite of reasoning benchmarks, while being trained on six times fewer tokens,” Soldaini proclaims while adding that Olmo 3’s Instruct model outperforms Qwen 2.5, Gemma 3, Llama 3, and is competitive to Qwen 3.

Though not one to hang its hat on evaluations, Ai2 shares that Olmo 3’s success proves it’s possible to provide “frontier-class results on far less compute,” which will make it easier for more researchers and developers to work with large AI models without raising the risk of environmental damage. Still, it declares that after performance and benchmarking, Olmo 3 is the “best American-made open-source model at this scale—the best 7B Western instruct and thinking model on the market.”

“By opening every stage of development—from data to deployment—Olmo 3 empowers researchers and developers to trace model behavior back to its sources, understand how training choices shape outcomes, and build with confidence on a fully transparent foundation,” the organization states. “Teams can fine-tune the models for new domains, experiment with alternative training objectives, or extend released checkpoints to drive fresh innovation across science, education, and real-world applications.”

And as is standard for everything Ai2 does, Olmo 3 is provided under an Apache 2.0 license. In addition, developers can download the model from Hugging Face and through Ai2’s playground today.

Updated on 11/20/2025 at 9:44 a.m. PT: Added details about Olmo-3-RL-Zero 7B

Neural Foundry

The 32B sweet spot makess a lot of sense from a practcal standpoint. Most researchers dont have access to enterprise level infrastructure, so keeping it efficient enough to run on a single node opens up a lot more experimentation possibilities. The transparencey around training data and checkpoints is really what differentiates these models.

Expand full comment