The factors driving large language model performance have fundamentally shifted over the past five years. While early success relied on pre-training scaling laws that emphasized bigger models and more data, the field has evolved through synthetic data generation and post-training optimization to reach today’s breakthrough: test-time compute scaling. This evolution represents not just technical progress but a paradigm shifts from “training bigger” to “thinking longer.”
Introduction
The pursuit of better language model performance has taken us through several distinct eras, each with its own prevailing wisdom about what drives capability improvements. From the early days of simple scaling to today’s sophisticated reasoning models, the field has continuously redefined what it means to make AI systems more capable.
This journey can be understood through four key phases:
- Phase 1: the establishment of pre-training scaling laws
- Phase 2: the discovery that data quality matters as much as quantity
- Phase 3: the rise of synthetic data and post-training optimization
- Phase 4: the current breakthrough in test-time compute scaling.
Each phase has built upon the previous one while fundamentally changing our understanding of how to build better AI systems.
Phase 1: Pre-training Scaling Laws (2020-2022)
The Kaplan Era: Size Matters Most
The modern understanding of language model scaling began with the seminal 2020 paper by Kaplan et al [Kaplan, J., et al. (2020)]. from OpenAI, which established empirical scaling laws showing that model performance follows power-law relationships with model size, dataset size, and training compute across multiple orders of magnitude.
The “Kaplan” Scaling Laws suggest that as your pre-training compute budget increases, you should scale model size more than data. This means: given a 10x increase in training budget, one should scale model size by 5.5x and data by 1.8x. This philosophy directly influenced GPT-3’s architecture: GPT-3 had 175B parameters, but was trained on only 300B tokens, which equates to roughly 1.7 tokens per parameter.
The implications were profound. Research focused intensively on building larger models, with the assumption that scale alone would unlock new capabilities. This period saw the race to trillion-parameter models and the emergence of what would later be called “emergent abilities” – capabilities that seemed to appear suddenly at certain scales.
The Chinchilla Revolution: Data Equality
However, the Kaplan scaling laws contained methodological limitations that would soon be exposed. There were certain flaws in those original scaling laws, such as not accounting for embedding parameters and generally using quite small models to estimate the Scaling Laws, which didn’t necessarily hold for larger models.
In 2022, DeepMind’s Chinchilla paper [Hoffmann, J., et al. (2022)] fundamentally challenged this paradigm. The Chinchilla Scaling Laws demonstrated that model size and training data should be scaled equally for compute-optimal training, contradicting the earlier Kaplan approach which prioritized larger models over more data.
GPT-3 used a 175B parameter model trained on 300B tokens (1.7 tokens per parameter). According to Chinchilla’s equal scaling principle, to properly train a 175B parameter model, GPT-3 should have used approximately 3.5T tokens (20 tokens per parameter) – about 11x more data than they actually used.
Beyond Chinchilla: The Inference Optimization Challenge
Real-world deployment considerations soon revealed another limitation of compute-optimal training. Just following the Chinchilla Scaling Laws leads to the “Chinchilla Trap”, whereby you end up with a model that is way too large and therefore expensive to run at large scale at inference time. Larger models require more GPU memory, process requests more slowly, and consume more electricity per query. While Chinchilla optimized for training efficiency, it ignored the ongoing costs of actually serving the model to users – which can dwarf training costs over the model’s lifetime.
This led to what researchers call “beyond Chinchilla” scaling, where models are deliberately overtrained to optimize for inference efficiency. This trend is evident in recent model architectures: Meta deliberately overtrained their Llama3 70B model using ten times more data per parameter than Chinchilla guidelines suggest, while Microsoft’s Phi-3 pushes this approach to an extreme with forty-five times the recommended data-to-parameter ratio.This approach trades higher upfront training costs for dramatically lower serving costs. A smaller, overtrained model can achieve similar performance while being much cheaper and faster to deploy at scale.
Phase 2: Synthetic Data Generation (2023-2024)
The Data Quality Awakening
As models began hitting the limits of internet-scale data, the field realized that data quality might matter as much as quantity. The effectiveness of models trained on carefully curated datasets like TinyStories, combined with approaches used in Microsoft’s Phi series, revealed that data quality often trumps quantity, enabling researchers to use LLMs for strategic dataset creation [Eldan & Li, (2023)] [Beatty, S. (2024)]
This realization coincided with a practical problem: Chinchilla’s scaling requirements suggested that frontier models would soon exhaust publicly available web content for training purposes. The solution emerged from an unlikely source – using LLMs themselves to generate training data.
LLM-Driven Synthetic Data Generation
Language models can generate synthetic training data, creating new datasets that serve multiple purposes including model training, fine-tuning, and performance evaluation. The advantages were compelling: generating synthetic data is often cheaper than collecting real-world data, provides unparalleled scalability, and doesn’t include personal or sensitive information.
The technical sophistication of these approaches rapidly advanced. IBM developed LAB (Large-scale Alignment for chatBots) [Sudalairaj et al., (2023).], a systematic approach for creating targeted synthetic datasets while integrating new capabilities into base models. The approach uses a taxonomy that segregates training instructions across three distinct areas: knowledge, foundational skills, and compositional skills.
NVIDIA developed the Nemotron-4 340B model family specifically for their NeMo and TensorRT-LLM ecosystems, providing both instructional and reward-based models plus training data. These components combine to establish a data generation framework supporting both initial training and iterative model enhancement for use cases across various commercial applications[NVIDIA et al. (2024)].
The Synthetic Data Ecosystem
By 2024, synthetic data had become integral to the LLM development pipeline. LLMs-driven synthetic data generation facilitates end-to-end automation of training and evaluation pipelines while minimizing manual intervention.
The approach evolved sophisticated techniques for ensuring quality and diversity. One major obstacle when leveraging language models for data creation is achieving adequate variation in outputs. Simple prompting strategies frequently produce similar or identical responses, regardless of temperature settings. Researchers address this limitation through structured prompting techniques that specify precise requirements for the desired content. This approach uses detailed parameter specifications to guide models toward generating more diverse and targeted synthetic datasets.[Lu, S., et al. (2024)]
Phase 3: Post-Training and Alignment (2023-2025)
The RLHF Foundation
While pre-training established the base capabilities of language models, post-training became crucial for making them useful and safe. Standard fine-tuning procedures generally follow a two-stage process: supervised fine-tuning creates an initial capable model by learning from high-quality examples, after which reinforcement learning from human feedback further improves performance based on human preferences. Researchers have developed two distinct strategies for model alignment: one approach builds explicit reward models from preference data and uses reinforcement learning algorithms such as PPO, while alternative methods like DPO bypass reward modeling entirely by directly optimizing on preference comparisons.
The DPO Simplification
Direct Preference Optimization emerged as a game-changer in alignment. Direct Preference Optimization represents a streamlined method for aligning models with human values. Rather than following the traditional two-step process of building reward models and applying reinforcement learning algorithms like PPO, this technique reformulates alignment as a direct classification problem using preference comparisons. The benefits proved substantial: the approach reduces implementation complexity by removing the reward modeling stage and simplifying the optimization process. This methodology demands fewer computational resources and involves less hyperparameter tuning compared to conventional RLHF workflows, while still delivering competitive performance outcomes.[Xu et al. (2024)]
The Cost Economics of Post-Training
By 2025, post-training had become a major cost center in AI development. The escalation is dramatic:
Llama 2 (Q3 2023) : Post-training costs of $10-20 million included 1.4 million preference pairs, reinforcement learning from human feedback (RLHF), instruction fine-tuning, and safety measures.
Llama 3.1 (Q3 2024): Post-training costs exceeded $50 million despite using a similar volume of preference data. The key difference was a substantially larger post-training team of approximately 200 people and more complex model architectures.
This represented more than a 2.5x cost increase in just one year. The dramatic escalation occurred not because companies used more training data, but because post-training processes had grown increasingly complex and required specialized expertise to implement effectively.The computational intensity reached its peak with reasoning models like OpenAI’s o1 series. In these advanced systems, post-training loss functions consumed 40% or more of the model’s total computational budget, fundamentally changing the economics of AI development.[Lambert, N. (2025)]
The Synthetic Data Integration
The convergence of synthetic data and post-training created powerful new possibilities. Although leading AI companies continue using human-generated data for certain aspects of post-training workflows, automated alternatives can replace human involvement in many pipeline stages while maintaining acceptable quality standards. The economic impact is dramatic: replacing human preference labeling (which costs $5-20 per data point) with AI-generated feedback (under $0.01 per sample) represents potential cost reductions of several orders of magnitude.
Phase 4: Test-Time Compute (2024-2025)
The Paradigm Shift
2024 has been the year in which improvements in model performance were primarily driven by post-training and scaling test-time compute. In terms of pretraining there hasn’t been as much news. This represented a fundamental shift from making models bigger to making them “think longer.”
The breakthrough came with OpenAI’s o1 model. The breakthrough came with OpenAI’s o1 model, which represents a novel approach using reinforcement learning to enhance reasoning capabilities. Unlike previous models that generate immediate responses, o1 employs extended deliberation periods, developing comprehensive reasoning chains before providing answers. Performance benchmarks reveal impressive results: the model achieves top-tier performance on programming challenges (89th percentile on Codeforces), mathematical competitions (top 500 nationally on AIME), and scientific assessments (surpassing PhD-level accuracy on GPQA across physics, biology, and chemistry) .[OpenAI. (2024).]
The Science of Test-Time Compute
Test-time compute refers to the amount of computational power used by an AI model when it is generating a response or performing a task after it has been trained. In simple terms, it’s the processing power and time required when the model is actually being used, rather than when it is being trained. Some advanced AI models, like OpenAI’s o1 series, dynamically increase their reasoning time during inference. This means they can spend more time thinking about complex questions, improving accuracy at the cost of higher compute usage.
The approach mirrors human cognition: traditional LLMs operate much like System 1 thinking – quick, intuitive, and based on pattern recognition. They generate responses rapidly based on their trained neural networks. In contrast, Reasoner models embody System 2 thinking – deliberate, methodical, and self-correcting. They can pause, reflect on their reasoning, and even backtrack when they detect potential errors in their logic.
The Scaling Properties
The scaling properties of test-time compute proved remarkable. Research demonstrates that o1’s capabilities scale predictably with both training-phase reinforcement learning and inference-time computational allocation. The scaling dynamics for reasoning models differ fundamentally from traditional language model pretraining, presenting new optimization challenges that researchers continue to explore.[OpenAI(2024)]
Test-time scaling strategies include parallel approaches (generating multiple solution candidates) and sequential methods (iterative solution refinement), which can be combined for enhanced performance [Raschka, (2025)]. The computational trade-offs became evident with o3’s release: while achieving 88.5% on the ARC-AGI benchmark—substantially exceeding both human performance (77%) and o1’s results (32%)—the high-performance configuration required over $1,000 in computational resources per task, compared to o1’s approximate $5 cost.[(NextBigFuture, (2024)] [TechCrunch, (2024)]
The Open-Source Response
OpenAI’s reasoning breakthrough catalyzed widespread open-source development efforts.x DeepSeek developed their R1 series, implementing alternative test-time scaling methodologies through systematic step-by-step reasoning processes. Their approach includes R1-Zero (trained exclusively through reinforcement learning without supervised examples) and the standard R1 model (which combines limited supervised fine-tuning with reinforcement learning) [(Hugging Face, 2025)].[DeepSeek-AI (2025)]
However, research revealed limitations: Recent findings reveal that for R1 and QwQ, extending solution length does not necessarily yield better performance due to the models’ limited self-revision capabilities. Research attributes this phenomenon to model underthinking, where models initially reach correct intermediate solutions but subsequently deviate toward incorrect conclusions during extended reasoning. [Wu, F., et al. (2025]
The Current Landscape and Future Directions
Integration of All Phases
Modern AI development now represents a synthesis of all these phases. This integration is evident across major AI platforms: Anthropic’s Claude 3.7 Sonnet and xAI’s Grok 3 feature optional reasoning modes that users can activate within the same model interface. In contrast, OpenAI maintains separate model variants (such as GPT-4o versus o1-mini) requiring users to select different systems for standard versus reasoning tasks. IBM has similarly incorporated reasoning toggles into their Granite model family. The widespread adoption of reasoning capabilities—whether through extended inference time or enhanced training methods—represents a significant advancement in language model development for 2025 [Raschka, (2025)]
Anthropic co-founder Jack Clark [Clark, J. (2024)] observed that o3 demonstrates accelerating AI advancement, suggesting 2025 will see more rapid progress than the previous year. Clark anticipates that researchers will combine inference-time scaling techniques with conventional pretraining approaches to maximize model performance gains.
Looking ahead, the integration of all these approaches points toward AI systems that are not just larger or faster, but genuinely more thoughtful and capable. The question is no longer just “how do we make models bigger?” but “how do we make them think better?” This shift from raw scale to sophisticated reasoning may prove to be the most important development in the field’s history.
References
[1] Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361.
[2] Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556.
[3] Eldan & Li, (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? Microsoft Research. arXiv:2305.07759.
[4] Beatty, S. (2024). Tiny but mighty: The Phi-3 small language models with big potential. Microsoft News.
[5] Sudalairaj et al., (2023). LAB: Large-scale alignment for chatbots. MIT-IBM Watson AI Lab and IBM Research. arXiv:2305.14627
[6] NVIDIA et al. (2024). Nemotron-4 340B Technical Report. arXiv:2406.11704.
[7] Lu, S., et al. (2024). On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. arXiv preprint.
[8] Xu et al. (2024). Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study. arXiv:2404.10719
[9] Lambert, N. (2025). The state of post-training in 2025. Interconnects.
[10] OpenAI. (2024). Learning to reason with LLMs. Retrieved from OpenAI website.
[11] DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
[12] Wu, F., et al. (2025). Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? arXiv preprint arXiv:2502.12215.
[13] Clark, J. (2024). Import AI 395: AI and energy demand. Import AI.