Model size, dataset size and compute all depend on the availability of necessary AI infrastructure
In January 2020, a team of OpenAI researchers led by Jared Kaplan, who moved on to co-found Anthropic, published a paper titled “Scaling Laws for Neural Language Models.” The researchers observed “precise power-law scalings for performance as a function of training time, context length, dataset size, model size and compute budget.” Essentially, the performance of an AI model improves as a function of increasing scale in model size, dataset size and compute power. While the commercial trajectory of AI has materially changed since 2020, the scaling laws continue to be steadfast; and this has material implications for the AI infrastructure that underlies the model training and inference that users increasingly depend on.
Before proceeding, we’ll break down the scaling laws:
- Model size scaling shows that increasing the number of parameters in a model typically improves its ability to learn and generalize, assuming it’s trained on a sufficient amount of data. Improvements can plateau if dataset size and compute resources aren’t proportionately scaled.
- Dataset size scaling relates model performance to the quantity and quality of data used for training. The importance of dataset size can diminish if model size and compute resources aren’t proportionately scaled.
- Compute scaling basically means more compute (GPUs, servers, networking, memory, power, etc…) equates to improved model performance because training can go on for longer, speaking directly to the needed AI infrastructure.
In sum, a large model needs a large dataset to work effectively. Training on a large dataset requires significant investment in compute resources. Scaling one of these variables without the others can lead to process and outcome inefficiencies. Important to note here the Chinchilla Scaling Hypothesis, developed by researchers at DeepMind and memorialized in the 2022 paper “Training Compute-Optimal Large Language Models,” that says scaling dataset and compute together can be more effective than building a bigger model.
“I’m a big believer in scaling laws,” Microsoft CEO Satya Nadella said in a recent interview with Brad Gerstner and Bill Gurley. He said the company realized in 2017 “don’t beg against scaling laws but be grounded on exponentials of scaling laws becoming harder. As the [AI compute] clusters become harder, the distributed computing problem of doing large scale training becomes harder.” Looking at long-term capex associated with AI infrastructure deployment, Nadella said, “This is where being a hyperscaler I think is structurally super helpful. In some sense, we’ve been practicing this for a long time.” He said build out costs will normalize, “then it will be you just keep growing like the cloud has grown.”
Nadella explained in the interview that his current scaling constraints were no longer around access to the GPUs used to train AI models but, rather, the power needed to run the AI infrastructure used for training.
Datacenter investor Obinna Isiadinso with IFC had a good analysis of this in a LinkedIn post titled “2025’s Data Center Landscape: Why Location Strategy Now Starts with Power Availability.” Looking at the North American Market, he tallied 2,700 data centers and expected energy consumption of 139 billion kilowatt-hours annually beginning this year. “Power availability remains the primary factor influencing site selection in North America,” Isiadinso wrote. “Development activity is expanding beyond traditional hubs into new territories, particularly in the central United States where wind power resources are abundant.” So power.
And two more AI scaling laws
Beyond the three AI scaling laws outlined above, NVIDIA CEO Jensen Huang, speaking during a keynote session at the Consumer Electronics Show earlier this month, threw out two more that have “now emerged.” Those are the post-training scaling law and test-time scaling.
One at a time: post-training scaling refers to a series of techniques used to improve AI model outcomes and make the systems more efficient. Some of the relevant techniques include:
- Fine-tuning a model by adding in domain-specific data, effectively reducing compute and data required compared to building a new model.
- Quantization reduces model precision weights to make it smaller and faster while maintaining acceptable performance and reducing memory and compute.
- Pruning removes unnecessary parameters in a trained model making it more efficient without performance decreases.
- Distillation essentially compresses knowledge from a large model to a small model while retaining most capabilities.
- Transfer learning re-uses a pre-trained model for related tasks meaning the new tasks require less data and compute.
Huang likened post-training scaling to “having a mentor or having a coach give you feedback after you’re done going to school. And so you get tests, you get feedback, you improve yourself.” That said, “Post-training requires an enormous amount of computation, but the end result produces incredible models.”
The second (or fifth) AI scaling law is test-time scaling which refers to techniques applied after training and during inference meant to enhance performance and drive efficiency without retraining the model. Some of the core concepts here are:
- Dynamic model adjustment based on the input or system constraints to balance accuracy and efficiency on the fly.
- Ensembling at inference combines predictions from multiple models or model version sto improve accuracy.
- Input-specific scaling adjusts model behavior based on inputs at test-time to reduce unnecessary computation while retaining adaptability when more computation is needed.
- Quantization at inference reduces precision to speed up processing.
- Active test-time adaptation allows for model tuning in response to data inputs.
- Efficient batch processing groups inputs to maximize throughput to minimize computation overhead.
As Huang put it, test-time scaling is, “When you’re using the AI, the AI has the ability to now apply a different resource allocation. Instead of improving its parameters, now it’s focused on deciding how much computation to use to produce the answers it wants to produce.”
Regardless, he said, whether it’s post-training or test-time scaling, “The amount of computation that we need, of course, is incredible…Intelligence, of course, is the most valuable asset that we have, and it can be applied to solve a lot of very challenging problems. And so, [the] scaling laws…[are] driving enormous demand for NVIDIA computing.”
The evolution of AI scaling laws—from the foundational trio identified by OpenAI to the more nuanced concepts of post-training and test-time scaling championed by NVIDIA—underscores the complexity and dynamism of modern AI. These laws not only guide researchers and practitioners in building better models but also drive the design of the AI infrastructure needed to sustain AI’s growth.
The implications are clear: as AI systems scale, so too must the supporting AI infrastructure. From the availability of compute resources and power to advancements in optimization techniques, the future of AI will depend on balancing innovation with sustainability. As Huang aptly noted, “Intelligence is the most valuable asset,” and scaling laws will remain the roadmap to harnessing it efficiently. The question isn’t just how large we can build models, but how intelligently we can deploy and adapt them to solve the world’s most pressing challenges.