The convergence of test-time inference scaling and edge AI

As AI shifts from centralized clouds to distributed edge environments, the challenge is no longer just model training—it’s scaling inference efficiently. Test-time inference scaling is emerging as a critical enabler of real-time AI execution, allowing AI models to dynamically adjust compute resources at inference time based on task complexity, latency needs, and available hardware. This shift is fueling the rapid rise of edge AI, where models run locally on devices instead of relying on cloud data centers—improving speed, privacy, and cost-efficiency.

Recent discussions have highlighted the importance of test-time scaling. NVIDIA CEO Jensen Huang, speaking at CES 2025, emphasized that during inference, AI systems can allocate resources dynamically, breaking down problems into multiple steps and evaluating various responses to enhance performance. This approach is proving to be incredibly effective.

The emergence of test-time scaling techniques has significant implications for AI infrastructure, especially concerning edge AI. Edge AI involves processing data and running AI models locally on devices or near the data source, rather than relying solely on centralized clouds. This approach offers several advantages:

Reduced latency: Processing data closer to its source minimizes the time required for data transmission, enabling faster decision-making.
Improved privacy: Local data processing reduces the need to transmit sensitive information to centralized servers, enhancing data privacy and security.
Bandwidth efficiency: By handling data locally, edge AI reduces the demand on network bandwidth, which is particularly beneficial in environments with limited connectivity. And, of course, reducing network utilization for data transport has a direct line to cost.

And industry focus on edge AI is gaining traction. Qualcomm, for instance, has been discussing the topic for more than two years, and sees edge AI as a significant growth area. In a recent earnings call, Qualcomm CEO Cristiano Amon highlighted the increasing demand for AI inference at the edge, viewing it as a “tailwind” for their business.

“The era of AI inference”

Amon recently told CNBC’s Jon Fortt, “We started talking about AI on the edge, or on devices, before it was popular.” Digging in further on an earnings call, he said, “Our advanced connectivity, computing and edge AI technologies and product portfolio continue to be highly differentiated and increasingly relevant to a broad range of industries…We also remain very optimistic about the growing edge AI opportunity across our business, particularly as we see the next cycle of AI innovation and scale.”

Amon dubbed that next cycle “the era of AI inference.” Beyond test-time inference scaling, this also aligns with other trends around model size reduction allowing them to run locally on devices like handsets and PCs. “We expect that while training will continue in the cloud, inference will run increasingly on devices, making AI more accessible, customizable and efficient,” Amon said. “This will encourage the development of more targeted, purpose-oriented models and applications, which we anticipate will drive increased adoption, and in turn, demand for Qualcomm platforms across a range of devices.”

Bottomline, he said, “We’re well-positioned to drive this transition and benefit from this upcoming inflection point.” Expanding on that point in conversation with Fortt, Amon said the shifting AI landscape is “a great tailwind for business and kind of materializes what we’ve been preparing for, which is designing chips that can run those models at the edge.”

Intel, which is struggling to differentiate its AI value proposition and is turning focus to development of “rack-scale” solutions for AI datacenters, also sees edge AI as an emerging opportunity. Co-CEO Michelle Johnston Holthaus talked it out during a quarterly earnings call. AI “is an attractive market for us over time, but I am not happy with where we are today,” she said. “On the one hand, we have a leading position as the host CPU for AI servers, and we continue to see a significant opportunity for CPU-based inference on-prem and at the edge as AI-infused applications proliferate. On the other hand, we’re not yet participating in the cloud-based AI datacenter market in a meaningful way.”

She continued: “AI is not a market in the traditional sense. It’s an enabling application that needs to span across the compute continuum from the datacenter to the edge. As such, a one-size-fits-all approach will not work, and I can see clear opportunities to leverage our core assets in new ways to drive the most compelling total cost of ownership across the continuum.”

Holthaus’s, and Intel’s, view of edge AI inference as a growth area pre-date her tenure as co-CEO. Former CEO Pat Gelsinger, speaking at a CES keynote in 2024, made the case in the context of AI PCs and laid out the three laws of edge computing. “First is the laws of economics,” he said at the time. “It’s cheaper to do it on your device…I’m not renting cloud servers…Second is the laws of physics. If I have to round-trip the data to the cloud and back, it’s not going to be as responsive as I can do locally…And third is the laws of the land. Am I going to take my data to the cloud or am I going to keep it on my local device?”

Post-Intel, Gelsinger has continued this area of focus with an investment in U.K.-based startup Fractile which specializes in AI hardware that specializes in in-memory inference rather than moving model weights from memory to a processor. Writing on LinkedIn about the investment, Gelsinger said, “Inference of frontier AI models is bottlenecked by hardware. Even before test-time compute scaling, cost and latency were huge challenges for large-scale LLM deployments. With the advent of reasoning models, which require memory-bound generation of thousands of output tokens, the limitations of existing hardware roadmaps [have] compounded. To achieve our aspirations for AI, we need radically faster, cheaper and much lower power inference.”

Verizon, in a move indicative of the larger opportunity for operators to leverage existing distributed assets in service of new revenue from AI enablement, recently launched the AI Connect product suite. The company described the offering as “designed to enable businesses to deploy…AI workloads at scale. Verizon highlighted McKinsey estimates that by 2030 60% to 70% of AI workloads will be “real-time inference…creating an urgent need for low-latency connectivity, compute and security at the edge beyond current demand.” Throughout its network, Verizon has fiber, compute, space, power and cooling that can support edge AI; Google Cloud and Meta are already using some of Verizon’s capacity, the company said.

Agentic AI at the edge

Looking further out, Qualcomm’s Durga Malladi, speaking during CES 2025, tied together agentic and edge AI. The idea is that on-device AI agents will access your apps on your behalf, connecting various dots in service of your request and deliver an outcome not tied to one particular application. In this paradigm, the user interface of a smart phone changes; as he put it, “AI is the new UI.”

He tracked computing from command line interfaces to graphical interfaces accessible with a mouse. “Today we live in an app-centric world…It’s a very tactile thing…The truth is that for the longest period of time, as humans, we’ve been learning the language of computers.” AI changes that; when the input mechanism is something natural like your voice, the UI can now transform using AI to become more custom and personal. “The front-end is dominated by an AI agent…that’s the transformation that we’re talking of from a UI perspective.”

He also discussed how a local AI agent will co-evolve with its user. “Over time there is a personal knowledge graph that evolves. It defines you as you, not as someone else.” Localized context, made possible by on-device AI, or edge AI more broadly, will improve agentic outcomes over time. “And that’s a space where I think, from the tech industry standpoint, we have a lot of work to do.”

Dell Technologies is also looking at this intersection of agentic and edge AI. In an interview, Pierluca Chiodelli, vice president of engineering technology, edge portfolio product management and customer operations, underscored this fundamental shift happening in AI: businesses are moving away from a cloud-first mindset and embracing a hybrid AI model that connects a continuum across devices, the edge and the cloud.

As it relates to agentic AI systems running in edge environments, Chiodelli used computer vision in manufacturing as an example. Today, quality control AI models run inference at the edge—detecting defects, deviations, or inefficiencies in a production line. But if something unexpected happens, such as a subtle shift in materials or lighting conditions, the model can fail. The process to retrain the model takes forever, Chiodelli explained. You have to manually collect data, send it to the cloud or a data center for retraining, then redeploy an updated model back to the edge. That’s a slow, inefficient process.

With agentic AI, instead of relying on centralized retraining cycles, AI agents at the edge could autonomously detect when a model is failing, collaborate with other agents, and correct the issue in real time. “Agentic AI, it actually allows you to have a group of agents that work together to correct things.”

For industries that rely on precision, efficiency, and real-time adaptability, such as manufacturing, healthcare, and energy, agentic AI could lead to huge gains in productivity and ROI. But, Chiodelli noted, the challenge lies in standardizing communication protocols between agents—without that, autonomous AI systems will remain fragmented. He predicted an inter-agent “standard communication kind of API will emerge at some point.” Today, “You can already do a lot if you are able to harness all this information and connect to the AI agents.”

“It’s clear [that] more and more data is being generated at the edge,” he said. “And it’s also clear that moving that data is the most costly thing you can do.” Edge AI, Chiodelli said is “the next wave of AI, allowing us to scale AI across millions of devices without centralizing data…Instead of transferring raw data, AI models at the edge can process it, extract insights, and send back only what’s necessary,” Chiodelli said. “That reduces costs, improves response times, and ensures compliance with data privacy regulations.”

A true AI ecosystem, he argues—and hearkening back to the idea of an AI infrastructure continuum that reaches from the cloud out to the edge—requires:

Seamless integration between devices, edge AI infrastructure, datacenters and the cloud.
Interoperability between AI agents, models, and enterprise applications.
AI infrastructure that minimizes costs, optimizes performance and scales efficiently.

Final thoughts from Chiodelli writing in Forbes: “We should expect the adoption of hybrid edge-cloud inferencing to continue its upward trajectory, driven by the need for efficient, scalable data processing and data mobility across cloud, edge, and data centers. The flexibility, scalability, and insights generated can reduce costs, enhance operational efficiency, and improve responsiveness. IT and OT teams will need to navigate the challenges of seamless interaction between cloud, edge, and core environments, striking a balance between factors such as latency, application and data management, and security.”

Tearing down the memory wall

Further to this idea of test-time inference scaling converging with edge AI where real people really experience AI, an important focus area is around memory and storage. To put it reductively, modern leading AI chips can process data faster than memory systems can deliver that data, limiting inference performance. Chris Moore, vice president of marketing for Micron’s mobile business unit, called it the “memory wall.” But to start more generally, Moore was bullish about the idea of AI as the new UI and on personal AI agents delivering useful benefits to our personal devices; but he was also practical and realistic about the technical challenges that need to be addressed.

“Two years ago…everybody would say, ‘Why do we need AI at the edge?” he recalled in an interview. “I’m really happy that that’s not even a question anymore. AI will be, in the future, how you are interfacing with your phone at a very natural level and, moreover, it’s going to be proactive.”

Screenshot 2025 02 07 at 8.13.28 AM 1 — The convergence of test-time inference scaling and edge AI 2

Image courtesy of Micron Technologies.

He pulled up a chart comparing Giga Floating Point Operations per second (GFLOPS), a metric used to evaluate GPU performance, and memory speed in GB/second, specifically Low-Power Double Rate (LPDDR) Direct Random Access Memory (DRAM). The trend was that GPU performance and memory speed followed relatively similar trajectories between 2014 and around 2021. Then with LP5 and LP5x, which respectively deliver 6400 Mbps and 8533 Mbps data rates in power envelopes and at price points appropriate for flagship smartphones with AI features, we hit the memory wall.

“There is a huge level of innovation required in memory and storage in mobile,” Moore said. “If we could get a Terabyte per second, the industry would just eat it up.” But, he rightly pointed out that the monetization strategies for cloud AI and edge AI are very different. “We’ve got to kind of keep these phones at the right price. We have a really great innovation problem that we’re going after.”

To that point, and in keeping with the historical paradigm that what’s old is often new again, Moore pointed to Processing-in-Memory (PIM). The idea here is to tear down the memory wall by enabling parallel data processing within memory, which is easy to say but requires fundamental adjustments to memory chips at the architectural level and, for a complete edge device, a range of hardware and software adaptions that would, in theory, allow for test-time inference of smaller AI models to occur in memory thereby freeing up GPU (or NPU) cycles for other AI workloads, and reduce the power associated with moving data from memory to a processor.

Comparing the dynamics for cloud-based AI workload versus edge AI workload processing, Moore pointed out that it’s not just a matter of throwing money at the problem, rather (and again) it’s a matter of fundamental innovation. “There’s a lot of research happening in the memory space,” he said. “We think we have the right solution for future edge devices by using PIM.”

In summary

Incorporating test-time inference scaling strategies within edge AI frameworks allows for dynamic resource allocation during the inference process. This means that AI models can adjust their computational requirements based on the complexity of the task at hand, leading to more efficient and effective performance. The AI infrastructure landscape is entering a new era of distributed intelligence—where test-time inference scaling, edge AI, and agentic AI will define who leads and who lags. Cloud-centric AI execution is no longer the default; embracing hybrid AI architectures, optimize inference efficiency, and scale AI deployment across devices, edge, and cloud will shape the future of AI infrastructure.

The convergence of test-time inference scaling and edge AI

“The era of AI inference”

Agentic AI at the edge

Tearing down the memory wall

In summary

ABOUT US

FOLLOW US

The convergence of test-time inference scaling and edge AI

“The era of AI inference”

Agentic AI at the edge

Tearing down the memory wall

In summary

RELATED POSTS

ABOUT US

FOLLOW US