Industry discourse around test-time inference scaling converging with edge AI—where real people really experience AI—continues to ramp, and an important focus area is around memory and storage. To put it reductively, modern leading AI chips can process data faster than memory systems can deliver that data, limiting inference performance. Chris Moore, vice president of marketing for Micron’s mobile business unit, called it the “memory wall.” But to start more generally, Moore was bullish about the idea of AI as the new UI and on personal AI agents delivering useful benefits to our personal devices; but he was also practical and realistic about the technical challenges that need to be addressed.
“Two years ago…everybody would say, ‘Why do we need AI at the edge?” he recalled in an interview with RCR Wireless News. “I’m really happy that that’s not even a question anymore. AI will be, in the future, how you are interfacing with your phone at a very natural level and, moreover, it’s going to be proactive.”
He pulled up a chart comparing Giga Floating Point Operations per second (GFLOPS), a metric used to evaluate GPU performance, and memory speed in GB/second, specifically Low-Power Double Rate (LPDDR) Direct Random Access Memory (DRAM). The trend was that GPU performance and memory speed followed relatively similar trajectories between 2014 and around 2021. Then with LP5 and LP5x, which respectively deliver 6400 Mbps and 8533 Mbps data rates in power envelopes and at price points appropriate for flagship smartphones with AI features, we hit the memory wall.
“There is a huge level of innovation required in memory and storage in mobile,” Moore said. “If we could get a Terabyte per second, the industry would just eat it up.” But, he rightly pointed out that the monetization strategies for cloud AI and edge AI are very different. “We’ve got to kind of keep these phones at the right price. We have a really great innovation problem that we’re going after.”
To that point, and in keeping with the historical paradigm that what’s old is often new again, Moore pointed to Processing-in-Memory (PIM). The idea here is to tear down the memory wall by enabling parallel data processing within memory, which is easy to say but requires fundamental adjustments to memory chips at the architectural level and, for a complete edge device, a range of hardware and software adaptions that would, in theory, allow for test-time inference of smaller AI models to occur in memory thereby freeing up GPU (or NPU) cycles for other AI workloads, and reduce the power associated with moving data from memory to a processor.
Comparing the dynamics for cloud-based AI workload versus edge AI workload processing, Moore pointed out that it’s not just a matter of throwing money at the problem, rather (and again) it’s a matter of fundamental innovation. “There’s a lot of research happening in the memory space,” he said. “We think we have the right solution for future edge devices by using PIM.”