YOU ARE AT:AI InfrastructureFour areas of focus for assuring data center networking

Four areas of focus for assuring data center networking

What new demands does AI make on back-end data center networking?

Beyond the physical testing of AI infrastructure, there are also broader questions of assurance that are still in the process of being answered. Testing the basic working functionality of physical infrastructure is one thing, it’s an entirely and emerging challenge to figure out, architect and maintain the optimal and minimum acceptance conditions for data center networking and compute fabrics that support AI workloads of varying types, across multiple data center locations and/or GPU clusters, and making sure that these expensive resources are utilized in the most efficient ways possible.

Sagie Fanish, senior director of AI infrastructure for DriveNets, said that the demands that AI places on infrastructure and data center networking are “very clear” and outlined a number of them in a recent RCR Wireless News webinar. They include:

Predictive high performance. Specifically, data center networking needs to be able to sustain the required bandwidth and latency from each endpoint to each endpoint on the network, Fanish said—which has not been typical for traditional data center compute environments. Those environments generally have had heterogeneous traffic patterns, many different types of traffic flows, and “high entropy,” he explained, adding that this is “the exact opposite of what AI workloads impose on the network.”

SDN data center networking

AI, he said, wants consistency—uninterrupted compute consumption and connection among the GPU clusters that are providing it. “You want to guarantee that your SLA is high, your system is actually functioning, your GPUs are functioning correctly and basically, [AI] just wants the infrastructure to be predictable, as lossless as possible and having the ability to sustain any type of dynamic traffic pattern,” he said, as well as providing a low “tail latency”—an important metric in back-end AI networking, because it focuses on the last packet of information to arrive from networked GPU clusters executing parallel computing tasks.

Maintaining that performance over interconnected locations. Power and cooling capabilities are often the current limiting factors on placing or adding GPUs in any given data center location. So workloads often have to be distributed across clusters at different sites, necessitating precise and low-latency data center networking between them.

Interconnecting remote geo-located data centers is particularly relevant for telcos, Fanish explained, because they have plenty of real estate, but not necessarily the necessary power to fill up any given location with GPUs. And the overall consistent performance of the data fabric needs to be maintained via the data center networking between these distributed clusters—which is DriveNets’ specialty. (The company, which handles the majority of AT&T’s core network traffic, in essence uses cloud-native software to turn distributed physical network infrastructure into a shared resource or unified fabric, with guarantees on characteristics such as zero packet loss and bi-directional bandwidth, and a distributed network operating system for orchestration.)

High data fabric utilization. It’s “very hard to control” how workloads are deployed on GPUs, Fanish added, particularly if a telco wants to provide bare-metal or GPU-as-a-service—which some, such as Singtel and SK Telecom, are already exploring. According to IEEE, some studies have shown as recently as last year that more than half of GPUs are not in use at any given time, which has given rise to “serverless” GPUaaS startups that broker access to available GPU assets.

“You want to use and leverage and utilize your infrastructure as much as possible,” Fanish said. “You don’t want to keep GPU idling if there’s a demand. You want to deploy the workload on any GPU, on any location of the fabric, with resiliency and failure recovery. … AI workloads running on top of the AI stack have a lot of timeouts. They don’t handle pocket loss very well.” If data fabrics don’t operate efficiently, or have too much packet loss, the inferior performance will affect the model training or results.

Tenant isolation is another major demand put on back-end data center networking, Fanish said, especially as many tenants are being deployed on the same infrastructure. And it’s more complicated than just handling data workflows from two different tenants, he added. “Tenant isolation is not just Tenant A and Tenant B, it’s also, [a] collective of the same workload that may interfere with a different collective of the same workload,” he explained. “We also call those ‘tenants’, even though it belongs to the same user … but those collective traffic patterns can interfere with one another.”

Implications for the wider telecom network

Stephen Douglas of Spirent Communications summarized AI infrastructure as needing to be ultra-fast, scalable, low-latency and able to handle massive, unpredictable workloads—and for infrastructure owners or managers, he adds power efficient to the list as well. He also noted that while those demands on infrastructure and data center networking are currently focused within and across data center locations, those demands will increasingly extend to the broader telecom network.

data center networking
Image: 123RF

And while hyperscalers like Google (with its own Google Fiber service) and Microsoft, which purchased a hollow-core fiber (HCF) provider three years ago and is using that ultra-fast physical layer to network its own data center infrastructure, Douglas points out that telecom networks are fragmented, so that across wireless networks, metro-area aggregation and backhaul providers and long-haul optical transport links, it’s highly unlikely that a single provider would have end-to-end control.

“The overlay of that, how that actually gets managed holistically, is probably one of the big topic areas that the industry needs to look at at the moment,” Douglas added—because that fragmentation and lack of control may very well limit AI capabilities, and/or network operators’ ability to monetize AI through services like GPUaaS.

For more insights on testing and assuring AI infrastructure, check out this RCR Wireless News webinar on-demand.

ABOUT AUTHOR

Kelly Hill
Kelly Hill
Kelly reports on network test and measurement, as well as the use of big data and analytics. She first covered the wireless industry for RCR Wireless News in 2005, focusing on carriers and mobile virtual network operators, then took a few years’ hiatus and returned to RCR Wireless News to write about heterogeneous networks and network infrastructure. Kelly is an Ohio native with a masters degree in journalism from the University of California, Berkeley, where she focused on science writing and multimedia. She has written for the San Francisco Chronicle, The Oregonian and The Canton Repository. Follow her on Twitter: @khillrcr