Telcos are adopting chaos engineering to test resiliency, reduce downtime and ensure systems can withstand real-world failures — before they happen.
As telecom networks adopt increasingly complex cloud-native architectures, traditional notions of reliability are being upended. In place of static, hardware-centric infrastructure, operators are building dynamic environments composed of microservices, containers and distributed orchestration. But with that agility comes fragility — and a growing realization: to build truly resilient networks, operators are now turning to a radical approach — intentional failure.
As services like 5G Standalone (SA), IoT and edge computing become more latency-sensitive, even a few seconds of downtime can cascade into major service disruptions or missed business opportunities.
From Netflix to network cores — The rise of chaos engineering
Chaos engineering — the practice of deliberately introducing failure into live systems to test their resilience — was made famous by Netflix’s “Chaos Monkey.” While the concept has long been embraced by hyperscalers and cloud-native companies, telecom operators have historically taken a more risk-averse approach. But that’s changing.
“If we know we’re going to have issues in these complex 5G cores, let’s do chaos engineering,” said Bill Clark, principal product manager at Spirent Communications. “You’ve got to break things to test different scenarios… because how else are you going to know if a vendor’s AMF [Access and Mobility Management Function] won’t just completely roll over?”
A cloud-native problem demands a cloud-native solution
The shift to 5G and beyond has brought significant flexibility to the telecom space — but it’s also introduced new layers of complexity. Cloud-native architectures rely on distributed services, containers and automation that must operate seamlessly across hybrid and multi-cloud environments. When something fails — and something always will — self-healing mechanisms need to respond instantly and without human intervention.
Clark points to examples like Kubernetes-based 5G Core functions, where a single network function might be distributed across hundreds of pods. If a single component fails, the system needs to recover immediately and automatically. “You don’t want the whole system to go down — you want to isolate and recover,” he said. “But you can’t assume that will work. You have to test it.”
Telco mindset shift — From uptime to fault tolerance
Despite its value, chaos engineering is still a tough sell for some operators. “We went to a Tier 1 a few years ago and said, ‘We’ve got a great idea — we want you to break things.’ They said no way, we’re not doing that in production,” Clark recalled.
But with continuous integration/continuous delivery (CI/CD) pipelines becoming foundational in telecom, chaos testing is increasingly viewed not as a risk — but as a requirement. It enables operators to simulate real-world failures, validate recovery processes and uncover weak points before they affect live users.
At its core, Clark said, chaos engineering in telecom is about resiliency testing: “Really, it’s resiliency testing — testing your CNS [cloud-native stack]. The solution we build is CNS resiliency. It’s a different mindset. It’s getting a lot of traction. Spirent could coin chaos engineering for 5G, but some operators love this terminology — and others, not so much.”
Toward continuous confidence
As telecom operators evolve their DevOps practices and embed automation deeper into the network lifecycle, chaos engineering offers a structured, proactive approach to testing resiliency. When integrated into CI/CD pipelines, it supports:
- Validation of failover and self-healing mechanisms
- Faster incident response and root cause analysis
- More predictable service performance during unexpected events
- Improved customer experience by reducing downtime
The result is a new kind of reliability — one built not on avoiding disruption, but on preparing for it.
What’s next?
While hyperscalers are comfortable testing in production, many telcos remain cautious. Yet as disaggregated, software-driven infrastructure becomes the norm, the tolerance for unexpected downtime will only shrink. Chaos engineering offers a clear path forward — one that ensures not only that networks are always on, but also that they’re always ready.