YOU ARE AT:AI-Machine-LearningThe vital role of observability for communications service providers (Reader Forum)

The vital role of observability for communications service providers (Reader Forum)

Information technology continues to play an increasingly critical role in the modern enterprise, with substantial and ongoing investment in observability platforms and practices to help manage the underlying scale and complexity.

A significant improvement over legacy monitoring tools, these platforms primarily cater to software engineers — providing insight into the performance and reliability of their stacks, from the application down through compute.

Unfortunately, in a trend seen over and over, the network was largely ignored by the creators of these platforms, leaving operators to rely on dozens of disparate legacy monitoring tools to discern the subtle causes of network performance and reliability issues. However, a new category of observability platforms is set to revolutionize operations with comprehensive visibility into networks, infrastructure and applications.

What is observability?

Observability has a long, storied history stemming from its origins within control theory.

A system is said to be observable if the current state of the system can be determined using system outputs. Historically, this meant collecting telemetry through sensors — but the general principles apply to just about any system.

In software systems, these system outputs include telemetry, such as logs, metrics and events. For the network domain, we frequently have two more data sources: configuration and inventory — which help to contextualize the related telemetry and pinpoint the cause of an issue.

The most recent generation of these platforms goes further, automating key aspects of the troubleshooting process and pointing engineers toward the likely source of an outage or issue. These platforms, which often go to market under the banner of AIOps, can even answer plain language questions from operators like: “Is there something wrong with the network?” “What is wrong with the network?” And ultimately, “What do I need to do to fix it?”

Leveraging this functionality, engineers can identify and address issues more quickly and effectively, reduce downtime and minimize the impact of network issues on their customers.

Navigating a changing landscape

The appearance of these platforms couldn’t have come too soon. The streaming wars were already heating up when COVID struck and forced so many businesses to facilitate remote and distributed work on a massive scale almost overnight. Many, if not most, of these trends, are permanent changes to how we live and work.

These advances have thrust networking and networks back into the limelight. For the last couple of decades, networks have become fundamental to more and more of our lives and our jobs. However, there was a point when availability got so good that we started to forget about the network.

What’s changed is a shift from service availability to a focus on performance and service quality. While network monitoring using a disparate collection of point solutions and tools may have been good enough to “keep the lights on,” we now require deeper, more comprehensive network observability to assure the quality of experience our customers and users demand.

Even so, service availability remains an issue for even the largest CSPs. For example, consider the 2022 Rogers Communications outage that left more than 25% of Canadians (about 12 million people) without internet or wireless services for an entire day. It appears that a key cause for the length of the outage was a lack of observability. As a result, no one could answer why the outage had occurred, let alone how to resolve it. The extended outage is estimated to have cost Rogers somewhere between $28M to $70M in customer rebates alone, and the toll on the Canadian economy exceeds $142M. And, of course, we all know that the incident directly resulted in the ouster of company CTO Jorge Fernandes.

Raising stakes

Most of us won’t have the misfortune of dealing with an outage of this scale or severity. But, ensuring uptime is, in many ways, an easier problem to solve than providing consistent high performance across the board.

Performance is now the key metric by which subscribers judge their CSP, even if they don’t know how to quantify it themselves. So, for example, we all notice when our favorite show won’t stream in 4k or someone’s video call gets laggy or choppy.

Business customers take this even further than individuals. Packet loss, latency and jitter all matter more than ever in a world where many enterprises interact with their customers through applications written with the assumption of always-on, high-performance network connections.

The problem with legacy monitoring

Over the years, many point solutions and tools have surfaced to try and fill this need. You likely have several of them installed in your network today. These might be traditional network monitoring tools (often based on polling SNMP data from network devices), log collection and analysis tools (based primarily on syslog messages from various devices and applications), packet capture or other flow-based tools (that collect and analyzing network traffic) and even synthetic monitoring tools (which generate and analyze simulated user/application traffic).

Unfortunately, each of these tools is independent and incomplete. While they may mitigate the need to manually hop from router to router and switch to switch to understand the current state of the network, they shift that burden to jumping between the various tools. Correlation and problem identification are still mainly left to the user.

Observability benefits for CSPs

Modern observability platforms help users efficiently identify the root cause of network issues by creating a correlated narrative across network infrastructure and applications.

These platforms federate data from all available sources — formatting, normalizing and automatically labeling incoming information. This, in turn, allows the correlation of data and metadata across all your infrastructure — from the network all through the application stacks — allowing the platform to rank potential alerts and surface only the most important ones to operators.

After years of false promises, such vertically integrated platforms finally deliver the single pane of glass we’ve all been waiting for and pave the way for revolutionizing observability.

ABOUT AUTHOR