YOU ARE AT:AI-Machine-LearningBotched software, crappy computers, hybrid clouds – about the global IT outage

Botched software, crappy computers, hybrid clouds – about the global IT outage

Some quick thoughts on the big IT outage today, which grounded planes, trains, banks, hospitals, shops, telcos, and broadcasters around the world. Reports on the radio this morning – when the story was breaking, as I drove the kids to school – led on a single notice by Microsoft that it was investigating the mess, and concluded, somewhat nebulously, that it is to do with interlinked global cloud systems, and professed alarm and dismay that the world is managed in this way, in far-off black-box data centres, where invisible errors go viral in global systems.

It felt like a prompt to write about the importance of hybrid and edge cloud computing and networking systems, and ring-fencing mission- and business-critical industries from the wild-west of the open internet. But then, fairly quickly, even as the car pulled up at the office, the story changed. It was not about the cloud at all; and not about a cyber attack on the pathways between. The name of US-based cybersecurity firm Crowdstrike was being bandied about; its chief executive had made a statement about a botched software upgrade on Windows-based computers and servers. 

There wasn’t an apology, yet, retorted an irate BBC tech journalist visiting a disrupted hospital. But the story had landed, and it was even starker than expected. “The world runs on millions of crappy Windows computers,” said Francis Haysom, partner and principal analyst at Appledore Research, in an email exchange. This global IT balls-up was down to a botched upgrade of a piece of antivirus software, impacting the Windows system specifically; and it seems, according to later reports, that it will only be patched up manually, going almost computer-by-computer.

Haysom wrote: “This is not the mission-critical systems of air traffic control; it’s the auxiliary business systems – check-in, boarding pass scans, train crew scheduling, train e-gates, and so on. Failure means that systems that make things run smoothly suddenly aren’t there. People fall back to paper and queues back up. This is not a failure of the cloud; this is not Microsoft Azure.” They’re not ‘mission-critical’, and not even ‘business-critical’, according to the traditional definitions, but maybe these critical ratings should be reassessed – because angry punters kill business.

Dean Bubley at Disruptive Analysis responded, as catastrophe unfolded: “[It] seems to be about endpoint security and firmware updates on devices and servers…. I guess a key learning is going to be about testing updates carefully and deploying them rapidly – but not simultaneously – everywhere.” Bubley speculated a little about the cloud/edge impact in the story; whether there is “some read-across to software-based networks” and the amount of testing for cloud-native software updates and bug-fixes, as well as the urgent clamour and need for AI in cybersecurity.

The AI angle was telling, of course; the story wasn’t even told yet, but AI was cast as both the villain and the hero of the piece from the start – as the superhero juice that had powered the hackers and would power the counter-attackers. Even when the Crowdstrike confession came out, and the total fragility of global digital infrastructure was exposed in a simple third-party software update, it was down to human error – which hit send on the update in the first place, and will labour over manual fixes in the end. AI is the answer to everything, always.

Maxine Holt, in charge of cybersecurity research at Omdia, was quick out of the blocks on social media. She wrote: “Conflicting reports are emerging. Some sources, including Microsoft, suggest the Windows 10 issue might be separate from the CrowdStrike fiasco. No concrete confirmation has been provided yet… All eyes are now on CrowdStrike and Microsoft. The stakes couldn’t be higher. CrowdStrike, deeply embedded in enterprise cybersecurity, faces an existential threat if this update is confirmed to be the root cause. 

“Unlike other vendors, removing CrowdStrike from the security stack is not a simple task; it’s a massive project fraught with complexities. The question looms: could CrowdStrike actually fail? The vendor’s entrenchment in enterprise cybersecurity might not be enough to withstand the fallout if it is responsible for this unprecedented global outage. Microsoft, despite its involvement, is unlikely to face the same existential threat. Its entrenchment in IT and security infrastructures across the globe makes it almost invincible. But the scrutiny and backlash will be intense.”

Which sums up the earlier point about the power of the mob; of people killing businesses, just like overthrow governments (in democratic systems); except maybe if you are Microsoft, plus a very few others. Leo Gergs, principal analyst at ABI Research, responded: “The damage to the credibility of centralised cloud services [and products] is severe. Businesses that [rely] on them are facing… operational chaos, financial losses, and tarnished reputations. The gravity… depends on the extent of the outage… but it could run to billions of dollars – all in a single day.”

But the question about public-versus-private cloud setups, as prompted by the coverage on the breakfast show on the BBC, is not dead. Haysom responded: “[Actually] it is a demonstration of why the cloud, and particularly the edge-cloud, is so important.” A telecoms vendor said in private chat that critical industries know very well already about the risks of using the public cloud, and are running data over private 4G and 5G networks into all-edge computing infrastructure with the kind of layered redundancy that ensures they operate during outages and failures. 

But a botched software update will mess with a private edge systems just the same. “It could have happened in a ring-fenced environment, too. Still, if the right layering was implemented, it should not have taken down complete operations. Update rules are different for IT and OT. In IT, a mass roll out of an update is not unheard of; in OT, they are more controlled, segment-by-segment.” Lessons should be carried over, perhaps. But the message is also that most of the industries that have been impacted, or the disciplines that have, need the cloud for their IT and OT apps. 

“Airports, retail, banking – these are heavily and globally interconnected, serving the public.” But Gergs at ABI Research says global industries should not rely any more on crappy computers and public clouds. “Enterprises must rethink their strategies in the wake of this outage. There’s likely to be a significant pivot towards hybrid and multi-cloud environments, where workloads are spread across multiple providers and on-premises systems, enhancing resilience and reducing dependency on any single provider. 

He continues: “This incident serves as a stark warning of what could happen in case of malicious cyberattack – which in the current times of hybrid warfare unfortunately is a more likely scenario than ever before. Private edge computing will gain momentum as companies seek to decentralise their processing and storage, bringing them closer to the data source. At the same time, scenarios like these will contribute to national states pushing for faster rollout of sovereign clouds – to provide an additional level of security and integrity for enterprises to secure their highly critical data.”

Back to Haysom, who reasons: “The cloud is not some perfect environment – it’s still software in the end. But it has solved many of the problems of software operations, including distribution and testing at scale, and the ongoing securitization of solutions… [But] public cloud on its own is not the simple answer. The systems affected today need to continue operation in the absence of connection to the cloud… Today’s events make the effective application of cloud at the edge more important, not less. But the edge cloud is different to the cloud, requiring new approaches.”Which is a discussion for another day; and also one found in the RCR Wireless archive.

ABOUT AUTHOR

James Blackman
James Blackman
James Blackman has been writing about the technology and telecoms sectors for over a decade. He has edited and contributed to a number of European news outlets and trade titles. He has also worked at telecoms company Huawei, leading media activity for its devices business in Western Europe. He is based in London.