YOU ARE AT:Telco CloudAWS regional outage cripples Amazon and hosted services

AWS regional outage cripples Amazon and hosted services

AWS’s biggest global Region went unresponsive for several hours on Tuesday, affecting both Amazon products and other hosted apps and services.

Tuesday was a tough day not only for AWS engineers but also for myriad cloud-native business that host their services through AWS’ East (North Virginia) Region, known as us-east-1. Services hosted in the us-east-1 Region data centers were unavailable for hours as AWS scrambled to fix “an impairment of network devices.” By late afternoon AWS had a fix in place. 

During the outage, AWS us-east-1 customers couldn’t access services including EC2, Connect, DynamoDB, Glue, Athena, Timestream and Chime. Popular streaming services affected by the outage included Disney Plus and Netflix. The AWS outage stranded users of dating app Tinder, Cryptocurrency service Coinbase and cash app Venmo. Players couldn’t launch popular video games like PUBG and League of Legends.

The outage affected Amazon’s own products and services. Alexa voice assistants stopped working, and Amazon-owned smart home automation hardware went offline. Many Amazon package delivery drivers and Amazon couriers also found themselves unable to get routes or deliver packages, according to reports. 

AWS customers first identified problems midmorning on Tuesday Eastern Time. The first acknowledgement from AWS came around mid-day. AWS explained that the underlying issue affected their monitoring software, which delayed the company’s response protocol.

By mid-afternoon Tuesday, AWS had executed mitigation processes. Services restored over the next few hours as they propagated throughout the affected sites.

Location is everything, even in the cloud

AWS bills its global infrastructure as “the most secure, extensive, and reliable cloud platform.” The foundation of that platform infrastructure is the Region, a physical location around the world where AWS has grouped data centers. Today AWS operates 25 global Regions and has announced plans to add nine more on the way in Canada, Europe, the Middle East, India and elsewhere in Asia. 

AWS logically segments groups of physically disparate data centers operating in the same Region into clusters called Availability Zones (AZ). By scaling cloud services to span multiple AZs, AWS customers can mitigate outages related to weather events or power grid infrastructure failures. 

Amazon operates five public North American Regions, but us-east-1 is its most elaborate. US East comprises six AZs and nine Local Zones, AWS infrastructure deployments specialized for edge computing and big data. 

Located in North Virginia’s “Data Center Alley,” us-east-1 is connected to the same fiber backbone that channels 70% of the world’s Internet traffic. More than 100 data centers populate the surrounding county, including Amazon.

With its central location for global Internet traffic, many AWS customers prefer to host services through us-east-1. Any disruption to services hosted or touching the AWS Region will have a rippling effect across the Internet. 

AWS us-east-1’s outage is in some ways reminiscent of a recent outage with Facebook. Facebook operates its own, independent global cloud infrastructure. In October, Facebook experienced a worldwide outage lasting about six hours, following a bad router change. A bad Border Gateway Protocol propagated across Facebook’s internal backbone routers. Complicating the issue, the router failure disrupted Facebook’s internal security systems. The outage briefly prevented employees from entering data center facilities to execute the mitigation.

ABOUT AUTHOR