Monday Blues: What the AWS Outage Taught Us About Cloud Dependency

It was a Monday but a unique one, I woke up to messages from friends wishing each other Diwali and a very odd email from my child’s daycare their daily reporting system (which sends pictures of their activity and daily summary) is down due to a nationwide outage. The term “nationwide outage” was an overkill for an app that few daycares use, I thought maybe it’s a very popular app uses by many.  Unaware of the fact that something bigger had happened on the backend. Slowly I was blown away by the fact that an outage on AWS can bring thousands of company services to a standstill. Millions of users reported sluggish experience on the apps they used.

Like any other tech enthusiast, I kept myself glued to the AWS updates feed to see what went wrong and finally they updated it, reason for the outage was a software bug in its DNS automation system in US East 1 region, which prevented the Amazon DynamoDB API from being resolved properly.

Next day we started to see the list of organizations that got impacted due to this outage. Some very prominent names like Canva, Delta Airlines, Coinbase Global started surfacing. But the most interesting one was a mattress company called “Eight Sleep”.

Yes, that’s right, a mattress company!

 

Eight Sleep manufactures smart mattress which gives users option to set temperature, mattress inclination etc. via their app. During the AWS outage users could not control the settings of the mattress. Some users shared experiences like they could not bring the temperature down. some were not able to change the inclination. Imagine you bought a 3000 USD mattress and it’s rebelling against you.

Let’s take a look at the design of Eight Sleep. User and the mattress are in the same room and when user performs an action, the events are sent from the app to its cloud server (AWS in this case). The cloud server then performs the necessary business logic and sends a notification to the mattress to take necessary action.  This looks good if you have uninterrupted internet access, which is very common at least in most of the metropolitan cities and your cloud server is always up and running. The outage showed us that it only takes a bug in the DNS resolver to bring down a giant like AWS.

Imagin if thousands of organizations either big or small were impacted, then it would mean that a lot of them have used the default region. For example, the app used by a Daycare in Texas, US uses AWS region as US East (N. Virginia) and not US East (Ohio) which should have been a nearest edge location but most of the training materials available on the internet mention US East (N. Virginia) as the default location which would have influenced the developers to pick this as the primary location.

All these boils down to two major aspects that should have been thought through while creating the architectural design of a product/service.

  1. Local Survivability: Instead of hair pinning the request via internet use local connections especially when we are designing a Smart/IoT device. This is not something new, there are a lot of IoT devices that work using Z-Wave, Zigbee, or Thread protocols which provide interfaces for local connectivity.  Examples of local survivability are not limited to IoT devices, Cloud based UC giants like Cisco, Zoom etc. provide local connectivity for calling offerings in a similar way.
  2. Use Non-Defaults (High Availability): In this age of globalization, users are spread across multiple locations but if your product or service offering has its major user base in a certain location then we should consider the offering to the closest edge location. This should automatically move you from the default location setting of your cloud infrastructure provider.

Local Survivability and High Availability may not always be feasible but are very important for a frictionless user experience. This should be taken into consideration when we envision a product or a service design.

Let us aim to make our Monday less stressful than it already is.

Author Details

Jijo Thomas

As an engineer at heart with over 18 years of experience, I have had the privilege of driving innovations in Unified Communications & Collaboration (UC&C) and Communications Platform as a Service (CPaaS) solutions. My journey is driven by a deep passion for embracing emerging technologies and leveraging them to create impactful, user-centric solutions. I’m always looking for creative problem-solving opportunities.

Leave a Comment

Your email address will not be published. Required fields are marked *