Sunday, 1 May 2011

Oops on the Cloud: AWS Outage


April 21: A massive outage sent Amazon Web Services (AWS) scrambling to fix its Elastic Computing Cloud (EC2) and Relational Database Service (RDS) cloud computing services. 

From 4.40 in the morning for some 10 hours, major Amazon EC2 cloud customers, including popular Web sites like check-in site Foursquare, socila media aggregator HootSuite, social site Quora and Reddit struggled for connectivity or dropped of the Internet altogether.

To short-cut the techno-babble, it appears EBS (Elastic Block Store) volume replication was disrupted which meant that automated copying, deployment and load-balancing of large volume sites simply froze. “Something large” clearly failed, not just the EBS system as a whole, but the infrastructure that supposedly spread the load across multiple 'availability zones' which lasted around three hours before recovery began in earnest...

Some databases that were replicated across multiple availability zones did not fail-over properly. It appears that the issues persisted at Amazon's North Virginia data center for some hours after that, with some database recovery taking up to 12 hours.

Although this isn't the outage of Amazon's cloud computing services the scale of the incident raises questions the resilience of cloud services, not to mention support; who should take the calls when services fail,vendors or service providers?

Up to this point, large volume, on-demand customer sites had enjoyed the resiliance of EC2 with relatively little downtime. Since Amazon’s reliability has been solid, many users were not well-prepared and some users got caught by a widespread failure in unexpected components,  rendering their failure plans ineffective. From Amazon's side, some ripple effects within EC2, particularly the EBS should not have happened.

Another important product at which Amazon normally excels, namely communication, dropped into a black hole with EBS and customers websites. This PR failure more than the outage itself proably did more damage to AWS.

I liked the post-mortem analysis on Gigaom, which, whilst appalled at the comprehensive failure of a service designed to be so robust that you, the customer, need not worry about it, maintains that the AWS behemoth will come out of the incident with barely a scratch.

"Transparency, increased liability beyond the SLAs, better standard support — these things will remain at the status quo. Amazon is a large company that, unlike many startup cloud providers, doesn’t have to rely on openness to appease its customers when something goes wrong. It can be as transparent as it deems necessary because 1) it’s currently the best cloud platform available; and 2) it already has their business."
 
"Amazon’s cloud business won’t be hurt a bit. There are two reasons for this:
1) anyone still using AWS likely will spend more money with the company to make their application architectures more resilient; and
2) AWS, as mentioned above, is still the best cloud around. The outage is a black eye, but it’s a black eye on what’s otherwise an Adonis of cloud computing. Nowhere else can developers have access to a suite of tools that spans... a robust portfolio of features and aren’t tied to the notion of “five nines” availability, AWS is still a great choice. Even if AWS bleeds a few customers from this outage, there’s plenty of new blood out there to replace it."

To quote one second-tier service provider, Rightscale, "It reminds us that this is still “day one” of the cloud and that we all have much to learn about building and operating robust systems on a large scale." 

Are we all feeling happy sitting on our Cloud? RC

1 comment:

  1. Important product at which Amazon normally excels, namely communication, dropped into a black hole with EBS and customers websites. This PR failure more than the outage itself proably did more damage to AWS.anyone still using AWS likely will spend more money with the company to make their application architectures more resilient; and AWS, as mentioned above, is still the best cloud around.

    aws downtime

    ReplyDelete

At least try to be nice, it won't kill you...