Search This Blog

Amazon Cloud Computing Failure

Yesterday Amazon Web Services (AWS) reported an outage of its processing services in the Northern Virginia data centers. Multiple web sites were out of service. This attracted widespread attention from evening TV news reports and from major daily newspapers. The word is out that cloud computing could be unreliable.

A review of uptime performance of AWS showed that all cloud services performed well at all times with the exception Only EC2, RDS and Beanstalk in Virginia for a part of one day. AWS in Europe and Asia Pacific continued to run perfectly.




The reported outage calls the Amazon back-up arrangements into question. Did Amazon provide for sufficient redundancy? Was there sufficient back ups available for the failed applications available at other sites?

In retrospect the affected customers, who included only a limited set of customers, such as Foursquare, Reddit, Quora and Hootsuite web sites, could have bought insurance against failure.  They could have signed Service Level Agreements  (SLA) for a remote site fail-over.
Such arrangement can be expensive, depending on the back-up options chosen. However, computers do fail.

Amazon's service-level commitment provides 99.95% availability. This is insufficient because this allows for average annual downtime of 263 minutes. If you apply the exponential distribution of the probability of failure, the chance of computer downtime is even greater than indicated by the average.

If customers run major businesses on top of Amazon, and suffer large dollar amounts of dollars in lost revenue, why not pay for fail-over at another site? Was there money saved by not providing redundancy worthy of the risk? *

Amazon is liable to compensate only for the loss of the customer’s cloud costs. They are not responsible for the customer’s losses. Calculating the worth of fail-over insurance should be a simple matter at the time when the SLAs are negotiated.

Indeed, Amazon backed up the workloads, but only within the processing zones that were located at US-EAST-1. Though data was replicated, the failure remained local. Why did not Amazon (or the customer) provide for a California backup? Why was the East Coast application not replicated to run simultaneously across multiple cloud platforms from different vendors?

For 100% uptime it is necessary to run across multiple zones not only from the same vendor who is likely to have identical bugs. Different locations from the identical cloud provider will never be totally sufficient. Even Google, with 27 data centers has problems in running Gmail. The simpler idea of sticking just with Amazon and balancing applications across multiple regions may be insufficient for critical applications.

SUMMARY
Customers should contract with multiple providers, at multiple locations, for survival of applications that warrants 100.00% uptime. Cloud computing, and particularly the Platform-as-a-Service (PaaS) offer much simpler deployment and management of applications. These are independent of the underlying proprietary infrastructures, which should make the selection of multiple vendors feasible. However, building an application to work across multiple vendors requires strict conformance to standards and a disciplined commitment to interoperability across multiple clouds. As yet, DoD has to demonstrate that it has IT executives who can steer the developers into such directions.

Anytime there is a cloud outage, some will call into question all cloud computing. That is not a valid argument. Every computer has downtime. The difference with cloud computing is how we manage to pay for risk. The Amazon cloud failure yesterday offers a very useful lesson how to start getting ready for the assured continuity of cyber operations, which make error free uptime a requirement.

* http://pstrassmann.blogspot.com/2011/04/continuity-of-operations-in-cloud.html

No comments:

Post a Comment

For comments please e-mail paul@strassmann.com