Paul Strassmann’s blog: Recent Cloud Crashes

In recent days Google Docs, Facebook, Amazon and Microsoft suffered outages. There have been always cloud crashed, of various duration, ever since the cloud approach to operations has attracted attention.

Google Docs was out of service for about an hour on Sept. 7, the result of a “memory management bug” that was exposed after Google made a change to improve real-time collaboration in Google Docs. The General Services Administration (GSA) is now running close to 20,000 desktops under Google Docs and depends on Google Docs uptime.

Facebook has been down for 2.5 hours because it “changed the persistent copy of a configuration value that was interpreted as invalid.” Tens of millions of Facebook users were affected. The overload on the system, which contributed to down time as every single client attempting to fix the unavailability of the service.

Amazon cloud service did not function correctly for more than a day. Several large accounts malfunctioned, with a substantial loss in business. The cause was a misapplied software error when systems software was getting updated as well as by too a delayed response to relocate processing.

Microsoft suffered what it called a Domain Name System failure that knocked out services for several hours worldwide for Office 365, Hotmail and SkyDrive. DISA is now in the process of migrating Army e-mail systems to Office 365 and hopes that Microsoft will not fail.

Though failures of public clouds are immediately visible and become a favorite subject for the press, one has to question whether reports of cloud downtime may be indicative that there is something fundamentally wrong with the concept of cloud computing.

The fact is that failures of commercial systems are rarely reported. Therefore there is no way how one can make any comparisons. As a rule in-house computer failures are kept as a secret within the enterprise except in cases when public operations, such as transactions on a stock exchange, become widely known.

SUMMARY
So far I have been unable to find a single case where a cloud service failure was caused by hardware failure though in the case of Amazon there was a problem in switching processing to fail-over processing. All known cloud failures so fare have been software failures that occurred when updating or upgrading system changes. Whether these are human failures by software personnel or the inability to fully test a new version prior to installation is debatable because there will be usually multiple small errors that can add up to downtime.

Primary datacenter cloud operations should never combine the test and development datacenter. Testing must be always completely separate and under tight configuration control to make sure that test versions are not mixed up with run versions. When new software is ready for release for to the primary data center it should be placed for a prolonged test time in a parallel “resource” cloud that mirrors the primary data center until it is safe to switch over.

Until software developers can develop a full assurance that a software change can be fully tested – which is unlikely to ever happen – cloud subscribers must proceed on the assumption that software failures will happen. To protect against such cases, elaborate testing of parallel operations will have to be installed.
None of this should prevent cloud operators to proceeding with the installation of software redundancies as well as fail-over capacity.

Paul Strassmann’s blog

Search This Blog

Recent Cloud Crashes

No comments:

Post a Comment