Paul Strassmann’s blog: Integration of Controls for Large Cloud Data Centers

Performance, capacity and configuration management become closely connected as the size of a data center increases. Traditional tools and processes that were designed for stand-alone environments that are run on separate servers do not work in highly virtualized cloud setting. When the size of a data center exceeds several hundred servers, the tight integration of pooled capacity and the fail-over of computing and memory assets require automated controls. Up time is achieved by making real-time reallocation in capacity feasible.

When cloud operations support ten thousands of devices where processing, memory storage and telecommunications are in a services pool, the installation of automated controls is essential. Human operators cannot cope with the rapidity and complexity of such operations. Therefore, further growth of cloud computing will be always constrained not by the availability of computing assets, but by the inherent limitations how such assets are managed. To extract high levels of capacity utilization of at least 80%, from rapidly changing equipment configurations, can be accomplished only if the entire data center is viewed as a single shared pool that can instantly adapt to changing demands.

The changes in the scale of data center operations in cloud operations makes it necessary to overhaul the ways how computing is organized. The new data centers require that all computing, storage and communications assets combine to offer to customers not only full uptime, but also on short latencies as devices are dependent on on-line responses. What was perhaps tolerable to a user who could always pass the accountability for poor services to company staff, in the cloud data center commercial per use services enforce delivery of superior service level agreements. The security assurance staff must also support unprecedented levels of reliability.

A number of vendors offer data center management control software, for instance IBM Tivoli, HP OpenView, EMC|SMARTS and VMware vCenter. The power of these tools depends on the ability to monitor and to analyze the performance metrics data regardless of source. To prevent vendor lock-in requires that such software is vendor and data agnostic. Such software must scale up to support the collection and analysis of millions of metrics per hour. Such scalability applies regardless of whether the metrics are collected from a single, massive cloud or from many smaller services, which are affiliated with the central cloud through processing “on the edge”.

Because fail-over is also arranged across separate operations central management control software must be also able to employ ‘remote collectors’. This feature allows it to securely tap into performance data across firewalled environments as well as geographically separated multi-datacenter deployments.

The analytics of management control software reflects the manner in which the normal behavior of each performance metric is determined. It must have the ability to analyze any performance metric because experience has shown that millions of indicators have shown that data behave in widely disparate ways.

It is inadequate to use a single method to characterize what is “normal” behavior by assuming that data will follow a ‘bell-shaped curve’. It is insufficient to trigger alerts when a metric reaches two or three standard deviations from the average. Monitors must specify a variety of allowable intervals to define ranges of acceptable behavior that would trigger an alert. Here are examples of methods that will reveal exceptional levels of performance:

• Exceeding linearly behavior (e.g., sudden peaks in disk utilizations). Monitoring defenses on a ship may require tracking in minutes in cases where there is an exposure to a missile attack.

• Two-state (e.g., on/off) availability of a service. Detection of a tracking signal by an UAV must be instant.

• Discrete value behavior detection (e.g., ‘number of database user connections’). Detection of an instant rise in the number of transactions may indicate an incipient denial-of-service attack.

• Cyclical pattern behavior detection (e.g., weekly, monthly, etc.). Mid-month rise in financial transactions may show a hacker attack.

• Non-time-series, ‘sparse’ data behavior, such as outliers. A rapid decline in communications may be an indicator of failure.

When problems are building up in a computing service, the first signs of abnormal behavior will show up as deviant performance metrics associated with an application. With sophistication of the automated detection means and the alertness by the monitors, it is possible to observe the abnormality and use this observation as an early warning of potential troubles.

It is important to recognize that automated monitoring is not necessarily telling conclusively if any one metric is behaving abnormally. In operations there will always be some metrics that will show abnormality at a time. That will be inconsequential systems ‘noise’ and all complex systems will always generate some of that. The objective is to learn what would be a computer network’s typical ‘noise’ level and then take whatever action is necessary to detect “noise” levels that are potentially dangerous. The sensors will have to be sufficiently diverse so that it will require a simultaneous detection of multiple adverse indicators to confirm that a critical event has occurred.

SUMMARY

The installation of a system of controls and monitoring of large data centers warrants top executives' attention prior to proceeding with plans to implement of cloud computing projects.

Paul Strassmann’s blog

Search This Blog

Integration of Controls for Large Cloud Data Centers

No comments:

Post a Comment