When cloud operations support ten thousands of devices where processing, memory storage and telecommunications are in a services pool, the installation of automated controls is essential. Human operators cannot cope with the rapidity and complexity of such operations. Therefore, further growth of cloud computing will be always constrained not by the availability of computing assets, but by the inherent limitations how such assets are managed. To extract high levels of capacity utilization of at least 80%, from rapidly changing equipment configurations, can be accomplished only if the entire data center is viewed as a single shared pool that can instantly adapt to changing demands.
The changes in the scale of data center operations in cloud operations makes it necessary to overhaul the ways how computing is organized. The new data centers require that all computing, storage and communications assets combine to offer to customers not only full uptime, but also on short latencies as devices are dependent on on-line responses. What was perhaps tolerable to a user who could always pass the accountability for poor services to company staff, in the cloud data center commercial per use services enforce delivery of superior service level agreements. The security assurance staff must also support unprecedented levels of reliability.
A number of vendors
offer data center management control software, for instance IBM Tivoli, HP
OpenView, EMC|SMARTS and VMware vCenter. The power of these tools depends on
the ability to monitor and to analyze the performance metrics data regardless
of source. To prevent vendor lock-in requires that such software is vendor and
data agnostic. Such software must scale up to support the collection and
analysis of millions of metrics per hour. Such scalability applies regardless
of whether the metrics are collected from a single, massive cloud or from many
smaller services, which are affiliated with the central cloud through
processing “on the edge”.
Because fail-over is also arranged across separate
operations central management control software must be also able to employ
‘remote collectors’. This feature allows it to securely tap into performance
data across firewalled environments as well as geographically separated
multi-datacenter deployments.
The analytics of management control software reflects the
manner in which the normal behavior of each performance metric is determined. It
must have the ability to analyze any performance metric because experience has
shown that millions of indicators have shown that data behave in widely disparate
ways.
It is inadequate to use a single method to characterize what is “normal” behavior by assuming that data will follow a ‘bell-shaped curve’. It is insufficient to trigger alerts when a metric reaches two or three standard deviations from the average. Monitors must specify a variety of allowable intervals to define ranges of acceptable behavior that would trigger an alert. Here are examples of methods that will reveal exceptional levels of performance:
It is inadequate to use a single method to characterize what is “normal” behavior by assuming that data will follow a ‘bell-shaped curve’. It is insufficient to trigger alerts when a metric reaches two or three standard deviations from the average. Monitors must specify a variety of allowable intervals to define ranges of acceptable behavior that would trigger an alert. Here are examples of methods that will reveal exceptional levels of performance:
• Exceeding linearly
behavior (e.g., sudden peaks in disk utilizations). Monitoring defenses on a
ship may require tracking in minutes in cases where there is an exposure to a
missile attack.
• Two-state (e.g.,
on/off) availability of a service. Detection of a tracking signal by an UAV
must be instant.
• Discrete value
behavior detection (e.g., ‘number of database user connections’). Detection of
an instant rise in the number of transactions may indicate an incipient
denial-of-service attack.
• Cyclical pattern
behavior detection (e.g., weekly, monthly, etc.). Mid-month rise in financial
transactions may show a hacker attack.
• Non-time-series, ‘sparse’
data behavior, such as outliers. A rapid decline in communications may be an
indicator of failure.
When problems are building up in a computing service, the
first signs of abnormal behavior will show up as deviant performance metrics
associated with an application. With sophistication of the automated detection
means and the alertness by the monitors, it is possible to observe the
abnormality and use this observation as an early warning of potential troubles.
It is important to recognize that automated monitoring is
not necessarily telling conclusively if any one metric is behaving abnormally.
In operations there will always be some metrics that will show abnormality at a
time. That will be inconsequential systems ‘noise’ and all complex systems will
always generate some of that. The
objective is to learn what would be a computer network’s typical ‘noise’ level
and then take whatever action is necessary to detect “noise” levels that are
potentially dangerous. The sensors will have to be sufficiently diverse so that
it will require a simultaneous detection of multiple adverse indicators to
confirm that a critical event has occurred.
SUMMARY
The installation of a system of controls and monitoring of large data centers warrants top executives' attention prior to proceeding with plans to implement of cloud computing projects.
No comments:
Post a Comment
For comments please e-mail paul@strassmann.com