Sunday, August 1, 2010

Network Downtime Metrics

DoD’s largest network (NMCI) is supposed to keep track of average monthly uptime metrics. The Service Level Agreements (SLA’s) call for an average 99.7% uptime, which results in 2.2 hours of downtime per month. The > 400,000 clients network would then have a total downtime of 876,000 hours per month.

How the downtime is calculated contains a number of provisions, which make the determination of the actual number of available network hours difficult to find. Scheduled downtime, preventive maintenance, bug fixes, hardware upgrades and software enhancements are excluded from downtime hours. It is also not explained whether the SLA uptime applies to end-to-end performance, e.g. keyboard to data center connectivity.

Hitting the target of 99.7% uptime is not difficult. The NMCI calculation is the average of the total measured population.  The larger the number of clients included the easier it is to meet uptime number.

An examination of the incidence of failure in a network will show that the probability of failures is exponentially distributed (see http://en.wikipedia.org/wiki/Failure_rate).



1. There will be always a small number of clients that will have failures greater than the average. A few of these will be out of service for an extended time period.

2. There will be always a very large number of clients that have failures substantially lower than average.  A large number of clients will not have any failures except for scheduled downtime.

3. The average downtime reported by NMCI will be related to the number of clients included in the average. The larger the population that is included for reporting purposes, the lower the reported average. The effects of a small number of excessive failures will be masked in the number of cases included.

SUMMARY

The calculation of network downtime using averages is misleading. In information warfare a small number of critical clients with excessive failure rates is unacceptable.

Information warfare network reliability metrics should not focus on broad averages but on the number of critical clients with failure rates in access of their time to restore abilities.

The time to restore is difficult to predict, especially on ships. The only solution to reducing downtime risks is to adopt end-to-end network redundancies for critical components of the NGEN network.  Automatic fail-over rates can assure near zero failures. This should be pursued because it is an economically feasible solution. The Navy is in an excellent position to adopt for NGEN a zero defect approach for critical parts of its future networks. The economics of virtualization makes that possible.

No comments:

Post a Comment

For comments please e-mail paul@strassmann.com