Sunday, January 9, 2011

Uptime Performance for Cyber Operations

The reliability of end-to-end transaction processing for cyber operations is one of the most important metrics for dictating the design of networks. Under conditions of information warfare, seconds, not minutes will matter.

It is necessary to reach agreements how to measure systems uptime. The reliability of a network cannot be isolated within the Army, Navy, Marine Corps or the Air Force. Under conditions of information operations, the uptime of a DoD network will be the combined response time from every participating network.

 The calculation of network uptime using undefined average metrics is misleading. Is uptime averaged over minutes, hours or days? Is it measured at the user’s keyboard or at the data center? Will it be measured in the number of transactions that exceed a standard, or is uptime expressed as the number of transactions that are below a defined threshold? Or, will the network operators resort to a survey of a random sample of users to gauge user satisfaction? Will such a sample be taken at a maximum peak load time or during average business hours?

The following illustrates a valid approach to measuring uptime:


1. The time interval over which the measurement of uptime is taken is specified.  That could be seconds or hours depending on a user tolerable response times. In the case above the downtime increments were chosen in five minutes, but could be any interval.
2. The number of transactions (users or seats in the above case) that miss a defined standard. That could be more than 200 milliseconds (in the case of a Google search) or less than five minutes when downloading geographic data.
3. The SLA (Service Level Agreement) non-performance standard is defined not as uptime but as performance downtime over five minutes. 99% sounds good until you realize that this number could give you 87.6 hours, on the average, per year.
4. Overall system performance can be examined as a frequency of failures (Green or Red), or as a summary over a 30-minute period.

When designing for network reliability one must consider whether the network has a single point of failure or whether it is redundant. Cascaded single points of transaction processing show the following downtimes:



If cyber operations use a redundant design (two identical system processes running in parallel) then the overall system reliability shows remarkable uptime improvements. When automatic fail-over rates assure low failures that approach should be pursued for all critical applications. Virtualization makes fail-over economically feasible:



SUMMARY
Contractor defined DoD Service Level Agreements are inconsistent in definitions as well as in calculating uptime/downtime metrics. With increased dependency on multi-Component interoperability it is necessary to standardize the uptime evaluation methods. That will make it possible to start predicting the reliability of complex networks in systems engineering of cyber operations.