Small Business Resources, Business Advice and Forms from AllBusiness.com

Drive down network costs: optimize your network architecture for maximum availability and...

By Weiss, David
Publication: Communications News
Date: Wednesday, November 1 2006

Nobody likes network downtime. It strikes at the heart of corporate profitability and its highly visible nature always leads to unhappy customers and upper management questions: Why did this happen? How can it be prevented? Don't we already have procedures in place? That is until budget time comes

around and the redundancies, staffing and training are hard to come by. The systems get more complex, and the goal remains the same--keep the network up.

Setting realistic goals, then allocating sufficient resources to achieve those goals is the ideal scenario for uptime management in any organization. A simple stepwise procedure can lead you from having downtime manage your organization to your organization managing its own uptime.

There are many resources available to assist in developing a cohesive and comprehensive uptime management plan. The National Institute of Standards and Technology (NIST) produced a contingency planning guide for information technology systems, which is an invaluable document for any IT organization. It outlines a seven-step approach:

1. Develop the contingency planning policy statement.

2. Conduct a business impact analysis.

3. Identify preventive controls.

4. Develop recovery strategies.

5. Develop an IT contingency plan.

6. Plan testing and training exercises.

7. Plan maintenance.

A lot of information on high-availability measurements and standards is available. The most common metric of availability is expressed in "nines" as in "five nines of reliability." This refers to 99.999% availability, or only five minutes of unplanned downtime per year.

That metric alone, though, is not enough to set a target for your company's high-availability planning. Is one five-minute outage a year the same as five, one-minute outages throughout the year? Is downtime at 3 a.m. the same as downtime at 3 p.m.? Ultimately, the user's experience is what counts, in addition to the revenue stream, opportunities lost and resources used in firefighting instead of planned activities that determine the true impact of downtime.

Achieving the lofty goal of five nines sounds like it should be every organization's objective, but any downtime that can be restored in less than five minutes implies a fully automated system. With an outage of any complexity, a human cannot recognize, analyze, diagnose, formulate a plan and implement it in five minutes. Just rebooting one server can eat up most of your time budget for the year.

A plan that provides reasonable expectations should be created, based on: the criticality of the network; the negative impact of downtime; and the available resources to increase uptime. Take an honest reckoning of what will be acceptable downtime and build the systems necessary to achieve it. When the inevitable downtime occurs, remind yourself (and upper management) that it was all part of the plan.

In addition to availability, several other measurements should be considered:

Mean time to repair. You can make the system reliable, but failures will eventually happen. This measures the time from failure to recovery, once the problem is diagnosed.

Affected users. Take time to think about an outage that lasts only one minute, but affects 1,000 users, versus an outage that affects one user for 1,000 minutes. Which is worse for your organization?

Potential affected users. If a 10,000-subscriber cable TV system goes out, but only 10% of the homes have TVs on, then the potential affected users is 10,000 but the number affected is only 1,000.

A standard calculation of loss can be summarized as: L = P x T x [C.sub.r] + [C.sub.p]. Where P is the probability that a disaster will occur in percent, C is the cost (lost revenue plus lost productivity) attributed to being down per unit of time, and T is the length of the downtime.

This measurement has to be done for each failure point in the system, as each has its own probability and cost impact. When all of the possible downtime costs for various scenarios are estimated, the cost of lessening the risks can be compared to the probability and costs associated with those risks.

In examining key failure points of your operation--equipment, connectivity, processes and staffing--look for ways to mitigate risk with built-in redundancies and automated procedures that can shave downtime to a minimum. Sometimes, simple solutions can provide cost-effective means to reduce the likelihood of failures, and to shorten their duration when things do go amiss.

Fault tolerance and redundancies can be built into most systems and processes. Standard techniques, such as RAID arrays, high-availability clustering, hot sites and protection switching, can be employed wherever possible to provide alternate resources that can be brought to bear when necessary. Battery backup, standby generators and diversity routing from multiple telecom providers can also be used.

In considering redundancy, do not forget the human element. Adequate staffing and cross training are often overlooked. In the event of a region-wide outage due to a hurricane, local staff may not be able to access the necessary facilities, or they may be consumed with personal issues. In these cases, remote access from staff outside the affected area can make all the difference.

If these redundancies can be brought to bear without human intervention, critical time can be saved. System and environmental monitoring solutions and services can alert personnel to potential problems before downtime occurs and automatically trigger redundant fail-over switching when predetermined conditions are met.

The human element can never be fully automated away, but clear-cut procedures for identification, notification, mitigation, escalation and resolution can reduce downtime and possibly prevent costly mistakes that are often borne of crisis thinking. Reliability, availability and scalability have direct impacts on customer satisfaction, employee productivity and revenue-generation. Optimizing your network for maximum availability is not just smart business, it is critical for business continuity and long-term success.

For more information: rsleads.com/611cn-255

David Weiss is CEO of Dataprobe, Paramus, N.J.

In addition, make sure to read these articles: