Published Articles
 
Webcasts
 
White papers
 
Podcasts
 
 
Published Articles
 
2007  |   2006
 
 

may 1

From Boom to Resume: the tyranny of the loqistics time line
When disaster strikes, even the prepared organization can feel a cold chill as the tyranny of the Logistics Time Line impacts carefully planned RTO objectives. by Dick Benton, principal consultant, GlassHouse Technologies, Inc.

http://www.snseurope.com/snslink/news/news-full.php?id=6735&result=Dick%20Benton

In reviewing disaster plans it is usual to see the technical staff focused on a technical response to bring back end user applications to target time frames (RTO – recovery time objective) and within target tolerances of lost data, (RPO - recovery point objective). Often the technical staff assumes the clock starts ticking from the implementation of the recovery process; however, it is not at all unusual for the business community to expect that the clock will start ticking at the time of the actual disaster.

There are at least 10 steps to move from "Boom" to "Resume," and it is critical to ensure that this logistical pipeline of key steps is highly visible during the planning process. Even more critical is the need for a hard-nosed, pragmatic assessment of best case and worst case timing in each component of the recovery time line. Let's look at 10 of the most obvious steps in this logistics time line.

1. "Boom"
When disaster strikes, it is trite but true that someone needs to notice that something bad has happened. Should an explosion occur in a chemical plant at 2am, it is likely that people in the neighborhood will be rudely awoken, 911 will be called by multiple people, and a host of first responders will attend within minutes of the event. However, a less public disastrous event may go unnoticed. A water leak or slow burning fire may exist for many hours before detection and response occurs. Best case/worst case scenarios will be dependant on many attributes: industry, geography, disaster type, organizational culture, etc. But this first step is critical. The time from "Boom" to someone actually noticing that a "boom" has occurred can exceed the entire RTO estimates in many DR plans - all before recovery is even started. Estimate your best case/worst case for this component of the logistics time line.

2. Disaster is Declared
Best practice calls for nominated staff to attend such disaster events and assess the necessity of formally declaring a disaster situation. The disaster declaration triggers the initiation of the recovery plan. The key point here is that the disaster itself does not necessarily trigger recovery except in the most sophisticated self-monitoring and fully automated/clustered fail over environment. Consider the most realistic situation where your key staff must determine and declare a disaster. Enter best case and worst case into the time line. It would be prudent to develop a written rational for these estimates.

3. DR Team is Notified
Once the disaster is declared, the pre-planned notification process is initiated. A disaster declaration does not necessarily result in the recovery process being initiated in IT; rather, it results in team members being notified and responding to the notification either by remote access to the DR site or by assembling at the target location/s. This activity is fraught with peril and unplanned outcomes. Unless a hot site exists and the communication links for remote administration survive, only when the team is assembled at the target site can recovery commence. This activity can take several hours. In extreme circumstances where travel options are limited, bridges down, airports closed, networks disrupted, it may well take considerably longer, perhaps 24 hours or more. And this assumes the team members will leave their families to go to work. In extreme disasters, like earthquake or flood, this may not happen. Enter your best estimate of the minimum time required for this activity along with an estimate of realistic worst case.

4. DR site is powered up
At last the team arrives at the recovery site or has locked in remote access. Hopefully the data and hardware for recovery is propositioned and running hot. If not, the power up process can be quite time consuming. The arrival of tape can also be subject to many of the same delay issues that impact the recovery team and their ability to travel to the recovery site. Worse, if the site is a shared facility, additional logistical challenges may be encountered as several organizations arrive in the same time frame all expecting a somnolent DR site to suddenly leap into high gear. How long will it really take before the first server is powered up and the data recovery process started? Unless already hot and ready, two to 12 hours might not be uncommon. What will your entries be in best and worst case?

5. Network is brought up
Hopefully, network connections have been propositioned at the recovery site and are kept live to avoid any additional delay in this area. This is the case in most well managed network environments. If this task has not been well-planned and pre-prepared, it may take considerable time to restart the various network services. Firing up and/or testing this process can run from a highly effective one hour to a panic stricken 24 hours of an endless round of trouble shooting by tired and stressed out network staff. Review your network plans and determine best and worst case plans for this component of the logistics time line. Consider what tasks can run in parallel to other components on the time line.

6. Data is recovered
Data is now restored from DR protection media. Unless some form of disk replication has been utilized and the data is already in place, this function will typically involve tape recovery. Tape protection often is configured to minimize backup time. The very functions that do this mitigate against quick and effective recovery of an individual server. Unless tape backup has been planned to facilitate orderly and sequential recovery of specific high priority servers, a nightmare scenario can occur as data for a key server has to be pulled from a tape interspersed with data from other non critical servers. Many of us have seen or heard the horror story of the 2 hour backup that took 5 days to recover. Look at how you plan your backup and determine best case and worst case for at least your Tier 1 applications.

7. Data Base is Validated and Stabilized
Once data has been recovered, there is often a requirement for database administrators to take additional steps to ensure the physically recovered data has logical integrity. This can be facilitated by DR aware applications, transaction bracketing, and checkpoint rollback functionality, but except in the most disciplined and well funded organizations, the DBA will need to examine the critical databases in tier 1 production before they can be restarted. In an organization that has not placed sufficient emphasis on consistency groups for DR data protection, and whose applications may be less than DR aware, this process can range from 1 to 24 hours or more. Consider at least your Tier 1 applications and determine the best and worst cases you should enter into this component of the logistics time line.

8. Application servers started
At last it's time to start the key applications. This should be the easiest component one might think, but again our best laid plans may founder as dependencies in rev levels at every layer in the system may appear. Typically, the DR plan will include a function to maintain appropriate revision levels and patch levels in the DR site. This may be unsuccessful at the actual recovery time because of less than effective change control, emergency changes, rapid implementations, or environment consistency upgrades that are asynchronous. Study your organization's approach to these issues to identify a realistic best case and worst case scenario. As always, it's prudent to document your assumptions.

9. Users brought on line
At last the users can be brought back on line (wherever they are) and given access to their recovered applications. During the recovery you probably took your phone off the hook to avoid 3,257 calls from each user wondering when the application would be available. You probably fielded several calls from irate managers apparently unaware of target RTO. And you probably redirected to your boss several senior executives trying to jump the queue for their business division. Once users return, the fun starts. Presumably you recovered the help desk function first, but even so, what is a realistic expectation of help desk call response during the immediate period following restart? It can be overwhelming. Users are normally brought back in a staged approach. Each recovery tier is staged, and within each tier the individual applications are staged. There may even be a need to stage the access of end users to a specific application. Depending on all these aspects, every end user in tier 1 may not be brought back at the same time. Understand realistically when the first group of users can return to production in Tier 1 and when the last group can return to full productivity in Tier 1. This will provide a best case and worst case scenario for this component of the logistics time line.

10. Audit review of resumption
And now that you have won the battle, achieved the impossible, triumphed over adversity, walked on water, and produced various other miracles, the auditors arrive to shoot the wounded. Or so it seems. This is a critical component of the recovery plan. After a tremendous focus on the operational aspects of recovery, it's now time to immediately, and with the same urgency, think about the tactical aspects of continued operation, security, and data protection at the recovery site. It is necessary to demonstrate this capability both to yourself and to others: auditors, business units, senior management, etc. Think about what effort it will take to get your recovered operations environment back to the security and data protection standards you had at your old production site. Consider best and worst case, as usual; and of course, document your assumptions and rational.

By the way, just when you thought it was all over and you can take a well earned vacation in Hawaii, it's time to start planning the transition from the recovery site back to the original production site, or perhaps to a new, or relocated, production site.

Summary
So, draw up your critical time line, map in each activity, event, or component that is relevant to your organization, and try to chunk things up initially to avoid getting lost in details. Then enter against each activity your estimate of best case and worst case. Include a footnote referring to a document that details your assumptions and other attributes that factored into the decision-making process.

Here's where the fun starts. Run the best case scenario and see how close this comes to your most aggressive RTO. Now there needs to be an iterative analytical and development process to ensure business unit needs are aligned with IT RTO targets, and that these targets are actually achievable when one considers the impact of each event in the logistics time line. The outcome of this study can have a wide ranging impact on budgets, business unit expectations, reference architecture framework, and post-disaster career opportunities.

 

 

  © Copyright 2001 - 2007 GlassHouse Technologies, Inc. All Rights Reserved.

Privacy Policy | Terms of Use