Any 21st-century company or office is dependent on IT and computer technologies in its everyday life. Even a small failure in IT departments, processes, and equipment can jeopardize the whole business. Regardless of whether the company has an internal IT department or an MSP is assisting, emergency response planning is one of the most critical processes. It makes sure that a company can recover to its regular business operations as soon as possible.
Most offices that neglect the risk assessment planning stage face more challenges during recovery. 90% of companies that don’t have a disaster recovery strategy, or have an incomplete one, fail if something goes wrong. A whopping 40% of small businesses can’t resume operation at all, and 75% don’t have a disaster recovery plan (DRP) in place.
Due to the COVID-19 pandemic, many companies had to shut down, move, or otherwise alter their processes to accommodate remote workplaces. Here is yet another challenge a modern workplace needs to overcome. This unexpected outcome is a great lesson all of us can learn: anything can happen at any time, and the best thing to do is to be ready.
What is Disaster Recovery for IT Professionals?
Usually, disaster recovery and backup plans stay incomplete because the internal IT department is too busy with everyday work. This fact makes it difficult, if not impossible, to allocate extra resources to it. In this case, the wise thing to do is to delegate the planning stage to a managed IT service provider (MSP).
Outsourcing the service can ease a lot of internal tension. An outside look at the company processes might even prove useful in resource and asset allocation during the DRP.
Another step in making sure that everybody in the company is on the same page is to communicate the strategy with all the key stakeholder groups: employees, clients, vendors, and shareholders, to name a few.
In our article, we’ll talk about how to plan and execute the disaster recovery strategy.
What Can Go Wrong?
The first step for the disaster recovery plan is to identify what can fail during your operation. ‘If anything can go wrong, it will.’ This mindset works best during disaster recovery planning. As thorough planners do, we should all prepare for the worst-case scenario.
This includes hardware and electricity, user-made errors and malware, hacker attacks, and natural disasters.
Some of these threats are more controllable and have higher chances to be averted. Other failure types like natural disasters (earthquake, tsunami, etc.) can leave not only you, but also neighborhoods, countries, and regions in disarray for a long time.
However, this doesn’t mean that you can’t plan to face them at all. Below, you can see our attempt to think through what can go wrong. Your scope of control and, subsequently, planning ability gets less intense from top to bottom of our list.
Planning for and preventing an equipment failure is easy enough with regular checkups and updates. However, equipment maintenance can be expensive. Besides, electronics, especially computers, age out quickly, so companies should account for their inevitable depreciation. That said, the rule of thumb is buying new equipment within 5-7 years from the previous purchase.
Malware or Hacking
In a perfect world, the disaster management strategy should protect you from the need to deal with any malicious programs and outside attacks. If you have reliable antivirus and prevention mechanisms, you shouldn’t see issues. As malware risks are evolving, rethinking this part of your DR plan might come often.
We recommend you take the regular system updates as soon as they’re available; they usually take care of bugs and have updated malware definitions.
Your staff members will also need essential cybersecurity and information security training. Ideally, they’ll be able to spot phishing emails, spam links, and outright scams. As a result, they’ll also gain enough knowledge and competency to avoid bringing harm to company equipment, network, and infrastructure unintentionally.
While training could prevent many mistakes, the lack of discernment in staff justifies the need for more sophisticated and up-to-date network security solutions.
Here is another training topic. Most of the time, user errors happen because the staff members haven’t been properly trained on an application or piece of equipment. They might not even imagine the consequences of specific actions.
User errors sometimes occur because of inappropriate user permissions. One way to avoid this is to make sure each employee has the proper system clearance.
The MSP, IT department, and management can decide what each employee needs to access.
Loss of Key Personnel
This is yet another external force to anticipate. If one of the managers with special permissions is injured, there should be a way to recover critical information during their absence.
This category of disasters is harder to tackle, but they are usually more widespread. They might include a larger area like the city, the region, or the whole nation. While natural disasters are more intense life-death scenarios, recovering the service is still critical if the company is an initial service provider.
Once the power and connectivity are up and running (when the external threat is averted), you can put the plan in motion to recover what you can as soon as possible. Some companies that operate in the ‘earthquake zones’ will have dedicated recovery plans just for that event.
DR and Working Remotely
While it’s not possible to give a complete list of threats a business might face (it’s a meticulous process tailored to each company’s scope of operations), there are more places disasters can come from. Nobody in the world would think that a pandemic would erupt in the middle of the 21st century.
If anything, 2020 has been a real challenge for many companies. Those who were flexible enough to maintain coherence as a remote workplace managed to keep up with the situation. However, most companies were clueless in the face of this particular threat.
Of course, getting ready for a pandemic might not be included in a usual disaster recovery plan, but if you have a working DRP, modifying it to fit the circumstances shouldn’t leave you rattled.
Each company is unique. Building a disaster recovery plan should target specific solutions and fine-tuned steps to lead a particular company out of distress. In this section, we’ll go through the objectives, goals, and the DR scope. We’ll also lay out what an adequate plan should include.
What’s the Purpose of DR?
These are all the steps that make it possible to minimize the human-made errors, outside attacks, and equipment downtime. These include training the personnel, ensuring the systems are protected, and following a routine maintenance plan. The prevention stage makes sure that the organization can recover in the shortest possible time frame.
For some companies, sustaining operations is the lifeblood of the business. If a business can maintain the bare minimum of processes during the disaster, they’re in good shape. This ability is also called ‘maintaining business continuity,’ where the essential functions can fold and unfold without much effort.
Sustaining some part of the operation during the disaster is possible. However, exploring the company’s potential of how fast it can return to full capacity is the third essential purpose of a comprehensive DR plan.
What’s the objective of DR?
Any disaster recovery plan consists of several key elements like:
- Make sure that people and assets are safe during the disaster
- Continue essential operations
- Minimize financial losses and damages to property
- Minimize disruption
- Remove complex processes from the recovery effort
- Make sure the frontline of the business and its support work flawlessly
What’s the Goal of DR?
When a business is down, a DR manager’s main goal should be to recover it as soon as possible. Although its important to consider the peculiarities of the company, you should also keep in mind the two most essential benchmarks of the DR strategy:
- Recovery point objective (RPO)
- Recovery time objective (RTO)
The first one, RPO, denotes the amount of data the company lost during the disruption. It also measures if it exceeds the threshold of the business continuity plan.
Recovery time objective (RTO) is the duration and the service level you can sustain before returning to full capacity.
In the meantime, another thing to consider is the percentage of the operation the company aims to recover. The aforementioned two benchmarks will be handy during and after the implementation phase, where you’ll need to evaluate the test results.
What’s the Scope of DR?
The most important thing from the start is defining what the particular disaster recovery plan can cover and what’s impossible to include due to some resource constraints. Careful consideration of the organization’s scope can make or break the whole strategy. We recommend determining the following essential aspects.
In the planning stage, you need to consider the budget available for the whole operation, including testing. Here, the managers will need to decide upon a few solutions: would it be better if the internal IT department made the plan, or would it be more cost-efficient to outsource the process to an MSP?
It might sound a bit counter-intuitive, but an expert MSP can create a solution tailored to the scope of your current operations.
The companies can also decide to expand the scope of their security solutions. At that point, they should consider a budget-friendly and approachable way to strengthen the security of the current systems.
Adding an expansion roadmap that implements a RAID (Redundant Array of Inexpensive Disks) setup is one type of upgrade to consider. This is a solution that duplicates and stores company data across multiple hard drives and keeps all the versions current.
This helps minimize data loss and accelerates the recovery process in the event of a system-wide failure.
The second essential step is setting a deadline for the planning stage. Most companies develop the plan without having an end date, under the pretense that they need time to make it perfect. As a result, they might end up needing the strategy before they finish building it. Having half a plan can’t be a whole solution; design it thoroughly, then modify it according to test results.
Anything physical or digital, like equipment, data, applications, and even the building, fall under this category. The company should go through the whole workflow, document, and audit all the company assets.
After identifying the assets inventory, it should make sure each one of them receives due attention. Any inventory checklists and action plans are welcome here. The IT department or the MSP needs to take all measures to ensure asset safety in case of a disaster.
For example, you can’t have open access to important office machines, hard drives, and valuable equipment. Moreover, the ‘value’ doesn’t directly coincide with how expensive the hardware is. A $40 hard drive can be ‘valuable,’ because it contains the company’s client database.
Use door codes or access cards to limit the number of people who have contact with these assets. Monitoring the area with surveillance cameras and schedules is also an option.
As disaster recovery planning relates to the company’s everyday workflow, it also closely concerns the personnel and departments. Think about the needs of all the people that are directly or indirectly affected by the disaster.
A carefully-crafted DR plan should listen to the input from:
- Employees – They are the frontline soldiers that need to be informed and capable of following rules and checklists. They need to be trained and ready for any disaster scenarios and know their roles.
- Management – They are the decision-makers who should have a plan of action in place before anything happens. Their input is crucial because of the huge accountability they bear within a company.
- Vendors – Your suppliers need to be in the loop, especially when the company can’t hold up the usual business commitments at that time.
- Clients – Clients are an essential link in the company’s life. Clear communication with them can make or break the whole process. If the system is down, and a client can’t access it, they need to feel included in the information flow.
- Shareholders – This group is a prominent part of larger corporations. They have a crucial role during any disaster because large shareholders are influential people or thought leaders with massive media leverage. Companies should be able to use their power to own the narrative, instead of piling up problems on top of the disaster.
What Should a DR Plan Include?
Main Documentation and Procedures
As described above, the primary documentation makes sure that you’ve considered all the aspects of the recovery plan. It’s the summary of the information we included in the planning phase, minus the budget. This should consist of the following:
- The DRP scope
- The timeline
- The communication mechanisms
- The objectives and goals
- The assets and people
Checklists for Essential Tasks
It’s challenging to make decisions when everything is falling apart. The planning phase is all about eliminating ad-hoc decision-making in the midst of a crisis.
Having inventory and process descriptions for all the categories above won’t make any difference until you have actionable, step-by-step task lists for each of the involved departments and staff members. If everyone knows what they should do, they won’t have much time to panic.
Thorough checklists and task list templates are the most convenient method for disaster communication.
‘Backup Plans’ and Sub-Plans for All Kinds of Failures
A disaster recovery plan is a document that should solve a problem or give a template for every solution during the disaster. However, situations differ, and it would be naïve to think that every scenario can fit into one plan.
Having ‘plan Bs’ tailored to specific failures can also help. Re-using and repurposing the original plan for other scenarios shouldn’t be difficult at this stage.
Now that you have ‘version 1’ of the strategy document, put it to the test and make changes. However, it isn’t as easy as doing this once. Testing is a cyclical process, where the first results impact ‘plan version 2,’ then ‘testing 2’ results turn into ‘version 3,’ and so on.
We can’t be sure what the best number of ’rounds’ is for a particular business, but one thing is clear: this is an ongoing process that evolves with the organization.
Testing is the second essential phase of the planning journey. There are several reasons to get into testing as soon as possible. First of all, the stakeholders get acquainted with the documentation and processes.
If you don’t do this, the company’s personnel won’t know what to do during a disaster. Even worse, the documentation that took so much time to put together will just stay on paper, with no use when the problem happens.
Failing to put the documentation through scenarios will strip you of verifying the plan’s efficiency and the system’s reliability. You’ll also fail to determine if the objectives are achievable. Plus, identifying the improvement strategy will become less evident.
Depending on the needs of the organization, testing methods can be different. In some cases, mixing and matching some options can be even more efficient in getting what you want to achieve. Below are the most popular ones.
The ‘checklist’ or ‘paper test’ is the first stage of testing the plan. Here, one team member from each department reads and annotates the disaster recovery plan and policies. Procedures, timelines, guidelines, checklists, and other constituents of the strategy get a thorough examination. The testing supervisor then analyzes the document and implements the changes needed on the primary copy.
A walkthrough test includes a table reading of the plan where the stakeholders brainstorm about all the aspects of that particular bit they’re considering. Each department sends a representative to voice any thoughts and concerns regarding their workflow.
Tabletop testing is the stage where the participants get to act out ‘what if’ scenarios. The testing supervisor suggests a situation and gathers responses from everybody. This type of testing is also useful for revealing gaps in knowledge the test supervisor hasn’t included in the initial plan.
Technical testing goes into a more hands-on territory. In the three previous methods, the participants virtually walk through the possibilities. Here, they experience the test physically. There are two methods of technical testing: ‘sandbox’ and ‘live interruption.
Also called parallel testing, this method offers a safe space for trial and error. After the IT department migrates the system onto a different location (the sandbox), the test starts.
It checks the capacity and functionality of the company’s restore and backup systems. In the past, this type of testing happened on a separate server. Now, companies use a safe space on the cloud.
During ‘sandbox’ testing, the main operation doesn’t stop. Your business continues to work as it did, and your customers don’t feel a difference. The ‘sandbox’ can spot many discrepancies of the new version compared to the original one. This is good, because if you happen to restore your data on a new location during a disaster, you might face a similar problem.
However, parallel testing can’t replace a full testing session. While useful, the ‘sandbox’ doesn’t replace the existing system, so the users cannot access it. Subsequently, you won’t necessarily learn anything about the issues a customer might experience on their side.
To solve all the problems mentioned above, you might need an experiment with full interruption. This method disrupts the main service in an attempt to recover it. However, it’s dangerous because in case the trial fails, you might lose the old system. ‘Live’ interruption testing failures may incur a high recovery cost for the company.
One way to prevent losing everything is to migrate the central infrastructure onto some other server, virtual machine, or a cloud. Here, you’ll be able to play out the scenario on the primary system and migrate the original structure from one server to another after the test.
Another method is to restore the original server or a virtual machine, keeping the central server operational. Here, you’ll be able to change the DNS to move traffic to the second server. The main one will stay without traffic and enable you to run tests on the second.
You can also use a balancer to spread the traffic between 2 servers during the test. At the end of the experiment, you can remove the balancer, leaving one of the servers active. With this method, you won’t have any interruption on the central server. This might be the best choice among the ones we described, but it’s also the most expensive.
Technical Testing Levels
The choice of technical testing method depends on what you want to achieve, which part of the system you want to test, and how much money the company is willing to spare for that purpose. You might also want to mix and match testing methods to elevate the efficiency and leverage constraints for each level.
- Verification – This process checks if the backed-up files exist. However, this cannot ensure that the applications will be functional after the recovery.
- Database Mounting – This test is the next level after data verification. It also checks if each database has inherited its basic functionality within the backup.
- Single Machine Boot – This is the third step, which verifies that it’s possible to reboot one single server, in case it fails.
- Single Machine Boot with Screenshots – In this test, the administrators receive screenshots that the server can reset, but this doesn’t prove if it will work post factum.
- DR Runbook – This one takes Single Machine Boot a level further, testing multiple machines linked together for business purposes.
- Recovery Assurance – This testing level covers it all: applications, multiple machines, service level agreements, and analytics.
According to one study, disaster recovery testing, while deemed as necessary, is sorely overlooked in most businesses. The DR strategies are either too complex and impossible to follow or non-existent. The chilling statistics say that 65% of companies that go the extra mile to set up and test the DR plan fail at their own ‘exam’.
The benchmark for testing frequency depends on many factors, like the company’s age and size. If the business is a classic corporate environment with heavy, multilayered operations and employees that have been with the company for a long time, the idea might not stick easily.
If the company is a young startup and hasn’t had a DR plan yet, some aspects of it might even be hardwired into some job descriptions. Backup creation might be a usual workflow checklist item for a programmer or a customer service representative. In this case, you’d just need to document the workflow and put it into the larger plan.
No testing is complete without benchmarks. For the best outcomes, you need to set out recovery time objectives and recovery point objectives (RTO & RPO). As mentioned above, these parameters help you evaluate how well the company managed to deal with the disaster.
Document and Analyze the Test
Documenting the test results is crucial. It helps you understand why some chunks of the plan work and others fail. You’ll have the opportunity to identify any gaps in the plan and eliminate them for the next testing phase.
Trust the Plan
Once you have the strategy in place, the stakeholders will just need to execute it as it is. Nobody will have to make extra decisions or even think for one second.
Trusting and following the plan to the letter during testing is imperative so that you get untainted results. This way, during the assessment phase, you’ll know precisely what works and what doesn’t. As all DR plan tests have to be well-documented and analyzed, only real errors and discrepancies should result in a strategy modification.
Getting real results is the purpose of the tests, so make sure to communicate that all the stakeholders should follow the exact rules in the documentation.
Involve an MSP
You might think that the company’s internal IT department is more than capable of leading this process. However, you should also take into account that their everyday work might be more valuable for the company, and outsourcing disaster recovery planning might be a better choice.
Putting together a DR plan is time-consuming thus, many CEOs decide to balance out and share the burden between the company’s internal IT department and an MSP. A dedicated, competent MSP with the right software can do the job faster.
As you’ve been reading through this in-depth research, the information you gained should have already highlighted the importance of having a tried-and-tested plan.
However, there are a few caveats we need to voice as a conclusion:
- It’s essential to understand the dangers of having an underdeveloped plan. Theoretically, it exists somewhere, but in reality, it’s useless.
- An even worse oversight is building a thorough plan and not testing it. Poorly-tested plans can cause more damage than the underdeveloped ones. While the latter reflects half the reality, the former might cause a bigger mess, misguiding the managers during decision-making.
- A well-written and tested, but old DR plan won’t work either. No matter if you pushed it through rigorous experimentation a few years back. Whatever the effort, the document doesn’t reflect today’s reality. Businesses change fast, and evolution should apply to the DR strategy, too.
A thorough DR plan is trustworthy. It externalizes the struggle enough to make the disaster recovery a simple set of tasks the company needs to go through. By avoiding the mentioned mistakes, the company will rely on the existing strategy as the shortest line towards fast recovery.
This type of planning, process documentation, and testing should give companies, MSPs using RMM platforms, and stakeholders a hassle-free, smooth transition through whatever disaster or disruption they face.