So many people in our industry equate disaster recovery to high availability. In many cases high availability does mitigate localized issues, but it doesn’t cover the kinds of emergencies that can happen at any time, in any location. I’m talking about the kinds of problems that take data centers down, from natural disasters to ransomware.
In the late 1990s I trained for and got my Private Pilot’s license in a single engine Cessna. One of the important things I learned is that pilots don’t spend a lot of time learning to fly, because flying’s not that hard. What they spend most of their time doing is learning how to react and recover from emergency conditions. The hours I spent with my instructor practicing recovering from “unusual attitudes” is something I won’t ever forget.
In a similar vein, I find that while it’s fairly common that people in the database technology field take regular backups (yes, there are exceptions), it’s fairly rare that those same people actually practice recovering their servers from those backups. While it’s good to have backups, if you haven’t practiced recovering using those backups you may find that there are important components to the recovery process that are missing, when the need to recover arises.
So, how do we prepare to recover from disaster? (Notice, I didn’t say “prepare for disaster”.) Let’s get started.
First, you need to have a full inventory of your data center – what servers do you have, what components are assigned to each server, and what are the capacities of those components? In other words, what processors are assigned to each server, how much memory does each have, how many disk drives, and how big are they, how many network cards are assigned to each, and what are the specs, etc.
Without that inventory, you’re guessing about what needs to be recreated in the event of a data center loss. Your backups are dependent on specific hardware configurations, and you need to be able to recreate that configuration for your backups to be fully successful, and for you to be able to start running again quickly.
Add to that inventory the version, edition, and patch levels of your operating system and server software. You’ll need service accounts (and passwords, if you’re not using Managed Service Accounts). You’ll also need to have copies of the installation and service pack/cumulative update files. You’ll need to know exactly where those are, and hopefully they’re on media that’s unattached to your network, so if the network itself fails you can still access them.
Once you have all that information recorded, you can then build a plan to recover. Write out a checklist of the steps you need to take to build the data center from your media and then recover it from your backups. Verify that everything in the inventory is covered by entries in the checklist.
Now comes the hard part. Do the recovery. You’ll need someplace to do this that won’t interfere with your regular business, but you need to practice recovering to be certain that you can recover. Just like when you did fire drills in school, you need to develop a sense memory of what it looks and feels like to start from a blank slate and rebuild your data center. (Think of it as one of those escape rooms, and a full recovery means you get to leave the room.)
How do you ensure success? Just like the answer to the question “how do you get to Carnegie Hall?” – Practice, Practice, Practice.