Posts Tagged ‘disaster recovery’
EMC’s Recoverpoint has always fascinated me. The technology that manages to split writes out and use those to journal these writes and replicate them is awesome technology. Unfortunately as with many cool technology features, it was complex and prone to error if not doen exactly right. Today EMC announced something that makes very cool technology WAY MORE cool: Recoverpoint will sit inside the hypervisor. What you can do then is mind blowing!
A very quick intro into EMC recoverpoint
Recoverpoint is basically a snapshot and replication technology that is independent of the used storage architecture. And much more. And with limitations. Basically Read the rest of this entry »
After a rather successful part 2 of this series, it is high time to kick off part 3, which covers Replication and Disaster Recovery (DR). Most important to note, that backup and DR are two completely different things, and one should not be tempted to combine both unless you are positive your solution will cover all business requirements for both DR and backup.
Almost every enterprise setup for ESX features multiple ESX nodes, multiple failover paths, multiple IP and/or fiber switches… But having multiple SANs is hardly ever done, except in Disaster Recovery environments. But what if your SAN decides to fail altogether? And even more important, how can you prevent impact if it happens to your production environment?
Using a DR setup to cope with SAN failure
One option to counter the problem of total SAN failure would of course be to use your DR-site’s SAN, and perform a failover (either manual or via SRM). This is kind of a hard call to make: Using SRM will probably not get your environment up within the hour, and if you have a proper underlying contract with the SAN vendor, you might be able to fix your issue on the primary SAN within the hour. No matter how you look at it, you will always have downtime in this scenario. But in these modern times of HA and even Fault Tolerance (vSphere4 feature), why live with downtime at all?
Using vendor-specific solutions
A lot of vendors have thought about this problem, and especially in the IP-storage corner one sees an increase in “high available” solutions. Most of the time relative simple building blocks are simply stacked, and can then survive a SAN (component) failure in that case. This is one way to cope with issues, but it generally has a lot of restrictions – such as vendor lock-in and an impact on performance.
Why not do it the simple way?
I have found that simple solutions are generally the best solutions. So I tried to approach this problem from a very simple angle: From within the VM. The idea is simple: You use two storage boxes which your ESX cluster can use, you put a VMs disk on a LUN on the first storage box, and you simply add a software mirror on a LUN on the second storage. It is almost too easy to be true. I used a windows 2003 server VM, converted the bootdrive to a dynamic disk, and simply added the second disk to the VM, choose “add mirror” from the bootdisk which I placed on the second disk.
Unfortunately, it did not work right away. As soon as one of the storages fails, VMware ESX reports “SCSI BUSY” to the VM, which will cause the VM to freeze forever. After adding the following to the *.vmx file of the VM, things got a lot better:
scsi0.returnBusyOnNoConnectStatus = “FALSE”
Now, as soon as one of the LUNs fail, the VM has a slight “hiccup” before it decides that the mirror is broken, and it continues to run without issue or even lost sessions! After the problem with the SAN is fixed, you simply perform an “add mirror” within the VM again, and after syncing to are ready for your next SAN failure. Of course you need to remember that if you have 100+ VMs to protect this way, there is a lot of work involved…
This has proven to be a simple yet very effective way to protect your VMs from a total (or partial) SAN failure. A lot of people do not like the idea of using software RAID within the VMs, but eh, in the early days, who gave ESX a thought for production workloads? And just to keep the rumors going: To my understanding vSphere is going to be doing exactly this from an ESX point of view in the near future…
To my knowledge, at this time there are no alternatives besides the two described above to survive a SAN failure with “no” downtime (unless you go down the software clustering path of course).
You want to have some form of fast and easy Disaster Recovery, but you do not want to spend a lot of money in order to get it. What can you do? You might consider buying two SANs, and leaving out SRM. That will work, it will make your recovery and testing more complex, but it will work. But even then, you still have to buy two SANs, the expensive WAN etc. What if you want to do these things – on a budget
DR – What does that actually mean?
More and more people start to implement some form of what they call Disaster Recovery. I too am guilty of misusing that name (who isn’t), Disaster Recovery. My point is, tape backups made for ages now are also part of Disaster Recovery. Your datacenter explodes, you buy new servers, you restore the backups. There you go: Disaster Recovery in action. What comes in reach now, for the larger part because of virtualization, is what is called Disaster Restart. This is when no complex actions are required, you “press a button” and basically – you’re done. I conveniently kept the title to “DR”, which kind of favors both 🙂
Products like VMware SRM make the restart after a disaster quite easy, and more important, for the larger part you can actually test the failover without interrupting your production environment. This is a very impressive way of doing Disaster Restarting, but still quite a lot of money is involved. You need extra servers, you need an extra (SRM supported!) SAN in order to get this into action.
Recovering or Restarting from a disaster is all about RPO and RTO – The point in time to recover to, and the time required to get your server up and running (from that point in time). The smaller the numbers, the more expensive the solution. Now lets put things in reverse. Why not build a DR solution with esXpress, and see how far we get!
DR setup using esXpress
The setup is quite simple. EsXpress is primarily a backup product, and that is just what we are going to setup first. Lets assume we have two sites. One is production with four ESX nodes, and the other site with two nodes is the recovery site (oops restarting site). For the sake of evading these terms, we’ll use Site-A and Site-B 🙂
At Site-A, we have four nodes running esXpress. At site-B, we have one or more FTP servers running (why not as a VM !) which receive the backups over the WAN. Now, Disaster Recovery is in place, since all backups go off-site. Now all we have to do, is try and get as near to Disaster Restart as we can get.
For the WAN link, we basically need the bandwidth to perform the backups (and perhaps to use for regular networking in case of failover). The WAN could be upgraded as needed, and you can balance between backup frequency versus available bandwidth. EsXpress can even limit its bandwidth if required…
All backups now reside on the FTP server(s) on Site-B. If we were to install esXpress on the ESX nodes at Site-B as well, all we need to do is use esXpress to restore the backups there. And it just so happens that esXpress has a feature for this: Mass Restores.
When you configure mass-restores, the ESX nodes at Site-B are “constantly” checking for new backups on the FTP servers. As soon as a backup finishes, esXpress at Site-B will discover this backup, and start a restore automatically. Where does it restore to? Simple! It restores to a powered-off VM at Site-B.
What this accomplishes is, that at Site-B you have your backups of your VMs (with their history captured in FULL and DELTA backups), and the ability to put that to tape if you like. You also have each VM (or just the most important if you choose) in the state of the last successful backup standing there, just waiting for a power-on. As a bonus on this bonus, you also have just found a way to test your backups on the most regular basis you can think of – every single backup is tested by actually performing a restore!
What does this DR setup cost?
There is no such thing as a free lunch. You have to consider these costs:
- Extra ESX servers (standby at the recover/restart site) plus licenses; ESXi is not supported by esXpress (yet);
- esXpress licenses for each ESX server (on both sites);
- A speedy WAN link (fast enough to offload backups);
- Double or even triple the amount of storage on the recover/restart site (space for backups+standby VMs. This is only a rough rule-of-thumb).
Still, way below the costs of any list that holds two SANs and SRM licenses…
So what do you get in the end?
Final question of course, is what do you get from a setup such as this? In short:
- Full-image Backups of your VMs (FULLs and DELTAs), which are instantaneously offloaded to the recover/restart site;
- The ability to make backups more than one time per 24 hours, tunable on a “per VM” basis;
- Have standby VMs that match the latest successful backup of the originating VMs;
- Failover to the DR site is as simple as a click… shiftclick… “power on VMs” !;
- Ability to put all VM backups to tape with ease;
- All backups created are tested by performing automated full restores;
- Ability to test your Disaster Restart (only manual reconnection to a “dummy” network is needed in order not to disturb production);
- RTO is short. Very short. Keep in mind, that the RTO for one or two VMs can be longer if a restore is running at the DR site: The VM being restored has to finish the restore before it can be started again;
- Finally (and this one is important!), if the primary site “breaks” during a replication action (backup action in this case), the destination VM is still functional (in the state of the latest successful backup made).
Using a setup like this is dirt-cheap when compared to SRM-like setups, you can even get away with using local storage only! The RPO is quite long (in the range of several hours to 24 hours), but RTO is short- In a smaller environment (like 30-50 VMs) RTO can easily be shorter than 30 minutes.
If this fits your needs, then there is no need to spend more – I would advise you to look at a solution like this using esXpress! You can actually build a fully automated DR environment without complex scripting or having to sell your organs 😉 . You even get backup as a bonus (never confuse backup with DR!)
More and more vendors of SANs and NASses are starting to add synchronous replication to their storage devices – some are even able to deliver the same data locally on different sites using nfs. This sounds great, but more and more people tend to use VMware clusters across sites – and that is where it goes wrong: VMs run here, using storage there. It all becomes “quantum entangled”, leaving you nowhere when disaster strikes.
These storage offerings are causing people to translate this into creating a single VMware HA-cluster across sites. And really- I cannot blame them. It all sounds too good to be true: “If an ESX node at site A fails, the VMs are automagically started on an ESX server at another site. Better yet, you can actually VMotion VMs from site A to site B and vice versa.” Who would not want this?
VMware thinks differently – and with reason. They state that a VMware cluster is meant for failover/load balancing between LOCAL ESX nodes, and failover is a whole other ballgame (where Site Recovery Manager or SRM comes in). This decision was not made for no reason as I will try to explain.
How you should not do DR
If you have one big single storage array across sites, you could run VMs on either side, using whatever storage is local to that VM. That way, you do not have your disk access from VM to storage over the WAN. But when DRS kicks in, the VMs will start to migrate between ESX nodes – and between sites! And that is where it goes wrong, the VMs and their respective storage will get “entangled”. I like to call that “quantum-entanglement of VMs”, because it is kind of alike, and of course, because I can 🙂
Even without DRS, but with manual VMotions, in time you will definitely loose track on which VM runs where, and more import: use storage from where. In the end 50% of your VMs might be using storage on the other site, loading the WAN with disk I/O and introducing the WANs latency to the disk I/O of the VMs that have become “stretched”.
All this is pretty bad, but let’s say something really bad happens: your datacenter at one location is flooded, and management decides you have to perform a failover to the other site. Now panic strikes: There is probably no Disaster Recovery plan, and even if there is, it is probably way off from being actually useable. VMs have VMotioned to the other site, storage has been added from either side. VMs have been created somewhere, using storage somewhere and possibly everywhere. In other words: You have no idea where to begin, let alone being able to automate or test a failover.
VMware’s way of doing DR
In order to be able to overcome the problems with this “entanglement”, VMware defines a few clear design limitations as to how you should setup DR failover, with SRM helping out if you choose to. But even without SRM, it is still a very good way of designing DR.
VMware states, that you should keep a VMware cluster within a single site. DRS and HA will then take care of the “smaller disasters” such as NICs going down, ESX nodes failing, basically all events that are not to be seen as a total disaster. These failovers are automatic, they correct without any human intervention.
The other site should be totally separated (from a storage point of view). The only connection between the storages on both sides should be a replication connection. So both sites are completely holding their own as far as storage is concerned. Out of scope of this blog, yet VERY important: When you decide on using asynchronous replication, make sure your storage devices can guarantee data integrity across both sites! A lot of vendors “just copy blocks” from one site to the other. Failure of one site during this block copy can (and will) lead to data corruption. For example, EMC storage creates a snapshot just before an asynchronous replication starts, and can revert to that snapshot in case of real problems. Once again, make sure your SAN supports this (or use synchronous replication).
Now let’s say disaster strikes. One site is flooded. HA and DRS are not able to keep up, serves go down. This is beyond what the environment should be allowed to “fix” by itself – So management decides to go for a failover. Using SRM, it should only take the press of a button, some patience (and coffee); but even without SRM you will know exactly what to do: Make replicated data visible (read/write) on the other site, browse for any VMs on them, register, and start. Even without any DR-plan in place, it is still doable!
Where to leave your DR capacity: 50-50 or 100-0?
So let’s assume you went for the “right” solution. Next to decide will be, what you are going to run where. Having a DR site, it would make sense to run all VMs (or at least almost all VMs) on the primary site, and leave the DR site dormant. Even better, if your company structure allows it, run test and development at the DR site. In case of a major disaster you can failover production to the DR site, and loosing only test and development (if that is allowable).
The problem often is your manager: He paid a lot of money for the second SAN, and DR ESX nodes. Now you will have to explain that these will do absolutely nothing as long as no disaster takes place. Technically there is no difference: You either run both sites at 50%, or one on 100% and the other dormant at 0%. Politically it is much more difficult to sell.
If you use SRM, there is a clear business case: If you run at 50-50, SRM needs double the licenses. And SRM is not cheap. Without SRM, it takes more explanation, but in my opinion running at 100-0 is still the way to go. As an added bonus, you might use less ESX nodes on the DR site if you do not have to failover the full production environment (which will reduce cost without SRM as well).
–> Don’t ever be tempted to quantum-entangle your VMs and their storage!