After a rather successful part 2 of this series, it is high time to kick off part 3, which covers Replication and Disaster Recovery (DR). Most important to note, that backup and DR are two completely different things, and one should not be tempted to combine both unless you are positive your solution will cover all business requirements for both DR and backup.
IMPORTANT NOTICE: As of the date of this writing, it seems PHD4 still has issues restoring VMs from its dedup store when CBT is enabled. Apparently, all blocks and VM backups show as “good”, but on image level restore something does not work as it should; in rare occasions the wrong blocks get inserted into the VM being restored resulting in a FAILING VM. I must strongly advice anyone using PHD4 to switch off CBT, test their restoring capability, and raise a support call with PHDvirtual in case of issues. Because this issue is still not fixed, the wisest thing to do ]is to disable CBT, or upgrade to PHD5, and consider CBT backups made with PHD4 as failed until thoroughly tested.
Veeam5 and PHD4 replication abilities
Both Veeam5 and PHD4 feature replication. Strangely, PHD5 completely lacks any implementation of this. Because PHDvirtual5 is an almost complete rewrite, they just did not get round to implementing this. I’ll just assume they’ll build in replication at a later stage, comparable to the implementation in PHD4.
It is interesting to see both Veeam5 and PHD4 have different approaches to replication. Where PHD4 uses the backups that arrived at the dedup store as a source for replication, Veeam5 runs a separate job much like a backup job, only with a VM as its target. My initial thought on the Veeam5 approach was that it was not the most efficient way of working. If you plan to run both backups and replication of a VM, that VM will have to be snapshotted twice, which appears to be a rather poor way of working; VM snapshots are to be avoided as much as possible, because they have a performance impact on the VM.
But after asking around, I found out that most customers do not use both backup and replication at the same time they either run backups OR replication. That is because the replication solution Veeam5 uses also stores points in time of previous replication actions, which means you can actually use a replicated VM to go back to previous points in time as well. This is a prime example of a mixture of DR and backup.
RTO and how DR integrity is assured
As discussed, Veeam5 takes on a rather alternative way of delivering backup and DR: They actually mix both in a single solution (unless you are really paranoid and kick off replication and backup separately). The replication location is actually also the backup store in this case. You can start replicated VMs from any point in time recorded at the secondary site (the number of restore points is configurable in the replication job).
PHD4 takes another, better known approach; one side performs backups and backups only, while the secondary site just sits and watches the dedup store. As soon as a new backup lands on the store, and this backup was made from VM to be replicated, the secondary site will start an incremental restore, meaning it will restore all changed blocks (and only changed blocks) to the standby replicated VM. The DR scenario for PHD4 is simple: Each replicated VM is always standing by, ready to be powered up. Only if a restore is currently running, that VM cannot be started right away. So you could say the time required to incrementally restore any single VM is your RTO. When no backups are running, no restores will be initiated, and your RTO drops to near-zero (you just need to start the standby replicated VMs on the secondary site).
Veeam allows you to start your replicated VMs at all times; when disaster strikes the currently running replication fails anyway, so you can immediately proceed to starting a previous point in time backup, which means that RTO is near-zero in this case, much like the PHD4 approach.
My favorite test in replication: simulate a disaster while replicating
I have often done testing around various replication solutions. My favorite test is simulating a disaster while replication is in progress. This ended testing for some solutions in the past within 5 minutes: During this simulated disaster, the secondary site should be able to restart a replicated VM at all times. So just replicating changed blocks over another VM at the secondary site is NOT going to cut it: During this test the source VM would be gone (obviously), and the VM at the secondary site would be synced halfway, effectively broken as well.
Both Veeam5 and PHD4 do much better than the example above: Both manage to survive this simulated disaster while replication is in progress. Here is how they do it:
- PHD4 only restores COMPLETED backups to replicated VMs. So if you simulate a failure during backup, the backup fails and the secondary site will not initiate a restore at all, so the replicated VM stays at its state of the last successful backup. Mission accomplished 😉
- Veeam5 can also survives the test; While replication is active, Veeam5 creates a separate file at the secondary site holding all changed blocks since the base replication. If this action fails in the middle, the last replication cannot be used obviously. But since the replication folder on the secondary site still contains previous points in time of previous replication activity, you can simply select another point in time (eg the last successful one) and start that instance.
So both solutions can survive perfectly – however you must give it plenty of thought where you put what – push or pull replication, what should be put on which site. Also consider WAN bandwidths!
WAN bandwidth requirements for Veeam5 and PHD4 for replication
Whenever you are replicating between sites, the WAN bandwidth is something to look at. In this chapter I’l describe what gets sent over the WAN for both solutions in a replicating environment. It is also very important to keep in mind what you put where (primary or secondary site).
With PHD4 you have two components to consider placement for: The dedup appliance and the actual dedup store. You could put both at the primary site, then replicate VMs offsite. That would require all changed blocks of all VMs to be transferred over the WAN. More effective for WAN bandwidth usage would be to either place the dedup store at the secondary site, or keep the dedup store local but replicate the dedup store offsite using rsync. Personally, I have not seen any successful implementations of rsyncing the dedup store out to another location, especially when replication is required: the dedup appliance is unaware of its replicated brother. So we’ll go back to the solution where the dedup store is put on the secondary site. De dedup appliance itself does not transfer any data (except checksumming), so it is more or less irrelevant where that appliance is placed. For Disaster Recovery, I’d put the dedup appliance at the secondary site as well.
Veeam5 replication works differently; since the Veeam5 application transmits all changed blocks for each replicated VM to the secondary site, WAN bandwidth usage of Veeam5 is somewhat higher compared to the PHD4 approach. With Veeam5 there is just about the same choice: You must place the dedup store and the appliance in one of two places, either the primary or the secondary site. I hear most in favor of pulling mode, meaning the application put in the secondary site. This will require a lot of bandwidth though. Possibly it will be more effective (for WAN bandwidth at least) to place the application on the primary site, and put the dedup store / replication destination on the secondary. Less data will be transported over the WAN in the last case.
Also, do not forget you can use rsync in Veeam5 as well; rsync is able to detect changes within a file, and sync only those changes. But beware, the rsync process needs to build a copy of the original store, resulting in the temporary use of twice the size of your dedup store at the secondary site during rsync replication.
Both PHD4 and Veeam5 manage to replicate between sites pretty effectively. PHD5 completely lacks replication, but this is sure to return in future versions.
When you decide to implement some kind of DR using these solutions, it probably means you are on a budget. So it is important to look at WAN bandwidth consumption. PHD4 has to be called the winner here; you can get very nice results with PHD4 when it comes to WAN bandwidth usage; it is very simple to implement a solution which will compress and send only those blocks unknown to the dedup store, which is very effective. Only products using variable blocksizes (like EMC’s Avamar) can potentially do better (but at a much higher cost).
Veeam however does a pretty good job as well, as long as you feed it enough WAN bandwidth and understand that you either make backups, or you replicate with several points in time to go potentially back to. Running both backup and replication is possible but will impact each VM twice (because replication basically takes another backup from the VM). I’d like to see Veeam creating the ability to restore from their own backups automagically; that would enable you to create backups, and restore your VMs “en masse” on the secondary site. You could then have your backups in a dedup store (maybe even for tape out), and VMs restored from there as well, impacting the VMs only once when you make the backup.