Posts Tagged ‘dedup’
People who have been using VMware Data Recovery quickly discovered that this product had issues. VMware’s take on Data Recovery was that they wanted to have a backup product for the smaller shops with a short time-to-market. Too bad it was this product that drove a lot of users to Veeam or PHDvirtual because of its many problems. In secrecy VMware started working together with EMC’s BRS division to build a brand new backup product leveraging EMC’s Avamar technology under the codename “Project Toystory”. This product has seen the light of day as “vSphere Data Protection 5.1” or vDP for short. In this post I will be looking into vDP version 5.1, which is actually the initial release.
Introduction to vSphere Data Protection 5.1
This is the first release of vDP, so actually a 1.0 version. I will not be expecting a fully feature-rich product, but one that actually WORKS would be nice. After all, it is a “free” product Read the rest of this entry »
Today at VMworld 2009 is joined a breakout session presented by PHD Virtual about their latest version of esXpress (3.6). Great stuff once again! Apart from the fact that esXpress is now fully functional on vSphere (still no ESXi support though), they also managed to use the vStorage API for “changed block reporting”. Basically what this means, is that when you are using vSphere and doing delta or deduped backups, you no longer need to read all the blocks of a VM and then decide is that block was changed or not. PHD managed to get esXpress so far that it reads only the changed blocks directly by using this “cheat sheet” that VMware was so nice to make available though the vStorage API.
What this means is, that backup speeds will be way higher when you do delta or deduped backups.
When you also use their dedup targt, with the dedup action going on on the SOURCE, you get tremendous backup speeds and as an added bonus you can use smaller WAN links when you are sending these backups offsite. Wonderfull guys, you did it again!
Everybody does it – and if you don’t, you can’t play along. What am I talking about? Data deduplication. It’s the best thing since sliced bread I hear people say. Sure it saves you a lot of disk space. But is it really all that brilliant in all scenarios?
The theory behind Data Deduplication
The idea is truly brilliant – You store blocks of data in a storage solution, and you create a hash which identifies the data inside the block uniquely. Every time you need to backup a block, you check (using the hash) if you already have the block in storage. If you do, just write a pointer to the data. Only if you have not got the block yet, copy it and include it into the storage dedup Dbase. The advantage is clear: The more equal data you store, the more you save in disk space. This is, especially in VMware, using equal VMs from templates a very big saving in disk space.
The actual dilemma
Certainly a nice thing about deduplication is, next to the large amounts of storage (and associated costs) you save, is that when you deduplicate at the source, you end up only sending new blocks across the line, which could dramatically reduce the bandwidth you need between remote offices and central backup locations. Deduplication at the source also means, you generally spread CPU loads better across your remote servers instead of locally in the storage solution.
Since there is a downside on every upside – Data Deduplication certainly has its downsides. For example, if I had 100 VMs, all from the same template, there surely are blocks that occur in each and every one of them. If that particular block gets corrupted… Indeed! You loose ALL your data. Continuing to scare you, if the hash algorithm you use is insufficient, two different data blocks might be identified as being equal, resulting in corrupted data. Make no mistake, the only way you can be 100% percent sure the blocks are equal, you need a hash number as big as the block itself (rendering the solution kind of useless). All dedup vendors use shorter hashes (I wonder why 😉 ), and live with the risk (which is VERY small in practice but never zero). Third mayor drawback, is the speed at which the storage device is able to deliver your data (un-deduplicated) back to you (which especially hurts on backup targets which have to perform massive restore operations). Final drawback: You need your ENTIRE database in order to perform any restore (at least you cannot be sure which blocks are going to be required to restore a particular set of data).
So – should I use it?
The reasons stated above always kept me a skeptic when it came to data deduplication, especially for backup purposes. Because at the end of the day, you want your backups to be functional, and not requiring the ENTIRE dataset in order to perform a restore. Speed can also be a factor, especially when you rely on restores from the dedup solution in a case of disaster recovery.
Still, there are definitely uses for deduplication. Most vendors have solved most issues with success, for example being able to access un-deduplicated data directly from the storage solution (enabling separate backups to tape etc). I have been looking at the new version of esXpress with their PHDD dedup targets, and I must say it is a very elegant solution (on which I will create a blog shortly 🙂