The Dedup Dilemma

Everybody does it – and if you don’t, you can’t play along. What am I talking about? Data deduplication. It’s the best thing since sliced bread I hear people say. Sure it saves you a lot of disk space. But is it really all that brilliant in all scenarios?

The theory behind Data Deduplication

The idea is truly brilliant – You store blocks of data in a storage solution, and you create a hash which identifies the data inside the block uniquely. Every time you need to backup a block, you check (using the hash) if you already have the block in storage. If you do, just write a pointer to the data. Only if you have not got the block yet, copy it and include it into the storage dedup Dbase. The advantage is clear: The more equal data you store, the more you save in disk space. This is, especially in VMware, using equal VMs from templates a very big saving in disk space.


The actual dilemma

Certainly a nice thing about deduplication is, next to the large amounts of storage (and associated costs) you save, is that when you deduplicate at the source, you end up only sending new blocks across the line, which could dramatically reduce the bandwidth you need between remote offices and central backup locations. Deduplication at the source also means, you generally spread CPU loads better across your remote servers instead of locally in the storage solution.

Since there is a downside on every upside – Data Deduplication certainly has its downsides. For example, if I had 100 VMs, all from the same template, there surely are blocks that occur in each and every one of them. If that particular block gets corrupted… Indeed! You loose ALL your data. Continuing to scare you, if the hash algorithm you use is insufficient, two different data blocks might be identified as being equal, resulting in corrupted data. Make no mistake, the only way you can be 100% percent sure the blocks are equal, you need a hash number as big as the block itself (rendering the solution kind of useless). All dedup vendors use shorter hashes (I wonder why 😉 ), and live with the risk (which is VERY small in practice but never zero). Third mayor drawback, is the speed at which the storage device is able to deliver your data (un-deduplicated) back to you (which especially hurts on backup targets which have to perform massive restore operations). Final drawback: You need your ENTIRE database in order to perform any restore (at least you cannot be sure which blocks are going to be required to restore a particular set of data).


So – should I use it?

The reasons stated above always kept me a skeptic when it came to data deduplication, especially for backup purposes. Because at the end of the day, you want your backups to be functional, and not requiring the ENTIRE dataset in order to perform a restore. Speed can also be a factor, especially when you rely on restores from the dedup solution in a case of disaster recovery.

Still, there are definitely uses for deduplication. Most vendors have solved most issues with success, for example being able to access un-deduplicated data directly from the storage solution (enabling separate backups to tape etc). I have been looking at the new version of esXpress with their PHDD dedup targets, and I must say it is a very elegant solution (on which I will create a blog shortly 🙂

7 Responses to “The Dedup Dilemma”

  • Tom says:

    I would REALLY appreciate your comments on esXpress de-dup, since it appears to be a challenge to get it working right etc., and I bought it with this idea in mind for possibly making offsite backups even possible.

    Please also comment about bandwidth and other requirements for offsite backups etc. with esXpress.

  • erikzandboer says:

    From what I’ve seen from esXpress dedupe, it works without issues. Remember there are no more delta and full backups, but PHDD-type backups. You must config esXpress to run PHDD backups, and you have to configure a PHDD backup target to match. I would recommend to post your issues on the esXpress forum ( http://www.phdvirtual.com/forums?func=showcat&catid=13 ). I’m sure Pete or someone else from PHD will be able to resolve your issues in a snap!

    Stay tuned for a series of blogposts I have planned on esXpress 3.5!

  • Tom says:

    I already use esXpress 3.1.21…I know the support is good, etc.
    I also check the forums and I see people having issues so I’m waiting a while.
    My actual request was that you comment about bandwidth issues vis-a-vis offsite backups with the dedupe method.
    3.5 will do *either* phdd OR full/delta backups, I think it does not allow you to do both kinds.
    Thank you, Tom

  • erikzandboer says:

    This is not exactly the right place to comment on this (I will write more about this in the planned blogposts). Anyway, Dedup in esXpress appears to be source-dedup. This would mean only “new” blocks hit the WAN. Basically the changerate of data on the remote site would greatly influence bandwidth. You could possibly use sub-10Mbit lines for daily backups. Not tested this enough though. Soon to come!

  • Tom says:

    That is what I meant — please comment etc. about it in your forthcoming blog entries.

    It will help a lot of people if you talk more about the changerate, how to determine what it might possibly be, etc. Most SMBs *only* have <<10 Mbit lines.

    Thank you, Tom

  • DarkFlib says:

    You aren’t 100% correct with regards to the use of hashes. I’m sure there are probably some companies doing what you say, but many write the blocks to disk then de-dup during idle periods. The only use the hashes to find candidates for the operation then do a full byte-wise comparison of the source and destination blocks.

    I also don’t see any reason why this also wouldn’t be done with inline/online de-dup, since even if a read is required to compare, with the right hash algorithm, collisions should be fairly rare and as such a block write after the comparison should also be correspondingly rare. This gives a net result (if we ignore the hashing operation) of replacing each write with a read, which on a RAID array is generally far faster than write it replaces.

    I don’t know about you, but the uncertainty over the technology is what causes me to avoid it at the current time, although I do use some filesystem level tools to do similar things (rsnapshot/fdup etc)

    The biggest downside I see to both block level de-dup (filesystem level doesn’t have this issue) and its cousin ‘thin-provisioning’ (the filesystem equivalent ‘sparse’ files is also painful in this respect) is that you can never be sure just how much free space you actually have in the array, you can only guess based on past performance; if something changes these projections can be thrown right out the window.

  • erikzandboer says:

    DarkFlib,

    a full byte-wise comparison cannot always be done. Apart from being very intensive (read: slow), you cannot use this option when you do source-based dedup. And source dedup is the way to save bandwidth. You are referring to destination dedup, in which you would end up sending all data over the network, and dedup and the central storage (so you would need a lot of CPU there). You are correct that some vendors do “offline dedup”, but that requires a lot of storage at busy times, and of course you need to have a shop where there are idle times to begin with.

    “fairly rare” as you describe, is not acceptable. One dedup error could be fatal for all your VM backups. It is just unacceptable. Fortunately, really smart mathematicians have made very cool algorithms which make collisions VERY rare indeed. In fact SO rare, that a “collision” will not occur within a human lifetime… Even EMCs Centerra archivers use algorithms like these (not for block data, but for detecting identical entries), and they are SERIOUS about not loosing your data!

Soon to come
Blogroll
Links
Archives