(originally written in 2008)
Deduplication has piqued the curiousity of CTOs and storage managers for a couple years, but is still in its infancy. Being able to restore a 5-year old file remotely within minutes while not having to increase fast disk (read “expensive”) storage capacity is certainly appealing.
Gone will be the old days of having to dig up an old tape, wait for delivery of it to your data center, and then begin the restore. Using deduplication technologies, companies will be able to rid themselves of tape and adhere to longer data retention policies all while saving money on the storage required to meet these needs.
Why do I need deduplication?
There is increasing demand to store archived information for longer retention periods in order to meet new regulations. Benefits such as getting rid of cumbersome tapes, easier restore capabilities, and ability to meet growing storage capacity demands for projects like document imaging makes this technology nearly essential. If your full backups are 1 TB a week, let’s say at the file level only 1/3 of that data changes. Your policy is to keep that data for 5 years or more. That’s 260 TB of data, 173 TB of it redundant!
With single-instance storage (at the file-level) you could store as little as 87 TB. At the block level, even less changes. A fairly average deduplication ratio would be 20:1. The 260 TB would only require 13 TB of actual storage to keep 5 years worth of data. With that much reduction, admins are now able to replicate that data over a WAN link to a hotsite and keep the data on disk for the entire retention period.
What kind of dedupe ratio will I get?
The deduplication ratio a company experiences varies greatly on the nature and make-up of their data. Traditionally Exchange data and database data do not dedupe very well, if at all. Compressed files will not deduplicate well. Most storage managers do not have accurate metrics on their data. You can do a little homework and get a good idea on how your data will be affected by a deduplication technology.
Write down the following:
- Size of full backup (a sample a month for past 6 months would be best)
- Size of largest differential backup (the one just before next full backup)
- Sum of all incremental backups between full backups
With these numbers you are halfway there. Now all you need to do is divide the differential backup size (or sum of incremental) by the full backup size. This tells you what percent of your data changes over a set period. Also subtract your largest diff from your full to determine how much data changed and divide that by the days between fulls. This will tell you how much data changed over a set period.
Dedupe for the Small Enterprise
I’m glad companies like Data Domain and EMC/Avamar are starting to market their products to SME (small-medium enterprise). The issue with the smaller products targeted for SMEs are that they are not very scalable, at least not yet. It is easy for an SME to bring in an appliance and wash their hands of the problem they wanted to solve.
The issue is while appliances are well-suited for a single application, they rarely scale very well and are generally quite proprietary. There are a few vendors out there that use an open-source algorithm to deduplicate the data. This is the way to go unless you are confident your dedupe vendor is never going out of business.
Which is better in-band or out-of-band, post-process?
Many IT professionals are wondering which dedupe technology is the best. There are two main versions you may have already heard about: in-line deduplication and out-of-band deduplication. In-band works “real-time”, deduping the data as it’s being copied to the storage target. Some benefits include not needing enough storage to store the data before it’s deduped and having the data deduplicated more quickly.
Disadvantages include not scaling as well since eventually the hash table in memory becomes huge, as well as the backups get slowed down by the inline process. They also don’t yet have global deduplication though most vendors say they are “working on it”. Post-process will allow the backups to write faster, but requires enough space to store all of the raw data before it’s deduplicated. It also often finishes later than an in-line process would.
Most post-process dedupe solutions have true global deduplication meaning all of the data is checked for duplicate blocks, not just the data within each appliance. This area is becoming more gray as companies like FalconStor have a post-process that starts immediately after the data starts writing to the target so it behaves more like an in-line. Quantum as an in-line with such a huge cache that it interferes less with the backup speed so it has the benefits of a post-process. Post-process has been winning, but in-line is really catching up.
Which vendor to choose
It really does depend on the needs of your company. No matter what the size of your company is or what your growth rate is, you need the most scalable solution out there. Right now very few (maybe none) of the in-line deduplication vendors offer any type of real scalability due to the lack of global deduplication.
Two years ago they used to claim they will have it soon by clustering their appliances, now most of them say it’s unnecessary because they will size the appliance based on your future needs and even if you do outgrow that (inevitable) they will simply migrate you to a larger box, though you will not get any money for your smaller, now useless box.
What I do like about the in-line CPU-based appliance like Data Domain is that CPU speed has doubled or tripled every 18 months, but we can’t say the same about disks. Most other dedupe solutions are disk I/O intensive, which is where the bottleneck lies. Also, an appliance like Data Domain can be a “NAS-like” target for other backups like VCB/VRanger, and SQL backups. Also, if you have no room for post-process “pre-duped” data, you will be looking for an in-line appliance.




Hi Mike,
Nice article on dedupe (in-band vs. post processing). One thing – our dedupe solution works great with data bases, actually all data types. We’re getting 51:1 on Oracle and SQL full backups. We back these up as DB fulls every day of the week and store 160 TB of data on 3 TB of disk.It’s a byle level compare and a post processing solution. The incrementals dedupe at a lower rate but we’re backing up 6 TB per day and storing 283 TB of data on 41 TB of disk.
Hope this helps.
Wow that sounds great. I’d be interested to know which product you are referring to. I’d love to hear more about it. Thanks for the comment.