T he purpose of data deduplication is to increase the amount of information that can be stored on disk arrays and to increase the effective amount of data that can be transmitted over networks. When based on variable-length data segments, data deduplication has the capability of providing greater granularity than single instance storage technologies that identify and eliminate the need to store repeated instances of identical whole files. In fact, variable-length block data deduplication can be combined with file-based data reduction systems to increase their effectiveness. It is also compatible with established compression systems used to compact data being written to tape or disk, and may be combined with compression at a solution level. Key elements of variable-length data deduplication were first described in a patent issued to Rocksoft, Ltd (now a part of Quantum Corporation) in 1999.

So what’s wrong? Well, most deduplication appliance datasheets claim to use variable-length deduplication. But as always, the devil is in the details. In some datasheets, you’ll see variable length deduplication. But, if you dig a little bit, you’ll find out that the administrator has to set up the size, so it’s really a configurable fixed-block not a proper variable-length deduplication. (Learn more about the effectiveness of variable block versus fixed block here). If you want to save space, just make sure that the appliance has an adaptive algorithm that can adapt the length to the data set automatically. And you should also check if the vendor offers automatic ISV filters. Those filters are developed to increase deduplication rates based on every ISV (read “backup software”).

Global deduplication (also called centralized or federated) is another feature often listed in datasheets. Data deduplication systems gain the most leverage when they allow multiple sources and multiple system presentations to write data to a common, deduplicated storage pool (a.k.a block pools or stores). But again, it’s better to ask the vendor first, as in many cases, the same block can exist in two or more partitions, not a great setup for a deduplication system. Many appliances don’t share a single block pool; every partition can have its own block pool.


If you have a backup appliance in production today, you may already be able to read between the lines of a datasheet. But if not, I hope these tips will help you. In the end, there is nothing better than to clearly define, in detail, your current and future environment, and to challenge the vendor’s teams. Picking a solution — or even just shortlisting — by only comparing datasheets will probably not help you choose the best data protection solution for your environment.

