Deduplication, when it comes to data storage, is the process of identifying redundancies within a data set and eliminating them. Depending on the vendor and the situation, deduplication can and should occur in several points in the process. In backup, for example, it can occur before data is transferred to a backup server (sometimes called “source side”) or when backup is received by the backup device (“target side”). In replication for disaster recovery, it can occur before the transference of data to the DR site occurs to avoid unnecessary bandwidth consumption. Many opinions about where the best place to deduplicate data appeared in the press and from various industry experts, but for now let’s put those aside and focus on what deduplication is. This is the first step to understanding it’s value and potential applications.


Regardless of where deduplication occurs the steps that deduplication solutions take are similar. The incoming data is segmented into files or ideally into smaller subsets, in some cases these are fixed and with other systems they’re variable. These data subsets are put through a hashing algorithm that provides a unique identifier for that segment of data, which can be thought of as a ‘fingerprint’ or serial number. This unique identifier is then compared to other identifiers that have already been generated from previously stored data. Where the deduplication occurs effects when redundant data is eliminated. For example if deduplication happens as data is being received instead of when that identifier is found the incoming data is not stored, but instead a link (or pointer) is established to the existing data. If the identifier is not found during the lookup the identifier is added to the lookup table and the data is actually stored. If deduplication occurs as part of a separate post process the redundant data is always stored first and then eliminated later as part of a post process.


While there are many small variants to the deduplication process, at a high level, this is how it occurs. The granularity with which those redundancies can be identified and the speed at which the inspection and lookup process occur are significant between vendors and are key differentiators between the vendors’ products.


For example, deduplicating at the file level does not require much effort by the software but also doesn’t yield the data reduction that identifying redundant segments within the file does. If the same database is copied to a deduplication device on successive days a file-level deduplication device will see that as two separate files. A segment-level deduplication device will see that as the same file but with changes inside and only store those changes, while establishing pointers to the redundant data.


Think of these pointers much the same way a relational database does not copy all the customer information for each invoice, it uses a pointer to go out and get the customer information from a separate table. Deduplication systems essentially make a relational database out of the segments within files.



Uses for Deduplication


While it may seem obvious, for deduplication to be effective there needs to be redundant data. There is probably no other storage repositories within data centers that have more redundant data than there is in backup storage. Most data centers perform full backups on a weekly or at a minimum monthly basis and most of the data within those full backups is identical to the previous full backup. This is prime territory for deduplication and is the reason why this market was the first place that deduplication adoption began.


As deduplication technologies like those from Data Domain began to come to market there was an increasing interest in using inexpensive SATA disk as part of the backup process. The problem was that while the new SATA technology was less expensive than primary fibre channel disk it was not as inexpensive as tape media. If disk was used, it’s primary function was as a cache to store backups temporarily and then stream that data to disk.


The disappointment with this strategy was that while disk backup helped with reducing backup window times, there were other bottlenecks that minimized how effective disk could be like network speeds and ability of backup clients to generate data streams.

Where disk backup could be more beneficial was in helping with recovery. Disk did not have to search through tape sequentially to find data, it could position directly to where the data was. This eliminates one of the slowest points in the recovery process; getting to the data.


Deduplication makes a significant improvement in disk backup. It allows for data to be more efficiently and cost effectively stored on disk. As stated earlier most full backups are highly redundant. Even daily backups tend to have significant redundancy in them. For example, a backup application considers a database or exchange store as an entirely net new file each day. As described earlier a deduplication device will only store the differences between the two files. This results in efficiency gains even on a daily basis.


The overall impact is that deduplication allows for the storage of months worth of backups on disk storage that is only a little larger than the actual size of the full backup. As a result, disk is now cost effective and backups can be restored from disk as opposed to tape.


A second impact that deduplication enabled was the ability to electronically vault data. With standard disk based backups the entire backup job is stored as a series of new and very large files. There was no way to identify the changed data within those files and they were too large to replicate across traditional WAN segments. Data deduplication however, because it only stores changed blocks or segments of data, can very easily replicate only those changes to a remote DR site. Deduplication’s biggest payoff may be more in the enablement of disaster recovery than in it’s ability to store multiple backups to tape.


Companies like Data Domain are investing heavily in advancing the replication capabilities of their solutions, including many to one replication, ideal for remote offices and support software APIs like Symantec’s OST to enable the backup application to control and understand the replication process.



Beyond Backups


The discussion around deduplication is moving well beyond backup into using the technology for archives, secondary NAS storage and even primary storage. It is important to note that as deduplication moves up the storage food chain that data redundancy becomes less apparent and deduplication with compression becomes more important. All data is compressible. Deduplication is more effective than compression but it only works if there is redundant data.


As deduplication systems begin to be used for primary storage look for more data related services than you would expect in backup devices. Examples are snapshots, high availability and in the case of compliance demands write once, read many (WORM) capabilities.



Deduplication is Reliable


In the early days when deduplication was still in the emerging technology category, the biggest hurdle to implementing deduplication into an environment was trust. After all it seems like the technology openly pro-ports not to store all data. Actually what is happening is a series of mathematical calculations to determine if the data being sent to the deduplication system has already been stored. In some cases there are more data integrity checks than in traditional storage systems. As a result deduplication systems have proven themselves to have high levels of data integrity – to the point where large, risk-averse enterprises have now embraced the technology.



Deduplication Now


The time is now to seriously consider deduplication as a strategy for the data centers. The products have been well vetted and are showing signs of increased maturity. They are reliable and trustworthy. Probably the best place to get started with deduplication is as an extension of the backup process. With little change to the backup routine, the benefits of deduplication, and possibly more importantly replication of backup, can be realized. Then as the comfort level rises using deduplication in other modes like archive or tier two NAS can be explored.

George Crump, Senior Analyst

This Article Sponsored by Data Domain