Where should Data De-Duplication happen?
Where should Data De-Duplication happen?
The single hottest buzzword in the storage industry right now is Data De-Duplication. It seems like we have gone from just a few vendors having De-Duplication capabilities to everyone having it. In addition, it is not just limited to storage any longer; WAN products are using a similar technology to reduce the amount of bandwidth required to keep remote offices and data centers in sync.
(New to Data De-duplication? Click here for my article explaining how it works)
As acceptance of the technology by the end user community has increased, so has the number of storage manufacturers that have entered into the de-dupe market. Candidly, I thought by this point in 2007 we would have been flooded with options from various manufacturers for doing de-duplication, but not so. Apparently, mastering de-duplication and delivering a reliable product was harder than we thought and the early innovators still stand alone with viable working solutions while the new entrants are stumbling out of the gates.
As the market matures, we are beginning to see battle lines established between where that data de-duplication is performed. There are essentially three scenarios; the data can be de-duped before the data is sent across the network to the backup devices storage, it can be de-duped as the data is being received by the backup devices storage or it can be de-duped after the data is written to the backup devices storage and all the writes are complete. What's the best method of data duplication? Many times the answer is "it depends", but this time I believe there is a clear winner.
De-duplicating data before it transfers across the storage network or IP network from a purist sense always seemed to be the technically correct point at which to do the data de-duplication. It would make sense that you would rather eliminate duplicate data before it goes across your network. This would mean a reduction in network bandwidth, causing reduction in backup windows and a reduction in backup resources. The real world has a way of proving me wrong. To do data de duplication before transmission requires an agent be installed at the server being backed up. That server then has to calculate all the possible de-dupable instances of data. What I have seen countless times is that data de-duplication being done before data transmission causes an unacceptable impact on those servers that are sending the data. This fact alone relegates the "de-dupe before transmission" category to small remote office backups. In the data center, doing de-duplication on the de-dupe appliance has proven time and time again to be more efficient. Additionally it almost always becomes a replacement for your current backup application, something you may not want or be willing to do. At a minimum most sites have to run their standard backup application in parallel to the de-dupe backup.
Now the great debate is when in the process should that de-duplication be done? You will hear two methods; in-line and post ingestion. Basically you can do the de-duplication as the data arrives at the backup devices storage (in-line) and before it is written to disk, or you can do the de-duplication after the data is written to disk (post ingestion) on the backup devices storage and then perform a "back end process" to do the data de-dupe. In-line ingestion is pretty self explanatory; as data is being received by the appliance it is being assessed for its data de-dupability (or redundant data blocks) before any data is written to disk. With the post ingestion products, all the data is written and received by the appliance first, then as a background process, the appliance is crawled to look for redundant blocks of data.
One of the common misconceptions is that doing de-duplication as the data arrives at the appliance is going to require a lot of upfront CPU and memory resources. While it is true that as these appliances "peg" the memory and CPU, their ingestion rates have continued to increase. The key enabler of that is efficient software. To simplify, if the software or file system has been written from the ground up to be able to handle data de-duplication, then it can handle in-line ingestion much more efficiently. In addition to efficient software, the in-line de-duplication appliances also benefit from ever improving speed of processors. Most, if not all, of the post process data de-duplication products are not written from the ground up to handle data de-duplication, which is why they do the data de-duplication as a back end process. They are trying to bolt on this data de-dupe feature to existing code or an existing file system, so the only way to do this, and be able to maintain some level of performance, is to do the data de-duplication as a post process.
I recommend in-line data de-duplication in almost all cases. If the net result is that you end up with de-duped data why should you care? There are about ten factors that I would be concerned about with post process data de-dupe, but for the purposes of this article I will highlight the two biggest factors. First, with post process data de-duplication you need extra storage space. You must have enough storage to receive the un-de-duplicated data, then you need to have enough data to store the actual de-duplicated data and typically you need to have some sort of an intermediary storage as a temporary working area. There always has to be enough excess capacity to store the inbound un-de-duplicated data. This not only requires more storage, as backup and archive sets continue to grow, but also greater complexity in determining how much storage you actually need for the process.
For example, you decide to dump a copy of your ERP database to this type of device every night. The ERP backup file is 600GB, but only has an effective daily change rate of less than 5GB's. In the post process world, you always need to make sure that you have 600GB's of free space to store the initial file and then room to store the final de-dupe data, plus some scratch area as the duplicated data is being determined. With an in-line appliance this duplication rate is determined at ingestion and only the 5GB of new or changed data is written. In addition, be aware that post write data de-duplication does not happen instantly. It will take some time to go through the various files (disk seeks take time) and compare those files to what has already been stored to determine what data is redundant. This is even more important if you are doing multiple dumps across multiple file systems.
The second and possibly biggest issue is replication of this de-duped data and it builds on the problem with post ingestion data de-duplication as described above. One of the key advantages of de-duplication is when it is time to replicate the data after the first sync, you only need to replicate non-redundant data, which greatly reduces duplicate data which in turn reduces the bandwidth required and ultimately leads to accelerated time to DR. In many cases this actually makes the concept of an electronically created backup or archive vault viable. (Click here for more details on electronic vaulting). With post ingestion de-duplication, you have to wait until the data de-duplication process is completed before replicating the data across the WAN. With the time delays involved in doing data de-duplication after write, this can be very problematic. To be valid, the replication to the remote site has to finish prior to the next backup process beginning at the primary site. My expectation would be that a site with even a modest sized ingest rate will not be able to complete the data movement before the next backup starts.
That leaves me with the conclusion that if you are looking for Disk to Disk Backup or Archiving and want to leverage data de-duplication, then I suggest you look for a supplier that does in-line data de-duplication, especially if you are then going to replicate that data for the creation of an off-site electronic vault.
Sunday, August 12, 2007