What is Data De-Duplication?

 

This will be the first in an on-going series of articles that define a feature or capability of certain storage platforms. The goal is that in conjunction with the web site's search feature you will be able to come back and ask, for example, "What is RAID 6?" and the correct article will appear answering your question. Also, many of the subjects covered could be white papers in and of themselves. The intent here is to give you a quick overview.


This article will explore data de-duplication, a buzzword that you have probably seen thrown around a lot. I consider data de-duplication to be an established (not emerging) technology that will have an increasingly larger role in customers' data centers. The first incarnations of this, and rightfully so, have centered on disk-to-disk backup, but I expect the scope of the technology to expand to archiving, secondary NAS servers and eventually to primary file systems themselves.


Essentially, data de-duplication is the ability of an appliance or software application running on a server with disk attached, to compare blocks of data being written to it with data blocks that currently reside on it. If duplicate data is found, a pointer is established to the original set of data as opposed to actually storing the duplicate blocks - removing or "de-duplicating" the redundant blocks from the volume. The key part of this is that the data de-duplication is being done at the block level, typically not at the file level. In fact, beware of products that only de-dupe at the file level (more on that later). Say for example you are backing up a very large database that changes throughout the day. As you know with the typical backup application you have to backup, and more importantly store, the entire database with each back up. An incremental won't help you here. With block level de-duplication you can backup the same database to the device on two successive nights and due to its ability to identify redundant blocks, only the blocks that have changed will be stored. All the redundant data will have pointers established.


This ability has fundamentally changed how disk-to-disk backup can be used in the overall backup strategy. Without data de-duplication, the disk part of disk-to-disk backup is just a cache. This means that even with today's prices on SATA hard disks, the cost to house that data for a significantly long period of time is prohibitive. Most customers can only store 1-2 weeks worth of data on disk before they have to spool to tape and free up disk space.


Much of the market uses a hashing algorithm to do the data comparison and I think we will begin to see some competitive information on whose algorithm is more efficient. As the use of these technologies continues to expand, performance of the algorithm will become increasingly important.


A second area of debate that has arisen is where and when to do the actual data comparison. Current options are to do the data comparison before the data is sent to the appliance or to do the data comparison on the appliance itself. The advantage of doing the comparison before sending data to the appliance is that this substantially lowers overall use of the network in addition to optimizing storage. In my experience I have seen two big challenges to this approach. First, having the client do this comparison is VERY CPU intensive, and especially on busy servers can cause significant problems. Second is that it requires a replacement of your current backup application, unless your backup application and data de-duplication appliance come from the same supplier. Maybe. The one supplier with this potential currently has not yet integrated these two components together.


Performing the data comparison at the appliance itself has one major challenge. All the data has to go across the network. At first I thought this would be a major issue, but if you think about it, that is already happening with backup applications today. We currently move all the data across the wire. The advantages of doing de-duplication this way are huge. First, there is no change in backup software applications. Second, there is no additional impact (CPU impact) on the servers being backed up; the same backup client agents are used. All the work is done at the data de-duplication appliance. Third, since the backup application is in full control of the process it knows exactly what is going on.  They don't encounter the control problems some VTL and disk backup products have had around movement to tape, cloning, virtual vs physical tapes, etc. Movement to tape, when desired, is more straight-forward; basically a cloning process, well understood and supported by all major backup software applications. Lastly, since there is very little change to the backup infrastructure, I've seen the installation of these products go much more smoothly. You still want to have someone that knows what they are doing perform the installation. It's really not that the installation of the hardware itself that's all that difficult, but integration to your software application and tuning for maximum performance that can take some work.


Most suppliers seem to be leaning toward doing data comparisons at the appliance, but this now seems to be fueling another debate (to be honest as a consultant, that's good for business). The question is "WHEN should you do the data comparison?" There are two options here. Synchronously - do the data comparison and eliminate redundant data as it comes in and before it is stored. Or Asynchronously - store the data and then eliminate redundancy as you start to run out of disk space, or maybe at a predetermined time each night or over the weekend. The advantage of the synchronous method is that it maintains the efficiency of the storage model, but at the cost of performance on the appliance. Clearly, we have seen these devices limited by how much data they can ingest and in the synchronous model, this limits how large a backup data set size they can handle. Causing a proliferation of extra appliances to spread out the data ingest. That said, de-duplication throughput has steadily improved and the resulting backup data set size has grown with it over the past few years. For many customers these devices are faster than their network or backup clients can deliver. By comparison, the advantage of the asynchronous model would be performance at the cost of storage efficiency. This should mean that Asynchronous Appliances should not require as much horsepower as Synchronous Appliances will, thus possibly being less expensive and requiring less appliances overall. I think that they will run into similar inbound bandwidth issues. Again, most often it is the network and the clients that are the issue. See my article on backup performance tuning.


So why is de-duplication important to you?

First, this is really the only practical way to do disk-to-disk backup. Without data de-duplication the expense and maintenance of enough disk space really limits disk backup solutions to a cache area until the data can be written off to tape. I would go so far to say that if you are considering a disk-to-disk backup strategy and do not have data de-duplication as a primary requirement, then stop and put it on the top of your list. I say this so strongly because it enables every aspect in disk-to-disk backup that most customers are looking for; decreasing the backup window, improving backup reliability, providing the ability to do most restores from disk, reducing the number of physical tape drives required, reducing the quantity of physical media required and the ability to electronically move backup data off-site (see the electronic vaulting article).


Lastly, this technology will eventually be used everywhere. As I previously indicated, while the initial market for this has been disk-to-disk backup and archive/compliance, the next big market will be NAS or File Services. The de-duplication example I gave above not only applies to databases but also to standard office productivity files as well. (Has anyone seen a small PowerPoint presentation file lately?) These files seem to multiply as well. For example, imagine I sent your entire IT Staff a copy of my Storage Market Analysis Presentation and they each decided to store it in your home directory with a different file name. For a backup application running an 'incremental backup' these are each different files. For a data de-duplication device that is one file being stored with multiple pointers, so imagine having that kind of storage efficiency on you file servers as well as your disk-to-disk backup process.


Data de-duplication is a technology whose time has come and should be considered a key requirement for all backup solutions going forward. For a complete comparison of the suppliers in this market and which would be the best fit for your environment, please email me at georgeacrump@mac.com

 

Wednesday, May 9, 2007

 
 
Made on a Mac

next >

< previous