Deduplication by Backup Software


The lack of integration has lead some software developers to bring the burden of deduplication, replication and validation into the backup software itself as a feature. This allows the backup software to have total control over those functions, but control which comes with its own set of issues. If the data protection software was written from the ground up with deduplication in mind it can mitigate many of these issues, but when the capability is bolted on via a separate software function it can cause problems.


The performance of the backup application may be impacted since it now also has the add on deduplication module responsibilities and communication back to the original application. This back and forth can lead to longer backup and recovery windows. One way to fix such a performance issue is to upgrade the backup server hardware. But migrating backup installations to new server platforms can be a challenging process, in addition to the obvious hard costs.


Second, when software handles all the deduplication there’s the chance that the customer may feel locked-in to a single backup solution. This is because in addition to the backed up data being stored in a proprietary format, a proprietary deduplication engine is also being used to make the associations between redundant data. This can make the move to a new backup software application if that need arises even more challenging and less likely to happen. It’s also counter intuitive to the whole movement towards flexible data centers. Being stuck with one software application is not the goal of these initiatives.


The major challenge with software systems doing the hard work of deduplication is that in almost every data center of any size there are multiple backup applications in place. This is especially true with the rise of server virtualization. Often, the legacy or non-virtualized environment is protected by traditional backup applications and the virtual environment is protected by a new breed of applications that are more ‘virtualization savvy’. Even without virtualization, it’s not uncommon to find three or more backup applications and maintaining separate ‘silos of deduplication‘ for each is extremely inefficient. Even when the perfect backup application is found there are always legacy applications that need to be dealt with.


Deduplication delivers its return on investment by eliminating redundant data, meaning the more data that the deduplication engine has to compare against the better the return. If parts of that comparison are siloed off its effectiveness is reduced. This is especially true in the backup and archive processes where much of the data is similar and it’s not uncommon for the exact same data to be protected multiple times by different backup processes. It is especially likely that when that data is archived it’s identical to the data that has been backed up. If all this data can come from multiple data sources but land on a common deduplicated store then the efficiency of deduplication increases significantly and the return on the investment happens much sooner and many times over.



Integrating Deduplication with Backup Software


What data centers need is the best of both worlds, the universal deduplication efficiency of a disk-to-disk system with the integration and control of a backup application. There is evidence of this now through companies like EMC. They are introducing products that not only allow for backup applications to have command and control over the backup appliance, but are also adding capabilities to the backup application itself that optimize performance and scalability of the deduplication system.


Although there are many new deduplication appliances coming to market that do the basics of combining a disk system with deduplication intelligence, getting to this integrated capability takes a serious developmental commitment. The first step is to either create or take advantage of an interface to the backup applications. Symantec has provided this through its OpenStorage API set and EMC Software, while not a formal API set, has the ability to support backup appliance integration.


The first steps of integration, which many disk backup vendors have not even taken yet, is to allow the backup software to be able to recognize the disk backup appliance as disk and to recognize the unit’s abilities, like deduplication, replication and verification. Then, the software can manage and trigger these capabilities as needed. For example, replication can be started at a given point in the backup process or when a certain hour of the day is reached. It should also then be able to recognize the remote unit and factor its availability into the recovery process. Without this integration, while replication can happen, there is a significant amount of configuration work that needs to take place at the remote location to get the backup application to recognize the unit, something that’s not ideal in a disaster recovery situation. Going further, the backup application may be able to use the remote unit differently than the local unit. For example, the local disk backup system could be sized to only house the most recent copies of data and the remote system could be sized for longer term retention.


Beyond system management, integration could also help with performance and scale, of which Data Domain’s DD Boost Technology is an excellent example. As Storage Switzerland discussed in a recent briefing report, this technology leverages the backup application to eliminate some of the redundant data from even having to be transferred to the disk backup system in the first place, increasing network efficiency and the performance of the backup server by saving I/O cycles. This level of integration could also be used to load balance or even cluster multiple independent systems and provide greater scale to the disk backup environment. This allows the disk backup system to grow to match the capacity and performance demands without having to move to an overly complex and sometimes cost inefficient, scale-out hardware design.


Integration of the backup hardware with backup software brings a ‘best of both worlds’ reality to the backup process. It allows backup software to focus on the data protection and management optimization without having to add capacity optimization as well. It also provides the customer with the opportunity to use a ‘best of breed’ approach to data protection since almost any data set targeted at the backup system can leverage the existing data footprint, increasing the storage efficiency of the deduplicated system.

George Crump, Senior Analyst

Data Domain is a client of Storage Switzerland