Differences in Deduplicated Backup and Replication
Differences in Deduplicated Backup and Replication
Tuesday, March 16, 2010
The methods many companies use to measure backup system performance and the duration of this complete backup cycle are often inaccurate, leaving them vulnerable to data loss when they think they’re protected. Users need to understand the technologies employed by backup target suppliers to complete the entire backup cycle, and how they function in a given backup environment in order to make good choices when evaluating these systems. Since the backup process includes multiple steps (ingest, dedupe and replication), how a backup system performs each step affects total cycle time. In addition, how it manages each step and whether it runs these steps concurrently can also impact the duration of the backup cycle.
Traditional system-level asynchronous replication was the original technology developed and is still used by a number of storage manufacturers. Essentially, it creates and maintains an exact duplicate copy of the source data set on the target system. It requires identical hardware at both source and target locations and only supports a one-to-one replication, meaning it can’t be used to replicate multiple sites to a central DR data center. Also, since it can’t take data from multiple sites, it can’t perform a ‘global deduplication’ of these data which can result in more capacity needed on the target system. Global deduplication is the ability of some backup targets to reduce bandwidth consumption by not requiring one source site to send data if another site has already sent the same data.
With this first form of replication all replication is done at the file system level, so the entire capacity of the source system is included in the replication data set, it won’t allow portions to be replicated more frequently, for example. Since source and target contain the same data sets, deduplicating the source keeps the target deduped as well. But this also requires the source data set to be static while it’s scanned for changed blocks, prior to replication. This means the deduplication process must be completed, not an issue with inline replication, on the entire data set before replication starts. This can have an impact on backup cycle time, as replication waits for ingest and dedupe to complete.
Backup storage systems, in order to meet the requirements of more enterprises, needed to move beyond traditional replication when dealing with deduped data and employ more advanced techniques that may better fit the data set and backup environment. Organizations may have multiple data centers that need to replicate to a single DR site, for example. Rather than the system-level process described previously, they may need a more dynamic, granular replication process like Data Domain’s Replicator Software. This ‘directory-level’ replication doesn’t simply duplicate the entire file system at the source set but allows portions of it, or directories, to be replicated independently.
This granularity enables more frequently backed up data sets to be replicated independently from the rest of the enterprise’s data. Using a critical database as an example, the directory-level method would allow the replication step to be done each time the database is backed up, perhaps several times throughout the day. Since what's being evaluated is total ‘wall clock’ time for the entire backup cycle, this replication step must be completed. With the system-level method, replication of each of these interim database backups would require the entire enterprise’s data set to be replicated. The result may be significantly longer backup cycle times.
Directory-level replication also better supports multiple site ‘fan-in’, or a many-to-one replication scheme. Companies with several locations can replicate all of them to a single DR site, with the target storage device deduplicating the consolidated data set as well as leveraging the bandwidth savings of global deduplication described above.
Perhaps the biggest difference in the directory-level method is its ability to start the replication process on small subsets or images within a total backup job while others are still in the ingest and dedupe stages. This is accomplished with a log-structured file system and the use of default intervals to dismount a file or eject a virtual tape. The resulting parallel processing of all three stages, ingest, dedupe and replication, can significantly shorten cycle times.
OST from Symantec has greatly improved management of the backup cycle process, and can be effective towards the goal of reducing total cycle time in environments that include remote replication. But total backup cycle time is still dependent on the system’s ability ‘know’ when data has finished the dedupe stage and is ready for replication. Backup systems that support OST each develop their own implementations of this software and provide varying levels of functionality in the process. Some implementations include a way for OST to monitor and control the transfer of individual backup images through the replication process, producing shorter cycle times. In addition some systems have taken full advantage of OST as a transfer protocol to significantly reduce the time it takes to deliver data to the backup target for the ingest step.
In systems which include multiple site replication to a single DR site, cycle time can be affected by the resiliency of the replication process. When communication errors occur, the replication process should have the ability to restart itself, automatically, until the replication job is completed.
With the inclusion of deduplication and replication into more backup infrastructures, the time required to completely secure a company’s data set has become more their responsibility. In the past the service level agreement was meet when data was put into the hands of the vaulting supplier and placed on their trucks. Now the SLA should include deduplication plus time to transfer the data set to an appropriate DR site. Consequently, the process backup software and storage systems employ to run this entire backup cycle must be understood by IT organizations in order to make effective infrastructure decisions.
The original replication methodology itself has evolved from a ‘system-level’ process to a more granular, ‘directory-level’ process in response to a need for faster backup cycles and support for multiple remote data centers. The release of the NetBackup OST option has helped with overall management but has also required special integration features on the part of backup storage vendors in order to reduce cycle times.
Eric Slack, Senior Analyst
This Article Sponsored by Data Domain
Related Articles
Use VMworld to Solve Backup Challenges
What are Purpose Built Backup Appliances?
When Does Backup Archiving Make Sense?
Impact of Cloud Data Centers on Backup
Archiver Provides Long Term Data Retention
Integrating Disk Backup with Backup Software
Dedupe Benefits Mainframes & Open Systems
SunGard chose Data Domain for VMware Backup
NetWorker Integration of Data Domain Boost
Dedupe Storage Systems Ease VMware Backup Pain
Client-Side Deduplication and VMware
Direct Recovery - Booting VMs from Backup
Leveraging Deduplication for Disaster Recovery
Storage Optimization Dedupe vs. Compression