Implementing Tape Based Backup Archives

A common data retention strategy is to use the backup process itself to create the archive. This is most often done by setting the retention policy for affected data sets in the backup application to the desired duration. Then, tapes of that data to be retained are created and stored off-site.

Most data centers today front end their backup process with disk. While initial disk backup systems were essentially a cache to store data before it went to tape, deduplication disk backup systems like those from Data Domain allowed data to be stored on disk for a longer period of time than the temporal nature of disk as a cache. This allowed disk to be leveraged for restores and for off-site backup. As these systems scaled more companies began to see disk as a storage area for most frequent recovery requests and tape as a longer term archive of backup jobs. In the past this disk-to-tape ‘integration’ had to be handled through a manual process. But now, as mentioned in our article “Integrating Backup Disk with Backup Software”, it often can be driven automatically from the backup software interface.

The use of backup data sets for archive is often criticized as not the ‘right way’ to retain data but is certainly very cost effective and relatively easy to implement. Where its weakness begins to show is when there is a demand for recovery of the older data. Recovery of retained data is fundamentally different than recovery of current data. With current data there is typically a time pressure, like when an application is down or a user needs their data recovered ‘yesterday’. The effort is helped though, because often there is an understanding of exactly what data needs to be recovered. Also the data is typically either on disk or on a recently used tape, so finding the specific files needed is relatively easy. Retained data, on the other hand, doesn’t have the same time pressures as current data. But there also isn’t always a clear understanding of which data needs to be recovered, as the data set is often a range of files or files across a range of dates.

The good news is that several backup applications have become excellent at meta-data (data about the backup) management and have the ability to quickly identify which tapes have which needed data sets. The problem arises after that information has been ascertained. First, the tapes actually need to be found and brought back to the data center, both of which can be time consuming and prone to error. The tapes then need to be loaded into the tape library, (and fingers crossed) while they’re attempted to be read. Remember, at this point those tapes may be five or six years old and the tape drive technology could easily have changed since the original tape was made. Even if the drives selected were “backwards” compatible, there is a good chance that the recovery could still fail due to a different drive alignment or just the media degradation that can result from sitting on a shelf for that long.

If the data is intact and viable, the tape has to be mounted in the drive, scanned, read and then data has to be restored to the primary storage device. Because of the often wide swipe of data sets that need to be examined, this repeated mounting, reading and recovery of the retained backups can be very time consuming. While recovery of retained data does not have the same time pressures that the immediate recovery on production data does it still must be recovered in a timely manner. It is important to note that the time consumed by these recoveries also includes the administration time to set up and monitor the recovery jobs.

Implementing Backup Archives - Archive

To overcome the weaknesses of using backup sets as an archive many organizations have explored the option of implementing a formal enterprise archive process. This is typically a completely separate data copy process from the backup and is often managed by an archiving application that’s specific to either the file system or the email package. These archive-specific applications are designed to provide very rapid search of archived data and an understanding of what those files contain prior to the data actually being recovered. Most of these enterprise archive systems today leverage scaleable disk storage systems to speed the delivery of that archive data. Enterprise archive systems are especially valuable in compliant environments where a chain of custody of the data must be maintained.

Enterprise archive systems provide the ultimate in long term data retention. But for many organizations the combined cost of the software and hardware may be overwhelming, as is the time to implement the system. In addition to basic implementation there may be a significant adoption curve as certain business processes need to change to ensure proper capture of data. While these costs may be justifiable in heavily regulated or heavily litigated organizations, for many, the capabilities are inappropriate. Sometimes they’re ‘overkill’ because there won’t be a frequent need for the retained data or the costs are just outside the reach of the organization.

Implementing Backup Archives - Disk Backup as a Bridge to Archive

Another option is to leverage the disk components, like deduplication systems, being used in the backup process, for the long term data retention process. Deduplication systems like those from Data Domain are highly efficient devices that can compress and eliminate redundant data. The more data that is stored on a deduplicated system the more efficient it typically becomes. Thanks to those capabilities, plus the advancement of general storage technology, many of these systems can now scale beyond 100TB. As a result the raw capability to store backup archives on them is now achievable. These systems have even added basic compliance capabilities to fulfill some of the chain of custody mandates that may exist within the organization’s industry.

All the data stored on these systems can be replicated off-site so valuable archives are protected from a disaster or storage system failure. In fact disk based systems can provide better data validation than tape since all the data is online, can be continuously verified to maintain data integrity. Also, unlike tape, since most of these systems are protected with RAID the failure of one piece of media does not jeopardize the recovery of an entire data set.

The wide swipe recovery demands of a data request from the archive is also easily handled. Leveraging online access and the backup software’s understanding of the backup meta-data, recoveries can be processed immediately, allowing more time for verification of the recovered files. This also frees the backup administrator from monitoring and waiting on slow tape based recovery jobs. From the backup administrator’s standpoint there is little difference between a recovery of short term and long term data.

The challenge with disk based backup archives is that they do not have the robustness of compliance and search based recovery that enterprise archives do. More specific information about the data in question may be needed with a backup archive and this may lead to a few more “trial and error” restores. These extra restores may be easily offset by a lower start up and operational cost.

The balance between disk based backup archives and an enterprise archive comes down to need vs. cost. In highly regulated industries there may be no option. The cost of the enterprise archive is offset by the frequent need to access retained information as well as the liability of not having that information.

In less regulated industries, especially where there is a concern over available budget and where recoveries of retained data will be more the exception than the rule, using a disk based backup archive may be a perfectly acceptable alternative. It overcomes the challenges of tape based backup archives and delivers a usable, cost effective way to retain and recover long term data archives.

This type of solution needs a specific type of disk appliance. To address backup archiving in this way companies like Data Domain with their Archiver solutions have built systems that are disk based but better designed for longer term retention. As the Storage Switzerland Briefing Report on Archiver shows, it is important that capacity efficient solutions like deduplication appliances account for longer term, retained information and balance across the deduplication indexes.

George Crump, Senior Analyst

EMC is a client of Storage Switzerland