Backup vs. Archive

 

Proper data retention is continuing to increase in importance for all data centers. The motivation may be as critical as to comply with regulations, corporate governance, data mining or even as simple as keeping data available to users “just in case”. 


As a result, there is a growing need to develop an area focused on archive storage. In addition to data retention issues, there is also the need to reduce the rate at which primary storage is being consumed. Primary storage is expensive to purchase, manage and has negative impact on power, cooling, and backup processes. Reducing the investment here will free up tight budget dollars for other business advancing projects.


IT professionals are struggling with how to develop an archive strategy that addresses all these challenges. One of the key decisions in the development of a strategy is determining what type of storage to use for archive data. A common consideration is to leverage the current backup storage or to introduce a new tier specifically for archived data.


To decide if the backup process and its associated storage is appropriate for the retention of data, there must first be an examination of what the requirements are for the archive vs. the requirements for backup. The next step is to compare these requirements with the capabilities of the backup process to see if it can meet the challenge.


Most backup requirements focus on moving large amounts of data as quickly as possible to a backup server and its associated backup storage. Recovery is similar; its requirement is to move large amounts of data from the backup storage to a failed server as quickly as possible. Even in the case of a more granular restore, the focus is on getting that copy of data back to its original location as fast as possible.


Speed often comes at the sacrifice of reliability. Backup jobs do not have the time available to verify the data as they write it to their backup storage nor do they have the time to do a complete verification of data to the original data. There is typically no data protection scheme on backup storage and no mirroring or RAID type of protection. If disk backup is being used there may be a basic RAID protection strategy, but most if not all disk backup solutions are short-term staging areas and are not designed to hold year’s worth of data, let alone decades. 


This lack of protection on backup systems is less of an issue because of the way backup works. Most backup processes run incremental backup jobs every day and full backups every weekend. The individual file data between these jobs is similar and many times identical.  As a result, if there is a problem with yesterday’s backup then going to the day prior, although it may result in more rekeying, is a viable option. With the backup process there are multiple copies of a particular piece of data, each day it was backed up and of course the original data on primary storage. Failure of a particular job, while not pleasant, has a work around.


Speed, while an issue for archive, is not the primary focus. Except for initial implementation, data is not typically moved to the archive in mass but more trickles in as the data ages. Similarly, requests are generally not for mass amounts of data but particular files that have not been accessed in potentially years.  Archive systems use this time to verify the data being written to make sure it is 100% what was intended. Even after this data is verified it is stored with multiple levels of redundancy offering a long-term RAID-like or mirrored protection scheme or with some advanced solutions a RAIN (Redundant Array of Inexpensive Nodes) architecture..


The verification and redundant storage of this data is critical to an archive because an archive job is typically only run once. Old data is migrated from the primary storage, written, verified and stored in redundant fashion on the archive and then removed from the primary storage area. Remember that one of the goals is to reduce the primary storage requirement and the second may be to store this data for a long time, possibly with legal ramifications on failure to deliver that data.


In addition to speed, the backup process is typically focused on protecting and recovering new data, meaning that the request is usually for the protection of the latest copy of the file or volume.  Also, since the data is fairly new, the request for that data is likely coming from the data’s owner or the application’s administrator.  As a result, finding that data is straightforward; it is at the top of the stack. A simple index that is available in most backup applications is fine for this. But, even those databases get very large over time, often because they are tracking hundreds of copies of the same file.


Archive data could be looking for a specific version of a file or a file that is years, maybe decades old. Also, the information about the data you need is more arbitrary and even when you recover it you may not know you have the right file until you manually inspect it. The request may be something like “we need any document that had to do with that bridge we built back in 2002”.  A list of file names won’t do, you need a searching mechanism that can get inside of the data files and look at content. This type of searching is much easier done when it is performed across a disk as opposed to a series of tapes.


A requirement for archive that causes challenges in the backup process also has to do with data retention. At some point in the data life cycle it will become legal to and possibly advantageous to permanently delete certain data, at a minimum it is good to know where all the copies of this data are. For tape backup this creates several problems. First, in a backup, data is not written as individual files, it is encapsulated into backup jobs and stored possibly on disk first then on tape in most cases. When backup is used as an archive the typical process is to not expire certain full backup jobs. As a result data is scattered on to many tape sets and hundreds of copies of this data can be created.


A decision for example, to delete all data associated with a certain project becomes a herculean task. Backup logs need to be reviewed. Tapes need to be found. Even when the tapes are found, most often there may be some data on that tape that needs to be retained and some data on that tape that does not. The only way to protect this mixed need is to load the old tape and copy the data that needs to be retained to another piece of tape media. Not impossible but certainly wide open to failure.


An archive on the other hand specifically stores discrete files, and an archive with a proper retention system can set holds on certain files and release other files as their retention requirements are fulfilled.


Finally, there is accessibility. Backup is a controlled process that needs to be in control of data movement from the client to the disk backup target and from the disk backup target to the tape library. Accessing data on either of those mediums requires using the actual backup software. Archive on the other hand typically appears as a file system and can most often be accessed through either a CIFS or NFS mount point.


Accessibility becomes critical when cast into the view of recoveries years into the future. Backup software may change, disk backup and tape formats may change but file systems for the most part remain accessible and as a result are, to a large extent, future proof.


In reality, this should not be a backup vs. archive decision, the two are complimentary processes and in fact archive can reduce the burden on tape significantly and as a result make backups more reliable. First, an archive system should move old data off of primary storage. In some cases as much as 75%+ of the data on primary storage is actually old data that has stopped changing. As a result, 75% of the data being backed up does not need to be, 75% of the data being stored on the backup targets (disk or tape) does not need to be and 75% of the data being recovered in critical system failure likely does not need to be.


Archive allows the movement of that 75% of data to a destination designed to hold it. With the redundancy and indexing features in place that data is more accessible and better protected than it is by the backup process. With replication, there is no need to backup the archive.


Archive not only speeds up the backup and recovery process but it also greatly reduces the primary storage investment both up front and going forward.

 

Monday, November 10, 2008

 
 
Made on a Mac

next >

< previous