Archiving Basics
Archiving Basics
The problem with keeping data either for legal reasons or a “just in case” reason is that it has to be stored somewhere. Without an effective archive strategy, this meant that the data would be stored on the same storage as everything else - on primary storage.
Inactive data on primary storage is a huge waste of an expensive resource. There is at least a $5 to $10 per GB price delta between primary storage and even the most expensive forms of archive storage and it can significantly greater! Additionally, primary storage is designed to quickly deliver somewhat transient data. It typically does not have the ability to support data retention policies or the ability to verify the integrity of a set of data years after it has been written to storage. Both of these are important requirements of archive storage.
Archive Targets
The first step in establishing the archive is to select the storage platform for the repository. This component must come first because the type of platform selected will dictate how the strategy is implemented.
Traditionally tape was thought of as the ideal archive medium. It was cheap and easy to store and transport. The problem with tape was that it required special software to access it; it was not like copying to another drive on the network. It also had limited capabilities for data retention and virtually no data verification capabilities. We outline this in more detail in our article Tape is not an Archive.
As disk prices started to drop, the thought of using an inexpensive NAS with SATA drive technology started to surface. While this was easier to access than tape, it faced other challenges including cost and scalability and suffered the same challenges as tape; retention and verification. This is covered in more detail in our article NAS is Not an Archive.
The shortcomings of disk and tape have lead to the development of disk based archive systems like those from Permabit Technologies. These systems have provided the easy access of NAS based storage, the cost efficiencies and scalability of tape with the inclusion of retention and verification capabilities.
Archive Strategy
Once the archive repository has been selected, the process of developing a strategy can be embarked on. The first step is to decide how data will be moved to the platform. Second is how often these moves will occur and finally, how to protect the archive.
Deciding how to move data is often the step that causes the most consideration. The simplest thing to do is to just move the data to the archive via standard OS commands. This is especially true if the archive platform is a disk based archive that emulates a network mount point. Since these systems are simply another network drive letter, it is fairly simple to manually move data to the repository. For some level of automation, a tool like Tek-Tool’s Storage Profiler can be used to generate a list and feed that list into an OS script that would move the data.
The advantage of the manual move process is that it is cost effective, often free, and very rapidly implemented. The disadvantages are that it has to be manually executed, maintained and the users have no direct indication of where file data has been moved to. It is often an ideal initial strategy until a more formal data movement process can be developed.
This more formal process is often some form of automated data movement. This can be done with specific archive software available from companies like Atempo or Enigma Software. This software either deploys an agent or remotely logs into servers in your environment to determine files that are eligible for archive. It then performs the migration of those files to the archive. Most of these applications create a transparent link back to the archive for seamless recall for the users.
The combination of this software and a disk based archive allow for very aggressive migration policies that move data to archive storage within months or even weeks of it becoming inactive. This allows for optimal utilization of primary storage without jeopardizing the user experience. When the user accesses an archived file, it appears to be right were they left it and because it is a disk archive they are unlikely to notice a decrease in performance. Since most surveys report that truly active data, that data that is being modified within a 90 day window, is only growing at 3% to 5% per year, aggressive archiving can delay future storage purchases for years!
The final section of an archive strategy is protection of the archive itself. Many customers are tempted to back up the disk archive like any other storage device. This should not be the case. The archive should never need to be backed up.
For protection of a local disk failure, disk archive systems have advanced data protection scheme’s that provide more robust protection than standard RAID. Also remember that they have built-in integrity checking of the data itself. For protection of a site failure, a disk archive solution should be able to replicate its data via a WAN connection to another site. While this does require the purchase of a second system, the cost savings by implementing a disk based archive will more than offset this additional level of protection.
Even without a second system, the data stored on the archive has already been backed up multiple times by the full backup jobs. For example, if the strategy is to migrate data after 90 days of inactivity and your backup policy is a full backup every weekend, that means the data in the archive will be protected on approximately 12 full backup sets. A simple change to the policy to maintain a monthly full for a longer period of time will mean the data in the archive is also available on tape.
Archive’s Impact
The most immediate and obvious benefit of a disk based archive strategy is the reduction in primary storage needs that can delay this years, and possibly the next few years, storage purchase saving potentially millions in IT budgets. In many cases, customers can free up 80% of their primary storage capacity, the equivalent of shelves and shelves of storage. By redistributing this storage, customers can actually turn off shelves of storage to reduce power consumption.
Finally, an effective archive strategy may delay the investment in upgrades to the backup environment by removing 80% of the backup load. This reduction can be significant by delaying upgrades to backup–to-disk architectures, backup bandwidth and backup servers.
Given the IT budget trends the cost effectiveness of an archive makes it an ideal project to strongly consider in 2009. The fact that you can complete this project while also: improving primary storage performance, reducing backup windows and increasing data safety, make it a valuable project during todays economic situation.
Thursday, February 12, 2009
Related Articles
Faster Primary Storage with Data Dedupe
Primary Storage Deduplication, Demand It
Dedupe Improves Primary Storage Efficiency
SMB NAS is Deduplication's Next Step
Primary Storage Dedupe Addresses Data Gap
How Should Primary Storage Be Delivered
Storage Industry Consolidation & Dedupe
Primary Storage: Dedupe vs. Compression
Making Primary Storage Dedupe Safe
High Performance Primary Storage Dedupe
Automated Tiering or Disk Archiving?
Global Healthcare Leader - Disk Archive
Optimization - New Normal in Storage
Managing Backups in the Real World
Bringing Green IT to Data Storage
In 2009, unlike any year in recent IT history, terms like “do more with less” and “cost containment” are going to be more common. One of the potential targets of doing more with less and containing costs is reducing the amount of storage required for the primary storage tier. One of the common solutions that will be proposed is Archiving. The challenge is that these proposals often come without much explanation.
What is archiving? When should you use it? What it the best way to implement it? Those are questions on CIO’s minds today and those are the questions this article will answer.
What is Archiving?
Data Archiving is the storing of inactive (static) data on to a secondary storage device such as online disk. This data is information that will/may be needed again in the future, so deletion of it is not acceptable. There may be legal reasons to keep that data, there may be organizational reasons like market research or the justification may be as simple as no one is comfortable deleting it. Regardless, the decision has been made to store the data instead of deleting it.