Creating a Disk Archive

 

While disk based archive can be a component of ILM (Information Life Cycle Management), an archive is not ILM in and of itself. When designing a disk based archive you are essentially creating a digital storage tank that you want to dump data into. This data could be copies of currently active data, where it may be valuable to have an extra copy, it might be data that no longer needs to be stored on primary storage, or it could be data that needs to meet certain retention requirements imposed either by regulations or corporate governance.


For most clients, archive is what they are currently using tape for today, but most are not satisfied with this as a archive solution. Using backup software and tape is not an archive, it is a way to store backups. While tape is very inexpensive per Gigabyte, it is an inefficient storage mechanism; it is slow to search, slow to retrieve, it does not do a good job of confirming data integrity over time and is ineffective at guaranteeing retention requirements.


Note, archive is not Disk to Disk Backup. Disk to Disk backup is by its nature is designed to store a relatively small amount of data for a few days. With Disk to Disk backup, an emphasis is placed on ingestion speed. Disk based archive, while slower to ingest data, is focused on scalability. Even with the advent of Data De-dupe, disk to disk appliances that are available now (see my article on data de-dupe), can only offer a few weeks of data retention for backups. That is impressive for front line recovery, but it is not suited for long term archival storage.


A disk based archive can address many of the short comings of tape archiving and can provide incredible value to an organization.  For a disk based archive to be effective, it has to fulfill some of the basic functionality that traditional tape libraries handle today. One of the key capabilities of tape libraries is that they can scale almost instantly by simply adding an additional tape cartridge. A disk based archive should offer the same functionality, so adding disk capacity needs to be as simple as plugging in another drive and then instantly having that capacity available for use. In addition, tape libraries can pack a lot of storage into a very small space; a disk based archive should offer this same ability.


There are Disk Based archive solutions that can match these abilities of tape; they are highly scalable and that scaling can be done almost as easily as inserting a blank cartridge in to a tape library. These two capabilities allow for multi-petabyte sized disk based archives and they lay the foundation to take advantage of disk in this archive mode.


The first step to using a disk based archive product would be to make the disk based archive target easy to access. With a disk solution, you would want to avoid proprietary API’s or the need to require special software applications or drivers, like a tape library would. If you are using a disk solution, it would make sense to make this a network mounted device for file sharing that you could get to from windows or UNIX. Copying and moving data to a file share is something that IT administrators do everyday, so why make it more complicated than that?


With the ability to scale and data access addressed, you would also want to have the disk based archive actually do data de-duplication for increased storage efficiencies. Similar to data de-dupe on disk to disk backup, this would be very valuable when making copies of files for safe keeping or long term storage where many of the files are similar.


With block level data de-dupe, you could let your DBA make an extra copy of their databases to your archive every night; only the actual blocks that changed in that data base would be stored, thereby consuming very little additional disk space even though the DBA will feel like their entire database has been stored. While your actual mileage will vary, a 10 to 20X increase in storage efficiency is not uncommon, making a 4TB archive appear like it is “storing” 40 to 80TB’s of data. Even though the disk based archive is scalable, you will still want to control and slow down its growth


Once you have created a highly scalable, highly efficient storage tank for your archive data, you will want a simple method to get the data back. Assuming we are using an archive system that supports stand file sharing on the network, you can simply connect to it and browse it for the data needed, just like any other file system. There is no need for a proprietary interface like a backup application would require. Just find the file and copy it back to where you want it.  This is NOT tape storage, so you could use the data right on the archive storage and not have to copy it anywhere at all. This can be incredibly valuable when you want to quickly preview a file to see if you have the correct copy; you normally cannot do that with tape solutions, so you would have to restore each copy and then by trial and error find the one you are looking for.


Take this one step further, with indexing on a disk solution you can use a content indexing method to actually index the contents of the files and provide a “google-like” search capability for the archive. As your data grows, this capability becomes mandatory when you project this seven years into the future. Imagine trying to remember where you or more likely where someone else stored a version of a file seven years ago. You will have to be able to find it some other way than to hunt and peck through folder after folder.


When you go to recover that data seven years from now, you will want it to still be valid. Here is another HUGE advantage over tape solutions. During the seven years how will you be able to monitor your tape media’s quality to make sure that the data on that piece of media is still valid? Also think back seven years ago, DLT7000 and DLT8000 tapes were the standard. Try taking a piece of DLT media and recovering it, quickly, today. It will be a challenge to say the very least. First finding a drive that can read that tape at all is going to be a problem, and having that piece of media being successfully read is going to be a long shot to say the least.


On the other hand, a disk based archive device can constantly scan itself performing automatic data integrity checks on the disks. The same logic that it uses to determine data differences in the data de-dupe process can also be used to confirm that the data is still valid. It should always return the same unique ID, if it does not you get an error message and the data can be recreated. From an access perspective, users have been accessing CIFS and NFS mounted systems for a very long time with issues and I don’t see any major changes on the horizon that would inhibit that from continuing.


From a retention perspective, whether it be from regulatory concerns or corporate governance, a disk based archive  has specific advantages over a tape based archive. First, while WORM tape technology exists, it is cumbersome to use. In addition, no intelligence is available to manage what goes to the WORM tape and what does not, plus there is the issue of what can be expired from the WORM Archive (see my article don’t keep it forever).


With a disk based archive, you could create a mixed environment and have some of the archive available for normal read write activities described above and a special area just for the data with a unique retention policy and a chain of custody issues. You could manually move that data to this unique area based on an understanding of your corporate guidelines for retention or you could leverage the indexing mentioned above to look for certain types of data.


A simple example would be storing any document that has a social security number in it or a particular legal case number. An indexing application could locate this data and move it to the WORM section of the archive.  It would then follow a retention policy set for that archive. Add to this the ability to encrypt this data for an additional level of security and you have a very tightly locked system for securing data with special retention requirements. With the right disk based archive, this ability can be added at any time in the future, hence laying the foundation for any future compliance needs that you may not even be aware of now.


A final layer of data protection would be to replicate this data to another digital archive in a remote facility. Normally this would require a large WAN bandwidth, but if your archive storage is only unique data as part of the data de-dupe process, then less bandwidth is needed to replicate the data.  with data de-dupe, only the unique blocks are stored and only changed blocks need to be replicated to the remote location, making the connection bandwidth to the remote site minimal.


With this system in place, a question comes up as to how and why to do backup of the device. With a disk based archive in place the role of backup changes to only being responsible for recovering the most recent copy of data. In a disaster, the most important thing is to get the data you were most recently working with back into production quickly. Also, a digital archive will greatly reduce the overall impact of the backup process itself by moving data out of the primary infrastructure.  This will reduce the amount of data that has to be backed up and, more importantly, the amount of data that would have to be recovered in the event of a disaster.  Here a disk based backup solution or a continuous data protection solution is the answer. In fact, if you combine continuous data protection with disk based archive you may be able to eliminate the need for a backup process.


As I speak to my clients about these types of strategies, a question that often comes up is "Can I use this disk based archive as a big (although slow) file server. The answer is yes, but only for office productivity types of files. If your users can tolerate four seconds to load a PowerPoint file, instead of 2 seconds, then I see no reason why you could not do this. This would also be a great place to store the dreaded PST files that are clogging up a lot of storage space in your data centers. For the possible small performance loss  of using archive storage (I doubt many of your users will actually notice it), you would then gain the automatic versioning, the storage efficiencies of de-dupe and the convenience of indexing.


The correctly designed digital archive can reduce your backup window, improve your ability to recover current data, improve your ability to store and find older data, and allow you to keep your neck out of ever tightening noose of corporate responsibility. For more on designing a disk based archive or eliminating the back up process please email me at georgeacrump@mac.com


 

Friday, May 25, 2007

 
 
Made on a Mac

next >

< previous