• Restore them (after all, they are recorded in the data format of the backup software), remove the redundant files and migrate these data sets off to a formal archive? 

• Delete them, since they’ve served their primary purpose of supporting restores for these weeks or months?

• Or just keep them?


For most organizations the answer is to keep them, using the backup system to manage what’s essentially an archive. While deduplication makes this tempting, storing long term data using the backup system has vulnerabilities: data security, data retrieval and data destruction.


What’s really needed is an archive-focused storage system like those from Permabit that can still compress and deduplicate data but also provide long term data retention capabilities. Data that’s backed up and deduplicated doesn’t provide data encryption, it’s not locked down to assure the data hasn’t been changed, it’s not readily accessible for retrieval of files or email, and it’s not available for destruction when called for. 



Encryption


Data must be secure, meaning it is protected from unwanted or unintended access. This usually means it must be encrypted. But deduplicated data doesn’t encrypt, unless the storage platform specifically integrates it, since separate encryption renders data unique and unique data has no duplication to remove. So backups that are stored for perpetuity won’t be encrypted.



Lock down


In addition to providing protection against accidental deletion or corruption, there are also other concerns. With regulatory and compliance requirements, IT Managers now have to make data available for legal discovery. This means it must be produced on demand to support a pending legal action. But the law also requires assurance that this data is not changed before it’s used in support of this pending action (i.e. chain of custody). The only way to prove chain of custody is to move the data to a WORM volume. Unfortunately, most deduplication backup systems don’t provide WORM capability either.



Retrieval


As mentioned above, one of the new requirements for data storage is the ability to produce specific data in response to a legal discovery motion. The storage system must be able to retrieve any of the files in question within the prescribed timeframe or face fines and /or penalties. In addition, historical data must also be available for business needs, such as trend analysis and customer predictive programs to optimize customer campaign returns. Similarly, a previous project may be restarted or research may be needed again after it has been archived. The point is, if data is valuable enough to be kept in the first place, it needs to be readily and easily accessible, and not in the proprietary format of the backup application that may change over the next few years.


Because the goal of the backup system is to improve backup efficiency, the backup systems that employ deduplication store the entire backup job together – usually thousands or millions of files –and must reconstruct the files needed for each restore. It’s not the same as an end user accessing a file share and copying only the ones needed, like disk archives do. From an administration perspective, the result of long term storage using the backup system is extra time spent searching for needed files, extra time spent retrieving those files and the frustration that accompanies tedious activities like this.



Email


In today’s organization, the email system is a de facto filing system. Its chronological structure makes it the natural place to look for some data when the source is uncertain.  Some people even use email to store attachments, never bothering to copy them to a file share. When considering the shortcomings of using a backup system with deduplication to store long term emails the issue of accessibility and retrieval becomes even more critical.  


Backup systems treat email systems (using Exchange as an example) as the large databases that they are. They store each message as a separate record and usually hold pointers to the file attachments that exist. They are backed up like a database and restored as such. Message-level restoring requires a special module (usually) and involves doing a much slower backup process in order to prepare the data for these granular restores. 


The way to optimize the use of the data and reduce the backup cycle is to use an email archiving solution. Most can be configured to send the messages and attachments they save off to a separate file-type archive. If a Permabit Enterprise Archive or similar archive tier solution has been implemented, an email archive can target it as well. The archive then becomes a single repository for all retained information. Doing so not only improves efficiency but manages data retention as well.



Destruction


Using a backup system to essentially manage an archive presents another unique issue.  Managing a data set for regulatory compliance and legal prudence means more than just being able to access files to produce for the court, it also means being able to access files (and all copies) so they can be destroyed at the appropriate time. It is critical to know that a file that needs to be destroyed is really gone and ALL of the copies are gone.


A backup system running deduplication will have many copies of the files it backs up. While these files may only exist physically in one location, they’re still spread virtually across many backup jobs since the files were first saved. It’s the act of repeatedly encountering redundant files and keeping virtual copies of them that gives deduplication its amazing data reduction rates in the first place. So, in order to be sure all copies are destroyed, every backup job that includes those files must be found before the files can be deleted.

 

Another issue arising from the use of backup architecture for long term storage is isolation. Since backup jobs are written together, the file candidates for deletion must be isolated from the backup job as a whole. This means a restore must be made of the job and then the files can be deleted. Once the files are gone, the job must be resaved to preserve the remaining files. This can complicate the process the backup software goes through to track the backup’s true age.



Summary


Backup software was designed first to get data backed up and second to make it available for restores. Backups assume the data is being accessed and changing and was never designed to manage these data sets for the long term. When deduplication is used in backup systems where data is kept long term, it introduces additional vulnerabilities. 


In these systems when data has aged out or past the point where it’s changing or being restored, it should be removed from the backup system before it becomes a candidate for deletion. A better place for data that no longer belongs in the active backup cycle is on a purpose-built archive storage system, which uses software to manage these data sets long term through multiple generations of technology and applications. These systems can provide for the three things long term storage needs; data security, data accessibility and data destruction.

Eric Slack, Senior Analyst

for Long Term Data Retention