Backups: Band-aides or Solutions
Backups: Band-aides or Solutions
Potentially no other segment of the storage environment has more products devoted to it than the data protection segment. They range from general purpose operating system protection to very specific application protection. There are solutions that can recover your data in seconds and those that will take days. There are backup storage devices that will deduplicate data so you can store more backups and those that will just store all the data but do so inexpensively. The problem is that most of these solutions do not address the core problem: there is too much data on primary storage. They are band-aides, not solutions.
Tuesday, January 19, 2010
The problem with these band-aide solutions is that they provide a false sense of security and cause customers to invest in elaborated point solutions and infrastructures to address the problem. The root of the problem is that the bulk of this data simply does not need to be backed up any more. According to almost every data study, the primary storage devices at most data centers are storing inactive data that has not been accessed in the last six months to a year. In many data centers the portion of inactive data is as high as 80%.
Impact of the inactive data problem
In most data centers complete full backups of the environment are done either weekly or monthly. This means that every time a full backup is done as much as 80% of the data being moved across the network and being stored on the backup target has not changed in the last year or more. This reality has led to the rapid success of data deduplication devices that use block-level identification to eliminate some of the duplicate information.
The challenge, however, is that these deduplication devices don’t address the problem of still having to move all this data across the network. As a result, they really only help with storage of the backup. Even though they are disk based, they only slightly reduce the time required to do the backup. In addition, an ongoing investment in the network infrastructure is still required, which can become complicated and expensive.
Even if there is a budget for and a willingness to invest in the network the problem of preparing all these files for the backup still remains. During backup, the application needs to examine each file to see if it has changed since the last backup. This is sometimes called “walking the file system” and can be very time consuming, especially on servers with a large quantity of files.
There is also the retention issue. With most backup systems ‘important’ data is mixed with truly ‘critical’ data which is mixed with data that’s needed for compliance. In most environments various forms of data have differing values. Some need to be kept to maintain a regulatory compliance and some need to be kept for internal corporate governance. When all this data is mixed within the same backup data set, it’s all but impossible to have specific retention schedules. As a result most organizations decide to keep all data for significantly longer than they have to, which results in additional liability. This makes data recovery the equivalent to finding a needle in a haystack.
Finally, there is the restoration issue, the whole reason that backups are performed in the first place. With all this static data mixed in with the active data, it can delay the time needed to recover (data?). For example, restoring a server to its original state may require the restoration of 1TB of data when only 200GB of data is actually needed. No matter what the technology is, it’s always faster to restore 200GB instead of 1TB of data. If this inactive data can be moved out of the way, then recoveries that used to take days can be done in hours.
Using Archive to address the inactive data problem
The simplest solution to the inactive data problem is to get rid of it; remove it from primary storage. If, hypothetically, all the inactive data in an environment was instantly deleted it would free up 80% of the storage in most organizations. Other than stopping additional storage purchases completely, it would also dramatically improve the backup and recovery processes. Staying with the 1TB of data example, it’s simply easier to scan, move and store 200GB of data than it is to do so on 1TB of information.
Of course for most organizations, deleting 80% of the data set is not only impractical it may be illegal. Something else is needed: an archive. By establishing an archive storage tier IT managers could move the inactive dataset off of primary storage and out of the backup process. Since this data would be stored separately, discrete retention policies could be set. Backup processes would not have to examine as many files to determine their backup need and recoveries would only be handling the active working set of data. Not only would backup and recovery improve, the investment in the backup hardware and network infrastructure would be greatly reduced.
The concept of archive is not new. It has been a method of offloading primary storage and reducing the load on the backup process since the days of the first mainframes. Open systems and Windows platforms have seen little use of the process. The primary challenge has been that most archives were tape based. Using them required complicated, custom software and client-side agents. Also, recovering data from the archive and categorizing that data on the tape based archive was often challenging.
Over the past few years disk based archives were developed to address these limitations. They were easy to access, typically via NFS or CFS mount points, easy to index and provided rapid from-disk recoveries that tape based systems did not. They also addressed the limitations of simply using cheap disk arrays by providing enhanced scalability and additional reliability that disk based systems did not.
Challenges to the Disk Archive
Disk based archives were not without their shortcomings. Some Disk Archives offered “tape-like” scalability. They achieved this scale by using a clustered architecture of 1U servers or nodes with storage in them, which could be interconnected. There was a limit to how many nodes could be added to the typical storage cluster, all of which required power, space and cooling - factors that quickly become a premium.
Also, these systems in many cases require a relatively large configuration of nodes to get started. Often, the initial capacity requirements are 25TBs or more, putting it outside the practical need for the small to medium sized enterprise.
Finally, an archive, disk or tape still needs to be managed. The storage needs to be implemented, allocated and monitored to make sure it’s working and doesn’t need maintenance or an upgrade. Given today’s reduced IT staffs there may not be enough personnel bandwidth to accomplish these tasks, despite the improvement that disk archives may bring.
Cloud Storage as the Archive
A viable alternative may be found in cloud storage. Using cloud storage as an archive can provide many potential advantages to users. For cloud storage to be used as an archive, most organizations should look for a solution like those from Iron Mountain that uses a local appliance to cache recently archived data to local disk for rapid recoveries and then migrate that data to a cloud storage repository for the long term. This provides the same access that disk based archives do with the increased scalability of tape. Since most cloud archive solutions are delivered on a pay-as-you-need basis these can become viable alternatives for even the smallest of enterprises.
Additionally, some cloud archive systems can be written directly from an API set. For example, Iron Mountain is partnering with Independent Software Developers (ISVs) to allow the integration of a cloud archive directly into their application. This is an ideal point to drive data to archive storage. The information is still fresh enough for the user to provide intelligence about the data set being archived and the application can help as well. This combination means the increased likelihood of the archive occurring in the first place, and an increased accuracy of recall when needed.
A cloud archive also addresses the operational issues that other forms of archives have imposed on an IT staff. Since all the physical storage is outsourced, zero time is spent managing that data set. Additionally, none of the organization’s power, space or cooling resources are consumed. This fact alone could justify the investment in cloud-based archiving, compared with other archiving solutions which can’t make this claim.
Storage Switzerland’s White Paper on Cloud Archive provides important factors to consider when selecting a company to house your long term data. While many of these are significant, probably none is more important than the long term viability of the organization. There’s probably no better indicator of current stability and future longevity than years in business and experience in the given market. This is an area where Iron Mountain has a unique advantage. It’s a company that has been meeting the long term storage requirements of organizations of all sizes for the last six decades. In the area of cloud based storage, Iron Mountain has been providing solutions for over 10 years.
George Crump, Senior Analyst
This Article Sponsored by Iron Mountain
Solving Backup Problems with Cloud Archiving