Deduplication Weaknesses - The Network
Deduplication Weaknesses - The Network
Disk-based backup was supposed to cure the problems that plagued the backup process but instead it incurred its own problems of cost and data retention which forced quick migration off to tape. Essentially disk backup was used as a cache to the tape library. Data deduplication resolved much of the retention issues in disk backup making the movement to tape less critical. Neither disk backup by itself nor with deduplication resolves one of the biggest headaches associated with backup; dealing with the backup network.
Most deduplication appliances perform their work target-side, or after the data has been transferred across the network to the destination device. This means that each week a full backup is performed 80% of the data is retransmitted across the backup network and sent to the destination deduplication device. Even though full backup windows are generally longer, this 80% extra data load still stresses the backup window. Most customers complain about meeting their full backup windows, not their nightly windows. As a result a backup network infrastructure is over-built for a once a week or month requirement.
Disk based archiving does resolve the network challenge and other backup process challenges and the two technologies are complimentary.
Many surveys indicate that over 80% of the data on primary storage is inactive. Backup devices that leverage data deduplication count on this to achieve their impressive optimization rates. In this scenario optimization rates of 20:1 are not uncommon. This is because each time the user does a full backup, most of the data has not changed and is eligible for deduplication.
However, counting on the deduplication technology alone can be problematic, and many of these backup challenges can be better addressed by using an archive system along with a backup deduplication appliance.
Throughout the data center a top topic of discussion is better utilization of resources through technologies like server virtualization and thin provisioning. The backup network should be no different. Instead of upgrading the backup network to continue to meet full backup windows, another approach should be considered that better utilizes existing infrastructure and resources.
One of the first technologies discussed when looking at better utilization of the backup network is source-side deduplication technologies. While these technologies have merit, they do lay a performance load on the local server clients and in most cases require a total change out of the backup software application.
To get maximum deduplication all the data must go through the new application. In most data centers this is simply not reality. Most customers have special provisions for their database environments, virtualized server environments, as well as using other applications like email archiving that are sourced from a different supplier than their backup application supplier.
Lastly, source-side deduplication or any backup deduplication for that matter is not really addressing the problem at the root; there is too much data. By not addressing the root cause of the problem customers are building elaborate backup infrastructures that need constant upgrading and deployment of new technologies that work around the problem.
Disk-based archiving addresses the root cause of the problem; reducing the amount of data on primary storage. Moving inactive data from primary storage on to a secondary archive or “value” tier reduces the amount of data that needs to be protected during the backup process by as much as 80%.
It is important that this data is moved to a tier of disk designed specifically for archiving. If for example, a NAS with cheap SATA disk is selected all that is achieved is the costs savings associated with moving this inactive data off of primary storage. This data still will need to be protected and managed. In fact, a case could be made that placing data on a cheap SATA array actually increases its vulnerability and as a result the protection responsibility will actually increase!
On the other hand, a disk-based archive system, like those offered by Permabit Technologies, is self protecting, self healing and self upgrading. The inactive data can be moved to this tier almost as inexpensively as moving it to a cheap SATA tier, meaning that the archive system will also represent a significant cost savings vs. primary storage.
In addition, the disk archive system is self-protecting. These systems have moved beyond a RAID 6 protection strategy that allows for multiple drive failures without loss of data. They also have replication built-in and ideally these systems should replicate from one to another. With replication in place, there is no need to perform backups of the system. The primary system has multiple points of redundancy and then is replicated to a secondary system. The secondary system also has those same multiple points of redundancy. The net result is that the combined pair is more reliable than the tape that would be used to back it up.
With this data safely stored on the disk archive and the need to include that archive in the backup process removed, the full backup data set can be successfully reduced by over 80%.
When a disk-based archive system is implemented, customers find they typically have freed enough primary storage capacity to eliminate or reduce those purchases for the next two to three years and that even when the buying cycle for primary storage continues, they are able to purchase significantly less capacity as the old data keeps moving to the archive.
Similarly, the archive system has the same result on the backup infrastructure. This was built for a full backup load that has now been reduced by over 80%. Also, with a disk-based archiving system in place, the need for future upgrades to the backup infrastructure will be minimized because as data ages it will be moved from the primary backup process and stored permanently on the archive system.
Disk backup with deduplication serves a needed roll in the enterprise. It is ideal for a medium term, 30 day, retention of backup data and should be functionally used to recover that last known good copy of data. Where these systems struggle is when their role is extended to a longer term strategy.
Archive systems represent a new “value tier” of storage that is the perfect compliment to a deduplication backup device. They reduce the size requirement of that device and the size of the investment in them. Additionally, archive systems reduce the network backup infrastructure requirement, provide longer term data retention capabilities and increase IT efficiency.
In part II of this series we will address the second weakness of backup deduplication; long term data retention. In the third and final part we will discuss that while deduplication improves storage efficiency, it potentially worsens IT efficiency.
Tuesday, April 21, 2009
Deduplication made disk backup practical but it is not without its challenges. In this three part series Storage Switzerland will examine deduplication weaknesses and how disk based archiving might be better positioned to resolve them. In this first part the backup network is examined.