Does Backup Need A Better DeDupe?
Does Backup Need A Better DeDupe?
Deduplication has become a well-entrenched feature of many backup products, including backup hardware and software. The benefits of deduplication are especially clear in a backup implementation since so much of the data is redundant.
As a result, users need to be careful to make sure that deduplication is not viewed as a ‘checkbox item’. There is a wide variety of deduplication technology deployed in today’s backup environments with many differences in implementation methods. There may be room for improvement in the current deduplication technologies used. Backup deduplication is ready for an upgrade.
Deduplication Assumptions
Users are often unaware how deduplication ends up in backup appliances and software. The assumption is that the vendor developed deduplication code internally as part of their code base. This was the case with the early entrants into the disk backup market when deduplication was a foundational component of their offerings.
Since then, some products’ deduplication code has come from other code bases that have been placed into the public domain. The goal may be to focus on their unique backup feature set and leverage publicly available code to save development time and costs. Unfortunately, these open source code bases were not typically designed to handle the capacity and performance demands of a deduplication offering in the backup process.
Another common assumption when it comes to deduplication's application in the backup process is that it’s being used everywhere, by everyone and is a tried and true technology. However, according to a study by Storage Magazine, less than 1/3 of users surveyed are actually using deduplication.
Challenges With Backup Deduplication
It's Not Everywhere!
First, there is still an ongoing concern that deduplication will corrupt or cause data loss. After all, the technology is designed to NOT write data as a default. That means the catalogs that track what should and should not be written is critical to data safety and, as a result, users are wary of deduplication engines. This is compounded by the fact that many of these deduplication engines are new, often open-source, and have not been properly vetted in high capacity or high object-count environments.
Users should look for a deduplication engine that has undergone years of testing in a variety of situations, one that has a proven track record of scaling to petabytes of capacity without incident of corruption. For example, Permabit's Albireo deduplication engine is based on a technology that has been widely used in the field for over a decade and is now in use by many well-known storage companies.
Deduplication is Siloed
The second reason that deduplication has not been adopted broadly is that the technology has been siloed. Each backup application or disk backup appliance that adds deduplication does so without compatibility with other products. As a result, customers are forced to manage data from different deduplication engines.
This is true even if all their products come from the same vendor. There are several examples of vendors that have backup software that can deduplicate client-side, server-side and on a hardware device, yet each step is unaware of the others’ deduplication efforts and most likely incompatible. They are essentially separate products performing a solo task in a silo. In many cases, when data needs to move to another system, this leads to a re-inflation of the data, and in most cases it requires multiple deduplication "sweeps" to eliminate redundancy.
The industry needs a de-facto standard to emerge that will allow cross device deduplication. With a single meta-data controller in place or the ability to provide inter-meta-data updates across applications, most of the re-inflation and multi-sweeps mentioned above can be eliminated. Doing so would also eventually provide cross-repository deduplication. As a result, data could stay in a deduplicated form throughout the data life cycle, from primary storage to backup storage to archive storage saving storage space needed for re-inflation, processor cycles and footprint.
Deduplication Has An Expense
Even though most products include deduplication as a feature, there is still an expense to deploying the technology. Because of inefficient implementation methods, more resources (RAM & processor cycles) are needed in either the backup server or backup appliance. Instead of solving these inefficiencies vendors have chosen to "throw hardware" at the problem. While this approach may be easier on them, doing so drives the cost of these devices up by requiring more powerful CPUs and additional RAM to hold deduplication meta-data tables.
This inefficiency will only get worse as the capacities of those devices increase. Eventually they won’t be able to store the deduplication meta-data in RAM alone and will be forced to use either hard disk space, impacting performance, or flash memory which will drive up platform cost significantly.
Further compounding the problem is the move to virtualize backup servers and backup appliances. When these are implemented as virtual machines their consumption of resources can impact production applications that are also virtualized. This will result in driving up the cost of the physical host to accommodate the additional resources of deduplication on the backup virtual machine.
To avoid over-consumption of resources, deduplication code must focus on efficiency instead of attempting to work around issues related to poor resource utilization in the application design. This means designing methods to reduce the size of the meta-data catalogs and developing a more efficient process for caching the needed parts of that meta-data catalog in RAM.
Deduplication meta-data access is not the rudimentary first in - first out type of access that most caching was designed for. Deduplication indices cannot be cached by traditional methods because hash values are randomly distributed. Companies like Permabit have solved this by developing advanced heuristics that analyze access patterns to make sure that RAM meta-data catalog storage is efficient and used sparingly.
The Impact of Better DeDupe
The next wave of backup deduplication technology will solve these problems and increase deduplication adoption in the backup process. If backup vendors will look to improve their deduplication code instead of throwing more hardware at the problem, deduplication could become safer, more efficient, less siloed and less expensive. Eventually, a single deduplication umbrella could cover all phases of the data life cycle including primary, secondary, backup and archive storage.
Driving down the resources required by deduplication allows an appliance vendor to offer their solutions on lower cost platforms and make the technology more appealing to smaller organizations.
An API-driven or Windows/Linux-based virtual engine approach, like that offered by Permabit, also allows tier 2 hardware manufacturers to enter the enterprise deduplication market quickly and deliver a differentiated offering that will perform, scale and be resource efficient. The result will enable vendors adopting their Albireo technology to more effectively compete with the established market leaders and provide a cost effective alternative to their users.
Permabit is a client of Storage Switzerland
Previous Entry: “Making ‘Big Backup’ Work without a Backup Window”
Thursday, December 6, 2012
George Crump, Senior Analyst