Is Dedupe Breaking?
Is Dedupe Breaking?
Deduplication has delivered storage efficiency gains in the disk backup market for many years. In the last year or so, the cost benefits of deduplication have also caught the eye of primary storage vendors and a few have implemented backup styled deduplication in primary offerings. While they won’t see the levels of redundancy typical for backup, users can certainly benefit by even the conservative 5X improvement in storage efficiency that primary storage deduplication is promising.
In parallel to the need to increase storage efficiency with technologies like deduplication, flash technology is also a technology that is on the rise. It is being applied to primary storage in hybrid (flash/HDD combinations) storage, as caching storage for hot data and even as flash-only (no HDD) storage. Combining both deduplication and flash together will have a dramatic impact on the storage industry. Since flash storage is sold at a roughly 15X cost premium versus traditional hard drives, an efficiency increase of even 5X with deduplication can provide a measurable savings closing the price gap between flash and HDD. As a result, primary storage vendors, flash-only storage vendors and flash cache vendors are all rushing to deliver their deduplication features to market.
However, this rush to market may come at the expense of long term data integrity and performance risk in some of these architectures. Maintaining consistent performance for the life of the system and the ability to scale out capacity and performance are critical requirements for any primary storage device that offers deduplication as an option. Deduplication, if not properly designed and tested, can impact system performance undermining the goal they were attempting to achieve.
In these situations, the dedupe engine may start to break down, degrading performance to the point that it’s unusable. Imagine investing in a hybrid, cache or solid state disk system and counting on deduplication to make the investment affordable, and performance becomes so impacted that deduplication must be turned off. In most cases, this would cause a loss in storage efficiency (from no deduplication) and might also generate an emergency purchase of additional storage just to land the un-deduped data.
Overview of Dedupe Operations
Most deduplication strategies operate by segmenting data and then generating a hash code for each of those segments. The hash value is checked prior to that data being stored to determine if it is a unique data segment or a duplicate. Think of the hash code as a serial number or fingerprint, each unique depending on the data in the segment. These serial numbers or codes are stored in a table against which the dedupe engine compares future data segments.
If a new segment’s code matches a code already in the table that segment of data is a duplicate. In such a case, an entry is made into the table that the previously stored data segment now represents multiple segments in the original data set. And this second instance of the segment is not stored, thereby reducing the capacity requirements for the data set. If the segment generates a code that the table does not have stored then that data is new to the environment and needs to be physically stored on disk. This also means that an entry needs to be made into the hash table.
Where Dedupe Can Break
Deduplication can impact storage system performance depending on how fast the above lookups can occur. Most vendors try to get around performance issues by storing as much of their hash table as possible in DRAM. An in-memory search will generate almost instant results that will not noticeably impact the application or user experience, yet will provide the efficiencies of deduplication. However, if the hash table in DRAM fills and spills over to HDD then latency impact is immediately seen and the performance degrades.
There are two key factors in making sure that deduplication performance does not degrade over time. The first is controlling the physical size of the hash table since the hash table will grow as more data is stored on the system. Vendors also need to use very efficient hash code algorithms in order to minimize the size of each entry in the hash table. The smaller the table is the more of it can be stored in DRAM, since memory is limited in capacity.
The alternative would be to make very large pools of DRAM available to the storage system to use for hash table management. Of course, the challenge is that this would add cost quickly, and most systems can only support a relatively small amount of RAM. Eventually the size of the deduplication index would be larger than the available RAM in the system.
This leads to the second factor in deduplication performance, the efficiency of the lookup engine. The key here is for the deduplication engine to be intelligent about which sections of the index are put into memory. Essentially, it needs to use RAM as a cache for its index.
In many cases, vendors without properly vetted deduplication engines will move sections of the deduplication index or metadata in to RAM based on a numerical sorting of the hash table. However, in these approaches there’s no correlation to the data being moved in and out of cache making the chances of a cache miss relatively high. As a result, the RAM in these cases needs to be very large or, more than likely, the capacity supported per deduplication volume needs to be relatively small which is what some primary vendors have suggested to their clients.
This limitation is the single biggest reason that the market is seeing so many deduplication solutions that are capacity limited. If you see storage systems that offer deduplication but only on a volume basis, and those volumes are limited in capacity, it’s a good indication that the deduplication engine cannot scale.
Users should also beware of systems that imply “unlimited” deduplication capabilities as it relates to capacity may be an even greater cause for concern. This may indicate that the vendor simply hasn’t stressed the deduplication code enough and they themselves are unaware that their deduplication won’t perform well enough to keep up with flash and HDD requirements for primary deduplication and will not scale to meet today’s data store size requirements and may break in the future. Clearly some vendors may have properly vetted their technologies, the challenge for the user is knowing what the vendor means by “unlimited”.
Avoiding the Dedupe Break Down
Keeping the deduplication index small and efficient is hard work and is best delivered by vendors with years of experience developing dedupe solutions for data centers with very large capacity requirements and more recently high performance I/O for flash deployments. Vendors like Permabit have leveraged their historical understanding of the impact of growing capacity on deduplication performance to fine tune their solutions and meet the needs of the modern data center. They have also proven they have the I/O performance to deliver dedupe to flash environments that keeps up with the performance required by those flash vendors.
Historically, experienced deduplication vendors use a heuristics algorithm to decide which sections of the deduplication index should be in RAM and which should be stored on flash or hard disk. Instead of loading a sorted set of the index into cache, these systems use a location-based awareness about which data will be accessed next. Understanding how these data sets correlate takes years of real-world experience, but leads to a highly efficient use of storage RAM and, more importantly, performance-neutral deduplication.
Summary
The challenge with adequate deduplication impact testing is not limited to start-ups. Large storage companies will have the same problems. And larger storage companies will have customers that have more production data that may be at risk if something goes wrong.
Instead, suppliers should look for deduplication solutions that have been well vetted with years of real world implementations. Companies like Permabit are delivering deduplication code as an API set that can be integrated into the storage software from almost any storage vendor and recently announced a ready-to-run modular version specifically for Linux based offerings.
From a user’s perspective, they should be on the lookout for deduplication solutions that have been run successfully for years in the market. Questions about how deduplication performance will change when the storage system is at 90% of capacity, or when a storage system reaches double digit TB’s of capacity (~50+TB). In our upcoming webinar, Storage Switzerland and Permabit will discuss the “Five Key Questions To Ask Your Storage Vendor About Deduplication” so that you can protect yourself against storage solutions that may not work as advertised.
For a printable copy of this article please email info@storage-switzerland.com
Permabit is a client of Storage Switzerland
Previous Entry: “Building The SAN-Less Data Center”
Monday, April 23, 2012
George Crump, Senior Analyst