Every feature that is added to a storage system impacts performance and consumes storage system resources. It’s the job of the suppliers to make sure that the user does not notice that resource consumption. Putting deduplication in a high performance primary storage system doesn’t mean that there will be no change in storage I/O. What it does mean, if implemented correctly, is that this change won’t affect application performance, nor will the use of the technology be noticeable by the users of the optimized storage.


Most features that are considered "required" today, like thin provisioning or snapshots, have some impact on performance. There’s overhead associated with tracking the unique data read and write characteristics, for example. Deduplication, when implemented on primary storage, will be no different. A lesson can be learned from snapshots and thin provisioning, which, as little as five years ago, were considered "risky" technologies that had too great of a performance impact to use on a large scale. Developers, through the use of improved hardware and software, addressed most of these shortcomings to the point that the overhead associated with their use seldom mentioned as an issue anymore. Primary storage deduplication will follow much the same path, as the software continues to improve and is further integrated into the core storage system architecture. All these features will benefit from the automatic performance increases that come from faster processors being used in the storage controllers.


There are three requirements for delivering high performance primary storage deduplication in such a way that it doesn’t impact the user experience: integration, index lookup speed and flexible deployment modes.



Integration via an API


It will be difficult to deliver high performance primary storage deduplication in a stand alone appliance form factor. While the stand alone method will have the advantage of deduplicating existing storage resources without having to load a new version of the storage software, it is also unlikely that acceptable performance levels can be maintained. It will be difficult to design the systems for unnoticeable performance impact on real time data sets such as server and desktop virtual machine images. Most likely, these external deduplication appliances will be used as an archive alternative operating on near active data. Instead, what will be needed is an embedded integration within the storage controller software via a deduplication API set.


Without integration into the core software of the storage system, the appliance approach will most likely need to store the data in a specialized format that only it can read. It’s true that reader software could be widely distributed to minimize a data lock-in issue, but the greater concern is how this external read will impact performance. If the data is stored in a custom format, the data has to be converted back to the original format before it can be read. While this may be acceptable for the occasional file access of data that should have been archived, using this method to access real time information is likely not going to be done without a measurable performance impact.


If the deduplication capability is integrated into the storage system software, then the storage system stores the data, just as it stores any other data. It can leverage the extent management algorithms that are already in place, that today already assures data is read from the correct snapshot and manages the auto-allocation of thin provisioned volumes. It can use all the capabilities that the storage system vendor designed into the system to make sure that thin provisioning and snapshots would work with primary storage deduplication. In an upcoming article, Storage Switzerland will discuss this ‘re-hydrationless deduplication’ model in more detail. However, its implementation is one of the most important aspects of non-impact primary storage deduplication.



Index Lookup Speed


The science behind dissecting a piece of data into sub-chunks and calculating a unique ID for each chunk is relatively common across deduplication products. Where deduplication systems begin to differentiate themselves are in how they compare that unique ID to previously generated IDs stored in some sort of table. The size of this table and the speed at which duplicate information can be found in it are critical determinants of performance to primary storage deduplication.


These lookup tables have to be stored somewhere and table size, for the most part, dictates their location. One of the factors in how fast the lookup can be performed is the speed of the device in which the table is stored. The faster the device, the more expensive and precious the resource is. RAM will be the most ideal place to store a lookup table, but RAM within storage systems is scarcer than on a traditional server, and in both cases, it’s the most expensive form of storage. A well designed lookup table needs to store the required information in a very small space so that most of it can be stored in storage system RAM. Even if the lookup table does need to be stored on disk within the array, a small size will still be beneficial by allowing more of the lookup table to fit into the storage system cache.


Beyond storage speed, the speed at which these individual lookups or comparisons can be made is also critical and is essentially no different than when comparing application databases. Clearly, different databases can be written to perform certain functions faster than others and the same is true with deduplication indices. Beyond lookup performance, the speed of housekeeping is more important with deduplication on primary storage than with backup applications. Compared to backup, which needs high insertion speed, but where massive random erasures are not the norm, primary storage will typically have less data ingested daily, but more file deletions. And, primary storage may potentially have less of a window to perform that maintenance.



Flexible Deployment Modes


Finally, as part of the integration process mentioned above, the deduplication solution has to allow the storage supplier to select different deduplication modes based on the type of system, its available resources and the type of data that it’s storing. The storage system should have control over when the deduplication software shifts modes and, for example, doesn’t deduplicate data as it’s received but deduplicates either in parallel or as a post-process. In very performance-sensitive environments, this allows lookups to be done with a guarantee of not impacting performance of the storage system.


In addition to switching deduplication modes, it’s also vital that the storage system be in control of the level of granularity of the data segmentation employed. The finer the granularity, the greater the chance of finding redundant information, but the more lookups that need to be performed. In systems with a lower likelihood of file similarity, it may be advisable to raise the granularity on this deduplication measurement.


The challenge for suppliers is that it is going to take significant additional effort on the part of the storage manufacturers to develop that technology in-house, as part of the core storage system software. As Storage Switzerland stated in a prior article that level of effort needs to be balanced against other priorities that the storage suppliers may have, like protocol unification, automated tiering and power management technologies. For that reason, companies like Permabit are now offering deduplication as an API to these storage suppliers. This allows them to quickly integrate deduplication requirements without having to stop development efforts elsewhere. We expect to begin seeing these solutions available by the end of 2010.


High performance primary storage deduplication is now a reality. It has to be implemented into the existing storage system software and has to manage the indexing issue. If vendors can develop this capability quickly or implement a complete API then there is a tremendous opportunity for them to not only solve the short term capacity problem that their customers are facing but also the long term value of a single stack of storage for primary, archive and backup data sets.

George Crump, Senior Analyst

Permabit Technology is a client of Storage Switzerland