Lab Report Overview - The Deduplication of Primary Storage
Lab Report Overview - The Deduplication of Primary Storage
Please note: The primary focus of this report is the deduplication of primary storage’s near-active data, as distinguished from primary storage’s very active data like databases or VMware images as well as deduplication for backup appliances. To that end, very active data sets were not included since neither Ocarina nor NetApp position themselves for that.
Testing Methodology
The first challenge with any deduplication test is to ensure systems are tested in alignment with how they were designed to work. This is one key reason for our focus on the NetApp and Ocarina Networks devices. They both work on near-active data sets and are specifically not designed for use with backup applications. Thus, we did not measure overall throughput or any stats related to performance. This is due to the fact that with these systems, I/O throughput is not the key challenge, as opposed to products that work on deduplication during the backup process. Performance for users and applications running against deduplicated data will be the focus of a future report.
The second challenge is to ensure that the data set is representative of what the IT decision maker is likely to have in his or her environment. Due to the multiplicity of storage needs across industries, no single data set would accurately represent all environments. To address this shortcoming, the information is provided in two ways. First we group the data sets by industry segment; Internet Media, Oil and Gas, Media and Entertainment, and Life Sciences. Secondly, there is an analysis of common file type formats: Microsoft® Office, PDFs, text and other files found in a typical corporate home share.
The final challenge, especially with deduplication, is to make sure the data sets are valid. Dedupe for primary data sets works against active online and nearline data, where the occurrence of duplicate data is unlikely. This is in contrast to backup data, where the same full backup runs every week and the chances of duplication are fairly high. We chose not to copy the same data set repeatedly as it would produce an unnatural amount of duplicates – and could easily be deduplicated down to almost nothing. We therefore selected a mix of data that contained few redundancies.
It is important when evaluating this technology that we match a potential real world data set to the data sets that were tested, which led to the decision to break the data set down in these different ways. That said, there is no one perfect data set, and no two storage environments are exactly the same.
Differences in Approach
While Ocarina and NetApp participate in the same market, it is important to note that they approach the problem differently. NetApp leverages its ownership of the file system by deduplicating in 4k block increments, which is the block size that its WAFL file system uses. It also does no compression. Ocarina can selectively use deduplication, compression or both. To ensure a fair comparison, we ran two sets of tests on the Ocarina system: one utilized Ocarina’s object dedupe only (no compression) and the other utilized both its compression and deduplication features.
Ocarina allows for user control over block size and compression types and so we ran a few tests making use of those capabilities. For example, deduplication could be adjusted based on block size.
It is important to note that the NetApp solution is "in place" deduplication only. It cannot migrate data to another storage tier as part of the deduplication process. This is in contrast to Ocarina, which has the ability to migrate data from primary storage, dedupe and compress that data, and then store it on another storage tier. Also important is that from an Ocarina perspective this can be from any type of storage to any other type.
While NetApp has added the functionality to dedupe a small number of manufacturers’ storage via its V-Series technology, it cannot migrate and dedupe data. Moving data from primary storage to secondary storage would require that the data be rehydrated to its original size and written to the secondary disk tier at its full size, and then re-deduplicated after being stored on that tier. Ocarina, meanwhile, was shown to be able to deduplicate that data during the migration process, eliminating the need for the hydration and re-duplication steps.
NetApp has product offerings that help in data movement, but they do not offer the same benefits as Ocarina’s, one-step deduplication with migration. NetApp Volume SnapMirror® – which is really a DR product, operates at the block level and moves the data over in its deduped form; however, it cannot be used as a repository for nearline data since it must always mirror the source. SnapVault®, which is a NetApp archiving/backup product, must first rehydrate the data before transferring it over the network, leading to the need for the data to be re-deduplicated on the destination.
While specific performance measurements were not made, it is safe to state that both deduplication products take time to analyze and optimize the storage at which they are targeted. Even with modest data sets, taking overnight to process the assigned files system is not uncommon. Thus, an additional cycle such as this one added a notable performance penalty.
Finally, Ocarina’s dedupe and compression can be applied by rules and policies. You can choose to only process files that meet certain criteria – those over a certain size, those that have not been modified for five days, etc. Additionally, Ocarina’s solution gives you the ability to create rules that ensure that the hottest files do not get deduped or compressed, and those files that are less active do get processed. Since NetApp does not have these capabilities, for these tests, we processed all the files in a given volume, so that the results would be comparable.
This further implies that a storage administrator in a NetApp environment must take greater care in what data is on what volume, making sure that very active data is not mixed with near-active data. The result is a potential increase in overhead associated with managing a NetApp environment with deduplication activated.
Test Summary
As a general rule in the deduplication only tests, Ocarina fared 0-450% better. When compression was enabled with deduplication, something that NetApp can't do, Ocarina fared anywhere from 181-2573% better than the NetApp results.
It is important to note that the only data sets that had known duplicate data in them were the home shares data set and the archive data set. The others had no files that were duplicated at the file level.
The Ocarina advantage was its content-aware optimizers, which can act on specific file types. Another notable factor is that with Ocarina, optimized data can be stored on an alternate vendor’s storage solution. This allows for greater cost reduction and better efficiency. It also opens the deduplication field to non-NetApp solutions such as Isilon, BlueArc, and EMC.
Overall Summary
The goal of optimizing primary storage is to reduce the cost of that storage by eliminating or reducing the need for future storage purchases. The combination of deduplication and compression provides Ocarina with an advantage in achieving this goal. In fact a case can be made that any primary storage optimization process MUST have compression.
All storage optimization techniques attempt to deliver greater storage capacity in the same amount of physical space. The challenge is this additional data still needs to be managed by IT administration. Beyond the optimization gains, possibly of greater importance is Ocarina's ability to increase IT administrator efficiency.
First, it provides a single platform for storage optimization. If data center personnel are not careful about this aspect, they could wind up managing five or six different deduplication interfaces. This would, of course, negatively impact staff efficiency. By providing a single storage-agnostic interface, the storage optimization management process remains in check.
Second it can perform deduplication on any type of storage and to be able to MOVE that data to any type of secondary storage. Optimizing multiple tiers from a single traditional primary storage supplier may not always be the most cost effective means of providing cost reduction. Leveraging a second tier provider not only further drives out costs, but many of these second tier providers focus on the archive storage capabilities that are required by this back-end, such as scale, redundancy and compliance.
Finally, and most important, by managing the move to secondary storage while at the same time optimizing that data makes Ocarina one of the few storage optimization tools that also increases staff efficiency.
Ocarina Networks’ solution can provide the data management capabilities to scan file systems based on access dates for example. As described above, it can then move that data to a secondary tier, freeing up primary storage completely and moving that older data out of the day-to-day management path of the storage administrators.
Interestingly, Ocarina Networks could become the great leveler. NAS vendors looking to provide deduplication to their clients no longer need to invest precious development cycles creating their own deduplication methodology. Customers can in turn leverage the technology across multiple platforms, including NetApp.
Monday, May 11, 2009
Storage Switzerland was recently commissioned to perform a head-to-head deduplication test pitting Ocarina Networks against NetApp, two of the leaders in primary storage deduplication. The results were striking: on all data sets tested, Ocarina had better data reduction results. When testing the Ocarina deduplication solution alone, Ocarina was somewhat ahead of NetApp, with anywhere from 0-450% better data reduction. When comparing NetApp with the complete Ocarina solution, which combines deduplication and compression, Ocarina was ahead, with 181-2573% better results.
As this report will outline and as we have stated in other articles, deduplication by itself has limited value on primary storage, and really must be combined with compression for maximum primary storage efficiency.
In addition, the Ocarina solution was found to be a more flexible option, allowing for multiple policy settings, along with the remarkable ability to easily migrate optimized data to another tier of storage from any vendor. This flexibility translates to lower costs in storage as well as administrator time. In contrast, NetApp was more rigid in its deployment, with little or no migration capability and no method for adjusting jobs based on customer preferences or needs.
It is our view, based on these results, that Ocarina should be a top consideration for enterprises that are managing large amounts of primary storage with near-active data, such as email, documents, graphics, and other files.
Primary Storage Dedupe - Backstory