What is Primary Storage Deduplication?


Deduplication is essentially a process that analyzes data, typically at a sub-file level, and looks for similarities with other segments from within that file and within other files. If a match is found, only one copy of that data segment is actually stored and references are created to it as needed. Depending on the data type the result can be a significant reduction (as much as 20X) in the amount of disk capacity consumed by deduplicated volumes. For this reason deduplication established its first foothold as a technology to better enable disk backup, a process with massive amounts of redundant data. And while data integrity was important, the impact of data loss was lower since backups are a second copy of data. However, as Storage Switzerland discussed in our recent articles “High Performance Primary Storage Deduplication” and “Making Primary Storage Deduplication Safe” when applied to primary storage, deduplication has a greater responsibility to maintain data integrity and performance. In other words capacity savings cannot come at the expense of application safety or performance.



What is Primary Storage Compression?


Compression, similar to deduplication, is a capacity-optimization technology that also looks for redundancy in a data set. But instead of looking for redundant information across files, it looks within the file, and then re-encodes that information, essentially creating a new file in an ‘optimized state’.


When it comes to primary storage compression data must pass through the compression/decompression engine in order to be provided back to the requesting application or user in a meaningful format. However, as we discuss in “What is Real-Time Data Compression”, since the storage infrastructure has half as much compressed data to move, the performance impact is often unnoticeable.



Which is Better - Compression or Deduplication?


When deciding which optimization technology to use, or at least which one to use first, the answer depends on the data that is being put in the storage system. Deduplication requires the existence of redundant data; if it’s there deduplication will almost always provide greater optimization. Also if the data is pre-compressed, running primary storage compression will not be as effective.


Data that will be better served by deduplication includes:

  1. BulletVMware VMDK files - because of the redundant data from the multiple instances of operating systems and applications

  2. BulletHome Directories with primarily office productivity files - (spreadsheets, presentations, documents) since many of these applications compress the data when it’s being saved and redundancy often exists between files, like the repeated use of the company logo

  3. BulletExchange Environments - where many of the attachments are office productivity files and there are multiple copies of the same or similar files, like multiple revisions of a presentation


Data that will be better served by compression includes:

  1. BulletNet new data where there is no redundancy, audio and video as well as industry specific data like oil and gas seismic data

  2. BulletData where the current compression algorithm is not as effective as the primary storage data compression algorithm. In many cases the default compression algorithm provides minimal optimization as to not sacrifice performance of the application.



Deduplication and Compression - Better Together


To combat storage growth, deduplication and compression should not be looked at as an either - or proposition. There was a time that it was assumed that deduplication and compression could not be used together. In reality this has never been the case. Since the earliest days of deduplication many systems could do both. Even when the capabilities are being provided from separate suppliers it has been proven that they are not incompatible technologies and can actually compliment each other to provide maximum space optimization. Additionally, by having the data compressed the deduplication engine is managing a compacted data set which will keep its own processes more efficient.


For example, Permabit Technology did testing of various data sets with deduplication and compression as stand-alone optimization processes and then combined them. As the chart below shows, in their test cases deduplication provided a better level of optimization than compression did. However, when combined, the two technologies provided a significant further reduction in capacity requirements.

Permabit Technology is a client of Storage Switzerland

George Crump, Senior Analyst

: Deduplication vs. Compression

The two gains that are the most impressive are VMware VMDKs and Exchange data, where the data sets are both compressible and ‘deduplicatable’. In these cases the combination of both technologies has a compound effect in reducing capacity requirements and should be applied to existing storage with little to no application performance impact.



Summary


Storage capacity savings of 35 to 1 and 10 to 1 begin to rival what drove the rapid adoption of capacity optimization in the backup market. These types of gains have been more difficult with primary storage because of the existence of pre-compressed data and the lower likelihood of redundant data. Combining the two technologies not only produces better net results for primary storage optimization but yields the same or better results that initially turned heads when applied in the backup use case that launched the deduplication market. It is also important to realize that one of the key benefits of primary storage optimization is the ‘ripple effect’ that data optimization at the primary level causes as less data travels through the environment, making all processes, like replication and backup, more effective.