As deduplication begins to move upstream toward primary storage the concern over performance impact is leading some of the early deployment vendors to implement a post-process deduplication strategy. This deployment strategy writes data to disk in real-time, then allows the deduplication steps to be done during times when the storage system is less busy, minimizing the chance of impacting application performance. Post process deduplication may be too conservative of an approach because, while it lowers the likelihood of storage latency, there is no way that it can increase storage performance.

Inline deduplication on the other hand can actually increase storage performance. The inline process works before data is written to the hard disk system and as a result does have the potential to improve storage performance. To be clear, there is overhead associated with inline primary storage deduplication. That overhead has to be mitigated and then enough positive downstream storage performance gains must be seen to substantiate the claim of an overall performance increase.

Overhead Mitigation

Processing Power

Addressing the performance concern over deduplication can be handled by leveraging the fact that processing power and RAM are becoming continually less expensive and more abundant. By properly embedding the deduplication technology into the storage system the performance impact can be minimized or so small it is insignificant in the overall performance scheme. On the processor side of the equation many of the storage systems introduced since mid-2010 have utilized the Intel Westmere processors, which have plenty of excess processing power to handle the work of deduplication and again minimize the processing impact of dedupe.


RAM is also important because that’s where high performance deduplication products are going to store the hash tables they use when performing their redundancy look-ups. RAM is essential for maintaining performance of this critical step. As stated above RAM is getting less expensive but still must be used efficiently. Products like Permabit’s Albireo API, which can be embedded into the storage controller, are extremely efficient in their use of RAM. For example, Permabit can store 2.6PB of data indices in 64GB of RAM yielding 0.1 bytes of RAM per block indexed!

Technology Integration

The final step is proper integration of the technology. Instead of an add-on process, tightly embedding the technology into the storage systems enables the deduplication process to leverage components of the storage software that are already in place to help create more efficient linkage to original data segments.

For example, the process that a storage system uses to track snapshots is very similar to how it could manage the tracking of references created by the deduplication system. Again another step in efficiency.

Downstream Performance Gains

Less Writes

After resolving issues of performance overhead, the next step is to examine the downstream efficiencies that inline deduplication can bring. The first and potentially most obvious area is less writing of data to disk. A data write is more resource intensive for a disk system than a data read, eliminating those writes improves performance and the more the better.

Less RAID Overhead

What is sometimes not as obvious is that it’s not just the elimination of the initial write that causes performance improvement. The elimination of the write also means elimination of the RAID parity recalculation for those writes and the actual writing of the parity bit associated with those writes. This means extra processing is returned to the storage system and secondary data protection writes don’t need to occur. Every extra step eliminated accumulates and adds to the overall efficiency.

Better Cache Utilization

The next source of performance gain is better utilization of the cache area. Storage systems use cache memory to accelerate performance by placing a small amount of their most active data into high speed RAM. This memory again, while showing continual price reductions, is still expensive enough that it must be used efficiently. Deduplication can improve that efficiency by the effective rate that it reduces data. For example, if the system is seeing a 4X increase in storage efficiency that means that a block of data in cache is 4X more likely to be accessed by different applications.

Better Storage Interconnect Utilization

The final area of performance gain is in the storage system interconnect. Most storage systems essentially have their own internal storage network that connects the main storage system controllers (these contain the storage software, cache memory and storage processors) with the actual storage capacity. That capacity is usually found on dozens of storage shelves that either attach directly to the storage controller or do so in a daisy-chain fashion. Deduplication makes this interconnect more efficient because as the data is reduced there is less traffic that has to travel through this network to find its resting place on the actual storage devices.

An Out

While inline deduplication can improve performance in many environments, when combined correctly with modern storage architectures, it’s also important to have the ability to throttle back the process if an unusual performance situation occurs. However, instead of giving up and throttling all the way back to post process deduplication, a newer capability called “parallel processing” should be an option.

In a parallel approach, data is written as it usually is without dedupe and a copy of the data is sent to the deduplication process for analysis. As duplicates are found, an advice notification is sent to the storage system that indicates a duplicate exists and where the original resides. The storage vendor can replace the duplicate data with a pointer to the original data saving the space. With this capability dedupe will have no impact on initial write performance and can be applied without concern for overall storage performance impact.


Primary storage deduplication, when implemented correctly, should show a net gain in overall performance. The key is for the deduplication engine to be tightly embedded into the storage system and for the storage to be using the most current processors to take advantage of the performance and multi-core capabilities that enable increased efficiency.

Once done, the downstream impact of inline deduplication provides the potential for a performance increase. This is now a reality as companies like BlueArc and Xiotech, both known for making high performance storage systems, have selected Permabit’s Albireo API set to deliver highly optimized storage platforms without sacrificing performance. In fact, we believe they will see an overall performance improvement in many cases.

Permabit Technology is a client of Storage Switzerland

George Crump, Senior Analyst