Storage Optimization Deduplication vs. Real Time Compression
Storage Optimization Deduplication vs. Real Time Compression
Data Deduplication
Data deduplication is the process of identifying redundant blocks of data within a given data set and then only stores one copy of that data. There are several existing and emerging use cases for data deduplication:
The initial and still common use case for data deduplication is to optimize the backup data set. Using data deduplication on a backup disk target pioneered by companies like Data Domain is probably the most ideal use case for the technology because of the high level of redundancy between multiple backup jobs. Efficiency rates of 12X are not uncommon on backup data stores.
Less common is deduplication for primary storage. Companies like NetApp are now able to deduplicate data that is on primary storage as described above by looking for redundant blocks of data and only storing one copy of those blocks. While this use case does not deliver as impressive of an efficiency ratio as backup data sets, a 50% efficiency is typical, and the user has to be careful not to apply the deduplication function to very active data for fear of a performance impact, there are certain application workloads that are ideal for the process. Deduplication on primary storage typically has best results for user home directories and VMware boot images, both of which are significant consumers of storage capacity, making the investment in deduplication worthwhile.
Deduplication is also available for the third type of primary storage optimization; disk based archive. Similar to deduplication on primary storage, the benefit is not as pronounced as it is with backup storage, again resulting in a typical 50% increase in efficiency.
The Cascade Effect of Real-Time Compression
Real-time compression not only optimizes all data types, it enhances the deduplication, backup and archive processes in several ways:
Because real-time compression reduces the size of all the data that is to be analyzed by the various deduplication technologies it makes them more efficient. Whether deduplication is being used for primary storage or backup, the initial baseline of data must be written to disk. If real-time compression is already in place, however, there is a decrease of almost 80% in the space required to store that data and a significant decrease in the amount of time required to write that data.
Secondly deduplication solutions require processing power to analyze the data. In primary storage it is common to perform the deduplication analysis after the data has been written during less busy storage IO times. These non-busy storage IO windows are under constant pressure and are continually getting smaller. Real-time data compression relieves this pressure by reducing the deduplication processing time by over 86%.
For systems that perform real time deduplication, common in backup deduplication solutions, the advantages are even more pronounced leading to an increase of over 80% in initial disk writes, a 90% increase in data reduction and a 74% reduction in CPU utilization. Many of the backup data deduplication systems are IP network based for simplicity, when combined with real time compression the data to be backed up can now be sent across the wire in pre-compressed format, reducing network traffic by over 70%. The net result for backup deduplication solutions when combined with real time compression is a reduction in the backup window by over 70%.
Similar results would be common when moving data to a disk based archive. Since most of these systems are NAS based the data moving to the disk archive would remain in the compressed state as they are transfered and stored on the archive. If the disk archive used deduplication, users could expect the same percentage improvements in overall storage capacity. This is important since disk based archive is under constant pressure to drive down the cost of storing data, any measurable increase has significant value.
Combining real time data compression with data deduplication technologies, no matter where they are used in the data life-cycle, can deliver a significant ROI that will likely more than justify the cost of the real time compression appliance.
Reducing the capacity requirements of primary storage, backup storage, and archive storage is just the beginning. Savings will also be derived by reducing power, cooling, and floor space requirements, replication costs, as well as potentially delaying backup network upgrades or adding additional backup systems for improved deduplication performance.
Tuesday, March 24, 2009
Related
Storage Switzerland
Instant ROI with Real Time Compression
January 28th, 2009
Storage Switzerland
September 1st, 2008
SearchStorage
Six requirements for deploying primary storage optimization
SearchVMare
October 31st, 2008
Available Now Webcast on Primary Storage Optimization
Comparing Real-Time Compression and other
Primary Storage Optimization Methods
Storing more data on your existing capacity is a high priority project in 2009 and three technologies have emerged to address this challenge; real-time compression, data deduplication, and disk based archiving. Each has its own strengths but real-time compression is the only solution that is purpose built for primary storage optimization and also uniquely compliments deduplication and archive throughout the data life-cycle.
Real-Time Compression
Real time data compression, provided by companies like Storwize, utilize an appliance that sits in-line between a NAS environment (CIFS or NFS) and the users of the data. Real-time compression has emerged as an ideal solution for primary storage optimization for several reasons:
The solution installs transparently and provides real-time, random access to data as it is being accessed or stored. The result is an easy installation, support for all application types, and data reduction levels of 50%-90%.
The legacy concern with an inline compression system centers on the potential for performance impact. However, in repeated testing this has not been the case. In fact performance consistently increases, even in challenging workloads like Oracle or VMware as discussed in our prior articles on the subject. (see the sidebar).
Real-time compression optimizes all the data stored on primary storage including frequently accessed, high-performance data as well as inactive and redundant data.