In primary storage the first reality is that the payoff from applying deduplication may not be as great because there’s less redundant data. Another realization is that performance impact must be minimal for the solution to be accepted by IT. Further, even the data loss of a single transaction has to be avoided. In short, the requirements are dramatically higher for deduplication on primary storage than with backup storage. These issues have in fact severely limited the application of deduplication on primarily storage for files and have nearly eliminated the application for block-based data.



Options for Primary Storage Deduplication


A hot topic of debate revolves around when to process data in a system to see if there is redundancy to be optimized. The alternative to real time deduplication is to examine the data during periods of inactivity, when the application servers have excess processing power and I/O is not being consumed by normal read/write tasks. Deduplication would then be done in batch mode as cycles were available. Conventional wisdom holds that this approach is consistent with other IT processes. However, this approach rapidly becomes complex. How, for example, does the system know and track the deduplication status? What happens if the deduplication backlog extends beyond a normal workday? As a result, primary storage deduplication has been more of a dream than an implementable reality.


Real time deduplication, in contrast, processes data in-line before it’s written to disk. This obviously consumes processing and I/O resources from normal read/write operations. Often forgotten though, is that if there’s redundant data, that data no longer has to be written to disk, only a meta-data table needs to be updated. Keep in mind that writing or rewriting data is potentially the ‘heaviest lifting’ that a file system and its underlying storage components have to perform. Not only does the data need to be written, a RAID calculation also needs to be made and one or more (assuming RAID 6) parity bits need to be written. Finally, if there is a snapshot in place the tables holding those parity bits need to be updated. If the data is redundant, something that can be determined prior to it being written, then all of these steps can be avoided. The result can be an actual performance improvement, rather than a loss, in even moderately redundant data sets.


The other advantage of real time data deduplication is in storage management. When deduplication is run after the fact, it requires the storage admin to manage two separate data states (optimized and not optimized) on the same volume. This can significantly add to overall complexity. With real time deduplication, since data is always in its optimized form, there are no calculations required to allocate pre-optimized and optimized datasets.


However, as mentioned previously, each processing cycle of an application server, particularly in virtualized environments, is precious. So, even though real time deduplication provides benefits, the tradeoff in terms of CPU utilization did not appear justified.


Recently (in approximately March 2010), a new real time deduplication scenario has arisen. The new scenario is utilizing an advanced file system, called the Zettabyte Filesystem (ZFS), embedded in the storage subsystem to handle deduplication. ZFS has been embedded into a storage platform called NexentaStorTM from Nexenta Systems, Inc. NexentaStor runs on a dedicated processor card in the storage subsystem. This provides two main advantages: 1) Deduplication can be handled in real time and 2) The application server is actually offloaded by deduplication.



The Advantage of Real Time File System-based Deduplication


With real time deduplication built right into the file system, the two resources impacted are the CPU power of the storage processor card and of the read/write capacity of the storage system, as there is an additional read I/O operation to check the meta-data tables. The CPU resource is needed to create the hash key that helps identify the data segment as being unique or not. This key is then compared to a list of hashes, commonly called a hash table. This look-up causes the additional read I/O. In both cases the open nature of the ZFS and NexentaStor file system helps alleviate the potential challenges.


First, since the ZFS file system runs on practically any hardware, configuring a storage server with additional CPU horsepower and RAM is not nearly as expensive as it is with proprietary systems. Second, because the file system can leverage solid state disk (SSD) natively, the hash table can be stored on flash SSD, which is ideal for high-read environments. These two factors quickly negate any performance impact that a user may experience. Finally, specific to ZFS, as we discussed in a recent article "What is OpenStorage and ZFS?", the file system is already examining data segments as part of its internal efforts to eliminate silent data corruption. The file system, when deduplication is enabled, is simply extending its analysis of those segments.


Beyond deduplication it’s important that the optimization also include the ability to compress data. As stated earlier, deduplication relies on redundant data for its effectiveness. Compression, on the other hand, optimizes all data, redundant or not. The combination of the two, especially when done in real time, can greatly reduce capacity consumption.



Where to use Real Time File System-based Deduplication


Most NAS file systems will allow real time data deduplication to be enabled on a per-volume basis; however there are some volumes that are more appropriate for deduplication than others. Probably, the data set with the highest return on investment remains backup data. With the speed of these systems, using a portion of the NAS storage pool to hold backup data is certainly appropriate, and can be a viable alternative to other solutions on the market.


Beyond backup data, probably the next best use case for deduplication on primary storage is the virtual server and/or the virtual desktop environment. All those server and desktop images have a significant amount of common data between them, including the operating system files and the application files. Storage efficiencies of 50:1 or greater are not uncommon when deduplicating virtual images.


After virtual images, user home directories are prime targets for optimization. There tends to be a measurable amount of redundancy between files as users collaborate and save different versions, while they are edited. Also, there tends to be a reuse of the same types of images and graphics between documents, like company logos for example. While the reduction is not often as significant there is typically a 5:1 improvement in storage utilization as a result of deduplication.



Summary: Real Time File System-based Deduplication is here to stay


When properly implemented within the NAS file system there can be a tremendous gain in storage optimization by using real time deduplication. This gain in optimization dramatically reduces storage purchasing and operational costs. Real time deduplication is typically simpler to manage since data is always stored in its optimized state. Finally, with proper architecting of the server hardware and its storage, this optimization can often be delivered without performance loss.

George Crump, Senior Analyst

Nexenta is a client of Storage Switzerland