How Should Primary Deduplication Be Delivered?
How Should Primary Deduplication Be Delivered?
Primary storage deduplication is becoming a necessary feature for storage managers. How primary storage deduplication should be delivered is a question that’s now on the top of many vendor product managers’ agendas. Their choices for delivering primary storage deduplication vary from proprietary software implemented directly into the architecture, to an external appliance, to a third party API set like that available from Permabit Technology. The choices vendors make directly impact end-user deployments, the extent of their product’s ROI and its long term flexibility.
The promise of reduced capacity requirements, its associated costs and the ability to ‘bend’ the storage growth curve as well as the increased efficiency of other operations like snapshots, replication and backup copies all make primary storage deduplication compelling. While there is no longer debate on the value of the technology, vendors are struggling with which of the above methods to use to integrate deduplication into their primary storage solutions.
Monday, February 14, 2011
Internally Developed
Storage hardware suppliers for the most part are actually software developers. While there may be some specific fine tuning, most use a standard storage platform delivered from an original equipment manufacturer (OEM) that specializes in the physical hardware. The majority of their development efforts and their products intellectual property are focused instead on software. As a result, some vendors may have the internal ability to develop deduplication on their primary storage platforms. The big challenge though is making primary storage deduplication fast enough and scalable enough so that adding the feature does not negatively impact the customer experience. Suppliers do not want to add a feature to loose a customer. Most suppliers are realizing that porting deduplication technology into primary storage will not work.
However, the straw on the camel’s back for a decision against internal development may have nothing to do with available resources or knowledge of how deduplication works. It may have more to do with the reality of storage acquisitions. Larger storage companies will continue to purchase smaller companies in an effort to round out their product offerings. In doing so they will be introducing new development teams and storage source code into the organization, yet customers will eventually come to expect a common deduplication engine across these storage platforms. For the acquiring company that has decided to develop deduplication internally, each acquisition means a new deduplication development effort to integrate dedupe into that offering. For the acquired company it likely means that their internal development work, if they started any, is now a wasted effort. Finally, for the user this means increased wait time as each new platform is integrated or worse, a migration from one deduplication platform to another.
Appliance Based Deduplication
Another option that vendors may consider is to provide primary storage deduplication for their customers as an appliance based solution. In this approach, all deduplication is done by an external device. The advantages of this technique are that it should work across a wide variety of storage types and it would be almost an ‘instantaneous integration’ with the vendor’s product. However, there are several areas for concern.
First, most external appliance implementations store the deduplicated data in a proprietary format, which only that appliance can read. If there is an appliance failure there may be no way to read the deduplicated data until that appliance has been repaired or replaced. Also, most deduplication techniques create tables that store hash information. These tables are for finding out if a data segment already exists. If it does then a pointer is usually placed to the original data segment. If these pointers are lost it could be a big issue and if the hash table is lost there will not be as many duplicates found. Some systems may have the ability to manually rebuild the table but that would be a time consuming process. Since data is stored in a format proprietary to the device, the vendor’s storage controller no longer understands the data being written to it. As a result, many storage devices have to disable some of their own key features, like snapshots or replication, and allow the deduplication device to provide those functions. This essentially reduces much of the value of their storage solution to a hunk of hardware with limited software value!
The second challenge with deduplication appliances is that all data must flow through them for example, during reads and writes, which could make the appliance a potential performance bottleneck. Unlike compression appliances, deduplication has a significantly higher number of tasks to perform. Compression only operates on the individual file as it’s being read or written. Deduplication must segment the file, search a hash table to decide if it has seen any of the segments of the file before and create pointers to any duplicate blocks. If none are detected, it must still write the data, similar to a cache miss. After spending the time doing the search for repetitive data segments, they don’t gain the efficiency of not having to write the data.
When all of this work is done by an external appliance, performance impact is a legitimate concern. To circumvent this, many external deduplication appliances will perform the deduplication step after all writes have occurred and the storage system is experiencing some idle time. While this removes the concern over deduplication performance impact during writes it does make the management of the environment more complex. This ‘post process’ method requires that users allocate storage capacity for both deduplicated and non-deduplicated data segments, difficult because the user can’t be sure of their exact capacity consumption. Even with this post-process model the appliance must still be involved in all reads so that deduplicated data can be reassembled.
Embedded API Set
The internally developed approach provides too much control and consumes too many resources. The appliance approach gives up too much control to a third party device and increases the risk of data vulnerability. An embedded deduplication API set may provide just the right balance between the two. A deduplication API set, like any other API set, provides hooks to the supplier so they can take advantage of a library of source code without having to develop the whole library themselves. As improvements are made to the library, the user of the API sees immediate benefit. This is the cornerstone value of deduplication API sets like those from Permabit Technology.
The deduplication API can be embedded directly into the controller of the storage system itself. Where that deduplication occurs (inline, parallel or post process) is at the discretion of the storage vendor. Since they have this level of control, they can even shift modes as needed. For example, if the percentage of storage CPU utilization is low they can use that excess resource to process deduplication inline, as the data is being received by the storage system, so that the users can see an immediate capacity savings. However, if processor activity starts to increase to the point that CPU horsepower is needed elsewhere, the storage system could automatically throttle the deduplication process to a lower priority and shift to post-process, to make sure the end user response time is acceptable. Once the peak has passed, deduplication priority can be raised back up optimizing processor utilization so that the capacity savings is quickly seen by the users.
The important element this example shows is that the storage system has knowledge of the deduplication API and has the ability to control its use as needed. When it comes to reads, the API should be designed to leverage the capabilities available in almost all storage software - its use of metadata tables to track blocks of data. If done correctly, the rebuilding of deduplicated data is no different than managing multiple snapshots or clones of volumes. As a result, there is no performance impact and the deduplication engine does not need to be in-place for data to be read.
Finally, and potentially most important, a deduplication API allows for seamless integration of other storage systems as acquisitions occur. As new storage systems are added to a company’s portfolio, all they need to do is enable the API set on that new platform and they have a common deduplication engine across multiple platforms. This allows the acquiring companies to be first to market with unified deduplication and should give them a strategic advantage over competitive solutions that do not have a unified approach.
Primary storage deduplication is quickly becoming “table stakes” in the competition for storage business in the data center. However, simply getting the feature by developing the technology internally or especially using an external appliance, can impact efficiency and be problematic from a management perspective. These approaches either won’t scale or give up too much control to an external device. An API may be the right balance of control and flexibility, allowing the storage supplier to offer a unified deduplication strategy that provides capacity optimization. It not only leverages the internally developed storage software feature set, like snapshots and replication, but may actually enhance them as well.
Permabit Technology is a client of Storage Switzerland
George Crump, Senior Analyst
Related Articles
Faster Primary Storage with Data Dedupe
Primary Storage Deduplication, Demand It
Dedupe Improves Primary Storage Efficiency
SMB NAS is Deduplication's Next Step
Primary Storage Dedupe Addresses Data Gap
Storage Industry Consolidation & Dedupe
Primary Storage: Dedupe vs. Compression
Making Primary Storage Dedupe Safe
High Performance Primary Storage Dedupe
Automated Tiering or Disk Archiving?
Can’t Deduplicate Admin Workload
Managing VM Sprawl with Disk Archive
Optimization - the New Normal in Storage
The Foundation of Dedupe’s Next Era
Weaknesses of Deduplication Backup...