Internally Developed


Storage hardware suppliers for the most part are actually software developers. While there may be some specific fine tuning, most use a standard storage platform delivered from an original equipment manufacturer (OEM) that specializes in the physical hardware. The majority of their development efforts and their products intellectual property are focused instead on software. As a result, some vendors may have the internal ability to develop deduplication on their primary storage platforms. The big challenge though is making primary storage deduplication fast enough and scalable enough so that adding the feature does not negatively impact the customer experience. Suppliers do not want to add a feature to loose a customer. Most suppliers are realizing that porting deduplication technology into primary storage will not work.


However, the straw on the camel’s back for a decision against internal development may have nothing to do with available resources or knowledge of how deduplication works. It may have more to do with the reality of storage acquisitions. Larger storage companies will continue to purchase smaller companies in an effort to round out their product offerings. In doing so they will be introducing new development teams and storage source code into the organization, yet customers will eventually come to expect a common deduplication engine across these storage platforms. For the acquiring company that has decided to develop deduplication internally, each acquisition means a new deduplication development effort to integrate dedupe into that offering. For the acquired company it likely means that their internal development work, if they started any, is now a wasted effort. Finally, for the user this means increased wait time as each new platform is integrated or worse, a migration from one deduplication platform to another.



Appliance Based Deduplication


Another option that vendors may consider is to provide primary storage deduplication for their customers as an appliance based solution. In this approach, all deduplication is done by an external device. The advantages of this technique are that it should work across a wide variety of storage types and it would be almost an ‘instantaneous integration’ with the vendor’s product. However, there are several areas for concern.


First, most external appliance implementations store the deduplicated data in a proprietary format, which only that appliance can read. If there is an appliance failure there may be no way to read the deduplicated data until that appliance has been repaired or replaced. Also, most deduplication techniques create tables that store hash information. These tables are for finding out if a data segment already exists. If it does then a pointer is usually placed to the original data segment. If these pointers are lost it could be a big issue and if the hash table is lost there will not be as many duplicates found. Some systems may have the ability to manually rebuild the table but that would be a time consuming process. Since data is stored in a format proprietary to the device, the vendor’s storage controller no longer understands the data being written to it. As a result, many storage devices have to disable some of their own key features, like snapshots or replication, and allow the deduplication device to provide those functions. This essentially reduces much of the value of their storage solution to a hunk of hardware with limited software value!


The second challenge with deduplication appliances is that all data must flow through them for example, during reads and writes, which could make the appliance a potential performance bottleneck. Unlike compression appliances, deduplication has a significantly higher number of tasks to perform. Compression only operates on the individual file as it’s being read or written. Deduplication must segment the file, search a hash table to decide if it has seen any of the segments of the file before and create pointers to any duplicate blocks. If none are detected, it must still write the data, similar to a cache miss. After spending the time doing the search for repetitive data segments, they don’t gain the efficiency of not having to write the data.


When all of this work is done by an external appliance, performance impact is a legitimate concern. To circumvent this, many external deduplication appliances will perform the deduplication step after all writes have occurred and the storage system is experiencing some idle time. While this removes the concern over deduplication performance impact during writes it does make the management of the environment more complex. This ‘post process’ method requires that users allocate storage capacity for both deduplicated and non-deduplicated data segments, difficult because the user can’t be sure of their exact capacity consumption. Even with this post-process model the appliance must still be involved in all reads so that deduplicated data can be reassembled.



Embedded API Set


The internally developed approach provides too much control and consumes too many resources. The appliance approach gives up too much control to a third party device and increases the risk of data vulnerability. An embedded deduplication API set may provide just the right balance between the two. A deduplication API set, like any other API set, provides hooks to the supplier so they can take advantage of a library of source code without having to develop the whole library themselves. As improvements are made to the library, the user of the API sees immediate benefit. This is the cornerstone value of deduplication API sets like those from Permabit Technology.


The deduplication API can be embedded directly into the controller of the storage system itself. Where that deduplication occurs (inline, parallel or post process) is at the discretion of the storage vendor. Since they have this level of control, they can even shift modes as needed. For example, if the percentage of storage CPU utilization is low they can use that excess resource to process deduplication inline, as the data is being received by the storage system, so that the users can see an immediate capacity savings. However, if processor activity starts to increase to the point that CPU horsepower is needed elsewhere, the storage system could automatically throttle the deduplication process to a lower priority and shift to post-process, to make sure the end user response time is acceptable. Once the peak has passed, deduplication priority can be raised back up optimizing processor utilization so that the capacity savings is quickly seen by the users.


The important element this example shows is that the storage system has knowledge of the deduplication API and has the ability to control its use as needed. When it comes to reads, the API should be designed to leverage the capabilities available in almost all storage software - its use of metadata tables to track blocks of data. If done correctly, the rebuilding of deduplicated data is no different than managing multiple snapshots or clones of volumes. As a result, there is no performance impact and the deduplication engine does not need to be in-place for data to be read.


Finally, and potentially most important, a deduplication API allows for seamless integration of other storage systems as acquisitions occur. As new storage systems are added to a company’s portfolio, all they need to do is enable the API set on that new platform and they have a common deduplication engine across multiple platforms. This allows the acquiring companies to be first to market with unified deduplication and should give them a strategic advantage over competitive solutions that do not have a unified approach.


Primary storage deduplication is quickly becoming “table stakes” in the competition for storage business in the data center. However, simply getting the feature by developing the technology internally or especially using an external appliance, can impact efficiency and be problematic from a management perspective. These approaches either won’t scale or give up too much control to an external device. An API may be the right balance of control and flexibility, allowing the storage supplier to offer a unified deduplication strategy that provides capacity optimization. It not only leverages the internally developed storage software feature set, like snapshots and replication, but may actually enhance them as well.

Permabit Technology is a client of Storage Switzerland

George Crump, Senior Analyst