George Crump, Senior Analyst

Optimization has to leverage deduplication and compression

Optimization, no matter where or how it occurs, is going to require resources and the value gained from data reduction has to be more than the cost of those resources. For example, a 5% reduction in capacity will seldom be worth the investment. This explains why deduplication was so successful in the backup market. In that application, there is plenty of redundant data and the expense of processing that data leads to a significant gain in space. Unlike backup data where the same data sets (full backups) are sent repeatedly to the storage device, primary storage doesn’t generally contain enough redundant data for deduplication alone to be effective. Compression in most cases is a more appropriate technology because it works within files, and data redundancy is not required. When compression is combined with deduplication there is typically a justifiable level of capacity savings to more than offset the resources expended. But successful enterprise-class data reduction isn’t just about algorithms. A complete optimization solution needs to integrate with existing storage and deliver reliability, scalability, and manageability.

In-Band / Out-of-Band placement is critical

While often debated in the backup process, the placement of when and where the optimization actually occurs may be more critical for the primary storage tier. Following the Prime Directive (“do no harm”), by not affecting performance, out-of-band systems have an advantage. They can compress and deduplicate data when the storage system is not busy and after data has proven to be presently inactive. The challenge with out-of-band technology is that it can violate the “don't change the user’s workflow” rule. It also potentially would require the loading of an agent or other software on every client machine. An ideal combination would then be an out-of-band optimizer and an in-band reader, similar to the Ocarina Networks ECOSystem product. Being in-band on the read process does not break the rule of not impacting performance. If the optimization solution is written correctly, the re-inflate process can be made to be significantly faster than the compress and deduplicate process, making the performance impact to the user barely noticeable, if at all.

Policy based control of the optimization process

An out-of-band placement of the optimization process, then, also enables a requirement of policy-based control. Policies allow for designation of when the data optimization should occur; continuously, once per night, once per week or as capacity needs demand it. They also can control what file types need to be optimized and what optimization techniques should be used. For example, if one of the storage clients is a very transaction-latency sensitive application, then the data reduction solution could be programmed to avoid those data files. Policies can also drive where data is stored after optimization. For example, the data could be stored in the same place, moved to a secondary tier of storage, an alternate storage system, or even to a cloud storage provider. An in-line read could also then facilitate re-locating the file when needed, for example bringing the file back into a tier 1 disk volume, regardless of where it was moved.

Data awareness for improved optimization

All data is not the same. Understanding what the data to be optimized is and how to best optimize it is critical for obtaining maximum space efficiency. However, the performance of specialized dedupe and compression algorithms will vary widely. Here again, out-of-band solutions show more of their value. Because these solutions can act on a file after hours, when it’s not been accessed for a period, they have a greater luxury of time in which to process data. While this reduction process only takes a few extra seconds per file, it could impact storage performance if it had to occur in realtime, hundreds of times per second. Out-of-band solutions avoid this concern, yet deliver maximum space efficiency.

By running the optimization process out-of-band, time can also be spent on very special use cases like photos and images that may be already pre-compressed. Although natively compressed data types used to be isolated to the media and entertainment industries, compression is increasingly a part of the file format in standard enterprise data such as Microsoft Office documents and PDF files. Part of the data awareness requirement will be the ability to understand these very specific and hard to optimize data types.

Having a broad selection of algorithms helps to deliver better aggregate results, but these content-aware file associations can’t be statically defined when the product is configured. The system needs to have proven logic for runtime selection of these algorithms.

Performance for the use case

Optimization solutions always have to strike a balance between performance and cost. The goal should be to return data as fast as possible, but still yield sizable capacity savings. This becomes the challenge for optimization solutions that aren’t content aware and use a fixed algorithm selection. For example, in-line solutions - in an effort to not affect performance - must treat all data and all performance needs the same. To minimize the performance impact these solutions need to be less aggressive at the level in which they optimize data. The results can minimize I/O performance impact, but provides fairly average optimization levels.

The reality is of course that all data is not created equal, and workflows and performance expectations are not all the same. The value of an out-of-band solution is that through policies, it can interact with different data types is different ways.

For a simple example, consider the following file life cycle. First, for active data, no optimization is done. The active data set is relatively small in most environments so the actual storage savings gained on its optimization is not justified. As that data becomes in-active, in days, not weeks, a policy could be set to only compress it and to not deduplicate it. This allows for a quick optimization step and a very quick read back if that data is needed. Assuming that data continues to age without being accessed it could then be compressed with an optimized compressor and deduplicated for maximum efficiency. It could still be stored on the local disk to make sure that if it were needed, no data transfer would occur to recover it. Finally, as the data becomes very stagnate it could be moved by the optimizer to an archive and then eventually to a cloud storage provider for more permanent retention.

Another key selection attribute for primary storage optimization solutions is the ability to scale performance up or down to meet workflow requirements. Ingest and I/O rates vary from application to application, customer requirements may grow over time. A solution that scales in optimization throughput, for example via a scale-out architecture, will provide far more flexibility to the customer than a one-size-fits-all solution.

Moving data in optimized form

As mentioned above it’s ideal if the optimization solution is really part of an entire ecosystem that can keep data in its optimized form and not have to ‘re-inflate’ it prior to handing it off to a separate deduplication engine in the backup process. Ideally, the primary storage deduplication system could integrate with the backup process so that the data can stay in its optimized form during backup operations as well as restore. Without optimized data movement, this backup and eventual restore operation would require re-inflating the data twice, increasing the backup window and restore time. Work on this front is still progressing, but look for vendors who understand this challenge and are working on a solution. When the final criteria are achieved, the walls between different deduplication silos will begin to come down and the data center can have one optimization solution for all data sets.

There are many other requirements that suppliers may try to add to the list. Some are without merit and don’t support internal justification. The primary objective is again to “do no harm” -  don't affect performance and don't affect the user experience while at the same time deliver the highest possible optimization levels. These requirements support those objectives.

Ocarina Networks is a client of Storage Switzerland