If sub-file level deduplication can be delivered it has the potential to reduce the storage requirement by 3X or more, with the actual level of reduction dependent on the data type. For example, virtualized server images will yield a significant space saving with data deduplication, but seismic oil and gas data may not since it’s typically all unique. However, if those seismic data sets are copied several times for processing and comparison, then the benefits of deduplication will be seen. Home directories are another example of data sets that provide a solid return on the deduplication investment. This is especially true in large organizations because as the numbers of users and groups increase the likelihood of duplicate information within files also increases.


All of these reductions have a distinct impact on other services as well. For example, if the data set can be reduced it means that less capacity is consumed by the drive protection process (RAID or mirror). It also means that if DR replication is being performed less data will need to be replicated across the WAN, assuming that the deduplication product provides global deduplication. In this case, the bandwidth required to perform the replication is reduced as is the storage capacity at the remote location. Efficiency increases even further if multiple sites are replicating into one DR site. The likelihood of redundancy increases as more sites are brought into the replication process.


Primary storage deduplication can help the storage manager to finally get ahead of the curve when it comes to storage growth. The market driven cost reductions on storage capacity are not keeping up with storage growth. Those reductions are coming at a slower pace, as it’s taking longer to develop and certify higher capacity hard drives. Optimizing the space required on primary storage is a key ingredient in controlling primary storage growth costs.



The Requirements of Primary Storage Deduplication


The capacity and efficiency savings that deduplication provides must be achieved without impacting some core commitments that the storage managers make to the users of data. First, it must maintain data integrity. Making sure that deduplicated data can be read by the user or application is fundamental to any real success of a deduplication platform. Delivering data integrity in deduplication can be challenging since in essence, the process is not writing all the data but linking new data with existing segments. To provide the highest level of integrity the vendor should look for technologies that can leverage the same changed block tracking that they have used for over a decade in their snapshot and thin provisioning technologies. Companies like Permabit are providing APIs to manufacturers that allow them to use their change block tracking techniques and integrate them into Permabit's deduplication engine.


The second requirement is performance. Performance of the deduplication engine is, of course, always an issue, but it becomes especially important in the primary storage space where its loss impacts the highest number of users. Some vendors have tried to get around the performance problem by moving to a post-process deduplication method, where deduplication runs in the background when the system is idle. There are a host of challenges with this approach. First, the storage manager inherits double the storage areas to manage, the deduplicated area and the non-deduplicated area. Second, all the performance penalties of deduplication are gained with few of the benefits. All redundant data has to be written and processed and there is no savings in RAID parity data created, nor is there a snapshot area savings.


The other option is to deduplicate inline. This deduplication as it happens provides a real-time view of what storage capacity will be in the deduplicated state since it’s always in that state. This also means that redundant data can be eliminated before it needs to be written to the hard disk. This provides a ripple effect by redundant data not causing a parity write or taking up space in a snapshot. The challenge is of course, performance. As Permabit has proven in their testing, with the right deduplication engine high performance can be achieved without having any impact on the user.



How Primary Storage Deduplication Can Be Delivered


Primary storage deduplication can be delivered in several ways. The first is for the storage vendor to deliver it with their own internal storage development effort. The problem is that if this project were to start tomorrow it would take years to create a releasable, reliable product. As evidenced by the vendors that have brought solutions to market thus far, creating a scalable, high performance deduplication solution is not a simple task.


Another option is to leverage a third party deduplication appliance. This will allow fast access to the market but is really not integrated with the primary storage system. As a result, it must store data in a proprietary format that’s outside of the storage system’s software and to a large extent, obviates the investment that the storage manufacturer and the user have made in the storage manufacturer’s software. It also means that if, for some reason, the appliance fails or is not available, access to the deduplicated data may be lost.


The final option is to leverage an API set like that available from Permabit. Here, the storage manufacturers can leverage the years of investment that the duplication company has made in the technology yet still integrate it into their existing storage software umbrella. The investment of resources to integrate is more than simply adding an appliance but, nowhere near the level of developing the software internally. The API set method should also be popular with the largest storage vendors because of their tendency to acquire new storage technologies. Using an API would allow them to deliver a common deduplication engine across their entire portfolio and quickly add that engine to new products as they are developed or purchased.



The Role of the User


How does the user get involved? Inform your storage vendor that you must have deduplication and you want it now. You can't afford a delay in realizing the benefits of this technology in your data center. If the promised delivery is more than a year away or if there are going to be caveats (like a lack of performance) with the solution they are bringing to market, then encourage them to look for API based solutions. If this strategy does not work then begin to look for storage platforms that have leveraged an API based method as part of your next storage refresh.



Summary


Primary storage deduplication is no longer a ‘nice to have’ feature, but a ‘must have’ and is becoming a core competency for storage vendors. Storage managers have to get ahead of the curve on storage growth so that they can curtail out of control budgets and facilities requirements. Given the increased pace of storage growth, without this type of technology there may be no practical way to close that gap. That is why our advice is, if you’re not getting primary storage deduplication, demand it!

Permabit Technology is a client of Storage Switzerland

George Crump, Senior Analyst