How Dedupe Improves Primary Storage Efficiency
How Dedupe Improves Primary Storage Efficiency
We need duplicates of our data. For most organizations working with a single copy of a file or data set isn’t really efficient, or even possible. We use copies to support our collaborative workflow, to facilitate changes in production data and enable the data protection process itself. By duplicating data, we’re trading space in the form of storage, CPU and networking capacity, for time and other benefits, like data protection. Using deduplication can improve primary storage efficiency and reduce the impact of this common productivity technique.
Workflow and Collaboration
The workflow in modern corporations requires multiple copies of a data set, either in its entirety or segments. We like to ‘spread out’ when we work and usually keep copies of source materials and older versions available throughout the production process; it helps our productivity. When the project is complete we can go back and ‘clean up’, although it’s common to save the intermediary copies and pieces of the final product in case we need to revise it or produce supporting information.
Wednesday, April 20, 2011
We also collaborate with people and typically, each needs a copy of the data we’re working on, including supporting documents. Work is seldom completed in one sitting, especially when multiple people are involved. The solution to this time issue is to create a copy, to both save the work in progress and to allow ourselves and others to jump back into the project when we have time in the future. The ability to make and save an almost unlimited number of copies and revisions enables us to decouple time and space from our work process. We’re essentially spending storage space to overcome time and location challenges.
Protection
In addition to workflow efficiency, copies of data are made for protection purposes. More than simply backup copies, we make and keep duplicates and partial duplicates of most steps in the ‘lifecycle’ of a project, as a means to protecting our investment in time and resources. Similarly, copies are made of essentially any production data object before changes are made so that it can be rolled back to a previous ‘working state’ if something goes wrong. Database administrators routinely make copies of production databases before any changes are made. Graphic, publishing and communications groups save a document before they modify it. Again, it’s an example of trading the cost of (storage) space for time and in this case, peace of mind.
The reality is that duplicate data exists in IT infrastructures even after we no longer need it. We know this but don't simply go and clean it out, it’s not an efficient use of time. Finding and deleting valueless files and personal data is an almost impossible task for a person who owns that data, let alone when IT tries to do it for others. This could be data created by a former employee and stored on a shared file server, or intermediary documents no longer needed when the final document is complete. Secondary copies made for safety’s sake (like the database example) or old backups of data that’s been deleted also add to this collection of data that’s difficult to excise.
If duplication of data is an inevitable and arguably an effective use of resources, how can a technology like deduplication help improve efficiency? The answer is through primary storage. Technologies like Albireo from Permabit can bring the benefits of deduplication to primary data sets and improve the efficiency of all the processes ‘downstream’ that handle that data. Unlike dedupe for backups, reducing the size of the primary data set has a multiplier effect as these data objects are processed and copied throughout the environment over their ‘lifecycle’. Like a stone dropped into the water, duplicate data ripples out into the environment.
For example, a database reduction of 5x, not atypical for primary storage deduplication, can translate into a much larger reduction for an organization that creates ten copies of a production database - also not atypical. In a similar fashion, reducing MS Office file sizes by 5x becomes more significant when a workgroup generates 100 files during a project and all 20 members save them. When a company has 100 such workgroups this multiplier effect is dramatic.
On the storage side, operations like snapshots and replication can also generate a large amount of duplicate data. Many NAS systems, for example, routinely create snapshot reserves of 20% or more. Similar to the space-for-time concept above, snapshotting is a wonderful device which can improve efficiency and productivity in many ways. But while it’s another example of ‘cheap insurance’, it isn’t free. In this case, it costs 20% of the storage TCO - and that’s tier 1 storage for the most part. A deduplication technology, like Albireo, which can reduce primary storage requirements by 5x will also reduce storage consumed through the snapshot process by that same factor.
Primary data deduplication can reduce the footprint of a data set and thereby reduce storage requirements for all its subsequent duplicates. For the collaboration example, this can lead to a compounding effect of that ‘data fundamental’. When a primary data set is deduplicated, the size of each generation of that data set that is borne from it is dramatically reduced as only pointers to the original copy are stored. Like the ‘magic’ of compound interest, a relatively small decrease in the size of primary files in a project data set can result in significant savings by the time that project is retired. And, unlike other storage reclamation efforts, primary storage deduplication is an embedded process that requires no administration time.
Deduplication is a technology that initially established itself in the backup space. But when implemented in primary storage, with solutions like Albireo, the impact of deduplication is multiplied for every downstream process using and creating copies of that primary data. It can reduce the size of the data ‘stone’ that the organization drops on IT so that the ripples of duplicate data from all the necessary copies and portions of copies don’t amount to as much. This translates into less capacity bought (tier 1 capacity, typically) less networking bandwidth consumed, and less CPU cycles consumed (as smaller files are manipulated).
Permabit Technology is a client of Storage Switzerland
Eric Slack, Senior Analyst
Related Articles
Faster Primary Storage with Data Dedupe
Primary Storage Deduplication, Demand It
SMB NAS is Deduplication's Next Step
Primary Storage Dedupe Addresses Data Gap
How Should Primary Storage Be Delivered
Storage Industry Consolidation & Dedupe
Primary Storage: Dedupe vs. Compression
Making Primary Storage Dedupe Safe
High Performance Primary Storage Dedupe
Automated Tiering or Disk Archiving?
Can’t Deduplicate Admin Workload
Managing VM Sprawl with Disk Archive
Optimization - the New Normal in Storage
The Foundation of Dedupe’s Next Era
Weaknesses of Deduplication Backup...