Making Primary Storage Deduplication Safe
Making Primary Storage Deduplication Safe
There may be a misconception in IT that using deduplication in primary storage systems is something too new and too risky to be considered. The concern is that it will imperfectly manipulate data, negatively impacting performance and somehow place data integrity at risk. While most deduplication solutions do alter data, some do not. There are solutions available today that leverage the same technological underpinnings that make snapshots and similar capabilities a reality in the modern storage system. If this approach is followed, primary deduplication should cause no performance impact and be as safe as the common disk based snapshot.
When deduplication is used in the backup space, there’s less concern over how it goes about its work, after all, it’s a copy of the data. Now that deduplication is moving ‘upstream’ and being implemented in primary (non-backup) storage systems, more attention is being paid to it. Understandably, since primary data is, in some cases the only copy, there’s a perceived risk associated with any process that touches that data. However, if the primary deduplication solution can leverage the same storage fundamentals used for files system and snapshot management, the risk is no greater than a snapshot. This fundamental storage element is known as the “extent”.
Tuesday, September 7, 2010
Extents and Extent Management
An extent can be defined as “one or more adjacent blocks within a file system, presented as an address-length pair, which identifies the starting block address and the length of the extent” [VxFS admin guide]. Each file (and data volume) is a collection of the data blocks described by those extents, the management of which ensures that a file or volume can be read, modified, saved and reread. Indeed, extent management is a significant portion of the work file systems and storage controllers do every day.
File systems and storage controllers have a solid track record of extent management. Both are so solid, in fact, that this extremely complex process is almost taken for granted as so much goes on ‘under the covers’, even in the simplest computer system OS, file system or storage controller. With the CPU, memory and I/O capabilities available today, storage systems do more than just manage this metadata to perform basic file operations. Snapshots, for example, involve another layer of abstraction for file systems and storage controllers that are managing extents.
Snapshots
Snapshots have become a very common process that’s used to create a ‘copy’ of a dataset by simply recording a copy of the extents instead of the actual data. Snapshots are widely used for a couple of reasons, they’re faster and require much less storage space than physical copies - but also, they work. They provide a copy of data to do something with, but still save the ‘original copy’. When a snapshot is taken and that data is modified, the ‘original’ data blocks are left untouched or saved in an area of disk reserved for snapshots, providing a copy with which the OS can reassemble the dataset as it was before that snapshot was taken.
Snapshots are a great example of how far users trust extent management with their data. A large file with a number of snapshots may have hundreds of extents representing the original data blocks and those that have been changed, but it continues to return the data without errors when called upon. Another example, change-based replication is again a series of references to blocks of data (extents) that are physically stored all over the place. But it too works and users are confident with it.
Deduplication
Deduplication can be added to a primary storage system in a similar way that snapshots are today, by leveraging extents. It’s also a series of references to blocks of data and a system of recording metadata (extents) about those blocks that tells the storage system how to re-assemble that data when users or applications need it. Unlike snapshots, the use of extents may occur prior to data actually being written.
Permabit’s Albireo technology operates outside of the normal data storage flow to avoid any performance impact. Referred to as an “advisory service” this data reduction architecture looks first to see if an incoming data block exists by checking its own hash table (more metadata). If it finds a duplicate segment it will notify the file system about the possible deduplication opportunity, letting the file system actually make the decision to reference the existing data block and create an extent that maps to the ‘new’ but duplicate version of that data segment. The result is a primary storage deduplication process that maintains data integrity and imposes no performance penalty.
For reads, deduplication, like snapshots, uses extents to map the user to the correct chain of data segments, even though those segments may be shared by other files as well. It is important to note that in this method data is not ‘re-hydrated’ (the reverse of the deduplication process), as doing so would cause a performance impact. Instead, like a snapshot, the data is simply mapped together leveraging the extent tree in similar fashion to a snapshot. The result is no performance impact.
Writeable Snapshots or Clones
If the concern is that each deduplicated file doesn’t include its own ‘dedicated’ data blocks, there’s another process already in use on primary data sets today that does this as well. Clones, or writeable snapshots, use shared data blocks to create a new ‘original’ file. Unlike simple snapshots, they don’t maintain a link to a ‘golden’ copy or original version of those blocks. Like a new file when first written to the storage system, writeable snapshots are themselves the only copy, but one that’s defined solely by existing data blocks which are managed as extents.
Extent management can be a complex process, especially at the level it’s applied in snapshots and replication. But it has been proven extremely reliable through years of use by almost every major storage platform and vendor. These sophisticated data management features are expected to be included in storage systems of almost every size. Users demand these features and most systems leverage extent management to provide them.
Deduplication has been in widespread use for almost ten years. It has become a mainstay in data backup and is now becoming more common in primary storage. In this newer implementation, it leverages extent management, in a similar fashion to that done by other array features like snapshots and change-based replication, which have been around for close to twenty years. However in primary storage there is increased concern over its ability to maintain data integrity. Based on ‘the fundamental similarity to these other, trusted data operations discussed above, and its ability to operate outside the data path, deduplication technologies like Permabit’s Albireo can leverage this same key process and provide safe, reliable storage for primary (non-backup) data, and do so without any performance penalty.
Eric Slack, Senior Analyst
Permabit Technology is a client of Storage Switzerland
Related Articles
Faster Primary Storage with Data Dedupe
Primary Storage Deduplication, Demand It
Dedupe Improves Primary Storage Efficiency
SMB NAS is Deduplication's Next Step
Primary Storage Dedupe Addresses Data Gap
How Should Primary Storage Be Delivered
Storage Industry Consolidation & Dedupe
Primary Storage: Dedupe vs. Compression
High Performance Primary Storage Dedupe
Automated Tiering or Disk Archiving?
Can’t Deduplicate Admin Workload
Managing VM Sprawl with Disk Archive
Optimization - the New Normal in Storage
The Foundation of Dedupe’s Next Era
Weaknesses of Deduplication Backup...