Are backup and power savings ‘MAID’ for each other?

Backup, as a storage activity, is well suited to the MAID technology. Backup typically consumes raw disk capacity, equal to several times that of primary storage, when all copies are accounted for. This means disk backup systems include a lot of disk drives, some of which could be spun down. Backup is a cyclical process, scheduled to run during part of each business day, usually in the evening. This means backup disk drives are either in an inactive or reduced activity state for most of the business day. This is perfect for MAID, as it provides a regular window in which to spin down drives. Finally, restoring backed up data is an activity that doesn’t typically have a low latency access requirement, unlike some primary storage. This, too, suits a MAID system, as spun down drives have response and data retrieval times greater than that of traditional, spinning disk drives.

Deduplication is very common in backup systems, and, obviously well suited for to the backup process, which contains a lot of redundant data. Deduplication can also enable WAN-optimized, off-site replication, which most companies associate with their backup process.  Having replication capability in the backup system hardware, along with deduplication is logical and simplifies the overall data protection system design.

Why not Deduplication AND MAID?

The current increasing interest in power management and the prevalence of deduplication in disk backup systems would seem to beg the question, “Should a disk system combine deduplication and MAID technologies?” Deduplication is certainly a green technology, but traditionally considered so because of its disk capacity savings and space reduction, when compared with the alternative of recording entire (uncompressed) backup jobs to disk. Adding MAID power savings to disk deduplication and powering down those drives when idle reduces its power footprint that much more.

New generations of disk array systems, like those from Nexsan are combining deduplication and MAID technologies to provide even better power, space and cost efficiency, perfect for backup applications. While backup, as a storage process, has characteristics that complement MAID, deduplication also has subprocesses that can potentially enhance MAID performance in a disk array.

Deduplication Processes

Deduplication is generally divided into ‘in-line’ and ‘post-process’ methods. In-line deduplication ingests a data segment and immediately starts the dedupe process on it - as it’s ingesting the next segment. This was the original dedupe methodology and could be called “synchronous”, as the dedupe process runs at that same time that data’s being ingested. Since meeting the backup window has always been paramount to backup systems, the original dedupe products were architected to reduce the time required for ingest and dedupe exclusively. Once these processes were completed, the backup job was essentially finished - as far as the client servers were concerned - and production on these systems could resume.

The rationale was that there would be ample time outside of the backup window to accomplish the other steps required in the complete dedupe process. These other steps could be called ‘housekeeping’ and ran in the background as the next backup job was being processed, or while the system was idle. Deduplication, like other sophisticated data processes involves a lot of movement and reorganization as the backed up data sets change. Space optimization, being the prime directive, requires that pointers and orphaned data blocks be removed and ‘whitespace’ that’s created by the entire process be eliminated. 

In addition to this clean up process, dedupe systems now also feature replication, as a method to getting data transferred off-site. The replication process occurs after the deduplication step, since dedupe also decreases the bandwidth required. After replication and housekeeping are completed, the system is idle until the next backup job starts.

Post-process dedupe employs a cache, or ‘landing zone’ for the ingested data. This ‘asynchronous’ process, enables the backup job to be written completely to this cache, before the dedupe process is even started. Obviously, the ability to cache backup jobs completely, before they’re processed for dedupe, could mean client servers are returned to production sooner. But it also means more physical disk capacity is required to cache the backups. Since the data is cached first, the dedupe and ingest processes don’t need to favor these steps in order to keep the backup window to a minimum. After the ingest and dedupe phases, post process systems also go through a ‘housekeeping’ phase, where data is rearranged and space is reclaimed so that the system is ready for the next data ingest cycle.

After the housekeeping phase, replication can take place - if it’s scheduled - same as with in-line systems. After that process, the system is ‘idle’, and in a state in which its drives can be spun down to accommodate a MAID architecture. Obviously, a dedupe system that’s optimized for MAID operation would maximize this idle phase, lowering operational costs.

Dedupe and MAID

This is where the next generation of dedupe differs from the original in-line or post-process technologies. These ‘MAID aware’ systems discard pointers and orphaned data throughout the entire process, actually organizing data blocks better during the ingest and dedupe phases. This serves to shorten the housekeeping cycle significantly. While it doesn’t really improve backup performance, per se, it does free up more disk drives, sooner for the idle phase, where they can be spun down. These systems can also schedule this reclamation process and run it once a week or once a month, for example. This can consolidate the housekeeping process and further improve MAID effectiveness.

With the focus on cost containment and issues with data center power consumption, the design of disk-intensive data storage systems, like disk backup, needs to optimized for power usage. MAID is a technology that can improve the power consumption profile of the typical disk backup system significantly, given the large number of spindles and extended periods of inactivity common with backups. While deduplication can help by reducing spindle count, it can also increase data movement and drive activity as it adds a housekeeping process that must run before disks are idle and available for MAID spin down.

The net of this is that traditional deduplication systems aren’t tuned to reduce these housekeeping processes and maximize the drive idle time needed for MAID to be effective. Next generation deduplication systems, are architected to minimize these back-end data movement and organization tasks, thus idling drives sooner. This pairing of MAID and ‘MAID aware’ deduplication in a backup system can provide another way to further reduce power costs in the data center. When coupled with the existing benefits of deduplication - 20:1 data reduction, diminished rack space and management requirements -  power managed deduplication systems can make disk backup an even more attractive solution.

Eric Slack, Senior Analyst

Nexsan is a client of Storage Switzerland