The Canary Wharf, London facility has about 1000 employees and over 1200 virtual machines running on 22 VMware ESX hosts. At any given time ~60% of these VMs are active. Prior to its backup redesign initiative, the infrastructure was backed up to ten LTO4 tape drives in three library systems managed with Symantec BackupExec. Primary data storage is provided by an EMC midrange array with ESX hosts attached via fibre channel. Backup tapes were sent off-site with a third party service.

SunGard is a very conservative company with an excellent track record of customer data protection and reliability; after all, they’re in the disaster recovery services business. These are test and development servers, not customer-facing, production systems, but they still needed reliable, comprehensive data protection. And Chapman didn’t feel like the existing system gave them the sense of confidence that backups should provide. Plus, the backup infrastructure was fragmented, with several business units running their own backup schedules, so there was no overall clarity of data loss exposure. The existing tape-based systems were “running flat out”, according to Chapman, and barely keeping up with the volume of data (>60TB total), especially with the number of virtual servers in the environment. Also, providing restores was difficult and very time consuming.

Traditional, tape based backup systems are relatively complex, physically handling all individual files associated with each server. As these systems grow to accommodate larger numbers of servers, their complexity also grows. In a virtual server environment this situation can be acute and SunGard’s system needed to expand. Successfully operating large scale tape-based backup typically requires significant investment in automation technology.

For SunGard, the scope of providing backups to so many server instances with a tape-only infrastructure that was over its limit caused frequent problems with missed backup windows and failed backups. When a restore was needed for one of these systems the result was a labor-intensive process by IT personnel to find the required tape from the catalogue or to try and recreate the required data. Although the number of backups that were missed was low in percentage terms, the sheer number of VMs backed up caused this manual restore process to occur too often. Each of these instances represented a very small risk, but in the aggregate sense, they constituted a significant exposure.

Part time insurance

The aggregated risk from this not wholly reliable backup of 1000+ VMs was akin to ‘part time insurance’. There was always a chance that the one VMDK someone needed wouldn’t be available, something that would be determined only after hours were spent by IT in the process. According to Chapman, the complexity of tape-based restores made remediation effectively a full-time-equivalent job for the team. In addition to this OpEx cost, there was also an opportunity cost in the impact that their extended RTO (24-48 hours) had on software development cycles. Time to market is a competitive advantage for SunGard. They simply needed a better way to backup this large number of virtual machines and provide reliable, efficient restores when called upon. Expanding SunGard’s existing tape infrastructure was one alternative, but not a very feasible one. Adding enough tape drives, library capacity and software to accomplish this objective could as Chapman said, “break the budget in this competitive company with punishing financial targets.”

SunGard looked at some virtual tape library (VTL) solutions, as well as software-only deduplication systems, but went with an EMC Data Domain DD690 system in the primary data center and another one at the DR site because of its capabilities and integration into NetBackup. They backup via NFS using NetBackup with OpenStorage and VMware’s VCB (VMware Consolidated Backup), then the DD690 replicates its data to the unit in the DR data center. The Data Domain system gives them a simple, disk-based solution to back up ALL their virtual machines - plus ‘headroom’ for more data growth. In fact, the backup window has shrunk to 25% of its former time frame. Chapman said not only did they skip buying a big tape library, they now need only one library and four LTO4 drives to handle tape offload for archival and portability.

Why Data Domain works

Disk backup is a natural solution for large numbers of VMs, since as many data streams as needed can be created to the same disk file system. Since VMs are essentially a single file, backup consists of copying each VMDK file to the DD690, a much more efficient process than direct-to-tape backup. Also, deduplication is especially efficient in a virtual server environment, since templates are used to create new VM image files and contain a high percentage of redundant data. This means that the backup target can accommodate more VM images in less space than with traditional disk storage, as only net modifications actually consume capacity.

SunGard now has the capacity and the throughput to backup 100% of their VMs and provide restores direct from disk. For Chapman, this means confidence that every server, physical and virtual is backed up, every cycle. And, restores come from disk in seconds, not potentially hours or days, as with tape. 

A competitive advantage

If the previous backup infrastructure represented an opportunity cost through long VM recovery times, the DD690 represents a new competitive advantage. In SunGard’s development environment projects come and go. The developers always prefer to save the VM image for the eventuality that it would be needed again, since significant effort might be involved in building a depersonalized model environment. While keeping these VMDK files would speed development when the project resumed, it also required extensive amounts of disk space, which in their environment was primary disk space at ~$5/GB.

The Data Domain system provides an ideal ‘near-line’ storage system to save VM images until they’re needed. Given the inherent redundancy of VM image files and the high deduplication rates they produce, the effective cost to store them is about 10% of what it would be on primary disk space. The result is a very low cost archive for as many VM images as the developers care to save, and a significant competitive advantage for SunGard.

SunGard also uses DynamicOps Virtual Resource Manager (VRM) to automate the “VM Lifecycle Management” process, allowing their developers to archive VMs easily and then restart them quickly when a project is turned back on. In addition, as detailed in a previous Storage Switzerland article, these VMDKs can be booted directly from the DD690.

Storage Switzerland’s Take

The Data Domain system does what’s expected, it provides a disk component to replace an existing tape backup infrastructure and does it very well. Tape just works better for archiving with disk in the mix and a deduplicating file storage system is especially effective for backing up highly redundant image files. It also provides space to support restores from disk and accommodate the often unpredictable growth rates that are common in virtual server environments.

But the DD690 does something else that may be unexpected. It addresses the functional disadvantage of having to store and restore very high numbers of backup files, and turns it into a competitive advantage. In a virtual server environment where VM templates abound and data redundancy is especially high, the DD690 provides a NAS tier that enables SunGard’s developers to keep whatever they want to. They can use the ‘space-time‘ equation to turn economical storage space into valuable development time and improve SunGard’s time-to-market performance in the process.

Eric Slack, Senior Analyst

Case Study

Data Domain is a client of Storage Switzerland

- and gets more