Disk storage is still a critical component in IT infrastructures. The densities now common in disk array systems - upwards to 60 disk drives in a single, rack-mounted 4U or 5U enclosure - have driven home the need for reliable array designs, made by experienced companies using reliable components. However, the expansion in the number of companies producing disk array hardware means some are new to the space and their products may be largely untested for primary, and especially Highly Available (HA) storage. Given the desire for cost reduction, the availability of commodity hardware has prompted its use for more storage applications, places where it may not have the reliability that’s needed.

Reducing storage expenses is a worthwhile goal but if that reduction comes with a reduction in reliability, then it may not be a good trade off. Care must be taken when evaluating disk array systems to understand the design factors that affect reliability so that products aren’t bought that are prone to failure. The impacts of using commodity hardware for primary storage should be carefully considered when truly HA, yet economical storage systems are available.

The companies producing these HA systems, like Nexsan’s SATABeast and SASBeast understand the details of producing highly reliable storage systems. This includes the component choice and the design and manufacturing processes, but also the commitment required throughout the organization to produce reliable storage products.

Reliability is a function of a lot of small factors in disk array design, component selection and testing, and in the manufacturing process. Improving reliability is done by improving more of these factors individually. This requires several core abilities on the part of manufacturers. For one, they have to be able to identify these factors, many of which come to light only after years of experience and accumulated data in the business of building disk arrays. Next, they must have a commitment to pursuing these details and sticking with the often tedious processes involved, especially when there are pressures to control costs and these incremental measures may be easy targets for budget cutting.

Similarly, the manufacturer must have a measure of discipline to ‘stay the course’ of quality and reliability throughout the organization. This includes delaying a release  when the product is overdue, but testing shows it’s not ready. It can also mean rejecting a component change that could save some cost, because the changes can mean losing historical statistics about the existing components and thereby, affect reliability. Discipline can also mean taking ownership of the product in the field, even when not implemented in a supported configuration. Although ‘hiding behind the configuration matrix’ and not addressing these issues may save money and reduce workload, it also eliminates an opportunity to learn more about the product and its behavior in actual use.

On the product side, reliability starts with design and component selection. Given its incremental nature, improving reliability means using components designed to meet the demands of storage arrays, not the usage profile of another application, like servers or PCs. This can be one of the design shortcomings of commodity hardware. For example, power supplies and fans must be designed to handle the loading profile that storage arrays subject them to. Drives must be designed to operate upright, in a rack, instead of flat, as they are in a server. They must also be designed for the failure analysis processes used in high reliability manufacturing, meaning they’re more apt to degrade predictably instead of suffering abrupt, catastrophic failure.

Using the right disk drive is a key factor in a reliable disk array. They’re the moving parts. All array manufacturers have access to the same drive suppliers, the trick is to design the right models into the system and know how to weed out the drives most likely to fail in the field. This means using only the top quality drive models and only those designed for the specific environment they’ll be used in. For rack mounted arrays, this includes tied shaft motors, which can operate more reliably on their sides than typical computer hard drives. The manufacturer must understand a successful qualification process and test drives in the actual chassis they’ll be running in, not individually. The QA process must be designed to find the weaker individual drives, it must test every drive (no sampling) and there must be a ‘zero tolerance’ for failures. Again, commodity hardware wasn’t designed for this level of stress and isn’t tested to uncover these levels of failure.

The controller also has a foundational role to play in array reliability. General purpose operating systems and firmware aren’t typically written to deal with component failure successfully, as high availability disk arrays must be. They also contain many more lines of code, adding complexity and cost for maintenance and revisions. A larger code base also means larger and more expensive CPUs are needed to run the controller and still produce the required performance.

From a mechanical perspective, highly reliable storage systems need to address issues of weight and vibration that occur, especially in the dense configurations common today. For a rack mounted array, 42 or 60 disk drives can cause problems with weight, balance and vibration when they’re put into a drawer. Dividing these high density array enclosures into ‘sub-drawers’ or ‘active’ drawers can alleviate both of these issues, as they only move a smaller number of drives. Also, using heavier gauge steel and engineered plastics can reduce mechanical failures and help absorb vibration. Another design element that improves reliability is mounting disk drives back to back, in counter-rotating pairs. This technique allows the gyroscopic forces generated by the spinning disk drives to largely cancel each other out and reduce vibration and overall mechanical stress.

In array designs, controlling heat is always a priority. This is even more of a potential issue with high-density arrays, where physical designs must allow the airflow required to cool up to 60 drives in a single chassis. In addition to controlling heat, preventing heat is perhaps more important, since it means reducing energy usage. Power-efficient designs, while saving costs can also improve reliability, as less heat means fewer failures.

Technologies like MAID that intelligently spin down drives when not in use can significantly reduce power usage, as can efficient firmware designs that run on lower power CPUs. Such power management features have evolved to allow greater automation based on user defined policies, resulting in greater applicability in performance sensitive environments.

High availability designs require redundancy to eliminate single points of failure within the system. This means redundant power supplies, fans and mirrored, battery-backed cache. Aside from these components, the motherboard can be another single point of failure. Similar to the active drawers used to reduce mechanical stress, the motherboard can also be split into sections that each service a subset of disk drives.

More companies are looking to lower storage costs by using increasingly less expensive disk arrays. In response, more array suppliers are springing up and the concept of ‘commodity hardware’ is making the practice more accepted. However, the reliability of these products may not be what it’s assumed to be, or what’s needed, especially as more critical applications find their way onto this class of storage. Often, these products don’t support the kind of design and manufacturing technology needed for adequate reliability.

IT must understand the basic elements of building reliable disk arrays in order to make accurate evaluations of array vendors and choose appropriate storage systems. Disk array manufacturers must have the commitment to building reliable products, plus the knowledge of the right design and manufacturing processes to put out reliable disk array systems.

Eric Slack, Senior Analyst

This Article Sponsored by Nexsan