Enterprise SSD Requires Reliability

When a flash memory cell is written to, it must first be cleared of old data. This process includes reading the cell to confirm that it actually contains old data, and then writing ‘empty data’ (0’s) to every bit in the cell. This is called an “erase cycle” and is the cornerstone in flash reliability concerns as every cycle shortens the life expectancy of the memory. The solid state memory that is installed in enterprise-class sustainable storage systems must accommodate a very high number of these erase cycles in order to maintain reliability.

Some systems, to address the flash reliability issue, use single-level cell (SLC) technology which writes only a single bit of data per memory cell and typically supports about 100,000 erase cycles. This provides the highest level of erase cycles but does so at a premium price point, the factor that has typically relegated SSD to the role of a niche performance solution. At the other end of the erase cycle curve is multi-level cell (MLC) technology which writes two bits of data into each cell and is typically rated at 5,000 erase cycles. With continuing reduction in NAND manufacturing from 3x to 2x NAND (such as 25 nm), maintaining erase cycles in MLC becomes even more challenging, reducing the longevity to 1,000 to 3,000 cycles. While the additional bit per cell increases capacity and reduces the cost of the SSD, the significantly lower erase cycles means a higher probability for failure. To provide high reliability at an acceptable cost, Enterprise MLC (EMLC), which is similar in capacity to MLC, has emerged, providing a significantly higher rating of over 30,000 erase cycles which, especially for sustainable storage systems, may strike a perfect balance.

Sustainable Storage Has Different Write Patterns

Today the most common enterprise uses of solid state storage are in very high I/O or high transaction environments, like a cache. Caches are small, keeping costs down, but typically have a very high turnover rate, meaning that they’re constantly being emptied (erased) and refilled (written to). Cache is dynamic and its contents will change to reflect the data being accessed at that moment in time. Its relatively small size means that the same cells are written to over and over again. As a result, in order to get a reasonable life expectancy out of these devices requires the extreme endurance of SLC flash memory.

Sustainable storage on the other hand has all the data on solid state storage, a storage area which is comparatively large and consistent. Since the sustainable storage system is where the data lives, as opposed to being a temporary area like cache, the data stored is not constantly erased and rewritten. As a result, an intelligent controller can spread those writes out across more flash memory modules, decreasing the amount of erase cycles consumed per cell. Therefore while enterprise solid state storage needs higher levels of reliability than MLC it does not need the extreme reliability of SLC. EMLC modules like those used in Nimbus’ S-Class sustainable storage systems provide the perfect balance of affordability and reliability. Observed another way, assume a 10TB Sustainable Storage System, such as the Nimbus S-Class. With 30,000 erase cycle NAND throughout the system, one could write 10TB of data 30,000 times, or 300,000 TB, before the NAND cells would wear out. That’s 165 TB of data written per day, 365 days per year, for 5 years. Most IT managers would agree that this level of reliability is ample in an enterprise applicable. However, with 25 nm MLC rated at 3,000 erase cycles, this write load would burn out all the NAND memory in a mere 6 months. Thus, unless customers are expecting writing more than 16x the capacity of their storage array in fresh data per year, for over 5 years, EMLC provides adequate protection, as long as write I/O is evenly balanced across the system. This is where the flash controller plays a vital role.

Intelligent Flash Controller

Having the right type of flash for the workload is important, but having a controller to exploit the advantages of that flash and minimize its weaknesses is equally important. As mentioned above, one of the key responsibilities of the flash controller is to spread out the write pattern to the flash memory so that each cell is written to evenly. This capability, known as “wear leveling”, makes sure that one cell does not get written to over and over causing it to wear out prematurely. The more available memory cells per flash controller, which is the case with sustainable storage, the more distributed that writing can be and the longer the life expectancy of each individual cell will be. In addition, in the case of a cell failure, the excess cell capacity, known to insiders as “overprovisioning”, makes it easy to find an alternate location to write the data. Sustainable storage systems, like the Nimbus S class, have 28% overprovisioned spare capacity to accommodate this failover built into the system. Think of it as a hot swap drive without having to replace the drive. In sustainable storage systems, this extra capacity is transparent and does not take away from the advertised storage capacity of the system.

Intelligent flash controllers can reduce the time required for the above erase cycle to take place. During this process the controller will pre-scan the flash memory to remove old data so that when a write occurs, half of the erase cycle process will have already been performed. If this process, often called “garbage collection”, can stay ahead of inbound writes, then performance will be consistent as an application writes data to the storage. If not, the application can experience performance slowdowns and may actually appear to pause. Many traditional SSDs without overprovisioned capacity suffer noticeable performance degradation as they fill up with data, and in the enterprise, predictable performance is essential, making overprovisioning an essential must-have feature.

The amount of contiguous memory available also impacts how quickly the garbage collection process can complete. Similar to how disk defragmentation takes longer as a disk fills up, garbage collection takes longer when there is less flash memory available to erase and reuse. The advantage of sustainable storage, since it is 100% solid state, is that there is typically more addressable memory per flash controller than in conventional hybrid systems.

Flash RAID

Even with the reliability of the NAND components and the intelligence of the flash controller, there is still a slight chance that a component may fail. If that occurs, enterprise-grade solid state storage systems should be protected by RAID technology similar to that used in mechanical hard drive-based systems. A solid-state-only system, as it does with other storage challenges, makes RAID better. RAID algorithms are struggling to keep pace with the ever increasing capacity of mechanical hard drives. Today when a drive fails it can take double-digit hours, if not days, for a rebuild to complete. A comparable sustainable storage system, based on a multitude of flash blades, typically rebuilds in less than an hour and is less impacted by capacity like legacy HDDs are.

Flash RAID also provides system administrators flexibility in RAID protection levels because performance between the different RAID levels is virtually identical. This allows them to choose the greater redundancy of double or even triple parity and see practically zero performance impact.

System Redundancy

All of this redundancy to provide data protection is useless if the system itself fails. As is the case with conventional enterprise storage systems sustainable storage needs to provide protection from a component failure. This means multiple paths to the storage network infrastructure, redundant power, cooling and redundant controllers. It also should provide redundant paths to flash modules themselves so that data access can continue despite a drive path failure.

The Reliability of Power Consumption

One of the well-known benefits of a solid state storage system is its power efficiency. This is where the sustainable storage moniker is derived. Efficient power consumption however is often thought of only in terms of dollars saved through lower electric bills. While that is certainly a positive, increased power efficiency has another benefit as well. According to a 2007 study by Google, who ranks among the largest consumers of storage in the world, heat and vibration are leading causes of drive failure in mechanical systems. Sustainable storage systems run significantly cooler and have no moving parts other than the fans, so there is no drive vibration to reduce drive life expectancy. This means the two biggest culprits in mechanical drive failure simply don’t exist in a solid-state-only storage system.

Enterprise IT managers expect their storage systems to provide near 100% uptime. A concern about implementing solid state storage in those environments has traditionally been its reliability. A solid-state-only storage system, like the Nimbus S-Class, not only addresses the concerns specific to solid state storage it can actually address some of the traditional reliability issues that plague mechanical drives, such as RAID rebuild times and heat and vibration related failures. Solid-state-only storage should be considered not only because of its obvious performance advantages but also for its ability to improve reliability, availability and serviceability.

Nimbus Data Systems is a client of Storage Switzerland

George Crump, Senior Analyst

 Related Articles
  Intelligent Flash Storage Demands Pt 3
  Intelligent Flash Storage Demands Pt 2
  Intelligent Flash Storage Demands Pt 1
  Which Apps are Best for Sustainable Storage?
  Why Flash in the Enterprise?
  Nimbus Sustainable Storage