The occurrence of any of these failures does not mean that the entire SSS has gone bad; it merely means that a section has failed. At that point it’s up to the intelligence built into the SSS to deal with that failure.

The most common type of failure is a block failure, and most SSS handle this by using some form of error correcting code (ECC). Most enterprise SSS can also handle the third most common type of failure (multiple block failures, which amounts to an entire flash memory module failure) by using some form of RAID algorithm. However, protection strategies typically ignore the second most common type of failure, a failure of the chip or plane.

Why Protecting against Plane Failure is Important

Most SSS, especially enterprise class systems, are made up of dozens or even hundreds of flash chips. If a plane failure occurs only 1/8th of one chip is actually impacted. However, the protection strategies employed by most solid state manufacturers mark the whole chip as bad, not just the plane. This is akin to having a bad sector on a mechanical drive and requiring the operating system to mark the whole platter as bad! Of course in both cases, a manufacturer’s warranty would likely be in place to come to the user’s rescue.

While a warranty is important, the act of replacement does cause its own issues. The biggest is time. The SSS module would have to be taken out and the replacement put into service, then repopulate the drive with data. It’s important to keep in mind that SSS is typically installed in performance demanding, revenue generating, production environments and maintenance windows come very sparingly, usually at late hours when IT personnel would rather be sleeping and when you’d have to pay costly overtime rates. The issue is not just the time to physically make the change but also the time required to repopulate the SSS.

Again, these are performance sensitive environments and the applications that use them are counting on that performance to meet the demands and expectations of users. Despite the fact that these are flash-based systems, the time that it takes to copy 1 TB of mission critical, performance dependent data to them may cost the organization millions in lost productivity because of lower performance while the replacement happens. Also, while several flash controllers have mitigated the time required to write to flash based storage it’s still the slowest of operations for any device to perform (other than DRAM).

Enter Plane Level SSS Protection

The first company to do anything about plane level failures is Texas Memory Systems. Their newly patented Variable Stripe RAID™ (VSR™) allows for continued operation when a Flash plane fails allowing for longer mean time between failures (MTBF). Up until now, most enterprise SSS used a RAID-like technology to protect against chip failure where the Flash media is typically grouped into stripes containing an equal, and fixed, number of chips. If all or part of a chip fails, the RAID-like technology can be used to reconstruct the data from the failed chip and place it on a new chip from a reserve of spare chips in the SSS. This works fine until the SSS is out of spare chips. Since the number of chips per stripe cannot be altered when the system is out of spare chips the replacement process has to occur. Critically, this also means that if only a small part of the chip has failed the whole chip has to be replaced, wasting the rest of the capacity on that chip.

VSR allows the number of chips per RAID stripe to be variable, meaning that if a chip fails, the size of the stripe can be adjusted to use the remaining chips. Essentially bad chips are bypassed by VSR allowing for long life even after the SSS is out of spare chips. The other interesting component to VSR is the level of granularity of the stripe, meaning that if only a small part of the chip fails, the rest of the capacity of that chip can still be used. The combination of both of these capabilities leads to a significantly longer MTBF and a lower likelihood of going through the costly replacement process.


Performance sensitive, production systems that upgrade to memory-based storage quickly become dependent on that storage for operation. Having to "fail" to mechanical storage while a replacement of the SSS is made is essentially equivalent to being down. In addition, there is a laundry list of things that can go wrong during any product replacement. The better option is to leverage self-healing storage technologies like VSR that automatically work around failed components and help you avoid replacing the device in the first place.

Texas Memory Systems is a client of Storage Switzerland

George Crump, Senior Analyst

Enhancing Server and Desktop Virtualization with SSD Series
  Part I - Cost Justifying SSD
  Part II - Integrating SSD into a Virtual Server or DT Infrastructure

Related Content
 Will MLC SSD Replace SLC?
 The Importance of SSD Architecture Design
 Using SSS with High Bandwidth Applications
 Solid State Storage for Bandwidth Applications
 Texas Memory Announces 8Gb FC
 SSD in Legacy Storage Systems
 Driving Down Storage Complexity with SSD
 SSD is the New Green
 SSD or Automated Tiering?
 Selecting Which SSD to Use Part III - Budget
 Selecting an SSD - Part Two
 Selecting which SSD to Use - Part One
 Pay Attention to Flash Controllers
 SSD Domination on Target
 Integrating SSD & Maintaining DR
 Visualizing SSD Readiness
Screen Casts
 Access our SSD Screen Cast