Why Flash Wears Out and How to Make it Last Longer

The life expectancy and durability of flash-based storage is a common topic in many presentations involving solid-state drive (SSD) products. While flash endurance is a prime factor in the economics of this technology, seldom is the audience given an explanation as to why NAND flash degrades with normal usage and eventually wears out, or what can be done to address it. In this article we’ll look at the endurance of flash devices and how advancements in SSD controller technology can increase a drive’s lifespan while improving its economics in the process.

The original flash technology, SLC (Single Level Cell) NAND stores one data bit (single level) in each cell. To produce better bit density and increase capacity, MLC (Multi-Level Cell) NAND was developed which can store two bits of information in the same physical location that SLC stores one bit. This makes MLC roughly half the price as SLC, on a per gigabyte (GB) basis. However, MLC has other issues.

MLC has Lower Endurance

All flash devices can sustain a finite number of writes and erasures, also called program/erase cycles, or (P/E cycles). Partly because MLC media stores twice the information in the same physical space, its maximum endurance (measured in P/E cycles) is significantly lower than SLC media, by roughly a factor of 10. In NAND flash storage, cost is a function of endurance — the faster a device wears out, the more often it must be replaced. This endurance issue reduces the cost advantages MLC enjoys.

A common way to express lifespan is in Total Bytes Written (TBW), which is a product of the P/E cycles and the capacity of the flash device itself. Aside from MLC’s lower TBW, some of the characteristics of flash storage applications can exacerbate its endurance problem. Processes like caching and storage tiering generate more P/E cycles, hastening wear out. Also, the shrinking lithographies inherent in the evolution of semiconductor-based products, like flash memory, increase bit errors that further reduce useful life.

Better MLC Endurance Improves Economics

One way to improve the economics of flash technology is to increase the lifespan of MLC devices. What’s needed is near-SLC endurance at an MLC cost. eMLC (enterprise MLC) media is a technology designed to do this, offering a 3x increase in TBW. However, this comes at the expense of slower performance than MLC and a higher cost. Instead of looking to different media, like eMLC, the solution lies in using an intelligent SSD controller technology to make the existing MLC last longer.

NAND flash stores the information by controlling the amount of electrons in a region called a “floating gate”. These electrons change the conductive properties of the memory cell (the gate voltage needed to turn the cell on and off), which in turn is used to store one or more bits of data in the cell. This is why the ability of the floating gate to hold a charge is critical to the cell’s ability to reliably store data.

Write and Erase Processes Cause Wear

When written to and erased during the normal course of use, the oxide layer separating the floating gate from the substrate degrades, reducing its ability to hold a charge for an extended period of time. Each solid-state storage device can sustain a finite amount of degradation before it becomes unreliable, meaning it may still function but not consistently. The number of writes and erasures (P/E cycles) a NAND device can sustain while still maintaining a consistent, predictable output, defines its endurance.

In the analog recording world, the concept of Signal to Noise Ratio (SNR) is used to describe the quality or condition of a recording. This refers to the strength of the signal which carries the information when compared to the random artifacts (or noise) that is generated by the recording media and the electronic components in the system. The SNR of a recording must be at a certain level in order for it to be replayed with any degree of quality. When a recording degrades, as happens when the metal oxide layer of a magnetic tape gets old, its SNR decreases. There are analog filters that can be applied in order to ‘clean up’ a noisy recording, but at some point the SNR gets low enough for the recording to be unusable.

Correcting Bit Errors

In digital recordings such as data storage, bit errors occur, somewhat like the noise in an analog recording. Since they’re almost impossible to prevent, the strategy is to correct them. Error Correction Code (ECC) is used in all digital data storage processes to address this issue. Somewhat like the parity calculations used in RAID (Redundant Array of Independent Disk) systems, ECC uses mathematical algorithms to fix these errors at the bit level. As with the analog example, at some point the amount of these errors, known as the Bit Error Rate (BER), can overwhelm the ECC engine’s ability to correct them.

Some SSD controllers currently use a number of techniques to address these error-related issues which ultimately affect endurance. When a block of cells reaches a point close to where bit error correction can no longer keep up with the high BER, the block is considered nearly worn out. When this occurs, this block of cells can be replaced with a spare block from inventory that enterprise-grade flash devices maintain for this purpose. This involves copying the data from the old block to the new one, something which the SSD controller performs as an overhead process.

Wear Leveling

In order to assure that most blocks will wear out at roughly the same time, most enterprise flash controllers use a process commonly referred to as “wear leveling.” This process is designed to evenly distribute write operations across all available blocks in a NAND device to reduce or eliminate concentration on a subset of blocks. But there are other ways to improve MLC endurance.

How to Improve MLC Endurance

Endurance in an SSD can be improved by reducing the amount of degradation that occurs to the recording media or increasing the system’s ability to effectively read degraded media. Said another way, an SSD manufacturer can improve life expectancy of MLC NAND flash if they can decrease the damage caused by the write and erase processes or improve their ability to prevent or correct bit-level errors caused by gate oxide degradation. An advanced SSD controller in a flash storage device can do both of these.

DSP Reduces Bit Errors

Similar to the way analog recording devices use filters, storage systems can use digital signal processing (DSP) to help reduce these bit errors and make the recording usable. This approach prevents read errors and reduces the workload on the ECC engine. DSP is only effective up to a certain level. But up until that point, it can extend the usable life of the MLC media.

Read Level Adjustments

Bit read errors occur when the SSD controller can’t distinguish which of the two possible (in the case of SLC, and four in the case of MLC) states that a cell has been programmed to. When data is read, thousands of these cells are polled in rapid succession to determine their stored value. In reality, the voltages captured in this read process vary widely from these values and their relative voltages shift as well. That’s where software comes in. An advanced SSD controller can perform routines to help recognize this variation from bit to bit and can turn what may look like a lot of noise into a readable sequence of binary information.

As the flash substrate wears, this noise level gets worse, but there are additional steps the controller can take to compensate for this. An advanced SSD controller can adjust the reference it uses to detect these binary voltage levels to better match the read output of the flash device. These “read level adjustments,” when combined with DSP and more powerful bit error detection, can enable an MLC NAND device to function reliably through more P/E cycles. The result is increased cell endurance and longer device life.

Flash Management

On the other side of the endurance equation, an advanced SSD controller can reduce the amount of damage done through the normal write and erase processes. Advanced SSD controller technologies, such as those developed by STEC, Inc., can control the way a cell is written and erased so as to minimize the impact of these processes on the NAND substrate. Every block of cells is different, so care must be taken to use the minimum amount of voltage to get the block programmed or erased.

As flash cells wear, the write and erase operations described above will change as well. This degree of wear can actually be characterized to some extent by the data collected for the read processes described earlier. Together, this ability to reduce cell degradation during the write and erase steps, along with improvements to read technologies, can greatly improve MLC endurance.

NAND wear is a fact of life in flash storage products, and a result of normal operation. Endurance of the cells, and ultimately the lifespan of the storage product itself, is increased by minimizing this wear and improving the device’s ability to consistently read data despite this normal degradation. Advanced SSD controller technologies, like STEC’s CellCare, can significantly increase MLC flash endurance by utilizing sophisticated write and erase processes that reduce NAND cell wear, as well as the use of software routines, like DSP, and sophisticated error correction algorithms to increase read reliability.