Leverage DRAM to Fix Flash Endurance
Leverage DRAM to Fix Flash Endurance
There is no arguing the performance advantage that flash storage has over hard disk drives nor is there any arguing its price advantage over DRAM. But flash technology does have some limitations that give data center managers pause for concern, limitations which have slowed its pace of adoption.
First, flash storage has an endurance problem in that it can only support a finite number of write and erase cycles before it fails. Second, flash is significantly slower in handling write traffic versus read traffic, a delta that makes for inconsistent response times in mixed workloads.
The work-arounds for these limitations either increase the cost of flash storage or reduce its performance. An alternative solution is needed to correct these problems, one that will further drive down the cost of flash while increasing performance. One solution leverages DRAM to stabilize flash.
Flash Endurance and Write Amplification
When data on a flash storage device needs to be changed or new data added the old data must first be cleared or erased by writing to that cell to clear it. Also these updates must be done a block at a time, NAND flash can’t be overwritten at the bit-level. Once that is done the new data can be written or programmed into it. This process is typically called a “program erase cycle”.
Flash is unique among storage technologies in that it only supports a limited number of data changes, a finite number of program/erase cycles. Essentially, NAND flash cells degrade with use, experiencing an increase in error rates as they slowly wear out. Flash device suppliers have developed sophisticated technologies that control how data is written to the flash. Called “wear leveling” these processes make sure that the flash is written to uniformly so that one area of the flash prematurely wears out before another.
As mentioned earlier flash must be programmed and written across blocks of cells. This means that on every write there are cells that are programmed or zeroed out that did not need to be. The above program/erase cycle is what causes flash to perform so poorly on write performance compared to read performance. The identification of available cells, the clearing of those cells and the writing of new data to those cells is a time consuming process.
The design of the flash also impacts endurance. There are currently three types of Flash; Single Level Cell (SLC), Multi-Level Cell (MLC) and Triple Level Cell (TLC). SLC writes one bit per cell, MLC writes two and TLC three. This means the charge on the cell needs to be be set at different levels rather than full or empty. That makes it more difficult to tell what the value in the cell is making the cell appear to go bad sooner.
As a result SLC was originally considered the most appropriate flash technology for the enterprise since it had the longest endurance. The problem was that SLC, because it also had the lowest capacity per cell, was the most expensive flash type to deploy. As a result manufacturers created work-arounds to allow MLC flash to provide acceptable endurance for the enterprise.
The MLC Work-Arounds
The most common work-around is over-provisioning. This means that more flash capacity is put into the flash device than is actually reported to the operating system. This allows the writes to be spread out across more flash cells giving a longer life to the overall unit and allows bad blocks to be removed from the storage pool and replaced by good blocks. The problem is that this extra capacity obviously adds cost. And while MLC-based enterprise products can still be more cost effective than SLC based products the level of over-provisioning impacts their cost effectiveness.
Also over-provisioning does nothing to organize writes so that the flash is written to more intelligently. Each program/erase cycle reduces the life of an entire block of flash in order to make room for data that, once written, may only occupy a small portion of the block.
Recently flash vendors have enhanced their controller technologies to modify how the flash is written to. They use ‘softer’ energy charges when write traffic is low to increase life. The softer the charge, the slower the write cycle is but the longer the device will last. This of course impacts performance and only appropriate under low write I/O situations.
The most pressing problem for flash vendors is that as the lithography of flash cells continues to shrink the other function of these controllers, error correction, becomes increasingly important. It’s reasonable to assume that as the lithography of the NAND flash shrinks the ability of the controller to perform the functions other than error correction becomes less likely.
Leveraging DRAM To Fix Flash
An alternative may be to fix flash endurance problems before they ever reach the flash itself. This can be done by leveraging DRAM to better organize data before it’s written to the flash module. In designs like the Marvell DragonFly DRAM is used as a staging area for new or modified data on flash modules. DRAM is an even higher speed memory technology than flash and suffers no endurance problems or performance loss when dealing with write traffic.
With a DRAM-leveraged flash system each inbound write from an application is acknowledged the moment that data is received by the module in DRAM, allowing the application to continue. DRAM as mentioned earlier is the fastest memory technology available today, so the application sees a dramatic performance improvement and no delay due to the flash program erase cycle.
Typically the DRAM area is relatively large, about 4GB, so that as much new data as possible can be collected in it before being written to flash. This enables two important capabilities that can dramatically increase the endurance of flash storage.
First a large DRAM cache can allow for the elimination of a significant number of writes to the flash device, this is because data updates are often a series of re-writes and over-writes. The DRAM area allows for the writes to ‘cool’ so that a more stable version of the data is written to flash. Without this capability similar data is written repeatedly to the flash device within a few minutes. Second, a large DRAM cache also allows these writes to be organized so that no writes are committed that only consume a part of a block, wasting flash capacity and reducing flash life.
By leveraging DRAM, flash can provide DRAM-like write performance combined with its already excellent read performance. It also increases the life expectancy of flash since fewer writes are committed to it and those writes are better organized so that entire blocks are used.
Overcoming the DRAM Challenges
There are typically two concerns when leveraging DRAM as a storage area. First DRAM is expensive compared to flash. This is easily overcome by using DRAM as a cache described above. 4GB of DRAM can handle a lot of inbound write traffic and 4GB of DRAM is not going to add much to the overall cost of the storage device.
Second, DRAM is volatile, meaning if it losses power it losses data. This is addressed by using something called “super capacitors”. These maintain a charge on the circuit board which, in the event of a power loss, provides time to copy the contents of DRAM to a backup flash memory area.
Conclusion
DRAM and flash are an ideal compliment to each other. Flash is non-volatile, performs well and is cost effective. DRAM provides ultra-high performance. When used together a storage device can provide DRAM speed write performance and overcome many of the endurance challenges that flash faces.
Marvell is a client of Storage Switzerland
Previous Entry: “Leveraging Server Side SSD for High Transaction Databases”
Wednesday, December 5, 2012
George Crump, Senior Analyst