George Crump, Senior Analyst

EMC is a client of Storage Switzerland

Enter Automated Tiering (AT) technologies. This capability when added to an array can self optimize the storage system so that only the hot data will be stored on the SSD tier. The storage system automatically and continually analyzes itself to determine which set of data should be on the SSD tier at any given moment in time. As a result we have seen vendors race to market with their own take on AT. The problem is that the way these technologies are implemented can actually hurt SSD reliability.


One of the challenges with solid state disk (SSD) is that it wears out as more and more data is written to it. Many people assume that wear leveling techniques fix that problem but as I discussed in a recent InformationWeek series, wear leveling simply brings predictability to solid state storage by making sure that the flash modules in that SSD wear out at basically the same time. It does not increase life expectancy. AT, if not implemented correctly, could actually hurt life expectancy of the SSD.



How Can Automated Tiering Hurt SSD Reliability?


The goal with an SSD, because of the premium you paid for it, is to keep it as close to 100% full as possible with the most active data set possible. Unused capacity on SSD is multiple times as expensive as unused capacity on HDD. AT fills the requirement nicely by making sure the drive is full of frequently accessed data. As another set of data becomes hot it is moved from HDD to SSD automatically. When that move happens, though it impacts SSD write cycles, space has to be cleared off of the SSD tier and new data written. On a very active system this can happen thousands if not millions of times per day. Each time AT updates the SSD tier the write cycle count is reduced.


This kind of write wear may be more than what the SSD manufacturer assumed when rating the drive and we are encountering cases where drives are "burning out" within a year or two of use. While the drive may be under warranty you still have to go through the trouble of replacing the drive and rebuilding the array group. Ironically, thanks to wear leveling, you may experience a drive failure halo where a whole series of drives fail within a few weeks of each other.



How Smarter Storage Can Help


The storage system has a role to play in making sure that advanced features like automated tiering don't cause the SSD tier to fail sooner than what the customer was expecting. We have to make sure the right data is being promoted at the right size increments and also leverage our old friend DRAM. EMC's FAST VP is a good example of a technology that provides this intelligence. Not only does FAST VP accelerate performance by moving hot data to the SSD tier it does that in such a manor that it is less likely to prematurely wear out the SSD tier and in fact it may even lengthen it's life expectancy.



Smarter Promotion


Just because a file has been accessed once does not mean it should be promoted. Their should be some functions in the AT algorithms to determine an “intensity of access” to make sure that only data that can take advantage of the SSD tier is actually promoted. This could be as simple as promoting just specific LUNS or volumes or it could be a QoS setting that allows for promotion of data based on a series of parameters.



Granular Promotion


Another factor is the size of the data that has to be moved. Some AT technologies require a minimum of 1GB of data be promoted to the SSD tier. If you only need a few MB's of data this is a massive waste of space. You are writing more data to the SSD tier than you needed to. While the difference between 10MB and 1GB may not seem like all that big a deal by today's capacity standards, multiply that by 1 million promotions per day and you could be writing to the SSD tier significantly more often than you need to. Again each write reduces longevity. We want to make sure we are only writing to SSD what actually needs to be there.



Extensive Use of RAM Caching


RAM caching may be another way to extend SSD life if that RAM cache is designed to safely cache write as well as reads, like EMC's VMAX does. Most data has a very short creation/modification window then it becomes inactive. For example, a customer record is added to a database, their personal information is added and then it goes dormant until that customer changes addresses or phone numbers. We want performance on this initial activity, so we’d like it on memory based storage but there are often a lot of re-writes of the same data during the initial creation/modification cycle. If all this can be cached to RAM in the storage array we can eliminate much of the write activity that may be directed toward the SSD, while still offering high performance.


Its true that some AT technologies push all writes to the hard disk tier first to get around the over-writing of the same data to flash tier but this hurts performance, probably when you need it most. Writes are costly from a performance perspective and typically it is a real time activity. Depending on the AT technology they may take days to promote that series of data to the active flash tier. Caching initial writes to a safe RAM cache solves these problems and allows the storage system to better align the data when it comes time to write it.


Storage systems need to be intelligent in how they enable certain features. You don't want to use a feature that fixes one problem but creates ten others. Instead, look for solutions that have thought the whole process through and not only fix one problem but can also ease the burden on some others.

EMC World Note