George Crump, Senior Analyst

Spectra Logic is a client of Storage Switzerland

Tape's First Big Data Role - The Access Tier

Conventional wisdom, or at least the wisdom of the disk manufacturers, is that the entire Big Data store needs to be online and available (how convenient). The reality is that putting all data online is not a viable solution for most businesses from a cost perspective. An example of a Big Data use case that I heard recently was of a farming conglomerate that installed soil sensors on every tractor. These sensors took samples every time they rotated through the soil on every field that was worked. For the conglomerate, this resulted in tractors taking hundreds of samples of soil every minute across thousands of acres of farmland throughout the United States. After each sample was taken and analyzed, the results were transmitted in realtime to the corporate data center where further analysis could generate a realtime soil map of all of the conglomerate’s farms. In parallel to this, weather data was correlated into the soil data so that the cause and effect of weather and other events on the soil could be determined.

The intent is to store all of these data for decades so that seasonal soil and weather predictions can be made. As you can imagine it will take petabytes of storage per year to maintain all this information. When extended to decades, the prospect of storing this data on even the cheapest of disk systems is totally impractical. Also consider that this is not a backup disk area, high performance analytics require high performance disk systems. Purchasing PBs of this kind of tier one disk would render almost any Big Data project too costly to undertake.

The answer is to leverage tape as part of the access tier through the use of an Active Archive. Active Archiving is the ability to marry high performance primary disk storage with secondary disk and then tape to create a single, fully integrated access point. The data that needs to be analyzed at a given moment in time would be loaded onto the high performance tier which, in the above example, could be a comparison of 2011 soil samples to 2009 soil samples. The secondary disk tier could store inbound data from the tractor collection devices, which would come over slower broadband connections. Finally, tape could be used to store all the other years’ worth of data. The Active Archive software would automatically move the data between the various tiers based on access or the movement could be pre-programmed into the application.

Tape's Second Big Data Role - The Protection Tier

The use of tape as part of the access tier for primary storage is also a critical component for the protection of Big Data assets. In fairness, all Big Data sets don’t need to be backed up. If the data can be regenerated, that may be a more cost effective strategy in the event of a disk failure. Depending on the situation though most data won’t be regenerated, like in our soil example where we can’t go back to 2009 and collect the samples again. The problem is how do you on a once per night or weekly basis backup petabytes of information? The answer is quite simple, you don't. By comparison with an Active Archive process new data can be copied to disk and tape simultaneously. The backups are happening as data is received. Active Archive also helps with a major restore. Instead of having to restore the entire data set, only data that‘s currently needed must be recovered.


Tape libraries like those from Spectra Logic that can leverage the Active Archive technology not only compliment the Big Data initiative they can become a major part of the Big Data infrastructure. In fact many Big Data projects may not be cost effective if tape is not integrated into the solution. You can learn more about tape’s role in Big Data environments at InfoWorld’s Enterprise Data Explosion event in Santa Clara, California tomorrow, June 8.