Data Reduction for Online Image Storage
Data Reduction for Online Image Storage
•Freeing up previously used capacity
•Eliminating or delaying future primary storage purchases
•Reducing backup windows for primary storage by moving old data to secondary storage
•Increasing the ROI of the archive tier by moving optimized data to it
The Digital Content Reality
Digital images make up an ever-increasing percentage of overall storage in many companies, especially in key Internet segments that offer online photo sharing or social networks. The challenge presented by digital image storage is not just the sole problem of Internet sites; many organizations store large numbers of digital images on an ongoing basis. For example, “Toll-Tags” are now prevalent in many states, replacing the need to stop and pay the toll. Every car that goes through the special lane has its license plate photographed. The resultant image is matched to a toll tag account and toll fees are charged to that account.
In all of these environments vast amounts of storage is required, so much so that many hardware manufacturers are trying to solve the problem with scalable and inexpensive storage solutions. The problem with these approaches is that, while the cost per GB is slightly less expensive than traditional storage, just buying more and more cheap disks does not solve the power, space, and management issues that many digital content customers are facing. Providers need a way to store the data more efficiently while using fewer disks.
Generic Optimization
Traditional data centers need capacity optimization solutions that work with both traditional and image-centric data – their requirement for traditional data is for standard, generic optimization solutions that will provide reasonable data reduction for a variety of file types. Unfortunately, the standard solutions commonly used with online storage and backups are not digital content aware and, as a result, provide little or no improvement in storage utilization for image-rich data. For the most part, generic data reduction solutions leave providers of digital content, photo sharing sites, and other organizations that store large amount of image data out in the cold.
To optimize the capacity utilization of image-rich digital content will require content-aware processing for specific types of content. The problem is that digital content is typically already stored in a compressed or optimized format, and standard compression tools cannot provide enough extra space savings to justify the effort. Block-level based deduplication solutions do not provide much relief either. While there is often much similarity between file content, each file is stored in a unique manor, and at the block level do not have much or maybe no commonality. For example, if you have two photos that are identical in every way except that red-eye was taken out of one, the human eye can see that the photos are near identical, but a dedupe solution might not find a single duplicate data block on disk.
Content Specific Optimization
The poor performance of standard compression tools on digital content stems from the fact that standard compression and data deduplication tools are too generic. Deduplication tools look at the physical disk level, searching for exact matches of strings of “1’s and 0’s” on disk. When the deduplication tool finds exact matches, the tool creates pointers that reference the original disk blocks and removes the duplicate blocks. Deduplication at the block level provides storage savings with some data types but is ineffective in many situations. Customers need deduplication methods that are content-aware and work at the file level. Compression is a method, using mathematical modeling and algorithms, to store file information more efficiently, storing it with fewer bits on disk than was used originally. The effectiveness of compression is dependent on the models and algorithms used. Generic compressors are designed to compress all types of data; they are not designed to compress specific file types with maximum efficiency. Moreover, standard solutions do not combine both content-aware deduplication and compression together to provide a solution that realizes the highest storage reduction possible. With digital content, for example, there is very little digital redundancy at a physical block level between images that are nearly identical to the human eye. A combination of content-aware deduplication and compression would be very beneficial with this type of data.
A solution is needed that looks for compressible patterns and redundancy at the information level. For example, photos need to be examined at the DCT / Pixel space, graphics at the visual space, and medical images at the spatial level. By providing 1D, 2D, and 3D content-aware algorithms, companies like Ocarina Networks are able to optimize digital content, delivering 5X better optimization of image-rich data sets than compression and 10X better optimization of data sets than does deduplication.
1-D optimization applies to sequential optimization. This includes traditional compressors like arithmetic encoding, predictive encoding, and dictionary encoding. 2-D optimization applies to spatial compression and deduplication within an image. 3-D optimization applies to time based deduplication across video frames, spatial deduplication across sets of images, and other temporal and spatial redundant information. An ideal use of 3-D optimization would be across time-based images. Medical scans for example that could take 100 pictures of a patient; while there are differences between the first and last, the images that were taken microns apart have a lot of similarity at the information level. A solution that can find that similarity and take advantage of it to get better data reduction will be more effective.
Using these different types of optimizers is ideal for digital content providers. For example, in many cases the provider will have digital content stored in various formats - the original, an optimized or cleaned up image, large, medium and small versions of the image, and then finally a thumbnail. The effect of this is one image that is nearly identical being stored six or more times.
Another example is when a photo is edited. For example, a picture with red eye is identified. The user opens the file, edits the image to remove the red eye, and then saves the image to a new file. On disk the image has a completely different set of 1’s and 0’s yet visually looks almost identical.
Using 2-D and 3-D optimization, the redundancy of the above examples can be captured by using content-aware deduplication and spatial correlation. The result is an 80 to 90% reduction in the amount of primary disk storage being used.
Another data reduction technique is the more efficient packaging of thumbnail data. Thumbnails are small versions of an image that make it easy to scan through a large number of images. As a result, thumbnails are often very small and, on most storage platforms, small files are stored inefficiently.
It is not uncommon for a thumbnail to be less than 2k in total size. Most file systems store data in blocks that range in size from 4k to 8k; therefore, a 100k word document might take 13 blocks to store that file, wasting 4k of the 13th block. Storing 50 2k thumbnails, or 100k total, would require 25 blocks, wasting 6k per thumbnail or 300k. In other words the thumbnails would waste more storage than they actual consume in total!
Expanding this example to real world data sets shows the significance of the problem. Whereas typical user home directories might contain a few million word documents, for digital content sites, file content is measured in the 10’s if not 100’s of millions of files. That can result in 100’s of GB if not TB’s of lost capacity.
By performing content-aware optimization, files can be packaged into block efficient groups, significantly reducing this data loss without a performance penalty on reads.
Digital Optimization Architecture
To deliver this level of data reduction will require time to analyze and optimize the data. It is not designed for use on real-time data but instead as a separate process that can be scheduled to run during off-peak hours against data that has not been accessed in a few days. This timing is user customizable and can typically be set by various last access or modify times or by size. From Ocarina, this out of band optimizer is typically a stand-alone appliance or a software solution installed on customer’s dedicated servers, but companies like BlueArc and HP have embedded it into their NAS storage offerings directly.
Once the optimizer has completed its data reduction process, the file can be stored exactly where it was or it can be migrated off to a disk archive storage area.
When the file needs to be read out of its compressed format a lightweight reader is used. Ocarina calls this the Ocarina ECOreader. There is minimal if any performance penalty when reading back the data because all the steps required to expand the data were understood when the file was originally reduced. Without requiring a software agent to be installed on the client, the ECOreader can be deployed in a variety of ways depending on the storage architecture, providing transparent access to optimized files.
The implementation of the solution is typically seamless and requires minimal, if any, change to day-to-day operations.
Near Instant Return on Investment
Content-aware primary optimization solutions such as those from companies like Ocarina Networks can allow digital content providers to reap the same benefits of primary storage optimization as traditional data centers. In fact, for digital content providers those benefits can be exponential. One Internet provider reduced its image data footprint by 30% by deploying the Ocarina ECOsystem.
Primary storage purchases are an ongoing, fast growth area and are difficult to predict for many digital content providers. The ability to reduce the capacity requirements of this data by up to 90% can provide immediate available capacity to address current needs without having to spot purchase additional storage.
For example, in a situation where 50TB’s of digital content is being stored prior to optimization, it can be reduced to 5TB’s, freeing up 45TB’s of capacity allowing for more data storage without an increase of power and cooling costs. Depending on the growth rate of the service, this could be sufficient for all of 2009. This creates a double ROI by allowing the business to expand without requiring additional primary storage assets. By deploying the Ocarina ECOsystem, one Internet media company reduced its yearly capital expenditure costs by 45% and a major Internet company reduced its power and cooling cost by $300,000.
Since solutions like Ocarina’s allow access to the data in its compressed and deduplicated format, data protection of this data is also reduced by 90%. This means backing up less data over the network, storing less data on the disk backup tier, and less data on the eventual tape tier, delivering ROI on three levels. In cases where digital content providers use replication instead of traditional backup, shrinking the data on primary storage offers triple benefits – space savings on the primary storage, less network bandwidth to copy data to the replica site, and less storage space at the replica site.
In addition to data reduction, many companies are implementing an archive tier of storage. This leverages the usage trend seen at most organizations where after 90 days data is no longer regularly accessed. This usage trend is similar at digital content providers; and, in fact, may be more extreme as that content is often uploaded and never accessed again.
This archive tier is often more cost effective than primary storage and adds a high degree of scalability; and, in some cases, may offer data reduction, but this optimization has the same ineffectiveness for digital content providers that the generic data reduction solutions do. Solutions like Ocarina’s can not only analyze, compress, and deduplicate the primary storage tier, they can then subsequently move that data, in its optimized form, to that secondary tier.
An optimized archive tier completes the cost savings triple advantage of primary storage optimization; reduced primary storage acquisition costs and regularity, reduced investment in backup infrastructure, and a fully optimized archive tier. Because it delivers an ROI on these three vectors, primary storage optimization investments costs are quickly and completely erased. A major Internet media company that deployed the Ocarina ECOsystem recovered its investment costs within 6 months.
Tuesday, January 13, 2009
Related
Byte and Switch: Data Reduction for Primary Storage
Coming Soon: Webcast on Primary Storage Optimization
Coping with the Rising Cost of Storing Digital Image Content
Cost containment is the top priority in 2009 for many data centers, with many of those viewing a reduction in the ongoing investment in primary data storage as one of the best ways to accomplish this. As a result, primary storage capacity optimization is a high priority project for many data centers. This is particularly true for data centers with large amounts of image data – such as online photo sites – because this kind of data consists of large files and has a high year-to-year growth profile - one growing Internet media company reported that its storage is growing at 40% per year. Therefore, optimization solutions that reduce the amount of space those large files consume can provide very high storage cost savings in image-centric data centers.
Primary storage optimization delivers an almost instant ROI by: