Foundational Storage Tier


The first step in Dedupe 2.0 will require a foundational repository. This tier of storage will be the physical landing place for all the deduplicated, or optimized, data in the environment as it is cleared off of primary or secondary storage. It may also be the original destination for data where it is known upfront that it’s unlikely to be accessed again in the future.


This platform should then also provide a base level of deduplication in order to receive data from sources that are not pre-optimized. There will be a significant amount of this data coming from non-optimized backups, application specific backup and/or archive utilities as well as user invoked copies for ad-hoc retention of data.


In addition, these foundational repositories should have the ability to compress data as well as deduplicate it. There will be cases where the data being stored on this tier of storage is unique enough that it may not achieve high deduplication rates. Unlike the Dedupe 1.0 era where solutions that were almost exclusively used for backup storage, Dedupe 2.0 may be used for many data repositories and applications in addition to backup. As a result these repositories need to compress data as well as deduplicate it. Deduplication achieves efficiencies only on redundant data; compression achieves efficiencies on practically all data.


The storage component of Dedupe 2.0 will need the ability to store data from multiple sources, and possibly from another deduplication product. Software applications that do their own deduplication step are now available from email archive vendors, database archive vendors and file movement vendors. There are also software providers that can add deduplication functionality to primary storage.


As this data is optimized, it needs to be stored on a device that won't negate the deduplication work already done, optimizing that data further if possible and then be able to store and retain that information for the required timeframe.


The foundation storage tier will need the ability to scale quickly and seamlessly without impacting user access to data. With the technology available today, this is likely to be performed by a clustered storage model where additional capacity can be added via nodes, like Lego blocks. The typical implementation would recognize the new storage, integrate it into the system and begin to load data in the background without having to take the system offline.


The storage cluster should also allow for mixed node sizes. This will enable the foundational storage tier to take advantage of new drive capacities and thus, new node capacities as they become available. The higher capacity per node, the less expensive the foundational tier becomes. An inability to mix nodes size will prevent a customer from enjoying that savings and realize the benefits from new technology densities, performance and costs.


Finally, the storage cluster should be able to support generationally different nodes. This will allow nodes to be upgraded in processing power, network bandwidth and power efficiency. Mixing nodes allows the customer to not only improve the performance of the storage cluster but also to decommission nodes as they become too old to participate in the cluster.


Mixing nodes based on capacity and generation then also allows for the storage platform to be upgraded without the need for fork-lift data migration. The total capacity of the foundation storage tier may become so great that there is no practical way to migrate its data to another platform. If the storage cluster can automatically rebalance data across nodes in the cluster then migration is performed in real time as new nodes and increased capacity are added. Old nodes can then be decommissioned and removed from the cluster. Once done the cluster automatically rebalances itself.


As data is placed on this foundational tier it may have a need to be retained for a specific period of time based on corporate or government regulations. The foundational tier should then have the ability to secure data via encryption and WORM protection methods. There should be the ability to define rules, or policies, that allow for fine grained customization of what data needs to be retained and for how long.


The foundational storage tier should also be the final resting ground for data as it ages out of backup policies. These backup policies can then be adjusted so that data is stored for a much shorter period of time. Backup should be relied upon to recover the latest copy or two of data, not as a foundational storage tier. The advantage to this is further cost savings as a result of the decreased volume of data for backup, but more importantly, assured destruction.


At some point in the life cycle of data it can, and should, be destroyed. No longer can data centers afford to retain everything forever, even with this less expensive storage foundation. It is unlikely that the "Where do you want to go to lunch" email needs to be retained for the next 100 years. The only way to ensure that destruction is a policy that makes the foundation tier the only container for aging data.


In fairness this level of classification is beyond the scope of what is expected out of the foundational tier. To achieve this level of data awareness the foundational tier should be able to easily work with other data classification tools. If the foundational tier has an open access protocol like NFS/CIFS then any off the shelf IT discovery or classification tools can interface with the foundational tier without the need for special API development on the part of the IT discovery tool developer.


As organizations implement a foundational tier and embrace the concept of a final resting point for data, it then becomes critical that this data be replicated to another facility that will house a mirror copy of the data set. The foundational storage platform should use its own replication software which understands the data that has been compressed and deduped and compares that data to what is already stored at the DR site. This allows the replication job to only send unique data, optimizing WAN bandwidth as well as remote storage requirements.


For now, Dedupe 2.0 will have many initiators that have specific expertise in deduping certain data types. While this may consolidate in the future, it is critical that a foundational tier of storage that can act as a deduplicated catch-all repository for the various outputs of these initiators be established as a first step on the road to deduplication's next era.

George Crump, Senior Analyst