Solving the Storage I/O Performance Bottleneck
Solving the Storage I/O Performance Bottleneck
There is a dark cloud looming in storage. Over the last decade, conventional storage platforms have been able to keep up with the demand for ever higher capacity systems at a lower cost per GB, however, the real specter on the horizon is severe and inevitable performance degradation. This is critical because most organizations rely on a scalable facility for servicing I/O to rapidly deliver information to aid in revenue generation. Indeed, resolving the storage I/O performance bottleneck becomes even more critical in a soft economy, when profits are the most elusive.
Thursday, June 25, 2009
As we covered in our article "What's Causing the Storage I/O Bottleneck", multi-tenant workloads are at the heart of the problem. Multi-tenant workloads are also known as concurrent/aggregate workloads in which data is shared between multiple users or applications and accessed concurrently by multiple users of the same shared storage resource. This type of activity is no longer relegated to a few isolated companies whose performance demands are on the fringe of mainstream data center environments. In fact, any organization that is deploying a server virtualization project has by definition a multi-tenant workload demand.
In the case of server virtualization, where multiple virtual servers generate numerous, near-simultaneous I/O requests, the server virtualization hypervisor essentially turns into a storage I/O mixing bowl; multiple I/O requests are interspersed together all demanding attention from storage almost simultaneously.
Multi-tenant workloads have moved well beyond the traditional sphere of simulation jobs often found in chip design or the SEG-Y data in Oil and Gas companies. There are many others, including DNA sequencing in bioinformatics, engine and propulsion testing in manufacturing, surveillance image processing in government, high-definition video in Media and Entertainment, and many of the Web 2.0-driven projects.
The performance demands of multi-tenant workloads increase exponentially in nearly every organization once they are unleashed into full production. The fact is, traditional storage workarounds, whether they are in highly customized SAN's or high speed NAS, are either too complex, too expensive or are, at best, band-aids for fixing storage I/O performance problems. As a result, the storage administrator is forced into the unenviable position of delivering a solution that his/her users must "live with".
A performance bottleneck that users have to "live with", can cost companies revenue, customers, or a competitive advantage, all of which may adversely affect profits and long-term viability.
For many storage engineers the gold standard for improving performance is simply adding more hard disk drive mechanisms to the storage system. This approach only works, however, as long as there are more requests from the storage system than there are drives to service those requests. As a result, storage performance will continue to scale as drives are added. In this scenario, most server applications eventually become their own storage bottleneck because at some point they will not be capable of generating enough requests to the storage system to sustain drive additions. The challenge that multi-tenant workloads introduce is they can easily generate more requests than conventional storage systems can support---regardless of drive count. Essentially the bottleneck moves from a lack of available disk drive mechanisms for servicing I/O to the storage controller or NAS head itself.
Multi-tenant workloads also demand an extensive amount of data sharing between tenants or servers hosting the tenants. As a result, the ideal platform should be a NAS which is designed specifically to share data. The problem is as we discussed in our article "Seeking a High IOPS NAS", there are many built-in limitations to traditional NAS architectures which impede performance, scalability and reliability.
Ironically, the solution for this problem is found within the very same high I/O workloads that created it to begin with. Server virtualization and/or grid compute environments. These environments allow multiple tenant applications to either live on a single physical server or allow a single application to scale across many servers. The same architecture design is now available for storage. In fact, companies like Isilon Systems are providing scale-out NAS built on a clustered architecture that allows for a scalable, high IOP's NAS to address both short term and long term storage I/O performance bottlenecks.
Symmetric Architecture
The first step in designing scale-out NAS storage is to base it on a symmetrical architecture that enables a series of nodes to be grouped together to act as a single entity. In this clustered architecture, multiple industry standard servers can be equipped with SAS drives and network connections to form a node. Each node can be bonded together through either an Infiniband Network or an IP Network.
Symmetrically Aware File System
Individual nodes are united through software intelligence to create a symmetrically aware file system capable of leveraging disparate components into a single entity. When this file system is applied to the hardware nodes, it creates a high IOPS NAS cluster that can address the challenges of today's – and tomorrow's – multi-tenant workloads.
Eliminating the Storage Compute Bottleneck
As stated earlier, with multi-tenant workloads no matter how large a traditional storage system is scaled, no matter how many drives are used, eventually the I/O capabilities of the storage compute engine become the bottleneck. The value of a scale-out NAS configuration is that the storage compute engine is no longer confined to a single system and a single set of controllers or heads, as is the case in a traditional NAS.
With a symmetrically aware file system, such as OneFS from Isilon, each node in the cluster provides storage compute resources. In fact, it can also make sure that all the nodes in the cluster actively participate. By comparison, some clustered storage solutions must designate a primary set of nodes, typically two, for each request. While these systems benefit from the redundancy of a cluster, they often have the same performance bottleneck of a traditional NAS.
With a cluster-aware file system, each file is broken down into small blocks and those blocks are distributed throughout nodes on the cluster. As a result, when a file is needed, multiple nodes in the cluster are able to deliver the data back to the requesting user or application. This dramatically improves overall performance, especially when hundreds, if not thousands, of these requests are made simultaneously from a multi-tenant application.
Compared to traditional storage solutions where performance flattens out long before the system reaches its theoretical maximum drive count, the symmetrical design of a scale-out NAS system allows performance to scale linearly as nodes are added. Each node in the cluster delivers additional storage capacity in the form of drives, additional cache memory, storage network I/O in the form of inter-cluster connections and additional connections out to the users. What’s more, each node contains additional processing power capable of addressing both external and internal (replication, backup, snapshot management and data protection) requests.
Beyond Enterprise Reliability
Since multi-tenant environments often support hundreds of applications, it is critical that they actually provide higher levels of reliability beyond the standard “five 9’s” offered by enterprise-class storage systems. A failure in storage can affect hundreds of applications or the performance of a mission-critical, revenue-generating compute cluster. These workloads also can't be subject to a one size fits all RAID protection scheme. Some applications or even specific files may demand specialized data protection so they can remain operational beyond multiple drive or node failures.
The symmetrical nature of a clustered high IOP’s NAS typically delivers beyond enterprise-class reliability. First, there is the inherent value of any clustered environment due to the redundant nodes. When coupled with a file system, like Isilon's OneFS, that is fully storage cluster-aware, the platform can deliver granular levels of protection at an application or even file level. This allows for not only data availability and accessibility, in the case of multiple drive or node failures, but also rapid recovery.
Conventional storage systems that are limited to two controllers or heads must process the rebuilding of a failed drive in conjunction with its other tasks. As a result, these systems can take 10+ hours to recover today's high capacity drives. Furthermore under a full load, the time to recover can increase to 20 hours or more. Multi-tenant workloads by their very nature are almost always under a full load and as a result, incur the worst case rebuild times.
Cost-Effective - Pay as you Grow Performance
Finally, performance has to be cost-effective and justifiable. As stated in the earlier article, theoretical high-end storage systems provide a top level of performance by utilizing specialized and expensive processors. Until the environment's storage performance demands scale to match the capabilities of these processors, the investment in them represents wasted capital. Ironically, soon after the environment’s performance demands have matched the capability of the specialized and expensive processor, they quickly scale right past the capabilities of that processor. What is needed is a solution that can start with a small footprint and scale modularly to keep pace with the growth of the environment.
Scale-out NAS is the embodiment of a pay as you grow model. The cluster can start at nearly the precise size required to meet the performance and capacity demands of the environment allowing the upfront investment to match the current need. Then as the environment grows, nodes can be added which improve each aspect of the storage cluster--capacity, storage performance and storage network performance.
Most importantly, like their compute cluster brethren, NAS storage clusters can take advantage of industry-standard hardware to keep costs down and enable the NAS cluster vendor to focus on the abilities of the storage system, not on designing new chips.
The challenge of multi-tenant workloads can ideally be addressed by a highly scalable but cost-effective NAS. A scale-out NAS storage system allows you to leverage the simplicity of NAS sharing keeping IT efficiency at a maximum while at the same time outpacing the performance capabilities of most legacy storage architectures. Any organization planning to deploy a multi-tenant workload, whether it is as common as a virtualized server environment or a more specialized revenue-generating compute cluster, should closely examine a scale-out NAS solution to fulfill their storage needs.