What’s Causing the Storage I/O Bottleneck?
What’s Causing the Storage I/O Bottleneck?
Of the two elements that have kept pace with the growing digital demand, compute power has kept pace via increased performance and increased core density, as well as increased intelligence though server virtualization and scale-out clustering or grid infrastructures. Networks similarly have kept pace with increased bandwidth capacity and intelligent use of that capacity through QoS, prioritization and efficient use of wide area connectivity.
Meanwhile, storage performance has not kept pace. Instead it has remained frozen in the same architectural design for at least a decade; a high performance SAN or NAS controller pair that drives an increasing number of disks. While increasing the number of drives can improve performance, there is a limit to the number of drives these controller pairs can support as well as a limit to the amount of inbound traffic they can sustain. This controller (SAN) or head (NAS) is now the primary bottleneck limiting improved storage performance.
Storage I/O vs. Multi-tenant Workloads
To compound this problem the workload is now changing. Workloads are now multi-tenant, with multiple shared servers and networks trying to access storage in this out-dated model. Prior to multi-tenant workloads, a single application coming from a single server could only create a limited number of requests. Multi-tenant workloads, running either through multiple virtual machines on a single physical server or through a single application scaled across many physical servers in a cluster or grid, can now generate hundreds if not thousands of requests for storage I/O.
The impact is that these requests saturate the storage controller or head and the applications or servers have to wait for it to catch up, which in turn delays processing, eventually costing the company money.
A multi-tenant workload is one that typically has multiple owners or users at any given point in time. The presence of these multi-tenant workloads is increasing in quantity and in capacity. They are no longer uniquely restricted to a limited number of enterprises but are in fact very common in some form in almost every enterprise today. Many enterprises now have multiple sources of these workloads.
At a minimum any organizations implementing server virtualization today have multi-tenant workloads; in some cases 20 or 30 virtual servers coming from a single physical server. NAS storage systems have become a preferred method for delivering storage services to the virtual hosts and the access patterns of the virtual machines are inherently random. Storage performance scaling in virtual environments becomes critical as one or more virtual machines begin to consume all the available storage I/O resources which then adversely affects performance across all the other virtual machines on that host, creating a domino effect of lowered performance and lowered confidence in the virtualization project.
Beyond the very common virtual server use case, there is also a rise in the more traditional case of multi-tenant workloads; multiple processing servers grinding through a job. These workloads are not limited to the common example of simulation jobs similar to those found in chip designs or processing SEG-Y data in the energy sector. There are many others; DNA sequencing in bioinformatics, engine and propulsion testing in manufacturing, surveillance image processing in government, high-definition video in Media and Entertainment, and much of Web 2.0-driven projects.
Storage I/O performance is critical in these environments because work essentially stops while the processing or simulation job completes. When these jobs stop, so typically does the organization’s ability to create revenue. To get around these delays timing of job runs becomes critical to minimize user impact but even with the best planning possible, user productivity will suffer. When that productivity suffers, so does organizational profitability.
Another compounding factor is that all of these data sets have increased in complexity in recent years, becoming more granular, shifting to three dimensions, or significantly increasing color depth. This granularity not only increases the physical size required to store this data but also the processing and storage I/O required to create, modify, analyze or test the data.
In all cases reliable, predictable, scalable storage I/O performance is critical.
For example, an integrated circuit designer may need to run a simulation on a particular chip design. As with other environments, this data set is becoming significantly more complex and detailed. In the case of chip design, the chips become smaller or the number of functions on the chip increase. There is a tremendous need in these environments for synthesis and regression testing. As a result the time required to process a simulation of the chip takes longer and longer. It is not uncommon for this type of job to take from three days to an entire month to run. There are two bottlenecks in this process; the time required for the CPU to process the data and the time for the storage to read and write the simulation scenarios.
In the virtual server example, the VM's are almost purely random by nature. While the virtual machines don't have to wait for a particular VM to finish its task, if one VM becomes busy it can dramatically impact performance of the other systems. As in the case of simulation-type workloads, the storage I/O pattern on these systems is as large as it is random.
The Storage I/O Bottleneck
While it is necessary to address all of the performance bottlenecks, computer, network and storage, most of the challenge in these environments is handling the storage bottlenecks. The compute bottlenecks are well understood and can be dealt with by allocating a higher quantity of faster processors through techniques like clustered and grid computing or simply leveraging Moore's law. Networking in similar fashion has increased bandwidth via techniques like trunking or mutli-homing. These techniques will adequately handle the compute and network element of the bottleneck and are also well understood.
What is lacking from most storage manufacturers is a similar scale-out model, as the current dual controller systems quickly become saturated by these workloads, especially many NAS based systems. Because of the shared nature of these systems, Network Attached Storage (NAS) should be an ideal storage platform for multi-tenant workloads. Unfortunately because of these workloads’ highly random data access patterns and very high number of storage I/O requests, either from a single server with multiple requests in the virtual server case or a single application making requests from multiple physical servers, single, or even clustered, NAS heads and ports can quickly become a severe bottleneck.
The result is many organizations turn to a shared SAN, which is not by its nature shared, nor is it as easy to manage as a single NAS file system. It too will still lead to bottlenecked storage performance, which again not only slows the business down, limits employee productivity, and eventually loses the organization money, but also adds greater complexity to an already complex environment.
In either the SAN or NAS case, during a job run there tends to be a significant number of sequential and random writes while at the same time requiring an equally large amount of very random reads. This is a deadly combination that renders most cache on these storage systems useless because they are too small to have a high degree of cache hits. The result is that in addition to these bottlenecks most if not all requests have to come from the drive mechanism not the cache that supports it further lowering performance.
Solving the Storage I/O Problem
As these workloads become more prevalent across the enterprise, the ideal solution is to solve the NAS bottleneck and establish an easy to manage, high-performance NAS infrastructure, in fact for many organizations it has become an absolute imperative.
One potential solution is to apply the same methodology behind clustered computing to the storage I/O platform. Build a scale-out NAS solution that increases both storage I/O performance and storage I/O bandwidth in parallel to each other. This would allow for the scaling of the environment, as the workload demanded it. Additionally it would also allow coherent use of memory within the NAS solution creating a very large but cost-effective cache. Finally it would keep the inherent simplicity of a NAS environment as opposed to the more complex shared SAN solution.
In our next article in this series we will examine how customers are using different types of storage technologies available to them to solve their need for a high IOPS storage environment.
Thursday, June 4, 2009
There are three basic elements of performance in a data center; its processing power harnessed by servers, its network harnessed by switches and routers, and its storage which consists of the disks harnessed by SAN and NAS controllers. Each of these elements is under constant strain to keep up with the digital demands of their users. Servers and networks have kept pace through added power and intelligently utilizing that power, but storage has not and has become the bottleneck of the enterprise. Now the storage bottleneck has moved beyond being an IT problem and has created a perilous situation for the organization as a whole. So what’s Causing the Storage IO Bottleneck?
Related Articles
Solve Corporate IT Challenges with Big Data
Storage Efficiency Is Key For Big Data
Designing Big Data Storage Infrastructures
Mitigating Risk With Scale-Out Storage
Legacy Storage in the Modern Data Center
VMware Storage Simplification Strategies
The Complexity of VMware Storage Mgmt.
Searching for High Performance Storage
Server Virtualization in Bottlenecking NAS Storage
Solving the Storage I/O Performance Bottleneck
Using NFS for Server Virtualization