Interaction vs. element monitoring

According to a VMware presentation at a recent VMworld conference, 90% of application performance issues are storage-related, which is indicated by latency, or the time that it takes an individual storage request by a server to be fulfilled from a LUN. One way to measure latency is by recording all data requests from a single server HBA, for example, or by monitoring traffic into and out of storage arrays or switches. This is how traditional SRM tools operate, they focus on ‘element’ monitoring, or monitoring the status of servers, storage arrays, switches, etc., not the interaction between these components.

Getting half of the conversation

Element monitoring could tell you which servers or arrays are experiencing I/O delays, but relying solely on this information makes it difficult to solve the problem. Similar to listening to one half of a telephone conversation, recording data from an individual element in the storage infrastructure can be a very ineffective way of figuring out what’s really going on. Element monitoring doesn’t capture complete transactions, something that’s needed to identify the cause of slow VM application performance (latency). In a virtual environment the problem can be compounded by the sheer number of VM server instances and the density of VMs that exist in a single ESX host.

Add to this the lack of visibility into specific server-LUN interactions caused by storage virtualization and it can be even more difficult to manage storage in an environment where VMs are competing for resources. Balancing these needs, or the requests for I/O service between servers or VMs, is a primary function of storage management. When large numbers of VMs attempt to access storage LUNs at the same time, latency will increase for all. Again, it’s not enough to know which servers are experiencing latency, but to know when those slow transactions occur and with which LUNs.

ECT measures latency

VirtualWisdom gathers data from three sources or probes. ProbeV reads MIBs via SNMP to compile traditional switch performance and utilization data, ProbeVM gets all the key server-side information from vCenter, and ProbeFCX reads data from physical layer TAPs (aka optical splitters) in the fibre channel fabric. This real-time I/O data is used to compute SCSI transaction times, called Exchange Completion Times (ECT). Along with this breadth of information, VirtualWisdom provides sophisticated diagnosis and troubleshooting capabilities to both avoid performance problems and resolve problems quickly, not just alert you to their existence.

As an example, below is a shot of the VirtualWisdom “I/O Performance Trending” screen. It displays a number of details about storage latency in a virtual environment, derived from data taken directly from vCenter and from Virtual Instruments’ SAN TAPs, which enable analysis of fibre channel frames for transaction information.

Eric Slack, Senior Analyst

Virtual Instruments is a client of Storage Switzerland

Briefing Report

The top right box shows data coming from vCenter. Each entry is an average of all requests or transactions that host participated in during 5-minute intervals. It combines transactions from each server with ALL the LUNs that those hosts read data from or wrote data to. It shows that server latencies are all in the normal (20ms) range, except two, which are only slightly above normal.

The bottom left box shows data taken from VI’s ProbeFCX, which records transactions for each server-LUN pair, not just for each server. Each pair is referred to as an Initiator-Target LUN (ITL) and the transaction time is labeled Exchange Completion Time (ECT). In this view, these are also averaged for 5-minute time intervals as is the data from vCenter, but they show a much different story. Instead of read latencies in the normal range, they show some hosts having extended latencies with certain LUNs. Specifically, transactions between server 6020 and port 16 are over 41ms, in the critical range, and the rest are well above the normal range. Compare this with the ‘element data’ taken from the server only and you get a completely different impression of how long applications are waiting for storage I/O. Server 6020, for example, is experiencing latency about 3x what vCenter is showing.

A problem uncovered

The takeaway from this is that there are specific LUNs that these servers are having problems with, which are creating latencies 2x-3x worse than what vCenter is showing for these same servers. But this fact is lost when these servers’ transactions with ALL LUNs are averaged together. The only way to capture this specific transaction data between server and LUN is with fibre channel frame monitoring through physical layer TAPs, like those displayed with VirtualWisdom. With this information, we know:

  1. 1.There’s a problem that doesn’t show up with a traditional monitoring tool

  2. 2.We can ID which server-LUN pairs are having these problems and where to start looking for the cause

Storage Switzerland’s Take

There’s an old saying, “you don’t know what you don’t know”. The information above makes you wonder if an IT Manager coined that phrase. It also makes you wonder how someone could get by without this kind of detail about their storage infrastructure. But it goes on all the time. Most companies already have some form of storage resource management (SRM) tool in place and ‘don’t know what they don’t know’ - like trying to make sense of half a phone conversation.

The ‘solution’ for many is to buy more storage or switching infrastructure, a fact that’s confirmed by Virtual Instruments’ finding that most organizations have storage link utilization rates in the single digits. This brings another saying to mind “when the only tool you have is a hammer (more hardware), everything looks like a nail (capacity problem)’.

It’s difficult to imagine someone choosing NOT to use a technology like this after they’ve been exposed to it. With the emphasis on VM densities and the move to put more critical production applications on VMware, having this kind of data would seem to be essential. Given that Fibre Channel SANs represent over 75% of the storage connected to VMware environments (WMworld 2009 survey by Virtual Instruments), production VMware shops should find this product suite very compelling.