The first step in that process is to find tools that will enable staff to save time. IT Managers need to go after the two big time wasters: confirming that what should have happened did happen and addressing minor issues before they become major, both often day-consuming projects.  


Confirming the Right Events Occurred


In the data center there are many automated processes that run throughout the day and night. In most cases each needs to be manually checked to confirm completion. The job of monitoring these processes can be very time-consuming. 


The first part of the problem is that there often is no uniform method to collect this data, and not a centralized collection point. This means that each process has to be manually audited to confirm that the report or output log is checked. In many cases this is more than just logging into a system to check its status. The report itself has to be run at the point of login, or worse, text logs need to be manually scanned to confirm successful completion of the task.


Even if there are acceptable reports available, in the real-world data center a mixed set of applications is in use. For example, the backup system may have three or four different applications protecting the environment.  The UNIX team may use one tool, the Windows team another and the VMware team, still another. While each of these backup applications may report on the status of the previous night’s run, none are consolidated and each must be manually checked. Depending on the size of the backup environment this can take a full day; and worse, has great potential for error. False reporting of a successful backup can lead to a catastrophe when that backup is needed for an emergency recovery.


Storage itself becomes a larger challenge. For example, in a NAS environment, you may be using the snapshots and/or replication capabilities of those systems as a first point of recovery, and even within the same vendor platform, may have multiple controllers or NAS heads.


What if you want to verify that all the snapshots for the last hour had successfully occurred and that you have enough storage reserved to complete the next hour’s snapshots? You would have to login to each controller or NAS head, know how to obtain this data if they were from different vendors, and know when and how to make adjustments based on that information. Again, it can take considerable time to gather this information, verify that it’s accurate and be prepared for the next hour’s snapshots, assuming that it can even be done in an hour.


In both examples, what often happens (due to time pressures) is that the responsibility gets delegated to multiple people who have other primary job functions.  And the data is not collected.


Finally beyond the day-to-day efficiency issues with monitoring the processes in the data center, there is the time spent creating the reports that management needs to assure these jobs are being done. Most often the creation of these weekly or monthly reports involves a similar manual process of keying in somewhat related data from a variety of sources in the hopes that it will all add up correctly when it is consolidated. 


This leads to a problem. The data center is generally too dynamic for a monthly status check on the situation. Thanks to server virtualization we now can create or decommission servers in minutes, storage volumes can be added or removed in seconds, capacity can be added with no down time or notification and storage can be allocated to existing servers automatically.


The dynamic nature of servers and storage demands a reporting tool that can be just as dynamic so that needs can not only be trended but also be "course corrected" the moment an anomaly occurs.


There is also the issue of too much information. Many of the reports run by these individual systems are far too verbose and difficult to scan through quickly for the exact information needed. What would be better is more of a "Heads Up Display" type of report that can be drilled into for detail when needed.


Preventing Minor Issues from becoming Major Problems


The second major time consumer is when something serious but unexpected happens and forces other work to cease so that all hands can focus on fixing the problem. Unexpectedly running out of storage is a prime example but also, failing to recover data because snapshots didn't work or backup media wasn't available. Work stoppage could also come from a performance loss that causes an application to become unusable.


In most cases, the problem could have been prevented. Most systems will provide hints that there is something wrong, that storage or backup capacity is running low or that they’re not working at all. Even performance issues can be diagnosed and predicted long before they become a work-stopping event. Ironically, serious issues like this happen because no one had the time to audit the systems proactively.


For example, performance is one of the tougher problems to monitor and needs an ongoing analysis to accurately predict. You don’t need a report that tells you everything is operational. Instead, what is needed is ongoing data that records spikes in load, I/O requests or bandwidth saturation. These spikes need to be constantly monitored and then reported on if they become too frequent. This type of monitoring would also allow for notification if spikes occur at critical times. Then adjustments could be made to handle the temporary performance demand.


With this type of information in hand the IT staff can respond proactively to address issues as opposed to dropping everything to put out today's fire. Contrary to popular belief, IT professionals are not fire fighters.


There are tools available today from companies like Tek-Tools that allow for real time, interactive monitoring of data. Critical in the dynamic data center, this type monitoring can provide a global tool to manage multiple storage platforms, virtualization platforms and application environments.


Armed with these tools the collection of data that verifies essential processes can be automated to provide a single access point. It can also be set to provide only the level of detail required by the person viewing the report or be expanded to provide further information. Finally these reports can be emailed to the interested parties automatically, saving the IT staff additional time.


Automated reporting turns the process of data collection from an hours-long job performed by multiple people who would rather (and probably should) be doing something better with their time into a single point and click task that can be performed in minutes. The real-time nature of this type of tool is ideal for predicting and diagnosing potential problems in the data center whether they be storage, server virtualization or application-related. Combined with a historical report and a future trending capability the IT professional has the information they need to predict when a problem may occur. They can also use the data to help determine what would be the most efficient and cost effective fix for that problem as well as plan the best time to implement the solution based on the severity of the problem and available maintenance windows.


The term “doing more with less” is overused and often misunderstood. To achieve great data center efficiencies is going to require not only improved resource utilization through server virtualization and storage optimization projects but also great staff efficiency by providing tools to allow for the quick collection and analysis of the system data.