Good post from Eric Slack (thank you)
Slowdowns in a SAN infrastructure are sometimes (usually) a combination of circumstances. This is a complex system, with many moving parts, all interacting with each other. Troubleshooting requires that you reproduce the problem, something which can seem nearly impossible given the number of subsystems and the number of variables (including time) which must be controlled.
Recreating the problem requires recreating the same workload, the same conditions, the same time base and the same sequence of events. For IT admins and network teams, trying to document the entire situation with multiple events occurring within multiple subsystems, at the same time, can be a real challenge. Like trying to record a live event with a still camera you’re left wondering where to position the camera to capture the right ‘scene’ and when to take the picture. It’s a multi-dimensional problem that requires more than a one-dimensional solution.
Performance problems are caused by a combination of many factors. Trying to monitor all of them with the tools most companies have at their disposal can be next to impossible and determining which storage systems or network elements to monitor, a frustrating experience. Like the carnival game “Whack-a-Mole”, where participants try to hit figures as they pop up through a matrix of holes, administrators can’t figure out which ports or systems to watch and are left trying to react fast enough to catch the problem.
Troubleshooting
When a performance issue occurs, most companies invoke a troubleshooting process that looks something like this:
An intermittent slowdown is reported with an application. After some basic investigation the server or application team discovers that a metric, like IOPS or throughput, has been degraded. The first assumption is often that the problem lies with the I/O subsystem, the SAN. So the SAN team turns on some monitoring and waits for the problem to reoccur (this assumes they’re watching the right I/O path, with the right tools). But when the problem does reoccur, since it’s usually picked up by the same tools the first group used, it only reconfirms what’s already known – IOPS or MB/s are down.
Next, the component vendors (disk arrays, fabric switches, HBAs) are called and immediately ask for a detailed description of the problem and when it occurred, plus diagrams of the infrastructure. They also ask for a lot of information, with everything time-stamped, including (but not limited to): log dumps from each monitoring tool in use, a GetConfig from the affected servers, iostat and sar output for UNIX hosts and for Windows, PerfMon data of all physical disk objects.
If the problem occurs frequently or is predictable (a big “if”), it may be relatively easy to generate these data. When the logs capture the event with enough granularity, and with an accurate time base, the vendor(s) may be able to provide some insight into the root cause.
Troubleshooting complex systems like this is a difficult task if you’re on-site full time and are familiar with the environment. But it can be nearly impossible to do part time, especially long distance by technicians who have never seen the data center and are also working several other customers’ issues.
More than likely, they’ll ask you for more metrics, with more care taken to provide the time differences of the various component clocks, because the time variable is critical to differentiate between cause and effect. As an example, one major storage vendor’s best practices document for troubleshooting includes the following: “Detail the exact timeline of all events before, during and after the event – be as specific as possible”.
After a time, without resolution, pressure builds to ‘throw hardware at the problem’ (something that the component vendors don’t object to). If this course it taken, with luck and often a lot of money spent, some degree of resolution is achieved, usually temporary, and the problem goes away for a while. But to really fix the problem, it has to be recreated, along with detailed information about what was going on with the other systems in the environment. You have to capture everything that’s pertinent to the problem, and you have to control the time dimension. Otherwise you’re just waiting for the mole to pop back up.
Read on here