Earlier in the week I posted a blog about Monitoring vs. Diagnostics; where I noted the skills, tools, and ultimately the visibility required to proactively monitor an environment versus re-actively diagnose, are very different. I’m sure those of you who have lived this challenge would agree. For those who have not lived to tell the tale, I’ll share this example from the field.
Please note I am not attempting to call out or embarrass anyone. The nuance and balance between monitoring and diagnostics is lost on all but the most seasoned end user computing professionals. It’s quite common that desktop workload-focused IT teams are ill prepared to solve a diagnostics problem with a monitoring solution. In this case, our example organization was experiencing crippling performance of its most important tier one application: the electronic health record (EHR) solution from Epic Systems.
Not Sized for Growth or Spikes
Our story begins at a large health care organization that recently completed a 20,000-user desktop transformation from physical PCs to virtual desktop infrastructure (VDI). The multi-year project went well; and until recently, everyone was happy. Management was realizing significant operational benefits. User expectations were being met, and the IT team has been proactively monitoring the environment.
Near the end of the desktop transformation, non-persistent images were cast and an Epic Systems EHR solution was added to the appropriate pods. Unbeknownst to the VDI architects, image design and pool density were not sized appropriately. Users did not complain however, as there was no Epic install prior to the VDI transformation—users believed the new EHR application performance was normal. Unfortunately, this train had already derailed—the IT team was just not yet aware of the impending crash.
Stratusphere UX was called in to assist after the current infrastructure monitoring solution, VMware vRealize Operations Manager (vROps), fell short. The VMware solution provided adequate visibility for proactive monitoring and alerting, but proved to be blind to the derailment that crippled the usability of the Epic VMs. All vROps views showed server and storage infrastructure to be operating within normal limits. Stratusphere showed similar indicators for the infrastructure, but the in-guest view told a very different story.
All Users, All Machines and All Applications
Stratusphere showed a few critical issues that were completely absent from the monitoring view provided by vROps. First, guest VM CPU queuing was abnormally high. Stratusphere also noted that in most instances GDI objects were in the 1 to 2,000 range; this is a normal metric for Epic when installed on Physical PCs, but a bit high for even a standard VDI guest workload. And lastly, Stratusphere UX found that six applications were consistently in a locked state—all of these applications were a part of the Epic solution. We suspected these issues would tie-back to a previous timeframe. With that in mind, we pulled the same machine and application metrics for an earlier period for comparison. The screen capture below shows this ‘known good’ state, as captured by Stratusphere UX.
It is worth noting that while vROps showed this same time period to be adequate, with respect to server utilization and performance, Stratusphere UX clearly showed all of the affected target machines to be experiencing an issue; albeit just below the threshold for user rebellion. You will see the composite user experience metric in the last column, VDI UX, shows that most users are receiving a ‘Good’ experience overall. That said, there was clearly a missed opportunity to identify this issue before it became system wide—and perhaps, avoid the downtime and critical user experience issues that were faced.
The Punchline
In the end it was determined that a combination of provisioning errors and missed symptoms contributed to this derailment. First, each of the Epic VM were only granted a single CPU core; causing Windows to be constrained to a single threaded kernel. While this is not an issue on physical PCs, it proved to be a major contributing factor in the downtime event.
And as previously noted, the CPU System and CPU User numbers were just below a perceptible threshold under ‘normal’ load. This too contributed to the failure. The Epic EHR solution requires a large amount of 2D video processing—the burden of this work falls on the guest vCPU. Overall the under-provisioning caused the CPU queue to rise to levels that were over 300 percent of available resources (remember, we have only one vCPU per Epic VM).
As a result of these misses, users had considered the Epic system to be slow on a normal day, but as the workload of the VMs was increased (even by a relatively small amount) the effect would be catastrophic. Applications would hang, screens froze, and system performance became unbearably slow.
When taken one-at-a-time these issues all seem small and insignificant. The fact they were missed would come as no surprise to any IT peer. The failure here is a lack of visibility to support diagnostics. Planning for the reactive need to troubleshoot, diagnose and determine root cause cannot be done after the train derails. Having the necessary metrics and trendable details is necessary to identify and place your organization in the best possible position for when catastrophe strikes. And please trust me … it will strike.
[…] the entire article here, Epic Fail: Using a Monitoring Tool for Diagnostics via the fine folks at Liquidware […]