VM performance troubleshooting: A quick list of things to check

I often see virtual machines that perform poorly. There can be many many reasons. I thought it was time to post a few “top 5 things to check in any given VMware ESX(i) environment” that might help you solve any issues.

Things to check on storage

Storage is often considered the bad guy when it comes to bad performance of virtual machines. As it turns out, this is not very often the case at all. Still, some storage-related things to check if you encounter a poor performing VM:

  1. If using iSCSI or NFS via multiple 1Gbit links, check if the load is really balanced between uplinks. A lot of people still seem to configure etherchannels/port aggregates of multiple 1Gbit links and then just assume it will balance the load. This is often NOT the case; you need multiple source and/or multiple destination IP addresses for ESX(i) to be able to spread the load across multiple links. For traffic coming into your ESX(i) host, it depends on your physical switch used; a lot of switches balance loads on source and/or destination MAC addresses;

  2. Check latency on the disks connected to your VM. To do this, click your VM in vCenter (or on the host level), select the tab “Performance”, and choose “Datastore” form the drop down menu. It will show you the latency encountered on any relevant datastore for that particular VM. Be sure to click the “advanced” button for more detailed graphs;

  3. Check how many IOPS are being transferred. You can use the same performance graph as with 2), but instead click “chart options”, unselect “Write latency” and “Read latency” and select “Average read requests per second” and “Average write requests per second”. These numbers will tell you how many IOPS the VM is performing, and it could give you some idea on the load the VM puts on your storage device.

  4. Using the graph from 3), be sure to check the read/write ratio. Some arrays are way better in performing reads than writes, and a VM delivering a lot of writes to the array may hit your array harder than a read-intensive load would;

  5. If the workload appears to be very high for this virtual machine, you may look into a command line tool called vscsiStats. This tool can give great insight in reads and writes, block sizes and latencies your VM encounters/generates. See VMdamentals.com: vscsiStats into the third dimension.

Things to check on resources

One of the things I see most performance issues on, is faulty resource settings. Very often a limit is applied somewhere that is forgotten about. The top 5 of things to check on resources:

  1. When you click on a resource pool and select the tab “Resource Allocation”, you get a list of all VMs inside that resource pool. Click on CPU, and check all VMs inside that pool (assuming your troublesome VM is in there as well). Make sure the “Limit – MHz” tab states “Unlimited” unless you had a specific reason to limit the performance of that VM (like an MS-DOS VM that will always use 100% of the available CPU);

  2. In the same list as in 1), click on “memory”. This is the real kicker: very often I do not see “Unlimited” there, but for each and every VM an amount of memory, often far less than configured! (Hint: You should see such a VM ballooning all the time). Normally you would change that back to “Unlimited”, unless you had a specific reason to limit the amount of memory or a VM;

  3. Same list again, press the storage button. Again, check for any limits that may apply but weren’t intended;
  4. For all of the items above, you can also look at the “% shares” values. These values indicate what percentage of shares that partiular VM has in within the resource pool. If one VM has 95% and other have 1%, it is probably time to check your shares values. These values are relative, not absolute: Make sure they are all in the same league (like 1000, 2000 or 3000) and not too far apart (like 10, 20, 4000);

  5. I have seen some environments where people tend to put limits and reservations on EVERYTHING. Unless you know exaclty what you’re doing, I’d recommend to remove them all. Reserving memory can improve performance a little, but in general it costs you more sorrow than you gain it. Same goes for limits… If you need to set any of these, first think what you are trying to accomplish, then apply it. Never apply it “because you can”;

  6. Beware the nunber of resource pools you use, or even IF you should use resource pools at all. For example, if you have a big production pool with many VMs and a small test pool with test VMs, you can think up situations where the production would suffer a performance hit while the test machines have all the resources they need (and even more). Take care, especially regarding the “Expandable reservation” and when using resource pools inside resource pools. Things can get ugly really quick.

Things to check on a VM level

On a VM level things also could go wrong. VMs use memory and CPU, and need enough resources to properly function. The top 5 list of things going wrong at this level:

  1. CPU load. If you click on the VM, and select the “Performance” tab, you can click on “CPU” in the drop-down menu (make sure you are viewing the advanced graphs). This will show you the amount of CPU the VM is consuming. If it is very high, you may consider to add vCPUs, after you make sure that A) you physical host has more cores available than what you are going to configure inside the VM, and B) that the applications inside the VM are actually able to utilize multiple vCPUs;

  2. If the VM appears to be ill-responsive, you may have a scheduling issue. To quickly check the general health (CPU-wise), iuse the same graph as in 1), but now click the “Chart Options” link, deselect all Counters and then select “Ready”. This will show you the ready time of the vCPU(s). This number indicates the time that a VM is ready to execute, but vSphere fails to assign a CPU to it. This value should normally be below 10[ms]. Beware: The graph you are looking at contains the accumulated value during the sample time! In the real time view, you should divide all numbers you see here by 20. See for more information on this.

  3. Memory could be a serious limit on VM performance as well. If you do not configure enough memory, the VM will usually respond by starting to swap its memory pages to disk (not to be confused with ESX(i) level swapping!!). So it is important to look at the memory available to a certain VM. As a rule of thumb, I always like to look at the “active memory” of a VM. You can see this by clicking the VM, selecting the “Performance” tab and then using the drop down box to select “memory”. There you will normally see a horizontal line (Granted memory) and below that a variable line (Active memory). I always like to stick to having the active memory at 1/2 to 1/3 of the Granted memory which usually does the trick. If active memory gets close to the Granted memory you can bet on swapping occurring inside the VM;

  4. If the physical host runs out of physical memory, the result might be that your VM(s) start to suffer from performance degradation as well. If the memory driver is installed within the VM (this driver is installed by default when you deploy VMware Tools) you might see ballooning occurring inside one or more VMs. The balloon driver starts to reserve memory inside the VM once the physical memory of the host runs low. The response from the VM’s operating system will be to start swapping to its disk (again: This is NOT ESX(i) swapping!). Space successfully claimed by the balloon driver will become available to the ESX(i) host for other use. Seeing some degree of ballooning might not be a problem, but seeing a constant non-zero value for ballooning usually spells trouble. If ballooning does not “cut it” anymore and the hosts is still low on memory, then ESX(i) level swapping will start to occur. This is a definite indicator that you have a shortage on physical memory;

  5. Always instal VMware Tools, and also make sure that the VMware Tools are up-to-date and actually running. The VMware Tools deliver you multiple things; from heartbeating from the OS to the hypervisor to the balloon driver to optimized disk and network drivers inside the VM. Not installing VMware Tools may cause your VM to perform very poorly!

Things to check on the Operating System level

Finally there are some things to think about inside the virtual machine as well. The Operating System inside may perform better if you make sure the following items are in place:

  1. Disk Alignment. For any pre-Windows 7, pre-Windows 2008 Server or older Linux based systems, your disks may be misaligned. Misalignment may cause quite a performance hit, especially when your storage underneath does not have a lot of IOPS to spare. You can read all about it here: Throughput part 3: Data Alignment. Apart from misalignment, using a lot of snapshots on your VMs may also cause quite an impact to performance. For more details you could look at: Performance impact when using VMware Snapshots;

  2. For some special applications it may be important to format the virtual disks to a specific format or blocksize. For example, the database of a Microsoft SQL 2005 server is generally put on an NTFS that has a blocksize (in NTFS called clustersize) of 64KB. It may make a big difference! You might even tune your underlying RAID sets to match the workload. For more of a deepdive on that, you could check out Throughput part 2: RAID types and segment sizes;

  3. Especially when your VM has issues at some point in time and otherwise runs flawlessly, you might consider checking things like virus scanners. Not only on the impacted VM, but possibly on other VMs as well. It may very well be that because other VMs start their antivirus checks, your VM is impacted in performance (as well);

  4. Concerning P2V’ed virtual machines: If you P2V a virtual machine (meaning you convert a physical machine to a VM) and you do not “clean up” afterwards, there may be a lot of unused drivers and even applications inside the VM. For example management software for the physical hardware. In the VM there are no fans to fail, no power supply to overheat. Always clean up P2V’ed VMs!

  5. Finally an issue I have seen sporadically in P2V’ed VMs: If your physical machine had multiple CPUs, and the virtual machine has too, I have seen cases where inside the VM all seems well when looking at CPU usage, but on a vSphere level one of the cores gets stuck at 100%. Especially old Windows 2000 machines had this issue, but newer ones may show this as well. If you have this issue, strip the VM down to a single vCPU, see if this is enough, and if needed add more vCPU’s back to it (while making sure you go from a uniproc kernel to a multiproc kernel). This usually fixes any issues around a “stuck” vCPU.

From all performance issues I have seen, I think that in 95% of the cases these checklists will deliver you the problem. In other cases you may have either “weird” issues like bugs in your vSphere environment or things like ill-configured networks.

Happy hunting!

One Response to “VM performance troubleshooting: A quick list of things to check”

Soon to come