Recently I got an email from a dear ex-colleague of mine Simon Huizenga with a question: “would this help speed up our homelab environment?”. Since his homelab setup is very similar to mine, he pointed me towards an interesting VMware KB article: “Tuning ESX/ESXi for better storage performance by modifying the maximum I/O block size” (KB:1003469). What this article basically describes, is that some arrays may experience a performance impact when very large storage I/O’s are performed, and how limiting the maximum sizes of I/O blocks might improve performance in specific cases.
How it works
To get around a possible performance issue, the KB article states that you can simply change a setting within the advanced settings of ESX(i) in order to limit the block size any ESX(i) host will either request from or send to the storage array. Easy enough to test I figured, and maybe interesting to people running on older iSCSI, FC or even parallel SCSI arrays (as we have).
Strength in numbers?
When your storage array indeed does have a performance issue when you feed it very large blocks, you might gain performance by limiting the maximum size of the blocks which will result in issuing more but smaller blocks.
You should see it like this: Let’s assume that a VM needs to send 1MByte of data. It might issue a single, 1MByte block-write to the array. But when you would have limited the maximum block size to let’s say 128KB, it would force vSphere to issue 10 writes of 128Kbytes instead. So in the end, the total data going over the bus would be roughly the same in both situations (when you exclude the overhead in the other 9 blocks sent in this case).
It is important to understand that this will not result in sending less data, but the array might possibly handle the transfer in a different manner.
Why your array might not “like” large blocks
So why would an array dislike a single, large block but have no problem with 10 smaller ones? This will differentiate per array, but I think in most cases write cache and write cache rules in specific are to blame. Information on this is VERY hard to find unfortunately from storage vendors, since this is part of the kind of the magic that makes an array perform the way it does.
I know at least some arrays will handle blocks of different sizes in a different manner. For example, I know for a fact that ZFS has (or at least used to have) a boundary hard-coded in its source of 32768 bytes called “MaxByteSize”. Any block larger than this value will be written to disk directly which is often called “write through”. Anything equal or smaller than this will be written to write cache (in ZFS terms called the ARC) and then acknowledged right away, often called “write back”. The latter is generally faster, as long as you have enough space in the write cache (and not be forcing a cache flush in which case you still have to wait for disk writes to complete).
So the difference in the example above is obvious: 8 writes of 16KB would potentially all go straight into the cache and be acknowledged right away to the host, while a single write of 128KB would have to be written to physical spindles before it is acknowledged back to the host, and is probably slower.
To find out if your array has this “feature”, you should either perform a lot of testing, or search through documentation what block size is handled in what way and act accordingly (again, information on these things is very hard to get).
Not the holy grail
It seems pretty simple. Limit the block size, all blocks written to the array go into write cache as a result, performance increases. But is this really always true? As you might expect: No.
There is a very specific reason for arrays not putting large blocks in write cache: For starters, it fills up your write cache really really fast, not leaving room for a lot of other (small random) writes where caching will really help. Also, this one big block on its own is sequential in nature, meaning that once the spindles underneath have performed their seeks, the entire block is written to disk all at once in a sequential manner. When your array uses RAID5 or RAID6, things will look even brighter because the array is likely to be able to perform full stripe writes in this case.
So looking at the way things work, it is in general more effective to fill the cache with many small random writes rather than a few big ones. On the other hand you may have more than enough write cache anyway… And this is where you need to find a balance.
Make sure you actually HAVE big blocks
Tweaking vSphere is always nice, but make sure you actually have a chance on success to begin with. You might limit the block size to 128KB, but if your VMs perform block I/O no larger than 64KB, the adjustment is useless and testing is a pure waste of time.
If your array has no means of measuring block sizes, you could use vscsiStats (for examples you could look at vscsiStats into the third dimension: Surface charts!). Use this tool to find out if you actually have VMs performing “large” I/O’s. In a Windows environment I would specifically target Windows 7 and Windows 2008 server VMs, since these operating systems are known to write larger blocks when they get the chance than Windows 2003 / XP VMs.
How to change the maximum block size in ESX(i)
In order to adjust the maximum IO size, you need to go to the advanced settings of the host, select “Disk” and modify the parameter Disk.DiskMaxIOSize (in kilobytes). By default this variable is set to 32767, which is 32MB minus one KB. In my case changing the parameter to 128 [Kb] actually gave the feeling of higher speed and a more snappy response (especially when using Win7 VMs which tend to do I/O’s larger than 64Kb if needed). I need to perform more testing though to see if this really helps or not.