Surviving total SAN failure

Almost every enterprise setup for ESX features multiple ESX nodes, multiple failover paths, multiple IP and/or fiber switches… But having multiple SANs is hardly ever done, except in Disaster Recovery environments. But what if your SAN decides to fail altogether? And even more important, how can you prevent impact if it happens to your production environment?



Using a DR setup to cope with SAN failure

One option to counter the problem of total SAN failure would of course be to use your DR-site’s SAN, and perform a failover (either manual or via SRM). This is kind of a hard call to make: Using SRM will probably not get your environment up within the hour, and if you have a proper underlying contract with the SAN vendor, you might be able to fix your issue on the primary SAN within the hour. No matter how you look at it, you will always have downtime in this scenario. But in these modern times of HA and even Fault Tolerance (vSphere4 feature), why live with downtime at all?


Using vendor-specific solutions

A lot of vendors have thought about this problem, and especially in the IP-storage corner one sees an increase in “high available” solutions. Most of the time relative simple building blocks are simply stacked, and can then survive a SAN (component) failure in that case. This is one way to cope with issues, but it generally has a lot of restrictions – such as vendor lock-in and an impact on performance.

Why not do it the simple way?

I have found that simple solutions are generally the best solutions. So I tried to approach this problem from a very simple angle: From within the VM. The idea is simple: You use two storage boxes which your ESX cluster can use, you put a VMs disk on a LUN on the first storage box, and you simply add a software mirror on a LUN on the second storage. It is almost too easy to be true. I used a windows 2003 server VM, converted the bootdrive to a dynamic disk, and simply added the second disk to the VM, choose “add mirror” from the bootdisk which I placed on the second disk.

Unfortunately, it did not work right away. As soon as one of the storages fails, VMware ESX reports “SCSI BUSY” to the VM, which will cause the VM to freeze forever. After adding the following to the *.vmx file of the VM, things got a lot better:

scsi0.returnBusyOnNoConnectStatus = “FALSE”

Now, as soon as one of the LUNs fail, the VM has a slight “hiccup” before it decides that the mirror is broken, and it continues to run without issue or even lost sessions! After the problem with the SAN is fixed, you simply perform an “add mirror” within the VM again, and after syncing to are ready for your next SAN failure. Of course you need to remember that if you have 100+ VMs to protect this way, there is a lot of work involved…

This has proven to be a simple yet very effective way to protect your VMs from a total (or partial) SAN failure. A lot of people do not like the idea of using software RAID within the VMs, but eh, in the early days, who gave ESX a thought for production workloads? And just to keep the rumors going: To my understanding vSphere is going to be doing exactly this from an ESX point of view in the near future…

To my knowledge, at this time there are no alternatives besides the two described above to survive a SAN failure with “no” downtime (unless you go down the software clustering path of course).

7 Responses to “Surviving total SAN failure”

  • […] Surviving total SAN failure […]

  • Robert says:

    Many thanks!
    I had the same Idea but as you point out ESX will freeze the VM.
    I have tested your suggestion in a 3.5U5 for SLES11 and Windows 2003 SP2 and it works.
    Some VMwareKB indicates that ESX 4.0 had some issues with this.

    • Hi Robert,

      It is actually not an issue, more of a feature. By default loosing the LUN will pause the VM and not fail the I/O. This is what you would want to have most of the time, unless you are creating a software mirror 🙂

      Good to hear you got it working.

  • John says:

    G’Day Erik,

    I have ESXi boxes here at version 4.0.0 Update 1 bolted to 2 x HP EVA’s and I have attempted to put your solution in place. Following the virtual machine configuration I simulated a storage failure and the virtual machine just froze. Virtual Centre complained that it could not see the virtual machines config file. What would be terrific if we were able to choose a shared disk model or a non-shared disk model. I came across a white paper ( which talks a bit about the whole Fault Tolerant implementation and they do talk about this non-shared disk configuration however to date I have not been able to find any more information on it. They also talk about Long Distance FT which sounds interesting. Thanks again for the post and if you have any suggestions on how I can get this working in 4.0 I would be very interested.


    • Hi John,

      I am unable to test it on vSphere 4.0 now. But I did test it on vSphere 4.1 🙂

      Everything still works as expected under vSphere 4.1. The VM “cuts loose” the failing virtual disk, and the test subject (I used a Windows 2003R2 x64 VM) popped up a message “Windows – FT Orphaning: A disk is part of a fault-tolerant volume can no longer be accessed.”

      This works for both the LUN holding only the second disk, and also for the LUN holding the VM config itself (although the vmx file etc is gone! Just do not power off this VM because it won’t start again because of the missing VM config).

      Whenever you build something like this, I always create a second VM on the second LUN, with a single (“existing”) disk pointing at the second virtual disk. That is because if your LUN/SAN fails with the VM config on it, you cannot restart the VM. In that case you should be able to restart the second “standby” VM. It should be able to boot from the remaining virtual disk.

      Since I have done the testing and screenshotting anyway, I’ll create another blog entry “Surviving Total SAN Failure Revisited” – out soon!

    • John,

      I think I both found AND solved your problem… Stay tuned for the Revisit of the “Surviving SAN Failure”, I’ll probably post it tomorrow (after some more testing) 🙂

  • […] time ago I posted Surviving total SAN failure which I had tested on ESX 3.5 at the time. A recent commenter had trouble getting this to work on a […]

Soon to come