How to recover a VM hung in the ESX Farm
Posted by craig - on September 24th, 2008 in Virtualization |

Today, I experience with a VM hung which is not able to be reset, power off or remove from the ESX farm in our production environment. During the troubleshooting, of course we start from the virtual center, which didn’t work. Follow by that, I start to using the vmware-cmd command to do a stop & stop hard, and it still do not work. At the same time, I also restarted the management service from ESX host. Once I had done that, the VM show poweroff, but in the esxtop, it will still show the VM are running. I try to register the vm to another host, and issue to power on, but it failed, due to the resources hold up by the ESX host for the specify problem VM.
The 2 simple way doesn’t work and I have to proceed further with the kill -9 option by doing the ps -ef about the PID for the VM, and it show -1 as PID, which consider abnormal.
grep VMNAME /proc/vmware/vm/*/*
This command will show the PID as well.
In normal case, you can just run the command kill -9 (PID Number)
My case, it doesn’t work. The only option now to go is to run the vm-support -X (VMID)
Please take note that this process will become a pain for you. It took me more than 25 mins for the entire process, and some how, the VM are still hung.
At the end, I Vmotion all the vm to the rest of the host and rebooted the ESX, and my problem is solved and back to normal. Somehow, it has been too much time consuming to troubleshoot this as the VM is consider critical. I will not suggest to spend too much time on troubleshoot on command if we can fix the thing faster.
Related posts:

4 Responses
I’m not agreed with ESX reboot and I believed it’s must a way to solve it but I’m not sure how. Please feel free to share if anyone know this. Thanks.
agree with you, but mission critical system is not allow time for us to try and error. therefore, to reboot the ESX is recover faster and we can further investigate with alternative solution to solve this
I have the same problem.
Sometimes ps -ef |grep “VM” or ps -aux |grep “VM” don’t work for me to locate process.
When server (vm) hanged is mission critical server I spend less time migrating VMs to other hosts and rebooting the host with the hanged VM, but It could be great if anybody know a solution to kill entire affected VM process to start it without reboot the server.
Excuse my english, I’m spanish.
according to the knowledge base and google search, the solution above should be worked, unfortunately for my case, it doesn’t work. But it only happen once to my end, therefore, I had not log a case to VMware about this. Anyone have any solution regarding this please share to us about it.