Resolution – ESX hosts unexpected disconnect from Virtual Center ( ESX 3.5 update 2 )
Posted by craig
- on August 23rd, 2008 in Virtualization | 9 Comments »

When I try to log in to my virtual center to verify my VM farm today, the virtual center show my ESX host had been disconnected from the virtual center by itself. The ESX host itself should be running in critical mode as production and had HA and DRS enable on the cluster. The 1st thing I try to verify is to ensure all my VM and the ESX host is still in production mode, and yes, all the VM is not been down and it still run as normal while it disconnected.
Here is what I did to reconfigure my ESX host and re-join it back to the HA and DRS cluster in my production farm.
Disable the HA and DRS features from the cluster, and totally remove the ESX host from the inventory on Virtual Center server. Follow by that, I SSH in to the ESX host with su -, then I path to the /etc/init.d and look for the services mgmt-VMware status command
It show the services is running. Then I issue the command services mgmt-vmware restart. This will take couple of minutes to get the service fully restarted. At the same time I had actually Remote log on to 1 of the VM to ensure no impact on the VM guest which sit on the ESX host. The result is perfectly work without any downtime on the VM guests, and should credit to the ability from VMware technology.
Once the services restarted, you can easily add host to the virtual center and reconfigure the HA and DRS cluster mode again. The ESX host is back to normal now and work perfectly as usual.
Related posts:
Tags: DRS, ESX, HA, mgmt-vmware, ssh, virtual center, VMware
9 Responses
This is a common bug, normally just a ‘services mgmt-vmware restart’ fixes the issue. No need to pull your DRS cluster apart etc.
After issuing thecommand it takes about 2-5 mins for the VC to see the host as active.
Yes, in best case that will work, to be safe and when you have mission critical VM which run on cluster basis, talking about 40 VM in the clusters, you may need to consider the best way to troubleshoot.
Hi Craig,
I manage 70 esx host daily. So always make sure I don’t bring down a host with running VM’s.
I’m unsure how your way is safer, as you haven’t isolated the host from the VM’s running on it when it disconnected. All you have done is create more work for yourself by turning off HA & DRS. Why do I say that?
I assume cause you turned DRS off, your not using any resource pools? Cause once you turn off DRS, it removes all resource pools from the cluster. So when you turn DRS back on, you have to recreate these pools and settings again.
If you have nested pools and different settings this can take a while to recreate and then move each VM into the correct pool.
All these settings and placements need to be taken note before you turn off DRS!
Its not a your wrong I’m right. It was a different viewpoint of things for others who may have large DRS pools setup.
the meaning of turn off the HA and DRS will not impact to the way I manage my farm. I had not specified using resource pool with the farm, that is because I had not seen any bottleneck or performance issue with the environment I am current running, which I am using the R900 with 128 GB memory, and my environment is cross region within Asia, Europe and NA, that meant the peak hour for each VM is different.
agree with your point, is not who right or wrong here, is really a sharing basis about the different environment and different way we could manage the farm. the reason I turn it off just in case any bugs happen for the HA. Not really concern about the DRS actually. I experience before which I can’t reconfigure back the HA after I disable it. No matter how many time i tried. the only solution that time is really re image the entire ESX and it back to normal. So, just in case something go wrong on HA and force the VM to be failed over and incur down time, I prefer to disable the HA ensure no fail over will happen
I’ve had this problem a couple of times, and reconnecting the server doesn’t usually work either because it doesn’t recognize the authentication information. Restarting the Management Agent over the console fixes the problem and doesn’t bring down any services.
Hi Craig, I has a problem like yours and found the solution looking to my firewall between networks.
Hi Alexander, maybe your environment is different may have different impact. Yes, you do need to ensure the correct port and setting within the firewall if you have firewall in between on it. For my case, we do not have firewall within the LAN, the firewall is applied to external public network only. Thanks for your sharing, as it may be useful for all of us
This one worked great!! thx
You should also check out if scratch and / directory has enough space left. Sometimes it’s good to manually delete vpxuser by connecting directly in to the box with the client.