VMWare HA – the wrong way.

December 8th, 2008 by Josh Leave a reply »

I ran into an issue this past week, where VirtualCenter wouldn’t start.  I was swamped with some other issues so in my haste made a few bad decisions that made everything worse and eventually discovered the root of the issue: the SQL logfile was full.  I changed it to unrestricted growth and voila we were back in business. 

Wait, no we weren’t!  None of the cluster hosts would enable HA, they errored out, all of them.  I tried several things that were suggested at The VMWare Communities without success, from removing the hosts from the cluster and re-adding them, to just disabling HA and enabling it again.  Was this the result of updates I may have installed recently?  Did something else change? 

I’ll take you back to when I first implemented this virtual platform.  It’s 4 Dell servers connected to a pre-Dell EqualLogic iSCSI SAN.  Each host has two Service Consoles configured, per best practices documentation, although I didn’t understand why that was necessary at the time of implementation.  Service Console #1 is on the production LAN using the default gateway as the isolation address (which is shown as the default gateway setting in the Service Console configuration), and Service Console #2 is on the iSCSI network using a non-existant IP as the isolation address.  If you understand how HA works and why a second Service Console is a good idea, then you’re probably cringing and calling me stupid right now.  So am I.

Back to last week.  We replaced our IPSEC VPN to our EMEA network with a direct connection to their Colt MPLS network, and in the process made some changes to the firewall rules.  One rule that got changed was the ICMP rule, which was accidentally deleted.  Now the ESX hosts couldn’t ping the default gateway, which shouldn’t have been a problem because of the second Service Console.  Since Service Console #2 wasn’t configured correctly, there was no redundancy and HA failed. 

Solution: RTFM and understand what you’re doing before implementing an HA cluster.  If you don’t, then do what I did and fix the ICMP rule on your default gateway for Service Console #1 and add a valid isolation address for Service Console #2 just in case you’re stupid again later.

Related Posts with Thumbnails
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Twitter
Advertisement

Leave a Reply

Josh Currier - Blogged