When I began my journey toward virtualization, it was early and there weren’t many good tools for planning performance capacity. Step 1 started with a single host server with local storage, and after a year of running 14 low-resource servers on the one box, I had the leverage to take a big step up to Step 2.
Step 2 included an EqualLogic PS400E iSCSI SAN, and 4 Dell PowerEdge 1950III servers with VMWare ESX Enterprise, with a Virtual Center Server. Our immediate needs/goals for virtualization were modest, with plans only to virtualize file, application, and mail servers and only 15 to 20 physical servers. Our Oracle and SQL servers already had dedicated platforms and until we break into the realm of a ‘real’ SAN like EMC, I didn’t want to deal with the i/o headaches these systems would cause on our virtual platform.
The transition went great! I had some network performance issues early on but we got things smoothed out and were back on track. What I didn’t see coming from all this, was the subsequent explosion of servers on our network. In less than 2 years, the number of servers more than doubled from about 20 to over 40. Engineering needs an application server? Sure! Deploy from template, done and ready to use in 30 minutes. IT needs a dev server? Yep, done. While the bulk of this growth was happening, I had no good view into the performance of our SAN. I could see server-specific performance stats in the performance tab of Virtual Center, but it was difficult to aggregate that data to see how the system as a whole was running. As I wrote recently, SAN Headquarters gave me the performance visibility I needed at the SAN level, and between that and some other recent revelations I realized that I could have planned our infrastructure better.
Some environments are slaves to the cpu, while others are slaves to disk i/o or memory. Ours is a slave to both memory and disk i/o. I was convinced, based on absolutely no hard evidence whatsoever, that our priority should be cpu -> memory -> disk performance. We got 64GHz of cpu, 64G of ram, and SATA Drives in the SAN. Where we’re at now is near the ceiling as far as memory capacity of the hosts, and well through the ceiling as far as disk performance. Cpu usage is averaging about 8%. Yes, 8. Having tools like SAN Headquarters earlier on would have kept me from getting in so deep, but the ultimate cause of this mess is poor planning.
Memory - go deep! Yes the beauty of a virtual platform is that you can scale the resources in a VM to meet tha particular VM’s needs. With that said, give your VMs more than they need anyway. More headroom in physical memory = less risk of paging to disk, which lowers the likelihood of unnecessary disk i/o.
Disk – remember that you’ll be running many server instances on shared storage. Exactly how shared depends on what type of SAN you get. EqualLogic creates one physical RAID array and then creates volumes on top of that. This means that all i/o is going to affect all other i/o to some degree. Other SAN technology like EMC create individual RAID arrays within each chassis, giving each volume some isolation from the i/o generated by other volumes in the same chassis.
What I would have done differently:
1. Double the memory. We are stuck between disk i/o and available physical memory and can’t grow our VM farm until both have been addressed.
2. Faster disks. This is especially important with a platform like EqualLogic where the volumes share the same physical RAID array.
3. Less processor. 8%, seriously.
So talk to your vendors and get help if you’re not sure exactly what you need for your environment. Capacity planning for existing implementations isn’t far off from a new implementation, but there are more tools available if you’ve already got VMWare deployed so I’ll follow up on that later.
Twitter