About a year ago I was faced with the reality that we’ll keep needing more servers, and that after 3 years we have to start paying to renew the warranties. Combined with the fact that our resource utilization on most of our servers was lllllooooowwwww, it became pretty obvious that we needed to make a change (we can believe in! lol). The answer to our problems: Virtualization. Duh.
We already had two standalone VMWare ESX servers running production servers and I couldn’t be happier, so the decision to use VMWare as the platform for a large-scale (ok it’s just large-scale to us) virtualization initiative was a no-brainer. We ordered 4 Dell rack servers with ESX 3 Enterprise, and a 7TB EqualLogic PS400E iSCSI SAN, to be connected using Procurve 2810-48 switches on both the network and iSCSI side with separate switches for each. I set it all up and started migrating physical servers into the virtual environment. Abso-freaking-lutely awesome!
I’ve got the ESX hosts set up with two dual-port nics (broadcom onboard, intel pci-e) in each, two teamed for Service Console, LAN, and DMZ, and two teamed for iSCSI and VMotion. On the iSCSI side I’ve got a Procurve 2810-48 with flow control turned on, but it doesn’t support FC and jumbo frames at the same time so jumbos are off. All is well.
That was then.
About a month ago I got a call from a guy in the Engineering Dept., saying that he’s opening a large SolidWorks drawing and it’s quite slow. I do some quick tests and sure enough, large file transfers from the file server are slow! 150Mbps if I’m lucky, and it’s on a GigE connection. I tested from some other systems and the results were the same. I called our sales rep with one of our vendors and asked if I could meet with their storage specialist to talk about this issue…and the following week we had our meeting. What he suggested was that, despite not being officially supported by VMWare, we get a switch that would do flow control and jumbo frames simultaneously. A Procurve 2900 fit the bill, and as soon as it arrived I cut over to the new switch. No improvement, but of course not because I still needed to enable jumbo frames on the vSwitches, which I did next. Still no improvement.
At this point I had posted about my little adventure on the VMWare user forums and received a ton of suggestions, but nothing had helped. That posting was noticed by a rep at Dell who was nice enough to forward me a best practices document for using VMWare with the EQL SAN, and to my delight I discovered that I had done everything in the proper manner. So…then what? More testing! I installed SQLio and IOmeter on the file server as suggested in the VMWare Communities thread, and ran it against one of the data volumes on the SAN. The results seemed optimal! If the server, that lives on the SAN, can get data to and from the SAN normally, why can’t I get data normally from that server via the network? Is it a problem on the network side? Mooore testing.
Next I installed the Microsoft iSCSI initiator client on a physical server and connected to a volume on the SAN and ran the same file copy that I used in my initial test, a 1.8G .iso file, and it was fast! So this isn’t a problem with EQL or iSCSI Procurve, it’s either the Procurve 2810-48 on the network side, the physical host server, or ESX! I moved a smaller VM onto the local storage of one of the hosts and tested the file copy again and while it was juuust a tad faster it was still terrible. The more testing I did, the more I was sure that it was the Broadcom nics on the host servers, so I ordered and installed an Intel GigE card to test with. No improvement. What?! Could it be the Procurve on the network side? I tested the file copy between physical servers, both connected to that switch, and it performed as expected. Nope, not the Procurve on the network side.
At one point I was directed to a great blog post about usb drivers interfering with network performance on ESX hosts, so I tried disabling the usb-ohci drivers on the hosts, but sadly I didn’t experience the same return of performance that many others did.
I’m running out of ideas at this point, and the last thing I can try (that I can think of) is iSCSI HBAs instead of the software iSCSI in ESX. The performance degradation definitely seems to be happening on the host server, so maybe, just maybe taking the sofware iSCSI initiator out of the mix will help, but I’m not confident that it will matter since the test using the local storage of one of the hosts showed similarly bad performance moving data to another network server. I guess doing it right doesn’t always work.
For the record I’m still quite happy with this system, most other servers on this SAN are running fine and the performance issue is only a problem on large file transfers. This system really is a match made in heaven, despite whatever is going on with ESX to cause this performance issue. I will post a follow-up to this when I make some progress. I’ve spent a lot of time on this, with a lot of help from a lot of people, and gotten nowhere. I’m sure it’s an issue on the network side on the ESX host but beyond that I can’t figure out a fix. If you have any suggestions, feel free to post a comment.
Some of the resources I found helpful:
http://communities.vmware.com/thread/166113?tstart=0
http://blog.scottlowe.org/2008/04/22/esx-server-ip-storage-and-jumbo-frames/
http://blog.scottlowe.org/2006/12/04/esx-server-nic-teaming-and-vlan-trunking/
http://www.tuxyturvy.com/blog/index.php?/archives/37-Troubleshooting-VMware-ESX-network-performance.html
Have you tried a Microsoft iSCSI initiator in the actual VM? I have heard that this is actually faster than the ESX iSCSI initiator if you do not have hardware HBA in the ESX server….
I haven’t tried that, no…but all indications point to the ESX network side as the culprit. I’ll do some testing with the MS initiator though, thanks for the suggestion!
I am working through the same issue on some HP DL360 G4 and HP DL360 G5 servers using ESXi 3.5 Update3 with two Cisco 3650G switches. And in my case installing the MS initiator inside a virtual guest is fast (~100MBytes/sec) just like on a physical host, but letting VMware ESXi’s software iSCSI handle the I/O I only get around 7MBytes/sec each direction.
Please update here and on your VMware community post if you find out anything, and I will do the same.
Thanks!
Long story short, VMWare software iSCSI initiator is crap. Last week I got 4 QLogic 4062c iSCSI HBAs, and wouldn’t you know it I’m getting 500-700 Mbps throughput on file copy tests across the network. I knew that software iSCSI would be a little slower than hardware, but that’s a little ridiculous.
I think this may be the solution I was looking for:
I tested IOmeter on a new Win2003 host I created from scratch and it’s virtual disk throughput was great (80MB/sec @ 1MB, 50/50 RW, 50/50 random) to our iSCSI DataCore storage.
I tested IOmeter on an old P2V converted host and got poor performance (15MB/sec @ 1MB, 50/50 RW, 50/50 random). I also did more of a real-world test to see how the IO was rather than just disk bandwidth (~280 iops, 1MB/sec @ 4K 50/50 RW, 75/25 random, 64 command queue).
I then took the same P2V converted host and changed the virtual SCSI controller used from Buslogic to LSI, rebooted, let it install drivers, rebooted again, ran IOmeter, and got really good performance (50MB/sec @ 1MB, 50/50 RW, 50/50 random), (~1300 iops, 5MB/sec @ 4K 50/50 RW, 75/25 random, 64 command queue).
This made my day!!! Now I just have to go back and switch to LSI SCSI all my P2V’ed hosts. This doesn’t clear up my poor disk throughput when I am copying large vmdk files using VI Client, but at least my guest VMs will be zippy now!
Cross-posted to: http://communities.vmware.com/post!reply.jspa?messageID=1191348
Josh,
Are you happy with the 2810 switches for iSCSI? We are having a hard time deciding on an HP iSCSI network switch. It’s either the 1800-24G or the 2810-24G.
Thanks,
Will
Hi Will,
As it turns out, none of my performance issues were caused byt he 2810 switches. As recommended by a ’storage expert’ I upgraded to a Procurve 2900 because it supported jumbo frames and flow control simultaneously, but it didn’t fix the problem. In light of that, yep I was more than happy with the 2810 switch.
I hear that software iSCSI has been much improved in ESX4 as well, though I haven’ had the opportunity to work with it yet.
Hi,
You might want to check teaming settings.
I have noticed that the teaming settings “route based on…” are by far slower than the “use explicit failover..”.
Hope it helps,
I haven’t done a direct comparison no, but thanks for the tip! I will do some testing with this next week.
I’d be curious if you ever did any more testing or trying to isolate this issue? I have a nearly identical setup to yours before you got the Qlogic HBAs, (VMware ESX 3.5, Equallogic san, both Broadcom built-in and pci-e intel gig-e cards, etc) except we have old dell gigabit switches (although we don’t think they’re the main performance problem), even right down to having the Dell servers.
We notice similar issues never getting more than 20MB/s or so of speeds from our virtualized, SAN-attached servers. Although general performance for small random I/O has still been OK, as we have exchange, SQL, file servers, etc all running off this setup concurrently and it’s all still a littlle faster than it used to be before SANs and Virtulization…although maybe it’s supposed to be A LOT faster and we’re just not seeing it?
However there’s one key factor we discovered in testing…Windows. We’re still mostly an XP/2003 shop, but have just started playing with some Win7/Svr 2008 stuff. To an XP machine from 2003 file server, I can’t get more than 20MB-27MB/s download of a large file. Even trying to eliminate the local disk, by downloading to a RAM-disk or opening CAD files direct from the server. Funny issue though is we know our Backups pull in close to 80MB/s, which is really starting to show good utilization of a gigabit pipe…and they’re running as clients on the servers, sending out to the tape library server over the LAN links, so we kind of figure the SAN/VMware infrastructure isn’t likely at fault… Then we tested with Windows 7 on an older Laptop. And got much closer to expected speeds! 30-40MB/s coping large file to the local drive, and even higher opening/loading large file into memory… So I wonder if the problem isn’t more on the client (server?) OS side?
I’d be interested in hearing what you think. Thanks.
Hi Shawn,
I haven’t done much testing since getting the HBAs. The performance is better but not what I would like to see from this system. I do plan to test further but I’m spread pretty thin these days and it might be a while before this particular issue finds its way to the top of the list again. Thanks for the insight on Server 2008 and Win7, I’ll keep that in mind when I get back to testing!