So in my previous blog entry I wrote about how I upgraded a 3PAR T400 to support the new VMware vSphere 4.1 VAAI extensions. I did some quick tests just to confirm the array was responding to the three new SCSI primitives, and all was a go. But to better quantify the effects of VAAI I wanted to perform more controlled tests and share the results.
Environment
First let me give you a top level view of the test environment. The host is an 8 core HP ProLiant blade server with a dual port 8Gb HBA, dual 8Gb SAN switches, and two quad port 4Gb FC host facing cards in the 3PAR (one per controller). The ESXi server was only zoned to two ports on each of the 4Gb 3PAR cards, for a total of four paths. The ESXi 4.1 Build 320092 server was configured with native round robin multi-pathing. The presented LUNs were 2TB in size, zero detect enabled, and formatted with VMFS 3.46 and using an 8MB block size.
Testing Methodology
My testing goal was to exercise the XCOPY (SCSI opcode 0x83) and write same (SCSI opcode 0x93). To test the write same extension, I wanted to create large eager zeroed disks, which forces ESXi to write all zeros to the entire VMDK. Normally this would take a lot of SAN bandwidth and time to transfer all of those zeros. Unfortunately I can't provide screen shots because the system is in production, so you will have to take my word for the results.
"Write Same" Without VAAI:
70GB VMDK 2 minutes 20 seconds (500MB/sec)
240GB VMDK 8 minutes 1 second (498MB/sec)
1TB VMDK 33 minutes 10 seconds (502MB/sec)
Without VAAI the ESXi 4.1 host is sending a total 500MB/sec of data through the SAN and into the 4 ports on the 3PAR. Because the T400 is an active/active concurrent controller design, both controllers can own the same LUN and distribute the I/O load. In the 3PAR IMC (InForm Management console) I monitored the host ports and all four were equally loaded around 125MB/sec.
This shows that round-robin was functioning, and highlights the very well balanced design of the T400. But this configuration is what everyone has been using the last 10 years..nothing exciting here except if you want to weight down your SAN and disk array with processing zeros. Boorrrringgg!!
Now what is interesting, and very few arrays support, is a 'zero detect' feature where the array is smart enough on thin provisioned LUNs to not write data if the entire block is all zeros. So in the 3PAR IMC I was monitoring the back-end disk facing ports and sure enough, virtually zero I/O. This means the controllers were accepting 500MB/sec of incoming zeros, and writing practically nothing to disk. Pretty cool!
"Write Same" With VAAI: 20x Improvement
70GB VMDK 7 seconds (10GB/sec)
240GB VMDK 24 seconds (10GB/sec)
1TB VMDK 1 minute 23 seconds (12GB/sec)
Now here's where your juices might start flowing if you are a storage and VMware geek at heart. When performing the exact same VMDK create functions on the same host using the same LUNs, performance was increased 20x!! Again I monitored the host facing ports on the 3PAR, and this time I/O was virtually zero, and thanks to zero detection within the array, almost zero disk I/O. Talk about a major performance increase. Instead of waiting over 30 minutes to create a 1TB VMDK, you can create one in less than 90 seconds and place no load on your SAN or disk array. Most other vendors are only claiming up to 10x boost, so I was pretty shocked to see a consistent 20x increase in performance.
In conclusion I satisfied myself that 3PAR's implementation of the "write same" command coupled with their ASIC based zero detection feature drastically increases creation performance of eager zeroed VMDK files. Next up will be my analysis of the XCOPY command, which has some interesting results that surprised me.
Update: I saw on the vStorage blog they did a similar comparison on the HP P4000 G2 iSCSI array. Of course the array configuration can dramatically affect performance, so this is not an apples to apples comparison. But nevertheless, I think the raw data is interesting to look at. For the P4000 the VAAI performance increase was only 4.4x better, not the 20x of the 3PAR. In addition, the VDMK creation throughput is drastically slower on the P4000.
Without VAAI:
T400 500MB/sec vs P4000 104MB/sec (T400 4.8x faster)
With VAAI:
T400 10GB/sec vs P4000 458MB/sec (T400 22x faster)
Interesting results with the 3Par.
ReplyDeleteBit of an apple to oranges comparison :)
The P4000 is a midrange commodity 1GB iSCSI cluster and the 3Par is no doubt doing its mesh cluster backplane multiple 4GB fibre connections and custom controller ASICs and a ton of cache?
Just out of curiosity - whats the price tag on the T400? I never seen them in the wild in this part of the world (NZ)
Barrie, you are certainly right the arrays are targeted at very different segments of the market. However, the 3PAR is surprisingly affordable given its feature set and performance.
ReplyDelete3PAR array prices vary widely depending on what features you license and capacity. But looking at some pre-packaged SAS P4000s, a nicely configured two-tier Fibre Channel and SATA combination 3PAR T400 could be only ~2x more per GB. Their F series is even more affordable.
One thing to note - you do have to install the 3PARv VAAI plugin first on the Vsphere host, or you won't see the effects.
ReplyDeleteKind of obvious but took a little while for me to realise! :)
Yes and what was worse is that the original driver package release was broken, and didn't work. So no matter what you tired VAAI didn't engage.
ReplyDeleteHi Derek,
ReplyDeleteI have a semi-related question for you. I have a large VM environment attached to a T800. In the past our 3par SE has repeatedly told us that partition alignment isn't a concern considering that the data stores are formatted with VMFS3. We had a meeting with our VMware reps today and they said we absolutely need to be concerned with alignment on the guest OS volumes. We are prepping a test for tomorrow, but I am curious if you align all host partitions? We will concentrate on the heavy hitting, I/O intensive VMs first, but I don't really believe that alignment will have a huge effect.
Thoughts?
Your 3PAR SE rep is wrong, sorry to say. VMFS3 has no relationship to whether the guest OS has aligned I/O or not. Server 2003 and earlier DO NOT properly align volumes, whereas server 2008 and later DO. Citrix has a good whitepaper on VDI I/O here: http://support.citrix.com/article/CTX130632
ReplyDeleteYou can check out another blog I wrote about disk alignment here:
http://derek858.blogspot.com/2011/06/align-your-partitions-with-vmware.html
Where they state: In order to minimize the utilization of the storage sub-systems, it is best practice to fully align the file systems at all layers (i.e. VM, Hypervisor, Storage). “For optimal performance, the starting offset of a file system should align with the start of a block in the next lower layer of storage. For example, an NTFS file system that resides on a LUN should have an offset that is divisible by the block size or stripe size of the storage array presenting the LUN. Misalignment of block boundaries at any one of these storage layers can result in performance degradation as multiple storage I/O operations can be required to access a single block of data.”
Yes, I am finding similar docs as well. This is so very unfortunate and frustrating since we have been P2V'ng for 6 months and deploying Linux VMs...most are not aligned, and wouldn't you know it, VMware is melting the array!
ReplyDeleteWhat a PITA!
Do you know of any SR data that might show the affect misalignment is having, or do I just need to be concerned with I/O?
Thank you for the quick responses.
Kara
I don't have any data regarding the net effect of mis-aligned partitions. You could look at a third-party tool that does performs VM alignment, or look at Converter 5.0, although I don't know if it supports aligning EXT partitions.
ReplyDelete"I did some quick tests just to confirm the array was responding to the three new SCSI primitives,"
ReplyDeleteQuestion: what method did you use to confirm the aforementioned ? Did you send the array scsi commands with http://sourceforge.net/projects/s3-utils/ ?
Also, GREAT blog! Just added it to my RSS feed.
By observation I confirmed that the commands were working. For example, when creating an EZT VMDK I monitored the SAN switch ports and 3PAR ports for I/O activity and there was practically none. Same thing when doing a storage vMotion, no fabric to speak of. ESXTOP can also list number of hardware locks per second, so I was able to confirm locking worked as well.
ReplyDeleteCan anyone comment on what parameters they use for sector aligning a Window's partition sitting on a 3Par F400.
ReplyDeleteMy name is Matt and I work for Dell. There are a lot of great comments happening on this post. Thank you so much for the information.
ReplyDeletePity though, that the UNMAP command has to be turned off in vSphere 5 due to performance issues caused by it. When creating/deleting vmdk`s from the vSphere client, the responsetime of our F400 array goes up into the 300+ms depending on the size of the disk created. We have UNMAP turned off at the host side now, which sucks because now i have to manually zero out the free diskspace of the datastores when vm`s are deleted or SvMotioned.
ReplyDeleteEven more strange is the fact that HP support wasn`t able to figure this out, i ended up resolving the issue myself and pointing them to this fact. As of yet there is no ETA on a fix for this issue.....
Sources:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2007427
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2009330