- HA/DRS solves two problems: Protection from unplanned downtime; Load Balancing and defragmentation of resources.
- Large number of environments have configured HA/DRS settings incorrectly.
- Mistake #1: Not planning for HW evolution.
- vMotion requires similar processors.
- Always set EVC mode
- Mistake #2: Not planning for svMotion
- VMs cannot have snapshots
- VM disks must be persistent mode or RDMs
- Host must have sufficient resources to support two instances of the VMs running concurrently.
- Must be licensed and correctly configured with vMotion
- Host must have access to both source and target datastores
- Mistake #3: Not Enough Cluster Hosts
- For HA failover requires additional "wasted" hardware resources
- Must plan for cluster reserve
- A fully prepared cluster must set aside one full server's worth of resources in preparation for HA.
- Enable admission control - Super important to enable. Will disallow starting VMs when resources are exhausted.
- Set host failures cluster tolerates to 1 (or more). Ensures you always have at least one hosts's worth of resources available.
- Mistake #4: Setting Host Failures the Cluster Tolerates to 1
- Not all your VMs are priority one
- Some VMs can stay down if a host dies
- Can set the % to less than one server's worth of resources, since not all VMs need to restart if a host fails.
- Mistake #5: Forgetting to Prioritize VM restart
- VM restart priority is one of those oft-forgotten settings
- Come into play when Percentage policy is enabled
- Restart policy is per-host
- Per-VM settings must be configured for each VM
- This can create a problem down the road, as VMs may restart in the wrong order
- Mistake #6: Disabling Admission Control
- Many young admins may turn it off and forget about it
- Never disable admission control!!!
- Mistake #7: Not updating Percentage Policy
- Needs to be adjusted as your cluster size changes
- Host failures the cluster tolerates needs no adjusting
- Mistake #8: Buying (the occasional) Big Server
- Host failures the cluster tolerates sets aside the amount of resources the protect every server.
- It must set aside resources equal to your biggest server in the cluster
- Mistake #9: Neglecting Host Isolation Response
- Current recommendation is to leave powered on
- On converged networks you may not want to use powered on
- Heartbeat datastores - Adds redundancy
- Mistake #10: Assuming that Datastore heartbeats Prevent isolation Events
- Master determines the state of the unresponsive host
- Isolation response is triggered by the slave
- Mistake #11: Confusing your ADP with your PDL
- An All Point Down scenario exists when all communication is severed between host and device
- I/O is then queued until a SCSI response code officially reports the link is down
- This can lead to infinite queuing of device I/O
- Permanent device loss scenario exists when the host can see the device target but the target isn't listening
- Lets the host recognize the I/O
- APD is a more common scenario and APD will not trigger vSphere HA
- Look at new settings in 5.0 U1 and 5.1 - Most handy for metro clusters
- Mistake #12: Overdoing Reservations, limits, and Affinities
- HA may not consider these "soft affinities" at failover
- Consider using shares over reservations and limits
- Less impact on DRS and thus HA
- Mistake #13: Considering Using Shares without Considering using Shares
- Shares are only considered during periods of contention
- But settings shares on resource pools can have unexpected results
- Don't treat resource pools like folders
- Mistake #14: Doing memory limits at all
- Don't assign memory limits. Ever.
- Limit memory closest to the application as possible (such as in the SQL app)
- Mistake #15: Thinking you are smarter than DRS
- Not using fully automated mode
- Mistake #16: Not understanding DRS' Equations
- Every 5 minutes a DRS interval is invoked
- Takes into account VM entitlements, host capacity
- Mistake #17: Being too liberal with your migration threshold
- Pri 1 recommendations are mandatory
- Mistake #18: Combining VDI and Server Workloads in the same cluster
- ESXi hosts running VDI workloads tend to experience more load than running server workloads.
- VDI forces DRS to work harder and more often
- Create separate clusters for VDI and everything else
- Mistake #19: Planning on Overcommit
- Over commit creates extra work for the hypervisor
- Assign the right amount of memory to your VMs
Monday, August 27, 2012
VMworld 2012: Avoiding 19 Biggest HA & DRS Mistakes INF-VSP1232
Greg Shields, Concentrated Technology partner. This session focused on the main HA/DRS mistakes that people make when virtualizing their infrastructure. Greg is a great speaker and also has presented sessions at TechEd and other conferences. HA/DRS settings are so easy to set and forget, or forget to set, that everyone should review all 19 mistakes and make sure you aren't doing them in your environment.
No comments:
Post a Comment