Fault Tolerance: Diamond in the Rough

One of vSphere 4.0 most understated features in my opinion is Fault Tolerance. I truly see this as a capability of vSphere that goes overlooked by most people (especially those who are focused on the cost of a vSphere deployment…as Fault Tolerance is included in vSphere Advanced and higher packages). Not to long ago, companies paid millions of dollars to achieve a lock step fault tolerant solution. Today, with vSphere, you can enable Fault Tolerance on a VM with just the click of your mouse. I want to clarify the key points on Fault Tolerance that most of my clients seem to ask me about; this won’t be a deep technical discussion on Fault Tolerance, that has been covered by others already and you can find those in the links I have included at the end of this article.

I still find it amazing how the spectrum of availability solutions still gets confused by IT administrators and executives alike. So, first a brief refresher on this spectrum:

Load Balancing: Multiple running copies of an application, failure may affect end user. Load Balancing via a network connection load balancer is the lowest common denominator for availability. Actually, these solutions are typically used to achieve scale out of applications that can’t scale out on their own. Load balancers allow you to run multiple copies of the same stateless (typically REST based) application. The nature of the client’s connection to the application determines what availability impact a load balancer has. If a failure occurs between a client’s connection the client should not be affected by the failure. However, if the failure occurs during a client’s connection the client most likely will be affected by the failure in some nature, possibly losing their work (REST, stateless short transactions less affected; non-REST, long connections more affected). By definition load balancing will increase the utilization on multiple servers…that’s what it’s designed to do. I spent five years crafting load balancing solutions for clients back in the late 90’s…and yet I still come across confusion here from time to time.
High Availability: Single running copy of an application, failure will affect end users. High Availability simply means that when a failure occurs, the highly available application will start running on another server. In vSphere, this means the environment will turn on the VM on another ESX hosts to ensure minimal amount of down time for users of the application. Typically, the user will be affected by the failure.
Fault Tolerance: Multiple (typically two) running copies of an application, failure will not affect end user. Fault Tolerance means that you are running two copies of the application in lock step, what ever instruction gets executed on the primary also gets executed on the secondary. This doubles the resource utilization in your environment, but ensures that a failure has no impact on the end user. When a failure occurs, the IP address of the primary system moves to the secondary system and the user continues doing what ever they were doing because the secondary system was processing the same instruction as the primary when it failed. By this definition, Fault Tolerance isn’t ideal for every application due to the higher cost of resource utilization, if you’re running at 80% utilization of your VM prior to Fault Tolerance, you will be running two VMs at 80% when Fault Tolerance is turned on.

What makes vSphere’s Fault Tolerance feature a diamond in the rough is this zero downtime solution is baked into the virtualization infrastructure that you may already own. For those key applications where zero downtime is valuable, it’s there to be turned on with minimal additional cost. There are some hardware requirements that you need to be kept in mind: like an additional network for the FT messages to be passed across (two networks if you want a 100% fault tolerant system), and ensure you have the right type of processor. But these are similar requirements for most comparable solutions.

What makes Fault Tolerance a bit rough is the fact that it only supports one VCPU Virtual Machines. If you application need multiple VCPUs, you’re out of luck. At least for today. Considering Fault Tolerance is a 1.0 feature, this limitation is understandable. It’s even more appreciated when you consider what is happening under the covers to keep the instructions in sync across two VMs, watch the following video from VMware Principal Engineer Doug Scale for the details:

Now imagine the complexity of trying to track, synchronize, and replay the processing instructions for multiple processors. Going back and using my basic Computer Science knowledge from my first year in college makes it obvious to me that supporting multiple processors is magnitudes more challenging that support one processor. So you gotta start somewhere!

Taking all this into consideration there are multiple applications that my clients are looking at as candidates for Fault Tolerance. From mail servers and messaging servers to custom applications where down time needs to be avoided. Before it used to apps where “downtime needs to be avoided at any cost“, but with vSphere Fault Tolerance it has become more like “avoided at a little cost”.

What apps do you have that you’re considering Fault Tolerance for? Tell me about them by leaving a comment.

Additional Resources on vSphere Fault Tolerance

Training Lab simulator for vSphere Fault Tolerance (FT) (via VMWARE INFO) – “See” FT in action thanks to this simulation created by the VMware Training team.
Check ESX CPU And VM OS Requirments for vSphere Fault Tolerance (via VM /ETC) – Make sure you meet the hardware requirements for FT…or use the New SiteSurvey utility from VMware checks for Fault Tolerance compatibility
VMware Fault Tolerance at your home-lab (via Eric Sloof) – after reading the last two, find out how to set this up for testing at home…
vSphere Availability Guide (pdf) – for the full skinny on Fault Tolerance and HA in vSphere.
How does Fault Tolerance prevent a split brain scenario? (via The VMguy) – Understand what happens when failure does occur.
VMware engineers caution IT pros: Use Fault Tolerance sparingly (via SearchServerVirtualization) – a bit of a misleading title, but reiterates my comments above: FT is different than HA, has specific use cases, and does use additional resource.
Don’t forget to search the VMware Knowledge Base for the latest articles on Fault Tolerance

Nick Bond says

July 7, 2009 at 3:44 am

Nice feature, but surely the cost (in terms of resource) are too high?

It must be more practical to run two instances of the application (which is in effect what you are doing with FT), and actually use them by adding a load balancer into the infrastructure. Particularly if you need to then scale the application, two running instances with FT means four lots of resource…

Load balancing especially makes sense when you can use a Virtual Appliance version.

Nick

[Reply]

latoga Reply:
July 7th, 2009 at 8:44 am

The use cases for Load Balancing versus Fault Tolerance are very different Nick. Typically, you wouldn’t use both. As I described, if your application requires zero downtime the additional resources you are using is the price of that zero downtime and load balancing can not guarantee zero downtime. This zero downtime requirement is typically for applications that are critical to a businesses operations. The whole point here is that FT makes zero downtime cost a lot less than it did previously!

[Reply]

Nick Bond Reply:
July 8th, 2009 at 5:49 am

Interuptions to service to a sub-set of your users is different to application downtime, at least in my mind.

Any modern online application that cannot handle the interuption of a connection etc. is poorly architected.

I would suggest that you would have to use both (or just load balancing) for true full availability. Geographic seperation of servers is essential to maintain systems running in case of a catastrophic failure (fire, aircraft strike etc). Therefore global load balancing would be needed, I suspect running FT across multiple locations would be fraught with problems (if not impossible)?

Nick

[Reply]

Comments

William Lam says

July 6, 2009 at 9:22 pm

This is another good link to add to the resources for FT: http://www.yellow-bricks.com/2009/07/01/ftcli-pl-vmware-fault-tolerant-managment/

[Reply]
latoga Reply:
July 7th, 2009 at 8:32 am
Thanks for sharing your script William. Great addition to be able to manage FT from the command line!

[Reply]
John Troyer says

July 6, 2009 at 10:29 pm

What expanded my consciousness during our podcast with the FT team was the notion that FT was so easy to use that you could even just turn it on for a few weeks during the holiday or end of the quarter rush and then turn it off again later.

The fact that FT isn’t a multi-million dollar solution and instead is a feature that can be turned on with a flick of the mouse really opens up the use cases — could be quite a game-changing tool in your toolbox.

[Reply]
latoga Reply:
July 7th, 2009 at 8:38 am
This is a great point John, I hadn’t considered the option of turning on FT just for a busy peak period for an application. Many of my clients are considering building dedicated environments targeted for internal users who require higher availability (i.e., SRM enabled clusters, FT specific sub-clusters). The ability to vMotion a VM to a FT cluster and turn it on for a month when it is really needed allows an IT organization to be more responsive to the needs of their customers.

Now, to enable this for self service and include a way to calculate a charge for this via vCenter ChargeBack! 🙂

[Reply]
Nick Bond says

July 7, 2009 at 3:44 am

Nice feature, but surely the cost (in terms of resource) are too high?

It must be more practical to run two instances of the application (which is in effect what you are doing with FT), and actually use them by adding a load balancer into the infrastructure. Particularly if you need to then scale the application, two running instances with FT means four lots of resource…

Load balancing especially makes sense when you can use a Virtual Appliance version.

Nick

[Reply]
latoga Reply:
July 7th, 2009 at 8:44 am
The use cases for Load Balancing versus Fault Tolerance are very different Nick. Typically, you wouldn’t use both. As I described, if your application requires zero downtime the additional resources you are using is the price of that zero downtime and load balancing can not guarantee zero downtime. This zero downtime requirement is typically for applications that are critical to a businesses operations. The whole point here is that FT makes zero downtime cost a lot less than it did previously!

[Reply]
Nick Bond Reply:
July 8th, 2009 at 5:49 am
Interuptions to service to a sub-set of your users is different to application downtime, at least in my mind.

Any modern online application that cannot handle the interuption of a connection etc. is poorly architected.

I would suggest that you would have to use both (or just load balancing) for true full availability. Geographic seperation of servers is essential to maintain systems running in case of a catastrophic failure (fire, aircraft strike etc). Therefore global load balancing would be needed, I suspect running FT across multiple locations would be fraught with problems (if not impossible)?

Nick

[Reply]

Fault Tolerance: Diamond in the Rough

Comments

Trackbacks

Leave a Reply Cancel reply