Devin’s Load Balancer for Exchange 2010

Overview

One of the biggest differences I’m seeing when deploying Exchange 2010 compared to previous versions is that for just about all of my customers, load balancing is becoming a critical part of the process. In Exchange 2003 FE/BE, load balancing was a luxury unheard of for all but the largest organizations with the deepest pockets. Only a handful of outfits offered load balancing products, and they were expensive. For Exchange 2007 and the dedicated CAS role, it started becoming more common.

For Exchange 2003 and 2007, you could get all the same benefits of load balancing (as far as Exchange was concerned) by deploying an ISA server or ISA server cluster using Windows Network Load Balancing (WNLB). ISA included the concept of a “web farm” so it would round-robin incoming HTTP connections to your available FE servers (and Exchange 2007 CAS servers). Generally, your internal clients would directly talk to their mailbox servers, so this worked well. Hardware load balancers were typically used as a replacement for publishing with an ISA reverse proxy (and more rarely to load balance the ISA array instead of WNLB). Load balancers could perform SSL offloading, pre-authentication, and many of the same tasks people were formerly using ISA for. Some small shops deployed WNLB for Exchange 2003 FEs and Exchange 2007 CAS roles.

In Exchange 2010, everything changes. Outlook RPC connections now go to the CAS servers in the site, not the MB server that hosts the active copy of the database. Mailbox databases now have an affiliation with either a specific CAS server or a site-specific RPC client access array, which you can see using the –RpcClientAccessServer parameter of the Get-MailboxDatabase cmdlet. If you have two or more servers, I recommend you set up the RPC client access array as part of the initial deployment and get some sort of load balancer in place.

Load Balancing Options

At Trace3, we’re an F5 reseller, and F5 is one of the few load balancer companies out there that has really made an effort to understand and optimize Exchange 2010 deployments. However, I’m not on the sales side; I have customers using a variety of load balancing solutions for their Exchange deployments. At the end of the day, we want the customer to do what’s right for them. For some customers, that’s an F5. Others require a different solution. In those cases, we have to get creative – sometimes they don’t have budget, sometimes the networking team has their own plans, and on some rare occasions, the plans we made going in turned out not to be a good fit after all and now we have to come up with something on the fly.

If you’re not in a position to use a high-end hardware load balancer like an F5 BIG-IP or a Cisco ACE solution, and can’t look at some of the lower-cost (and correspondingly lower-feature) solutions that are now on the market, there are few alternatives:

  • WNLB. To be honest, I have attempted to use this in several environments now and even when I spent time going over the pros and cons, it failed to meet expectations. If you’re virtualizing Exchange (like many of my customers) and are trying to avoid single points of failure, WNLB is so clearly not the way to go. I no longer recommend this to my customers.
  • DNS round robin. This method at least has the advantage of in theory driving traffic to all of the CAS instances. However, in practice it gets in the way of quickly resolving problems when they come up. It’s better than nothing, but not by much.
  • DAG cluster IP. Some clever people came up with this option for instances where you are deploying multi-role servers with MB+HT+CAS on all servers and configuring them in a DAG. DAG = cluster, these smart people think, and clusters have a cluster IP address. Why can’t we just use that as the IP address of the RPC client access array? Sure enough, this works, but it’s not tested or supported by Microsoft and it isn’t a perfect solution. It’s not load balancing at all; the server holding the cluster IP address gets all the CAS traffic. Server sizing is important!

The fact of the matter is, there are no great alternatives if you’re not going to use hardware load balancing. You’re going to have to compromise something.

Introducing Devin’s Load Balancer

For many of my customers, we end up looking something like this:

  • The CAS/HT roles are co-located on one set of servers, while MB (and the DAG) is on another. This rules out the DAG cluster IP option.
  • They don’t want users to complain excessively when something goes wrong with one of the CAS/HT servers. This rules out DNS round robin.
  • They don’t have the budget for a hardware solution yet, or one is already in the works but not ready because of schedule. They need a temporary, low-impact solution. This effectively rules out WNLB.

I’ve come up with a quick and dirty fix I call Devin’s Load Balancer or, as I commonly call it, the DLB. It looks like this:

  1. Pick one CAS server that can handle all the traffic for the site. This is our target server.
  2. Pick an IP address for the RPC client access array for the site. Create the DNS A record for the RPC client access array FQDN, pointing to the IP address.
  3. Create the RPC client access array in EMS, setting the name, FQDN, and site.
  4. On the main network interface of the target server, add the IP address. If this IP address is on the same subnet as the main IP address, there is no need to create a secondary interface! Just add it as a secondary IP address/subnet mask.
  5. Make sure the appropriate mailbox databases are associated with the RPC client access array.
  6. Optionally, point the internal HTTP load balance array DNS A record to this IP address as well (or publish this IP address using ISA).

You may have noticed that this sends all traffic to the target server; it doesn’t really load balance. DLB also stands for Doesn’t Load Balance!

This configuration, despite its flaws, gives me what I believe are several important benefits:

  • It’s extremely easy to switchover/failover. If something happens to my target server, I simply add the RPC client access array IP address as a secondary IP address to my next CAS instance. There are no DNS cache entries to wait to expire. There are are no switch configurations to modify. There are no DNS records I have to update. If this is a planned switchover, client get disrupted but can immediately connect. I can make the update as soon as I get warning that something happened and my clients can reconnect without any further action on their part.
  • It isolates what I do with the other CAS instances. Windows and Exchange no longer have any clue they’re in a load balanced pseudo-configuration. With WNLB, if I make any changes to the LB cluster (like add or remove a member), all connections to the cluster IP addresses are dropped!
  • It makes it very easy to upgrade to a true load balancing solution. I set the true solution up in parallel with an alternate, temporary IP address. I use local HOSTS file entries on my test machines while I’m getting everything tested and validated. And then I simply take the RPC client access array IP address off the target server and put it on the load balancer. Existing connections are dropped, but new ones immediately connect with no timeouts – and now we’re really load balancing.

Note that you do not need the CAS SSL certificate to contain the FQDN of the RPC client access array as a SAN entry. RPC doesn’t use SSL for encryption (it’s not based on HTTP).

Even in a deployment where the customer is putting all roles into single-server configuration, if there’s any thought at all that they might want to expand to an HA configuration in the future, I now am in the habit of configuring this. The RPC client access array is now configured and somewhat isolated from the CAS configuration, so now my future upgrades are easier and less disruptive.

Published by

Devin

Husband and father; technology consultant, speaker, author, and blogger; Microsoft Exchange architect and MVP; writer, reader, Xbox player, karate student, and music lover. Seeker of balance, reveler in life, learning how to look for the uplifting.

13 thoughts on “Devin’s Load Balancer for Exchange 2010”

  1. Mark: I’ve not run into any issues with this configuration without the hotfix to date, but it’s only been used in smaller (typically 2-server) environments. Much past 500 mailboxes and they can usually afford *some* load balancer like the Kemp products. This is typically a temporary solution to the problem.

    Having said that, the KBs made me remember that you can script the provisioning of the additional IP address, so it shouldn’t be hard at all for a clever admin to come up with a tiny little bit of script to automatically move the DLB IP address over to another server if the DLB address fails to respond to a ping after a set timeout…

  2. Sweet! I love that I don’t have to do the work of coming up with the scripting…although that’s not how I’d approach it, it seems to work nicely.

  3. Do you think this method would work on cas/hub/mbx servers that are in a DAG?
    Running on VMware? I am thinking that in order to add the additional IP it would cause window to reset networking and case the database to failover to another node?

    Mark

  4. Mark: I’ve used this under VMWare without any issues. This isn’t like Windows NLB, where adding a host to the NLB cluster or changing the cluster VIP address makes the whole cluster reset all network connections. It’s just adding an additional secondary IP address to one of the hosts — a basic Windows level operation that doesn’t reset the interface.

  5. Good collection of thoughts and information.

    One thing I would like to add is, if you are trying to add new HLB into network just for exchange, plan Exchange server IP route scheme ahead of time. Routed and Transparent through HLB makes huge differences in RPC connections.

  6. @Devin…I dont agree with your solution which contradict with your own statement of using DAG cluster IP and DLB. You said we should not use DAG cluster IP because it does not offer load balance. Well I am into support business and customers want high availability solution without having a big hole in their pocket. Above all customers dont know what is being setup for them. I have implemented two nodes high availability solution with multi roles using DAG cluster IP as rpc client access array. The solution works perfectly though it does not offer load balance to CAS role but it does offer high availabitility which is more than what customers want. The customer was running single exchange 2007 node with 350 mailboxes hosted on an old hardware. Atleast now we’ve got faster hardware to cater rpc client.

  7. Hi Devin – I found your article after researching an issue we recently encountered with our new Exchange 2010 deployment. We have 3 CAS servers load-balanced by a pair of f5’s. This issue we ran into involved a failure on the primary f5, causing the CAS servers to behave unpredictably. We only got out of it by shutting the CAS servers all off, then rebooting them one at a time. Prior efforts to reboot them one at at time didn’t resolve the issue with connectivity.

    Our other load-balanced applications didn’t have any issue with the f5 primary to secondary failure.

    Any thoughts?

Add to the legend