Devin’s Load Balancer for Exchange 2010

Overview

One of the biggest differences I’m seeing when deploying Exchange 2010 compared to previous versions is that for just about all of my customers, load balancing is becoming a critical part of the process. In Exchange 2003 FE/BE, load balancing was a luxury unheard of for all but the largest organizations with the deepest pockets. Only a handful of outfits offered load balancing products, and they were expensive. For Exchange 2007 and the dedicated CAS role, it started becoming more common.

For Exchange 2003 and 2007, you could get all the same benefits of load balancing (as far as Exchange was concerned) by deploying an ISA server or ISA server cluster using Windows Network Load Balancing (WNLB). ISA included the concept of a “web farm” so it would round-robin incoming HTTP connections to your available FE servers (and Exchange 2007 CAS servers). Generally, your internal clients would directly talk to their mailbox servers, so this worked well. Hardware load balancers were typically used as a replacement for publishing with an ISA reverse proxy (and more rarely to load balance the ISA array instead of WNLB). Load balancers could perform SSL offloading, pre-authentication, and many of the same tasks people were formerly using ISA for. Some small shops deployed WNLB for Exchange 2003 FEs and Exchange 2007 CAS roles.

In Exchange 2010, everything changes. Outlook RPC connections now go to the CAS servers in the site, not the MB server that hosts the active copy of the database. Mailbox databases now have an affiliation with either a specific CAS server or a site-specific RPC client access array, which you can see using the –RpcClientAccessServer parameter of the Get-MailboxDatabase cmdlet. If you have two or more servers, I recommend you set up the RPC client access array as part of the initial deployment and get some sort of load balancer in place.

Load Balancing Options

At Trace3, we’re an F5 reseller, and F5 is one of the few load balancer companies out there that has really made an effort to understand and optimize Exchange 2010 deployments. However, I’m not on the sales side; I have customers using a variety of load balancing solutions for their Exchange deployments. At the end of the day, we want the customer to do what’s right for them. For some customers, that’s an F5. Others require a different solution. In those cases, we have to get creative – sometimes they don’t have budget, sometimes the networking team has their own plans, and on some rare occasions, the plans we made going in turned out not to be a good fit after all and now we have to come up with something on the fly.

If you’re not in a position to use a high-end hardware load balancer like an F5 BIG-IP or a Cisco ACE solution, and can’t look at some of the lower-cost (and correspondingly lower-feature) solutions that are now on the market, there are few alternatives:

  • WNLB. To be honest, I have attempted to use this in several environments now and even when I spent time going over the pros and cons, it failed to meet expectations. If you’re virtualizing Exchange (like many of my customers) and are trying to avoid single points of failure, WNLB is so clearly not the way to go. I no longer recommend this to my customers.
  • DNS round robin. This method at least has the advantage of in theory driving traffic to all of the CAS instances. However, in practice it gets in the way of quickly resolving problems when they come up. It’s better than nothing, but not by much.
  • DAG cluster IP. Some clever people came up with this option for instances where you are deploying multi-role servers with MB+HT+CAS on all servers and configuring them in a DAG. DAG = cluster, these smart people think, and clusters have a cluster IP address. Why can’t we just use that as the IP address of the RPC client access array? Sure enough, this works, but it’s not tested or supported by Microsoft and it isn’t a perfect solution. It’s not load balancing at all; the server holding the cluster IP address gets all the CAS traffic. Server sizing is important!

The fact of the matter is, there are no great alternatives if you’re not going to use hardware load balancing. You’re going to have to compromise something.

Introducing Devin’s Load Balancer

For many of my customers, we end up looking something like this:

  • The CAS/HT roles are co-located on one set of servers, while MB (and the DAG) is on another. This rules out the DAG cluster IP option.
  • They don’t want users to complain excessively when something goes wrong with one of the CAS/HT servers. This rules out DNS round robin.
  • They don’t have the budget for a hardware solution yet, or one is already in the works but not ready because of schedule. They need a temporary, low-impact solution. This effectively rules out WNLB.

I’ve come up with a quick and dirty fix I call Devin’s Load Balancer or, as I commonly call it, the DLB. It looks like this:

  1. Pick one CAS server that can handle all the traffic for the site. This is our target server.
  2. Pick an IP address for the RPC client access array for the site. Create the DNS A record for the RPC client access array FQDN, pointing to the IP address.
  3. Create the RPC client access array in EMS, setting the name, FQDN, and site.
  4. On the main network interface of the target server, add the IP address. If this IP address is on the same subnet as the main IP address, there is no need to create a secondary interface! Just add it as a secondary IP address/subnet mask.
  5. Make sure the appropriate mailbox databases are associated with the RPC client access array.
  6. Optionally, point the internal HTTP load balance array DNS A record to this IP address as well (or publish this IP address using ISA).

You may have noticed that this sends all traffic to the target server; it doesn’t really load balance. DLB also stands for Doesn’t Load Balance!

This configuration, despite its flaws, gives me what I believe are several important benefits:

  • It’s extremely easy to switchover/failover. If something happens to my target server, I simply add the RPC client access array IP address as a secondary IP address to my next CAS instance. There are no DNS cache entries to wait to expire. There are are no switch configurations to modify. There are no DNS records I have to update. If this is a planned switchover, client get disrupted but can immediately connect. I can make the update as soon as I get warning that something happened and my clients can reconnect without any further action on their part.
  • It isolates what I do with the other CAS instances. Windows and Exchange no longer have any clue they’re in a load balanced pseudo-configuration. With WNLB, if I make any changes to the LB cluster (like add or remove a member), all connections to the cluster IP addresses are dropped!
  • It makes it very easy to upgrade to a true load balancing solution. I set the true solution up in parallel with an alternate, temporary IP address. I use local HOSTS file entries on my test machines while I’m getting everything tested and validated. And then I simply take the RPC client access array IP address off the target server and put it on the load balancer. Existing connections are dropped, but new ones immediately connect with no timeouts – and now we’re really load balancing.

Note that you do not need the CAS SSL certificate to contain the FQDN of the RPC client access array as a SAN entry. RPC doesn’t use SSL for encryption (it’s not based on HTTP).

Even in a deployment where the customer is putting all roles into single-server configuration, if there’s any thought at all that they might want to expand to an HA configuration in the future, I now am in the habit of configuring this. The RPC client access array is now configured and somewhat isolated from the CAS configuration, so now my future upgrades are easier and less disruptive.