From Whence Redundancy? Exchange 2010 Storage Essays, part 1

Updated 4/13 with improved reseed time data provided by item #4 in the Top 10 Exchange Storage Myths blog post from the Exchange team.

Over the next couple of months, I’d like to slowly sketch out some of the thoughts and impressions that I’ve been gathering about Exchange 2010 storage over the last year or so and combine them with the specific insights that I’m gaining at my new job. In this inaugural post, I want to tackle what I have come to view as the fundamental question that will drive the heart of your Exchange 2010 storage strategy: will you use a RAID configuration or will you use a JBOD configuration?

In the interests of full disclosure, the company I work for now is a strong NetApp reseller, so of course my work environment is conducive to designing Exchange in ways that make it easy to sell the strengths of NetApp kit. However, part of the reason I picked this job is precisely because I agree with how they address Exchange storage and how I think the Exchange storage paradigm is going to shake out in the next 3-5 years as more people start deploying Exchange 2010.

In Exchange 2010, Microsoft re-designed the Exchange storage system to target what we can now consider to be the lowest common denominator of server storage: a directly attached storage (DAS) array of 7200 RPM SATA disks in a Just a Box of Disks (JBOD) configuration. This DAS/JBOD/SATA (what I will now call DJS) configuration has been an unworkable configuration for Exchange for almost its entire lifetime:

  • The DAS piece certainly worked for the initial versions of Exchange; that’s what almost all storage was back then. Big centralized SANs weren’t part of the commodity IT server world, reserved instead for the mainframe world. Server administrators managed server storage. The question was what kind of bus you used to attach the array to the server. However, as Exchange moved to clustering, it required some sort of shared storage. While a shared SCSI bus was possible, it not only felt like a hack, but also didn’t scale well beyond two nodes.
  • SATA, of course, wasn’t around back in 1996; you had either IDE or SCSI. SCSI was the serious server administrator’s choice, providing better I/O performance for server applications, as well as faster bus speeds. SATA, and its big brother SAS, both are derived from the lessons that years of SCSI deployments have provided. Even for Exchange 2007, though, SATA’s poor random I/O performance made it unsuitable for Exchange storage. You had to use either SAS or FC drives.
  • RAID has been a requirement for Exchange deployments, historically, for two reasons: to combine enough drive spindles together for acceptable I/O performance (back when disks were smaller than mailbox databases), and to ensure basic data redundancy. Redundancy was especially important once Exchange began supporting shared storage clustering and required both aggregate I/O performance only achievable with expensive disks and interfaces as well as the reduced chance of a storage failure being a single point of failure.

If you look at the marketing material for Exchange 2010, you would certainly be forgiven for thinking that DJS is the only smart way to deploy Exchange 2010, with SAN, RAID, and non-SATA systems supported only for those companies caught in the mire of legacy deployments. However, this isn’t at all true. There are a growing number of Exchange experts (and not just those of us who either work for storage vendors or resell their products) who think that while DJS is certainly an interesting option, it’s not one that’s a good match for every customer.

In order to understand why DJS is truly possible in Exchange 2010, and more importantly begin to understand where DJS configurations are a good fit and what underlying conditions and assumptions you need to meet in order to get the most value from DJS, we need to separate these three dimensions and discuss them separately.

JBOD vs RAID

While I will go into more detail on all three dimensions at later date, I want to focus on the JBOD vs.. RAID question now. If you need some summaries, then check out fellow Exchange MVP (and NetApp consultant) John Fullbright’s post on the economics of DAS vs. SAN as well as Microsoft’s Matt Gossage and his TechEd 2009 session on Exchange 2010 storage. Although there are good arguments for diving into drive technology or storage connection debates, I’ve come to believe that the central philosophy question you must answer in your Exchange 2010 design is at what level you will keep your data redundant. Until Exchange 2007, you had only one option: keeping your data redundant at the disk controller level. Using RAID technologies, you had two copies of your data[1]. Because you had a second copy of the data, shared storage clustering solutions could be used to provide availability for the mailbox service.

With Exchange 2007’s continuous replication features, you could add in data redundancy at the application level and avoid the dependency of shared storage; CCR creates two copies, and SCR can be used to create one or more additional copies off-site. However, given the realities of Exchange storage, for all but the smallest deployments, you had to use RAID to provide the required number of disk spindles for performance. With CCR, this really meant you were creating four copies; with SCR, you were creating an additional two copies for each target replica you created.

This is where Exchange 2010 throws a wrench into the works. By virtue of a re-architected storage engine, it’s possible under specific circumstances to design a mailbox database that will fit on a single drive while still providing acceptable performance. The reworked continuous replication options, now simplified into the DAG functionality, create additional copies on the application level. If you hit that sweet spot of the 1:1 database to disk ratio, then you only have a single copy of the data per replica and can get an n-1 level of redundancy, where n is the number of replicas you have. This is clearly far more efficient for disk usage…or is it? The full answer is complex, the simple answer is, “In some cases.”

In order to get the 1:1 database to disk ratio, you have to follow several guidelines:

  1. Have at least three replicas of the database in the DAG, regardless of which sites they are in. Doing so allows you to place both the EDB and transaction log files on the same physical drive, rather than separating them as you did in previous versions of Exchange.
  2. Ensure that you have at least two replicas per site. The reason for this is that unlike Exchange 2007, you can reseed a failed replica from another passive copy. This allows you to avoid reseeding over your WAN, which is something you do not want to do.
  3. Size your mailbox databases to include no more users than will fit in the drive’s performance envelope. Although Exchange 2010 converts many of the random I/O patterns to sequential, giving better performance, not all has been converted, so you still have to plan against the random I/O specs.
  4. Ensure that write transactions can get written successfully to disk. Use a battery-backed caching controller for your storage array to ensure the best possible performance from the disks. Use write caching for the physical disks, which means ensuring each server hosting a replica has a UPS.

At this point, you probably have disk capacity to spare, which is why Exchange 2010 allows the creation of archive mailboxes in the same mailbox database. All of the user’s data is kept at the same level of redundancy, and the archived data – which is less frequently accessed than the mainline data – is stored without additional significant disk or I/O penalty. This all seems to indicate that JBOD is the way to go, yes? Two copies in the main site, two off-site DR copies, and I’m using cheaper storage with larger mailboxes and only four copies of my data instead of the minimum of six I’d have with CCR+SCR (or the equivalent DAG setup) on RAID configurations.

Not so fast. Microsoft’s claims around DJS configurations usually talk about the up-front capital expenditures. There’s more to a solid design than just the up-front storage price tag, and even if the DJS solution does provide savings in your situation, that is only the start. You also need to think about the lifetime of your storage and all the operational costs. For instance, what happens when one of those 1:1 drives fails?

Well, if you bought a really cheap DAS array, your first indication will be when Exchange starts throwing errors and the active copy moves to one of the other replicas. (You are monitoring your Exchange servers, right?) More expensive DAS arrays usually directly let you know that a disk failed. Either way, you have to replace the disk. Again, with a cheap white-box array, you’re on your own to buy replacement disks, while a good DAS vendor will provide replacements within the warranty/maintenance period. Once the disk is replaced, you have to re-establish the database replica. This brings us to the wonderful manual process known as database reseeding, which is not only a manual task, but can take quite a significant amount of time – especially if you made use of archival mailboxes and stuffed that DJS configuration full of data. Let’s take a closer look at what this means to you.

[Begin 4/13 update]

There’s a dearth of hard information out there about what types of reseed throughputs we can achieve in the real world, and my initial version of this post where I assumed 20GB/hour as an “educated guess” earned me a bit of ribbing in some quarters. In my initial example, I said that if we can reseed 20GB of data per hour (from a local passive copy to avoid the I/O hit to the active copy), that’s 10 hours for a 200GB database, 30 hours for a 600GB database, or 60 hours –two and a half days! – for a 1.2 TB database[2].

According to the Top 10 Exchange Storage Myths post on the Exchange team blog, 20GB/hour is way too low; in their internal deployments, they’re seeing between 35-70GB per hour. How would these speeds affect reseed times in my examples above? Well, let’s look at Table 1:

Table 1: Example Exchange 2010 Mailbox Database reseed times

Database Size Reseed Throughput Reseed Time
200GB 20GB/hr 10 hours
200GB 35GB/hr 7 hours
200GB 50GB/hr 4 hours
200GB 70GB/hr 3 hours
600GB 20GB/hr 30 hours
600GB 35GB/hr 18 hours
600GB 50GB per hour 12 hours
600GB 70GB per hour 9 hours
1.2TB 20GB/hr 60 hours
1.2TB 35GB/hr 35 hours
1.2TB 50GB/hr 24 hours
1.2TB 70GB/hr 18 hours

As you can see, reseed time can be a key variable in a DJS design. In some cases, depending on your business needs, these times could make or break whether this is a good design. I’ve done some talking around and found out that reseed times in the field are all over the charts. I had several people talk to me at the MVP Summit and ask me under what conditions I’d seen 20GB/hour, as that was too high. Astrid McClean and Matt Gossage of Microsoft had a great discussion with me and obviously felt that 20GB/hour is way too low.

Since then, I’ve received a lot of feedback and like I said, it’s all over the map. However, I’ve yet to hear anyone outside of Microsoft publicly state a reseed throughput higher than 20GB/hour. What this says to me is that getting the proper network design in place to support a good reseed rate hasn’t been a big point in deployments so far, and that in order to make a DJS design work, this may need to be an additional consideration.

If your replication network is designed to handle the amount of traffic required for normal DAG replication and doesn’t have sufficient throughput to handle reseed operations, you may be hurting yourself in the unlikely event of suffering multiple simultaneous replica failures on the same mailbox database.

This is a bigger concern for shops that have a small tolerance for any given drive failure. In most environments, one of the unspoken effects of a DJS DAG design is that you are trading number of replicas – and database-level failover – for replica rebuild time. If you’re reduced from four replicas down to three, or three down to two during the time it takes to detect the disk failure, replace the disk, and complete the reseed, you’ll probably be okay with that taking a longer period of time as long as you have sufficient replicas.

All during the reseed time, you have one fewer replica of that database to protect you. If your business processes and requirements don’t give you that amount of leeway, you either have to design smaller databases (and waste the disk capacity, which brings us right back to the good old bad days of Exchange 2000/2003 storage design) or use RAID.

[End 4/13 update]

Now, with a RAID solution, we don’t have that same problem. We still have a RAID volume rebuild penalty, but that’s happening inside the disk shelf at the controller, not across our network between Exchange servers. And with a well-designed RAID solution such as generic RAID 10 (1+0) or NetApp’s RAID-DP, you can actually survive the loss of more disks at the same time. Plus, a RAID solution gives me the flexibility to populate my databases with smaller or larger mailboxes as I need, and aggregate out the capacity and performance across my disks and databases. Sure, I don’t get that nice 1:1 disk to database ratio, but I have a lot more administrative flexibility and can survive disk loss without automatically having to begin the reseed dance.

Don’t get me wrong – I’m wildly enthusiastic that I as an Exchange architect have the option of designing to JBOD configurations. I like having choices, because that helps me make the right decisions to meet my customers’ needs. And that, in the end, is the point of a well-designed Exchange deployment – to meet your needs. Not the needs of Microsoft, and not the needs of your storage or server vendors. While I’m fairly confident that starting with a default NetApp storage solution is the right choice for many of the environments I’ll be facing, I also know how to ask the questions that lead me to consider DJS instead. There’s still a place for RAID at the Exchange storage table.

In further installments over the next few months, I’ll begin to address the SATA vs. SAS/FC and DAS vs. SAN arguments as well. I’ll then try to wrap it up with a practical and realistic set of design examples that pull all the pieces together.

[1] RAID-1 (mirroring) and RAID-10 (striping and mirroring) both create two physical copies of the data. RAID-5 does not, but it allows the loss of a single drive failure — effectively giving you a virtual second copy of the data.

[2] Curious why picked these database sizes?  200GB is the recommended maximum size for Exchange 2007 (due to backup limitations), and 600GB/1.2TB are the realistic recommended maximums you can get from 1TB and 2TB disks today in a DJS replica-per-disk deployment; you need to leave room for the content index, transaction logs, and free space.

9 Comments



  1. Xiotech Employee <– just want to be clear :)

    WOW – Great job on this blog post. In fact, i'll be forwarding it around my team.

    What’s really interesting is we are seeing more and more companies being pushed to look at DAS solutions for their application environments. Mostly I think it’s around reducing the overall cost of the solution, as well as certain predictability around performance and reliability. DAS solutions in the past, have signified low cost, low performance and low reliability as well as an inability to scale very far. Not to mention, it was missing key features that the applications vendors didn’t include like COW snapshots, replication, deduplication etc.

    Just think, 10 years ago we spent a lot of time educating people on SAN vs. DAS. We talked about how great it is to create a snapshot, or replicate your data to a DR site and you couldn’t do that unless it was in a Storage Array with those features.

    If we look at storage controllers today, we’ve packed a whole bunch of features into 2(ea) servers/controller heads. In today’s storage array, not only is it doing RAID and Cache protection (which is super important), but it’s also doing thin provisioning, replications, dedupe (in some cases), snapshots (full copy and COW), multi-tier migration, CIFS, NFS, FC, iSCSI, FCoE etc etc. It’s getting to the point that performance predictability is pretty much going away. Reliability of the code, and mixing of different technologies (1GB, 2GB, 4GB FC Drive bays, SAS connections) as well as all the various “plumbing” connectivity options most arrays offer today. Not to mention, the fundamental building block of a Storage array is data protection. 2TB drives, rebuild times etc all take a toll on controllers.

    Fast forward 10 years and I think we are seeing the application vendors have caught up. Exchange 2010 is a great example of this. VSphere is another one. Solutions like our Emprise 5000 (I did mention I was with Xiotech right ? :) – ) product line offer the ability to have DAS predictability. Both from a performance stand point, as well as a price standpoint. You need 10TB’s of Tier 1, Tier 2 or Tier 3 storage, it costs X amount. You need 10,000 IOPS it’s X amount etc.

    Don’t get me wrong, there is CLEARLY a need for Storage Controllers to do some of the things you pointed out in your blog post, I still think Intelligent DAS has a very large place in the overall Storage foundation of a datacenter.

    @StorageTexan








Add to the legend