I love my job – or where I’ve been lately
I felt like posting this email because I think it’s a great example of the complexity of the computing environment I manage (and design). I always get lots of questions on what I do, so hopefully this sheds a little light on the subject. And for the record, today was an easy day. This kind of thing, or completely redoing our enterprise wide area network routing topology at the drop of a hat, is why I’ve been posting so sparsely of late. I’ve been WAY busy. Anyhow, enjoy! Oh, and yes – I’ve changed server names to protect the innocent. 🙂
I received a helpdesk email this morning about 7a. Not a lot to go on here.
Hello,
It appears we’re not receiving emails into the Confirms and DIG inboxes within the Operations public folder. Thanks for your help!
When I finally had a free minute tonight to rehash the solution, here’s what I wrote:
Okay, so the issue we encountered today was a rather complex symptom caused by two unique, but simple configuration errors, along with a misunderstood designed Exchange behavior. I engaged Microsoft’s Product Support Services group and was on the phone with their engineers for roughly 3.5 hours. Here are the details:
Symptom – External emails that were destined for an address associated with a mail-enabled Public Folder (PF) were being accepted by KRK12 and then routed on to SUN01 for delivery to the PF. SUN01 does not at this time have a local replica of the PF’s in question (Confirms, DIG) so this behavior was quite confusing. Furthermore, once the messages were received by SUN01, message tracking indicated that they were being delivered locally and at the same time relayed on to KRKMail for delivery. The kicker was that those messages destined for KRKMail were
being queued for delivery. While working on the issue we also realized that messages destined for recipients outside of KRK were also being queued (the majority of these were postmaster delivery delay notifications). All of these behaviors were only happening to PF-destined emails … emails to users were working without error.Message routing explained – It turns out that the routing of emails to SUN01 is actually by “design” in the Exchange 2003 world. A front end or bridgehead Exchange server should not have a Public Folder storage group located on that box as such a configuration often leads to message routing issues (ironic, no?). Our environment (KRK12) follows that best practice. Because there is no local copy of the PF folder hierarchy on the bridgehead, that server queries Active Directory for a server in the local Exchange site (called an Administrative Group, or AG) for a server that does have the PF hierarchy that can route the email destined for a PF. AD looks up the list of PF servers in its database, and returns a
server name from the array. The record it returns first is the most recent addition to the list … in this case SUN01. Should the first server returned be unreachable, the bridgehead queries AD again to get the next server in line. In our scenario, however, SUN01 was online and accepting emails with no issues.Once the bridgehead identifies a suitable PF owner it forwards the message in question to that PF server. That server accepts the message via SMTP, identifies it as a message going to a PF, and because the server is a PF server it actually delivers the message to the PF store. This is why we saw the message being delivered “locally” in message tracking on SUN01. The PF store gets the message, references it’s hierarchy, and in this case realizes that since there’s no local replica of the PF contents the message needs to be delivered to a PF server that does have those contents. This is why we saw the message being re-queued for
delivery to KRKMail – the home of all our PF’s at this time.In order to prevent messages from being routed to our Sunnyvale server first, we
will need to move SUN01 into its own Administrative Group (Exchange site). This is actually an Exchange best practice – to mimic your Active Directory topology in your Exchange organization – but it’s optional. There are pros and cons with that configuration that should be considered. One option is a simple configuration (current status with all servers in a single AG) that will route messages to a PF the long way – to Sunnyvale first. The alternative is a geographically segregated Exchange topology that resembles our AD topology that prevents such long routing (the Kirkland AG only will list KRKMail for a local PF host in AD), but is more complicated in its configuration in that routing connectors between AG’s need to be
configured. My vote is for the latter: more work – yes, but more efficient routing and a purer “by the book configuration”.Why SUN01 couldn’t send mail – This is where the story got even more convoluted, and in the end there was much smacking of heads on desks. Essentially there were two configuration errors in our environment that each, on their own, would have prevented proper mail delivery. The first issue located was a MX record on bgi-group.net, our internal DNS domain. This is a big no-no in the
Exchange world, and I don’t remember adding it when we built things out originally. Even worse, that record was configured wrong, pointing mail destined for our address space towards “mailkrk.krk-group.net” instead of “krkmail.krk-group.net” D’oh! The internal MX record was not fixed, it was simple deleted from our zone.The second error was a failure to add the SAV client object for SUN01 in Symantec System Center to the Exchange servers configuration group – it was using the generic defaults our non-group-member clients get. One of those settings is to enable SMTP scanning on the client side – that’s inbound and outbound scanning. What makes this interesting is that I do remember not installing the SMTP scanning bits when I installed the SAV client on SUN01 – odd. I added the server to the appropriate configuration group, bounced IIS on SUN01 (to restart the SMTP service) and all the message queues cleared out instantaneously. Messages delivered.
Summary – As one of my favorite quotes goes, “We regret to inform you your sons are dead because they were stupid.” Little errors and omissions can make big issues where you least expect them to. It makes a good case for change management and documentation, and an even bigger case for documented server build instructions and checklists. As always we’ve learned a lot from this incident. Our outstanding to-do is to discuss/decide our Exchange organization’s topology. We should make a decision on that ASAP. I do not believe the changes require a maintenance window, especially since nobody is running off of the Sunnyvale server yet for their personal mailboxes, but to be on the safe side I recommend the AG division and routing configuration be completed after-hours. I’ll be staying overnight sometime in the next couple weeks with Verizon when the splice us into the OC-12 ring; that makes an ideal opportunity to effect these changes.