I felt like posting this email because I think it’s a great example of the complexity of the computing environment I manage (and design). I always get lots of questions on what I do, so hopefully this sheds a little light on the subject. And for the record, today was an easy day. This kind of thing, or completely redoing our enterprise wide area network routing topology at the drop of a hat, is why I’ve been posting so sparsely of late. I’ve been WAY busy. Anyhow, enjoy! Oh, and yes – I’ve changed server names to protect the innocent. 🙂
I received a helpdesk email this morning about 7a. Not a lot to go on here.
It appears we’re not receiving emails into the Confirms and DIG inboxes within the Operations public folder. Thanks for your help!
When I finally had a free minute tonight to rehash the solution, here’s what I wrote:
Okay, so the issue we encountered today was a rather complex symptom caused by two unique, but simple configuration errors, along with a misunderstood designed Exchange behavior. I engaged Microsoft’s Product Support Services group and was on the phone with their engineers for roughly 3.5 hours. Here are the details:
Symptom – External emails that were destined for an address associated with a mail-enabled Public Folder (PF) were being accepted by KRK12 and then routed on to SUN01 for delivery to the PF. SUN01 does not at this time have a local replica of the PF’s in question (Confirms, DIG) so this behavior was quite confusing. Furthermore, once the messages were received by SUN01, message tracking indicated that they were being delivered locally and at the same time relayed on to KRKMail for delivery. The kicker was that those messages destined for KRKMail were
being queued for delivery. While working on the issue we also realized that messages destined for recipients outside of KRK were also being queued (the majority of these were postmaster delivery delay notifications). All of these behaviors were only happening to PF-destined emails … emails to users were working without error.
Message routing explained – It turns out that the routing of emails to SUN01 is actually by “design” in the Exchange 2003 world. A front end or bridgehead Exchange server should not have a Public Folder storage group located on that box as such a configuration often leads to message routing issues (ironic, no?). Our environment (KRK12) follows that best practice. Because there is no local copy of the PF folder hierarchy on the bridgehead, that server queries Active Directory for a server in the local Exchange site (called an Administrative Group, or AG) for a server that does have the PF hierarchy that can route the email destined for a PF. AD looks up the list of PF servers in its database, and returns a
server name from the array. The record it returns first is the most recent addition to the list … in this case SUN01. Should the first server returned be unreachable, the bridgehead queries AD again to get the next server in line. In our scenario, however, SUN01 was online and accepting emails with no issues.
Once the bridgehead identifies a suitable PF owner it forwards the message in question to that PF server. That server accepts the message via SMTP, identifies it as a message going to a PF, and because the server is a PF server it actually delivers the message to the PF store. This is why we saw the message being delivered “locally” in message tracking on SUN01. The PF store gets the message, references it’s hierarchy, and in this case realizes that since there’s no local replica of the PF contents the message needs to be delivered to a PF server that does have those contents. This is why we saw the message being re-queued for
delivery to KRKMail – the home of all our PF’s at this time.
In order to prevent messages from being routed to our Sunnyvale server first, we
will need to move SUN01 into its own Administrative Group (Exchange site). This is actually an Exchange best practice – to mimic your Active Directory topology in your Exchange organization – but it’s optional. There are pros and cons with that configuration that should be considered. One option is a simple configuration (current status with all servers in a single AG) that will route messages to a PF the long way – to Sunnyvale first. The alternative is a geographically segregated Exchange topology that resembles our AD topology that prevents such long routing (the Kirkland AG only will list KRKMail for a local PF host in AD), but is more complicated in its configuration in that routing connectors between AG’s need to be
configured. My vote is for the latter: more work – yes, but more efficient routing and a purer “by the book configuration”.
Why SUN01 couldn’t send mail – This is where the story got even more convoluted, and in the end there was much smacking of heads on desks. Essentially there were two configuration errors in our environment that each, on their own, would have prevented proper mail delivery. The first issue located was a MX record on bgi-group.net, our internal DNS domain. This is a big no-no in the
Exchange world, and I don’t remember adding it when we built things out originally. Even worse, that record was configured wrong, pointing mail destined for our address space towards “mailkrk.krk-group.net” instead of “krkmail.krk-group.net” D’oh! The internal MX record was not fixed, it was simple deleted from our zone.
The second error was a failure to add the SAV client object for SUN01 in Symantec System Center to the Exchange servers configuration group – it was using the generic defaults our non-group-member clients get. One of those settings is to enable SMTP scanning on the client side – that’s inbound and outbound scanning. What makes this interesting is that I do remember not installing the SMTP scanning bits when I installed the SAV client on SUN01 – odd. I added the server to the appropriate configuration group, bounced IIS on SUN01 (to restart the SMTP service) and all the message queues cleared out instantaneously. Messages delivered.
Summary – As one of my favorite quotes goes, “We regret to inform you your sons are dead because they were stupid.” Little errors and omissions can make big issues where you least expect them to. It makes a good case for change management and documentation, and an even bigger case for documented server build instructions and checklists. As always we’ve learned a lot from this incident. Our outstanding to-do is to discuss/decide our Exchange organization’s topology. We should make a decision on that ASAP. I do not believe the changes require a maintenance window, especially since nobody is running off of the Sunnyvale server yet for their personal mailboxes, but to be on the safe side I recommend the AG division and routing configuration be completed after-hours. I’ll be staying overnight sometime in the next couple weeks with Verizon when the splice us into the OC-12 ring; that makes an ideal opportunity to effect these changes.
I’ve been using Mailblocks for about 16 months now and I swear by it. According to their tracking counter I’ve blocked over 80,000 spam messages in that time. I very rarely get a spam message now on the email accounts I filter through their service.
Alicea wanted to sign up over the weekend, but we were dismayed to see that they’re not accepting new accounts at this time. Warning bells went off in my head – are they shutting down? I fired off an email to their support line (I still wish they had a phone number) and got the boiler-plate response of “yeah we got your email, we’ll get to you when we can”. Except on closer inspection it’s a LOT better than your average receipt confirmation.
To give our customers the highest quality of service possible, we strive to respond to all emails within 24 hours of receipt. You will receive a personal response from our Support Team shortly. Our current average Email response time for the past 7 days is approximately 12 hours.
Notice that last line … “Our current average Email response time for the past 7 days is approximately 12 hours.” Now this may be horse hocky, or it may be a real statistic. I got a response at 11:03a PDT Monday from a submission at 10:43p PDT Saturday night – but that was over a Sunday and it wasn’t a service-down issue. But here’s what I’m getting at – they set a customer expectation that they’re involved, are actively monitoring their support emails, and set a realistic expectation for the user.
It’s way better than just saying “We’ve received your email and will respond in the order received.” Same way I’d rather hear “you’ve reached the Product X support queue; there are 3 callers ahead of you” or “the average hold time is 15 minutes” than nothing at all.
Now, would I rather have a phone number and receive an immediate answer to my question without going trough voicemail menus or waiting in a queue? Sure. But that level of staffing (both in numbers and technical expertise) isn’t realistic for most organizations or products. If you can set an expectation with the customer it sends a very different, and positive, message than simly acknowledging they exist. Better to know it’ll be about 12 hours than wonder after 5 why I haven’t heard anything back.
What else did I learn that makes me high on Mailblocks?
At this time, we are working on creating the next generation web mail product. As soon as new information is released, we will be updating the main http://www.mailblocks.com web page. We apologize, but we do not have an exact time or date on this.
Can’t wait for the next generation! And I’m sure Alicea can’t wait to use their service for the first time. Now if they’d only publish a timeline … but that’s ever so dangerous for software release dates, isn’t it Microsoft? 🙂
A public interest lawyer who is also intending to run as a Republican in the 2006 Illinois gubernatorial race is taking his fight to Microsoft in hopes of preventing the company from releasing what he calls “bad code.” Andy Martin of The Committee to Fight Microsoft on Tuesday announced his intentions to block Microsoft from releasing its Windows Vista operating system. Martin intends to ask Microsoft for an unconditional warranty that the operating system is free of bugs that could result in security vulnerabilities.
“Bill Gates sells the public defective products, and then expects us to spend years being his guinea pigs, while he corrects the myriad of defects and vulnerabilities in his defective code. This is mass consumer fraud.” Martin argued. “It is unacceptable corporate behavior. Over four years after Windows XP was released I still receive regular ‘updates’ and ‘bug fixes,’ which reflect a product that was originally scandalously defective.”
I don’t think this asshole realizes that we’re talking about software here. I mean let’s think about releasing an OS in comparison with something Mr. Martin uses every day – a car. I’m just guessing here admittedly, but your average car proabably has … oh I don’t know, we’ll go with 10,000 parts (if you count all the bolts and screws). EVERY car manufacturer has a recall of some kind every year … and those are just the defects that the government is worried enough about to mandate a fix. There are TONS of maintenance bulletins, etc. released as well.
Sounds just like patching to me … except an OS has millions of lines of code and a defect won’t KILL you! That’s right boys and girls, your computer won’t flip over at high speed and kill you (Explorer). Nor will a slight physical impact to the case cause the hard drive to explode in a ball of firey death (Pinto). So why isn’t Mr. Martin flaming the car industry? Or airlines who lose luggage, cancel flights without notice, and generally do everything in their power to piss off their customers? [Oh, they occasionally kill their customers too.]
Let’s get back to the computer industry … why doesn’t Mr. Martin yell at Apple? Or Oracle? Or Novell? Or Symantec? [Hell, I yell at Symantec all the time.] Or Adobe? Or ANY other company that’s ever written a line of code and put it out for use by someone else?
Shut the hell up. Seriously.
Update: It seems I’ve got company in the “this guy is a complete idiot” department. Thanks Brandon @ Longhornblogs.com!