Here is an overview of how we provide and manage the services that we provide to you.
Monitoring and Escalations:
Ensuring our services are always up and running as well as making sure when things go wrong we are there to take immediate action is essential. These are some of the monitoring services we use to be in the know about our networks health:
- Real time port monitoring for all SMTP services with Pingdom.com (1 minute cycles)
- Real time system threshold and SMTP queue level monitoring with Datadog (near real time because of statsd)
- We query the size of our postfix queues and the status of the 60+ servers.
- Escalation of these events to PagerDuty where we have 3 people on call at any one point in time.
On the architecture side.
We are using some of the Biggest beefiest Amazon cloud machines in the front end load balanced for the initial SMTP greeting and RBL blocking.
Once you get through that, you go to an internal load balancer ELB from amazon to a battery of servers doing virus and spam filtering
From there you go on to delivery which is running PowerMTA so that we can ensurig we are playing nicely with the big boys like Gmail and Outlook as far as queue speed.
There are 3 major process – and where we had the problem was with our concurrency and system buffers.
What we did not know that a the time is that with the ELB (Elastic Load Balancing) load balancer not knowing when one of the filtering (spam) servers went offline, the port would sit open and it would just keep sending connections to the machines. ELB does not know how to check for 220 messages and when it see’s an open port it tries to push as hard as possible.
We have SMTP pipelining enabled so we don’t shut down the connection from our front end servers to the scanners – so they think that the filters are open and try push 1,000 messages per minute per machine to the ELB.
This backed up, messages queued and it sent us into a death spiral. We honestly waited too long to switch back to Dyn for them to filter but we really thought we had the issue nailed – and each test takes 5-10 minutes and by that point in time 50k messages were backed up and there was no way that we could de-queue it. The biggest problem was that only a couple of the filters died so the servers that had sessions open to running machines kept running but the others stopped. And once they started sending to the functional machines it was a cascading thundering herd and there was nothing we could do – not even our amazon auto scaler would help.
What we have done now…
We have setup far far far tighter alerting and thresholds on misbehaving systems.
Running a process that watches the spam processes every 90 seconds, and we are in discussions with Amazon about their ELB configurations.
We have also more than doubled the infrastructure (we don’t need it because you can’t throw hardware at a software problem and expect it to hold), the new Ubuntu server is optimized, processes watched, phones, pagers – ready and lots of sleepless nights… and I think we have it solved.
We have been doing this for a long time, but Dyn had some extremely unusual configurations and accommodations for clients that were built over the span of 10 years, with opt-outs and just funky database requests which required us to rebuild these servers from scratch based off of their configurations and it was like a dart flying backwards – it looks inherently stable but if you throw it that way it does some wild things.
We are ironing out the kinks to keep the spammers out, and the customers happy – because with all integrity is should be an awesome setup when its humming.
Was we hope that this helpful. Please post your comment below or send any questions to support.