I am the CEO of DuoCircle and I am sending this email because we had a system outage today that impacted your service.


Starting at 3AM PST Monday Jan 25th we started experiencing deliverability

problems from some of our servers.

The symptoms included:

* Timeouts
* Slow network responses
* Inability to connect to remote servers on port 25 (The email port)

You symptoms that you may have experienced:

Delays in receiving email
Receiving some messages but not others
Emails arriving in a different order than they were sent

Initially we believed that this was a connectivity issue with some of our Amazon
Web Services servers because the deliverability problems were not impacting all of our mail.

Our engineering staff determined that about 80% of our servers were unaffected, but the remaining 20% were timing out when attempting to deliver mail to our customers.

As soon as we noticed that the problem was isolated to a small number of machines we stopped accepting inbond mail on the troubled manchines, so that no new mail would be impacted and we sent the traffic to the nodes that were operational. Utilizing this process we were able to deliver all new mail to our clients, however the messages that were on the misbehaving machines still would not get delivered.

After escalating to Amazon it was determined that they had improperly filtered out port 25
on some of our servers because of an abuse complaint. However after talking to a
support Amazon lead it was determined that the filtering was applied too aggressively
and improperly in our case and it was removed immediately.

We have since corrected the problem, and are awaiting a root cause analysis from
Amazon to understand why their escalation process followed in this instance.

However are not without blame in this outage, and we are also taking this opportunity to improve our own internal processes and escalations.

Our failures:
* Internally we did not escalate this issue as quickly as we should have upon detection
* We did not correctly correlate complaints from multiple clients into a larger system issue
* We did not escalate once we identified as quickly as we could have to Amazon Support. We waited for our own internal engineering team.

We would like all users to be aware that we have a status portal located at http://status.duocircle.com – where you can track issues in real time. 

 

Pin It on Pinterest

Share This