Nettica Postmortem May 17
Due to a system outage on Wednesday May 17 20:08 UTC, 30 minutes of email needed to be re-delivered to your mail server.
We became aware of a system delivery issue because our delivery checks from MXToolbox failed and then alerted us within two minutes of the issue starting. Engineering was staff responded immediately, there was no delay between notification and the time we started investigating.
We posted to the status page within 12 minutes of the outage and created a communication channel at – http://status.www.duocircle.com/incidents/y3wl4bjv1r5z . Our goal is to be extremely transparent during all incidents.
Hard drive consistency checks failed on one of the attached storage nodes and required a reboot. When the system was rebooted mail was delivered but not correctly written to disk.
Resolution and recovery
Starting on Monday May 15 2017 DuoCircle has experienced three service interruptions with the Nettica hosted email system (mailhostingservice.com). We were able to address the first two issues for the previous with a reboot, and subsequently updating an SSL key. However today May 17th we had a hard drive issue with one of the connected node and Nettica was down for 40 minutes from 20:08 UTC (1:08PM PST).
During this outage mail was successfully queued on a separate cluster, however, when Nettica was returned to production there was a gap of about 30 minutes of emails because of a misconfiguration when the server was restored to service.
Normally in an instance like this the messages for the 30 minutes that the queuing cluster delivered to the main server would be lost however we have had a 100% recovery of the messages because of a proprietary system that we introduced a few weeks ago called Message Replay. It’s sort of a DVR for email, we can rewind and play messages again at a later time.
As of midnight UTC we have restored all messages for our customers using mx1.mailhostingservice.com. Rather than lose 30 minutes of messages we made the determination to restore all messages from 19:00 to 21:00 UTC to ensure that your mailbox is intact.
You may see new emails in your inbox with a timestamp from earlier today or duplicate messages that you have already received.
Corrective and Preventative Measures
We are in the process of ensuring that all customers are using the correct MX records. We have escalated to the storage vendors for options and actions that we can do differently.
I am personally very impressed with the speed and professionalism of the teams responded to the issue and that our alerting system worked as designed. Thank you to our customers for your understanding, and we will be migrating mailhostingservice.com over to a new system as soon as possible. – Brad Slavin