Postmortem: August 10th, 2014

On Sunday August 10th we had a major outage receiving check-ins from your snitches. During the outage we failed to receive and process snitch check-ins for about 4 hours and mistakenly sent out a large number of notifications claiming systems had failed to check-in. I’m very sorry for any chaos, confusion, and wasted time this caused.

Background

On August 5th we deployed a new service for handling snitch check-ins. The service has been in development and testing for several months with the goal of improving reliability. The deployment went smoothly and it has met most of our expectations. One notable change is that we now queue snitches for processing vs handling them immediately.

So what happened?

On August 10th at 600 EDT there was a cluster error on our backing RabbitMQ server due to running out of disk space on one of the nodes. This caused all of our connections to RabbitMQ to become blocked and for snitch check-ins to be queued for retry. At 635 EDT the service stopped processing snitch check-ins entirely.

At 850 EDT the first of our alarms were triggered as the servers running the collection service began to run out of memory. This was our first indication that anything was wrong. We quickly diagnosed the issue as the blocked connections on RabbitMQ and contacted the support team for our RabbitMQ service provider.

At 925 EDT our service provider fixed the failing node and unblocked all publishers. We restarted the servers running the collection service and service was partially recovered by 930 EDT. By 940 EDT the workers finished handling all outstanding messages in the queue.

Our next steps

1. We were not aware of the issue until it had been on going for some time. We have identified a few key metrics and added alerting around them. We will continue to improve our alerting and metrics so we can react more quickly the next time something like this occurs.

2. We could have done a better job communicating the issues. In addition to Intercom and Twitter we have added a status page at status.deadmanssnitch.com.

3. The code for failure handling and retry has proven buggy. We will be reviewing it and addressing the issues we discovered when dealing with blocked connections.

4. False positive alerts can cause a lot of trouble. We want to avoid that as much as possible as we know the kind of havoc, confusion, and 4am emergency calls it can cause. We’re looking at ways to eliminate any single points of failure in the chain from receiving a snitch check-in until it has been stored and processed.

Summary

We want you to be able to depend on Dead Man’s Snitch as much as we do. We know the trouble it can cause when monitoring is not working as expected. I am very sorry for any trouble the extra notifications caused. We are working hard to make sure Dead Man’s Snitch is a tool you can continue to rely on.

The best place to get notified of any issues we are having is to subscribe to notifications at status.deadmanssnitch.com or to follow us on twitter.

- Chris Gaffney
[i] Collective Idea