On Friday, March 6th we had a major outage caused by a loss of historical data. During the outage we failed to alert on missed snitch check-ins and sent a large number of erroneous failure alerts for healthy snitches. It took 8 hours to restore or reconstruct all missing data and get our systems stabilized. I am incredibly sorry for the chaos and confusion this caused.
On March 6th at 9:30 EST we deployed a change that decoupled two of the models in our system (historical periods and check-ins). At 9:45 EST a user triggered an unscoped deletion of all historical period records when they changed the interval of their snitch.
We were alerted at 9:50 EST and immediately disabled our alerter process to avoid further confusion. We began diagnosing the cause and at 10:50 EST deployed a fix for the unscoped delete. Our next step was to restore the missing data from our backups. We decided to keep the system live and to use a slower but more accurate process to restore the data due to possible conflicts created by keeping the system running.
At 17:30 EST we finished the restoration of most of the historical data and ran a set of data integrity checks to ensure everything was in a clean state. We sent out one final set of “reporting” alerts for any snitches that were healthy but thought to be failed.
We use a pull request based development process. Whenever a change is made it is reviewed by another developer and then merged by the reviewer. It’s common to make several revisions to a change before it is merged.
In this case, the unscoped deletion was introduced as part of implementing a suggestion to reduce the number of queries made during an interval change. When making the change the scoping to only those periods for a snitch was accidentally removed. The code was reviewed but the scoping issue was missed on final review.
Additionally, we have an extensive test suite in place that gives us confidence when we make large changes to the system. Our tests did not uncover this issue since the unscoped delete satisfied our testing conditions.
1. We have reviewed our use of destructive operations that could be prone to scoping issues (e.g. Model.where(…).delete_all) and have found that this was the only instance of it left in our codebase.
2. We have reviewed our tests around destructive behavior and have added cases to ensure they only affect the records they should.
3. Our restore and recovery process took much longer than we would like. We developed a set of tools for checking data integrity while we waited for the restore to finish and we will be fleshing these out further and making them a part of our normal maintenance routine. Lastly we will be planning and running operations fire drills to improve our readiness for cases like this.
Monitoring failures can mean lost sleep, lost time, and added stress to an already stressful job. As an operations person I am well aware of the trouble a malfunctioning system can cause. I am very sorry for the chaos and confusion caused by our failings. We very much see Friday’s issues a failure of our development process and are taking the steps to improve that process.
Should we have future issues the best place to get notified is to subscribe to notifications at status.deadmanssnitch.com or to follow us on twitter.
- Chris Gaffney
[i] Collective Idea
On Sunday August 10th we had a major outage receiving check-ins from your snitches. During the outage we failed to receive and process snitch check-ins for about 4 hours and mistakenly sent out a large number of notifications claiming systems had failed to check-in. I’m very sorry for any chaos, confusion, and wasted time this caused.
On August 5th we deployed a new service for handling snitch check-ins. The service has been in development and testing for several months with the goal of improving reliability. The deployment went smoothly and it has met most of our expectations. One notable change is that we now queue snitches for processing vs handling them immediately.
So what happened?
On August 10th at 600 EDT there was a cluster error on our backing RabbitMQ server due to running out of disk space on one of the nodes. This caused all of our connections to RabbitMQ to become blocked and for snitch check-ins to be queued for retry. At 635 EDT the service stopped processing snitch check-ins entirely.
At 850 EDT the first of our alarms were triggered as the servers running the collection service began to run out of memory. This was our first indication that anything was wrong. We quickly diagnosed the issue as the blocked connections on RabbitMQ and contacted the support team for our RabbitMQ service provider.
At 925 EDT our service provider fixed the failing node and unblocked all publishers. We restarted the servers running the collection service and service was partially recovered by 930 EDT. By 940 EDT the workers finished handling all outstanding messages in the queue.
Our next steps
1. We were not aware of the issue until it had been on going for some time. We have identified a few key metrics and added alerting around them. We will continue to improve our alerting and metrics so we can react more quickly the next time something like this occurs.
2. We could have done a better job communicating the issues. In addition to Intercom and Twitter we have added a status page at status.deadmanssnitch.com.
3. The code for failure handling and retry has proven buggy. We will be reviewing it and addressing the issues we discovered when dealing with blocked connections.
4. False positive alerts can cause a lot of trouble. We want to avoid that as much as possible as we know the kind of havoc, confusion, and 4am emergency calls it can cause. We’re looking at ways to eliminate any single points of failure in the chain from receiving a snitch check-in until it has been stored and processed.
We want you to be able to depend on Dead Man’s Snitch as much as we do. We know the trouble it can cause when monitoring is not working as expected. I am very sorry for any trouble the extra notifications caused. We are working hard to make sure Dead Man’s Snitch is a tool you can continue to rely on.
The best place to get notified of any issues we are having is to subscribe to notifications at status.deadmanssnitch.com or to follow us on twitter.
- Chris Gaffney
[i] Collective Idea