On Friday, March 6th we had a major outage caused by a loss of historical data. During the outage we failed to alert on missed snitch check-ins and sent a large number of erroneous failure alerts for healthy snitches. It took 8 hours to restore or reconstruct all missing data and get our systems stabilized. I am incredibly sorry for the chaos and confusion this caused.
On March 6th at 9:30 EST we deployed a change that decoupled two of the models in our system (historical periods and check-ins). At 9:45 EST a user triggered an unscoped deletion of all historical period records when they changed the interval of their snitch.
We were alerted at 9:50 EST and immediately disabled our alerter process to avoid further confusion. We began diagnosing the cause and at 10:50 EST deployed a fix for the unscoped delete. Our next step was to restore the missing data from our backups. We decided to keep the system live and to use a slower but more accurate process to restore the data due to possible conflicts created by keeping the system running.
At 17:30 EST we finished the restoration of most of the historical data and ran a set of data integrity checks to ensure everything was in a clean state. We sent out one final set of “reporting” alerts for any snitches that were healthy but thought to be failed.
We use a pull request based development process. Whenever a change is made it is reviewed by another developer and then merged by the reviewer. It’s common to make several revisions to a change before it is merged.
In this case, the unscoped deletion was introduced as part of implementing a suggestion to reduce the number of queries made during an interval change. When making the change the scoping to only those periods for a snitch was accidentally removed. The code was reviewed but the scoping issue was missed on final review.
Additionally, we have an extensive test suite in place that gives us confidence when we make large changes to the system. Our tests did not uncover this issue since the unscoped delete satisfied our testing conditions.
1. We have reviewed our use of destructive operations that could be prone to scoping issues (e.g. Model.where(…).delete_all) and have found that this was the only instance of it left in our codebase.
2. We have reviewed our tests around destructive behavior and have added cases to ensure they only affect the records they should.
3. Our restore and recovery process took much longer than we would like. We developed a set of tools for checking data integrity while we waited for the restore to finish and we will be fleshing these out further and making them a part of our normal maintenance routine. Lastly we will be planning and running operations fire drills to improve our readiness for cases like this.
Monitoring failures can mean lost sleep, lost time, and added stress to an already stressful job. As an operations person I am well aware of the trouble a malfunctioning system can cause. I am very sorry for the chaos and confusion caused by our failings. We very much see Friday’s issues a failure of our development process and are taking the steps to improve that process.
Should we have future issues the best place to get notified is to subscribe to notifications at status.deadmanssnitch.com or to follow us on twitter.
- Chris Gaffney
[i] Collective Idea
Anyone who’s had 10s or 100s of snitches hasn’t had an easy way to organize their dashboards. We’re excited to announce that we released snitch tags today! Tags allow you to easily group snitches together based on projects, clients, environments, whatever! They automatically group by status so you don’t have to scroll through your entire dashboard to find the failures.
Creating Snitch Tags
Add tags to a snitch when you create or edit a snitch.
Tags are displayed on each snitch install page for easy editing.
Snitches are automatically grouped by tag and status.
Try them out and let us know what you think!
Change Snitch Intervals Anytime
Until now, there wasn’t a way to change a snitch’s interval. You had to delete your snitch and create a new one with the new interval.
Now it’s super easy:
Don’t forget—it matters when you ping your snitch for the first time!
15 & 30 Minute Intervals
We’re excited to announce that your #1 most requested feature, 15 and 30 minute intervals, are now live! Monitoring tasks that run multiples times per hour is now easier than ever. You’ll also receive alerts up to 4x faster than before—Enjoy!
The dashboard needed some love. We redesigned your dashboard to make it even easier (and faster) to manage your snitches. It doesn’t stop here, though. We’ll be introducing new features and functionality in the near future. Stay tuned!
We carried the new UI into our beloved iPhone app. Push notification alerts are available if email alerts aren’t enough! The app is free to download here.
Never heard of Dead Man’s Snitch but use cron or Heroku Scheduler? Give it a try free!
On Sunday August 10th we had a major outage receiving check-ins from your snitches. During the outage we failed to receive and process snitch check-ins for about 4 hours and mistakenly sent out a large number of notifications claiming systems had failed to check-in. I’m very sorry for any chaos, confusion, and wasted time this caused.
On August 5th we deployed a new service for handling snitch check-ins. The service has been in development and testing for several months with the goal of improving reliability. The deployment went smoothly and it has met most of our expectations. One notable change is that we now queue snitches for processing vs handling them immediately.
So what happened?
On August 10th at 600 EDT there was a cluster error on our backing RabbitMQ server due to running out of disk space on one of the nodes. This caused all of our connections to RabbitMQ to become blocked and for snitch check-ins to be queued for retry. At 635 EDT the service stopped processing snitch check-ins entirely.
At 850 EDT the first of our alarms were triggered as the servers running the collection service began to run out of memory. This was our first indication that anything was wrong. We quickly diagnosed the issue as the blocked connections on RabbitMQ and contacted the support team for our RabbitMQ service provider.
At 925 EDT our service provider fixed the failing node and unblocked all publishers. We restarted the servers running the collection service and service was partially recovered by 930 EDT. By 940 EDT the workers finished handling all outstanding messages in the queue.
Our next steps
1. We were not aware of the issue until it had been on going for some time. We have identified a few key metrics and added alerting around them. We will continue to improve our alerting and metrics so we can react more quickly the next time something like this occurs.
2. We could have done a better job communicating the issues. In addition to Intercom and Twitter we have added a status page at status.deadmanssnitch.com.
3. The code for failure handling and retry has proven buggy. We will be reviewing it and addressing the issues we discovered when dealing with blocked connections.
4. False positive alerts can cause a lot of trouble. We want to avoid that as much as possible as we know the kind of havoc, confusion, and 4am emergency calls it can cause. We’re looking at ways to eliminate any single points of failure in the chain from receiving a snitch check-in until it has been stored and processed.
We want you to be able to depend on Dead Man’s Snitch as much as we do. We know the trouble it can cause when monitoring is not working as expected. I am very sorry for any trouble the extra notifications caused. We are working hard to make sure Dead Man’s Snitch is a tool you can continue to rely on.
The best place to get notified of any issues we are having is to subscribe to notifications at status.deadmanssnitch.com or to follow us on twitter.
- Chris Gaffney
[i] Collective Idea
Send all billing related emails directly to your accounting department. If left blank, billing emails will be sent to the Account Email. Enter your password to confirm your changes.
Two questions we often get are:
“I know my job failed but I haven’t received an alert yet—what’s going on?”
“I received an alert that said my job hasn’t checked in, then, ten minutes later I received a second email saying my job checked in. Why is this?”
Both have to do with what Dead Man’s Snitch refers to as periods and their relationship with snitch intervals. Each interval has a period start and end time.
At the end of each period Dead Man’s Snitch looks back to see if the job checked-in during the period. If it did, great. If it didn’t, an alert email is sent.
It Matters When You Ping Your Snitch
The closer a snitch is pinged to the end of a period, the sooner you will be alerted after a job fails. Take a look at this daily snitch. Notice how the email is sent soon after the missed check-in.
If you ping your snitch after a new period starts and your snitch fails to check-in, you won’t receive an alert email until after the next period has ended. Notice the extra time between the snitch failure and the alert email.
After the first alert email, DMS will continue to send one email per failed period until the snitch is paused or pinged again. As soon as the snitch is pinged an email notification will be sent to let you know the snitch is reporting again.
Once your snitch has been pinged for the first time you’ll be able to view its history on the Activity page. The Activity page shows whether or not the snitch checked-in during each period and the exact time. The most recent period is shown first. For demonstration purposes, the screenshot below has additional notes on the right-hand side to explain the timestamps. For your convenience, we also display check-in times in your current timezone in a tooltip when you hover over the timestamps.
More Control Over Check-In Times
We recognize the current intervals and periods aren’t great for everyone. As we’ve grown we’ve received great feedback and requests for more control over snitch check-in times and intervals. We’re listening and couldn’t agree more. We plan to release both of these updates in the future.
Keep the right people (or the whole team) informed.
The first thing a lot of users do is create an hourly test Snitch to see how Dead Man’s Snitch works. Don't worry, you can delete it at any time.
1. Name your Snitch and set the interval to Hour. You can set a Snitch-specific email address if it's different than your account email. Click Save.
2. The next page displays your unique Snitch URL and methods for installing it: cURL, email, and Ruby. To kick off the checking process request the URL using any of these three methods. The most common and basic method is using cURL and would look something like:
$ run_backups_or_something && curl https://nosnch.in/c2354d53d2
For the purpose of the test Snitch, cURL only the Snitch URL in terminal or paste the Snitch URL in your browser.
3. Head back to your dashboard. Your Snitch status light should now be a green checkmark. Since you're checking in your Snitch manually it won't be pinged in the next hour (unless you do it manually) and you’ll receive an alert email.
Note: Hourly snitches check in on the hour, every hour. Keep this in mind when you hit the Snitch URL since the first check-in may take longer than you expected. For example: If you hit the URL for the first time at 1:55 PM and your snitch doesn’t check in between 2-3 PM, you’ll receive an alert around 3:01 PM. However, if you hit the URL for the first time just after the hour, say, 2:10 PM and your snitch doesn’t check-in (from 3-4 PM), you won’t receive an alert until 4:01 PM. Since the URL was hit after the hour, the first full period is from 3-4 PM. Your snitch will continue to check-in every hour, on the hour thereafter. This blog post goes into more detail on snitch check-ins and alerts.
We are excited to announce that our new plans are live! Since acquiring Dead Man’s Snitch back in July we quickly learned that two plans wasn’t enough for our wide range of users. Since then, we’ve collected great feedback from users to help shape the new plans.
We’re excited about the future developments of Dead Man’s Snitch and hope you are too. To compliment the new plans, extra features are in the pipeline and Enhanced Intervals will be available soon!