We're extremely excited to announce that we now support sending alerts to PagerDuty! PagerDuty, an Incident Management System for IT Teams, provides alerting, on-call scheduling, escalation policies and incident tracking to increase uptime of your apps, servers, websites and databases. Dead Man's Snitch alerts will show up as incidents and your on-call team members will know immediately if your scheduled jobs go missing.
If there is a service you would love to see us add support for please let us know!
We're extremely excited to announce that we now support sending alerts to Slack! Slack is a collaborative, real-time messaging app that brings all of your team’s communications together in one place. Dead Man's Snitch alerts will show up in your team's Slack channel making it easier to collaborate with your team as soon as your scheduled jobs go missing.
If there is a service you would love to see us add support for please let us know!
We're excited to announce that the first ever Dead Man's Snitch Android app is available for download! As an extension of your account you'll be able to:
Let us know what you think! If you're an iPhone user our app just got a refresh including swipe to pause and landscape mode.
You may have noticed that new snitches now send an email when your job checks in for the first time.
Don't worry, you won't receive an email for every check-in. We just want you to have confidence that your job and snitch are set up correctly.
Some customers may be required to include additional billing-related information to their receipts for accounting purposes.
Now you can easily add this information under your Account settings and it will be displayed on all of your receipts.
We've recently rolled out a couple of small features to make setting up your snitches easier.
Dead Man's Snitch lets you set up each of your snitches to send alerts to a unique email address. You could even send alerts to more than one email address if you separated the addresses with commas, as shown: "firstname.lastname@example.org, email@example.com".
It's easy to mistype an address, however. To help out, DMS will now autocomplete your email addresses. Any email used in your account will show up as an option in the autocomplete.
A similar autocomplete is now available on your snitch tags as well.
We've also expanded the snitch check-in setup page.
Since we now allow you to check in via email, we added directions explaining what email address to send your check-ins to.
We also added an example of checking in from Ruby using the Snitcher gem.
We plan to add examples in more languages soon. If there are any languages you'd like to see on the setup page, let us know at firstname.lastname@example.org.
We hope these improvements make setting up your snitches easier.
We recently decided to move our blog off of Tumblr and into our main site. We wanted our blog to feel more like part of the main site and an integrated blog is also more SEO friendly.
We thought about publishing our blog as static files with Jekyll. However, it quickly became clear that sharing our main HTML layout file between Jekyll and our main Rails app was not going to be elegant. The layout file needed to be duplicated, making it tedious to keep up to date.
We only need a simple blog, so we ditched Jekyll and started to write vanilla Markdown files we could just render in Rails within our main HTML layout. It seemed to be working well and we were about to migrate all of our posts when someone pointed out a new kid on the block: ButterCMS.
To top it off, ButterCMS setup is dead simple.
1. Sign up.
2. Add their code library to your Rails or Django project.
3. Copy over your API token.
ButterCMS has been super supportive during our migration. Right now, they are advertising that they'll import your existing blog for free. They imported our Tumbler blog, and it came over fine.
We had to tweak ButterCMS's default views a bit, but we're happy that we don't have to manage our blog data. ButterCMS provides a nice little editor.
We are glad to now have a cleanly implemented blog on our main site. We plan to be posting more DMS tips and tricks, as well as showcasing some of the creative applications people are using DMS for.
In the mean time, if you're looking to integrate an SEO-friendly blog into your application, check out ButterCMS.
We’re excited to announce the ability to check-in your Snitches via email! Email check-ins have many use-cases and great for things like getting an email when your server goes down.
Get your Snitch email from the "Email" tab on the left column in the snitch install page. Alerts and check-ins work the same as using curl or the Snitcher gem.
Please Note: Checking in via email makes it easier to use Dead Man’s Snitch in situations where HTTPS isn’t feasible, though there are some caveats. While email is reliable, they can be easily delayed. Between the time an email is sent and the time it’s received, it usually goes through several intermediaries which may spool, delay, retry, or redirect it before it finally arrives at Dead Man’s Snitch. With check-ins being time sensitive, be aware that false alerts could occur if a snitch checks in towards the end of its period. For this reason we suggest using HTTPS if you can and only use email in cases where it’s the only option.
We made a small update to our tooltips! Originally we only displayed check-in times in UTC. Some of our users shared with us that this was inconvenient and asked if we could convert the timestamps to their local time.
Now, all tooltips convert snitch check-in timestamps to a user’s local time zone and displays this in the tooltip on hover.
These tooltips are available on your dashboard…
…and the individual snitch activity page.
We appreciate your feedback. If there’s something you would like to see added reach out to us anytime!
On Friday, March 6th we had a major outage caused by a loss of historical data. During the outage we failed to alert on missed snitch check-ins and sent a large number of erroneous failure alerts for healthy snitches. It took 8 hours to restore or reconstruct all missing data and get our systems stabilized. I am incredibly sorry for the chaos and confusion this caused.
On March 6th at 9:30 EST we deployed a change that decoupled two of the models in our system (historical periods and check-ins). At 9:45 EST a user triggered an unscoped deletion of all historical period records when they changed the interval of their snitch.
We were alerted at 9:50 EST and immediately disabled our alerter process to avoid further confusion. We began diagnosing the cause and at 10:50 EST deployed a fix for the unscoped delete. Our next step was to restore the missing data from our backups. We decided to keep the system live and to use a slower but more accurate process to restore the data due to possible conflicts created by keeping the system running.
At 17:30 EST we finished the restoration of most of the historical data and ran a set of data integrity checks to ensure everything was in a clean state. We sent out one final set of “reporting” alerts for any snitches that were healthy but thought to be failed.
We use a pull request based development process. Whenever a change is made it is reviewed by another developer and then merged by the reviewer. It’s common to make several revisions to a change before it is merged.
In this case, the unscoped deletion was introduced as part of implementing a suggestion to reduce the number of queries made during an interval change. When making the change the scoping to only those periods for a snitch was accidentally removed. The code was reviewed but the scoping issue was missed on final review.
Additionally, we have an extensive test suite in place that gives us confidence when we make large changes to the system. Our tests did not uncover this issue since the unscoped delete satisfied our testing conditions.
1. We have reviewed our use of destructive operations that could be prone to scoping issues (e.g. Model.where(…).delete_all) and have found that this was the only instance of it left in our codebase.
2. We have reviewed our tests around destructive behavior and have added cases to ensure they only affect the records they should.
3. Our restore and recovery process took much longer than we would like. We developed a set of tools for checking data integrity while we waited for the restore to finish and we will be fleshing these out further and making them a part of our normal maintenance routine. Lastly we will be planning and running operations fire drills to improve our readiness for cases like this.
Monitoring failures can mean lost sleep, lost time, and added stress to an already stressful job. As an operations person I am well aware of the trouble a malfunctioning system can cause. I am very sorry for the chaos and confusion caused by our failings. We very much see Friday’s issues a failure of our development process and are taking the steps to improve that process.
Should we have future issues the best place to get notified is to subscribe to notifications at status.deadmanssnitch.com or to follow us on twitter.
- Chris Gaffney
[i] Collective Idea