November 08, 2017
by Kelly Dunn

On-call and Response - How We Handle Service Outages at Simple

Since January 2017, we've substantially reduced the number of Service Outages at Simple and improved the reliability of our services by 60%! I want to share with you the techniques we have adopted to help improve reliability, reduce stress while on-call, and improve our platform at Simple.
toy-ambulance

Here at Simple, we take pride in shipping the highest quality software possible. We do our best to participate in regular code review, write and run unit tests, perform exhaustive integration tests, and vet changes in a separate environment from production before deploying.

Despite these best efforts, our systems fail. In the second quarter of 2016, on-call engineers were exhausted from handling dozens of pages in the middle of the night. Seemingly innocuous web services would end up taking down the entire site for hours at a time on a regular cadence, sometimes once a week per service. Engineers, Customer Relation Agents, and our Customers were all frustrated with these outages. I personally felt the tension when trying to resolve these issues; the root cause was always slightly out of grasp, and right before I would get to the bottom of an incident, a new one hit my pager, demanding my immediate attention.

We were also amidst a company reorganization, where clear ownership of services was in flux, business critical web applications were lacking sufficient attention and documentation, all the while, the Engineering Organization was shipping new features and migrating customers to BBVA.

After a prolonged period of stress, and reaching a boiling point of tension, we formed a team of like-minded engineers who desired to resolve these outages with data, transparency, and cross-team collaboration. We had advocates from the engineering management team who wanted to create a space to work on these issues, and promoted the value of enablement as a result of our work. And after a few months of doing deep dives on our systems, working in close partnership with our fellow product-team engineers, we were able to substantially improve the quality of service of our app as well as the quality of life for on-call engineers on-call at Simple.

I’m proud to say that since January 2017, we’ve substantially reduced the number of Service Outages at Simple, effectively improving the reliability of our services by 60 percent! This is a huge improvement from where we were at the beginning of the year, and it’s all due to the hard work and dedication of our on-call engineers and managers. We look forward to improving our systems and our response to these outages every day! We learned a great deal during this transition, and I want to share with you some of the techniques we have adopted to help improve reliability, reduce stress while on-call, and progressively improve our platform here at Simple.

Be Blameless

“Failure is Natural” is a key philosophical concept we embrace here at Simple. Instead of getting frustrated with outages, or doubting our own ability to respond, we prefer to take action and ask for help when we need it. When an outage happens, we stay objective, collect data, and use it to guide us towards understanding the root of the issue.

We keep track of these Service Outages each time they happen, sharing what data we can find and treating each new outage as a snapshot of symptoms. The goal is that over time we may find the root cause of the problem and collaborate on a more suitable architectural improvement. The rigorous documentation process also helps with response as well; we use this log of incidents as a record of steps we’ve taken in the past for similar outages. Should we ever be alone with a service outage at 3AM, we have a rich history of previous incidents and playbooks to guide us to resolution.

Every Friday Morning, we start the day with our weekly incident-review meetings, or as we call them internally, “Oncall Roundups”. We use this meeting to go over every page handled by our on-call representatives. We focus on the pages, rather than the person, with the guiding concept of eliminating pain and making oncall easier in the future. We also talk about our service outages, capturing what we did in our notes so that the next person on-call may be prepared. The weekly cadence of this meeting forced us to take action, if the same page showed up meeting after meeting, it would beg the question, “what can we do to solve this?” or “is this page even useful?”. Sometimes the answer was “we just need more information”.

Seek Out The Unknown

One of the first tasks we took on when attempting to improve the state of on-call was to add consistent levels of visibility to our services. This is because we needed the data to keep ourselves objective, and of course provide the necessary evidence to document our outages. This included removing metrics or healthchecks that didn’t provide much diagnostic value, centralizing and increasing granularity in our logs, and creating generic, low-level metrics dashboards measuring basic system resource utilization (e.g. RAM, Disk, Network, CPU), as well as application and runtime specific metrics (JVM GC, API counters, Worker Burndown Rate).

After a period of time, we were able to correlate outages to our newly added metrics. We did this by proliferating the dashboards to other product-teams, using them to provide diagnosis metrics during normal, everyday events like deploying or running an ad-hoc job. Using metrics for both outages and regular operations helped us all communicate in the same language.
These new tools enabled us to clear the fog of war for our more complex incidents – turning our “unknown unknowns” into “known unknowns”. This effectively whittled away the surface area of complexity in our platform. As more people rotated through our on-call schedule, we would grab more and more data, which would either bolster our previous suspicions, or invalidate unfounded beliefs in how the system actually worked.

Some Pain, All Gain

Collecting system metrics and application metrics gives you a good sense of your application’s health, but it’s also important to consider the health of your on-call engineers. A brief blip on a Grafana dashboard can translate into days of pain and expedited work distributed across many engineers and teams. To understand the brittleness, stubborn interfaces, and tightly coupled nature of your systems, you must also measure the human cost to these outages.

During our On-call Roundups, the host of the meeting takes diligent notes as our engineers recall the pages they handled over the last week. We use these notes as a log of anecdotal data, which gives us a rough idea of relative pain. We also collect an idea of what needs to change, or deficiencies in our preparation, like lack of playbooks, tooling, or resilient architecture components.

Sometimes the simple act of having on-call representatives in the same room talking about the same outages can yield amazing education moments; everyone learns at the same rate, and our collective experiences temper our playbooks into comprehensive scripts for remediation. Celebrating wins becomes natural and regular. We celebrate when someone doesn’t get paged during their shift, and on-call roundups are a great time to recognize everyone’s hard work that leads to those joyous moments.

Over the past year, we’ve paired this anecdotal data with our outage tracking tickets and we have been able to make calculated decisions on how to best focus our efforts to reduce pain and increase reliability. We do this by tracking the necessary work in a Kanban System and promoting it to our product teams as necessary. Actively tracking this work is key to making improvements in our system. The best agent of change here is understanding that reliability is a critical part of your product’s success, and should be prioritized and assessed as regular product work.

A Few Ounces Of Prevention

In addition to keeping our weekly meetings and upholding rigor while on-call, we actively contribute to the following supporting components which keep us nimble, informed, and confident during outages:

Playbooks: A detailed overview of a particular service outage, with clear, concise detection, remediation, and post-remediation steps. Playbooks include specific unix commands, links, or steps to take to get our systems back to a working state. Sometimes calling a partner company or a support engineer is necessary.

Note that the playbooks are intended to be living documents, which evolve from describing the symptoms of an outage to eventually providing a cure. Add to these as you learn more about your specific outage.

Tooling: Sometimes, outages need an on-call engineer to rerun a job, re-play messages in a queue, or massage data into a correct state. It’s essential that these tools are easy to use, have a clear guide on how to use them, and are simple in nature. Eventually, these operations should be re-introduced as fixes to the upstream code base.

Communication Protocol: A sound protocol of your own should also be adopted, or at least a dedicated communication channel so you may be able to have a clear history of an incident when it does occur. Remember, your actions and findings during an outage quickly become education material for your future self.

Internally, we use Curt Micol’s Incident Management Structure (IMS). It provides us with a clear chain of command during outages, an etiquette for communication, and a way of measuring the severity of an outage at Simple.

Post On-call Clean Up: Sometimes an engineer can have a particularly heavy on-call shift. During this time, they may not be able to sufficiently complete any toil, remediation work, or documentation for the outages they handled.

On backend engineering, we have an extra business day which we devote towards cleaning up any remaining work, or tackling a ticket in our on-call ticket tracker. This may be longer or shorter as your organization sees fit, but the guiding principle here is that you truly invest in making your platform better by giving your engineers sufficient time to clean up after being on-call.

Conclusion

Systems will fail, and pagers will ring, but ultimately, it’s our quality of response and preparedness which enables us to restore the health of our platform. Improvement takes time, and your organization may require a different level of response. But if you stay blameless, let the data point you in the right direction, and actively listen to those with the pagers, you have a good chance of improving morale as well as your reliability.

Interested in joining our engineering team?

We're Hiring!

Disclaimer: Hey! Welcome to our disclaimer. Here’s what you need to know to safely consume this blog post: Any outbound links in this post will take you away from Simple.com, to external sites in the wilds of the internet; neither Simple nor The Bancorp Bank, our partner bank, endorses any linked-to websites; and we didn’t pay/barter with/bribe anyone to appear in this post. And as much as we wish we could control the cost of things, any prices in this article are just estimates. Actual prices are up to retailers, manufacturers, and other people who’ve been granted magical powers over digits and dollar signs.