Incident management for high-velocity teams
Get it free
Service Request Management
Overview
Best practices for building a service desk
IT metrics and reporting
SLAs: The What, the Why, the How
Why first call resolution matters
Help desk
Service desk vs help desk vs ITSM
How to run IT support the DevOps way
Conversational ticketing
Customize Jira Service Management
Transitioning from email support
Service Catalog
What is a virtual agent
Understanding IT services and why they’re important
IT Asset Management
Overview
Configuration management databases
Configuration vs Asset Management
Asset tracking
Hardware asset management
Incident Management
Overview
IT service continuity management
Incident Communication
Templates
Workshop
Incident Response
Best Practices
Incident Commander
Aviation
Roles and responsibilities
Lifecycle
Playbook
IT support levels
On call
On call schedules
On call pay
Alert fatigue
Improving on call
IT alerting
Escalation Policies
Tools
Template
Escalation path template
KPIs
Common metrics
Severity levels
Cost of downtime
SLA vs. SLO vs. SLI
Error budget
Reliability vs. availability
MTTF (Mean Time to Failure)
DevOps
SRE
You built it, you run it
Problem management vs. incident management
ChatOps
ITSM
Major incident management
IT incident management
Modern incident management for IT ops
How to develop an IT disaster recovery plan
Disaster recovery plan examples
Bug tracking best practices
Postmortem
Template
Blameless
Reports
Meeting
Timelines
5 whys
Public vs. private
Tutorials
Incident communication
On call schedule
Automating customer notifications
Handbook
Incident response
Postmortems
Template generator
Glossary
Get the handbook
2020 State of Incident Management
2021 State of Incident Management
IT Management
Overview
Problem Management
Overview
Template
Roles and responsibilities
Process
Change Management
Overview
Best practices
Roles and responsibilities
Change advisory board
Change management types
Knowledge Management
Overview
What is a knowledge base
What is knowledge-centered service (KCS)
Self-service knowledge bases
Enterprise Service Management
Overview
HR Service Management and Delivery
HR Automation best practices
Three implementation tips for ESM
Understanding the offboarding process
Employee Experience Management Strategies
Top 9 Onboarding Software
Employee experience platforms
Onboarding workflow
ITIL
Overview
DevOps vs ITIL
ITIL Service Strategy Guide
ITIL service transition
Continual service improvement
IT Operations
Overview
IT infrastructure management
IT Operations Management
Overview
System Upgrade
Service mapping
Application dependency mapping
IT infrastructure
Every development, operations, and IT team knows that sometimes incidents happen.
Even the biggest companies with the brightest talent and a reputation for nearly 100% uptime sometimes watch in frustration as their systems go down. Just look at Apple, Delta, or Facebook, all have lost tens of millions to incidents in the past five years.
This reality means Service Level Agreements (SLAs) should never promise 100% uptime. Because that’s a promise no company can keep.
It also means that if your company is very good at avoiding or resolving incidents, you might consistently knock your uptime goals out of the park. Perhaps you promise 99% uptime and actually come closer to 99.5%. Perhaps you promise 99.5% uptime and actually reach 99.99% on a typical month.
When that happens, industry experts recommend that instead of setting user expectations too high by constantly overshooting your promises, you consider that extra .99% an error budget—time that your team can use to take risks.
What is an error budget?
An error budget is the maximum amount of time that a technical system can fail without contractual consequences.
For example, if your Service Level Agreement (SLA) specifies that systems will function 99.99% of the time before the business has to compensate customers for the outage, that means your error budget (or the time your systems can go down without consequences) is 52 minutes and 35 seconds per year.
If your SLA promises 99.95% uptime, your error budget is four hours, 22 minutes, and 48 seconds. And with an SLA promise of 99.9% uptime, your error budget is eight hours, 46 minutes, and 12 seconds.
Why do tech teams need error budgets?
At first glance, error budgets don’t seem that important. They’re just another metric IT and DevOps need to track to make sure everything’s running smoothly, right?
The answer, fortunately, is no. Error budgets aren’t just a convenient way to make sure you’re meeting contractual promises. They’re also an opportunity for development teams to innovate and take risks.
As we explain in our SRE article,
“The development team can ‘spend’ this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.”
The benefit of this approach is that it encourages teams to minimize real incidents and maximize innovation by taking risks within acceptable limits. It also bridges the gap between development teams, whose goals are innovation and agility, and operations, who are concerned with stability and security. As long as downtime remains low, developers can remain agile and push changes without friction from operations.
How to use an error budget
First, you’ll need to consult your SLAs and SLOs. What objectives have you already set for uptime or successful system requests? What promises has your company made to clients? Those will dictate your error budget.
Error budgets based on uptime
Most teams monitor uptime on a monthly basis. If availability is above the number promised by the SLA/SLO, the team can release new features and take risks. If it’s below the target, releases halt until the target numbers are back on track.
To use this method effectively, you’ll need to translate your SLO target (usually a percentage) into real figures your developers can work within. This means calculating how many hours and minutes your 1% or .5% or .1% of allowed downtime actually translates to. Common targets include:
SLA target | Yearly allowed downtime | Monthly allowed downtime | |
---|---|---|---|
99.99% uptime | Yearly allowed downtime 52 minutes, 35 seconds | Monthly allowed downtime 4 minutes, 23 seconds | |
99.95% uptime | Yearly allowed downtime 4 hours, 22 minutes, 48 seconds | Monthly allowed downtime 21 minutes, 54 seconds | |
99.9% uptime | Yearly allowed downtime 8 hours, 45 minutes, 57 seconds | Monthly allowed downtime 43 minutes, 50 seconds | |
99.5% uptime | Yearly allowed downtime 43 hours, 49 minutes, 45 seconds | Monthly allowed downtime 3 hours, 39 minutes | |
99% uptime | Yearly allowed downtime 87 hours, 39 minutes | Monthly allowed downtime 7 hours, 18 minutes |
Error budgets based on successful requests
SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.
Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.
Try Jira Service Management free
Tutorial
Learn incident communication with Statuspage
In this tutorial, we’ll show you how to use incident templates to communicate effectively during outages. Adaptable to many types of service interruption.
Read this tutorialUp next
The importance of an incident postmortem process
An incident postmortem, also known as a post-incident review, is the best way to work through what happened during an incident and capture lessons learned.
Read this articleUp Next
DevOps