What is an error budget—and why does it matter?

Incident Management

Resources
- Jira Service Management
- ITSM
- Product guide
- Resource library
Service management for IT Ops, development and business teams

Deliver high velocity service management at scale.

Get it free

Learn more

How to manage the end-to-end delivery of IT services

Check out tips to improve your service management practices.

Learn more

Everything you need to know to get setup on JSM

These guides cover everything from the basics to in-depth best practices.

View guide

Jira Service Management resource library

Browse through our whitepapers, case studies, reports, and more to get all the information you need.

View library

Get it free

Resources
- Jira Service Management
- ITSM
- Product guide
- Resource library
Service management for IT Ops, development and business teams

Deliver high velocity service management at scale.

Get it free

Learn more

How to manage the end-to-end delivery of IT services

Check out tips to improve your service management practices.

Learn more

Everything you need to know to get setup on JSM

These guides cover everything from the basics to in-depth best practices.

View guide

Jira Service Management resource library

Browse through our whitepapers, case studies, reports, and more to get all the information you need.

View library

Incident management for high-velocity teams

Get it free

Learn more

Service Request Management

Overview

Best practices for building a service desk

What is an error budget?

An error budget is the maximum amount of time that a technical system can fail without contractual consequences.

For example, if your Service Level Agreement (SLA) specifies that systems will function 99.99% of the time before the business has to compensate customers for the outage, that means your error budget (or the time your systems can go down without consequences) is 52 minutes and 35 seconds per year.

If your SLA promises 99.95% uptime, your error budget is four hours, 22 minutes, and 48 seconds. And with an SLA promise of 99.9% uptime, your error budget is eight hours, 46 minutes, and 12 seconds.

Why do tech teams need error budgets?

At first glance, error budgets don’t seem that important. They’re just another metric IT and DevOps need to track to make sure everything’s running smoothly, right?

The answer, fortunately, is no. Error budgets aren’t just a convenient way to make sure you’re meeting contractual promises. They’re also an opportunity for development teams to innovate and take risks.

As we explain in our SRE article,

“The development team can ‘spend’ this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.”

The benefit of this approach is that it encourages teams to minimize real incidents and maximize innovation by taking risks within acceptable limits. It also bridges the gap between development teams, whose goals are innovation and agility, and operations, who are concerned with stability and security. As long as downtime remains low, developers can remain agile and push changes without friction from operations.

How to use an error budget

First, you’ll need to consult your SLAs and SLOs. What objectives have you already set for uptime or successful system requests? What promises has your company made to clients? Those will dictate your error budget.

Error budgets based on uptime

Most teams monitor uptime on a monthly basis. If availability is above the number promised by the SLA/SLO, the team can release new features and take risks. If it’s below the target, releases halt until the target numbers are back on track.

To use this method effectively, you’ll need to translate your SLO target (usually a percentage) into real figures your developers can work within. This means calculating how many hours and minutes your 1% or .5% or .1% of allowed downtime actually translates to. Common targets include:

SLA target	Yearly allowed downtime	Monthly allowed downtime
99.99% uptime	Yearly allowed downtime 52 minutes, 35 seconds	Monthly allowed downtime 4 minutes, 23 seconds
99.95% uptime	Yearly allowed downtime 4 hours, 22 minutes, 48 seconds	Monthly allowed downtime 21 minutes, 54 seconds
99.9% uptime	Yearly allowed downtime 8 hours, 45 minutes, 57 seconds	Monthly allowed downtime 43 minutes, 50 seconds
99.5% uptime	Yearly allowed downtime 43 hours, 49 minutes, 45 seconds	Monthly allowed downtime 3 hours, 39 minutes
99% uptime	Yearly allowed downtime 87 hours, 39 minutes	Monthly allowed downtime 7 hours, 18 minutes

SLA target

Yearly allowed downtime

Monthly allowed downtime

99.99% uptime

Yearly allowed downtime

52 minutes, 35 seconds

Monthly allowed downtime

4 minutes, 23 seconds

99.95% uptime

Yearly allowed downtime

4 hours, 22 minutes, 48 seconds

Monthly allowed downtime

21 minutes, 54 seconds

99.9% uptime

Yearly allowed downtime

8 hours, 45 minutes, 57 seconds

Monthly allowed downtime

43 minutes, 50 seconds

99.5% uptime

Yearly allowed downtime

43 hours, 49 minutes, 45 seconds

Monthly allowed downtime

3 hours, 39 minutes

99% uptime

Yearly allowed downtime

87 hours, 39 minutes

Monthly allowed downtime

7 hours, 18 minutes

Error budgets based on successful requests

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

Stay on top of SLAs to resolve requests based on priorities, and use automated escalation rules to notify the right team members and prevent SLA breaches with Jira Service Management.

Try Jira Service Management free

Tutorial

Learn incident communication with Statuspage

In this tutorial, we’ll show you how to use incident templates to communicate effectively during outages. Adaptable to many types of service interruption.

Read this tutorial

Up next

The importance of an incident postmortem process

An incident postmortem, also known as a post-incident review, is the best way to work through what happened during an incident and capture lessons learned.

Read this article

Up Next

DevOps

What is an error budget—and why does it matter? | Atlassian (2024)

Service management for IT Ops, development and business teams

How to manage the end-to-end delivery of IT services

Everything you need to know to get setup on JSM

Jira Service Management resource library

Service management for IT Ops, development and business teams

How to manage the end-to-end delivery of IT services

Everything you need to know to get setup on JSM

Jira Service Management resource library

Incident management for high-velocity teams

What is an error budget?

Why do tech teams need error budgets?

How to use an error budget

Error budgets based on uptime

SLA target

Yearly allowed downtime

Monthly allowed downtime

Error budgets based on successful requests

Learn incident communication with Statuspage

The importance of an incident postmortem process