What Is SRE (Site Reliability Engineering)?

As technology advances many new roles are coming up. One of these roles, which has been around for about 15 years, is a site reliability engineer. However, site reliability engineering (SRE), a term coined by Google to explain how they run production systems, has recently gained popularity.

Many companies are now advertising for site reliability engineer positions or trying to implement SRE. But with movements like DevOps also becoming more prominent, you may be wondering if it’s really necessary to hire a site reliability engineer at your company.

In fact, given SRE’s close resemblance to DevOps, there is an ongoing debate over what SRE is and why it’s site reliability engineers play an important role.

In this article, we’re going to talk about what SRE is. To give you a better understanding of SRE, I will also discuss how it relates to DevOps and why you should consider adding a site reliability engineer to your team.

We’ll start with a little bit of history, covering the history of SRE. Then, we’ll define the term, talk about its most important aspects, and cover the main job responsibilities of an SRE. After that, as promised, we’ll clarify the relationship between DevOps and SRE, clearly defining where one ends and the other begins.

Finally, we’ll go in-depth into the burning questions. What are the reasons for adopting this role? Should startups do it? Should your organization do it? By the end of the post, you’ll have the knowledge to answer those questions and make the right decision for your organization. Let’s dig in.

Computer_with_gears_in_scalyr_colors_signifying_what_is_sre

The Origins of SRE

In 2003, Benjamin Treynor, the originator of the term SRE, was put in charge of running a production team consisting of seven engineers. The purpose of this production team was to make sure that Google websites were available, reliable, and as serviceable as possible.

Since Benjamin was a software engineer, he designed and managed the team in the way he would have if he worked as a site reliability engineer himself. He did this by giving the team the task of spending half their time on operations tasks so they could have a better understanding of software in production. That team eventually became Google’s present-day SRE team.

As Benjamin puts it, one of the contributing factors for the idea behind SRE was the division between the product development and operations team.

Each of these teams has differing goals. On the one hand, the development team aims to launch new features and see how users adopt them. On the other hand, the operations team makes sure that the service doesn’t break. When each team has their own way of doing things, it becomes difficult to achieve business goals.

As it turned out, SRE became the paradigm to help manage Google’s large-scale systems as well as facilitate the continuous introduction of new features.

So What Is SRE?

SRE essentially involves creating a bridge between development and operations. SRE’s approach to this is to apply a software engineering mindset to system administration topics.

Since SRE is a relatively new concept, there is no consensus on what the site reliability engineer role entails or what exactly it is. A quick survey of job expectations and requirements from different job listings makes this evident.

To explain more on the site reliability engineer role, some Google engineers have written a book about SRE that you can read online for free. The book explains how Google handles SRE in their organization.

Note: although you will learn a lot from the book on how to implement SRE, this doesn’t necessarily mean that your company should copy the exact methods Google does. The main consideration should be your organization needs. For instance, a large organization’s implementation of SRE is not the same as that of a startup especially in terms of affording a team for this role.

Important Aspects of SRE

Still not convinced your organization should adopt SRE? Let’s have a look at some aspects that set the site reliability engineer role apart from other roles.

Site reliability engineers collaborate with other engineers, product owners, and customers to come up with targets and measures. This helps ensure system availability. You easily know when action should be taken once you’ve agreed upon a system’s uptime and availability. This is done through service level indicators (SLIs) and service level objectives (SLOs).
SRE introduces error budgets that help you measure risk and consequently balance availability and feature development. Having an error budget means that failure is accepted as normal and that requiring 100 percent availability is not necessary. With no unrealistic reliability targets set, a team has the flexibility to deliver updates and improvements to a system.
SRE believes in reducing toil. Therefore, it aims at automating tasks that require a human operator to manually work on a system. For instance, Google expects that only 50 percent of each site reliability engineer’s time goes to coding. The other 50 percent is for the feeding and daily care of existing applications.
A site reliability engineer should have a holistic understanding of the systems as well as the connections between the systems.
Site reliability engineers have the task of ensuring the early discovery of problems to reduce the cost of failure.
Since the goal of SRE is to solve problems between teams, the expectation is that both the SRE teams and the development teams have a holistic view of libraries, front end, back end, storage, and other components. And shared ownership means that any one team can’t jealously own single components.
Performance is another area that an SRE can help improve. SRE teams can act proactively and help organizations uncover performance bottlenecks across the system. That way, they can be solved before they make into production and cause frustration for the end-users.
To improve availability and also to uncover and fix performance issues, SREs need to know what’s happening with their systems. That’s why monitoring is a key aspect of SRE. Thanks to monitoring, SRE teams can have a comprehensive and up to date view of how their systems are behaving and how healthy they are.
Incident response is a key component of SRE, which is also facilitated by an efficient monitoring strategy.

The Main Job Responsibilities of an SRE

We’ve just covered some important aspects of the site reliability engineer role. But what about the day-to-day practical concerns of this job title? That’s what we’ll cover right now.

One key aspect of the SRE job is automation. If you have—or aspire to have—this job title, then “everything that is automatable, should be automated” should be the motto you live by. Any SRE worth their salt should be always searching for inefficiencies to get rid of them. When feeling the pain of dealing with a particularly troublesome, error-prone, or tedious task, the SRE should immediately seek ways of automating such a task away, so developers and other professionals don’t have to feel that pain also.

Another essential component of the SRE role is monitoring—and everything that comes after, such as responding to incidents, should they occur, writing post mortems, and conducting investigations to find out the root cause of issues.

SREs are also responsible for everything related to the release of the software. It’s part of an SRE’s job responsibilities to ensure that the software release process happens in a safe, predictable, repeatable, and efficient way. SRE professionals define guidelines, conventions, and best practices to guide the team and organization on how to deploy and monitor code. They’re also responsible for the services in production, helping the organization ensure their availability and ensure healthy processes for responding to incidents.

Is There a Relationship Between SRE and DevOps?

You may have noticed that there are a lot of similarities between SRE and DevOps. It can be especially confusing because both SRE and DevOps aim to bridge the gap between operations and development. We also see the practices behind these concepts playing an important role in scaling and automating processes.

But what sets SRE apart from DevOps? DevOps bridges the gap between operations and development through aligning key goals and initiatives. While SRE uses team-lead engineers who have an operations background and mindset to remove departmental communication problems.

Another major difference between the two is the focus on coding. DevOps focuses on creation and testing—this involves moving the code through the pipeline effectively and efficiently. On the other hand, SRE focuses on creating a balance between site reliability and the need for new features.

Although DevOps helps reduce the problematic gap between operations and development, it doesn’t define clearly how to accomplish these goals. SRE embodies DevOps philosophies and goes even further to include ways of achieving reliability through engineering and operations work.

In other words, and as Google puts it, “SRE implements DevOps.”

The 4 Key Metrics of SRE

Throughout the post, we’ve been talking about how SREs help organizations ensure availability and efficient incident response processes through the use of metrics and targets. But what are the metrics that SRE adopts? Is there an “official” set of metrics?

The answer is “yes”. Though official might be too strong of a word, there are a set of metrics often called “the four golden signals of monitoring.” The recommendation comes from Google itself: if you can only focus on a few metrics, make them these four.

Latency

As you’re probably aware, latency means the time it takes to respond to a request. It’s extremely important here to treat successful and unsuccessful requests differently. There are situations that can cause an error request to be served very quickly. You should dismiss those latency times, otherwise, that might cause you to believe your average latency for successful requests is way lower than it actually is.

SREs can use latency to define baselines to compare future measurements against. They can define what a good latency looks like, and compare the latency of successful requests against that baseline. Doing such measurements across the whole system can help teams detect poor-performant services and fix them quickly.

Errors

This metric refers to the rate of unsuccessful requests. Of course, “unsuccessful” might mean a variety of things. It could mean requests that fail by returning an error code. Also, it could mean a request that returns the correct status code (e.g. 200 OK) but the content is wrong for some reason.

Closely monitoring errors allow organizations to learn the types of incidents that happen more frequently. It also allows them to have a comprehensive view of the system’s health. That’s especially true when errors are categorized by severity, which helps teams focus on the problems that really affect the end-users.

Traffic

Traffic is a measure of the demand your system is receiving. Of course, traffic is a somewhat generic term that can acquire different meanings according to your specific scenario. For instance, when it comes to web apps or APIs, traffic might mean the number of HTTP requests per second the application receives.

Traffic is a key metric because it can help SREs identify issues that are caused by stress due to high-traffic and tell them apart from problems that happen even in lower-traffic scenarios.

Saturation

Saturation is another word for capacity. How full is your system? Can it take more load? Saturation as a metric helps you see which resources from your system are constrained. It applies to resources like disk capacity, CPU usage, and memory consumption.

Most resources start to cause performance degradation way before they reach 100% of utilization. That’s why saturation is such an important monitoring signal: it helps SRE teams identify determine healthy thresholds for utilization of those resources. Then, they can act preemptively and prevent system saturation from reaching a critical level in which performance for end-users will degrade.

Should You Implement SRE?

The buzz on the value of site reliability engineers has many IT managers wondering if they should add one to their team. In most cases, the addition of site reliability engineers to a team happens during the design and development of large-scale systems. Although the SRE was created at Google, other recognizable brands such as Netflix, GitHub, and Reddit already have these teams. This means that mainly cloud-native and SaaS companies have adopted SRE. Still, other companies are gradually adopting this role for their software development teams.

Some Reasons Why You Should Consider SRE

To sum up, here are a few key reasons why a site reliability engineer role is worthwhile:

SRE automates processes for reliability, which will save time for your in-house team. Through automation, a team eliminates manual reprogramming, which is tedious and laborious. Thus, SRE will help in recognizing and addressing operational flaws with no human interference.
A site reliability engineer combines the role of a system administrator and developer and this prevents conflicts that might arise. How? When you have a system admin and a developer, each embraces different ideas and methodologies at the time of development and troubleshooting. But a site reliability engineer utilizes the strengths of a system developer and those of a system admin to form an operational system.
The collaboration skill of site reliability engineers is critical for high-quality systems. It also comes in handy when there are problems during development or when a system fails. This is because site reliability engineers focus on finding a solution rather than dwelling on divisive matters of how things should be done.
A site reliability engineer uses an innovative approach to problem-solving and this boosts the likelihood that your team will come up with a disruptive product.

To successfully implement SRE your organization should have people who have the right experience. Such a person will be able to operationally lead development teams.

Should a Startup Adopt SRE?

Well, yes, but remember that every organization has to deal with all sorts of different problems.

A startup might not have the budget to hire a team dedicated to systems reliability. Nor the time to focus on much of the practices SRE fosters. But, it’s important to remember that the SRE model comes from the book Google published a few years ago. Some companies realized they were doing SRE, as well. What I’m trying to say is that you don’t have to adopt all the practices and principles from SRE to get some benefits.

Although, one essential practice that I’d recommend you to start implementing right away is SLOs. This single number will foster a lot of discussions within teams. Start measuring to understand how your numbers look at the present moment. Then you might find out that you’re spending more time than you need creating the perfect CI/CD pipeline or trying to do several deployments per day.

You might be having more significant problems, but you need to start measuring. Getting feedback after every deployment or release is critical, so focus on that.

Understanding Your Organization Needs Will Help You Decide

Despite the debate surrounding SRE and DevOps, the most important thing is that they both aim to help applications and systems run more efficiently. With more advances in technology, we are bound to see new practices and roles coming up. Of utmost importance is being able to select or adopt new practices or roles that drive operational efficiency.

If you are trying to decide whether you should adopt SRE, think of the outcome. If you have large projects that require continuous improvement, SRE is right for you.

The Origins of SRE