Although site reliability engineering has been around for a while, it has only recently gained fame in general software circles. But there are still a lot of questions as to what a site reliability engineer (SRE) is and does.
SREs have been compared to operations groups, system admins, and more. But the comparison falls short in encompassing their role in today’s modern software environment. SREs cover more responsibilities than operations and infrastructure. And though they may have a background in system administration, they also bring software development skills to the role.
SREs combine all these skills and ensure that complex distributed systems run smoothly.
So how do they do all this?
Read further to find out what SREs are, in practice, and how they accomplish this through the responsibilities they fulfill.
What Is A Site Reliability Engineer?
Let’s define site reliability engineer.
Originally, the concept of site reliability engineering was introduced by a book of the same name, written by a team of engineers from Google. The book provides a great in-depth look at site reliability engineering. However, today we want to stick to a simple definition.
Wikipedia defines site reliability engineering like this:
Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
In a nutshell, we can say that a SRE is a professional with solid background in coding/automation, that uses that experience to solve problems in infrastructure and operations.
How is SRE Different Than DevOps?
At this point you may think this sounds similar to DevOps. We’ve been incorporating DevOps principles into our teams and organizations for a while. What’s the difference?
First of all, you shouldn’t think of DevOps as a role. Rather, consider it more of a cultural aspect. It can’t and shouldn’t be assigned to a person, but rather performed as a team.
Next, note that DevOps automates and simplifies the process of getting code from a developer’s laptop to production as smoothly as possible. But what happens once the code deploys to production?
That’s where site reliability engineering makes all the difference. SRE improves operations once that code deploys to production, focusing on operations and maintaining highly available services. And again, this isn’t just our typical operations or application maintenance role, manually responding to problems. The professional that assumes this role should be a software engineer and use those skills to automate their way to high availability.
In short, DevOps gets our code to production, while SRE ensures that it works properly once there.
What Does a Site Reliability Engineer Do?
Site reliability engineers have a lot of different responsibilities because there are a lot of ways to look at and solve reliability problems. And these responsibilities may vary based on the company size, domain, team size, and other factors.
Ultimately, they follow SRE principles to reduce toil, monitor and improve systems, and solve reliability problems when they occur. Let’s look at some of these further.
Automate All the Things
One difference between the SRE role and the traditional operations team involves automation.
In the past, operations folks would keep things running by watching dashboard, executing scripts, and carrying out other manual endeavors. However, in the SRE world, there’s a heavy emphasis on automation and removing repetitive toil.
Where did this drive come from? The engineering aspect of the SRE role.
When you put software developers in a position where they repeat the same functions day in and day out, they’ll be driven to automate. That’s what software developers do best.
And automation doesn’t stop at automating a software build, acceptance tests, and deployment. Their automation includes CI/CD and infrastructure creation and patching, as well as monitoring, alerting, and automating responses to certain incidents.
In Google’s SRE book, much of this work is also referred to as eliminating toil:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
—Site Reliability Engineering
But why do we focus so much on reducing toil?
Not only does reducing toil make the processes repeatable and automated, but it also increases the amount of time SREs have to build new tools and investigate infrastructure changes that further improve site reliability.
In summary, the less toil there is, the more time and resources are dedicated to making sure your software ecosystem runs reliably and the faster you can deliver business value.
Monitor Distributed Systems
With the increasing popularity of distributed systems, there’s a greater need for increased monitoring and automated alerting.
It’s not enough that your application is up and running. We also need to ensure that systems and infrastructure work properly. And we can’t do that without monitoring.
For this portion of the job, SREs can use a product like Scalyr to monitor and alert on any potential issues. This allows real-time system monitoring as well as analysis of long-term reliability trends. In fact, you can get started today with a free trial of Scalyr that can give you the visibility you need to monitor all your systems.
Define Service Level Indicators and Objectives
When you hear that a service has attained or is striving toward an uptime of 99.99 percent, you’re talking about service level objectives (SLO). Service level indicators (SLI) measure these objectives.
In other words, the SLI is an agreement on how the SLO will be measured.
SREs provide monitoring services for systems so that teams can begin to track their SLOs and SLIs. They also help provide realistic objectives for the future and might advise on proper SLAs for customers.
The SRE then works to make sure your application meets, though does not overly exceed, the stated SLO.
Now you may think that it’s odd to not work to exceed an SLO. However, it would be a waste of resources to make something more reliable than it needs to be. You could be using that time and effort on work that’s more critical instead.
Provide On-Call Support
Similar to traditional operations roles, SREs spend time rotating on-call responsibilities. What does on call look like?
Typically SREs rotate the on-call. This balances on-call duties with more in-depth engineering and automation activities, reducing burn out improving focus when it’s needed.
When a high-priority page triggers, the engineer will investigate and diagnose the issue. The SRE might also pull in additional engineers or software developers if necessary. And the SRE works to resolve the issue to bring the service back up.
As with system monitoring, on-call support also provides metrics that can be used to drive improvement. With on-call support, site reliability engineers work to reduce metrics like Mean Time To Acknowledge (MTTA) and Mean Time to Resolve (MTTR). As you may suspect, SRE roles require actionable metrics that drive our systems to improve aspects of system reliability.
An important part of the SRE role involves managing incidents.
Now you may say this is no different than the on-call responsibility. You find a problem and then fix it. How hard could it be?
Well, for managing incidents, SREs need to employ additional professional skills to make sure everything goes smoothly.
When an outage occurs, for example, there could be dozens of ways to diagnose and attempt to resolve the issue. And your first step involves using the monitoring and metrics you have to diagnose the issue.
Once you have the data or symptoms available, you’ll have to manage the incident properly. You’ll need someone to take point on facilitating and coordinating the actions of all involved. And that requires clearly defined roles.
Though not all companies include these Google-recommended incident roles, we should at least consider them. These roles include the following:
- An incident commander who maintains a high-level view of everything occurring.
- The engineers who execute processes or modifications to the infrastructure or systems.
- A communication role for relaying the right message to customers and management.
- A planning role in charge of planning any meetings, handoffs, and logistical needs.
Without clearly defined roles for our SREs, we could have SREs that step on each other’s toes as they try different solutions without up-front coordination and communication.
Now that we’ve lived through an incident and resolved it in the sections above, we’re ready for the postmortem.
Typically, an SRE facilitates or participates in these postmortems.
A postmortem brings together all relevant parties for analysis of the incident. The goal is to analyze what occurred during the incident and find the root cause. The participants also determine how the incident can be prevented or fixed in the future.
Some of the items that come out of a postmortem are listed below:
- Stories to improve reliability or monitoring.
- Additional documentation to assist with future incidents.
- Further investigations or testing to prove out any hypothesis related to the incident.
Postmortems provide a great tool for all teams to review incidents and find the best path forward.
Work With SRE and Development Teams
In addition to supporting development teams during on call, SREs also provide consulting and troubleshooting.
This assists both other SRE teams and software development teams that struggle with operational or reliability issues.
In this scenario, the SRE will assess current issues and determine which can be improved with automation or engineering effort. The SRE may also suggest solutions to reliability problems.
And perhaps most importantly, the SRE will drive changes to team processes and culture. Because SRE involves more than tools and monitoring.
Responsibilities May Vary
In this post, we’ve discussed various activities that site reliability engineers participate in. Although these activities are done by many SREs, they aren’t set in stone.
Companies do vary their SRE roles and responsibilities based on need. In general, companies that are at different points in the SRE journey may have different needs.
For example, a newer company may need SRE support in getting general outages under control. And most energy goes toward that base level of reliability.
However, other companies that are further along in the journey may have eliminated company-wide outages. They may spend more time on improving or validating system and business-related metrics.
For example, your pizza shop application may need improved monitoring on its pizza recommendations once the site’s general availability is stable and reliable.
And remember, these tools and processes aren’t just for software companies. In fact, software engineers and ops folks should look at these processes and duties wherever they work. SRE principles can provide value to traditional manufacturing companies, retail companies, and more. Wherever there’s a need to improve reliability and availability of applications, or even just reducing toil, there’s room for SRE.
Should You Become an SRE?
You should consider becoming an SRE if you have a solid foundation in or passion for automation.
So ask yourself:
- Do you find yourself automating mundane development or operational tasks?
- Are you are a systems engineer who wants to improve your coding skills?
- Are you are a developer who wants to learn how to manage large-scale distributed systems?
- Do you want to learn how to solve system problems by using the monitoring data that you’ve helped set up?
If the answer to any of the questions above was “Yes,” then this role is for you.
There is no better way to stay in touch with the most recent developments in the DevOps world. You’d also be able to increase your knowledge and skills in areas that are currently in high demand.
As we discussed, SREs spend time on both technical and process-oriented responsibilities.
They do more than an operation or system administration team. Site reliability engineers employ their engineering skills to automate and reduce the manual intervention necessary for administration tasks.
Additionally, they work with other engineering teams to provide proper monitoring, incident response, and management. Over time, these functions improve the reliability and maintenance costs of your distributed systems. You can try this for free with Scalyr.
And finally, they spread the culture of site reliability engineering through your organization so that all teams learn to make decisions with reliability in mind.