Observability in Your Systems: The Why and How

In the DevOps and SRE world, observability has become an important term, especially when you’re talking about running production systems. No matter how much effort you put in creating good quality software, there will always be things you miss, like users increasing exponentially, user data that’s longer than expected, or cache keys that never expire. I’ve seen it, and I’ve produced it too. For this reason and more, systems need to be observable.

Today’s post is about answering the why and how of observability.

In a world where too many pieces in the system interconnect, new problems arise every day. And the good news is that your systems lend themselves to observation when you instrument them appropriately. More on this later.

But first, let me clear the path by explaining what observability means.

Face_with_magnifying_glass_signifying_observability

What Does Observability Mean?

To better explain this concept, let me use the definition of observability from control theory in Wikipedia: “observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.”

If we talk about running production systems, can you infer its internal state by observing the outputs the system produces? What outputs should you be looking at that describe the inner workings of the system? You might not know it, but I’d say that CPU and memory metrics are not sufficient.

Observability helps you understand the internals of your production system by asking questions from the outside. Historically, we monitor metric values like CPU and memory from known past issues. And every time a new issue arises, we add a new monitor. But the problem is that we usually end up with noisy monitors, which people tend to ignore. Or you end up with monitors whose purpose no one understands.

Observability_helps_you_understand_the_internals_of_your_production_system_by_asking_questions_from_the_outside.

For known problems, you have everything under control—in theory. You have an incredible runbook that you just need to follow, and customers shouldn’t notice that anything happened. But this isn’t how things often work, in reality. Customers still complain about issues in your system even if your monitors look good.

Therefore, monitors for metrics alone are not enough. You need context, and you can get it from your logs.

Why Is Observability Needed?

When working with simple applications that don’t have many parts to fulfill a need, like a WordPress site, it’s easy to control its stability to some extent. You can place monitors for CPU, memory, networking, and databases. Many of these monitors will help you to understand where to apply known solutions to known problems.

In other words, these type of applications are mature enough to know their own problems better. But nowadays, the story is different. Almost everyone is working with distributed systems.

There are microservices, containers, cloud, serverless, and a lot of combinations of these technologies. All of these increase the number of failures that systems will have because there are too many parts interacting. And because of the distributed system’s diversity, it’s complex to understand present problems and predict future ones.

You should recognize what the SRE model emphasizes: systems will continue failing.

As your system grows in usability and complexity, new and different problems will continue emerging. Known problems from the past might not occur again, and you’ll have to deal with the unknown problems regularly.

For instance, what usually happens is that when there’s a problem in a production environment, sysadmins are the ones trying to find out what the problem is. They can make guesses based on the metrics they see. If the CPU usage is high, it might because the traffic has increased. If the memory is high, it might be because there’s a memory leak in the application. But these are just guesses.

Systems need to emit telemetry that shots out what the problem is. It’s difficult because it’s impossible to cover all failure scenarios. Even if you instrument your application with logs, at some point you’ll need more context; the system should be observable.

Observability is what will help you to be better when troubleshooting in production. You’ll need to zoom in and zoom out over and over again. Take a different path, deep-dive into logs, read stack trace errors, and do everything you can to answer new questions to find out what’s causing problems.

How’s Different From Monitoring?

So, at this point, you might be wondering how observability is different from monitoring. Well, as I said before, monitoring is not a good way to find out about unknown problems in the system. Monitoring asks the same questions over and over again. Is the CPU usage under 80%? Is memory usage under 75% percent? Or, is the latency under 500ms? This is valuable information, but monitoring is useful for known problems.

Observability, on the other side is about asking different questions almost all the time. You discover new things.

You understand your application’s usage by asking questions from the outside. Then, you use all the information your systems produce like logs, traces, and metrics. With this information, you can ask any question, not only particular ones like you do with monitoring. In other words, observability is a superset of monitoring.

Correlating Data Is Fundamental

At this point, you can see that for observability to work, you need to have context. Metrics alone don’t give you that context.

From what I’ve seen, when memory usage increases, it’s just a sign that something terrible is about to happen. Usually, it’s a memory leak somewhere in the code or a customer that uploads a file bigger than expected. These are the type of problems that you can’t spot by just using a metric. As the Twitter engineering team said a few years ago on their blog, the pillars of observability are

  • Metrics
  • Traces
  • Logs
  • Alerts

It doesn’t mean that these are going to be the only sources of information you use to observe the system from the outside, but they’re the primary source of information. Correlating these different sources of data is challenging but not impossible. For example, in microservices, there’s heavy leverage on HTTP headers to pass information between calls.

Something as simple as marking a user’s request with a unique ID can make the difference when debugging in production. By using the request ID in a centralized storage location, you can get all the context from a user’s call at a specific point in time…like the time when the user complained but your monitors said things were all good.

when_memory_usage_increases,_it_s_just_a_sign_that_something_terrible_is_about_to_happen

How Can We Make Systems Observable?

Everything that happens inside and outside of the system can emit valuable information. An easy way to start making your systems observable is by collecting all sorts of metrics surrounding the application. I’m talking about CPU, memory, network, or disk metrics from the infrastructure that’s hosting the application.

Also include logs from the applications or services of which you don’t control the source code. I’m talking about services like NGINX, Redis, or any other logs from a cloud provider like AWS CloudWatch metrics.

Another essential component of making systems observable is the logs from your applications. And for that, the instrumentation is vital. Luckily for us, nowadays there are pretty good tools out there, like OpenCensus, that you can use to emit logs, traces, and metrics from the inside of your applications. Once in production, you might notice that the application isn’t emitting enough information, and that’s OK. You can continue adding more information as needed. But make sure you also get rid of the information you no longer need or use.

One thing I must say—all this information must be stored in a central location and in a raw format. It will be counterproductive to have to query different tools at different places to get the information you need when asking new questions. And you need to have the ability to aggregate and disaggregate information as required.

Observability and Its Benefits for Different Roles

When you have observable systems, it benefits not only the team in charge of running the production systems. Other teams like the developers can see what’s happening in the applications when receiving live traffic. There are times when replicating a problem in a development environment is difficult. However, when developers have a better perspective by inspecting logs, finding solutions to problems is easier.

But what about those non-technical teams like marketing or sales? Well, they could ask developers to include certain types of identifiers like a campaign or customer ID in the logs. Then, these other teams can simply search at the logs for any coincidence and read the error messages. Perhaps a customer has a missing configuration in a campaign and can fix that problem for the customer.

However, there are some challenges in regards to making your systems observable.

Challenges in Achieving Observability

One of the main challenges is to have a way to correlate events. Following the previous example, you could add unique identifiers to a request to understand what happened before an error occurred in the systems. You need to have a way to find the context of an error, otherwise, the information your systems emit will simply be bits of information. But as I mentioned before, instrumenting your application properly is crucial.

Although, using logging frameworks in the log won’t be enough as you need to agree at a higher architecture level what information you’ll use to correlate events.

Another challenge in observability is aggregating data. At some point, you’ll end up with too many events from different sources. Especially if you’re working with a distributed architecture, which it’s becoming the norm. You would like to avoid having to query different sources of information when troubleshooting. For instance, you want to have your database logs in the same location as your application logs. In other words, you need to have centralized logging.

Observability Helps You With Unknown Problems

Observability, in a nutshell, is the ability to ask questions from the outside to understand the inside of a system. Making your production systems observable doesn’t mean that you’ll be able to solve problems. You must continue examining all the information you have and think about whether it’s still useful.

Avoid falling into the trap of collecting everything just because you might need it in the future. It’s essential that you also spend time understanding the system and its architecture and components, either internal or external. Plus, it’s healthy to know where reliability is lacking in your system. You can use tools like DataSet  to start aggregating your system’s logs in a centralized location and troubleshoot in a better way.

Observability will help your team, especially developers, to own the system and to better understand how it behaves in a live environment with real users.

One last piece of advice. Don’t rely on the past problems you’ve had. All your monitors might say the system is OK, and yet customers still complain. Observability is about having the right data when you need to debug production systems. It will help you with the unknown problems you might have and do have.