When it comes to running production software, observability and monitoring provide the necessary data to make things easy. They provide the foundation to improve the customer experience and to reduce reliability metrics like Mean Time to Repair (MTTR) and improve Mean Time Between Failures (MTBF). And those provide your customers with the reliability and availability that they expect.
But what does observability really mean? And what can monitoring actually provide? Are these just the latest buzzwords?
In this post, we’ll review observability and monitoring, what they are, and how they differ. And we’ll give you the info you need to start building these practices with Scalyr.
Observability vs. Monitoring
Let’s start with a simple explanation.
When someone mentions observability and monitoring, we often picture rows of dashboards and radiators providing critical data to a software command center. And if an incident or outage occurs, we picture engineers scrambling and cycling through the available data to solve the problem quickly. Observability and monitoring sound vaguely similar, so that picture probably covers it, right? Well, not exactly. That picture doesn’t tell the full story.
Both observability and monitoring involve more than fancy dashboards and engineers. They are related, but perhaps not in the way you think.
To put it simply, monitoring helps teams identify problems and receive notifications about them. To follow problem identification, observability helps teams resolve the problems by improving debugging and cause analysis. Additionally, monitoring uses observability tools to track known metrics and failure points. On the other hand, observability provides tools to resolve unknown or unexpected issues. You need both monitoring and observability if you want to build reliable systems.
So that’s it. That’s the simple answer. But as you may expect, more goes into it than that. Let’s look further.
In control theory, observability means being able to infer the internal state of a system based on its outputs. With observability, we can’t predict what we’ll want to know about our system. So we want to track enough data to make sure we can analyze problems from different angles and different aggregates when problems inevitably occur.
What type of data adds to a system’s observability? To find out, let’s look at the commonly coined three pillars of observability: metrics, logs, and traces. All provide different views of what’s going on inside the system.
Let’s start with metrics. Metrics typically aggregate numeric data about your system and application. For example, you can have metrics around your available CPU and memory. Additionally, you can track metrics like response codes, traffic, latency, and errors.
Once you’ve got your system metrics defined, don’t forget custom metrics that provide relevant business or domain metrics. With custom metrics, you can track types of payments, shopping cart size, and a number of abandoned carts, to name a few.
We’re going to go a bit further with metrics, as they vary in type. The following are the most common, though you may see others.
- Gauge. Gauges represent measurements at a particular point in time. Metrics like CPU, memory, or queue counts use gauges.
- Counter. Counters measure events that occur. For example, you may count the number of requests your API receives, the number of errors that result, or the number of visitors to your application site.
- Histogram. With histograms, you can measure the distributions of events. One of the most common uses for histograms is latency. And instead of using just an average or max, you can determine the 50th, 90th, or 99th percentile of latency that your customers experience.
- GaugeHistogram. As a combination of the gauge and histogram, here you can see the distribution of gauge data. So if we take queue counts as an example gauge, we could plot how long the data has been in the queue with a histogram.
- Info. For information that doesn’t change during a process, you can use info. This can indicate an application version number or dependency version numbers.
With metrics, you have the potential to measure anything that occurs in your system.
Next, let’s look at logs. Logs provide textual data regarding events that occur in your system.
If you’ve been in software for more than a few weeks, you know that some logs can provide the crucial piece of information that helps solve a nasty bug. But you also know that many logs don’t add much value. Those are the logs that you just skim over when looking for something helpful.
So we’ve seen good logs and bad logs. Good logs provide the necessary context to recreate and investigate issues. Bad logs clutter the text with useless information or, even worse, misleading information.
For our third pillar, let’s talk about traces.
Both logs and metrics can relate to particular events that occur within the system. What happened at this particular place in our system at this time? That’s helpful info, but it doesn’t provide the ability to trace one particular transaction or customer until we add tracing.
Tracing, using tools like Zipkin or Jaeger, let us look at a series of related events through our system. For example, let’s say we want to follow our customer’s experience for a particular transaction that failed. We can do that with traces and tie relevant metrics, errors, and logs together to show the path through the code that a particular transaction took. With traces, we gain the ability to trace one process, transaction, or experience through our system.
Beware the Three Pillars
So now we have a high-level view of the three pillars. But you should use some caution when determining whether you have observability. Many people wrongfully assume that if they have monitoring, logging, and traceability, they have observability. But that’s not always the case.
In fact, common pitfalls can add more pain and make debugging more complex. For example, if you have three disparate systems that provide the logging, metrics, and traces, your engineers will have to context switch and attempt to correlate the data in those systems themselves. That can lead to errors, a longer time to debug, and frustration.
Additionally, some companies have the “silver bullet” pitfall. Remember that observability isn’t just about throwing a tool or dashboards at your application teams. It’s also about building a solid foundation of good logging and metric fundamentals.
Your teams can have all the tools at their disposal, but if your application reliability and availability aren’t improving, you might not have observability, no matter how many fancy dashboards you may have.
Now that we have a good understanding of observability, what about monitoring? With monitoring, we use some of our observability tools to identify issues, notify the software team of those issues, and even predict potential trends in our system’s reliability.
So this is where we finally get to use our dashboards with all of our metrics and logs! But even though dashboards look nice and make even the dullest applications seem exciting with their graphs and colors, you shouldn’t be sitting around watching dashboards and waiting for something to happen.
Instead, automated alerts should provide notifications when things need to be looked at or when systems experience issues. And the dashboards should provide relevant data for investigative purposes.
Now this doesn’t mean that you always need an incident to review dashboards. You can also explore the current health of the system or the activities taking place. Then you can start to see how different types of traffic or load affect other parts of the system. And from that, you can start to predict when issues may crop up in the future.
Putting It All Together
Now we know that observability and monitoring go hand in hand. So what’s next? Start a free trial with Scalyr and see how you can combine observability and monitoring to ensure your teams can not only detect issues but also resolve them quickly.