When I first heard about whitebox monitoring, I thought I knew what it was just by name. But like many software terms, it’s easy to conflate one thing for another. It’s also easy to jump to conclusions too quickly and think you know everything about a term. Whitebox monitoring is a very valuable tool in our DevOps toolbox. Because of this, it’s prudent for us to ensure we fully understand what it is and how to use it.
We’re going to cover quite a bit about whitebox monitoring: why it’s valuable, how it differs from blackbox monitoring, and how such monitoring is implemented in a system.
Blackbox vs. Whitebox Monitoring
Whitebox has often come with its counterpart, blackbox, through many variations. One famous example is blackbox vs. whitebox testing. I’ll quickly cover the difference between blackbox and whitebox monitoring.
The name blackbox invokes a feeling of mystery. Black is opaque; we cannot see through it. In the same way, we can’t see into the system we’re monitoring. Take a series of houses on a street, for example. From the outside, we can know a few things. We can see if anyone is home by whether the lights are on or how many cars are in the driveway. If we read the gas meter, we know how much gas they use per month. It’s the same for the water meter and water usage.
But we don’t know what happens inside. We have no idea what sort of interesting celebrations, arguments, or hobbies could be happening in these houses. It’s the same for monitoring. I can monitor the traffic and responses of the software. I can measure the CPU and memory utilization. But in blackbox monitoring, I have no idea what’s happening inside these requests. I don’t know what database calls it’s making or what sort of fancy rules it applies to the data.
Whitebox monitoring is, of course, the opposite of blackbox. The idea of a whitebox invokes a feeling of transparency, though maybe clearbox would be a better name. Imagine a household with a swear jar, a calendar of events, or a weekly menu. I can record and watch these things change, as long as I have a window to see inside. Whitebox monitoring is the same. It’s the act of putting windows into our code and showing the outside world, usually our developers, what’s happening inside.
What Do I Get out of Whitebox Monitoring?
If you know exactly what’s happening inside your application, you can easily answer questions that may pop up from time to time. Back to our household, what if we wanted to know how often they eat fish? Well, we just need to peek through the window and record their menu. We keep doing this every week, and we can get an average for how much fish they eat per month.
With software, having such things gives us some key benefits in two categories: operations and product insights.
Operational whitebox monitoring helps us keep our system alive and healthy. It also helps us keep the risk low when trying new architectures and techniques. It does this by giving us quick feedback once we’ve released something. For example, if we see traffic is slowing down for one of our endpoints, we need a way to drill down and see what internal method is causing the bottleneck. Monitoring database queries is a fantastic example of a whitebox monitor. It gives us insight into a key part of our system that is often the cause of slowdowns. Operational whitebox monitoring helps us build a thing right.
While operational monitoring helps us build a thing right, product insight monitors help us build the right thing. They help us answer the question, are we making customers happy? One popular version of this monitoring is shopping cart abandonment. It’s common in e-commerce to see how often customers put things inside a shopping cart, but never place an order. By looking at what they put in, or where they clicked next, we can improve our systems to encourage customers to place more orders.
How Do I Wire up Whitebox Monitoring?
Armed with an understanding of what whitebox monitoring is, the question naturally pops up on how to implement it. This is extremely sensitive to the languages and frameworks you’re using, but I’ll describe some important characteristics here.
There are two main categories of whitebox monitoring that affect how we implement them: event-centric monitors and time-series data.
Event-centric monitoring is where we report out interesting events from our system. For our household, this could be tallying on a chalkboard every time a family has an argument. We may record what room the argument was in, how many decibels the speaking reached, and how long it lasted. We don’t know up front how the data will be used—we just know it’s interesting. With software, this could be things like what web requests occurred, whether they were successful, how many database calls they made, and how long they took.
Event-centric data is great when we need to answer questions that we don’t know we have ahead of time. Oftentimes production incidents catch us by surprise, and we need answers right now! Event-centric data can give us these answers.
Time-series data is information that is aggregated over a time period. For our households, it might be the number of dishes used per week or pounds of food consumed per day. In software, this can be things like the number of requests per minute or average latency per hour. The key to time-series data is that it’s aggregated inside the application before being sent through our window. This makes it ready to be shown in graphs so that we can see trends over time.
The cost of whitebox monitoring is that it must be built into the app as a first-class concept. I need the application itself to tell me what’s going on inside it. In a household, this would be equivalent to the family telling me what they ate that night or giving me a key to their house. This usually consists of a reporter component that blasts out the data to somewhere else. It also involves some sort of instrumentation that plugs into our workflow life cycle. Many frameworks give us extensions to plug in such instrumentation. Sometimes we have to do it ourselves via things like aspect-oriented programming.
However we report, here are some key traits. Most reporters, especially event-centric ones, should be asynchronous. This is because we don’t want the time it takes to report out the data to interfere with our responsiveness to customers. We should also ensure its failures are isolated. We don’t want a failed log event to stop up someone trying to make a payment!
Once we have a reporter, we need something to report to. Sometimes we report to multiple places. We should have a persistent store somewhere that we can send this data to. As a note, this place should exist outside our production system. It’d be quite a pain to troubleshoot when production is down if your monitoring goes down with it!
For the record, Scalyr offers both storage and reporters as part of its product. Other offerings may focus solely on one or the other. Spring Sleuth, for example, only handles reporting and instrumentation. It leaves where to put it up to you.
Now that we have the data stored, we likely want to see it and see it frequently. We have a few options for that.
The almighty dashboard is the most popular way for visualizing monitors. There’s nothing quite like the feeling of confidence a trending graph gives. Dashboards excel at giving you a pulse of the system at a glance. Healthy ones let you see patterns and detect abnormalities. They should be uncluttered, focusing only on a few things. The flip side to this is that they should easily let you drill down at runtime into more specific data. This will make it easier to diagnose problems quickly.
Search is the ultimate ability to answer questions you didn’t know you would need to ask in advance. Searchable storage works very well with event-centric monitoring. It gives you the full flexibility to ask very specific questions about your system. No amount of dashboards can always tell you all the information about a bug, but searching can.
Alerts are when the system tells you when something is going on, as opposed to the other way around. These are great for letting you relax, and not having to spend time every day looking through graphs or logs. They let the system do the hard work of detecting abnormalities. On the other hand, they can easily get out of hand. I have seen badly managed alerts that can ping you every five minutes for the next four hours while your system is down.
What to Monitor
Once we get whitebox monitoring up and running, we want to leverage it effectively. I’ll be using a couple of ideas from Google’s Site Reliability Engineering book to touch on this.
Service Level Indices
Service level indices are the baseline of what’s happening in your system today. They’re useful to get a pulse on how your system works. You may find some surprising results. What things you want to index are usually based on what service level objectives you have.
Service Level Objectives
This is one of the keys to all monitoring. Monitoring is by no means a purely technical endeavor. A team should have objectives on how to make its customers happy. If your system deals with payments it could be “we want payments 20% more likely to succeed without a snag.” In this case, let’s measure the rate of successful vs. failed payment transactions. Service level objectives are something for which you should intimately collaborate with all of your stakeholders. You all are responsible for these, and they’ll help guide what monitors you plug in.
In a household, there are many interesting things that happen inside. We have to explicitly create systems that let us measure those things, like swear jars. In the same way, software systems are full of valuable details that let us peer into what’s happening. Building whitebox monitoring into our app will let us turn these details into valuable insights. There are many tools, like Scalyr, that help us implement this quickly, leaving us time to figure out what service level objectives we want to measure.