|
Logging and log analysis are critical to an engineer’s skill set. Engineers rely on logs to track how their systems are running. Sometimes engineers fail to log important events. They often regret not having them in the logs.
See, engineers use logs for error resolution, security, and even for making improvements to the system. Having certain events in the logs and the tools to properly analyze them is what enables engineers to do these things. Let’s look at these ways in which logs are valuable in order to understand more about why engineers care so much about logging.
Using Logs for Error Resolution
Any given system will eventually throw exceptions. We’re taught to handle them gracefully or bubble them up the call stack and deal with them at the highest level. Typically, part of handling exception is logging the exception. We find errors by examining the logs for exceptions, warnings, and other anomalies.
Before diving into some of the ways logging can help engineers resolve errors, I want to make sure the distinction between error and exception is clear. The wiki for Haskell presents a good summary, stating:
It is an error to not handle an exception. If a file cannot be opened, you must respect that result. You can proceed as if the file could be opened, though. If you do so you might crash the machine or the runtime system terminates your program. All of these effects are possible consequences of a (programming) error.
To sum up: errors are coding missteps. Exceptions are for “expected but irregular situations.” However, some programming languages use error (or Error) to mean exception, by this definition.
Detection
The first thing that comes to mind when thinking about logging is error detection. Your program, the platform, or the OS logs details whenever an exception occurs in your system. Whenever you release a new version of the system, you can check the logs to ensure there aren’t any exceptions being thrown (and logged) from the new version. The presence of exceptions in the logs would indicate an error in the new version and would potentially warrant a rollback.
For best results, don’t wait to check the logs until you release a new version. Instead, do routine checks on the logs for exceptions. It’s also best to have a proactive alerting system that sends alerts whenever exceptions are logged.
Diagnosis
Analyzing exceptions and other unusual events is an important step in error detection. The information found in one or more entries can be valuable, sometimes even essential to understanding the cause of errors and resolving them.
You can use the stack trace in the entry to find out where in the program the problem occurred. I’ve personally found that problems can be resolved much more efficiently with a clean stack trace. When the stack is unwound before the exception is logged, the trace only contains information about the exception handler. This is not helpful since every exception will have the same stack trace regardless of the origin of the exception.
My advice is to purposely throw exceptions during development as an experiment to see how useful your exception logging really is. This way, when your system makes it to production, you’ll at least have some confidence that you can rely on your logs to get you much closer to the source of your errors.
Parallel Reproduction
You should log event chains, especially if you have a system that uses parallel programming. When problems occur in these systems, they are typically difficult to reproduce. Rebuilding the event chain that produced the symptom in the first place might be the only way to discover the cause of the issue. You must log state changes as well in order to faithfully retrace the events that led to a failure condition.
Using Logs for Security (and Securing Logs)
Security is such a hot topic these days, especially when talking about logs. You may have heard about the emerging concerns of keeping logs secure (at rest and while in transit). But it isn’t all so glum; you can also use logs to enhance security rather than simply present another security risk. Engineers can set up analyzers to watch logs for indicators of security issues.
Logs at Rest
Logs need to be secured. They should have the same access controls that apply to any other data. Use the principle of least privilege when giving access to logs and partition access rather than giving blanket access. (Partitioning will be easier if the logs are well organized.)
Although it may be tempting to log all parameters for every call on your servers, it’s not prudent to do so. Some parameters contain sensitive information, such as passwords or PII (personally identifiable information). Sensitive data must be secured with appropriate access controls and obfuscation! These days, there’s just no excuse for leaving that information laying out in the open.
Logs in Transit
When transmitting log data, use a secure connection such as HTTPS or SFTP. That way, if the encrypted data is intercepted in transit, it will be unreadable anyway.
You should use secure transport protocols even on internal networks. You might have an undetected breach in the network with someone sniffing around for packets of juicy data to steal.
Logs As Indicators
When it comes to questions about securing your system and your environment, your logs have lots of answers. If you analyze everything from the firewall logs to the web server request logs, you can get a pretty good picture of how often your system comes under attack. Hopefully, you’ve got your system well secured; if not, you’ll appreciate the outcome of this kind of analysis as a guidepost telling you where to focus your efforts.
Even if you already do run a tight ship when it comes to security, it’s good to know the frequency of attempted attacks. If that number goes up for a period, it can be an indicator that someone is specifically targeting your company. Your security team can then notify employees to be extra vigilant.
Using Logs to Make Improvements
Most technology companies need to be making continuous improvements in order to keep the edge on the competition. So it should come as no surprise that I recommend using logs to improve your products. There’s valuable information in your logs! They’re gold when it comes to understanding user behavior, system behavior, possible performance improvements, and methods for computing resource allocation.
Understanding User Behavior
We may never fully understand user behavior, but there are billions of dollars to be had in trying. You may have heard of Google Analytics. Google has made a killing from quantifying certain aspects of user behavior, offering reports and metrics on how your users are interacting with your application. That’s great! Depending on your needs, you might also want to implement some custom analysis to supplement Google Analytics.
You could build your own tools, but that’s probably going to be wasteful. The tools are out there. You’re better off putting effort into activities that produce value in your domain.
Understanding System Behavior
Besides reducing errors that make their way into a system operating “in the wild,” you can analyze logs to improve features.
Web server logs offer a prime example of this. A web server configured to log some details about every request/response produces a lot of data. All that juicy data is useful when it comes to understanding what your web clients are doing and how your server is behaving. Take, for example, search queries or certain IDs from GET requests hitting your API logs more often; you can optimize for those. Moreover, those search queries may contain valuable user feedback if you look at them just right.
Finding Performance Improvements
Significant events should be logged explicitly. I don’t mean exceptions this time around, but events about which you want additional information. Perhaps you’d want to log an event whenever your logic follows a certain path. This might give you an idea about patterns that you can monitor. For instance, you could monitor cache misses, if you’re building a system that uses caches to optimize for speed.
Allocating Computing Resources
Autoscaling in AWS uses CloudWatch metrics to trigger scaling. All that fancy talk is Amazonian for logging system performance over time, then analyzing trends to add or remove resources as needed.
You don’t have to be as sophisticated as autoscaling systems to use logs to understand how to allocate resources. If you have a few API endpoints on a single web server, for example, a simple query of requests can show you which requests make up the bulk of your load. If you break those off onto their own server(s), you’ll have a more balanced load than if you’d scaled the server directly.
In Conclusion
For engineers, logs are our lifeline. They can get us out of jams, so long as we’ve set ourselves up for success. Luckily, the pathway to success is straightforward: logging effectively and being able to access those logs efficiently.