I’ve talked here before about log management in some detail. And I’ve talked about log analysis in high-level terms when making the case for its ROI. But I haven’t gone into a ton of detail about log analysis. Let’s do that today.
At the surface level, this might seem a little indulgent. What’s so hard? You take a log file and you analyze it, right?
Well, sure, but what does that mean, exactly? Do you, as a human, SSH into some server, open a gigantic server log file, and start thumbing through it like a newspaper? If I had to guess, I’d say probably not. It’s going to be some interleaving of tooling, human intelligence, and heuristics. So let’s get a little more specific about what that looks like, exactly.
Log Analysis, In the Broadest Terms
In the rest of this post, I’ll explain some of the most important elements of log analysis. But, before I do that, I want to give you a very broad working definition.
Log analysis is the process of turning your log files into data and then making intelligent decisions based on that data.
It sounds simple in principle. But it’s pretty involved in practice. Your production operations generate all sorts of logs: server logs, OS logs, application logs, etc. You need to take these things, gather them up, treat them as data, and make sense of them somehow. And it doesn’t help matters any that log files have some of the most unstructured and noisy data imaginable in them.
So log analysis takes you from “unstructured and noisy” to “ready to make good decisions.” Let’s see how that happens.
Collection and Aggregation
As I just mentioned, your production systems are going to produce all sorts of different logs. Your applications themselves produce them. So, too, do some of the things your applications use directly, such as databases. And then, of course, you have server logs and operating system logs. Maybe you need information from your mail server or other, more peripheral places. The point is, you’ve got a lot of sources of log data.
So, you need to collect these different logs somehow. And then you need to aggregate them, meaning you gather the collection together into a whole.
By doing this, you can start to regard your production operations not as a hodgepodge collection of unrelated systems but as a more deliberate whole.
Parsing and Semantic Interpretation
Let’s say you’ve gathered up all of your log files and kind of smashed them together as your aggregation strategy. That might leave you with some, shall we say, variety.
111.222.333.123 HOME - [03/Mar/2017:02:44:19 -0800] "GET /some/subsite.htm HTTP/1.0" 200 198 "http://someexternalsite.com/somepage" "Mozilla/4.01 (Macintosh; I; PPC)"
2015-12-10 04:53:32,558  ERROR WebApp [(null)] - Something happened!
6/15/16,8:23:25 PM,DNS,Information,None,2,N/A,ZETA,The DNS Server has started.
As you can see, parsing these three very different styles of log entry would prove interesting. There seems to be a timestamp, albeit in different formats, and then a couple of the messages have kind of a general message payload. But beyond that, what do you do?
That’s where the ideas of parsing and semantic interpretation come in. When you set up aggregation of the logs, you also specify different parsing algorithms, and you assign significance to the information that results. With some effort and intelligence, you can start weaving this into a chronological ordering of events that serve as parts of a whole.
Data Cleaning and Indexing
You’re going to need to do more with the data than just extract it and assign it semantic meaning, though. You’ll have missing entries where you need default values. You’re going to need to apply certain rules and transformations to it. And you’re probably going to need to filter some of the data out, frankly. Not every last byte capture by every last logging entity in your ecosystem is actually valuable to you.
In short, you’re going to need to “clean” the data a little.
Once you’ve done that, you’re in good shape, storage-wise. But you’re also going to want to do what databases do: index the data. This means storing it in such a way to optimize information retrieval.
The reason you need to index as part of your storage and cleaning process is pretty straightforward. Any good log analysis paradigm is going to be predicated upon search. And not just any search — really good search.
This makes sense when you think about it. Logs collect tons and tons of data about what your systems are doing in production. To make use of that data, you’re going to need to search it, and the scale alone means that search has to be fast and sophisticated. We’re not talking about looking up the customer address in an MS Access table with 100 customer records.
Once you have log files aggregated, parsed, stored, and indexed, you’re in good shape. But the story doesn’t end there. What happens with the information is just as important for analysis.
First of all, you definitely want good visualization capabilities. This includes relatively obvious things, like seeing graphs of traffic or dashboards warning you about spikes in errors. But it can also mean some relatively unusual or interesting visualization scenarios.
Part of log analysis means having the capability for deep understanding of the data, and visualization is critical for that.
You’ve stored and visualized your data, but now you also want to be able to slice and dice it to get a deeper understanding of it. You’re going to need analytics capability for your log analysis.
To get a little more specific, analytics involves automated assistance interpreting your data and discovering patterns in it. Analytics is a discipline unto itself, but it can include concerns such as the following:
- Statistical modeling and assessing the significance of relationships.
- Predictive modeling.
- Pattern recognition.
- Machine learning.
To zoom back out, you want to gather the data, have the ability to search it, and be able to visualize it. But then you also want automated assistance with combing through it, looking for trends, patterns, and generally interesting insights.
Everything I’ve mentioned so far should be automated in your operation. Of course, the automation will require setup and intervention as you go. But you shouldn’t be doing this stuff yourself manually. In fact, you shouldn’t even write your own tools for this because good ones already exist.
But none of this is complete without human intervention, so I’ll close by mentioning that. Log analysis requires excellent tooling with sophisticated capabilities. But it also requires a team of smart people around it that know how to set it up, monitor it, and act on the insights that it provides.
Your systems generate an awful lot of data about what they’re doing, via many log files. Log analysis is critical to gathering, finding, visualizing, understanding, and acting on that information. It can even mean the difference in keeping an edge on your competition.