Nothing is Wrong, It’s Just Not Working

Troubleshooting a production incident is stressful. But sometimes, you get to the end, give a satisfied sigh, and think: that was a worthwhile puzzle. This is such a tale.

It’s also an example of something near and dear to our hearts at Scalyr: the importance of detailed logs, especially for troubleshooting the 20% of hard problems that cause 80% of the grief. In this instance, we had a complex issue, with many links in the root cause chain, and metrics alone gave almost no indication of what was really happening. Without additional data to work with, we would have been helpless.

Scalyr, like so many businesses nowadays, is a 24×7 operation. Our customers rely on us to ingest and analyze signals on a continuous basis. One recent morning, alerts started triggering: the rate at which customer data was flowing into our storage tier was dropping. For a few minutes it surged back up, then dropped again:

Storage Ingestion Volume (Bps)

Time to troubleshoot. Checking dashboards, we quickly saw that data was piling up in our queuing tier:

Queued Data Size (Bytes)

So data was successfully moving from our customers into our queuing tier but was not then flowing into the storage tier. Something was wrong in the storage tier, or possibly the queuing tier. But what? We looked through the rest of the dashboard: CPU usage looked fine. I/O and network bandwidth looked fine. No noticeable spike in errors on the storage nodes. Virtually every graph looked fine, except for the graphs showing data backing up in the queue servers. There was nothing to indicate why data was backing up… except for one graph, showing that the queue servers were getting errors when sending data to the storage nodes:

Queue → Storage Errors (Events / Second)

Which, at this point, was not a surprise. But the errors were peaking at about 25 errors/second, which – confusingly – is less than 1% of our traffic. This error rate, while unusual, was not nearly high enough to explain the dramatic drop in data ingestion.

Fortunately, we don’t just record the number of errors, we also log every error. By breaking down fields of the log, we were able to see that the errors were distributed fairly evenly across queue servers:

Queue → Storage Errors By Queue Server

And we were able to look at the specific error messages:

26,674 Timeout waiting for connection from pool (storage-52)
16,895  Timeout waiting for connection from pool (storage-53)
6217 Connect to storage-52 failed: Read timed out
4930 Read timed out (storage-52)
4907 Read timed out (storage-53)
2030 Connect to storage-53 failed: Read timed out
1302 Read timed out (storage-54)
784 Timeout waiting for connection from pool (storage-54)
111 storage-53 failed to respond

Because the queue servers log detailed error messages, we were able to see that the errors involved just three particular storage nodes. And yet, by breaking down our ingestion logs by storage nodes, we could see that all storage nodes were experiencing dips in throughput:

Storage Ingestion Volume By Storage Node (Bps)

To recap: we had a serious problem, but none of our metrics or dashboards pointed toward a specific cause. The error logs provided a hint, but not a complete answer, because they only indicated problems on three storage nodes, and yet throughput was plummeting on all storage nodes. And no metrics or errors pointed toward any issue on any storage node, including those three. The error logs clearly showed a problem connecting to the three storage nodes, but the metrics for the storage nodes themselves looked fine, including from a network perspective.

Before we could investigate any further, the problem resolved! Ingestion volume:

Storage Ingestion Volume (Bps)

Queue backups:

Queued Data Size (Bytes)

“Problem resolved” is always a good thing. But we still had no idea what happened, meaning we had no idea whether it might happen again (protip: unexplained problems tend to recur), or what to do if it did. And troubleshooting after the fact is always harder: you can’t perform experiments, and you can’t gather new data. You have to work with whatever clues might exist in the data you’ve already collected. This is one reason that it’s important to have broad data collection in place before problems arise.

Used with permission

Fortunately, we had a lot of data to work with. At Scalyr, we’re heavy users of our own service, and we collect a wide variety of logs. All of the graphs presented above were derived from log data, which is why we were easily able to drill down to explore detailed error messages and identify specific affected servers. We still didn’t have a root cause, but we hadn’t yet exhausted the logs, either.

For instance: we have a mechanism that periodically logs a stack trace for every thread, along with information regarding the task each thread was processing. As a general rule, when a service is mysteriously experiencing lower throughput than usual, and there are no obvious causes – no problems with downstream services, no spike in CPU or I/O – one possible cause is that threads are being blocked. This can be very difficult to track down because it tends to not generate any obvious signals: there may be no relevant metrics, and a blocked thread doesn’t emit any logs. But our periodic stack-trace logs can often shed light on such situations.

And indeed! The logs showed that during the period when the problem was occurring, queue servers had large numbers of threads blocked on network I/O. Here’s a simplified version of the stack trace: Method)

Each of these queue server threads was waiting for a response from a storage node; specifically, one of the three storage nodes that had shown up earlier in the error logs. And at any particular moment in time, all of the blocked threads were blocked on just one storage node.

It turns out that our queue servers have a large, but capped, number of threads available for delivering data to storage nodes. And at times, every single one of those threads – on every queue server – was blocked on a single storage node.

We were finally getting toward an understanding of the problem. If a queue server’s threadpool was completely tied up delivering data to one storage node, that would explain why all storage nodes were experiencing a drop in throughput. But we still didn’t know the root cause. Why wasn’t that one storage node responding to requests?

Storage nodes also have a bounded threadpool, used to receive data from queue servers. Looking at the thread logs on that storage node, all of the threads in its ingestion threadpool were blocked. Fortunately, the details that we log for each thread include a customer ID. (How we arrange for this is an interesting topic in its own right! I’ll cover this in a subsequent post.)

As it turned out, all of the ingestion threads on the relevant storage node were blocked on a single customer. Something about the data flow from that customer was tying up all of the ingestion threads on that storage node; causing all queue delivery threads to block on that storage node; causing the widespread ingestion delays that initially triggered our alarms. This is much closer to a root cause!

Storage nodes can process many inbound data batches simultaneously, but each node can only process one batch at a time per customer. This is normally not an issue, but for some reason this particular storage node was not able to keep up with inbound data for this particular customer.

So, why was this particular customer a problem? Their data rate was fairly low. But, aha: checking the logs again, we were receiving a huge number of data batches from this customer. Each individual batch was very small, so the total volume was still low – meaning this customer didn’t stand out in our monitoring. But the number of individual batches was very high.

Looking through some debug logs, we were able to find the IP addresses that the customer was using to send data. We then ran a quick query to determine what fraction of our ingestion traffic came from that customer:

log="frontend-log" method="ingestion"
| group count = count() by ip
| let is_suspect_customer = ip in (<redacted>)
| group total = percent_of_total(sum(count)) by is_suspect_customer

About 8%. That might not seem high. However, because the customer’s data volume was low, our routing system was only directing their traffic to three storage nodes. Across all of our customers and storage nodes, having 8% of all traffic directed to just one customer on just three nodes was a huge outlier. And so now we finally had our understanding:

  • One customer was sending a large number of very small data batches.
  • Because the total data size was small, our routing tier distributed this data across just three storage nodes.
  • Each storage node can only process one batch per customer at a time; the high number of (small) batches caused data ingestion requests to stack up on the storage nodes.
  • This periodically caused all request processing threads on a storage node to block on that customer.
  • This quickly caused all data delivery threads on all queue servers to block on that storage node.
  • Thus blocking data ingestion for all customers.

Phew! Fortunately, we had enough information in our logs to follow this chain of events before the problem recurred. Once we had a complete understanding of the problem, it was easy to work around: we configured a limit on concurrent requests from this customer, and spread their traffic to more storage nodes.

Note that the root cause centers around two instances of the same problem: one customer monopolized all threads on a storage node; then that storage node monopolized all threads on the queue servers. Interestingly, our architecture includes two separate safety mechanisms intended specifically to prevent this exact type of issue. Both mechanisms failed. In a followup post, I’ll explain why, and what we did to implement a more robust solution.