What Debugging Scalyr Data Ingestion Issues Can Teach You about Production Troubleshooting

As a member of Scalyr’s support team, I deal with various questions from customers and prospects. Some questions lead to really interesting work, such as Scalyr’s integration with a third-party platform like Fluentd or StackDriver. However, the questions that usually get the most attention are more like “why can’t I see my logs” or “failure of data ingestion in the last x hours.”

Production troubleshooting is a stressful task. When you have limited time to fix the problem, reliability and speed are the two most important things you care about for your observability platform. We can totally relate to our users’ frustration when they notice that their logs are missing, and that’s why Scalyr makes troubleshooting data ingestion issues so simple that you actually have most of the tools at your fingertips.

Read on to learn how we do that!

Sometimes It’s Us, Not You

Scalyr builds software primarily for DevOps engineers and SREs. So it only makes sense that we monitor our own production environment using Scalyr too. In fact, we implement thousands of alerts and hundreds of dashboards to ensure the availability of our service, and we take preemptive action before any issue gets out of hand.

When you experience any production issues, such as “missing logs” or “search times out,” you should first check that the issues are even within your control before sinking further time into troubleshooting. We are pretty responsive in updating our status page with any identified production issue. Therefore, when a log’s missing time span matches with the time of our server’s outage, it’s likely us who need to fix the problem, not you.

Lessons Learned from Agent Ingestion

The Scalyr agent is a lightweight piece of software that handles data ingestion, and it can be deployed to a typical host machine or containerized environment. It collects all types of signal data from the client side so our users can troubleshoot their production issues.

Tailing Log Files Is a Great First Line of Defense

When you don’t see application logs on your monitoring platform, you should verify that your application is actually actively logging. Using tail is a simple way to show the ends of files when analyzing logs and other text files that change over time. Additionally, you could add -f to the tail command, as shown below, to show a real-time, streaming output of a changing file.

Linux agent: tail -f <log file>
Kubernetes agent: kubectl logs -f {pod_name}

Once it’s confirmed that the application is actively logging and the same data can’t be found in the Scalyr app, we can move to the next step to further investigate the cause.

Using Observable Tools is a Huge Plus

A complete software platform typically includes utilities that makes the system somewhat observable. The Scalyr agent has a runtime status indication mechanism that you can use to see what’s happening. Basically, users can reference the agent’s status output to confirm that there’s no error in the agent config and all of the log files are tracked by the agent with no data loss. Here’s how you can do that.

Linux agent: sudo scalyr-agent-2 status -v
Kubernetes agent: kubectl exec {scalyr-agent-2-pod} -- /usr/sbin/scalyr-agent-2 status -v

The status command output gives you the agent’s health, the last successful ingestion timestamp, and all of the log files the agent is actively tracking. Troubleshooting is a lot simpler when you’re using techs that let you easily determine if something is wrong.

Pull quote--Troubleshooting is a lot simpler when you're using techs that let you easily determine if something is wrong

You Need to Have a Way to Query Your Logs

After you confirm that your application is up and running and the agent’s configuration is valid, you’ll need a more sophisticated tool for advanced troubleshooting. In the end, you can only get so far with simple utilities like tail and grep. In addition to the basic functionalities, you’ll need to start regarding application logs as data, with optimized queries being a must-have.

Querying the log to look for errors is what you need. Scalyr offers simple query language to search for errors in agent.log in the event of a data ingestion issue. You can see that below.

Linux agent: $logfile == "/var/log/scalyr-agent-2/agent.log"   $serverHost == {myhost}
Kubernetes agent: $logfile == "/var/log/scalyr-agent-2/agent.log"  $serverHost == {agent pod name}

If the error’s timestamp matches up with the period logs were dropped, we know the errors are likely the cause and can investigate further based on the exception.

Lessons Learned from RESTful API Ingestion

You’ll Want to Use Your APIs Correctly

We’ve found that, besides ingestion with an agent, many of our customers enjoy ingesting messages using our RESTful API due to its flexibility and simplicity. However, even though it’s easy to use, sometimes our users have difficulties finding messages in Scalyr upon successful requests.

The most common cause of the issue is that Scalyr’s addEvent API requires timestamp parameters to be in nanoseconds of UNIX time. If the timestamp is stale or it’s in an incorrect format, the message’s timestamp will be determined by our system (typically within minutes of the ingestion time).

The key to avoiding the same mistake is to ensure your users review the API specification and understand all the possible config options to optimize API usages. As an API provider, it’s our job to update our documentation and emphasize the importance of using the correct format.

Lessons Learned from AWS Ingestion

Precise Exception Message Could Save Your Day

If you have experience integrating another software into your application, you know it’s never as simple as the documentation suggests. Each application is unique, and you’ll need to go through trial and error to figure out the right steps.

One thing that could make the integration process less painful is having informative debugging logs when the integration fails. Scalyr has native integration with various AWS services such as CloudWatch, RDS, and S3. Occasionally, our users may run into issues due to misconfigurations. That may lead to AWS logs being unavailable on their accounts.

Fortunately, Scalyr users could identify the potential causes by submitting a simple query on our UI. For example, you can search $tag == “S3BucketMonitor” and tag=’cloudwatchMonitor’. That will show error messages and details of log ingestion from S3 bucket and CloudWatch, respectively. It gives users immediate guidance on what to fix so they don’t need to be guessing at what causes failures.

Issues Are Inevitable, But Repeating Them Isn’t

At Scalyr, we build a platform to make DevOps engineers’ and SRE’s lives easier. Today, many companies adopt agile software development cycles. That means it’s unlikely that your application will never run into any future issues.

If we accept the fact that issues are inevitable, having the right tools could be really helpful for troubleshooting. In addition to that, encouraging users to follow best practices could prevent the same issues from occurring repeatedly.

If you can take these two lessons with you, you might have more fun debugging production issues than you think!