SentinelOne | 99.99% uptime on a 9-to-5 schedule

Being “on call” is often the most dreaded part of server operations. In the immortal words of Devops Borat, “Devops is intersection of lover of cloud and hater of wake up at 3 in morning.” Building and operating sophisticated systems is often a lot of fun, but it comes with a dark side: being jarred out of a sound sleep by the news that your site is down — often in some new and mysterious way. Keeping your servers stable around the clock often clashes with a sane work schedule.
At Scalyr, we work hard to combat this. Our product is a server monitoring and log analysis service. It’s internally complex, running on about 20 servers, with mostly custom-built software. But in the last 12 months, with little after-hours attention, we’ve had less than one hour of downtime. There were only 11 pager incidents before 9:00 AM / after 5:00 PM, and most were quickly identifiable as false alarms, dismissible in less time than it would take for dinner to get cold.
In this article, I explain how we keep things running on a mostly 9-to-5 schedule.

Things We Don’t Do

There’s no point in avoiding a midnight beeper, if you’re spending so much time grinding QA that you’re still up at midnight anyway. The heart of our strategy is to catch problems early, before they become crises. This lets us reduce time spent in other areas. Some things we don’t do:

Manual QA, other than ad-hoc testing of new features.
TDD, or other formal QA methodologies.
Code coverage analysis.
Code review (usually).
Pair programming.

Nothing against any of these practices, but we haven’t found them necessary.

Catch Problems Early

Problems happen. Code has bugs, ops teams make mistakes, systems get overloaded or break. No matter how tight your operation, problems will occur.
The heart of our strategy is to catch problems early, before they become crises. Most problems are detectable in advance, if you look carefully enough. If you do manage to spot the early signs, you can identify and resolve a problem before it escalates to a critical production incident. This keeps your pager quiet. It also happens to greatly reduce customer-facing outages, if you’re into that kind of thing.
In the next few sections, I’ll discuss some of the techniques we use.

Monitor Tipping Points

Many production incidents have, as their root cause, “something filled up”. In 2010, Foursquare suffered a widely publicized 11 hour outage because one shard of their MongoDB database outgrew the available memory. Overflow can take many forms: CPU overload, network saturation, queue overflows, disk exhaustion, worker processes falling behind, API rate limits being exceeded.
These failures rarely come out of nowhere. The effect may manifest very suddenly, but the cause creeps up gradually. This gives you time to spot the problem before it reaches a tipping point.
Some tipping points are obvious, like a full disk. Some can be very subtle. For instance, a few weeks ago a critical background process in one of our servers started to bog down. It turned out there was heavy contention on one lock. In a few days, it would have reached the point where that background process couldn’t keep up, which would have caused a critical incident — we would have been unable to accept new log data. However, an alert warned us well before the tipping point, and we were able to casually make a small code fix and push it to production without incident.
It’s impossible to anticipate every possible tipping point, but a few rules of thumb will take you a long way:

Any form of storage can fill up. Monitor unused memory on each server, free space in each garbage-collected heap, available space on each disk volume, and slack in each fixed-size buffer, queue, or storage pool.
Any throughput-based resource can become oversubscribed. Monitor CPU load, network load, and I/O operations. If you’re using rate limiters anywhere, monitor those as well.
A background thread or fixed-size thread pool can choke even if the server has CPU to spare. Use variable-sized pools, or monitor the thread’s workload.
Lock contention can also cause a tipping point on an otherwise lightly loaded server. It’s difficult to directly monitor lock usage, but you can try to monitor something related. For some locks, we generate a log message each time we acquire or release the lock, allowing Scalyr’s log analysis to compute the aggregate time spent holding the lock.

In each case, write an alert that will fire before the tipping point is reached. For instance, the library we use internally for rate limiting emits a log message whenever a limiter reaches 70% of the cutoff threshold.
You can build out your tipping-point monitors by paying attention to outages and close calls. After each incident, ask yourself how you could have seen it coming. Sometimes, the hints are already there in your log, and you just need to add an alert. Other times, you might need to make a code change to log additional information.
Finally, I like to throw in a few alerts on high-level parameters like CPU usage and site traffic. When our traffic hits twice its current level, that will trigger an email alert. This doesn’t indicate any specific problem, it will just be a reminder for me to review our capacity. It may also call my attention to an unusual event, such as unanticipated press coverage. When the alert becomes noisy, I’ll give it a higher threshold.

Run Realistic Tests

QA—automated or manual testing—is the traditional method for detecting code bugs before they affect users. However, traditional QA has limitations:

Thorough coverage requires a lot of ongoing work.
It’s not easy to detect performance problems, or other operational and systems-behavior issues. These are precisely the issues that are most likely to set off a pager.
It’s hard to fully verify user interface behavior, especially visual appearance, in an automated way.
It can miss logic errors that only manifest in complex, real-world usage scenarios.

As a startup, our time budget for traditional QA is limited. We write quite a few automated tests, but we don’t follow a formal testing process, and we do very little manual QA. Instead, we use several low-effort mechanisms that complement traditional QA.
First, we maintain a staging instance that runs a very realistic workload: it monitors our production servers. Because the staging instance is our primary tool for watching production, we spend a lot of time there. Many problems are uncovered as a natural consequence of our normal operations work. And as a complete (if downscaled) mirror of the production system, the staging instance can flush out complex systems-level issues. We even shard the staging instance across multiple backend servers, even though it’s not big enough for that to be necessary, in order to faithfully mirror the production setup.
The same extensive set of alerts that monitor our production environment are also applied to staging, flagging any issues that we don’t notice directly. Because problems on the staging instance don’t affect customers, we don’t send these alerts to PagerDuty. Instead, we get an e-mail, and address the problem during regular working hours.
Second, we maintain a load-test instance which is loaded more heavily than any single production server. This instance receives a steady flow of logs from over 100 simulated clients, sending random snippets of real-world log data. This is a good way to detect problems that manifest under heavy load, before a load spike brings down the production servers.

Monitor Exceptions

Exceptions can be great early warning signs. We send an alert for every exception generated by our servers. (With a few, um, exceptions. See the section on noisy alerts, below.) This helps us identify:

Problems encountered by a user, even if they don’t report them to us.
Problems on the staging server that we didn’t notice directly.
Bugs in background tasks.
Low-probability race conditions.

A bonus is that we get to impress users by proactively reaching out to them before they even complain about an error.
We enhance the effectiveness of exception monitoring by heavily annotating our code with assertions and warning checks. If code is behaving in an unexpected way, or a data structure contains inconsistent values, it is likely to be caught by an assertion.

Roll Out Changes Gradually

We roll out major changes gradually, starting on the staging server. Many changes are also guarded by a user whitelist. We sometimes let an update “bake” on our internal account for as long as a month, then gradually enable it for other users. Major bugs are usually found on the staging server; smaller bugs are caught as soon as they’ve rolled out far enough to affect one or two production users.

Clean Up Noisy Alerts

Early warnings are useless if they’re lost in a babble of false alarms. To catch problems, we have a lot of alerts and we set thresholds aggressively. This can lead to false warnings. When that happens, we adjust the alert. Scalyr’s log alerting features give us a variety of ways to screen out false alarms:

A freshly restarted server process takes a minute or two to settle into normal behavior; this triggers various alerts. We add an “AND server_uptime_seconds > 120” clause to those alerts.
We run various background tasks on a regular 60-second or 300-second cycle. We alert if any task isn’t running on schedule. The schedule can be briefly, and harmlessly, thrown off if a server is busy with higher-priority work. We use time averaging and grace periods to avoid alerting for these routine interruptions.
We use Amazon’s log retrieval API to import logs from RDS databases, and we alert on API errors. Amazon can be a bit flaky here, occasionally returning a brief spate of “InternalFailure” errors for no obvious reason. This particular error message seems unlikely to indicate a problem on our end, so we added an AND term to the alert trigger to screen out that specific message. This ensures that we’ll be notified of any other API errors that might arise, without polluting our alert channel with ignorable Amazon glitches.
For usage-dependent conditions, we have ratio triggers to alert if various events occur at an unexpected rate relative to overall traffic.

We also think hard about which alerts should set off a pager. Most of our alerts go to e-mail, allowing us to investigate potential issues at a convenient time. Only alerts indicating a serious and immediate problem are sent to PagerDuty. On the rare occasion that a false alarm goes to PagerDuty, we look for a way to tighten the alert condition, or consider removing that alert from the PagerDuty list. In practice, it has not been difficult to strike a good balance; we’ve been aware of every serious incident within minutes.

Summary

Alerts just aren’t for telling you when your site is down. Used properly, they can warn you well in advance, allowing you to fix problems on a convenient schedule and without affecting users.
It’s important to have a robust staging instance, and to monitor it as extensively as production. Start with broad, aggressive alerts, and iteratively narrow them to eliminate false alarms. It helps to use an alerting tool that can trigger based on complex conditions in your server logs.
These techniques—and a little luck—have helped us hit 99.99% uptime while waking up the on-call engineer less than once per quarter. Hopefully they can do the same for you.