SentinelOne | Service Availability: What It Is and Metrics You Should Know

The most significant way to improve your business is by impressing your customers and keeping them satisfied. And the first step to that is keeping the promises you make. Your customers come to you with certain expectations. Of these expectations, one of the most crucial is getting the service they need when they need it. Hence, service availability is important to have happy customers. To help you with that, let’s start by understanding what service availability is and how to improve it. We’ll also look at important service availability metrics to measure and how these affect the service-level agreement.

What Is Service Availability?

There are different types of services. Commonly, physical stores in your locality, like a shopping center or a cinema, are open only for certain hours in a day. Whereas things like hospitals, websites, etc. provide service throughout the day. The IT world never sleeps. So, when it comes to IT, most of the services provided through applications run round the clock.

When an enterprise and a customer enter into a commitment, the customer expects to get service when they want it. But depending on the business, some provide service 24/7 (e.g., e-commerce applications), while others provide service during set hours (e.g., bank KYC verification). When the period of service is defined, customers are prepared for it, and they expect to be served during that promised period. Service availability is simply the measure of the service being available and accessible to the customers during the time you promised to keep the service available. It’s usually calculated as a percentage.

Let’s say that you own an e-commerce application and you promise to provide service round the clock. This promise you make is called agreed service time. The period when your service is not available is called downtime. Service availability can be calculated using the following formula:

Let’s say that your service was down for five minutes. So the downtime is 5 minutes and the agreed service time is 24 (hours/day) x 60 (minutes/hour) = 1,440 minutes. The service availability would be as follows:

We should aim for high availability. Having service availability at 100% is great. But we don’t live in a perfect world. So how high is good enough?

The “Nines” of Service Availability

Any value of uptime below 99% is considered bad. But that doesn’t mean any value above 99% is good. The concept of nines describes how adding an additional nine to our service availability makes a huge difference.

Although 99% might sound good, when you look at it in terms of duration of downtime, you see that it’s a huge deal. What if an e-commerce website went down for almost 3.5 days? Imagine the loss in business. A good value of service availability is 99.999% (five nines). Anything above that is really awesome! So, when building applications, you should aim for at least 99.999% availability if you want happy customers.

Availability Metrics

To know where you stand on service availability, you need stats. While building a system, you decide on certain availability benchmarks the system should meet. But to know if it’s happening in reality, you need to monitor some metrics. Here are the four most important service availability metrics to monitor.

Downtime Duration

This metric gives you the time for which the system is down. Downtime duration is an important variable to calculate service availability. Along with that, there are various conclusions you can make using this metric. If you have a backup system designed to kick in when the main system fails, downtime duration can tell you how long it takes for the backup to kick in. If the cause of downtime is being fixed manually, downtime duration can help you understand the efficiency of your incident response team. You can calculate downtime duration for different time frames: a day, a month, a year, or whatever window is relevant for your use case.

Downtime Frequency

This metric tells you how often the system is down. In most cases, when there’s high downtime frequency, you can find a pattern or a particular reason for the outage. Downtime frequency can also be calculated for a different time frame. Regular system downtime and overall high downtime frequency usually indicate an issue in the hardware or software components of the system.

Let’s look at the example of an e-commerce website. E-commerce websites usually have good offers during Black Friday week. This attracts a lot of customers within a given time frame. So, if your e-commerce website is never down throughout the year but has high downtime frequency during Black Friday week, it could mean that your system is lacking the hardware resources to cater to the traffic.

Although you can use downtime duration and downtime frequency individually, analyzing both these metrics together gives you even more specifics. For example, say you have a downtime duration of one hour a month. Just the value of downtime duration alone will not give you many details. But if you have downtime frequency as well, you’ll know whether the downtime of one hour was in one stretch or multiple downtimes that contributed to a total of one hour.

System Uptime

This metric tells you how long your system was available. You can calculate this as time duration or as a percentage. System uptime can be used to tell your customers how well you provide a service. Continuous monitoring of system uptime and consistent value can help you define terms in your service-level agreement.

User Uptime

Just because the system is up and running doesn’t mean that the customers are getting serviced. When the system lacks the resources to cater to all the requests, it may go down completely, or it might go down only for certain users. In such cases, there won’t be system downtime for some users, but there will be downtime for others. User uptime is a metric that considers uptime for each user/bulk of users.

These are the most important and primitive metrics for service availability. These metrics, alongside other information, such as timestamps, location, network segments, and application components, will help you get the complete picture of your service availability.

Once you’re clear with where you stand on service availability, you should have enough knowledge to make promises/commitments to your customer. And this commitment happens in a service-level agreement.

Service-Level Agreement

A service-level agreement (SLA) is an agreement between the service provider and the service receiver (customer). An SLA basically contains all the terms and conditions for the association of service. Typically, it contains the following:

Details about the service being provided
Service availability
Metrics used to measure service availability and reliability
Actions that will be taken if there’s commitment failure

Service availability is a crucial part of SLAs and can lead to penalties if not fulfilled. Hence, before creating any SLA, be sure to understand your system and potential issues. Run multiple tests to get service availability values and check their consistency. Finally, have a contingency plan for when there’s unexpected downtime.

Improving Service Availability

As mentioned earlier, we don’t live in a perfect world. And that leaves room for improvement. Here are some tips to improve service availability:

Communicate and understand customer requirements.
Run tests.
There’s always a chance of failure. Have a plan B.
Implement fail-safes.
Eliminate single points of failure.
Consider scaling issues.
Perform continuous monitoring.

Monitoring Availability

Service availability is no joke. Any compromise with availability has serious effects on business. You can’t just come in once a year and collect values from metrics and expect to build a highly available system. Service availability needs continuous monitoring and clarity on data. Service availability might sound like a simple thing to do. It doesn’t take much to explain to someone what it is. But when you’re on the path to achieving highly available systems, things get complex. You have to deal with various primitive and derived metrics; you need clear visibility and a lot more. For all of this to happen, you need a good choice of tool.

Scalyr is one such tool that can help you with monitoring availability. You can use Scalyr to monitor and measure primitive metrics and also derive custom metrics. Scalyr has metrics and dashboards that are generated from the most granular data on service health event data and logs. Scalyr’s data ingestion and storage and sub-second query response provide companies with an opportunity to offer better and more accurate service availability and accurate service-level agreements for customers. Scalyr also provides event data cloud service to help you improve your services. The point is that Scalyr is a complete package with everything you need for monitoring services and systems for availability and remediating service issues. If that’s what you’re looking for, you can try Scalyr here.

This post was written by Omkar Hiremath. Omkar is a cybersecurity analyst who is enthusiastic about cybersecurity, ethical hacking, data science, and Python. He’s a part time bug bounty hunter and is keenly interested in vulnerability and malware analysis.