Some things aren’t always what they seem.
You’re tasked with engineering a solution that your organization needs. You implement it with a tool that seems relatively easy to set up. But over time, you realize that there’s no Easy button.
Elasticsearch is an example of one of those things. It’s a great product for collecting event data fairly quickly and easily. You start with one data node in one cluster and go from there. And because it’s free and open-source (for now), it’s even better. But as your Elasticsearch cluster grows and collects more data, you start to have some scaling issues. In this post, I’m going to provide some information on scaling an Elasticsearch implementation, as well as some general recommendations for proactive ways to scale Elasticsearch.
Scaling Data Ingestion
One of the first issues you’ve probably run into is with indexing. By default, Elasticsearch refreshes its index every second, so that data becomes searchable. As indexed data grows, this quickly becomes an obvious bottleneck.
Elasticsearch ingests documents into various indices. The more documents, the higher the ingestion rate. Performance degrades and ingestion becomes a bottleneck as documents are distributed across more shards over more data nodes in a cluster.
This distribution is done, in part, to make searching and analyzing data faster later but has its drawbacks.
If you’re collecting data in the order of gigabytes per day, it’s usually fine to ingest data this way. But once you start to get into terabytes per day, you’ll see some of the following challenges with indexing.
Increased Data Node Resources
As the number of documents being collected increases, processing on the Elasticsearch nodes increases as well. So ingesting this data will take up more CPU, memory, and storage. And on an underpowered data node, ingestion will inevitably slow down to a crawl.
Increased Index Size
With more documents, the index size will continue to grow. This will take up more disk space on your data nodes. Also, with each refresh, more processing capacity is needed to keep up.
Increased Number of Shards
Elasticsearch breaks up an index into many primary shards. For redundancy purposes, it also creates a replica for each primary shard. So more shards mean more indices to maintain—and even more work for you.
These are challenges that force your data nodes to slow down data ingestion if you don’t do something. Elasticsearch tries to help out, by default. For example, the one-second refresh interval is applied only to indices that users query in the last 30 seconds. Otherwise, it doesn’t get refreshed. That helps a little.
But what if that’s not your situation? Or you want to be more proactive? You have some options.
Speeding up Indexing
If you’re running into any of the challenges above, here are some steps you can take to improve data ingestion.
Get More Data Node Resources
Indexing uses up a lot of CPU and memory resources. A quick fix is to add more data nodes. But you might need more resources too. So look at increasing resources on your existing nodes, especially disk space for index and shard sizes.
Increase Refresh Interval
If you have lots of dynamic data that your users constantly search for, consider increasing the refresh rate from one second to anything less frequent. You can do this with the index.refresh_interval setting.
But be careful with this. Leave it too frequently, and speed increases will be negligible. Make it too infrequent, and you’ll have no data ready for search. You have to test it and find the sweet spot for your organization.
Send Bulk Operations
By default, Elasticsearch indexes documents individually. Look at indexing many documents at once instead. These bulk operations will improve data ingestion efficiency.
But be sure to test what’s optimal for your organization. Start small, with 50 or 100 documents, and slowly increase. Otherwise, you’ll once again adversely affect memory and CPU.
Using a combination of the above and other scaling options, you can improve how long Elasticsearch takes to process incoming event data.
Elasticsearch is considered to be the heart of the Elastic Stack. The whole purpose of storing any of your data is to use Elasticsearch to look for collected events and analyze them. But as Elasticsearch collects more documents, you’ll need to scale how to search this data. Otherwise, you’ll have some unhappy users.
Search Query Challenges
The following are some challenges you’ll likely encounter with searching data in Elasticsearch.
Increased Number of Queries
As your system collects more data, there are more events and fields that you and your users can search and analyze. This creates an increased number of search queries run against Elasticsearch. These queries add more overhead and inevitably increase CPU and memory usage on your data nodes, slowing them down.
User-Expected Response Times
Whether Elasticsearch is used for infrastructure observability or an application data store, you’ll have users who expect faster and faster response times. It’s one thing to ensure you’re getting accurate data, but having the pressure to get it within a user-expected reasonable time is an added challenge.
With power comes great responsibility. Users with the ability to run various types of queries are likely to run a wildcard query of some kind. This will slow down your nodes significantly, especially if the search happens across remote clusters.
Again, Elasticsearch tries to help out with ways to maintain good search performance. For example, default searches return only the top 10 hits. Again, it’s helpful, but that’s not enough.
Speed up Queries
You can mitigate any of the scaling search challenges above, and others, to some degree. The following are some options to consider implementing.
Optimize Shards for Queries
As data grows, so will the number and size of shards used to store documents. Querying large shards or too many shards both slow down the search. But depending on the amount of data your clusters have, a small number of shards can have the same effect. So run some benchmarks on your data to determine the right number and size of shards.
Use Custom Routing
When Elasticsearch stores documents in an index, they’re routed to a specified shard in that index. The selected shard is based on doing a hash of the documents’ _id field, which it passes to the _routing field. It then divides it by the number of primary shards and takes the remainder.
shard_num = hash(_routing) % num_primary_shards
During a search, a data node sends out a broadcast to all index shards for the data the query is looking for. The more shards to search, the longer the search will take.
Instead, set the _routing field so that the node knows exactly which shard contains the queried data. This is custom routing. With it, you can specify the shard to search, thus removing the need for broadcast messages and multiple shard searches.
But be careful with this setting. It can lead to selecting some of the same nodes to store data, leading to increased processing on certain nodes and not others.
Implement User Restrictions
Users not only have expected response times but can also introduce problems that impact response time. Remember the long-running queries? Consider providing your users with documentation that outlines search best practices specific to your organization.
With these scaling options, you have a shot at maintaining good search response time as your Elasticsearch clusters grow.
Best Practices With Limits
For the above options for scaling Elasticsearch, there’s a certain level of testing you must do. Elasticsearch isn’t one size fits all. There aren’t one or two things that can fix all your scaling issues. A lot of it is very use-case specific.
Below are some general guidelines and best practices to ensure you avoid more scaling problems.
- Plan your shards. Develop a sharding strategy that takes into account the number and size of the shards across Elasticsearch clusters.
- Optimize disk storage. Improve index lifecycle to maximize data node disk storage. Implement separate nodes for hot, warm, cold, and frozen indices if needed.
- Decentralize node types. By default, various Elasticsearch node types—such as master nodes, data nodes, and ingest nodes—can install on one physical node. Distribute these node types across separate environments to improve scalability.
- Keep Elasticsearch updated. Stay up-to-date with upgraded Elasticsearch versions. A scaling issue today could be minimized or removed with more recent versions.
- Limit index size. Prevent indexing documents over 100MB, which will increase processing speed and network resources.
- Add a message queue to your architecture. Consider using a message queuing system, like Kafka, in front of Elasticsearch to help handle and queue data collection.
- Monitor the monitor. Create a separate cluster to monitor Elasticsearch. This way as performance degrades, you’ll likely see it before it impacts users.
Each of the above practices will help your Elasticsearch deployment scale better. However, each adds additional complexity too. Planning is good but takes time. Getting the resources for extra nodes takes money. Separate clusters increase management overhead.
All of these are limitations you have to consider when scaling Elasticsearch. So account for the tradeoffs as well.
Now that you’ve seen how you can scale Elasticsearch, it’s time to get to work. You can scale by increasing the resources of your data nodes. Other options include changing your Elasticsearch deployment architecture or making configuration changes. Depending on your scaling issues, you can try one or all of these things.
But what if you don’t want to deal with any of this? Some of the same scaling problems you fix now could come back later as Elasticsearch collects and stores more data.
If you’re not interested in troubleshooting all these scaling issues, there’s nothing wrong with considering other options. Scalyr is an alternative that does ingestion quite different from Elasticsearch. Check out this YouTube video on how Scalyr is different and see it for yourself. If you’ve tried a number of the scaling options above and elsewhere, and you’re still looking for something else, give Scalyr a shot. There’s a demo you can try out.
This post was written by Jean Tunis. Jean is the principal consultant and founder of RootPerformance, a performance engineering consultancy that helps technology operators minimize cost and lost productivity. He has worked in this space since 1999 with various companies, helping clients solve and plan for application and network performance issues.