Column-oriented databases have seen a resurgence in interest in recent years. The first column-oriented databases appeared decades ago. However, they have never gained a lot of traction in the market. In recent years, though, big data and cloud computing spurred a new interest in these types of databases. The reason for this is that the unique implementation of columnar databases fits surprisingly well into the workloads and applications run in cloud computing. More on this below.
In this post I’ll discuss the following:
- What are column-oriented databases? How do they differ from standard row-oriented databases?
- When should you choose a columnar database instead of a row-based one and vice versa?
- What are the popular use cases for using a columnar database?
- What are some commercial-grade column-oriented databases? And when should each one be used?
What Are Column-Oriented Databases?
Column-oriented databases are regular relational database management systems with full SQL support. The SQL syntax is usually the standard flavor you should be familiar with if you have ever worked with MySQL, PostgreSQL, and the like. The key difference is in the way the data in the database schema is stored in persistent storage (magnetic/SSD storage). In a row-oriented database, tables are stored with all the columns in one location (MyISAM, for instance, in one file or consecutive files). In contrast, in columnar databases, each column is stored in a separate location. This way if you want to do a SELECT query on the database and you need only one specific field, instead of retrieving all the tables into memory and filtering a column, you can fetch only the relevant column, which is much faster.
Row-Oriented vs. Column-Oriented Database Use Cases
As the cliche goes, in software engineering there are no perfect solutions, only trade-offs. Choosing columnar- or row-based databases is no exception. There are use cases for which a row-oriented database will be a better fit and vice versa.
Use Cases for Row-Oriented Databases
One thing that we need to remember when we are trying to understand whether we should use a row- or column-oriented database is a very important factor in real-life scenarios: popularity. Row-oriented databases are much more popular than their columnar counterparts. This fact has multiple important implications:
- If you need a DBA or a software engineer to set up, maintain, and optimize a database, it will be much easier and cheaper to find a professional that is familiar with row-oriented databases.
- There is much more documentation out there and industry knowledge in running production workloads based on row-oriented databases. If you run into an issue, it will be easier to fix it.
- The tooling ecosystem for row-oriented databases is much bigger. There are way more tools for MySQL, for instance, than there are for column-oriented databases.
What that means for you is that it will be cheaper and faster to set up a row-oriented database, and it will be easier to maintain. When we are talking about a mission-critical component as a DBMS, it’s not something to take lightly. So if price, the ease of hiring professionals to maintain the database, and the ease and speed of fixing issues are much more critical for you than performance, then row-oriented databases are the right choice for you.
Application Heavy on Writes
Another use case for row-oriented databases is an environment that has a lot of writes and is relatively thin on reads. If you store all the tables in one location, adding a row to this table requires access to only one location. On the other hand, if you use a columnar database to add a row to the table, you need to access each column location individually.
Use Cases for Column-Oriented Databases
Application Heavy on Reads
If your application is heavy on reads and light on writes, has lots of data, and you usually fetch only specific columns, then a column-oriented database can be a good choice. In this scenario, for a row-oriented database, it means that you need to load the entire table (or even schema, depending on the storage engine) to RAM before you start discarding irrelevant columns for your query. In contrast, in columnar databases, you can load only the relevant columns for your table. And if you need only one column, it means that you’ll access persistent storage only once. It will result in a massive performance gain. Real-world systems with this design are called OLAP (online analytical processing) systems. For example, a log aggregator is an OLAP system. Specifically, Scalyr is a log aggregator that uses an in-house column-oriented database internally to boost performance for its clients.
Another possible use case for column-oriented databases is when you have a lot of repetitions in your columns for the same values. Let’s say that you store referring domains to your website in a column. In our day and age of social media, it’s more than likely that you’ll see multiple occurrences of Facebook and Twitter as referrals, for instance. Many columnar databases use compression algorithms to efficiently store recurring values, thus gaining an additional performance boost. So if you have a lot of repetitive values, a columnar database is worth considering.
Examples of Column-Oriented Databases
Amazon Redshift is Amazon’s offering for a column-oriented database. It’s a good choice if you already host your deployment on AWS and use other AWS services. Amazon Redshift has excellent integration with other AWS products, such as Amazon Kinesis, Amazon Athena, and AWS Lake Formation. In addition, it provides dynamic scaling and handles all the system work for you (DBaaS). If you don’t want to manage your databases by yourself, or if you don’t have the capacity to, it’s worth considering as well.
Yes, you read it correctly. The well-known row-based database PostgreSQL has an option of column store as well. This is particularly handy if your team is already familiar with PostgreSQL and with using it in production environments. This way you can flatten the learning curve. An additional scenario for which PostgreSQL fits is when you want to use only one database for both use cases, column and row store. In addition, PostgreSQL is an open-source solution. This is something to keep in mind if you want to make changes to the source code so you can completely tune the database to your needs.
ClickHouse is an open-source database by Yandex that has a DBaaS offering as well. It is purely for column-oriented storage. This database excels in performance and, according to their blog, is substantially faster than AWS Redshift.
SQream is a DBaaS solution that fits reallybig data workloads (petabytes of data). In addition, it provides seamless integration with machine learning data sets to train and create reliable machine learning models based on massive data.
MariaDB is another popular open-source database that supports both row and column storage. Its benefit compared to PostgreSQL is its compatibility with the very popular MySQL. This allows you to enjoy a vast tooling ecosystem. The drawback compared to PostgreSQL is that it is not as feature rich as PostgreSQL.
In this post, I’ve explained what are column-oriented databases and the use cases for them. I’ve also provided a list of popular vendors in the field. As stated, before choosing a vendor, we need to assess first that a column-oriented database is right for your application. Once we have determined that it’s a good fit, we can select a vendor based on its specific strengths, price, and other factors.
Alexander Fridman wrote this post. Alexander is a veteran in the software industry with over 11 years of experience. He worked his way up the corporate ladder and has held the positions of Senior Software Developer, Team Leader, Software Architect, and CTO. Alexander is experienced in frontend development and DevOps, but he specializes in backend development.