Databases are the foundation of most software applications. And much like when building a house, consideration needs to be given to make sure the foundation is right. When we build a house, we ask a multitude of questions before laying any foundation. What type of soil is it? What size and shape are the house? Where do the services need to run? What type of weather is common in the area?
And yet, for most of us, when we build a software application, we simply grab whichever relational database management system (RDBMS) we’re most familiar with. Sure, we consider the topology of the tables, the datatypes, keys, and indices. But that’s like making sure the concrete is the right shape, is the right thickness, and is level. What if concrete is the wrong material for the job?
In this post, I’ll introduce an alternative database model, called a columnar database. In doing so, I’ll compare it to the familiar row-based RDBMSs and present pros and cons. The important thing to remember is that picking a database is about fit for your application. As such, I’ll also provide examples of when a columnar database is appropriate.
What Is a Columnar Database?
A columnar database does what it says on the tin. It stores data in columns rather than the rows of a traditional row-based database, such as an RDBMS. To do so, columnar databases alter the traditional structure on disc we’re all familiar with.
In a row-based relational database, data are stored contiguously by row. For example, a users table may be stored like:
1,Levi,24,Null,[email protected],108;2,Maya,56,Null,[email protected],Null;3,Elon,31,18845632724,[email protected],2343;
However, in a columnar database, the same data are stored like this:
1,2,3;Levi,Maya,Elon;24,56,31;Null,Null,18845632724;[email protected],[email protected],[email protected];108,Null,2343;
If you look carefully, you can see that both examples store the following data:
+----+------+-------+-------------+----------------+-------+ | ID | Name | Age | Mobile | Email | Karma | +----+------+-------+-------------+----------------+-------+ | 1 | Levi | 24 | Null | [email protected] | 108 | | 2 | Maya | 56 | Null | [email protected] | Null | | 3 | Elon | 31 | 18845632724 | [email protected] | 2343 | +----+------+-------+-------------+----------------+-------+
Columnar Database Topology
Of course, if you’re going to store data differently on disc, it makes sense that you present the data to users in an appropriate way. To do so, columnar databases use a concept called a keyspace to group related information. Inside the keyspace, a columnar database stores column families. For simplicity, you can think of the keyspace like a RDBMS schema and column families like tables. However, that’s where the similarity ends.
- Inside a keyspace, column families float around; they’re not linked into a rigid structure by foreign keys.
- Column families consist of data about a topic presented in multiple rows.
- Rows, however, can contain a different number of columns to each other, as well as data from different columns.
- Therefore, instead of spanning all rows, each column is contained within a row. The column data is presented as a key-value pair and a timestamp.
Consider making a “users” column family from the above example that returns the items Levi, Maya, and Elon. The Levi row might have the columns Age, Email, and Karma. Meanwhile, the Maya row might only have Age, and Email. Finally, Elon might have Age, Mobile, Email, and Karma. As a result, a query of the column family returns only the columns that are relevant to each person.
Comparing Columnar to Row-Oriented Databases
Columnar databases are fundamentally different to row-oriented databases such as RDBMSs. The following table summarizes the key differences.
|Columnar Databases||Row-Oriented Databases|
|Data are stored and read in columns||Data are stored and read in rows|
|Slow to write new records or update multiple columns in a record||Fast to write new records and update multiple columns in a record|
|Fast and easy to add new columns (horizontal scalability)||Slow to add new columns|
|Suited to online analytical processing (OLAP)||Suited to online transactional processing (OLTP)|
|Suits sparsely populated data||Suits densely populated data|
|Efficient at performing operations across a whole dataset, e.g., aggregation||Inefficient at performing operations across a whole dataset|
|Efficiently compressed||Inefficient compression|
|Suits distributed storage||Suits single-machine storage|
Pros and Cons
At first glance, it might seem a bit strange to store data by column rather than row. So, what are the pros and cons of columnar databases?
- Certain queries are much faster. For example, to perform an aggregation query on your users to find the average age, the drive can seek directly to the age column to read and average 24, 56, and 31. This is much faster than seeking to each of the rows and finding the column data one by one.
- More efficiently compressed. Column stores can be much more efficiently compressed than row stores because contiguous data in a column is all of the same type.
- Flexibly scaled. Adding new columns to a columnar database is very quick and easy due to the way data are stored. Similarly, data are easily distributed across multiple machines.
- Great for OLAP. This comes from the strength in performing aggregation queries and collecting all the data from multiple columns very quickly and efficiently.
- Slower write speed. Due to the way data are stored on disc, the database management software can’t just dump the row at the end of a list.
- Certain queries aren’t well suited. If you’re going to be searching multiple column values for individual records over and over, rather than aggregating or getting a single column for multiple records, columnar databases are probably not what you’re looking for.
- Not good for OLTP. OLTP involves a lot of reading and writing of multiple columns for individual items.
Example Use Cases
As stated in the previous sections, there are certain applications that suit columnar databases, and other applications that definitely do not. For example, it’s probably not a good idea to run an online shopping website on a columnar database. Firstly, you’ll probably be reading a lot of columns about individual items, rather than column information across multiple items. And secondly, the slow read and write speeds or full records will make it difficult to update and keep track of stock. No matter what you do with the backend and frontend code, a columnar database will likely cause nightmares for you and turn off your customers.
Simply put, you want to use a columnar database when you’re analyzing large volumes of data.
- When you want to improve the sales on your online store. You can take a snapshot of your row-based inventory OLTP database and convert it to a columnar store. This will allow you to efficiently find out what’s selling best, when it’s selling, what type of users buy which items, which marketing tactics are working, and so on.
- For analyzing your application logs. Again, take a snapshot of your application’s OLTP database and convert it onto a columnar store. Then you can perform efficient analysis on anything you like—for example, understanding user behavior, troubleshooting problems, auditing security, or testing whether your experiment succeeded. If you’re interested in this use case, Scalyr’s logging analytics is built using a columnar database. Try it here.
- When performing business intelligence tasks. This one speaks for itself: business intelligence is all about quickly analyzing and understanding large volumes of historical data.
Columnar databases may seem strange at first, given the prevalence of RDBMSs in software systems. However, when understood and applied to the right use cases, they offer benefits that no RDBMS can match. In short, they’re well suited to analytical processing, effectively compressed, and flexibly scaled horizontally.
This post was written by Michael de Ridder. Michael has worked in software development, data visualization, data science, research, consulting, and business analysis across healthcare, telecommunications, radio and finance. He enjoys the challenge of combining and utilizing the relationships between different domains and technology. A big fan of travel, Michael is a proponent for the benefits of work-life balance, believing that time away from a subject allows creativity to flourish.