In the twenty-first century, data is the new gold. Almost all (successful) companies rely on data to grow. Understanding and adapting to clients’ needs are the most direct ways to increase sales. The bigger the company, however, the more data that needs to be processed in order to have that understanding. That’s where data warehouse systems come into play. In this post, you’ll learn what exactly cloud data warehouses are and some examples of them.
(Cloud) Data Warehouses: What Are They?
A typical company runs many different applications and systems. One application might be used for ordering supplies from a third party, one might be used for running the shop, and another might be used for connecting to a shipping company. These are just a few examples. In order to answer critical business questions, data from all these different systems needs to be easily accessible, preferably in one place. This can be useful to create reports and insights based on which decisions can be made that will be best for the business. Data warehouse systems are doing just that: gathering data from different sources across the whole company, processing all of it, and making it available for data analytics.
Traditionally, these systems required a lot of servers. You needed to purchase them, install the data centers, and hire a specialized engineer. But since a lot of companies are moving their infrastructure into the cloud nowadays, there is also a new norm of using cloud data warehouse systems instead of on-premises ones.
There are many advantages to cloud data warehouse systems. There’s no need for purchasing servers up front, no risk of under- or over-provisioning, much more flexibility, and very easy scaling. On top of that, they usually bring better performance. Also, since the data is in the cloud, business teams usually don’t need to wait for updated information as they often do with on-premises systems. From a technical perspective, they don’t differ that much from their on-premises counterparts.
If you want to go for cloud data warehouse systems, you first need to make some decisions. There are a few different architectures and pricing models for cloud data warehouse systems you can choose from.
Architecture vs Pricing
Most of the cloud data warehouses come in either paid-per-server clustered versions or pay-per-use fully managed solutions. In the first option, you’ll more or less be responsible for managing the cluster. You’ll also be responsible for scaling the cluster up and down when needed (or setting up auto-scalers). Basically, you will see your data warehouse as a cluster. This architecture is usually associated with a pay-per-server model. Therefore, knowing how big your cluster is and having control over its capacity make it easy to predict costs.
Another option is to use fully managed solutions. In this case, the cluster is “hidden” from you. You won’t need to take care of it because all the maintenance tasks and scaling will be taken care of for you by the provider. This architecture will, in most cases, provide a pay-per-use model. Since there are no servers/clusters assigned to you, you’ll pay based on usage or query. The advantage of this approach is that you don’t have to manage anything so you only have to focus on your data, but predicting costs is more difficult. Without proper governance, you may sometimes end up surprised by a bill after making a typo in a query (resulting in executing a query that requires much more computing power or running in a loop).
Examples of Data Warehouses
Here are some examples of data warehouses that exist.
Azure Synapse Analytics
In November 2019, Microsoft announced its new platform called Azure Synapse Analytics, which replaced its previous offering, Azure SQL Data Warehouse. While SQL Data Warehouse was offered only in a pay-per-cluster architecture, the new Azure Synapse Analytics gives you a choice and offers both pay-per-cluster and pay-per-usage models. The key selling points of Azure Synapse Analytics are the abilities to combine data warehouses with big data analytics into one platform and to run on a petabyte-scale.
Amazon’s Redshift platform is probably the most well-known cloud data warehouse solution on the market. It offers easy integration with data lake solutions and all other Amazon services. Therefore, it’s a natural option for companies that run the rest of their infrastructure on Amazon. Redshift promises to easily query petabytes of data from various data sources and databases. Amazon advertises Redshift as being three times faster than any other cloud data warehouse solution. They are also working on a new hardware-accelerated cache for Redshift, which should raise the bar even higher and make Redshift ten times faster than others. However, AQUA (advanced query accelerator) is currently still in preview. On top of that, they also claim to be the cheapest solution, but you should take that promise with a grain of salt as it all depends on pricing models and your usage.
Google also has its own DW solution. As one of the services available in Google Cloud Platform, Google BigQuery offers seamless integration with all the other services offered by Google. It’s a fully managed solution with a pay-per-usage pricing model. Google BigQuery offers a few unique functions, like multi-cloud analysis, building machine learning models using SQL, and an in-memory analysis service, which gives you extreme performance.
If you don’t need to deeply integrate your data warehouse with the rest of your infrastructure, you may want to consider Snowflake. Unlike previously mentioned platforms, Snowflake is a platform dedicated only for data processing and analysis. Therefore, it’s worth considering if you really want committed people behind the platform. They even offer dedicated solutions for specific sectors (financial, health care, retail, etc). It’s a fully managed solution with “almost zero” maintenance and therefore offers a per-usage pricing model.
You are not limited to the big four of cloud data warehouse solutions listed above. I encourage you to check other platforms too. Each provides some unique functions, and most of them will work fine for standard usage, so it’s worth checking all of them. Oracle Autonomous Data Warehouse, IBM Db2 Warehouse on Cloud, SAP Data Warehouse Cloud, Teradata Vantage, and Panoply are some other platforms to research.
Choosing the Platform
There is no golden formula for choosing a proper platform for your use case. It all depends on your specific usage and the rest of your infrastructure. For example, if you run your whole company on Google Cloud Platform, then it makes sense to use Google’s DW offering, unless you need some specific features found elsewhere. If you don’t need the highest performance on the market, you might want to look for a slightly cheaper platform. Below, you will find a list of a few well-known solutions, together with some tips that may help you make a choice.
Event Data Cloud
So far, we’ve talked about typical cloud data warehouse solutions. There is, however, another approach: Scalyr’s Event Data Cloud. It’s a new approach for dealing with data. It lets you ingest and query as much data as you need in real time. Typically, it can replace solutions based on Elasticsearch and lower the infrastructure and operations costs. If you want to know more, check Scalyr’s white paper about Event Data Cloud, which you can download here.
As mentioned at the beginning of this post, data is the new gold nowadays. But, as with gold, you need to properly process data in order to get its true value. Data warehouses that are incorrectly set up or underperforming can give you more frustration that insight into your client needs. There are also new ideas like Scalyr’s Event Data Cloud on the market. With the information in this post, you can start figuring out which cloud data warehouse can work for you.