Cloud-based data warehouses are becoming more and more popular with each passing day. While the number of customers has increased, the number of options has increased as well. In this post, we will compare three of the most popular data warehouses: Redshift, Snowflake, and BigQuery. By the end of this post, you will get an idea of the parameters that you can use for comparing one warehouse with another.
Prerequisites and Expectations:
Since this will be more of a theoretical post, there are not many prerequisites. It is assumed that you are coming from a technical background, and have worked with at least one database in the past, and can understand the concepts about databases. Apart from this, an open mind and curiosity are pretty much all you need.
As for setting the expectations, please don’t expect a straightforward answer on which warehouse to choose by the end of this post. After all, the answer is very specific to your application and you will have to come up with that answer yourself. This post will give you a framework for comparing the warehouses, and walk you through the pros and cons of each solution for different metrics.
Let’s start with the ‘What’! What are the three terms being compared? Well, they are data warehousing solutions. Now, what is a data warehouse? Is it similar to a database? The answer is ‘Yes’ and ‘No’. Let’s understand the difference.
A database is optimized for ‘OLTP’ (Online Transactional Processing) use cases. Thus, it will be able to handle hundreds, or even thousands of concurrent users, will be used for atomic transactions (typically CRUD – Create, Read, Update, Delete), and will hardly have any downtimes, since a database downtime can bring the operations of a firm to a standstill.
Databases store data efficiently, avoiding duplicates wherever possible.
A data warehouse is optimized for ‘OLAP’ (Online Analytics Processing) use cases. Thus it is used for analyzing the stored data. It, therefore, need not have thousands of concurrent users, but only a small limited number. It is optimized for complex and lengthy analysis queries. It can have some downtime since the unavailability of the data warehouse does not disrupt the operations of any firm, just pause the analyses for some time.
Data warehouses can contain several duplicates since they are optimized for analytical queries. If data replication helps perform parallel computation and gives faster results, then why not?
Alright, now that we have understood how a data warehouse is different from a database, let’s get a brief overview of each of the 3 entities that will be compared in this post.
Redshift is a service provided by AWS. Like most AWS services, it is fully managed. It organizes data in columns instead of rows, which allows it to use its massive parallel processing (MPP) technology to speed up query execution. It integrates well with several other AWS services like Aurora, S3, Quicksight, etc.
Fun fact: As per the Wikipedia page for Redshift, the name redshift teases a shift from Oracle (red denoting Oracle).
BigQuery is Google’s answer to data warehousing. Like Redshift, BigQuery also organizes data in columns instead of rows for parallel query execution and it automatically allocates compute resources as per the requirements, so you need not worry about that (making BigQuery serverless). It is built to process read-only data.
Snowflake is not backed by any of the FAANG companies, unlike Redshift and BigQuery. It uses AWS, Azure, or Google Cloud resources in the background (users can select the service provider and the region). It is a data warehouse provided as a SaaS (Software as a Service). Snowflake’s disruptiveness resides in the fact that it de-hyphenates compute and storage, allowing customers to pay for the two separately. So if you have high storage requirements, but low compute requirements, you need not pay for a storage+compute package where you inevitably end up paying more for the compute part. Instead, you pay separately for the storage, and separately for the compute. Snowflake also uses columnar storage, like Redshift and BigQuery, for parallel query execution.
Alright. Now that we have gone through a brief overview of each of these 3 entities, let’s see how they compare against one another. We will look at 5 broad aspects:
- Performance and Use Cases
- Setup and Maintenance
Let’s take care of this elephant in the room first.
Redshift: Redshift’s pricing model is quite simple. It charges you per node per hour, with the price varying depending on the kind of node chosen. Because redshift doesn’t separate storage and compute, the pricing covers both storage and compute costs. So you get fixed storage per node and unlimited processing on that node. One key concept to note in the case of Redshift is ‘Reserved Instances’. So if you anticipate regular workloads, you can reserve an instance beforehand for 1-3 years. You tend to earn massive discounts with this approach.
BigQuery: BigQuery pricing is complicated. It charges separately for analysis and storage. The storage cost is straightforward ($20/TB per month for active storage). However, the analysis cost comes in two types: On-demand pricing (where charges are based on the number of bytes processed per query), and flat-rate pricing (where you purchase fixed slots or vCPUs, for either 60 seconds, 30 days, or 365 days).
Snowflake: Snowflake also charges for compute and storage separately. It has plans (Standard, Enterprise, Business Critical, etc.) for the compute part, with the minimum plan costing $2/hour, and the price and features incrementing with each plan. For the storage, you can choose either the On-Demand Storage ($35-$40/TB/month), or capacity storage (where you pay upfront, for a discount – $20-$23/TB/month).
Performance and Use Cases
Redshift: When evaluating the performance of Redshift, it is presumed that you have provisioned as many nodes as required. If the number of provisioned nodes is lesser than your computation needs, then your performance will be inferior. This is one of the drawbacks of having storage+compute bundled together. Your decision of the number of nodes has to incorporate both.
Having said that, Redshift claims to provide 10x better performance than any other data warehouse, scaling to petabytes of data, using its MPP engine. However, its performance takes a hit when dealing with semi-structured data formats like JSON. Redshift is more suited for everyday business operations, like creating nodes for any BI analysis and so on
BigQuery: The performance of BigQuery is independent of the storage (since it allows for partitioning of compute and storage) and quite fast as well. BigQuery is better suited for querying large chunks in a smaller time frame (spiky workloads).
Snowflake: In the case of Snowflake, since the storage can be accessed independently by each workload, a high number of concurrent users (100-1000) are supported. Snowflake, therefore, is a good option if you want to provide data-as-a-service, i.e., giving access to several clients to query the data at the same time.
Setup and Maintenance
Redshift: This requires the creation of an appropriate cluster of nodes, keeping both compute and storage requirements in mind. It requires periodic maintenance in the form of VACUUM/ANALYZE operations on the storage.
BigQuery: Here the customers need not worry about sizing at the time of setup, since this is practically serverless, and you can provision more storage and computation as needed. Overall maintenance required from the end-user is low.
Snowflake: Here again, customers need not worry about sizing at the time of setup. The only choice the customers need to make is the cloud service provider (AWS vs Azure vs GCP), and the plan for compute (Standard, Business Critical, Enterprise, etc.). Overall, the maintenance level here is low as well, practically zero.
Redshift: As you would have expected, this integrates with a host of other AWS services like S3 (Redshift Spectrum is an important topic you should look at), Athena, RDS, etc. It also integrates with several BI tools like Tableau, Quicksight, etc. If your business is running in the AWS ecosystem, using Redshift is a no-brainer.
BigQuery: Like Redshift integrates well with AWS services, BigQuery integrates well with GCP services, and also Google Workspace. Of course, the common BI tools work well with BigQuery.
Snowflake: This comes with its own set of integrations like Apache Spark, IBM Cognos, Tableau, etc.
Thus, the bottom line here is that each alternative comes with its own set of integrations. However, the intersections are few, and it is more of a question of the services you are comfortable with.
Redshift: Redshift allows for both horizontal and vertical scaling. However, it is a bit slow in the vertical scaling. It can take minutes and in some cases, hours to add more nodes. It can scale horizontally up to 500 concurrent connections and 50 simultaneous queries.
BigQuery: Since compute and storage are separated, both horizontal and vertical scaling is quite seamless. You can horizontally scale up to 1000 concurrent connections and 100 simultaneous queries.
Snowflake: Here again, horizontal and vertical scaling is seamless because storage and compute parts are separated. Snowflake offers instant scaling. There is no hard limit for parallelism on Snowflake.
Each alternative discussed in this post has its own sets of benefits and drawbacks. You are advised to consider these factors and then choose a warehouse that best suits your application. Please go through the official documentation of each of these warehouses, because the features of these warehouses keep getting upgraded and what may have been a limitation at the time of writing this post may not be one by the time you read it.
Know of any other metrics which can be used to compare these 3? Post them in the comments.
If you are a beginner in the field of data warehousing, you may want to check out this course on Udemy.
Also, you read more articles on AWS on iotespresso.com.