12 Misconceptions about Virtual Warehouses in Snowflake

If you are new to Snowflake, you may often find the concept of a “virtual warehouse” a bit confusing. There is also confusion between changing a warehouse size or the number of clusters. Another category of errors relates to the goal of vertical versus horizontal scaling. Here is a collection of frequent honest mistakes most of us come up with during the learning process.

1. Virtual warehouses can be dynamically resized

Virtual warehouses can be only manually resized anytime, even when you may have running queries and other queries queued for them. You cannot auto-scale vertically (up or down) in Snowflake, only horizontally (in or out). And this last auto-scale mode is supported by a multi-cluster environment in the Enterprise Edition, in which every cluster is a virtual warehouse of the same size, processing queries in parallel, that can be started or stopped.

In the previous image, a size-4 (Medium) warehouse has been manually resized to a size-8 (Large). Warehouses can be resized between 1 (X-Small) and 512 (6X-Large), in powers of 2 (i.e. 1, 2, 4, 8, 16, 32, … 256, 512). The warehouse can be independent or part of a multi-cluster. Resizing a multi-cluster warehouse will increase or decrease with the same increment or decrement the size of any individual warehouse.

The previous image shows an auto-scale of a multi-cluster warehouse. Individual warehouses are replicated here, each with the same size. In this case, a size-4 (Medium) warehouse has been replicated, for a cluster with two warehouses, each of the same size.

2. Resizing a virtual warehouse will improve concurrency

Remember this important difference between resizing a warehouse and resizing the number of clusters in a multi-cluster warehouse, as each has a different impact, on either performance or concurrency:

(a) Warehouse resizing improves performance. For the most part, queries scale linearly with regard to warehouse size, particularly for larger, more complex queries. This feature provides manual vertical scalability (up or down).

(b) Multi-cluster warehouses improve concurrency. They are designed specifically for handling queuing and performance issues related to large numbers of concurrent users and/or queries. This feature provides automatic horizontal scalability (in or out).

The size-8 (Large) warehouse on the left will likely process any individual query faster than any size-4 (Medium) warehouse from the right. However, if query 1 and query 2 are issued at the same time, query 2 will likely be queued and wait for query 1 to be processed first. However, each individual warehouse from the multi-cluster warehouse can execute in parallel query 1 and query 2.

3. You may have warehouses of different sizes in a multiple-cluster

In Enterprise Edition, when you define a virtual warehouse with multiple clusters, all individual clusters must use the exact same number of nodes, known as a “size”. For the same multi-cluster virtual warehouse you may not start in parallel clusters with a different number of nodes.

The multi-cluster warehouse on the left has two warehouses, each size-4 (Medium). The one on the right has warehouses of different sizes each, which is not possible.

4. A larger warehouse will run your query faster

Not necessarily. Simple queries may require just one or two nodes. And there are also many other factors that come into play. In a heavy multi-user environment, you may avoid your queries being queued by using a multi-cluster virtual warehouse, not a larger size. Your query may wait to be processed because of concurrency issues, not complexity.

The previous query may require just two nodes allocated. The rest of the nodes just consume credits and are not used. A size-2 (Small) warehouse could be cheaper and more efficient here if most queries are similar.

5. An increase in size will improve the performance of a currently slow-running query without restart

Virtual warehouses can be manually resized anytime, but they do not interrupt or impact the currently running queries. The additional resources do not impact any queries that are already running. But once they are fully provisioned they become available for use by any queries that are queued or newly submitted.

While query 1 still executes, you may manually resize the virtual warehouse, from a size-2 (Small) to size-8 (Large). However, this change does not interrupt or affect the execution of query 1, which continues to be served only by the initial two nodes. If the query could take advantage of more nodes, it must be stopped and executed again.

6. A query is always queued if the current warehouse is busy

Even with single-cluster virtual warehouses, the number of queries that can be concurrently processed is determined by the size and complexity of each query. As queries are submitted, the warehouse calculates and reserves the compute resources needed to process each query. Only if the warehouse does not have enough remaining resources to process a query, the query will be queued, pending resources that become available as other running queries complete.

A size-8 (Large) virtual warehouse may determine it needs just 2 nodes for query 1. Another incoming query may use some of the remaining nodes if they are idle, and so on. If there are enough resources for both query 1 and query 2, one single warehouse can execute both queries in parallel.

To enable fully automated scaling for concurrency, Snowflake recommends multi-cluster warehouses, which provide essentially the same benefits as creating additional warehouses and redirecting queries, but without requiring manual intervention. Queries that would be normally queued could be automatically redirected and executed by another cluster, in the same warehouse.

7. A large warehouse can provision fewer nodes for very simple queries

All nodes are started and provisioned altogether in a single-cluster virtual warehouse. Only in multi-cluster virtual warehouses, different clusters of nodes can be automatically and individually started and stopped.

This means that, for simple queries that may require one or two nodes for processing, you waste all other running nodes and pay more for nothing. If you’re the sole user on a Snowflake instance and run simple queries one at a time, a size-1 (X-Small) or size-2 (Small) virtual warehouse could be enough.

A size-8 (Large) warehouse will always start and provision 8 nodes. Never just one, two, or less than 8.

8. Increase the size of a virtual warehouse if you have one large file to ingest

Increasing the size of a warehouse does not always improve data loading performance. Data loading performance is influenced more by the number of files being loaded and the size of each file than the size of the warehouse.
Unless you are bulk loading hundreds or thousands of files concurrently, a smaller warehouse is generally sufficient. Using a larger warehouse will consume more credits and may not result in any performance increase.

The data ingestion with the size-8 (Large) warehouse could be faster if you ingest many files concurrently. Otherwise, one single ingestion of a large file may require more time because of the volume of data, not data processing. Each COPY INTO command may take up to 4 nodes per request. Using the big file with a size-8 (Large) warehouse could be a waste of compute credits.

9. Multi-cluster warehouses can auto-scale between a minimum and maximum number of compute nodes

Minimum and maximum configuration numbers (with accepted values between 1 and 10) relate to the number of single-cluster warehouses, not compute nodes. Each cluster in any virtual warehouse must have the same number of compute nodes or resources, known as “size”.

By resizing, a warehouse can be scaled up or down only manually. But the number of warehouses in a multi-cluster warehouse can be automatically scaled in or out.

The previous multi-cluster warehouse has been scaled out from 1 to 3 warehouses, each with the same size 4 (Medium). Minimum is set to 1 here, with a maximum of 3 or 10 (max allowed value), in auto-scale mode.

10. A multi-cluster warehouse will help process a complex query faster

Multi-cluster virtual warehouses help process multiple queries in parallel. One single query is always processed by one single warehouse, and only the size matters here, not the number of clusters. A complex query could benefit from a virtual warehouse with large size. A larger number of clusters do not matter in this case.

For the execution of query 1, it doesn’t really matter if you have one size-4 (Medium) warehouse or a multi-cluster warehouse with two warehouses of the same size. The query will still be allocated to just one of these individual warehouses, each with 4 compute nodes.

11. A query can be processed by more than one warehouse at the same time

A query can be executed by only one virtual warehouse. Through different ways, a USE WAREHOUSE statement is always issued for the current session, and this will be the virtual warehouse running your current query.

However, your query may be executed by a different cluster in this virtual warehouse, if you have a multi-cluster warehouse. You cannot say which cluster, but queued queries can be redirected to another running cluster when the other clusters are busy executing other queries.

Remember that a cluster is a collection of N compute nodes to be provisioned, and to execute queries, N being known as the “size”. A virtual warehouse is a collection of such clusters. For one single cluster, we identify the virtual warehouse as a collection of nodes. But in fact, our warehouse is one single cluster with N nodes.

The previous query 1 can be executed only by one warehouse, even if it may require more resources for optimal processing. It is not possible to allocate additional nodes for the same query from another warehouse.

12. Incoming queries are automatically redirected to any available virtual warehouse

Each query is run by one single warehouse, defined as the current context — with USE WAREHOUSE or defined as default for the current user. There is no automatic redirection to another warehouse, the process is always manual. Only in the Enterprise Edition, with a multi-cluster virtual warehouse, different clusters can be started in parallel for queries waiting in the queue.

The difference between the two size-4 (Medium) warehouses on the left and the similar warehouses on the right is the ones from the right are part of a multi-cluster warehouse, which is used as the current context and can eventually scale out to accommodate additional processing power. If the current context for query 1 and query 2 is the same separate warehouse, it is likely the second query will be queued and wait for the first one to execute.

Conclusions

We frequently advise our clients on Snowflake performance tuning and other related issues. We produce documentation and implement proofs-of-concept on the overall scalability of Snowflake systems we design and develop for them.

In one such system data files are dropped in S3, triggered events are queued in SQS queues, then processed by AWS lambda functions that call stored procedures in Snowflake. We’ve built entire core applications in Snowflake, with SQL and JavaScript-based stored procedures performing complex data ingestions and transformations.

--

--

Cristian Scutaru
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Multi-Cloud Data Expert (in AWS/Azure/GCP). Former Snowflake ”Data Superhero” and Certification SME.