Optimizing River Executions for Speed
The scale at which data is growing today means that data pipelines need to be structured to ingest, load and process information in an efficient, scalable and repeatable manner. Cloud data warehouses are great for this, with their near limitless distributed computing power and ability to scale by creating nodes, clusters, or related containerized, computational entities. Rivery’s position as an ELT, as opposed to an ETL tool makes it uniquely poised to take advantage of these benefits. In this guide we go over some general tips at the warehouse, table and SQL query level to help you reduce your Logic River runtime.
At the Data Warehouse Level:
At this level, you want to ensure that your data warehouse size (may be called warehouse type depending on your CDWH) is properly sized for the amount of data that you expect to process. Every popular cloud data warehouse has some sort of option to create and resize a warehouse. For example, in Snowflake you can use the Alter command:
ALTER WAREHOUSE <your_warehouse_name> set
warehouse_size = XXLARGE
auto_suspend = 300
auto_resume = True;
This would create a 32 node cluster that could process larger workloads. The screenshot below shows a warehouse that may benefit from being resized:
Ideally, you want a warehouse that, during your query, only stores data in “local storage” (the fast, SSD storage native to the warehouse and not in “remote storage”, which is the S3 or Blob storage that the warehouse may use in case of spillover (which is less efficient).
Another option at the data warehouse level is to designate more clusters (may be called nodes, depending on your CDWH) which effectively spread your computation “horizontally” across multiple computational entities. In Snowflake, this can be accomplished with a statement like:
ALTER WAREHOUSE <your_warehouse_name> set
min_cluster_count = 1
max_cluster_count = 6
scaling_policy = ‘standard’
This will allow additional data clusters to be added, up to the limit as the compute resources for the Query increase.
At the Table Level:
For data tables that are very large in size (1TB+), the addition of a “cluster key” (may be called partition key, depending on your CDWH) may dramatically improve your query times. To illustrate the effects on a cluster key on performance for a data set within Snowflake, see the graphic below:
For this table, there is a frequent partition by or filtering statement that appears by the Date column within the query, by designating the Date column as a “cluster key” one can improve the performance of the query vs. the non clustered one by the following metrics detailed below: (example uses a table with 1.3 Billion rows)
If in your SQL query, you are using the PARTITION BY statement frequently, the columns partitioned by are good candidates to assign as cluster keys (in a large table).
Another strong way to optimize queries at the table level (besides cluster keys) is to augment original tables with sparse tables (essentially flags) to indicate whether a column is a certain value or not. Queries are always fastest when evaluating True/False statements rather than looking for specific strings. For example, if you are searching 10MM user_id’s for a columns where value = ‘clicked on advertisement’, it is better to preprocess at the table level and assign it to its own column designated by a 1/0 or True/False for each user_id (assigning a 1 or a 0 to designate whether that condition is true or false) and then reference the name of that column instead of naming the string in your query every time. Cloud data warehouses are generally much more efficient when columns are indexed and it can search for that particular index rather than compare strings to search for equality)
At the Query Level:
There are various ways to fine-tune your SQL queries in Logic Rivers to reduce runtime. Although these depend on the datatypes of the table that you are querying as well as your cloud data warehouse, here are some guidelines that apply almost universally across platforms. Knowledge of the SQL “Order of Operations” in queries is useful as well:
|Execution Order||SQL statement||Description|
|1||FROM||Choosing tables and joins to pull in data.|
|2||WHERE||Apply filter on extracted data.|
|3||GROUP BY||Perform an aggregation.|
|4||HAVING||Apply a filter on aggregated data.|
|5||SELECT||Return finalized data.|
|6||ORDER BY||Sort the returned finalized data.|
|7||LIMIT||Limit the number of rows returned.|
Avoid using SELECT *. Instead, only extract columns by name that are necessary for your use case/downstream data transformations.
Index, Index, Index! As highlighted in ‘At the Table Level’ section, indexing columns can greatly improve your queries by allowing you to take advantage of cluster/partition keys and other CDWH-optimizations. At the Query level, it allows you to use WHERE and other column-level statements instead of LIKE or other slower row-level search statements.
Eliminate functions in the WHERE clause. Depending on the type of function, the query might have to perform a full table scan every time the where clause is evaluated.
Be careful using SELECT DISTINCT. The resultant backend grouping of all columns that results by using this statement can consume large amounts of processing power. It is better to select more fields until you have just enough columns to make each entry effectively distinct.
If you have to use the LIKE operator, don’t use wildcard operators at the beginning of the string unless absolutely necessary. In this setting, the RDBMS will switch over from the default setting of finding a suitable index when querying the DB and go into a manual full table scan to find the beginning of the string. Try to use wildcard operators at the end of a string (e.g LIKE ‘foo%’ instead of LIKE ‘%oo%’) if possible.
Replace correlated subqueries with non-correlated subqueries, single statements or just regular subqueries. Correlated subqueries’ interdependence can result in slowdowns due to the uneven patterns of query completion or possible difference in “Big-O” between the two queries compounding to result in an even slower overall query.
Use INNER JOIN instead of WHERE when performing joins. While WHERE joins are easier to read and sometimes preferred by data scientists/engineers, they trigger a cross join of the two tables first followed by a WHERE filter on the backend for some RDBMS’. This results in an unnecessary computational and space load on the system. It is better to use the INNER/LEFT/RIGHT JOIN.
When testing a Query, always use a LIMIT statement to ensure that you are extracting the right columns, executing the correct joins, groupings, etc. before you apply the query to the entire dataset. Extrapolating the runtime of that query from various samples of rows limited by the limit statement to the full table can sometimes give you an idea of how long the query might take on the full dataset by linear, exponential, or logarithmic interpolation.
It is better to apply a filter using WHERE before an aggregation than with HAVING unless it is absolutely necessary that the filter apply to the aggregate columns. According to the SQL order of operations, HAVING executes after WHERE and thus would filter on rows that may be extraneous and would have been filtered out earlier with WHERE.
At the Rivery Level:
If you have to flatten JSON files in a variant or related column before querying, separate out the JSON flattening that you perform in your query in another logic step, to avoid both processes happening multiple times in a single query.
Ensure that your loading modes are set appropriately.
The three options available are Append Only, which performs a union of the new data coming in and the data already in the table, Overwrite, which replaces your destination table with a new one holding the new data, and Upsert-Merge, which updates rows in the destination table with new data corresponding to matching primary keys between the source data and existing data. If the primary keys do not match, the data will be unioned on.
Utilize River and Global variables. If there is information, such as an access token, date, or user_id set that is/are being used repeatedly, consider storing them as variables and then referencing those variables throughout your workflow rather than having to extract them from scratch in every step that uses them.