Data extracted from certain sources often contains fields that contain mixed data stored in JSON arrays. These JSON arrays contain keys that represent “nested” fields of data.
Oftentimes, it is necessary to extract these nested fields into their own independent columns for ease of storage and querying in a relational database.
Some cloud data warehouses, like BigQuery and Snowflake, have built-in functions to unnest, or “flatten” these nested fields. For some warehouses, however, this process is more manual. In this tutorial, we are going to unnest fields from a table in Amazon Redshift, which does not yet support native flattening functions.
We are going to use the capabilities of a rivery logic river in order to orchestrate the unnesting process. A logic river allows us to execute commands and transformations in our target database directly within the Rivery UI.
Once the data ingestion phase is complete, we will be running a series of queries to attempt the unnesting of the data.
In this step we will call a data source to target river that pulls data from a source (such as facebook ads, twitter ads, rest api and loads it into our redshift database.
This table should contain one or more columns that are “atomic”, that is, only contain one datatype that is not a JSON, record or variant, and one column that contains a JSON with multiple keys. For example, a table like:
This table will hold sequential numbers 1 through n, n corresponding to the maximum value for the number of keys that are present in the JSON array(s) that we want to unnest.
This sample script will help you generate this “dummy” table of sequential numbers by iterating over the total rows of an existing table but also allowing you to set a limit equal to the maximum JSON entries in a single line:
CREATE or replace view seq_0_to_n AS (
SELECT row_number() over (
ORDER BY TRUE)::integer - 1 AS i
FROM <table_name> LIMIT n);
We want to now join the sequence dummy table created in the second step and the table that you want to un-nest. This process essentially “explodes” the nested fields and creates an equivalent amount of rows in the cross join corresponding to keys in the nested JSON field. In order to do this we will utilize three redshift built-in functions: JSON_EXTRACT_ARRAY_ELEMENT_TEXT(), JSON_ARRAY_LENGTH(), and JSON_EXTRACT_PATH_TEXT.
This is some sample code for executing this cross join:
WITH exploded_array AS (
SELECT <tabular_column_1>, <tabular_column_2>… <tabular_column_n>, JSON_EXTRACT_ARRAY_ELEMENT_TEXT (<JSON_column_to_unnest>, seq.i) AS json
FROM , seq_0_to_n AS seq
WHERE seq.i < JSON_ARRAY_LENGTH(<JSON_column_to_unnest>)
JSON_EXTRACT_PATH_TEXT(json, ‘<key_name_1>’) as <unnested_column_name_1>,
CASE WHEN JSON_EXTRACT_PATH_TEXT(json, ‘<key_name_2>’) = ‘’ then 0 ELSE JSON_EXTRACT_PATH_TEXT(json, ‘<key_name_2>’)::float8 END as <unnested_column_name_2>,
CASE WHEN JSON_EXTRACT_PATH_TEXT(json, ‘<key_name_3>’) = ‘’ then 0 ELSE JSON_EXTRACT_PATH_TEXT(json, ‘<key_name_3>’)::float8 END as <unnested_column_name_3>,
CASE WHEN JSON_EXTRACT_PATH_TEXT(json, ‘<key_name_n>’) = ‘’ then 0 ELSE JSON_EXTRACT_PATH_TEXT(json, ‘<key_name_n>’)::float8 END as <unnested_column_name_n>
- note that you must ensure that all datatype mappings are compatible with the fields for all non varchar types.
The flow described above can be implemented into three logic steps. First step to ingest, second to generate the sequence, and third to cross join and un-nest.
The result of this flow should give you a test_json_unnest table with atomic columns: