Upsert Merge Explained
UPSERT Merge inserts/updates rows into a Target table through defined keys. When the defined keys are not found for a record the process will create a new row. When the record’s defined keys do exist, the process will update the row with the new information. Defined keys can be selected in the Column Mapping field in Rivery. More information can be found here.
In the following example, we’ll use UPSERT Merge to update a table in Snowflake:
Below is the Target table that we’d like to update:
List of records from a data Source that we’d like to add to the Target table:
Upsert-Merge can be selected in the Target Step of a Source to Target River.
As Upsert Merge adds/updates records in the Target table based on defined keys. These will need to be selected in the ‘Columns Mapping’ step in Rivery’s ‘Data source to Target’. You can define keys by clicking the Key icon next to the Source Field. Once selected, the key will appear to have gold color. More information on selecting Keys and Column Mapping can be found here.
Using ACTOR_ID as the Key, the River will add any New ACTOR_IDs not in the Target table and Update any existing ACTOR_IDs with the new FIRST_NAME and LAST_NAME.
After the River has run, the following represents the updated Target table in Snowflake:
As you can see, the rows for ACTOR_IDs 3 and 5 have been updated with the new values. The River has also added ACTOR_ID: 11 as this ID did not exist in the Target table.
Filter Logical Keys:
Within the Upsert-Merge option, there is an option to ‘Filter Logical key duplication between files.’.
This toggle allows you to perform a deduplication process on the inbound records from the Source. This can be useful if you have more than 1 record per key (Note: This will only perform the deduplication process on the inbound records and not the entirety of the Target database). As this process only keeps 1 record, the ‘Filter order expression’ can be used to decide which record is chosen. This works similarly to an ORDER BY statement in SQL where the River stores the first record from the top; the Target database’s syntax is used for this expression.
In our example, let’s assume there is another field for ‘last_modified_date’. If you’d like to perform the deduplication process on the ACTOR_ID field by selecting the most record, we’d add the following configuration in the ‘Filter Logical Key’: