Querying for Duplications - Finding Upsert-Merge Keys

francisco.munoz · August 12, 2022, 6:37pm

The following article outlines the steps to take when querying for possible duplications on the target.
The root case for these types of issues is usually due to incorrect primary key selection.

Step 1: Prepping the Duplicated River

Duplicate the river.
Clear the columns mapping
Fetch new column mapping
Set the river to overwrite and define a table
Load a chunk of data to the new table

Step 2: Looking for the Table Upsert-Merge Key

The usual suspects for keys are any column with ‘ID’ in it, no need to add the name column as well
i.e. if you have a ‘group_id’ and ‘group_name’ columns you only need the ‘group_id’.

If it’s a daily, weekly, monthly… report you should also have some kind of date column that needs to be part of the key as well.

Step 3: General flow

Identify possible keys and update the “group by having” query.
Run the “group by having” query to check if the keys are unique.
If the query returns results, query specific data that returns with the “specific query”.
If no results return, you have identified the table upsert merge keys!
Check for differences between the records and go to step 1.

”GROUP BY HAVING” query format :

SELECT key1, key2... count(*) as num_count 
FROM `db.schema.tabl` 
GROUP BY key1, key2 ..HAVING count(*) > 1

“SPECIFIC QUERY” format :

SELECT * 
FROM `db.schema.tabl` 
WHERE key1=val1 and key2=val2 ...

Practice Example:

Let’s take a look at this example.
Suppose we have a table ‘TEST_TABLE’ in our DWH where we have the data in question.

Here is a list of some of the columns in the table:

campaignId,
campaignName,
deleted,
campaignStatus,
app_appName,
app_adamId,
servingStatus,
servingStateReasons,
date
…

Our first step would be to build a query to check for potential keys.

We can immediately see that ‘campaignId’ is a potential key because it has ‘ID’ in the column name (‘app_adamId’ as well but we can ignore it for now)

Starting with the ‘campaignId’ we get the following “group by having” query:

SELECT campaignId, count(*) as num_count 
FROM `.rivery_demo.TEST_TABLE` 
GROUP BY campaignIdHAVING count(*) > 1

Below you can see the results:

Looking at the results we can see that 2575 records return, this means that ‘campaignId’ is not the complete upsert key for this table and we need to find additional keys.

Focusing on the results we query one ‘campaignId’ in particular and look for differences between the records using the “specific query”.

SELECT * 
FROM `.rivery_demo.TEST_TABLE` 
WHERE campaignId = 95955930

query_result

Looking at the data we can see that one difference is the ‘date’ column so we add it to the list of keys and try again :

SELECT campaignId,date, count(*) as num_count 
FROM `rivery_demo.TEST_TABLE` 
GROUP BY campaignId,dateHAVING count(*) > 1

And here are the results:

Still not good enough, we need to keep querying the data and adding keys until the “group by having” query returns no data.

3)Continue until no results return from the “group by having” query.

Topic		Replies	Views
Upsert Merge Explained Tutorial	0	2097	July 29, 2020
Netsuite Merge Method Not Working Source to Target	2	164	January 8, 2024
Prohibiting schema changes Source to Target	8	698	June 12, 2023
[Post Mortem] Insufficient support on dropped columns in multi-table rivers Post Mortem	1	734	June 24, 2020
Does the CDC capture type detect deleted rows from a table if they ar Truncated in Postgres? Source to Target	1	233	January 30, 2024

Querying for Duplications - Finding Upsert-Merge Keys

Step 1: Prepping the Duplicated River

Step 2: Looking for the Table Upsert-Merge Key

Step 3: General flow

Practice Example:

Related topics