Querying for Duplications - Finding Upsert-Merge Keys

The following article outlines the steps to take when querying for possible duplications on the target.
The root case for these types of issues is usually due to incorrect primary key selection.

Step 1: Prepping the Duplicated River

  1. Duplicate the river.
  2. Clear the columns mapping
  3. Fetch new column mapping
  4. Set the river to overwrite and define a table
  5. Load a chunk of data to the new table

Step 2: Looking for the Table Upsert-Merge Key

The usual suspects for keys are any column with ‘ID’ in it, no need to add the name column as well
i.e. if you have a ‘group_id’ and ‘group_name’ columns you only need the ‘group_id’.

If it’s a daily, weekly, monthly… report you should also have some kind of date column that needs to be part of the key as well.

Step 3: General flow

  1. Identify possible keys and update the “group by having” query.
  2. Run the “group by having” query to check if the keys are unique.
  3. If the query returns results, query specific data that returns with the “specific query”.
  4. If no results return, you have identified the table upsert merge keys!
  5. Check for differences between the records and go to step 1.

”GROUP BY HAVING” query format :

SELECT key1, key2... count(*) as num_count 
FROM `db.schema.tabl` 
GROUP BY key1, key2 ..HAVING count(*) > 1

“SPECIFIC QUERY” format :

SELECT * 
FROM `db.schema.tabl` 
WHERE key1=val1 and key2=val2 ...

Practice Example:

Let’s take a look at this example.
Suppose we have a table ‘TEST_TABLE’ in our DWH where we have the data in question.

Here is a list of some of the columns in the table:

campaignId,
campaignName,
deleted,
campaignStatus,
app_appName,
app_adamId,
servingStatus,
servingStateReasons,
date
…

  1. Our first step would be to build a query to check for potential keys.

We can immediately see that ‘campaignId’ is a potential key because it has ‘ID’ in the column name (‘app_adamId’ as well but we can ignore it for now)

Starting with the ‘campaignId’ we get the following “group by having” query:

SELECT campaignId, count(*) as num_count 
FROM `.rivery_demo.TEST_TABLE` 
GROUP BY campaignIdHAVING count(*) > 1

Below you can see the results:

Looking at the results we can see that 2575 records return, this means that ‘campaignId’ is not the complete upsert key for this table and we need to find additional keys.

  1. Focusing on the results we query one ‘campaignId’ in particular and look for differences between the records using the “specific query”.
SELECT * 
FROM `.rivery_demo.TEST_TABLE` 
WHERE campaignId = 95955930

query_result

Looking at the data we can see that one difference is the ‘date’ column so we add it to the list of keys and try again :

SELECT campaignId,date, count(*) as num_count 
FROM `rivery_demo.TEST_TABLE` 
GROUP BY campaignId,dateHAVING count(*) > 1

And here are the results:

Still not good enough, we need to keep querying the data and adding keys until the “group by having” query returns no data.

3)Continue until no results return from the “group by having” query.