Rivery Best Practices: Historical Backfill and Incremental Loads

When using Rivery to ingest data from a source, a common initial use case is that of historically backfilling the dataset into your cloud data warehouse. After this process is completed and data quality has been checked, the next logical step is to configure this process for incremental loading in the future. In this post we’ll go over how to optimize your historical backfill and incremental load configuration in the Rivery platform.

Historical Backfilling - Using an API Connection

In the Source tab of any Data Source to Target River using an API source, you’ll first establish the entity or report you would like to pull into your target. This tab will vary in options depending on which source you are using, but the general flow is this:

  1. Establish connection
  2. Choose report or entity
  3. Apply any additional filters desired
  4. Select time period

Once you get to the Please Choose Time Period option, you have the ability to select a predefined time range or input a custom date range.

For your historical load, you will first have to decide how far back you would like to backfill. Note - this may be dictated by the API itself, as some APIs only hold data for a certain amount of time.

Once you have decided on your date range for the backfill, you can set your time period in Rivery to reflect this range. For example, let’s say that you want to load data in from Facebook Ads from the beginning of 2019.

In the time period selection dropdown, select ‘Custom Date Range’.

Let’s put January 1, 2019 as our start date. Next, we can either specify an End Date or leave it empty. When the End Date is left empty the river will pull from the specified Start Date up to the time the river is executed.

Incremental Loading - Using an API Connection

Upon the first run when using a custom date range, the river will pull data from the Start Date indicated to the End Date indicated (if End Date is left empty, it will pull data from the Start Date until the current time). The next time the river runs, the Start Date will be updated to reflect the date/time of the latest run, and the End Date will be turned empty. Thus, the river load will pick up where it left off from the previous load, which means no action is necessary after the historical backfill run to further configure the river for subsequent incremental loads.

In addition, you will also want to make sure to set the ‘Loading Mode’ on the Target step of the river to your loading requirements.

For incremental loading, you will likely want to use ‘Append Only’ or ‘Upsert-Merge.’ Using the ‘Append Only’ mode will not compare any of the data values in your source data to your target data - it will simply add the records pulled in the current run to your target table. If the ‘Upsert-Merge’ mode is specified, you will be required to define a primary key in the Column Mapping step.

You can define your merge key by navigating to the Column Mapping tab and clicking the key symbol next to the field(s) you would like to use as a key. By setting a field as your primary key, you are telling Rivery to check the source data values of this field (data being pulled in each run) against the target values of this field (data already stored in your target table).

The ‘Upsert-Merge’ functionality works as such:

If a source key value does not match any of the values in the target, the corresponding record will be upserted, or appended to the target table. If a source key matches a key value that already exists in the target table, then the record from the source will be merged, or replace the existing target record.

Thus, this process will account for new and changed data in the source system. The primary key is how you specify the expected granularity of your target table.

Historical Backfilling and Incremental Loading - Using an Database Connection

When connecting to a database as your source, you’ll first select your river mode:

Use ‘Multi-Tables’ to ingest data from multiple source tables at once. This is the recommended option for the historical load of an entire database to your cloud data warehouse.

In the Target tab, you can set your default loading mode to ‘Upsert-Merge’ from the start. If the tables do not already exist in your target, Rivery will create them on the fly.

In the Mapping step, you will find a list of schemas in your source database, which you can select to see the metadata of their underlying tables.

Click on the ‘Edit’ button on the far right of the table name to see the metadata for an individual table.

Here you can define the primary key for ‘Upsert-Merge’, just like in the previous section. You can set the table to be loaded incrementally by clicking into the ‘Table Settings’ tab and selecting ‘Incremental’ as your extract mode.

Then define the field by which to create your increments. This can be either a date field or an integer used as a running number.

Thus, when connecting to a database source, your river can be configured for both its historical backfill as well as its subsequent incremental loads all at once. The first execution will pull the entire history and create the tables in the target data warehouse, and the future loads will abide by the conditions set in the ‘Loading Mode’ and ‘Extract Method’ configurations.

4 Likes