Automate Ingestion of Public Datasets using Actions

There are many public datasets available on the web - however many of which do not have supported APIs, but do allow for retrieving CSVs or JSON data results via HTTP links.

In this example, we’ll use Github’s public data for COVID19 cases across Canada. This data is publicly accessible via this link:

To automate the ingestion of this data into a data warehouse, we’ll walk through the following steps:

  1. Creating an Action river to use as a template
  2. Creating a Source to Target river for data ingestion (using the Action from Step 1 as the source) into a target data warehouse.

Set Up the Action Template

Create a new Action river and input the URL from above into the API URL window:

Then, in the Results section, select ‘Data’. Under Results Details, select ‘CSV’, as this is the expected output format.

To test that the call is successful, you can click ‘Test Rest Action’ button to the right of the screen.

Save the Action.

Use the Action Template for Data Ingestion

Create a new Data Source to Target river and name it. In the Source tab, select ‘REST API’ as the source type. In the ‘Rest Action River’ dropdown, select the Action created in the previous step:

Next, select the desired target in the Target tab and enter the desired Database, Schema, Table, etc. for output. In the ‘Additional Options’ under ‘File Settings’ make sure that the file type is set to CSV.

Move to the Column Mapping tab and click ‘Auto-Mapping’ to automatically detect and map the schema of the output table.

At this point, either click ‘RUN’ to force start the process or navigate to the Schedule tab to set a frequency for automation.

And that’s it, no more manual downloads!