Advanced Feature - Triggering Rivers through Apache Airflow with the Rivery API

Using the Rivery API to schedule data transformations in Apache Airflow

What is the Rivery API?

The Rivery API is an API that allows you to integrate the functionality of Rivery’s platform into other applications or schedulers, written in other programming languages. You can read more about how to set up authentication credentials to enable the Rivery API here.

What is Apache Airflow?

Apache Airflow is a very popular open-source python based platform used to author, monitor and schedule workflows. Airflow’s scalability, extensibility, and integration with devops tools such as docker has made it the go-to platform for data engineers to build data ingestion and transformation workflows. We believe that Airflow’s customizability, dynamic nature, and scheduling options, combined with Rivery’s intuitive UI for building ELT pipelines in the cloud make for an exciting combination that would allow - for the first time - both technical and non-technical teams in an organization to build data workflows.

In this tutorial, geared toward advanced users coming from a data engineering background, we walk through how to enable and execute Rivers using the Apache Airflow platform to better integrate Rivery with existing enterprise data engineering architecture. While Rivery can stand on its own as a fully fledged data integration/management/orchestration tool, part of Rivery’s value comes from its adaptability to existing organizational data architecture.

This tutorial assumes a basic familiarity with Apache Airflow and Docker, including creating a containerized environment using a .yml docker file and configuring a local webserver to run Apache Airflow.

  1. First, we need to create Rivery API credentials. In order to do this, log into your Rivery account and select the button from the left hand panel.

  2. Select the button, give it a name of your choice, and hit “Create”.

  3. IMPORTANT: After you name your token you will be shown a screen with your unique token identifier. This will be the only time you will be shown this info, so copy/paste or write it down and keep it in a safe place:

  4. Navigate to your terminal and launch your apache web server with the command “docker-compose up”, then, navigate to http://localhost:[your port number]/ from your browser.

  5. First thing we need to do is to add the Rivery API as a connection in Airflow. The easiest way to enable this is in the Airflow UI. Navigate to “Admin”, then click “Connections”:

  6. On the screen, click on the tab and enter the following credentials into the corresponding input boxes:

  7. Hit the “Save” button to create the connection.

  8. In the next set of steps, we will be creating a simple Airflow DAG (Directed Acyclic Graph) to call a river that we want to control. Open up your default text editor/python IDE.

    a. Import the necessary libraries:

    b. Next pass in the following default arguments to the DAG (configure these based on your desired preferences):
    c. Create the DAG object and pass in the default arguments created above:
    d. Create a “DummyOperator” object to initiate DAG workflow like so:
    e. We will be using the airflow BashOperator to run the POST call to the Rivery API to trigger our river. For this POST call you will need the following information, 1) your API token that you saved in step 3, and 2) Your river_id, which you can get here:

    Once you have these two pieces of information, the full curl is as follows (replace the “<>” and the content in between them with your specific API Token, river_id). Additional escape strings have been added for python formatting.

    curl -X POST “\

    -H \“accept: application/json\”

    -H \“Authorization: Bearer < your api token >\”

    -H \“Content-Type: application/json\”

    -d \"{ \\\“river_id\\\”: \\\"\\\"}\"",

    f. Create the BashOperator object and enter the full curl in step 8d for the “bash_command” parameter:

    g. Lastly, call the workflow by adding the following line of code:

  9. Save your finished DAG in the same directory as your other DAG’s as a .py file.

  10. Open up your Airflow UI, you should now see your newly created DAG in your list of available DAG’s:

  11. Switch your DAG to the “On” position if it is not already, depending on your settings, this should trigger a run, which you can observe once you click on the DAG. When all three of the boxes turn a dark green, it means that your DAG has run successfully and the River that you indicated has been called to run:

  12. On the Rivery console, under the “Activities” Tab, you should be able to see the run of the river that was triggered by the DAG:

  13. The ability to build complex workflows from Rivery’s simple GUI and then schedule and control them using an open-source, mainstream data engineering tool like airflow enables many people in an organization to take part in the data management process. We look forward to seeing how Rivery’s platform and resulting technical enablement brings value to your organization.

1 Like