New Feature - Using Rivery to drive ML models in BigQuery

Rivery’s recently added support for custom BigQuery SQL queries directly within a Logic river now enables advanced users to run BigQuery-supported machine learning models from directly within the Rivery platform.

BigQuery currently supports the following machine learning models:

  • Linear regression
  • Binary Logistic regression
  • Multiclass logistic regression
  • K-Means clustering
  • Matrix factorization
  • Time-series forecasting
  • Boosted Tree/XGBoost based classification/regression
  • Deep Neural Networks (DNN)

Advanced users can even generate custom ML models in other programming languages like Python using packages like Tensorflow and deploy them within Rivery.

In this post, we will work through an example of using the Rivery platform, BigQuery and binary logistic regression to predict binary classifiers on the famous Pima diabetes dataset. The Pima Indians diabetes dataset was released by the National Institute of Diabetes and Digestive and Kidney Diseases with the objective of predicting whether or not a patient would develop diabetes, based on certain health measurements.

Running a BigQuery ML model in a Rivery Logic River, step-by-step example:

1. First create a Data Source to Target river bringing in the dataset from google sheets into your BigQuery database.

2. Create a second Data Source to Target river to load into your BigQuery database, health measurements for patients whose risk for diabetes you are trying to predict. The format of the predictions table should be as shown below. In this case we want to predict the diabetes risk for four unique patients:

3. Create a Logic River calling the Data Source to Target river created step (1), and a second, concurrent step calling the Data Source to Target river created in step (2). Containerization will be useful for this. In this case, we named the river in step (1): “Pima Diabetes import to BigQuery” and the river in step (2): “Diabetes - New patients - import to BigQuery”

4. Add another step to the Logic River (outside of the container created in step (2)) selecting “SQL/Script” and “BigQuery” and using the same connection to your BigQuery database as used in the river created in step 1.

5. Building the Logistic regression ML model.

a. First thing you need to do is to split your dataset into a training set and a testing set (which you will use to validate your model performance before you decide to deploy it). In this example, we will be using an 80/20 train/test split. Our Pima dataset is 768 entries long, therefore we will be using 614 rows for training the data and the remaining rows to validate our model.

b. Enter the following query into your Logic step (ensure that the “SQL Script” slider is toggled) and replace the entries between “<>” (along with the “<>” themselves) with the project/table/model names relevant to your BigQuery cloud data warehouse:

6. Before we deploy our model, we would like to view the model’s performance. It may be necessary to choose another model type based on our specific use case if the accuracy is not up to par, or to bring in additional data to support model training. In a deployed clinical context, it may be necessary to aggregate models (known as bagging) to better predict the outcome. For this case, we stick with our simple classifier. Now we run the model we created over that data to see how accurately it predicts the outcome.

a. Create another Logic step with “SQL/Script” and “BigQuery” set in the options.

b. Enter the following query in to the “SQL” box:

c. Under “Advanced Options”, toggle the “Use Standard SQL” slider.

d. Select the “Target” button and choose the dataset, and a table name of your choosing. In this case, we call it “model_pima_diabetes_log_reg_performance_test_data”

e. Now, if you choose to run the river, you can see technical details about the performance of your ML model.

7. Now we use the model we created to predict whether a set of four patients for whom the required health metrics were recorded, will be at a high risk for developing diabetes.

a. Create another logic step with the same settings as in step (6a, 6c).

b. Enter the following SQL into the “SQL” box:

c. Set the “Target” table to a table name of your choosing, in this case we called the target table “diabetes_new_patient_data_predictions”.

d. Now we just need to create one more Logic step with the same settings as in step (6a, 6c) to join the patient_id’s from patients we are trying to predict, to the predictions from our model themselves, this can be accomplished by a simple join:

e. Select “Target” and save this final predictions table with a name of your choosing.

8. Run the river. In BigQuery, preview the table named in step (7e) and you can now see the patient_id and the model’s predictions (1 means likely to develop diabetes, 0 means unlikely).

9. You just used one of Rivery’s logic rivers to train, test, evaluate, and deploy a machine learning model! Advanced users can use the metrics in step (6e) to evaluate and compare numerous ML models and pick the best one for their use case.