Module - Sagemaker

In this module we will learn how to use Amazon managed workflow for Apache Airflow (MWAA) to develop Machine Learning (ML) workflows or pipelines. ML workflows orchestrate sequence of tasks like data collection, transformation, training, testing, and evaluating a ML model to achieve a business outcome.

Business requirement: Build a model to predict the fare amount for a taxi ride in New York City.

In this exercise we’ll be using the same dataset “NYC taxi ride” and MWAA environment which we created and used earlier in the workshop.


We’ll start by exploring the data, transforming the data, and training a model on the data. We’ll fit the model using an Amazon SageMaker managed training cluster. We’ll then deploy to an endpoint to perform batch predictions on the test data set. All of these tasks will be plugged into a workflow that can be orchestrated and automated through MWAA integration with Amazon SageMaker.

Diagram below shows the ML workflow we’ll implement

Sagemaker Pipeline

The workflow performs the following tasks:

  1. Data pre-processing: Extract and pre-process data from Amazon S3 to prepare the training data.
  2. Train the model: Train the Amazon SageMaker built-in XG-Boost ML model with the training data and generate model artifacts. The training job will be launched by the Airflow Amazon SageMaker operator.
  3. Tune the hyperparameters: A conditional/optional task to tune the hyperparameters of XGBoost to find the best model. The HPO (Hyper-parameter optimization) job will be launched by the Amazon SageMaker Airflow operator.
  4. Batch inference: Using the trained model, get inferences on the test dataset stored in Amazon S3 using the Airflow Amazon SageMaker operator.

Airflow Amazon SageMaker operators

Amazon SageMaker operators are custom operators available with Airflow allowing it to talk to Amazon SageMaker and perform the following ML tasks:

  • SageMakerTrainingOperator: Creates an Amazon SageMaker training job.
  • SageMakerTuningOperator: Creates an AmazonSageMaker hyperparameter tuning job.
  • SageMakerTransformOperator: Creates an Amazon SageMaker batch transform job.
  • SageMakerModelOperator: Creates an Amazon SageMaker model.
  • SageMakerEndpointConfigOperator: Creates an Amazon SageMaker endpoint config.
  • SageMakerEndpointOperator: Creates an Amazon SageMaker endpoint to make inference calls.

Okay, with that background, it’s time to Build!!