Module - Data Processing

Now that we have completed the setup of an Airflow instance in MWAA, lets look at how we can build a data pipeline incorporating the following steps -

  • S3 Sensor that waits for raw files to arrive into a predefined S3 bucket prefix
  • A Glue crawler that will create the table metadata in the data catalog
  • A Glue job which transforms the raw data into a processed data format while performing file format conversions
  • An EMR job to generate reporting data sets
  • S3-to-Redshift copy of the aggregated data [Optional]

Airflow Pipeline Architecture

Data Pipeline DAG

Apache Airflow has several components when it comes to setting up a DAG to schedule a specific or a collection of tasks, such as Operators, Sensors, Hooks, Tasks, Connections and others. You can read more here - https://airflow.apache.org/docs/apache-airflow/1.10.12/concepts.html

You can get familiar with the components before we work towards setting up a DAG for the data pipeline.

Preparing the data set and Scripts

For this demo, we will work with the NYC taxi ride open data set. Details about the data set are available in the link - http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

We will not copy the data over to S3 as yet. We will hold on to this step until we have setup the DAG in Airflow, and later perform a S3 copy to trigger the S3 Sensor in the DAG.

There are a couple of scripts we will use during the course of the exercise. The script package needs to be downloaded from here

Alright! Let’s get started with writing the DAG code. You can use your favorite IDE to put together the DAG script, which will be the Python DAG file that we will deploy to Airflow.