Now that we have completed the setup of an Airflow instance in MWAA, lets look at how we can build a data pipeline incorporating the following steps -
Apache Airflow has several components when it comes to setting up a DAG to schedule a specific or a collection of tasks, such as Operators, Sensors, Hooks, Tasks, Connections and others. You can read more here - https://airflow.apache.org/docs/apache-airflow/1.10.12/concepts.html
You can get familiar with the components before we work towards setting up a DAG for the data pipeline.
For this demo, we will work with the NYC taxi ride open data set. Details about the data set are available in the link - http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
We will not copy the data over to S3 as yet. We will hold on to this step until we have setup the DAG in Airflow, and later perform a S3 copy to trigger the S3 Sensor in the DAG.
There are a couple of scripts we will use during the course of the exercise. The script package needs to be downloaded from here
Alright! Let’s get started with writing the DAG code. You can use your favorite IDE to put together the DAG script, which will be the Python DAG file that we will deploy to Airflow.