Glue ETL Job

Next, we will import the job script into Glue via the AWS Console. To complete this activity, follow the below set of steps -

  • Edit the Glue job script nyc_raw_to_transform.py (previously downloaded) to set the bucket name to airflow-yourname-bucket in the data sink (last) step.

    If you haven’t downloaded the scripts previously, you can downloaded the package from here

  • Copy the modified job script to S3 path s3://airflow-yourname-bucket/scripts/glue/. You will need to create the glue folder under scripts if it wasn’t performed during setup.

  • Login to the AWS Glue Console

  • Add Job by providing the following details

    • Name - nyc_raw_to_transform
    • IAM Role - AWSGlueServiceRoleDefault
    • Select - An existing script that you provide
    • S3 path where the script is stored - s3://airflow-yourname-bucket/scripts/glue/nyc_raw_to_transform.py
    • Temporary directory - s3://airflow-yourname-bucket/glue-temp
    • Next, Save Job and Edit Script
    • Make sure that the datasink4 step is pointing to the right target bucket
    • Click on Save
    • Close the editor by clicking on the X mark

Using the custom Glue operator and hook, the DAG task can be written as shown below to invoke an existing Glue job.

glue_task = AWSGlueJobOperator(  
    task_id="glue_task",  
    job_name='nyc_raw_to_transform',  
    iam_role_name='AWSGlueServiceRoleDefault',  
    dag=dag) 

The above task is going to run the Glue Job and wait for completion to trigger the next step in the data pipeline.