Glue Crawlers

Currently there is no Glue Operator shipped with the latest stable release (at the time this workshop was prepared) of Apache Airflow. However, it is fairly straight forward to write a custom Operator using the existing AWS hook and boto3 library.

The below scripts are included as part of the code package that was initially downloaded during S3 setup and deployed to Airflow during instance creation.

aws_glue_job/crawler_hook.py - consists of the required modules to interact with the Glue API using the AWS hook. The two main hook methods defined in this script are used to initialize a Glue job/crawler, and wait for job/crawler completion once a job/crawler has been submitted through the API.

aws_glue_job/crawler_operator.py - consists of a method to execute the Glue Job/crawler using the hook.

Before we proceed with adding the code block for the Glue crawler run step, lets create the Crawler using AWS CLI with the Cloud9 Workspace previously created.

Edit to set the Glue IAM Service role ARN (previously noted), and the S3 Bucket name in the crawl path/target

If you are at AWS Event and/or skipped through the manual setup, to get the Glue IAM Service role ARN navigate to the IAM Console - Roles, search for AWSGlueServiceRoleDefault, click on the Role listed, and copy the Role ARN to replaced in the command below.

aws glue create-crawler \
--name airflow-workshop-raw-green-crawler \
--role arn:aws:iam::1111111111111111:role/AWSGlueServiceRoleDefault \
--database-name default \
--targets "{\"S3Targets\":[{\"Path\":\"s3://airflow-yourname-bucket/data/raw/green\"}]}"

Now, we will add the step in the DAG to invoke the crawler once the data arrives in S3.

glue_crawler = AWSGlueCrawlerOperator(
    task_id="glue_crawler",
    crawler_name='airflow-workshop-raw-green-crawler',
    iam_role_name='AWSGlueServiceRoleDefault',
    dag=dag)

Proceed to the next step.