Redshift

This is an optional module if you want to try out the load to Redshift part of the data pipeline

Create Redshift IAM Role

In this step we will create an IAM role for Redshift cluster.

  • Go to: IAM Console
    • Click on Create role
    • Select Redshift from the list of services
    • Select Redshift - Customizable under Select your use case
    • Click Next: Permissions
    • In Search box, search and check following two policies
      • AmazonS3FullAccess
      • AWSGlueConsoleFullAccess
    • Click Next: Tags
    • Click Next: Review
    • Give Role Name as AmazonMWAA-workshop-redshift-role
    • Verify following two policies are attached to the role:
      • AmazonS3FullAccess
      • AWSGlueConsoleFullAccess
    • Click Create role

Read More about Associating IAM Roles with Redshift cluster here: https://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html


Create Subnet groups

We will create a Subnet group associated with the MWAA Airflow VPC

  • Go to: Redshift Console
  • Click Create cluster subnet group
  • Enter a name for the subnet group - AmazonMWAA-workshop-subnet-group
  • Enter a description - Subnet group for the Amazon MWAA workshop
  • Under Add subnets
    • From the VPC drop down, select MWAAEnvironment
    • Click on Add all the subnets for this VPC
  • Click Create cluster subnet group

Create Redshift cluster

In this step, you will be creating a 2 node Redshift cluster to store aggregated data.

  • Go to: Redshift Console
  • Click Create Cluster
  • Leave Cluster identifier as redshift-cluster-1
  • Select dc2.large as Node Type
  • Select Number of Nodes as 2.
  • Verify Configuration summary.
  • Leave Database port (optional) as 5439
  • Change Master user name to admin
  • Enter Master user password.
    • Check the password rules mentioned below the Master user password before choosing a password. (This password will be used to access the cluster)
  • Expand Cluster permissions (optional)
    • Select previously created AmazonMWAA-workshop-redshift-role from the drop down menu.
    • Click Add IAM role
    • AmazonMWAA-workshop-redshift-role should appear under Attached IAM roles
  • Under Additional configurations
    • Toggle the Use defaults slider
    • Under Network and Security, select the MWAAEnvironment from the VPC drop down
    • Leave the rest of the settings to its defaults
  • Click Create Cluster

It will take a few minutes for the cluster to move into the available state.

Once the cluster has come up and is in Available state, move to the next step.


Create S3 Gateway Endpoint

In this step, we will create S3 Gateway Endpoint so that Redshift cluster can communicate with S3 using its private IP.

Read more about Amazon Virtual Private Cloud (VPC) here: https://docs.amazonaws.cn/en_us/vpc/latest/userguide/what-is-amazon-vpc.html

Read more about S3 Gateway Endpoints here: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html

  • Go to: AWS VPC Console
  • Click Create Endpoint
  • Select AWS Services under Service category (which is the default selection)
  • Under Service name search box, search for “s3” and hit enter/return.
  • com.amazonaws.[region].s3 should come up as search result. Select the option from the list which mentions the Type as Gateway.
  • Under VPC, chose MWAAEnvironment VPC. This is the same VPC which was used for configuring redshift cluster.
  • Under Configure route tables, select the listed route table (this should be the main route table). You can verify this by checking Yes under Main column.
  • Leave Policy as default. (Full Access)
  • Click Create endpoint.
  • It should take a couple of seconds to provision this. Once this is ready, you should see Status as available against the newly created S3 endpoint.
  • Edit the name of the endpoint and make it RedshiftS3EP

Verify and add rules to the default security group

In this step, you will verify and add rules to the Redshift security group so that Glue service can talk to Redshift.

  • Go to: VPC Security Groups
  • Select the Redshift security group. It should be the default security if it was not changed during the Redshift cluster creation step.
    • If you have more than one security group in the list, follow the below step:
      • Go to: Redshift Console
      • Click on redshift-cluster-1
      • Click on Properties tab.
      • Scroll down and check Network and security section for Security Group id.
  • Once you have verified the Security Group, select the security group (check box at the beginning of the row).
  • Click Inbound Rules.
    • Click Edit inbound rules
    • Check if a self-referencing rule exists. (This should be available by default but if not add a rule as listed below)
      • Type: All Traffic
      • Source: [Name of the same security group which you are editing] Note : A self-referencing rule is required for Glue components to communicate.
    • Add HTTPS rule for Amazon S3 access.
      • Type: HTTPS
      • Source: [s3-prefix-list-id]
        Note : Under Source select Custom and type “pl” into the textbox beside it, the previously created S3 endpoint will show up and select it.
    • Add Redshift rule for to allow Airflow instance to access the cluster.
      • Type: Redshift
        Note : Under Source select Custom and type “airflow” into the textbox beside it, the security group attached to the airflow instance will show up and select it.
    • Click Save rules.
  • Click Outbound Rules.
    • Click Edit rules
    • Leave existing rule as-is which has the following values:
      • Type: All Traffic
      • Source: 0.0.0.0/0
    • Add a self-referencing rule. (This should be available by default but if not add a rule as listed below)
      • Type: All TCP
      • Source: [Name of the same security group which you are editing] Note : A self-referencing rule is required for Glue components to communicate.
    • Click Save rules.

Create schema and redshift tables.

  • Go to: Redshift Query Editor
  • Enter the appropriate database connection details (Default Database name is dev unless changed)
  • Click Connect
  • Execute the queries below to create schema and tables for the aggregated data.
--	Create nyc schema.
CREATE schema nyc;
--	Create agg_green_rides table.
CREATE TABLE IF not EXISTS nyc.green (
  pulocationid      bigint,
  trip_type         bigint,
  payment_type		bigint,
  total_fare_amount	float
);

That’s the end of the setup of the Redshift cluster. We will refer to the table we just created in the Airflow DAG later to load data into the table.