Code Ready ETL using Pyspark, VS Code, AWS Redshift, and S3

BigCodeGen
4 min readDec 30, 2021

--

This tutorial is to demonstrate a fully functional ETL pipeline based on the following procedures:

  1. Setting up Amazon(AWS) Redshift(RDS) Cluster, with the created table while populating the table from the data file in the AWS S3 bucket
  2. Configuring VPC security group to enable inbound access to AWS Redshift Database while using Pyspark SQL in VS Code to generate some insights.

1. Set up AWS Redshift Cluster and S3 bucket with data query capabilities

An open-source dataset: Seattle Real-Time Fire 911 calls can be uploaded into an AWS S3 bucket named seattle-realtime-emergence-fire-call; assuming that an AWS account has been created to launch an Amazon Redshift cluster and to create a bucket in Amazon S3 with AWS credentials (an access key ID and secret access key) to load the data from Amazon S3. Readers can refer to a detailed tutorial on how to launch an AWS Redshift cluster and create a table as well as loading data into a database table from a data file in an Amazon S3 bucket. Figure 1 illustrates the architecture of the ETL process pipeline.

Fig 1. ETL process pipeline architecture

With the data uploaded to the S3 bucket, proceed to AWS Redshift Query Work Bench to create the table where the data file will be loaded. This task is done using the following SQL commands:

Figure 2: Create AWS RDS Table

The above code creates the seattle_emergence_calls with the specified columns.

Next load the data file in S3 into the table using the following SQL command:

Figure 3: Load CSV data file to RDS table from S3 bucket

As seen in the COPY SQL command, the header columns are ignored for the CSV data file because they are already provided in the table schema in Figure 2; other important parameters are the security CREDENTIALS and REGIONincluded based on the AWS IAM role and the location of the AWS cloud computing resources.

In the next procedure, the discussion shifts to the configuration of the VPC security groups to enable inbound access to AWS Redshift Database from Visual Studio Code(VS Code); while using Pyspark SQL to generate some insights.

2. Configure VPC security group to enable inbound access to AWS Redshift Database and using Pyspark SQL to generate insights.

In order to implement this step, working knowledge of python programming language leveraging the Pyspark SQL library is assumed and Visual Studio IDE can be used as a python development tool. The reader can refer to my previous medium post for helpful tips on how to get this up and running.

Also, to connect to our RDS and read the data table from VS Code using pyspark; first, in the Redshift console go to the Configuration tab; next click on the link next to VPC security group This should direct to the EC2 console if the correct security group is selected. And, in the dialog box displayed in the console, select Redshift from the Typedropdown menu. Next, from the Source dropdown, change Customto Anywhere.Finally, click the Save button to enable the new Redshift rule.

Open Visual Studio Code, copy the following python script and save the file.

In the above code, it can be observed that the Redshift JDBC driver JAR_path , Line 6, is specified and later added to the Spark SparkSessionconfig(Lines 9,10). Redshift is basically a PostgreSQL database considering that both of them use port 5439 and they allow querying of semi-structured and structured data using a SQL WorkBench. Hence, the appropriate JDBC driver library must be referenced to perform read, write operations. In addition, ensure that the latest Java Runtime Environment (JRE) is installed on the operating system in use.Line 14-16are the definitions of the constant variables.Line 20-25, is the implementation of the spark read operation. If the forward_spark_s3_credentials option is set to true then the data source will automatically detect the credentials that Spark uses to connect to S3 in order to forward those credentials to Redshift over JDBC.

From here, it should be pretty easy to implement various transformations and generate insight(s) with thepyspark dataframeas shown in the following example:

And, hopefully, get the following insights in the VS Code terminal window.

--

--