This tutorial is to demonstrate a fully functional ETL pipeline based on the following procedures:
- Setting up Amazon(AWS) Redshift(RDS) Cluster, with the created table while populating the table from the data file in the AWS S3 bucket
- Configuring VPC security group to enable inbound access to AWS Redshift Database while using Pyspark SQL in VS Code to generate some insights.
1. Set up AWS Redshift Cluster and S3 bucket with data query capabilities
An open-source dataset: Seattle Real-Time Fire 911 calls can be uploaded into an AWS S3 bucket named seattle-realtime-emergence-fire-call; assuming that an AWS account has been created to launch an Amazon Redshift cluster and to create a bucket in Amazon S3 with AWS credentials (an access key ID and secret access key) to load the data from Amazon S3. Readers can refer to a detailed tutorial on how to launch an AWS Redshift cluster and create a table as well as loading data into a database table from a data file in an Amazon S3 bucket. Figure 1 illustrates the architecture of the ETL process pipeline.
With the data uploaded to the S3 bucket, proceed to AWS Redshift Query Work Bench to create the table where the data file will be loaded. This task is done using the following SQL commands:
The above code creates the seattle_emergence_calls with the specified columns.
Next load the data file in S3 into the table using the following SQL command:
As seen in the COPY SQL command, the header columns are ignored for the CSV data file because they are already provided in the table schema in Figure 2; other important parameters are the security CREDENTIALS and REGIONincluded based on the AWS IAM role and the location of the AWS cloud computing resources.
In the next procedure, the discussion shifts to the configuration of the VPC security groups to enable inbound access to AWS Redshift Database from Visual Studio Code(VS Code); while using Pyspark SQL to generate some insights.
2. Configure VPC security group to enable inbound access to AWS Redshift Database and using Pyspark SQL to generate insights.
In order to implement this step, working knowledge of python programming language leveraging the Pyspark SQL library is assumed and Visual Studio IDE can be used as a python development tool. The reader can refer to my previous medium post for helpful tips on how to get this up and running.
Also, to connect to our RDS and read the data table from VS Code using pyspark; first, in the Redshift console go to the
Configuration tab; next click on the link next to
VPC security group This should direct to the EC2 console if the correct security group is selected. And, in the dialog box displayed in the console, select
Redshift from the
Typedropdown menu. Next, from the
Source dropdown, change
Anywhere.Finally, click the
Save button to enable the new Redshift rule.
Open Visual Studio Code, copy the following python script and save the file.
In the above code, it can be observed that the Redshift JDBC driver
Line 6, is specified and later added to the Spark
Lines 9,10). Redshift is basically a PostgreSQL database considering that both of them use
port 5439 and they allow querying of semi-structured and structured data using a SQL WorkBench. Hence, the appropriate JDBC driver library must be referenced to perform read, write operations. In addition, ensure that the latest Java Runtime Environment (JRE) is installed on the operating system in use.
Line 14-16are the definitions of the constant variables.
Line 20-25, is the implementation of the spark read operation. If the
forward_spark_s3_credentials option is set to
true then the data source will automatically detect the credentials that Spark uses to connect to S3 in order to forward those credentials to Redshift over JDBC.
From here, it should be pretty easy to implement various transformations and generate insight(s) with the
pyspark dataframeas shown in the following example:
And, hopefully, get the following insights in the VS Code terminal window.