blog

Querying Amazon Redshift with Events and Schedules

| By Farid Neshat, Cloud Engineer HeleCloud The Netherlands | 

Introduction

Amazon Web Services (AWS) has introduced Redshift Data APIs, which can run arbitrary SQL statements on Redshift clusters asynchronously. These can run without any additional costs, and added as an EventBridge target, giving more significant potential to integrate through your AWS ecosystem. 

To use Redshift Data APIs for multi-step Extract-Transform-Load(ETL) jobs, we have created “redshift-query-step-function”. This is a CloudFormation template that deploys a step function state machine to your account that can run, track, and retry Redshift queries using Redshift Data APIs, enabling far greater reliability and value creation. In this blog article, we will discover the numerous ways of running queries on Amazon Redshift based on events and schedules in a reliable and cost-effective way.

Background

At HeleCloud, we utilise many AWS services for customer projects, including Amazon Redshift, and are often looking for ways to create greater performance and reliability over the queries we run.

Amazon Redshift is a managed data warehouse service that allows analysing terabytes of data using standard SQL. You can find more about it here.

When running queries on Redshift, the ability to automate them based on events or schedules is a requirement that comes up often. Example use cases include:

  • Loading data into Redshift upon creation of S3 objects
  • Analysing data from the previous day
  • Refreshing materialised views daily
  • Inserting the results of specific queries into a table or S3.

In previous years, the standard and easy solution would have been to deploy cron jobs to EC2 instances. For such a simple thing, you had to ensure the EC2 instance was always patched, whilst making sure it was continuously running and had to run failed queries in case something happened manually (instance failures, networking disruptions, database failures).

What were the alternatives?

Data Pipeline

AWS Data Pipeline was an option that achieved the same goal in the same manner. It would run an EC2 instance with a given AMI and run the given queries in the steps configured. This was an improvement from running the EC2 ourselves, but still clunky, and we had to make sure the queries were retried manually. AWS has no longer added features to it and it did not support newer EC2 instance types, nor passwordless Redshift authentication via Redshift GetClusterCredentials API.

Redshift GetClusterCredentials can generate temporary credentials of a Redshift user that can be used to authenticate and run SQL statements. Using this API means that we do not have to store Redshift credentials anywhere and do not have to care about rotating them, we can simply use IAM roles for compute environments that run the queries, making this more straightforward.

AWS Lambda

AWS Lambda paired with Step Functions running the queries directly was a good option, however the Lambda had to be alive during the time the query was running, which meant we were limited to only 15 minutes of run time per query. This meant running long running analysis queries and large inserts were not possible without breaking down the steps into smaller pieces, which resulted in increasing complexity and costs.

AWS Glue

AWS Glue arrived, a managed Apache Spark environment that could ingest data from S3, transform and then dump into Redshift through S3. Using this for our simple queries that loaded data into Redshift and then transforming them (ELT rather than ETL), was similar to using a spaceship to commute from one city to another(instead of planets) when you could have simply used a car. However, there was no time limit, and the environment was completely managed, so the complexity compared to the other solutions was lower.

The AWS Glue team saw these types of workloads running and introduced AWS Glue Python Shell Jobs, similar to AWS Lambda but there was no time limit and only Python could be used. For our team Python was the primary language of the team so this was an advantage. AWS Glue jobs support retrying on failures, and it can be started based on other Jobs.

We created a simple library called redshift-query that could run directly on AWS Glue Python shell jobs and wrapped into a Cloudformation template taking SQL statement as arguments and running them based on events or schedules with passwordless authentication. This was our semi-managed query runner that hid the complexity, and we could easily add new queries to it.

However, this was still not ideal, as we were paying for compute environments that were simply waiting for an SQL statement to finish running on Redshift, and they were idle 99% of the time.

Redshift Data API

A few months ago, AWS introduced Redshift Data API, an API that can run SQL statements in an asynchronous way. Using the API, the results can be checked and fetched later. This negates the problem of managing and running idle compute environments that are merely waiting for queries to finish running.

To take advantage of this, we wrapped Redshift Data API with AWS Lambda and Step Functions, which enables automatic retries and it can be started based on events from EventBridge, for example on a scheduled basis or when new files are added to S3. We have open sourced as redshift-query-step-function.

Here is an example deployment:

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Resources:
  RedshiftQueryStepFunction:
    Type: AWS::Serverless::Application
    Properties:
      Location:
        ApplicationId: arn:aws:serverlessrepo:eu-west-1:708984232979:applications/redshift-query-step-function
        SemanticVersion: 0.1.0
      Parameters:
        DefaultUser: master
        DefaultClusterId: testcluster
        DefaultDb: testdb
  Schedule:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: rate(1 day)
      Targets:
        - Arn: !GetAtt [RedshiftQueryStepFunction, Outputs.StateMachine]
          Id: step-function
          Input: |
            ["copy favoritemovies from 'dynamodb://Movies' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' readratio 50;",
                  "vacuum favoritemovies;"]'

Meanwhile, shortly after that, AWS introduced Redshift Data API as a target for rules in EventBridge. This meant you do not have to use AWS Lambda to simply call out the Redshift Data API. Here is how you can use it via CloudFormation:

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
  Cluster:
    Type: String
    Default: 'test-cluster'
  DbUser:
    Type: String
    Default: 'test'
  DbName:
    Type: String
    Default: 'test'
Resources:
  Role:  # To check the results of a query, the same role must be used, so this can be used
    Type: AWS::IAM::Role
    Properties:
      RoleName: EventBridgeRedshiftQueryRole
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              # This is so you can assume role into this role and check the error for failed queries
              AWS: "arn:aws:iam::${AWS::AccountId}:root"
              Service:
                - events.amazonaws.com # For eventbridge.
            Action:
              - 'sts:AssumeRole'
      Policies:
        - PolicyName: 'RedshiftAccess'
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - redshift-data:*
                Resource: "*"
              - Effect: Allow
                Action: redshift:GetClusterCredentials
                Resource:
                  - !Sub "arn:aws:redshift:${AWS::Region}:${AWS::AccountId}:dbuser:${Cluster}/${DbUser}"
                  - !Sub "arn:aws:redshift:${AWS::Region}:${AWS::AccountId}:dbname:${Cluster}/${DbName}"
  RunQueryOnRedshiftRule:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: rate(1 day)
      Targets:
        - Arn: !Sub "arn:aws:redshift:${AWS::Region}:${AWS::AccountId}:cluster:${Cluster}"
          Id: query-redshift
          RoleArn: !GetAtt [Role, Arn]
          RedshiftDataParameters:
            Database: !Ref DbName
            DbUser: !Ref DbUser
            Sql: "vacuum favoritemovies;"

Here is a video from AWS demonstrating setting this up via the console: https://www.youtube.com/watch?v=qvtb3ONjj8o

Once your queries have run, you can use Redshift Data API to check the results of the queries. However, you can only do that if you call the API from the same IAM Role that was provided to Redshift Data ExecuteStatement API. A simple retry can be added by catching the event:

RetryRedsfhitQueryRule:
  Type: AWS::Events::Rule
  Properties:
    EventPattern:
      detail:
        state: ["FAILED"]
        statementName: ["vacuum-movies"]
      detail-type: ["Redshift Data Statement Status Change"]
      source: ["aws.redshift-data"]
    Targets:
      - Arn: !Sub "arn:aws:redshift:${AWS::Region}:${AWS::AccountId}:cluster:${Cluster}"
        Id: query-redshift
        RoleArn: !GetAtt [Role, Arn]
        RedshiftDataParameters:
          Database: !Ref DbName
          DbUser: !Ref DbUser
          Sql: "vacuum favoritemovies;"
          StatementName: "retry-vacuum-movies"

This is effective, but it retries only instantly rather than with a delay, usually a minute afterwards during our experiments, and if you want to run multiple retries you must create more targets, so it does not scale in terms of reliability. This is great for queries that do not need to be retried if the cluster goes down for extended time. However, for some use cases you would prefer that the queries are kept being retried until they are expired, or they are successful.

For those cases it is best to use the step functions solutions mentioned above.

Conclusion

AWS has finally released a way to query Redshift in a serverless way via the Redshift Data API, however there is no native way of retrying in case of failure. This post shows a few ways to use Redshift Data API in a serverless matter and introducing methods that to retry the queries with different strategies.

If you have any questions about this blog, please feel welcome to get in touch. Or, for more ‘How To on AWS’ blogs, click here.