Amazon Redshift Checklist
This checklist aims to be an exhaustive list of all elements you should consider when using Amazon Redshift.
Table of Contents
How to use
All items in the Amazon Redshift Checklist are required for the majority of projects, but some elements can be omitted or are not essential. We choose to use 3 levels of flexibility:
- π΄ means the item can't be omitted for any reason.
- π‘ means the item is highly recommended and can eventually be omitted in some really particular cases.
- π’ means the item is recommended but can be omitted in some particular situations.
Some resources possess an emoticon to help you understand which type of content/help you may find on the checklist:
- π documentation or article
- π§ online tool
- πΉ media
Sister Projects
Checklist
Designing Tables
π΄ Select an appropriate table distribution style
In order to utilise the parallel nature of Redshift, data must be correctly distributed within each table of the cluster. Tables not distributed correctly (based on their query patterns) will generally lead to poor query performance.
- π Choosing a data distribution style
- π Amazon Redshift now recommends distribution keys for improved query performance
π‘ Set column compression
Ensures data is better compressed utilising less storage space.
π‘ Select appropriate table sort keys
Ensures data is retrieved from within each node in the most performant way.
- π Choosing sort keys
- π Amazon Redshift now supports changing table sort keys dynamically
- π Amazon Redshift now recommends sort keys for improved query performance
- π Compound and Interleaved Sort Keys
π’ Define table constraints
Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.
- π Defining constraints
Loading Data
π΄ Use the COPY command
Loads data into a table from data files or from an Amazon DynamoDB table. The files can be located in an Amazon Simple Storage Service (Amazon S3) bucket, an Amazon EMR cluster, or a remote host that is accessed using a Secure Shell (SSH) connection.
π‘ Compress data files
Compressed files generally load faster. Use either GZIP, LZOP, BZIP2, or ZSTD.
- π Compress your data files
- π Redshift database benchmarks: COPY performance with compressed files
π‘ Use multi-row inserts
If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible.
π‘ Pre-sort data files in a sort key order
Load your data in sort key order to avoid needing to vacuum.
π‘ Enable automatic compression
Use the COPY command with COMPUPDATE
set to ON
to automatically set column encoding for new tables during their first load.
π’ Split data into multiple files
Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression.
Performance
π΄ Enable automatic workload management (WLM)
Amazon Redshift determines how many concurrent queries and how much memory is allocated to each dispatched query.
π‘ Enable concurrency scaling
Dynamically adds concurrent clusters improving read query concurrency.
- π Working with concurrency scaling
- π Concurrency Scaling pricing
- πΉ Amazon Redshift Concurrency Scaling
π‘ π Use AZ64 column compression encoding
Consider using Redshift's proprietary new column encoding algorithm AZ64.
π‘ Analyse query performance
STL_ALERT_EVENT_LOG
table allows users to analyse and improve performance issues.
π’ Disable automatic compression
Use the COPY command with COMPUPDATE
set to OFF
. Running compression computing every time on an already known data set will decrease performance.
- π ANALYZE COMPRESSION
π’ π Use materialized views
Materialized views can significantly boost query performance for repeated and predictable analytical workloads such as dashboarding, queries from business intelligence (BI) tools, and ELT (Extract, Load, Transform) data processing.
π’ Enable short query acceleration (SQA)
SQA runs short-running queries in a dedicated space so that SQA queries aren't forced to wait in queues behind longer queries.
π’ Use elastic resize scheduling
Consider scheduling an elastic cluster resize for nightly ETL workloads or to accommodate heavier workloads during the day as well as shrinking a cluster to accommodate lighter workloads at specific times of the day.
TRUNCATE
over DELETE
π’ Use Consider using TRUNCATE
instead of DELETE
when creating transient tables. TRUNCATE
is much more efficient than DELETE
and doesn't require a VACUUM and ANALYZE.
- π TRUNCATE
Security
π΄ Enable cluster encryption
Ensure cluster encryption is turned on protecting data at rest.
π΄ Disable publicly accessibility
Most clusters should not be publicly accessible and therefore should be set to private.
π΄ Enable enhanced VPC routing
Forces all COPY and UNLOAD traffic between your cluster and your data repositories through your Amazon VPC.
π΄ Use user groups
To make permission management easier, create different user groups and grant privileges based on their roles. Add and remove users to/from groups instead of granting permissions to individual users.
π‘ π Use federated user access
Consider providing user access via SAML-2.0 using AD FS, PingFederate, Okta, or Azure AD.
- π Federate Database User Authentication Easily with IAM and Amazon Redshift
- π Federate Amazon Redshift access with Microsoft Azure AD single sign-on
- π Federate Amazon Redshift access with Okta as an identity provider
- π Options for providing IAM credentials
π‘ π Enable multi-factor authentication (MFA)
Consider enabling MFA for production workloads.
π‘ Use Secrets Manager for service accounts
Configure AWS Secrets Manager to automatically rotate Amazon Redshift passwords for service accounts. Secrets Manager uses a Lambda function provided by Secrets Manager.
π’ π Use column-level access controls
Consider implementing column-level access controls to restrict users from accessing certain columns.
Monitoring
π΄ Action Redshift advisor recommendations
Redshift advisor analyses your cluster and makes recommendations to improve performance and decrease costs.
π΄ Monitor long running queries
Set an alarm to notify users when queries are running for longer than expected using the QueryDuration
CloudWatch metric.
π΄ Monitor underutilised or over utilised clusters
Check if your cluster is underutilised or over utilised using the CPUUtilisation
CloudWatch metric.
- π Amazon Redshift performance data
- π§ isitfit
π΄ Monitor disk space usage
Check if your cluster is running out of disk space and whether you need to consider scaling using the PercentageDiskSpaceUsed
metric.
π΄ π Enable CloudWatch anomaly detection
Applies machine-learning algorithms to the metric's past data to create a model of the metric's expected values.
π΄ Query monitoring rules
Define metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries.
π‘ Analyse workload performance
Optimise your cluster based on how much time queries spend on different stages of processing.
π‘ Use Redshift Advance Monitoring
This GitHub project provides an advance monitoring system for Amazon Redshift that is completely serverless, based on AWS Lambda and Amazon CloudWatch. A serverless Lambda function runs on a schedule, connects to the configured Redshift cluster, and generates CloudWatch custom alarms for common possible issues.
Consumption
π‘ π Use Data API
Using this API, you can access Amazon Redshift data with web servicesβbased applications, including AWS Lambda, AWS AppSync, Amazon SageMaker notebooks, and AWS Cloud9.
Cluster
π΄ Increase automated snapshot retention
The default retention period of 1 day can catch organisations out in case of disaster recovery or rollback. Consider changing to 35 days. You can use the HTTP endpoint to run SQL statements without managing connections. Calls to the Data API are asynchronous.
π‘ π Use RA3 nodes
Consider using Redshift's new RA3 nodes with a mix of local cache and S3 backed elastic storage if compute requirements exceed dense compute or dense storage node levels.
- π Amazon Redshift introduces RA3 nodes with managed storage enabling independent compute and storage scaling
- πΉ AWS re:Invent 2019: [NEW LAUNCH!] Amazon Redshift reimagined: RA3 and AQUA (ANT230)
- πΉ Amazon Redshift RA3 Nodes: Overview and How to Upgrade
π’ Use Redshift Spectrum
Consider using Redshift Spectrum to allow users to query data straight from S3 using their Redshift cluster. This can be used in replacement of a staging schema whereby your staged data lives within your data lake and is read into Redshift via Spectrum.
- π Getting started with Amazon Redshift Spectrum
- π Why youβre better off exporting your data to Redshift Spectrum, instead of Redshift
- π Redshift Spectrum pricing
- πΉ Cost and usage controls for Amazon Redshift
π’ π Pause and resume clusters
Redshift has recently introduced the ability to pause and resume the cluster within minutes. Take advantage of this feature for non-production clusters to save money.
π’ π Use elastic resize over classic resize
Consider using elastic resize over classic resize when changing both the node types and the number of nodes within your Redshift cluster. Elastic resize is much quicker (minutes vs hours) and doesn't take your cluster out of commission.
Contributing
Open an issue or a pull request to suggest changes or additions.