awslabs / aws-lambda-redshift-loader Goto Github PK

Amazon Redshift Database Loader implemented in AWS Lambda

License: Other

JavaScript 91.58% Python 0.66% Shell 7.15% PowerShell 0.61%

aws-lambda-redshift-loader's Introduction

A Zero Administration AWS Lambda Based Amazon Redshift Database Loader

Please note that this function is now deprecated, and instead we recommend that you use the Auto COPY feature built into Redshift. Please see https://aws.amazon.com/blogs/big-data/simplify-data-ingestion-from-amazon-s3-to-amazon-redshift-using-auto-copy-preview/ for more information

With this AWS Lambda function, it's never been easier to get file data into Amazon Redshift. You simply drop files into pre-configured locations on Amazon S3, and this function automatically loads into your Amazon Redshift clusters.

For automated delivery of streaming data to S3 and into to Redshift, please consider using Amazon Kinesis Firehose instead of this function.

Using AWS Lambda with Amazon Redshift
Installing with AWS CloudFormation (Recommended)
Installing Manually
Configuring your VPC for connections between AWS Lambda and Redshift
Entering the Configuration
Security
Loading multiple Redshift clusters concurrently
Support for Notifications & Complex Workflows
Operations Guide
Extending and Building New Features
Configuration Reference

Using AWS Lambda with Amazon Redshift

Amazon Redshift is a fully managed petabyte scale data warehouse available for less than $1000/TB/YR that provides AWS customers with an extremely powerful way to analyse their applications and business as a whole. To load their Clusters, customers ingest data from a large number of sources, whether they are FTP locations managed by third parties, or internal applications generating load files. Best practice for loading Amazon Redshift is to use the COPY command (http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html), which loads data in parallel from Amazon S3, Amazon DynamoDB or an HDFS file system on Amazon Elastic MapReduce (EMR).

Whatever the input, customers must run servers that look for new data on the file system, and manage the workflow of loading new data and dealing with any issues that might arise. That's why we created this AWS Lambda-based Amazon Redshift loader. It offers you the ability drop files into S3 and load them into any number of database tables in multiple Amazon Redshift Clusters automatically - with no servers to maintain. This is possible because AWS Lambda (http://aws.amazon.com/lambda) provides an event-driven, zero-administration compute service. It allows developers to create applications that are automatically hosted and scaled, while providing you with a fine-grained pricing structure.

The function maintains a list of all the files to be loaded from S3 into Amazon Redshift using a DynamoDB table. This list allows us to confirm that a file is loaded only one time, and allows you to determine when a file was loaded and into which table. Input file locations are buffered up to a specified batch size that you control, or you can specify a time-based threshold which triggers a load.

You can specify any of the many COPY options available, and we support loading CSV files (of any delimiter), AVRO files, as well as JSON files (with or without JSON paths specifications). All Passwords and Access Keys are encrypted for security. With AWS Lambda you get automatic scaling, high availability, and built in Amazon CloudWatch Logging.

Finally, we've provided tools to manage the status of your load processes, with built in configuration management and the ability to monitor batch status and troubleshoot issues. We also support sending notifications of load status through Simple Notification Service - SNS (http://aws.amazon.com/sns), so you have visibility into how your loads are progressing over time.

Installing with AWS CloudFormation (Recommended)

This repository includes two CloudFormation templates (deploy.yaml, and deploy-vpc.yaml) which will create much of what is needed to set up the autoloader. This section details the setup and use of the template.

This is a visual architecture of the CloudFormation installer:

The intent of this template is to simplify the setup work necessary to configure the autoloader.

Pre-work

Set up the KMS key to be used by the setup script which encrypts and decrypts the RedShift password. This key will require a specific alias, which is how the setup script picks it up. The alias must be LambdaRedshiftLoaderKey.

Also, a user will be required with the necessary privileges to run the template. This user will require an access key, which is one of the input parameters required at runtime.

The template requires several input parameters: KMS Key ARN which will be granted access from above, and if you are in VPC the Subnet ID's in which the function should egress network traffic, and which Security Groups should be granted to the function.

Usage Steps

Create a CloudFormation stack with template that you require, based upon whether you run in VPC or not. YOu can also use the table of links below. This stack will include everything needed to set up the autoloader with two exceptions. The KMS key must be created and managed separately, and a RedShift cluster will be required when setting up the autoloader. Note that this stack does not configure the autoloader, which must be done on your workstation or with an EC2 instance.

Notes

This stack will be created in the same region where you invoke the template.
The input parameters are not cross-checked at template creation time, so make sure that the subnet choice matches the availability zone you require.
The stack creates the Lambda trigger as well as the execution role - so they will be managed as part of the stack. It is expected that the EC2 instance and setup role can be used on an ongoing basis for the administration of the autoloader.

Launch Links

Region	Launch without VPC	Launch in VPC
eu-north-1
ap-south-1
eu-west-3
eu-west-2
eu-west-1
ap-northeast-2
ap-northeast-1
sa-east-1
ca-central-1
ap-southeast-1
ap-southeast-2
eu-central-1
us-east-1
us-east-2
us-west-1
us-west-2

Post Install

If you prefer to use an EC2 instance to configure the database loader rather than your laptop, then run this template using CloudFormation.

Once launched, log in to the EC2 instance created as part of the stack. It contains all the necessary components set up the autoloader. 3. Invoke the setup.js script on the created EC2 instance to begin configuring the autoloader.

Function Configuration

You can set the logging behaviour of the function by adding environment variables:

DEBUG - Sets log level debug and all messages will be shown in CloudWatch Logs
LOG_LEVEL - Sets the log level to whatever you require, per Winston documentation

You can also set the following environment variables to change the behaviour of the loader:

SuppressFailureStatusOnSuccessfulNotification - Allows the loader to silently fail the execution state if failure notifications were successfully sent to SNS. When set to true, you will not see failed Lambda events in CloudWatch.
SuppressWildcardExpansionPrefixList - Allows you to specify a set of S3 Prefixes for which Hive Partitioning Wildcard expansion will not be used. Use this setting if you have prefixes which include the = symbol but which should not be converted to value=*. Requires version 2.8.0+.

Installing Manually

This option is not recommended, but can be used if you just want to patch the binaries, etc. Proceed with caution.

Before you deploy - Lambda Execution Role

You need to create an IAM policy that AWS Lambda uses when it runs, and allows it to call SNS, use DynamoDB, write Manifest files to S3, perform encryption with the AWS Key Management Service, and pass STS temporary credentials to Redshift for the COPY command. We recommend that you obtain the required permissions from the contents of the deploy.yaml as they will always be up to date.

Deploy the function

Go to the AWS Lambda Console in the same region as your S3 bucket and Amazon Redshift cluster.
Select Create a Lambda function and select the 'Author from Scratch' option
Enter the function name LambdaRedshiftLoader, and the Runtime value as 'Node.js '. The function name must be LambdaRedshiftLoader in order to use automated event source routing. The function was built and comprehensively tested on Node version .10, and is used by customers on a variety of other versions. Please report any issues around Node.js engine compatibility via Issues.
Choose the IAM role that you would like to run the Lambda function under, as configured above
Choose 'Create Function'
Under the 'Function code' section, choose 'Upload a file from Amazon S3', and use the table below to find the correct s3 location for your region.
Configure the default values of index.js for the filename and handler for the handler. We also recommend using the max timeout for the function to accomodate longer COPY times.

Region	Function Code S3 Location
eu-north-1	s3://awslabs-code-eu-north-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
ap-south-1	s3://awslabs-code-ap-south-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
eu-west-3	s3://awslabs-code-eu-west-3/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
eu-west-2	s3://awslabs-code-eu-west-2/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
eu-west-1	s3://awslabs-code-eu-west-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
ap-northeast-2	s3://awslabs-code-ap-northeast-2/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
ap-northeast-1	s3://awslabs-code-ap-northeast-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
sa-east-1	s3://awslabs-code-sa-east-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
ca-central-1	s3://awslabs-code-ca-central-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
ap-southeast-1	s3://awslabs-code-ap-southeast-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
ap-southeast-2	s3://awslabs-code-ap-southeast-2/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
eu-central-1	s3://awslabs-code-eu-central-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
us-east-1	s3://awslabs-code-us-east-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
us-east-2	s3://awslabs-code-us-east-2/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
us-west-1	s3://awslabs-code-us-west-1/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip
us-west-2	s3://awslabs-code-us-west-2/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.7.9.zip

When you're done, you'll see that the AWS Lambda function is deployed and you can submit test events and view the CloudWatch Logging log streams.

Lambda Function Versions

We previously released version 1.0 in distribution AWSLambdaRedshiftLoader.zip, which didn't use the Amazon Key Management Service for encryption. If you've previously deployed and used version 1.0 and want to upgrade to version 1.1, then you'll need to recreate your configuration by running node setup.js and reentering the previous values including connection password, symmetric encryption key, and optionally an S3 Secret Key. You'll also need to upgrade the IAM policy for the Lambda Execution Role as listed below, as it now requires permissions to talk to the Key Management Service.

Furthermore, version 2.0.0 adds support for loading multiple Redshift clusters in parallel. You can deploy the 2.x versions with a 1.1x configuration, and the Lambda function will transparently upgrade your configuration to a 2.x compatible format. This uses a loadClusters List type in DynamoDB to track all clusters to be loaded.

Configuring your VPC for connections between AWS Lambda and Redshift

Please click here for a full guide on how to configure AWS Lambda to connect to Redshift in VPC and non-VPC networking environments.

Entering the Configuration

Now that your function is deployed, we need to create a configuration which tells it how and if files should be loaded from S3. Simply install AWS SDK for Javascript and configure it with credentials as outlined at http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/node-intro.html and http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/node-configuring.html. You'll also need a local instance of Node.js - today the included Client Tools such as setup.js only run under pre-ES6 versions of Node (0.10 and 0.12 have been tested). NVM (https://github.com/creationix/nvm/blob/master/README.markdown) is a simple way to install and switch between node versions. Then install dependencies using the following command:

cd aws-lambda-redshift-loader && npm install

In order to ensure communication with the correct AWS Region, you'll need to set an environment variable AWS_REGION to the desired location. For example, for US East use us-east-1, and for Dublin use eu-west-1.

export AWS_REGION=eu-central-1

Next, run the setup.js script by entering node setup.js. The script asks questions about how the load should be done, including those outlined in the setup appendix as the end of this document. If you are the sort of person who would rather automate this, then you can call the setup module directly by including common.js and calling function setup(useConfig, dynamoDB, s3, lambda, callback) where useConfig is a valid configuration in the form of a DynamoDB Item, and dynamoDB, S3 and lambda are all client connections to the respective services.

Alternatively, you can populate config.json with your configuration values and run node setup-file.js to run a setup script that uses a JSON configuration file instead of reading the values from the command line.

All data used to manage the lifecycle of data loads is stored in DynamoDB, and the setup script automatically provisions the following tables:

LambdaRedshiftBatchLoadConfig - Stores the configuration of how files in an S3 input prefix should be loaded into Amazon Redshift.
LambdaRedshiftBatches - Stores the list of all historical and open batches that have been created. There will always be one open batch, and may be multiple closed batches per S3 input prefix from LambdaRedshiftBatchLoadConfig.
LambdaRedshiftProcessedFiles - Stores the list of all files entered into a batch, which is also used for deduplication of input files.

Once the tables are configured, the setup script will automatically create an event source for the prefix you specified in S3, and start pushing ObjectCreated:* events to the database loader.

*** IMPORTANT ***

The tables used by this function are created with BillingMode: "PAY_PER_REQUEST" (from version 2.6.8), which means that they will autoscale as needed to cope with the demand placed on them by the loader, which is a function of the frequency that files land in S3. If you have extremely peaky workloads where thousands of files arrive at the same time, but infrequently, then you may see Provisioned Throughput based Throttling of the function, and we would advise you move to Provisioned IO to meet your Peak Read/Write capacity requirements.

The S3 Prefix

When you enter the configuration, you must provide an S3 Prefix. This is used by the function to resolve which configuration to use for an incoming event. There are two dimensions of dynamic resolution provided which will help you respond to events where the path is variable over time or from provider:

Hive Partitioning Style Wildcards

You may have implemented a great practice of segregating S3 objects using time oriented buckets. Data for January 2016 sits in a prefix mybucket/data/<type>/2016/01 while data for Feburary is in mybucket/data/<type>/2016/02. Rather than having to create one configuration per year and month, you can instead use Hive Partitioning style prefixes. If you place S3 objects into a prefix mybucket/data/<type>/yyyy=2016/dd=01, you can then create a configuration with an S3 prefix mybucket/data/<type>/yyyy=*/dd=*. The incoming event will be pre-processed and files which use this convention will always match the wildcard configuration.

Prefix Matching

In some cases, you may want to have a configuration for most parts of a prefix, but a special configuration for just a subset of the data within a Prefix. In addition to Hive partitioning style wildcards, you can also create configuration hierarchies. In the above example, if I wanted to process data from 2016 with a single configuration, but had special rules for February only, then you can create 2 configurations:

mybucket/data/<type>/2016 This will match anything that is submitted for 2016, regardless of other information provided.
mybucket/data/<type>/2016/02 This will only match input events that were submitted for February 2016, and give you the ability to provide a new configuration item. Some examples of matching are included below:

Input Prefix	Matched Configuration
`mybucket/data/uploads/2016/02/ELB`	`mybucket/data/uploads/2016/02`
`mybucket/data/uploads/2016/01/ELB` `mybucket/data/uploads/2016/03/ELB` `mybucket/data/uploads/2016/04/ELB` `mybucket/data/uploads/2016/.../ELB` `mybucket/data/uploads/2016/12/ELB`	`mybucket/data/uploads/2016`
`vendor-uploads/inbound/unregistered-new-data-feed/csv/upload.zip`	`vendor-uploads/inbound/`
`vendor-uploads/inbound/vendor1/csv/upload.zip`	`vendor-uploads/inbound/vendor1`
`vendor-uploads/inbound/vendor2/csv/upload.zip`	`vendor-uploads/inbound/vendor2`

Security

When the Redshift COPY command is created, by default the Lambda function will use a temporary STS token as credentials for Redshift to use when accessing S3. You can also optionally configure an Access Key and Secret Key which will be used instead, and the setup utility will encrypt the secret key.

Redshift supports two options for connecting to the cluster: IAM Role based authentication, and Username/Password based authentication. While we highly recommend using IAM Role based authentication, it is not avialable with this utility as it requires the use of the Redshift JDBC Driver, which we don't yet support in this module.

The database password, as well as the master symmetric key used for encryption operations, must be encrypted by the Amazon Key Management Service before running the setup utility. To perform this encryption, you will use the encryptValue.js script supplied with this project. The encryption is done with a KMS Customer Master Key with an alias named alias/LambdaRedshiftLoaderKey. If you supply an unencrypted password to the setup scripts, they will store the value in Dynamodb (BAD), but these will not be usable by the system as it will throw a decryption error on execution.

If you would like support for storage of passwords in the SSM Parameter Store, please +1 this issue.

Loading multiple Redshift clusters concurrently

Version 2.0.0 and higher adds the ability to load multiple clusters at the same time. To configure an additional cluster, you must first have deployed version AWSLambdaRedshiftLoader-2.1.0.zip or higher and had your configuration upgraded to 2.x format (you will see a new loadClusters List type in your configuration). You can then use the addAdditionalClusterEndpoint.js to add new clusters into a single configuration. This will require you enter the vital details for the cluster including endpoint address and port, DB name and password.

You are now ready to go. Simply place files that meet the configured format into S3 at the location that you configured as the input location, and watch as AWS Lambda loads them into your Amazon Redshift Cluster. You are charged by the number of input files that are processed, plus a small charge for DynamoDB. You now have a highly available load framework which doesn't require you manage servers!

Support for Notifications & Complex Workflows

There are two options for creating complex ELT type data flows with the Lambda Loader. The first is to use pre and post-sql commands to prepare, load, and then react to new data being loaded into a table. This allows you to set the presql and postsql portions of the configuration to add SQL statements to manipulate the database as needed, and these statements are run transactionally with the COPY command.

In addition to pre/post-sql, this function can send notifications on completion of batch processing. Using SNS, you can then receive notifications through email and HTTP Push to an application, or put them into a queue for later processing. You can even invoke additional Lambda functions to complete your data load workflow using an SNS Event Sources for another AWS Lambda function. If you would like to receive SNS notifications for succeeded loads, failed loads, or both, create SNS Topics and take note of their ID's in the form of Amazon Resource Notations (ARN) for later use in the configuration setup. An example failure notification message:

{
  "batchId": "2790a034-4954-47a9-8c53-624575afd83d",
  "error": "{\"localhost\":{\"status\":-1,\"error\":{\"code\":\"ECONNREFUSED\",\"errno\":\"ECONNREFUSED\",\"syscall\":\"connect\",\"address\":\"127.0.0.1\",\"port\":5439}}}",
  "failedManifest": "meyersi-ire/redshift/manifest/failed/manifest-2018-04-26 10:34:02-5230",
  "key": "input/redshift-input-0.csv",
  "originalManifest": "meyersi-ire/redshift/manifest/manifest-2018-04-26 10:34:02-5230",
  "s3Prefix": "lambda-redshift-loader/input",
  "status": "error"
}

The loader supports the ability to suppress a failure status of the Lambda function if you have configured a failure notification SNS topic, which should only be used if you know that you've created a downstream workflow to deal with these notifications completely, and you do not want Lambda level failures to be exposed. Please note that this error suppression is only available for batch level load failures, and not for other types of failures of the function (for example if it's unable to complete a status update or send notifications at all). This suppression is only available if you've configured a failureTopicARN in your S3 prefix configuration.

To suppress Lambda level failures, set environment variable SuppressFailureStatusOnSuccessfulNotification = 'true' in your Lambda configuration.

Operations Guide

Viewing Previous Batches & Status

If you ever need to see what happened to batch loads into your Cluster, you can use the 'queryBatches.js' script to look into the LambdaRedshiftBatches DynamoDB table. It can be called by:

node queryBatches.js --region <region> --batchStatus <batchStatus> --startDate <beginning of date range> --endDate <end of date range>

region - the region in which the AWS Lambda function is deployed
batchStatus - the batch status you are querying for, including 'error', 'complete', 'pending', or 'locked'
startDate - optional date argument to use as a start date for querying batches
endDate - optional date argument to use as an end date for the query window

Running node queryBatches.js --region eu-west-1 --batchStatus error might return a list of all batches with a status of 'error' in the EU (Ireland) region, such as:

[
    {
        "s3Prefix": "lambda-redshift-loader-test/input",
        "batchId": "2588cc35-b52f-4408-af89-19e53f4acc11",
        "lastUpdateDate": "2015-02-26-16:50:18"
    },
    {
        "s3Prefix": "lambda-redshift-loader-test/input",
        "batchId": "2940888d-146c-47ff-809c-f5fa5d093814",
        "lastUpdateDate": "2015-02-26-16:50:18"
    }
]

If you require more detail on a specific batch, you can use describeBatch.js to show all detail for a batch. It takes 3 arguments as well:

region - the region in which the AWS Lambda function is deployed
batchId - the batch you would like to see the detail for
s3Prefix - the S3 Prefix the batch was created for

Which would return the batch information as it is stored in Dynamo DB:

{
    "batchId": {
        "S": "7325a064-f67e-416a-acca-17965bea9807"
    },
    "manifestFile": {
        "S": "my-bucket/manifest/manifest-2015-02-06-16:20:20-2081"
    },
    "s3Prefix": {
        "S": "input"
    },
    "entries": {
        "L": [
            {"file":"input/sample-redshift-file-for-lambda-loader.csv", "size": N},
            "input/sample-redshift-file-for-lambda-loader1.csv",
            "input/sample-redshift-file-for-lambda-loader2.csv",
            "input/sample-redshift-file-for-lambda-loader3.csv",
            "input/sample-redshift-file-for-lambda-loader4.csv",
            "input/sample-redshift-file-for-lambda-loader5.csv"
        ]
    },
    "lastUpdate": {
        "N": "1423239626.707"
    },
    "status": {
        "S": "complete"
    }
}

Working with Processed Files

We'll only load a file once by default, but in certain rare cases you might want to re-process a file, such as if a batch goes into error state for some reason. If so, use the processedFiles.js script to query or delete processed files entries. The script takes an operation type and filename as arguments:

Use --query to query if a file has been processed at all, and if so by which batch.
Use --delete to delete a given file entry.
Use --reprocess for files that couldn't be added to a batch for some reason, which might include DynamoDB throttling. This will perform an in-place file copy on S3, which will then be received by the Lambda loader and it will attempt to reprocess the file. If it's already part of a batch, then it will be ignored. Please note this option is only supported by version 2.5.5 and higher.

An example of the processed files store can be seen below:

Reprocessing a Batch

If you ever need to reprocess a batch - for example if it failed to load the required files for some reason - then you can use the reprocessBatch.js script. This takes the same arguments as describeBatch.js (region, batch ID & input location). The original input batch is not affected; instead, each of the input files that were part of the batch are removed from the LambdaRedshiftProcessedFiles table, and then the script forces an S3 event to be generated for the file. This will be captured and reprocessed by the function as it was originally. Please note you can only reprocess batches that are not in 'open' status. Please also note that because this function reads and then re-writes object metadata, it is potentially liable to overwriting metadata added by a different process. If you have frequent S3 metadata re-write as part of your application, use with extreme caution.

Unlocking a Batch

It is possible, but rare, that a batch would become locked but not be being processed by AWS Lambda. If this were to happen, please use unlockBatch.js including the region and Batch ID to set the batch to 'open' state again.

Deleting Old Batches

As the system runs for some time, you may find that your LambdaRedshiftBatches table grows to be very large. In this case, you may want to archive old Completed batches that you no longer require.

USE THIS FEATURE WITH CAUTION!!! IT WILL DELETE DATA!!!

If you would like to clear out old delete batch entries, then you can use the deleteBatches.js script. It will allow you to query for batches that are 'complete' and then clear them out of the system. It does not currently support deleting other types of batches (error, locked, pending), as these should be reprocessed or would make no sense to delete. To run the script, execute:

deleteBatches.js --region <region> --batchStatus <batchStatus> --startDate <beginning of date range> --endDate <end of date range>

This function will return console output (and can also be used programmatically) and for example, when specified as above will show:

node deleteBatches.js --region eu-west-1 --batchStatus error
Dry run only - no batches will be modified
Resolved 1 Batches for Deletion
OK: Deletion of 0 Batches
Deleted Batch Information:
{ s3Prefix: 'lambda-redshift-loader-test/input',
  batchId: '43643fda-f829-4f60-820a-2ce331e62b18',
  status: 'complete',
  lastUpdateDate: '2016-03-10-10:33:12',
  lastUpdate: '1457605992.447' }

This will allow you test your batch deletion and understand the impact of performing such a change. When you are completely happy to delete batches as outlined in the dry run, then add --dryRun to the command line, or supply false for the dryRun parameter. This will ACTUALLY REALLY DELETE BATCH INFORMATION. To mitigate risk of data loss in error, the return of this function is an array of all the batch information that was deleted, so that you can save logfiles for future recovery if needed. For example:

node deleteBatches.js --region eu-west-1 --batchStatus error --endDate 1457434179 --dryRun false
Deleting 1 Batches in status error
OK: Deletion of 1 Batches
Deleted Batch Information:
{
  "batchId": {
    "S": "fe5876bc-9eeb-494c-a66d-ada4698f4405"
  },
  "clusterLoadStatus": {
    "S": {
      "db1.cluster.eu-west-1.redshift.amazonaws.com": {
        "error": {
          "code": "ETIMEDOUT",
          "errno": "ETIMEDOUT",
          "syscall": "connect"
        },
        "status": -1
      },
      "db2.cluster.amazonaws.com": {
        "error": {
          "code": "ENOTFOUND",
          "errno": "ENOTFOUND",
          "syscall": "getaddrinfo"
        },
        "status": -1
      }
    }
  },
  "entries": {
    "SS": [
      "lambda-redshift-loader-test/input/redshift-input-0.csv",
      "lambda-redshift-loader-test/input/redshift-input-2.csv"
    ]
  },
  "errorMessage": {
    "S": {
      "db1.cluster.eu-west-1.redshift.amazonaws.com": {
        "error": {
          "code": "ETIMEDOUT",
          "errno": "ETIMEDOUT",
          "syscall": "connect"
        },
        "status": -1
      },
      "db2.cluster.eu-west-1.redshift.amazonaws.com": {
        "error": {
          "code": "ENOTFOUND",
          "errno": "ENOTFOUND",
          "syscall": "getaddrinfo"
        },
        "status": -1
      }
    }
  },
  "lastUpdate": {
    "N": "1457434178.86"
  },
  "lastUpdateDate": "2016-03-08-10:49:38",
  "manifestFile": {
    "S": "my-bucket/lambda/redshift/failed/manifest-2016-03-08-10:47:30-1368"
  },
  "s3Prefix": {
    "S": "lambda-redshift-loader-test/input"
  },
  "status": {
    "S": "error"
  },
  "writeDates": {
    "NS": [
      "1457434049.802",
      "1457433786.56"
    ]
  }
}

As you can see the entire contents of the batch are returned to you, so that you can ensure no possiblity of data loss. The most important features of this returned data structure are likely $entries.SS, $manifestFile.S, which would allow you to re-inject files into the loader if needed.

Reviewing Logs

For normal operation, you won't have to do anything from an administration perspective. Files placed into the configured S3 locations will be loaded when the number of new files equals the configured batch size. You may want to create an operational process to deal with failure notifications, but you can also just view the performance of your loader by looking at Amazon CloudWatch. Open the CloudWatch console, and then click 'Logs' in the lefthand navigation pane. You can then select the log group for your function, with a name such as /aws/lambda/<My Function>.

Each of the above Log Streams were created by an AWS Lambda function invocation, and will be rotated periodically. You can see the last ingestion time, which is when AWS Lambda last pushed events into CloudWatch Logging.

You can then review each log stream, and see events where your function simply buffered a file, or where it performed a load.

Changing your stored Database Password or S3 Secret Key Information

Currently you must edit the configuration manually in Dynamo DB to make changes. If you need to update your Redshift DB Password, or your Secret Key for allowing Redshift to access S3, then you can use the encryptValue.js script to encrypt a value using the Lambda Redshift Loader master key and encryption context.

To run:

node encryptValue.js --region <region> --input <Value to Encrypt>

This script encrypts the value with Amazon KMS, and then verifies the encryption is correct before returning a JSON object which includes the input value and the encrypted Ciphertext. You can use the 'encryptedCiphertext' attribute of this object to update the Dynamo DB Configuration.

Ensuring Loads happen every N minutes

If you have a prefix that doesn't receive files very often, and want to ensure that files are loaded every N seconds, use the following process to force periodic loads.

When you create the configuration, specify a batchTimeoutSecs and add a filenameFilterRegex such as '.*.csv' (which only loads CSV files that are put into the specified S3 prefix). Then every N seconds, schedule one of the included trigger file generators to run:

Using Scheduled Lambda Functions

You can use an included Lambda function to generate trigger files into all configured prefixes that have a regular expression filter, by completing the following:

Create a new AWS Lambda Function, and deploy the same zip file from the dist folder as you did for the AWS Lambda Redshift Loader. However, when you configure the Handler name, use createS3TriggerFile.handler, and configure it with the timeout and RAM required.
In the AWS Web Console, select Services/CloudWatch, and in the left hand navigation select 'Events/Rules'
Choose Event Source = 'Schedule' and specify the interval for your trigger files to be gnerated
Add Target to be the Lambda function you previously configured

Once done, you will see CloudWatch Logs being created on the configured schedule, and trigger files arriving in the specified prefixes

Through a CRON Job

You can use a Python based script to generate trigger files to specific input buckets and prefixes, using the following utility:

./path/to/function/dir/generate-trigger-file.py <region> <input bucket> <input prefix> <local working directory>

region - the region in which the input bucket for loads resides
input bucket - the bucket which is configured as an input location
input prefix - the prefix which is configured as an input location
local working directory - the location where the stub dummy file will be kept prior to upload into S3

These methods write a file called 'lambda-redshift-trigger-file.dummy' to the configured input prefix, which causes your deployed function to scan the open pending batch and load the contents if the timeout seconds limit has been reached. The batch timeout is calculated on the basis of when the first file was added to the batch.

Extending and Building New Features

We're excited to offer this AWS Lambda function under the Amazon Software License. The GitHub repository does not include all the dependencies for Node.js, so in order to build and run locally please install the following modules with npm install:

Node Postgres - Native Postgres Driver for Javascript (https://github.com/brianc/node-postgres & npm install pg)
Async - Higher-order functions and common patterns for asynchronous code (https://www.npmjs.com/package/async & npm install async)
Node UUID - Rigorous implementation of RFC4122 (v1 and v4) UUIDs (https://www.npmjs.com/package/node-uuid & npm install node-uuid)

Configuration Reference

The following section provides guidance on the configuration options supported. For items such as the batch size, please keep in mind that in Preview the Lambda function timeout is 60 seconds. This means that your COPY command must complete in less than ~ 50 seconds so that the Lambda function has time to complete writing batch metadata. The COPY time will be a function of file size, the number of files to be loaded, the size of the cluster, and how many other processes might be consuming WorkLoadManagement queue slots.

Item	Required	Notes
Enter the Region for the Redshift Load Configuration	Y	Any AWS Region from http://docs.aws.amazon.com/general/latest/gr/rande.html, using the short name (for example us-east-1 for US East 1)
Enter the S3 Bucket & Prefix to watch for files	Y	An S3 Path in format `<bucket name>/<prefix>`. Prefix is optional
Enter a Filename Filter Regex	N	A Regular Expression used to filter files which appeared in the input prefix before they are processed.
Enter the Cluster Endpoint	Y	The Amazon Redshift Endpoint Address for the Cluster to be loaded.
Enter the Cluster Port	Y	The port on which you have configured your Amazon Redshift Cluster to run.
Enter the Database Name	Y	The database name in which the target table resides.
Enter the Database Username	Y	The username which should be used to connect to perform the COPY. Please note that only table owners can perform COPY, so this should be the schema in which the target table resides.
Enter the Database Password	Y	The password for the database user. Will be encrypted before storage in Dynamo DB.
Enter the Table to be Loaded	Y	The Table Name to be loaded with the input data.
Enter the comma-delimited column list	N	If you want to control the order of columns that are found in a CSV file, then list the columns here. Please see Column List Syntax for more information
Should the Table be Truncated before Load? (Y/N)	N	Option to truncate the table prior to loading. Use this option if you will subsequently process the input patch and only want to see 'new' data with this ELT process.
Ignore Header (first line) of the CSV file? (Y/N)	N	Option to ignore the first line of the CSV (Header)
Enter the Data Format (CSV, JSON or AVRO)	Y	Whether the data format is Character Separated Values, AVRO or JSON data (http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html).
If CSV, Enter the CSV Delimiter	Yes if Data Format = CSV	Single character delimiter value, such as ',' (comma) or '
If JSON, Enter the JSON Paths File Location on S3 (or NULL for Auto)	Yes if Data Format = JSON	Location of the JSON paths file to use to map the file attributes to the database table. If not filled, the COPY command uses option 'json = auto' and the file attributes must have the same name as the column names in the target table.
Enter the S3 Bucket for Redshift COPY Manifests	Y	The S3 Bucket in which to store the manifest files used to perform the COPY. Should not be the input location for the load.
Enter the Prefix for Redshift COPY Manifests	Y	The prefix for COPY manifests.
Enter the Prefix to use for Failed Load Manifest Storage	N	On failure of a COPY, you can elect to have the manifest file copied to an alternative location. Enter that prefix, which will be in the same bucket as the rest of your COPY manifests.
Enter the Access Key used by Redshift to get data from S3. If NULL then Lambda execution role credentials will be used.	N	Amazon Redshift must provide credentials to S3 to be allowed to read data. Enter the Access Key for the Account or IAM user that Amazon Redshift should use.
Enter the Secret Key used by Redshift to get data from S3. If NULL then Lambda execution role credentials will be used.	N	The Secret Key for the Access Key used to get data from S3. Will be encrypted prior to storage in DynamoDB.
Enter the SNS Topic ARN for Successful Loads	N	If you want notifications to be sent to an SNS Topic on successful Load, enter the ARN here. This would be in format `arn:aws:sns:<region>:<account number>:<topic name>`.
Enter the SNS Topic ARN for Failed Loads	N	SNS Topic ARN for notifications when a batch COPY fails.
How many files should be buffered before loading?	Y	Enter the number of files placed into the input location before a COPY of the current open batch should be performed. Recommended to be an even multiple of the number of CPU's in your cluster. You should set the multiple such that this count causes loads to be every 2-5 minutes.
How old should we allow a Batch to be before loading (seconds)?	N	AWS Lambda will attempt to sweep out 'old' batches using this value as the number of seconds old a batch can be before loading. This 'sweep' is on every S3 event on the input location, regardless of whether it matches the Filename Filter Regex. Not recommended to be below 120.
Additional Copy Options to be added	N	Enter any additional COPY options that you would like to use, as outlined at (http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html). Please also see http://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift for information on good practices for COPY options in high frequency load environments.

An example configuration looks like this:

{
  "batchSize": 100,
  "batchTimeoutSecs": 100,
  "copyOptions": "FILLRECORD EMPTYASNULL BLANKSASNULL ",
  "csvDelimiter": "|",
  "currentBatch": "1fafd16b-6e44-458f-8222-db8f74f9aac4",
  "dataFormat": "CSV",
  "failedManifestKey": "redshift/manifest/failed",
  "failureTopicARN": "arn:aws:sns:eu-west-1:123456789:LambdaRedshiftLoadErrors",
  "lastBatchRotation": "2018-11-15 18:22:43",
  "lastUpdate": 1489594870.252,
  "loadClusters": [
    {
      "clusterDB": "master",
      "clusterEndpoint": "mycluster.xxxxxx-west-1.redshift.amazonaws.com",
      "clusterPort": 5439,
      "connectPassword": "AQECAHh+YtzV/K7+L/VDT7h2rYDCWFSUugXGqMxzWGXynPCHpQAAAGkwZwYJKoZIhvcNAQcGoFowWAIBADBTBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDGfZWdg/pXqRzMPlQAIBEIAmF4Xe+Hcy53+LM2/OGu04RySGIZ4pny12Krks/EblJhjlIVv3JIM=",
      "connectUser": "master",
      "targetTable": "test_table",
      "truncateTarget": false,
      "useSSL": true
    }
  ],
  "manifestBucket": "mybucket",
  "manifestKey": "redshift/manifest",
  "s3Prefix": "lambda-redshift-loader-test/input",
  "status": "open",
  "version": "2.6.2"
}

Release Notes

Version 2.7.9: Fixed a bug in how the content length was specified in the manifest file. Now, file size is captured on each entry into the pending batch, and this information is used to generate the correct content length.

Licensed under the Amazon Software License (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at

http://aws.amazon.com/asl/

or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and limitations under the License.

aws-lambda-redshift-loader's People

Contributors

Stargazers

Watchers

Forkers

jwswj grimesjm danromuald mitesh91 paulmaddox rstrahan ospreyx mvinicius-santos wbornor ummair wrgoldstein jjmartin edgarlanting tleach paullo0106 smartpcr jrick1977 cohere-io nataren raghotham malaqueueit philltomlinson jaehoony surya-shodan jamshedmelik sukstansky rio517 dinob0t terencet carlos-eduardo-gb pinguo-hehailin jdahlbom idralyuk samsung360vr fs-e-brandon-woo prasantapattnaik bigsnarfdude datteswararao ryanpoplin jagatfx intellifora wallinm1-reaktor ap-hyperbole adrianoaguiar kirillseva gametimesf mvandevy alexhogberg amcsvod gerricchaplin meganwang07 jayzhs yaareck alexperto indigobio markbolden ohgyun darioatbashton superopengl aviblinder ckdudrnjsdl eriklovdahl atulmish garithd christophertull donflopez stewnorriss pc-axue aseroj kartikaastrinia srikargampa apraetor marketing1by1 brightwheel ryan-thorburn vishalinvincible ashwini-padhy shanmugs gustavodata adolfoeliazat bmasciarelli joelhess chrisjameswilliams ramitsurana srikar1262 jbelke aboussetta ellerbrock cloud-architecture pdalinis dataingestion sharvankumar plentymarkets sungyoul pravin-gadekar prabz jianleisun krreguri waltmaguire ianmeyers

aws-lambda-redshift-loader's Issues

Large files ignored

Smaller files (~5 MB) are recognised, but as soon as I upload a 50 MB file, the system turns silent. No batch is created, no log stream. Is it because the lambda fails? Too little memory allocated? Why isn't a log stream created anyway?

SetTimeout issue in index.js

Line 1170: setTimeout(callback(), timeout);

Should just be setTimeout(callback, timeout);
Unless the anticipated result of the callback is a function.
Saw this while a coworker was working with this library so I'm not 100% on the context of the code.

The loader does not response to ObjectCreated:CompleteMultipartUpload

When we upload a huge file to S3 in aws javascript sdk, e.g.
aws s3 cp 20mb_file s3://mybucket/metrics

The event name is ObjectCreated:CompleteMultipartUpload instead of ObjectCreated:Put.

Therefore the Lambda function will generate the error Invalid Event Name ObjectCreated:CompleteMultipartUpload and halt.

How can this problem be fixed? Can the load support multipart upload?

Unable to load configuration using date subdirs

I have this structure on S3: mybucket/logs/access/cdn/20150214/cdn_20150214_0001.log.gz. I have tried various prefixes to account for the extra date directory, but without success. I now have this s3Prefix in the LambdaRedshiftBatchLoadConfig table:

mybucket/logs/access/cdn/*

I copy up a file:

aws s3 cp cdn_20150214_0001.log.gz s3://mybucket/logs/access/cdn/20150214/

I get this error in the CloudWatch Logs stream:

unable to load configuration for mybucket/logs/access/cdn/20150214

The connection to Redshift will fail if the password contains special character

Hello,

I just noticed that the loader is updated to version 2.0 and it is using another DB client.
I tried to use the loader in my lambda function, but resulting in an error message when connecting to Redshift:
Error: getaddrinfo ENOTFOUND

Upon some inspection, I found the reason is that my DB password contains some special characters.
And I found that if I encode them properly, I can connect to Redshift without error.

e.g.

postgres://hello:#[email protected]:5439/db will fail, while
postgres://hello:%[email protected]:5439/db will succeed.

Perhaps a encodeURIComponent will solve the problem?

Thanks,
Gene Ng

Leave unprocessed on Error?

we have some scripts to turn off our dev Redshift server at night/weekends, but the files that we are importing come in at all hours, so occasionally we will get a file that comes in and doesn't get imported/ has an error. But its added to the processed list, so if it comes in again during working hours, that file doesn't get imported because its on the already processed list. Is there some option that could be added that would not mark it as processed if there was an error importing? (or better, somehow set up a delay to try again... )

Difference with Postgres driver vs. Redshift driver

I'm getting errors when loading my csv files from s3 into redshift with respect to the timestamp through this lambda function. Here is the error from CloudWatch:

And in the stl_load_errors table:

I can load these files without a problem running a simple COPY command from SqlWorkbench. The only difference I see is that this lambda function is using the Postgres driver (which Redshift no longer recommends) and my Workbench is connected with the Redshift driver. I don't really care, I just need my data loaded. Does Postgres not like the quotes around the date?

thanks,
Andy

No configuration found for <bucket-name>

Hi,

I have followed the steps outlined in your README document. I did a git clone and added AWSLambdaRedshiftLoader-2.0.0.zip into my Lambda function. Then I added aws-sdk as a node module. I have my AWS credentials for accessing S3 bucket at ~/.aws/credentials file which I am assuming will be automatically picked up. Then I ran node setup whose script is given below.

After running the setup script, whenvever I try uploading a test file to s3 I am getting the following log in Cloudwatch. From the code, I can tell configuration for load in Dynamo DB is not found whenever the lambda function is triggered. Please help.

Cloudwatch Log:

START RequestId: 48273add-18ac-11e5-91f3-4f69165d9383 
2015-06-22T06:59:58.284Z    48273add-18ac-11e5-91f3-4f69165d9383    No Configuration Found for test-bucket
END RequestId: 48273add-18ac-11e5-91f3-4f69165d9383 
REPORT RequestId: 48273add-18ac-11e5-91f3-4f69165d9383  Duration: 919.97 ms Billed Duration: 1000 ms Memory Size: 128 MB    Max Memory Used: 18 MB

Setup Script:-

Enter the Region for the Configuration > us-east-1
Enter the S3 Bucket & Prefix to watch for files > test-bucket
Enter a Filename Filter Regex > 
Enter the Cluster Endpoint > red****.*********.us-east-1.redshift.amazonaws.com
Enter the Cluster Port > 5439
Enter the Database Name > dev
Enter the Table to be Loaded > comma
Should the Table be Truncated before Load? (Y/N) > N
Enter the Database Username > ****
Enter the Database Password > ****
Error during Master Key creation
Error during resolution of Customer Master Key
Enter the Data Format (CSV or JSON) > CSV
Enter the CSV Delimiter > ,
Enter the S3 Bucket for Redshift COPY Manifests > testSymm
Enter the Prefix for Redshift COPY Manifests > /success/
Enter the Prefix to use for Failed Load Manifest Storage > /failure/
Enter the Access Key used by Redshift to get data from S3 > *******
Enter the Secret Key used by Redshift to get data from S3 > ********
Error during Master Key creation
Error during resolution of Customer Master Key
Enter the SNS Topic ARN for Successful Loads > 
Enter the SNS Topic ARN for Failed Loads > 
How many files should be buffered before loading? > 1
How old should we allow a Batch to be before loading (seconds)? > 1
Additional Copy Options to be added > 
Creating Tables in Dynamo DB if Required
{"TableName":"LambdaRedshiftBatchLoadConfig","Item":{"truncateTarget":{"BOOL":false},"currentBatch":{"S":"d17f10c7-a426-4e43-b8c5-c54a391cb6a6"},"version":{"S":"2.0.0"},"loadClusters":{"L":[{"M":{"clusterEndpoint":{"S":"redshift**8.redshift.amazonaws.com"},"clusterPort":{"N":"5439"},"clusterDB":{"S":"dev"},"targetTable":{"S":"comma"},"truncateTarget":{"BOOL":false},"connectUser":{"S":"test"},"connectPassword":{}}}]},"s3Prefix":{"S":"dataexplorer-secure"},"dataFormat":{"S":"CSV"},"csvDelimiter":{"S":","},"manifestBucket":{"S":"testSymm"},"manifestKey":{"S":"/success/"},"failedManifestKey":{"S":"/failure/"},"accessKeyForS3":{"S":"****"},"secretKeyForS3":{},"batchSize":{"N":"2"},"batchTimeoutSecs":{"N":"3"}}}
{"message":"Supplied AttributeValue is empty, must contain exactly one of the supported datatypes","code":"ValidationException","time":"2015-06-16T11:31:58.388Z","statusCode":400,"retryable":false,"retryDelay":0}```

Setup reads previous values and makes them default

This is more of a suggestion than an issue. It would be nice for upgrades or just general changes if it would try to load any previously existing setup info and set that as the default for each of the questions.

The loader does not response to wildcard paths

Hello,
With reference to #5,I have tried the loader v.1.1.2 for the wildcard paths function with the following settings:
Enter the S3 Bucket & Prefix to watch for files > mybucket/click_events/y=*/m=*

Then I have uploaded a file to the path s3://mybucket/click_events/y=2015/m=04/16.json using AWS nodejs SDK.
aws s3 cp ~/local/16.json s3://mybucket/click_events/y=2015/m=04/16.json

Unfortunately I have found this error message in the CloudWatch log:
unable to load configuration for mybucket/click_events/y%3D2015/m%3D04

Looks like it has encoding problem?

Configurations overlap

I wanted to upgrade this lambda function to the new 2.0 one. So I deleted the current function and re-uploaded the new zip file. When I went to add my event source I got this error message: "Configurations overlap. Configurations on the same bucket cannot share a common event type." I only have this one lambda function and nothing else is receiving event updates from that bucket.

configureSample.sh fails with Cannot read property 'describeKey' of undefined

We're attempting to run the sample loader.

We have a bucket in S3, DynamoDB has the three tables created with setup.js
and we have a Redshift DB.

Aware that have to be in the same region.

When executing the script "configureSample.sh", get the following error:
Note had to use a tunnel to our cluster endpoint.

/Users/xxxx/xxxx/aws-lambda-redshift-loader/kmsCrypto.js:55
        kms.describeKey({
           ^
TypeError: Cannot read property 'describeKey' of undefined
    at getOrCreateMasterKey (/Users/xxxx/xxxx/aws-lambda-redshift-loader/kmsCrypto.js:55:5)
    at Object.encrypt (/Users/xxxx/xxxx/aws-lambda-redshift-loader/kmsCrypto.js:116:2)
    at Object.<anonymous> (/Users/xxxx/xxxxx/aws-lambda-redshift-loader/sample/scripts/createSampleConfig.js:51:11)
    at Module._compile (module.js:460:26)
    at Object.Module._extensions..js (module.js:478:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:501:10)
    at startup (node.js:129:16)
    at node.js:814:3

https://github.com/awslabs/aws-lambda-redshift-loader/blob/master/kmsCrypto.js#L44

Appears to not be working and the function is not called, hence kms is not initialised.

We hacked the code to explicitly initialise kms, but that only got us so far:

Configuration for dp-transform-test-01/input successfully written in us-east-1
/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:32
          throw err;
                ^
TypeError: undefined is not a function
    at Response.<anonymous> (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/common.js:228:6)
    at Request.<anonymous> (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:350:18)
    at Request.callListeners (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/sequential_executor.js:100:18)
    at Request.emit (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/sequential_executor.js:77:10)
    at Request.emit (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:592:14)
    at Request.transition (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:21:12)
    at AcceptorStateMachine.runTo (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:22:9)
    at Request.<anonymous> (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:594:12)

Any help or advice would be greatly appreciated.

Thank You

Manifest Creation error

Not sure why this error is rise, I haven't figure it out looking out the logs:

2015-04-13T15:42:38.189Z b317c750-e1f3-11e4-ac7c-37fd0efd1aa2 Writing manifest to analitycs-events-store-manifests/load-indentify-manifest/manifest-2015-04-13-15:42:38-2323
2015-04-13T15:42:38.512Z b317c750-e1f3-11e4-ac7c-37fd0efd1aa2 Error on Manifest Creation
2015-04-13T15:42:38.569Z b317c750-e1f3-11e4-ac7c-37fd0efd1aa2 { [AccessDenied: Access Denied]
message: 'Access Denied',
code: 'AccessDenied',
time: Mon Apr 13 2015 15:42:38 GMT+0000 (UTC),
statusCode: 403,
retryable: false,
retryDelay: 30 }
2015-04-13T15:42:38.571Z b317c750-e1f3-11e4-ac7c-37fd0efd1aa2 Error Message: AccessDenied: Access Denied
2015-04-13T15:42:38.571Z b317c750-e1f3-11e4-ac7c-37fd0efd1aa2 error: {"errorMessage":"error"}

Here is config related to the manifest:
failedManifestKeyString: failed-indentify-manifest
manifestBucketString: analitycs-events-store-manifests
manifestKeyString: load-indentify-manifest

The user for s3 has also access to his bucket.

Unable to decrypt configuration items

I get an error after the manifest has been written, during decryption:

Error during Decryption
Unable to decrypt configuration items due to
{ [InvalidCiphertextException: null]
  message: null, 
  code: 'InvalidCiphertextException', 
  time: Wed Apr 15 2015 12:40:32 GMT+0000 (UTC), 
  statusCode: 400, 
  retryable: false, 
  retryDelay: 30 }
Error Message: InvalidCiphertextException: null
error:
{ "errorMessage": "error" }

The configuration was created using the setup.js script, although some fields, eg s3Prefix, have been manually edited in DynamoDB.

The loader did not response to the put events in sub-directories

Hello, I am using this loader since the version 1.1.

The workflow is fine, except that if I upload the file in the sub-directories of the watched bucket, the file is not loaded to Redshift, and the log will say unable to load configuration for [path]

e.g.
I set the value of "Enter the S3 Bucket & Prefix to watch for files" to be mybucket/metric
When I upload a file exactly to mybucket/metric, it will be loaded to Redshift, which is nice.
But when I upload a file to mubucket/metric/2015/03, the lambda won't do anything.

How can I setup the configuration so that lambda will also response to sub-directory upload? Thank you.

Hard coded 'csv' copy option causing failed loads

The hard coded 'csv' copy option (line 765) that is added in 2.0.7 seems to be breaking our previously functional upload process.

// add data formatting directives to copy options
if (config.dataFormat.S === 'CSV') {
    copyOptions = copyOptions + ' csv delimiter \'' + config.csvDelimiter.S + '\'\n';
}

We are using gzipped csv files and we have been using a selection of copy options such as REMOVEQUOTES and ESCAPE, that obviously have been working pre-2.0.7.

Now with the hardcoded csv option, Redshift is complaining with errors such as:
CSV is not compatible with ESCAPE
, etc.

I wonder if we have been using it wrong previously, or this change has not been properly verified?

3 hour running query, redshift statement timeout needed

Over the weekend we had our redshift locked with this query running for 3 hours:

COPY pending_updates_devices from
's3://_-lambda-logs/_-updated-Success/manifest-2015-08-29-09:45:21-5214' with credentials as '' manifest JSON 'auto' ;

I'm thinking the lambda function timed out but that query kept on executing. When we noticed it we killed it and then everything started to behave like normal.

I'm wondering if there is a way to adding a statement timeout to the loader script. something like this:

set statement_timeout to 60000;

I could edit the code and do this myself but wondering if there is something in place that can take care of this already.

Thanks
Daniel

CSV data format doesn't actually add FORMAT CSV to the copy command

I have CSV chosen as an option but that only forces the delimiter ',' to be added to the COPY command.

It should also add FORMAT [AS] CSV the difference is that without that - the COPY is not using the quotes in my file.

readme issue: missing example

https://github.com/awslabs/aws-lambda-redshift-loader#clearing-processed-files mentions an example below that is missing...

When trying to run configureSample.sh getting error with Customer Master key

Below error when trying to execute the sample

CREATE USER
create table lambda_redshift_sample(
column_a int,
column_b int,
column_c int
);
CREATE TABLE
Unknown Error during Customer Master Key describe
Error during resolution of Customer Master Key
Enter the Region for the Redshift Load Configuration >

No configuration found for <bucket-name>

I have the same issue as issue #24 . I tried the fix described in that topic, to no avail.

Event Data
START RequestId: 2c09fcea-b898-11e5-979e-c78c987ef064 Version: $LATEST
2016-01-11T19:19:03.336Z 2c09fcea-b898-11e5-979e-c78c987ef064 No Configuration Found for cloudar/
END RequestId: 2c09fcea-b898-11e5-979e-c78c987ef064
REPORT RequestId: 2c09fcea-b898-11e5-979e-c78c987ef064 Duration: 1026.64 ms Billed Duration: 1100 ms Memory Size: 128 MB Max Memory Used: 17 MB

I used the sample script in the sample directory. What steps can I take to further troubleshoot this?

config.batchTimeoutSecs not set when using Sample script

When running the sample script:

› ./configureSample.sh aaaa.bbbbbbbbb.us-east-1.redshift.amazonaws.com 5439 dev myusername
Password for user myusername:
create user test_lambda_load_user password 'Change-me1!';
CREATE USER
create table lambda_redshift_sample(
column_a int,
column_b int,
column_c int
);
CREATE TABLE
Enter the Region for the Redshift Load Configuration > us-east-1
Enter the S3 Bucket to use for the Sample Input > z1-trial-data
Enter the Access Key used by Redshift to get data from S3 > MYACCESSKEY
Enter the Secret Key used by Redshift to get data from S3 > MYSECRETKEY
Creating Tables in Dynamo DB if Required
Configuration for z1-trial-data/input successfully written in us-east-1

› aws s3 sync ../data s3://z1-trial-data/input
upload: ../data/sample-redshift-file-for-lambda-loader1.csv to s3://z1-trial-data/input/sample-redshift-file-for-lambda-loader1.csv
upload: ../data/sample-redshift-file-for-lambda-loader2.csv to s3://z1-trial-data/input/sample-redshift-file-for-lambda-loader2.csv
upload: ../data/sample-redshift-file-for-lambda-loader3.csv to s3://z1-trial-data/input/sample-redshift-file-for-lambda-loader3.csv
upload: ../data/sample-redshift-file-for-lambda-loader4.csv to s3://z1-trial-data/input/sample-redshift-file-for-lambda-loader4.csv
upload: ../data/sample-redshift-file-for-lambda-loader5.csv to s3://z1-trial-data/input/sample-redshift-file-for-lambda-loader5.csv

Gets the following processing error:

2015-03-07 21:00:56 UTC
2015-03-07T21:00:56.579Z    09a8edeb-c50d-11e4-bf88-3550f9dbceb4    Found Redshift Load Configuration for mybucket/input 

2015-03-07 21:00:57 UTC
2015-03-07T21:00:57.002Z    09a8edeb-c50d-11e4-bf88-3550f9dbceb4    Adding Manifest Entry for mybucket/input/sample-redshift-file-for-lambda-loader4.csv

2015-03-07 21:00:57 UTC
2015-03-07T21:00:57.742Z    09a8edeb-c50d-11e4-bf88-3550f9dbceb4    Batch Size 2 reached 

2015-03-07 21:00:57 UTC
Failure while running task: TypeError: Cannot read property 'N' of undefined
at Response.<anonymous> (/var/task/index.js:394:73)
at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:350:18) 
at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:100:18) 
at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:77:10) 
at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:604:14) 
at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:21:12) 
at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12) 
at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10 
at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:22:9) 
at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:606:12) 

2015-03-07 21:00:57 UTC
Process exited before completing request TypeError: Cannot read property 'N' of undefined

Setup hangs on encryption?

I got as far as the password question, and then the setup script seems to hang:

$ AWS_REGION=eu-west-1 AWS_PROFILE=myprofile node setup.js
...
Enter the Database Password > somepwd

Something about the KMS encryption?

Please add support for loading encrypted files into redshift.

I am trying to use aws-lambda-redshift-loader to load encrypted files into redshift from S3 using aws client-side master symmetric key. The error log in cloudwatch that I am getting is:

2015-06-23T09:06:07.587Z    0e6fdf2b-1987-11e5-80b1-671df805b3f4    Cluster Load Failure error: syntax error at or near "'aws_access_key_id=***;aws_secret_access_key=***;master_symmetric_key=**'" on Cluster ****.****.redshift.amazonaws.com

The redshift COPY command supports this. Please add support for loading client side encrypted files with master symmetric key.

Error with reprocessBatch

I found some locked batches and I ran into this error(s) trying to unlock and reprocess them. What do you think the problem is?

andrewguy$ node unlockBatch.js us-east-1 e4855fe7-cd46-48fb-b689-310ecb4b90d2 ttgred2/errorlog
Batch e4855fe7-cd46-48fb-b689-310ecb4b90d2 is not currently allocated as the open batch for Load Configuration on ttgred2/errorlog. Use reprocessBatch.js to rerun the load of this Batch.

andrewguy$ node reprocessBatch.js us-east-1 e4855fe7-cd46-48fb-b689-310ecb4b90d2 ttgred2/errorlog
{ [InvalidRequest: This copy request is illegal because it is trying to copy an object to itself without changing the object's metadata, storage class, website redirect location or encryption attributes.]
  message: 'This copy request is illegal because it is trying to copy an object to itself without changing the object\'s metadata, storage class, website redirect location or encryption attributes.',
  code: 'InvalidRequest',
  time: Wed Jun 24 2015 16:46:44 GMT-0400 (EDT),
  statusCode: 400,
  retryable: false,
  retryDelay: 30 }

Better error logging in batches

This is more a feature request than issue report:

As operations monitoring the Lambda-based ETL process, I'd like the failed batches that are stored in DynamoDb to include more diagnostic information.

We had some loading errors in a newly set up table, and the information available in the batches are just:

{
    "xxxx.redshift.amazonaws.com": {
        "error": {
            "code": "0A000",
            "file": "/home/awsrsqa/padb/src/pg/src/backend/commands/commands_copy.c",
            "length": 138,
            "line": "2168",
            "name": "error",
            "routine": "DoCopy",
            "severity": "ERROR"
        },
        "status": -1
    }
}

We did manage to eventually find the corresponding log stream and diagnose it further, which was caused by a wrong copy command (which I suspect is yet another issue on its own), and quite a lot of more detailed error logs were actually available.

So here I wonder, would it be possible to include more details in the failed batches, so we can easily query it with just 'queryBatches.js'?

support for multiple config's

how would I add a 2nd config and either set triggers or a whole additional lambda to use the 2nd config?

Add Support for Specifying Column Names to support Defaulted Columns

We want a column in a table in Redshift that defaults to GetDate() so that we know when the row is copied into Redshift. But we cannot get it work work with code changes to the Loader

Here is the walk-thru.

Table Definition:
Field is shown below "dbImportDateTime"

CREATE TABLE userData (
userID bigint,
userInsertDateTime timestamp ,
firstName varchar(200) ,
domain varchar(400) ,
languageFriendly varchar(200) ,
countryFriendly varchar(200) ,
SharedUser smallint,
SignupBucket varchar(200) ,
SignupSource varchar(400) ,
SignupSubSource varchar(400) ,
SignupCampaign varchar(200) ,
SignupSegment varchar(200) ,
selectedRole varchar(400),
dbImportDateTime timestamp DEFAULT GETDATE(),
PRIMARY KEY (userID)
)
DISTKEY(userID)
COMPOUND SORTKEY(userID);

Copy Command Generated by Redshift Loader Lambda function

COPY userData
from 's3://redshift-copy-manifests/user-data/manifest-2015-07-15-20:06:10-7234'
with credentials as SNIP
manifest
json 's3://json-paths/user-data.txt';

Error from Copy Command

An error occurred when executing the SQL command:
Amazon Invalid operation: Number of jsonpaths and the number of columns should match. JSONPath size: 13, Number of columns in table or column list: 14;

JSON Path File
{
"jsonpaths": [
"$.userID",
"$.userInsertDateTime",
"$.firstName",
"$.domain",
"$.languageFriendly",
"$.countryFriendly",
"$.SharedUser",
"$.SignupBucket",
"$.SignupSource",
"$.SignupSubSource",
"$.SignupCampaign",
"$.SignupSegment",
"$.selectedRole"
]
}

Summary of Error
The error message is accurate. JSON Path have 13 fields and Table has 14 fields but I was hoping that the 14th field will be filled in with the Default of GetDate()

Solution
We found that by adding the column names, the copy works and the last column is defaulted to the load date. Copy command shown below.

COPY userData (
userID,
userInsertDateTime,
firstName,
domain,
languageFriendly,
countryFriendly,
SharedUser,
SignupBucket,
SignupSource,
SignupSubSource,
SignupCampaign,
SignupSegment,
selectedRole
)
from 's3://redshift-copy-manifests/user-data/manifest-2015-07-15-20:06:10-7234'
with credentials as SNIP
manifest
json 's3://json-paths/user-data.txt';

Code Changes Needed
We will need to make some code changes to the Lambda Loader but I thought I check with you before making such a change to make sure we are going down the right path and that you will merge our code change back into your codebase.

Add support for a new column in the LambdaRedshiftBatchLoadConfig table to support specifying comma delimited column names. If null, copy command will not include the column names. If specified, column names are added to copy command
Update setup.js to support this new column
Update index.js to include column names in copy command if provided in config table

Is this the right approach? Any other things we should watch out for?

Diagnosing errors

Can you give a basic checklist on how to troubleshoot when a batch does not load? For example, under what circumstances would a batch get locked? I have batches that error out, but the message is often cryptic, like:

"errorMessage": {
    "S": "{\"ttg-blah.blah.us-east-1.redshift.amazonaws.com\":{\"status\":-1,\"error\":{\"name\":\"error\",\"length\":179,\"severity\":\"ERROR\",\"code\":\"25P02\",\"file\":\"\/home\/awsrsqa\/padb\/src\/pg\/src\/backend\/tcop\/postgres.c\",\"line\":\"1809\",\"routine\":\"execOneParseTree\"}}}"
  },

I look in stl_load_errors, but there is nothing related to that batch in there. Is there another place to look?

Flushing Partial Batches not working for multiple tables/config scenario

We have 5 rows in the LambdaRedshiftBatchLoadConfig table that writes to 5 different target tables. When we use the dummy file approach to flush the partial batches, only 1 open batch gets flushed. We are running the generate-trigger-file.py 5 times and we do see a dummy file in each of the 5 S3 locations. We are not sure why the flush is not working. We have tested this several times and each time, it only flush 1 batch and not the other 4. Thanks for help.

Lambda Execution Role needs S3 write permissions

I got permission denied on the manifest write operation until I added s3:* to the policy on the Lambda Execution Role.

Enhancement to Support Running SQL Scripts Pre and Post Load

Ian

We have a scenario where we like to run a SQL script against Redshift after a load succeeds. Before we implement it, we like to run it by you and get your feedback. Here is the functional overview:

During setup, user can specify a URI to a SQL script in S3 to execute before or after the data load. One or both can specified.
At runtime, the appropriate SQL scripts are executed against the Redshift DB either before or after the copy command.

Any pitfalls we should watch out for? Any feedback for improvements? Do you know if we can get error codes from the SQL execution back to the Lambda function? Will this feature be useful to other users?

Thanks

configureSample.sh fails with requests error

When running the script, we get so far and then the following error is thrown:

/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:34
          throw err;
                ^
TypeError: undefined is not a function
    at Response.<anonymous> (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/common.js:228:6)
    at Request.<anonymous> (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:352:18)
    at Request.callListeners (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/sequential_executor.js:100:18)
    at Request.emit (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/sequential_executor.js:77:10)
    at Request.emit (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:594:14)
    at Request.transition (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:21:12)
    at AcceptorStateMachine.runTo (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:22:9)
    at Request.<anonymous> (/Users/xxxxx/xxxxxx/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:596:12)

The following line is executed just before the error:

Creating Tables in Dynamo DB if Required
Configuration for dp-transform-test-01/input successfully written in us-east-1

Any help or advice would be appreciated.

Thank You

P.S. Thanks for the quick response to the related ticket:
#28

jdbc?

Why connect with jdbc to redshift?

Any particular reason why you don't use https://github.com/brianc/node-postgres to connect to redshift (or other node alternatives) ?

Flush Required

I started getting throttled on lambda so I changed some settings in my app.

Now I get this alert every time it runs

"Unable to write <filename>.json.gz in 100 attempts. Failing further processing to Batch <batchId> which may be stuck in 'locked' state. If so, unlock the back using `node unlockBatch.js <batch ID>`, delete the processed file marker with `node processedFiles.js -d <filename>`, and then re-store the file in S3"

When manually running the lambda I get the error message that flush is required.

I've tried walking through the code to see what it's looking for or what the problem is, and will continue to do so but I thought maybe someone else can give me a hand.

Ideally I would just like to know what I can change in dynamo to get this working again. I tried changing the currentBatch entry in the dynamo, and also changed the states of some of the batches trying to get them to work. Nothing has worked. Every new file gets the same alert (and I get a lot of them).

Any help would be amazing. Thanks

Is it possible to include source file(s) names in redshift tables while issuing copy statment?

Does AWS lambda redshfit loader support inclusion of source file names in an additional column while copy from s3?

A possible scenario could be:-

When I put 3 files a.txt, b.txt and c.txt in s3 location : s3://test-bucket/test/ and this location is configured to automatically populate redshift table 'testfilenames' then the resulting table would be populated in the following manner:-

field1	field2	field3	filename
xty1	abc	dey	a.txt
lala	xana	dula	a.txt
falal	katal	natal	b.txt
rara	fewa	begnas	b.txt
tata	mama	gaga	c.txt

Please note that an extra column is added to table 'testlocation' with that dentoes the file's s3 location from which each record originates.

Thanks,

S3 Secret Key

More of a question than an issue.
I wonder why there's a need to specify an access/secret key for Redshift to access S3?
Shouldn't the lambda function be able to receive AWSCredentials from the role it's running on and from those credentials get an accesskey, secretkey and any token needed?

Example below on how I've done it successfully with a Lambda in Java.

StringBuilder copy = new StringBuilder();
AWSCredentials creds = credentialsProvider.getCredentials();
copy.append("COPY " + table + " ("+columns+") ");
copy.append("FROM 's3://" + srcBucket + "/" + srcKey + "' ");
copy.append("CREDENTIALS 'aws_access_key_id=" + creds.getAWSAccessKeyId());
copy.append(";aws_secret_access_key=" + creds.getAWSSecretKey());
if(creds instanceof AWSSessionCredentials) {
            copy.append(";token=");
            copy.append(((AWSSessionCredentials) creds).getSessionToken());
}

CopyCommand Bug

Hey guys, much appreciated for writing this beautiful tool which is no doubt going to help with our Redshift loading.

I noticed a minor bug in line 748 of index.js, instead of

copyOptions = copyCommand + '  ' + config.dataFormat.S + '  \'auto\'  \n';

it should be

copyOptions = copyOptions + '  \'auto\'  \n';

as now the whole copyCommand is repeated if JSON auto is used.

Also could you please add a space before with in line 842.

Thanks and all the best!

Ren

More flexible configuration key

I have a lambda function with event source bucket foo and prefix bar/, but the files I'd like to load into redshift (created by a third party) are created with a path like foo/bar/baz/qux/norf/file123.gz. The /qux/norf/.. part of the path changes over time as new directories are created.

I would like the lambda function to always find a single dynamoDB configuration with key foo/bar. I've forked this repo and changed the searchKey in index.js directly, but I would love to contribute back a more flexible solution. Can you comment on whether that would make sense? Maybe the lambda function can set the searchKey according to the lambda function's event source?

Support DynamoDB streams?

Feature request.

Any plans to expand this to support DynamoDB streams as a data source? This project sounds like it has the right framework to go beyond S3 and also handle dynamodb data.

npm install error on 2.6.1

0 info it worked if it ends with ok
1 verbose cli [ 'C:\\Program Files (x86)\\nodejs\\\\node.exe',
1 verbose cli   'C:\\Program Files (x86)\\nodejs\\node_modules\\npm\\bin\\npm-cli.js',
1 verbose cli   'install' ]
2 info using [email protected]
3 info using [email protected]
4 verbose node symlink C:\Program Files (x86)\nodejs\\node.exe
5 error install Couldn't read dependencies
6 verbose stack Error: Invalid version: "2.0.6.1"
6 verbose stack     at Object.module.exports.fixVersionField (C:\Program Files (x86)\nodejs\node_modules\npm\node_modules\normalize-package-data\lib\fixer.js:190:13)
6 verbose stack     at C:\Program Files (x86)\nodejs\node_modules\npm\node_modules\normalize-package-data\lib\normalize.js:30:38
6 verbose stack     at Array.forEach (native)
6 verbose stack     at normalize (C:\Program Files (x86)\nodejs\node_modules\npm\node_modules\normalize-package-data\lib\normalize.js:29:15)
6 verbose stack     at final (C:\Program Files (x86)\nodejs\node_modules\npm\node_modules\read-package-json\read-json.js:368:33)
6 verbose stack     at then (C:\Program Files (x86)\nodejs\node_modules\npm\node_modules\read-package-json\read-json.js:127:33)
6 verbose stack     at C:\Program Files (x86)\nodejs\node_modules\npm\node_modules\read-package-json\read-json.js:267:40
6 verbose stack     at evalmachine.<anonymous>:334:14
6 verbose stack     at C:\Program Files (x86)\nodejs\node_modules\npm\node_modules\graceful-fs\graceful-fs.js:102:5
6 verbose stack     at FSReqWrap.oncomplete (evalmachine.<anonymous>:95:15)
7 verbose cwd C:\Projects\aws-lambda-redshift-loader
8 error Windows_NT 6.3.9600
9 error argv "C:\\Program Files (x86)\\nodejs\\\\node.exe" "C:\\Program Files (x86)\\nodejs\\node_modules\\npm\\bin\\npm-cli.js" "install"
10 error node v0.12.2
11 error npm  v2.7.4
12 error Invalid version: "2.0.6.1"
13 error If you need help, you may report this error at:
13 error     <https://github.com/npm/npm/issues>
14 verbose exit [ 1, true ]

powershell build -deploy loses triggers

i know i just added that file but i am running into an issue where the way i was uploading is losing the triggers i had set up on my lambda function.

Trying to figure out the Update Function powershell
https://forums.aws.amazon.com/thread.jspa?threadID=214760

but having a couple of issues, I'll submit a pull request when i get this fixed.

Matching?

I have on dynamodb an entry with this prefix:

s3Prefix: analitycs-events-store/year=/month=/day=*

But I can figure out how to match it on lambda, It always return that has not found the pattern on dynamodb:

2015-04-13T14:16:17.567Z a3e758a8-e1e7-11e4-8f24-6fe7d03a27ad unable to load configuration for analitycs-events-store/year%3D2015/month%3D04/day%3D13

Any tip?

Issue running setup.js

I just ran node setup.js for the first time and I got this output. I did run export AWS_REGION=us-east-1 before this

andrews-mbp:aws-lambda-redshift-loader andrewguy$ node setup.js
Enter the Region for the Redshift Load Configuration > us-east-1
Enter the S3 Bucket & Prefix to watch for files > ttgredshift/logload
Enter a Filename Filter Regex > error-csv.*
Enter the Cluster Endpoint > ttg-reporting.blahbalh..
Enter the Cluster Port > 5439
Enter the Database Name > reporting
Enter the Table to be Loaded > tablename
Should the Table be Truncated before Load? (Y/N) > n
Enter the Database Username > reporting
Enter the Database Password > MyPassword
Unknown Error during Customer Master Key describe
Error during resolution of Customer Master Key
/Users/andrewguy/src/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:32
          throw err;
                ^
TypeError: Cannot read property '1' of null
    at Object.toLambdaStringFormat (/Users/andrewguy/src/aws-lambda-redshift-loader/kmsCrypto.js:236:32)
    at /Users/andrewguy/src/aws-lambda-redshift-loader/setup.js:139:19
    at /Users/andrewguy/src/aws-lambda-redshift-loader/kmsCrypto.js:124:11
    at Response.<anonymous> (/Users/andrewguy/src/aws-lambda-redshift-loader/kmsCrypto.js:102:16)
    at Request.<anonymous> (/Users/andrewguy/src/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:350:18)
    at Request.callListeners (/Users/andrewguy/src/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/sequential_executor.js:100:18)
    at Request.emit (/Users/andrewguy/src/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/sequential_executor.js:77:10)
    at Request.emit (/Users/andrewguy/src/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:604:14)
    at Request.transition (/Users/andrewguy/src/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/request.js:21:12)
    at AcceptorStateMachine.runTo (/Users/andrewguy/src/aws-lambda-redshift-loader/node_modules/aws-sdk/lib/state_machine.js:14:12)

Unable to leverage "role" for S3 bucket access

In the setup.js it prompts for the access/secret key to access the S3 buckets to load data. It mentions "If NULL then Lambda execution role credentials will be used", but if you leave NULL it errors out and says "You Must Provide an Access Key"

Preparing your Amazon Redshift Cluster

In the readme it states:

In order to load a cluster, we’ll have to enable AWS Lambda to connect. To do this, we must enable the cluster security group to allow access from the public internet. In the future, AWS Lambda will support presenting the service as though it was inside your own VPC.

Due to our security policy, we're unable to grant public access.

Will this prevent the sample loader from loading the example csv files into redshift ?

Is there a workaround ?

Thank You

Support for IGNOREHEADER in copy command

This should be an easy add - i'll try to get it in tomorrow... - Our CSV have a header row, so our copy command needs IGNOREHEADER 1 as part of the options

Cannot reprocess batch

I have a batch that is in status locked:

$ AWS_PROFILE=myprofile node queryBatches.js eu-west-1 locked
[
  {
    "s3Prefix": "mybucket/logs/access/cdn",
    "batchId": "ea3cbcdb-4526-4ba2-aaa4-9f12503f7b54",
    "lastUpdateDate": "2015-04-15-13:02:24"
  }
]

I run the reprocessBatch.js script:

$ AWS_PROFILE=myprofile node reprocessBatch.js eu-west-1 ea3cbcdb-4526-4ba2-aaa4-9f12503f7b54 mybucket/logs/access/cdn
Submitted reprocess request for mybucket/logs/access/cdn/cdn_20150214_0001.log.gz
Processed 1 Files

I'm not sure exactly what should happen. There are no entries in the table LambdaRedshiftProcessedFiles. There is still only one entry in the table LambdaRedshiftBatches. There are no new log streams or events in CloudWatch.

Connection database error 403

I having a 403 error when connection, I have tested both user and password to connect and they seem to work, Any idea how can I verify that the user is proper encoded decoded in dynamodb ?

2015-04-13T16:00:20.701Z 2c0c5f3e-e1f6-11e4-acad-4f348c1833fe Connecting to Database jdbc:postgresql://MY_INSTANCE.us-east-1.redshift.amazonaws.com:5439/events?tcpKeepAlive=true
2015-04-13T16:00:25.611Z 2c0c5f3e-e1f6-11e4-acad-4f348c1833fe { [AccessDenied: Access Denied]
message: 'Access Denied',
code: 'AccessDenied',
time: Mon Apr 13 2015 16:00:25 GMT+0000 (UTC),
statusCode: 403,
retryable: false,
retryDelay: 30 }
2015-04-13T16:00:25.751Z 2c0c5f3e-e1f6-11e4-acad-4f348c1833fe { [AccessDenied: Access Denied]
message: 'Access Denied',
code: 'AccessDenied',
time: Mon Apr 13 2015 16:00:25 GMT+0000 (UTC),
statusCode: 403,
retryable: false,
retryDelay: 30 }
2015-04-13T16:00:25.959Z 2c0c5f3e-e1f6-11e4-acad-4f348c1833fe Error Message: AccessDenied: Access Denied
2015-04-13T16:00:25.990Z 2c0c5f3e-e1f6-11e4-acad-4f348c1833fe error: {"errorMessage":"error"}

Unable to unlock batches in locked status

Hi,
I'm currently ending up with batches being in locked state and the loader trying to write to them resulting in error. I tried to unlock them but ending up with the following message:

"Batch is not currently allocated as the open batch for Load Configuration on Use reprocessBatch.js to rerun the load of this Batch."

So, when running the batch with reprocessBatch.js, the files are processed, what happens with the old batch file, cause to my understanding its still there with the locked state, and the loader is trying to write to it, ending up with follwing message in the logs.

Reload of Configuration Complete after attempting to write to Locked Batch Attempt 1, continuing doing so for 60 secs.

If we run the reprocessBatch.js the files in the "old" locked batch will end up in a new batch with open status? Should we delete the "old" batch which is locked? And how would one do that? Any pointers appreciated.

awslabs / aws-lambda-redshift-loader Goto Github PK

aws-lambda-redshift-loader's Introduction

A Zero Administration AWS Lambda Based Amazon Redshift Database Loader

Please note that this function is now deprecated, and instead we recommend that you use the Auto COPY feature built into Redshift. Please see https://aws.amazon.com/blogs/big-data/simplify-data-ingestion-from-amazon-s3-to-amazon-redshift-using-auto-copy-preview/ for more information

Table of Contents

Using AWS Lambda with Amazon Redshift

Installing with AWS CloudFormation (Recommended)

Function Configuration

Installing Manually

Before you deploy - Lambda Execution Role

Deploy the function

Lambda Function Versions

Configuring your VPC for connections between AWS Lambda and Redshift

Entering the Configuration

The S3 Prefix

Hive Partitioning Style Wildcards

Prefix Matching

Security

Loading multiple Redshift clusters concurrently

Support for Notifications & Complex Workflows

Operations Guide

Viewing Previous Batches & Status

Working with Processed Files

Reprocessing a Batch

Unlocking a Batch

Deleting Old Batches

Reviewing Logs

Changing your stored Database Password or S3 Secret Key Information

Ensuring Loads happen every N minutes

Using Scheduled Lambda Functions

Through a CRON Job

Extending and Building New Features

Configuration Reference

Release Notes

aws-lambda-redshift-loader's People

Contributors

Stargazers

Watchers

Forkers

aws-lambda-redshift-loader's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs