GithubHelp home page GithubHelp logo

dmanjunath / node-redshift Goto Github PK

View Code? Open in Web Editor NEW
69.0 69.0 48.0 538 KB

A simple collection of tools to help you get started with Amazon Redshift from node.js

JavaScript 100.00%
connection-pool node-redshift orm redshift

node-redshift's Introduction

Navigation

Overview

This package is a simple wrapper for common functionality you want when using Redshift. It can do

  • Redshift connections & querying
  • Creating and running migrations
  • Create and manage models
  • CRUD API with ORM wrapper with type validation

Warning!!!!!! This is new and still under development. The API is bound to change. Use at your own risk.

Installation

Install the package by running

npm install node-redshift

Link to npm repository https://www.npmjs.com/package/node-redshift

Setup

The code to connect to redshift should be something like this:

//redshift.js
var Redshift = require('node-redshift');

var client = {
  user: user,
  database: database,
  password: password,
  port: port,
  host: host,
};

// The values passed in to the options object will be the difference between a connection pool and raw connection
var redshiftClient = new Redshift(client, [options]);

module.exports = redshiftClient;

There are two ways to setup a connection to redshift.

***By default node-redshift uses connection pooling

Raw Connection

Pass in the rawConnection parameter in the redshift instantiation options to specify a raw connection. Raw connections need extra code to specify when to connect and disconnect from Redshift. Here's an example of the raw connection query

var redshiftClient = new Redshift(client, {rawConnection: true});
Connection Pooling

Connection pooling works by default with no extra configuration. Here's an example of connection pooling

Setup Options

There are two options that can be passed into the options object in the Redshift constructor.

Option Type Description
rawConnection Boolean If you want a raw connection, pass true with this option
longStackTraces Boolean Default: true. If you want to disable bluebird's longStackTraces, pass in false

Usage

Query API

Please see examples/ folder for full code examples using both raw connections and connection pools.

For those looking for a library to build robust, injection safe SQL, I like sql-bricks to build query strings.

Both Raw Connections and Connection Pool connections have two query functions that are bound to the initialized Redshift object: query() and a parameterizedQuery().

All query() and parameterizedQuery() functions support both callback and promise style. If there's a function as a third argument, the callback will fire. If there's no third function argument, but instead (query, [options]).then({})... the promise will fire.

//raw connection
var redshiftClient = require('./redshift.js');

redshiftClient.connect(function(err){
  if(err) throw err;
  else{
    redshiftClient.query('SELECT * FROM "TableName"', [options], function(err, data){
      if(err) throw err;
      else{
        console.log(data);
        redshiftClient.close();
      }
    });
  }
});

//connection pool
var redshiftClient = require('./redshift.js');

// options is an optional object with one property so far {raw: true} returns 
// just the data from redshift. {raw: false} returns the data with the pg object
redshiftClient.query(queryString, [options])
.then(function(data){
    console.log(data);
})
.catch(function(err){
    console.error(err);
});
//instead of promises you can also use callbacks to get the data
Parameterized Queries

If you parameterize the SQL string yourself, you can call the parameterizeQuery() function

//connection pool
var redshiftClient = require('./redshift.js');

// options is an optional object with one property so far {raw: true} returns 
// just the data from redshift. {raw: false} returns the data with the pg object
redshiftClient.parameterizedQuery('SELECT * FROM "TableName" WHERE "parameter" = $1', [42], [options], function(err, data){
  if(err) throw err;
  else{
    console.log(data);
  }
});
//you can also use promises to get the data
Template Literal Queries

If you use template literals to write your SQL, you can use a tagged template parser like https://github.com/felixfbecker/node-sql-template-strings to parameterize the template literal

//connection pool
var redshiftClient = require('./redshift.js');
var SQL = require('sql-template-strings');

// options is an optional object with one property so far {raw: true} returns 
// just the data from redshift. {raw: false} returns the data with the pg object
let value = 42;

redshiftClient.query(SQL`SELECT * FROM "TableName" WHERE "parameter" = ${value}`, [options], function(err, data){
  if(err) throw err;
  else{
    console.log(data);
  }
});
//you can also use promises to get the data
rawQuery()

If you want to make a one time raw query, but you don't want to call connect & disconnect manually and you dont want to use conection pooling, you can use rawQuery()

//connection pool
var redshiftClient = require('./redshift.js');

// options is an optional object with one property so far {raw: true} returns 
// just the data from redshift. {raw: false} returns the data with the pg object
redshiftClient.rawQuery('SELECT * FROM "TableName"', [options], function(err, data){
  if(err) throw err;
  else{
    console.log(data);
  }
});
//you can also use promises to get the data
Query Options

There's only a single query option so far. For the options object, the only valid option is {raw: true}, which returns just the data from redshift. {raw: false} or not specifying the value will return the data along with the entire pg object with data such as row count, table statistics etc.

CLI

There's a CLI with options for easy migration management. Creating a migration will create a redshift_migrations/ folder with a state file called .migrate in it which contains the state of your completed migrations. The .migrate file keeps track of which migrations have been run, and when you run db:migrate, it computes the migrations that have not yet been run on your Redshift instance and runs them and saves the state of .migrate

WARNING!!! IF YOU HAVE SEPARATE DEV AND PROD REDSHIFT INSTANCES, DO NOT COMMIT THE .migrate FILE TO YOUR VCS OR DEPLOY TO YOUR SERVERS. YOU'LL NEED A NEW VERSION OF THIS FILE FOR EVERY INSTANCE OF REDSHIFT.

Create a new migration file in redshift_migrations/ folder

node_modules/.bin/node-redshift migration:create <filename>
Run all remaining migrations on database

node_modules/.bin/node-redshift db:migrate <filename>
Undo last migration

node_modules/.bin/node-redshift db:migrate:undo <filename>
Creating a model using the command line

node_modules/.bin/node-redshift model:create <filename>

Models

A model will look like this

'use strict';
  var person = {
    'tableName': 'people',
    'tableProperties': {
      'id': {
        'type': 'key'
      },
      'name': { 
        'type': 'string',
        'required': true
      },
      'email': { 
        'type': 'string',
        'required': true
      }
    }
  };
  module.exports = person;
Importing and using model with ORM

There are two ways you could import and use redshift models. The first is using redshift.import in every file where you want to use the model ORM.

var redshift = require("../redshift.js");
var person = redshift.import("./redshift_models/person.js");

person.create({name: 'Dheeraj', email: '[email protected]'}, function(err, data){
    if(err) throw err;
    else{
      console.log(data);
    }
  });

The alternative(my preferred way) is to abstract the import calls and export all the models with the redshift object right after initialization

//redshift.js
...redshift connection code...

var person = redshift.import("./redshift_models/person.js");
redshift.models = {};
redshift.models.person = person;

module.exports = redshift;

//usage in person.js
var redshiftConnection = require('./redshift.js');
var person = redshift.models.person;

person.create({name: 'Dheeraj', email: '[email protected]'}, function(err, data){
    if(err) throw err;
    else{
      console.log(data);
    }
  });

ORM API

There are 3 functions supported by the ORM

/**
 * create a new instance of object
 * @param  {Object or Array}   data Object/Array with keys/values to create in database. keys are column names, values are data
 * @param  {Function} cb   
 * @return {Object}        Object that's inserted into redshift
 */
Person.create({emailAddress: '[email protected]', name: 'Dheeraj'}, function(err, data){
  if(err) throw err;
  else console.log(data);
});
 
/**
 * update an existing item in redshift
 * @param  {Object}   whereClause The properties that identify the rows to update. Essentially the WHERE clause in the UPDATE statement
 * @param  {Object}   data        Properties to overwrite in the record
 * @param  {Function} callback    
 * @return {Object}               Object that's updated in redshift
 *
 */
Person.update({id: 72}, {emailAddress: '[email protected]', name: 'Dheeraj'}, function(err, data){
  if(err) throw err;
  else console.log(data);
});

/**
 * delete rows from redshift
 * @param  {Object}   whereClause The properties that identify the rows to update. Essentially the WHERE clause in the UPDATE statement
 * @param  {Function} cb   
 * @return {Object}        Object that's deleted from redshift
 */
Person.delete({emailAddress: '[email protected]', name: 'Dheeraj'}, function(err, data){
  if(err) throw err;
  else console.log(data);
});

Upcoming features

  • Ability to customize location of .migrate file or even from S3
  • Model checking prior to queries to verify property name and type
  • Add class & instance methods to model

License

MIT

node-redshift's People

Contributors

dmanjunath avatar glena avatar schmidp avatar thalesmello avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-redshift's Issues

Publish latest change?

Hey, @dmanjunath

Would you mind publishing a new version to NPM?

I would like to point my production repo to the NPM package rather than my current fork.

pg.types in node-redshift

Hi,

When a date is pulled from RedShift, nodejs makes a "new Date()", so we depend on Local timezone.
pg offers the possibility to override types parsing:

var pg = require('pg');
var types = pg.types;

var TIMESTAMP_OID = 1114;
var parseDateFn = function (val) {
    return val === null ? null : val;
};
types.setTypeParser(TIMESTAMP_OID, parseDateFn);

Unfortunately the node-redshift npm doesn't export this pg.types object and so we can't modify the way javascript will parse redshift data.

Would it be possible to add it?

Thanks

Re: Support for SSL, Keep alive as connection option

Couldnt see a place to put this request, it may not be an issue but rather feature ask.

I need to pass following options to my connection -
?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory

Most of the code and document reference option as {rawConnection: true}, is there a way i can include above?

Thanks

node-redshift issue

Hi,

I am trying to write a simple node js app using node-redshift package.
using the examples here -> https://www.npmjs.com/package/node-redshift
but i am not sure how to provide the value for the "host " parameter.

Can someone suggest please

I am getting Error: getaddrinfo ENOTFOUND - surely I am not able to pas correct value for host parameter

Connection constantly resets.

Hi,

Good day

Is there any option one can use to stop the connection from dying? I keep on getting the following error:

Error: read ECONNRESET
    at _errnoException (util.js:1003:13)
    at TCP.onread (net.js:611:25)
From previous event:
    at _queryHelper (/var/www/redshift/migrate_to_s3/node_modules/node-redshift/lib/query.js:107:12)
    at Redshift.query (/var/www/redshift/migrate_to_s3/node_modules/node-redshift/lib/query.js:28:10)
    at Timeout.unload [as _onTimeout] (/var/www/redshift/migrate_to_s3/migrate_by_subtype.js:104:20)
    at ontimeout (timers.js:462:13)
    at tryOnTimeout (timers.js:298:5)
    at Timer.listOnTimeout (timers.js:261:5)

Thanks.

Regards.
JJ

Use rawConnection or Pool in node-redshift for lambda?

I am stuck at a scenario in which whether to use rawConnection or deafult connection-pooling of node-redshift. The application is a Lambda function which is going to insert the data into redshift table and it's going to be at scale. I want to make sure that I use optimized way of connection and follow best practices! Any help would be appreciated.

library shouldn't attempt to configure bluebird

When trying to test a simple query, I got:
Error: cannot enable long stack traces after promises have been created

This exception was thrown by bluebird because my app created a promise before calling into node-redshift.

I eventually figured out I could work around this by passing { longStackTraces: false }, but I think that shouldn't be necessary - it's not uncommon for apps to be using promises elsewhere. I suggest making the default value of longStackTraces false, or removing it as an option altogether (library consumers can call Promise.config themselves if desired).

What about streaming?

Is there a way to stream out rows from redshift instead of getting the entire dataset into memory before return it to client?

Thank you for your help

Accept promises in the migration interface

I think it would be nice if instead of

exports.up = function(next) {
  // something
  next();
};

It would be nice to write

exports.up = function() {
  return query('Something')
};

It would even allow as to use async functions in the queries, making it even nicer

exports.up = async function() {
  await query('something')
};

Redshift errors

Hi @dmanjunath ,

Me again :)

Just a quick question/remark about the way node-redshift npm handles Redshift errors:
In query.js, you reject the promise this way:

reject(new Error(err));

Unfortunately we loose the RedShift error number.

Is there a reason why you have chosen to "override" the native RedShift error?

Thanks a lot for your answer!

Nours

Issue with connecting to my database.

Please help ! I have redshift cluster and from sql workbench everything works fine, i can connect and work. I tried to connect to my database from node.js server. i installed this module, tried to use, but every time i was getting

error: getaddrinfo ENOTFOUND jdbc:.... (and my jdbc link)

i tried with both raw and normal connection

here is my sample code

const Redshift = require('node-redshift');
const redShiftConnectionOptions = {
    user: 'myusername',
    database: 'mydatabase',
    password: 'mypassword',
    port: 5439,
    host: 'jdbc:redshift://my-redshift-link'
  };

  const redshiftClient = new Redshift(redShiftConnectionOptions, {
    rawConnection: true
  });

redshiftClient.connect(function(err) {
    if (err) {
      console.log('error is here');
      throw err;
    } else {
      redshiftClient.query('select * from "photo_tags"', function(err, data) {
        if (err) throw err;
        else {
          res.send(data);
          redshiftClient.close();
        }
      });
    }
  });

Every time it goes to connection error.

Use With AWS Lambda

Hello,

Is it possible to use this within a Lambda function?

I inserted my nodejs code that was working on my local machine into an AWS Lambda function and received the error "Error: Cannot find module 'node-redshift'.

I then uploaded the zipped module after downloading from git as it's own Lambda function and received the error "errorMessage": "Cannot find module '/var/task/index'". Is there a way around both these issues, to enable the use of node-redshift within an AWS Lambda function?

Thanks,

Nick

Query returning incorrect float values

Making a simple query like:
select some_float_column from some_table;
is returning a row with some_float_column = 1040480
But in the database it's actually 1040478.3

Same problem happening with other rows.

some_float_column is of type float4

Problems with raw connections

Hello!

It seems like your library is mostly suited for connection pooling, which is understandable. But my (and many others') usecase is just a simple connection opening, query and closing (inside the callback).

Having to execute the query as a callback to the connect function is a major drawback, along with not having an instance factory, which is the case for most mature database drivers.

In this sense, I've developed something that made my life really easy, and maybe you can integrate.

It's just a Redshift object factory and an easy way to query directly:

var redshiftFactory = function() {
    var redshiftClient = new Redshift(client, {rawConnection: true});
    redshiftClient.rawQuery = function(query, queryCb) {
        redshiftClient.connect(function(err) {
            if (err) console.log(err);
            else {
                redshiftClient.query(query, {raw: true}, function(err, data) {
                    if (err) {
                        console.log(err);
                    } else {
                        if (typeof queryCb === 'function') {
                            queryCb(err, data);
                        }
                        console.log(data);
                        redshiftClient.close();
                    }
                });
            }
        });
    };

    return redshiftClient;
}

module.exports = redshiftFactory;

This way, I can just call:

const redshiftClient = redshiftFactory();
redshiftClient.rawQuery('SELECT * FROM users', function(err, data) {});

Without worrying about connecting, all the ugly nested callbacks and also tightly controlling the scope of my connection.

Since Redshift is widely used for analytics (and since you'll find people using Node.js just as way to serve a data collecting API), a lot of people may be in the need of just using simple, straightforward, and fast connection handling.

Hope it helps

code: EPIPE

Hi there, this is a great component. However after trying few queries I get this error: {"code":"EPIPE"}. It looks like it got disconnected from the server. Have you seen this error before? Is there a way to force a reconnect from you code?

Here is a sample from my code

exports.data = function(req, res) {
    var redshiftClient = require('../redshift.js');
    var util = require('util');
    var from = req.query.from;
    var to = req.query.to;
    console.log('req:' + from + ' ' + to);

    var sql = "select ... from metrics where date between '%s' and '%s' group by 1,2 order by 1,3 desc,2";
    var query = util.format(sql, from, to);

    // options is an optional object with one property so far {raw: true} returns  
    // just the data from redshift. {raw: false} returns the data with the pg object 
    redshiftClient.query(query, {
        raw: true
    }, callback);

    function callback(error, data) {
        res.json({
            error: error,
            report: data
        });
    }
};

Thanks

Redshift and latency

Hi,
This is more a question than an issue. Is there anything to configure in node-redshift to control how long it takes for a insert or update to be reflected in the Redshift database?
I have an application with database tables in Aurora that I am trying to parallel in Redshift. Most of the tables I bulk load via S3 and Lamda with good response time, but one table, insight, keeps track of daily statistics, which are polled at 15 minute intervals. If no record exists for the date, one is inserted; otherwise, the existing record is updated with the new values. This add/update decision doesn't lend itself to bulk loading.
I am posting a simplified version of the function that is invoked every 15 minutes for this table. What I am observing in the log file is that the 'insights:' log statement is recorded in the log right on the 15 minute mark, but the 'queryString:' log statement can appear in the log much later, even skipping several 15 minute periods, then logging for all of those skipped periods at once. And when I check the AWS Redshift console for queries, there is even more latency until they finally appear there.
Is there some sort of "flush" option?
TIA,
Ed
`/* jshint node: true */
'use strict';
var Redshift = require('node-redshift');
var util = require('util');
var config = require(__dirname + '/' + 'configuration.json');

var client = {user: config.my_user,database: config.my_db,password: config.my_password,port: 5439,host: config.my_host+'.us-east-1.redshift.amazonaws.com'};
var redshiftClient = new Redshift(client, {rawConnection: false});

function addUpdateRedshiftInsights(insights) {
util.log('insights:',insights);
_.each(insights,function(i) {
redshiftClient.query('select * from insight where insight_date=''+i.insight_date+'';', {}, function(err, data) {
var queryString;
if(err) {
util.log('error encountered');
throw err;
}
else{
if(data.rows[0]) {
queryString = 'update insight set impressions='+i.impressions+',impressions_unique='+i.impressions_unique+',impressions_paid='+i.impressions_paid+' where insight_date=''+i.insight_date+'';';
}
else {
queryString = 'insert into insight (insight_date,impressions,impressions_unique,impressions_paid) values (''+i.insight_date+'','+i.impressions+','+i.impressions_unique+','+i.impressions_paid+');';
}
util.log('queryString:',queryString);
redshiftClient.query(queryString,{},function(err,data) {
if(err) throw err;
});
}
});
});
}`

using redshift pool .. This socket has been ended by the other party

var redshift_info = {
user: 'dsi',
database: 'id...',
password: 'pass',
port: '5439',
host: 'aws.xxxxxx.com,
max: 10, // max number of clients in the pool
idleTimeoutMillis: 5000,
};
var redshiftClient = new redshift(redshift_info, {rawConnection: false});

and insert some data to redshift ..
at first insert ok... some data insert to redshift
but after that
i got below error

2017-01-17 01:59:25 - error: Error: This socket has been ended by the other party
at Socket.writeAfterFIN [as write] (net.js:291:12)
at Connection.query (/app/node-server/node_modules/pg/lib/connection.js:204:15)
at Query.submit (/app/node-server/node_modules/pg/lib/query.js:138:16)
at Client._pulseQueryQueue (/app/node-server/node_modules/pg/lib/client.js:307:24)
at Client.query (/app/node-server/node_modules/pg/lib/client.js:335:8)
at runQuery (/app/node-server/node_modules/node-redshift/lib/query.js:37:15)
at Redshift.query (/app/node-server/node_modules/node-redshift/lib/query.js:33:8)

Hi, i get an error:{ error: column "crop" does not exist in events...

full error:
{ error: column "crop" does not exist in events
at Connection.parseE (/Users/dang/Documents/applicaster/syteCMS/node_modules/pg/lib/connection.js:554:11)
at Connection.parseMessage (/Users/dang/Documents/applicaster/syteCMS/node_modules/pg/lib/connection.js:381:17)
at Socket. (/Users/dang/Documents/applicaster/syteCMS/node_modules/pg/lib/connection.js:117:22)
at emitOne (events.js:96:13)
at Socket.emit (events.js:188:7)
at readableAddChunk (_stream_readable.js:172:18)
at Socket.Readable.push (_stream_readable.js:130:10)
at TCP.onread (net.js:542:20)

my code:
var string = 'select created_at::date, name, count (*) from events where created_at::date > GETDATE()::date - 7 AND (name="crop" OR name="adclick") and origin="mako.co.il" group by created_at::date, name';
redshiftClient.query(string, null, function(error, answer){
if (error){
console.log(error);
}
else{
console.log(answer);
}
...

however when i use SQL worbrench:
jdbc:redshift://syte-dw.cwvze9ydsqd4.us-east-1.redshift.amazonaws.com:5439/analytics

with the same query:
select created_at::date, name, count (*)
from events
where created_at::date > GETDATE()::date - 7 AND (name='crop' OR name='adclick') and origin='mako.co.il'
group by created_at::date, name

i get my results:
2016-12-07 adclick 16
2016-12-05 adclick 190
2016-12-06 adclick 192
2016-12-07 crop 474
2016-12-01 adclick 295
2016-12-02 adclick 154
2016-12-03 adclick 140
2016-12-05 crop 3688
2016-12-06 crop 3610
2016-12-01 crop 5362
2016-12-02 crop 3296
2016-12-03 crop 3087
2016-12-04 crop 5209
2016-12-04 adclick 229

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.