dat-ecosystem-archive / gasket Goto Github PK

Build cross platform data pipelines [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]

JavaScript 100.00%

gasket's Introduction

More info on active projects and modules at dat-ecosystem.org

gasket

Preconfigured pipelines for node.js

$ npm install -g gasket
$ gasket # prints help
$ gasket completion --save # install tab completion

Usage

To setup a pipeline add a gasket section to your package.json

{
  "name": "my-test-app",
  "dependencies" : {
    "transform-uppercase": "^1.0.0"
  },
  "gasket": {
    "example": [
      {
        "command": "echo hello world",
        "type": "pipe"
      },
      {
        "command": "transform-uppercase",
        "type": "pipe"
      }
    ]
  }
}

To run the above example pipeline simply to the repo and run

$ gasket run example # will print HELLO WORLD

gasket will spawn each command in the pipeline (it supports modules/commands installed via npm) and pipe them together (if the type is set to "pipe").

If you want to wait for the previous command to finish, set the type to "run" instead.

{
  "gasket": {
    "example": [
      {
        "command": "echo hello world",
        "type": "run"
      },
      {
        "command": "echo hello afterwards",
        "type": "run"
      }
    ]
  }
}

Running the above will print

hello world
hello afterwards

Modules in pipelines

In addition to commands it supports node modules that return streams

{
  "gasket": [
    {
      "command": "echo hello world",
      "type": "pipe"
    }
    {
      "command": {"module":"./uppercase.js"},
      "type": "pipe"
    }
  ]
}

Where uppercase.js is a file that looks like this

var through = require('through2')
module.exports = function() {
  return through(function(data, enc, cb) {
    cb(null, data.toString().toUpperCase())
  })
}

If your module reads/writes JSON object set json:true in the pipeline. That will make gasket parse newline separated JSON before parsing the objects to the stream and stringify the output.

Running gasket run main will produce HELLO WORLD

Using gasket.json

If you don't have a package.json file you can add the tasks to a gasket.json file instead

{
  "example": [
    {
      "command": "echo hello world",
      "type": "pipe"
    },
    {
      "command": "transform-uppercase",
      "type": "pipe"
    }
  ]
}

gasket as a module

You can use gasket as a module as well

var gasket = require('gasket')

var pipelines = gasket({
  example: [
    {
      "command": "echo hello world",
      "type": "pipe"
    },
    {
      "command": "transform-uppercase",
      "type": "pipe"
    }
  ]
})

pipelines.run('example').pipe(process.stdout)

gasket's People

Contributors

Stargazers

Watchers

Forkers

shama prodigeni alex-node denismars finnp web5design timoxley fgrehm alex-spa melaniecebula eval-usertoken wpears gitter-badger evaluation-alex the-cc-dev isabella232 iq-scm

gasket's Issues

gasket as a module

hi, i'm having a great time discovering dat tools.
i bumped into this error trying to use gasket as a module's example.

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: This socket is closed.
    at Socket._write (net.js:638:19)
    at doWrite (_stream_writable.js:226:10)
    at writeOrBuffer (_stream_writable.js:216:5)
    at Socket.Writable.write (_stream_writable.js:183:11)
    at Socket.write (net.js:616:40)
    at Duplexify._write (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/index.js:197:22)
    at doWrite (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:237:10)
    at writeOrBuffer (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:227:5)
    at Writable.write (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:194:11)
    at write (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/node_modules/readable-stream/lib/_stream_readable.js:623:24)

conditional process trigger

hi all:

i'm interested in integrating gasket into our neuro imaging analysis project.

TWO PREGUNTAS!

1 we have a pipeline need as such:

=> client runs taskA
=> client runs taskB
( system watcher picks up on Bs output )
........=> remote computer aggregates taskB results (some ML routines w/ other client results)
=> client runs taskC (using result from remote computation)

2 i'm assuming gasket is stateless. e.g. taskB fails. system comes back up, tries again, and replays from taskA. correct?

discussion.

i was considering making my own pipeliner (which may be required if it's too out of scope for gasket to achieve these goals), but of course, i'd rather chat it out first!

in the end i'm going to have to produce something that has the following effect:

    "example": [
      {
        "command": "echo hello world",
        "type": "pipe"
      },
      // now i intercept the flow here, and store some state about my progress through the pipeline... proceed :)
      {
        "command": "transform-uppercase",
        "type": "cmd",
      },
// once again i intercept the flow, and do some remote data computation/synchronization
      {
        "command": "crazy-async-xform",  // doesn't run immediately.  gasket sets up a listener because of `onEvent` then waits.  my app then emits the event, feeds the data in, and it proceeds
        "type": "cmd",
        "onEvent": "doCrazyXform"
      },
    ]

expose current task/progress

Would be nice to have access to the current task. For my use case, I would like to be able to access
it on error so I can have some insight into which task failed. Would possibly be useful to be able to access this for tracking progress as well.

Not sure how this would be exposed outside of the case of an error. The current 'data' event emits output from stdout, which seems appropriate.

(this was discussed in #dat)

Data Pipeline Configuration and Datscript Proposal

Goal

Create a data pipeline configuration that makes sense. This involves:

~~Creating and refining a datscript file format (outlined here, previously discussed: dat-ecosystem-archive/datproject-discussions#16)~~
~~Making changes to hackfile parser to parse datscript correctly~~
Ultimately making changes to gasket to support functionality of datscript, probable changes to syntax to correctly handle hackfile output (some comments here, early discussion here: #7, some (early) proposed changes here: #16)

Pipeline: datscript --> hackfile parser --> hackfile --> gasket

Datscript

Keywords

Command-Types:

run: runs following commands serially
pipe: pipes following commands together
fork: runs following commands in parallel, next command-type waits for these commands to finish
background: similar to fork, but next command-type does not wait for these commands to finish
map: multiway-pipe from one to many; pipes first command to rest of commands
reduce: multiway-pipe from many to one; pipes rest of commands to first command

Other Keywords:

pipeline: keyword for distinguishing a pipeline from other command-types

Datscript Syntax

Command-type {run, pipe, fork, background, map, reduce} followed by args in either of the two formats:

Format 1:

{command-type} {arg1}
  {arg2}
  {arg3}
  ....

Format 2:

{command-type}
  {arg1}
  {arg2}
  {arg3}
  ....

pipeline {pipeline-name} followed by either of the previous command-type formats:

pipeline {pipeline-name}
    {command-type}
      {arg1}
      {arg2}
      {arg3}
      ....

Commands in detail

Run Command:

run will run each command serially; that is, it will wait for the previous command to finish before starting the next command.

The following all result in the same behavior, since the run command is serial:

Example 1:

run bar
run baz

Example 2:

run
  bar
  baz

Example 3 (not best-practice):

run bar
  baz

Pipe Command:

pipe will pipe each command together; that is, it will take the first command and pipe it to the next command until it reaches the end, and pipe to std.out. pipe with only one supplied command is undefined.

Example 1: prints "A" to std.out

pipe
  echo a
  transform-to-uppercase
  cat

Example 2: prints "A" to std.out

pipe echo a
  transform-to-uppercase
  cat

Example 3: INVALID because both transform-to-uppercase and cat need input (and since these are separate groupings, these lines are NOT piped together)

pipe echo a
pipe transform-to-uppercase
pipe cat

Example 4: prints "A" to std.out, prints "B"to std.out

pipe
 echo a
  transform-to-uppercase
  cat
pipe
  echo b
  transform-to-uppercase
  cat

Fork Command:

fork will run each command in parallel in the background. The next command-type will wait for these commands to finish. If there is no next command-type, gasket will implicitly wait for these commands to finish before exiting. Each forked command is not guaranteed to be executed in the order you supply.

Example 1 (best-practice):

will print a and b to std.out (in either order)
after completing those commands, will print baz to std.out

fork
  echo a
  echo b
run echo baz

Example 2: Same output as Example 1 (not best-practice)

fork echo a
  echo b
run echo baz

Example 3: Will print a and b to std.out (in either order), before exiting.

fork
  echo a
  echo b

Background Command

background will run each command in parallel in the background. The next command-type will NOT wait for these commands to finish. If there is no next command-type, gasket will NOT wait for these commands to finish before exiting. Each background command is not guaranteed to be executed in the order you supply.

Example 1 (best-practice):

will print a, b, and baz to std.out (in either order)

background
  echo a
  echo b
run echo baz

Example 2: Same output as Example 1 (not best-practice)

background echo a
  echo b
run echo baz

Example 3: Starts a node server, run echo a does not wait for run node server.js to finish. After completing the last command (in this case, run echo a , gasket will NOT wait for background commands ( run node server.js)to finish, but will properly exit them.

background
  run node server.js
run echo a

Map Command

map is a multiway-pipe from one to many. That is, it pipes the first command to rest of the provided commands. The rest of the provided commands are treated as fork commands. Therefore, the "map" operation pipes the first command to the rest of the provided commands in parallel (and therefore no order is guaranteed). map with only one supplied command is undefined.

Example 1 (best-practice): In either order:

pipes data.json to dat import
pipes data.json to cat

  map curl http://data.com/data.json
      dat import
      cat

Example 2: Same output as Example 1

  map 
      curl http://data.com/data.json
      dat import
      cat

Reduce Command

reduce is a multiway-pipe from many to one. That is, it pipes rest of commands to first command. The rest of the provided commands are treated as fork commands. Therefore, the "reduce" operation pipes each of the provided commands to the first command in parallel (and therefore no order is guaranteed). reduce with only one supplied command is undefined.

Example 1 (best-practice): In either order:

pipes papers to dat import
pipes taxonomy to dat import

  reduce dat import
      papers
      taxonomy

Example 2: Same output as Example 1

  reduce
      dat import
      papers
      taxonomy

Defining and Executing a Pipeline

The pipeline keyword distinguishes a pipeline from the other command-types. Pipelines are a way of defining groups of command-types that can be treated as data (a command) to be run by any command-type.

Example 1: An import-data pipeline is defined. It imports 1, 2, 3 in parallel before printing "done importing" to std.out. After converting from datscript to gasket, to run the pipeline in the command line gasket run import-data

pipeline import-data
  fork
    import 1
    import 2
    import 3
  run echo done importing

Example 2: Same output as Example 1, but run from within the datscript file.

run import-data
pipeline import-data
  fork
    import 1
    import 2
    import 3
  run echo done importing

You cannot nest pipeline definitions (they should always be at the shallowest layer), but you can nest as many command-types within a pipeline as you like.

Example 3: Nested command-types in a pipeline. Will print a, then print b, then print C

pipeline baz
  run
    echo a
    echo b
    pipe
      echo c
      transform-to-upper-case
      cat

Example 4: INVALID: Pipelines can only be defined at the shallowest layer.

pipeline foo
  run echo a
  pipeline bar

//TODO: Lots of tricky cases to think about here.
Example 6: Executing non-run command-types on a pipeline. In this example, we define a pipeline foo, which has baz and bat defined (without a command-type provided). Then we map bar on to the pipeline foo (so we pipe bar into baz and also into bat, in parallel). One problem here, is pipeline foo might be invalid syntax.

map bar
  foo

pipeline foo
  baz
  bat

Misc

This issue is still a WIP. A lot of this concerns datscript directly, but will ultimately shape gasket (so I think it belongs here)

Variable replacement

Hey there,
I was writing the readme for one of our internal projects that is experimenting with gasket, and I used the pipeline example from here: http://blog.petersobot.com/pipes-and-filters

cat /usr/share/dict/words |     # Read in the system's dictionary.
grep purple |                   # Find words containing 'purple'
awk '{print length($1), $1}' |  # Count the letters in each word
sort -n |                       # Sort lines ("${length} ${word}")
tail -n 1 |                     # Take the last line of the input
cut -d " " -f 2 |               # Take the second part of each line
cowsay -f tux                   # Put the resulting word into Tux's mouth

Running with gasket, it failed like this:

awk: syntax error at source line 1
 context is
    {print length(), >>>  } <<<
awk: illegal statement at source line 1

Might gasket be replacing the variables wrongly?
Cheers!

Goals for Gasket

Hi there,
I'd like to know what are the main goals and what is planned for gasket. I'm asking these because at PetroFeed we are in the need for a similar tool and were thinking of rolling our own… Things like debugging logs, like on('pipe', …), error handling (using domains, maybe) are some things we would need.

Is Gasket supposed to solve any Dat-specific needs? How would you go about passing arguments (for example: .pipe(require('csv-parse')({ columns: true })))? And how does it relate to transformer?

Thanks a lot!

progress API

when doing long pipelines like this:

"search-ncbi": [
    "dat cat",
    "grep Guillardia",
    "tool-stream extractProperty assemblyid",
    "bionode-ncbi download assembly -",
    "tool-stream collectMatch status completed",
    "tool-stream extractProperty uid",
    "bionode-ncbi link assembly bioproject -",
    "tool-stream extractProperty destUID",
    "bionode-ncbi link bioproject sra -",
    "tool-stream extractProperty destUID",
    "grep 35526",
    "bionode-ncbi download sra -",
    "tool-stream collectMatch status completed",
    "tee > metadata.json"
  ],

it would be really nice to get some progress events so we can render a progress bar or something....

currently it just sits there while all the subcommands run

DEBUG=* is a little better, but its pretty verbose

I'm not sure if we need a progress bar or a 'percentage done' necessarily, just some sort of output that lets users know whats going on under the hood inside the pipeline

loses stderr?

I was attempting to use gasket to start the server and discovered that it doesn't print stack traces. Here's a simplified case where you can see stderr prints are not printed. You can reproduce by using console.error as well.

package.json

{
  "name": "foo",
  "gasket": [
    "echo 'I came through stderr' >/dev/stderr",
    "echo 'I came through stdout"
  ]
}

$ gasket run
=> I came through stdout

[feedback wanted] api for pipe/parallel/serial mode

consider this use case:

https://gist.github.com/maxogden/80de2ba6a6f52ff382e3

the nulls are currently the only way to tell gasket to run the pipeline one at a time (serially). if the nulls are removed then all of the gasket run import -- lines would be spawned at once, which technically works but causes my computer to almost die

so what would be a better api for disabling the auto pipe mode?

ideas:

1: make the main pipeline an object instead of an array and add an option to change behavior, e.g.:

{
  "gasket": {
    "main": {
      "commands": [
        "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-1.xml",
        "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-2.xml"
      ],
      "serial": true
    }
  }
}

instead of "serial": true it could be "parallel": false or "pipe": false

2: make "pipe": false by default. then you could just do this:

{
  "gasket": {
    "main": [
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-1.xml",
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-2.xml"
    ]
  }
}

and they would spawned/run one at a time and not get piped to each other. to get them to pipe together you would have to use the syntax from option 1

3: have 2 top level default keys for 'parallel' and 'serial' commands

{
  "gasket": {
    "pipes": [
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-1.xml",
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-2.xml"
    ],
    "serial": [
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-1.xml",
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-2.xml" 
    ]
  }
}

e.g. in the above doing gasket run pipe would act differently from gasket run serial (this one might be too magic). also i don't like the names serial and pipes that much

thoughts?

Some null-terminated commands not executed?

Create package.json file with the following:

{
  "name": "my-test-app",
  "dependencies" : {
    "transform-uppercase": "^1.0.0"
  },
  "gasket": {
    "example": [
        "echo hi1",
        null,
        "echo hi2",
        null,
        "echo hi3",
        null,
        "echo hi4"
    ]
  }
}

Run the following command

gasket run example

output is:

hi1
hi2

expected output is:

hi1
hi2
hi3
hi4

resume after crash.

Gasket steps (and any step in a generic task runner) should be idempotent, but in reality users are free to use whatever i.e rm -rf / o.0.

So, instead the pipeline should support resuming from last failed step.

A way of dealing with this that I've seen before is :

by storing a hash of the pipeline manifest,
creating a temporary $PWD/gasket-progress.json file which stores current progress of the pipeline execution
on error, bail. show error explaining the presence and purpose of the progress file
allow the user to fix errors. and run gasket resume
continue execution from last failed step specified in $PWD/gasket-progress.json.
bail if pipeline manifest source (gasket.json, package.json, etc) changes (since progress wouldn't match)

Gasket Compatible Module List

It would be nice to maintain a list of 'gasket compatible' modules. It is hard to find them in npm because there not usually gasket dependancy.

I would suggest adding to the gasket README instructions about add the 'gasket' keyword for new npm modules.

Maybe also starting a modules page (like https://github.com/rvagg/node-levelup/wiki/Modules) for some hand curated module examples.

better behavior when one of the pipeline commands crashes

right now if a command in the e.g. middle of the pipeline closes gasket doesn't produce any output

it would be nice if there was a debug mode that told you when all of the gasket pipeline commands end w/ their exit codes

more verb proposals, catch and each

run
  catch email-me.py
    download
  process

pipeline download
  each scrape-each-search-page.py
    collect download-xml
      parse xml into json
    get pdf id from json
    construct full pdf url
    add to download queue

pipeline process
  map
    hash pdf
    catch email-me
      exists.py
      create-thumbnail

options for setting config file and pipeline key

e.g. gasket run --config=foo.json --pipeline=pizza would work with a file called 'foo.json' that had:

{
  "pizza": [ // gasket commands here ]
}

Invalid package.json causes data-plumber to emit misleading message about json path

I'm using @maxogden's data-plumber lesson to learn about Gasket.

Running data-plumber run package.json with an invalid package.json will cause data-plumber to search in the wrong directory for the package.json.

Test package.json (notice the extra comma in the gasket section):

{
  "name": "dataplumber",
  "version": "0.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "gasket": [
    "jsonfilter 'rows.*.doc.song'",
  ],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "transform-uppercase": "^1.0.0",
    "jsonfilter": "^1.0.1",
    "csv-parser": "^1.4.1",
    "trim-object-stream": "^1.0.0"
  }
}

When I run this in data-plumber, it gives the following output:

➜  dataplumber  data-plumber run package.json   
cat /usr/local/lib/node_modules/data-plumber/data/nextbus.xml | gasket run --config /Users/Jon/Code/dataplumber/package.json
ENOTDIR, open '/Users/Jon/Code/dataplumber/package.json/package.json'

The error shows that the path has an extra /package.json added to the end.

another bug?

test.json:

{"float":{"_":"2.9402447","name":"score"},"arr":[{"name":"applicant","str":"Carl Hockett"},{"name":"applicant_sort","str":"Carl Hockett"},{"name":"brief","bool":"true"},{"name":"city","str":"Santa Rosa"},{"name":"dateRcpt","date":"2014-02-21T05:00:00Z"},{"name":"disseminated","date":"2014-02-21T20:25:14.053Z"},{"name":"exParte","bool":"false"},{"name":"id","long":"6017590320"},{"name":"modified","date":"2014-02-21T20:25:14.053Z"},{"name":"pages","int":"1"},{"name":"proceeding","str":"14-28"},{"name":"regFlexAnalysis","bool":"false"},{"name":"smallBusinessImpact","bool":"false"},{"name":"stateCd","str":"CA"},{"name":"submissionType","str":"COMMENT"},{"name":"text","str":"7521074778.txt \nReclassify The Internet As A Common Carrier.  \nAs always, the disclaimer:  I only choose the best alternative option to \"Mr.\" as I \nobject to the title as offensive and refuse to use it.  Had I been a Warrant Officer\nin Vietnam, I would THEN accept it, but I wasn't, so \"Dr.\" it is, since you provide \nno better alternative.\nPage 1\n\n"},{"name":"viewingStatus","str":"Unrestricted"},{"name":"zip","str":"95407"}]}

package.json:

{
  "name": "data",
  "version": "0.0.0",
  "gasket": [
    "jsonmap \"delete this.arr\""
  ],
  "dependencies": {
    "jsonmap": "^1.1.1"
  }
}

what happens:

$ cat test.json | gasket run
$ cat test.json | gasket exec jsonmap "delete this.arr"
$ cat test.json | jsonmap "delete this.arr"
{"float":{"_":"2.9402447","name":"score"}}

they should all output the same thing

dat-ecosystem-archive / gasket Goto Github PK

gasket's Introduction

gasket

Usage

Modules in pipelines

Using gasket.json

gasket as a module

gasket's People

Contributors

Stargazers

Watchers

Forkers

gasket's Issues

Data Pipeline Configuration and Datscript Proposal

Goal

Datscript

Keywords

Command-Types:

Other Keywords:

Datscript Syntax

Commands in detail

Run Command:

Pipe Command:

Fork Command:

Background Command

Map Command

Reduce Command

Defining and Executing a Pipeline

Misc

Recommend Projects

Recommend Topics

Recommend Org

Jobs