GithubHelp home page GithubHelp logo

dat-ecosystem-archive / gasket Goto Github PK

View Code? Open in Web Editor NEW
191.0 19.0 19.0 81 KB

Build cross platform data pipelines [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]

JavaScript 100.00%

gasket's Introduction

deprecated

More info on active projects and modules at dat-ecosystem.org


gasket

Preconfigured pipelines for node.js

logo

$ npm install -g gasket
$ gasket # prints help
$ gasket completion --save # install tab completion

Usage

To setup a pipeline add a gasket section to your package.json

{
  "name": "my-test-app",
  "dependencies" : {
    "transform-uppercase": "^1.0.0"
  },
  "gasket": {
    "example": [
      {
        "command": "echo hello world",
        "type": "pipe"
      },
      {
        "command": "transform-uppercase",
        "type": "pipe"
      }
    ]
  }
}

To run the above example pipeline simply to the repo and run

$ gasket run example # will print HELLO WORLD

gasket will spawn each command in the pipeline (it supports modules/commands installed via npm) and pipe them together (if the type is set to "pipe").

If you want to wait for the previous command to finish, set the type to "run" instead.

{
  "gasket": {
    "example": [
      {
        "command": "echo hello world",
        "type": "run"
      },
      {
        "command": "echo hello afterwards",
        "type": "run"
      }
    ]
  }
}

Running the above will print

hello world
hello afterwards

Modules in pipelines

In addition to commands it supports node modules that return streams

{
  "gasket": [
    {
      "command": "echo hello world",
      "type": "pipe"
    }
    {
      "command": {"module":"./uppercase.js"},
      "type": "pipe"
    }
  ]
}

Where uppercase.js is a file that looks like this

var through = require('through2')
module.exports = function() {
  return through(function(data, enc, cb) {
    cb(null, data.toString().toUpperCase())
  })
}

If your module reads/writes JSON object set json:true in the pipeline. That will make gasket parse newline separated JSON before parsing the objects to the stream and stringify the output.

Running gasket run main will produce HELLO WORLD

Using gasket.json

If you don't have a package.json file you can add the tasks to a gasket.json file instead

{
  "example": [
    {
      "command": "echo hello world",
      "type": "pipe"
    },
    {
      "command": "transform-uppercase",
      "type": "pipe"
    }
  ]
}

gasket as a module

You can use gasket as a module as well

var gasket = require('gasket')

var pipelines = gasket({
  example: [
    {
      "command": "echo hello world",
      "type": "pipe"
    },
    {
      "command": "transform-uppercase",
      "type": "pipe"
    }
  ]
})

pipelines.run('example').pipe(process.stdout)

gasket's People

Contributors

finnp avatar johnnyman727 avatar mafintosh avatar max-mapper avatar melaniecebula avatar ninabreznik avatar shama avatar timoxley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gasket's Issues

gasket as a module

hi, i'm having a great time discovering dat tools.
i bumped into this error trying to use gasket as a module's example.

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: This socket is closed.
    at Socket._write (net.js:638:19)
    at doWrite (_stream_writable.js:226:10)
    at writeOrBuffer (_stream_writable.js:216:5)
    at Socket.Writable.write (_stream_writable.js:183:11)
    at Socket.write (net.js:616:40)
    at Duplexify._write (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/index.js:197:22)
    at doWrite (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:237:10)
    at writeOrBuffer (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:227:5)
    at Writable.write (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:194:11)
    at write (/home/me/node/dat/node_modules/gasket/node_modules/duplexify/node_modules/readable-stream/lib/_stream_readable.js:623:24)

conditional process trigger

hi all:

i'm interested in integrating gasket into our neuro imaging analysis project.

TWO PREGUNTAS!

1 we have a pipeline need as such:

=> client runs taskA
=> client runs taskB
( system watcher picks up on Bs output )
........=> remote computer aggregates taskB results (some ML routines w/ other client results)
=> client runs taskC (using result from remote computation)

2 i'm assuming gasket is stateless. e.g. taskB fails. system comes back up, tries again, and replays from taskA. correct?

discussion.

i was considering making my own pipeliner (which may be required if it's too out of scope for gasket to achieve these goals), but of course, i'd rather chat it out first!

in the end i'm going to have to produce something that has the following effect:

    "example": [
      {
        "command": "echo hello world",
        "type": "pipe"
      },
      // now i intercept the flow here, and store some state about my progress through the pipeline... proceed :)
      {
        "command": "transform-uppercase",
        "type": "cmd",
      },
// once again i intercept the flow, and do some remote data computation/synchronization
      {
        "command": "crazy-async-xform",  // doesn't run immediately.  gasket sets up a listener because of `onEvent` then waits.  my app then emits the event, feeds the data in, and it proceeds
        "type": "cmd",
        "onEvent": "doCrazyXform"
      },
    ]

expose current task/progress

Would be nice to have access to the current task. For my use case, I would like to be able to access
it on error so I can have some insight into which task failed. Would possibly be useful to be able to access this for tracking progress as well.

Not sure how this would be exposed outside of the case of an error. The current 'data' event emits output from stdout, which seems appropriate.

(this was discussed in #dat)

Data Pipeline Configuration and Datscript Proposal

Data Pipeline Configuration and Datscript Proposal


Goal

Create a data pipeline configuration that makes sense. This involves:

  • Creating and refining a datscript file format (outlined here, previously discussed: dat-ecosystem-archive/datproject-discussions#16)
  • Making changes to hackfile parser to parse datscript correctly
  • Ultimately making changes to gasket to support functionality of datscript, probable changes to syntax to correctly handle hackfile output (some comments here, early discussion here: #7, some (early) proposed changes here: #16)

Pipeline: datscript --> hackfile parser --> hackfile --> gasket

Datscript


Keywords

Command-Types:

run: runs following commands serially
pipe: pipes following commands together
fork: runs following commands in parallel, next command-type waits for these commands to finish
background: similar to fork, but next command-type does not wait for these commands to finish
map: multiway-pipe from one to many; pipes first command to rest of commands
reduce: multiway-pipe from many to one; pipes rest of commands to first command

Other Keywords:

pipeline: keyword for distinguishing a pipeline from other command-types

Datscript Syntax


Command-type {run, pipe, fork, background, map, reduce} followed by args in either of the two formats:

Format 1:

{command-type} {arg1}
  {arg2}
  {arg3}
  ....

Format 2:

{command-type}
  {arg1}
  {arg2}
  {arg3}
  ....

pipeline {pipeline-name} followed by either of the previous command-type formats:

pipeline {pipeline-name}
    {command-type}
      {arg1}
      {arg2}
      {arg3}
      ....  

Commands in detail


Run Command:

run will run each command serially; that is, it will wait for the previous command to finish before starting the next command.

The following all result in the same behavior, since the run command is serial:

Example 1:

run bar
run baz

Example 2:

run
  bar
  baz

Example 3 (not best-practice):

run bar
  baz

Pipe Command:

pipe will pipe each command together; that is, it will take the first command and pipe it to the next command until it reaches the end, and pipe to std.out. pipe with only one supplied command is undefined.

Example 1: prints "A" to std.out

pipe
  echo a
  transform-to-uppercase
  cat

Example 2: prints "A" to std.out

pipe echo a
  transform-to-uppercase
  cat

Example 3: INVALID because both transform-to-uppercase and cat need input (and since these are separate groupings, these lines are NOT piped together)

pipe echo a
pipe transform-to-uppercase
pipe cat

Example 4: prints "A" to std.out, prints "B"to std.out

pipe
 echo a
  transform-to-uppercase
  cat
pipe
  echo b
  transform-to-uppercase
  cat

Fork Command:

fork will run each command in parallel in the background. The next command-type will wait for these commands to finish. If there is no next command-type, gasket will implicitly wait for these commands to finish before exiting. Each forked command is not guaranteed to be executed in the order you supply.

Example 1 (best-practice):

  • will print a and b to std.out (in either order)
  • after completing those commands, will print baz to std.out
fork
  echo a
  echo b
run echo baz

Example 2: Same output as Example 1 (not best-practice)

fork echo a
  echo b
run echo baz

Example 3: Will print a and b to std.out (in either order), before exiting.

fork
  echo a
  echo b

Background Command

background will run each command in parallel in the background. The next command-type will NOT wait for these commands to finish. If there is no next command-type, gasket will NOT wait for these commands to finish before exiting. Each background command is not guaranteed to be executed in the order you supply.

Example 1 (best-practice):

  • will print a, b, and baz to std.out (in either order)
background
  echo a
  echo b
run echo baz

Example 2: Same output as Example 1 (not best-practice)

background echo a
  echo b
run echo baz

Example 3: Starts a node server, run echo a does not wait for run node server.js to finish. After completing the last command (in this case, run echo a , gasket will NOT wait for background commands ( run node server.js)to finish, but will properly exit them.

background
  run node server.js
run echo a

Map Command

map is a multiway-pipe from one to many. That is, it pipes the first command to rest of the provided commands. The rest of the provided commands are treated as fork commands. Therefore, the "map" operation pipes the first command to the rest of the provided commands in parallel (and therefore no order is guaranteed). map with only one supplied command is undefined.

Example 1 (best-practice): In either order:

  • pipes data.json to dat import
  • pipes data.json to cat
  map curl http://data.com/data.json
      dat import
      cat

Example 2: Same output as Example 1

  map 
      curl http://data.com/data.json
      dat import
      cat

Reduce Command

reduce is a multiway-pipe from many to one. That is, it pipes rest of commands to first command. The rest of the provided commands are treated as fork commands. Therefore, the "reduce" operation pipes each of the provided commands to the first command in parallel (and therefore no order is guaranteed). reduce with only one supplied command is undefined.

Example 1 (best-practice): In either order:

  • pipes papers to dat import
  • pipes taxonomy to dat import
  reduce dat import
      papers
      taxonomy

Example 2: Same output as Example 1

  reduce
      dat import
      papers
      taxonomy

Defining and Executing a Pipeline

The pipeline keyword distinguishes a pipeline from the other command-types. Pipelines are a way of defining groups of command-types that can be treated as data (a command) to be run by any command-type.

Example 1: An import-data pipeline is defined. It imports 1, 2, 3 in parallel before printing "done importing" to std.out. After converting from datscript to gasket, to run the pipeline in the command line gasket run import-data

pipeline import-data
  fork
    import 1
    import 2
    import 3
  run echo done importing

Example 2: Same output as Example 1, but run from within the datscript file.

run import-data
pipeline import-data
  fork
    import 1
    import 2
    import 3
  run echo done importing

You cannot nest pipeline definitions (they should always be at the shallowest layer), but you can nest as many command-types within a pipeline as you like.

Example 3: Nested command-types in a pipeline. Will print a, then print b, then print C

pipeline baz
  run
    echo a
    echo b
    pipe
      echo c
      transform-to-upper-case
      cat

Example 4: INVALID: Pipelines can only be defined at the shallowest layer.

pipeline foo
  run echo a
  pipeline bar

//TODO: Lots of tricky cases to think about here.
Example 6: Executing non-run command-types on a pipeline. In this example, we define a pipeline foo, which has baz and bat defined (without a command-type provided). Then we map bar on to the pipeline foo (so we pipe bar into baz and also into bat, in parallel). One problem here, is pipeline foo might be invalid syntax.

map bar
  foo

pipeline foo
  baz
  bat

Misc

This issue is still a WIP. A lot of this concerns datscript directly, but will ultimately shape gasket (so I think it belongs here)

Variable replacement

Hey there,
I was writing the readme for one of our internal projects that is experimenting with gasket, and I used the pipeline example from here: http://blog.petersobot.com/pipes-and-filters

cat /usr/share/dict/words |     # Read in the system's dictionary.
grep purple |                   # Find words containing 'purple'
awk '{print length($1), $1}' |  # Count the letters in each word
sort -n |                       # Sort lines ("${length} ${word}")
tail -n 1 |                     # Take the last line of the input
cut -d " " -f 2 |               # Take the second part of each line
cowsay -f tux                   # Put the resulting word into Tux's mouth

Running with gasket, it failed like this:

awk: syntax error at source line 1
 context is
    {print length(), >>>  } <<<
awk: illegal statement at source line 1

Might gasket be replacing the variables wrongly?
Cheers!

Goals for Gasket

Hi there,
I'd like to know what are the main goals and what is planned for gasket. I'm asking these because at PetroFeed we are in the need for a similar tool and were thinking of rolling our own… Things like debugging logs, like on('pipe', …), error handling (using domains, maybe) are some things we would need.

Is Gasket supposed to solve any Dat-specific needs? How would you go about passing arguments (for example: .pipe(require('csv-parse')({ columns: true })))? And how does it relate to transformer?

Thanks a lot!

progress API

when doing long pipelines like this:

"search-ncbi": [
    "dat cat",
    "grep Guillardia",
    "tool-stream extractProperty assemblyid",
    "bionode-ncbi download assembly -",
    "tool-stream collectMatch status completed",
    "tool-stream extractProperty uid",
    "bionode-ncbi link assembly bioproject -",
    "tool-stream extractProperty destUID",
    "bionode-ncbi link bioproject sra -",
    "tool-stream extractProperty destUID",
    "grep 35526",
    "bionode-ncbi download sra -",
    "tool-stream collectMatch status completed",
    "tee > metadata.json"
  ],

it would be really nice to get some progress events so we can render a progress bar or something....

currently it just sits there while all the subcommands run

DEBUG=* is a little better, but its pretty verbose

I'm not sure if we need a progress bar or a 'percentage done' necessarily, just some sort of output that lets users know whats going on under the hood inside the pipeline

loses stderr?

I was attempting to use gasket to start the server and discovered that it doesn't print stack traces. Here's a simplified case where you can see stderr prints are not printed. You can reproduce by using console.error as well.

package.json

{
  "name": "foo",
  "gasket": [
    "echo 'I came through stderr' >/dev/stderr",
    "echo 'I came through stdout"
  ]
}
$ gasket run
=> I came through stdout

[feedback wanted] api for pipe/parallel/serial mode

consider this use case:

https://gist.github.com/maxogden/80de2ba6a6f52ff382e3

the nulls are currently the only way to tell gasket to run the pipeline one at a time (serially). if the nulls are removed then all of the gasket run import -- lines would be spawned at once, which technically works but causes my computer to almost die

so what would be a better api for disabling the auto pipe mode?

ideas:

1: make the main pipeline an object instead of an array and add an option to change behavior, e.g.:

{
  "gasket": {
    "main": {
      "commands": [
        "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-1.xml",
        "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-2.xml"
      ],
      "serial": true
    }
  }
}

instead of "serial": true it could be "parallel": false or "pipe": false

2: make "pipe": false by default. then you could just do this:

{
  "gasket": {
    "main": [
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-1.xml",
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-2.xml"
    ]
  }
}

and they would spawned/run one at a time and not get piped to each other. to get them to pipe together you would have to use the syntax from option 1

3: have 2 top level default keys for 'parallel' and 'serial' commands

{
  "gasket": {
    "pipes": [
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-1.xml",
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-2.xml"
    ],
    "serial": [
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-1.xml",
      "gasket run import -- http://www.fcc.gov/files/ecfs/14-28/14-28-RAW-Solr-2.xml" 
    ]
  }
}

e.g. in the above doing gasket run pipe would act differently from gasket run serial (this one might be too magic). also i don't like the names serial and pipes that much

thoughts?

Some null-terminated commands not executed?

Create package.json file with the following:

{
  "name": "my-test-app",
  "dependencies" : {
    "transform-uppercase": "^1.0.0"
  },
  "gasket": {
    "example": [
        "echo hi1",
        null,
        "echo hi2",
        null,
        "echo hi3",
        null,
        "echo hi4"
    ]
  }
}

Run the following command

gasket run example

output is:

hi1
hi2

expected output is:

hi1
hi2
hi3
hi4

resume after crash.

Gasket steps (and any step in a generic task runner) should be idempotent, but in reality users are free to use whatever i.e rm -rf / o.0.

So, instead the pipeline should support resuming from last failed step.

A way of dealing with this that I've seen before is :

  1. by storing a hash of the pipeline manifest,
  2. creating a temporary $PWD/gasket-progress.json file which stores current progress of the pipeline execution
  3. on error, bail. show error explaining the presence and purpose of the progress file
  4. allow the user to fix errors. and run gasket resume
  5. continue execution from last failed step specified in $PWD/gasket-progress.json.
  6. bail if pipeline manifest source (gasket.json, package.json, etc) changes (since progress wouldn't match)

Gasket Compatible Module List

It would be nice to maintain a list of 'gasket compatible' modules. It is hard to find them in npm because there not usually gasket dependancy.

I would suggest adding to the gasket README instructions about add the 'gasket' keyword for new npm modules.

Maybe also starting a modules page (like https://github.com/rvagg/node-levelup/wiki/Modules) for some hand curated module examples.

more verb proposals, catch and each

run
  catch email-me.py
    download
  process

pipeline download
  each scrape-each-search-page.py
    collect download-xml
      parse xml into json
    get pdf id from json
    construct full pdf url
    add to download queue

pipeline process
  map
    hash pdf
    catch email-me
      exists.py
      create-thumbnail

Invalid package.json causes data-plumber to emit misleading message about json path

I'm using @maxogden's data-plumber lesson to learn about Gasket.

Running data-plumber run package.json with an invalid package.json will cause data-plumber to search in the wrong directory for the package.json.

Test package.json (notice the extra comma in the gasket section):

{
  "name": "dataplumber",
  "version": "0.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "gasket": [
    "jsonfilter 'rows.*.doc.song'",
  ],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "transform-uppercase": "^1.0.0",
    "jsonfilter": "^1.0.1",
    "csv-parser": "^1.4.1",
    "trim-object-stream": "^1.0.0"
  }
}

When I run this in data-plumber, it gives the following output:

➜  dataplumber  data-plumber run package.json   
cat /usr/local/lib/node_modules/data-plumber/data/nextbus.xml | gasket run --config /Users/Jon/Code/dataplumber/package.json
ENOTDIR, open '/Users/Jon/Code/dataplumber/package.json/package.json'

The error shows that the path has an extra /package.json added to the end.

another bug?

test.json:

{"float":{"_":"2.9402447","name":"score"},"arr":[{"name":"applicant","str":"Carl Hockett"},{"name":"applicant_sort","str":"Carl Hockett"},{"name":"brief","bool":"true"},{"name":"city","str":"Santa Rosa"},{"name":"dateRcpt","date":"2014-02-21T05:00:00Z"},{"name":"disseminated","date":"2014-02-21T20:25:14.053Z"},{"name":"exParte","bool":"false"},{"name":"id","long":"6017590320"},{"name":"modified","date":"2014-02-21T20:25:14.053Z"},{"name":"pages","int":"1"},{"name":"proceeding","str":"14-28"},{"name":"regFlexAnalysis","bool":"false"},{"name":"smallBusinessImpact","bool":"false"},{"name":"stateCd","str":"CA"},{"name":"submissionType","str":"COMMENT"},{"name":"text","str":"7521074778.txt \nReclassify The Internet As A Common Carrier.  \nAs always, the disclaimer:  I only choose the best alternative option to \"Mr.\" as I \nobject to the title as offensive and refuse to use it.  Had I been a Warrant Officer\nin Vietnam, I would THEN accept it, but I wasn't, so \"Dr.\" it is, since you provide \nno better alternative.\nPage 1\n\n"},{"name":"viewingStatus","str":"Unrestricted"},{"name":"zip","str":"95407"}]}

package.json:

{
  "name": "data",
  "version": "0.0.0",
  "gasket": [
    "jsonmap \"delete this.arr\""
  ],
  "dependencies": {
    "jsonmap": "^1.1.1"
  }
}

what happens:

$ cat test.json | gasket run
$ cat test.json | gasket exec jsonmap "delete this.arr"
$ cat test.json | jsonmap "delete this.arr"
{"float":{"_":"2.9402447","name":"score"}}

they should all output the same thing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.