Data Pipeline Configuration and Datscript Proposal
Goal
Create a data pipeline configuration that makes sense. This involves:
Creating and refining a datscript file format (outlined here, previously discussed: dat-ecosystem-archive/datproject-discussions#16)
Making changes to hackfile parser to parse datscript correctly
- Ultimately making changes to gasket to support functionality of datscript, probable changes to syntax to correctly handle hackfile output (some comments here, early discussion here: #7, some (early) proposed changes here: #16)
Pipeline: datscript --> hackfile parser --> hackfile --> gasket
Datscript
Keywords
Command-Types:
run: runs following commands serially
pipe: pipes following commands together
fork: runs following commands in parallel, next command-type waits for these commands to finish
background: similar to fork, but next command-type does not wait for these commands to finish
map: multiway-pipe from one to many; pipes first command to rest of commands
reduce: multiway-pipe from many to one; pipes rest of commands to first command
Other Keywords:
pipeline: keyword for distinguishing a pipeline from other command-types
Datscript Syntax
Command-type {run, pipe, fork, background, map, reduce} followed by args in either of the two formats:
Format 1:
{command-type} {arg1}
{arg2}
{arg3}
....
Format 2:
{command-type}
{arg1}
{arg2}
{arg3}
....
pipeline {pipeline-name} followed by either of the previous command-type formats:
pipeline {pipeline-name}
{command-type}
{arg1}
{arg2}
{arg3}
....
Commands in detail
Run Command:
run will run each command serially; that is, it will wait for the previous command to finish before starting the next command.
The following all result in the same behavior, since the run command is serial:
Example 1:
Example 2:
Example 3 (not best-practice):
Pipe Command:
pipe will pipe each command together; that is, it will take the first command and pipe it to the next command until it reaches the end, and pipe to std.out. pipe with only one supplied command is undefined.
Example 1: prints "A" to std.out
pipe
echo a
transform-to-uppercase
cat
Example 2: prints "A" to std.out
pipe echo a
transform-to-uppercase
cat
Example 3: INVALID because both transform-to-uppercase and cat need input (and since these are separate groupings, these lines are NOT piped together)
pipe echo a
pipe transform-to-uppercase
pipe cat
Example 4: prints "A" to std.out, prints "B"to std.out
pipe
echo a
transform-to-uppercase
cat
pipe
echo b
transform-to-uppercase
cat
Fork Command:
fork will run each command in parallel in the background. The next command-type will wait for these commands to finish. If there is no next command-type, gasket will implicitly wait for these commands to finish before exiting. Each forked command is not guaranteed to be executed in the order you supply.
Example 1 (best-practice):
- will print a and b to std.out (in either order)
- after completing those commands, will print baz to std.out
fork
echo a
echo b
run echo baz
Example 2: Same output as Example 1 (not best-practice)
fork echo a
echo b
run echo baz
Example 3: Will print a and b to std.out (in either order), before exiting.
Background Command
background will run each command in parallel in the background. The next command-type will NOT wait for these commands to finish. If there is no next command-type, gasket will NOT wait for these commands to finish before exiting. Each background command is not guaranteed to be executed in the order you supply.
Example 1 (best-practice):
- will print a, b, and baz to std.out (in either order)
background
echo a
echo b
run echo baz
Example 2: Same output as Example 1 (not best-practice)
background echo a
echo b
run echo baz
Example 3: Starts a node server, run echo a does not wait for run node server.js to finish. After completing the last command (in this case, run echo a , gasket will NOT wait for background commands ( run node server.js)to finish, but will properly exit them.
background
run node server.js
run echo a
Map Command
map is a multiway-pipe from one to many. That is, it pipes the first command to rest of the provided commands. The rest of the provided commands are treated as fork commands. Therefore, the "map" operation pipes the first command to the rest of the provided commands in parallel (and therefore no order is guaranteed). map with only one supplied command is undefined.
Example 1 (best-practice): In either order:
- pipes data.json to dat import
- pipes data.json to cat
map curl http://data.com/data.json
dat import
cat
Example 2: Same output as Example 1
map
curl http://data.com/data.json
dat import
cat
Reduce Command
reduce is a multiway-pipe from many to one. That is, it pipes rest of commands to first command. The rest of the provided commands are treated as fork commands. Therefore, the "reduce" operation pipes each of the provided commands to the first command in parallel (and therefore no order is guaranteed). reduce with only one supplied command is undefined.
Example 1 (best-practice): In either order:
- pipes papers to dat import
- pipes taxonomy to dat import
reduce dat import
papers
taxonomy
Example 2: Same output as Example 1
reduce
dat import
papers
taxonomy
Defining and Executing a Pipeline
The pipeline keyword distinguishes a pipeline from the other command-types. Pipelines are a way of defining groups of command-types that can be treated as data (a command) to be run by any command-type.
Example 1: An import-data pipeline is defined. It imports 1, 2, 3 in parallel before printing "done importing" to std.out. After converting from datscript to gasket, to run the pipeline in the command line gasket run import-data
pipeline import-data
fork
import 1
import 2
import 3
run echo done importing
Example 2: Same output as Example 1, but run from within the datscript file.
run import-data
pipeline import-data
fork
import 1
import 2
import 3
run echo done importing
You cannot nest pipeline definitions (they should always be at the shallowest layer), but you can nest as many command-types within a pipeline as you like.
Example 3: Nested command-types in a pipeline. Will print a, then print b, then print C
pipeline baz
run
echo a
echo b
pipe
echo c
transform-to-upper-case
cat
Example 4: INVALID: Pipelines can only be defined at the shallowest layer.
pipeline foo
run echo a
pipeline bar
//TODO: Lots of tricky cases to think about here.
Example 6: Executing non-run command-types on a pipeline. In this example, we define a pipeline foo, which has baz and bat defined (without a command-type provided). Then we map bar on to the pipeline foo (so we pipe bar into baz and also into bat, in parallel). One problem here, is pipeline foo might be invalid syntax.
map bar
foo
pipeline foo
baz
bat
Misc
This issue is still a WIP. A lot of this concerns datscript directly, but will ultimately shape gasket (so I think it belongs here)