preaction / etl-yertl Goto Github PK
View Code? Open in Web Editor NEWETL With a Shell
Home Page: http://preaction.me/yertl
License: Other
ETL With a Shell
Home Page: http://preaction.me/yertl
License: Other
Will parallelizing make faster? One process for input/output, multiple for script interpretation?
Or perhaps parsing/compiling the script into something that doesn't require regexp to interpret?
Or perhaps just write the whole damned thing in C...
YAML Error: Expected separator '---'
Code: YAML_PARSE_ERR_NO_SEPARATOR
Line: 12
Document: 2
at /usr2/local/perlbrew/perls/perl-5.16.3/lib/site_perl/5.16.3/YAML/Loader.pm line 81.
This isn't very helpful with invalid YAML
We need to know what file was bad (or STDIN), and we should keep going with other files if we can, maybe?
group_by
requires a lot of memory. A file of 500M has so far used 800M of memory.
Of course, once we get into more complex programs, we want to get the group_by data as full documents, so we can't do anything that assumes we do not need the full document structure anymore...
An on-disk scratch area might work. I believe sort(1)
does this when necessary... It'd be slow, but it would work.
A way of splitting / teeing a large file into a bunch of smaller files would also work. So, group_by, but instead writing the results to a directory so that individual files can be worked with.
It seems JSON specific, but that's what yto and yfrom are for
https://metacpan.org/release/Data-Partial-Google
https://github.com/nemtsov/json-mask
Tiny language for selecting parts of a data structure
Convert the internal doc format (yaml for now) into the desired format.
CSV and JSON to start. Do not require the CSV module. Make a way for progressive enhancement.
Write a document to the file named by FILTER. Along with string interpolation #19, this would let us split up one large stream into multiple smaller streams.
Perhaps tee( FILTER )
as well?
Perhaps this should simply be > FILTER
?
The problem seems to be that YAML::Tiny doesn't even try to parse certain parts of YAML which are useful.
I could do without the blessing into objects thing though...
The killer feature is being able to string together long-running, distributed workers.
I want it for logstash, mostly, but I could see it being infinitely useful in other places. Hum... But what I want it for is more line-oriented than it is document-oriented...
A DBI-based document input/output command. Should be able to read from databases and write to databases. Configuration (DSN, user, pass) should be in a File::HomeDir file.
# Child (filename) exited without calling finalize()
# Failed test 'filename'
# at /home/nbkyslo/perl5/lib/perl5/Test/Builder.pm line 263.
Can't locate object method "indent_length" via package "JSON::XS" at /home/nbkyslo/.cpanm/work/1415647630.31935/App-YAML-Filter-0.012/t/bin/../../bin/yto line 80, <$fh> line 3.
# Child (DOC -> JSON) exited without calling finalize()
# Failed test 'DOC -> JSON'
# at /home/nbkyslo/perl5/lib/perl5/Test/Builder.pm line 276.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 2.
t/bin/yto.t ......
Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/2 subtests
This will be tricky... How do we detect date/time? It's most likely just text. Perhaps if the LHS of a comparison is some kind of datetime object we try to parse the RHS as a datetime string / epoch time / something?
I'd really like to be able to do: select( .timestamp > Date("2014-05-06") )
Humph. How long until I'd be better off with SQLite?
LINQ is MS's language for doing basically this. We should implement a command-line script for it. @tobyink has started a library for it: https://github.com/tobyink/p5-linq. This is very close to what we want to achieve: http://en.wikipedia.org/wiki/Language_Integrated_Query
A multitude of languages for working with documents is a good thing.
Perhaps YAML's object serialization can be useful here...
Here's a fun one! Nothing in the parser is prepared for this one...
We should appear to use YAML's syntax for this, which is good because it's approximately the same syntax that JSON (and thus jq
) uses.
The leading . sigil will save us here though... so we can do:
# input
symbol: FOO
timestamp: 3
fields:
foo: BAR
baz: FUZZ
$ yq '{ foo: .fields.foo, .symbol: .timestamp }'
foo: BAR
FOO: 3
In theory...
Running [make]...
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /home/nbkyslo/.cpanm/work/1418073009.30590/Marpa-R2-2.100000/libmarpa_build/missing --run aclocal-1.11 -I m4
/usr/bin/perl: symbol lookup error: /home/nbkyslo/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_apiversion_bootcheck
WARNING: `aclocal-1.11' is missing on your system. You should only need it if
you modified `acinclude.m4' or `configure.ac'. You might want
to install the `Automake' and `Perl' packages. Grab them from
any GNU archive site.
cd . && /bin/sh /home/nbkyslo/.cpanm/work/1418073009.30590/Marpa-R2-2.100000/libmarpa_build/missing --run automake-1.11 --gnu
/usr/bin/perl: symbol lookup error: /home/nbkyslo/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_apiversion_bootcheck
WARNING: `automake-1.11' is missing on your system. You should only need it if
you modified `Makefile.am', `acinclude.m4' or `configure.ac'.
You might want to install the `Automake' and `Perl' packages.
Grab them from any GNU archive site.
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /home/nbkyslo/.cpanm/work/1418073009.30590/Marpa-R2-2.100000/libmarpa_build/missing --run autoconf
/usr/bin/perl: symbol lookup error: /home/nbkyslo/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_apiversion_bootcheck
make: *** [configure] Error 1
Making libmarpa: make Failure at inc/Marpa/R2/Build_Me.pm line 701.
-> FAIL Installing Marpa::R2 failed. See /home/nbkyslo/.cpanm/work/1418073009.30590/build.log for details. Retry with --force to force install it.
-> FAIL Installing the dependencies failed: Module 'Marpa::R2' is not installed
-> FAIL Bailing out the installation for App-YAML-Filter-0.012.
Classic Perl version mismatch because of PERL5LIB. This causes App-YAML-Filter to not be installable, at all.
sort()
should take 0-2 arguments:
sort()
- Sort the documents. The document may be a simple scalar, so this can be usefulsort( EXPR )
- String sort based on the result of EXPR
. This is more commonly useful than a strict numeric sort, so we make it heresort( CMP_EXPR, VAL_EXPR )
- Sort the values from VAL_EXPR
using the expression CMP_EXPR
.This requires that we implement the two binary tri-value comparison operators cmp
and <=>
. We provide the values to CMP_EXPR
as $a
and $b
, so numeric sorting becomes:
sort( $a <=> $b, .foo )
And reverse sorting becomes:
sort( $b cmp $a, .foo )
I'm not sure if I like all that this implies ($a
/$b
being special, what does sort( $a cmp $b )
do, reverse sorting is implicit rather than an explicit reverse()
function), and the implementation of the CMP_EXPR
will be interesting bordering on clever. I do like the cmp
/<=>
operators though.
It appears that VAL_EXPR
will make an implied Schwartzian Transform, which is nice. sort( $a.baz cmp $b.baz, .foo )
will be possible, though it would be better expressed as simply sort( .foo.baz )
. Same with sort( length( $a ) <=> length( $b ) )
being better expressed as sort( $a <=> $b, length( .foo ) )
.
yq currently chokes on input like BAR/NONE, since / is not \w. Use Text::Balanced maybe?
Get the length of the result of the given EXPR. Should perform exactly as jq's function.
We should support frontmatter as a document format, putting the markdown in the "content" key.
This may or may not do what I wish: allow queries over a Statocles site.
EXPR1 | EXPR2 should give the result of EXPR1 as the document for EXPR2.
This makes if .foo.bar then .fuzz else empty
simpler via select( .foo.bar ) | .fuzz
.
I'm starting to wonder if if EXPR then EXPR else EXPR
is even necessary... I suppose as the most general case, it is...
-v could help immensely with debugging programs, especially when they're really bugs in the grammar.
If we're going to do this thing:
jq
-like languageAnd other things perhaps like
And if we want to support multiple document formats like
We need to come up with some kind of umbrella name for the protocol these things use to talk to each other.
Right now all that protocol will be is a single envvar describing the document format (possibly via a MIME-like content-type string), but we need a name for it so other people can join the effort in their own languages.
Allow Perl's ..
syntax, Perl's ,
syntax (which is just 1..5
expanded to 1,2,3,4,5
), and jq/python's [1:2]
syntax.
I love how everyone has a different way of doing it. Ruby's slice being [start,length]
rather than [start,end]
is especially fun. I will not be implementing Ruby's way of doing it, since it does not allow cherry-picking individual array elements like "I want the first and last element" @mylist[1, -1]
.
Python's slice syntax is, imho, the best. Though it's weird that they don't enforce an explicit reverse( mylist[0:2] )
and allow mylist[0:2:-1]
.
, >=, <, and <=.
These don't have the proper use without allowing functions to be one of the operands, but I think I can find some use for them...
Deeper data structures can be modeled with additional header lines, like
FOO,FOO,BAR,BAR
A,B,A,B
1,2,3,4
5,6,7,8
Could yield:
FOO:
- A: 1
B: 2
- A: 5
B: 6
BAR:
- A: 3
B: 4
- A: 7
B: 8
Hash of hashes is the only thing I want to do with this for now.
Use the standard shell construct $(EXPR), and maybe going so far as to implement more of the suggestions found in POSIX: http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html
This gets us which file we're currently in, which we can use to help debugging. It also makes scripting around yq easier: Instead of a for
loop, we can use find | xargs yq
.
This does make testing harder, since we can't hijack <> like we can STDIN. At least, I don't think we can. There has to be a way... Or we'll just make a way to say "use STDIN instead of <>".
I think File::Slurp had a way to do <> in a nice way...
If I only want certain fields to make it through the filter, right now I have to build my own document up out of the fields I want:
yq '{ foo: .foo, bar: .bar, baz: .baz }'
Which can get unwieldy. A simpler hash slice syntax like:
yq '.{foo,bar,baz}'
Would be much nicer.
jq
has been glossed over the difference between arrays and hashes, mainly because JavaScript makes very little distinction (there's only one member lookup syntax, [...]
). Python also has one member lookup that list and dict both share, but 0
and "0"
would not get the same elements in a dictionary that contained both. Ruby also has one member lookup syntax for arrays and hashes, but I can't tell if it makes a difference between the string "0"
and the number 0.
See Regexp::Common maybe?
Get the keys of a hash or an array.
I think I may now be conflating things that return multiple documents, and things that return a single document containing an array (the difference between an array and an arrayref). We must be cautious...
A generic web service I/O layer. Not sure how this one will work at all, but it'd be nice to make web services full citizens of the UNIX philosophy.
We need a basic operation and a bunch of optional dependencies. Some features may be filled by multiple dependencies and we should try to be as flexible as possible about that.
We're trying not to install half of CPAN here...
Doqu? Dock-You? Doq-U?
I cannot do:
{ foo: .foo }, { bar: .bar }
to yield two hashes out of the current document.
This is a shorthand for if FILTER then FILTER else empty
. Also give it the name grep(), because why not?
Clean up the grammar by making , a list constructor and : a pair constructor. Then it will make more sense to put , separated stuff as args to functions
A generic reduce()
function would encapsulate all uses of the current $scope hashref, allowing us to make a bunch of functions in terms of reduce().
Except... I do not like jq
's syntax for this at all...
Perhaps using Perl's $a and $b?
reduce( push( $a[$b.foo], $b ) ) <- group_by( .foo )
This apparently introduces variables, the topic variable (implicit doc lookups), the push() function. And probably assignment and more binary ops...
I'll probably want to tackle these things one-at-a-time...
If the CSV file has been formatted to look like columns for visual inspection, we should remove that leading/trailing whitespace.
This use of xargs cat *_pages.txt | awk '{ print $2 }' | xargs -I{} yq 'grep( .symbol eq {} )' ursa_data.yml
is going to walk over the file once for each line item, which is impossibly slow if there's a 500M file.
group_by( EXPR )
will be much nicer.
We should be able to sort documents based on an arbitrary expression.
Then if .foo.bar then . else empty
will work.
I also could have used this one.
We are going to treat hashes and arrays differently, so this will not return the values of a hash, unlike jq
(because JavaScript does not treat them differently). We will add a values() function that will do both (exactly like Perl's values function).
It'd be nice if there were trim(), rtrim(), and ltrim() for trimming whitespace around things.
Should we have everything in a global namespace?
This should be part of my "bare minimum" script template.
Instead of throwing exceptions like YAML::Tiny does, YAML does Carp::cluck whenever a parsing error occurs. Trap this and add which file we're currently in.
Maybe YAML::LoadFile already does this? Maybe if @argv contains more than one thing we should be using YAML::LoadFile?
Why can't we apply our mini-language to JSON? or XML (ew)? or any other structured document format?
Why not indeed!
Skip a preamble or some kind of data that we don't really care about
Same behavior as jq
. I could have really used this one today.
uniq( EXPR ) should print the result of EXPR if and only if that result has not been seen before.
uniq() with no EXPR is the same as uniq(.). This way, when pipes are implemented, we can do .foo | uniq
as the equivalent of uniq( .foo )
, or more usefully, select( .foo eq bar ) | uniq
as the equivalent of uniq( select( .foo eq bar ) )
.
uniq() should yield values interactively, we should not have to wait until all input is exhausted before printing output, so uniq() will need to print a value if it has not already been seen in a lookup cache. If there are multiple uniq(), each must keep its own cache of seen values. (bonus: this lookup cache could build a histogram... somehow...)
Right now, binary comparisons only allow FILTERs. They should allow function calls as well.
They can't (yet) allow all EXPRs, since binary operators are EXPRs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.