preaction / etl-yertl Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 4.0 812 KB

ETL With a Shell

Home Page: http://preaction.me/yertl

License: Other

Perl 78.89% CSS 2.42% HTML 7.01% Raku 11.68%

etl-yertl's People

Contributors

Stargazers

Watchers

Forkers

mishin grtodd mcandre jkeenan standardgalactic

etl-yertl's Issues

Performance enhancements

Will parallelizing make faster? One process for input/output, multiple for script interpretation?

Or perhaps parsing/compiling the script into something that doesn't require regexp to interpret?

Or perhaps just write the whole damned thing in C...

Multiple files with no separator between them?

YAML Error: Expected separator '---'
   Code: YAML_PARSE_ERR_NO_SEPARATOR
   Line: 12
   Document: 2
 at /usr2/local/perlbrew/perls/perl-5.16.3/lib/site_perl/5.16.3/YAML/Loader.pm line 81.

This isn't very helpful with invalid YAML

We need to know what file was bad (or STDIN), and we should keep going with other files if we can, maybe?

Memory issues with group_by

group_by requires a lot of memory. A file of 500M has so far used 800M of memory.

Of course, once we get into more complex programs, we want to get the group_by data as full documents, so we can't do anything that assumes we do not need the full document structure anymore...

An on-disk scratch area might work. I believe sort(1) does this when necessary... It'd be slow, but it would work.

A way of splitting / teeing a large file into a bunch of smaller files would also work. So, group_by, but instead writing the results to a directory so that individual files can be worked with.

App::jt is another See Also

It seems JSON specific, but that's what yto and yfrom are for

json-mask

https://metacpan.org/release/Data-Partial-Google

https://github.com/nemtsov/json-mask

Tiny language for selecting parts of a data structure

yfrom and yto

Convert the internal doc format (yaml for now) into the desired format.

CSV and JSON to start. Do not require the CSV module. Make a way for progressive enhancement.

write( FILTER )

Write a document to the file named by FILTER. Along with string interpolation #19, this would let us split up one large stream into multiple smaller streams.

Perhaps tee( FILTER ) as well?

Perhaps this should simply be > FILTER?

Evaluate YAML::Tiny

The problem seems to be that YAML::Tiny doesn't even try to parse certain parts of YAML which are useful.

I could do without the blessing into objects thing though...

[ymsg] Send/recv documents over messaging protocols (zeromq, activemq)

The killer feature is being able to string together long-running, distributed workers.

I want it for logstash, mostly, but I could see it being infinitely useful in other places. Hum... But what I want it for is more line-oriented than it is document-oriented...

DBI input/output -- ysql

A DBI-based document input/output command. Should be able to read from databases and write to databases. Configuration (DSN, user, pass) should be in a File::HomeDir file.

JSON::XS dependency requires a certain version

        # Child (filename) exited without calling finalize()

    #   Failed test 'filename'
    #   at /home/nbkyslo/perl5/lib/perl5/Test/Builder.pm line 263.
Can't locate object method "indent_length" via package "JSON::XS" at /home/nbkyslo/.cpanm/work/1415647630.31935/App-YAML-Filter-0.012/t/bin/../../bin/yto line 80, <$fh> line 3.
    # Child (DOC -> JSON) exited without calling finalize()

#   Failed test 'DOC -> JSON'
#   at /home/nbkyslo/perl5/lib/perl5/Test/Builder.pm line 276.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 2.
t/bin/yto.t ......
Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/2 subtests

This will be tricky... How do we detect date/time? It's most likely just text. Perhaps if the LHS of a comparison is some kind of datetime object we try to parse the RHS as a datetime string / epoch time / something?

I'd really like to be able to do: select( .timestamp > Date("2014-05-06") )

Humph. How long until I'd be better off with SQLite?

LINQ

LINQ is MS's language for doing basically this. We should implement a command-line script for it. @tobyink has started a library for it: https://github.com/tobyink/p5-linq. This is very close to what we want to achieve: http://en.wikipedia.org/wiki/Language_Integrated_Query

A multitude of languages for working with documents is a good thing.

Perhaps YAML's object serialization can be useful here...

Build hashes / arrays

Here's a fun one! Nothing in the parser is prepared for this one...

We should appear to use YAML's syntax for this, which is good because it's approximately the same syntax that JSON (and thus jq) uses.

The leading . sigil will save us here though... so we can do:

# input
symbol: FOO
timestamp: 3
fields:
    foo: BAR
    baz: FUZZ
$ yq '{ foo: .fields.foo, .symbol: .timestamp }'
foo: BAR
FOO: 3

In theory...

Marpa::R2 does not install if /usr/bin/perl is different from used Perl

Running [make]...
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /home/nbkyslo/.cpanm/work/1418073009.30590/Marpa-R2-2.100000/libmarpa_build/missing --run aclocal-1.11 -I m4
/usr/bin/perl: symbol lookup error: /home/nbkyslo/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_apiversion_bootcheck
WARNING: `aclocal-1.11' is missing on your system.  You should only need it if
         you modified `acinclude.m4' or `configure.ac'.  You might want
         to install the `Automake' and `Perl' packages.  Grab them from
         any GNU archive site.
 cd . && /bin/sh /home/nbkyslo/.cpanm/work/1418073009.30590/Marpa-R2-2.100000/libmarpa_build/missing --run automake-1.11 --gnu
/usr/bin/perl: symbol lookup error: /home/nbkyslo/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_apiversion_bootcheck
WARNING: `automake-1.11' is missing on your system.  You should only need it if
         you modified `Makefile.am', `acinclude.m4' or `configure.ac'.
         You might want to install the `Automake' and `Perl' packages.
         Grab them from any GNU archive site.
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /home/nbkyslo/.cpanm/work/1418073009.30590/Marpa-R2-2.100000/libmarpa_build/missing --run autoconf
/usr/bin/perl: symbol lookup error: /home/nbkyslo/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_apiversion_bootcheck
make: *** [configure] Error 1
Making libmarpa: make Failure at inc/Marpa/R2/Build_Me.pm line 701.
-> FAIL Installing Marpa::R2 failed. See /home/nbkyslo/.cpanm/work/1418073009.30590/build.log for details. Retry with --force to force install it.
-> FAIL Installing the dependencies failed: Module 'Marpa::R2' is not installed
-> FAIL Bailing out the installation for App-YAML-Filter-0.012.

Classic Perl version mismatch because of PERL5LIB. This causes App-YAML-Filter to not be installable, at all.

Enhanced sorting (numeric, reverse, custom comparisons)

sort() should take 0-2 arguments:

sort() - Sort the documents. The document may be a simple scalar, so this can be useful
sort( EXPR ) - String sort based on the result of EXPR. This is more commonly useful than a strict numeric sort, so we make it here
sort( CMP_EXPR, VAL_EXPR ) - Sort the values from VAL_EXPR using the expression CMP_EXPR.

This requires that we implement the two binary tri-value comparison operators cmp and <=>. We provide the values to CMP_EXPR as $a and $b, so numeric sorting becomes:

sort( $a <=> $b, .foo )

And reverse sorting becomes:

sort( $b cmp $a, .foo )

I'm not sure if I like all that this implies ($a/$b being special, what does sort( $a cmp $b ) do, reverse sorting is implicit rather than an explicit reverse() function), and the implementation of the CMP_EXPR will be interesting bordering on clever. I do like the cmp/<=> operators though.

It appears that VAL_EXPR will make an implied Schwartzian Transform, which is nice. sort( $a.baz cmp $b.baz, .foo ) will be possible, though it would be better expressed as simply sort( .foo.baz ). Same with sort( length( $a ) <=> length( $b ) ) being better expressed as sort( $a <=> $b, length( .foo ) ).

Allow quoted strings

yq currently chokes on input like BAR/NONE, since / is not \w. Use Text::Balanced maybe?

length( EXPR )

Get the length of the result of the given EXPR. Should perform exactly as jq's function.

support frontmatter

We should support frontmatter as a document format, putting the markdown in the "content" key.

This may or may not do what I wish: allow queries over a Statocles site.

Implement pipes between expressions

EXPR1 | EXPR2 should give the result of EXPR1 as the document for EXPR2.

This makes if .foo.bar then .fuzz else empty simpler via select( .foo.bar ) | .fuzz.

I'm starting to wonder if if EXPR then EXPR else EXPR is even necessary... I suppose as the most general case, it is...

Add a -v flag to output internal messaging to STDERR

-v could help immensely with debugging programs, especially when they're really bugs in the grammar.

New Name for Document Shell System

If we're going to do this thing:

App::dbi - SQL input/output against DBI stuff
App::mango - MongoDB input/output using Mango
App::YAML::Filter - Filter documents using jq-like language
App::linq - Filter documents using LINQ
App::dpath - Filter documents using XPath-like syntax

And other things perhaps like

App::minion - Message-queue input/output using Minion
App::ws - Webservice input/output using WebService:: modules (I really wish these had a consistent API)

And if we want to support multiple document formats like

YAML
JSON
XML
CSV
BSON perhaps?

We need to come up with some kind of umbrella name for the protocol these things use to talk to each other.

Right now all that protocol will be is a single envvar describing the document format (possibly via a MIME-like content-type string), but we need a name for it so other people can join the effort in their own languages.

Array slices

Allow Perl's .. syntax, Perl's , syntax (which is just 1..5 expanded to 1,2,3,4,5), and jq/python's [1:2] syntax.

I love how everyone has a different way of doing it. Ruby's slice being [start,length] rather than [start,end] is especially fun. I will not be implementing Ruby's way of doing it, since it does not allow cherry-picking individual array elements like "I want the first and last element" @mylist[1, -1].

Python's slice syntax is, imho, the best. Though it's weird that they don't enforce an explicit reverse( mylist[0:2] ) and allow mylist[0:2:-1].

Greater-than, Less-than, and "or equal-to" numeric binary operators

, >=, <, and <=.

These don't have the proper use without allowing functions to be one of the operands, but I think I can find some use for them...

[yfrom][yto] CSV format should allow multiple header lines

Deeper data structures can be modeled with additional header lines, like

FOO,FOO,BAR,BAR
A,B,A,B
1,2,3,4
5,6,7,8

Could yield:

FOO:
    - A: 1
      B: 2
    - A: 5
      B: 6
BAR:
    - A: 3
      B: 4
    - A: 7
      B: 8

Hash of hashes is the only thing I want to do with this for now.

String interpolation

Use the standard shell construct $(EXPR), and maybe going so far as to implement more of the suggestions found in POSIX: http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html

Use <> instead of <STDIN>

This gets us which file we're currently in, which we can use to help debugging. It also makes scripting around yq easier: Instead of a for loop, we can use find | xargs yq.

This does make testing harder, since we can't hijack <> like we can STDIN. At least, I don't think we can. There has to be a way... Or we'll just make a way to say "use STDIN instead of <>".

I think File::Slurp had a way to do <> in a nice way...

Hash slices

If I only want certain fields to make it through the filter, right now I have to build my own document up out of the fields I want:

yq '{ foo: .foo, bar: .bar, baz: .baz }'

Which can get unwieldy. A simpler hash slice syntax like:

yq '.{foo,bar,baz}'

Would be much nicer.

jq has been glossed over the difference between arrays and hashes, mainly because JavaScript makes very little distinction (there's only one member lookup syntax, [...]). Python also has one member lookup that list and dict both share, but 0 and "0" would not get the same elements in a dictionary that contained both. Ruby also has one member lookup syntax for arrays and hashes, but I can't tell if it makes a difference between the string "0" and the number 0.

Allow all valid numbers, unquoted (decimals, scientific)

See Regexp::Common maybe?

keys( EXPR )

Get the keys of a hash or an array.

I think I may now be conflating things that return multiple documents, and things that return a single document containing an array (the difference between an array and an arrayref). We must be cautious...

Web Service input/output

A generic web service I/O layer. Not sure how this one will work at all, but it'd be nice to make web services full citizens of the UNIX philosophy.

Optional Dependencies

We need a basic operation and a bunch of optional dependencies. Some features may be filled by multiple dependencies and we should try to be as flexible as possible about that.

We're trying not to install half of CPAN here...

IRC channel #doqu on irc.perl.org

Doqu? Dock-You? Doq-U?

Comma combinator does not play well with hash/array constructors

I cannot do:

{ foo: .foo }, { bar: .bar }

to yield two hashes out of the current document.

Add select( BOOLEAN ) function

This is a shorthand for if FILTER then FILTER else empty. Also give it the name grep(), because why not?

Lists and pairs

Clean up the grammar by making , a list constructor and : a pair constructor. Then it will make more sense to put , separated stuff as args to functions

reduce( )

A generic reduce() function would encapsulate all uses of the current $scope hashref, allowing us to make a bunch of functions in terms of reduce().

Except... I do not like jq's syntax for this at all...

Perhaps using Perl's $a and $b?

reduce( push( $a[$b.foo], $b ) ) <- group_by( .foo )

This apparently introduces variables, the topic variable (implicit doc lookups), the push() function. And probably assignment and more binary ops...

I'll probably want to tackle these things one-at-a-time...

[yfrom] CSV file should trim whitespace

If the CSV file has been formatted to look like columns for visual inspection, we should remove that leading/trailing whitespace.

Implement group_by( EXPR )

This use of xargs cat *_pages.txt | awk '{ print $2 }' | xargs -I{} yq 'grep( .symbol eq {} )' ursa_data.yml is going to walk over the file once for each line item, which is impossibly slow if there's a 500M file.

group_by( EXPR ) will be much nicer.

Implement sort( EXPR ) function

We should be able to sort documents based on an arbitrary expression.

allow if statement to take any filter

Refactor the LHS OP RHS to be a filter that returns true/false.
Add the boolean module to get true/false values.

Then if .foo.bar then . else empty will work.

Add .[] to flatten an array

I also could have used this one.

We are going to treat hashes and arrays differently, so this will not return the values of a hash, unlike jq (because JavaScript does not treat them differently). We will add a values() function that will do both (exactly like Perl's values function).

[yq] trim() functions

It'd be nice if there were trim(), rtrim(), and ltrim() for trimming whitespace around things.

Should we have everything in a global namespace?

Add a --version flag

This should be part of my "bare minimum" script template.

Capture and amend YAML warnings

Instead of throwing exceptions like YAML::Tiny does, YAML does Carp::cluck whenever a parsing error occurs. Trap this and add which file we're currently in.

Maybe YAML::LoadFile already does this? Maybe if @argv contains more than one thing we should be using YAML::LoadFile?

Generic Document Query Language

Why can't we apply our mini-language to JSON? or XML (ew)? or any other structured document format?

Why not indeed!

[yfrom] CSV format should allow skipping lines

Skip a preamble or some kind of data that we don't really care about

Add , to create multiple filters

Same behavior as jq. I could have really used this one today.

Implement uniq function

uniq( EXPR ) should print the result of EXPR if and only if that result has not been seen before.

uniq() with no EXPR is the same as uniq(.). This way, when pipes are implemented, we can do .foo | uniq as the equivalent of uniq( .foo ), or more usefully, select( .foo eq bar ) | uniq as the equivalent of uniq( select( .foo eq bar ) ).

uniq() should yield values interactively, we should not have to wait until all input is exhausted before printing output, so uniq() will need to print a value if it has not already been seen in a lookup cache. If there are multiple uniq(), each must keep its own cache of seen values. (bonus: this lookup cache could build a histogram... somehow...)

Binary comparisons should allow function calls

Right now, binary comparisons only allow FILTERs. They should allow function calls as well.

They can't (yet) allow all EXPRs, since binary operators are EXPRs.

preaction / etl-yertl Goto Github PK

etl-yertl's People

Contributors

Stargazers

Watchers

Forkers

etl-yertl's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs