GithubHelp home page GithubHelp logo

etl-yertl's People

Contributors

jkeenan avatar ltriant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

etl-yertl's Issues

Performance enhancements

Will parallelizing make faster? One process for input/output, multiple for script interpretation?

Or perhaps parsing/compiling the script into something that doesn't require regexp to interpret?

Or perhaps just write the whole damned thing in C...

Multiple files with no separator between them?

YAML Error: Expected separator '---'
   Code: YAML_PARSE_ERR_NO_SEPARATOR
   Line: 12
   Document: 2
 at /usr2/local/perlbrew/perls/perl-5.16.3/lib/site_perl/5.16.3/YAML/Loader.pm line 81.

This isn't very helpful with invalid YAML

We need to know what file was bad (or STDIN), and we should keep going with other files if we can, maybe?

Memory issues with group_by

group_by requires a lot of memory. A file of 500M has so far used 800M of memory.

Of course, once we get into more complex programs, we want to get the group_by data as full documents, so we can't do anything that assumes we do not need the full document structure anymore...

An on-disk scratch area might work. I believe sort(1) does this when necessary... It'd be slow, but it would work.

A way of splitting / teeing a large file into a bunch of smaller files would also work. So, group_by, but instead writing the results to a directory so that individual files can be worked with.

yfrom and yto

Convert the internal doc format (yaml for now) into the desired format.

CSV and JSON to start. Do not require the CSV module. Make a way for progressive enhancement.

write( FILTER )

Write a document to the file named by FILTER. Along with string interpolation #19, this would let us split up one large stream into multiple smaller streams.

Perhaps tee( FILTER ) as well?

Perhaps this should simply be > FILTER?

Evaluate YAML::Tiny

The problem seems to be that YAML::Tiny doesn't even try to parse certain parts of YAML which are useful.

I could do without the blessing into objects thing though...

DBI input/output -- ysql

A DBI-based document input/output command. Should be able to read from databases and write to databases. Configuration (DSN, user, pass) should be in a File::HomeDir file.

JSON::XS dependency requires a certain version

        # Child (filename) exited without calling finalize()

    #   Failed test 'filename'
    #   at /home/nbkyslo/perl5/lib/perl5/Test/Builder.pm line 263.
Can't locate object method "indent_length" via package "JSON::XS" at /home/nbkyslo/.cpanm/work/1415647630.31935/App-YAML-Filter-0.012/t/bin/../../bin/yto line 80, <$fh> line 3.
    # Child (DOC -> JSON) exited without calling finalize()

#   Failed test 'DOC -> JSON'
#   at /home/nbkyslo/perl5/lib/perl5/Test/Builder.pm line 276.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 2.
t/bin/yto.t ......
Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/2 subtests

Handle date/time comparisons

This will be tricky... How do we detect date/time? It's most likely just text. Perhaps if the LHS of a comparison is some kind of datetime object we try to parse the RHS as a datetime string / epoch time / something?

I'd really like to be able to do: select( .timestamp > Date("2014-05-06") )

Humph. How long until I'd be better off with SQLite?

Build hashes / arrays

Here's a fun one! Nothing in the parser is prepared for this one...

We should appear to use YAML's syntax for this, which is good because it's approximately the same syntax that JSON (and thus jq) uses.

The leading . sigil will save us here though... so we can do:

# input
symbol: FOO
timestamp: 3
fields:
    foo: BAR
    baz: FUZZ
$ yq '{ foo: .fields.foo, .symbol: .timestamp }'
foo: BAR
FOO: 3

In theory...

Marpa::R2 does not install if /usr/bin/perl is different from used Perl

Running [make]...
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /home/nbkyslo/.cpanm/work/1418073009.30590/Marpa-R2-2.100000/libmarpa_build/missing --run aclocal-1.11 -I m4
/usr/bin/perl: symbol lookup error: /home/nbkyslo/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_apiversion_bootcheck
WARNING: `aclocal-1.11' is missing on your system.  You should only need it if
         you modified `acinclude.m4' or `configure.ac'.  You might want
         to install the `Automake' and `Perl' packages.  Grab them from
         any GNU archive site.
 cd . && /bin/sh /home/nbkyslo/.cpanm/work/1418073009.30590/Marpa-R2-2.100000/libmarpa_build/missing --run automake-1.11 --gnu
/usr/bin/perl: symbol lookup error: /home/nbkyslo/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_apiversion_bootcheck
WARNING: `automake-1.11' is missing on your system.  You should only need it if
         you modified `Makefile.am', `acinclude.m4' or `configure.ac'.
         You might want to install the `Automake' and `Perl' packages.
         Grab them from any GNU archive site.
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /home/nbkyslo/.cpanm/work/1418073009.30590/Marpa-R2-2.100000/libmarpa_build/missing --run autoconf
/usr/bin/perl: symbol lookup error: /home/nbkyslo/perl5/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_apiversion_bootcheck
make: *** [configure] Error 1
Making libmarpa: make Failure at inc/Marpa/R2/Build_Me.pm line 701.
-> FAIL Installing Marpa::R2 failed. See /home/nbkyslo/.cpanm/work/1418073009.30590/build.log for details. Retry with --force to force install it.
-> FAIL Installing the dependencies failed: Module 'Marpa::R2' is not installed
-> FAIL Bailing out the installation for App-YAML-Filter-0.012.

Classic Perl version mismatch because of PERL5LIB. This causes App-YAML-Filter to not be installable, at all.

Enhanced sorting (numeric, reverse, custom comparisons)

sort() should take 0-2 arguments:

  • sort() - Sort the documents. The document may be a simple scalar, so this can be useful
  • sort( EXPR ) - String sort based on the result of EXPR. This is more commonly useful than a strict numeric sort, so we make it here
  • sort( CMP_EXPR, VAL_EXPR ) - Sort the values from VAL_EXPR using the expression CMP_EXPR.

This requires that we implement the two binary tri-value comparison operators cmp and <=>. We provide the values to CMP_EXPR as $a and $b, so numeric sorting becomes:

sort( $a <=> $b, .foo )

And reverse sorting becomes:

sort( $b cmp $a, .foo )

I'm not sure if I like all that this implies ($a/$b being special, what does sort( $a cmp $b ) do, reverse sorting is implicit rather than an explicit reverse() function), and the implementation of the CMP_EXPR will be interesting bordering on clever. I do like the cmp/<=> operators though.

It appears that VAL_EXPR will make an implied Schwartzian Transform, which is nice. sort( $a.baz cmp $b.baz, .foo ) will be possible, though it would be better expressed as simply sort( .foo.baz ). Same with sort( length( $a ) <=> length( $b ) ) being better expressed as sort( $a <=> $b, length( .foo ) ).

Allow quoted strings

yq currently chokes on input like BAR/NONE, since / is not \w. Use Text::Balanced maybe?

length( EXPR )

Get the length of the result of the given EXPR. Should perform exactly as jq's function.

support frontmatter

We should support frontmatter as a document format, putting the markdown in the "content" key.

This may or may not do what I wish: allow queries over a Statocles site.

Implement pipes between expressions

EXPR1 | EXPR2 should give the result of EXPR1 as the document for EXPR2.

This makes if .foo.bar then .fuzz else empty simpler via select( .foo.bar ) | .fuzz.

I'm starting to wonder if if EXPR then EXPR else EXPR is even necessary... I suppose as the most general case, it is...

New Name for Document Shell System

If we're going to do this thing:

  • App::dbi - SQL input/output against DBI stuff
  • App::mango - MongoDB input/output using Mango
  • App::YAML::Filter - Filter documents using jq-like language
  • App::linq - Filter documents using LINQ
  • App::dpath - Filter documents using XPath-like syntax

And other things perhaps like

  • App::minion - Message-queue input/output using Minion
  • App::ws - Webservice input/output using WebService:: modules (I really wish these had a consistent API)

And if we want to support multiple document formats like

  • YAML
  • JSON
  • XML
  • CSV
  • BSON perhaps?

We need to come up with some kind of umbrella name for the protocol these things use to talk to each other.

Right now all that protocol will be is a single envvar describing the document format (possibly via a MIME-like content-type string), but we need a name for it so other people can join the effort in their own languages.

Array slices

Allow Perl's .. syntax, Perl's , syntax (which is just 1..5 expanded to 1,2,3,4,5), and jq/python's [1:2] syntax.

I love how everyone has a different way of doing it. Ruby's slice being [start,length] rather than [start,end] is especially fun. I will not be implementing Ruby's way of doing it, since it does not allow cherry-picking individual array elements like "I want the first and last element" @mylist[1, -1].

Python's slice syntax is, imho, the best. Though it's weird that they don't enforce an explicit reverse( mylist[0:2] ) and allow mylist[0:2:-1].

[yfrom][yto] CSV format should allow multiple header lines

Deeper data structures can be modeled with additional header lines, like

FOO,FOO,BAR,BAR
A,B,A,B
1,2,3,4
5,6,7,8

Could yield:

FOO:
    - A: 1
      B: 2
    - A: 5
      B: 6
BAR:
    - A: 3
      B: 4
    - A: 7
      B: 8

Hash of hashes is the only thing I want to do with this for now.

Use <> instead of <STDIN>

This gets us which file we're currently in, which we can use to help debugging. It also makes scripting around yq easier: Instead of a for loop, we can use find | xargs yq.

This does make testing harder, since we can't hijack <> like we can STDIN. At least, I don't think we can. There has to be a way... Or we'll just make a way to say "use STDIN instead of <>".

I think File::Slurp had a way to do <> in a nice way...

Hash slices

If I only want certain fields to make it through the filter, right now I have to build my own document up out of the fields I want:

yq '{ foo: .foo, bar: .bar, baz: .baz }'

Which can get unwieldy. A simpler hash slice syntax like:

yq '.{foo,bar,baz}'

Would be much nicer.

jq has been glossed over the difference between arrays and hashes, mainly because JavaScript makes very little distinction (there's only one member lookup syntax, [...]). Python also has one member lookup that list and dict both share, but 0 and "0" would not get the same elements in a dictionary that contained both. Ruby also has one member lookup syntax for arrays and hashes, but I can't tell if it makes a difference between the string "0" and the number 0.

keys( EXPR )

Get the keys of a hash or an array.

I think I may now be conflating things that return multiple documents, and things that return a single document containing an array (the difference between an array and an arrayref). We must be cautious...

Web Service input/output

A generic web service I/O layer. Not sure how this one will work at all, but it'd be nice to make web services full citizens of the UNIX philosophy.

Optional Dependencies

We need a basic operation and a bunch of optional dependencies. Some features may be filled by multiple dependencies and we should try to be as flexible as possible about that.

We're trying not to install half of CPAN here...

Lists and pairs

Clean up the grammar by making , a list constructor and : a pair constructor. Then it will make more sense to put , separated stuff as args to functions

reduce( )

A generic reduce() function would encapsulate all uses of the current $scope hashref, allowing us to make a bunch of functions in terms of reduce().

Except... I do not like jq's syntax for this at all...

Perhaps using Perl's $a and $b?

reduce( push( $a[$b.foo], $b ) ) <- group_by( .foo )

This apparently introduces variables, the topic variable (implicit doc lookups), the push() function. And probably assignment and more binary ops...

I'll probably want to tackle these things one-at-a-time...

Implement group_by( EXPR )

This use of xargs cat *_pages.txt | awk '{ print $2 }' | xargs -I{} yq 'grep( .symbol eq {} )' ursa_data.yml is going to walk over the file once for each line item, which is impossibly slow if there's a 500M file.

group_by( EXPR ) will be much nicer.

allow if statement to take any filter

  • Refactor the LHS OP RHS to be a filter that returns true/false.
  • Add the boolean module to get true/false values.

Then if .foo.bar then . else empty will work.

Add .[] to flatten an array

I also could have used this one.

We are going to treat hashes and arrays differently, so this will not return the values of a hash, unlike jq (because JavaScript does not treat them differently). We will add a values() function that will do both (exactly like Perl's values function).

[yq] trim() functions

It'd be nice if there were trim(), rtrim(), and ltrim() for trimming whitespace around things.

Should we have everything in a global namespace?

Capture and amend YAML warnings

Instead of throwing exceptions like YAML::Tiny does, YAML does Carp::cluck whenever a parsing error occurs. Trap this and add which file we're currently in.

Maybe YAML::LoadFile already does this? Maybe if @argv contains more than one thing we should be using YAML::LoadFile?

Implement uniq function

uniq( EXPR ) should print the result of EXPR if and only if that result has not been seen before.

uniq() with no EXPR is the same as uniq(.). This way, when pipes are implemented, we can do .foo | uniq as the equivalent of uniq( .foo ), or more usefully, select( .foo eq bar ) | uniq as the equivalent of uniq( select( .foo eq bar ) ).

uniq() should yield values interactively, we should not have to wait until all input is exhausted before printing output, so uniq() will need to print a value if it has not already been seen in a lookup cache. If there are multiple uniq(), each must keep its own cache of seen values. (bonus: this lookup cache could build a histogram... somehow...)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.