GithubHelp home page GithubHelp logo

etsy / cascading.jruby Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mrwalker/cascading.jruby

42.0 42.0 8.0 3.65 MB

A JRuby DSL for Cascading

Home Page: https://github.com/etsy/cascading.jruby/wiki

non-sox

cascading.jruby's People

Contributors

blinsay avatar gmarabout avatar morria avatar mrwalker avatar petergoldstein avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cascading.jruby's Issues

insert cannot handle long constants, only ints

coerce_to_java currently converts all Fixnum to java.lang.Integer, but it should instead use java.lang.Long (just as we use java.lang.Double instead of java.lang.Float for Float).

Here's an example fail:

too big for int: 32503701600000
/opt/jruby/lib/ruby/gems/1.8/gems/cascading.jruby-0.0.9/lib/cascading/operations.rb:117:in `coerce_to_java'

where implementation

where cannot simply wrap a Janino expression in negation because expressions can be preceded by import statements. Proper implementation will require parsing of the imports so that only the expression is negated.

Node name uniqueness

Currently, there is no requirement that nodes in the composition hierarchy have unique names. This can lead to 2 kinds of problems:

  • If peers in the hierarchy are named the same thing, the last defined "wins." This means nodes created earlier are not available for sinking, joins, unions, or lookups (because they've been overwritten in the list of children).
  • If non-peers are named the same thing, then a lookup (like in sinks, joins, and union) becomes ambiguous. Whichever node is located via depth-first search "wins" (see Node#find_child).

This problem could be resolved in several ways:

  1. Enforce unique names in Node#add_child so collisions cannot happen
  2. Require node names to be unambiguously qualified when there is a potential collision (e.g. "cascade_name.flow_name.assembly_name.branch_name" as opposed to just "branch_name")

The current workaround is simply to use globally unique names, which the programmer must ensure themselves.

debug_scope has a npe after a left_join

assembly "joined" do
left_join *["visit_lvl"] + joined_tables + [{:on => ['group', 'subgroup', 'ab_test', 'ab_test_group']}]

  debug_scope # this dies a horrible death

  insert "blah" => "blah"

 debug_scope # this is cool

end

Sources and sinks must have unique names

Currently sources and sinks must have unique names because they are stored as name -> tap maps. This results in a branch/pass anti-pattern when people want to write the output of an assembly to different taps. It can also be surprising when earlier sinks are compiled away by c.j.

We should exception on duplicate source/sink names, but maybe also provide an optional parameter to distinguish between the name of the tap and the assembly from which it reads.

Gotcha: hash form of :on argument to join

Keys in the hash form of the :on argument to join are sorted lexicographically and the first of these is the grouping key field name that propagates. This should either produce a warning or be documented somewhere visible.

Primary key propagation with empty group_by

A group_by with no every block or an empty every block seems to not set the primary key to the grouping keys. Fixing this bug should be held off until we refactor the code such that Every can -only- occur in the group_by/join block.

Sink order causes composition/runtime planner mismatch

If you list sinks in the order in which assemblies/branches are defined, this is not an issue. However, if you list them in a different order, sometimes the cascade will pass composition-time checking but yield a runtime Planner exception. In the cases I've seen so far, it appears that a branch will inherit the fields of the tail pipe of its enclosing assembly (for some unknown reason, only at runtime).

The workaround is to list the sinks in the same order as the assemblies, but this is an annoying limitation and needs to be fixed.

SQL mode for composite_aggregator

min and max are the only composite aggregators for which SQL mode has been verified. Should cover this with tests and see if it extends to the other aggregators.

cascading 2.0 wip versions

TL;DR - I wasn't able to get the exact cascading 2.0 WIP build that cascading.jruby is written against and c.j doesn't work against the latest as of yesterday (281)

I made some tweaks to fit the API refactors, which you can see here - https://github.com/blake-education/cascading.jruby

I got past the initial errors, only to stumble into "NativeException: java.lang.IllegalArgumentException: resultGroupFields and cogroup resulting joined fields must be same size"

I'm an absolute cascading newb, but it seems that the thrown exceptions were added in this commit to cascading from April 3 cwensel/cascading@689e3a8#diff-6

So at that point I gave up :)

I know cascading 2.0 is a moving target, so do you reckon you could state and archive the build against which c.j is tested? The cascading project doesn't seem to keep old 2.0-wip builds around.

Mystery branch/group_by bug

This bug needs to be researched and documented better. The general idea is that a group_by that follows a branch at the end of an assembly will fail to be included in the dataflow. The workaround is simply to wrap the group_by in another branch.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.