GithubHelp home page GithubHelp logo

activewarehouse / activewarehouse-etl Goto Github PK

View Code? Open in Web Editor NEW
242.0 242.0 105.0 412 KB

Extract-Transform-Load library from ActiveWarehouse

License: MIT License

Ruby 99.75% Logos 0.09% Batchfile 0.15%

activewarehouse-etl's Introduction

ActiveWarehouse

The ActiveWarehouse library provides classes and functions which help with building Data Warehouses using Rails.

Installation

To install ActiveWarehouse, add the gem to your Gemfile:

gem 'activewarehouse'

Generators

ActiveWarehouse comes with several generators. In the examples below, you can use either example and the results will be the same.

Facts

Creates a SalesFact class and a sales_facts table.

script/generate fact Sales
script/generate fact sales

Dimensions

Creates a RegionDimension class and a region_dimension table.

script/generate dimension Region
script/generate dimension region

Cubes

Creates a RegionalSalesCube class.

script/generate cube RegionalSales
script/generate cube regional_sales

Bridge

Creates a CustomerHierarchyBridge class.

script/generate bridge CustomerHierarchy
script/generate bridge customer_hierarchy

Dimension View

Creates an OrderDateDimension class which is represented by a view on top of the DateDimension.

script/generate dimension_view OrderDate Date
script/generate dimension_view order_date date

Model Naming

The rules for naming are as follows:

Facts:

  • Fact classes and tables follow the typical Rails rules: classes are singular and tables are pluralized.
  • Both the class and table name are suffixed by "_fact".

Dimensions:

  • Dimension classes and tables are both singular.
  • Both the class name and the table name are suffixed by "_dimension".

Cube:

  • Cube class is singular. If a cube table is created it will also be singular.

Bridge:

  • Bridge classes and tables are both singular.
  • Both the class name and the table name are suffixed by "_bridge".

Dimension View:

  • Dimension View classes are singular. The underlying data structure is a view
  • on top of an existing dimension.
  • Both the class name and the view name are suffixed by "_dimension"

ETL

The ActiveWarehouse plugin does not directly handle Extract-Transform-Load processes, however the ActiveWarehouse ETL gem (installed separately) can help. To install it use:

gem install activewarehouse-etl

More information on the ETL process can be found at http://activewarehouse.rubyforge.org/etl

activewarehouse-etl's People

Contributors

aeden avatar byrnejb avatar cdimartino avatar colincasey avatar fearoffish avatar jayzes avatar jlecour avatar joshuabates avatar kennym avatar kookster avatar lgustafson avatar mainej avatar opencoderx avatar pdodds avatar pgericson avatar sasikumargn avatar seeingidog avatar smeyfroi avatar tchukuchuk avatar thbar avatar tylergannon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

activewarehouse-etl's Issues

Comply with rubygems deprecation warnings

RubyGems 1.7.2 complains with the following warnings (which have been removed in RubyGems 1.8.5, but should be taken care of, still)

→ rake test
NOTE: Gem::Specification#has_rdoc= is deprecated with no replacement. It will be removed on or after 2011-10-01.
Gem::Specification#has_rdoc= called from /Users/jlecour/Projects/activewarehouse-etl/Rakefile:106
.
NOTE: Gem::Specification#default_executable= is deprecated with no replacement. It will be removed on or after 2011-10-01.
Gem::Specification#default_executable= called from /Users/jlecour/Projects/activewarehouse-etl/Rakefile:113
.
NOTE: Gem::Specification#has_rdoc= is deprecated with no replacement. It will be removed on or after 2011-10-01.
Gem::Specification#has_rdoc= called from /Users/jlecour/Projects/activewarehouse-etl/Rakefile:106
.
NOTE: Gem::Specification#default_executable= is deprecated with no replacement. It will be removed on or after 2011-10-01.
Gem::Specification#default_executable= called from /Users/jlecour/Projects/activewarehouse-etl/Rakefile:113
.
/Users/jlecour/.rvm/rubies/ree-1.8.7-2011.03/bin/ruby -I"lib:lib" -I"/Users/jlecour/.rvm/gems/ree-1.8.7-2011.03@aw-etl/gems/rake-0.9.0/lib" "/Users/jlecour/.rvm/gems/ree-1.8.7-2011.03@aw-etl/gems/rake-0.9.0/lib/rake/rake_test_loader.rb" "test/**/*_test.rb"
Using AdapterExtensions
initializing ETL engine

Using native MySQL
Resetting database
 * DEFERRED: the DateDimensionBuilder when building a date dimension with a fiscal year offset month should respect the fiscal year offset month.
 * DEFERRED: a model source should find n rows.
Loaded suite /Users/jlecour/.rvm/gems/ree-1.8.7-2011.03@aw-etl/gems/rake-0.9.0/lib/rake/rake_test_loader
Started
.................................................................................................................
Finished in 2.155438 seconds.

113 tests, 335 assertions, 0 failures, 0 errors

Voici mon contexte :
- OS X 10.6
- Ruby Enterprise Edition 1.8.7 2011.03
- Rake 0.9.0
- RubyGems 1.7.2
- Bundler 1.0.14

Rows should not be processed anymore once error threshold is reached

Currently when the error threshold is reached, the remaining rows are still processed (see test-case below).

I believe the original intent was to stop processing remaining rows, reading the code.

require File.dirname(__FILE__) + '/test_helper'

class EngineTest < Test::Unit::TestCase

  context 'process' do

    should 'stop as soon as the error threshold is reached' do
      engine = ETL::Engine.new

      assert_equal 0, engine.errors.size

      engine.process ETL::Control::Control.parse_text <<CTL
        set_error_threshold 1
        source :in, { :type => :enumerable, :enumerable => (1..100) }
        after_read { |row| raise "Failure" }
CTL

      # will fail because engine.errors.size is 100
      assert_equal 1, engine.errors.size
    end

  end

end

Provide better error message on missing target

The following bits:

pre_process :truncate, :table => table

will give:

.../etl/engine.rb:205:in `establish_connection': No connection found for {} (ETL::ETLError)

whereas it should instruct that no database has been provided.

Unexpected results when using accents in foreign key lookup

During my work on the etl sample, I noticed that some values "eg: José" are not properly looked-up while using the SQLResolver, while everything works fine with the ActiveRecordResolver.

More testing required before releasing this, I suspect this is linked to the mysql gem.

ETL::Control::FileDestination does not always generate valid CSV file

The class ETL::Control::FileDestination does not generate a valid CSV file in some cases. At least not one that could be read by a file source with a :parser[:name] of :delimited. It appears that FileDestination was originally designed to use FasterCSV to create the output file (based on the method "options"), but it apparently wasn't implemented.

Here is a case which will break under the current implementation:

  1. In one .ctl, read from a database source where the value of a column contains a double quote
  2. Output the results to a file using a file destination, make to set :enclose to """
  3. In another .ctl, read from the file

In this case, a MalformedCSVError exception is raised from FasterCSV when attempting to read from the file in the second .ctl. Here is the sequence of events that lead to this exception.

  1. When data is read from the database it may also be written to the "source_data" directory, if you examine that file you'll see that embedded double-quotes are escaped by an additional double-quote (""). This is the correct way to escape embedded double-quotes in a CSV file.
  2. Compare to the file output from step 2 above. Note that each field is enclosed by double-quotes and the embedded double-quote is escaped with a backslash (rather than another double-quote).
  3. When reading this file with a delimited file source, FasterCSV raises an error because it doesn't understand escaping embedded double-quotes with a backslash.

Because of the above issues, it appears there is no way to create a file in one .ctl and read it in another when embedded quotes are present.

The solution could involve using FasterCSV instead of the manual escaping and enclosing implemented in FileDestination. It may be wise to use another name, such as CsvDestination to avoid a breaking change.

Stack level too deep (rake test:matrix on Ruby 1.9.2)

[master] thbar@~/git/activewarehouse-etl: BUNDLE_GEMFILE=test/config/Gemfile.rails-2.3.x rvm 1.9.2 rake test --trace
rake test --trace

rvm 1.6.9 by Wayne E. Seguin ([email protected]) [https://rvm.beginrescueend.com/]

(in /Users/thbar/git/activewarehouse-etl)
/Users/thbar/.rvm/gems/ruby-1.9.2-p180/gems/rake-0.9.2/lib/rake/file_utils.rb:10: warning: already initialized constant RUBY
/Users/thbar/.rvm/gems/ruby-1.9.2-p180/gems/rake-0.9.2/lib/rake/file_utils.rb:84: warning: already initialized constant LN_SUPPORTED
rake/rdoctask is deprecated.  Use rdoc/task instead (in RDoc 2.4.2+)
/Users/thbar/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/rdoc/task.rb:30: warning: already initialized constant Task
** Invoke test (first_time)
** Execute test
WARNING: Global access to Rake DSL methods is deprecated.  Please include
    ...  Rake::DSL into classes and modules which use the Rake DSL methods.
WARNING: DSL method Rake::TestTask#ruby called at /Users/thbar/.rvm/gems/ruby-1.9.2-p180/gems/rake-0.9.2/lib/rake/file_utils_ext.rb:36:in `ruby'
rake aborted!
stack level too deep
/Users/thbar/.rvm/gems/ruby-1.9.2-p180/gems/rake-0.9.2/lib/rake/file_utils_ext.rb:116

Bring back Rails 3 patches

A number of forks introduce patches to make aw-etl work with Rails 3. I want to bring back those patches and have the tests pass into a rails3 branch.

Build a rake command to matrix-testing

I'd like a single command to run the following matrix like in travis:

  • rvm
    • ruby 1.8.7
    • ruby 1.9.2
    • jruby 1.6.2
  • rails
    • 2.3.11
    • 3.0.7
  • database
    • mysql
    • postgresql

etl could report the first row error when error threshold is reached

When a row error occur like in the following case:

after_read :check_unique, :keys => [:this_field_does_not_exist]

errors are stacked until the threshold is reached, but none is reported on STDOUT:

Processing control etl/prepare_user_dimension.ctl
Source: /Users/thbar/git/activewarehouse-etl-sample/data/git-commits.csv
.......................
Exiting due to exceeding error threshold: 100
ETL process complete

I think reporting the first error could be useful. As well it seems that when errors are logged, the processing is a lot, lot slower. Maybe there is something to optimize in there ?

Delimited parser is actually a CSV parser

When the user would probably expect basic splitting on a separator, FasterCSV tries to handle the quotes and blocks on some files (like in the work in progress sample).

The delimited parser should either:

  • be renamed to csv_parser
  • or actually support an option to not use fastercsv

Parsers shouldn't have to call Dir.glob themselves

A good part of the parsers currently use Dir.glob.

A couple of issues with that:

  • the code is duplicated whereas it should be pulled-up at the abstract parser level
  • the code itself can be considered buggy: when the user pass only one file with a typo (or bad path), the file is ignored

Maybe a shared helper method should be created, which should verify that at least one file match the pattern.

"do_bulk_load is an abstract method" when using mysql2

After my first attempt to use mysql2, the following error is raised:

/Users/thbar/.rvm/gems/ruby-1.9.2-p180@activewarehouse-etl-sample/gems/adapter_extensions-0.9.5.rc1/lib/adapter_extensions/connection_adapters/abstract_adapter.rb:40:in `do_bulk_load': do_bulk_load is an abstract method (NotImplementedError)
from /Users/thbar/.rvm/gems/ruby-1.9.2-p180@activewarehouse-etl-sample/gems/adapter_extensions-0.9.5.rc1/lib/adapter_extensions/connection_adapters/abstract_adapter.rb:18:in `bulk_load'
from /Users/thbar/git/activewarehouse-etl/lib/etl/processor/bulk_import_processor.rb:85:in `block in process'
from /Users/thbar/.rvm/gems/ruby-1.9.2-p180@activewarehouse-etl-sample/gems/activerecord-3.0.9/lib/active_record/connection_adapters/abstract/database_statements.rb:139:in `transaction'
from /Users/thbar/git/activewarehouse-etl/lib/etl/processor/bulk_import_processor.rb:70:in `process'

Look into ScdTest (non determinist)

  1) Failure:
test: when working with a slowly changing dimension of type 2 on run 1 should set the effective date. (ScdTest)
    [/Users/thbar/git/activewarehouse-etl/test/scd_test.rb:78:in `__bind_1308073273_32000'
     org/jruby/RubyProc.java:268:in `call'
     org/jruby/RubyMethod.java:117:in `call'
     /Users/thbar/.rvm/gems/jruby-1.6.2/gems/shoulda-2.11.3/lib/shoulda/context.rb:382:in `test: when working with a slowly changing dimension of type 2 on run 1 should set the effective date. '
     org/jruby/RubyProc.java:268:in `call'
     org/jruby/RubyKernel.java:2059:in `send'
     org/jruby/RubyArray.java:1602:in `each'
     org/jruby/RubyArray.java:1602:in `each']:
<"2011-06-14T19:41:13+00:00"> expected but was
<"2011-06-14T19:41:12+00:00">.

etl --read-locally picks wrong last file

As reported by another user at http://groups.google.com/group/activewarehouse-discuss/browse_thread/thread/56b83f0e26f45bf3 ,activewarehouse-etl does not choose the last available locally stored source file. This is because the call to Dir.glob in ETL::Control::Source#last_local_file_trigger doesn't return the array of filenames in sorted order.

In addition, if there are no cache files available the method will return nil, which will raise an unclear exception upstream. The exception should probably be raised in this method.

The solution is pretty straightforward on the surface because all cache file names contain a timestamp. However, cache files from a file source also contain a sequence number to handle the case of multiple source files. Since the sequence number is not left-padded, it can very easily cause sorting issues. The mitigating factor here is that only database sources call #last_local_file (and therefore #last_local_file_trigger) and that calling #last_local_file doesn't really make sense when multiple files are expected.

Cryptic error message when declaring unknown screen level

The following (erroneous) statement:

after_post_process_screen(:warning) {
}

will raise:

in `after_post_process_screen': undefined method `<<' for nil:NilClass (NoMethodError)

checking the level should help provide a better message here.

Make rake test:all pass on travis-ci.org

  • the 4 required databases should be created when running on travis.
  • rake test:all should work fine for Ruby 1.8.7

From there, the goal will be to keep a clean build.

Make runtime dependency on some gems optional

Some destinations or processors use a number of gems, including tmail, net-sftp, zip and spreadsheet.

Currently, the requires occur unchecked, which means the etl command will raise an error if any of these is not installed.

Explicit ActiveRecord::Base.establish_connection required for connection to work

In the etl sample, I currently need to do this:

after_post_process_screen(:fatal) do
  ActiveRecord::Base.establish_connection(:datawarehouse)

  assert_equal end_date - start_date + 1, DateDimension.count

  # ensure we keep constant ids despite the truncating
  assert_equal start_date, DateDimension.find(1).date
end

I'm not sure why establishing the connection is required at all here. Maybe some regression ?

Declarations with unknown parameters should raise an error

Quite often it's a bit hard to get the syntax right at first.

For instance:

destination :out, :file => 'myfile.txt', :order => [:name, :email]

is incorrect and should be written:

destination :out, { :file => 'new_git_users.txt' }, { :order => [:name, :email] }

Using a parser on non-existent single file should raise an error

Currently all the parsers use Dir.glob(file) to retrieve the list of files to be processed.

It would be better to:

  • let the caller handle this instead, so that parser implementer doesn't have to remember it and the code stays DRY
  • raise an error if not at least one file matches

Incremental reorganization/rewrite, test-first

I'd like to create an empty branch, add rspec/cucumber support, think about how to make things database agnostic, then gradually bring back most if not all of the original components, in a test-first fashion.

This is a fairly large effort to undertake, which I'll estimate in the early stage.

Pass tests with the mysql2 adapter

Current build will trigger multiple failures, including:

<#<NoMethodError: undefined method `copy_table' for #<ActiveRecord::ConnectionAdapters::Mysql2Adapter:0x00000104138ff8>>>.

and

RuntimeError: Unsupported adapter ActiveRecord::ConnectionAdapters::Mysql2Adapter for this destination

File destination handling of booleans

A typical use case of the "file destination" (lib/etl/control/destination/file_destination.rb) is to populate a file for bulk loading into a database. However, there is an issue when using SCD workflows with this type of destination. When defining a type 2 SCD, you must supply the name of the column that holds a boolean indicating whether it's the latest version of the row. Accordingly, during the SCD workflow activewarehouse-etl sets the column to true or false in the pipeline.

The trouble is, when the column is written to the file, true and false get written as "true" and "false" because that's the result of true.to_s or false.to_s. These are not valid values for a MySQL BOOLEAN, so the subsequent bulk load fails.

The easy solution is to modify file_destination.rb to coerce TrueClass and FalseClass to "1" and "0", respectively.

However, this may not be the best solution since there are other use cases for a file destination and perhaps others are depending on booleans being written as "true" or "false".

Declarative order should be enforced

The expected ETL declaration order is (afaik):

  • source
  • after_read
  • transform
  • before_write
  • destination
  • screen
  • post_process
  • after_post_process_screen

It would be nice by default to tell the user a regular screen doesn't come after the post_process, for instance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.