GithubHelp home page GithubHelp logo

bigdecisions / bigquery-loader-cli Goto Github PK

View Code? Open in Web Editor NEW

This project forked from snowplow-archive/bigquery-loader-cli

0.0 1.0 1.0 396 KB

Prototype CLI app for uploading Snowplow enriched events to BigQuery

Home Page: http://snowplowanalytics.com

Scala 83.37% Shell 16.63%

bigquery-loader-cli's Introduction

BigQuery Loader CLI

[ ![Build Status] travis-image ] travis [ ![Release] release-image ] releases [ License license-image ] license

Overview

A prototype command-line app to upload Snowplow enriched events from local storage to [Google BigQuery] bigquery.

Getting Started

1. Dependencies

You will need:

  • Some Snowplow enriched events as typically archived in Amazon S3
  • Java 7+ installed
  • A Google BigQuery account

2. Installing

The app is hosted from Bintray:

> wget http://dl.bintray.com/snowplow/snowplow-generic/bigquery_loader_cli_0.1.0.zip
> unzip bigquery_loader_cli_0.1.0.zip

3. BigQuery setup

First, [sign up] bigquery-signup to BigQuery if you have not already done so, and enable billing.

Second, create a project, and make a note of the Project Number by clicking on the name of the project on the [Google Developers Console] google-developers-console.

Third, our command-line app will need credentials to access the BigQuery project:

  1. Click on the *Consent screen link in the APIs and auth section of the Developer Console, add an Email address and hit Save
  2. Click on the Credentials link in the APIs and auth section
  3. Click on the create new Client ID button, selecting Installed application as the application type and other as the installed application type
  4. Click CreateClient Id and then Download JSON to save the file
  5. Save the client_secrets file to the same directory that you unzipped the command-line app
  6. Rename the client_secrets file to client_secrets_<projectId>.json, where <projectId> is the Project Number obtained earlier

4. Downloading some Snowplow enriched events

Assuming that you are running the Snowplow Hadoop-based data pipeline with EmrEtlRunner, you can quickly retrieve January's enriched events using the following:

> aws --profile="xxx" s3 cp "s3://xxx-archive/enriched/good/" . --recursive \ 
    --exclude "*" --include "run=2015-01-*"
> find . -type f -execdir bash -c 'd="${PWD##*/}"; [[ "$1" != "$d-"* ]] && mv "$1" "../$d-$1"' - '{}' \;
> find . -type d -exec rm -d {} \;

5. Uploading a first batch of events

To upload your data you simply type the command:

> java -jar bigquery-loader-cli-0.1.0 --create-table \
    <projectId> <datasetId> <tableId> <dataLocation>

where:

  • <projectId> is the Project Number obtained from the Google development console
  • <datasetId> is the name of the dataset, which will be created if it doesn't already exist
  • <tableId> is the name of the table, which will be created if it doesn't already exist
  • <dataLocation> is the location of either a single file of Snowplow enriched events, or an un-nested folder of Snowplow enriched events

The first time you run this command, you will be prompted to go through Google's browser-based authentication process.

6. Uploading further batches of events

To append further data to the table simply run the command again, omitting the --create-table flag and changing <dataLocation> as appropriate.

Warning: loads are not idempotent. Running the command twice against the same files will result in two copies of the events being added to the table.

Developer Quickstart

Assuming git, [Vagrant] vagrant-install and [VirtualBox] virtualbox-install installed:

 host> git clone https://github.com/snowplow/bigquery-loader-cli
 host> cd bigquery-loader-cli
 host> vagrant up && vagrant ssh
guest> cd /vagrant
guest> sbt test

Copyright and license

Copyright 2015 Snowplow Analytics Ltd.

Licensed under the [Apache License, Version 2.0] license (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

bigquery-loader-cli's People

Contributors

alexanderdean avatar

Watchers

 avatar

Forkers

jbthummar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.