GithubHelp home page GithubHelp logo

eoger / telemetry-batch-view Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mozilla/telemetry-batch-view

0.0 2.0 0.0 1.53 MB

A Scala framework to build derived datasets, aka batch views, of Telemetry data.

Scala 98.34% Shell 0.12% Jupyter Notebook 0.79% Python 0.75%

telemetry-batch-view's Introduction

telemetry-batch-view

This is a Scala application to build derived datasets, also known as batch views, of Telemetry data.

Build Status codecov.io CircleCi Status

Raw JSON pings are stored on S3 within files containing framed Heka records. Reading the raw data in through e.g. Spark can be slow as for a given analysis only a few fields are typically used; not to mention the cost of parsing the JSON blobs. Furthermore, Heka files might contain only a handful of records under certain circumstances.

Defining a derived Parquet dataset, which uses a columnar layout optimized for analytics workloads, can drastically improve the performance of analysis jobs while reducing the space requirements. A derived dataset might, and should, also perform heavy duty operations common to all analysis that are going to read from that dataset (e.g., parsing dates into normalized timestamps).

Adding a new derived dataset

See the views folder for examples of jobs that create derived datasets.

See the Firefox Data Documentation for more information about the individual derived datasets. For help finding the right dataset for your analysis, see Choosing a Dataset.

Development

There are two possible workflows for hacking on telemetry-batch-view: you can either create a docker container for building the package and running tests, or import the project into IntelliJ's IDEA.

To run the docker tests, just use the provided Dockerfile to build a container, then use the runtests.sh script to run tests inside it:

docker build -t telemetry-batch-view .
./runtests.sh

You may need to increase the amount of memory allocated to Docker for this to work, as some of the tests are very memory hungry at present. At least 4 gigabytes is recommended.

You can also pass arguments to sbt (the scala build tool we use for running the tests) through the runtests.sh. For example, to run only the addon tests, try:

./runtests.sh "test-only com.mozilla.telemetry.AddonsViewTest"

If you wish to import the project into IntelliJ IDEA, apply the following changes to Preferences -> Languages & Frameworks -> Scala Compile Server:

  • JVM maximum heap size, MB: 2048
  • JVM parameters: -server -Xmx2G -Xss4M

Note that the first time the project is opened it takes some time to download all the dependencies.

Generating Datasets

See the documentation for specific views for details about running/generating them.

For example, to create a longitudinal view locally:

sbt "run-main com.mozilla.telemetry.views.LongitudinalView --from 20160101 --to 20160701 --bucket telemetry-test-bucket"

For distributed execution we pack all of the classes together into a single JAR and submit it to the cluster:

sbt assembly
spark-submit --master yarn --deploy-mode client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.11/telemetry-batch-view-*.jar --from 20160101 --to 20160701 --bucket telemetry-test-bucket

In the future, we will modify airflow jobs to actually pull the jar from s3 rather than git checkout and sbt assembly Something like:

wget https://s3-us-west-2.amazonaws.com/net-mozaws-data-us-west-2-ops-mavenrepo/snapshots/telemetry-batch-view/telemetry-batch-view/1.1/telemetry-batch-view-1.1.jar

Caveats

If you run into memory issues during compilation time or running the test suite, issue the following command before running sbt:

export _JAVA_OPTIONS="-Xms4G -Xmx4G -Xss4M -XX:MaxMetaspaceSize=256M"

Slow tests By default slow tests are not run when using sbt test. To run slow tests use ./runtests.sh slow:test (or just sbt slow:test outside of the Docker environment).

Running on Windows

Executing scala/Spark jobs could be particularly problematic on this platform. Here's a list of common issues and the relative solutions:

Issue: I see a weird reflection error or an odd exception when trying to run my code.

This is probably due to winutils being missing or not found. Winutils are needed by HADOOP and can be downloaded from here.

Issue: java.net.URISyntaxException: Relative path in absolute URI: ...

This means that winutils cannot be found or that Spark cannot find a valid warehouse directory. Add the following line at the beginning of your entry function to make it work:

System.setProperty("hadoop.home.dir", "C:\\path\\to\\winutils")
System.setProperty("spark.sql.warehouse.dir", "file:///C:/somereal-dir/spark-warehouse")

Issue: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------

See SPARK-10528. Run "winutils chmod 777 /tmp/hive" from a privileged prompt to make it work.

Any commits to master should also trigger a circleci build that will do the sbt publishing for you to our local maven repo in s3.

telemetry-batch-view's People

Contributors

acmiyaguchi avatar bsmedberg avatar cameres avatar chutten avatar daoshengmu avatar dexterp37 avatar dzeber avatar fbertsch avatar haroldwoo avatar harterrt avatar maurodoglio avatar mhammond avatar mreid-moz avatar mythmon avatar relud avatar robhudson avatar robotblake avatar sapohl avatar saptarshiguha avatar sunahsuh avatar thomcc avatar uberi avatar vgutierrez9 avatar vitillo avatar whd avatar wlach avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.