GithubHelp home page GithubHelp logo

mythmon / mozilla-pipeline-schemas Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mozilla-services/mozilla-pipeline-schemas

0.0 3.0 1.0 822 KB

License: Other

CMake 20.88% Shell 0.38% Lua 54.00% Jupyter Notebook 19.25% Python 5.50%

mozilla-pipeline-schemas's Introduction

Mozilla Pipeline Schemas

This repository contains schemas for Mozilla's data ingestion pipeline and data lake outputs.

The JSON schemas are used to validate incoming submissions at ingestion time. The RapidJSON library is used for JSON Schema Validation. This has implications for what kinds of string patterns are supported, see the Conformance section in the linked document for further details.

To learn more about writing JSON Schemas, Understanding JSON Schema is a great resource.

The Parquet-MR schemas are used for direct to parquet output; some examples of Parquet-MR schemas can be found here: Parquet Schema Examples

Build

Prerequisites

CMake Build Instructions

git clone https://github.com/mozilla-services/mozilla-pipeline-schemas.git
cd mozilla-pipeline-schemas
mkdir release
cd release

cmake ..  # this is the build process (the schemas are built with cmake templates)
cpack -G TGZ # (DEB|RPM|ZIP)

# Integration Tests (run on schema-test EC2 instance)
  # If running locally
    # The following RPM's must be installed:
      # luasandbox, hindsight, luasandbox-lfs, luasandbox-lpeg, luasandbox-rjson, luasandbox-cjson, luasandbox-parquet
    # The following external libraries must be installed
      # parquet-cpp
make # this sets up the tests in the release directory
ctest -V -C hindsight # loads all the schemas and tests the inputs in the validation directory against them

Releases

  • The master branch is the current release and is considered stable at all times.
  • New versions can be released as frequently as every two weeks (our sprint cycle). The only exception would be for a high priority patch.
  • New releases occur the day after the sprint finishes.
    • The version in the dev branch is updated
    • The changes are merged into master
    • A new tag is created

Contributions

  • All pull requests must be made against the dev branch, direct commits to master are not permitted.
  • All non trivial contributions should start with an issue being filed (if it is a new feature please propose your design/approach before doing any work as not all feature requests are accepted).

Notes

All schemas are generated from the 'templates' directory and written into the 'schemas' directory (i.e., the artifacts are generated/saved back into the repository). The reason for this is twofold:

  1. It lets us easily see and refer to complete schemas as they are actually used. This means that the schemas can be referenced directly in bugs and such, as well as being fetched directly from the repo for testing other schema consumers (test being important here, as any production use should be using the installable packages).
  2. It gives us a changelog for each schema, rather than having to reason about changes to templated external pieces and when/how that impacted a given doctype's schema over time. This means that it should be easy to look back in time for the provenance of different parts of the schema for each doctype.

mozilla-pipeline-schemas's People

Contributors

mreid-moz avatar fbertsch avatar dexterp37 avatar georgf avatar trink avatar gregglind avatar whd avatar hannosch avatar marco-c avatar chutten avatar katejim avatar bgirard avatar liuche avatar st3fan avatar cnevinc avatar

Watchers

Michael Cooper avatar James Cloos avatar  avatar

Forkers

gaybro8777

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.