GithubHelp home page GithubHelp logo

ubisoft / mobydq Goto Github PK

View Code? Open in Web Editor NEW
246.0 16.0 59.0 192.74 MB

:whale: Tool to automate data quality checks on data pipelines

Home Page: https://ubisoft.github.io/mobydq/

License: Apache License 2.0

Python 18.98% JavaScript 7.36% HTML 0.99% PLpgSQL 6.66% Dockerfile 1.77% Shell 23.26% TSQL 4.47% Vue 36.51%
data-quality data-pipeline data-warehouse big-data data-quality-checks data-quality-monitoring

mobydq's People

Contributors

alexisrolland avatar dependabot[bot] avatar epsuchti avatar imaan1411 avatar lerignoux avatar matalie avatar mcluky avatar moeby avatar pascalhonegger avatar sijonelis avatar thomasgassmann avatar wang14597 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mobydq's Issues

Implement Continuous Integration Pipeline (CI)

As a developer on the data quality framework, I would like to have a continuous integration pipeline implemented. Whenever a pull request is merged in the development branch, a new build is triggered and executes all the automated unit tests. This would allow to monitor potential regression issues when delivering new releases.

The CI solution envisioned is https://circleci.com

  • Backend tests
  • Frontend tests
  • Frontend linting

Design Cool Data Visualizations

As a user of MobyDQ, I would like to have cool data visualizations on the dashboard page to monitor data quality indicator executions. Visualizations could be about:

  • Batch and indicator executions (start, end, duration, nb sessions...)
  • Indicator quality level objectives (nb alerts...)

Update Scripts Dockerfile to Support Connection to Hive

As a user of the data quality framework, I would like to be able to compute data quality indicators on Hive database.

  • To create a test container db-hive including test data
  • To update the scripts Dockerfile in order to include Hive ODBC driver (Cloudera, Hortonworks, other?)
  • To update Data Source script to ensure connectivity to Hive database

Format JSON Payload to Remove Carriage Return

We need to send http request to GraphQL via the Flask API in order to intercept custom mutations and execute them (executeBatch, testDataSource). In order to do so, the web app should format the JSON payload to make sure it does not contain carriage returns.

Current structure generated by Appollo:

{
  "operationName":null,
  "variables":{},
  "query":"{
      allIndicators{
          nodes{
              id
              name
              description
              executionOrder
              flagActive
              createdDate
              updatedDate
              indicatorTypeId
              __typename
          }
      __typename
      }
  }"
}

Ideal structure:

{
  "operationName":null,
  "variables":{},
  "query":"{allIndicators{nodes{id,name,description,executionOrder,flagActive,createdDate,updatedDate,indicatorTypeId,__typename}__typename}}"
}

Group management

Only super admin will be able to create / assign groups to a user in the Admin page.

We still need to figure the flow to assign the very first group of a user after he logs in with his Google account.

Asynchronous Execution of Indicators in a Batch

As a user of the data quality framework, I would like to have the possibility to execute indicators in parallel / asynchronously in a given indicator group. This should allow a faster execution of a batch.

Example of solution design:

  • Indicators currently have an attribute "execution order". It it used to order the execution of indicators in an indicator group.
  • All indicators having the same value for execution order could be executed in parallel during a batch.

Update README File

As a user of the data quality framework, I would like the README file to be updated in order to reflect project changes and reality.

Restrict Access to GraphQL API for Flask API Only

As a developer on the data quality framework, I would like the access to the GraphQL API to be restricted in order to prevent users without proper permissions to perform forbidden operations on the database.

When logging in with flask api, we will return a uniques access token to the user.
This token will have to be passed as auth data to graphql.
https://medium.com/the-graphqlhub/graphql-and-authentication-b73aed34bbeb for reference.
---------------------- ­
Simas Joneliunas on Thursday Aug 16, 2018 03:25 AM:
Ill take ownership of it a bit later into the project. First i want to have some frontend done so we could actually test the auth.
---------------------- ­
Alexis ROLLAND on Monday Sep 03, 2018 12:16 AM:
I changed the docker compose prod configuration to prevent access to GraphQL from the outside of the Docker network. Only the Flask API and the App have their ports exposed. So I’m not sure this is still required.
---------------------- ­
Simas Joneliunas on Monday Sep 03, 2018 12:24 AM:
Uh, the website (app container) is running react.
Thus ALL of the calls will be coming from outside.
We cant restrict it like that.
---------------------- ­
Alexis ROLLAND on Monday Sep 03, 2018 12:30 AM:
Hum I see, we can reverse the change... otherwise one work around I’m thinking could be for react to send the requests to the Flask API which can already route them to GraphQL and return the same result. In theory just changing the base URL to http://server/data-quality/api/graphql should work. And then we can manage auth and permissions with Flask

Migrate Wiki Pages to Github Page

As a maintainer of MobyDQ, I would like to have the documentation of the project publicly available on https://mobydq.github.io
A new project repository mobydq.github.io has been built to generate automatically the Github Pages. Now the current documentation MobyDQ wiki should be migrated to this new repository.

Design and Mockup Frontend User Flow

As a developer on the data quality framework, I would like to have a clearly designed user flow and mockups I can use to develop the frontend application.

Design New Type of Indicator for Anomaly Detection

As a user of the data quality framework, I would like to have a new type of indicator to detect outlying values on time series data sets using anomaly detection algorithms. This would allow to detect abnormal values without having to specify a threshold.

Update Scripts Dockerfile to Support Connection to Impala

As a user of the data quality framework, I would like to be able to compute data quality indicators on Impala database.

  • To create a test container db-impala including test data
  • To update the scripts Dockerfile in order to include Impala ODBC driver (Cloudera, other?)
  • To update Data Source script to ensure connectivity to Impala database

Implement Automated Unit Tests for Python Scripts All Indicators Modules

As a developer on the data quality framework, I would like to have automated unit tests implemented for all methods in the Python scripts. This would allow to monitor potential regression issues when delivering new releases.
Modules:

  • completeness
  • freshness
  • latency
  • validity
  • indicator (method which compute session result)

Update Scripts Dockerfile to Support Connection to Oracle

As a user of the data quality framework, I would like to be able to compute data quality indicators on Oracle database.

  • To create a test container db-oracle including test data
  • To update the scripts Dockerfile in order to include Oracle ODBC driver
  • To update Data Source script to ensure connectivity to Oracle database

Design Roles and Permissions Strategy

As a user of the data quality framework, I would like to have a clearly defined roles and permissions strategy. The solution should allow easy management of users permissions.

Some ideas:

  • Roles and permissions should support teams concept to allow multiple teams to use the same instance
  • Read and write access should be defined at entity (indicator, data source, etc...) and records level
  • Possibly reuse roles and permissions mechanisms from PostgreSQL database

Create new user stories for the implementation of Roles and Permissions once this story is completed.

Implement SSL Encryption for GraphQL API

As a user of the data quality framework, I would like the Flask API to use SSL encryption. In particular to ensure login and passwords sent over the network are not captured by third parties.

Implement Automated Unit Tests for Flask API

As a developer on the data quality framework, I would like to have automated unit tests implemented for all methods in the Python scripts. This would allow to monitor potential regression issues when delivering new releases.

  • api, graphql endpoint
  • api, health endpoint

Asynchronous Execution of Source and Target Requests

As a user of the data quality framework, I would like the source and target requests to be executed in parallel so that one does not have to wait the other before running. This could help reduce the number of discrepancies / alerts in case of long running queries.

Implement Automated Unit Tests for Python Scripts Data Source Module

As a developer on the data quality framework, I would like to have automated unit tests implemented for all methods in the Python scripts. This would allow to monitor potential regression issues when delivering new releases.

  • Microsoft SQL Server
  • MySQL
  • PostgeSQL
  • SQLite
  • Hive
  • Impala
  • MariaDB
  • Oracle
  • Teradata

Update Scripts Dockerfile to Support Connection to MariaDB

As a user of the data quality framework, I would like to be able to compute data quality indicators on MariaDB database.

  • To create a test container db-mariadb including test data
  • To update the scripts Dockerfile in order to include MariaDB ODBC driver (see potential reuse of MySQL driver)
  • To update Data Source script to ensure connectivity to MariaDB database

Design Solution to Manage Indicator Logs

As a user of the data quality framework, I would like to have a solution to keep track and access indicators execution logs. One solution envision would be to use an Elasticsearch Docker container to store logs.

Design Solution to Manage Authentication

As a user of the data quality framework, I would like to have a solution in place to manage user authentication. This will allow later on to implement roles and permissions. Also see #112

Update Wiki Pages

As a user of the data quality framework, I would like Wiki pages to be updated in order to reflect project changes and reality. Also probably move wiki to dedicated GitHub page.

General frontend improvements

  • Fix warnings, comply with best practice standards
  • Define browser support
  • General CSS improvements (layout, button spacing...)

Update Scripts Dockerfile to Support Connection to Teradata

As a user of the data quality framework, I would like to be able to compute data quality indicators on Teradata database.

  • To create a test container db-teradata including test data
  • To update the scripts Dockerfile in order to include Teradata ODBC driver
  • To update Data Source script to ensure connectivity to Teradata database

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.