ubisoft / mobydq Goto Github PK
View Code? Open in Web Editor NEW:whale: Tool to automate data quality checks on data pipelines
Home Page: https://ubisoft.github.io/mobydq/
License: Apache License 2.0
:whale: Tool to automate data quality checks on data pipelines
Home Page: https://ubisoft.github.io/mobydq/
License: Apache License 2.0
As a developer on the data quality framework, I would like to have a continuous integration pipeline implemented. Whenever a pull request is merged in the development branch, a new build is triggered and executes all the automated unit tests. This would allow to monitor potential regression issues when delivering new releases.
The CI solution envisioned is https://circleci.com
As a user of MobyDQ, I would like to have cool data visualizations on the dashboard page to monitor data quality indicator executions. Visualizations could be about:
Update once #117 is done
As a user of the data quality framework, I would like to be able to compute data quality indicators on Hive database.
We need to send http request to GraphQL via the Flask API in order to intercept custom mutations and execute them (executeBatch, testDataSource). In order to do so, the web app should format the JSON payload to make sure it does not contain carriage returns.
Current structure generated by Appollo:
{
"operationName":null,
"variables":{},
"query":"{
allIndicators{
nodes{
id
name
description
executionOrder
flagActive
createdDate
updatedDate
indicatorTypeId
__typename
}
__typename
}
}"
}
Ideal structure:
{
"operationName":null,
"variables":{},
"query":"{allIndicators{nodes{id,name,description,executionOrder,flagActive,createdDate,updatedDate,indicatorTypeId,__typename}__typename}}"
}
Only super admin will be able to create / assign groups to a user in the Admin page.
We still need to figure the flow to assign the very first group of a user after he logs in with his Google account.
As a user of the data quality framework, I would like the Flask API to use SSL encryption. In particular to ensure login and passwords sent over the network are not captured by third parties.
As a user of the data quality framework, I would like to have the possibility to execute indicators in parallel / asynchronously in a given indicator group. This should allow a faster execution of a batch.
Example of solution design:
As a user of the data quality framework, I would like the README file to be updated in order to reflect project changes and reality.
As a developer on the data quality framework, I would like the access to the GraphQL API to be restricted in order to prevent users without proper permissions to perform forbidden operations on the database.
When logging in with flask api, we will return a uniques access token to the user.
This token will have to be passed as auth data to graphql.
https://medium.com/the-graphqlhub/graphql-and-authentication-b73aed34bbeb for reference.
----------------------
Simas Joneliunas on Thursday Aug 16, 2018 03:25 AM:
Ill take ownership of it a bit later into the project. First i want to have some frontend done so we could actually test the auth.
----------------------
Alexis ROLLAND on Monday Sep 03, 2018 12:16 AM:
I changed the docker compose prod configuration to prevent access to GraphQL from the outside of the Docker network. Only the Flask API and the App have their ports exposed. So I’m not sure this is still required.
----------------------
Simas Joneliunas on Monday Sep 03, 2018 12:24 AM:
Uh, the website (app container) is running react.
Thus ALL of the calls will be coming from outside.
We cant restrict it like that.
----------------------
Alexis ROLLAND on Monday Sep 03, 2018 12:30 AM:
Hum I see, we can reverse the change... otherwise one work around I’m thinking could be for react to send the requests to the Flask API which can already route them to GraphQL and return the same result. In theory just changing the base URL to http://server/data-quality/api/graphql should work. And then we can manage auth and permissions with Flask
As a maintainer of MobyDQ, I would like to have the documentation of the project publicly available on https://mobydq.github.io
A new project repository mobydq.github.io has been built to generate automatically the Github Pages. Now the current documentation MobyDQ wiki should be migrated to this new repository.
As a developer on the data quality framework, I would like to have a clearly designed user flow and mockups I can use to develop the frontend application.
As a developer on the data quality framework, I would like to have automated unit tests implemented for all methods in the Python scripts. This would allow to monitor potential regression issues when delivering new releases.
As a user of the data quality framework, I would like to have a new type of indicator to detect outlying values on time series data sets using anomaly detection algorithms. This would allow to detect abnormal values without having to specify a threshold.
As a user of the data quality framework, I would like to be able to compute data quality indicators on Impala database.
As a user of the data quality framework, I would like to have a solution to manage scheduling and execution of batches and indicators. Possibly using Airflow Docker container?
As a developer on the data quality framework, I would like to have automated unit tests implemented for all methods in the Python scripts. This would allow to monitor potential regression issues when delivering new releases.
Modules:
As a user of the data quality framework, I would like to be able to compute data quality indicators on Oracle database.
As a user of the data quality framework, I would like to have a clearly defined roles and permissions strategy. The solution should allow easy management of users permissions.
Some ideas:
Create new user stories for the implementation of Roles and Permissions once this story is completed.
Extended from #122 => Also run and validate backend linting in CI
As a user of the data quality framework, I would like the Flask API to use SSL encryption. In particular to ensure login and passwords sent over the network are not captured by third parties.
See story #113
As a developer on the data quality framework, I would like to have automated unit tests implemented for all methods in the Python scripts. This would allow to monitor potential regression issues when delivering new releases.
Send a reset token via email, the user can use a link to set a new password
As a user of the data quality framework, I would like the source and target requests to be executed in parallel so that one does not have to wait the other before running. This could help reduce the number of discrepancies / alerts in case of long running queries.
As a developer on the data quality framework, I would like to have automated unit tests implemented for all methods in the Python scripts. This would allow to monitor potential regression issues when delivering new releases.
As a user of the data quality framework, I would like to be able to compute data quality indicators on MariaDB database.
Implement user management features in front end application
user_X
into JWT (see doc: https://www.graphile.org/postgraphile/jwt-guide)user_group_X
into JWTAs a user of the data quality framework I would to have a secure way to manage passwords for data sources created in the tool. Most likely using a combination of public and private keys to encrypt / decrypt passwords.
As a user of the data quality framework, I would like to have a solution to keep track and access indicators execution logs. One solution envision would be to use an Elasticsearch Docker container to store logs.
As a user of the data quality framework, I would like to have a solution in place to manage user authentication. This will allow later on to implement roles and permissions. Also see #112
As a user of the data quality framework, I would like Wiki pages to be updated in order to reflect project changes and reality. Also probably move wiki to dedicated GitHub page.
As a user of the data quality framework, I would like to be able to compute data quality indicators on Teradata database.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.