sparsecode / daflow Goto Github PK

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

License: Other

Scala 90.18% Shell 7.06% Dockerfile 2.76%

etl apache-spark scala parquet json hive avro csv join-data transformation-rules

daflow's Introduction

#DaFlow [Data Flow(ETL) Framework]

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

daflow's People

Contributors

Stargazers

Watchers

Forkers

thejanagala majiajue profbiyi ramuchava geekay2015 vishurkamble thhuestc szhorizon bpprak locnguyenhuu lavecoral wuchunfu nsinghdeveloper

daflow's Issues

Spline Support

Is your feature request related to a problem? Please describe.
https://absaoss.github.io/spline/
The spline is a Data Lineage tracking and visualization tool for Apache Spark ™. It would be good to analyze metrics with the output of Spline.

Describe the solution you'd like
Integration of the Spline with DaFlow job flow.

DaFlow Metrics

Explore OpenCensus(https://opencensus.io/) api's and leverage them to publish metrics to DataDog, Prometheus and several other frameworks.

Support Yaml based DaFlow Job Configurations.

Is your feature request related to a problem? Please describe.
Currently, DaFlow jobs are only XML based. It should accept job definitions from a different format. YAML is one of the popular formats.

Describe the solution you'd like
Build parser classes for parsing DaFlow job definition from the YAML file.

Support GraphQL in Schema Registry along with grpc and thrift.

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Update README.md of the project.

Update the README.md files of the various modules of the project and detailed README.md for building, deploying and running the project.

Move project build from SBT to Maven

Is your feature request related to a problem? Please describe.
Currently, elt-framework is based on SBT build tool. However, for managing a multi-module project, Maven build tool is easier and extensible. Moving build tool from sbt to maven is much needed for refactoring of the code too.

Refactored etl launch job executor to make more generic & robust

Currently, ETL Job Launcher is tightly coupled to validate schema of transformed data & load data will using be validation step results. This should be more generic so that it would use boolean from the job configs to validate transformed data or not.

Refactored code for old spark dependencies & other unwanted deoendencies.

The legacy code needs refactoring like spark old dependencies.

Build easy demo for DaFlow usage.

Is your feature request related to a problem? Please describe.
DaFlow is a complex project with several modules based on several technologies and It is necessary to have an easy good simple usage showcase.

Describe the solution you'd like
Docker container-based demo could be easily achievable to showcase DaFlow usage.

Add support for schema registry module with support for multiple versions of schema

Functionality / Module required for validation and transformation of a feed schema. Maintaining the versions of a schema is one of the basic requirement. Also, the schema should be easily accessible from various endpoints based on different methods.

Schema registry framework in future can be extensible for storing different vendors data-types mapping.

Add support for multiple feeds in ETL job.

Currently, multiple feeds in extraction are supported but passing through the transformation stage & finally loading multiple feeds are not supported. Required support for multiple feeds. Also, further strategy required for support of atomicity in-case of multiple feeds.

Entries in hive table for each feed run.
Publishing stats on Prometheus.

Right now, it is tightly coupled and code needs a separation from the feed code so that based on job_static_param stats will be published.

sparsecode / daflow Goto Github PK

daflow's Introduction

daflow's People

Contributors

Stargazers

Watchers

Forkers

daflow's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs