#DaFlow [Data Flow(ETL) Framework]
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
License: Other
Is your feature request related to a problem? Please describe.
https://absaoss.github.io/spline/
The spline is a Data Lineage tracking and visualization tool for Apache Spark ™. It would be good to analyze metrics with the output of Spline.
Describe the solution you'd like
Integration of the Spline with DaFlow job flow.
Explore OpenCensus(https://opencensus.io/) api's and leverage them to publish metrics to DataDog, Prometheus and several other frameworks.
Is your feature request related to a problem? Please describe.
Currently, DaFlow jobs are only XML based. It should accept job definitions from a different format. YAML is one of the popular formats.
Describe the solution you'd like
Build parser classes for parsing DaFlow job definition from the YAML file.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Update the README.md files of the various modules of the project and detailed README.md for building, deploying and running the project.
Is your feature request related to a problem? Please describe.
Currently, elt-framework is based on SBT build tool. However, for managing a multi-module project, Maven build tool is easier and extensible. Moving build tool from sbt to maven is much needed for refactoring of the code too.
Currently, ETL Job Launcher is tightly coupled to validate schema of transformed data & load data will using be validation step results. This should be more generic so that it would use boolean from the job configs to validate transformed data or not.
The legacy code needs refactoring like spark old dependencies.
Is your feature request related to a problem? Please describe.
DaFlow is a complex project with several modules based on several technologies and It is necessary to have an easy good simple usage showcase.
Describe the solution you'd like
Docker container-based demo could be easily achievable to showcase DaFlow usage.
Functionality / Module required for validation and transformation of a feed schema. Maintaining the versions of a schema is one of the basic requirement. Also, the schema should be easily accessible from various endpoints based on different methods.
Schema registry framework in future can be extensible for storing different vendors data-types mapping.
Currently, multiple feeds in extraction are supported but passing through the transformation stage & finally loading multiple feeds are not supported. Required support for multiple feeds. Also, further strategy required for support of atomicity in-case of multiple feeds.
ETL Framework currently supports the basic transformation functions like filter, explode, select. Joining of the two feeds is one of the most common and basic function ETL operations.
Apache Calcite library is a basic library used in SQL analysis and parsing in almost all SQL-based big data project. Explore the project from usage perspective in ETL-framework.
Is your feature request related to a problem? Please describe.
Apache Iceberg is a new table format for large, slow-moving tabular data. From the load perspective of the ETL framework support is required for the Iceberg.
Describe the solution you'd like
Exploration and implementation of code are� required for supporting a new format in the framework.
Currently etl_framework support two frameworks for publishing stats for the ETL feed.
Right now, it is tightly coupled and code needs a separation from the feed code so that based on job_static_param stats will be published.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.