GithubHelp home page GithubHelp logo

setl-framework / setl Goto Github PK

View Code? Open in Web Editor NEW
177.0 13.0 31.0 1.39 MB

A simple Spark-powered ETL framework that just works ๐Ÿบ

License: Apache License 2.0

Java 2.50% Scala 97.08% Shell 0.42%
spark etl framework scala setl pipeline data-transformation data-science data-engineering data-analysis

setl's People

Contributors

dependabot[bot] avatar hoaihuongbk avatar joristruong avatar maroil avatar nourrammal avatar qxzzxq avatar r7l208 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

setl's Issues

SparkRepository can't handle both user defined schema and CompoundKey annotation

Description
When a repository has a user-defined schema, and the repo is used to both load and save the data, and the type of repository has annotations like ColumnName and/or CompoundKey, in such a situation, the repository will not be able to read the data after the execution of save.

Because when it saves the data, it renames/adds columns according to the case class' annotation, and this will break the user-defined schema.

Improve test coverage

Even though the average coverage rate is about 85% but there are still packages with low cov rate, especially:

Need to improve the tests and get a veeery green Codecov badge!

Spark master URL can't be overwritten in local environment

Describe the bug
When the app environment is local, the Spark master URL will always be set to local and can't be overwritten.

To Reproduce

new SparkSessionBuilder().setEnv("local").setSparkMaster("test").build()

The master URL will still be "local"

Expected behavior
We should be able to overwrite the master URL in local environment by calling the setSparkMaster method of SparkSessionBuilder

Review Setl builder

DCContext builder should be able to parse spark related configuration from the config file when setDCContextConfigPath is called.

In a section like the following:

context_config {
  spark {
    master_url = "url"
    app_env = ""
  }
}

Handle primitive type delivery

Need someone to implement the following test for

  • DeliverableDispatcher:
    For a given Deliverable[T] object, where T is a primitive type and for a Factory with an input of type T, DeliverableDispatcher should inject the deliverable into the factory. This can be tested by calling the method testDispatch(factory)
  • Pipeline
    Primitive type input should be able to be set by calling setInput method of Pipeline. And the pipeline should be able to inject these inputs into their corresponding factories.

DeliverableDispatcher and Pipeline should be able to handle primitive type delivery.

Cross compile and deployment for Scala 2.11 and 2.12

Is your feature request related to a problem? Please describe.
The CI is configured only to build test and deploy setl_2.11, which is based on Scala compatible version 2.11. We'd like to add support for Scala 2.12 and make the test and deployment automatic in our CI.

Proposed solution
We may use a shell script to replace the scala.version and scala.compat.version properties in the pom file.

Documentation hosting

Description
Need help:

  • Now we only use Travis for CI. Should we continue hosting the scala api doc on S3?
  • Need to implement the CI?

Publish coverage report

Test coverage report is created with scoverage and can be found in target/site/scoverage. We can publish it into a public bucket

Handle multiple CompoundKeys on the same field

When several compound keys are setted for the same element only the last one is considered

example :
@CompoundKey("partition", "1")@CompoundKey("sort", "1") environment: String,
consider only the sort compound key

Implement `create` method for JDBCConnector

Is your feature request related to a problem? Please describe.
We'd like to be able to create a table in the database for the input DataFrame.

Describe the solution you'd like
As the connection information is given in the configuration, we can use java.sql.DriverManager to create the DB connection. Then we can generate and execute the SQL query with the connection.

The SQL query could be generated according to the schema of the DataFrame.

If possible, check also if we can retrieve the connection directly from Spark. You may need to read a bit of the Spark source code ๐Ÿ˜ƒ (cf. org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)

This issue is similar to #82

Check the feasibility of PySpark support

As I never tried porting code between Scala and Python. I'd like to have some help.

We should try to have the same API as Scala SETL.

  • can we call scala code in python?
  • it should be type safe (use the type checking of python)
  • use python decorator in a similar way as the java/scala annotation
  • use python's reflection

Review configuration design

In org.apache.spark.sql.execution.datasources, Spark has already defined conf classes like JdbcOptions.

Can we integrate these classes to replace our conf classes?

Publish API doc

The API doc is created in target/site/scaladocs. We can publish it into a public bucket

Annotation @Write to save a Dataset or Datafame

Instead of requiring the user to implement the write method of a factory, we add a new annotation @Write to declare a dataframe/dataset to be written. The pipeline will inspect the factory and find out all the variables with the Write annotation. Then according to the type (DF or DS), we use a Connector or a SparkRepository to write the data.

Be careful when the variable to be written is a dataframe, in this case, a delivery id should be specified in order to find the right connector.

We implement a default write method so that it doesn't need to be implemented by user and it doesn't do anything when it's not overwritten.

def write(): this.type = this

Add `create`, `drop` and `delete` methods for Connector

Is your feature request related to a problem? Please describe.
Currently, we are only able to delete files with FileConnectors. We'd like to be able to "drop" (drop a table in a database or remove files/folder in a file system) or "delete" (remove records in a DB or partitions/files in a file system)

Describe the solution you'd like
Implement the "drop" and "delete" method in both FileConnector and DBConnector.

Review ExcelConnector

Description
ExcelConnector doesn't inherit the class FileConnector.

We want refactor this class so that it inherits FileConnector and all its functionality

Implement the `delete` method for JDBCConnector

Is your feature request related to a problem? Please describe.
We'd like to be able to delete rows of a table in the database.

Describe the solution you'd like
As the connection information is given in the configuration, we can use java.sql.DriverManager to create the DB connection. Then we can generate and execute the SQL query with the connection.

If possible, check also if we can retrieve the connection directly from Spark. You may need to read a bit of the Spark source code ๐Ÿ˜ƒ (cf. org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)

The input argument "query" should contain only the condition of a complete SQL request. An exception should be thrown if the query is not well formated.

This issue is similar to #81

Drop all the non-defined columns when loading data with SparkRepository

Is your feature request related to a problem? Please describe.
If there are lots of columns in a data source and we are not able to create a case class to represent it, we are not able to use SparkRepository

Describe the solution you'd like
SparkRepository should keep only the defined fields of the case class (it should take into account the ColumnName annotation as well), and drop the other columns.

Annotation @Cache to cache variables

Is your feature request related to a problem? Please describe.
Spark only starts the computation when an action is triggered, and sometimes the same computation could be triggered multiple times if the action is called multiple times (for example when one calls Factory.write() and then Factory.get(), the data processing could be triggered twice).

In the case where the computation is more costly than the storage, we'd prefer to cache the output data in order to accelerate the process.

Describe the solution you'd like
Developers can indeed handle the caching by explicitly calling cache() or persist(storageLevel). But is it possible to do it automatically?

I'd like to add a new annotation called Cache, that could be used onto a factory class. The pipeline should automatically cache the output variable after the invocation of the process() method of this factory so that when we save or get the output, the computation will not be triggered more than once

Mermaid diagram improvement

Is your feature request related to a problem? Please describe.
Improve the print of mermaid code and generate url of Mermaid live editor. Basically is to encode the code with the following format with the base64URL scheme:

{"code": "...","mermaid":{"theme":"default"}}

Describe the solution you'd like
Expected result in the console:

[Other logs...]

--------- MERMAID DIAGRAM ---------
classDiagram
class MyFactory {
  <<Factory[Dataset[TestObject]]>>
  +SparkRepository[TestObject]
}
------- END OF MERMAID CODE -------

You can copy the previous code to a markdown viewer that supports Mermaid. 

Or you can try the live editor: https://mermaid-js.github.io/mermaid-live-editor/#/edit/{BASE64_OF_YOUR_DIAGRAM}

Add `drop` method for DBConnector

Is your feature request related to a problem? Please describe.
Currently, we are only able to delete files with FileConnectors. We'd like to be able to "drop" a table with a DBConnector

Benchmark Factory

Is your feature request related to a problem? Please describe.
Be able to benchmark the performance of a factory.

Describe the solution you'd like
By adding an annotation @benchmark to either the read, process or write method, we benchmark the elapsed computing time.

The benchmarking process could be activated/deactivated according to the user.

AutoLoad deliveries with the same type

Describe the bug
When two or more deliveries have the same type but different ID with autoLoad setting to true, the pipeline is not able to deliver the loaded dataset.

In the delivery pool, there are repositories that correspond to the delivery id

Review wording

  • ConfigFile: file path of the type safe config file (xxx.conf/yyy.properties)
  • ConfigPath: "key" of the configuration in a config file
  • call: use can only "call" a method/function
  • invoke: a method can be "invoked" by another method
  • Delivery: a data transfer
  • Deliverable: the transferred data

Mermaid code generation from the DAG of pipeline

Is your feature request related to a problem? Please describe.
Be able to generate Mermaid diagram after the inspection of PipelineInspector.

Describe the solution you'd like

  • should be able to distinguish Dataset and Factory (data transformation)
  • should be able to show the data store type
  • should be able to show the data transfer direction
  • should be able to show the dataset structure (schema)

Review CI setting

  • Remove GitLab CI configuration
  • Remove AWS Codebuild configuration
  • Use Travis for deployment

CI for build and deployment

As we are no longer on private JCDecaux Gitlab, we need to choose a new tool for CI/CD.

I suggest that we use the AWS CodeBuild.
We have also the possibility to use Travis CI but it's a little bit complicated.

Check the user guide of CodeBuild : https://docs.aws.amazon.com/codebuild/latest/userguide/build-spec-ref.html

If you have some time, you can check if we have the possibility to create services for unit tests (as you did for cassandra for example) with CodeBuild.

Travis CI for PR

Use Travis for testing PR so that the test report could be visible by external contributors. Once the PR is merged, use AWS CodeBuild to for the deployment

Private variable with annotation Delivery

  • we should be able to add Delivery annotation to private variables
  • as private[this] var is handled differently in Scala complier compared to private var, we should review the design of PipelineInspector and DeliverableDispatcher

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.