setl-framework / setl Goto Github PK

View Code? Open in Web Editor NEW

177.0 13.0 31.0 1.39 MB

A simple Spark-powered ETL framework that just works 🍺

License: Apache License 2.0

Java 2.50% Scala 97.08% Shell 0.42%

spark etl framework scala setl pipeline data-transformation data-science data-engineering data-analysis

setl's People

Contributors

Stargazers

Watchers

setl's Issues

Improve test coverage: Storage

sub-issue of #54

SparkRepository can't handle both user defined schema and CompoundKey annotation

Description
When a repository has a user-defined schema, and the repo is used to both load and save the data, and the type of repository has annotations like ColumnName and/or CompoundKey, in such a situation, the repository will not be able to read the data after the execution of save.

Because when it saves the data, it renames/adds columns according to the case class' annotation, and this will break the user-defined schema.

Improve test coverage

Even though the average coverage rate is about 85% but there are still packages with low cov rate, especially:

storage #64
workflow #66
setl #65
config #63

Need to improve the tests and get a veeery green Codecov badge!

Spark master URL can't be overwritten in local environment

Describe the bug
When the app environment is local, the Spark master URL will always be set to local and can't be overwritten.

To Reproduce

new SparkSessionBuilder().setEnv("local").setSparkMaster("test").build()

The master URL will still be "local"

Expected behavior
We should be able to overwrite the master URL in local environment by calling the setSparkMaster method of SparkSessionBuilder

Review Setl builder

DCContext builder should be able to parse spark related configuration from the config file when setDCContextConfigPath is called.

In a section like the following:

context_config {
  spark {
    master_url = "url"
    app_env = ""
  }
}

Handle primitive type delivery

Need someone to implement the following test for

DeliverableDispatcher:
For a given Deliverable[T] object, where T is a primitive type and for a Factory with an input of type T, DeliverableDispatcher should inject the deliverable into the factory. This can be tested by calling the method testDispatch(factory)
Pipeline
Primitive type input should be able to be set by calling setInput method of Pipeline. And the pipeline should be able to inject these inputs into their corresponding factories.

DeliverableDispatcher and Pipeline should be able to handle primitive type delivery.

Review connector reader/writer to keep the maximum compatibility with Spark API

Add Hive support

Pipeline progress tracking

Cross compile and deployment for Scala 2.11 and 2.12

Is your feature request related to a problem? Please describe.
The CI is configured only to build test and deploy setl_2.11, which is based on Scala compatible version 2.11. We'd like to add support for Scala 2.12 and make the test and deployment automatic in our CI.

Proposed solution
We may use a shell script to replace the scala.version and scala.compat.version properties in the pom file.

Documentation hosting

Description
Need help:

Now we only use Travis for CI. Should we continue hosting the scala api doc on S3?
Need to implement the CI?

Add AWS redshift DBConnector

Publish coverage report

Test coverage report is created with scoverage and can be found in target/site/scoverage. We can publish it into a public bucket

Handle multiple CompoundKeys on the same field

When several compound keys are setted for the same element only the last one is considered

example :
@CompoundKey("partition", "1")@CompoundKey("sort", "1") environment: String,
consider only the sort compound key

Improve test coverage: Conf

#54

Implement `create` method for JDBCConnector

Is your feature request related to a problem? Please describe.
We'd like to be able to create a table in the database for the input DataFrame.

Describe the solution you'd like
As the connection information is given in the configuration, we can use java.sql.DriverManager to create the DB connection. Then we can generate and execute the SQL query with the connection.

The SQL query could be generated according to the schema of the DataFrame.

If possible, check also if we can retrieve the connection directly from Spark. You may need to read a bit of the Spark source code 😃 (cf. org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)

This issue is similar to #82

Check the feasibility of PySpark support

As I never tried porting code between Scala and Python. I'd like to have some help.

We should try to have the same API as Scala SETL.

can we call scala code in python?
it should be type safe (use the type checking of python)
use python decorator in a similar way as the java/scala annotation
use python's reflection

Add coverage report and API doc

Review configuration design

In org.apache.spark.sql.execution.datasources, Spark has already defined conf classes like JdbcOptions.

Can we integrate these classes to replace our conf classes?

Publish API doc

The API doc is created in target/site/scaladocs. We can publish it into a public bucket

Improve code coverage: workflow

sub-issue of #54

Improve code coverage: Setl

Sub-issue of #54

Annotation @Write to save a Dataset or Datafame

Instead of requiring the user to implement the write method of a factory, we add a new annotation @Write to declare a dataframe/dataset to be written. The pipeline will inspect the factory and find out all the variables with the Write annotation. Then according to the type (DF or DS), we use a Connector or a SparkRepository to write the data.

Be careful when the variable to be written is a dataframe, in this case, a delivery id should be specified in order to find the right connector.

We implement a default write method so that it doesn't need to be implemented by user and it doesn't do anything when it's not overwritten.

def write(): this.type = this

Add `create`, `drop` and `delete` methods for Connector

Is your feature request related to a problem? Please describe.
Currently, we are only able to delete files with FileConnectors. We'd like to be able to "drop" (drop a table in a database or remove files/folder in a file system) or "delete" (remove records in a DB or partitions/files in a file system)

Describe the solution you'd like
Implement the "drop" and "delete" method in both FileConnector and DBConnector.

Review ExcelConnector

Description
ExcelConnector doesn't inherit the class FileConnector.

We want refactor this class so that it inherits FileConnector and all its functionality

Implement the `delete` method for JDBCConnector

Is your feature request related to a problem? Please describe.
We'd like to be able to delete rows of a table in the database.

Describe the solution you'd like
As the connection information is given in the configuration, we can use java.sql.DriverManager to create the DB connection. Then we can generate and execute the SQL query with the connection.

If possible, check also if we can retrieve the connection directly from Spark. You may need to read a bit of the Spark source code 😃 (cf. org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)

The input argument "query" should contain only the condition of a complete SQL request. An exception should be thrown if the query is not well formated.

This issue is similar to #81

Drop all the non-defined columns when loading data with SparkRepository

Is your feature request related to a problem? Please describe.
If there are lots of columns in a data source and we are not able to create a case class to represent it, we are not able to use SparkRepository

Describe the solution you'd like
SparkRepository should keep only the defined fields of the case class (it should take into account the ColumnName annotation as well), and drop the other columns.

PySpark support

Handle primitive types in the addFactory method of Stage class

Annotation @Cache to cache variables

Is your feature request related to a problem? Please describe.
Spark only starts the computation when an action is triggered, and sometimes the same computation could be triggered multiple times if the action is called multiple times (for example when one calls Factory.write() and then Factory.get(), the data processing could be triggered twice).

In the case where the computation is more costly than the storage, we'd prefer to cache the output data in order to accelerate the process.

Describe the solution you'd like
Developers can indeed handle the caching by explicitly calling cache() or persist(storageLevel). But is it possible to do it automatically?

I'd like to add a new annotation called Cache, that could be used onto a factory class. The pipeline should automatically cache the output variable after the invocation of the process() method of this factory so that when we save or get the output, the computation will not be triggered more than once

Mermaid diagram improvement

Is your feature request related to a problem? Please describe.
Improve the print of mermaid code and generate url of Mermaid live editor. Basically is to encode the code with the following format with the base64URL scheme:

{"code": "...","mermaid":{"theme":"default"}}

Describe the solution you'd like
Expected result in the console:

[Other logs...]

--------- MERMAID DIAGRAM ---------
classDiagram
class MyFactory {
  <<Factory[Dataset[TestObject]]>>
  +SparkRepository[TestObject]
}
------- END OF MERMAID CODE -------

You can copy the previous code to a markdown viewer that supports Mermaid. 

Or you can try the live editor: https://mermaid-js.github.io/mermaid-live-editor/#/edit/{BASE64_OF_YOUR_DIAGRAM}

Use Scalatest Matcher to compare collection

Implement delete and create method in DBConnectors

Async execution of the next stage if it doesn't depend on the remaining factory(ies) of the current stage

Review SparkRepositoryBuilder

Refactor setter methods

Add `drop` method for DBConnector

Is your feature request related to a problem? Please describe.
Currently, we are only able to delete files with FileConnectors. We'd like to be able to "drop" a table with a DBConnector

OSSRH deployment

OSSRH Deployment

Documentation

https://central.sonatype.org/pages/ossrh-guide.html

@maroil can you create a account in Jira of OSSRH and apply the deployer right?

After the permission grant, we can start configuring our deployment!

FYI, normally the javadoc and source maven plugin should already be configured correctly.

Benchmark Factory

Is your feature request related to a problem? Please describe.
Be able to benchmark the performance of a factory.

Describe the solution you'd like
By adding an annotation @benchmark to either the read, process or write method, we benchmark the elapsed computing time.

The benchmarking process could be activated/deactivated according to the user.

AutoLoad deliveries with the same type

Describe the bug
When two or more deliveries have the same type but different ID with autoLoad setting to true, the pipeline is not able to deliver the loaded dataset.

In the delivery pool, there are repositories that correspond to the delivery id

Review wording

ConfigFile: file path of the type safe config file (xxx.conf/yyy.properties)
ConfigPath: "key" of the configuration in a config file
call: use can only "call" a method/function
invoke: a method can be "invoked" by another method
Delivery: a data transfer
Deliverable: the transferred data

Add `Column` type support in FindBy method of SparkRepository and Condition class

Implement filtering strategy for DBConnector to improve request performance

If a SparkRepository[T] has a DBConnector, then when the method FindBy() is called, one should check if the column that we'd like to filter belongs to a Key column or not.

It could be achieved by looking into the annotation of the case class T.

Mermaid code generation from the DAG of pipeline

Is your feature request related to a problem? Please describe.
Be able to generate Mermaid diagram after the inspection of PipelineInspector.

Describe the solution you'd like

should be able to distinguish Dataset and Factory (data transformation)
should be able to show the data store type
should be able to show the data transfer direction
should be able to show the dataset structure (schema)

Data encryption in SparkRepository

Review CI setting

Remove GitLab CI configuration
Remove AWS Codebuild configuration
Use Travis for deployment

CI for build and deployment

As we are no longer on private JCDecaux Gitlab, we need to choose a new tool for CI/CD.

I suggest that we use the AWS CodeBuild.
We have also the possibility to use Travis CI but it's a little bit complicated.

Check the user guide of CodeBuild : https://docs.aws.amazon.com/codebuild/latest/userguide/build-spec-ref.html

If you have some time, you can check if we have the possibility to create services for unit tests (as you did for cassandra for example) with CodeBuild.

we should be able to add Delivery annotation to private variables
as private[this] var is handled differently in Scala complier compared to private var, we should review the design of PipelineInspector and DeliverableDispatcher

Builder of an empty SparkRepository to be used as placeholder in Factory

Is your feature request related to a problem? Please describe.
The common way in Scala to declare uninitialized field is to use something empty instead of null
https://alvinalexander.com/scala/scala-null-values-option-uninitialized-variables

Describe the solution you'd like

private[this] val repo = SparkRepository[MyClass]

setl-framework / setl Goto Github PK

setl's People

Contributors

Stargazers

Watchers

Forkers

setl's Issues

OSSRH Deployment

Documentation

Recommend Projects

Recommend Topics

Recommend Org

Jobs