setl-framework / setl Goto Github PK
View Code? Open in Web Editor NEWA simple Spark-powered ETL framework that just works ๐บ
License: Apache License 2.0
A simple Spark-powered ETL framework that just works ๐บ
License: Apache License 2.0
sub-issue of #54
Description
When a repository has a user-defined schema, and the repo is used to both load and save the data, and the type of repository has annotations like ColumnName
and/or CompoundKey
, in such a situation, the repository will not be able to read the data after the execution of save.
Because when it saves the data, it renames/adds columns according to the case class' annotation, and this will break the user-defined schema.
Describe the bug
When the app environment is local, the Spark master URL will always be set to local and can't be overwritten.
To Reproduce
new SparkSessionBuilder().setEnv("local").setSparkMaster("test").build()
The master URL will still be "local"
Expected behavior
We should be able to overwrite the master URL in local environment by calling the setSparkMaster
method of SparkSessionBuilder
DCContext builder should be able to parse spark related configuration from the config file when setDCContextConfigPath
is called.
In a section like the following:
context_config {
spark {
master_url = "url"
app_env = ""
}
}
Need someone to implement the following test for
Deliverable[T]
object, where T is a primitive type and for a Factory with an input of type T, DeliverableDispatcher
should inject the deliverable into the factory. This can be tested by calling the method testDispatch(factory)
setInput
method of Pipeline
. And the pipeline should be able to inject these inputs into their corresponding factories.DeliverableDispatcher and Pipeline should be able to handle primitive type delivery.
Is your feature request related to a problem? Please describe.
The CI is configured only to build test and deploy setl_2.11
, which is based on Scala compatible version 2.11. We'd like to add support for Scala 2.12 and make the test and deployment automatic in our CI.
Proposed solution
We may use a shell script to replace the scala.version
and scala.compat.version
properties in the pom file.
Description
Need help:
Test coverage report is created with scoverage and can be found in target/site/scoverage
. We can publish it into a public bucket
When several compound keys are setted for the same element only the last one is considered
example :
@CompoundKey("partition", "1")@CompoundKey("sort", "1") environment: String,
consider only the sort compound key
Is your feature request related to a problem? Please describe.
We'd like to be able to create a table in the database for the input DataFrame.
Describe the solution you'd like
As the connection information is given in the configuration, we can use java.sql.DriverManager
to create the DB connection. Then we can generate and execute the SQL query with the connection.
The SQL query could be generated according to the schema of the DataFrame.
If possible, check also if we can retrieve the connection directly from Spark. You may need to read a bit of the Spark source code ๐ (cf. org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
)
This issue is similar to #82
As I never tried porting code between Scala and Python. I'd like to have some help.
We should try to have the same API as Scala SETL.
In org.apache.spark.sql.execution.datasources
, Spark has already defined conf classes like JdbcOptions.
Can we integrate these classes to replace our conf classes?
The API doc is created in target/site/scaladocs
. We can publish it into a public bucket
sub-issue of #54
Sub-issue of #54
Instead of requiring the user to implement the write
method of a factory, we add a new annotation @Write
to declare a dataframe/dataset to be written. The pipeline will inspect the factory and find out all the variables with the Write
annotation. Then according to the type (DF or DS), we use a Connector
or a SparkRepository
to write the data.
Be careful when the variable to be written is a dataframe, in this case, a delivery id should be specified in order to find the right connector.
We implement a default write method so that it doesn't need to be implemented by user and it doesn't do anything when it's not overwritten.
def write(): this.type = this
Is your feature request related to a problem? Please describe.
Currently, we are only able to delete files with FileConnectors. We'd like to be able to "drop" (drop a table in a database or remove files/folder in a file system) or "delete" (remove records in a DB or partitions/files in a file system)
Describe the solution you'd like
Implement the "drop" and "delete" method in both FileConnector and DBConnector.
Description
ExcelConnector doesn't inherit the class FileConnector
.
We want refactor this class so that it inherits FileConnector and all its functionality
Is your feature request related to a problem? Please describe.
We'd like to be able to delete rows of a table in the database.
Describe the solution you'd like
As the connection information is given in the configuration, we can use java.sql.DriverManager to create the DB connection. Then we can generate and execute the SQL query with the connection.
If possible, check also if we can retrieve the connection directly from Spark. You may need to read a bit of the Spark source code ๐ (cf. org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils)
The input argument "query" should contain only the condition of a complete SQL request. An exception should be thrown if the query is not well formated.
This issue is similar to #81
Is your feature request related to a problem? Please describe.
If there are lots of columns in a data source and we are not able to create a case class to represent it, we are not able to use SparkRepository
Describe the solution you'd like
SparkRepository should keep only the defined fields of the case class (it should take into account the ColumnName annotation as well), and drop the other columns.
Is your feature request related to a problem? Please describe.
Spark only starts the computation when an action is triggered, and sometimes the same computation could be triggered multiple times if the action is called multiple times (for example when one calls Factory.write()
and then Factory.get()
, the data processing could be triggered twice).
In the case where the computation is more costly than the storage, we'd prefer to cache the output data in order to accelerate the process.
Describe the solution you'd like
Developers can indeed handle the caching by explicitly calling cache()
or persist(storageLevel)
. But is it possible to do it automatically?
I'd like to add a new annotation called Cache
, that could be used onto a factory class. The pipeline should automatically cache the output variable after the invocation of the process()
method of this factory so that when we save or get the output, the computation will not be triggered more than once
Is your feature request related to a problem? Please describe.
Improve the print of mermaid code and generate url of Mermaid live editor. Basically is to encode the code with the following format with the base64URL scheme:
{"code": "...","mermaid":{"theme":"default"}}
Describe the solution you'd like
Expected result in the console:
[Other logs...]
--------- MERMAID DIAGRAM ---------
classDiagram
class MyFactory {
<<Factory[Dataset[TestObject]]>>
+SparkRepository[TestObject]
}
------- END OF MERMAID CODE -------
You can copy the previous code to a markdown viewer that supports Mermaid.
Or you can try the live editor: https://mermaid-js.github.io/mermaid-live-editor/#/edit/{BASE64_OF_YOUR_DIAGRAM}
Refactor setter methods
Is your feature request related to a problem? Please describe.
Currently, we are only able to delete files with FileConnectors. We'd like to be able to "drop" a table with a DBConnector
https://central.sonatype.org/pages/ossrh-guide.html
@maroil can you create a account in Jira of OSSRH and apply the deployer right?
After the permission grant, we can start configuring our deployment!
FYI, normally the javadoc and source maven plugin should already be configured correctly.
Is your feature request related to a problem? Please describe.
Be able to benchmark the performance of a factory.
Describe the solution you'd like
By adding an annotation @benchmark to either the read, process or write method, we benchmark the elapsed computing time.
The benchmarking process could be activated/deactivated according to the user.
Describe the bug
When two or more deliveries have the same type but different ID with autoLoad setting to true, the pipeline is not able to deliver the loaded dataset.
In the delivery pool, there are repositories that correspond to the delivery id
If a SparkRepository[T] has a DBConnector, then when the method FindBy() is called, one should check if the column that we'd like to filter belongs to a Key column or not.
It could be achieved by looking into the annotation of the case class T.
Is your feature request related to a problem? Please describe.
Be able to generate Mermaid diagram after the inspection of PipelineInspector
.
Describe the solution you'd like
As we are no longer on private JCDecaux Gitlab, we need to choose a new tool for CI/CD.
I suggest that we use the AWS CodeBuild.
We have also the possibility to use Travis CI but it's a little bit complicated.
Check the user guide of CodeBuild : https://docs.aws.amazon.com/codebuild/latest/userguide/build-spec-ref.html
If you have some time, you can check if we have the possibility to create services for unit tests (as you did for cassandra for example) with CodeBuild.
Test not passed
Use Travis for testing PR so that the test report could be visible by external contributors. Once the PR is merged, use AWS CodeBuild to for the deployment
Delivery
annotation to private variablesprivate[this] var
is handled differently in Scala complier compared to private var
, we should review the design of PipelineInspector
and DeliverableDispatcher
Is your feature request related to a problem? Please describe.
The common way in Scala to declare uninitialized field is to use something empty instead of null
https://alvinalexander.com/scala/scala-null-values-option-uninitialized-variables
Describe the solution you'd like
private[this] val repo = SparkRepository[MyClass]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.