GithubHelp home page GithubHelp logo

Comments (7)

vgkowski avatar vgkowski commented on August 18, 2024 1

I am in the process of making my own solutions to the above as I haven't heard of data-solutions-framework-on-aws before. I've looked at aws-ddk but it also did not help Glue development either. This is my project: glue-pyspark-dev-tools.
If there is alignment, I'll be happy to help add my planned features here in this project.

@dashmug I see your tool as an equivalent of the EMR toolkit but for Glue: a packaged solution based on this blog post. Am I correct?
If yes, your solution would tackle the local dev and unit testing parts which is great! I think DSF would be complementary and can provide value on packaging this local dev to make it deployable in a Glue Job. We just need to ensure both solutions are not mandatory for each other.

What I am thinking of now is to provide as part of DSF:

  1. An abstracted construct for the Glue Job with smart defaults and best practices. Something similar to the SparkEmrServerlessJob construct.
  2. A Glue job packager construct that takes your local env and make it available/consumable by Glue. Something similar to the PySparkApplicationPackage but for Glue specificities.

from data-solutions-framework-on-aws.

vgkowski avatar vgkowski commented on August 18, 2024

Thanks for providing feedback! Can you give us more details on what you would like to see in this construct? Think about your user experience and how this construct can help you as a data engineer (with your preferences).

from data-solutions-framework-on-aws.

dashmug avatar dashmug commented on August 18, 2024

A few ideas.

  1. Glue is non-trivial to replicate locally. So engineers end up iterating their scripts in the cloud which makes the development cycle slow.
  2. Glue's CDK constructs are still on L1 making it too low-level and the development experience is not so great.
  3. Glue's CFN deployment only deploys a single script for each job. If you are developing multiple scripts and use some common utility functions (to be DRY), you have to package them into a python package, upload to s3, and then indicate in your Glue job. Again, all of this makes it not-so-friendly to developers.

I am in the process of making my own solutions to the above as I haven't heard of data-solutions-framework-on-aws before. I've looked at aws-ddk but it also did not help Glue development either. This is my project: glue-pyspark-dev-tools.

If there is alignment, I'll be happy to help add my planned features here in this project.

from data-solutions-framework-on-aws.

klescosia avatar klescosia commented on August 18, 2024

Bouncing off on your ideas..

  1. Yes, we end up iterating/running/testing scripts in the cloud. We also use Athena to test our transformation logics since I mostly advocated the use Spark SQL scripts for our transformations instead of PySpark.

Our jobs are structured as follows:

  • Ingestion
  • Staging
  • Transformation
  • Loading (to Redshift)

What I did for our deployment was to have 2 config files. One CSV file that contains the JobName, Classification (default/custom), Category (Ingestion, etc.), ConnectionName (since our jobs run in private network). This CSV file will be used by the CDK to loop through and deploy the Glue Jobs. Another config file would be for managing the custom job (Clasification) which were tagged from the CSV file.

from data-solutions-framework-on-aws.

lmouhib avatar lmouhib commented on August 18, 2024

One more point to consider for the feature, provide a way to run unit test, By inferring the arguments from the job construct and running them against the Glue runtime docker container.

from data-solutions-framework-on-aws.

vgkowski avatar vgkowski commented on August 18, 2024

What I did for our deployment was to have 2 config files. One CSV file that contains the JobName, Classification (default/custom), Category (Ingestion, etc.), ConnectionName (since our jobs run in private network). This CSV file will be used by the CDK to loop through and deploy the Glue Jobs. Another config file would be for managing the custom job (Clasification) which were tagged from the CSV file.

@klescosia Do I understand correctly you have implemented a config-file-based approach on top of CDK and Glue to create Glue jobs in a simpler way than the CDK L1 construct?

from data-solutions-framework-on-aws.

klescosia avatar klescosia commented on August 18, 2024

What I did for our deployment was to have 2 config files. One CSV file that contains the JobName, Classification (default/custom), Category (Ingestion, etc.), ConnectionName (since our jobs run in private network). This CSV file will be used by the CDK to loop through and deploy the Glue Jobs. Another config file would be for managing the custom job (Clasification) which were tagged from the CSV file.

@klescosia Do I understand correctly you have implemented a config-file-based approach on top of CDK and Glue to create Glue jobs in a simpler way than the CDK L1 construct?

Yes, that is correct. We have many Glue Jobs, each has different functionality and configurations. So I'm looping through the CSV file then executing glue.CfnJob (i'm using Python CDK) then I also have yaml file to store the configurations (number of workers, worker types, s3 paths, etc.) both for default and custom jobs.

from data-solutions-framework-on-aws.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.