GithubHelp home page GithubHelp logo

vincentclaes / datajob Goto Github PK

View Code? Open in Web Editor NEW
110.0 6.0 19.0 3.23 MB

Build and deploy a serverless data pipeline on AWS with no effort.

Home Page: https://pypi.org/project/datajob/

License: Apache License 2.0

Python 94.66% Makefile 0.93% Dockerfile 4.41%
aws-cdk data-pipeline stepfunctions glue glue-job sagemaker aws pipeline serverless machine-learning

datajob's Introduction

Awesome logo

Build and deploy a serverless data pipeline on AWS with no effort.
Our goal is to let developers think about the business logic, datajob does the rest...



  • Deploy code to python shell / pyspark AWS Glue jobs.
  • Use AWS Sagemaker to create ML Models.
  • Orchestrate the above jobs using AWS Stepfunctions as simple as task1 >> task2
  • Let us know what you want to see next.


Installation

Datajob can be installed using pip.
Beware that we depend on aws cdk cli!

pip install datajob
npm install -g [email protected] # latest version of datajob depends this version

Quickstart

You can find the full example in examples/data_pipeline_simple.

We have a simple data pipeline composed of 2 glue jobs orchestrated sequentially using step functions.

from aws_cdk import core

from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow

app = core.App()

# The datajob_stack is the instance that will result in a cloudformation stack.
# We inject the datajob_stack object through all the resources that we want to add.
with DataJobStack(scope=app, id="data-pipeline-simple") as datajob_stack:
    # We define 2 glue jobs with the relative path to the source code.
    task1 = GlueJob(
        datajob_stack=datajob_stack, name="task1", job_path="glue_jobs/task.py"
    )
    task2 = GlueJob(
        datajob_stack=datajob_stack, name="task2", job_path="glue_jobs/task2.py"
    )

    # We instantiate a step functions workflow and orchestrate the glue jobs.
    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn:
        task1 >> task2

app.synth()

We add the above code in a file called datajob_stack.py in the root of the project.

Configure CDK

Follow the steps here to configure your credentials.

export AWS_PROFILE=default
# use the aws cli to get your account number
export AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text --profile $AWS_PROFILE)
export AWS_DEFAULT_REGION=eu-west-1

# init cdk
cdk bootstrap aws://$AWS_ACCOUNT/$AWS_DEFAULT_REGION

Deploy

Deploy the pipeline using CDK.

cd examples/data_pipeline_simple
cdk deploy --app  "python datajob_stack.py" --require-approval never

Execute

datajob execute --state-machine data-pipeline-simple-workflow

The terminal will show a link to the step functions page to follow up on your pipeline run.

sfn

Destroy

cdk destroy --app  "python datajob_stack.py"

Examples

All our examples are in ./examples

Functionality

Deploy to a stage

Specify a stage to deploy an isolated pipeline.

Typical examples would be dev , prod, ...

cdk deploy --app "python datajob_stack.py" --context stage=my-stage
Using datajob's S3 data bucket

Dynamically reference the datajob_stack data bucket name to the arguments of your GlueJob by calling datajob_stack.context.data_bucket_name.

import pathlib

from aws_cdk import core
from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow

current_dir = str(pathlib.Path(__file__).parent.absolute())

app = core.App()

with DataJobStack(
        scope=app, id="datajob-python-pyspark", project_root=current_dir
) as datajob_stack:
    pyspark_job = GlueJob(
        datajob_stack=datajob_stack,
        name="pyspark-job",
        job_path="glue_job/glue_pyspark_example.py",
        job_type="glueetl",
        glue_version="2.0",  # we only support glue 2.0
        python_version="3",
        worker_type="Standard",  # options are Standard / G.1X / G.2X
        number_of_workers=1,
        arguments={
            "--source": f"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv",
            "--destination": f"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet",
        },
    )

    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn:
        pyspark_job >> ...

you can find this example here

Deploy files to the datajob's deployment bucket

Specify the path to the folder we would like to include in the deployment bucket.

from aws_cdk import core
from datajob.datajob_stack import DataJobStack

app = core.App()

with DataJobStack(
    scope=app, id="some-stack-name", include_folder="path/to/folder/"
) as datajob_stack:

    ...
Package your project as a wheel and ship it to AWS

You can find the example here

# We add the path to the project root in the constructor of DataJobStack.
# By specifying project_root, datajob will look for a .whl in
# the dist/ folder in your project_root.
with DataJobStack(
    scope=app, id="data-pipeline-pkg", project_root=current_dir
) as datajob_stack:

Package you project using poetry

poetry build
cdk deploy --app "python datajob_stack.py"

Package you project using setup.py

python setup.py bdist_wheel
cdk deploy --app "python datajob_stack.py"

you can also use the datajob cli to do the two commands at once:

# for poetry
datajob deploy --config datajob_stack.py --package poetry

# for setup.py
datajob deploy --config datajob_stack.py --package setuppy
Processing big data using a Glue Pyspark job
import pathlib

from aws_cdk import core
from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob

current_dir = str(pathlib.Path(__file__).parent.absolute())

app = core.App()

with DataJobStack(
        scope=app, id="datajob-python-pyspark", project_root=current_dir
) as datajob_stack:
    pyspark_job = GlueJob(
        datajob_stack=datajob_stack,
        name="pyspark-job",
        job_path="glue_job/glue_pyspark_example.py",
        job_type="glueetl",
        glue_version="2.0",  # we only support glue 2.0
        python_version="3",
        worker_type="Standard",  # options are Standard / G.1X / G.2X
        number_of_workers=1,
        arguments={
            "--source": f"s3://{datajob_stack.context.data_bucket_name}/raw/iris_dataset.csv",
            "--destination": f"s3://{datajob_stack.context.data_bucket_name}/target/pyspark_job/iris_dataset.parquet",
        },
    )

full example can be found in examples/data_pipeline_pyspark.

Orchestrate stepfunctions tasks in parallel
# Task2 comes after task1. task4 comes after task3.
# Task 5 depends on both task2 and task4 to be finished.
# Therefore task1 and task2 can run in parallel,
# as well as task3 and task4.
with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as sfn:
    task1 >> task2
    task3 >> task4
    task2 >> task5
    task4 >> task5

More can be found in examples/data_pipeline_parallel

Orchestrate 1 stepfunction task

Use the Ellipsis object to be able to orchestrate 1 job via step functions.

some_task >> ...
Notify in case of error/success

Provide the parameter notification in the constructor of a StepfunctionsWorkflow object. This will create an SNS Topic which will be triggered in case of failure or success. The email will subscribe to the topic and receive the notification in its inbox.

with StepfunctionsWorkflow(datajob_stack=datajob_stack,
                           name="workflow",
                           notification="[email protected]") as sfn:
    task1 >> task2

You can provide 1 email or a list of emails ["[email protected]", "[email protected]"].

Datajob in depth

The datajob_stack is the instance that will result in a cloudformation stack. The path in project_root helps datajob_stack locate the root of the project where the setup.py/poetry pyproject.toml file can be found, as well as the dist/ folder with the wheel of your project .

import pathlib
from aws_cdk import core

from datajob.datajob_stack import DataJobStack

current_dir = pathlib.Path(__file__).parent.absolute()
app = core.App()

with DataJobStack(
    scope=app, id="data-pipeline-pkg", project_root=current_dir
) as datajob_stack:

    ...

When entering the contextmanager of DataJobStack:

A DataJobContext is initialized to deploy and run a data pipeline on AWS. The following resources are created:

  1. "data bucket"
    • an S3 bucket that you can use to dump ingested data, dump intermediate results and the final output.
    • you can access the data bucket as a Bucket object via datajob_stack.context.data_bucket
    • you can access the data bucket name via datajob_stack.context.data_bucket_name
  2. "deployment bucket"
    • an s3 bucket to deploy code, artifacts, scripts, config, files, ...
    • you can access the deployment bucket as a Bucket object via datajob_stack.context.deployment_bucket
    • you can access the deployment bucket name via datajob_stack.context.deployment_bucket_name

when exiting the context manager all the resources of our DataJobStack object are created.

We can write the above example more explicitly...
import pathlib
from aws_cdk import core

from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow

current_dir = pathlib.Path(__file__).parent.absolute()

app = core.App()

datajob_stack = DataJobStack(scope=app, id="data-pipeline-pkg", project_root=current_dir)
datajob_stack.init_datajob_context()

task1 = GlueJob(datajob_stack=datajob_stack, name="task1", job_path="glue_jobs/task.py")
task2 = GlueJob(datajob_stack=datajob_stack, name="task2", job_path="glue_jobs/task2.py")

with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow") as step_functions_workflow:
    task1 >> task2

datajob_stack.create_resources()
app.synth()

Ideas

Any suggestions can be shared by starting a discussion

These are the ideas, we find interesting to implement;

  • add a time based trigger to the step functions workflow.
  • add an s3 event trigger to the step functions workflow.
  • add a lambda that copies data from one s3 location to another.
  • version your data pipeline.
  • cli command to view the logs / glue jobs / s3 bucket
  • implement sagemaker services
    • processing jobs
    • hyperparameter tuning jobs
    • training jobs
  • implement lambda
  • implement ECS Fargate
  • create a serverless UI that follows up on the different pipelines deployed on possibly different AWS accounts using Datajob

Feedback is much appreciated!

datajob's People

Contributors

dependabot[bot] avatar lorenzocevolani avatar petervandenabeele avatar vincentclaes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

datajob's Issues

upgrade to cdk 1.87.1

  • if you want this to work in a vs code devcontainter
  • you need 1.87.1 for the cli and the python libs
  • fix dependencies

make the workflow name optional

    with StepfunctionsWorkflow(
        datajob_stack=mailswitch_stack, name="workflow"
    ) as step_functions_workflow:
        join_labels >> ...

it might also be easier to execute a workflow that has the same name as the stack

README: Add explanation how to run the tests

On a Linux box with conda, this could be an explanation on how to get pytest running.

/home/peter_v/anaconda3/bin/python -m pip install --upgrade pip  # to avoid warnings about spyder 4.1.5 versions
make
sudo apt install nodejs  # to avoid massive warnings about RuntimeError: generator didn't stop after throw()

$ poetry run pytest
========================================== test session starts ===========================================
platform linux -- Python 3.8.2, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/peter_v/data/github/vincentclaes/datajob
collected 16 items                                                                                       

datajob_tests/test_datajob_context.py .                                                            [  6%]
datajob_tests/test_datajob_stack.py ....                                                           [ 31%]
datajob_tests/datajob_cli_tests/test_datajob_deploy.py .......                                     [ 75%]
datajob_tests/datajob_cli_tests/test_datajob_execute.py .                                          [ 81%]
datajob_tests/glue/test_glue_job.py .                                                              [ 87%]
datajob_tests/stepfunctions/test_stepfunctions_workflow.py ..                                      [100%]

=========================================== 16 passed in 5.62s ===========================================

subclass SomeMockedClass from DatajobBase


@stepfunctions_workflow.task
class SomeMockedClass(object):
    def __init__(self, unique_name):
        self.unique_name = unique_name
        self.sfn_task = Task(state_id=unique_name)

better resemble reality

if stepfunctions workflow is None we should handle this

If an error occurs it looks like the "exit" function is called of the contextmanager and it might be the workflow is still None, which raises another exception when trying to create the resources.

KeyError: 'AWS_DEFAULT_REGION'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/vincent/Workspace/zippo-data-layer/deployment_zippo.py", line 91, in <module>
    ] >> crop_raster_per_country >> dump_data_layer_to_gbq >> dump_display_names_to_gbq
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob_stack.py", line 72, in __exit__
    self.create_resources()
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob_stack.py", line 91, in create_resources
    [resource.create() for resource in self.resources]
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob_stack.py", line 91, in <listcomp>
    [resource.create() for resource in self.resources]
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/stepfunctions/stepfunctions_workflow.py", line 102, in create
    text_file.write(self.workflow.get_cloudformation_template())
AttributeError: 'NoneType' object has no attribute 'get_cloudformation_template'

mention to cdk bootstrap in the readme

Do you wish to deploy these changes (y/n)? y
data-pipeline-simple-dev: deploying...

 โŒ  data-pipeline-simple-dev failed: Error: This stack uses assets, so the toolkit stack must be deployed to the environment (Run "cdk bootstrap aws://077590795309/eu-west-1")
    at Object.addMetadataAssetsToManifest (/usr/local/lib/node_modules/aws-cdk/lib/assets.ts:27:11)
    at Object.deployStack (/usr/local/lib/node_modules/aws-cdk/lib/api/deploy-stack.ts:205:29)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at CdkToolkit.deploy (/usr/local/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:180:24)
    at initCommandLine (/usr/local/lib/node_modules/aws-cdk/bin/cdk.ts:204:9)
This stack uses assets, so the toolkit stack must be deployed to the environment (Run "cdk bootstrap aws://077590795309/eu-west-1")
Traceback (most recent call last):
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/bin/datajob", line 5, in <module>
    run()
  File "/Users/vincent/Workspace/datajob/datajob/datajob.py", line 20, in run
    app()
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/datajob-KxqvMF6C-py3.6/lib/python3.6/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/vincent/Workspace/datajob/datajob/datajob.py", line 51, in deploy
    call_cdk(command="deploy", args=args, extra_args=extra_args)
  File "/Users/vincent/Workspace/datajob/datajob/datajob.py", line 103, in call_cdk
    subprocess.check_call(shlex.split(full_command))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cdk', 'deploy', '--app', 'python /Users/vincent/Workspace/datajob/examples/data_pipeline_simple/datajob_stack.py', '-c', 'stage=dev']' returned non-zero exit status 1.
(datajob-KxqvMF6C-py3.6) Vincents-MacBook-Pro:data_pipeline_simple vincent$ 

expand datajob to deploy ecs fargate tasks

  • can we subclass from DataJobBase and implement the requirements for an ecs fargate task/job?
  • maybe name it FargateJob ? (job is consistent within the lib, but i think task is the correct term for ecs/fargate)
  • can we add ecs fargate job to stepfunctions workflow?
  • add test to create a class FargateJob
  • add example to synth a job with fargate in github actions

better handle when no aws account could be resolved

check this in advance and raise an error from within datajob

Unable to resolve AWS account to use. It must be either configured when you define your CDK or through the environment
Traceback (most recent call last):
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/bin/datajob", line 8, in <module>
    sys.exit(run())
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob.py", line 17, in run
    app()
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob.py", line 37, in deploy
    call_cdk(command="deploy", args=args, extra_args=extra_args)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/datajob.py", line 73, in call_cdk
    subprocess.check_call(shlex.split(full_command))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cdk', 'deploy', '--app', 'python /Users/vincent/Workspace/zippo-data-layer/deployment_zippo.py', '-c', 'stage=stg']' returned non-zero exit status 1.

implement a notification notification="[email protected]" to the stepfunctionsworkflow

    # We instantiate a step functions workflow and orchestrate the glue jobs.
    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name="workflow", notification="some-email...") as sfn:
        task1 >> task2
  • if we define a notification we create an SNS to add to the pipeline
  • it accepts an email address as string or a list of email addresses
  • it notifies in case of failure or in case of success

example deploy fails `Error: Invalid S3 bucket name (value: data-pipeline-simple-None-deployment-bucket)`

The None has a capital letter which is invalid.

I ran:

export AWS_DEFAULT_ACCOUNT=_____________29
export AWS_PROFILE=my-profile
export AWS_DEFAULT_REGION=your-region # e.g. eu-west-1

<..>/datajob/examples/data_pipeline_simple$ datajob deploy --config datajob_stack.py 
cdk command: cdk deploy --app  "python <..>/datajob/examples/data_pipeline_simple/datajob_stack.py"  -c stage=None
jsii.errors.JavaScriptError: 
  Error: Invalid S3 bucket name (value: data-pipeline-simple-None-deployment-bucket)
  Bucket name must only contain lowercase characters and the symbols, period (.) and dash (-) (offset: 21)

get a default sagemaker role

  • sagemaker processor/estimator can use a default role when none is supplied
  • maybe a static function from SagemakerBase?

bug with credentials

(node:17808) ExperimentalWarning: The fs.promises API is experimental
python: can't open file 'deployment_glue_datajob.py': [Errno 2] No such file or directory
Subprocess exited with error 2
DVCL643@10NB03610:~/workspace/python/aws_best_practices$ cd glue
DVCL643@10NB03610:~/workspace/python/aws_best_practices/glue$ cdk deploy --app  "python deployment_glue_datajob.py"
(node:10368) ExperimentalWarning: The fs.promises API is experimental
Traceback (most recent call last):
  File "deployment_glue_datajob.py", line 60, in <module>
    python_job >> pyspark_job
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\stepfunctions\stepfunctions_workflow.py", line 115, in __exit__
    self._build_workflow()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\stepfunctions\stepfunctions_workflow.py", line 91, in _build_workflow
    self.client = boto3.client("stepfunctions")
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\boto3\__init__.py", line 93, in client
    return _get_default_session().client(*args, **kwargs)
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\boto3\session.py", line 263, in client
    aws_session_token=aws_session_token, config=config)
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\session.py", line 826, in create_client
    credentials = self.get_credentials()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\session.py", line 431, in get_credentials
    'credential_provider').load_credentials()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\credentials.py", line 1975, in load_credentials
    creds = provider.load()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\credentials.py", line 1102, in load
    credentials = fetcher(require_expiry=False)
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\botocore\credentials.py", line 1137, in fetch_credentials
    provider=method, cred_var=mapping['secret_key'])
botocore.exceptions.PartialCredentialsError: Partial credentials found in env, missing: AWS_SECRET_ACCESS_KEY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deployment_glue_datajob.py", line 60, in <module>
    python_job >> pyspark_job
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\datajob_stack.py", line 74, in __exit__
    self.create_resources()
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\datajob_stack.py", line 93, in create_resources
    [resource.create() for resource in self.resources]
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\datajob_stack.py", line 93, in <listcomp>
    [resource.create() for resource in self.resources]
  File "C:\Users\dvcl643\.virtualenvs\glue-E61qYPlY\lib\site-packages\datajob\stepfunctions\stepfunctions_workflow.py", line 104, in create
    text_file.write(self.workflow.get_cloudformation_template())
AttributeError: 'NoneType' object has no attribute 'get_cloudformation_template'
Subprocess exited with error 1

make all objects configurable

let the user add **kwargs to;

  • all the cdk object, the create functions
  • to all the step functions object. check the stepfunctions_workflow

handle aws region better

step functions workflow should first inherit from the datajob stack before checking env vars

Traceback (most recent call last):
  File "/Users/vincent/Workspace/zippo-data-layer/deployment_zippo.py", line 87, in <module>
    with StepfunctionsWorkflow(datajob_stack=datajob_stack, name=stackname) as sfn:
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/jsii/_runtime.py", line 83, in __call__
    inst = super().__call__(*args, **kwargs)
  File "/Users/vincent/Library/Caches/pypoetry/virtualenvs/zippo-data-layer-_EDEGVNn-py3.6/lib/python3.6/site-packages/datajob/stepfunctions/stepfunctions_workflow.py", line 53, in __init__
    self.region = region if region else os.environ["AWS_DEFAULT_REGION"]
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'AWS_DEFAULT_REGION'

self.region = region if region else os.environ["AWS_DEFAULT_REGION"]

include stack name in the tasks that we run.

if you have the same task name and stage over 2 different pipelines, you will have a conflict
e.g name "task1" and stage="stg" will result in task1-stg
we need to prefix this with our stack name, e.g.; my-stack-task1-stg

bug when runnin last jobs in parallel

hi V, I noticed that the workflow fails if the last step is not an unique job.
Ex: task1 >> [task2, task3] fails
9:14
but [task1, task2] >> task3 works
9:15
(in my example each task are independent)
9:15
is this the expected behavior?

bugfix - let boto3 handle region

now we explicitly specify a region when defining a stepfunctions workflow.
we should handle it implicitly by boto3
also don't raise an error when no region is found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.