GithubHelp home page GithubHelp logo

databrickslabs / dbx Goto Github PK

View Code? Open in Web Editor NEW
434.0 23.0 117.0 1.82 MB

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.

Home Page: https://dbx.readthedocs.io

License: Other

Python 97.15% Makefile 0.96% Jinja 1.72% Dockerfile 0.18%
databricks databricks-api databricks-cli ci cicd mlops

dbx's Introduction

dbx by Databricks Labs

logo

🧱Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.


Documentation Status Latest Python Release codecov downloads We use black for formatting


Concept

dbx simplifies Databricks workflows development, deployment and launch across multiple environments. It also helps to package your project and deliver it to your Databricks environment in a versioned fashion. Designed in a CLI-first manner, it is built to be actively used both inside CI/CD pipelines and as a part of local tooling for rapid prototyping.

Requirements

  • Python Version > 3.8
  • pip or conda

Installation

  • with pip:
pip install dbx

Documentation

Please refer to the docs page.

Interface versioning

For CLI interfaces, we support SemVer approach. However, for API components we don't use SemVer as of now. This may lead to instability when using dbx API methods directly.

Legal Information

This software is provided as-is and is not officially supported by Databricks through customer technical support channels. Support, questions, and feature requests can be communicated through the Issues page of this repo. Please see the legal agreement and understand that issues with the use of this code will not be answered or investigated by Databricks Support.

Databricks recommends using Databricks asset bundles for CI/CD. Please see migration guidance on how to migrate from dbx to dabs

Feedback

Issues with dbx? Found a bug? Have a great idea for an addition? Feel free to file an issue.

Contributing

Please find more details about contributing to dbx in the contributing doc.

dbx's People

Contributors

allebacco avatar attilaszuts avatar chasdevs avatar copdips avatar dependabot[bot] avatar dinispeixoto avatar dumontvi avatar elenamartina avatar elvas avatar fjakobs avatar gchandra10 avatar greentim avatar guiferviz avatar jckegelman avatar jspreddy avatar matthayes avatar mitchstockdale avatar mshtelma avatar nfx avatar nididpi avatar pietern avatar pohlposition avatar renardeinside avatar saadansari-db avatar scholer avatar skylarbpayne avatar tmacedo avatar tyler-richardett avatar xeliba avatar yinxi-db avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbx's Issues

[BUG] Datafactory reflect removes parameters from updated pipelines and activities

Expected Behavior

Activities in ADF pipeline are updated, leaving existing parameters as they are. Pipeline parameters are unchanged.

Current Behavior

When datafactory reflect updates existing activities in ADF pipeline, parameters of updated activities are removed. Also parameters of the whole pipeline disappear (probably variables disappear as well, I didn't check that).

Steps to Reproduce (for bugs)

  1. Create pipeline in ADF. Add parameter.
  2. Create Databricks Python activity in that pipeline. Add parameter.
  3. Update this pipeline using dbx datafactory reflect.
  4. Look for missing parameters.

Context

Your Environment

  • dbx version used: 0.2.2
  • Databricks Runtime version: 9.1 ML

dbx v0.1.0 crashes on Windows platform

Problem

dbx CLI version 0.1.0 totally crashed on Windows (see traceback below). Previous version (0.0.14) works correctly.

Traceback:

$ dbx
Traceback (most recent call last):
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\tools\miniconda3\envs\data-platform-pipelines\Scripts\dbx.exe\__main__.py", line 4, in <module>
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\dbx\cli.py", line 8, in <module>
    from dbx.commands.datafactory import datafactory
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\dbx\commands\datafactory.py", line 8, in <module>
    from azure.identity import DefaultAzureCredential
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\azure\identity\__init__.py", line 9, in <module>
    from ._credentials import (
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\azure\identity\_credentials\__init__.py", line 5, in <module>
    from .authorization_code import AuthorizationCodeCredential
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\azure\identity\_credentials\authorization_code.py", line 8, in <module>
    from .._internal.aad_client import AadClient
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\azure\identity\_internal\__init__.py", line 50, in <module>
    from .certificate_credential_base import CertificateCredentialBase
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\azure\identity\_internal\certificate_credential_base.py", line 11, in <module>
    from .persistent_cache import load_service_principal_cache
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\azure\identity\_internal\persistent_cache.py", line 9, in <module>
    import msal_extensions
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\msal_extensions\__init__.py", line 12, in <module>
    from .cache_lock import CrossPlatLock
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\msal_extensions\cache_lock.py", line 5, in <module>
    import portalocker
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\portalocker\__init__.py", line 4, in <module>
    from . import portalocker
  File "c:\tools\miniconda3\envs\data-platform-pipelines\lib\site-packages\portalocker\portalocker.py", line 9, in <module>
    import win32file
ImportError: DLL load failed while importing win32file: The specified module could not be found.

Your Environment

OS Windows 10, 1909
conda 4.9.2 (using virtual env)
dbx version used: 0.1.0

Actually I created fresh project using cicd-templates

[FEATURE] Add support for multi-task jobs inside dbx execute

Expected Behavior

Running a multi-task job is currently (dbx=0.2.0) only possible using dbx deploy followed by dbx launch. As such, to run a multi-task job requires specifying a new cluster to be created and launched for each task. Given that dbx execute allows execution of standard jobs against existing clusters, one would expect the ability to run a multi-task job against an existing cluster.

Current Behavior

Multi-task jobs are only possible with dbx using dbx launch. Desired behaviour is that this would be possible with dbx execute also.

Steps to Reproduce (for bugs)

Configure a job with multiple tasks and attempt to execute against an existing cluster using dbx execute

Environment

  • dbx version used: 0.2.0
  • Databricks Runtime version: DBR 9.0 ML

[BUG] Issue with encoding :sparkles: character when running dbx deploy command

Expected Behavior

Deployment finishing successfully

Current Behavior

Error
UnicodeEncodeError: 'charmap' codec can't encode character '\u2728' in position 88: character maps to "undefined"

Steps to Reproduce (for bugs)

Running dbx deploy

Context

When running "dbx deploy" command there is an error just after
"Updating job definitions - done" log
and before
"Deployment for environment {environment} finished successfully" log message.

It is due to problem with encoding the ✨ emoji (https://www.fileformat.info/info/unicode/char/2728/index.htm). Change was added in the latest release namely in
https://github.com/databrickslabs/dbx/pull/115/files
in dbx/commands/deploy.py file.

Using previous 0.2.2 version resolved the issue.

Your Environment

Azure Devops MSFT hosted agent (vmImage: "windows-latest")
Python 3.8 x64

  • dbx version used: 0.3.0
  • Databricks Runtime version:

[FEATURE] dbx launch --parameter parsing fails when param is a json string

Expected Behavior

When providing --parameters to a launch, I expect I can provide a json string like this:

dbx launch --job="my-job-name" --parameters='{"name": "sai", "age": 999}'

OR multiple like this:

dbx launch --job="my-job-name" --parameters='{"name": "sai", "age": 999}' --parameters='{"name": "Mojo jojo", "age": 10}'

I expect this to work. Because the underlying databricks jobs run-now --python-params works with an array of json strings. i.e. '["{\"name\": \"sai\", \"age\": 999}", "{\"name\": \"Mojo jojo\", \"age\": 10}"]'

Current Behavior

Fails with this error:

  File "/Users/sai/blah/.venv/lib/python3.8/site-packages/dbx/commands/launch.py", line 95, in launch
    override_parameters = parse_multiple(parameters)
  File "/Users/sai/blah/.venv/lib/python3.8/site-packages/dbx/utils/common.py", line 35, in parse_multiple
    tags_dict = {t[0]: t[1] for t in tags_splitted}
  File "/Users/sai/blah/.venv/lib/python3.8/site-packages/dbx/utils/common.py", line 35, in <dictcomp>
    tags_dict = {t[0]: t[1] for t in tags_splitted}
IndexError: list index out of range

Error Location:

tags_splitted = [t.split("=") for t in multiple_argument]

This parser is expecting the parameters to be of this format: --parameters="asdf=pqrs".
And if the parameter deviates from that expectation, it throws error.

Steps to Reproduce (for bugs)

dbx launch --job="my-job-name" --parameters='{"name": "sai", "age": 999}'

Context

I want to pass a json payload to my python job when invoking using dbx launch.
This should be possible as the underlying databricks cli works this way.

Your Environment

  • dbx version used: DataBricks eXtensions aka dbx, version ~> 0.1.2
  • Databricks Runtime version: 8.1, 8.2

[ISSUE] CLI commands are a bit slow

Current Behavior

Even getting a simple dbx --help takes quite a long time (comparison with the main CLI shows that dbx is 5.8 times slower ):

time dbx --help # > dbx --help  2.09s user 1.06s system 72% cpu 4.310 total
time databricks --help > databricks --help  0.36s user 0.14s system 93% cpu 0.538 total

Would be great to debug why it works in this way and how to improve it.

  • dbx version used: 0.2.1
  • Databricks Runtime version: N/A

Plans for cluster deployments

Hi! Do you have any plans to include cluster deployment as well or is dbx only intended for jobs?

I think it would be nice to have a list of clusters at the same level as the list of jobs. Something like:

{
    "default": {
        "jobs": [...],
        "clusters": [...],
    }
}

I would be very happy to work on it if we agree on a definition. Something worth discussing would be whether jobs should reference clusters with the id (like the API requires) or with the name (it would require a lookup prior to job creation to substitute the name for the id).

[BUG] DBX CLI creating jobs that already exist, causing duplicates

Expected Behavior

Executing: dbx deploy --deployment-file ../xxx/jobs/deployment.yaml --no-rebuild -e MOD

[dbx][2022-01-25 09:31:43.874] Processing deployment for job: xxx
[dbx][2022-01-25 09:31:43.915] Updating existing job with id: 542433 and name: xxx
  1. DBX deploying the deployment.yaml file and selecting the right environment
  2. DBX CLI checking for existing jobs in the workspace
  3. DBX only creating jobs if they do not exist yet, else create
  4. Rinse and repeat

Current Behavior

  1. DBX deploying the deployment.yaml file and selecting the right environment
  2. DBX CLI checking for existing jobs in the workspace
  3. DBX creating jobs that already exist even though the name is the same

In the current situation, multiples of the same job are created in the Databricks workspace, even though the job already exists.

Extra information:

  • Some of the jobs were created manually prior to using the DBX CLI. After removing these and running the pipeline again, jobs are not re-created
    • This however happened also in other Databricks workspaces that did not have any existing jobs
  • Running a databricks jobs list using the API version 2.0 gives all jobs including duplicates up to 20 jobs. Using the limit option gives a maximum of 25 jobs. Running the databricks jobs list using API version 2.1 results in a subset of most recent jobs, no duplicates

Checking the code it seems DBX is using JobsApi:

all_jobs = jobs_service.list_jobs().get("jobs", [])

There is no call made using a limit or subset, which leaves me wondering if DBX is having an issue where it only retrieves the default of 20 jobs rather than using the same approach as the Databricks --all option and retrieving all jobs or that the issue is in the JobsAPI itself.

See: https://github.com/databricks/databricks-cli/blob/8e2849a91b594cadba702e13aebcc391afa6c54e/databricks_cli/jobs/api.py#L34 for the actual code in the Databricks CLI.

Does DBX take into account the has_more feature in the 2.1 Jobs API? https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsList

Steps to Reproduce (for bugs)

Context

DBX is used to manage jobs through a deployment.yaml file. Every time our Azure DevOps repository is updated, the pipeline will trigger the update/creation of these jobs based on the deployment.yaml file in the repository and the environment that is used in the command.

The pipeline is triggered and installs dbx, sets up a .databrickscfg file to use for the environments and deploys the jobs that are in the deployment.yaml file.

Your Environment

  • dbx version used: 0.3.0
  • Databricks Runtime version: 9.1-LTS

Support --no-package for a specific task in multitask

Expected Behavior

deployment.yml

environments:
  default:
    strict_path_adjustment_policy: true
    jobs:
      - name: "multitask-job-sample"  # Multitask job
        tasks:
          - task_key: "task1"
            new_cluster:
              CLUSTER_WITHOUT_TABLE_ACCESS_CONTROL
            spark_python_task:
              python_file: "file://pyspark_jobs/jobs/sample/entrypoint.py"
              parameters: ["--conf-file", "file:fuse://conf/test/sample.yml"]
          - task_key: "task2"
            new_cluster:
              CLUSTER_WITH_TABLE_ACCESS_CONTROL
            notebook_task:
              notebook_path: "/Repos/.../ownership"
            depends_on: 
              - task_key: "task1"

dbx deploy --jobs multitask-job-sample --no-package-task multitask-job-sample/task2

task1 should have dbfs:/Shared/dbx/projects/pyspark_jobs/.../job.whl as Dependent Libraries
task2 shouldn't have

Current Behavior

dbx deploy --jobs multitask-job-sample

dbx put the .whlas Dependent Libraries for both tasks

or

dbx deploy --jobs multitask-job-sample --no-package
Does not add .whl in any task, it's all or nothing

Context

Python Jobs are not supported by TAC Clusters because it tries to use some dbutils functions that are not whitelisted, to copy the dependencies and .py from dbfs

If try to run the jobs, it returns the following exception
Constructor public com.databricks.backend.daemon.dbutils.FSUtilsParallel(org.apache.spark.SparkContext) is not whitelisted.

To overcome that limitation, I created a notebook that contains the specific code to grant database ownership, a code that only works within TAC Clusters, and set him up in a TAC Cluster, while my first task is running in a Non TAC Cluster.

If a ran dbx deploy --jobs multitask-job-sample dbx will put the generated .whl as a dependency for both tasks, causing the second task not to work

Your Environment

  • dbx version used: 0.3.0
  • Databricks Runtime version: 9.1, 10.2

[FEATURE] Add support to create a new job instead of update existing

Expected Behavior

Suppose we have 2 users on Azure Databricks (Mary and John),:
Mary ([email protected]) deploys a job with name "dev-pipeline" on databricks.
John ([email protected]) works on different project and he wants to deploy a job with the same name "dev-pipeline" on databricks.

I'm expecting to have two different jobs created on databricks related to two different users

Current Behavior

Currently dbx is updating any existing job that has the same name.

_update_job(jobs_service, job_id, job)

It would be nice to have a new parameter to support creating a new job instead of updating exiting one.

[BUG] Stuck at creating a new context when running dbx execute

Expected Behavior

dbx execute should run the job specified in the deployment configuration using the format mentioned in this link

Current Behavior

On running dbx execute, it gets stuck as shown below. The last time dbx execute was stuck for ~6 hours.

image

Steps to Reproduce (for bugs)

  1. Install dbx ( I tried v0.1.6 and v0.2)
  2. run a job on an all purpose cluster using dbx
    dbx execute --cluster-name="<cluster name>" --job="<job-name>"

Context

Your Environment

  • dbx version used: 0.2 and 0.1.16
  • Databricks Runtime version: 7.3LTS and 8.4

[ISSUE] longer job deployment with YAML configuration

Expected Behavior

Job deployment time using YAML configuration is similar to deployment time using JSON configuration

Current Behavior

  • job deployment time with JSON configuration - 10sec
    dbx deploy --jobs=b2s_dim_business_hubspot --environment=b2s_jobs_prod --deployment-file=conf/deployment_prod.json
    [dbx][2021-12-10 09:07:23.217] Starting new deployment for environment b2s_jobs_prod
    [dbx][2021-12-10 09:07:23.224] No environment variables provided, using the ~/.databrickscfg
    [dbx][2021-12-10 09:07:24.931] Re-building package
    [dbx][2021-12-10 09:07:26.264] Package re-build finished
    [dbx][2021-12-10 09:07:26.264] Locating package file
    [dbx][2021-12-10 09:07:26.277] Package file located in: dist\databricks_jobs-1639123645-py3-none-any.whl
    [dbx][2021-12-10 09:07:26.291] Requirements file is not provided
    [dbx][2021-12-10 09:07:26.291] Deployment will be performed only for the following jobs: ['b2s_dim_business_hubspot']
    [dbx][2021-12-10 09:07:27.361] Deploying file: dist\databricks_jobs-1639123645-py3-none-any.whl
    [dbx][2021-12-10 09:07:29.262] Deploying file: jobs\batch\b2s\dim_business_hubspot.py
    [dbx][2021-12-10 09:07:30.098] Updating job definitions
    [dbx][2021-12-10 09:07:30.099] Processing deployment for job: b2s_dim_business_hubspot
    [dbx][2021-12-10 09:07:30.328] Updating existing job with id: 5093 and name: b2s_dim_business_hubspot
    [dbx][2021-12-10 09:07:31.960] Updating job definitions - done
    [dbx][2021-12-10 09:07:33.209] Deployment for environment b2s_jobs_prod finished successfully

  • job deployment time with YAML configuration - 14min
    dbx deploy --jobs=b2s_dim_business_hubspot --environment=b2s_jobs_prod --deployment-file=conf/deployment_prod.yaml
    [dbx][2021-12-10 11:36:21.829] Starting new deployment for environment b2s_jobs_prod
    [dbx][2021-12-10 11:36:21.836] No environment variables provided, using the ~/.databrickscfg
    [dbx][2021-12-10 11:36:23.423] Re-building package
    [dbx][2021-12-10 11:36:24.635] Package re-build finished
    [dbx][2021-12-10 11:36:24.635] Locating package file
    [dbx][2021-12-10 11:36:24.646] Package file located in: dist\databricks_jobs-1639132584-py3-none-any.whl
    [dbx][2021-12-10 11:36:24.672] Requirements file is not provided
    [dbx][2021-12-10 11:36:24.673] Deployment will be performed only for the following jobs: ['b2s_dim_business_hubspot']
    [dbx][2021-12-10 11:36:25.540] Deploying file: .
    [dbx][2021-12-10 11:50:50.117] File is already stored in the deployment, no action needed
    [dbx][2021-12-10 11:50:50.347] Deploying file: dist\databricks_jobs-1639132584-py3-none-any.whl
    [dbx][2021-12-10 11:50:52.150] Deploying file: jobs\batch\b2s\dim_business_hubspot.py
    [dbx][2021-12-10 11:50:52.825] Updating job definitions
    [dbx][2021-12-10 11:50:52.825] Processing deployment for job: b2s_dim_business_hubspot
    [dbx][2021-12-10 11:50:53.052] Updating existing job with id: 5093 and name: b2s_dim_business_hubspot
    [dbx][2021-12-10 11:50:54.667] Updating job definitions - done
    [dbx][2021-12-10 11:50:55.936] Deployment for environment b2s_jobs_prod finished successfully

Steps to Reproduce (for bugs)

Configure a simple task and attempt to deploy to an existing cluster using dbx deploy with JSON and YAML configuration.
Below is a comparison of both configurations
obraz

Your Environment

  • dbx version used: 0.2.0 and 0.2.2
  • Databricks Runtime version: 9.1 LTS

[BUG] version package uploaded

Expected Behavior

If I generate a different version of my package to deploy. For example, if I move from version 0.0.1 to 0.0.2 I expect to deploy on the cluster version 0.0.2, not 0.0.1.
This work only if I manually delete from the dist folder the wheel package relative to version 0.0.1.

Current Behavior

If I generate 2 wheel packages for different version of the same library, the old version is pushed to the cluster.

Steps to Reproduce (for bugs)

  1. deploy a package with version 0.0.1 using dbx deploy
  2. modify the version in the __init__.py file
  3. execute dbx deploy once again, but the old version is still pushed

Context

Your Environment

  • dbx version used: 0.1.3
  • Databricks Runtime version: 7.5 ML

[FEATURE] Fetch and use cluster policy definitions when policy_id is defined in deployment config

Expected Behavior

When policy_id is being used in deployment configuration of a job all mandatory fields are set in cluster policy automatically merged into api request for job creation/modification.

Current Behavior

dbx doesn't include mandatory configuration fields from cluster policy.

Steps to Reproduce (for bugs)

  1. Create a policy for cluster with random mandatory fields.
  2. bootstrap a project (https://dbx.readthedocs.io/en/latest/quickstart.html)
  3. Include policy_id field in deployment.json. Value must be the same as if of a policy created in step 1.
  4. dbx deploy will fail because a new job is expected to have the field from step 1.

Context

Your Environment

  • dbx version used: 0.1.2
  • Databricks Runtime version: doesn't matter

[ISSUE] dbx execute - log print

Expected Behavior

dbx execute: print logs to standard output from Databricks CICD templates job

Current Behavior

Logs do not print to standard output

Steps to Reproduce (for bugs)

define job with log and execute

class SampleJob(Job):
    def launch(self):
        self.logger.warn("print me")

Context

Your Environment

  • dbx version used: version ~> 0.2.2
  • Databricks Runtime version: 9.1 LTS

[ISSUE] execute processes the parameters incorrectly when strict adjustment is enabled

Expected Behavior

Execute shall correctly process references in deployment.json when strict adjustment is enabled (example):

{
    "default": {
        "strict_path_adjustment_policy": true,
        "jobs": [
            {
                "name": "dbx-strict-adjustment-test-sample",
                "new_cluster": {
                    "spark_version": "7.3.x-cpu-ml-scala2.12",
                    "node_type_id": "Standard_F4s",
                    "num_workers": 2
                },
                "max_retries": 0,
                "spark_python_task": {
                    "python_file": "file://dbx_strict_adjustment_test/jobs/sample/entrypoint.py",
                    "parameters": [
                        "--conf-file",
                        "file://conf/test/sample.json"
                    ]
                }
            }
        ]
    }
}

Current Behavior

execute logic fails to resolve the upload since is_strict was not properly inherited.

Your Environment

  • dbx version used: 0.2.1
  • Databricks Runtime version: N/A

[BUG] Launch multitask job with parameters

Expected Behavior

Command launch and launch with --as-run-submit should be able to override parameters on multi-task job.

Current Behavior

Command launch and launch with --as-run-submit fail to override parameters on multi-task job.

Steps to Reproduce (for bugs)

For RunNow:

  1. Create multi-task job/
  2. Deploy.
  3. Launch with parameter provided by --parameters or --parameters-raw keys.
    e.g.:
dbx launch \
  --job smaple \
  --environment default \
  --parameters-raw='{"key1": "value1", "key2": 2}' 

For RunSubmit:

  1. Create multi-task job/
  2. Deploy with --files-only.
  3. Launch with parameter provided by --parameters or --parameters-raw keys.
    e.g.:
dbx launch \
  --job smaple \
  --environment default \
  --parameters-raw='{"key1": "value1", "key2": 2}' \
  --as-run-submit

Context

Your Environment

  • dbx version used: v0.3.0
  • Databricks Runtime version: 10.1

[BUG] Spark python task does not make use of --parameters on dbx execute

Expected Behavior

Using dbx execute --job=cicd-sample-project-sample --cluster-name=test-cluster should take the parameters for the spark python task which is part of the deployment.json.

Current Behavior

Using dbx execute --job=cicd-sample-project-sample --cluster-name=test-cluster does not take the spark-python-task parameters from deployment.json

Steps to Reproduce (for bugs)

  1. use the default deployment.json
{
    "default": {
        "jobs": [
            {
                "name": "cicd-sample-project-sample",
                "new_cluster": {
                    "spark_version": "7.3.x-cpu-ml-scala2.12",
                    "node_type_id": "i3.xlarge",
                    "aws_attributes": {
                        "first_on_demand": 0,
                        "availability": "SPOT"
                    },
                    "num_workers": 2
                },
                "libraries": [],
                "email_notifications": {
                    "on_start": [],
                    "on_success": [],
                    "on_failure": []
                },
                "max_retries": 0,
                "spark_python_task": {
                    "python_file": "cicd_sample_project/jobs/sample/entrypoint.py",
                    "parameters": [
                        "--conf-file",
                        "conf/test/sample.json"
                    ]
                }
            }
       ]
}
  1. execute the job using dbx execute --job=cicd-sample-project-sample --cluster-name=test-cluster on an interactive(all purpose) cluster.

Context

I am trying to run a python project that makes use of command line arguments.

Your Environment

  • dbx version used: 0.1.3
  • Databricks Runtime version: 7.3LTS

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) in common.py

It looks like the mlflow client version may have changed, as when the experiment doesn't exist, this line fails:
https://github.com/databrickslabs/cicd-templates-api/blob/65bda97b1b9e7be296b2c522b2b900938f03abe8/dbx/utils/common.py#L215

with this exception

  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/mlflow/utils/rest_utils.py", line 156, in call_endpoint
    js_dict = json.loads(response.text)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

mlflow version:

(cicd-demo) miguel.peralvo@C02ZL29MMD6N cicd_demo % mlflow --version
mlflow, version 1.11.0

project.json:

(cicd-demo) miguel.peralvo@C02ZL29MMD6N cicd_demo % cat .dbx/project.json
{
    "environments": {
        "default": {
            "profile": "az-mp-test",
            "workspace_dir": "/Shared/dbx/projects/cicd_demo",
            "artifact_location": "dbfs:/dbx/cicd_demo"
        }
    }
}

Complete stack trace:

Traceback (most recent call last):
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/bin/dbx", line 8, in <module>
    sys.exit(cli())
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/dbx/commands/deploy.py", line 47, in deploy
    api_client = prepare_environment(environment)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/dbx/utils/common.py", line 215, in prepare_environment
    experiment = mlflow.get_experiment_by_name(environment_data["workspace_dir"])
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/mlflow/tracking/fluent.py", line 353, in get_experiment_by_name
    return MlflowClient().get_experiment_by_name(name)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/mlflow/tracking/client.py", line 174, in get_experiment_by_name
    return self._tracking_client.get_experiment_by_name(name)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/mlflow/tracking/_tracking_service/client.py", line 130, in get_experiment_by_name
    return self.store.get_experiment_by_name(name)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/mlflow/store/tracking/rest_store.py", line 303, in get_experiment_by_name
    response_proto = self._call_endpoint(GetExperimentByName, req_body)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/mlflow/store/tracking/rest_store.py", line 52, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/site-packages/mlflow/utils/rest_utils.py", line 156, in call_endpoint
    js_dict = json.loads(response.text)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/miguel.peralvo/opt/anaconda3/envs/cicd-demo/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

[BUG] Azure Data Factory reflect AttributeError SubscriptionClient has no attribute subscriptions

Steps to Reproduce (for bugs)

Running command
dbx datafactory reflect
results in AttributeError: 'SubscriptionClient' object has no attribute 'subscriptions'

Stack trace

Traceback (most recent call last):
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\runpy.py", line 193, in _run_module_as_main      
    "__main__", mod_spec)
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\Scripts\dbx.exe\__main__.py", line 7, in <module>    
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\site-packages\click\core.py", line 1053, in main 
    rv = self.invoke(ctx)
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\site-packages\click\core.py", line 754, in invoke    return __callback(*args, **kwargs)
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\site-packages\dbx\commands\datafactory.py", line 
70, in reflect
    reflector = DatafactoryReflector(specs_file, subscription_name, resource_group, factory_name, name, 
environment)
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\site-packages\dbx\commands\datafactory.py", line 
91, in __init__
    self.subscription_id = self._get_subscription_id(subscription_name)
  File "C:\Users\lapinska\Anaconda3\envs\dbx_py375\lib\site-packages\dbx\commands\datafactory.py", line 
139, in _get_subscription_id
    sub for sub in self.sub_client.subscriptions.list() if sub.display_name == subscription_name        
AttributeError: 'SubscriptionClient' object has no attribute 'subscriptions'

Reason

Attribute was renamed from subscriptions to subscription in azure-mgmt-subscription==2.0.0
subscription_client.py#L66

Workaround

Fall back to version 1.0.0
pip install azure-mgmt-subscription==1.0.0

  • dbx version used: DataBricks eXtensions aka dbx, version ~> 0.3.0

[BUG] DBX fails to read environment variables

Hi everyone, I am having a slight problem with dbx not able to read environment variables in both JSON and YML

Expected Behavior

I would like my deployment definitions (in both JSON and YML) to correctly read and inject the environment variables.

Current Behavior

I have created two deployment files - JSON and YML - and declared the use of environment variables like this:

# in JSON
...
"spark_version": "${DBX_SPARK_VERSION}",
"node_type_id": "${DBX_NODE_TYPE_ID}",
...

# in YML
...
spark_version: !ENV DBX_SPARK_VERSION
node_type_id: !ENV DBX_NODE_TYPE_ID
...

When issuing dbx deploy command, it gives me the following error for JSON:

Traceback (most recent call last):
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/databricks_cli/sdk/api_client.py", line 138, in perform_query
    resp.raise_for_status()
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://*.azuredatabricks.net/api/2.0/jobs/reset

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/envs/localspark/bin/dbx", line 8, in <module>
    sys.exit(cli())
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/dbx/commands/deploy.py", line 177, in deploy
    deployment_data = _create_jobs(deployment["jobs"], api_client)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/dbx/commands/deploy.py", line 379, in _create_jobs
    _update_job(jobs_service, job_id, job)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/dbx/commands/deploy.py", line 404, in _update_job
    raise e
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/dbx/commands/deploy.py", line 400, in _update_job
    jobs_service.reset_job(job_id, job)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/databricks_cli/sdk/service.py", line 131, in reset_job
    return self.client.perform_query('POST', '/jobs/reset', data=_data, headers=headers, version=version)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/databricks_cli/sdk/api_client.py", line 146, in perform_query
    raise requests.exceptions.HTTPError(message, response=e.response)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://*.azuredatabricks.net/api/2.0/jobs/reset
 Response from server:
 { 'error_code': 'INVALID_PARAMETER_VALUE',
  'message': 'Node type ${DBX_NODE_TYPE_ID} is not supported. Supported node '
             'types: Standard_DS3_v2, Standard_DS4_v2, Standard_DS5_v2, '
             'Standard_D4s_v3, Standard_D8s_v3, Standard_D16s_v3, '
             'Standard_D32s_v3, Standard_D64s_v3, Standard_D4a_v4, '
             'Standard_D8a_v4, Standard_D16a_v4, Standard_D32a_v4, '
             'Standard_D48a_v4, Standard_D64a_v4, Standard_D96a_v4, '
             ...
}

And following for YML:

Traceback (most recent call last):
  File "/root/miniconda3/envs/localspark/bin/dbx", line 8, in <module>
    sys.exit(cli())
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/dbx/commands/deploy.py", line 133, in deploy
    deployment = deployment_file_config.get_environment(environment)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/dbx/utils/common.py", line 105, in get_environment
    return self._read_yaml(self._path).get("environments").get(environment)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/dbx/utils/common.py", line 102, in _read_yaml
    return self.yaml.load(f)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/ruamel/yaml/main.py", line 434, in load
    return constructor.get_single_data()
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 122, in get_single_data
    return self.construct_document(node)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 132, in construct_document
    for _dummy in generator:
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 722, in construct_yaml_map
    value = self.construct_mapping(node)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 446, in construct_mapping
    return BaseConstructor.construct_mapping(self, node, deep=deep)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 262, in construct_mapping
    value = self.construct_object(value_node, deep=deep)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 155, in construct_object
    data = self.construct_non_recursive_object(node)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 190, in construct_non_recursive_object
    data = constructor(self, node)
  File "/root/miniconda3/envs/localspark/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 738, in construct_undefined
    raise ConstructorError(
ruamel.yaml.constructor.ConstructorError: could not determine a constructor for the tag '!ENV'
  in "conf/deployment.yaml", line 3, column 20

I am using Miniconda to manage my virtual environment and variables. When issuing command conda env config vars list (to list my environment variables) I get:

DBX_SPARK_VERSION = 9.0.x-scala2.12
DBX_NODE_TYPE_ID = Standard_DS3_v2

The question now is whether I am referencing these variables incorrectly or there is other issue. I have used the docs for reference but there is no example of how to use the environment variables in deployment definitions.

Steps to Reproduce (for bugs)

  • Add two environment variables (DBX_SPARK_VERSION and DBX_NODE_TYPE_ID)
  • Create deployment files (JSON and YML) and reference both variables
  • Run dbx deploy

Context

Your Environment

  • Miniconda version used: 4.9.2
  • dbx version used: 0.2.0
  • Databricks Runtime version: 9.0

[FEATURE] add better error message for cases when hostname cannot be parsed

Expected Behavior

When the hostname variable is provided in an incorrect way, an error message shall point towards the issue.

Current Behavior

Currently the error message is misleading and doesn't really show the root cause:

 File "/opt/hostedtoolcache/Python/3.7.5/x64/lib/python3.7/site-packages/databricks_cli/sdk/api_client.py", line 72, in __init__
    if host[-1] == "/":
IndexError: string index out of range

Steps to Reproduce (for bugs)

Context

Your Environment

  • dbx version used: 0.1.4
  • Databricks Runtime version: N/A

[BUG] dbx deploy fails when multiple jobs/tasks use the same conf file in parameters.

Sample YAML File (Cookie-Cutter):

custom:
  basic-cluster-props: &basic-cluster-props
    spark_version: "9.1.x-cpu-ml-scala2.12"

  basic-static-cluster: &basic-static-cluster
    new_cluster:
      <<: *basic-cluster-props
      num_workers: 1
      node_type_id: "i3.xlarge"

# please note that we're using FUSE reference for config file, hence we're going to load this file using its local FS path
environments:
  default:
    strict_path_adjustment_policy: true
    jobs:
      - name: "test-dbx-sample"
        <<:
          - *basic-static-cluster
        spark_python_task:
          python_file: "file://dbx_package/jobs/sample/entrypoint.py"
          parameters: ["--conf-file", "file:fuse://conf/test/sample.yml"]
      - name: "test-dbx-sample-integration-test"
        <<:
          - *basic-static-cluster
        spark_python_task:
          python_file: "file://tests/integration/sample_test.py"
          parameters: ["--conf-file", "file:fuse://conf/test/sample.yml"]

Expected Behavior

Deploying a job should result in the following message:

dbx deploy --deployment-file conf/deployment.yml --jobs test-dbx-sample

(base) vdi:~/git/test-dbx$ dbx deploy --deployment-file conf/deployment.yml --jobs test-dbx-sample
[dbx][2022-01-24 14:43:23.594] Starting new deployment for environment default
[dbx][2022-01-24 14:43:23.595] No environment variables provided, using the ~/.databrickscfg
[dbx][2022-01-24 14:43:24.530] Re-building package
[dbx][2022-01-24 14:43:25.362] Package re-build finished
[dbx][2022-01-24 14:43:25.362] Locating package file
[dbx][2022-01-24 14:43:25.362] Package file located in: dist/dbx_package-0.0.1-py3-none-any.whl
[dbx][2022-01-24 14:43:25.370] Requirements file is not provided
[dbx][2022-01-24 14:43:25.370] Deployment will be performed only for the following jobs: ['test-dbx-sample']
[dbx][2022-01-24 14:43:25.878] Deploying file: dbx_package/jobs/sample/entrypoint.py
[dbx][2022-01-24 14:43:26.769] Deploying file: conf/test/sample.yml
[dbx][2022-01-24 14:43:27.380] Deploying file: dist/dbx_package-0.0.1-py3-none-any.whl
[dbx][2022-01-24 14:43:27.798] Updating job definitions
[dbx][2022-01-24 14:43:27.799] Processing deployment for job: test-dbx-sample
[dbx][2022-01-24 14:43:27.849] Creating a new job with name test-dbx-sample
[dbx][2022-01-24 14:43:28.481] Updating job definitions - done
[dbx][2022-01-24 14:43:29.127] Deployment for environment default finished successfully ✨

Current Behavior

Now, if I deploy multiple jobs, that need the same conf file, or a job with multiple tasks needing the same conf file, then deployment fails with below error:

(base) vdi:~/git/test-dbx$ dbx deploy --deployment-file conf/deployment.yml
[dbx][2022-01-24 14:47:20.060] Starting new deployment for environment default
[dbx][2022-01-24 14:47:20.060] No environment variables provided, using the ~/.databrickscfg
[dbx][2022-01-24 14:47:21.080] Re-building package
[dbx][2022-01-24 14:47:21.896] Package re-build finished
[dbx][2022-01-24 14:47:21.896] Locating package file
[dbx][2022-01-24 14:47:21.896] Package file located in: dist/dbx_package-0.0.1-py3-none-any.whl
[dbx][2022-01-24 14:47:21.905] Requirements file is not provided
[dbx][2022-01-24 14:47:22.409] Deploying file: dbx_package/jobs/sample/entrypoint.py
[dbx][2022-01-24 14:47:23.367] Deploying file: conf/test/sample.yml
[dbx][2022-01-24 14:47:24.069] Deploying file: dist/dbx_package-0.0.1-py3-none-any.whl
[dbx][2022-01-24 14:47:24.594] Deploying file: tests/integration/sample_test.py
[dbx][2022-01-24 14:47:25.093] Deploying file: conf/test/sample.yml
[dbx][2022-01-24 14:47:26.255] Deploying file: conf/test/sample.yml
[dbx][2022-01-24 14:47:26.701] Deploying file: conf/test/sample.yml
Traceback (most recent call last):
  File "/home/username/anaconda3/bin/dbx", line 8, in <module>
    sys.exit(cli())
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 174, in deploy
    _adjust_job_definitions(
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 313, in _adjust_job_definitions
    _walk_content(adjustment_callback, job)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 499, in _walk_content
    _walk_content(func, item, content, key)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 499, in _walk_content
    _walk_content(func, item, content, key)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 502, in _walk_content
    _walk_content(func, sub_item, content, idx)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 504, in _walk_content
    parent[index] = func(content)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 307, in adjustment_callback
    return _adjust_path(p, artifact_base_uri, file_uploader)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 569, in _adjust_path
    adjusted_path = _strict_path_adjustment(candidate, adjustment, file_uploader)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 532, in _strict_path_adjustment
    _upload_file(local_path, adjusted_path, file_uploader)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 511, in _upload_file
    file_uploader.upload_file(local_path)
  File "/home/username/anaconda3/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/username/anaconda3/lib/python3.8/site-packages/retry/api.py", line 73, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
  File "/home/username/anaconda3/lib/python3.8/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/utils/common.py", line 410, in upload_file
    mlflow.log_artifact(str(file_path), str(posix_path.parent))
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/tracking/fluent.py", line 605, in log_artifact
    MlflowClient().log_artifact(run_id, local_path, artifact_path)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/tracking/client.py", line 955, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py", line 355, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/store/artifact/dbfs_artifact_repo.py", line 119, in log_artifact
    self._databricks_api_request(
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/store/artifact/dbfs_artifact_repo.py", line 61, in _databricks_api_request
    return http_request_safe(host_creds=host_creds, endpoint=endpoint, method=method, **kwargs)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 162, in http_request_safe
    return verify_rest_response(response, endpoint)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 175, in verify_rest_response
    raise MlflowException("%s. Response body: '%s'" % (base_msg, response.text))
mlflow.exceptions.MlflowException: API request to endpoint /dbfs/dbx/test-dbx/efb1d96bdba44aa2bf9129a681c1fa32/artifacts/conf/test/sample.yml failed with error code 409 != 200. Response body: '<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 409 </title>
</head>
<body>
<h2>HTTP ERROR: 409</h2>
<p>Problem accessing /dbfs/dbx/test-dbx/efb1d96bdba44aa2bf9129a681c1fa32/artifacts/conf/test/sample.yml. Reason:
<pre>    File already exists, cannot overwrite: &apos;/dbx/test-dbx/efb1d96bdba44aa2bf9129a681c1fa32/artifacts/conf/test/sample.yml&apos;</pre></p>
<hr />
</body>
</html>
'

Steps to Reproduce (for bugs):

execute dbx deploy --deployment-file conf/deployment.yml with the above yaml

Or a yaml with multiple tasks in a job like this:

custom:
  basic-cluster-props: &basic-cluster-props
    spark_version: "9.1.x-cpu-ml-scala2.12"

  basic-static-cluster: &basic-static-cluster
    new_cluster:
      <<: *basic-cluster-props
      num_workers: 1
      node_type_id: "i3.xlarge"

# please note that we're using FUSE reference for config file, hence we're going to load this file using its local FS path
environments:
  default:
    strict_path_adjustment_policy: true
    jobs:
      - name: "multiple-task-example"
        tasks:
          - task_key: "test-dbx-sample"
            <<: *basic-static-cluster
            spark_python_task:
              python_file: "file://dbx_package/jobs/sample/entrypoint.py"
              parameters: ["--conf-file", "file:fuse://conf/test/sample.yml"]
          - task_key: "test-dbx-sample-integration-test"
            <<: *basic-static-cluster
            spark_python_task:
              python_file: "file://tests/integration/sample_test.py"
              parameters: ["--conf-file", "file:fuse://conf/test/sample.yml"]
            depends_on:
              - task_key: "test-dbx-sample"

Error:

(base) vdi:~/git/test-dbx$ dbx deploy --deployment-file conf/deployment.yml
[dbx][2022-01-24 15:00:54.997] Starting new deployment for environment default
[dbx][2022-01-24 15:00:54.998] No environment variables provided, using the ~/.databrickscfg
[dbx][2022-01-24 15:00:55.769] Re-building package
[dbx][2022-01-24 15:00:56.521] Package re-build finished
[dbx][2022-01-24 15:00:56.521] Locating package file
[dbx][2022-01-24 15:00:56.521] Package file located in: dist/dbx_package-0.0.1-py3-none-any.whl
[dbx][2022-01-24 15:00:56.531] Requirements file is not provided
[dbx][2022-01-24 15:00:57.112] Deploying file: dbx_package/jobs/sample/entrypoint.py
[dbx][2022-01-24 15:00:58.253] Deploying file: conf/test/sample.yml
[dbx][2022-01-24 15:00:58.825] Deploying file: tests/integration/sample_test.py
[dbx][2022-01-24 15:00:59.370] Deploying file: conf/test/sample.yml
[dbx][2022-01-24 15:01:00.550] Deploying file: conf/test/sample.yml
[dbx][2022-01-24 15:01:01.086] Deploying file: conf/test/sample.yml
Traceback (most recent call last):
  File "/home/username/anaconda3/bin/dbx", line 8, in <module>
    sys.exit(cli())
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/username/anaconda3/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 174, in deploy
    _adjust_job_definitions(
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 313, in _adjust_job_definitions
    _walk_content(adjustment_callback, job)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 499, in _walk_content
    _walk_content(func, item, content, key)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 502, in _walk_content
    _walk_content(func, sub_item, content, idx)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 499, in _walk_content
    _walk_content(func, item, content, key)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 499, in _walk_content
    _walk_content(func, item, content, key)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 502, in _walk_content
    _walk_content(func, sub_item, content, idx)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 504, in _walk_content
    parent[index] = func(content)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 307, in adjustment_callback
    return _adjust_path(p, artifact_base_uri, file_uploader)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 569, in _adjust_path
    adjusted_path = _strict_path_adjustment(candidate, adjustment, file_uploader)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 532, in _strict_path_adjustmentr
    _upload_file(local_path, adjusted_path, file_uploader)
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/commands/deploy.py", line 511, in _upload_file
    file_uploader.upload_file(local_path)
  File "/home/username/anaconda3/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/username/anaconda3/lib/python3.8/site-packages/retry/api.py", line 73, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
  File "/home/username/anaconda3/lib/python3.8/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/home/username/anaconda3/lib/python3.8/site-packages/dbx/utils/common.py", line 410, in upload_file
    mlflow.log_artifact(str(file_path), str(posix_path.parent))
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/tracking/fluent.py", line 605, in log_artifact
    MlflowClient().log_artifact(run_id, local_path, artifact_path)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/tracking/client.py", line 955, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py", line 355, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/store/artifact/dbfs_artifact_repo.py", line 119, in log_artifact
    self._databricks_api_request(
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/store/artifact/dbfs_artifact_repo.py", line 61, in _databricks_api_request
    return http_request_safe(host_creds=host_creds, endpoint=endpoint, method=method, **kwargs)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 162, in http_request_safe
    return verify_rest_response(response, endpoint)
  File "/home/username/anaconda3/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 175, in verify_rest_response
    raise MlflowException("%s. Response body: '%s'" % (base_msg, response.text))
mlflow.exceptions.MlflowException: API request to endpoint /dbfs/dbx/test-dbx/7b4271f8e5744df3b4fac82e7a679564/artifacts/conf/test/sample.yml failed with error code 409 != 200. Response body: '<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 409 </title>
</head>
<body>
<h2>HTTP ERROR: 409</h2>
<p>Problem accessing /dbfs/dbx/test-dbx/7b4271f8e5744df3b4fac82e7a679564/artifacts/conf/test/sample.yml. Reason:
<pre>    File already exists, cannot overwrite: &apos;/dbx/test-dbx/7b4271f8e5744df3b4fac82e7a679564/artifacts/conf/test/sample.yml&apos;</pre></p>
<hr />
</body>
</html>
'

Context

I've developed a generic job that processes a table based on a command line param, then picks it's relevant info like source and target paths, etc from a conf file that is common for a data processing layer i.e. (1 for bronze, 1 for silver and 1 for gold.)

Since these are going to be multiple tasks running parallelly but sharing the same code, they share the same conf file in the task definitions:

            spark_python_task:
              python_file: "file://test_dbx/jobs/bronze/entrypoint.py"
              parameters: ["--conf-file", "file:fuse://conf/dev/bronze.yml", "--table_name", "{{task_key}}"]

Your Environment

  • dbx version used: 0.3.0
  • Databricks Runtime version: 9.1.x-cpu-ml-scala2.12 (9.1)

[FEATURE] Add support for getting the current branch name within GitLab CI

Expected Behavior

When calling the get_current_branch_name() function during a deploy executed within a GitLab-CI job, the function returns the current branch name.

Current Behavior

When calling the get_current_branch_name() function during a deploy executed within a GitLab-CI job, the function returns None.

Context

I am calling dbx deploy within a GitLab-CI job. The name of the current branch can be retrieved with the CI_COMMIT_REF_NAME environment variable.

For more details about GitLab-CI variables, see Predefined variables reference.

Possible implementations

  1. in get_current_branch_name(), add an if "CI_COMMIT_REF_NAME" in os.environ: section for reading the branch name, like for the GitHub one
  2. (more generic) in the deploy command add a --branch-name argument for passing the name of the branch

Your Environment

  • dbx version used: 0.1.3
  • Databricks Runtime version: 8.1 (includes Apache Spark 3.1.1, Scala 2.12)

[DISCUSSION] Question about environment variables in yaml deployment file

Expected Behavior

On this docs page (https://dbx.readthedocs.io/en/latest/quickstart.html?highlight=env#id2)

There is a note that says: "Unlike JSON, in YAML you have to specify the !ENV tag before your environment variables for it to be resolved in a valid manner."

I have tried to do this on my yaml but it is not working. Could you guys help me out with a working example?

deployment.yaml

environments:
  default:
    jobs:

      - name: test1
        <<: *basic-settings
        tasks:
          - task_key: "main-task"
            <<:
              - *basic-settings
              - *basic-autoscale-cluster
            spark_python_task:
              python_file: company/jobs/sample/entrypoint.py
              parameters:
                - "--e"
                - ${!ENV:PWD}
                - "--f"
                - '${!ENV:PATH}'
                - "--g"
                - '${!ENV:PWD}'
                - "--h"
                - '${!ENV:ENVI}'
                - "--i"
                - ${!ENVI}
                - "--j"
                - ${!PWD}
                - "--k"
                - ${!TERM}


Your Environment

  • dbx version used: 0.2.1
  • Databricks Runtime version:

[FEATURE] Please support defining the deployment in yaml instead of json.

Current Behavior

The deployment is defined in conf/deployment.json.

JSON is cumbersome to read and write. Not to mention it causes duplication of config.

Context

  1. I want to be able to not duplicate config.
  2. I want to extract common config and include it in the job definitions.
  3. I want to reduce human error when configuring the jobs.
  4. I want to be able to extend/add functionality to the deployment config file in the future.

Proposal

  1. Allow --deployment-file to take a deployment(.json | .yaml | .yml) file.
  2. By default check for deployment(.json | .yaml | .yml) file in the conf folder.
  3. Parse the file appropriately.

Your Environment

  • dbx version used: DataBricks eXtensions aka dbx, version ~> 0.1.2
  • Databricks Runtime version: 8.1, 8.2

[BUG] Custom tags are being replaced during deployment

Expected Behavior

DBX created/updates a job def. with custom tags exactly as they were specified in the deployment.json

Current Behavior

DBX changes some custom tags if their value is the same as project dir.

Steps to Reproduce (for bugs)

Create custom tag with the value the same to the project dir

[FEATURE] Add support for scala spark_jar_task in dbx datafactory reflect

Expected Behavior

According to the docs (and the code), currently only spark_python_task is supported in dbx datafactory reflect command. Scala jars are already supported for deployment and updating Jar activity in a similar way should be easy to implement.

Current Behavior

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\envs\SPA\Scripts\dbx.exe\__main__.py", line 7, in <module>
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\dbx\commands\datafactory.py", line 70, in reflect
    reflector.launch()
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\dbx\commands\datafactory.py", line 230, in launch
    job_activity = self._generate_activity(job_spec, service_name)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\dbx\commands\datafactory.py", line 188, in _generate_activity
    python_file=job_spec.get("spark_python_task").get("python_file"),
AttributeError: 'NoneType' object has no attribute 'get'

Steps to Reproduce (for bugs)

Run dbx datafactory reflect on job defined here: https://github.com/renardeinside/dbx-scala-example

  • dbx version used: 0.2.1
  • Databricks Runtime version: 9.1

DBX execute/launch command issue - issue getting the result

Expected Behavior

To run execute/launch command and to get the result

Current Behavior

I am trying to implement databricks cicd.
when I am running the execute/launch command, I am getting the following error. could you please help me to solve the issue. Also where can I learn about the entrypoint and the python coding that is done inside it. I need to get output of the launch and execute shown in the output. currently it is showing the following error -

File "C:\Users\AChowdhury1\02_Python_Executables\lib\site-packages\mlflow\utils\rest_utils.py", line 184, in verify_rest_response
raise MlflowException("%s. Response body: '%s'" % (base_msg, response.text))
mlflow.exceptions.MlflowException: API request to endpoint was successful but the response body was not in a valid JSON format. Response body: '<!doctype html><title>Databricks - Sign In</title>

<script src="login/login.6018450d.js"></script>'

Steps to Reproduce (for bugs)

Context

Your Environment

Windows

  • dbx version used:
    DataBricks eXtensions aka dbx, version ~> 0.2.0
  • Databricks Runtime version:
    8.3 (includes Apache Spark 3.1.1, Scala 2.12)

[IMPROVEMENT] Raise coverage up to 80%

Explanation

New requirements to the labs' projects are enforcing an 80% level of code coverage.

PR addressing this issue shall enforce codecov checks AND improve coverage up to 80%.

dbx execute fails in 0.0.11

Expected Behavior

dbx execute shall run smoothly on ML runtime 7.X

Current Behavior

The command fails during cluster execution with error:

CalledProcessError: Command 'pip install -U -r /path/to/whl' returned non-zero exit status 2.

Steps to Reproduce (for bugs)

As described above.

Context

dbx execute

Your Environment

  • dbx version used: 0.0.11
  • Databricks Runtime version: 7.4 ML

[FEATURE] Merge cicd-templates into the dbx

Feature summary

cicd-templates are actively used across multiple users, but supporting two interconnected repositories is hard. Repositories were disconnected mainly due to various legacy reasons that arose at the time of project releases.

Since in most cases users of cicd-templates also use dbx, it makes sense to add cicd-templates functionality directly into dbx.
This will also make development with dbx much simpler since the creation of the project will be built into the dbx cli.

CLI design is to add a new command, called init. Basic usage will initialize cookie-cutter parameter I/O dialogue:

dbx init 

Passing parameters will be supported using:

dbx init --template-parameters ...

For example:

dbx init --template-parameters project_name=sample_project ...

Feature rollout plan

Since this feature will address heavily used repository, the following actions are expected:

  • separate PR to introduce init command
  • separate PR to migrate cicd-templates documentation into dbx documentation
  • Release of 0.3.0 with init functionality
  • Adding a deprecated header to the cicd-templates
  • Archiving the cicd-templates project

[QUESTION] Is dbx capable to update jar-based library on existing cluster?

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

In my use case I need to work at one cluster instead of creating new for each job due to security requirements (cluster must be located in specific vnet).
According to this docs page the only way to update cluster installed library is to first uninstall it, restart cluster and then install it once again.

So I just want to confirm, that currently dbx is not capable of doing "reinstallation" of the library?

Your Environment

  • dbx version used:
  • Databricks Runtime version:

[FEATURE] Add Support for `Jinja2` Deployment Files

Expected Behavior

It would be beneficial to be able to define environment variable specific logic within a deployment file, with if statements, for loops, filters and other functionality that Jinja2 allows. This would enable, for example, the same job deployment configuration file to be reused for different scopes and environments and would improve code reusability.

This functionality would also automatically allow Passing environment variables to the deployment configuration, which is currently being hardcoded using regex patterns as implemented in this pr.

Given a deployment file deployment.json.j2:

{
    "default": {
    "jobs": [
        {
        "name": "your-job-name",
        {% if (ENVIRONMENT == "production") %}
        "timeout_seconds": {{ TIMEOUT | default(-1)  }},
        {% else %}
        "timeout_seconds": {{ TIMEOUT | default(2700) }},
        {% else %}
    [ ...]

...or a deployment file deployment.yaml.j2:

environments:
  default:
    jobs:
      - name: "your-job-name"
        {% if (ENVIRONMENT == "production") %} 
        timeout_seconds: {{ TIMEOUT | default(-1)  }}
        {% else %}
        timeout_seconds: {{ TIMEOUT | default(2700) }}
   [ ... ]

there could be a Deployment Config class defined as

class Jinja2DeploymentConfig(AbstractDeploymentConfig):

which could render Jinja and then pass the rendered template onto a json or yaml loader/reader.

Current Behavior

Currently, only plain json and yaml deployment files are supported.
No custom support is provided for logic/conditional statements, filters like | lower and for loops in deployment files.

Your Environment

  • dbx version used: v0.2.2
  • Databricks Runtime version: 9.1LTS

[DISCUSSION] Organizing different trigger conditions with Github Actions

I was going through the DBX CICD repository as below -
databrickslabs/dbx: CLI tool for advanced Databricks jobs management. (github.com)
I have a python project where I have done CI integration and testing framework as per DBx project suggestion. Now I want to perform promotion of the project based of tag creation in each environment (for dev to sit a promotion of code will happen when a tag t-* is created; for sit to pre-prod a promotion of code will happen when a tag rc-* is created;)
Could you please help me as understand as how can I make it possible with the DBX CICD approach.
Any help is appreciated.

Unneeded argument "--deployment-file" on "dbx launch" command at integration tests

Expected Behavior

python_basic template uses dbx launch for running integration tests without error.

Current Behavior

When trying to do integration tests, I get the following error

Run dbx launch --deployment-file conf/deployment.yml --job=mlops-project-dbx-sample-integration-test --as-run-submit --trace
Usage: dbx launch [OPTIONS]
Try 'dbx launch -h' for help.

Error: No such option: --deployment-file
Error: Process completed with exit code 2.

dbx launch doesn't have or need the --deployment-file option

Steps to Reproduce (for bugs)

  • Use dbx to create a new proyect based on python_basic template
  • Set your DATABRICKS_HOST and DATABRICKS_TOKEN on GitHub Actions
  • Commit and push
  • Check the pipeline for the error in the integration tests step

Context

I'm trying to do a deployment of a raw project based on the template using GitHub Actions

Your Environment

  • dbx version used: 0.3.0
  • Databricks Runtime version: 10.2

[Bug] Deploy removes any packages starting with "pyspark".

Expected Behavior

I can install a package starting with "pyspark" (but not the "pyspark" package) without dbx stripping it during deploy.

Current Behavior

dbx removes any packages in the provided --requirements-file starting with "pyspark"

Steps to Reproduce (for bugs)

Add a package called "pysparktools" to requirements.txt and deploy it using --requirements-file. Look for the log message saying that "pyspark" was removed from the dependency list and look at the databricks job to verify that the "pysparktools" package was stripped from the deployment.

Context

Our team has a package called "pysparkblocks" which we want to install via a requirements.txt file for our jobs.

Your Environment

  • dbx version used: v0.1.5
  • Databricks Runtime version: 8.3

[FEATURE] Add support for multi-task jobs

Expected Behavior

When deployment configuration contains a job with multiple tasks a valid configuration is generated and deployed to the cluster.

Current Behavior

The deployed configuration is invalid for a multi-task job. Libraries (libraries JSON object) specified in job JSON object have to be moved to each task JSON object.

Your Environment

  • dbx version used: 3.1.4
  • Databricks Runtime version: 8.4

[FEATURE] Add support for environment variable substitution in deployment.yaml or deployment.json

Expected Behavior

Given a deployment file like:
Screen Shot 2021-09-09 at 3 32 24 PM

it would be beneficial to enable environment variable substitution during runtime like ${{ALERT_EMAIL}} in the image above. The environment variable can be set by the CICD tool like DATABRICKS_TOKEN or DATABRICKS_HOST that dbx supports.

Current Behavior

No support for env variables and everything has to be pre-set in deployment files

Your Environment

  • dbx version used: 0.1.6
  • Databricks Runtime version: 8.3

policy_name is not propagating in multitask jobs

Expected Behavior

policy_name shall be correctly propagated towards tasks inside multitask_jobs definitions

Current Behavior

This property is ignored in multitask jobs

Steps to Reproduce (for bugs)

Context

Your Environment

  • dbx version used: 0.3.0
  • Databricks Runtime version: N/A

[ISSUE] Honoring provided job names when deploying to databricks

Expected Behavior

In the deployment.json/yaml file, the job name for each job is specified. On deploying to databricks, the expectation is that a job with the same name is deployed, and overwritten if necessary.

Current Behavior

Currently, dbx deploy creates a new job with a name corresponding to the full path of the artifact directory and a UUID. This creates a large number of hard-to-differentiate entries.

image

Steps to Reproduce (for bugs)

dbx deploy --environment xyz --deployment-file conf/deployment.yaml

Context

Your Environment

  • dbx version used: 0.2.0
  • Databricks Runtime version: N/A

[BUG] Datafactory reflect fails on jobs with existing_cluster_id defined

Expected Behavior

Steps in ADF pipeline are updated.

Current Behavior

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\envs\SPA\Scripts\dbx.exe\__main__.py", line 7, in <module>
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\dbx\commands\datafactory.py", line 70, in reflect
    reflector.launch()
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\dbx\commands\datafactory.py", line 229, in launch
    service_name = self._create_linked_service(job_spec)
  File "C:\ProgramData\Anaconda3\envs\SPA\lib\site-packages\dbx\commands\datafactory.py", line 172, in _create_linked_service
    existing_cluster_id=cluster_spec.get("existing_cluster_id"),
AttributeError: 'NoneType' object has no attribute 'get'

Steps to Reproduce (for bugs)

  1. Use exemplary job. Change conf/deployment.json to use existing_cluster_id.
  2. Run dbx deploy --files-only --write-specs-to-file=.dbx/deployment-result.json
  3. Run dbx datafactory reflect --specs-file=.dbx/deployment-result.json --subscription-name *** --resource-group *** --factory-name *** --name ***

Context

It's probably enough to replace cluster_spec with job_spec in line 172.

Your Environment

  • dbx version used: 0.2.1
  • Databricks Runtime version: 9.1 LTS

[FEATURE] Generalize file paths and add options for FUSE path substitution

Expected Behavior

Currently, the logic of transforming local paths into dbfs:/ based paths is missing well-defined rules and misses options for /dbfs/ substitutions.
Would be great to implement the following:

if local path is referenced like "file://<>" then simply upload the file and pass dbfs:/ path
if local path is referenced like "file:as-fuse://" then upload the file and pass /dbfs/ path 

[BUG] Unable to change the artifact location from default

Expected Behavior

When running dbx deploy files should be pushed to chosen artifact location e.g. dbfs:/dbx/{addition_to_path}/{current_folder_name}

Current Behavior

Always pushes the files to the default file path, even if the project.json artifact location has been changed
e.g. dbfs:/dbx/{current_folder_name}

Steps to Reproduce (for bugs)

  1. Change the project.json artifact location
  2. Run dbx deploy
  3. See where the files have been pushed to (for me this is always the default)

Context

Want to be able to include initials in the path to push to different places in the dbfs

Your Environment

  • dbx version used: 0.1.2
  • Databricks Runtime version: N/A

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.