astronomer / astro-provider-databricks Goto Github PK

Orchestrate your Databricks notebooks in Airflow and execute them as Databricks Workflows

License: Apache License 2.0

Starlark 0.55% Dockerfile 0.68% Python 98.77%

airflow airflow-dags apache apache-airflow dags databricks databricks-notebooks python workflows

astro-provider-databricks's Issues

New Feature Request: Handling retry at the notebook level

Hey, been using the library for a while now, love it, thanks for the good work.
Trying to implement retry at the notebook level, Databricks have parameters for it you can change in the UI that appear like this in the json:

{
  "task_key": "databricks_lol__champion_builds__champion_builds_gold_0",
  ...
  "max_retries": 2,
  "min_retry_interval_millis": 60000,
  "retry_on_timeout": true,
  "timeout_seconds": 1200,
}

I'm unable to set those parameters with astro_databricks

Thanks in advance

Enable code coverage

ATM we are missing Code Coverage, we should set it up and enable it in our CI.

Set Github branch protection rules

Error while trying to use existing_cluster_id instead of Job Cluster

It is always expecting Job cluster information even when you are just using the existing_cluster_id parameter.

Duplicate Job Creation in Databricks During Airflow DAG Runs

Issue

Our teams at HealthPartners are encountering a recurring issue where each execution of an Airflow DAG leads to the creation of a new job, despite the job already existing within the Databricks workspace.

This issue is most likely linked to the Databricks REST API retrieving a limit of 20 jobs per request, by default. In instances where the workspace contains over 20 jobs, additional API requests are necessary utilizing the 'next_page_token' from the initial call to fetch the complete job list.

Proposed Solution

Under "_get_job_by_name" function in operators/workflow.py:

directly pass the job_name parameter to the jobs_api.list_jobs() method to leverage the API's built-in job name filtering capability. This approach is more efficient than fetching an exhaustive job list and subsequently filtering for the specific job.

Marking a running DAG/task as failed has no effect on the ongoing Databricks job run

Within Airflow UI, when a Databricks workflow job is run and a DAG or task is marked failed while the job is still running, it gets marked failed within Airflow, but the ongoing Databricks job run is not cancelled/killed and continues processing.

Repair All failed run - inconsistent behaviour

Issue: I have 3 task which is wrapped under single Databricks cluster group and when one of the task fails and when using 'Repair All failed task' from Launch activity, sometimes the Airflow UI for the DAG shows as failed but in the backend the Repair is actually going on in Databricks, its quite wierd that sometime it works fine in the UI, but mostly it shows as failed only

Airflow 2.6.2 running in AKS using open source Helm chart.

Launch Operator trigger_rule Option

Currently the launch operator defaults to 'all_success' and it doesn't appear to have an option to pass trigger_rule's to this operator. Thus when any task upstream is skipped, the launch task skips and linked tasks fail, despite differing trigger_rule's applied to the DatabricksNotebookOperator's.

Is it possible to pass trigger_rule's to this Operator, and if not can this functionality be added?

Improve the README

▪︎ Installation
▪︎ Example or link to example DAGs
▪︎ Link to the changelog
▪︎ License

Create PyPI package

Address feedback to get this work into Apache Airflow

While discusssing contributing this work into the Apache Airflow repo with @alexott, he gave the following feedback:

We need to talk about integrating your work with JobsCreate operator, which is now developed by @Sri Tikkireddy (PR: apache/airflow#32221).
From analysis of your code, it has a lot of overlap with your work, but has some valuable things, like the use of Data Classes from the Databricks Python SDK.
As you mentioned, you're using SDK from Databricks CLI - it's already considered deprecated and is replaced by Databricks Python SDK. It has a big advantage over the old SDK as it evolves together with the REST APIs.
If your code doesn't provide asynchronous execution, then either use of SDK could be the best way forward. Or we can switch to using DatabricksHook functions.
In your code, instead of JSON payload for tasks, and having dedicated operator for notebooks, we can switch to use data classes from the new SDK - it will give self-documenting capabilities and type safety.

Clicking on repair all/repair single task for an ongoing DAG/task breaks the UI

When a Databricks DAG is run in Airflow, and the repair buttons are clicked (either repair a single job or all failed jobs), the UI breaks as seen in the screenshot.

We might need to disable the links when the job is running or handle it gracefully by displaying a message that there is an ongoing job run and repair cannot take place.

Support Airflow version 2.2.4, 2.3 and 2.4 and enable CI run on it

At the moment, our customer who requested the Databricks support uses:

Python 3.9
Airflow 2.2.4

(based on Sigma)

However, we are only testing Python 3.8 and Airflow 2.5. We should extend the current CI setup astronomer/astronomer-cosmos#161 (PR astronomer/astronomer-cosmos#167) to also validate against this version of Python and Airflow.

Add needed support in the implementation for supporting Airflow 2.2.4, 2.3 and 2.4 together with Python version 3.8 & 3.9 and enable CI to run on such a matrix.

The launch task currently has a downstream dependency on all tasks, not just direct ones

Because of this line below, the launch task has a dependency on everything downstream of it (instead of just direct dependencies). This doesn't actually affect the functionality, but the graph looks uglier. See graph screenshot.

            task.databricks_metadata = create_databricks_workflow_task.output

Support templating in the field notebook_params of the DatabricksNotebookOperator

Context

As of 0.1.2, DatabricksNotebookOperator only supports Airflow Jinja templating in the field databricks_metadata.

Desired behaviour

The field notebook_params should also be templated.

Screnshoot

Incompatible with Airflow>=2.8.0 - remove pydantic version restrictions

In #52, pydantic was pinned on <2.0.0. However, in Airflow 2.8.0, Pydantic is bumped to >2.3.0.

This causes the package installation to fail on Airflow > 2.8.0

Update CI to test on appropriate Airflow and Python versions.

Update CI to:

Also run Python 3.11
Drop testing for Airflow 2.5 and 2.6 since those aren't LTS versions
Run the unit tests for Airflow 2.8 instead of Airflow 2.5

Support for pyspark job submit for Databricks Jobs using astro provider

Currently the DatabricksWorkflowTaskGroup only supports creating notebook tasks using the DatabricksNotebookOperator. While I was going through Orchestrate Databricks jobs with Apache Airflow, I came across DatabricksSubmitRunOperator. This is a really a nice functionality as it would allow users to take full advantage of DatabricksWorkflowTaskGroup from astro and ease the development of clean Airflow DAGs.

There are other ways to implement the above using Databricks Connect V2.

The ask is related to the discussion on How to Orchestrate Databricks Jobs Using Airflow where Daniel Imberman(@dimberman) expressed that this functionality is in the roadmap of this project.

Getting 401 Client Error: Unauthorized

I am trying to run the sample dag with a valid databricks connection that I use in other dags with our custom databricks operators.

The config for our databricks conn is:

{
  "token": "dapisomevalidtoken",
  "host": "https://my-databricks.cloud.databricks.com"
}

Are you expecting a different format for databricks conn?

Here is the call stack from the DatabricksWorkflowTaskGroup:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/astro_databricks/operators/workflow.py", line 165, in execute
job = _get_job_by_name(self.databricks_job_name, jobs_api)
File "/usr/local/lib/python3.10/site-packages/astro_databricks/operators/workflow.py", line 43, in _get_job_by_name
jobs = jobs_api.list_jobs().get("jobs", [])
File "/usr/local/lib/python3.10/site-packages/databricks_cli/jobs/api.py", line 36, in list_jobs
resp = self.client.list_jobs(job_type=job_type, expand_tasks=expand_tasks, offset=offset,
File "/usr/local/lib/python3.10/site-packages/databricks_cli/sdk/service.py", line 341, in list_jobs
return self.client.perform_query(
File "/usr/local/lib/python3.10/site-packages/databricks_cli/sdk/api_client.py", line 174, in perform_query
raise requests.exceptions.HTTPError(message, response=e.response)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url

Improve `DatabricksNotebookOperator` monitoring job behaviour

A customer reported that, from time to time, instances of DatabricksNotebookOperator are stuck in a running state in Airflow while being completed on Databricks.

The logs need to explain what the Databricks job is trying to use - they are empty.

While checking our code, I noticed that the implementation could be improved.
https://github.com/astronomer/astro-provider-databricks/blob/3e1ca039a024a98f9079d178478aa24702e15453/src/astro_databricks/operators/notebook.py#L235C1-L238C64

The implementation seems to have been improved in our contribution to Airflow
apache/airflow#39178

In:
https://github.com/astronomer/airflow/blob/20dacc7cec64d0055fad79943fd6afa453dbe775/airflow/providers/databricks/operators/databricks.py#L1038-L1063

Since this affects an Astronomer customer and we have not completed the migration yet, my suggestion is that:

We give visibility of what is happening in the Airflow worker node by logging something like "Waiting for the job to complete, current status: PENDING"
We make the implementation of polling the status of the job consistent with what we have contributed to Airflow.

Task names character limit in workflows

The databricks task names are generated as combination of dag_id and task_id in notebook operator but this will overflow the character limit in workflow jobs, which has only 100 characters limit. Need a change in ‎_get_databricks_task_id to return the trimmed task id if the taskgroup is workflowtaskgroup type.

This will also give cleaner look on the databricks UI to easily identify the task names.

Add support to run the WorkflowTaskGroup and the NotebookOperator asynchronously

Improve overall test coverage

#4 (comment)

Enable Deep Source

Release astro-databricks-provider 0.2.2

Release PR #76

Support sending parameterized SQL queries to Databricks Jobs

Problem statement

Currently the DatabricksWorkflowTaskGroup only supports creating notebook tasks using the DatabricksNotebookOperator. While this feature unlocks all databricks python-based development (and to some extent SQL through spark.sql commands), it does not allow users to take advantage of the Databricks SQL, which would limit the flows that users can create.

To solve this, we should offer support for adding support for sql_task tasks.

sql_task tasks allow databricks to refer to query objects that have been created in the databricks SQL editor. These queries can be parameterized by the user at runtime.

Solving this issue would involve two steps:

The first step is to create a DatabricksSqlQueryOperator that expects a query ID instead of a SQL query. If run outside of a DatabricksWorkflowTaskgroup, this operator would be able to launch and monitor a SQL task on its own. The second step would be to create a convert_to_databricks_workflow_task to convert the SQL operator task into a workflow task.

For this task to be completed, a SQL query should be added to the example DAG and should run through CI/CD.

Bug - Invalid dependency graph for tasks

Describe the bug

While creating a dependency between two tasks created using DatabricksTaskOperator() does not use the task_key specified, but uses dagName__groupId__taskKey. This is inconsistent with the tasks created on Databricks because they correctly use the task_key specified.

To Reproduce

Steps to reproduce the behavior:

Run the following code with a valid cluster config and update the path to two notebooks on databricks which could simply print hello.

from airflow.decorators import dag
from astro_databricks.operators.common import DatabricksTaskOperator
from astro_databricks.operators.workflow import DatabricksWorkflowTaskGroup
from pendulum import datetime

 
DATABRICKS_JOB_CLUSTER_KEY: str = "Airflow_Shared_job_cluster"
DATABRICKS_CONN_ID: str = "databricks_default"

 
job_cluster_spec: list[dict] = [
# A valid cluster config
]

 
@dag(start_date=datetime(2024, 1, 1), schedule=None, catchup=False)
def dynamic_template():
    task_group = DatabricksWorkflowTaskGroup(
        group_id="projectv2",
        databricks_conn_id=DATABRICKS_CONN_ID,
        job_clusters=job_cluster_spec,
    )
    with task_group:
        print_1 = DatabricksTaskOperator(
            task_id="print_1",
            databricks_conn_id=DATABRICKS_CONN_ID,
            job_cluster_key=DATABRICKS_JOB_CLUSTER_KEY,
            task_config={
                "task_key": "print_1",
                "notebook_task": {
                    "notebook_path": "path_to_notebook/print_test1",
                    "source": "WORKSPACE",
                },
            },
        )

        print_2 = DatabricksTaskOperator(
            task_id="print_2",
            databricks_conn_id=DATABRICKS_CONN_ID,
            job_cluster_key=DATABRICKS_JOB_CLUSTER_KEY,
            task_config={
                "task_key": "print_2",
                "notebook_task": {
                    "notebook_path": "path_to_notebook/print_test2",
                    "source": "WORKSPACE",
                },
            },
        )
        print_2.set_upstream(print_1)
dynamic_template()

Screenshots

Expected behavior

This should create a DAG with two tasks - print_1 and print_2 - and print_2 should be dependent on print_1.

Desktop (please complete the following information):

OS: macos Ventura 13.6.1
Browser Firefox
Version 123.0.1

Support for using existing clusters in DatabricksNotebookOperator

Issue

The parameter "existing_cluster_id" for DatabricksNotebookOperator is not being utilized.

Our teams at HealthPartners have several use cases of using a job compute and shared compute within a workflow for different tasks. However the ability to use different computes (other than job) is not supported as only the job_cluster parameter is evaluated when constructing the JSON for a new workflow.

Proposed Solution

Under "convert_to_databricks_workflow_task" function in operators/notebook.py:

add key='existing_cluster_id' and value=self.existing_cluster_id to the variable result if self.existing_cluster_id is not empty.
add key='job_cluster_key' and value=self.job_cluster_key to the variable 'result' if self.existing_cluster_id is empty.
PR: #73

Provision to pass queue name (to execute in specified airflow worker) while constructing DatabricksWorkflowTaskGroup

Use case: Execute all the databricks tasks in dedicated airflow worker.

Problem: The tasks created by "_CreateDatabricksWorkflowOperator" is executing in default airflow worker as is no provision to pass queue value to "_CreateDatabricksWorkflowOperator".

Proposed solution:
Give provision to pass operator parameters or just operator queue while constructing DatabricksWorkflowTaskGroup. So, while internally constructing _CreateDatabricksWorkflowOperator, those parameters can be passed.

Please let me know if any information is required.

Fix MyPy failures

Uncomment the Type-Check tests from the Github Actions workflow and fix the following MyPy issues:
https://github.com/astronomer/astro-providers-databricks/actions/runs/4384003378/jobs/7674945034

And any other that may be found.

Refactor package from `astro_databricks` to `astro.databricks`

Similar to Google libraries, this way, our package could play along very nicely with the Astro Python SDK.

Connection to Databricks

I am trying to connect to databricks cluster , I am getting this error: HTTPError: 403 Client Error: Forbidden for url:
Any thoughts about the reason for that?

Add support for TaskGroups from within the Databricks Workflow

Consider the case where you have 4 workflows you want to run in a row, each with a few notebooks. Right now, you could use this package to define each workflow separately, which would look like this:

However, the downside of this is there are 4 separate launch tasks that launch clusters, even if the clusters across workflows are the same. It'd be neat to do something more like this:

This would clutter the Databricks Workflow UI quite a bit, but a user shouldn't need to use that much. There would be time and cost savings to doing this.

Couple thoughts/notes:

on the Databricks Workflow side, we'd have to add additional dependencies across Airflow TaskGroups because there's no notion of TaskGroups on for Databricks
cold starts for a Databricks Workflow seems to be ~4 mins, warm starts (within a few mins of a previous workflow) is still consistently 1.5 minutes, so this would definitely save time/compute

Enable Pre-Commit service

Once we do this, add the badge to the README:

[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/astronomer/astro-providers-databricks/main.svg)](https://results.pre-commit.ci/latest/github/astronomer/astro-providers-databricks/main)

Enhancement - Authentication using OAuth tokens for service principals.

The current implementation utilizes Databricks CLI - ApiClient for authentication, which requires the user to provide a username and password or a Personal Access Token (PAT).
It would help if the token generation and refreshing process could be automated using Azure service principal credentials. This would simplify the authentication process.

For example - The below function call can be implemented:
_get_token .

(https://github.com/apache/airflow/blob/da4912b5e562c7a30e0c54f79220c99a32e69ab9/airflow/providers/databricks/hooks/databricks_base.py#L213)

astronomer / astro-provider-databricks Goto Github PK

astro-provider-databricks's Issues

Issue

Proposed Solution

Problem statement

Describe the bug

To Reproduce

Screenshots

Expected behavior

Desktop (please complete the following information):

Issue

Proposed Solution

Recommend Projects

Recommend Topics

Recommend Org

Jobs