astronomer / astro-provider-databricks Goto Github PK

Orchestrate your Databricks notebooks in Airflow and execute them as Databricks Workflows

License: Apache License 2.0

Starlark 0.55% Dockerfile 0.68% Python 98.77%

airflow airflow-dags apache apache-airflow dags databricks databricks-notebooks python workflows

astro-provider-databricks's Introduction

Databricks Workflows in Airflow

The Astro Databricks Provider is an Apache Airflow provider to write Databricks Workflows using Airflow as the authoring interface. Running your Databricks notebooks as Databricks Workflows can result in a 75% cost reduction ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute).

While this is maintained by Astronomer, it's available to anyone using Airflow - you don't need to be an Astronomer customer to use it.

There are a few advantages to defining your Databricks Workflows in Airflow:

	via Databricks	via Airflow
Authoring interface	Web-based via Databricks UI	Code via Airflow DAG
Workflow compute pricing	✅	✅
Notebook code in source control	✅	✅
Workflow structure in source control		✅
Retry from beginning	✅	✅
Retry single task		✅
Task groups within Workflows		✅
Trigger workflows from other DAGs		✅
Workflow-level parameters		✅

Example

The following Airflow DAG illustrates how to use the DatabricksTaskGroup and DatabricksNotebookOperator to define a Databricks Workflow in Airflow:

from pendulum import datetime

from airflow.decorators import dag, task_group
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from astro_databricks import DatabricksNotebookOperator, DatabricksWorkflowTaskGroup

# define your cluster spec - can have from 1 to many clusters
job_cluster_spec = [
   {
      "job_cluster_key": "astro_databricks",
      "new_cluster": {
         "cluster_name": "",
         # ...
      },
   }
]

@dag(start_date=datetime(2023, 1, 1), schedule_interval="@daily", catchup=False)
def databricks_workflow_example():
   # the task group is a context manager that will create a Databricks Workflow
   with DatabricksWorkflowTaskGroup(
      group_id="example_databricks_workflow",
      databricks_conn_id="databricks_default",
      job_clusters=job_cluster_spec,
      # you can specify common fields here that get shared to all notebooks
      notebook_packages=[
         { "pypi": { "package": "pandas" } },
      ],
      # notebook_params supports templating
      notebook_params={
         "start_time": "{{ ds }}",
      }
   ) as workflow:
      notebook_1 = DatabricksNotebookOperator(
         task_id="notebook_1",
         databricks_conn_id="databricks_default",
         notebook_path="/Shared/notebook_1",
         source="WORKSPACE",
         # job_cluster_key corresponds to the job_cluster_key in the job_cluster_spec
         job_cluster_key="astro_databricks",
         # you can add to packages & params at the task level
         notebook_packages=[
            { "pypi": { "package": "scikit-learn" } },
         ],
         notebook_params={
            "end_time": "{{ macros.ds_add(ds, 1) }}",
         }
      )

      # you can embed task groups for easier dependency management
      @task_group(group_id="inner_task_group")
      def inner_task_group():
         notebook_2 = DatabricksNotebookOperator(
            task_id="notebook_2",
            databricks_conn_id="databricks_default",
            notebook_path="/Shared/notebook_2",
            source="WORKSPACE",
            job_cluster_key="astro_databricks",
         )

         notebook_3 = DatabricksNotebookOperator(
            task_id="notebook_3",
            databricks_conn_id="databricks_default",
            notebook_path="/Shared/notebook_3",
            source="WORKSPACE",
            job_cluster_key="astro_databricks",
         )

      notebook_4 = DatabricksNotebookOperator(
         task_id="notebook_4",
         databricks_conn_id="databricks_default",
         notebook_path="/Shared/notebook_4",
         source="WORKSPACE",
         job_cluster_key="astro_databricks",
      )

      notebook_1 >> inner_task_group() >> notebook_4

   trigger_workflow_2 = TriggerDagRunOperator(
      task_id="trigger_workflow_2",
      trigger_dag_id="workflow_2",
      execution_date="{{ next_execution_date }}",
   )

   workflow >> trigger_workflow_2

databricks_workflow_example_dag = databricks_workflow_example()

Airflow UI

Databricks UI

Quickstart

Check out the following quickstart guides:

Documentation

The documentation is a work in progress--we aim to follow the Diátaxis system:

Reference Guide

Changelog

Astro Databricks follows semantic versioning for releases. Read changelog to understand more about the changes introduced to each version.

Contribution guidelines

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Read the Contribution Guidelines for a detailed overview on how to contribute.

Contributors and maintainers should abide by the Contributor Code of Conduct.

License

Apache Licence 2.0

astro-provider-databricks's People

Contributors

Stargazers

Watchers

Forkers

josh-fell tjanif mutwirifrank crong-k singhsatnam talchemist amitesharqiva graphy-young w0ut0 hang1225

astro-provider-databricks's Issues

Enable code coverage

ATM we are missing Code Coverage, we should set it up and enable it in our CI.

Marking a running DAG/task as failed has no effect on the ongoing Databricks job run

Within Airflow UI, when a Databricks workflow job is run and a DAG or task is marked failed while the job is still running, it gets marked failed within Airflow, but the ongoing Databricks job run is not cancelled/killed and continues processing.

Support sending parameterized SQL queries to Databricks Jobs

Problem statement

Currently the DatabricksWorkflowTaskGroup only supports creating notebook tasks using the DatabricksNotebookOperator. While this feature unlocks all databricks python-based development (and to some extent SQL through spark.sql commands), it does not allow users to take advantage of the Databricks SQL, which would limit the flows that users can create.

To solve this, we should offer support for adding support for sql_task tasks.

sql_task tasks allow databricks to refer to query objects that have been created in the databricks SQL editor. These queries can be parameterized by the user at runtime.

Solving this issue would involve two steps:

The first step is to create a DatabricksSqlQueryOperator that expects a query ID instead of a SQL query. If run outside of a DatabricksWorkflowTaskgroup, this operator would be able to launch and monitor a SQL task on its own. The second step would be to create a convert_to_databricks_workflow_task to convert the SQL operator task into a workflow task.

For this task to be completed, a SQL query should be added to the example DAG and should run through CI/CD.

Support templating in the field notebook_params of the DatabricksNotebookOperator

Context

As of 0.1.2, DatabricksNotebookOperator only supports Airflow Jinja templating in the field databricks_metadata.

Desired behaviour

The field notebook_params should also be templated.

Screnshoot

Fix MyPy failures

Uncomment the Type-Check tests from the Github Actions workflow and fix the following MyPy issues:
https://github.com/astronomer/astro-providers-databricks/actions/runs/4384003378/jobs/7674945034

And any other that may be found.

Incompatible with Airflow>=2.8.0 - remove pydantic version restrictions

In #52, pydantic was pinned on <2.0.0. However, in Airflow 2.8.0, Pydantic is bumped to >2.3.0.

This causes the package installation to fail on Airflow > 2.8.0

Clicking on repair all/repair single task for an ongoing DAG/task breaks the UI

When a Databricks DAG is run in Airflow, and the repair buttons are clicked (either repair a single job or all failed jobs), the UI breaks as seen in the screenshot.

We might need to disable the links when the job is running or handle it gracefully by displaying a message that there is an ongoing job run and repair cannot take place.

Error while trying to use existing_cluster_id instead of Job Cluster

It is always expecting Job cluster information even when you are just using the existing_cluster_id parameter.

Update CI to test on appropriate Airflow and Python versions.

Update CI to:

Also run Python 3.11
Drop testing for Airflow 2.5 and 2.6 since those aren't LTS versions
Run the unit tests for Airflow 2.8 instead of Airflow 2.5

Getting 401 Client Error: Unauthorized

I am trying to run the sample dag with a valid databricks connection that I use in other dags with our custom databricks operators.

The config for our databricks conn is:

{
  "token": "dapisomevalidtoken",
  "host": "https://my-databricks.cloud.databricks.com"
}

Are you expecting a different format for databricks conn?

Here is the call stack from the DatabricksWorkflowTaskGroup:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/astro_databricks/operators/workflow.py", line 165, in execute
job = _get_job_by_name(self.databricks_job_name, jobs_api)
File "/usr/local/lib/python3.10/site-packages/astro_databricks/operators/workflow.py", line 43, in _get_job_by_name
jobs = jobs_api.list_jobs().get("jobs", [])
File "/usr/local/lib/python3.10/site-packages/databricks_cli/jobs/api.py", line 36, in list_jobs
resp = self.client.list_jobs(job_type=job_type, expand_tasks=expand_tasks, offset=offset,
File "/usr/local/lib/python3.10/site-packages/databricks_cli/sdk/service.py", line 341, in list_jobs
return self.client.perform_query(
File "/usr/local/lib/python3.10/site-packages/databricks_cli/sdk/api_client.py", line 174, in perform_query
raise requests.exceptions.HTTPError(message, response=e.response)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url

Connection to Databricks

I am trying to connect to databricks cluster , I am getting this error: HTTPError: 403 Client Error: Forbidden for url:
Any thoughts about the reason for that?

Enhancement - Authentication using OAuth tokens for service principals.

The current implementation utilizes Databricks CLI - ApiClient for authentication, which requires the user to provide a username and password or a Personal Access Token (PAT).
It would help if the token generation and refreshing process could be automated using Azure service principal credentials. This would simplify the authentication process.

For example - The below function call can be implemented:
_get_token .

(https://github.com/apache/airflow/blob/da4912b5e562c7a30e0c54f79220c99a32e69ab9/airflow/providers/databricks/hooks/databricks_base.py#L213)

Improve overall test coverage

#4 (comment)

Bug - Invalid dependency graph for tasks

Describe the bug

While creating a dependency between two tasks created using DatabricksTaskOperator() does not use the task_key specified, but uses dagName__groupId__taskKey. This is inconsistent with the tasks created on Databricks because they correctly use the task_key specified.

To Reproduce

Steps to reproduce the behavior:

Run the following code with a valid cluster config and update the path to two notebooks on databricks which could simply print hello.

from airflow.decorators import dag
from astro_databricks.operators.common import DatabricksTaskOperator
from astro_databricks.operators.workflow import DatabricksWorkflowTaskGroup
from pendulum import datetime

 
DATABRICKS_JOB_CLUSTER_KEY: str = "Airflow_Shared_job_cluster"
DATABRICKS_CONN_ID: str = "databricks_default"

 
job_cluster_spec: list[dict] = [
# A valid cluster config
]

 
@dag(start_date=datetime(2024, 1, 1), schedule=None, catchup=False)
def dynamic_template():
    task_group = DatabricksWorkflowTaskGroup(
        group_id="projectv2",
        databricks_conn_id=DATABRICKS_CONN_ID,
        job_clusters=job_cluster_spec,
    )
    with task_group:
        print_1 = DatabricksTaskOperator(
            task_id="print_1",
            databricks_conn_id=DATABRICKS_CONN_ID,
            job_cluster_key=DATABRICKS_JOB_CLUSTER_KEY,
            task_config={
                "task_key": "print_1",
                "notebook_task": {
                    "notebook_path": "path_to_notebook/print_test1",
                    "source": "WORKSPACE",
                },
            },
        )

        print_2 = DatabricksTaskOperator(
            task_id="print_2",
            databricks_conn_id=DATABRICKS_CONN_ID,
            job_cluster_key=DATABRICKS_JOB_CLUSTER_KEY,
            task_config={
                "task_key": "print_2",
                "notebook_task": {
                    "notebook_path": "path_to_notebook/print_test2",
                    "source": "WORKSPACE",
                },
            },
        )
        print_2.set_upstream(print_1)
dynamic_template()

Screenshots

Expected behavior

This should create a DAG with two tasks - print_1 and print_2 - and print_2 should be dependent on print_1.

Desktop (please complete the following information):

OS: macos Ventura 13.6.1
Browser Firefox
Version 123.0.1

Launch Operator trigger_rule Option

Currently the launch operator defaults to 'all_success' and it doesn't appear to have an option to pass trigger_rule's to this operator. Thus when any task upstream is skipped, the launch task skips and linked tasks fail, despite differing trigger_rule's applied to the DatabricksNotebookOperator's.

Is it possible to pass trigger_rule's to this Operator, and if not can this functionality be added?

Add support to run the WorkflowTaskGroup and the NotebookOperator asynchronously

Support for using existing clusters in DatabricksNotebookOperator

Issue

The parameter "existing_cluster_id" for DatabricksNotebookOperator is not being utilized.

Our teams at HealthPartners have several use cases of using a job compute and shared compute within a workflow for different tasks. However the ability to use different computes (other than job) is not supported as only the job_cluster parameter is evaluated when constructing the JSON for a new workflow.

Proposed Solution

Under "convert_to_databricks_workflow_task" function in operators/notebook.py:

add key='existing_cluster_id' and value=self.existing_cluster_id to the variable result if self.existing_cluster_id is not empty.
add key='job_cluster_key' and value=self.job_cluster_key to the variable 'result' if self.existing_cluster_id is empty.
PR: #73

Task names character limit in workflows

The databricks task names are generated as combination of dag_id and task_id in notebook operator but this will overflow the character limit in workflow jobs, which has only 100 characters limit. Need a change in ‎_get_databricks_task_id to return the trimmed task id if the taskgroup is workflowtaskgroup type.

This will also give cleaner look on the databricks UI to easily identify the task names.

Enable Pre-Commit service

Once we do this, add the badge to the README:

[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/astronomer/astro-providers-databricks/main.svg)](https://results.pre-commit.ci/latest/github/astronomer/astro-providers-databricks/main)

Create PyPI package

Set Github branch protection rules

Address feedback to get this work into Apache Airflow

While discusssing contributing this work into the Apache Airflow repo with @alexott, he gave the following feedback:

We need to talk about integrating your work with JobsCreate operator, which is now developed by @Sri Tikkireddy (PR: apache/airflow#32221).
From analysis of your code, it has a lot of overlap with your work, but has some valuable things, like the use of Data Classes from the Databricks Python SDK.
As you mentioned, you're using SDK from Databricks CLI - it's already considered deprecated and is replaced by Databricks Python SDK. It has a big advantage over the old SDK as it evolves together with the REST APIs.
If your code doesn't provide asynchronous execution, then either use of SDK could be the best way forward. Or we can switch to using DatabricksHook functions.
In your code, instead of JSON payload for tasks, and having dedicated operator for notebooks, we can switch to use data classes from the new SDK - it will give self-documenting capabilities and type safety.

Refactor package from `astro_databricks` to `astro.databricks`

Similar to Google libraries, this way, our package could play along very nicely with the Astro Python SDK.

Support for pyspark job submit for Databricks Jobs using astro provider

Currently the DatabricksWorkflowTaskGroup only supports creating notebook tasks using the DatabricksNotebookOperator. While I was going through Orchestrate Databricks jobs with Apache Airflow, I came across DatabricksSubmitRunOperator. This is a really a nice functionality as it would allow users to take full advantage of DatabricksWorkflowTaskGroup from astro and ease the development of clean Airflow DAGs.

There are other ways to implement the above using Databricks Connect V2.

The ask is related to the discussion on How to Orchestrate Databricks Jobs Using Airflow where Daniel Imberman(@dimberman) expressed that this functionality is in the roadmap of this project.

Add support for TaskGroups from within the Databricks Workflow

Consider the case where you have 4 workflows you want to run in a row, each with a few notebooks. Right now, you could use this package to define each workflow separately, which would look like this:

However, the downside of this is there are 4 separate launch tasks that launch clusters, even if the clusters across workflows are the same. It'd be neat to do something more like this:

This would clutter the Databricks Workflow UI quite a bit, but a user shouldn't need to use that much. There would be time and cost savings to doing this.

Couple thoughts/notes:

on the Databricks Workflow side, we'd have to add additional dependencies across Airflow TaskGroups because there's no notion of TaskGroups on for Databricks
cold starts for a Databricks Workflow seems to be ~4 mins, warm starts (within a few mins of a previous workflow) is still consistently 1.5 minutes, so this would definitely save time/compute

Improve the README

▪︎ Installation
▪︎ Example or link to example DAGs
▪︎ Link to the changelog
▪︎ License

Repair All failed run - inconsistent behaviour

Issue: I have 3 task which is wrapped under single Databricks cluster group and when one of the task fails and when using 'Repair All failed task' from Launch activity, sometimes the Airflow UI for the DAG shows as failed but in the backend the Repair is actually going on in Databricks, its quite wierd that sometime it works fine in the UI, but mostly it shows as failed only

Airflow 2.6.2 running in AKS using open source Helm chart.

Release astro-databricks-provider 0.2.2

Release PR #76

Duplicate Job Creation in Databricks During Airflow DAG Runs

Issue

Our teams at HealthPartners are encountering a recurring issue where each execution of an Airflow DAG leads to the creation of a new job, despite the job already existing within the Databricks workspace.

This issue is most likely linked to the Databricks REST API retrieving a limit of 20 jobs per request, by default. In instances where the workspace contains over 20 jobs, additional API requests are necessary utilizing the 'next_page_token' from the initial call to fetch the complete job list.

Proposed Solution

Under "_get_job_by_name" function in operators/workflow.py:

directly pass the job_name parameter to the jobs_api.list_jobs() method to leverage the API's built-in job name filtering capability. This approach is more efficient than fetching an exhaustive job list and subsequently filtering for the specific job.

The launch task currently has a downstream dependency on all tasks, not just direct ones

Because of this line below, the launch task has a dependency on everything downstream of it (instead of just direct dependencies). This doesn't actually affect the functionality, but the graph looks uglier. See graph screenshot.

            task.databricks_metadata = create_databricks_workflow_task.output

New Feature Request: Handling retry at the notebook level

Hey, been using the library for a while now, love it, thanks for the good work.
Trying to implement retry at the notebook level, Databricks have parameters for it you can change in the UI that appear like this in the json:

{
  "task_key": "databricks_lol__champion_builds__champion_builds_gold_0",
  ...
  "max_retries": 2,
  "min_retry_interval_millis": 60000,
  "retry_on_timeout": true,
  "timeout_seconds": 1200,
}

I'm unable to set those parameters with astro_databricks

Thanks in advance

Support Airflow version 2.2.4, 2.3 and 2.4 and enable CI run on it

At the moment, our customer who requested the Databricks support uses:

Python 3.9
Airflow 2.2.4

(based on Sigma)

However, we are only testing Python 3.8 and Airflow 2.5. We should extend the current CI setup astronomer/astronomer-cosmos#161 (PR astronomer/astronomer-cosmos#167) to also validate against this version of Python and Airflow.

Add needed support in the implementation for supporting Airflow 2.2.4, 2.3 and 2.4 together with Python version 3.8 & 3.9 and enable CI to run on such a matrix.

Provision to pass queue name (to execute in specified airflow worker) while constructing DatabricksWorkflowTaskGroup

Use case: Execute all the databricks tasks in dedicated airflow worker.

Problem: The tasks created by "_CreateDatabricksWorkflowOperator" is executing in default airflow worker as is no provision to pass queue value to "_CreateDatabricksWorkflowOperator".

Proposed solution:
Give provision to pass operator parameters or just operator queue while constructing DatabricksWorkflowTaskGroup. So, while internally constructing _CreateDatabricksWorkflowOperator, those parameters can be passed.

Please let me know if any information is required.

astronomer / astro-provider-databricks Goto Github PK

astro-provider-databricks's Introduction

Databricks Workflows in Airflow

Example

Airflow UI

Databricks UI

Quickstart

Documentation

Changelog

Contribution guidelines

License

astro-provider-databricks's People

Contributors

Stargazers

Watchers

Forkers

astro-provider-databricks's Issues

Problem statement

Describe the bug

To Reproduce

Screenshots

Expected behavior

Desktop (please complete the following information):

Issue

Proposed Solution

Issue

Proposed Solution

Recommend Projects

Recommend Topics

Recommend Org

Jobs