GithubHelp home page GithubHelp logo

astronomer / astro-provider-databricks Goto Github PK

View Code? Open in Web Editor NEW
20.0 9.0 10.0 11.37 MB

Orchestrate your Databricks notebooks in Airflow and execute them as Databricks Workflows

License: Apache License 2.0

Starlark 0.55% Dockerfile 0.68% Python 98.77%
airflow airflow-dags apache apache-airflow dags databricks databricks-notebooks python workflows

astro-provider-databricks's Introduction

Databricks Workflows in Airflow

The Astro Databricks Provider is an Apache Airflow provider to write Databricks Workflows using Airflow as the authoring interface. Running your Databricks notebooks as Databricks Workflows can result in a 75% cost reduction ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute).

While this is maintained by Astronomer, it's available to anyone using Airflow - you don't need to be an Astronomer customer to use it.

There are a few advantages to defining your Databricks Workflows in Airflow:

via Databricks via Airflow
Authoring interface Web-based via Databricks UI Code via Airflow DAG
Workflow compute pricing
Notebook code in source control
Workflow structure in source control
Retry from beginning
Retry single task
Task groups within Workflows
Trigger workflows from other DAGs
Workflow-level parameters

Example

The following Airflow DAG illustrates how to use the DatabricksTaskGroup and DatabricksNotebookOperator to define a Databricks Workflow in Airflow:

from pendulum import datetime

from airflow.decorators import dag, task_group
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from astro_databricks import DatabricksNotebookOperator, DatabricksWorkflowTaskGroup

# define your cluster spec - can have from 1 to many clusters
job_cluster_spec = [
   {
      "job_cluster_key": "astro_databricks",
      "new_cluster": {
         "cluster_name": "",
         # ...
      },
   }
]

@dag(start_date=datetime(2023, 1, 1), schedule_interval="@daily", catchup=False)
def databricks_workflow_example():
   # the task group is a context manager that will create a Databricks Workflow
   with DatabricksWorkflowTaskGroup(
      group_id="example_databricks_workflow",
      databricks_conn_id="databricks_default",
      job_clusters=job_cluster_spec,
      # you can specify common fields here that get shared to all notebooks
      notebook_packages=[
         { "pypi": { "package": "pandas" } },
      ],
      # notebook_params supports templating
      notebook_params={
         "start_time": "{{ ds }}",
      }
   ) as workflow:
      notebook_1 = DatabricksNotebookOperator(
         task_id="notebook_1",
         databricks_conn_id="databricks_default",
         notebook_path="/Shared/notebook_1",
         source="WORKSPACE",
         # job_cluster_key corresponds to the job_cluster_key in the job_cluster_spec
         job_cluster_key="astro_databricks",
         # you can add to packages & params at the task level
         notebook_packages=[
            { "pypi": { "package": "scikit-learn" } },
         ],
         notebook_params={
            "end_time": "{{ macros.ds_add(ds, 1) }}",
         }
      )

      # you can embed task groups for easier dependency management
      @task_group(group_id="inner_task_group")
      def inner_task_group():
         notebook_2 = DatabricksNotebookOperator(
            task_id="notebook_2",
            databricks_conn_id="databricks_default",
            notebook_path="/Shared/notebook_2",
            source="WORKSPACE",
            job_cluster_key="astro_databricks",
         )

         notebook_3 = DatabricksNotebookOperator(
            task_id="notebook_3",
            databricks_conn_id="databricks_default",
            notebook_path="/Shared/notebook_3",
            source="WORKSPACE",
            job_cluster_key="astro_databricks",
         )

      notebook_4 = DatabricksNotebookOperator(
         task_id="notebook_4",
         databricks_conn_id="databricks_default",
         notebook_path="/Shared/notebook_4",
         source="WORKSPACE",
         job_cluster_key="astro_databricks",
      )

      notebook_1 >> inner_task_group() >> notebook_4

   trigger_workflow_2 = TriggerDagRunOperator(
      task_id="trigger_workflow_2",
      trigger_dag_id="workflow_2",
      execution_date="{{ next_execution_date }}",
   )

   workflow >> trigger_workflow_2

databricks_workflow_example_dag = databricks_workflow_example()

Airflow UI

Airflow UI

Databricks UI

Databricks UI

Quickstart

Check out the following quickstart guides:

Documentation

The documentation is a work in progress--we aim to follow the Diátaxis system:

Changelog

Astro Databricks follows semantic versioning for releases. Read changelog to understand more about the changes introduced to each version.

Contribution guidelines

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Read the Contribution Guidelines for a detailed overview on how to contribute.

Contributors and maintainers should abide by the Contributor Code of Conduct.

License

Apache Licence 2.0

astro-provider-databricks's People

Contributors

chrishronek avatar crong-k avatar denimalpaca avatar dimberman avatar dwreeves avatar hang1225 avatar iancmoritz avatar jacobsalway avatar jbandoro avatar jlaneve avatar josh-fell avatar julienledem avatar kaxil avatar mikewallis42 avatar pankajkoti avatar patawan avatar petedejoy avatar pre-commit-ci[bot] avatar rnhttr avatar ryw avatar shashanksinghfd avatar tatiana avatar tjanif avatar w0ut0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

astro-provider-databricks's Issues

Enable code coverage

ATM we are missing Code Coverage, we should set it up and enable it in our CI.

Support sending parameterized SQL queries to Databricks Jobs

Problem statement

Currently the DatabricksWorkflowTaskGroup only supports creating notebook tasks using the DatabricksNotebookOperator. While this feature unlocks all databricks python-based development (and to some extent SQL through spark.sql commands), it does not allow users to take advantage of the Databricks SQL, which would limit the flows that users can create.

To solve this, we should offer support for adding support for sql_task tasks.

sql_task tasks allow databricks to refer to query objects that have been created in the databricks SQL editor. These queries can be parameterized by the user at runtime.

Screenshot 2023-02-05 at 12 34 13 PM

Solving this issue would involve two steps:

The first step is to create a DatabricksSqlQueryOperator that expects a query ID instead of a SQL query. If run outside of a DatabricksWorkflowTaskgroup, this operator would be able to launch and monitor a SQL task on its own. The second step would be to create a convert_to_databricks_workflow_task to convert the SQL operator task into a workflow task.

For this task to be completed, a SQL query should be added to the example DAG and should run through CI/CD.

Clicking on repair all/repair single task for an ongoing DAG/task breaks the UI

When a Databricks DAG is run in Airflow, and the repair buttons are clicked (either repair a single job or all failed jobs), the UI breaks as seen in the screenshot.

We might need to disable the links when the job is running or handle it gracefully by displaying a message that there is an ongoing job run and repair cannot take place.

Screenshot 2023-03-10 at 2 41 39 PM

Getting 401 Client Error: Unauthorized

I am trying to run the sample dag with a valid databricks connection that I use in other dags with our custom databricks operators.

The config for our databricks conn is:

{
  "token": "dapisomevalidtoken",
  "host": "https://my-databricks.cloud.databricks.com"
}

Are you expecting a different format for databricks conn?

Here is the call stack from the DatabricksWorkflowTaskGroup:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/astro_databricks/operators/workflow.py", line 165, in execute
job = _get_job_by_name(self.databricks_job_name, jobs_api)
File "/usr/local/lib/python3.10/site-packages/astro_databricks/operators/workflow.py", line 43, in _get_job_by_name
jobs = jobs_api.list_jobs().get("jobs", [])
File "/usr/local/lib/python3.10/site-packages/databricks_cli/jobs/api.py", line 36, in list_jobs
resp = self.client.list_jobs(job_type=job_type, expand_tasks=expand_tasks, offset=offset,
File "/usr/local/lib/python3.10/site-packages/databricks_cli/sdk/service.py", line 341, in list_jobs
return self.client.perform_query(
File "/usr/local/lib/python3.10/site-packages/databricks_cli/sdk/api_client.py", line 174, in perform_query
raise requests.exceptions.HTTPError(message, response=e.response)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url

Connection to Databricks

I am trying to connect to databricks cluster , I am getting this error: HTTPError: 403 Client Error: Forbidden for url:
Any thoughts about the reason for that?

Enhancement - Authentication using OAuth tokens for service principals.

The current implementation utilizes Databricks CLI - ApiClient for authentication, which requires the user to provide a username and password or a Personal Access Token (PAT).
It would help if the token generation and refreshing process could be automated using Azure service principal credentials. This would simplify the authentication process.

For example - The below function call can be implemented:
_get_token .

(https://github.com/apache/airflow/blob/da4912b5e562c7a30e0c54f79220c99a32e69ab9/airflow/providers/databricks/hooks/databricks_base.py#L213)

Bug - Invalid dependency graph for tasks

Describe the bug

While creating a dependency between two tasks created using DatabricksTaskOperator() does not use the task_key specified, but uses dagName__groupId__taskKey. This is inconsistent with the tasks created on Databricks because they correctly use the task_key specified.

To Reproduce

Steps to reproduce the behavior:

Run the following code with a valid cluster config and update the path to two notebooks on databricks which could simply print hello.

from airflow.decorators import dag
from astro_databricks.operators.common import DatabricksTaskOperator
from astro_databricks.operators.workflow import DatabricksWorkflowTaskGroup
from pendulum import datetime

 
DATABRICKS_JOB_CLUSTER_KEY: str = "Airflow_Shared_job_cluster"
DATABRICKS_CONN_ID: str = "databricks_default"

 
job_cluster_spec: list[dict] = [
# A valid cluster config
]

 
@dag(start_date=datetime(2024, 1, 1), schedule=None, catchup=False)
def dynamic_template():
    task_group = DatabricksWorkflowTaskGroup(
        group_id="projectv2",
        databricks_conn_id=DATABRICKS_CONN_ID,
        job_clusters=job_cluster_spec,
    )
    with task_group:
        print_1 = DatabricksTaskOperator(
            task_id="print_1",
            databricks_conn_id=DATABRICKS_CONN_ID,
            job_cluster_key=DATABRICKS_JOB_CLUSTER_KEY,
            task_config={
                "task_key": "print_1",
                "notebook_task": {
                    "notebook_path": "path_to_notebook/print_test1",
                    "source": "WORKSPACE",
                },
            },
        )

        print_2 = DatabricksTaskOperator(
            task_id="print_2",
            databricks_conn_id=DATABRICKS_CONN_ID,
            job_cluster_key=DATABRICKS_JOB_CLUSTER_KEY,
            task_config={
                "task_key": "print_2",
                "notebook_task": {
                    "notebook_path": "path_to_notebook/print_test2",
                    "source": "WORKSPACE",
                },
            },
        )
        print_2.set_upstream(print_1)
dynamic_template()

Screenshots

image
image

Expected behavior

This should create a DAG with two tasks - print_1 and print_2 - and print_2 should be dependent on print_1.

Desktop (please complete the following information):

OS: macos Ventura 13.6.1
Browser Firefox
Version 123.0.1

Launch Operator trigger_rule Option

Currently the launch operator defaults to 'all_success' and it doesn't appear to have an option to pass trigger_rule's to this operator. Thus when any task upstream is skipped, the launch task skips and linked tasks fail, despite differing trigger_rule's applied to the DatabricksNotebookOperator's.

Is it possible to pass trigger_rule's to this Operator, and if not can this functionality be added?

Support for using existing clusters in DatabricksNotebookOperator

Issue

The parameter "existing_cluster_id" for DatabricksNotebookOperator is not being utilized.

Our teams at HealthPartners have several use cases of using a job compute and shared compute within a workflow for different tasks. However the ability to use different computes (other than job) is not supported as only the job_cluster parameter is evaluated when constructing the JSON for a new workflow.

Proposed Solution

Under "convert_to_databricks_workflow_task" function in operators/notebook.py:

  • add key='existing_cluster_id' and value=self.existing_cluster_id to the variable result if self.existing_cluster_id is not empty.
  • add key='job_cluster_key' and value=self.job_cluster_key to the variable 'result' if self.existing_cluster_id is empty.
  • PR: #73

Task names character limit in workflows

The databricks task names are generated as combination of dag_id and task_id in notebook operator but this will overflow the character limit in workflow jobs, which has only 100 characters limit. Need a change in ‎_get_databricks_task_id to return the trimmed task id if the taskgroup is workflowtaskgroup type.

This will also give cleaner look on the databricks UI to easily identify the task names.

Enable Pre-Commit service

Once we do this, add the badge to the README:

[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/astronomer/astro-providers-databricks/main.svg)](https://results.pre-commit.ci/latest/github/astronomer/astro-providers-databricks/main)

Address feedback to get this work into Apache Airflow

While discusssing contributing this work into the Apache Airflow repo with @alexott, he gave the following feedback:

  • We need to talk about integrating your work with JobsCreate operator, which is now developed by @Sri Tikkireddy (PR: apache/airflow#32221).

  • From analysis of your code, it has a lot of overlap with your work, but has some valuable things, like the use of Data Classes from the Databricks Python SDK.

  • As you mentioned, you're using SDK from Databricks CLI - it's already considered deprecated and is replaced by Databricks Python SDK. It has a big advantage over the old SDK as it evolves together with the REST APIs.

  • If your code doesn't provide asynchronous execution, then either use of SDK could be the best way forward. Or we can switch to using DatabricksHook functions.

  • In your code, instead of JSON payload for tasks, and having dedicated operator for notebooks, we can switch to use data classes from the new SDK - it will give self-documenting capabilities and type safety.

Support for pyspark job submit for Databricks Jobs using astro provider

Currently the DatabricksWorkflowTaskGroup only supports creating notebook tasks using the DatabricksNotebookOperator. While I was going through Orchestrate Databricks jobs with Apache Airflow, I came across DatabricksSubmitRunOperator. This is a really a nice functionality as it would allow users to take full advantage of DatabricksWorkflowTaskGroup from astro and ease the development of clean Airflow DAGs.

There are other ways to implement the above using Databricks Connect V2.

The ask is related to the discussion on How to Orchestrate Databricks Jobs Using Airflow where Daniel Imberman(@dimberman) expressed that this functionality is in the roadmap of this project.

Add support for TaskGroups from within the Databricks Workflow

Consider the case where you have 4 workflows you want to run in a row, each with a few notebooks. Right now, you could use this package to define each workflow separately, which would look like this:

Screen Shot 2023-03-30 at 9 49 59 PM

However, the downside of this is there are 4 separate launch tasks that launch clusters, even if the clusters across workflows are the same. It'd be neat to do something more like this:

Screen Shot 2023-03-30 at 9 54 18 PM

This would clutter the Databricks Workflow UI quite a bit, but a user shouldn't need to use that much. There would be time and cost savings to doing this.

Couple thoughts/notes:

  • on the Databricks Workflow side, we'd have to add additional dependencies across Airflow TaskGroups because there's no notion of TaskGroups on for Databricks
  • cold starts for a Databricks Workflow seems to be ~4 mins, warm starts (within a few mins of a previous workflow) is still consistently 1.5 minutes, so this would definitely save time/compute

Improve the README

▪︎ Installation
▪︎ Example or link to example DAGs
▪︎ Link to the changelog
▪︎ License

Repair All failed run - inconsistent behaviour

Issue: I have 3 task which is wrapped under single Databricks cluster group and when one of the task fails and when using 'Repair All failed task' from Launch activity, sometimes the Airflow UI for the DAG shows as failed but in the backend the Repair is actually going on in Databricks, its quite wierd that sometime it works fine in the UI, but mostly it shows as failed only

Airflow 2.6.2 running in AKS using open source Helm chart.

Duplicate Job Creation in Databricks During Airflow DAG Runs

Issue

Our teams at HealthPartners are encountering a recurring issue where each execution of an Airflow DAG leads to the creation of a new job, despite the job already existing within the Databricks workspace.

This issue is most likely linked to the Databricks REST API retrieving a limit of 20 jobs per request, by default. In instances where the workspace contains over 20 jobs, additional API requests are necessary utilizing the 'next_page_token' from the initial call to fetch the complete job list.

Proposed Solution

Under "_get_job_by_name" function in operators/workflow.py:

  • directly pass the job_name parameter to the jobs_api.list_jobs() method to leverage the API's built-in job name filtering capability. This approach is more efficient than fetching an exhaustive job list and subsequently filtering for the specific job.

New Feature Request: Handling retry at the notebook level

Hey, been using the library for a while now, love it, thanks for the good work.
Trying to implement retry at the notebook level, Databricks have parameters for it you can change in the UI that appear like this in the json:

{
  "task_key": "databricks_lol__champion_builds__champion_builds_gold_0",
  ...
  "max_retries": 2,
  "min_retry_interval_millis": 60000,
  "retry_on_timeout": true,
  "timeout_seconds": 1200,
}

I'm unable to set those parameters with astro_databricks

Thanks in advance

Support Airflow version 2.2.4, 2.3 and 2.4 and enable CI run on it

At the moment, our customer who requested the Databricks support uses:

Python 3.9
Airflow 2.2.4

(based on Sigma)

However, we are only testing Python 3.8 and Airflow 2.5. We should extend the current CI setup astronomer/astronomer-cosmos#161 (PR astronomer/astronomer-cosmos#167) to also validate against this version of Python and Airflow.

Add needed support in the implementation for supporting Airflow 2.2.4, 2.3 and 2.4 together with Python version 3.8 & 3.9 and enable CI to run on such a matrix.

Provision to pass queue name (to execute in specified airflow worker) while constructing DatabricksWorkflowTaskGroup

Use case: Execute all the databricks tasks in dedicated airflow worker.

Problem: The tasks created by "_CreateDatabricksWorkflowOperator" is executing in default airflow worker as is no provision to pass queue value to "_CreateDatabricksWorkflowOperator".

Proposed solution:
Give provision to pass operator parameters or just operator queue while constructing DatabricksWorkflowTaskGroup. So, while internally constructing _CreateDatabricksWorkflowOperator, those parameters can be passed.

Please let me know if any information is required.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.