astronomer / astro-provider-databricks Goto Github PK
View Code? Open in Web Editor NEWOrchestrate your Databricks notebooks in Airflow and execute them as Databricks Workflows
License: Apache License 2.0
Orchestrate your Databricks notebooks in Airflow and execute them as Databricks Workflows
License: Apache License 2.0
Hey, been using the library for a while now, love it, thanks for the good work.
Trying to implement retry at the notebook level, Databricks have parameters for it you can change in the UI that appear like this in the json:
{
"task_key": "databricks_lol__champion_builds__champion_builds_gold_0",
...
"max_retries": 2,
"min_retry_interval_millis": 60000,
"retry_on_timeout": true,
"timeout_seconds": 1200,
}
I'm unable to set those parameters with astro_databricks
Thanks in advance
ATM we are missing Code Coverage, we should set it up and enable it in our CI.
It is always expecting Job cluster information even when you are just using the existing_cluster_id parameter.
Our teams at HealthPartners are encountering a recurring issue where each execution of an Airflow DAG leads to the creation of a new job, despite the job already existing within the Databricks workspace.
This issue is most likely linked to the Databricks REST API retrieving a limit of 20 jobs per request, by default. In instances where the workspace contains over 20 jobs, additional API requests are necessary utilizing the 'next_page_token' from the initial call to fetch the complete job list.
Under "_get_job_by_name" function in operators/workflow.py:
job_name
parameter to the jobs_api.list_jobs() method to leverage the API's built-in job name filtering capability. This approach is more efficient than fetching an exhaustive job list and subsequently filtering for the specific job.Within Airflow UI, when a Databricks workflow job is run and a DAG or task is marked failed while the job is still running, it gets marked failed within Airflow, but the ongoing Databricks job run is not cancelled/killed and continues processing.
Issue: I have 3 task which is wrapped under single Databricks cluster group and when one of the task fails and when using 'Repair All failed task' from Launch activity, sometimes the Airflow UI for the DAG shows as failed but in the backend the Repair is actually going on in Databricks, its quite wierd that sometime it works fine in the UI, but mostly it shows as failed only
Airflow 2.6.2 running in AKS using open source Helm chart.
Currently the launch operator defaults to 'all_success' and it doesn't appear to have an option to pass trigger_rule's to this operator. Thus when any task upstream is skipped, the launch task skips and linked tasks fail, despite differing trigger_rule's applied to the DatabricksNotebookOperator's.
Is it possible to pass trigger_rule's to this Operator, and if not can this functionality be added?
▪︎ Installation
▪︎ Example or link to example DAGs
▪︎ Link to the changelog
▪︎ License
While discusssing contributing this work into the Apache Airflow repo with @alexott, he gave the following feedback:
We need to talk about integrating your work with JobsCreate operator, which is now developed by @Sri Tikkireddy (PR: apache/airflow#32221).
From analysis of your code, it has a lot of overlap with your work, but has some valuable things, like the use of Data Classes from the Databricks Python SDK.
As you mentioned, you're using SDK from Databricks CLI - it's already considered deprecated and is replaced by Databricks Python SDK. It has a big advantage over the old SDK as it evolves together with the REST APIs.
If your code doesn't provide asynchronous execution, then either use of SDK could be the best way forward. Or we can switch to using DatabricksHook functions.
In your code, instead of JSON payload for tasks, and having dedicated operator for notebooks, we can switch to use data classes from the new SDK - it will give self-documenting capabilities and type safety.
When a Databricks DAG is run in Airflow, and the repair buttons are clicked (either repair a single job or all failed jobs), the UI breaks as seen in the screenshot.
We might need to disable the links when the job is running or handle it gracefully by displaying a message that there is an ongoing job run and repair cannot take place.
At the moment, our customer who requested the Databricks support uses:
Python 3.9
Airflow 2.2.4
(based on Sigma)
However, we are only testing Python 3.8 and Airflow 2.5. We should extend the current CI setup astronomer/astronomer-cosmos#161 (PR astronomer/astronomer-cosmos#167) to also validate against this version of Python and Airflow.
Add needed support in the implementation for supporting Airflow 2.2.4, 2.3 and 2.4 together with Python version 3.8 & 3.9 and enable CI to run on such a matrix.
Context
As of 0.1.2, DatabricksNotebookOperator
only supports Airflow Jinja templating in the field databricks_metadata
.
Desired behaviour
The field notebook_params
should also be templated.
In #52, pydantic was pinned on <2.0.0. However, in Airflow 2.8.0, Pydantic is bumped to >2.3.0
.
This causes the package installation to fail on Airflow > 2.8.0
Update CI to:
Currently the DatabricksWorkflowTaskGroup
only supports creating notebook tasks using the DatabricksNotebookOperator.
While I was going through Orchestrate Databricks jobs with Apache Airflow, I came across DatabricksSubmitRunOperator. This is a really a nice functionality as it would allow users to take full advantage of DatabricksWorkflowTaskGroup
from astro and ease the development of clean Airflow DAGs.
There are other ways to implement the above using Databricks Connect V2.
The ask is related to the discussion on How to Orchestrate Databricks Jobs Using Airflow where Daniel Imberman(@dimberman) expressed that this functionality is in the roadmap of this project.
I am trying to run the sample dag with a valid databricks connection that I use in other dags with our custom databricks operators.
The config for our databricks conn is:
{
"token": "dapisomevalidtoken",
"host": "https://my-databricks.cloud.databricks.com"
}
Are you expecting a different format for databricks conn?
Here is the call stack from the DatabricksWorkflowTaskGroup
:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/astro_databricks/operators/workflow.py", line 165, in execute
job = _get_job_by_name(self.databricks_job_name, jobs_api)
File "/usr/local/lib/python3.10/site-packages/astro_databricks/operators/workflow.py", line 43, in _get_job_by_name
jobs = jobs_api.list_jobs().get("jobs", [])
File "/usr/local/lib/python3.10/site-packages/databricks_cli/jobs/api.py", line 36, in list_jobs
resp = self.client.list_jobs(job_type=job_type, expand_tasks=expand_tasks, offset=offset,
File "/usr/local/lib/python3.10/site-packages/databricks_cli/sdk/service.py", line 341, in list_jobs
return self.client.perform_query(
File "/usr/local/lib/python3.10/site-packages/databricks_cli/sdk/api_client.py", line 174, in perform_query
raise requests.exceptions.HTTPError(message, response=e.response)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url
A customer reported that, from time to time, instances of DatabricksNotebookOperator
are stuck in a running state in Airflow while being completed on Databricks.
The logs need to explain what the Databricks job is trying to use - they are empty.
While checking our code, I noticed that the implementation could be improved.
https://github.com/astronomer/astro-provider-databricks/blob/3e1ca039a024a98f9079d178478aa24702e15453/src/astro_databricks/operators/notebook.py#L235C1-L238C64
The implementation seems to have been improved in our contribution to Airflow
apache/airflow#39178
Since this affects an Astronomer customer and we have not completed the migration yet, my suggestion is that:
The databricks task names are generated as combination of dag_id and task_id in notebook operator but this will overflow the character limit in workflow jobs, which has only 100 characters limit. Need a change in _get_databricks_task_id
to return the trimmed task id if the taskgroup is workflowtaskgroup type.
This will also give cleaner look on the databricks UI to easily identify the task names.
Release PR #76
Currently the DatabricksWorkflowTaskGroup only supports creating notebook tasks using the DatabricksNotebookOperator
. While this feature unlocks all databricks python-based development (and to some extent SQL through spark.sql
commands), it does not allow users to take advantage of the Databricks SQL, which would limit the flows that users can create.
To solve this, we should offer support for adding support for sql_task
tasks.
sql_task
tasks allow databricks to refer to query objects that have been created in the databricks SQL editor. These queries can be parameterized by the user at runtime.
Solving this issue would involve two steps:
The first step is to create a DatabricksSqlQueryOperator
that expects a query ID instead of a SQL query. If run outside of a DatabricksWorkflowTaskgroup, this operator would be able to launch and monitor a SQL task on its own. The second step would be to create a convert_to_databricks_workflow_task
to convert the SQL operator task into a workflow task.
For this task to be completed, a SQL query should be added to the example DAG and should run through CI/CD.
While creating a dependency between two tasks created using DatabricksTaskOperator()
does not use the task_key
specified, but uses dagName__groupId__taskKey
. This is inconsistent with the tasks created on Databricks because they correctly use the task_key
specified.
Steps to reproduce the behavior:
Run the following code with a valid cluster config and update the path to two notebooks on databricks which could simply print hello
.
from airflow.decorators import dag
from astro_databricks.operators.common import DatabricksTaskOperator
from astro_databricks.operators.workflow import DatabricksWorkflowTaskGroup
from pendulum import datetime
DATABRICKS_JOB_CLUSTER_KEY: str = "Airflow_Shared_job_cluster"
DATABRICKS_CONN_ID: str = "databricks_default"
job_cluster_spec: list[dict] = [
# A valid cluster config
]
@dag(start_date=datetime(2024, 1, 1), schedule=None, catchup=False)
def dynamic_template():
task_group = DatabricksWorkflowTaskGroup(
group_id="projectv2",
databricks_conn_id=DATABRICKS_CONN_ID,
job_clusters=job_cluster_spec,
)
with task_group:
print_1 = DatabricksTaskOperator(
task_id="print_1",
databricks_conn_id=DATABRICKS_CONN_ID,
job_cluster_key=DATABRICKS_JOB_CLUSTER_KEY,
task_config={
"task_key": "print_1",
"notebook_task": {
"notebook_path": "path_to_notebook/print_test1",
"source": "WORKSPACE",
},
},
)
print_2 = DatabricksTaskOperator(
task_id="print_2",
databricks_conn_id=DATABRICKS_CONN_ID,
job_cluster_key=DATABRICKS_JOB_CLUSTER_KEY,
task_config={
"task_key": "print_2",
"notebook_task": {
"notebook_path": "path_to_notebook/print_test2",
"source": "WORKSPACE",
},
},
)
print_2.set_upstream(print_1)
dynamic_template()
This should create a DAG with two tasks - print_1
and print_2
- and print_2
should be dependent on print_1
.
OS: macos Ventura 13.6.1
Browser Firefox
Version 123.0.1
The parameter "existing_cluster_id" for DatabricksNotebookOperator is not being utilized.
Our teams at HealthPartners have several use cases of using a job compute and shared compute within a workflow for different tasks. However the ability to use different computes (other than job) is not supported as only the job_cluster parameter is evaluated when constructing the JSON for a new workflow.
Under "convert_to_databricks_workflow_task" function in operators/notebook.py:
result
if self.existing_cluster_id is not empty.Use case: Execute all the databricks tasks in dedicated airflow worker.
Problem: The tasks created by "_CreateDatabricksWorkflowOperator" is executing in default airflow worker as is no provision to pass queue value to "_CreateDatabricksWorkflowOperator".
Proposed solution:
Give provision to pass operator parameters or just operator queue while constructing DatabricksWorkflowTaskGroup. So, while internally constructing _CreateDatabricksWorkflowOperator, those parameters can be passed.
Please let me know if any information is required.
Uncomment the Type-Check tests from the Github Actions workflow and fix the following MyPy issues:
https://github.com/astronomer/astro-providers-databricks/actions/runs/4384003378/jobs/7674945034
And any other that may be found.
Similar to Google libraries, this way, our package could play along very nicely with the Astro Python SDK.
I am trying to connect to databricks cluster , I am getting this error: HTTPError: 403 Client Error: Forbidden for url:
Any thoughts about the reason for that?
Consider the case where you have 4 workflows you want to run in a row, each with a few notebooks. Right now, you could use this package to define each workflow separately, which would look like this:
However, the downside of this is there are 4 separate launch
tasks that launch clusters, even if the clusters across workflows are the same. It'd be neat to do something more like this:
This would clutter the Databricks Workflow UI quite a bit, but a user shouldn't need to use that much. There would be time and cost savings to doing this.
Couple thoughts/notes:
Once we do this, add the badge to the README:
[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/astronomer/astro-providers-databricks/main.svg)](https://results.pre-commit.ci/latest/github/astronomer/astro-providers-databricks/main)
The current implementation utilizes Databricks CLI - ApiClient for authentication, which requires the user to provide a username and password or a Personal Access Token (PAT).
It would help if the token generation and refreshing process could be automated using Azure service principal credentials. This would simplify the authentication process.
For example - The below function call can be implemented:
_get_token .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.