Comments (5)
Thanks for opening this issue.
First of all, there is actually a minor bug in your code since you should be using dask.compute
inside of the delayed driver function (do_math
) as well, i.e.
import dask, time
@dask.delayed
def add(x, y):
time.sleep(2)
return x + y
@dask.delayed
def do_math(x, y, op):
results = []
for _ in range(4):
results.append(op(x, y))
return dask.compute(*results)
The reason why this isn't necessary right now is also the reason why you are experiencing "broken concurrency", i.e. the sub tasks are not executed in parallel but sequentially.
You are not merely executing nested tasks but you are also using a very functional approach to programming in that you are passing the delayed decorated function add
to the driver task do_math
. If I was to type annotate that function it would look like
def do_math(x: int, y: int, op: Delayed) -> list[int]:
...
The problem is that there is an ambiguity internally about instances of Delayed
objects and the results of the functions the delayed objects represent. In short, we are unpacking/unwrapping all Delayed
instances (e.g. your decorated function) such that the called function receives the wrapped object. I'm not entirely sure but I think this is not something we can change easily.
The intended way to define these nested calls is to call the delayed functions directly.
@dask.delayed
def do_math(x, y):
results = []
for _ in range(4):
# `add` here is the decorated delayed function and we're calling it directly, not via the function argument
results.append(add(x, y))
return dask.compute(*results)
There is nothing wrong with defining nested computations (we recommend not doing it because this is an advanced approach where users should know what they are doing to avoid a bad experience) but the particular approach you chose is clashing with some of our internals.
from dask.
FWIW I agree that this is not nice, I'm just not sure if I can offer a fix. I'm looking into one thing right now but can't promise much...
from dask.
@fjetter: Thank you so much for taking the time to reply. The tl;dr here is --- I completely understand and agree. That said, I will be thorough in my reply here for the sake of any future individual who reads this.
First of all, there is actually a minor bug in your code since you should be using dask.compute inside of the delayed driver function (do_math) as well, i.e.
Great call. You're absolutely right, although it is a bit clunky to have to inject a dask.compute
call in what is otherwise a Dask-free function. But it is necessary. Some other workflow languages have a dedicated operator for this kind of thing, but with only the @delayed
decorator in Dask, you are right --- one must call dask.compute(*results)
inside the function.
The problem is that there is an ambiguity internally about instances of Delayed objects and the results of the functions the delayed objects represent. In short, we are unpacking/unwrapping all Delayed instances (e.g. your decorated function) such that the called function receives the wrapped object. I'm not entirely sure but I think this is not something we can change easily.
Indeed, for my original report here, that is the problem. Dask tries to intelligently unpack the Delayed
object when, in this functional programming approach, it is meant to just be passed around as-is. It would certainly be nice to be able to support this mechanism, if you find a way.
As for right now, I am doing a hack in my own codes where I define a custom @delayed
decorator (e.g. @my_delayed
) that behaves as follows:
from dask import delayed
from functools import wraps
def my_delayed(_func, **kwargs)
@wraps(_func)
def wrapper(*f_args, **f_kwargs):
return _func(*f_args, **f_kwargs)
return Delayed_(delayed(wrapper, **kwargs))
class Delayed_:
"""
A small Dask-compatible, serializable object to wrap delayed functions
that we don't want to execute
"""
__slots__ = ("func",)
def __init__(self, func):
self.func = func
def __reduce__(self):
return (Delayed_, (self.func,))
def __call__(self, *args, **kwargs):
return self.func(*args, **kwargs)
This "protects" my Delayed
object when no arguments are supplied yet, but it is obviously very hacky.
There is nothing wrong with defining nested computations (we recommend not doing it because this is an advanced approach where users should know what they are doing to avoid a bad experience) but the particular approach you chose is clashing with some of our internals.
Great! I am fine with it being for advanced users. Edge cases are expected.
As a sidenote, there is a related issue that I report on the Discourse page. Namely, the following works:
import dask
import time
def job():
print('Sleeping')
time.sleep(5)
return True
@dask.delayed
def get_n():
return 5
@dask.delayed
def subflow(op, n):
results = []
for _ in range(n):
results.append(dask.delayed(op)())
return results
def workflow():
n = get_n()
return subflow(job, n).compute()
dask.compute(workflow())
However, the following breaks concurrency:
import dask
import time
@dask.delayed
def job():
print('Sleeping')
time.sleep(5)
return True
@dask.delayed
def get_n():
return 5
@dask.delayed
def subflow(op, n):
results = []
for _ in range(n):
results.append(op())
return results
def workflow():
n = get_n()
return subflow(job, n).compute()
dask.compute(workflow())
I can't quite tell if it's an unintuitive Dask behavior like before or not, but it's likely related to the above comments. Either way, there are likely better ways to achieve the intended logic regardless.
from dask.
Some other workflow languages have a dedicated operator for this kind of thing, but with only the @delayed decorator in Dask, you are right --- one must call dask.compute(*results) inside the function.
Can you elaborate what you are referring to here? I assume you are not talking about python operators are you?
I don't see a way around a collect/gather/compute method/function to get the results of multiple concurrently running tasks (e.g. asyncio is using asyncio.gather
just like our Client API)
from dask.
@fjetter sure, I can explain. Basically, it is very nice to avoid injecting any Dask specific logic in the underlying functions (other than the applied decorator). However, in the presented scenario, that's not possible, as you nicely highlighted with the return
statement. Several other workflow tools have a dedicated "subworkflow" type of decorator that will take care of the gather operation automatically. A good example is the @join_app in Parsl, but there are other examples as well.
For dask, a similar idea would be something like:
from dask import delayed
from dask.distributed import worker_client
from functools import wraps
def join_delayed(_func, **kwargs):
@wraps(_func)
def wrapper(*f_args, **f_kwargs):
with worker_client() as client:
futures = client.compute(_func(*f_args, **f_kwargs))
return client.gather(futures)
return delayed(wrapper, **kwargs)
Anyway, that comment was a bit tangential to the main point. Didn't mean to get off course.
from dask.
Related Issues (20)
- Mean fails to compute for very large column of pyarrow type HOT 1
- Previously working time series resampling breaks in new version of Dask HOT 3
- When using PyArrow dtypes, aggregations create NaNs of unexpected type HOT 1
- Column with object dtype get converted to string when selecting the column HOT 1
- aggregate function that operates on vector(array of numeric) data
- Dask .head() returns error as .compute returns ok! HOT 2
- API docs missing for `read_csv`, `read_fwf` and `read_table` HOT 3
- New CI failure showing up in fsspec HOT 5
- Overlap with `new_axis` option is not trimmed correctly HOT 1
- ValueError: An error occurred while calling the read_csv method registered to the pandas backend HOT 2
- add a api load dataset from [huggingface datasets] HOT 4
- map_blocks returning pd.DataFrame fails with block_info parameter HOT 4
- Couple of sparse tests are failing HOT 1
- I'm not sure what βb_dictβ is, I couldn't find any relevant content HOT 1
- Release GH action needs to be run twice HOT 1
- gpuCI failing due to `pytest` warning HOT 5
- pandas>=2.0.0 incompatibility ?
- Concat with unknown divisions raises TypeError HOT 1
- Dask 2024.5.1 removed `.attrs` HOT 3
- Dask 2024.5.1 raises exception when `.compute()` is called on a categorical column HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.