GithubHelp home page GithubHelp logo

Comments (5)

fjetter avatar fjetter commented on June 5, 2024

Thanks for opening this issue.

First of all, there is actually a minor bug in your code since you should be using dask.compute inside of the delayed driver function (do_math) as well, i.e.

import dask, time

@dask.delayed
def add(x, y):
    time.sleep(2)
    return x + y

@dask.delayed
def do_math(x, y, op):
    results = []
    for _ in range(4):
        results.append(op(x, y))
    return dask.compute(*results)

The reason why this isn't necessary right now is also the reason why you are experiencing "broken concurrency", i.e. the sub tasks are not executed in parallel but sequentially.

You are not merely executing nested tasks but you are also using a very functional approach to programming in that you are passing the delayed decorated function add to the driver task do_math. If I was to type annotate that function it would look like

def do_math(x: int, y: int, op: Delayed) -> list[int]:
    ...

The problem is that there is an ambiguity internally about instances of Delayed objects and the results of the functions the delayed objects represent. In short, we are unpacking/unwrapping all Delayed instances (e.g. your decorated function) such that the called function receives the wrapped object. I'm not entirely sure but I think this is not something we can change easily.

The intended way to define these nested calls is to call the delayed functions directly.

@dask.delayed
def do_math(x, y):
    results = []
    for _ in range(4):
        #  `add` here is the decorated delayed function and we're calling it directly, not via the function argument
        results.append(add(x, y))
    return dask.compute(*results)

There is nothing wrong with defining nested computations (we recommend not doing it because this is an advanced approach where users should know what they are doing to avoid a bad experience) but the particular approach you chose is clashing with some of our internals.

from dask.

fjetter avatar fjetter commented on June 5, 2024

FWIW I agree that this is not nice, I'm just not sure if I can offer a fix. I'm looking into one thing right now but can't promise much...

from dask.

Andrew-S-Rosen avatar Andrew-S-Rosen commented on June 5, 2024

@fjetter: Thank you so much for taking the time to reply. The tl;dr here is --- I completely understand and agree. That said, I will be thorough in my reply here for the sake of any future individual who reads this.

First of all, there is actually a minor bug in your code since you should be using dask.compute inside of the delayed driver function (do_math) as well, i.e.

Great call. You're absolutely right, although it is a bit clunky to have to inject a dask.compute call in what is otherwise a Dask-free function. But it is necessary. Some other workflow languages have a dedicated operator for this kind of thing, but with only the @delayed decorator in Dask, you are right --- one must call dask.compute(*results) inside the function.

The problem is that there is an ambiguity internally about instances of Delayed objects and the results of the functions the delayed objects represent. In short, we are unpacking/unwrapping all Delayed instances (e.g. your decorated function) such that the called function receives the wrapped object. I'm not entirely sure but I think this is not something we can change easily.

Indeed, for my original report here, that is the problem. Dask tries to intelligently unpack the Delayed object when, in this functional programming approach, it is meant to just be passed around as-is. It would certainly be nice to be able to support this mechanism, if you find a way.

As for right now, I am doing a hack in my own codes where I define a custom @delayed decorator (e.g. @my_delayed) that behaves as follows:

from dask import delayed
from functools import wraps

def my_delayed(_func, **kwargs)

    @wraps(_func)
    def wrapper(*f_args, **f_kwargs):
        return _func(*f_args, **f_kwargs)

    return Delayed_(delayed(wrapper, **kwargs))

class Delayed_:
    """
    A small Dask-compatible, serializable object to wrap delayed functions
    that we don't want to execute
    """

    __slots__ = ("func",)

    def __init__(self, func):
        self.func = func

    def __reduce__(self):
        return (Delayed_, (self.func,))

    def __call__(self, *args, **kwargs):
        return self.func(*args, **kwargs)

This "protects" my Delayed object when no arguments are supplied yet, but it is obviously very hacky.

There is nothing wrong with defining nested computations (we recommend not doing it because this is an advanced approach where users should know what they are doing to avoid a bad experience) but the particular approach you chose is clashing with some of our internals.

Great! I am fine with it being for advanced users. Edge cases are expected.

As a sidenote, there is a related issue that I report on the Discourse page. Namely, the following works:

import dask
import time

def job():
    print('Sleeping')
    time.sleep(5)
    return True

@dask.delayed
def get_n():
    return 5

@dask.delayed
def subflow(op, n):
    results = []
    for _ in range(n):
        results.append(dask.delayed(op)())
    return results

def workflow():
    n = get_n()
    return subflow(job, n).compute()

dask.compute(workflow())

However, the following breaks concurrency:

import dask
import time

@dask.delayed
def job():
    print('Sleeping')
    time.sleep(5)
    return True

@dask.delayed
def get_n():
    return 5

@dask.delayed
def subflow(op, n):
    results = []
    for _ in range(n):
        results.append(op())
    return results

def workflow():
    n = get_n()
    return subflow(job, n).compute()

dask.compute(workflow())

I can't quite tell if it's an unintuitive Dask behavior like before or not, but it's likely related to the above comments. Either way, there are likely better ways to achieve the intended logic regardless.

from dask.

fjetter avatar fjetter commented on June 5, 2024

Some other workflow languages have a dedicated operator for this kind of thing, but with only the @delayed decorator in Dask, you are right --- one must call dask.compute(*results) inside the function.

Can you elaborate what you are referring to here? I assume you are not talking about python operators are you?
I don't see a way around a collect/gather/compute method/function to get the results of multiple concurrently running tasks (e.g. asyncio is using asyncio.gather just like our Client API)

from dask.

Andrew-S-Rosen avatar Andrew-S-Rosen commented on June 5, 2024

@fjetter sure, I can explain. Basically, it is very nice to avoid injecting any Dask specific logic in the underlying functions (other than the applied decorator). However, in the presented scenario, that's not possible, as you nicely highlighted with the return statement. Several other workflow tools have a dedicated "subworkflow" type of decorator that will take care of the gather operation automatically. A good example is the @join_app in Parsl, but there are other examples as well.

For dask, a similar idea would be something like:

from dask import delayed
from dask.distributed import worker_client
from functools import wraps

def join_delayed(_func, **kwargs):
    @wraps(_func)
    def wrapper(*f_args, **f_kwargs):
        with worker_client() as client:
            futures = client.compute(_func(*f_args, **f_kwargs))
            return client.gather(futures)

    return delayed(wrapper, **kwargs)

Anyway, that comment was a bit tangential to the main point. Didn't mean to get off course.

from dask.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.