nvidia / stdexec Goto Github PK

`std::execution`, the proposed C++ framework for asynchronous and parallel programming.

License: Apache License 2.0

C++ 85.84% CMake 1.59% Cuda 12.27% Python 0.09% C 0.21%

stdexec's Introduction

Senders - A Standard Model for Asynchronous Execution in C++

stdexec is an experimental reference implementation of the Senders model of asynchronous programming proposed by P2300 - std::execution for adoption into the C++ Standard.

Purpose of this Repository:

Provide a proof-of-concept implementation of the design proposed in P2300.
Provide early access to developers looking to experiment with the Sender model.
Collaborate with those interested in participating or contributing to the design of P2300 (contributions welcome!).

Disclaimer

stdexec is experimental in nature and subject to change without warning. The authors and NVIDIA do not guarantee that this code is fit for any purpose whatsoever.

Example

Below is a simple program that executes three senders concurrently on a thread pool. Try it live on godbolt!.

#include <stdexec/execution.hpp>
#include <exec/static_thread_pool.hpp>

int main()
{
    // Declare a pool of 3 worker threads:
    exec::static_thread_pool pool(3);

    // Get a handle to the thread pool:
    auto sched = pool.get_scheduler();

    // Describe some work:
    // Creates 3 sender pipelines that are executed concurrently by passing to `when_all`
    // Each sender is scheduled on `sched` using `on` and starts with `just(n)` that creates a
    // Sender that just forwards `n` to the next sender.
    // After `just(n)`, we chain `then(fun)` which invokes `fun` using the value provided from `just()`
    // Note: No work actually happens here. Everything is lazy and `work` is just an object that statically
    // represents the work to later be executed
    auto fun = [](int i) { return i*i; };
    auto work = stdexec::when_all(
        stdexec::on(sched, stdexec::just(0) | stdexec::then(fun)),
        stdexec::on(sched, stdexec::just(1) | stdexec::then(fun)),
        stdexec::on(sched, stdexec::just(2) | stdexec::then(fun))
    );

    // Launch the work and wait for the result
    auto [i, j, k] = stdexec::sync_wait(std::move(work)).value();

    // Print the results:
    std::printf("%d %d %d\n", i, j, k);
}

Resources

Working with Asynchrony Generically: A Tour of Executors: Part 1 (Part 2) (Video): A comprehensive introduction to Senders and structured concurrency
What are Senders Good For, Anyway? (Blog): Demonstrates the value of a standard async programming model by wrapping a C-style async API in a sender
From Zero to Sender/Receiver in ~60 Minutes (Video): Live-coding a toy sender/receiver implementation from scratch
A Unifying Abstraction for Async in C++ (Video): A simple introduction to the concepts behind P2300
A Universal Async Abstraction for C++ (Blog): An introduction to Senders
A Universal I/O Abstraction for C++ (Blog): A look at how the Senders concepts interact with io_uring on Linux
Structured Concurrency (Video): An explanation of structured concurrency in C++ and its benefits
Executors: a Change of Perspective (Article): An article about the computational completeness of Senders
Structured Concurrency in C++ (Article): An article about how Senders manifest the principles of structured concurrency
Structured Networking in C++ (Video): A look at what a P2300-style networking library could look like
HPCWire Article: Provides a high-level overview of the Sender model and its benefits
NVIDIA HPC SDK Documentation: Documentation for the NVIDIA HPC SDK
P2300 - std::execution: Senders proposal to C++ Standard

Structure

This library is header-only, so all the source code can be found in the include/ directory. The physical and logical structure of the code can be summarized by the following table:

Kind	Path	Namespace
Things approved for the C++ standard	`<stdexec/...>`	`::stdexec`
Generic additions and extensions	`<exec/...>`	`::exec`
NVIDIA-specific extensions and customizations	`<nvexec/...>`	`::nvexec`

How to get `stdexec`

There are a few ways to get stdexec:

Clone from GitHub
- git clone https://github.com/NVIDIA/stdexec.git
Download the NVIDIA HPC SDK starting with 22.11
(Recommended) Use CMake Package Manager (CPM) to automatically pull stdexec as part of your CMake project. See below for more information.

You can also try it directly on godbolt.org where it is available as a C++ library or via the nvc++ compiler starting with version 22.11 (see below for more details).

Using `stdexec`

Requirements

stdexec requires compiling with C++20 (-std=c++20) but otherwise does not have any dependencies and only requires a sufficiently new compiler:

clang 13+
gcc 11+
nvc++ 22.11+ (required for GPU support). If using stdexec from GitHub, then nvc++ 23.3+ is required.

How you configure your environment to use stdexec depends on how you got stdexec.

NVHPC SDK

Starting with the 22.11 release of the NVHPC SDK, stdexec is available as an experimental, opt-in feature. Specifying the --experimental-stdpar flag to nvc++ makes the stdexec headers available on the include path. You can then include any stdexec header as normal: #include <stdexec/...>, #include <nvexec/...>. See godbolt example.

GPU features additionally require specifying -stdpar=gpu. For more details, see GPU Support.

GitHub

As a header-only C++ library, technically all one needs to do is add the stdexec include/ directory to your include path as -I<stdexec root>/include in addition to specifying any necessary compile options.

For simplicity, we recommend using the CMake targets that stdexec provides as they encapsulate the necessary configuration.

cmake

If your project uses CMake, then after cloning stdexec simply add the following to your CMakeLists.txt:

add_subdirectory(<stdexec root>)

This will make the STDEXEC::stdexec target available to link with your project:

target_link_libraries(my_project PRIVATE STDEXEC::stdexec)

This target encapsulates all of the necessary configuration and compiler flags for using stdexec.

CMake Package Manager (CPM)

To further simplify obtaining and including stdexec in your CMake project, we recommend using CMake Package Manager (CPM) to fetch and configure stdexec.

Complete example:

cmake_minimum_required(VERSION 3.14 FATAL_ERROR)

project(stdexecExample)

# Get CPM
# For more information on how to add CPM to your project, see: https://github.com/cpm-cmake/CPM.cmake#adding-cpm
include(CPM.cmake)

CPMAddPackage(
  NAME stdexec
  GITHUB_REPOSITORY NVIDIA/stdexec
  GIT_TAG main # This will always pull the latest code from the `main` branch. You may also use a specific release version or tag
)

add_executable(main example.cpp)

target_link_libraries(main STDEXEC::stdexec)

GPU Support

stdexec provides schedulers that enable execution on NVIDIA GPUs:

nvexec::stream_scheduler
- Single GPU scheduler that executes on the first available GPU (device 0)
- Defined in <nvexec/stream_context.cuh>
nvexec::multi_gpu_stream_scheduler
- Executes on all visible GPUs
- Defined in <nvexec/multi_gpu_context.cuh>

These schedulers are only supported when using the nvc++ compiler with -stdpar=gpu.

Example: https://godbolt.org/z/4cEMqY8r9

Building

stdexec is a header-only library and does not require building anything.

This section is only relevant if you wish to build the stdexec tests or examples.

The following tools are needed:

CMake
One of the following supported C++ compilers:
- GCC 11+
- clang 12+
- nvc++ 22.11 (nvc++ 23.3+ for stdexec from GitHub)

Perform the following actions:

# Configure the project
cmake -S . -B build -G<gen>
# Build the project
cmake --build build

Here, <gen> can be Ninja, "Unix Makefiles", XCode, "Visual Studio 15 Win64", etc.

Specifying the compiler

You can set the C++ compiler via -D CMAKE_CXX_COMPILER:

# Use GCC:
cmake -S . -B build/g++ -DCMAKE_CXX_COMPILER=$(which g++)
cmake --build build/g++

# Or clang:
cmake -S . -B build/clang++ -DCMAKE_CXX_COMPILER=$(which clang++)
cmake --build build/clang++

Specifying the stdlib

If you want to use libc++ with clang instead of libstdc++, you can specify the standard library as follows:

# Do the actual build
cmake -S . -B build/clang++ -G<gen> \
    -DCMAKE_CXX_FLAGS=-stdlib=libc++ \
    -DCMAKE_CXX_COMPILER=$(which clang++)

cmake --build build/clang++

Tooling

For users of VSCode, stdexec provides a VSCode extension that colorizes compiler output. The highlighter recognizes the diagnostics generated by the stdexec library, styling them to make them easier to pick out. Details about how to configure the extension can be found here.

stdexec's People

Contributors

Stargazers

Watchers

Forkers

meastp danra dietmarkuehl miscco npuichigo lucteo ysw1912 kshitij12345 gevtushenko kexianda villevoutilainen assafcohen kirkshoop huzifa-terkawi msimberg yaito3014 sithhell mfbalin havogt ericniebler lzhangzz williamspatrick satacker gonidelis huixie90 paulbendixen ccotter kyusic trxcllnt zyctree csukuangfj jhssugi 5l1v3r1 phoenixdigitalfx steve-downey cloudbee7 aurianer dongfengxin richardsonjf jrhemstad jamboree clownsw draugus cwharris bylee20 yuenxq gmh5225 benfrantzdale shonker maikel smallsunsun1 gavinljj masterghui rishabhrd yangrong688 lordliquid leehowes ipapadop ynsn yiwenxue autoantwort mahsanchez anrongwn botogoske samuelpmish runner-2019 bananatfx yafare bustercopley flowty heureka-shanghai pointkernel seanbaxter sourcegraph-ce igchor feiyunwill muertebt5 owl66 vasama rodman10 elicharlese muellan wsgeek normalzero gornishanov chalsliu paytonwu felixguendling skyline-v zelbok ajschmidt8 classic130 brugarolas dkolsen-pgi machine-milk lukester1975 noel1992 mhaseeb123 oliveiradan nickpdemarco

stdexec's Issues

{#intro-compare}: Phrasing of a sender being "bound to" a scheduler

The section 1.4 "What are the major design changes compared to P0443" has a dot-point:

Senders now advertise what scheduler, if any, they are bound to.

I thought the query this paper was adding was querying what scheduler the operation would complete on.
The semantics of a sender being "bound to" a scheduler is ill-defined IMO.

Consider changing to:

Senders can now advertise what scheduler they will complete on, if this is known in advance.

Use `get_underlying_executor` in the example for it

4.4 At the end of the example above, execution::get_underlying_scheduler(last) returns an object equivalent to sch.

The example should probably just use get_underlying_scheduler in some way.

{#design-senders}: Comparison of senders and futures

This section contains a the sentence "... unlike futures, the work that is being done to arrive at the values they will send is also directly described by the sender object itself"

I don't really understand what this sentence is trying to say.
How is this different from futures?

The tense of this sentence "work that is being done" seems to imply eagerness which is not necessarily the case for lazy senders, where the sender describes the work that will be done.

{#design}: Find better words for the two meanings of "algorithm"

Initially, we've been using the word "algorithm" indiscriminately to refer to any kind of a customizable function that accepts or returns senders. We have since introduces three classes of "algorithms" (using the original meaning of the word), as originally suggested by @brycelelbach:

sender factories, which are customizable functions that create senders without accepting senders as arguments;
sender adapters, which accept senders and return senders; and
sender algorithms, which consume senders without returning senders.

I like the split itself, as it provides helpful categories. I do not, however, like the name of the last category; I think we need to find a new name.

The reason for that is two-fold:

we need a collective name for factories, adapters, and algorithms; it comes up all over the place; and
"sender algorithms" is the phrase we have been using for those three categories collectively for years now, and I struggle to not just automatically say or type "algorithm" when I mean all three categories.

{#design-algorithm-start_detached}: `execution::start_detached` should take an optional allocator

As the implementation of execution::start_detached will most-likely need to heap-allocate storage for the operation-state, we should consider adding an overload that allows the caller to specify an optional custom allocator to allow them to control the allocation.

{#design-receiver}: Use of term "root of tree" and "leaf nodes" seems to have opposite meaning to my understanding of those terms

The section "Receivers serve as glue between senders" has a description of set_done which contains the sentence:

set_done, which signals that the predecessor requests that any work that has not started yet should not start, because it is not needed (this is cancellation, but propagating from the root of the execution tree down to its leaf nodes, as opposed to propagating from the leaf nodes up to the root - so not some_sender.cancel()).

The terminology used in this sentence for "root of the execution tree" and "leaf nodes" and the direction of propagation of the set_done signal is the opposite of the meaning I have been giving to those terms.

When I have been using the term "leaf node" or "leaf operation" I usually mean some operation that is at the lowest level - something that isn't composed of other operations. e.g. an individual read from a socket, a schedule() operation or a timer.

Similarly, the meaning I have been giving to the term "root of the tree" usually refers to the top-level sender in an expression-tree of senders.

And that with cancellation, a request for an operation to stop propagates in the direction from the root node towards leaf nodes. And that the response to the cancellation (i.e. an operation completing with set_done) propagates in the direction from leaf nodes towards the root node.

However, the description you have for set_done seems to be describing signals travelling in the opposite direction. i.e. from the root to the leaf nodes and I found this quite confusing.

I gather the difference is that I have been describing them from an expression-centric view of the computation (i.e. taking the composition of senders as a kind of AST), whereas the description here has taken an execution-centric view of the computation.

It would be good to use a consistent terminology here.

As an aside: I recall having much confusion in discussions with @kirkshoop about Rx vs coroutines/senders as we used the terms "up" and "down" with opposite meanings and so we ended up settling on use of terminology described in terms of towards root or towards leaf operations to make it more obvious which direction we were describing. But now I see that you also seem to have an opposite interpretation of that term...

Remove when_any

Too much of a debate over cancellation and detaching. It isn't vital, we should drop from this paper and add later.

Some debate over whether it should be detaching or not.

std::this_thread::sync_wait should not be pipeable

We've found algorithms like this to be confusing in pipe syntax. It also doesn't return a sender.

6.1 example then would be:

auto [result] = std::this_thread::sync_wait(execution::schedule(get_thread_pool())
| execution::then([]{ return 123; })
| execution::transfer(cuda::get_scheduler())
| execution::then([](int i){ return 123 * 5; })
| execution::transfer(get_thread_pool())
| execution::then([](int i){ return i - 5; }));

{#design-adapter-transfer}: Rename `transfer`

This is like via but returns a sender that returns sch for get_completion_scheduler.

It can also optimise internally if it knows what scheduler the incoming sender completes on.

Interestingly, though, we could build on issue 4 here: if the incoming sender is strictly lazy, the returned sender here might also be strictly lazy, but we always provide a completion scheduler here because that is the definition of a transition.

This would mean we don't necessarily accidentally introduce eagerness, but we do always inject the information we want.

{#design-propagation}: Update phrasing of associated/bound scheduler to be consistent with `get_completion_scheduler` semantics

"Section 4.4 - senders may be bound to schedulers" is still using some phrasing such as:

"bound to" in the section title
"... a standardized way for senders to advertise what scheduler (and by extension - what execution context) any work (save for explicit transitions) attached to them will execute on."

These should be revised to be consistent with the get_completion_scheduler semantics which only really refers to the execution context that the sender will call set_value() on (i.e. the completion context) not necessarily what context the sender will do (most of) it's work on.

e.g. I might have a sender, s, that does all of its work on a thread-pool and then wrap this in a complete_on(s, mainThread). This sender would report a get_completion_scheduler() of mainThread but is not technically the context that the operation "will execute on".

The term "predecessor" isn't always appropriate to describe an input to an algorithm

I've noticed in a few places that the paper uses the term "predecessor" to describe a sender input to an algorithm.
However, for algorithms that are designed to control the launch of sender operations I don't think the term "predecessor" is appropriate.

Some examples of algorithms where the use of 'predecessor' to describe input senders is less appropriate:

In the the on(sender, scheduler) algorithm I'm not sure it makes sense to call the sender a 'predecessor' since the 'on' operation technically starts before the input sender does. The input sender only starts started after the schedule() operation that transitions to the scheduler's context has completed.

Also, in the sequence(senders...) algorithm each of the input senders is only started after the prior senders have completed successfully. While these input senders define a sequence of predecessors relative to each other, I'm not sure it makes sense to call one of these input senders the predecessor of the sequence operation.

Even in the case of then(sender, func) the sender is the predecessor of the execution of func but is not necessarily the predecessor of the sender returned from then(), as in the lazy form, the execution of the input sender does not actually start until the then() sender is started. The execution of the input sender operation is nested within the execution of the then-operation.

I think we need to be careful about how we describe the relationship of 'predecessor' arguments and the returned sender. The 'predecessor' argument is not necessarily a predecessor of the returned sender but rather of the operation performed by the returned sender. I hope this subtle distinction makes sense.

Example usages from the paper that don't quite feel right.

A sender factory is a function which creates new senders without requiring a predecessor sender.

The strictly lazy versions of the adapters below (that is, all the versions whose names start with lazy_) are guaranteed to not start any predecessor senders passed into them.

we propose to expose both sync_wait, which is a simple, user-friendly version of the algorithm, but requires that value_types have only one possible variant, and sync_wait_with_variant, which accepts any sender, but returns an optional whose value type is the variant of all the possible tuples sent by the predecessor sender:

In these cases, I think a more generic term would be input sender rather than predecessor sender, which, to me, implies something about the sequencing of the execution of that particular sender with respect to a particular operation which is not necessarily applicable to these generic contexts.

Further, in the section "Receivers serve as glue between senders"

set_value, which is the moral equivalent of an operator() or a function call, which signals successful completion of its predecessors;

set_done, which signals that the predecessor requests that any work that has not started yet should not start...

This use of predecessor presumably refers to the sender to which the receiver was connected, which is not the usual term we've been using for that. Usually when referring to the completion signals called on a receiver we talk about "the operation completing with set_value/set_error/set_done". Where the operation is the execution of logic associated with a given operation-state returned from calling connect() on a sender and this particular receiver.

{#design-propagation}: Should all `schedule` implementations be required to customize `get_completion_scheduler`?

Do we need to require that all implementations of senders returned by schedule() customisations will also customise the get_completion_scheduler() CPO to return that same scheduler used to create them? Are there any algorithms that might require/assume that to be the case?

Should the existence of this query on the sender be part of the scheduler contract?

{#design-senders}: "senders describe work" - be more precise

Some disagreement in the call today about exactly what the definition of a Sender is.

sync_wait should provide a scheduler

sync_wait should say something (for the receiver) like "get_scheduler(r) returns an implementation-defined scheduler that is driven by the waiting thread such that scheduled tasks run on the thread of the caller."

That makes sync_wait provide a real scheduler, and give it a real forward progress guarantee (how we word that aspect is an open question).

Incorporate cancellation wording (P2175)

We agreed to incorporate cancellation from P2175, so we need to make that happen.

execution::let

On a previous call we agreed that let was an important construct to keep things alive while running a sender-returning task.

Should we include that as a core building block in the first version?

Add `sender_of<S, T>` concept

I want to be able to write an algorithm that only accepts senders that send a particular type, e.g. something like this:

std::execution::sender_of<std::span<std::byte>> auto
async_read(std::execution::sender_of<std::span<std::byte>> auto buffer, handle_type handle);

How does this approach lead to functioning networking?

First things first; the Networking proposal is paying, at best, lip service to P0443. io_services are not P0443 executors, they are far more than that, they combine errors and successful results into one channel that's provided into the callbacks the user passes into the work-submission.

We have lofty ideas how to senderandreceiverify Networking, but those ideas are, proposal-wise, pipe dreams, especially when we start talking about how to do chunked downloads with resumed downloads if an earlier download fails. I wonder how this approach would lead to being able to do that with Networking, and when this proposal would lead to being able to do that with Networking.

I also wonder how compatible that would be with how ASIO does things, and why the standardization of Networking should even consider waiting for it, because frankly, none of this executors and senders and receivers work has materially changed how Networking really does its job, nor has any of this work really provided anything useful for Networking, other than some basic concepts that we hope to extend to be useful for Networking's needs. Asynchronous streams? Great, when will Networking be able to use them? Schedulers that can actually schedule networking work, and do the various staged-chained operations mentioned in the other issue talking about executors? Great, when will Networking be able to use them?

In other words, why should Networking pretend to remain to be coupled with Executors, whatever they are, when it really currently isn't except on a very superficial second-class-citizen level, and can't be if it wants to ship in a reasonable time frame?

Provide a more precise definition for get_completion_scheduler than present in the user-facing design section

Give precise definition. Maybe:
if the sender completes with set_value, it will call set_value on the returned scheduler.

Add async read example

Terminal non-blocking operations and destructors of objects owning execution resources those operations use

We discuss the issue of coordinating terminal non-blocking operations like submit or execute with running the destructors of objects owning execution resources those operations use.

{#design-propagation}: Should `sender_with_completion_scheduler` be a concept?

Therefore, we propose that when at least one of the sender arguments to the joining algorithms has a completion scheduler, a user must also provide an explicit scheduler argument, which describes the scheduler that the returned sender will be bound to.

Concepts come from algorithms. It sounds like we have an algorithm that wants a concept of senders that have schedulers bound to them, e.g. the second when_all overload that doesn't take a scheduler:

execution::sender auto when_all(
    execution::scheduler auto sched,
    execution::sender auto ...predecessors
);

execution::sender auto when_all(
    execution::sender auto ...predecessors
) requires (have-completion-schedulers<predecessors...>);

Maybe we should add such a concept.

{#design-transitions}: Not clear what "execution context transitions must be explicit" means

This section states:

We propose that, for senders advertising their completion scheduler, all execution context transitions must be explicit; running user code anywhere but where they defined it to run must be considered a bug.

What does it mean here that "execution context transitions must be explicit"?

I would have thought that the requirement that if a sender reports a scheduler from get_completion_scheduler() then if it completes with set_value that it actually does so on a context associated with that scheduler would be sufficient here.

The algorithm execution::transfer seems to be describing the algorithm that ensures that the input operation actually completes on the provided context.

Note that it's not clear in the description of transfer in section 4.9.1 what it does if the predecessor completes with set_error/set_done - does it always transition or only if it completes with set_value?

I can imagine wanting to be able to compose senders that do report a completion scheduler with algorithms that are not explicit about where they will complete (at least not in the sense of customising get_completion_scheduler). e.g. the when_all() algorithm would be such an algorithm

Can we be a bit more precise about what is meant by "explicit context transitions" here?

{#design-schedulers}: Describing future intent for scheduler extensions

Section 4.2 describing schedulers just talks about them being solely in terms of the schedule() operation, which is indeed their primary defining operation.

However, this does not fully convey the intent that we intend to later extend the scheduler concept with more operations to create subsumptions where the associated execution context supports a wider set of operations.
e.g. a time_scheduler concept that adds schedule_at, schedule_after and now operations in addition to schedule (which would become equivalent to schedule_after(s, 0s).
Or a file_scheduler concept that adds an open_file_read_only() and open_file_read_write() operations where the read/write operations for those files will complete on that scheduler's context.

Do we want to add at least some paragraphs in here to describe the intent for execution contexts to be able to use the scheduler type for providing access to a wider array of operations?

{#design-propagation}: We talk about existing sender strategies ending up in one of 2 situations - are we missing a 3rd?

This section describes existing design situation of senders not having access to scheduling context information as ending up in one of two situations:

trying to submit work to CPU execution contexts (e.g. a thread pool) from an accelerator (e.g. a GPU), which assumes that the accelerator threads of execution are as capable as the CPU threads of execution (which they aren’t); or
forcibly interleaving two adjacent execution graph nodes that are both executing on an accelerator with glue code that runs on the CPU; this operation is prohibitively expensive on runtimes such as CUDA.

I think there is also a third strategy that should probably be mentioned here:

Having to customise every fundamental algorithm to avoid GPU<->CPU transitions
Where fundamental algorithm is defined as any algorithm whose implementation defines its own receiver/operation-state and calls execution::connect on other senders rather than simply being a composition of other sender algorithms.

The transition back to CPU is generally necessary for calling a given receiver and so you'd need to customise any algorithm that made use of receivers internally to instead use a GPU-native signalling strategy.

I'm not sure we've sufficiently explored this space to enumerate the set of algorithms that would need to be customised to determine whether this is viable or not.

{#examples}: Add more examples

We need to have more examples. Here's a list of some important use cases:

{#design-propagation}: `get_completion_scheduler` equivalent for `set_done`/`set_error`

The current description of get_completion_scheduler() seems like it only describes the context that set_value() will be called on and says nothing about where the set_error or set_done completion signals might be called.

However some senders may be able to provide a stronger guarantee, e.g. that all of its completion signals will be delivered on a context associated with the specified scheduler, and some algorithms may be able to take advantage of that.
e.g. transfer_when_all may be able to avoid rescheduling onto the context if the last operation to complete completes with set_error and its get_completion_scheduler() query indicates that set_error is already being called on the desired completion context.

Should we also provide another query that allows senders to customise it to report that they are providing a stronger guarantee?

What about cases where the completion scheduler is different for different completion signals?
e.g. set_value completes on s1 and set_error completes on s2?
Should there be a separate query to allow querying which context a particular signal will be delivered on?

Maybe we could have get_completion_scheduler_for<CPO>(sender auto) and then have get_completion_scheduler(sender auto) be used for the case where all of the completion signals will complete on that context. The get_completion_scheduler_for<CPO>() queries could default to calling get_completion_scheduler() if that is defined.

Possibly remove `when_any`

The semantics may not be sufficiently clear, especially with respect to cancellation.

{#design-schedulers}: What does "execution contexts don't necessarily manifest in C++ code" mean?

4.2 Since execution contexts don’t necessarily manifest in C++ code, it’s not possible to program directly against their API.

I'm not sure what this sentence means, and I'm not sure it's important. The distinction between scheduler and execution context is that scheduler is a lightweight, non-owning handle.

Replicating the P0443 executor error handling model

With a P0443 executor, I can program with a model where I have an executor that emits errors at the point of task submission, and none later. Because that's all that executors can do - they can emit errors when a task is submitted, but after that they have to just invoke the task. Now, if we go to schedulers&senders&receivers only, and I invoke std::execution::submit,

does that mean that submit() will never throw? ..and by extension, that..
..any errors, even if they are task-submission errors before the work is actually scheduled, are sent to the set_error channel of the receivers of any sender+receiver+operation_state?

do we plan to provide other error-managing algorithms than the one that's sorta kinda baked into submit(), which is "uh oh just terminate"?

The question seems relevant to me both from the point of view of users of all of this and authors of algorithms and authors of submit() customization points.

In some ways, it seems to me that the executor concept is still lurking inside this, we're just not naming it. As some wise people have suggested, concepts arise out of algorithms. Using a "no errors after submission" scheduler/execution context with a task that doesn't handle the three-channel API of a receiver, or using said scheduler/execution context with a task combined with a particular error handling algorithm, especially one that just terminates, seems to be such an algorithm (as a whole, not in the "sender algorithm" sense), and suggests that the executor concept is still there. Using such a concept has some plausible use cases, like an execution context and the related scheduler that just spawn threads until resource allocation fails, so they'll never emit errors after submission and before scheduling, because those are just one thing.

Add GUI-centric example

{#design-propagation}: `get_completion_scheduler` and type-erased senders

What do we want to do with type-erased sender wrappers, like an any_sender type, that want to be able to support but not require that wrapped senders have a get_completion_scheduler customisation.

There is no default value we can return from get_completion_scheduler in the case that the wrapped sender does not provide a get_completion_scheduler. With non-type-erased senders we can just use SFINAE to detect whether or not the sender provides a get_completion_scheduler, but for a type-erased sender this may be a runtime check rather than a compile-time check.

Should this query be returning a dummy scheduler that we can check the return value for? Or perhaps return the scheduler wrapped in a std::optional?

Should we have a meta-CPO for run-time queries of maybe-supported queries?
e.g. have a try_query<get_completion_scheduler>(sender) -> std::optional<any_scheduler> operation.

Where try_query<CPO> does the check to see if CPO is callable with the arguments and if so calls it and returns its result and otherwise returns std::nullopt.

get_scheduler

get_scheduler currently used instead of get_completion_scheduler in various places.
cuda::get_scheduler() may be helpfully renamed if we want the receiver query to be get_scheduler(r).
We should add the get_scheduler query for receivers as something that:
a) on defines as one of the valid CPOs on its receiver, along with set_value etc.
b) something that is propagated by all algorithms such that if the receiver passed to the algorithm defines it, the receiver the algorithm passes upstream defines it.

That is important because otherwise:
sync_wait(on(sched, just(3) | bulk(...));

does not carry the information to tell bulk where to schedule its work. It starts bulk on sched, but without backchanelling information about the scheduler bulk can't ask where to schedule more work.

Presumably we'll be adding a stop_token query similarly once cancellation support merges in.

Should `then` dispatch to `lazy_then` as the spec says?

https://github.com/brycelelbach/wg21_p2300_executors/blob/main/executors.bs#L1544

Is this a typo or intentional? It currently says the last thing that then tries to dispatch to is lazy_then. If it's intentional, can someone explain?

Examples of chaining and transfer with I/O and complex object lifetimes

On various occasions, I have tried to understand how P0443 schedulers and senders provide something roughly along the lines of the following:

launch an asynchronous operation that opens a file (well, socket), and produces a file descriptor
somehow "pipe" that to another asynchronous operation that reads data from the file, and produces that data to its consumers

The current approach looks like I might more easily wrap my head around how to do that, but I may be wrong. With P0443, it was all too easy to think "oh, I need a receiver that also acts as a sender, so I'll receive the descriptor from one sender and then act as a sender that initiates the data read". In this model, I might envision starting work on a scheduler to open the file, and then transfer that to another scheduler (most likely actually the same scheduler) and then piping the descriptor into the new-work, which would then produce different values/results.

But I have no idea whether that's the correct train of thought. Can we have an example of how to do the example described above?

There are different possible examples here that would be quite helpful. Suppose I have async work that sends a HTTP HEAD request to a web server, and produces an object of a class type describing the response of the request, and then I want to "pipe" that into subsequent work that examines that response object, and if successful, initiates more work to actually HTTP GET some data out of it, and then pipes that into subsequent work that actually reads the response body of the GET request.

And namely

Examples where all this is done with different (C++) types of data produced by the different pieces of work
Also examples where some of that data is shared; either by refcounted objects, or by Out-of-Band shared data that the different stages of work know about. Fine, chances are that they know about it by virtue of references or reference-semantic objects referring to it being passed through the "pipelines", but still.

{#design-senders}: Usability concerns with use of `execution::submit`

The example code snippet in section 4.3 uses the function execution::submit() to effectively launched a detached computation. This is equivalent to the std::thread::detach() method in that there is no generic way to then safely wait for that operation to complete to be able to safely shutdown or release its resources.

To be able to later join this operation the user will probably need to store some object whose destructor performs some synchronisation or manually attach a finally() algorithm to the input sender to do some synchronisation when the operation completes.

I am worried that the existence of this sort of algorithm in the standard library will encorourage the creation of detached computation that can not be joined in a structured way.

I am also worried that the existence of detached computations will force execution contexts to keep track of outstanding work and give some other way of joining all work that is scheduled on them during shutdown - adding runtime overhead to those implementations and also complications like how to tell when all work that will be scheduled as actually been scheduled (an empty queue is not necessarily a reliable indicator that no more work will be scheduled on a given context).

In libunifex we have a submit() CPO, but it instead takes a receiver that will receive the results of the operation when it completes, allowing the operation to still be joined. And we don't really have much usage of submit() within libunifex - it is only used within the implementation of via() for untyped senders (something libunifex doesn't really support very well anyway) and within execute() which also has similar challenges with ensuring work is joined.

I would prefer to see an API for launching work that still provides some way for joining that work at a later point in time. e.g. in libunifex we have async_scope::spawn() for doing this which allows later joining all work spawned using that async_scope object.

{#design-fork}: Clarify different flavours of multi-shot senders

The phrasing of section 4.6 seems to imply that all multi-shot senders have the forking semantics. i.e. that you can connect them multiple times to have multiple receivers receive the same result.

However, there are multiple possibilities here which I think should be called out explicitly.
Many of the purely-lazy senders will support being multi-shot-connectable but doing so will result in the sender having multiple independent copies of that operation launched, each potentially starting from scratch.

For example, the retry_when() algorithm in libunifex relies on this semantic to be able to retry a given input operation many times if it fails.

However, for an eagerly running operation, the natural thing for supporting multi-shot connect is to have the operation only execute once (it's already running and there is no easy way to start the computation again) and have multiple receivers each receive the same result from that single operation.

I think this section should call out these different interpretations for multi-connect senders explicitly and not leave the reader under the assumption that this always means "forking" a computation - it might mean duplicating a computation.

Do we need to tie is_lazy and get_completion_scheduler

Clearly the two together represents valuable information.

However, take a trivial algorithm implementation based on what is in the paper like:

auto eager_erasing_then(s1, foo) {
return unschedule(then(s1, foo));
}

so this is not lazy, in that foo may well have started eagerly if s1 was eager and had a completion scheduler and then specialised on that information.

However eager_erasing_then is eager but returns a scheduler that explicitly erases its completion scheduler. It should still have is_strictly_lazy be false, but get_completion_scheduler should not be defined.

Should `strictly_lazy_sender` be a concept?

Do we have algorithms that plan to be constrained on whether execution::sender_traits<Sender>::is_strictly_lazy is true? Perhaps we should introduce a concept for this.

"Senders are joinable" section still mentions that when_all requires predecessor senders not have a completion scheduler

With the splitting of when_all into when_all/transfer_when_all I don't think this constraint is required any more.

Can we apply the following change:

The when_all algorithm ~~requires that none of the predecessor senders have a completion scheduler, and~~ returns a sender that also does not have an associated scheduler.

Clarifying the use case for strictly_lazy and underlying_scheduler

I think we should be clear precisely when these are to be used and what they mean, maybe that way we can get to better naming and double check if we need both.

Specifically, think of two cases and some algorithm sort(sender, cmp) -> sender.

As i understand it, we have a few things we need from this information:

We need to know if sort will run eagerly or not. This is a correctness question for the caller. Essentially we can think of this as "will the work represented by sort potentially have started by the time sort returns".
We need to be able to decide to eagerly enqueue work for GPU-like execution scenarios.
We need to know what scheduler work will complete on, so that type erased incoming senders can be paired with subsequent work.

2 seems solvable entirely using customisation. So: sort(nvidia_sender, cmp) -> nvidia_sender would trigger eager work, using whatever method is available on an nvidia_sender to access the cuda streams and other state.

3 is for when we need to type erase the nvidia_sender and just know its completion scheduler. I can see how we can gain certain transitional efficiency here but I'm not sure I see how under type erasure we can do more than that. So if the incoming sender is type erased, and only forwards get_underlying_scheduler, we can't guarantee that the incoming sender enqueued the work, so we can't rely on any in-order queue. All we can do is know whether we need to enqueue work back onto a CPU to do transitions or not.

1 is a question for the algorithm. With the current plan it looks like we will make the eagerness decision for an algorithm based on whether the passed type exposes the eager flag. I think we should document in this case that even if the algorithm triggers a full customisation off the type, it should obey that eagerness query, so we can be confident of the behaviour. If the return type says it is potentially eager, then we know the algorithm acted in an eager fashion on that eager argument. It might be subtle, but it is at least detectable.

More fundamentally, is the assumption here simply that if the parameter is lazy, clearly the algorithm cannot be eager because it has to wait for the input anyway. If the parameter might be eager, the algorithm can be eager.

In a type erased scenario, even if the parameter is potentially eager, and we can get the scheduler from it, we can't know if it was already enqueued. So presumably we only actually get eagerness when we do customise the algorithm. Or is there another approach nvidia intends to use here?

Bifurcation of `when_(all|any)` seems likely to cause confusion in generic contexts.

Therefore, we propose that when at least one of the sender arguments to the joining algorithms has an underlying scheduler, a user must also provide an explicit scheduler argument, which describes the scheduler that the returned sender will be bound to.

How do you know if your senders have an underlying scheduler in a generic context?

We don't have a separate concept for such senders (but see #19). Maybe we need one.

If I'm in a generic context, I could imagine it not being clear to me if I have senders that have underlying schedulers or not.

I wonder if perhaps these two overloads should have different names.

Introduce separate strictly lazy and maybe eager variants of algorithms

As per our discussion on the 2021-05-26 telecon, we agreed to introduce separate strictly lazy and maybe eager variants of algorithms. The placeholder spelling of the strictly lazy variants will be lazy_*, and the maybe eager variants will be *. For example, lazy_then and then.

Interactions with monadic optional

This brings to mind two thoughts.

Bikeshedding. It uses transform, and_then, or_else. I wonder to what degree we will end up discussing alignment on terminology.
I wonder if it is cleaner for us to follow p2300 with a paper to define connect and start for std::optional and get the sender algorithms on optional for free. It would give us co_await on an optional for free too, once co_await sender is defined for the task type, though not coroutine optionals in general.

when_all with a scheduler

Separate algorithm to avoid potential overload ambiguity with a sender+scheduler.

when_all_on potentially, for consistency with just_on.

Consider alternate API designs for `sender_traits`

I find the whole design of sender_traits a bit questionable and murky. I'd like to see it fleshed out more. I'm not convinced a monolithic traits class is the right answer here.

{#design-factory-schedule} When does the sender returned by `schedule` complete?

schedule essentially returns a "ready" thing. Can we say anything to that effect?

Semantics of is_strictly_lazy

Unclear precisely what this should apply to.

Given:
auto s1 = some_sender_factory();
auto s2 = then(s1, foo);

It makes sense to ask if s2's type is strictly lazy, to know that foo might be already running before s2 is connected.

It also makes sense to ask the same of s1, and then might use that information to decide if it should be eager.

We might also want a query on the operation state to know if work was in some way enqueued after connect was called.

Is the intent that or different?

Michal expressed concern about ABI issues with this formulation - assuming we didn't talk past each other.

Story for non-blocking submission

We need a good answer to the Networking TS requirement for executors that don't block when they submit.

One possible solution would be to say "things that require that should define a trait, and then constrain on the trait".

{#intro-prior-art}, {#design-receivers}: Forward or undefined references to "callback" and "channel"

5.2 A receiver is a callback that supports more than one channel

What is a callback? What is a channel? This is not a good definition for the uninitiated.