GithubHelp home page GithubHelp logo

khronosgroup / sycl-docs Goto Github PK

View Code? Open in Web Editor NEW
109.0 32.0 67.0 6.78 MB

SYCL Open Source Specification

License: Other

Makefile 0.67% C++ 11.41% C 0.03% Shell 0.31% Python 5.52% Ruby 1.90% CSS 3.64% JavaScript 76.51%

sycl-docs's Introduction

SYCL Logo

SPEC 2020-9 SPEC 2020-9 SPEC latest Join the Slack group

SYCL Open Source Specification

This repository contains the source markup used to generate the formal SYCL specifications found on https://www.khronos.org/sycl/.

If you are proposing a merge or pull request to the specification, this README describes how the specification HTML and PDF targets can be built. Proposed changes must successfully build these targets before being considered for inclusion by the SYCL Working Group.

Building the SYCL specification

Building using Github CI

When using our github repository, pushing to a branch will trigger the Github Actions CI script under ../.github/workflows/CI.yml to build HTML and PDF versions of the spec. To see the outputs, click on the Actions tab in the top navigation bar, or go to https://github.com/KhronosGroup/SYCL-Docs/actions . Your commit should show up on the list of commits below. Click through to the Actions summary for that commit.

On success, a green checkmark will appear by the commit name on this page. Output artifacts can be downloaded as a zipfile from the Artifacts section near the bottom of the page.

On failure, a red x will appear by the commit name. Click through to the build job and it should auto-scroll to the CI log showing where the build failed. Fix it and push a new commit to try again.

Note that to read the HTML specification correctly with all the mathematical symbols, you need also to have the katex directory along the html one. This might not be the case if your downloading framework lazily unzips just what you read.

If you are proposing a pull request from your own clone of our repository, you may need to enable Github Actions for your clone.

Building Using The Khronos Docker Image

Building the specification on your own machine requires a large set of tools. Rather than installing these tools yourself, if you can run Docker on a Linux compatible host (probably including Windows WSL2 with a Ubuntu or Debian OS, and possibly including MacOS X), you can use the same pre-configured Docker image used by the CI builds.

If you are on Debian/Ubuntu Linux, install Docker with:

sudo apt update
sudo apt install docker.io

The Docker image used to build the specifications can then be downloaded or updated to the latest version via

docker pull khronosgroup/docker-images:asciidoctor-spec

The Dockerfile specifying this image can be found at https://github.com/KhronosGroup/DockerContainers if you need to build a modified or layered image. However, if something is missing or out of date in the image, please file an issue on the DockerContainers repository before trying to build your own image. We will try to keep the image updated as needed.

To build the specification using the image, use the Makefile inside the adoc directory:

cd adoc
make clean docker-html docker-pdf

Outputs will be located in $(OUTDIR) (by default, out/ in the adoc directory).

There are some variables defined in the Makefile you can set to change the behavior, such as to verbosely display the build process:

make QUIET= clean docker-html docker-pdf

If you need to invoke Docker without using make on the host, look at the actions in the docker-% target in adoc/Makefile and replicate them on your system.

Building On Your Native Machine

If you don't want to, or can't use Docker (or a compatible replacement - it is possible that the Red Hat podman tool can run our Docker container, for example, though we do not support this), then you will need to install all the same tools in your own environment.

We cannot provide instructions to do this on every possible build environment. However, if you are using Debian/Ubuntu Linux, either native or via WSL2, you should be able to install the required tools by looking at the Dockerfile at

https://github.com/KhronosGroup/DockerContainers/blob/master/asciidoctor-spec.Dockerfile

Note that the Khronos Docker image layers on the official Ruby 3.1 Docker image, so you must install Ruby first.

If you have installed an older version of the tools and the Khronos image is updated, there may be minor changes in the Makefile and markup required by the new versions. For example, updating from asciidoctor-pdf 1.6.1 to 2.2.0 required changing the pdf-stylesdir attribute in the asciidoctor build to pdf-themesdir. Eventually, these changes may make using the older tools impractical. If this happens, update your tools to match the latest Docker image, and rebase your working branch on current main branch.

Building Using GitLab CI

Finally, if you are a Khronos member working in our internal Gitlab server, Gitlab CI builds the image just like Github CI. Go to the ...sycl/Specification repository page on the gitlab server, click through to CI/CD-Jobs (the rocket-ship icon on the left menu bar or ...sycl/Specification/-/jobs). If your job succeeded, click on the Download icon for the latest CI job in the appropriate branch to download the zip file of build artifacts, or click on Passed to see build details.

The Gitlab CI script is functionally equivalent to the Github CI script, but is located under .gitlab-ci.yml, using a different YAML scheme.

sycl-docs's People

Contributors

ad2605 avatar aelovikov-intel avatar aerialmantis avatar alexeysachkov avatar bader avatar bso-intel avatar gmlueck avatar hdelan avatar illuhad avatar jackakirk avatar jzc avatar keryell avatar melirius avatar michoumichmich avatar mmoadeli avatar naghasan avatar nliber avatar nmnobre avatar npmiller avatar oddhack avatar pennycook avatar pkeir avatar psalz avatar ruyk avatar sbalint98 avatar steffenlarsen avatar tapplencourt avatar tomdeakin avatar u235axe avatar verenabeckham avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sycl-docs's Issues

Unary plus and minus operators for sycl::vec takes a non-const reference to the value

From SYCL 2020 specification (revision 2) page 351:

// OP is unary +, -
friend vec operatorOP(vec &rhs) const { /* ... */ }

Is there a reason why this takes a non-const lvalue reference? This makes it impossible to do something like:

const sycl::float4 v{};
-v; // Error: cannot bind const vec& to vec&

or

-sycl::float4{}; // Error: cannot bind vec&& to vec&

In my opinion, the line in the spec should be corrected to:

// OP is unary +, -
friend vec operatorOP(const vec &val) { /* ... */ }

Meaning of SYCL

I can't find the meaning of SYCL. Presumably it once meant "System OpenCL"?
I think that should be mentioned on the website, on Wikipedia and the specification.
It makes it clearer what SYCL "is".

Add queue(context, device) constructor overload

The specification doesn't provide the following queue constructor:

queue(const context &syclContext, const device &syclDevice, const property_list &propList = {});

The only alternative is to use:

queue(const context &syclContext, const device_selector &deviceSelector, const property_list &propList = {});

Having this constructor would be useful for any case where you have a device and a user constructed context and want to create a queue, but don't want to pay the cost of having the queue implicitly construct another context.

It would also be useful in interop cases where you have a device and an interop context or a user constructed context and an interop device.

buffer::get_access<access::mode::read> should be marked const.

The Problem

Calling get_access in read-only mode requires the buffer to be non-const since get_access itself is never marked as const. This is a problem for const data structures which encapsulate a SYCL buffer. It also contradicts const-correctness which is strongly advocated for by the C++ community. Finally it is counterintuitive for the programmer.

Example

#include <CL/sycl.hpp>

int main()
{
    auto queue = cl::sycl::queue{};

    const auto in = cl::sycl::buffer<int, 1>{1024};
    auto out = cl::sycl::buffer<int, 1>{1024};

    queue.submit([&](cl::sycl::handler& cgh)
    {
        // candidate function not viable: method is not marked const
        auto in_acc = in.get_access<cl::sycl::access::mode::read>(cgh);
        auto out_acc = out.get_access<cl::sycl::access::mode::write>(cgh);

        cgh.copy(in_acc, out_acc);
    });

    return 0;
}

Use case

Any const data structure encapsulating a SYCL buffer.

Possible solutions

  1. Mark get_access const for access::mode::read. I'd be happy to open a pull-request for this if needed.
  2. Add wording to get_access's specification which explains why it cannot be marked const.

Over-complicated declaration of `select`?

Looking at select built-in, there are 6 declarations of it:

\addRowSixSL
{geninteger select (geninteger a, geninteger b, igeninteger c)}
{geninteger select (geninteger a, geninteger b, ugeninteger c)}
{genfloatf select (genfloatf a, genfloatf b, genint c)}
{genfloatf select (genfloatf a, genfloatf b, ugenint c)}
{genfloatd select (genfloatd a, genfloatd b, igeninteger64 c)}
{genfloatd select (genfloatd a, genfloatd b, ugeninteger64 c)}
{
For each component of a vector type:\newline
\codeinline{result[i] = (MSB of c[i] is set) ? b[i] : a[i].}\newline
For a scalar type:\newline
\codeinline{result = c ? b : a}.\newline
\codeinline{geninteger} must have the same number
of elements and bits as \codeinline{gentype}.
}

I'm a bit confused by the last four: why genfloatf versions accepts genint as the last "mask" argument whilst genfloatd version accepts geninteger64?

I think that we can remove the difference here and leave just genint/ugenint by updating the description: genint and ugenint must have the same number of elements and bits as corresponding geninteger, genfloatf or genfloatd

Explicit buffer allocation

Hi,

The handler class have copy and fill member functions. Can we have an allocate function? As the name implies, this function will allocate buffers [*]

The justification for this demand is the following:
Allocations may include synchronization between kernels. Hence users may want first to allocate buffers, then run their kernels.
This makes reasoning about concurrency easier (as far as I know explicit preemptive allocations is considered good practice in OpenMP world).

[*] It's possible to mimic the features right now by either using directly an
"OpenCL buffer" or by creating and launching a dummy "kernel". Both methods are kinda cumbersome.

Thanks

rev5 tag?

There are tags for rev6 & rev7, but not rev5. It looks like the initial commit was rev5. Is it exactly rev5? Could you tag it?

bfloat16 data type support in SYCL

As title, I know SYCL implementation defines half data type in cl::sycl space, my question is bfloat16 is whether supported like cl::sycl::bfloat16 ?

Ambiguous queue constructors due to arbitrary property_list type

The following two queue constructors are not necessarily uniquely identifiable where the default property_list is used:

cl::sycl::queue::queue(const cl::sycl::device_selector&, const async_handler&, const cl::sycl::property_list& = {})
cl::sycl::queue::queue(const cl::sycl::device_selector&, const cl::sycl::property_list&)’

For instance, I saw that the use below was calling the second constructor, when I would expect the first:

  cl::sycl::queue *queue = new cl::sycl::queue(dev, [&](cl::sycl::exception_list l)
  {
    // omitted ...   
  }); 

@illuhad identified the issue is caused by the property_list constructor accepting arbitrary types, and so my lambda function is used to construct a property_list rather than an async_handler!

There are a couple of solutions:

  1. Define a new cl::sycl::property type that the property_list constructors can accept instead of any type.
  2. Add explicit to the constructors.
  3. Rely on implementors to get it right! This would be bad, but it's what happens in TriSYCL currently.

Some more context can be found in the hipSYCL Issue I raised. The resolution to the second constructor there was causing other compile errors later.

How to extract address space from raw pointers?

Imagine a device-side function with the following signature:

void foo(int* vec);

I don't know if vec comes from global, local, constant or private memory. However, inside foo I'd like to do something to vec which requires me to know the address space of the pointer, e.g. a cl::sycl::atomic_fetch_add. How do I tell the multi_ptr / atomic inside foo which address space is needed? Simply using a global_ptr will break if vec actually resides in local memory. Using multi_ptr will fail because the address space template parameter is missing. Creating an atomic by passing vec to its constructor will fail because vec isn't a multi_ptr. Using atomic_fetch_add on vec will fail because vec isn't an atomic type.

Some implementations (like ComputeCpp) internally use __global to annotate the pointer during device compilation. But even if there was a way to write something like void foo(__global int* vec) (there isn't as far as I know, ComputeCpp complains if I do this) this would be a bad idea because the address space attributes are implementation-defined.

Why do we need this? Sadly, there are libraries / frameworks out there that pass around raw pointers but where a SYCL backend is planned / worked on.

Edit: I also tried to overload foo with global_ptr, local_ptr etc. directly. This will fail because the call is ambigous.

Create placeholder accessors without buffer

I'm facing a situation were I need to create a placeholder accessor without knowing the buffer the accessor will operate on. While thinking about potential solutions to this (as it is not possible in SYCL 1.2.1 Revision 6), I found that apparently ComputeCpp used to support this in an extension, where cl::sycl::handler::require() would optionally also take a buffer as a second argument, thus binding the provided accessor to that buffer. You can see it in action here. This extension has since been deprecated, unfortunately.

Are there any plans to introduce such functionality into future versions of SYCL?

SYCL memory model definition inconsistency.

Originally raised here: https://reviews.llvm.org/D99488#2656661.

The specification defines SYCL memory models as "based on the OpenCL v1.2 memory model", but later refers to generic address space, which has been introduced in OpenCL v2.0. See https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#the-generic-address-space.

Should we just drop OpenCL version from memory model definition wording?
E.g. "based on the OpenCL v1.2 memory model" -> "based on the OpenCL memory model"

Consider adding default constructor for cl::sycl::range

Currently the spec does define a default constructor for cl::sycl::id, but not for cl::sycl::range . I'm not sure whether this is intentional or an oversight, however it can be quite a hassle when trying to write generic code.

For example consider this:

template <int dimensions>
struct my_range_holder {  cl::sycl::range<dimensions> range; };

There is no easy way of writing a constructor for this struct, without having to resort to some kind of range-initialization functor, or if constrexpr in C++17, both of which are far from pretty.

This issue first came to my attention because the Intel implementation seems to be the only one that actually doesn't provide some kind of default constructor for cl::sycl::range (thus "breaking" my code :)). I did a quick survey of the rest of the implementations, and it looks like everybody is doing something slightly different:

Implementation
ComputeCpp Initializes to 1
hipSYCL Initializes to 0
triSYCL Does not initialize

Clearly this is not great for portability, and even if you decide not to add a default constructor, at least consider explicitly deleting it in the spec. In my opinion, initializing ranges with 1, as ComputeCpp is doing, seems to be the most natural solution.

SYCL 2020 buffer interface feedback

Hi there. I’ve just been perusing the SYCL 2020 provisional spec and noticed that the buffer interface doesn’t provide size() and byte_size() member functions like the accessor interfaces do, only get_count() and get_size(). Having to specialise my code for the latter two was my most common frustration with 1.2.1 so I was delighted to see it being addressed in 2020. I was just wondering why this ‘fix’ wasn’t applied to buffers too? Is this something that could be added to the final 2020 spec?

Thanks for all your hard work with SYCL!

Clarify if swizzling vec<T, N> to vec<T, N-n> should be possible

Currently it is not clear if simple swizzles of a vec<T, 4> should include swizzles for smaller vector types, like: .xyz()
As stated in the spec in section 4.10.2.1 in the description for XYZW_SWIZZLE:

[...] Where XYZW_SWIZZLE is all permutations with repetition of x, y for numElements == 2, x, y, z for numElements == 3 and x, y, z, w for numElements == 4. For example xzyw and xyyy

Please clarify.

Also see: AdaptiveCpp/AdaptiveCpp#7

Consider adding support for explicit strided host to device copies

I sometimes find myself in a situation where I'd like to copy some strided data from the host to a SYCL buffer on the device, for subsequent use in a kernel. As it turns out however, the existing APIs for explicit memory operations only allow me to pass a contiguous host pointer as the source of a copy.

I had previously been playing with the idea of using a temporary SYCL buffer constructed with a pointer to my (strided) host memory, from which I could then create a host accessor to use as the src in an explicit copy operation. However, @AerialMantis pointed out to me that the copy is considered a kernel executed on the device (see section 4.8.6), and using host accessors inside kernel functions results in undefined behavior (as they are only allowed to be used on the host, see section 4.7.6.3).

Now, maybe one way of doing this could be to use a device accessor for my src as well, and hope that the SYCL runtime will recognize that instead of doing a H -> D -> D copy, this could be optimized to a strided H -> D copy. However, in doing so I'm pretty much at the mercy of the implementors, and have no guarantee about how much memory will actually be used for this operation (other than an upper bound). More likely than not, in any of the current implementations, the entire temporary host buffer would first be copied to the device (correct me if I'm wrong!).

The other option, which I'm using now, is to first do a host-side copy of the strided data into a contiguous staging buffer, and using that as the src for the copy operation. That is of course not ideal, and if host memory gets tight, might also not be feasible (especially if the implementation uses another pinned staging buffer internally...).

Ultimately I think, given that both OpenCL and CUDA provide APIs for doing strided H -> D copies, SYCL could also benefit from having something like this.

As I recently ran into this issue again, it got me thinking: Why not simply provide the ability to create a SYCL accessor for arbitrary user pointers? Like so:

float* my_ptr = malloc(128 * 128 * sizeof(float));
// ...
cl::sycl::accessor<
 float, 2, cl::sycl::access::mode::read_write /* mode is probably not needed */,
 cl::sycl::access::target::user_pointer>
my_accessor(my_ptr,
 cl::sycl::range<2>(128, 128) /* range of data pointed to */,
 cl::sycl::range<2>(64, 128) /* optional sub-range to access */,
 cl::sycl::id<2>(32, 0) /* optional offset to access */);

With this API, my_accessor could then be used as the src in an explicit copy, implying that the copy should be strided. As an added bonus, such an accessor would allow users to index into their self-managed data just like they can for SYCL buffers, without having to worry about the data's layout in memory.

Searching broken within spec PDF output

I've heard that people are having trouble searching for text within the SYCL 1.2.1 spec PDF. Some of the issues appear to be on Mac specifically, but others are cross-platform. The problem may be related to underscores - that's where there are known reproductions.

Problem 1: Searching for italicized text with an underscore misses matches, even on a PDF reader under Windows. For example, SYCL1.2.1r5, p.76, the table has cl_command_queue_info in the right column. I can't make my PDF reader find any portion of the text, once I add an underscore to it.

Problem 2: Non-italicized hits missing on some Mac readers (Safari/Preview). A reproducer of the problem is async_handler. Searching with Safari/Preview on Mac apparently finds 8 results, while Acrobat reader on Mac finds 33 (which is consistent with what I see with a different reader in Windows).

SYCL 2020 Specification

Are there any plans to also publish the SYCL 2020 specification document here?
So people could change small mistakes and make pull request for the new standard document.

(For example: I found a wrongly colored brace. Not a major problem, but easily fixable if the standard would be released here.)

Support specifying global offset in parallel_for_work_group

handler::parallel_for(nd_range<>, ...) and handler::parallel_for(range<>, ) both support a global offset parameter by wich all work item ids passed to the kernel will be shifted. handler::parallel_for_work_group is missing such a parameter.

Specifying a global offset is useful for accessing only part of an index space without requiring manual index translation inside the kernel. This can be used to transparently split kernel executions into parts, e.g. for staying within device memory limits, all without modifying the actual kernel implementation.

As a practical motivation, we use offsets on parallel_for in our SYCL-based distributed memory runtime Celerity. The addition of this parameter would allow us to efficiently implement shared-memory functionality on top of parallel_for_work_group as well.

The primary goal of an offset parameter is to shift the global ids that are iterated within parallel_for_work_item. The offset can be specified either on the parallel_for_work_group or the parallel_for_work_item call. We prefer the former variant for our application.

When the offset is part of the call to parallel_for_work_group, it could be specified either in work items similar to nd_range, or in multiples of the local size. We advocate for the latter variant, since it allows shifting group indices instead of just altering the work-item id mapping. This makes a split invocation of parallel_for_work_group completely transparent to device code.

It might also be worth discussing if and how this applies to the implicitly-sized overload of parallel_for_work_group, although we do not see a use case for that variant at the moment. For the time being, we advocate for the parameter to be added to the explicit variant only.

Typo in the comment to unsampled_image_accessor::read() function

Look at the declarations of read() and write() methods of unsampled_image_accessor class:

/* Available only when: accessMode == access_mode::read
if dimensions == 1, coordT = int
if dimensions == 2, coordT = int2
if dimensions == 4, coordT = int4 */
template <typename coordT>
dataT read(const coordT &coords) const noexcept;
/* Available only when: accessMode == access_mode::write
if dimensions == 1, coordT = int
if dimensions == 2, coordT = int2
if dimensions == 3, coordT = int4 */
template <typename coordT>
void write(const coordT &coords, const dataT &color) const;

Compare the comments to read() and write() functions. Looks like the following line in the comment to the method read() is wrong:
if dimensions == 4, coordT = int4 (wrong)

In my opinion it should be corrected to be the same as the corresponding line in the comment to the method write():
if dimensions == 3, coordT = int4 (right)

Allow for capturing tuple types in SYCL kernels

The Problem

The SYCL specification demands that all data structures passed into a kernel are standard layout classes. This prevents most widely-used tuple types (std::tuple, boost::hana::tuple, etc.) from being passed, effectively disabling generic kernels with variadic parameters. In turn, this makes the implementation of a SYCL backend for abstract kernel libraries such as Alpaka or HPX, both heavily using variadic kernels, nearly impossible. HPX stumbled upon this a few years ago (see this paper, Section 4.4) and ultimately failed as there is still no SYCL backend (as far as I know).

Example

template <typename Func, typename... Args>
struct generic_kernel
{
    Func m_kernel;
    std::tuple<Args...> m_args;

    generic_kernel(Func&& kernel, Args...&& args)
    : m_kernel{kernel}, m_args{args...}
    {}

    auto operator()(cl::sycl::handler& cgh)
    {
        // copy by value to prevent 'this' pointer in device code
        auto k_kernel = m_kernel;
        auto k_args = m_args;

        cgh.single_task<class dummy>([=]()
        {
            // Error: class std::tuple is not standard layout
            std::apply(k_kernel, k_args);
        }
    }
};                                                                                                                                      

Use case

Libraries such as Alpaka or HPX store their (abstract) kernel arguments in the form of variadic parameter packs which are saved in a tuple inside their (abstract) kernel structure. The actual (SYCL, CUDA...) kernel function itself is then called via std::apply or similar mechanisms. This approach already works for CUDA and it would be great to achieve the same with SYCL.

Users might also define their own generic kernel structs (launched with the same set of global sizes / work-group sizes) for a set of similar functions with differing behaviour (e.g. a set of filter kernels).

Possible solutions

  1. Lower the restrictions for kernel parameters so std::tuple et al. can be captured.
  2. Implement a SYCL tuple (maybe with conversion from std::tuple) which is able to be passed into SYCL kernels.

Is it allowed to wrap accessors?

The SYCL 1.2.1 Rev 6 spec states that all user defined data structures that are passed as arguments to kernels must have C++11 standard layout (section 3.10). Intel is proposing to relax this to trivially copyable. However, I cannot find any clarification on whether the built-in argument types, i.e., stream, sampler, and in particular accessor, must also fulfill this requirement, and by extension, whether it is allowed to wrap these types inside user defined data structures.

There is another sentence in section 3.10 that might imply it not being allowed (emphasis is mine):

The only way of passing pointers to a kernel is through the cl::sycl::accessor class, which supports the cl::sycl::buffer and cl::sycl::image classes. No hierarchical structures of these classes are supported and any other data containers need to be converted to the SYCL data management classes using the SYCL interface.

However from my reading I think this refers to buffer and image, not accessor.

In any case, out of the four SYCL implementations I tested, none provides accessors that are both standard layout and trivially copyable:

Impl is_standard_layout is_trivially_copyable
hipSYCL false false
ComputeCpp true false
Intel false false
triSYCL true false

Since these requirements are defined recursively, any user defined type wrapping an accessor from one of these implementations may thus not have standard layout (and certainly won't be trivially copyable).

I would actually find it quite puzzling if wrapping accessors were not allowed, as it seems a rather useful pattern to me. In particular since as far as I understand, wrapping and later "hydrating" accessors was one of the main justifications for introducing placeholder accessors in the first place. This pattern is currently e.g. extensively used in Codeplay's SYCL-BLAS.

Can you clarify this?

Allow 0-sized buffers

When the user writes code, let's say the user has 100s of buffers and one of them has to have a 0 size. Since 0-sized buffers are not allowed in SYCL, existing SYCL implementations just throw an error. I don't think this is good at all. What is the harm of allowing 0-sized buffers?

Namespace for half data type: cl::sycl or no namespace?

Khronos SYCL-CTS validates that SYCL implementation defines half data type in cl::sycl space, although it's not clearly stated in the spec that this type must be defined.
Here is all the relevant references in the spec regarding half data type.

4.10.1 Scalar data types.
...
Additional scalar data types which are supported by SYCL within the cl::sycl namespace are described in Table 4.93
...
For the purpose of interoperability and portability, SYCL defines a set of aliases to C++ types within the cl::sycl namespace using the cl_ prefix. These aliases are described in Table 4.94
...
cl_half
Alias to a 16-bit floating-point. The half data type must conform to the IEEE 754-2008 half precision storage format. A SYCL feature_not_supported exception must be thrown if the half type is used in a SYCL kernel fucntion which executes on a SYCL device that does not support the extension khr_fp16

Should we extend the Table 4.93 with half data type?

half data type is mentioned in

4.10.2.2 Aliases
SYCL provides aliases for vec<dataT, numElements> as for the data types: char, short, int, long, float, double, half, cl_char, cl_uchar, cl_short, cl_ushort, cl_int, cl_uint, cl_long, cl_ulong, cl_float, cl_double and cl_half and the data types: signed char , unsigned char, unsigned short, unsigned int, unsigned long, long long and unsigned long long represented with the short hand schar, uchar, ushort, uint, ulong, longlong and ulonglong respectively, for number of elements: 2, 3, 4, 8, 16. For example the alias to vec<float, 4> would be float4.

NOTE: it's not clear here whether half is expected to be defined in cl::sycl namespace or not. It doesn't seem so as this list includes C++ fundamental types, which are not in cl::sycl namespace.

5.1 Half Precision Floating-Point
The half scalar data type: half and the half vector data types: half1, half2, half3, half4, half8 and half16 must be available at compile-time. However if any of the above types are used in a SYCL kernel function, executing on a device which does not support the extension khr_fp16, the SYCL runtime must throw a feature_not_supported exception.

The conversion rules for half precision types follow the same rules as in the OpenCL 1.2 extensions specification [2, par. 9.5.1].

The math functions for half precision types follow the same rules as in the OpenCL 1.2 extensions specification [2, par. 9.5.2, 9.5.3, 9.5.4, 9.5.5]. The allowed error in ULP(Unit in the Last Place) is less than 8192, corresponding to Table 6.9 of the OpenCL 1.2 specification [1].

cl::sycl namespace is not mentioned here neither.

6.5 Built-in scalar data types
In a SYCL device compiler, the standard C++ fundamental types, including int, short, long, long long int need to be configured so that the device definitions of those types match the host definitions of those types. A device compiler may have this preconfigured so that it can match them based on the definitions of those types on the platform. Or there may be a necessity for a device compiler command-line option to ensure the types are the same.

The standard C++ fixed width types, e.g. int8_t, int16_t, int32_t,int64_t, should have the same size as defined by the C++ standard for host and device.
...
half
A 16-bit floating-point. The half datatype must conform to the IEEE 754-2008 half precision storage format. A SYCL feature_not_supported exception must be thrown if the half type is used in a SYCL kernel fucntion which executes on a SYCL device that does not support the extension khr_fp16.
Table 6.1: Fundamental data types supported by SYCL.

Here it seems like this time must be defined in the global namespaces as the rest of the C++ built-in scalar types.

OpenCL contexts should not be implicitly convertible to SYCL contexts

An OpenCL Context can be implicitly converted to a SYCL context. The spec (v1.2.1r5) does not forbid it, as the constructor:

context(cl_context clContext, async_handler asyncHandler = {});

is not marked explicit. This means the following is currently valid SYCL code:

#include <CL/cl.h>
#include <CL/sycl.hpp>

int use_sycl_context(cl::sycl::context sycl_context) {
  // Do something with context
  (void)sycl_context;
  return 0;
}

int use_cl_context(cl_context ocl_context) {
  return use_sycl_context(ocl_context);
}

I believe this is a bad idea, as a SYCL context constructor may involve a comparatively large amount of work beyond just initialising a cl_context pointer. If the conversion is only allowed to be explicit then a user must deliberately construct the SYCL context and so is much less likely to construct them when not expected.

Device definitions

\startTable{SYCL device selectors}

Currently the device selectors define returning a device of a specific type but there is no definition of what those types mean within this document. To find the definition of what a GPU device type one must look in the OpenCL specification and even there the definition of what a device is quite sparse. It would be nice to have some definition of what a particular device selector will give you.

Support for SYCL 2020 reduction variables with `range` instead of `nd_range`

Can someone please comment if reduction_variables are supported only with nd_range and not with range launch configuration.

For example, if one uses Intel's SYCL implementation, the following takes the form with range which is not yet implemented.

#include <CL/sycl.hpp>
#include <iostream>
#include <numeric>
#include <cassert>

int main() {
    cl::sycl::queue myQueue(cl::sycl::gpu_selector{});

    std::vector<int> valuesVec(1024);
    std::iota(std::begin(valuesVec), std::end(valuesVec), 0);
    int* valuesBuf = cl::sycl::malloc_device<int>(1024, myQueue);
    myQueue.memcpy(valuesBuf, valuesVec.data(), 1024*sizeof(int));

// cl::sycl::buffers with just 1 element to get the reduction results
    int sumResult_host = 0, maxResult_host = 0;
    int* sumResult = cl::sycl::malloc_device<int>(1, myQueue);
    int* maxResult = cl::sycl::malloc_device<int>(1, myQueue);

    myQueue.submit([&](cl::sycl::handler& cgh) {
// Create temporary objects describing variables with reduction semantics
	auto sumReduction = cl::sycl::ONEAPI::reduction(sumResult, cl::sycl::ONEAPI::plus<>());
	auto maxReduction = cl::sycl::ONEAPI::reduction(maxResult, cl::sycl::ONEAPI::maximum<>());
// parallel_for performs two reduction operations
// For each reduction variable, the implementation:
// - Creates a corresponding reducer
// - Passes a reference to the reducer to the lambda as a parameter
	cgh.parallel_for(cl::sycl::range<1>{1024},
			 sumReduction, maxReduction,
			 [=](cl::sycl::item<1> idx, auto& sum, auto& max) {
			   sum += valuesBuf[idx];
			   max.combine(valuesBuf[idx]);
			 });
      });

    myQueue.memcpy(&sumResult_host, sumResult, sizeof(int)).wait();
    myQueue.memcpy(&maxResult_host, maxResult, sizeof(int)).wait();    
    std::cout << "value of Result_Host: " << sumResult_host << ", " << maxResult_host << std::endl;

    assert(maxResult_host == 1023 && sumResult_host == 523776);
}

User adding symbols to sycl namespace

Currently the spec only specifies how the implementer should use the sycl namespace. Section 4.1 states that extension should go into vendor namespace and implementation details into detail. It doesn't put any restriction on user code. Is the user allowed to put anything in sycl namespace? I assume the intention is that this isn't allowed. It should be clarified.

Clarify behavior of event::wait regarding asynchronicity in interop host_tasks

I couldn't find any information on how sycl::event::wait behaves regarding host_tasks that use an interop_handle to submit asynchronous operations to the backend queue underlying the SYCL queue. Consider this example:

auto e = my_queue.submit([&](sycl::handler& cgh) {
    cgh.host_task([=](sycl::interop_handle& ih) {
        auto stream = ih.get_native_queue<sycl::backend::cuda>();
        cudaMemcpyAsync(dst_ptr, src_ptr, count, cudaMemcpyDefault, stream);
        // cudaStreamSynchronize(stream); // Required?
    };
});

// Does this wait for the callback to return, or for `cudaMemcpyAsync` to be finished?
e.wait();

Basically the example says it already, there are two different behaviors one could expect, either:

  1. e.wait() returns as soon as the callback provided to handler::host_task has returned, or
  2. it returns only once the callback has returned, and all work on the underlying queue (in this example, the CUDA stream) has finished.

Personally, I would find option 2) much more useful, as option 1) would effectively mean that any form of interop would have to be blocking.

AllocatorT template parameter is not used in constructors of sampled_image class (unlike unsampled_image class)

Compare the current interfaces of sampled_image and unsampled_image classes in SYCL 2020 rev 3 specification:

template <int dimensions = 1, typename AllocatorT = sycl::image_allocator>
class sampled_image {
public:
sampled_image(const void *hostPointer, image_format format,
image_sampler sampler, const range<dimensions> &rangeRef,
const property_list &propList = {});
/* Available only when: dimensions > 1 */
sampled_image(const void *hostPointer, image_format format,
image_sampler sampler, const range<dimensions> &rangeRef,
const range<dimensions -1> &pitch,
const property_list &propList = {});
sampled_image(std::shared_ptr<const void> &hostPointer, image_format format,
image_sampler sampler, const range<dimensions> &rangeRef,
const property_list &propList = {});
/* Available only when: dimensions > 1 */
sampled_image(std::shared_ptr<const void> &hostPointer, image_format format,
image_sampler sampler, const range<dimensions> &rangeRef,
const range<dimensions -1> &pitch,
const property_list &propList = {});

template <int dimensions = 1, typename AllocatorT = sycl::image_allocator>
class unsampled_image {
public:
unsampled_image(image_format format, const range<dimensions> &rangeRef,
const property_list &propList = {});
unsampled_image(image_format format, const range<dimensions> &rangeRef,
AllocatorT allocator, const property_list &propList = {});
/* Available only when: dimensions > 1 */
unsampled_image(image_format format, const range<dimensions> &rangeRef,
const range<dimensions -1> &pitch,
const property_list &propList = {});
/* Available only when: dimensions > 1 */
unsampled_image(image_format format, const range<dimensions> &rangeRef,
const range<dimensions -1> &pitch, AllocatorT allocator,
const property_list &propList = {});
unsampled_image(void *hostPointer, image_format format,
const range<dimensions> &rangeRef,
const property_list &propList = {});
unsampled_image(void *hostPointer, image_format format,
const range<dimensions> &rangeRef, AllocatorT allocator,
const property_list &propList = {});
/* Available only when: dimensions > 1 */
unsampled_image(void *hostPointer, image_format format,
const range<dimensions> &rangeRef,
const range<dimensions -1> &pitch,
const property_list &propList = {});
/* Available only when: dimensions > 1 */
unsampled_image(void *hostPointer, image_format format,
const range<dimensions> &rangeRef,
const range<dimensions -1> &pitch, AllocatorT allocator,
const property_list &propList = {});
unsampled_image(std::shared_ptr<void> &hostPointer, image_format format,
const range<dimensions> &rangeRef,
const property_list &propList = {});
unsampled_image(std::shared_ptr<void> &hostPointer, image_format format,
const range<dimensions> &rangeRef, AllocatorT allocator,
const property_list &propList = {});
/* Available only when: dimensions > 1 */
unsampled_image(std::shared_ptr<void> &hostPointer, image_format format,
const range<dimensions> &rangeRef,
const range<dimensions -1> &pitch,
const property_list &propList = {});
/* Available only when: dimensions > 1 */
unsampled_image(std::shared_ptr<void> &hostPointer, image_format format,
const range<dimensions> &rangeRef,
const range<dimensions -1> &pitch, AllocatorT allocator,
const property_list &propList = {});

Both of them have AllocatorT template type parameter that is a type of an allocator used by the corresponding class. But no allocators of the given type can be passed to the constructors of an object of sampled_image class (unlike unsampled_image class). It looks like a bug in the specification.

Could local placeholder accessors be supported?

According to table 4.47 in SYCL 1.2.1 Rev 6, local accessors cannot be placeholders. As I'm currently being hampered by this restriction, I'm wondering why and whether it has to be this way. At least on OpenCL platforms, where local memory can be dynamically allocated before running a kernel, I don't see how a local placeholder accessor would be any different from a global placeholder accessor. However ultimately I don't have enough insight into all of the different platforms SYCL wants to support to know whether this would be generally feasible or not.

While my particular use case is a bit arcane, I think generally the same reasoning as was used in Codeplay's original proposal for placeholder accessors could be applied to local accessors (e.g. building a functor operating on local memory outside a CGF).

Has this been discussed at Khronos? Do you think it would be possible to support local placeholder accessors?

Need clarification on parallel_for_work_item() without specifying the work group size

Table 4.73 on page 162 explains how parallel_for_work_item() without specifying the work-group size should be used.

template <typename workItemFunctionT>
void parallel_for_work_item(workItemFunctionT func)const 

Launch the work-items for this work-group. func is a function object type with a public member 
function void F::operator()(h_item<dimensions>) representing the work-item computation. This 
member function can only be invoked within a parallel_for_work_group context. It is undefined 
behavior for this member function to be invoked from within the parallel_for_work_group form
that does not define work-group size, because then the number of work-items that should
execute the code is not defined. It is expected that this form of parallel_for_work_item is
invoked within the parallel_for_work_group form that specifies the size of a work group

There are two different APIs of parallel_for_work_group, one with specifying the work group size and the other without specifying it.
If I understand the above excerpt correctly, since the logical range of a work group is not specified, this API shouldn't be used inside a parallel_for_work_group that does not specify the work group size.
It makes sense to me because it is not possible to determine the total iteration space only from the number of work groups.

However, the example at the bottom of page 175 says this is a valid case.

1 myQueue.submit([&](handler & cgh) {
2    // Issue 8 work-groups. The work-group size is chosen by the runtime because unspecified
3    cgh.parallel_for_work_group<class example_kernel>(
4        range<3>(2, 2, 2), [=](group<3> myGroup) {
5
6        // Launch a set of work-items for each work-group. The number of work-items is chosen
7        // by the runtime because the work-group size was not specified to parallel_for_work_group
8        // and a logical range is not specified to parallel_for_work_item.
9        myGroup.parallel_for_work_item([=](h_item<3> myItem) {
10               //[work-item code]
11       });

Did I misunderstand the parallel_for_work_item API?

SYCL functions for explicit memory operations

While working on implementation of these functions for triSYCL/triSYCL#66, I found the semantics unclear, specially concerning the transfers accessors to accessors.
Do we consider that we use the get_linear_id() to access the source?
If so, there is no constrains to have accessors of the same rank on source and destination.
Should we really have the same target, mode and value type as specified today?

There is also some destination/source inversions fixed in #22

Buffer reinterpret 1D range

When reinterpreting to a buffer of dimension 1, the 1.2.1 spec allows only one way to specify the range, because of the following paragraph (Member functions for the buffer class):

Must throw an invalid_object_error SYCL exception if the total size in bytes represented by the type and range of the reinterpreted SYCL buffer does not equal the total size in bytes represented by the type and range of this SYCL buffer.

Example:

buffer<int, 2> initial{range<2>{5, 6}};
buffer<float, 1> linear = initial.reinterpret<float, 1>(range<1>{initial.get_size() / sizeof(float)});

In this case it should be possible to just omit the range, because the SYCL runtime knows exactly what to do. If we also allow the dimensions parameter to be defaulted to the same one as the original buffer, it would simplify 1D buffers even further:

buffer<int> initial{range<1>{16}};
buffer<float> linear_float = initial.reinterpret<float>();
buffer<char> linear_char = initial.reinterpret<char>();

Explicit memory operations are underspecified

I've recently started to look into explicit memory operations again, and have come to the conclusion that the exact semantics of the various functions are still rather underspecified.

Since I'm not quite happy with the current state of support for explicit memory operations in the existing SYCL implementations, I've also created a PR for the CTS that greatly enhances the respective tests, thoroughly checking all of the expected behaviors (or at least, my interpretation thereof). Notably, this also includes a table with results for different SYCL implementations and backends, which I will reference below, so I encourage you to go check it out!

So here is the list of my concerns:

  • 5c3fd1b addressed a couple of bugs where the allowed combinations of accessors were artificially (and sometimes nonsensically) restricted by having only a single set of template parameters for both accessors (for example, both had to have the same access mode). The commit changed this, however in doing so, also permitted the two accessors to have different dimensionalities. My question is: Was this intentional? While some implementations do already support it, and I can certainly see use cases where this could come in handy, it seems like a rather specialized and low-level mechanism. I would argue that it introduces quite a bit of complexity for implementors (see my test cases), which makes it prone to bugs (see my results), for a relatively niche application. I think realistically, whoever needs such functionality could also just implement it as a custom kernel.
  • In similar vein, the spec doesn't say anything about the shape of the source and destination accessors used in a device-to-device copy, only that the source must access at least as many bytes as the destination. Should it therefore be possible to copy between two 2D accessors of shape [16,32] and [32,16]?
    • Since source and destination can have different data types, a strided copy could even become impossible for certain shape combinations. For example, what happens when I want to copy from a [4,1] int32 accessor to a [8,2] int8 accessor? Each "row" (dimension 0) of the destination accessor is only 16 bits wide, but we want to copy 32 bits into it. (Edit: Okay, I guess it can be done by reininterpreting the source first, then copying 8 bit elements - still, this again seems like a very fringe use case).
  • If copies between different dimensionalities or differently shaped accessors should be possible, I think the spec should also clarify the semantics of such a copy. I think the most obvious interpretation would be to have each copied element maintain its relative linear id within the source and destination range, but I would argue that as long as its not specified, anything could be done here.
  • While the allowed accesses modes are listed for source and destination accessors, it says nowhere which access targets are legal. For instance, it doesn't make any sense to copy into a local accessor, as the accessor only exists within a single CGF, and it is not allowed to submit more than one action from within a CGF (e.g. a kernel). I also used to think it was possible to copy from a host accessor to a device accessor. It was then pointed out to me that a copy is considered a kernel in itself, and it is not allowed to use host accessors in device kernels. While that is fine, I think it wouldn't hurt to spell it out explicitly somewhere. Curiously, the commit message of 5c3fd1b seems to address local accessors, but the change itself appears to be missing from the diff.
  • When copying from or to the host using a raw pointer or shared_ptr, should the host memory be considered dense? The spec says "if an accessor accesses a range of 10 elements of int type, the host pointer must at least have 10 * sizeof(int) bytes of memory allocated". From this I would assume that copying the range [3,7] out of a [10,20] buffer at offset [2,5] would also only require 3 * 7 * sizeof(T) bytes on the host, thus being a dense copy of the strided source data. However, according to my results not all implementations behave this way (granted, this could realistically simply be bugs).
    • I think the example could be changed to a 2D buffer with a ranged accessor to illustrate this.
  • Lastly, should it be possible to only fill parts of a buffer by using a ranged accessor? I'd find it rather unintuitive if not, but none of the implementations I've tested seems to support it.

`ND_range` where `ND` > 3

Hi,

Graphics is 3D, but Science is ND! (one told me you should always start your issue with a catchy sentence...)

Two examples that I aware about :

The OpenMP version of su3bench look like that:

   #pragma omp target teams distribute
    for(int i=0;i<total_sites;++i) {
    #pragma omp parallel for collapse(3)
    for (int j=0; j<4; ++j) {
      for(int k=0;k<3;k++) {
        for(int l=0;l<3;l++){

where the SYCL version looks like this monstrosity:

     size_t total_wi = total_sites * THREADS_PER_SITE;
       cgh.parallel_for(
     nd_range<1> {total_wi, wgsize}, [=](nd_item<1> item) {
      size_t myThread = item.get_global_id(0);
      size_t mySite = myThread/36;
      if (mySite < total_sites) {
        int j = (myThread%36)/9;
        int k = (myThread%9)/3;
        int l = myThread%3;

The SYCL code just linearises the loop-nest and then manually computes the indexes. This is totally mechanical but makes the code quite ugly and hard to parse.

To make SYCL appealing to scientists, it will be nice if nd_range can be a "real" nd_range ranging from 0 to N dimensions. For the spec point of view, It should be "trivial" to generalize the linearization formula of https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:multi-dim-linearization, with something like:

\sum_i^N {id}_{i} \prod_{j=i+1}^{N} r_j

In short, I would like to write something like:

 cgh.parallel_for(
     range<4> {total_sites,4,3,3}, [=](range<4> item) {
     }

(As said during the panel discussion this feature request should be a by-product of the extension of dimensionality of buffer to support also more than 3 dimension. I just put it here to keep a trace.
Also possibly related to #107 )

Cheers,
Thomas

Ownership wrong term to describe relationship of host pointers and SYCL objects

Description when passing raw pointers to SYCL runtime is misleading:

When using a SYCL buffer, the ownership of the pointer passed to the constructor

A developer coming from OpenCL will likely understand the correct behavior, but a developer coming from C++ will be misled. Specifically the default behavior is as follows:

  • Data and state within the address space passed to the object is fully controlled by the SYCL runtime
  • Lifetime of the data is still managed by the original owner of the memory
  • Original owner only manages the lifetime and cannot legally view or modify the state of the data

This behavior is not the behavior implied by "the ownership of the pointer passed to the constructor of the class is, by default, passed to SYCL runtime". While the specification clarifies that "pointer cannot be used on the host side until the buffer or image is destroyed", it states nothing about lifetime management besides "memory pointed by host pointer will not be de-allocated by the runtime".

Furthermore, descriptor given throughout the specification simply states:
"The ownership of this memory is given to the constructed SYCL buffer for the duration of its lifetime". This is the same description given for both raw and smart pointers, further complicating the meaning:

Construct a SYCL \codeinline{buffer} instance with the \codeinline{hostData} parameter provided. The ownership of this memory is given to the constructed SYCL \codeinline{buffer} for the duration of its lifetime.

Construct a SYCL \codeinline{buffer} instance with the \codeinline{hostData} parameter provided. The ownership of this memory is given to the constructed SYCL \codeinline{buffer} for the duration of its lifetime.

Finally, the specification introduces something called "full ownership" which is really just ownership plus invalidation of any references to that data outside the SYCL runtime:

In the case where there is host memory to be used for initialization of data

Except in this instance, where "full ownership" is used differently:
In order to allow the \gls{sycl-runtime} to do memory management and allow

In C++, "ownership" of a pointer is in reference to lifetime. For example, unique_ptr:
https://en.cppreference.com/w/cpp/memory/unique_ptr/unique_ptr
"2) Constructs a std::unique_ptr which owns p"

Additionally, ownership implies nothing about control of the underlying data, which can still be referenced by other objects which have no control of the lifetime. Therefore, ownership is the wrong term to describe the behavior.

Note that the information is available to understand what is meant by the specification is available, specifically:

A buffer can be constructed with associated host memory and a default

For the lifetime of the image object, the associated host memory must

The problem is that one has to read the specification completely or be misled. The specification does a good job overall of describing the behavior, but misleadingly uses "ownership". Because ownership is a term that already has strong meaning within C++, it is misleading to use that term outside the standard meaning.

Another term should be used and that term specifically defined to avoid any confusion. "Data control", "management", or something else that doesn't already have a strong meaning in C++ would be appropriate.

SYCL 2020 Reductions Feedback

The SYCL 2020 provisional spec (revision 1), section 4.10.2, calls for feedback on the two (or four, depending on how you see it) proposed variants for reductions. Since there doesn't seem to be such an issue here yet, I'm starting one now :).

Note that I will also be referencing some stuff from Intel's original reduction proposal, which this seems to be based on.

This also includes some considerations which are relevant to us at the Celerity project (cc @fknorr, @PeterTh).

On reducer vs reduction

At first glance, there doesn't seem to be much difference between the two APIs, other than the reduction variant being a bit more verbose and cumbersome to use. From that perspective, reduce is clearly preferable.

One issue comes to mind that may be relevant for pure library SYCL implementations: Reductions are often implemented through a series of kernel launches, each further reducing the input until a single value is left. To do so, a SYCL implementation would need to have access to the BinaryOperation used by a reduction. With the implicitly captured reducer variant, I imagine this could become difficult. Especially pure library implementations would have a hard time extracting the reducer objects out of opaque and implementation-defined kernel lambda objects. Even more so, a SYCL implementation would ideally have access to the BinaryOperation type statically in order to generate the necessary kernels, which as far as I can tell, is not possible for the reducer variant.

For Celerity, the fact that the proposed reducer API doesn't seem to include a default constructor is something that would cause us some headaches for sure. So if at all possible, we would greatly appreciate this being added :-).

On parallel_for vs parallel_reduce

Note: While not explicitly shown in the provisional spec, given that the Intel proposal is using reduction in combination with a new overload of parallel_for, and the reducer example from the spec uses parallel_for, I presume the option of not introducing a dedicated parallel_reduce (regardless of reducer vs reduction) is on the table as well.

One argument that might be made against introducing a dedicated parallel_reduce is that it would increase SYCL's API surface, which is of course always something that shouldn't be taken lightly. However, in reality, introducing the parameter pack overloads for parallel_for (to receive and forward the results of reduction calls) amounts to a surface extension just the same, only in a less obvious and self-documenting way.

After giving it some thought, I really can only think of reasons for NOT wanting to re-use parallel_for for reductions:

On some platforms, it might make sense to "de-parallelize" the reduction, doing multiple combine operations per thread to exploit ILP. By doing this transparently on parallel_for calls, users may be led to believe that they can achieve full parallelism for some other computation they are doing next to a reduction, even if they can't.

While the example shown in the spec only uses the simple parallel_for, from the Intel proposal I would assume that this would also be supported for ND-range parallel_for. In that case, "de-parallelization" becomes outright impossible, as work group semantics must remain intact (the user may issue a barrier, for example).

Additionally, if the user gets explicit control of work group sizes through ND-range parallel_for, things might become very awkward if an implementation wants to use local shared memory internally: If memory is allocated proportional to the group size, it may result in unexpected errors for the user if the limit is reached, and if the allocation size is decoupled from the group size, implementations may become very complex.

In general, and I think this applies to both variants, the proposed interface is extremely flexible, by allowing users to essentially have arbitrary code executed within their reduction kernels (basically it's the "map" phase in what could be seen as a MapReduce operation). However that flexibility may bring with it some downsides:

The Intel proposal shows "a simple way to implement this proposal" by using OpenCL 2.0's work_group_reduce_add. However, that group collective operation acts as an implicit barrier, which means that doing something like

cgh.parallel_reduce(..., [=](id<1> idx, auto& sum) {
  if(idx[0] % 2 == 0) {
    sum += input[idx];
  }
})

would deadlock, and a SYCL implementation would need compiler support to prove that using that implementation is safe. It may thus be worth considering putting restrictions on what can be done within a reduction kernel (e.g., control flow wise), which would only further warrant having those restrictions within a dedicated parallel_reduce command.


Overall, I think while simply capturing a reducer inside a parallel_for is very sleek and elegant, for the reasons outlined above, the reduction + parallel_reduce variant is the more sensible combination, even if very verbose.

Other notes

I understand that the reducer variant cannot accept placeholder accessors, as it needs to be bound to the current CGF somehow. However, the result of reduction is passed explicitly into the CGF's command function anyway, so I would argue that any placeholder accessors could be required at this point, and the interface could thus be made less strict (this is again something that we would appreciate having for Celerity).

Both reducer and reduction can only be instantiated/called with accessors that have either a read_write or discard_write access mode. While the idea makes sense, I find it somewhat strange to introduce new use-cases for discard_write, when it is being deprecated in favor of the noinit property with the same SYCL release.

I also wanted to point out that it feels strange that when using accessors, reducers ranging from 0 to 3 dimensions can be constructed, while USM is effectively limited to 0-dimensional (through pointer) and 1-dimensional (through span) reducers. Alas, as we won't see mdspan before C++23, I realize there is probably not much that can be done here.

Lastly, some minor spec bugs:

  • Both examples use 1-dimensional buffers of size 1 for their reduction variables, which in my understanding would result in an array-reducer, on which combine cannot be called directly.
  • reducer's operator[] should be available for Dimensions > 0 (not > 1)

SYCL 1.2.1 spec (rev 7) section 4.8.9.3 example code in error?

There seem to be a number of problems with the example in the spec. One, there doesn't seem to be a "build_from_name" method in OneAPI headers nor in the spec itself. Two, example uses both "MyProgram" and "myProgram" (minor nit but leads me to believe this isn't from verified code). Lastly I don't recognize a parallel_for with a signature that would match that of the example.

Originally:
https://stackoverflow.com/questions/62645410/sycl-spec-1-2-1-rev-7-section-4-8-9-3-error

SYCL 2020 accessors issues.

Here is the summary of discussion between @gmlueck and @yuriykoch regarding issues in accessors section of SYCL 2020 specification.

  1. An accessor with deprecated target local has method is_placeholder(). But there was no such method for local accessor in SYCL 1.2.1. We should remove this method from SYCL 2020 specification.

  2. Par. 3.13.3 states that:
    “Inside kernels, the functions and data types available are restricted by the underlying capabilities of SYCL backend devices.”

    That statement from section 3.13.3 is not very useful because it doesn’t explain how an application can tell which data types are restricted by the backend. We describe this better, though, in other parts of the spec. For example, 4.6.4.3 states that the “sycl::half” type is only available on a device if the device has “aspect::fp16”. We suggest removing 3.13.3 paragraph.

  3. local_accessor description states that: “The underlying dataT type can be any C++ type”.

    We suggest changing this sentence to “The underlying dataT type can be any C++ type that the device supports.”

    The intent is to say that local accessors do not impose any further restrictions on the underlying type beyond those restrictions on the device code in general.

  4. Accessor API changes missing in "What has changed ..." section. Looks like D.1 appendix may be extended with the following:

  • There is no explicit mention of accessor::value_type and accessor::reference type dependency on the access mode. The spec says “Member functions of accessor which return a reference to an element have been changed to return a const reference for read-only accessors.”, but we should tweak that sentence to say that the member types are also changed for read-only accessors.

  • There is no explicit mention of read_write access mode removal for image accessors. Even SYCL 1.2.1 did not allow read-write access mode for images, unless the target was “host_image”. It was probably an oversight that “host_unsampled_image_accessor” does not provide this ability.

  1. p. 180, get_range() description for the target ‘local’ has misprint:
    “the returned value is the the range”

set_final_data invalid default argument

The documentation for the image and buffer class' set_final_data member functions contains incorrect default arguments. Both function declarations reference a nonexistent std::nullptr value instead of the nullptr keyword. This can be seen at the following locations:

void set_final_data(Destination finalData = std::nullptr);

void set_final_data(Destination finalData = std::nullptr);

{void set_final_data(Destination finalData = std::nullptr)}

{void set_final_data(Destination finalData = std::nullptr)}

Implicit accessor-to-pointer casts

This is a duplicate of triSYCL/triSYCL#247, moving the discussion here.

The Problem

The SYCL specification (3.5.2.1) says the following:

Within kernels, accessors can be implicitly cast to C++ pointer types. The pointer types will contain a compile-time deduced address space. So, for example, if an accessor to global memory is cast to a C++ pointer, the C++ pointer type will have a global address space attribute attached to it. The address space attribute will be compile-time propagated to other pointer values when one pointer is initialized to another pointer value using a defined mechanism.

This is not reflected in accessor's interface and none of the publicly available implementations support this.

Example

void vec_add(const int* a, const int* b, int* c, std::size_t size);

queue.submit([&](cl::sycl::handler& cgh)                                                                                                                      
{                                                                                                                                                             
    auto a = a_d.get_access<cl::sycl::access::mode::read>(cgh);                                                                                               
    auto b = b_d.get_access<cl::sycl::access::mode::read>(cgh);                                                                                               
    auto c = c_d.get_access<cl::sycl::access::mode::discard_write>(cgh);                                                                                      
                                                                                                                                                         
    cgh.single_task<class vector_add>([=]()                                                                                                                   
    {
        // no known conversion from accessor to const int*                                                                                                                                                         
        vec_add(a, b, c, 1024);                                        
    });                                                                                                                                                       
});

Use case

Libraries such as Alpaka or HPX provide abstraction layers over competing compute APIs such as CUDA. Unfortunately their API works with raw pointers in their abstract kernels. In order to implement a SYCL backend the accessor has to be explicitly transformed into a pointer right now (by going through multi_ptr). Being able to do this implicitly would be a lot easier.

Possible solutions

  1. Remove the above wording and rely on multi_ptr's conversion instead.
  2. Allow accessor to be implicitly cast as well.

Can SYCL implementation define symbols within ::cl::sycl namespace in <sycl/sycl.hpp> header?

The question originally raised by @psalz in KhronosGroup/SYCL-CTS#108.

Section 4.3 of the spec says following:

SYCL provides one standard header file: <sycl/sycl.hpp>, which needs to be included in every translation unit that uses the SYCL programming API.

All SYCL classes, constants, types and functions defined by this specification should exist within the ::sycl namespace.

For compatibility with SYCL 1.2.1, SYCL provides another standard header file: <CL/sycl.hpp>, which can be included in place of <sycl/sycl.hpp>. In that case, all SYCL classes, constants, types and functions defined by this specification should exist within the ::cl::sycl C++ namespace.

For consistency, the programming API will only refer to the <sycl/sycl.hpp> header and the ::sycl namespace, but this should be considered synonymous with the SYCL 1.2.1 header and namespace.

@psalz noticed inconsistency between implementation in defining ::cl::sycl namespace symbols in <sycl/sycl.hpp> header.
DPC++ defines ::cl::sycl namespace, whereas hipSYCL doesn't.

Is SYCL implementation allowed to defined ::cl::sycl namespace from <sycl/sycl.hpp> or it can be done only from <CL/sycl.hpp>?

SLM use

\subsubsection{Local accessor}

The shared local memory accessor is quite vague specifying what it actually does. I can understand this is to be flexible for vendor implementation but makes it difficult to use in practice. It is not clear if the accessor is allocating a block of memory that the user should split between workgroups or if the total memory allocated is per workgroup.

In the example code, does the runtime allocate 1K * 16 workgroups of SLM or just 1K of SLM and it will be distributed equally between the 16 workgroups? I don't believe the spec is clear in this regard.

    queue.submit([&](cl::sycl::handler &cgh)
    {
        auto acc = cl::sycl::accessor<int, 1,  cl::sycl::access::mode::read_write, 
                           cl::sycl::access::target::local>(cl::sycl::range<1>(256), cgh);
        cgh.parallel_for_work_group(cl::sycl::range<1>(16), cl::sycl::range<1>(64),
            [=](cl::sycl::group<1> g)
            {
                g.parallel_for_work_item([&](cl::sycl::h_item<1> hi)
                    {
                        size_t i = hi.get_local_id(0);
                        acc[i] = hi.get_global_id(0); 
                    }
                );
            }
        );
    }
);

Possible error in SYCL specification document (1.2.1)

On page 87 at line 69 in the sycl::buffer interface overview the function signature is:

buffer(buffer<T, dimensions, AllocatorT> b, 
    const id<dimensions> &baseIndex,
    const range<dimensions> &subRange);

However later in the table on page 93 regarding the constructor the function signature is:

buffer(buffer<T, dimensions, AllocatorT> &b,
    const id<dimensions> & baseIndex,
    const range<dimensions> & subRange)

In the table the buffer b is passed by reference while in the interface overview its passed by value.
I guess one of the function definitions is wrong?

SYCL 2020 make_module API definition is incorrect

SYCL 2020 provisional spec defines make_module API as follows:

template<backend Backend>
kernel make_module(const backend_traits<Backend>::native_type<event> &backendObject, const context &targetContext);

Instead it should apparently be:

template<backend Backend>
module make_module(const backend_traits<Backend>::native_type<module> &backendObject, const context &targetContext);

Does stream support cl_bool/bool operand types?

In SYCL specification there are no mentions of bool or cl_bool types in Table 4.104: Operand types supported by the stream class. However there are checks for bool and cl_bool in the existing test for stream operator<<.
Is this CTS or Spec issue?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.