gpuweb / gpuweb Goto Github PK

Where the GPU for the Web work happens!

License: Other

Makefile 0.37% Python 13.38% Shell 0.29% JavaScript 0.20% Perl 0.68% Bikeshed 82.20% Mermaid 0.60% C++ 2.21% Dockerfile 0.08%

gpgpu-computing gpu w3c

gpuweb's Introduction

W3C GPU for the Web Community Group

This is the repository for the W3C GPU for the Web Community Group WebGPU API and WebGPU Shading Language (WGSL) specifications. This specification is formally standardized by the W3C GPU for the Web Working Group.

We use the wiki and issue tracker as the main sources of information related to the work. This repository will hold the actual specification, examples, etc.

Work-in-progress specification: https://gpuweb.github.io/gpuweb/

Work-in-progress WGSL specification: https://gpuweb.github.io/gpuweb/wgsl/

Charter

The charter for this group is maintained in a separate repository.

Membership

Membership in the Community Group is open to anyone. We especially encourage hardware vendors, browser engine developers, 3d software engineers and any Web Developers with expertise in graphics to participate. You'll need a W3C account to join, and if you're affiliated with a W3C member, your W3C representative will confirm your participation. If you're not a W3C member, you're still welcome. All participants are required to agree to the Contributor License Agreement.

Contributions

You are not required to be a member of the Community Group or Working Group in order to file issues, errors, fixes or make suggestions. Anyone with a GitHub account can do so.

In order to assure that WebGPU specifications can be implemented on a Royalty-Free (RF) basis, all significant contributions need to be made with RF commitments. Members of the Working Group, and members of the Community Group who have signed the Final Specification Agreement have already committed to the terms of the W3C Patent Policy. Non-members will be requested to provide an RF commitment under terms similar to the W3C Patent Policy.

All contributions must comply with the group's contribution guidelines.

See CONTRIBUTING.md for technical guidance on contributing.

Code of Conduct

This group operates under W3C's Code of Conduct Policy.

Communication

Our primary public chat channel is via Matrix (what is matrix?) at #WebGPU:matrix.org.

For asynchronous concerns, we use GitHub for both our issue tracker and our discussions forum.

Both the Community Group and the Working Group have W3C email lists as well, though these are largely administrative.

gpuweb's People

Contributors

Stargazers

Watchers

Forkers

danielribeiro mhorga gomson mandar-shinde invingagan1 kangz kdashg kainino0x kvark grovesnl jiawei-shao grorg napokue darknoon lp249839965 richard-yunchao litherum devshgraphicsprogramming acidburn0zzz millerhooks zhangchn jussn drsimonhack rafaelcintron happy-ferret bbernhar matthew-mcallister austineng shadowkun jungkees a2un cuulee cloned-repos ryandev331 waywardmonkeys yuhwankim biyootiful jacekdaa iraiter2 damyanp nichijoucc quanpands qpanweb aashish24 angle-li lernaengine romandev polokobe bzbarsky hugebean oz90210 misback squadstar bagelflower lily-zayn vorg wujs bpeersmoz ykankaya dneto0 frzi jameshfisher mbrukman haoxli samkuhn tonysan pow2clk doug-moen paulkernfeld variousforks michaldybizbanski toji rconde01 06wj ecgouvea hadirgax yuyang0 xingjinglu googol-lab gpsbird vziatkov akosblazsik shaoboyan xiaoshzx hixio-mh rukai jseam2 kunalmohan longde123 almarklein ponitka curioustauseef shyran-systems marguelgtz cwfitzgerald autokagami tidoust lachlansneff sarahm0 wxohyer

gpuweb's Issues

WebGL - WebGPU interoperability

This would enable incremental porting of WebGL apps to WebGPU and access to specific WebGPU features from WebGL apps (e.g. #17).

Related material:

Cost of transferring the data could be reduced using shared graphics memory (IOSurface, libdrm GBM...)

Render target / render pass investigation

Render targets / Render passes

Things overlooked: programmable sample position, layered rendering.

Basically:

All APIs require the sample count and formats of the attachments at pipeline creation time.
D3D12 is "immediate mode" and you can set render targets, clear, resolve, discard at any time.
Metal render-targets part of the encoder state and force clear, resolve and discards to be declared at encoder creation time. Probably because it helps mobile tilers a lot.
Vulkan is similar to Metal but adds a "renderpass" concept declaring multiple passes at the same time, so that some of the attachments can be kept in tile memory only.

D3D12

In D3D12 attachments are defined by Render-Target Views (RTV) descriptors living in a CPU RTV descriptor heap. These descriptors are created by ID3D12Device::CreateRenderTargetView and defined by a D3D12_RENDER_TARGET_VIEW_DESC. I didn't look at all the dimensions of texture, but you seem to be able to select the mip-level, array slice, and depth (for a 3D texture) to render to. For depth stencil things are similar but with Depth-Stencil views (DSV).

RTV and DSV descriptors are then used in ID3D12GraphicsCommandList::OMSetRenderTargets to set the current attachments, without any other information. Clearing, discarding and resolving resources are respectively done through ID3D12GraphicsCommandList::ClearRenderTargetView (and ID3D12GraphicsCommandList::ClearDepthStencilView), ID3D12GraphicsCommandList::DiscardResource and ID3D12GraphicsCommandList::ResolveSubresource.

At pipeline creation time, only the format of the RTV and DSV need to be declared in D3D12_GRAPHICS_PIPELINE_STATE_DESC along with the sample count.

Metal

In Metal binding render targets is done for the duration of a MTLRenderCommandEncoder by specifying the MTLRenderPassDescriptor at render encoder creation. Each attachment (color, depth or stencil) is specified with a texture, a mip-level, a slice (for arrays) and a depth plane for 3D textures. Attachments also get a MTLLoadAction action (don't care, clear or load) to help optimize memory traffic as well as a MTLStoreAction (don't care, store and or resolve). The texture storing the resolved data is specified per-attachment with a texture, level, slice and depth as well. A clear value can be provided when the load action is "clear".

At pipeline creation time, the format of each attachment is set separately, but the sample counts set for the whole pipeline, see MTLRenderPipelineDescriptor's doc. I didn't find it in the doc but certainly, the pipeline's attachment format and sample count must match those of the render encoder's attachment.

Inside the render encoder, you can't clear, resolve or discard attachment, but can choose a store action previously set as unknown.

Vulkan

See both parts of "Renderpass" in these slides

Vulkan has a concept of renderpass where the structure of the rendering algorithm is described to the driver in advances. A renderpass contains a list of attachment formats and samples, and a list of subpass that use these attachments. Think of it like each subpass is the equivalent of a MTLRenderPassDescriptor. Dependencies between subpasses are expressed that the driver can optimize. Vulkan also has a concept of "input attachment" which is an attachment which can potentially stay in tile memory, or in the texture cache between different subpasses. This allwos things like a GBuffer to stay close to the ALU, instead of having to store it in main memory. Renderpasses are created with vkCreateRenderPass.

Pipelines are created with vkCreateGraphicsPipelines for use at a specific subpass of a renderpass, so that shader code manipulating the tile memory can be emitted.

In command buffers, renderpasses are started, stepped and ended with vkCmdBeginRenderPass, vkCmdNextSubpass and vkCmdEndRenderPass. The attachments must be given when beginning a renderpass as a compatible VkFramebuffer created with vkCreateFramebuffer from a bunch of VkImageViews.

Suggestion for WebGPU

The APIs all need the attachment formats and sample count for creating pipelines, so let's do that (specifying a subpass of a renderpass also gives that information). Metal and Vulkan need renderpasses explicitly started and ended with load and store actions to help with tilers, so let's do that.

The only things is whether we should have Metal single-passes or Vulkan render-passes. Some people [needs citation] has seen more than 30% improvement in power and perf on mobile when using Vulkan render passes, so I think we should go that direction.

Resource Round-Robining

Background

On Nov. 1, there was consensus in the group about scheduling resource uploads & downloads at a particular point in the device's queue so that the CPU and GPU wouldn't be accessing the same resource at the same time. This is the best (only?) solution for resources which live across multiple frames.

Let's consider the other side of the issue: when information on the CPU is only necessary on the GPU for a single frame. One example of this is the Model View Projection matrices, which are consistent throughout a single frame, but will change from frame to frame.

There are a few possible models for these kinds of resources:

Resource Churn The application allocates and destroys a new resource each frame. Obviously (I think we can all agree) we don't want this.
Scheduled Uploads The application uses a single resource, and schedules an upload at the start of each frame.
Explicit Recycling Allocate n resources up front, and recycle them each frame. We can piggyback off the design of the swapchain here, because the swapchain includes n buffers, and it guarantees that, when you're recording commands to draw into a particular swapchain buffer, that buffer is not being accessed by the GPU. If we make an array of our own resources which parallels the n buffers in the swapchain, the parallel item in our own resource array is unused by the GPU at recording time, and therefore is free for the CPU to populate.
Implicit recycling Just like above, a collection of resources will be created, but this array is owned by the implementation. All resources exist inside the implementation in a "free pool" or an "in-use pool." When an application asks for a resource, one is pulled from the free pool, or if the pool doesn't contain anything compatible, a new one is created. The application then attaches this resource to their recording commands, and notifies the implementation when they're done recording with this resource. At this point, however, the resource isn't returned to the free pool; instead, the implementation only returns the resource to the free pool when the GPU is finished with the resource. This way, any resource granted to the application is free to be immediately written into by the CPU. (Indeed, the resource acquisition function may even accept an argument to specify the resource's initial contents.) In this option, the same number of resources will be created and recycled as in option 3, but the application doesn't own the array.

Recommendation

Option 4 is most compatible with a Web API, and should be the model for WebGPU. This is for a few reasons.

Option 2 is a good start, but we can do better. In this model, the CPU-side memcpy() will occur on the GPU's timeline, taking time away from the GPU's execution. The other models allow for issuing the memcpy() during command recording, before ownership of the resource has been given to the GPU.

Option 3 improves upon 2, but has the drawback of making application logic dependent on the number of buffers in the swapchain, which is another potential source of non-portability. In particular, both Metal and Vulkan (and maybe Direct3D, I don't know) don't let the application specify exactly how many buffers the swapchain contains. Vulkan allows the application to request a certain number, but the actual implementation may return a different number than requested. Metal never tells you how many buffers are in the swapchain, but just gives you the "next one." We'd like to avoid web authors hardcoding a constant number of resources in their application because that happened to be how many buffers their local machine was using.

Option 4 has all the benefits of option 3, but has additional benefits:

Portability Application logic is insensitive to the number of buffers in the platform's swapchain.
Performance Letting the browser automatically recycle resources means that the browser can improve the performance of poorly-written applications. Recycling should be automatic; developers shouldn't have to opt-in to good performance.
Fingerprinting The nature of the platform swapchains would provide more entropy for fingerprinting.

WebGL - WebGPU interop: WebGPU computation API with existing WebGL

Combining WebGL with WebGPU compute shaders is a specific use case of #18 WebGL - WebGPU interoperability.
It would enable access to general computation API from existing WebGL apps.

404 Link on https://github.com/gpuweb/gpuweb/wiki/Proposals---Experiments

On the Proposals Experiments the Apple Proposal resolves to a github 404.

Compiling/validating shaders asynchronously

Many developers will agree that freezing the interface the user interacts with results in a bad user experience. It is for this reason that developers may choose to perform some tasks asynchronously. Currently with WebGL compiling a shader / linking a program can freeze the browser for a short while; depending on the complexity of the shader and the user's hardware. On mobile this is usually way more noticeable.

Some would argue it's up to the developer to perform all shader generating, compiling and program linking at the very start of their web app or game. But there are valid reasons to generate and compile shaders at a later moment. E.g. loading in a new model at a later time (streaming?), or the user changing the game's quality settings during runtime. Or perhaps the app performs computations based on the user's input.

This is why I propose that, whatever function will compile the shader, should be either done forcefully asynchronously, or at least provide the option to do so, to prevent the browser from temporarily freezing. Said function would return a Promise, passing the compiled and validated shader to the resolve function and passing any error messages to the reject function.

Promises are The Future of Async Javascript™ and it only seemed logical to choose Promises over callback functions. With the introduction of async functions (already implemented in some browsers!) I believe Promises have become a more strong and important feature of the Javascript language.

Continuing Apple's example of using the Metal API, a common path to compile a shader (library) may look something like this (ignoring any error handling):

async function createRenderPipelineState() {
    const source = await fetch(/* url */).then(response => response.text())
    const library = await gpu.createLibrary(source)
    const vertexF = library.functionWithName('vertex_main')
    const fragmentF = library.functionWithName('fragment_main')
    /** etc **/
    return gpu.createRenderPipelineState(pipelineDescriptor);
}

Transpilation

Transpilers are listed as one of the primary target audiences in the charter:

Developers of 3D and game engines/tools that are producing Web content via transpilation. For example, Unity and the Unreal Engine use emscripten to compile content for the Web.

How can we ensure that we design an API that supports this audience? Unless we shape the WebGPU API to match a specific native API, it may add a lot of burden on the transpiler (i.e. emscripten) to reorganize native API calls to fit the WebGPU API. In some cases this may not even be possible depending on the final shape and requirements of the WebGPU API.

As a trivial example, we've already decided to enforce begin/end various types of commands such as {Begin|End}Compute to better support Metal's encoders. So if a Vulkan application will be transpiled into a WebGPU application, the transpiler will need to insert these methods at the appropriate locations. In this example:

the transpiler may be able to deal with this pessimistically (i.e. cause many encoder switches in Metal which is what we were attempting to avoid), or
depending on the API requirements it may not even be possible (i.e. if there are certain constraints in which these begin/end commands may be invoked)

This one additional requirement will probably be dealt with fine, but I'm concerned about the aggregate impact these API shape and requirement changes may have.

Is it reasonable for us to agree to shape the WebGPU API to match one of the existing native APIs to simplify transpilation, given that transpilers are one of the primary target audiences for this API?

If the WebGPU API eventually diverges too far from any native API such that transpilers are forced to emulate various pieces of a WebGPU backend, we will lose many of the benefits that these low-level APIs offer. Likewise it seems as though it would be a major issue to expect game engines to create a separate WebGPU backend when OpenGL ES to WebGL transpilation was straightforward in the past.

It would be very useful to clarify how transpilation should work before settling on the fundamental API shape. For example, if we thought the WebGPU API shape should mirror Vulkan's API closely, then the path for device creation as discussed in #22 is obvious (shape it after Vulkan and constrain it further if necessary), and we can focus on finding the correct way to emulate it on D3D12 and Metal.

Investigation: Render Target Aliasing

We've heard from the Metal team that aliasing render targets is important for memory use.

The Problem

Imagine a WebGPU application that renders to a texture, then reads from that texture to render into another texture, then reads from that other texture to render into yet another texture. Long chains like this are common in games with post-processing.

The flow would look like the following:

   [ ]
    | Renders into
    V
Texture A
    |
    | Renders into
    V
Texture B
    |
    | Renders into
    V
Texture C
    |
    | Renders into
    V
The screen

With the current design, the application code would look like:

let textureA = device.createTexture();
let textureB = device.createTexture();
let textureC = device.createTexture();

let targetA = createRenderPassDescriptor(textureA);
commandEncoder.beginRenderPass(targetA);
commandEncoder.draw();
commandEncoder.endRenderPass();

let targetB = createRenderPassDescriptor(textureB);
commandEncoder.beginRenderPass(targetB);
commandEncoder.attachResource(textureA);
commandEncoder.draw();
commandEncoder.endRenderPass();

let targetC = createRenderPassDescriptor(textureC);
commandEncoder.beginRenderPass(targetC);
commandEncoder.attachResource(textureB);
commandEncoder.draw();
commandEncoder.endRenderPass();

commandEncoder.attachResource(theScreen);
commandEncoder.attachResource(textureC);
commandEncoder.draw();
commandEncoder.endRenderPass();

queue.submit([commandEncoder]);

This approach means that textures A and C both exist in memory at the same time. However; they don't need to - they are never accessed at the same time. In fact, they should be able to live in the same region of memory.

This is important for mobile devices because they usually have big and fairly high density screens, but not that much memory. Therefore, wasting memory as big as a frame buffer is quite unfortunate.

An application often can't simply reuse the same resource in multiple places in their frame (meaning: textures A and C usually can't literally be the same resource), because the resources may be required to be different sizes or formats.

Solution A: The Vulkan-Subpass Metaphor

One way to solve this problem is for the application to submit to us a plan of what the inputs and outputs of every future render pass will be so that the browser can figure out which targets can be aliased. The application would have to do this at the beginning of each frame.

This approach is unfortunate for a few reasons:

This is how Vulkan subpasses work, and we have heard from IHVs that applications usually don't use these facilities. We can only assume that lack of adoption wasn't an accident.
A scene graph renderer may not know ahead of time exactly how the frame will be laid out. For example, upon encountering a particular node, the engine may realize that they need a temporary buffer (e.g. to blur something).
Figuring out how to alias render targets is essentially the graph coloring problem, which is NP-complete. We don't want the browser to have to do this analysis.

Solution B: Heaps

This solution would look like

interface WebGPUTextureHeap {
    WebGPUTexture createTexture(WebGPUTextureDescriptor descriptor);
}

partial interface WebGPUDevice {
    WebGPUTextureHeap createTextureHeap();
}

And the application code would look like

let heap = device.createTextureHeap();
let textureA = heap.createTexture();
let textureB = device.createTexture(); // This one comes from the device because it shouldn't be shared with anything
let textureC = heap.createTexture();

[and the rest is all the same]

Here, any textures created from the same heap would be able to alias. This is effectively making the programmer provide the solution to the graph coloring problem at resource creation time.

However, this is unfortunate because a particular coloring is relevant to a particular frame, not a particular resource. Frame plans change over time as the scene graph changes, and forcing a particular coloring for the lifetime of the entire resource would lead to suboptimal utilization.

Solution C: Purgability

Recall that the runtime (browser) needs to be able to evict resources due to OS pressure without notifying the running WebGPU program. This means that a WebGPU texture isn't pinned to a specific address, and may be relocated by the runtime at any time, without the application knowing.

Therefore, the problem isn't about "aliasing," per se. A browser can freely move resources around wherever it wants, and if it moves a resource on top of another resource, the other resource's contents gets clobbered. If the application wasn't using that resource, then there is no problem.

In the example above, the only reason the browser has to keep all the resources around at the same time is because the application may suddenly decide to use all the textures as inputs in one additional draw call. Therefore, the browser doesn't know that a resource is able to be clobbered.

So, this solution would simply add a function on WebGPUTexture:

partial interface WebGPUTexture {
    void purge();
}

This lets the browser know that the application is done using a resource. A purged resource will be guaranteed to have all its contents reset to 0s, and, therefore, the browser is free to place other resources on top of it.

A purged resource may be used, just like normal, at any point after that, but doing so will un-purge the resource. Un-purging a resource forces the browser to move any resources located on top of it.

There is no undefined behavior because purging a resource will fill it with 0s at the time it's next read, even if no other resource was relocated on top of it.

So, using purgeability, the above example would become:

let textureA = device.createTexture();
let textureB = device.createTexture();
let textureC = device.createTexture();
...
// At the beginning, everything is purged
let targetA = createRenderPassDescriptor(textureA);
commandEncoder.beginRenderPass(targetA); // Un-purges target A
commandEncoder.draw();
commandEncoder.endRenderPass();

let targetB = createRenderPassDescriptor(textureB);
commandEncoder.beginRenderPass(targetB); // Un-purges target B
commandEncoder.attachResource(textureA);
commandEncoder.draw();
commandEncoder.endRenderPass();
textureA.purge();

let targetC = createRenderPassDescriptor(textureC);
commandEncoder.beginRenderPass(targetC); // Un-purges target C
commandEncoder.attachResource(textureB);
commandEncoder.draw();
commandEncoder.endRenderPass();
textureB.purge();

commandEncoder.attachResource(theScreen);
commandEncoder.attachResource(textureC);
commandEncoder.draw();
commandEncoder.endRenderPass();
textureC.purge();

queue.submit([commandEncoder]);

Notice that textures A and C are never un-purged at the same time, so the implementation is free to make them alias. However, if some surprising draw call comes along and uses them both at once, the implementation will happily de-alias them and continue on.

Position: Directly ingest SPIR-V, and not a human-writable shading language

Revision R1: 2018-01-01
Revision R2: 2018-01-03

WebGPU should directly ingest SPIR-V, and it should not ingest a human-writable shading language. [R2] The WebGPU group should bless HLSL and fully maintain a complete development path to enable development with HLSL.

Ease of SPIR-V ingestion

SPIR-V is a smaller and simpler implementation and verification target than any human-writable shading language.

Ingesting SPIR-V instead of a shading language removes substantial amounts of code from the WebGPU implementation, such as a preprocessor and (text) parser.
It eliminates code for language-level concepts like identifiers and expression trees, and all string processing, from the translator frontend and validator.
During the development of WebGL, hundreds of bugs and edge cases have been encountered which are exclusive to the ingestion of a human-writable shading language. Note that these can not be dismissed as simply bugs in the browser's shader translator; these types of bugs are inherent to ingesting a text-based language. Ingesting SPIR-V categorically eliminates bugs in these areas (example bugs provided):
- String-related operations: parsing, tokens, identifiers (bug), literals (bug, bug) - including needing to enforce a maximum identifier length (bug) in the specification to avoid implementation-specific limits
- Scoping and identifier namespacing/collisions (bug, bug)
- Abstract syntax trees - including needing to enforce a maximum tree depth/size/complexity (bug, bug) in the specification to avoid implementation-specific limits
- The preprocessor (bug, bug) - another entire language layered on top
- Weak typing (bug, bug) - implicit type conversions and type inference
- Implicit qualifiers and other optional syntax (bug, bug)
- Other error-prone high-level language features, like initializers (bug, bug), vec/mat destructuring (bug), etc.
This bug list, curated from the ANGLE and Chromium repositories, illustrates the classes of bugs that have been encountered during the development of the WebGL shading language specification, and which would be categorically eliminated by ingesting only the SPIR-V binary format.
Ingesting SPIR-V does add a few new concepts:
- Parsing of a simple binary format - much simpler to parse than HLSL/GLSL
- Control flow graph and single static assignment form
  - CFGs are simpler than ASTs and much easier to operate on.
  - SSA forms are simpler and much easier to operate on.
  - Code transformations and analyses are much easier to apply to CFG/SSA, which is why modern compilers use them internally. (Examples: bounds checks, bounds check elision, injecting temporary values and computations, constant propagation, dead code elimination.)
SPIR-V supports optional debug information like variable names and source line numbers (SPIR-V 1.2 §2.15), but it is only used for diagnostic output. It is not used during parsing or validation.
- Any error messages produced during shader validation can still refer to the developer's original variable names.
  - And if the ingested SPIR-V is produced by an SL compiler toolchain, as it almost always will be, it should always be valid. Any errors in the SL source should be caught by the SL compiler.
- Source maps, or a similar mechanism, can be used to provide the original source for shaders during development. Source maps are also used in WebAssembly to enable debugging and variable inspection for binaries compiled from any source language. (Recently implemented in Firefox.)

The benefits of ingesting a less complex verification target are significant:

[R2] HLSL/GLSL compiler bugs are no longer portability or security problems. (See Better for Web developers.)
Less buggy browser implementations
- A simpler implementation will be less buggy.
More consistent browser implementations
- Two implementations will be more consistent if they are less buggy.
- Tighter specification (see below) means better consistency.
Better security
- SPIR-V is a smaller attack surface and a smaller fuzzing target. Fuzzing can thus be more effective.
Potentially better ingestion performance
- By eliminating the parsing step and related complexity.

Better for Web developers

A hallmark of modern Web development is that applications are highly portable across browsers, with very minimal porting effort. Cross-browser consistency is critical.
- [R2] Portability problems caused by HLSL/GLSL compiler bugs are eliminated: If the application is responsible for compiling HLSL shaders (whether during build or at runtime), they will build once or ship with a single version of a compiler (such as DXC) which behaves consistently across clients. Any HLSL compiler bugs which affect an application can be caught in application testing, and will never be browser or system dependent. The developer can work around them, or they can fix them and upstream fixes.
[R2] Developers still get to take advantage of the well-established HLSL ecosystem and catalog.
Developers simultaneously get to take advantage of the rich SPIR-V ecosystem (optimizer, validator, cross-compilation, etc.).
- An evolution of HLSL can target SPIR-V. Targeting SPIR-V makes it much easier to evolve a language, because the browser does not need to implement language changes - an application developer can pick up new language developments whenever they want, [R2] as well as control for regressions.
- [R2] It also eases development of new languages, and allows some languages designed for Vulkan/OpenGL/OpenCL to be used in WebGPU. This flexibility provides value without undermining the well-lit path of HLSL.
It has been argued that Web developers have complained about the "lack of view-source" in WebAssembly, and that some WebAssembly WG members disagree with the final, binary format of WebAssembly (though we do not know what motivates either of these opinions). These viewpoints are not reflected in our experiences with WebGL developers. Vocal WebGL developers consistently request flexibility, performance, and performance features; view-source for shaders on third-party websites does not seem to be a big concern.
- View-source during development can be provided by SPIR-V debug information and/or source maps.
For developers who want or need online compilation of human-writable shaders, a shader compiler ([R2] like shaderc or DXC) can be easily ([R2] compiled to WebAssembly and) packaged as a JavaScript library. See Appendix: Engine development strategies for WebGPU shaders.

SPIR-V: tight specification and active ecosystem

SPIR-V is very stable. There have been minimal recent changes to its specification (diff from 1.1 to 1.2). This makes it a more stable target as there is minimal need for any given tool (such as an HLSL compiler) to target multiple SPIR-V versions.
SPIR-V is tightly specified. The complexity of the format is much lower than GLSL/HLSL, so its prose specification is able to be more precise. Lower complexity also leaves fewer holes for underspecification or self-contradiction.
- GLSL ES 3.20: 218 content pages for GLSL graphics/compute (mostly long prose, very dense)
- SPIR-V 1.2: 192 content pages for graphics/compute and OpenCL (some readable long prose, mostly well-organized tables with prose)
Since SPIR-V is less complex, a conformance test suite of the same "size" should intuitively be able to cover more cases and thus more effectively complement the prose spec. (We don't have data to go with this, since it's very hard to compare the size and coverage of SPIR-V in the Vulkan CTS with the size and coverage of GLSL in the OpenGL CTS.)
If we need to add a feature to SPIR-V, it can be done so as an Extension to SPIR-V. The WebGPU group can effectively own these Extensions.
SPIR-V's interaction with WebGPU will be defined by an Environment Specification. This will encompass limitations and additional validation rules we apply to SPIR-V. It can also require support for new SPIR-V extensions.
- SPIR-V adds constraints on the CFG (like structured control flow) that make it possible to translate into the AST-level (like GLSL/HLSL/MSL).
- Further constraints can be added (to an environment specification) to define a canonical ordering for the encoded CFG, such that it is easier to validate and easier to translate to the AST-level. (David Neto has started work on this.)
The SPIR-V ecosystem is very active, because it is used in Vulkan (and OpenCL, OpenGL, and Vulkan Portability). This lets us provide better tools for developers and better browser implementations. It allows WebGPU developers to take advantages of SPIR-V ecosystem developments, including human-writable shading language advancements like an evolution of HLSL.

Appendix: Engine development strategies for WebGPU shaders

There are ~four categories of engine development:

New WebGPU-only engines: They can base the engine around offline shader compilation, or use online shader compilation for prototyping.
Cross-compiled native engines (Unity, Unreal, etc.): They can already output both HLSL and SPIR-V.
User-generated shader content (Shadertoy): Shadertoy developers can trivially ship a shader compiler like shaderc or DXC. It will make users' content more consistent across browsers.
- Shadertoy could also, for example, choose to use a heavier compiler ([R2] or more expensive compiler options) with better diagnostic messages. Good diagnostic messages add even more complexity to an implementation, and for the reasons above, it is problematic to (1) add that complexity to every browser and (2) require it to run on every shader compile.
Engines which want to target both WebGL and WebGPU (Three.js, Cesium.js, etc.): These developers will want their existing GLSL shaders and GLSL-text-snipping engines to work on WebGPU. They will need to continue supporting WebGL 1.0 for compatibility. The development complexity is actually very similar regardless of the ingested format; [R2] ingesting SPIR-V does not make it harder:

Options if browser ingests HLSL	Options if browser ingests SPIR-V
1. Online cross-compile GLSL to HLSL with a third-party library. (shaderc+SPIRV-Cross?)	1. Online cross-compile GLSL to SPIR-V with a third-party library. (shaderc)
2. Maintain, in parallel with GLSL code, a separate library of HLSL and an HLSL-text-snipping engine.	2. Maintain a separate library of some SL, and offline compile it to SPIR-V (shaderc or DXC). Do online linking and specialization with a JS library.
3. An HLSL library and HLSL-text-snipping engine (as above), plus online-compile HLSL to GLSL (if maintaining WebGL as a "slower fallback path") with a third-party library.	3. An SL library compiled SPIR-V (as above), plus online-compile SPIR-V to GLSL (if maintaining WebGL as a "slower fallback path"). (Aside: offline compilation from SPIR-V to GLSL may even be possible, with a simple JS implementation of linking and specialization for GLSL.)

[R2] Addendum: Blessing and supporting HLSL (or an HLSL evolution)

Obviously, application developers will not normally write SPIR-V by hand. Instead, they will compile to SPIR-V (either online or offline) from a human-writable shading language. The WebGPU working group should bless a single such language, HLSL, and take responsibility for the HLSL development path.

HLSL has the richest library of existing source code.
HLSL to SPIR-V compilation is under development in DXC.
The WebGPU working group is expected to own specification, conformance, and development of HLSL or an evolution thereof.
The WebGPU working group should bless HLSL by:
- Using it in WebGPU documentation, and recommending its use in third-party documentation/tutorials/etc.
- Maintaining an online compilation path to WebGPU-compatible SPIR-V, e.g. DXC compiled to WASM.
- Maintaining an offline compilation path to WebGPU-compatible SPIR-V, e.g. DXC as a webpack packaging step using WASM DXC, and native or nodejs packages for Mac/Linux.

For developers with existing content in GLSL, we should also ensure the ecosystem includes a well-supported GLSL path (for example via shaderc).

Restriction on where each command can be done (encode, in/out of renderpasses)

Last meeting we were looking at different API's restriction on where each type of command could be done. This isn't about multi-queue scenarios, and assumes we are on a DIRECT queue on D3D12 and a queue with all bits set that is guaranteed to exist in Vulkan.

Metal

In Metal, to put commands in a MTLCommandList, the application has to use encoders. There are three types of encoders that support mostly disjoint operation subsets (all of them can do synchronization):

MTLBlitCommandEncoder can encode copies, blits, and friends and that's all.
MTLComputeCommandEncoder can do dispatch and set state / resources / pipeline / residency for compute shaders
MTLRenderCommandEncoder is created with a rendertarget bound, which stays for the duration of the encoder. It can do draws and set state / resource / pipeline / residency for graphics work. There's also commands for the Metal equivalent of queries and for setting the "store" operations for render targets.

Vulkan

Operations in Vulkan can either be done inside render passes, outside, or both but are all encoded via the same object.

Inside-only: Draws, clearing attachments for the current subpass
Outside-only: Dispatch, copies, and controlling query pools
Both: Setting [compute / graphics] [pipelines / resources / state], synchronization, beginning and ending queries.

D3D12

@RafaelCintron I haven't been able to find documentation on the restriction in the doc for ID3D12GraphicsCommandList. Is it because you are allowed to do any command anywhere, or because I didn't look hard enough?

Conclusion

Let's forget about Vulkan allowing to set graphics state outside of renderpasses, and compute state inside render passes. Let's also skip over API details we are not ready to look at (queries >_>).

Operations you can do inside Vulkan renderpasses are basically MTLRenderCommandEncoder operations, while operations you can do outside Vulkan renderpasses are both MTLComputeCommandEncoder and MTLBlitCommandEncoder operations. Which is great!

In my opinion, this means that either:

Changing between Blit and Compute encoders is cheap in Metal, in which case I think it would be nicer to allow mixing blits and compute operations in the API.
It is expensive, and we'll want to make the API have explicit boundaries between compute and copy operations.

@grorg do you think we could get data on this?

Raw notes for reference

Vulkan:
    Inside renderpasses:
        Draws:
            vkCmdDraw
            vkCmdDrawIndexed
            vkCmdDrawIndirect
            vkCmdDrawIndexedIndirect
        vkCmdClearAttachments - (only attachments of the current subpass)

    Outside
        Dispatch
            vkCmdDispatch
            vkCmdDispatchIndirect
        Copies
            vkCmdClearColorImage
            vkCmdClearDepthStencilImage
            vkCmdFillBuffer
            vkCmdUpdateBuffer
            vkCmdCopyBuffer
            vkCmdCopyImage
            vkCmdCopyBufferToImage
            vkCmdCopyImageToBuffer
            vkCmdBlitImage
            vkCmdResolveImage
        Controlling queries
            vkCmdResetQueryPool
            vkCmdCopyQueryPoolResults

    Both
        vkCmdBindPipeline
        vkCmdExecuteCommands - To run secondary command buffers
        Synchronization
            vkCmdSetEvent
            vkCmdResetEvent
            vkCmdWaitEvents
            vkCmdPipelineBarrier
        Binding compute/graphics resources
            vkCmdBindDescriptorSets
            vkCmdPushConstants
        Starting/stopping queries
            vkCmdBeginQuery
            vkCmdEndQuery
            vkCmdWriteTimestamp
        Setting graphics state
            vkCmdBindIndexBuffer
            vkCmdBindVertexBuffers
            vkCmdSetViewport
            vkCmdSetLineWidth
            vkCmdSetDepthBias
            vkCmdSetScissor
            vkCmdSetDepthBounds
            vkCmdSetStencilCompareMask
            vkCmdSetStencilWriteMask
            vkCmdSetBlendConstants

    N/A
        vkCmdBeginRenderPass
        vkCmdNextSubpass
        vkCmdEndRenderPass

Metal:
    Blit encoder:
     - Copying / blitting to/from/between buffers and textures
     - Filling a buffer
     - Generate mipmaps
     - Fence and synchronization

    Compute encoder:
     - Setting compute pipeline
     - Setting compute resources (includes push constants / root table constants like updates)
     - SetThreadGroupMemoryLength
     - Dispatch
     - Residency control

    Render encoder:
     - Setting all graphics state
     - Setting graphics resources
     - Draws
     - Synchronization
     - Setting the rendertarget "store" actions
     - Metal equivalent of queries
     - Residency control

Indirect draw/dispatch commands investigation

Native APIs provide ways to emit draws and dispatches that take arguments from buffers on the GPU which is useful when some of the parameters are computed on the GPU.

Native APIs

Metal

In Metal indirect commands are done by passing an indirect buffer and an offset to commands instead of the parameters:

@protocol MTLRenderCommandEncoder <MTLCommandEncoder>
    - (void) drawPrimitives: (MTLPrimitiveType) primitiveType
             indirectBuffer: (id <MTLBuffer>) indirectBuffer
       indirectBufferOffset: (NSUInteger) indirectBufferOffset;

    - (void) drawIndexedPrimitives: (MTLPrimitiveType) primitiveType
                         indexType: (MTLIndexType) indexType
                       indexBuffer: (id <MTLBuffer>) indexBuffer
                 indexBufferOffset: (NSUInteger) indexBufferOffset
                    indirectBuffer: (id <MTLBuffer>) indirectBuffer
              indirectBufferOffset: (NSUInteger) indirectBufferOffset;
@end

@protocol MTLComputeCommandEncoder <MTLCommandEncoder>
    - (void) dispatchThreadgroupsWithIndirectBuffer: (id <MTLBuffer>) indirectBuffer
                               indirectBufferOffset: (NSUInteger) indirectBufferOffset
                              threadsPerThreadgroup: (MTLSize) threadsPerThreadgroup;
@end

The format of the indirect buffers is defined by:

typedef struct {
    uint32_t vertexCount;
    uint32_t instanceCount;
    uint32_t vertexStart;
    uint32_t baseInstance;
} MTLDrawPrimitivesIndirectArguments;

typedef struct {
    uint32_t indexCount;
    uint32_t instanceCount;
    uint32_t indexStart;
    int32_t  baseVertex;
    uint32_t baseInstance;
} MTLDrawIndexedPrimitivesIndirectArguments;

typedef struct {
    uint32_t threadgroupsPerGrid[3];
} MTLDispatchThreadgroupsIndirectArguments;

Vulkan

Vulkan is very similar to Metal except that:

the primitive type is taken from the last VkCmdBindPipeline
the index buffer / type and offset are taken from the last VkCmdBindIndexBuffer
the commands allow sending multiple draws at a fixed stride in the buffer; the maximum number for drawCount is a device limit that has to be at least 1.
baseInstance must be 0 unless the drawIndirectFirstInstance is enabled.

void vkCmdDrawIndirect(VkCommandBuffer commandBuffer, VkBuffer buffer, VkDeviceSize offset, uint32_t drawCount, uint32_t stride);
void vkCmdDrawIndexedIndirect(VkCommandBuffer commandBuffer, VkBuffer buffer, VkDeviceSize offset, uint32_t drawCount, uint32_t stride);
void vkCmdDispatchIndirect(VkCommandBuffer commandBuffer, VkBuffer buffer, VkDeviceSize offset);

The format of the indirect buffers is defined by:

typedef struct VkDrawIndirectCommand {
    uint32_t    vertexCount;
    uint32_t    instanceCount;
    uint32_t    firstVertex;
    uint32_t    firstInstance;
} VkDrawIndirectCommand;

typedef struct VkDrawIndexedIndirectCommand {
    uint32_t    indexCount;
    uint32_t    instanceCount;
    uint32_t    firstIndex;
    int32_t     vertexOffset;
    uint32_t    firstInstance;
} VkDrawIndexedIndirectCommand;

typedef struct VkDispatchIndirectCommand {
    uint32_t    x;
    uint32_t    y;
    uint32_t    z;
} VkDispatchIndirectCommand;

D3D12

In D3D12 executing commands indirectly is done through "command signature" that are a super-set of what's available in Metal and Vulkan. First a command signature is created that represents the order and layout of commands in the command buffer, then this signature is passed along with the indirect buffer to ID3D12GraphicsCommandList::ExecuteIndirect.

The interesting bits are:

That the layout of the D3D12_DRAW_ARGUMENTS, D3D12_DRAW_INDEXED_ARGUMENTS and D3D12_DISPATCH_ARGUMENTS structures match the respective structures in Metal and Vulkan.
A command signature has to be created against a root signature (because there are indirect commands to change root constants and descriptors).
Command signatures can take the maximum number of commands to process from a "count buffer" which makes command signatures a super-set of "MultiDrawIndirect".

typedef struct D3D12_DRAW_ARGUMENTS {
  UINT VertexCountPerInstance;
  UINT InstanceCount;
  UINT StartVertexLocation;
  UINT StartInstanceLocation;
} D3D12_DRAW_ARGUMENTS;

typedef struct D3D12_DRAW_INDEXED_ARGUMENTS {
  UINT IndexCountPerInstance;
  UINT InstanceCount;
  UINT StartIndexLocation;
  INT  BaseVertexLocation;
  UINT StartInstanceLocation;
} D3D12_DRAW_INDEXED_ARGUMENTS;

typedef struct D3D12_DISPATCH_ARGUMENTS {
  UINT ThreadGroupCountX;
  UINT ThreadGroupCountY;
  UINT ThreadGroupCountZ;
} D3D12_DISPATCH_ARGUMENTS;

Links to relevant parts of the D3D12 documentation:

Using command signatures: ID3D12GraphicsCommandList::ExecuteIndirect
Creating command signatures: D3D12_COMMAND_SIGNATURE_DESC and ID3D12Device::CreateCommandSignature
Specifying a single command in the signature: D3D12_INDIRECT_ARGUMENT_DESC and D3D12_INDIRECT_ARGUMENT_TYPE

Proposed API

All three APIs use the same layout for commands in the indirect buffer, which makes things easier. Because of restrictions of Metal and Vulkan, sending only a single DrawIndirect, DrawIndexedIndirect or DispatchIndirect command at a time should be supported.

The arguments for each of these should be an indirect buffer and offset, and that's it because the rest of the data will be present either in the last bound pipeline state, or in the last bound index buffer. Obviously graphics commands will have to be done in a render pass, and likewise the compute command should be done in a compute pass.

void CommandBuffer::DrawIndirect(Buffer* indirectBuffer, uint32_t indirectOffset);
void CommandBuffer::DrawIndexedIndirect(Buffer* indirectBuffer, uint32_t indirectOffset);
void CommandBuffer::DispatchIndirect(Buffer* indirectBuffer, uint32_t indirectOffset);

// Indirect buffer layout same as D3D12, Metal and Vulkan

Open question: Some Adreno 4XX and 5XX GPUs don't support drawIndirectFirstInstance. Should we require it anyways?

SPIR-V robust resource access

Posting here for criticism and comments. Once discussions have settled we could put this doc under and investigation/ directory in this repo.

SPIR-V robust resource access

Corentin Wallez, David Neto, Google

This document explores how SPIR-V can be made robust to satisfy the constraints of WebGPU. It describes both shader code transformations and API-side validation that result in complete robust resource access in the SPIR-V shader (attributes aren't handled). The SPIR-V logical addressing model that would have to be required for portability simplifies the analysis immensely and leave only a few cases to handle.

What is robust access

SPIR-V is a candidate shading language for WebGPU. One hard constraint of WebGPU is that it should be secure and in particular prevent shaders from reading memory that could contain data from other applications. This includes both accessing other applications' GPU addresses as well as reading uninitialized data that could contain leftover values. The same goes for preventing memory writes outside of the memory the application owns, and atomics.

The Vulkan API defines two primary types of resources: buffers and images. Buffers include the equivalent of OpenGL SSBO and UBO and views into those. (See Vulkan section 11. Resource Creation.)

OpenGL and Vulkan have introduced a concept of "robust buffer access" that can be enabled and uses implementation specific techniques to make sure read and writes to buffer resources cannot access out of the bounds of the resources passed to the shader. In bounds accesses aren't modified but out-of-bounds accesses can become any of the following:

Be discard for writes
Access any location within the resource for reads and writes
Return zero values for reads or (0, 0, 0, X) with X being 0, 1, -1, or extrema for integers, or -0.0, +0.0, -1.0, +1.0 for floating point values
Atomics can return undefined values.

Robust buffer access also covers vertex and index fetching which won't be discussed in this document. Vulkan and OpenGL don't cover robust image access however.

OpenGL ES robust buffer access extension

Vulkan robustBufferAccess feature (first item after the big feature table)

Because not all drivers implement robust buffer access (and some implement it incorrectly), we need to make sure we can add robust buffer access to existing shaders via a code transformation. Below we make a couple of assumptions:

Resources are cleared on creation, or appear as if they are, so that no uninitialized memory can be read.
On the API side we are able to validate at draw-time that the buffer views have a minimum size, and skip GPU work if the validation fails.

SPIR-V logical addressing mode

SPIR-V modules must declare and use a single addressing model that describes the shape of pointers. The logical addressing model encodes the constraints of "shaders for graphics" in APIs like OpenGL, Vulkan and even D3D12. These environments use a model where no recursion is allowed, and where there are no general-purpose pointers, but instead references within objects allocated before the shader begins. Such a reference is either an object reference (an OpVariable definition), or derived from such a reference via sub-object indexing (an "access chain"). Examples of sub-object indexing include going from a struct to one of its members, or from an array to one of its elements. The environments also require that if a shader was completely inlined, then the implementation could statically infer which resource each memory access touches.

Note on terminology: SPIR-V calls an object reference a "pointer". Different environments define what you can do with a pointer. We are describing logical addressing mode which are designed to tightly constrain pointer semantics, and therefore logical addressing mode pointers are not general-purpose pointers like we are used to in C++ or C.

The SPIR-V validation rules add the following constraints when using the logical addressing model:

An object referenced by an OpVariable cannot contain any pointer. (You can't load or store a pointer.)
All constant indices to array accesses must be positive (or unsigned).
A pointer can only be created from OpVariable, OpAccessChain, OpInBoundsAccessChain, OpFunctionParameter, OpImageTexelPointer, OpCopyObject.
A pointer can only be used by OpLoad, OpStore, OpAccessChain, OpInBoundsAccessChain, OpFunctionCall, OpImageTexelPointer, OpCopyMemory, OpCopyObject, OpAtomic.

Basically in logical addressing mode, pointers correspond to partial address computations that can only be used to act on elementary types or create pointers to parts of the pointee:

A pointer is an opaque value:
- A pointer has no bit representation, no size, and can't be stored or loaded.
- A pointer cannot be reinterpreted as a pointer to a different type.
- A pointer cannot be converted to a numeric type.
- You can't convert a different type (e.g. unsigned integer) to a pointer.
All "pointer arithmetic" is done via the "access chain" instructions OpAccessChain and OpInBoundsAccessChain.
They can only point to a full object (structure, array, matrix, vector, scalar or boolean, or an opaque object without substructure such as an image). Pointers at a range in an array aren't possible.
Pointed-to objects cannot contain pointers (OpVariable constraint).
There are no function pointers.

Our strategy to add robust buffer access to existing shaders is to make pointers that would be out of bounds be "clamped" to be in bounds. This way all subsequent out-of-bounds load, store or atomic operations act on some in-bounds location, which is an allowed robust buffer access behavior.

Ways to get a pointer are the following:

Via OpFunctionParameter or OpCopyObject that have the same value as an existing pointer that is in-bounds by construction.
Via OpAccessChain and OpInBoundsAccessChain which are address calculations in the sense that they go from a reference to an object X to a reference to a subobject of X. (An OpAccessChain instruction is the equivalent of an LLVM getelementptr instruction where the first index element is zero.) These operations will have to be instrumented to always return pointers that are in-bounds.
Via OpImageTexelPointer, which are address calculations for texels inside a swizzled image. Likewise these will have to be instrumented.
Via top-level OpVariables that point to resources. The API side can make sure the resources are big enough to hold the pointed type entirely (up to unsized arrays, see discussion below).
Via function-local OpVariables that point to local variables. These are in-bounds by construction, have the a lifetime of the whole function, and cannot be returned by the function.

Detail of transforms

Clamping of image accesses {#clamping-of-image-accesses}

Access to images without a sampler must be done in-bounds as there is no guarantee from SPIR-V. This covers both OpImageTexelPointer and other sampler-less image operations like OpImageRead, OpImageGather etc.

All these operations take an OpTypeImage or a pointer to an OpTypeImage. The size of images can be queried via OpImageQuerySize and OpImageQuerySizeLod from the OpTypeImage. Thus for all sampler-less image operations it is possible to query the image size and clamp the coordinates before they are passed to the operation (the API can validate that all images' views are at least 1x1 on the base mip level).

With such a transform, all pointers returned from OpImageTexelPointer are guaranteed to be in-bounds.

Simple access chain clamping

OpAccessChainInBounds is equivalent to OpAccessChain except that the application pinky-promises that the access is in bounds. We don't trust that application so we'll treat OpAccessChainInBounds like OpAccessChain.

The OpAccessChain instruction acts on a reference (pointer) to a composite object. Composite objects are matrices, vectors, or structures or arrays of other objects (composite or scalar). OpAccessChain is used to create a pointer to a sub-object from a bunch of indices that are used at each step to select the structure member or array element. (An OpAccessChain instruction is equivalent to an LLVM getelementptr instruction that has a first index of zero.)

The transform would insert code to clamp indices to a valid range. The valid range for each index depends on the pointee type being referenced at that level of indirection. For structure, matrix, vector, and fixed-size array types:

The minimum index is zero. SPIR-V requires each of these types (except structure) to have at least one member
The maximum index is derived from the pointee type definition. For example:
- The maximum index into an array is one less than the number of elements in the array. An array length may be a "specialization constant" but the values for these are fixed at pipeline creation time.
- The maximum index into a structure is one less than the number of structure members.

An index into structure value must be a compile-time constant and will have been validated by a SPIR-V validator before the robust buffer access transform is run. Access into structures with no member will have been validated by the validator too.

Except for a pointee type of OpTypeRuntimeArray, the maximum index is known at compile time or pipeline creation time and a clamp of the index with a constant will be added. The only case remaining is for OpTypeRuntimeArray.

Fat-pointers for unsized arrays

The top-level structure for a resource can have an OpTypeRuntimeArray as its last member that corresponds to an unsized array at the end of an SSBO in OpenGL / Vulkan. This is the only allowed case for unsized arrays.

The size of the array can be queried in the shader from a pointer to its parent struct via OpArrayLength. This means that when an access chain starts from the struct and also indexes the runtime array, we can easily add clamping. The only difficult case is when a pointer to an OpTypeRuntimeArray is created, as it is not possible to go back to its parent struct to query the size.

Pointers to unsized arrays must become "fat-pointers" that contain both the size and a pointer to the OpTypeRuntimeArray. This transformation isn't hard to make and can be done without force-inlining all the SPIR-V code. Then when an OpAccessChain operates on an OpTypeRuntimeArray, it clamps the first index with the size stored in the fat pointer. The API-side validation can ensure there is sufficient space for at least one element of the unsized array so the resulting pointer is in-bounds.

Validation on the API side

The code transforms assume the following:

The top level of image views isn't empty.
Buffer views contain enough space for the "sized" part of the structures, and at least one element of the unsized part if present.

Assumption 1) is easy to validate and something that we want to check for in WebGPU anyways as, for example, creating an empty image view is invalid in Vulkan.

Assumption 2) will require additional checks that wouldn't happen on backends with correct robust buffer access behavior. Such validation would work by computing, for each resource of each pipeline, what the minimum size needs to be, then on draw commands the size of bound buffer views would be checked (optimized with dirty bits etc. of course).

Conclusion

We've shown a practical way to instrument SPIR-V code to add robust resource access behavior, with help from a little API-side validation. All the statically checkable constraints on SPIR-V code for this behavior are already encoded in the logical addressing mode such that only some instrumentation for runtime validation is needed. The logical addressing is not too constraining however as it is sufficient for all Vulkan (and D3D12) games, and a requirement for WebGPU for portability.

A drawback of using API-side validation to add back robust resource access behavior is that depending on hardware / backing API support for robust buffer access, the application might see different results for the same program. A command buffer might mostly run when there is hardware support, but be validated out when emulation happens, leading to different rendering results. Requiring at least one element of the unsized part of resources is another wart that we couldn't get rid of.

Depth test state

As agreed in meeting, discussion of depth test state. The question was (as I understand it) do we include separate flag for enabling depth test or just infer it from write mask and depth function (i.e. depthTestEnable = depthFunc != ALWAYS || depthWrite).

Vulkan and D3D12 have a separate flag, whereas Metal doesn't. Reading section about depth test of the Vulkan spec suggests that ditching separate state flag for enabling depth test is OK, i.e. depthTestEnable set to VK_FALSE is indistinguishable in terms of observable side effects from setting it to VK_TRUE, depthCompareOp to VK_COMPARE_OP_ALWAYS and depthWriteEnable to VK_FALSE. I presume the same goes for D3D12 since I wasn't able to find in MSDN anything suggesting otherwise.

Also, I think similar reasoning can be applied to stencil test state.

[meta] Big questions

List of big questions to resolve for the content of the API (not the shape):

resource allocation
- how are resources bound to memory
- whether memory heaps are exposed or implied
resource capabilities - #37
- do we ask the user for the set of potential usages for each resource (akin D3D12_RESOURCE_FLAGS) or imply this
multiple queues
- do we expose multiple hardware queues (limited in Vulkan, unlimited in D3D12) or try to parallelize in the driver itself (like Metal)
memory barriers- #27:
- choices: implicit, half-explicit (with only the destination state provided), fully explicit (Vulkan, D3D12)
partial pipeline states
- do we create blend, depth-stencil, raterizer (and other) states independently, or only as a part of a bigger pipeline state?
- support for non-dynamic states (Vulkan only)
tile memory
- do we take advantage of tiled renderers?
- existing APIs: tile memory in Metal, multiple sub-passes in Vulkan
- blocked by D3D12
data upload/download - #45
- persistent mapping support
- coherency
- callbacks versus polling
web workers support
- resource creation
- command buffer recording
shader languages - #42,#43,#44:
- consume SPIR-V and/or HLSL
- support for pipeline/shader caches
DOM integration
- swapchain initialization, frame sync
- canvas/OffscreenCanvas interop
- WebXR integration
- video elements, image elements, etc.
security == portability (~~N==NP~~):
- is there a portability trade-off with performance that we can take without compromising the security?

Ability to issue command and create resources from workers

Guys from Google already said that it's planned, but here's an issue anyway so it won't be forgotten :)

Resource copying/clearing/updating investigations

Native APIs provide different constraints and features when it comes to resource copies and clears, where resources can be buffers or images. In this issue, we'll try to find a common ground (a least common denominator API) that is usable and efficient on all backends.

In Metal, all of the copy/clear operations are done via the MTLBlitCommandEncoder.
In Vulkan, these are transfer operations, supported on any queue type. They require TRANSFER_SRC flag on the source and TRANSFER_DST flag on the destination.

Operation table

operation/backend	Vulkan	D3D12	Metal
clear buffer	vkCmdFillBuffer	views only with ClearUnorderedAccessView*	nothing
clear image	vkCmdClearColorImage, vkCmdClearDepthStencilImage	views only with ClearRenderTargetView, ClearDepthStencilView	nothing
update buffer	vkCmdUpdateBuffer, limited to 64k	nothing	nothing
update image	nothing	nothing	nothing
buffer -> buffer	vkCmdCopyBuffer	CopyBufferRegion	copy
buffer -> image	vkCmdCopyBufferToImage	CopyTextureRegion	copy
image -> buffer	vkCmdCopyImageToBuffer	CopyTextureRegion	copy
image -> image	vkCmdCopyImage	CopyTextureRegion	copy
image blit	vkCmdBlitImage	nothing	generateMipmaps

Buffer Updates

In D3D12, the only way to update a buffer with new data coming from CPU is to use a staging buffer (that is mapped, filled, then copied to the destination).

In Metal, similar effect can be achieved by creating a buffer with makeBuffer that re-uses the existing storage.

In Vulkan, the implementation may have a fast-path for small buffer updates by in-lining the data right into the command buffer space. The implementation can fall back to a staging-like scheme for larger updates.

Image Blitting

Image blits are different from image copies for allowing format conversion and arbitrary scaling with filtering. A typical use case for blitting is mipmap generation. It is not clear to me why/how Vulkan provides this on a transfer-only queue, but other APIs are far more (and reasonably) limited with regards to where and how they can blit surfaces.

Alignment rules

Vulkan

VkPhysicalDeviceLimits has optimal alignments for buffer data when transferring to/from image:

optimalBufferCopyOffsetAlignment is the optimal buffer offset alignment in bytes for vkCmdCopyBufferToImage and vkCmdCopyImageToBuffer
optimalBufferCopyRowPitchAlignment is the optimal buffer row pitch alignment in bytes for vkCmdCopyBufferToImage and vkCmdCopyImageToBuffer

These are not enforced by the validation layers but are recommended for optimal performance.

D3D12

MSDN section lists the following restrictions:

linear subresource copying must be aligned to D3D12_TEXTURE_DATA_PLACEMENT_ALIGNMENT (512) bytes
row pitch aligned to D3D12_TEXTURE_DATA_PITCH_ALIGNMENT (256) bytes

Proposed API

Clears

D3D12 model appears to be the least common denominator. If we have the concept of views, we can have API calls to clear them. In Vulkan, these calls would trivially translate into direct clears. In Metal, we'd need to run a compute shader to clear the resources. Supporting multiple cear rectangles seems to complicate this scheme quite a bit, so I suggest only doing the full-slice clears.

Updates

Given the limited support of resource updates, I suggest not providing this API at all in favor of requiring the user to use staging resources manually.

Copies

All 3 APIs appear to provide the copy capability between buffers and textures. The difference is mostly about the alignment requirements. I suggest having device flags to the minimum offset/pitch required:

D3D12: equal to D3D12 constants
Vulkan: equal to optimal alignment features
Metal: some reasonable default selected by Apple

Blits

D3D12 doesn't support any sort of blitting, I'm inclined to propose no workarounds here. Users doing simple render passes for blitting textures shouldn't be slower than emulating this in the API, anyway.

Afterword

This analysis may be incomplete, corrections are welcome to go directly as the issue edits.

Command Queue investigations

Glossary

Instance (VkInstance, IDXGIFactory, -)
Adapter (VkPhysicalDevice, IDXGIAdapter, -)
Device (VkDevice, ID3D12Device, MTLDevice)
Command Queue {VkQueue, ID3D12CommandQueue, MTLCommandQueue}
Command Buffer (VkCommandBuffer, ID3D12CommandList, MTLCommandBuffer)
Render Pass (VkRenderPass, -, MTLRenderCommandEncoder)

Queues

Command Queue objects exist in all three APIs: VkQueue, ID3D12CommandQueue, and MTLCommandQueue.
Command Queues are specific to Contexts.
Command Buffers are submitted via Command Queues.
In Metal, Command Buffers are created from Command Queues, whereas in Vulkan and D3D12 they are created from Devices, and submitted to a same-Device Command Queue later.

While queues in D3D12 and Metal are created as-needed, queues in Vulkan are created at during device creation.
vkCreateDevice takes an array of VkDeviceQueueCreateInfos, which respectively specify the number of queues to create for each "queue family".
Queue families are enumerated via vkGetPhysicalDeviceQueueFamilyProperties(VkPhysicalDevice), which details the VkQueueFlags supported by that queue family.
A device's queues are then retrieved with vkGetDeviceQueue(device, queueFamilyIndex, queueIndex).

Command Queue types

There are three types of commands in each API:

Graphics (Graphics, Direct, Render)
Compute
Transfer (Transfer, Copy, Blit)

Compute and Transfer don't seem to be fleshed out in D3D12 yet? (ID3D12GraphicsCommandList is the only documented child of ID3D12CommandList)

D3D12

Each ID3D12CommandQueue has a single D3D12_COMMAND_LIST_TYPE:

D3D12_COMMAND_LIST_TYPE_DIRECT (Graphics+Compute+Transfer)
D3D12_COMMAND_LIST_TYPE_COMPUTE (Compute+Transfer)
D3D12_COMMAND_LIST_TYPE_COPY (Transfer)

Vulkan

Each VkQueue may support an or'd combination of flags (VkQueueFlags) in VkQueueFamilyProperties::queueFlags:

VK_QUEUE_GRAPHICS_BIT
VK_QUEUE_COMPUTE_BIT
VK_QUEUE_TRANSFER_BIT

Very relevant:
"If an implementation exposes any queue family that supports graphics operations, at least one queue family of at least one physical device exposed by the implementation must support both graphics and compute operations."

"All commands that are allowed on a queue that supports transfer operations are also allowed on a queue that supports either graphics or compute operations. Thus, if the capabilities of a queue family include VK_QUEUE_GRAPHICS_BIT or VK_QUEUE_COMPUTE_BIT, then reporting the VK_QUEUE_TRANSFER_BIT capability separately for that queue family is optional."

Metal

The command types are surfaced as three distinct MTL*CommandEncoder interfaces, created from the following MTLCommandBuffer methods:

makeRenderCommandEncoder() -> MTLRenderCommandEncoder (Graphics)
makeComputeCommandEncoder() -> MTLComputeCommandEncoder
makeBlitCommandEncoder() -> MTLBlitCommandEncoder (Transfer)

Command Buffers

Due to the differences in command buffer submission for Metal, I'll delve into Command Buffers a bit.

Command Buffer creation

vkAllocateCommandBuffers(VkCommandPool)
ID3D12Device::CreateCommandList(D3D12_COMMAND_LIST_TYPE, ID3D12CommandAllocator)
MTLCommandQueue::makeCommandBuffer()

Command Buffer recording

Begin:

vkBeginCommandBuffer(VkCommandBuffer)
(implicit with CreateCommandList)
(implicit with makeCommandBuffer)
End:
vkBeginCommandBuffer(VkCommandBuffer)
ID3D12GraphicsCommandList::Close()
(implicit with makeCommandBuffer)
Reset:
vkResetCommandBuffer(VkCommandBuffer)
ID3D12GraphicsCommandList::Reset()
(unsupported)

Command Buffer submission

vkQueueSubmit(VkQueue, VkSubmitInfo{ VkCommandBuffer[] }[])
ID3D12CommandQueue::ExecuteCommandLists(ID3D12CommandList[])
MTLCommandBuffer::enqueue()

Rough skeletons:

Vulkan:
device = vkCreateDevice(VkInstance, VkDeviceQueueCreateInfo[])
commandQueue = vkGetDeviceQueue(device)
commandBuffer = vkAllocateCommandBuffers(device)
// ...
vkResetCommandBuffer(commandBuffer)
vkBeginCommandBuffer(commandBuffer
vkCmdBeginRenderPass(commandBuffer, VkRenderPass, VkFramebuffer)
vkCmdDraw(commandBuffer)
vkCmdEndRenderPass(commandBuffer)
vkEndCommandBuffer(commandBuffer)
vkQueueSubmit(commandQueue, commandBuffer)

D3D12:
commandQueue = device.ID3D12Device::CreateCommandQueue()
commandBuffer = device.ID3D12Device::CreateCommandList()
// ...
commandBuffer.ID3D12GraphicsCommandList::Reset()
// implicit Render Pass
commandBuffer.ID3D12GraphicsCommandList::DrawInstanced()
commandQueue.ID3D12CommandQueue::ExecuteCommandLists(commandBuffer)

Metal:
commandQueue = device.MTLDevice::makeCommandQueue()
// ...
commandBuffer = commandQueue.MTLCommandQueue::makeCommandBuffer()
// potentially commandBuffer.MTLCommandBuffer::enqueue() already
renderPass = commandBuffer.MTLCommandBuffer::makeRenderCommandEncoder()
renderPass.MTLRenderCommandEncoder::drawPrimitives()
renderPass.MTLCommandEncoder::endEncoding()
commandBuffer.MTLCommandBuffer::commit()

Fences

Vulkan

VkFence

Signals host from GPU.

"Fences are a synchronization primitive that can be used to insert a dependency from a queue to the host. Fences have two states - signaled and unsignaled. A fence can be signaled as part of the execution of a queue submission command. Fences can be unsignaled on the host with vkResetFences. Fences can be waited on by the host with the vkWaitForFences command, and the current state can be queried with vkGetFenceStatus."

VkSemaphore

Signals command buffer from command buffer.

"Semaphores are a synchronization primitive that can be used to insert a dependency between batches submitted to queues. Semaphores have two states - signaled and unsignaled. The state of a semaphore can be signaled after execution of a batch of commands is completed. A batch can wait for a semaphore to become signaled before it begins execution, and the semaphore is also unsignaled before the batch begins execution."

VkEvent

Signals queue from queue or host.

"Events are a synchronization primitive that can be used to insert a fine-grained dependency between commands submitted to the same queue, or between the host and a queue. Events have two states - signaled and unsignaled. An application can signal an event, or unsignal it, on either the host or the device. A device can wait for an event to become signaled before executing further operations. No command exists to wait for an event to become signaled on the host, but the current state of an event can be queried."

D3D12

ID3D12Fence

Signals host or GPU from GPU

Set to a value with Signal(u64)
Polled with GetCompletedValue()->u64
ID3D12Fence::SetEventOnCompletion(UINT64 Value, HANDLE hEvent)
ID3D12Device1::SetEventOnMultipleFenceCompletion(ID3D12Fence[] fences, UINT64[] vals, HANDLE hEvent)
ID3D12CommandQueue::Signal(ID3D12Fence, u64)
ID3D12CommandQueue::Wait(ID3D12Fence, u64)

Windows Event

D3D12 uses Windows Events for the host side of gpu->host synchronization.

Metal

MTLFence

Signals command buffer from command buffer.

Created from MTLDevice
Signaled from encoder
Waited on by encoder

"Drivers may delay fence updates until the end of the command encoder; drivers may also wait for fences at the beginning of a command encoder. Therefore, you are not allowed to wait on a fence after it has been updated in the same command encoder."

MTLCommandBuffer

MTLCommandBuffer::add{Scheduled,Completed}Handler(MTLCommandBufferHandler)
MTLCommandBuffer::waitUntil{Scheduled,Completed}()

MTLCommandBufferHandler is "A block of code to be invoked".

Equivalency

Host can Poll/Wait on GPU with VkFence/ID3D12Fence::SetEventOnCompletion/MTLCommandBuffer::{addHandler,waitUntil}().
Command Buffer can Wait on Command Buffer with VkSemaphore/ID3D12CommandQueue::Wait/MTLFence
Host can delay Command Buffer execution by:
-- VkEvent
-- Not submitting Command Buffers

Proposal

Vulkan/D3D12-style queues, which are readily implementable on Metal.

D3D12 generally uses u64 signals instead of boolean signals in Vulkan and simple signal/wait in Metal.
Metal's semantics are readily implementable on the others, but are less sophisticated in comparison.

VkFence is implementable on ID3D12Fence, and more coarsely on MTLCommandBuffer::add*/wait*.
VkSemaphore is implementable on ID3D12CommandQueue::Signal/Wait and MTLFence.
VkEvent is emulatable on D3D12 and Metal, but does not have a direct equivalent.

Lever + Webgpu

I were inspired about webgpu announcement and continued with Vulkan bindings in my own programming language.

http://boxbase.org/entries/2017/feb/13/lever-and-webgpu/

https://github.com/cheery/lever/blob/master/samples/warpgpu/app0.lc
https://github.com/cheery/lever/blob/master/lib/sdl.lc

Maybe this inspires you back.

Do not link all functions to contexts

It'd be nice not to have all API function tied to some context object. Right now, since the API is context-based, it more or less makes sense, but, for example, in Metal we aren't tied to context objects (or, IIRC, event to a device object) so much. We can create buffers, shader libraries and textures w/o need to explicitly reference a context. Necessity of passing gl context sometimes creates not so nice constructs in code (for example, three.js).

Optional tesselation support in GpuWeb!

Hi,
seeing two current proposals (Apple one based on Webkit, and Google's Chromium based one) they add compute support which is nice but seems there is no exposure of the tesselation stage..
Notice not asking about geometry shaders support which came even earlier on Desktops (OpenGL 3.2) because on mobile they haven't get much successs for ex. current Apple Metal API doesnt' support it..
any way tesselation shaders is supported everywhere on next gen APIs (D3D12 has it, Metal since IOS10/Macos 10.12 and recent iDevices and Vulkan as optional feature)..
Seeing all three APIs seems tesselation should be exposed as an optional capability as Vulkan does..
Just saying hope tesselation is incorporated in WebGPU spec from start and not as an afterthought!!

Thanks..

Keep viability for compilation to WebAssembly or transpilation via Emscripten in mind

This is migrating the gpuweb/admin#2 issue. Originally from @thokra1

First off, something on the status-quo as I perceive it. My major gripe with WebGL is its horribly slow adoption rate and the fact that WebGL2 will be outdated as soon as it is ratified compared to GLES3.2 and OpenGL 4.5 certainly does not help. I currently find myself implementing an OpenGL 4.5/GLES2 based rendering framework for a special purpose (so Unity and Unreal are currently intentionally out of the picture). While being able to (de-facto) seemlessly transpile the GLES codepath to WebGL, it is very disconcerting having to drop all the goodness we have come to expect from OpenGL 4.5 (including but not limited to the AZDO paradigm). I imagine Unity and Unreal developers feel the same way. Frankly, as a developer mainly dabbling in stuff on the desktop, it is beyond me why exposing desktop OpenGL in a browser and on mobile devices should not be possible today - current hardware certainly should not pose much of a problem (please correct me if I'm wrong there).

Despite that major problem, having a single code base, for instance in C++, which can be compiled for different targets like x86_64 and ARM and (at this time) transpiled to JavaScript is pretty sweet. As a developer working primarily on and for the desktop, I'd like to be able to reuse my C++ code wherever possible. IIUC, the formalization of WebAssembly and the possibility to let your compiler spit out the IR will only make this more convenient - which is also a major step in the right direction. Even though my knowledge of the Web as a platform is limited, step-by-step elimination of hurdles for multi-platform deployment and consistent user experience is what I consider one of the most important mid- or long-term goals.

This also applies to the WebGPU proposal. For example, I'd like to be able to write something on the desktop and deploy it on all relevant platforms - ideally using the same set of features. Having a close or even one-to-one API correspondence currently enables me to do that when I limit myself to GLES2 and WebGL. I urge you to keep that in mind when specifying a WebGPU API - I don't care how this is achieved, but I desperately want it and I think we all need it. Ideally, I imagine a unified approach to all platforms which currently can be (at least to a degree) achieved with the help of Emscripten and frameworks like electron.

This immediately impacts certain choices currently made for the proposal which I find highly debatable, including the choice of shading language. Being forced to deploy shaders written in two different languages unnecessarily increases complexity. Also, why would you demand that developers learn a fringe shading language and why would you define another IR for something that is already in place with e.g. SPIR-V? I can see some of the benefits of the Metal SL (like being able to modularize properly without having to employ a de-factor VFS with GL_ARB_shading_language_include). I also understand that for something that comes out of Apple, it makes sense to advertise in-house technology, but I think we all might be better off if you just turn to your fellow Khronos members and handle this business with all members of the industry participating. Just improve GLSL or define something new as a concerted industry effort (unless you want the Metal SL to be just that, in which case I want you to provide a SPIR-V compiler).

Something on which I'd like to have some clarification is the fact that OpenGL and Vulkan, while usable to achieve the same observable result, they are very different APIs which both exist in their own right. I can see a ton of scenarios where implementing a Vulkan renderer is not necessary. On the other hand, OpenGL carries some historically based design flaws which can be properly handled with Vulkan (e.g. multi-threaded command buffer submission) and, by design and due to resulting implementations, is (supposedly) not able to outmatch Vulkan in regards to performance.

While the WebGPU proposal addresses the necessity to expose something akin to Vulkan in the browser, this does not solve any problems for people wanting to use the latest GPU features through OpenGL - unless someone comes up with an implementation of OpenGL atop WebGPU. I don't want to be forced to use either technology - I want to be enabled to choose the API according to necessity. Do you think this is possible and something to strive for?

s/GPU for the Web/GPU on the Web/

In the title of README.md it says "GPU for the Web" whereas "GPU on the Web" seems to be the CG title.

The case for passes

We've been going back and forth on the need for render/compute/blit passes, and I'd like to systematize the arguments and suggest models we could navigate from.

In native APIs, Vulkan has render passes only, Metal requires all the work to be done in one of the blit/compute/render pass. And D3D12 does not have any passes yet.

Here are the major areas (I can think of) that could take advantage of a pass concept, and the ideas on how they might be relevant to WebGPU:

Resource bindings inheritance - #24

In Metal, resource bindings are only valid with the scope of a pass, where in Vulkan/D3D12 they are scoped to command buffers. This allows Metal runtime/driver to allocate descriptors based on everything used in a pass. This isn't a benefit for WebGPU, given that we have explicit resource bind groups created by the user.

Synchronization and validation - #27, #59

In Vulkan the memory barriers and layout transitions are explicit. In D3D12 the resource state transitions as well as UAV/aliasing barriers are mostly explicit, but there are rules for automatic to/from conversion to the "common" state. In Metal, memory barriers are mostly implicit. They are inserted automatically at the render pass boundaries, around individual dispatch and transfer calls.

For WebGPU, we decided to not expose the memory barriers in the API, and thus we can insert the appropriate synchronization commands between passes. In order for this to be valid, we introduce the "Usage" (mutable) state of a resource and declare a requirement of each resource having a constant usage across a pass. However, we also decided to automatically synchronize individual dispatch (UAV barriers only) and transfer calls, much like Metal. It isn't a long stretch to go from here to a full synchronization (allowing layout changes, for example), in which case the construct of a compute and transfer pass stops being useful (in the context of memory barriers).

Work scheduling - #22

Hardware queues on Vulkan and D3D12 are explicit. In Metal, the MTLCommandQueue is also exposed, but it doesn't have a direct relation to a hardware queue. Metal runtime figures out the dependencies between passes (which is sound, given that all the work is split into passes) and reserves the right to actually execute them on different hardware queues (note: how much this is the case in reality is unknown to me).

In WebGPU we decided to not expose the hardware queues, thus I find the opportunity of automatically taking advantage of them to be quite strong to miss.

Type safety

Each kind of a pass has a restricted set of operations that can be recorded. E.g. one can't dispatch() or bind compute resources in a MTLBlitCommandEncoder pass. This makes the API more usable, assists documentation, and is generally is a good practice for API design.

SPIR-V sources of undefined behavior

Another investigation on SPIR-V, this time on the sources of undefined behavior.

Skipped classes of UB

The parallel nature of GPUs make it so the result of running a shader can depend on invocation ordering, how data-races are resolved, etc when cross-invocation interactions happen. We can't really validate or paper over these UBs so they won't be addressed here. This also interacts with some compiler optimizations that for example reorder writes to memory: if the shader doesn't have the proper barriers there will be UB.

Other UBs are when a the shader reads or writes memory outside of the resources it was provided. These issues were already discussed in #33.

Also mathematical floating point precision UB won't be addressed here, and we think we should not address them at all as they depend on how the hardware performs computations. The CG should however specify bounds on the error allowed for floating point operations.

UBs in SPIR-V

Overall the SPIR-V spec tries to be tight for valid shaders, up to spec bugs such as the OpPhi one @litherum pointed out. Sources of undefined behavior are called out in the spec and are the following:

OpUndef that can be evaluated to any bit pattern. We should forbid this for WebGPU.
Division by 0 and friends. We can either say it returns a fixed value like 0, or leave it as a UB as long as it doesn't break the security of WebGPU.
OpVariable without initializers. We should forbid this for WebGPU.
OpVectorIndexDynamic and OpVectorInsertDynamic, i.e. float4 v; v[4]; v[4] = foo;. We could provide "robust resource access"-like guarantees for this that access to invalid indices can either be removed or access any element.
OpVectorShuffle takes constant components, and allows for them to be FFFFFFFF which means the component is undefined. We should forbid this for WebGPU.
Shifts by more than the size of integer type produce undefined value. We could either keep this if it doesn't break the security of WebGPU or for example AND the shift with a mask.
OpBitFieldInsert, OpBitFieldSExtract and OpBitFieldUExtract when the range of bits set / extracted is out of bounds. We could either keep this if it doesn't break the security of WebGPU, or for example return 0 for invalid out of bounds.
Any operation using derivatives outside uniform control flow. It is impossible to know statically when control-flow will be uniform, and historically this UB hasn't been a problem for WebGL.

Other undefined behavior that won't affect WebGPU shaders:

OpConstantSampler with None sampler addressing mode and OOB access. OpConstantSampler isn't available in Vulkan and should not be exposed by WebGPU.
Geometry shader OpEmitVertex and OpEmitStreamVertex makes output variables undefined. We don't have geometry shaders in WebGPU because Metal doesn't have them.
OpCaptureEventProfilingInfo when called multiple times with the same event. This is a "big compute" functionality that won't be present in WebGPU.

UBs in extended instruction sets

Not all builtins are in the SPIR-V spec directly, and some are implemented in "extended instruction sets". The only one that's relevant is the GLSL.std.450 that essentially adds more math builtins. There are more UB in there related to the argument space for math functions acos(2 * Pi) = ?. Also clamping operations are undefined if max < min.

Choosing a binding model

This is the detailed investigation I promised to put on Github last meeting. It explains at a high-level the binding models of all 3 explicit APIs, and exposes why we believe Vulkan's binding model should be used as a starting point.

See also these slides which contain more info about bindless vs. fixed-function and the Vulkan binding model, and our initial investigation for NXT (with some outdated code samples).

The binding models of explicit APIs

Metal’s binding model

Metal’s binding model is basically the same as OpenGL’s and D3D11: there are per-stage global arrays of textures, buffers and samplers. Binding of shader variables to resources is done by tagging arguments to the main function with an index in the tables.

There are functions on the encoders (aka command buffers) to update ranges of the tables, with one update function for each stage, for each table type (and sometimes more than one, mostly for convenience). The update to the tables is immediate and the driver handles any synchronization that might be needed transparently. For example:

D3D12’s binding model

D3D12’s binding model is more geared towards “bindless” GPUs in which resources aren’t set in registers in fixed-function hardware and referenced by their index, but instead described by a descriptor living in GPU memory and referenced by their descriptor’s GPU virtual address.

Things look as if the shaders had only access to one global UBO by default, called “root signature” that is an array of elements that can be one of three things:

A small constant
An inlined descriptor for a resource
A pointer to a range of a “descriptor heap”

A shamelessly re-used slide.

Individual elements root signature can be updated directly in command lists with updates appearing immediately to subsequent draws / dispatches. For example:

ID3D12GraphicsCommandList::SetGraphicsRoot32BitConstant changes a constant at a specific index in the root signature
ID3D12GraphicsCommandList::SetGraphicsRootShaderResourceView puts a texture / buffer descriptor at a specific index in the root signature
ID3D12GraphicsCommandList::SetComputeRootDescriptorTable changes the pointer to a descriptor heap at a specific index in the root signature

Using a root signature in shaders is done like below. The root signature layout is described and gives each “binding” a register index, then these register indices are bound to variable names and finally, the “main” function is tagged with the root signature layout. The extra level of indirection using register indices seems to be to help with porting from D3D11 shaders.

// A example from the D3D12 docs, trimmed down, that defines a root signature with in order
//  - A constant buffer binding “b0”
//  - A pointer to a descriptor heap range containing…
//    - A constant buffer “b1”
//    - An array of 8 textures/buffers starting at “t1”
// - A constant named “b3”
#define MyRS1 \
              "CBV(b0), " \
              "DescriptorTable( CBV(b1), " \
                               "SRV(t1, numDescriptors = 8)), " \
              "RootConstants(num32BitConstants=3, b10)"
 
 
// Binding of part of the signature to variables names
cbuffer cbCS : register(b0) {
    uint4 g_param;
    float4 g_paramf;
};
 
// Using the root signature for this entry point (MyRS1 would be actually defined in a header).
[RootSignature(MyRS1)]
float4 main(float4 coord : COORD) : SV_Target
{
…
}

On the API side, a corresponding ID3D12RootSignature object must be created that represents the layout of the root signature. This is needed because the actual layout on GPU might not match what was declared, to allow the driver to optimize things or do some emulation on hardware that isn’t bindless enough.

When compiling an ID3D12PipelineState, the root signature must be provided so that the compiled shader code can be specialized for the actual layout of the root signature on the GPU. Then in ID3D12GraphicsCommandList::SetGraphicsRootSignature must(?) be called before any update to the root table or call to ID3D12GraphicsCommandList::SetPipelineState, so that the command list too knows what the actual layout on the GPU is.

Having the same root signature used to compile multiple pipeline makes sure the driver doesn’t have to shuffle data around when pipelines are changed, since they are guaranteed to access descriptors by following the same layout.

More info on root signatures can be found in the MSDN docs.

One thing we didn’t mention yet is how to populate a descriptor heap with data. MSDN docs aren’t super-clear but my understanding is that there are two types of descriptor heaps:

Non-shader visible descriptor heaps that can be used for staging before copying to shader-visible heaps, or for descriptors used to set render-targets, since it is still not bindless on most HW.
Shader-visible heaps, which can be written but not read by the CPU(?) and are essentially opaque heaps that can be on the GPU (for bindless use) and/or on CPU (for emulation on fixed-function HW).

ID3D12Device::CopyDescriptors can be used or descriptors created directly in the heaps. Then to be able to use a heap with, for example SetGraphicsRootDescriptorTable, the heap must be bound to the current command list with ID3D12GraphicsCommandList::SetDescriptorHeaps. Limitations are that only one sampler heap and one SRV/UAV/CBV descriptor heap can be bound at a time. Also heaps cannot be arbitrary large on some hardware, with samplers being at max 2048 descriptors and SRV/UAV/CBV heaps a million descriptors. Switching heaps can cause a wait-for-idle on some hardware.

One restriction worth pointing out is that descriptor heaps cannot be created to contain both samplers and other descriptors at the same time.

Vulkan’s binding model

Vulkan’s binding model is essentially a subset of D3D12’s and more opaque to the application. This is because Vulkan needs to run on mobile GPU that are more fixed-function than what D3D12 targets.

Another shameless slide reuse:

This is similar to D3D12’s root signature except that:

There are no descriptors directly in a root table.
There is a fixed number (at least 4) of root descriptor tables.
There is a fixed number of root constants.
Also in Vulkan these two concepts are not presented together, instead each descriptor table is called a descriptor set and root constants are called push constants.

Binding shader variables to specific locations in a descriptor set is done like shown below (also shows push constants) (from this slide):

// Example in GLSL because SPIRV would be much more verbose
layout(set = 1, binding = 0) uniform texture2D albedo;
 
layout(push_constant) uniform Block {
    int member1;
    float member2;
    ...
} pushConstants;
 
float foo = texture(t, texcoord).r + pushConstants.member1;
 
// Could get compiled to:
 
Descriptor* set1 = RootConstantRegister[1];
TextureDescriptor* albedo = set1[0];
 
Int pushConstants_member1 =
    RootConstantRegister[PUSH_CONSTANT_START + 0];
float foo = Sample2D(albedo, texcoord).r
    + pushConstants_member1;

On the API side, like in D3D12, a layout object needs to be created and used to during pipeline creation (vkCreateGraphicsPipelines). This object is VkPipelineLayout (vkCreatePipelineLayout) and is made of multiple VkDescriptorSetLayouts (vkCreateDescriptorSetLayout). Then descriptor sets are used in a command buffer (without need for synchronization) via vkCmdBindDescriptorSets and push constants are set with vkCmdPushConstants.

Descriptors are created from a VkDescriptorPool that is similar to D3D12 descriptor heaps, except that the VkDescriptorSets returned are opaque and can only be written to or copied via specialized API functions (not general GPU copy like D3D12(?)).

More info about this in the specification’s section on descriptor sets.

Metal 2

Metal 2 adds the Indirect Argument Buffer concept, which are opaque-layout buffers containing constants and descriptors that can be bound all at once. It is essentially a descriptor set.

In addition to allocating, populating and using IABs, applications must specify on the encoders which resources or resource heaps need to be resident for the draw. Also it looks like Metal allocates all descriptor in a single descriptor heap, and has the same type of limitation as D3D12, but for the whole app. (500.000 descriptors max, 2048 samplers max).

Why use Vulkan’s binding model

During our NXT investigation we found that the Vulkan binding model would be the best one to use, as converting from Vulkan to D3D12 or Metal doesn’t add much overhead while converting to D3D12 or Metal to Vulkan would be expensive. (and we are not smart enough to design a whole new binding model) Also for reasons, NXT uses the term “bind groups” instead of “descriptor sets”.

Vulkan to Metal

When the pipeline layout is compiled we can do “register allocation” of the descriptors in their respective tables. Then the shaders can be changed to refer to the position in the table instead of the (set, location) as in Vulkan. Finally when a descriptor set is bound in the command buffers, we just need to call the necessary MTLGraphicsCommandEncoder::set<Textures/Buffers/Samplers> at most once each. The arguments can have been precomputed at descriptor set creation).

Vulkan to D3D12

Essentially we would only use special case root signatures that look like the Vulkan binding model presented above. Then all operations on descriptors sets and push constants decay naturally to their D3D12 counterparts.

One thing is that it looks like D3D12 can only use a limited number of descriptor heaps in a command list while (AFAIK) Vulkan doesn’t have this limitation. In NXT we plan to solve this mismatch by not exposing descriptor pools and doing our own descriptor set bookkeeping (even in the Vulkan backend): with some sort of compacting GC for descriptor sets we should be able to keep the number of heaps small.

Another issue is that D3D12 has separate samplers heap, which can solved by having potentially two root descriptor tables per descriptor set that has samplers. The hope is that in most applications samplers aren’t present in many places in a pipeline layout.

Metal to Vulkan

In Metal, each a pipeline essentially has its own pipeline layout / root signature layout, so in Vulkan new descriptor sets would have to be created at each pipeline changes. Things could work a bit better on D3D12, as the D3D11 on D3D12 layer shows (Metal and D3D11’s binding models are the same). However this would prevent a large class of optimizations applications can do by reusing parts of the root table or descriptor sets.

D3D12 to Vulkan

There are many small mismatches because D3D12 is more flexible and explicit than Vulkan:

Translating from D3D12 to Vulkan would require runtime creation of descriptor sets when a root descriptor is set.
Vulkan requires the number of each type of descriptor for descriptor pool creation, whereas D3D12 requires the total number of descriptor for descriptor heap creation.
Descriptors aren’t placed explicitly in Vulkan, which would make supporting copy operations a challenge.
Root descriptor tables can alias in D3D12, supporting this in Vulkan would require a ton of state tracking.
...

Inheritance

Another advantage of using Vulkan is that it has very good inheritance semantics for descriptor sets: if you set a pipeline with a different pipeline layout, but with descriptor set layouts matching up to position N, then you don’t need to bind these descriptor sets again. This could translate to not having to reset the whole tables in Metal, and only changing a couple root table descriptors in D3D12.

Device Info APIs

Hey all,

During TPAC one item that the web perf working group discussed was adding an api (w3c/device-memory#15) to be able to get 'WEBGL_debug_renderer_info' info without having to spin up a web gl context. There are many users with lower end GPUs that can't really handle many of the experiences we want to ship so we need to get this info. Unfortunately the users with the lower end GPUs are the same ones where spinning up a web gl context can be extra expensive, there are some cases where spinning up a web gl context can cause the screen to appear to freeze for whole seconds. To address this we want to make an api that allows for this info to be exposed without having to make a web gl context.

There's not any new info being exposed here so fingerprinting isn't really a big concern. This mainly will just avoid paying the large cost of setting up a web gl context for the people who really shouldn't be using one. Also even without 'WEBGL_debug_renderer_info' being exposed the performance characteristics of the gpu will always be observable so all we do by not exposing this info is force the people on lower end devices to pay a large cost to render things.

One of the follow ups from TPAC was to see if anyone here has any big concerns before we move forward. Like I said this information is already available today and we need to and do get it, but doing so can be a very bad user experience on some devices.

WebGPUCommandEncoderDescriptor doesn't appear to be defined anywhere in the sketch IDL

gpuweb/design/sketch.webidl

Line 500 in 8ffe46f

 WebGPUCommandEncoder createCommandEncoder(WebGPUCommandEncoderDescriptor descriptor); 

I ⌘-F'ed for it, but didn't see any definition.

Proposal: synchronize unordered access views at pass boundaries

TL;DR: WebGPU should preserve the order of all side effects (transfer read/writes, pixel read/writes) except for Unordered Access Views, which are only synchronized at the render/compute pass boundaries.

Introduction

Graphics hardware provides certain guarantees about the order of operations. We don't observe the actual execution order of the shaders, since they are largely executed in parallel, but we can observe their side effects, such as output colors written to the texture targets.

In a graphics pass, the only side effect that is not ordered is Unordered Access Views (or UAV in short - the term comes from DirectX, while GL land calls it shader storage buffers - SSBO) writes. All the other side effects are ordered according to the primitive submission order (draw call order -> instance order -> primitive order):

RTV and DSV during rasterization (pixel, depth, and stencil writes)
- note: AMD_rasterization_order allows to remove the ordering guarantees from raster operations in order to get a 5% performance gain. We should be able to figure out internally if doing so introduces any (additional) data races without exposing it to the user.
stream output (i.e. transform feedback)
Ordered Access Views (OAV)

In a compute pass, UAV is the only way to get something out, so there is nothing left to be ordered. In transfer operations, read-write and write-write hazards are possible, and pipeline barriers are required to serialize those.

UAV

The mechanics of a UAV in D3D12 and Vulkan is such that the user is expected to place memory barriers if they want to serialize the side effects. In Metal, UAV are forcefully serialized at the draw call granularity in compute passes, and at the render pass granularity otherwise.

It's at the core of an UAVs to produce data races, and that is what allows the hardware to read/write them efficiently (no need to synchronize/serialize access). Therefore, for performance/efficiency reasons we don't believe that it's worth trying to enforce synchronization at a finer level than the draw calls. The cost of draw call-level synchronization is also expected to be unacceptably high for render operations, since a tiling GPU would have to flush the whole tile before proceeding after such a barrier. For this reason, Vulkan supports only a very limited set of pipeline barriers inside render passes.

Proposal

Document UAVs as a special kind of resource view that has a wide synchronization scope - the render/compute pass boundary. Any dependent operations within this scope are then considered non-portable, although it would be hard (if possible at all) for an implementation to detect those and warn appropriately.

Note that the proposal is based on the constraint that each resource would have to be only in either a single writable state or a combination of readable states during a pass. This automatically prevents a situation where the user would want to write to an UAV and then re-bind as an SRV/CBV within a pass.

For transfer operations, the API knows precise resources affected and their ranges, since those are explicitly provided by the user for copy/blit calls. Therefore, an implementation can figure out the possible hazards and insert appropriate barriers automatically. It doesn't have to be smart, could just optimize later by removing some of the barriers it considers unnecessary. For this reason, grouping operations into a "transfer/copy pass" does not appear to bear much of a value, and we think the group should reconsider having those passes in the API.

Issues

Why not insert automatic barriers between compute dispatches like in Metal?

Mainly because it's not consistent with render passes. If compute UAV side effects are synchronized at the dispatch boundary, then the users will seek ways to avoid hitting that synchronization point in cases where their use of an UAV is guaranteed to be portable at the logic level that is not visible to WebGPU implementation. These ways could consist of trying to build mega-shaders that do many operations at once, which is counter to what they'd do in Vulkan/DX12 and not productive (working around the API instead of taking the benefit of it). If the UAVs are synchronized at the pass barriers, the users always have an option to break a pass (and start a new one) if they need to depend on previous writes.

In Metal, automatic barriers made more sense because there is no constraint on a resource usage being static across a pass. If we do this in WebGPU, then we'd need to reconsider the static usage constaints, and it would hurt optimal performance on Vulkan and D3D12 backends.

Precompiled shader blobs API

Some closed platforms like game consoles disallow shader compilation on the user machine. This makes the showing of arbitrary WebGL (so far) content extremely tough.
It would be great for the gpuweb API to allow the developer to use Blobs as pre-compiled shader code. Although not very web-y because the blobs will have to be platform-depndent, it will deal in a standard way a problem that otherwise is pretty much impossible to solve.

Investigation: D3D12 does not support SRC_COLOR et al used in SrcBlendAlpha slot

D3D12

D3D12 does not support the following enums in the SrcBlendAlpha slot of a D3D12_RENDER_TARGET_BLEND_DESC:

D3D12_BLEND_SRC_COLOR
D3D12_BLEND_INV_SRC_COLOR
D3D12_BLEND_DST_COLOR
D3D12_BLEND_INV_DST_COLOR

"Blend options that end in _COLOR are not allowed." https://docs.microsoft.com/en-us/windows/desktop/api/d3d12/ns-d3d12-d3d12_render_target_blend_desc

Indeed, at runtime: "D3D12 ERROR: ID3D12Device::CreateBlendState: SrcBlendAlpha[ 0 ] is trying to use a D3D11_BLEND value (0x4) that manipulates color, which is invalid. [ STATE_CREATION ERROR #114: CREATEBLENDSTATE_INVALIDSRCBLENDALPHA]"

Vulkan

When given *_COLOR in the srcAlphaBlendFactor slot, Vulkan behaves as if the corresponding _ALPHA was passed instead: https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch26s01.html#id-1.28.2.17.5

Metal

Empirically, Metal behaves similarly to Vulkan, but I couldn't find anything in the docs.

Options

Two possible options:

Have WebGPU also disallow these combinations and error at validation time
Make the D3D12 backend do internally what Vulkan/GL/Metal do: map it to corresponding ALPHA variant.

Thoughts?

Position: Directly ingest a human-writable shading language

WebGPU should directly ingest this (#41) human-writable shading language. This is for a few reasons:

Easing Developers’ Burdens: Because almost all shaders are already written in this language, these preexisting shaders can be supplied directly into WebGPU without the need for an additional compilation step.
View-Source: We’ve gotten a significant amount of feedback from the WebAssembly effort that developers wish they had some way of viewing the source that a webpage was written in (and not the source that was supplied to the browser). We’ve heard this from both authors of webpages as well as users who wish they can inspect a webpage’s source. Because one language will be blessed as the preferred WebGPU language, developer tools would attempt to reverse-compile any ingested language to the blessed language, at which point the tools for the language are already in the browse and should thus be made directly available to the web.
Dynamism: Many 3D web applications dynamically create or modify existing shaders at runtime. This kind of workflow would require a webpage to ship a compiler, which is big enough that downloading a compiler from each page load would be slow and potentially costly on metered connections. We have not seen any history of webpages modifying a non-human-writable source program at runtime.
Interoperability: A human-authored shading language can be as interoperable as a non-human-authored shading language. There were times in the early days of Java where its (non-human-authored) byte code was not fully specified and behavior divergences occurred for well-formed programs in compliant runtimes. GLSL in WebGL is a perfect example of this a well-specified human-authored shading language. Secure HLSL improves upon GLSL by fully specifying every construct in the language, thereby eliminating the ambiguities that GLSL had. Secure HLSL includes a comprehensive test suite that the WebGPU Community Group is currently creating.
Size: A compiler for Secure HLSL does not necessarily need to be large. A compiler for a very similar language, WSL, is 22,000 lines of JavaScript (as opposed to LLVM’s 2.5 million lines of C++). Browser binary sizes do not necessarily need to increase dramatically to handle Secure HLSL.
Security: The WebGPU Community Group is proving that Secure HLSL is secure using a static proof solver. In addition, Secure HLSL compilers will be submitted to extensive security testing and fuzzing before ever being exposed to the Web.
Specificity: The WebGPU Community Group is currently writing a specification for Secure HLSL. This will result in a fully-defined specification, including security model and API bindings.
Performance: All modern browsers optimize all source code that passes through them, such as Javascript and WebAssembly. Web authors also routinely perform offline optimizations on their (human-authorable) Javascript sources before serving it to the browser. In addition, all shader source ends up going through the graphics driver where it may be further optimized. The WebGPU Community Group is gathering metrics showing that human-authorable shaders optimize to a form as fast or faster than an optimized form of non-human-authorable shaders.
Versatility: Just as TypeScript and Dart compile to Javascript, other human-authorable shading languages can be compiled to Secure HLSL.

Memory barriers investigations

Memory barrier is an abstraction provided to the graphics API user that allows controlling the internal mutable state of otherwise immutable objects. Such states are device/driver dependent and may include:

cache flushes
memory layout/access changes
compression states

Two failure cases (from AMD GDC 2016 presentation):

too many or too broad: bad performance
missing barriers: corruptions (*)

General information

Metal

Memory barriers are inserted automatically by the runtime/driver.

Direct3D 12

Quote from MSDN:

In Direct3D 11, drivers were required to track this state in the background. This is expensive from a CPU perspective and significantly complicates any sort of multi-threaded design.

Direct3D has 3 kinds of barriers:

State barrier: to tell that a resource needs to transition into a different state.
Alias barrier: to tell that one alias of a resource is going to be used instead of another.
UAV barrier: to wait for all operations on an UAV to finish before another operation on this UAV.

Resource states

A (sub-)resource can be either in a single read-write state, or in a combination of read-only states. Read-write states are:

D3D12_RESOURCE_STATE_RENDER_TARGET
D3D12_RESOURCE_STATE_STREAM_OUT
D3D12_RESOURCE_STATE_COPY_DEST
D3D12_RESOURCE_STATE_UNORDERED_ACCESS

For presentation, a resource must be in D3D12_RESOURCE_STATE_PRESENT state, which is equal to D3D12_RESOURCE_STATE_COMMON.

There are special rules for resource state promotion from the COMMON state and decay into COMMON. These transitions are implicit and specified to incur no GPU cost.

The barrier can span over multiple draw calls:

Split barriers provide hints to the GPU that a resource in state A will next be used in state B sometime later. This gives the GPU the option to optimize the transition workload, possibly reducing or eliminating execution stalls.

Vulkan

Typical synchonization use-cases

Pipeline barriers

Vulkan as a lot of knobs to configure the barriers in a finest detail. For example, user provides separate masks for source and target pipeline stages. By spreading out the source and target barriers, we can give GPU/driver more time to do the actual transition and minimize the stalls.

There are 3 types of barriers:

Global memory barrier: specifies access flags for all memory objects that exist at the time of its execution.
Buffer memory barrier: similar to a global barrier, but limited to a specified sub-range of buffer memory.
Image memory barrier: similar to a global barrier, but limited to a sub-range of image memory. In addition to changing the access flags, image barrier also includes the transition between image layouts.

Similarities with D3D12:

explicit barriers
both source and destination layout/states are requested, i.e. the driver doesn't track the current layout and expects/trusts the user to insert optimal barriers/transitions
image sub-resources carry independent layouts that can be changed individually or in bulk

Vulkan can transition to any layout if the current contents are discarded.

Note: barriers also allow resource transitions between queue families.

Implicit barriers

Barriers are inserted automatically between sub-passes of a render pass, based on the follow information:

initial and final layouts provided for each attachment
a layout provided for each attachment for each sub-pass
set of sub-pass dependencies, each specifying what parts of what destination sub-pass stages depend on some results of some stages of a source sub-pass

Vulkan implementation also automatically inserts layout transitions for read-only layouts of a resource used in multiple sub-passes.

Events

Vulkan event is a synchronization primitive that can be used to define memory dependencies within a command queue. Arguments of vkCmdWaitEvents are almost identical to vkCmdPipelineBarrier. The difference is an ability to move the start of transition earlier in the queue, similarly in concept to D3D12 split barriers.

Analysis

Tips for best performance (for AMD):

combine transitions
use the most specific state, but also - combine states
give driver time to handle the transition
- D3D12: split barriers
- Vulkan: vkCmdSetEvent + vkCmdWaitEvents

Nitrous engine (Oxide Games, GDC 2017 presentation slide 36) approach:

engine is auto-tracking the current state, the user requests new state only
extended (from D3D12) resource state range that maps to Vulkan barriers

Overall, in terms of flexibility/configuration, Vulkan barriers >> D3D12 barriers >> Metal. Developers seem to prefer D3D12 style (TODO: confirm with more developers!).

Translation between APIs

Metal API running on D3D12/Vulkan

We'd have to replicate the analysis already done by D3D11 and Metal drivers, but without a low-level access to the command buffer structure.

D3D12/Vulkan API running on Metal

All barriers become no-ops.

D3D12 API running on Vulkan

Given that D3D12 appears to have a smaller API surface and stricter set of allowed resources states (e.g. no multiple read/write states allowed), it seems possible to emulate (conservatively) D3D12 states on top of Vulkan. Prototyping would probably help here to narrow down the fine details.

Vulkan API on D3D12

Largely comes down to the following aspects:

ignoring the given pipeline stages
translating (image layout, access mask) -> D3D12 resource state
vkCmdWaitEvents should be possible to translate to a D3D12 split barrier, but more experiments are needed to confirm

Security/corruption issues

We've done some research with IHVs on how the hardware behaves when the resources are used in the case of a mismatched resource layout/state. E.g. an operation expects image to be in a shader-readable state, while the image is not.

The conclusion we got is that in most situations this workload will end up in either a GPU page fault (crash), or visual corruption with user data. It's relatively straightforward for Vulkan to add an extension, and for IHVs to implement it, that would guarantee security of such mismatched layout access. The extension would be defined similarly to robustBufferAccess and specify the exact behavior of the hardware and the lack of access to non-initialized memory not owned by the current instance.

Automation versus Validation

Inserting optimal Vulkan/D3D12 barriers at the right times appears to be a complex task, especially when taking multiple independent queues into consideration. It requires knowledge ahead of time on how a resource is going to be used in the future, and thus would need us to defer actual command buffer recording until we get more data on how resources are used. This would add more CPU overhead to command recording.

Simply validating that current transitions are sufficient appears to be more feasible, since it doesn't require patching command buffers and that logic can be moved completely into the validation layer.

Concrete proposals

TODO

Pipeline states investigation

This document attempts to describe the differences between pipeline states in the current generation of graphics APIs. It also touches some adjacent features like stencil reference values and blend factors. The investigation uses materials from https://github.com/jdashg/vulkan-portability/blob/master/pipeline-state.md.

Information

The concept of a pipeline state is similar between Vulkan/D3D12/Metal. A graphics PSO (short for Pipeline State Object) is an opaque object encapsulating the following information:

shaders for active stages
primitive topology, vertex input layout, tessellation
rasterizer, blend, depth/stencil states, multi-sampling
render targets descriptions

Graphics and compute pipelines are represented with separate types and constructed from different states. We'll focus on the graphics one, considering the compute to use a subset of graphics states.

Differences between APIs are mostly laid out within the specification of render targets and the range of supported states within particular stages.

Vulkan

vkCreateGraphicsPipelineStates creates multiple pipeline states at once, each from a separate VkGraphicsPipelineCreateInfo struct. Users can provide a pipeline cache object in order to re-use internal parts (opaque to the user) between pipelines.

Instead of specifying the formats of render targets, the pipeline state is created for a specific sub-pass of a render pass. A pipeline can then be used with any compatible render pass, according to the rules.

Vulkan allows certain states to be either baked into PSO or set independently during command encoding:

typedef enum VkDynamicState {
    VK_DYNAMIC_STATE_VIEWPORT = 0,
    VK_DYNAMIC_STATE_SCISSOR = 1,
    VK_DYNAMIC_STATE_LINE_WIDTH = 2,
    VK_DYNAMIC_STATE_DEPTH_BIAS = 3,
    VK_DYNAMIC_STATE_BLEND_CONSTANTS = 4,
    VK_DYNAMIC_STATE_DEPTH_BOUNDS = 5,
    VK_DYNAMIC_STATE_STENCIL_COMPARE_MASK = 6,
    VK_DYNAMIC_STATE_STENCIL_WRITE_MASK = 7,
    VK_DYNAMIC_STATE_STENCIL_REFERENCE = 8,
} VkDynamicState;

Vulkan multi-sampling state allows not only forcing sample shading on/off but also specifying the ratio of uniquely shaded samples. This is gated by sampleRateShading feature of the device. Similarly, independent blending is supported behind the independentBlend feature.

VkVertexInputRate does not seem to support instance rate values higher than 1, although it seems trivial to extend the range of accepted values for this type.

Direct3D 12

CreateGraphicsPipelineState creates a graphics PSO described by the D3D12_GRAPHICS_PIPELINE_STATE_DESC structure.

Color and depth render targets are described by their format.

The user can pass a cache blob of another PSO in order to re-use the internal compiled parts.

For multi-sampling, D3D12 exposes the DXGI_SAMPLE_DESC::Quality value, the semantics of which is rather opaque but the exposed capabilities are similar to Vulkan's minSampleShading. Comparing to Vulkan, various parts of multi-sampling state are spread over the rasterization state, blending, sample descriptors, and plain values in the PSO descriptor.

Limitations compared to Vulkan:

no VK_CULL_MODE_FRONT_AND_BACK
no VK_POLYGON_MODE_POINT
no VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN
only integer constant depth bias
no separate stencil read/write masks and reference value
no configurable sample shading
32bit sample mask

Extra features compared to Vulkan:

InstanceDataStepRate allows an advance of vertex attribute per N instances
conservative rasterization

Metal

A graphics PSO is created via makeRenderPipelineState from MTLRenderPipelineDescriptor.

There is no notion of pipeline cache as well as pipeline layout (aka root signature).

States that are surprisingly out of PSO:

depth bias
cull mode
triangle fill mode
depth/stencil

Limitations compared to Vulkan:

no VK_CULL_MODE_FRONT_AND_BACK
no VK_POLYGON_MODE_POINT
no VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN
no variable shading
no sample mask (interesting, why?)
no depth bounds

Extra features compared to Vulkan:

MTLVertexBufferLayoutDescriptor::stepRate allows to configure the rate of an instanced attribute

Analysis

In both Vulkan and D3D12 there is a separate object describing resource binding layout and push/root constants, called VkPipelineLayout and ID3D12RootSignature correspondingly. It makes sense to have it in GPUWeb as well (related to #19).

Vulkan dynamic states are useful in order to match the PSO-baked states with D3D12 and Metal. Thus, we consider all of the dynamic states to be used at all times by the Vulkan backend of GPUWeb.

In terms of pipeline states, it seems reasonable to use the intersection of capabilities between the APIs, which happens to match D3D12 except for:

instance rate, which we can limit to 1 for the MVP and revise later
sample mask

I find it rather unfortunate if we had to drop support for the sample mask, thus perhaps we could patch the shader code in order to apply the mask in the shader specifically for Metal and support it uniformly.

The question of whether render target formats are needed in the PSO depends on #23. If render sub-passes are accepted, I believe following Vulkan model (of providing the pass and sub-pass index) is straightforward.

TODO:

tessellation
format of vertex data and color/depth/stencil targets

GPU / CPU Transfers

This is Apple's proposal for GPU to CPU transfers, and visa versa.

We believe that for a first version (MVP), we can stick to an extremely simple model. If we later discover we need something more complicated for efficiency, we can add to the API.

partial interface HostAccessPass {
    Promise<ArrayBuffer> downloadData(GPUBuffer buffer, UnsignedLong offset, UnsignedLong length);
    void uploadData(GPUBuffer buffer, ArrayBuffer input, UnsignedLong offset);
}

Benefits

Asynchronous: It is impossible to synchronously read from a buffer, and therefore cause a GPU flush.
Portable: There is no ambiguity when the site's Javascript can request data to be downloaded or uploaded. (And it's not stateful.)
Well-defined: It is impossible to use this API to cause a data race between the CPU and GPU. Transfers will only ever occur when both the CPU and GPU are ready for them to occur.
Secure: ArrayBuffer automatically handles the situation of reading out of bounds.
Simple: Downloading and uploading are each a single easily-understandable call.
Implementable: Implementations which don't support mapping work naturally.
Optimizable: Web content doesn't need to have a special path for UMA vs discrete GPU scenarios, or have to know about how some buffers are CPU accessible but slow on the GPU but others are not CPU accessible but fast on the GPU. The implementation is more likely than the web app to handle all the cases in the most optimized way possible. (Write once, run anywhere.)
Easy to use: It's likely that any website code using this API will be correct. It's difficult (impossible?) to use this API wrong.
Style: The rest of the Web platform uses Promises and ArrayBuffers, and this API is no exception.

Drawbacks

All transfers require at least one copy.

Example

function performAsynchronousMath(gpuQueue, gpuBuffer, inputBuffer) {
    let uploadPass = queue.createHostAccessPass();
    uploadPass.uploadData(gpuBuffer, inputBuffer, 0);

    let computePass = queue.createComputePass();
    computePass.setState(...);
    computePass.setBuffer(buffer, ...);
    computePass.dispatch(...);

    let downloadPass = queue.createHostAccessPass();
    downloadPass.downloadData(buffer, 0, buffer.getLength()).then(function(arrayBuffer) {
        let typedArray = new Float32Array(arrayBuffer);
        for (let i = 0; i < buffer.getLength() / Float32Array.BYTES_PER_ELEMENT; ++i) {
            console.log(String(arrayBuffer[i]));
        }
    });

    queue.enqueue(uploadPass);
    queue.enqueue(computePass);
    queue.enqueue(downloadPass);
}

Out-of-bounds behavior in shaders

We all agree that some form of out-of-bounds checking needs to happen at runtime. From our discussion two weeks ago, we came up with two forms of this checking:

Clamping, where array indices which have a too-high index automatically access the last value of the array
Trapping, where such access causes the shader stage to immediately end, feeding zeroes to all pipeline stage output variables.

In Metal, option 1 looks like this:

data[clamp(index, 0U, dataCount)]

and option 2 looks like this:

if (index > dataCount) {
    Output output;
    // Fill in the fields of "output" with zeroes
    return output;
}
data[index];

I did some performance testing in Metal on "iPad mini with Retina display" to try to evaluate these two techniques. There are three styles of performance tests:

Tight Loop

In Metal, this code looks like:

device float* data [[ buffer(1) ]]; // Buffer the user-authored shader wants to read from
device uint32_t& dataCount [[ buffer(2) ]]; // Implicit buffer, which contains the length of "data"
...
VertexOutput output;
float sum = 0;
for (uint32_t i = 0; i < 100000; ++i) {
    sum += data[clamp(i, 0U, dataCount)];
}
output.position = [some function of "sum"];
return output

Here are the performance results. All times are in ms.

	Tight Loop Unchecked	Tight Loop Clamp	Tight Loop Trap
1000 items	5.26997131676063	5.68129132211016	5.52152075518696
3000 items	5.51838883162774	6.51381429188452	6.33112298559014
10000 items	6.21394229649268	9.44977917071966	8.00736172385653
100000 items	12.9947847222175	35.1627072386362	30.8669331404834

Totally Inlined Loop

In Metal, this code looks like:

device float* data [[ buffer(1) ]]; // Buffer the user-authored shader wants to read from
device uint32_t& dataCount [[ buffer(2) ]]; // Implicit buffer, which contains the length of "data"
...
VertexOutput output;
float sum = 0;
sum += data[clamp(0U, 0U, dataCount)];
sum += data[clamp(1U, 0U, dataCount)];
sum += data[clamp(2U, 0U, dataCount)];
...
output.position = [some function of "sum"];
return output

Here are the performance results. All times are in ms. Note that, for larger programs, the runtime compiler times out and doesn't emit code, so I couldn't test them.

	Inlined Unchecked	Inlined Clamp	Inlined Trap
1000 items	5.39240562908213	6.14405965171417	5.3955658573221
3000 items	5.65435510235397	8.08545490402	5.64488865150129
10000 items	-	-	-
100000 items	-	-	-

Partially Inlined Loop (chunk size = 100)

In Metal, this code looks like:

device float* data [[ buffer(1) ]]; // Buffer the user-authored shader wants to read from
device uint32_t& dataCount [[ buffer(2) ]]; // Implicit buffer, which contains the length of "data"
...
VertexOutput output;
float sum = 0;
uint32_t index = 0;
for (uint32_t i = 0; i < 100000 / 100; ++i) {
    index = i * 100 + 0; sum += data[clamp(index, 0U, dataCount)];
    index = i * 100 + 1; sum += data[clamp(index, 0U, dataCount)];
    index = i * 100 + 2; sum += data[clamp(index, 0U, dataCount)];
    ...
    index = i * 100 + 99; sum += data[clamp(index, 0U, dataCount)];
}
output.position = [some function of "sum"];
return output

Here are the performance results. All times are in ms.

	Partially Inlined Unchecked (Chunk = 100)	Partially Inlined Clamp (Chunk = 100)	Partially Inlined Trap (Chunk = 100)
1000 items	5.29934397381731	5.95148991506805	5.73344740079656
3000 items	5.45856812753239	7.18951789342246	6.69228472859371
10000 items	5.95523161258487	11.658497685183	10.0505332602401
100000 items	12.3861319985335	51.63057076149	39.1050798693512

Resource Capabilities

Background

Direct3D 12 and Vulkan have the concept of a resource which has a set of (conceptual) capabilities.

In one example, there may be a resource in these APIs which exists in a region of memory which the CPU can access, but is slow to use on the GPU. There may be another resource in these APIs which is fast to use on the GPU, but the CPU cannot access.

In another example, there may be a texture in these APIs which is tiled such that it can be sampled from in a fragment shader, but CPU-side code wouldn't know where to read/write data to update the contents of the texture (because the exact nature of the tiling is undefined). There may be another texture in these APIs which is not tiled so its contents can be updated by the CPU, but the fixed-function sampling hardware used in the fragment shader cannot sample from it.

Because a resource may not have all the capabilities an application requires, it is common practice to create "staging" resources. When an application wants to, for example, use the CPU to write to a texture which doesn't have CPU access, it writes to a staging texture that does have CPU access, and then enqueues a GPU command to copy the data from the staging texture to the destination texture. Indeed, this is so common that Direct3D 12 itself even has a helper function for it.

Recommendation

The WebGPU API should not have a concept of individual resource capabilities. Instead, every WebGPU API resource should have the same set of capabilities, which includes CPU reading and writing, GPU reading and writing, sampling, etc. On platform APIs where no such resource is available, the browser will need to use staging resources. This is possible because each CPU-side access will be associated with a particular point in the device's command queue (there was consensus during the Nov. 1 call about this). This point is the point where the runtime may emit GPU-side copy commands or synchronization commands.

There are a few reasons for this approach:

Convenience Developers of existing APIs use staging resources because they have to, not because they want to. If every developer is going to have to emit commands to use these staging resources, the browser should just do it for them.
Portability On some devices (for example, UMA devices), staging buffers are entirely unnecessary. On other devices (for example, discrete desktop graphics cards), staging buffers are a requirement to have good performance. The web developer should not be writing code which works on the device they're testing on, but doesn't work on other types of devices. The browser knows much more about the currently-running hardware than the web author.
Performance In an API which has a concept of resource capabilities, poorly-authored applications may use resources in a non-optimal manner. For example, a poorly-authored application might upload once into a CPU-visible resource, and then only sample from it every frame thereafter from the GPU. Removing the API concept of resource capabilities gives the browser the freedom to move the resource to a faster region of GPU memory.
Fingerprinting The properties of resource capabilities are intrinsic to the raw hardware of the device running the web application. The web platform has avoided giving hardware-specific knowledge to any running webpage.

Better specified context events

WebGL has events for loss and restoration of a context and creation failure error. However, there's a problem with them: sometimes they're not very helpful.

For instance, webglcontextlost event. Since it doesn't give the reason why we've lost the context, it's not clear what to do: show our user an error message or wait for restore event. I actually tried to log statusMessage property of event objects and it was empty.

It'd be good to have better specified context events.

Shimming for upgrades, a la database versions

In the same way that databases are updated from one version to the next with scripts that migrate how the data is stored, and in the same way that future versions of web features are shimmed with polyfills, should WebGPU have shims for updating?

This is not a proposal that old code needs to be supported forever with no effort, but a halfway point where old code needs to at least update to include a shim for their old code. New code will need to include a shim, and old code will need to include it.

This is a bit of a shot in the dark, so this issue is raised for discussion rather than resolution. Is there a median point where new code can be proposed and implemented with a shim in the same way as today's polyfills, while old code needs to remain minimally aware of what is going on and do the minimum needed without a rewrite?

Binding Inheritance from outside passes to inside.

Now that we agreed that compute and graphics work should be started and ended explicitly, an open question is whether resource bindings can be bound outside of "compute / render passes" and inherited inside these passes.

D3D12

D3D12 doesn't have explicit bounds for graphics / compute work so be default root signature tables/descriptors/values are persistent. There are two root signatures, one for graphics and one for compute. Also the root signature is invalidated on a call to SetGraphicsRootSignature that changes the layout.

Overall when setting resource bindings outside of passes, we don't know if they should be put in the compute and graphics root signature, and they will in most cases be invalidated by a change to the root signature layout at the beginning of the pass. @kvark pointed out that in Vulkan (and all APIs) the functions binding resources know whether we bind to graphics or compute.

Metal

Bindings are set inside of the MTLComputeCommandEncoder and MTLRenderCommandEncoder and do not propagate outside of them.

Vulkan

Supports inheritance from outside render passes to inside.

Proposal

Given that both D3D12 and Metal do not support inheritance, and given that vkCmdBindDescriptorSet is usually multiple order of magnitudes cheaper than vkCmdBeginRenderPass, it sounds ok to not have inheritance and ask the app to all the bindings inside the passes.

Shape of the API for object creation

We've been thinking of what the shape of the API should be to create the WebGPU objects apart from the command buffer/encoders. Our constraints were the following:

Performance and GC: WebGPU is meant to be a high-performance API on both the GPU and CPU side. Object creation should mostly happen during at load-time but some applications might need to defer object building to rendering time. We should make the overhead of the JS bindings minimal and minimize the amount of GC garbage.
Extensibility: WebGPU will have extensions that are enabled explicitly by developers when create the WebGPU device. Based on the extensions enabled new capabilities will be exposed, often as new creation arguments for WebGPU objects. Anisotropic filtering for sampler object is something that could be exposed as an extension. There might also be “features” that are like built-in extensions present in the main spec and have to be enabled at device creation time.
WebAssembly: It will be a first-class citizen of WebGPU meaning that there will be a version of the WebGPU API that’s native to WebAssembly and has as low overhead as possible. Ideally the WASM version API is exactly the same as the JS version.
Type safety: For the best developer experience we would like the API to be well-typed so that developers don’t spend hours debugging to find a type in on argument to an object creation. We would also make it easy for developers to use extensions and have correct code with/without the extensions. This safety can be “compile-time”, at runtime or runtime with a “debug device” / devtools open.
JSness: It would be nice if the API could be used in a way that looked idiomatic in Javascript.

This approach has been suggested by @fserb. Thanks!

Example WebIDL (that was validated with the WebIDL checker):

dictionary WebGPUSamplerDictionary {
    long wrapModeR; // Or some enum type?
    long wrapModeS;
    // If the anisotropic extension is supported
    long maxAnisotropy;
};

interface WebGPUSamplerDescriptor {
    attribute long wrapModeR; // Or some enum type?
    attribute long wrapModeS;

    void setWrapModeR(long value);
    void setWrapModeS(long value);

    // If the anisotropic extension is supported
    attribute long maxAnisotropy;
    void setMaxAnisotropy(long value);
};

interface WebGPUSampler {
    // Opaque handle
};

interface WebGPUDevice {
    WebGPUSamplerDescriptor createSamplerDescriptor();
    WebGPUSampler createSampler((WebGPUSamplerDictionary or WebGPUSamplerDescriptor) arg);
};

Example of the multiple ways to use this:

//--- Use of a descriptor with the setters
let desc = device.createSamplerDescriptor();
desc.setWrapModeR(...);
desc.setWrapModeS(...);

if (canUseAnisotropicFiltering) {
    // Throws if the extension is not implemented (“method doesn’t exist”)
    // Throws if the extension is not enabled (“extension isn’t enabled”)
    desc.setMaxAnisotropy(...);
}

let sampler = device.createSampler(desc);
// or let sampler = new WebGPUSampler(device, desc) with the corresponding IDL changes.

//--- Use of a descriptor with its attributes
let desc = device.createSamplerDescriptor();
desc.wrapModeR = ...;
desc.wrapModeS = ...;

// This is ignored if the extension is not present
desc.maxAnisotropy = ...;

let sampler = device.createSampler(desc);

//--- Use of a dictionary
let sampler = device.createSampler({
    wrapModeS: …,
    wrapModeR: …,
    maxAnisotropy: …,
});

Going back to the constraints:

Performance and GC: Descriptors should be able to be optimized and can even be reused if GC pressure is a concern. Setter functions could be slower than just setting the property directly on the descriptor (?).
Extensibility: Works, extension properties can be pre-filled on descriptors or have default values for missing dictionary arguments.
WebAssembly: Descriptor objects could be exposed as opaque objects in WASM, with methods to set values. This requires host bindings to work with opaque objects, but doesn’t require WASM property bags. @flagxor confirmed that this would work well with the host bindings currently being discussed in the WASM group.
Type safety: Great type safety in WASM. JS can use the setters for more type-safety.
JSness: The last example looks like idiomatic JS.

Would love to hear your thoughts!

security model, focused on shaders

Security is a key requirement for WebGPU.

However discussions so far have included assertions about the 'security' of a shading language. But we don't have a working definition of security, with respect to a shading language.

To me, it makes little sense to talk about the security of a language in isolation. Rather, I think we judge an implementation of WebGPU to be secure or not, and the implementation uses a shading language.

I believe an agreed-upon definition of security, focused on shaders, will clarify discussions on selecting a shader language. So, I'll start off with a strawman proposal:

Proposal

Definition: An "Access" of a resource is either a read or a write of that resource, in whole or in part.

Strawman definition of the security requirement:

An implementation of WebGPU, X, is 'secure' if
- An ill-behaved application A running on X cannot:
  - Access resources that A is not authorized to access. For example, follow the same-origin policy.
  - Cause other applications running on X to misbehave
    - Exception: Allow competition for CPU, GPU, memory, bandwidth, and other system facilities. This is necessarily fuzzy because it can be impossible to tell when an application begins misbehaving. An ill-behaved app can cause out-of-memory or denial-of-service on both GPU and CPU but may not produce incorrect results in other applications.

Restatement, focusing on shaders executing in a pipeline:

An implementation of WebGPU securely runs graphics pipelines if:
- For any ill-behaved application A running on X, any attempt to run a graphics or compute pipeline will
  - a) Not execute the pipeline, or
  - b)
    - b1) X sets up a valid pipeline and attempts to execute it. What 'validity' means is TBD, but includes valid binding of resources (images, buffers, etc.), and using only shaders that are statically valid. (i.e. the shaders pass all validation rules that can be checked statically)
    - b2) Execution of that pipeline cannot access resources that A is not authorized to access.
    - b3) Execution of that pipeline cannot cause other applications running on X to misbehave. Exception is competition for system resources.

Updated: Incorporated some feedback from #39 (comment)

Discussion

There is no duty to detect or report whether an application is actually ill-behaved. Attempting to do so may incur overhead. For a development environment it may be desirable to detect bad behaviour, but that is not a security requirement.
We don't care about the computed results or performance of an ill-behaved application. That may be a user-experience concern, but not a security concern.
We do not try to protect against denial of service attacks as doing so requires operating system support. Some operating systems provide defense against DOS by resetting the GPU if it is stuck on a task too long. For such OS we should report to the application if it has lost GPU resources due to a GPU reset ("context lost").

Is a clean break from WebGL really necessary?

In the other thread I think @grovesNL brought up a good point:

I agree, WebGL1/2 is tied to OpenGL ES today, but there doesn't seem to be a reason concepts like command queues couldn't be brought into future versions of WebGL or even WebGL2 behind extensions depending on the API changes required. The example of WebGPU provided was essentially equivalent to WebGL today except for the command queue functionality. I'd be surprised if OpenGL ES is not considering some of these concepts for future versions.

Perhaps the discussion should be around which low level features should be exposed to the web as a starting point. I'm slightly skeptical that a common abstraction would significantly vary from WebGL2's existing API, given that Vulkan appears to be out of the picture because it's too low level. If the goal is to provide a slightly more object-oriented API with some additional functionality then it's a quite a large investment for the benefits it provides.

It seems to me from discussions elsewhere that WebGPU choose for whatever reasons to start with a brand new API. But I don't see a good reason to break from the existing WebGL API just yet.

Use callbacks directly?

On the agenda for the next meeting, wait vs. poll vs. promise is listed.

Could we consider using callbacks directly? This approach seems like it would be possible to avoid the GC pressure associated with promises and not be concerned about blocking the main thread. It would also be easy to wrap into a promise if an application preferred the promise API instead.

Position: Blessing a single human-writable shading language

We should bless a single human-writable shading language. This is because of the community and ecosystem that is possible when most authors are writing in the same language. We’ve seen this in many places, the most visible of which is probably sites like Shadertoy, which is no the de-facto place where graphics programmers share new ideas and algorithms, including academic research and published papers. Widespread adoption of WebGPU will be hampered if teachers and influencers don’t have a language to spread.

This doesn’t mean that other languages will be forbidden. Instead, this is about the idea that one language will be promoted for documentation, teaching, marketing, etc.

The blessed language should be Secure HLSL. This is for a few reasons:

Compatibility: The majority of shaders in the world are written in HLSL, and almost all of them will be compatible with Secure HLSL. This will decrease the burden of porting an application to WebGPU.
Security: Secure HLSL has as strong, or stronger, security guarantees than already exist on the Web. The WebGPU Community Group is pursuing efforts to prove this with a static proof solver.
Expressivity: Secure HLSL includes facilities for generics and limited pointers, which makes it more suitable for GPGPU algorithms and CPU algorithms ported to the GPU.
Portability: The WebGPU Community Group is authoring a full specification which defines every operation. This includes full rules about variable scoping, every standard library function, overload priority resolution, among others.

Position: Don't ingest a non-human-writable shading language

WebGPU should not directly ingest a non-human-writable shading language. This is for a few reasons:

Consistency: As the language in which humans write shaders changes, the non-human-representation of the language must also change. Similarly, changing the non-human-representation of the language means that a similar change should be made in the language humans are writing. This makes it difficult to evolve WebGPU.
Ecosystem: Ecosystems exist around humans and the languages they write. There is no large popular repository of shaders in a non-human-readable format.
Documentation Accuracy: Documentation (à la MDN) will present concepts in the language authors are writing in, not the language the browser ingests. The additional level of indirection between what documentation describes and what the browser is actually doing will make the concepts more difficult to understand for web authors.
Security: No non-human-writable shading language meets the same security claims as the above human-writable shading language.
Debugability: Source maps in the only non-human-writable language browsers directly ingest (WebAssembly) still are not implemented in any major browser. Debugging a view of source code, without direct access to the original source, is difficult for browsers to perform.
Ownership: The WebGPU Community Group does not own any non-human-writable shading languages. Any time the Community Group desires to make a modification to the language, they would need to ask a separate group, perhaps in a separate standards body. Such a process would make it more difficult and slower to modify. Not all of the members of the WebGPU Community Group are present in other standards groups.
Portability: An additional compile step adds one more place where behavior deviances can occur. The same shader run through different compilers may have different characteristics (either semantic differences or performance differences). Directly ingesting the exact program the author writes means that this whole class of variance is eliminated.

Request to Apple open sourcing Metal shader compiler for WebGPU..

Hi,
interesting that current gpuweb WebKit patch uses metal shaders.
For me this might be even a good shader language to use in WebGPU, but then will be time for requesting Apple to open source Metal shader compiler stack like Microsoft has done recently with current HLSL shader compiler that translates to his new intermediate format DXIL.. the new Microsoft compiler uses LLVM/Clang stack and seems Apple Metal shader compiler too..

So briefly here (may be not the current best place to ask), with Khronos LLVM IR<->SPIR-V translator and DXIL open source only modern shader language compiler not open source is Metal and would make sense to open source it for WebGPU implementations on other OSes and even make easier porting modern macos/IOS games using Metal to other OSes..
Makes sense?
thanks..

CUDA for Web?

Hello. Will cuda-like support in gpuweb?
As I know, CUDA have multiple-gpu (and Nvidia-only) compute.

So would be cool to see these features:

Compute shaders with (or even CUDA-like programs) SSBO, UBO, TBO, etc.
Atomic operations and counters
Local, global invocations, and warp operations
Limited cycles with big number, but with possible to break in defined value

Also, I had idea for WebRL, that adds ray tracing API to Web, that unifies OpenRT, OpenRL, OptiX, RadeonRays, Embree...

develop webvulkan , adopt webgpu or continue with webgl ?

KhronosGroup/Vulkan-Docs#451

Accept HLSL (or some human-writable language) as a shader input format

We propose that WebGPU accept HLSL shaders as input.

HLSL is by far the most common source language for shaders. There is a huge amount of existing content as well as an extremely large, stable and experienced ecosystem.
Providing a human-writable/readable shading language will help the Web. We have 30 years of examples showing this. Binary-only formats have shown the opposite. WebGPU should support "View Source".
HLSL is already "secure". WebGL shows that we can take a similar language and translate it into multiple backends.
We should avoid requiring developers run a (likely-offline) compilation step in order to get a piece of content working. Any content that wants to create shaders on the client end won't need to ship a (likely-large) compiler.

Note: Accepting HLSL doesn't necessarily rule out accepting SPIR-V, although it doesn't seem worth duplicating work to get something that doesn't provide any more features.

Shaders transpilation to Metal.

Metal has an advantage that it runs on both the GPU and CPU, giving a performance edge.

If GPUWeb transpilated non-metal shaders into Metal on the client; there would be a performance increase. Another option is a development tool, that can convert any shader language to Metal, the second option would be my preference, as it wouldn't give any performance hit to the client.