This ticket is to track ideas/known solutions to GPU-driven. Vulkan

Some experiments: Added libspiv</cod

Strategy update, for compue-driven workflow: single execution

Current implementation: <a target="_blank" rel="noopener noreferrer nofollow" href

First proof of concept kind triangle: <a target="_blank" rel="noopener noreferrer

Current idea for firstTask and <code class="notransla

Almost there: <a target="_blank" rel="noopener noreferrer nofollow" href="https://

Running stable on OpenGothic: <a target="_blank" rel="noopener noreferrer nofollow

GPU driven rendering about tempest HOT 10 OPEN

try commented on May 26, 2024

GPU driven rendering

from tempest.

Comments (10)

Try commented on May 26, 2024

Now, since mesh-shading is released for OpenGothic can start thinking about next steps.

With VK_NV_mesh_shader all fits fine with the engine, just need to emulate them on else platforms.

Idea for emulation workflow:

Split mesh shader into 2 compute shaders + 1 vertex shader
Shaders: counting pass + workload shader + vertex-passthrough
Extra data:
counting_buffer[], indirect_buffer[draw_count], var_buffer[]
var_buffer - is buffer with varyings outputed from .mesh shader

Spirv patching notes:

OpDecorate %1234 BuiltIn PrimitiveCountNV    <-- should be noped/removed
%gl_PrimitiveCountNV = OpVariable %_ptr_Output_uint Output  <-- should be mutated to shared-variable

Counting shader

// upfront. Using set=1 is ideal, since engine doesn't work with multiple descriptor sets
layout(set = 1, binding = 0) buffer EngineInternal
{
    uint countersCount;
    uint counters[];
} engine;
---
// tail of the main function
  if(_gl_PrimitiveCountNV!=0) {
    uint pos = atomicAdd(engine.countersCount, 1);
    engine.counters[pos] = _gl_PrimitiveCountNV;
    }

Once counter are done, internal shader has to build multi-draw-indirect buffer, with prefix summed counts.

// recap note about indirect commands
struct VkDrawIndexedIndirectCommand {
   uint32_t    indexCount;
   uint32_t    instanceCount;
   uint32_t    firstIndex; // prefix sum
   int32_t     vertexOffset; // can be abused to offset into var_buffer
   uint32_t    firstInstance; // caps: should be zero
   };

Final draw

each vkCmdDrawMeshTasks get replaced by vkCmdDrawIndexedIndirect, that consumes var_buffer and passing it to fragment shader.

Multiple renderpasses

vkEvent should be fine to synchronize execution of previous set of compute shaders for now.

Split command-buffers

Generating extra compute shaders will require a way to insert vkCmdDispatch commands into begin of render-pass.
Can be done by deferred command recording or by spliting one engine-level command buffer into multiple vulkan-command buffers.
Cons:

if deferred: validation is gonna be delayed as well, making debug problematic
multiple vulkan command buffers: full-screen quad pass will produce command buffer with single draw-call

Issues

Rasterization order - not considered, zbuffer is more than fine to achieve correct 3D rendering
Mesh shader side effects - not possible due to counting pass
Per-primitive data - not now
All buffers has to be preallocated with finite size. Unfortunately we can runout of buffer memory and there is no lazy-allocated buffers in vulkan
Not task shader support for now - OpenGothic doesn't need it

from tempest.

Try commented on May 26, 2024

Some experiments:

Added libspiv - internal utility library for spir-v tooling
First attempts to convert .mesh to .comp

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 82
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 1 1 1
               OpSource GLSL 450
               OpSourceExtension "GL_NV_mesh_shader"
               OpName %main "main"
               OpName %g1_MeshPerVertexNV "g1_MeshPerVertexNV"
               OpMemberName %g1_MeshPerVertexNV 0 "g1_Position"
               OpMemberName %g1_MeshPerVertexNV 1 "g1_PointSize"
               OpMemberName %g1_MeshPerVertexNV 2 "g1_ClipDistance"
               OpMemberName %g1_MeshPerVertexNV 3 "g1_CullDistance"
               OpMemberName %g1_MeshPerVertexNV 4 "g1_PositionPerViewNV"
               OpMemberName %g1_MeshPerVertexNV 5 "gl_ClipDistancePerViewNV"
               OpMemberName %g1_MeshPerVertexNV 6 "gl_CullDistancePerViewNV"
               OpName %g1_MeshVerticesNV "g1_MeshVerticesNV"
               OpName %Vbo "Vbo"
               OpMemberName %Vbo 0 "vertices"
               OpName %_ ""
               OpName %PerVertexData "PerVertexData"
               OpMemberName %PerVertexData 0 "color"
               OpName %v_out "v_out"
               OpName %g1_PrimitiveIndicesNV "g1_PrimitiveIndicesNV"
               OpName %g1_PrimitiveCountNV "g1_PrimitiveCountNV"
               OpName %VkDrawIndexedIndirectCommand "VkDrawIndexedIndirectCommand"
               OpMemberName %VkDrawIndexedIndirectCommand 0 "indexCount"
               OpMemberName %VkDrawIndexedIndirectCommand 1 "instanceCount"
               OpMemberName %VkDrawIndexedIndirectCommand 2 "firstIndex"
               OpMemberName %VkDrawIndexedIndirectCommand 3 "vertexOffset"
               OpMemberName %VkDrawIndexedIndirectCommand 4 "firstInstance"
               OpDecorate %_runtimearr_v2float ArrayStride 8
               OpMemberDecorate %Vbo 0 NonWritable
               OpMemberDecorate %Vbo 0 Offset 0
               OpDecorate %Vbo BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 0
               OpDecorate %v_out Location 0
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
               OpDecorate %VkDrawIndexedIndirectCommand BufferBlock
               OpDecorate %80 DescriptorSet 1
               OpDecorate %80 Binding 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 0 Offset 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 1 Offset 4
               OpMemberDecorate %VkDrawIndexedIndirectCommand 2 Offset 8
               OpMemberDecorate %VkDrawIndexedIndirectCommand 3 Offset 12
               OpMemberDecorate %VkDrawIndexedIndirectCommand 4 Offset 16
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
       %uint = OpTypeInt 32 0
     %uint_1 = OpConstant %uint 1
%_arr_float_uint_1 = OpTypeArray %float %uint_1
     %uint_4 = OpConstant %uint 4
%_arr_v4float_uint_4 = OpTypeArray %v4float %uint_4
%_arr__arr_float_uint_1_uint_4 = OpTypeArray %_arr_float_uint_1 %uint_4
%g1_MeshPerVertexNV = OpTypeStruct %v4float %float %_arr_float_uint_1 %_arr_float_uint_1 %_arr_v4float_uint_4 %_arr__arr_float_uint_1_uint_4 %_arr__arr_float_uint_1_uint_4
     %uint_3 = OpConstant %uint 3
%_arr_g1_MeshPerVertexNV_uint_3 = OpTypeArray %g1_MeshPerVertexNV %uint_3
%_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 = OpTypePointer Workgroup %_arr_g1_MeshPerVertexNV_uint_3
%g1_MeshVerticesNV = OpVariable %_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 Workgroup
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
    %v2float = OpTypeVector %float 2
%_runtimearr_v2float = OpTypeRuntimeArray %v2float
        %Vbo = OpTypeStruct %_runtimearr_v2float
%_ptr_Uniform_Vbo = OpTypePointer Uniform %Vbo
          %_ = OpVariable %_ptr_Uniform_Vbo Uniform
%_ptr_Uniform_v2float = OpTypePointer Uniform %v2float
    %float_0 = OpConstant %float 0
    %float_1 = OpConstant %float 1
%_ptr_Workgroup_v4float = OpTypePointer Workgroup %v4float
      %int_1 = OpConstant %int 1
      %int_2 = OpConstant %int 2
%PerVertexData = OpTypeStruct %v4float
%_arr_PerVertexData_uint_3 = OpTypeArray %PerVertexData %uint_3
%_ptr_Workgroup__arr_PerVertexData_uint_3 = OpTypePointer Workgroup %_arr_PerVertexData_uint_3
      %v_out = OpVariable %_ptr_Workgroup__arr_PerVertexData_uint_3 Workgroup
         %54 = OpConstantComposite %v4float %float_1 %float_0 %float_0 %float_1
         %56 = OpConstantComposite %v4float %float_0 %float_1 %float_0 %float_1
         %58 = OpConstantComposite %v4float %float_0 %float_0 %float_1 %float_1
%_arr_uint_uint_3 = OpTypeArray %uint %uint_3
%_ptr_Workgroup__arr_uint_uint_3 = OpTypePointer Workgroup %_arr_uint_uint_3
%g1_PrimitiveIndicesNV = OpVariable %_ptr_Workgroup__arr_uint_uint_3 Workgroup
     %uint_0 = OpConstant %uint 0
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
     %uint_2 = OpConstant %uint 2
%g1_PrimitiveCountNV = OpVariable %_ptr_Workgroup_uint Workgroup
     %v3uint = OpTypeVector %uint 3
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_1 %uint_1 %uint_1
    %v3float = OpTypeVector %float 3
%_arr_v3float_uint_3 = OpTypeArray %v3float %uint_3
         %74 = OpConstantComposite %v3float %float_1 %float_0 %float_0
         %75 = OpConstantComposite %v3float %float_0 %float_1 %float_0
         %76 = OpConstantComposite %v3float %float_0 %float_0 %float_1
         %77 = OpConstantComposite %_arr_v3float_uint_3 %74 %75 %76
%VkDrawIndexedIndirectCommand = OpTypeStruct %uint %uint %uint %int %uint
%_ptr_Uniform_VkDrawIndexedIndirectCommand = OpTypePointer Uniform %VkDrawIndexedIndirectCommand
         %80 = OpVariable %_ptr_Uniform_VkDrawIndexedIndirectCommand Uniform
       %main = OpFunction %void None %3
          %5 = OpLabel
         %27 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_0
         %28 = OpLoad %v2float %27
         %31 = OpCompositeExtract %float %28 0
         %32 = OpCompositeExtract %float %28 1
         %33 = OpCompositeConstruct %v4float %31 %32 %float_0 %float_1
         %35 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_0 %int_0
               OpStore %35 %33
         %37 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_1
         %38 = OpLoad %v2float %37
         %39 = OpCompositeExtract %float %38 0
         %40 = OpCompositeExtract %float %38 1
         %41 = OpCompositeConstruct %v4float %39 %40 %float_0 %float_1
         %42 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_1 %int_0
               OpStore %42 %41
         %44 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_2
         %45 = OpLoad %v2float %44
         %46 = OpCompositeExtract %float %45 0
         %47 = OpCompositeExtract %float %45 1
         %48 = OpCompositeConstruct %v4float %46 %47 %float_0 %float_1
         %49 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_2 %int_0
               OpStore %49 %48
         %55 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_0 %int_0
               OpStore %55 %54
         %57 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_1 %int_0
               OpStore %57 %56
         %59 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_2 %int_0
               OpStore %59 %58
         %65 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_0
               OpStore %65 %uint_0
         %66 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_1
               OpStore %66 %uint_1
         %68 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_2
               OpStore %68 %uint_2
               OpStore %g1_PrimitiveCountNV %uint_1
               OpReturn
               OpFunctionEnd

In here:

Mesh-related buildtins promoted to shared memory
Entry point adjusted to have no out variables for spirv<1.4
Entry point changed to GLCompute
Extra SSBO binding at (set=1, binding=0) introduced (set to be used as count buffer)
gl_* prefix changed to g1_ to make spirv-cross happy

from tempest.

Try commented on May 26, 2024

Strategy update, for compue-driven workflow:

single execution of .mesh.comp - this will simplify code-gen and C++ workflow
index sorting/packing with internal shaders
manual vertex pull in generated .vert

Extra descriptor set:

struct IndirectCmd { // 32 bytes
  uint    indexCount;
  uint    instanceCount;
  uint    firstIndex;    // prefix sum
  int     vertexOffset;  // can be abused to offset into var_buffer
  uint    firstInstance; // caps: should be zero

  uint    self;  // sequential id of dispatchMesh class, in render-pass
  uint    padd0;
  uint    padd1;
  }; // 32 bytes

layout(set = 1, binding = 0, std430) buffer EngineInternal0 {
  IndirectCmd cmd[];
  } indirect; // indirect buffer, mostly set by CPU, except for indexCount, firstIndex

layout(set = 1, binding = 1, std430) buffer EngineInternal1 {
  uint    grow;
  uint    ibo[];
  } ind;

layout(set = 1, binding = 2, std430) buffer EngineInternal2 {
  uint    grow;
  uint    vbo[];
  } var;

layout(set = 1, binding = 3, std430) buffer EngineInternal3 {
  uint    grow; // and dispatchX
  uint    dispatchY; // =1
  uint    dispatchZ; // =1
  uint    desc[];
  } mesh;

layout(set = 1, binding = 4, std430) buffer EngineInternal4 {
  uint    ibo[];
  } indFlat;

Workflow by example:

      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      enc.dispatchMesh(0,3);
      enc.dispatchMesh(3,2);

Will be translated as:

      enc.setUniforms(pso_compute_ms,ubo);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = 0);
      enc.dispatch(3, 1,1);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = commandId);
      // TODO: pass base taskID somehow
      enc.dispatch(2, 1,1);
     ....
      VkBufferMemoryBarrier(comp -> comp, indirect.ind);
      // after all 'dispatchMesh' are done
      // prefix summ pass doest 2 jobs actually:
      // indirect.ind[i] firstIndex = prefixSumm(indexCount);
      // indirect.ind[i] indexCount = 0; <-- will be re-accumulated in compactage pass
      enc.setUniforms(psoSum,uboSum);
      enc.dispatch(1,1,1); // 1 group with 256 threads
      // should be dispatch-indirect
      VkBufferMemoryBarrier(comp -> comp, all helper buffers, except var);
      enc.setUniforms(psoCompactage,uboCompactage);
      enc.dispatchIndirect(mesh.grow,1,1);
      VkBufferMemoryBarrier(comp -> vert);

      // main rendering, as drawIndirect
      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      env.drawIndirect(indirect.cmd[0]);
      env.drawIndirect(indirect.cmd[1]);
      // vert -> comp barrier at end of render-pass

from tempest.

Try commented on May 26, 2024

Current implementation:

Each dispatch-mesh call works as pair of compute shader + draw-indirect
Compute shader as well as vertex passthru shaders are generated from single mesh shader: cc326ee
Once all compute-passes related to draw-calls are finished, output should be sorted (only in prototype, not in engine) and forwarded to vkCmdDrawIndexedIndirect

TODO:

Add VMeshShaderEmulated as special case in related pieces in engine
Take care of pipeline-memory allocation and scheduling in general

from tempest.

Try commented on May 26, 2024

First proof of concept kind triangle:

TODO: Need to pass somehow firstTask and selfId to compute shader

from tempest.

Try commented on May 26, 2024

Current idea for firstTask and selfId pass:

Use Y/Z inputs of vkCmdDispatchBase.
Use case: vkCmdDispatchBase(impl, firstTask, self, 0, taskCount, 1,1). This will break some builtin variables.

// workgroup dimensions
in uvec3 gl_NumWorkGroups; // not sure how this interacts with vkCmdDispatchBase
const uvec3 gl_WorkGroupSize;  // unaffected

// workgroup and invocation IDs
in uvec3 gl_WorkGroupID;  // Y is polluted
in uvec3 gl_LocalInvocationID; // unaffected

// derived variables
in uvec3 gl_GlobalInvocationID; // polluted, since it is byproduct of gl_WorkGroupID
in uint gl_LocalInvocationIndex; // unaffected

from tempest.

Try commented on May 26, 2024

Almost there:

Normals are bugged-out, because translator can't handle arrayed varyings

from tempest.

Try commented on May 26, 2024

Running stable on OpenGothic:

from tempest.

Try commented on May 26, 2024

New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly):
Decouple .mesh into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.

uniform-function to me is:
Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects.
Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.

The only problem is gl_WorkGroupID.x that is used all over the place

from tempest.

GPU driven rendering about tempest HOT 10 OPEN

Comments (10)

Spirv patching notes:

Counting shader

Final draw

Multiple renderpasses

Split command-buffers

Issues

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs