GithubHelp home page GithubHelp logo

GPU driven rendering about tempest HOT 10 OPEN

try avatar try commented on May 26, 2024
GPU driven rendering

from tempest.

Comments (10)

Try avatar Try commented on May 26, 2024

Now, since mesh-shading is released for OpenGothic can start thinking about next steps.

With VK_NV_mesh_shader all fits fine with the engine, just need to emulate them on else platforms.

Idea for emulation workflow:

  • Split mesh shader into 2 compute shaders + 1 vertex shader
  • Shaders: counting pass + workload shader + vertex-passthrough
  • Extra data:
    counting_buffer[], indirect_buffer[draw_count], var_buffer[]
    var_buffer - is buffer with varyings outputed from .mesh shader

Spirv patching notes:

OpDecorate %1234 BuiltIn PrimitiveCountNV    <-- should be noped/removed
%gl_PrimitiveCountNV = OpVariable %_ptr_Output_uint Output  <-- should be mutated to shared-variable

Counting shader

// upfront. Using set=1 is ideal, since engine doesn't work with multiple descriptor sets
layout(set = 1, binding = 0) buffer EngineInternal
{
    uint countersCount;
    uint counters[];
} engine;
---
// tail of the main function
  if(_gl_PrimitiveCountNV!=0) {
    uint pos = atomicAdd(engine.countersCount, 1);
    engine.counters[pos] = _gl_PrimitiveCountNV;
    }

Once counter are done, internal shader has to build multi-draw-indirect buffer, with prefix summed counts.

// recap note about indirect commands
struct VkDrawIndexedIndirectCommand {
   uint32_t    indexCount;
   uint32_t    instanceCount;
   uint32_t    firstIndex; // prefix sum
   int32_t     vertexOffset; // can be abused to offset into var_buffer
   uint32_t    firstInstance; // caps: should be zero
   };

Final draw

each vkCmdDrawMeshTasks get replaced by vkCmdDrawIndexedIndirect, that consumes var_buffer and passing it to fragment shader.

Multiple renderpasses

vkEvent should be fine to synchronize execution of previous set of compute shaders for now.

Split command-buffers

Generating extra compute shaders will require a way to insert vkCmdDispatch commands into begin of render-pass.
Can be done by deferred command recording or by spliting one engine-level command buffer into multiple vulkan-command buffers.
Cons:

  • if deferred: validation is gonna be delayed as well, making debug problematic
  • multiple vulkan command buffers: full-screen quad pass will produce command buffer with single draw-call

Issues

  • Rasterization order - not considered, zbuffer is more than fine to achieve correct 3D rendering
  • Mesh shader side effects - not possible due to counting pass
  • Per-primitive data - not now
  • All buffers has to be preallocated with finite size. Unfortunately we can runout of buffer memory and there is no lazy-allocated buffers in vulkan
  • Not task shader support for now - OpenGothic doesn't need it

from tempest.

Try avatar Try commented on May 26, 2024

Some experiments:

  1. Added libspiv - internal utility library for spir-v tooling
  2. First attempts to convert .mesh to .comp
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 82
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 1 1 1
               OpSource GLSL 450
               OpSourceExtension "GL_NV_mesh_shader"
               OpName %main "main"
               OpName %g1_MeshPerVertexNV "g1_MeshPerVertexNV"
               OpMemberName %g1_MeshPerVertexNV 0 "g1_Position"
               OpMemberName %g1_MeshPerVertexNV 1 "g1_PointSize"
               OpMemberName %g1_MeshPerVertexNV 2 "g1_ClipDistance"
               OpMemberName %g1_MeshPerVertexNV 3 "g1_CullDistance"
               OpMemberName %g1_MeshPerVertexNV 4 "g1_PositionPerViewNV"
               OpMemberName %g1_MeshPerVertexNV 5 "gl_ClipDistancePerViewNV"
               OpMemberName %g1_MeshPerVertexNV 6 "gl_CullDistancePerViewNV"
               OpName %g1_MeshVerticesNV "g1_MeshVerticesNV"
               OpName %Vbo "Vbo"
               OpMemberName %Vbo 0 "vertices"
               OpName %_ ""
               OpName %PerVertexData "PerVertexData"
               OpMemberName %PerVertexData 0 "color"
               OpName %v_out "v_out"
               OpName %g1_PrimitiveIndicesNV "g1_PrimitiveIndicesNV"
               OpName %g1_PrimitiveCountNV "g1_PrimitiveCountNV"
               OpName %VkDrawIndexedIndirectCommand "VkDrawIndexedIndirectCommand"
               OpMemberName %VkDrawIndexedIndirectCommand 0 "indexCount"
               OpMemberName %VkDrawIndexedIndirectCommand 1 "instanceCount"
               OpMemberName %VkDrawIndexedIndirectCommand 2 "firstIndex"
               OpMemberName %VkDrawIndexedIndirectCommand 3 "vertexOffset"
               OpMemberName %VkDrawIndexedIndirectCommand 4 "firstInstance"
               OpDecorate %_runtimearr_v2float ArrayStride 8
               OpMemberDecorate %Vbo 0 NonWritable
               OpMemberDecorate %Vbo 0 Offset 0
               OpDecorate %Vbo BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 0
               OpDecorate %v_out Location 0
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
               OpDecorate %VkDrawIndexedIndirectCommand BufferBlock
               OpDecorate %80 DescriptorSet 1
               OpDecorate %80 Binding 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 0 Offset 0
               OpMemberDecorate %VkDrawIndexedIndirectCommand 1 Offset 4
               OpMemberDecorate %VkDrawIndexedIndirectCommand 2 Offset 8
               OpMemberDecorate %VkDrawIndexedIndirectCommand 3 Offset 12
               OpMemberDecorate %VkDrawIndexedIndirectCommand 4 Offset 16
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
       %uint = OpTypeInt 32 0
     %uint_1 = OpConstant %uint 1
%_arr_float_uint_1 = OpTypeArray %float %uint_1
     %uint_4 = OpConstant %uint 4
%_arr_v4float_uint_4 = OpTypeArray %v4float %uint_4
%_arr__arr_float_uint_1_uint_4 = OpTypeArray %_arr_float_uint_1 %uint_4
%g1_MeshPerVertexNV = OpTypeStruct %v4float %float %_arr_float_uint_1 %_arr_float_uint_1 %_arr_v4float_uint_4 %_arr__arr_float_uint_1_uint_4 %_arr__arr_float_uint_1_uint_4
     %uint_3 = OpConstant %uint 3
%_arr_g1_MeshPerVertexNV_uint_3 = OpTypeArray %g1_MeshPerVertexNV %uint_3
%_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 = OpTypePointer Workgroup %_arr_g1_MeshPerVertexNV_uint_3
%g1_MeshVerticesNV = OpVariable %_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 Workgroup
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
    %v2float = OpTypeVector %float 2
%_runtimearr_v2float = OpTypeRuntimeArray %v2float
        %Vbo = OpTypeStruct %_runtimearr_v2float
%_ptr_Uniform_Vbo = OpTypePointer Uniform %Vbo
          %_ = OpVariable %_ptr_Uniform_Vbo Uniform
%_ptr_Uniform_v2float = OpTypePointer Uniform %v2float
    %float_0 = OpConstant %float 0
    %float_1 = OpConstant %float 1
%_ptr_Workgroup_v4float = OpTypePointer Workgroup %v4float
      %int_1 = OpConstant %int 1
      %int_2 = OpConstant %int 2
%PerVertexData = OpTypeStruct %v4float
%_arr_PerVertexData_uint_3 = OpTypeArray %PerVertexData %uint_3
%_ptr_Workgroup__arr_PerVertexData_uint_3 = OpTypePointer Workgroup %_arr_PerVertexData_uint_3
      %v_out = OpVariable %_ptr_Workgroup__arr_PerVertexData_uint_3 Workgroup
         %54 = OpConstantComposite %v4float %float_1 %float_0 %float_0 %float_1
         %56 = OpConstantComposite %v4float %float_0 %float_1 %float_0 %float_1
         %58 = OpConstantComposite %v4float %float_0 %float_0 %float_1 %float_1
%_arr_uint_uint_3 = OpTypeArray %uint %uint_3
%_ptr_Workgroup__arr_uint_uint_3 = OpTypePointer Workgroup %_arr_uint_uint_3
%g1_PrimitiveIndicesNV = OpVariable %_ptr_Workgroup__arr_uint_uint_3 Workgroup
     %uint_0 = OpConstant %uint 0
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
     %uint_2 = OpConstant %uint 2
%g1_PrimitiveCountNV = OpVariable %_ptr_Workgroup_uint Workgroup
     %v3uint = OpTypeVector %uint 3
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_1 %uint_1 %uint_1
    %v3float = OpTypeVector %float 3
%_arr_v3float_uint_3 = OpTypeArray %v3float %uint_3
         %74 = OpConstantComposite %v3float %float_1 %float_0 %float_0
         %75 = OpConstantComposite %v3float %float_0 %float_1 %float_0
         %76 = OpConstantComposite %v3float %float_0 %float_0 %float_1
         %77 = OpConstantComposite %_arr_v3float_uint_3 %74 %75 %76
%VkDrawIndexedIndirectCommand = OpTypeStruct %uint %uint %uint %int %uint
%_ptr_Uniform_VkDrawIndexedIndirectCommand = OpTypePointer Uniform %VkDrawIndexedIndirectCommand
         %80 = OpVariable %_ptr_Uniform_VkDrawIndexedIndirectCommand Uniform
       %main = OpFunction %void None %3
          %5 = OpLabel
         %27 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_0
         %28 = OpLoad %v2float %27
         %31 = OpCompositeExtract %float %28 0
         %32 = OpCompositeExtract %float %28 1
         %33 = OpCompositeConstruct %v4float %31 %32 %float_0 %float_1
         %35 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_0 %int_0
               OpStore %35 %33
         %37 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_1
         %38 = OpLoad %v2float %37
         %39 = OpCompositeExtract %float %38 0
         %40 = OpCompositeExtract %float %38 1
         %41 = OpCompositeConstruct %v4float %39 %40 %float_0 %float_1
         %42 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_1 %int_0
               OpStore %42 %41
         %44 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_2
         %45 = OpLoad %v2float %44
         %46 = OpCompositeExtract %float %45 0
         %47 = OpCompositeExtract %float %45 1
         %48 = OpCompositeConstruct %v4float %46 %47 %float_0 %float_1
         %49 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_2 %int_0
               OpStore %49 %48
         %55 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_0 %int_0
               OpStore %55 %54
         %57 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_1 %int_0
               OpStore %57 %56
         %59 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_2 %int_0
               OpStore %59 %58
         %65 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_0
               OpStore %65 %uint_0
         %66 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_1
               OpStore %66 %uint_1
         %68 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_2
               OpStore %68 %uint_2
               OpStore %g1_PrimitiveCountNV %uint_1
               OpReturn
               OpFunctionEnd

In here:

  • Mesh-related buildtins promoted to shared memory
  • Entry point adjusted to have no out variables for spirv<1.4
  • Entry point changed to GLCompute
  • Extra SSBO binding at (set=1, binding=0) introduced (set to be used as count buffer)
  • gl_* prefix changed to g1_ to make spirv-cross happy

from tempest.

Try avatar Try commented on May 26, 2024

Strategy update, for compue-driven workflow:

  • single execution of .mesh.comp - this will simplify code-gen and C++ workflow
  • index sorting/packing with internal shaders
  • manual vertex pull in generated .vert

Extra descriptor set:

struct IndirectCmd { // 32 bytes
  uint    indexCount;
  uint    instanceCount;
  uint    firstIndex;    // prefix sum
  int     vertexOffset;  // can be abused to offset into var_buffer
  uint    firstInstance; // caps: should be zero

  uint    self;  // sequential id of dispatchMesh class, in render-pass
  uint    padd0;
  uint    padd1;
  }; // 32 bytes

layout(set = 1, binding = 0, std430) buffer EngineInternal0 {
  IndirectCmd cmd[];
  } indirect; // indirect buffer, mostly set by CPU, except for indexCount, firstIndex

layout(set = 1, binding = 1, std430) buffer EngineInternal1 {
  uint    grow;
  uint    ibo[];
  } ind;

layout(set = 1, binding = 2, std430) buffer EngineInternal2 {
  uint    grow;
  uint    vbo[];
  } var;

layout(set = 1, binding = 3, std430) buffer EngineInternal3 {
  uint    grow; // and dispatchX
  uint    dispatchY; // =1
  uint    dispatchZ; // =1
  uint    desc[];
  } mesh;

layout(set = 1, binding = 4, std430) buffer EngineInternal4 {
  uint    ibo[];
  } indFlat;

Workflow by example:

      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      enc.dispatchMesh(0,3);
      enc.dispatchMesh(3,2);

Will be translated as:

      enc.setUniforms(pso_compute_ms,ubo);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = 0);
      enc.dispatch(3, 1,1);
      // vkCmdBindDescriptorSets(internalSet, dynOffset = commandId);
      // TODO: pass base taskID somehow
      enc.dispatch(2, 1,1);
     ....
      VkBufferMemoryBarrier(comp -> comp, indirect.ind);
      // after all 'dispatchMesh' are done
      // prefix summ pass doest 2 jobs actually:
      // indirect.ind[i] firstIndex = prefixSumm(indexCount);
      // indirect.ind[i] indexCount = 0; <-- will be re-accumulated in compactage pass
      enc.setUniforms(psoSum,uboSum);
      enc.dispatch(1,1,1); // 1 group with 256 threads
      // should be dispatch-indirect
      VkBufferMemoryBarrier(comp -> comp, all helper buffers, except var);
      enc.setUniforms(psoCompactage,uboCompactage);
      enc.dispatchIndirect(mesh.grow,1,1);
      VkBufferMemoryBarrier(comp -> vert);

      // main rendering, as drawIndirect
      enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
      enc.setUniforms(pso,ubo);
      env.drawIndirect(indirect.cmd[0]);
      env.drawIndirect(indirect.cmd[1]);
      // vert -> comp barrier at end of render-pass

from tempest.

Try avatar Try commented on May 26, 2024

Current implementation:
изображение

  1. Each dispatch-mesh call works as pair of compute shader + draw-indirect
  2. Compute shader as well as vertex passthru shaders are generated from single mesh shader: cc326ee
  3. Once all compute-passes related to draw-calls are finished, output should be sorted (only in prototype, not in engine) and forwarded to vkCmdDrawIndexedIndirect

TODO:

  1. Add VMeshShaderEmulated as special case in related pieces in engine
  2. Take care of pipeline-memory allocation and scheduling in general

from tempest.

Try avatar Try commented on May 26, 2024

First proof of concept kind triangle:
изображение

TODO: Need to pass somehow firstTask and selfId to compute shader

from tempest.

Try avatar Try commented on May 26, 2024

Current idea for firstTask and selfId pass:

Use Y/Z inputs of vkCmdDispatchBase.
Use case: vkCmdDispatchBase(impl, firstTask, self, 0, taskCount, 1,1). This will break some builtin variables.

// workgroup dimensions
in uvec3 gl_NumWorkGroups; // not sure how this interacts with vkCmdDispatchBase
const uvec3 gl_WorkGroupSize;  // unaffected

// workgroup and invocation IDs
in uvec3 gl_WorkGroupID;  // Y is polluted
in uvec3 gl_LocalInvocationID; // unaffected

// derived variables
in uvec3 gl_GlobalInvocationID; // polluted, since it is byproduct of gl_WorkGroupID
in uint gl_LocalInvocationIndex; // unaffected

from tempest.

Try avatar Try commented on May 26, 2024

Almost there:
изображение

Normals are bugged-out, because translator can't handle arrayed varyings

from tempest.

Try avatar Try commented on May 26, 2024

Running stable on OpenGothic:
изображение

from tempest.

Try avatar Try commented on May 26, 2024

New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly):
Decouple .mesh into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.

uniform-function to me is:
Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects.
Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.

The only problem is gl_WorkGroupID.x that is used all over the place

from tempest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.