Comments (10)
Now, since mesh-shading is released for OpenGothic can start thinking about next steps.
With VK_NV_mesh_shader
all fits fine with the engine, just need to emulate them on else platforms.
Idea for emulation workflow:
- Split mesh shader into 2 compute shaders + 1 vertex shader
- Shaders: counting pass + workload shader + vertex-passthrough
- Extra data:
counting_buffer[], indirect_buffer[draw_count], var_buffer[]
var_buffer - is buffer with varyings outputed from .mesh shader
Spirv patching notes:
OpDecorate %1234 BuiltIn PrimitiveCountNV <-- should be noped/removed
%gl_PrimitiveCountNV = OpVariable %_ptr_Output_uint Output <-- should be mutated to shared-variable
Counting shader
// upfront. Using set=1 is ideal, since engine doesn't work with multiple descriptor sets
layout(set = 1, binding = 0) buffer EngineInternal
{
uint countersCount;
uint counters[];
} engine;
---
// tail of the main function
if(_gl_PrimitiveCountNV!=0) {
uint pos = atomicAdd(engine.countersCount, 1);
engine.counters[pos] = _gl_PrimitiveCountNV;
}
Once counter are done, internal shader has to build multi-draw-indirect buffer, with prefix summed counts.
// recap note about indirect commands
struct VkDrawIndexedIndirectCommand {
uint32_t indexCount;
uint32_t instanceCount;
uint32_t firstIndex; // prefix sum
int32_t vertexOffset; // can be abused to offset into var_buffer
uint32_t firstInstance; // caps: should be zero
};
Final draw
each vkCmdDrawMeshTasks
get replaced by vkCmdDrawIndexedIndirect
, that consumes var_buffer
and passing it to fragment shader.
Multiple renderpasses
vkEvent
should be fine to synchronize execution of previous set of compute shaders for now.
Split command-buffers
Generating extra compute shaders will require a way to insert vkCmdDispatch
commands into begin of render-pass.
Can be done by deferred command recording or by spliting one engine-level command buffer into multiple vulkan-command buffers.
Cons:
- if deferred: validation is gonna be delayed as well, making debug problematic
- multiple vulkan command buffers: full-screen quad pass will produce command buffer with single draw-call
Issues
- Rasterization order - not considered, zbuffer is more than fine to achieve correct 3D rendering
- Mesh shader side effects - not possible due to counting pass
- Per-primitive data - not now
- All buffers has to be preallocated with finite size. Unfortunately we can runout of buffer memory and there is no lazy-allocated buffers in vulkan
- Not task shader support for now - OpenGothic doesn't need it
from tempest.
Some experiments:
- Added
libspiv
- internal utility library for spir-v tooling - First attempts to convert
.mesh
to.comp
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 10
; Bound: 82
; Schema: 0
OpCapability Shader
%1 = OpExtInstImport "GLSL.std.450"
OpMemoryModel Logical GLSL450
OpEntryPoint GLCompute %main "main"
OpExecutionMode %main LocalSize 1 1 1
OpSource GLSL 450
OpSourceExtension "GL_NV_mesh_shader"
OpName %main "main"
OpName %g1_MeshPerVertexNV "g1_MeshPerVertexNV"
OpMemberName %g1_MeshPerVertexNV 0 "g1_Position"
OpMemberName %g1_MeshPerVertexNV 1 "g1_PointSize"
OpMemberName %g1_MeshPerVertexNV 2 "g1_ClipDistance"
OpMemberName %g1_MeshPerVertexNV 3 "g1_CullDistance"
OpMemberName %g1_MeshPerVertexNV 4 "g1_PositionPerViewNV"
OpMemberName %g1_MeshPerVertexNV 5 "gl_ClipDistancePerViewNV"
OpMemberName %g1_MeshPerVertexNV 6 "gl_CullDistancePerViewNV"
OpName %g1_MeshVerticesNV "g1_MeshVerticesNV"
OpName %Vbo "Vbo"
OpMemberName %Vbo 0 "vertices"
OpName %_ ""
OpName %PerVertexData "PerVertexData"
OpMemberName %PerVertexData 0 "color"
OpName %v_out "v_out"
OpName %g1_PrimitiveIndicesNV "g1_PrimitiveIndicesNV"
OpName %g1_PrimitiveCountNV "g1_PrimitiveCountNV"
OpName %VkDrawIndexedIndirectCommand "VkDrawIndexedIndirectCommand"
OpMemberName %VkDrawIndexedIndirectCommand 0 "indexCount"
OpMemberName %VkDrawIndexedIndirectCommand 1 "instanceCount"
OpMemberName %VkDrawIndexedIndirectCommand 2 "firstIndex"
OpMemberName %VkDrawIndexedIndirectCommand 3 "vertexOffset"
OpMemberName %VkDrawIndexedIndirectCommand 4 "firstInstance"
OpDecorate %_runtimearr_v2float ArrayStride 8
OpMemberDecorate %Vbo 0 NonWritable
OpMemberDecorate %Vbo 0 Offset 0
OpDecorate %Vbo BufferBlock
OpDecorate %_ DescriptorSet 0
OpDecorate %_ Binding 0
OpDecorate %v_out Location 0
OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
OpDecorate %VkDrawIndexedIndirectCommand BufferBlock
OpDecorate %80 DescriptorSet 1
OpDecorate %80 Binding 0
OpMemberDecorate %VkDrawIndexedIndirectCommand 0 Offset 0
OpMemberDecorate %VkDrawIndexedIndirectCommand 1 Offset 4
OpMemberDecorate %VkDrawIndexedIndirectCommand 2 Offset 8
OpMemberDecorate %VkDrawIndexedIndirectCommand 3 Offset 12
OpMemberDecorate %VkDrawIndexedIndirectCommand 4 Offset 16
%void = OpTypeVoid
%3 = OpTypeFunction %void
%float = OpTypeFloat 32
%v4float = OpTypeVector %float 4
%uint = OpTypeInt 32 0
%uint_1 = OpConstant %uint 1
%_arr_float_uint_1 = OpTypeArray %float %uint_1
%uint_4 = OpConstant %uint 4
%_arr_v4float_uint_4 = OpTypeArray %v4float %uint_4
%_arr__arr_float_uint_1_uint_4 = OpTypeArray %_arr_float_uint_1 %uint_4
%g1_MeshPerVertexNV = OpTypeStruct %v4float %float %_arr_float_uint_1 %_arr_float_uint_1 %_arr_v4float_uint_4 %_arr__arr_float_uint_1_uint_4 %_arr__arr_float_uint_1_uint_4
%uint_3 = OpConstant %uint 3
%_arr_g1_MeshPerVertexNV_uint_3 = OpTypeArray %g1_MeshPerVertexNV %uint_3
%_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 = OpTypePointer Workgroup %_arr_g1_MeshPerVertexNV_uint_3
%g1_MeshVerticesNV = OpVariable %_ptr_Workgroup__arr_g1_MeshPerVertexNV_uint_3 Workgroup
%int = OpTypeInt 32 1
%int_0 = OpConstant %int 0
%v2float = OpTypeVector %float 2
%_runtimearr_v2float = OpTypeRuntimeArray %v2float
%Vbo = OpTypeStruct %_runtimearr_v2float
%_ptr_Uniform_Vbo = OpTypePointer Uniform %Vbo
%_ = OpVariable %_ptr_Uniform_Vbo Uniform
%_ptr_Uniform_v2float = OpTypePointer Uniform %v2float
%float_0 = OpConstant %float 0
%float_1 = OpConstant %float 1
%_ptr_Workgroup_v4float = OpTypePointer Workgroup %v4float
%int_1 = OpConstant %int 1
%int_2 = OpConstant %int 2
%PerVertexData = OpTypeStruct %v4float
%_arr_PerVertexData_uint_3 = OpTypeArray %PerVertexData %uint_3
%_ptr_Workgroup__arr_PerVertexData_uint_3 = OpTypePointer Workgroup %_arr_PerVertexData_uint_3
%v_out = OpVariable %_ptr_Workgroup__arr_PerVertexData_uint_3 Workgroup
%54 = OpConstantComposite %v4float %float_1 %float_0 %float_0 %float_1
%56 = OpConstantComposite %v4float %float_0 %float_1 %float_0 %float_1
%58 = OpConstantComposite %v4float %float_0 %float_0 %float_1 %float_1
%_arr_uint_uint_3 = OpTypeArray %uint %uint_3
%_ptr_Workgroup__arr_uint_uint_3 = OpTypePointer Workgroup %_arr_uint_uint_3
%g1_PrimitiveIndicesNV = OpVariable %_ptr_Workgroup__arr_uint_uint_3 Workgroup
%uint_0 = OpConstant %uint 0
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
%uint_2 = OpConstant %uint 2
%g1_PrimitiveCountNV = OpVariable %_ptr_Workgroup_uint Workgroup
%v3uint = OpTypeVector %uint 3
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_1 %uint_1 %uint_1
%v3float = OpTypeVector %float 3
%_arr_v3float_uint_3 = OpTypeArray %v3float %uint_3
%74 = OpConstantComposite %v3float %float_1 %float_0 %float_0
%75 = OpConstantComposite %v3float %float_0 %float_1 %float_0
%76 = OpConstantComposite %v3float %float_0 %float_0 %float_1
%77 = OpConstantComposite %_arr_v3float_uint_3 %74 %75 %76
%VkDrawIndexedIndirectCommand = OpTypeStruct %uint %uint %uint %int %uint
%_ptr_Uniform_VkDrawIndexedIndirectCommand = OpTypePointer Uniform %VkDrawIndexedIndirectCommand
%80 = OpVariable %_ptr_Uniform_VkDrawIndexedIndirectCommand Uniform
%main = OpFunction %void None %3
%5 = OpLabel
%27 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_0
%28 = OpLoad %v2float %27
%31 = OpCompositeExtract %float %28 0
%32 = OpCompositeExtract %float %28 1
%33 = OpCompositeConstruct %v4float %31 %32 %float_0 %float_1
%35 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_0 %int_0
OpStore %35 %33
%37 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_1
%38 = OpLoad %v2float %37
%39 = OpCompositeExtract %float %38 0
%40 = OpCompositeExtract %float %38 1
%41 = OpCompositeConstruct %v4float %39 %40 %float_0 %float_1
%42 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_1 %int_0
OpStore %42 %41
%44 = OpAccessChain %_ptr_Uniform_v2float %_ %int_0 %int_2
%45 = OpLoad %v2float %44
%46 = OpCompositeExtract %float %45 0
%47 = OpCompositeExtract %float %45 1
%48 = OpCompositeConstruct %v4float %46 %47 %float_0 %float_1
%49 = OpAccessChain %_ptr_Workgroup_v4float %g1_MeshVerticesNV %int_2 %int_0
OpStore %49 %48
%55 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_0 %int_0
OpStore %55 %54
%57 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_1 %int_0
OpStore %57 %56
%59 = OpAccessChain %_ptr_Workgroup_v4float %v_out %int_2 %int_0
OpStore %59 %58
%65 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_0
OpStore %65 %uint_0
%66 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_1
OpStore %66 %uint_1
%68 = OpAccessChain %_ptr_Workgroup_uint %g1_PrimitiveIndicesNV %int_2
OpStore %68 %uint_2
OpStore %g1_PrimitiveCountNV %uint_1
OpReturn
OpFunctionEnd
In here:
- Mesh-related buildtins promoted to shared memory
- Entry point adjusted to have no
out
variables for spirv<1.4 - Entry point changed to GLCompute
- Extra SSBO binding at (set=1, binding=0) introduced (set to be used as count buffer)
- gl_* prefix changed to g1_ to make spirv-cross happy
from tempest.
Strategy update, for compue-driven workflow:
- single execution of
.mesh.comp
- this will simplify code-gen and C++ workflow - index sorting/packing with internal shaders
- manual vertex pull in generated
.vert
Extra descriptor set:
struct IndirectCmd { // 32 bytes
uint indexCount;
uint instanceCount;
uint firstIndex; // prefix sum
int vertexOffset; // can be abused to offset into var_buffer
uint firstInstance; // caps: should be zero
uint self; // sequential id of dispatchMesh class, in render-pass
uint padd0;
uint padd1;
}; // 32 bytes
layout(set = 1, binding = 0, std430) buffer EngineInternal0 {
IndirectCmd cmd[];
} indirect; // indirect buffer, mostly set by CPU, except for indexCount, firstIndex
layout(set = 1, binding = 1, std430) buffer EngineInternal1 {
uint grow;
uint ibo[];
} ind;
layout(set = 1, binding = 2, std430) buffer EngineInternal2 {
uint grow;
uint vbo[];
} var;
layout(set = 1, binding = 3, std430) buffer EngineInternal3 {
uint grow; // and dispatchX
uint dispatchY; // =1
uint dispatchZ; // =1
uint desc[];
} mesh;
layout(set = 1, binding = 4, std430) buffer EngineInternal4 {
uint ibo[];
} indFlat;
Workflow by example:
enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
enc.setUniforms(pso,ubo);
enc.dispatchMesh(0,3);
enc.dispatchMesh(3,2);
Will be translated as:
enc.setUniforms(pso_compute_ms,ubo);
// vkCmdBindDescriptorSets(internalSet, dynOffset = 0);
enc.dispatch(3, 1,1);
// vkCmdBindDescriptorSets(internalSet, dynOffset = commandId);
// TODO: pass base taskID somehow
enc.dispatch(2, 1,1);
....
VkBufferMemoryBarrier(comp -> comp, indirect.ind);
// after all 'dispatchMesh' are done
// prefix summ pass doest 2 jobs actually:
// indirect.ind[i] firstIndex = prefixSumm(indexCount);
// indirect.ind[i] indexCount = 0; <-- will be re-accumulated in compactage pass
enc.setUniforms(psoSum,uboSum);
enc.dispatch(1,1,1); // 1 group with 256 threads
// should be dispatch-indirect
VkBufferMemoryBarrier(comp -> comp, all helper buffers, except var);
enc.setUniforms(psoCompactage,uboCompactage);
enc.dispatchIndirect(mesh.grow,1,1);
VkBufferMemoryBarrier(comp -> vert);
// main rendering, as drawIndirect
enc.setFramebuffer({{fbo,Vec4(0,0,1,1),Tempest::Preserve}});
enc.setUniforms(pso,ubo);
env.drawIndirect(indirect.cmd[0]);
env.drawIndirect(indirect.cmd[1]);
// vert -> comp barrier at end of render-pass
from tempest.
- Each dispatch-mesh call works as pair of compute shader + draw-indirect
- Compute shader as well as vertex passthru shaders are generated from single mesh shader: cc326ee
- Once all compute-passes related to draw-calls are finished, output should be sorted (only in prototype, not in engine) and forwarded to
vkCmdDrawIndexedIndirect
TODO:
- Add
VMeshShaderEmulated
as special case in related pieces in engine - Take care of pipeline-memory allocation and scheduling in general
from tempest.
First proof of concept kind triangle:
TODO: Need to pass somehow firstTask
and selfId
to compute shader
from tempest.
Current idea for firstTask
and selfId
pass:
Use Y/Z inputs of vkCmdDispatchBase
.
Use case: vkCmdDispatchBase(impl, firstTask, self, 0, taskCount, 1,1)
. This will break some builtin variables.
// workgroup dimensions
in uvec3 gl_NumWorkGroups; // not sure how this interacts with vkCmdDispatchBase
const uvec3 gl_WorkGroupSize; // unaffected
// workgroup and invocation IDs
in uvec3 gl_WorkGroupID; // Y is polluted
in uvec3 gl_LocalInvocationID; // unaffected
// derived variables
in uvec3 gl_GlobalInvocationID; // polluted, since it is byproduct of gl_WorkGroupID
in uint gl_LocalInvocationIndex; // unaffected
from tempest.
Normals are bugged-out, because translator can't handle arrayed varyings
from tempest.
from tempest.
New idea on how to avoid scratch buffer traffic problems(and make solution more Intel-friendly):
Decouple .mesh
into separate index and vertex shaders. This can be done, for the most cases, if vertex computation is uniform-function.
uniform-function
to me is:
Function that can use only constants, locals, uniforms, read-only ssbo, push-constants in various combinations and have no side-effects.
Similar to pure function in a way, but less restricted. This will allow to move most of computation to vertex shader.
The only problem is gl_WorkGroupID.x
that is used all over the place
from tempest.
Related Issues (20)
- [Linux] Overlay Layers Do not Appear To Work HOT 3
- Compiling on Raspberry Pi 4 HOT 11
- Compilation warnings on GCC 10.2.1 HOT 2
- vkQueuePresentKHR is slow HOT 3
- Fix registers allocation for HLSL
- Rethinking a engine structure HOT 4
- Misleading method name "manhattenLength" HOT 1
- Push constants issue
- automatic PipelineBarriers investigation HOT 3
- Single time submit command buffers
- Add DX12 support for mingw
- DX12 test cashes on appveyor
- Uniform data/buffers HOT 1
- DX12: gl_BaseInstance is always zero HOT 1
- Missing support for DDS RGB formats HOT 10
- Bindless support HOT 2
- [Linux] Unable to decode üöä HOT 3
- Mesh shader emulation over draw-indirect HOT 10
- Robust support for push_constant's and SPIRV_Cross_VertexInfo in DX12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tempest.