gpuopen-drivers / llpc Goto Github PK

LLVM-Based Pipeline Compiler

License: MIT License

CMake 0.95% C++ 68.10% C 0.53% Python 1.04% GLSL 7.91% JavaScript 0.03% Dockerfile 0.10% Shell 0.16% LLVM 21.00% Pawn 0.07% HLSL 0.13%

llpc's People

Contributors

Stargazers

Watchers

Forkers

alan-baker amdvlk-admin jacobheamd amdrexu dnovillo jiaolu jfactory07 hustwarhd pifuamd jsnjkl davidstuttard-amd zhicaigfx denis0x0d yjaelex trenouf s-perron perlfu googlestadia flakebi eebackend piotramd noahfredriks jayfoad yesferatu85 kbapatamd nhaehnle aaronhaganamd mcoffin kuhar dstutt cainguo amingriyue flonier zakhrov yukli-amd afn4096 kai-amd kotokoruru dorisyan717 csyonghe lljjdd rtayl shchchowamd xuechen417 ruiling inequation tywuamd stone0107 rdomingu mwezdeck afd c00lrain linqun xdevs23 ehsannas rrocm jaebaek arngeo1993 amd-dwang jaxlinamd petar-avramovic tomsunchen999 arkadiuszsarwaamd sisyph sudonatalie maskray tsymalla asuonpaa bsaleil shichangsheng vettoreldaniele shanminchao mrrobbin flygod1159 kmitropoulou wenqingliamd neonkore mariusz-sikora-at-amd jasilvanus yavn stepthomas mrjiang001 rnshah9 jpages ruiminzhao tsymalla-amd rockamd ryanrio matthesseling dominikwitczakamd qiaojbao crystaljinamd feiyunwill jianyesun janrehders-amd gmh5225 outofcache ashleysmithamd ginshio mdinkov

llpc's Issues

dEQP-VK.graphicsfuzz.two-loops-set-struct

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

two-loops-set-struct.zip

dEQP-VK.graphicsfuzz.max-mix-conditional-discard

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

max-mix-conditional-discard.zip

LGC shader compilation interface

LGC: shader compilation interface

Further to #507 LGC: shader compilation proposal, this proposal concerns the LGC interface for shader compilation and later linking.

Current interface for pipeline compilation

The front-end compiles a whole pipeline using LGC APIs like this:

Create Pipeline object, populate it with tuning options and pipeline state.
For each shader stage, build IR module for it using Builder methods. (This part does not use the pipeline state if in BuilderRecorder mode.)
Call Pipeline::link to link the one or more IR modules for shader stages into a pipeline IR module.
At this point, the pipeline IR module is serializable for testing purposes (if in BuilderRecorder mode).
Call Pipeline::run to generate that pipeline module into a pipeline ELF.

Proposed interface for shader compilation

The front-end would compile a vertex-processing half-pipeline, or a fragment shader, or a compute shader, like this:

Create Pipeline object, populate it with tuning options, and optionally some other pipeline state if it is available. One tuning option is "assume triangles".
If the other half-pipeline is already available as a compiled ELF, give it to the Pipeline as it may aid more optimal shader input/output passing.
Call a new Pipeline function to get the hash of that part of the supplied state that will affect the half-pipeline to be compiled. (That includes linker metadata from the other half-pipeline compiled ELF if supplied.)
Front-end can hash that in with the input shader(s) to check its cache and retrieve a cached ELF.
On cache miss:
- For each shader stage, build IR module for it using Builder methods. (This part does not use the pipeline state if in BuilderRecorder mode.)
- Call Pipeline::link to link the one or more IR modules into a single half-pipeline IR module.
- At this point, the half-pipeline IR module is serializable for testing purposes (if in BuilderRecorder mode).
- Call Pipeline::run to generate a half-pipeline ELF. Where pipeline state is absent, some scheme is used for link-time resolution, as proposed in #507.
- Front-end can cache it with the hash it calculated above.

Proposed interface to link ELFs into pipeline ELF

Create Pipeline object, and populate it with the complete pipeline state. (This can be the same Pipeline object as above, if that suits how the front-end is doing its compilation.)
From Pipeline, create new Linker object, providing it with the unlinked ELF(s).
Call Linker method to get a (possibly empty) list of hashes of required prologs/epilogs (fetch shader, VS epilog, FS epilog, CS prolog).
For each prolog/epilog:
- Check cache for existing ELF; if found, give it to Linker object.
- Otherwise, call Linker object to create the prolog/epilog ELF, and store ELF in the cache.
Call Linker function to do final ELF link, generating pipeline ELF.

dEQP-VK.graphicsfuzz.discard-in-loop

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

discard-in-loop.zip

Structs with no members segfault

calcShaderBlockSize does getStructMemberType(0) but there's no members! Shader triggering this is attached. (It has other problems)

; SPIR-V
; Version: 1.2
; Generator: Khronos SPIR-V Tools Assembler; 0
; Bound: 165
; Schema: 0
               OpCapability Shader
               OpExtension "SPV_KHR_storage_buffer_storage_class"
               OpMemoryModel Logical Simple
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 1 1 1
               OpName %fn_return "fn_return"
               OpName %anonymous_36 "anonymous_36"
               OpName %anonymous_35 "anonymous_35"
               OpName %anonymous_34 "anonymous_34"
               OpName %anonymous_33 "anonymous_33"
               OpName %anonymous_32 "anonymous_32"
               OpName %anonymous_31 "anonymous_31"
               OpName %storage_buffer_u32_array_runtime_array_ptr "storage_buffer_u32_array_runtime_array_ptr"
               OpName %id "id"
               OpName %anonymous_30 "anonymous_30"
               OpName %input_u32_ptr "input_u32_ptr"
               OpName %fn_start_3 "fn_start_3"
               OpName %main "main"
               OpName %fn_2 "fn_2"
               OpName %x_2 "x_2"
               OpName %y_ptr_2 "y_ptr_2"
               OpName %x_ptr_2 "x_ptr_2"
               OpName %cond_3 "cond_3"
               OpName %var_9 "var_9"
               OpName %var_8 "var_8"
               OpName %loop_after_3 "loop_after_3"
               OpName %loop_continue_3 "loop_continue_3"
               OpName %loop_body_3 "loop_body_3"
               OpName %loop_merge_3 "loop_merge_3"
               OpName %loop_header_3 "loop_header_3"
               OpName %anonymous_29 "anonymous_29"
               OpName %fn_start_2 "fn_start_2"
               OpName %bignum_copy_storage_buffer_storage_buffer_storage_buffer "bignum_copy_storage_buffer_storage_buffer_storage_buffer"
               OpName %fn_1 "fn_1"
               OpName %bignum_copy_storage_buffer_storage_buffer_storage_buffer_var_0 "bignum_copy_storage_buffer_storage_buffer_storage_buffer_var_0"
               OpName %bignum_copy_storage_buffer_storage_buffer_storage_buffer_arg_1 "bignum_copy_storage_buffer_storage_buffer_storage_buffer_arg_1"
               OpName %bignum_copy_storage_buffer_storage_buffer_storage_buffer_arg_0 "bignum_copy_storage_buffer_storage_buffer_storage_buffer_arg_0"
               OpName %w_ptr_0 "w_ptr_0"
               OpName %anonymous_28 "anonymous_28"
               OpName %k_in_j_loop "k_in_j_loop"
               OpName %anonymous_27 "anonymous_27"
               OpName %anonymous_26 "anonymous_26"
               OpName %anonymous_25 "anonymous_25"
               OpName %anonymous_24 "anonymous_24"
               OpName %anonymous_23 "anonymous_23"
               OpName %anonymous_22 "anonymous_22"
               OpName %anonymous_21 "anonymous_21"
               OpName %anonymous_20 "anonymous_20"
               OpName %anonymous_19 "anonymous_19"
               OpName %anonymous_18 "anonymous_18"
               OpName %anonymous_17 "anonymous_17"
               OpName %w "w"
               OpName %w_ptr "w_ptr"
               OpName %x_1 "x_1"
               OpName %x_ptr_1 "x_ptr_1"
               OpName %k "k"
               OpName %anonymous_16 "anonymous_16"
               OpName %cond_2 "cond_2"
               OpName %var_7 "var_7"
               OpName %var_6 "var_6"
               OpName %loop_after_2 "loop_after_2"
               OpName %loop_continue_2 "loop_continue_2"
               OpName %loop_body_2 "loop_body_2"
               OpName %loop_merge_2 "loop_merge_2"
               OpName %loop_header_2 "loop_header_2"
               OpName %y_1 "y_1"
               OpName %y_ptr_1 "y_ptr_1"
               OpName %cond_1 "cond_1"
               OpName %var_5 "var_5"
               OpName %var_4 "var_4"
               OpName %loop_after_1 "loop_after_1"
               OpName %loop_continue_1 "loop_continue_1"
               OpName %loop_body_1 "loop_body_1"
               OpName %loop_merge_1 "loop_merge_1"
               OpName %loop_header_1 "loop_header_1"
               OpName %anonymous_15 "anonymous_15"
               OpName %fn_start_1 "fn_start_1"
               OpName %bignum_mul_storage_buffer_storage_buffer_storage_buffer "bignum_mul_storage_buffer_storage_buffer_storage_buffer"
               OpName %fn_0 "fn_0"
               OpName %void "void"
               OpName %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_2 "bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_2"
               OpName %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_1 "bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_1"
               OpName %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_0 "bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_0"
               OpName %bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_2 "bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_2"
               OpName %bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_1 "bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_1"
               OpName %bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_0 "bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_0"
               OpName %return_borrow "return borrow"
               OpName %r_ptr_0 "r_ptr_0"
               OpName %anonymous_14 "anonymous_14"
               OpName %anonymous_13 "anonymous_13"
               OpName %anonymous_12 "anonymous_12"
               OpName %anonymous_11 "anonymous_11"
               OpName %borrow "borrow"
               OpName %anonymous_10 "anonymous_10"
               OpName %anonymous_9 "anonymous_9"
               OpName %anonymous_8 "anonymous_8"
               OpName %y_0 "y_0"
               OpName %y_ptr_0 "y_ptr_0"
               OpName %x_0 "x_0"
               OpName %x_ptr_0 "x_ptr_0"
               OpName %cond_0 "cond_0"
               OpName %var_3 "var_3"
               OpName %var_2 "var_2"
               OpName %loop_after_0 "loop_after_0"
               OpName %loop_continue_0 "loop_continue_0"
               OpName %loop_body_0 "loop_body_0"
               OpName %loop_merge_0 "loop_merge_0"
               OpName %loop_header_0 "loop_header_0"
               OpName %anonymous_7 "anonymous_7"
               OpName %fn_start_0 "fn_start_0"
               OpName %bignum_sub_storage_buffer_storage_buffer_storage_buffer "bignum_sub_storage_buffer_storage_buffer_storage_buffer"
               OpName %bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_1 "bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_1"
               OpName %bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_0 "bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_0"
               OpName %bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_2 "bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_2"
               OpName %bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_1 "bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_1"
               OpName %bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_0 "bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_0"
               OpName %return_carry "return_carry"
               OpName %r_ptr "r_ptr"
               OpName %anonymous_6 "anonymous_6"
               OpName %anonymous_5 "anonymous_5"
               OpName %anonymous_4 "anonymous_4"
               OpName %anonymous_3 "anonymous_3"
               OpName %carry "carry"
               OpName %anonymous_2 "anonymous_2"
               OpName %anonymous_1 "anonymous_1"
               OpName %anonymous_0 "anonymous_0"
               OpName %struct_1 "struct_1"
               OpName %y "y"
               OpName %y_ptr "y_ptr"
               OpName %x "x"
               OpName %x_ptr "x_ptr"
               OpName %storage_buffer_u32_ptr "storage_buffer_u32_ptr"
               OpName %cond "cond"
               OpName %var_1 "var_1"
               OpName %var_0 "var_0"
               OpName %loop_after "loop_after"
               OpName %loop_continue "loop_continue"
               OpName %loop_body "loop_body"
               OpName %loop_merge "loop_merge"
               OpName %loop_header "loop_header"
               OpName %bool "bool"
               OpName %anonymous "anonymous"
               OpName %fn_start "fn_start"
               OpName %bignum_add_storage_buffer_storage_buffer_storage_buffer "bignum_add_storage_buffer_storage_buffer_storage_buffer"
               OpName %fn "fn"
               OpName %function_u32_ptr "function_u32_ptr"
               OpName %storage_buffer_u32_array_ptr "storage_buffer_u32_array_ptr"
               OpName %bignum_add_storage_buffer_storage_buffer_storage_buffer_var_1 "bignum_add_storage_buffer_storage_buffer_storage_buffer_var_1"
               OpName %bignum_add_storage_buffer_storage_buffer_storage_buffer_var_0 "bignum_add_storage_buffer_storage_buffer_storage_buffer_var_0"
               OpName %bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_2 "bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_2"
               OpName %bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_1 "bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_1"
               OpName %bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_0 "bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_0"
               OpName %c_u32_1 "c_u32_1"
               OpName %c_u32_0 "c_u32_0"
               OpName %descriptor_set_0_0 "descriptor_set_0_0"
               OpName %descriptor_set_0_1 "descriptor_set_0_1"
               OpName %descriptor_set_0_2 "descriptor_set_0_2"
               OpName %storage_buffer_struct_ptr "storage_buffer_struct_ptr"
               OpName %struct_0 "struct_0"
               OpName %u32_array_runtime_array "u32_array_runtime_array"
               OpName %u32_array "u32_array"
               OpName %push_constants "push_constants"
               OpName %push_constant_struct_ptr "push_constant_struct_ptr"
               OpName %struct "struct"
               OpName %global_invocation "global_invocation"
               OpName %input_vector_ptr "input_vector_ptr"
               OpName %vector "vector"
               OpName %u32 "u32"
               OpDecorate %descriptor_set_0_0 Binding 0
               OpDecorate %descriptor_set_0_0 DescriptorSet 0
               OpDecorate %descriptor_set_0_1 Binding 1
               OpDecorate %descriptor_set_0_1 DescriptorSet 0
               OpDecorate %descriptor_set_0_2 Binding 2
               OpDecorate %descriptor_set_0_2 DescriptorSet 0
               OpDecorate %global_invocation BuiltIn GlobalInvocationId
        %u32 = OpTypeInt 32 0
        %164 = OpSpecConstant %u32 24
     %vector = OpTypeVector %u32 3
%input_vector_ptr = OpTypePointer Input %vector
     %struct = OpTypeStruct
%push_constant_struct_ptr = OpTypePointer PushConstant %struct
  %u32_array = OpTypeArray %u32 %164
%u32_array_runtime_array = OpTypeRuntimeArray %u32_array
   %struct_0 = OpTypeStruct %u32_array_runtime_array
%storage_buffer_struct_ptr = OpTypePointer StorageBuffer %struct_0
%storage_buffer_u32_array_ptr = OpTypePointer StorageBuffer %u32_array
%function_u32_ptr = OpTypePointer Function %u32
         %fn = OpTypeFunction %u32 %storage_buffer_u32_array_ptr %storage_buffer_u32_array_ptr %storage_buffer_u32_array_ptr
       %bool = OpTypeBool
%storage_buffer_u32_ptr = OpTypePointer StorageBuffer %u32
   %struct_1 = OpTypeStruct %u32 %u32
       %void = OpTypeVoid
       %fn_0 = OpTypeFunction %void %storage_buffer_u32_array_ptr %storage_buffer_u32_array_ptr %storage_buffer_u32_array_ptr
       %fn_1 = OpTypeFunction %void %storage_buffer_u32_array_ptr %storage_buffer_u32_array_ptr
       %fn_2 = OpTypeFunction %void
%input_u32_ptr = OpTypePointer Input %u32
%storage_buffer_u32_array_runtime_array_ptr = OpTypePointer StorageBuffer %u32_array_runtime_array
    %c_u32_0 = OpConstant %u32 0
    %c_u32_1 = OpConstant %u32 1
%descriptor_set_0_0 = OpVariable %storage_buffer_struct_ptr StorageBuffer
%descriptor_set_0_1 = OpVariable %storage_buffer_struct_ptr StorageBuffer
%descriptor_set_0_2 = OpVariable %storage_buffer_struct_ptr StorageBuffer
%push_constants = OpVariable %push_constant_struct_ptr PushConstant
%global_invocation = OpVariable %input_vector_ptr Input
%bignum_add_storage_buffer_storage_buffer_storage_buffer = OpFunction %u32 None %fn
%bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_0 = OpFunctionParameter %storage_buffer_u32_array_ptr
%bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_1 = OpFunctionParameter %storage_buffer_u32_array_ptr
%bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_2 = OpFunctionParameter %storage_buffer_u32_array_ptr
   %fn_start = OpLabel
%bignum_add_storage_buffer_storage_buffer_storage_buffer_var_0 = OpVariable %function_u32_ptr Function
%bignum_add_storage_buffer_storage_buffer_storage_buffer_var_1 = OpVariable %function_u32_ptr Function
               OpStore %bignum_add_storage_buffer_storage_buffer_storage_buffer_var_1 %c_u32_0
  %anonymous = OpISub %u32 %164 %c_u32_1
               OpStore %bignum_add_storage_buffer_storage_buffer_storage_buffer_var_0 %c_u32_0
               OpBranch %loop_header
%loop_header = OpLabel
               OpLoopMerge %loop_merge %loop_continue None
               OpBranch %loop_merge
 %loop_merge = OpLabel
      %var_0 = OpLoad %u32 %bignum_add_storage_buffer_storage_buffer_storage_buffer_var_0
       %cond = OpSLessThanEqual %bool %var_0 %anonymous
               OpBranchConditional %cond %loop_body %loop_after
  %loop_body = OpLabel
      %x_ptr = OpAccessChain %storage_buffer_u32_ptr %bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_1 %var_0
          %x = OpLoad %u32 %x_ptr
      %y_ptr = OpAccessChain %storage_buffer_u32_ptr %bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_2 %var_0
          %y = OpLoad %u32 %y_ptr
%anonymous_0 = OpIAddCarry %struct_1 %x %y
%anonymous_1 = OpCompositeExtract %u32 %anonymous_0 0
%anonymous_2 = OpCompositeExtract %u32 %anonymous_0 1
      %carry = OpLoad %u32 %bignum_add_storage_buffer_storage_buffer_storage_buffer_var_1
%anonymous_3 = OpIAddCarry %struct_1 %anonymous_1 %carry
%anonymous_4 = OpCompositeExtract %u32 %anonymous_3 0
%anonymous_5 = OpCompositeExtract %u32 %anonymous_3 1
%anonymous_6 = OpBitwiseOr %u32 %anonymous_2 %anonymous_5
               OpStore %bignum_add_storage_buffer_storage_buffer_storage_buffer_var_1 %anonymous_6
      %r_ptr = OpAccessChain %storage_buffer_u32_ptr %bignum_add_storage_buffer_storage_buffer_storage_buffer_arg_0 %var_0
               OpStore %r_ptr %anonymous_4
               OpBranch %loop_continue
%loop_continue = OpLabel
      %var_1 = OpIAdd %u32 %var_0 %c_u32_1
               OpStore %bignum_add_storage_buffer_storage_buffer_storage_buffer_var_0 %var_1
               OpBranch %loop_header
 %loop_after = OpLabel
%return_carry = OpLoad %u32 %bignum_add_storage_buffer_storage_buffer_storage_buffer_var_1
               OpReturnValue %return_carry
               OpFunctionEnd
%bignum_copy_storage_buffer_storage_buffer_storage_buffer = OpFunction %void None %fn_1
%bignum_copy_storage_buffer_storage_buffer_storage_buffer_arg_0 = OpFunctionParameter %storage_buffer_u32_array_ptr
%bignum_copy_storage_buffer_storage_buffer_storage_buffer_arg_1 = OpFunctionParameter %storage_buffer_u32_array_ptr
 %fn_start_2 = OpLabel
%bignum_copy_storage_buffer_storage_buffer_storage_buffer_var_0 = OpVariable %function_u32_ptr Function
%anonymous_29 = OpISub %u32 %164 %c_u32_1
               OpStore %bignum_copy_storage_buffer_storage_buffer_storage_buffer_var_0 %c_u32_0
               OpBranch %loop_header_3
%loop_header_3 = OpLabel
               OpLoopMerge %loop_merge_3 %loop_continue_3 None
               OpBranch %loop_merge_3
%loop_merge_3 = OpLabel
      %var_8 = OpLoad %u32 %bignum_copy_storage_buffer_storage_buffer_storage_buffer_var_0
     %cond_3 = OpSLessThanEqual %bool %var_8 %anonymous_29
               OpBranchConditional %cond_3 %loop_body_3 %loop_after_3
%loop_body_3 = OpLabel
    %x_ptr_2 = OpAccessChain %storage_buffer_u32_ptr %bignum_copy_storage_buffer_storage_buffer_storage_buffer_arg_0 %var_8
    %y_ptr_2 = OpAccessChain %storage_buffer_u32_ptr %bignum_copy_storage_buffer_storage_buffer_storage_buffer_arg_1 %var_8
        %x_2 = OpLoad %u32 %x_ptr_2
               OpStore %y_ptr_2 %x_2
               OpBranch %loop_continue_3
%loop_continue_3 = OpLabel
      %var_9 = OpIAdd %u32 %var_8 %c_u32_1
               OpStore %bignum_copy_storage_buffer_storage_buffer_storage_buffer_var_0 %var_9
               OpBranch %loop_header_3
%loop_after_3 = OpLabel
               OpReturn
               OpFunctionEnd
%bignum_mul_storage_buffer_storage_buffer_storage_buffer = OpFunction %void None %fn_0
%bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_0 = OpFunctionParameter %storage_buffer_u32_array_ptr
%bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_1 = OpFunctionParameter %storage_buffer_u32_array_ptr
%bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_2 = OpFunctionParameter %storage_buffer_u32_array_ptr
 %fn_start_1 = OpLabel
%bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_0 = OpVariable %function_u32_ptr Function
%bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_1 = OpVariable %function_u32_ptr Function
%bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_2 = OpVariable %function_u32_ptr Function
%anonymous_15 = OpISub %u32 %164 %c_u32_1
               OpStore %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_1 %c_u32_0
               OpBranch %loop_header_1
%loop_header_1 = OpLabel
               OpLoopMerge %loop_merge_1 %loop_continue_1 None
               OpBranch %loop_merge_1
%loop_merge_1 = OpLabel
      %var_4 = OpLoad %u32 %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_1
     %cond_1 = OpSLessThanEqual %bool %var_4 %anonymous_15
               OpBranchConditional %cond_1 %loop_body_1 %loop_after_1
%loop_body_1 = OpLabel
    %y_ptr_1 = OpAccessChain %storage_buffer_u32_ptr %bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_2 %var_4
        %y_1 = OpLoad %u32 %y_ptr_1
               OpStore %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_2 %c_u32_0
               OpStore %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_0 %c_u32_0
               OpBranch %loop_header_2
%loop_header_2 = OpLabel
               OpLoopMerge %loop_merge_2 %loop_continue_2 None
               OpBranch %loop_merge_2
%loop_merge_2 = OpLabel
      %var_6 = OpLoad %u32 %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_0
     %cond_2 = OpSLessThanEqual %bool %var_6 %anonymous_15
               OpBranchConditional %cond_2 %loop_body_2 %loop_after_2
%loop_body_2 = OpLabel
%anonymous_16 = OpIAdd %u32 %var_6 %var_4
          %k = OpLoad %u32 %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_2
    %x_ptr_1 = OpAccessChain %storage_buffer_u32_ptr %bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_1 %var_6
        %x_1 = OpLoad %u32 %x_ptr_1
      %w_ptr = OpAccessChain %storage_buffer_u32_ptr %bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_0 %anonymous_16
          %w = OpLoad %u32 %w_ptr
%anonymous_17 = OpIAddCarry %struct_1 %k %w
%anonymous_18 = OpCompositeExtract %u32 %anonymous_17 0
%anonymous_19 = OpCompositeExtract %u32 %anonymous_17 1
%anonymous_20 = OpUMulExtended %struct_1 %x_1 %y_1
%anonymous_21 = OpCompositeExtract %u32 %anonymous_20 0
%anonymous_22 = OpCompositeExtract %u32 %anonymous_20 1
%anonymous_23 = OpIAddCarry %struct_1 %anonymous_21 %anonymous_18
%anonymous_24 = OpCompositeExtract %u32 %anonymous_23 0
%anonymous_25 = OpCompositeExtract %u32 %anonymous_23 1
%anonymous_26 = OpIAdd %u32 %anonymous_22 %anonymous_25
%anonymous_27 = OpIAdd %u32 %anonymous_26 %anonymous_19
               OpStore %w_ptr %anonymous_24
               OpStore %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_2 %anonymous_27
               OpBranch %loop_continue_2
%loop_continue_2 = OpLabel
      %var_7 = OpIAdd %u32 %var_6 %c_u32_1
               OpStore %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_0 %var_7
               OpBranch %loop_header_2
%loop_after_2 = OpLabel
%k_in_j_loop = OpLoad %u32 %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_2
%anonymous_28 = OpIAdd %u32 %var_4 %164
    %w_ptr_0 = OpAccessChain %storage_buffer_u32_ptr %bignum_mul_storage_buffer_storage_buffer_storage_buffer_arg_0 %anonymous_28
               OpStore %w_ptr_0 %k_in_j_loop
               OpBranch %loop_continue_1
%loop_continue_1 = OpLabel
      %var_5 = OpIAdd %u32 %var_4 %c_u32_1
               OpStore %bignum_mul_storage_buffer_storage_buffer_storage_buffer_var_1 %var_5
               OpBranch %loop_header_1
%loop_after_1 = OpLabel
               OpReturn
               OpFunctionEnd
%bignum_sub_storage_buffer_storage_buffer_storage_buffer = OpFunction %u32 None %fn
%bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_0 = OpFunctionParameter %storage_buffer_u32_array_ptr
%bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_1 = OpFunctionParameter %storage_buffer_u32_array_ptr
%bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_2 = OpFunctionParameter %storage_buffer_u32_array_ptr
 %fn_start_0 = OpLabel
%bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_0 = OpVariable %function_u32_ptr Function
%bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_1 = OpVariable %function_u32_ptr Function
               OpStore %bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_1 %c_u32_0
%anonymous_7 = OpISub %u32 %164 %c_u32_1
               OpStore %bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_0 %c_u32_0
               OpBranch %loop_header_0
%loop_header_0 = OpLabel
               OpLoopMerge %loop_merge_0 %loop_continue_0 None
               OpBranch %loop_merge_0
%loop_merge_0 = OpLabel
      %var_2 = OpLoad %u32 %bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_0
     %cond_0 = OpSLessThanEqual %bool %var_2 %anonymous_7
               OpBranchConditional %cond_0 %loop_body_0 %loop_after_0
%loop_body_0 = OpLabel
    %x_ptr_0 = OpAccessChain %storage_buffer_u32_ptr %bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_1 %var_2
        %x_0 = OpLoad %u32 %x_ptr_0
    %y_ptr_0 = OpAccessChain %storage_buffer_u32_ptr %bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_2 %var_2
        %y_0 = OpLoad %u32 %y_ptr_0
%anonymous_8 = OpISubBorrow %struct_1 %x_0 %y_0
%anonymous_9 = OpCompositeExtract %u32 %anonymous_8 0
%anonymous_10 = OpCompositeExtract %u32 %anonymous_8 1
     %borrow = OpLoad %u32 %bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_1
%anonymous_11 = OpISubBorrow %struct_1 %anonymous_9 %borrow
%anonymous_12 = OpCompositeExtract %u32 %anonymous_11 0
%anonymous_13 = OpCompositeExtract %u32 %anonymous_11 1
%anonymous_14 = OpBitwiseOr %u32 %anonymous_10 %anonymous_13
               OpStore %bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_1 %anonymous_14
    %r_ptr_0 = OpAccessChain %storage_buffer_u32_ptr %bignum_sub_storage_buffer_storage_buffer_storage_buffer_arg_0 %var_2
               OpStore %r_ptr_0 %anonymous_12
               OpBranch %loop_continue_0
%loop_continue_0 = OpLabel
      %var_3 = OpIAdd %u32 %var_2 %c_u32_1
               OpStore %bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_0 %var_3
               OpBranch %loop_header_0
%loop_after_0 = OpLabel
%return_borrow = OpLoad %u32 %bignum_sub_storage_buffer_storage_buffer_storage_buffer_var_1
               OpReturnValue %return_borrow
               OpFunctionEnd
       %main = OpFunction %void None %fn_2
 %fn_start_3 = OpLabel
%anonymous_30 = OpAccessChain %input_u32_ptr %global_invocation %c_u32_0
         %id = OpLoad %u32 %anonymous_30
%anonymous_31 = OpAccessChain %storage_buffer_u32_array_runtime_array_ptr %descriptor_set_0_0 %c_u32_0
%anonymous_32 = OpAccessChain %storage_buffer_u32_array_runtime_array_ptr %descriptor_set_0_1 %c_u32_0
%anonymous_33 = OpAccessChain %storage_buffer_u32_array_runtime_array_ptr %descriptor_set_0_2 %c_u32_0
%anonymous_34 = OpAccessChain %storage_buffer_u32_array_ptr %anonymous_31 %id
%anonymous_35 = OpAccessChain %storage_buffer_u32_array_ptr %anonymous_32 %id
%anonymous_36 = OpAccessChain %storage_buffer_u32_array_ptr %anonymous_33 %id
  %fn_return = OpFunctionCall %u32 %bignum_add_storage_buffer_storage_buffer_storage_buffer %anonymous_34 %anonymous_35 %anonymous_36
               OpReturn
               OpFunctionEnd

Pack shader input/output framework

The overview of shader input/output packing is described in DdnPackShaderInputOutput.md. There are three phases.

Phase 1 [Done] Support packing input/output on XX-FS.
Commit Id [40b2a6b, 9b88f2e]: basic framework
Commit Id [f1cebc3]: extend to handle 16-bit variables
Commit Id [884d3b0]: turn on pack-in-out option by default
Phase 2 [Done]: Support input/output packing on tessellation pipeline. It packs on VS-TCS and TES-FS. Commit Id [8cd0205]
Phase 3 [Done]: Refactor PatchResourceCollect pass to prepare for supporting GS pipeline
Commit Id [966a827]: refactor InOutLocationInfo
Commit Id [796032b, 764ee20, 2a57f8b, 37ff8e4]: refactor inputLocInfoMap/outputLocInfoMap
Phase 4 [Done]: Support partial packing with dynamic indexing. Commit Id [3562e51]
Phase 5 [Done]: Support input/output packing on geometry pipeline. It packs on VS/TES-GS and GS-FS. Commit Id [b1bc7a3]

For VS-TCS and VS/TES-GS, scalares are treated as 32-bit and packed together.

Integrating with OSS-Fuzz

Greetings llpc developers and contributors,

We’re reaching out because your project is an important part of the open source ecosystem, and we’d like to invite you to integrate with our fuzzing service, OSS-Fuzz. OSS-Fuzz is a free fuzzing infrastructure you can use to identify security vulnerabilities and stability bugs in your project. OSS-Fuzz will:

Continuously run at scale all the fuzzers you write.
Alert you when it finds issues.
Automatically close issues after they’ve been fixed by a commit.

Many widely used open source projects like OpenSSL, FFmpeg, LibreOffice, and ImageMagick are fuzzing via OSS-Fuzz, which helps them find and remediate critical issues.

Even though typical integrations can be done in < 100 LoC, we have a reward program in place which aims to recognize folks who are not just contributing to open source, but are also working hard to make it more secure.

We want to stress that anyone who meets the eligibility criteria and integrates a project with OSS-Fuzz is eligible for a reward.

If you're not interested in integrating with OSS-Fuzz, it would be helpful for us to understand why—lack of interest, lack of time, or something else—so we can better support projects like yours in the future.

If we’ve missed your question in our FAQ, feel free to reply or reach out to us at [email protected].

Thanks!

Tommy
OSS-Fuzz Team

Cannot build against upstream LLVM

I tried to build LLPC against upstream LLVM by pointing the XGL_LLVM_SRC_PATH to point to my System-installed LLVM. Howerver LLPC fails to build by saying Intrinsic::amdgcn_waterfall_begin
is not a valid intrinsic. Is there any way to work around this?

dEQP-VK.graphicsfuzz.unreachable-continue-statement

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

unreachable-continue-statement.zip

Shader compilation and specialization constants proposal

If we're going to be able to deal with specialization constants with shader compilation and linking, then we need some scheme of dealing with them with relocs.

A use of a specialization constant falls into one of two different categories:

A normal use of a constant in an instruction or initializer, that can be changed with a reloc.
A use in something that cannot be changed after compilation with a reloc, such as the size of an array type. For this category, we can cope if that constant is not specified in the pipeline state (so it just takes its default value in the SPIR-V). Maybe we also want to be able to cope if it is specified to have the same value as the default. We cannot cope if it is specified to have a different value.

Front-end (SPIR-V reader)

For a category 1 constant use, the SPIR-V reader should call Builder::CreateRelocConstant (which already exists), with a name that specifies that it is a specialization constant, its number, which part of it, and its default value.

CreateRelocConstant can only cope with i32. For anything else, the SPIR-V reader will have to split it up and/or bitcast it.

For a category 2 constant use, the SPIR-V reader will call a new Builder call that we haven't invented yet, that says "this is an unchangeable specialization constant" with the same name specification as above.

Middle-end (LGC)

For a category 2 constant use, LGC needs to put the unchangeable specialization constant into the ELF in some way. Maybe it could use an LGC-private section in PAL metadata, but, for uniformity in handling the two categories of constant, I think I prefer the idea of a global variable that contains these unchangeable constants with relocs.

Linker

The linker gets provided with the specialization constants in pipeline state. It uses those to resolve the specialization constant relocs above.

A reloc for a specialization constant that is not provided uses the default value encoded in the symbol name.

When processing a reloc in the global variable for unchangeable constants, and it would change the value from default, then it returns a recoverable error so we do a full pipeline compilation instead.

Possibly we could spot the failure case earlier, by spotting that the SPIR-V contains category 2 specialization constants. That might fail unnecessarily if the pipeline state doesn't actually set the constant, but that could be worth the saving in double compiles.

Fragment shader not writing to depth buffer correctly

I have noticed a shader that does not write to the depth buffer correctly. It seems like two OpKills with a load of an image are needed. I find it very odd that the load of the image makes a difference.

I will attached a reduced version of the spir-v, and the gcn that was generated. In the gcn, there is a sample trace through the code that shows the problem. Please let me know if I have misunderstood any of the instructions. I'm still learning gcn.

dEQP-VK.graphicsfuzz.two-loops-with-break

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

two-loops-with-break.zip

shaderdb test don't run correctly

When I follow the instructions to run the test, every test fails because it cannot find SPVGEN. The output from one of the tests is given below. SPVGEN has been build. I can see it in the build directory, but it is not on the path that is being looked for. If I add the path to LD_LIBRARY_PATH, the tests start to work.

This is bad because the tests may work or fail because it will pick up whichever spvgen.so happens to be in my LD_LIBRARY_PATH, and not the one being built. It also adds an extra step. Something should be done in the build to ensure that the spvgen.so that is in the build directory is the only one that can be used.

stevenperron@stevenperron0:builds/Release ‹dev*›$ llvm/bin/llvm-lit -v llpc/test/shaderdb/gfx9/ObjFloat16_TestTrinaryMinMaxFuncs_lit.frag
-- Testing: 1 tests, 1 threads --
FAIL: LLPC_SHADERTEST :: shaderdb/gfx9/ObjFloat16_TestTrinaryMinMaxFuncs_lit.frag (1 of 1)
******************** TEST 'LLPC_SHADERTEST :: shaderdb/gfx9/ObjFloat16_TestTrinaryMinMaxFuncs_lit.frag' FAILED ********************
Script:
--
: 'RUN: at line 27';   /usr/local/google/home/stevenperron/llpc/amdvlk/drivers/xgl/builds/Release/llpc/amdllpc -spvgen-dir= -v -gfxip=9 /usr/local/google/home/stevenperron/llpc/amdvlk/drivers/llpc/test/shaderdb/gfx9/ObjFloat16_TestTrinaryMinMaxFuncs_lit.frag | /usr/local/google/home/stevenperron/llpc/amdvlk/drivers/xgl/builds/Release/llvm/bin/FileCheck -check-prefix=SHADERTEST /usr/local/google/home/stevenperron/llpc/amdvlk/drivers/llpc/test/shaderdb/gfx9/ObjFloat16_TestTrinaryMinMaxFuncs_lit.frag
--
Exit Code: 1

Command Output (stderr):
--
/usr/local/google/home/stevenperron/llpc/amdvlk/drivers/llpc/test/shaderdb/gfx9/ObjFloat16_TestTrinaryMinMaxFuncs_lit.frag:28:21: error: SHADERTEST-LABEL: expected string not found in input
; SHADERTEST-LABEL: {{^// LLPC}} SPIRV-to-LLVM translation results
                    ^
<stdin>:1:1: note: scanning from here
ERROR: Failed to load SPVGEN -- cannot compile GLSL
^
<stdin>:1:13: note: possible intended match here
ERROR: Failed to load SPVGEN -- cannot compile GLSL
            ^

--

********************
Testing Time: 0.10s
********************
Failing Tests (1):
    LLPC_SHADERTEST :: shaderdb/gfx9/ObjFloat16_TestTrinaryMinMaxFuncs_lit.frag

  Unexpected Failures: 1

Handling descriptor offsets as relocations

When constructing relocatable shader elf, there is some data that cannot be known at compile time. One example is the offset needed to do a descriptor load. We want to be able to use a relocation to represent this offset, which can then be filled in during the linking phase that will create the pipeline elf.

If we want to replace the constant offset with a relocation, we need to figure out 3 different ways of representing the value for the different phases of the compilation.

We already have a version of the code that will generate relocation and replace them. This code is in #457 and GPUOpen-Drivers/llvm-project#1. This version of the code is wrong in many ways, and needs to be improved.

Relocation in elf

This is the code that is generated by the initial version of llpc/llvm that generates relocations.

s_movk_i32 s0, 0x0                  // 000000000000: B0000000 
s_getpc_b64 s[6:7]                  // 000000000004: BE861C00 
0000000000000004:  R_AMDGPU_ABS32       doff_0_0
s_add_u32 s0, s3, s0                // 000000000008: 80000003 
s_mov_b32 s6, s4                    // 00000000000C: BE860004 
s_addc_u32 s1, s7, 0                // 000000000010: 82018007 
v_add_u32_e32 v0, s5, v0            // 000000000014: 68000005 
s_load_dwordx4 s[0:3], s[0:1], 0x0  // 000000000018: C00A0000 00000000

This is wrong in a few different ways. First, the relocation offset is wrong. The relocation should point to the location in the s_movk_i32 instruction that contains the constant value. Second, the descriptor offset can only be 16-bit because that is all the s_movk_i32 instruction can hold. Third, the type of relocation is wrong. The R_AMDGPU_ABS32 is meant to be used as the absolute address of the symbol to which the relocation applies. We need to create a new relocation.

The code we want generated would be something more like:

s_mov_b32 s2, 0x0                                     // 000000000010: BE8200FF 00000000 
000000000014: R_AMDGPU_VAL32           doff_0_0
s_load_dwordx4 s[0:3], s[0:1], s2               // 000000000018: C0080000 00000002

We still want to create dummy symbol doff_0_0 (the descriptor offset for set 0 and binding 0). This symbol is intended to hold the constant offset of the resource, and the new relocation says we should replace the location by that constant 32-bit value.

For performance, I do not know what is best. I would like the force the s_mov_b32 to always immediately precede the load s_load_dwordx4. Then we could implement an optimization in the linking phase: if the offset is less than 2^20, then make the s_mov_b32 a nop instruction and rewrite the load so it uses an immediate offset instead of the register.

Representation in the compilation

The offset is added to the code during "Patch LLVM for descriptor load operations". This happens earlier in the compilation. We need a way to track that the offset of a particular descriptor is to be used. As we go through the different phases of the compilation, that representation needs to change.

Offset representation in llvm-ir

The translation from Spir-V to llvm-ir, will generate a call to @llpc.descriptor.load.buffer. This gets expanded in "Patch LLVM for descriptor load operations", and part of the expansion is calculating the offset of the descriptor using the descriptor offset field of the pipeline info. See https://github.com/GPUOpen-Drivers/llpc/blob/dev/llpc/patch/llpcPatchDescriptorLoad.cpp#L316 for where this is calculated.

The our initial implementation, we chose to replace the offset with a builtin function @llvm.amdgcn.desc.offset that represents the offset.

%20 = call i32 @llvm.amdgcn.desc.offset(i32 0, i32 0)
%21 = mul i32 0, 16
%22 = add i32 %21, %20
%23 = zext i32 %22 to i64
%24 = getelementptr [4294967295 x i8], [4294967295 x i8] addrspace(4)* %13, i64 0, i64 %23

This worked in the sense that it correctly marked that the offset of a particular descriptor is needed. However, the pattern recognition when lowering to MIR does not work well.

Offset representation in MIR

The next phase is MIR. The main problem is that the ISEL does not recognize the code generated above as having an offset. So it does operations on the base register instead of creating an "S_LOAD_DWORDX4_SGPR" where the offset is in a register. I'm guessing we need to fix up the pattern matching.

The bigger problem is how to represent the offset of a particular descriptor. I wanted to create a dummy instruction that has two parameter, the set and binding. However, I am not sure how to properly generate that instruction: %16:sreg_32 = S_DESC_OFFSET 0, 0.

Currently in AMDGPU DAG->DAG Pattern Instruction Selection, I identified the builtin above and replace it with the new instruction. This is a sample of code that would currently be generated:

%16:sreg_32 = S_DESC_OFFSET 0, 0
%17:sreg_32 = S_MOV_B32 0                                                                                                                                                                                                                                                             
%18:sreg_64 = REG_SEQUENCE killed %16:sreg_32, %subreg.sub0, killed %17:sreg_32, %subreg.sub1                                                                                                                                                                                         
%19:sreg_64 = S_ADD_U64_PSEUDO killed %13:sreg_64, killed %18:sreg_64, implicit-def dead $scc
%20:sgpr_128 = S_LOAD_DWORDX4_IMM killed %19:sreg_64, 0, 0, 0 :: (invariant load 16 from %ir.21, addrspace 4)

We need to figure out how to define an instruction that takes two immediate operands in MIR, but outputs an "S_MOV_B32 " when doing machine code lowering. I can see how to do this, but I do not want to put the effort into doing it if the design is simply wrong.

An alternative would be to create S_LOAD_DWORDX4_RELOC. We would do a pattern recognition to that the offset is a reloction. We would have to be careful how we handle large offset and offsets into descriptor arrays.

Machine code lowering

In AMDGPUMCInstLower::lower, the S_DESC_OFFSET instruction is lowered to an s_movk_i32 instruction with a fixup for the immediate value. As has already been mentioned, this is a problem because it is only at 16-bit value. This is also where we choose the relocation type. I was unsure of what needs to be done if I want to create a new relocation type, so I reused an existing one. Anything that can point us to how to create a new relocation type will be helpful.

Then SIMCCodeEmitter::getMachineOpValue is used to actually output the relocation. The offset is currently wrong, so it is not actually on the immediate value. The size of the fix up is also wrong. It outputs a 32-bit fixup, when there is only 16-bits that can be written. All of this can be easily fixed up once we have the instruction correct.

I should be able to output an s_mov_b32 instruction instead, and do a 32-bit fixup that would cover the entire offset.

Relocatable elf vertex input handling

Part of #431.

Fields in the pipeline info

The data for the vertex input state comes in two arrays. The order of the elements of the arrays does not matter, so we will show a template for a single entry.

binding[x].binding = b
binding[x].stride = s
binding[x].inputRate = r
attribute[y].location = l
attribute[y].binding = b
attribute[y].format = f
attribute[y].offset = o

This code only makes a difference when there is a load in a vertex shader from a variable that is decorated with location l. The binding stride and input rate come from the element in the binding array with the same value for binding as the attribute. In the template above, both have the same value b.

In the function PipelineContext::SetVertexInputDescriptions, we see that inputRate has only two valid values: VK_VERTEX_INPUT_RATE_VERTEX and VK_VERTEX_INPUT_RATE_INSTANCE. Which value is used will affect register usage, and will probably change the .note section. It also affects the code because it needs to know which register to use as the base address.

The offset and stride are only used to generate the vertex fetch instructions in AddVertexFetchInst. The offset seems to be added in two different ways. The first part is offset / stride, and this is added to the base address directly. The remainder offset % stride is added via the offset field in the buffer load instruction.

The binding is used as a parameter to LoadVertexBufferDescriptor. To access the buffer, the base address of the buffer needs to be loaded. The binding field gives an offset to use in order to be able to find the buffer’s base address.

The last field is the format field. This describes the format of the data in memory and how it should be transformed to a usable format for the shaders. A lot of this is encoded in the dfmt and nfmt fields of the tbuffer_load_format_* instructions. Some formats might require a few extra instructions, but they are not used in the cases that we are interested in for now. The formats that are used are:

   2175 VK_FORMAT_R16G16B16A16_SFLOAT
   2740 VK_FORMAT_R16G16B16A16_SINT
 109154 VK_FORMAT_R16G16B16A16_SNORM
   6929 VK_FORMAT_R16G16_SFLOAT
  37087 VK_FORMAT_R16G16_SNORM
 145042 VK_FORMAT_R32G32B32A32_SFLOAT
  10686 VK_FORMAT_R32G32B32_SFLOAT
   4203 VK_FORMAT_R32G32_SFLOAT
   9640 VK_FORMAT_R8G8B8A8_UINT
  21257 VK_FORMAT_R8G8B8A8_UNORM

Here is a sample of the generated gcn for a two loads from two vertex inputs:

s_getpc_b64 s[4:5]
s_mov_b32 s4, s2

; Add the base address (s3) to the vertex index (v0) to get the base address.
; The base register used will be affected by the input rate.
v_add_u32_e32 v0, s3, v0

; Load the two vertex buffer descriptors.  One for binding 0 and one for binding 1
; Note that the binding affects the offset used in this instruction.
s_load_dwordx4 s[0:3], s[4:5], 0x0
s_load_dwordx4 s[4:7], s[4:5], 0x10

; Add the "dividend" offset / stride (offset = 36, stride = 32)
v_add_u32_e32 v1, 1, v0

v_mov_b32_e32 v8, 1.0
s_waitcnt lgkmcnt(0)

; 4 loads each loading an element of the vertex input for 1 input
; Note the offset for the first element is 4, which is offset%stride.
; The offset for other elements depends on the format as well.
tbuffer_load_format_x v4, v1, s[4:7],  dfmt:2,  nfmt:1, 0 idxen offset:4
tbuffer_load_format_x v5, v1, s[4:7],  dfmt:2,  nfmt:1, 0 idxen offset:6
tbuffer_load_format_x v6, v1, s[4:7],  dfmt:2,  nfmt:1, 0 idxen offset:8
tbuffer_load_format_x v7, v1, s[4:7],  dfmt:2,  nfmt:1, 0 idxen offset:10

; 1 load loading all 4 elements of the vertex input for another input
tbuffer_load_format_xyzw v[0:3], v0, s[0:3],  dfmt:12,  nfmt:1, 0 idxen

Note the optimization for the second input. It loads all of the elements in a single load instruction. Whether or not this optimization is done depends on the stride, offset, and the format. See the condition in VertexFetch::AddVertexFetchInst. If we do not have that information readily available, we will have to disable this optimization for relocatable shader elf. We might be able to do a link-time optimization to merge the loads.

The Plan

First disable the whole vertex load optimization for relocatable shader elf.

I believe that all of these fields can be replaced by relocations in the relocatable elf. We will need 3 relocations:

For each load of a vertex buffer descriptor, we will need a relocation so that the offset on the load can be adjusted based on the binding.
For each input there will be an add3 instruction that adds the base address, vertex/instance index, and the dividend.
1. A relocation will be added to this instruction.
2. The relocations will be used to fill in all 3 operands
Add a relocation to every tbuffer_load_format keeping track of the location and element being loaded.
1. This relocation will be used to update the dfmt, nfmt, and offset fields.
2. This may be turned into a load immediate if the format does not have this field.

Long term we will want to do an optimization: Try to merge upto 4 consecutive tbuffer_load_format instructions that have relocations on them.

Improve accuracy of VGPR live range calculation

The live ranges computed for VGPRs in LLVM are conservative because they do not take thread-level control flow into account. For example:

    x = def();
    if (divergent_condition) {
        use(x);
        ... // A
    } else {
        use(x);
        ... // B
    }
    // No further use of x

Assuming that the order of basic blocks after structurization is as suggested by the pseudo-code, the live range of x will cover the region of code indicated by A, even though the lanes which are live there can reuse the same register for other values.

Fixing this issue will be facilitated by #758

dEQP-VK.graphicsfuzz.discard-continue-return

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

discard-continue-return.zip

Implement new control flow lowering based on reconverging CFG transform

Tracking issue for the bulk of control flow lowering changes.

PatchInOutImportExport shouldn't visit whole functions

The PatchInOutImportExport pass visits whole functions to find the call and return instructions. I think it would be more efficient to find the call instructions by iterating over users of the callee, and return instructions by looking at the terminator of each basic block. This should be faster when visiting very large functions with very few call and return instructions.

Out of bounds read in ElfWriter::assembleNotes()

c7a4b44 uncovered a bug in the ElfWriter. The memcpy rounds up the number of bytes it copies but note.data contains a non-multiple of 8 number of bytes, so the memcpy reads more than it should.

llpc/llpc/util/llpcElfWriter.cpp

Lines 484 to 486 in bb8404f

 const unsigned noteDescSize = alignTo(note.hdr.descSize, sizeof(unsigned)); 

 memcpy(data, note.data, noteDescSize); 

 data += noteDescSize;

The buffer gets allocated in updateMetaNote:

llpc/llpc/util/llpcElfWriter.cpp

Lines 790 to 793 in bb8404f

 auto data = new uint8_t[blob.size()]; 

 memcpy(data, blob.data(), blob.size()); 

 pNewNote->hdr.descSize = blob.size(); 

 pNewNote->data = data;

Use .spvasm instead of .spvas for SPIR-V assembly text files

The standard suffix for SPIR-V assembly files is .spvasm. LLPC currently uses .spvas which causes some mild confusion.

I have a patch that moves LLPC to use .spvasm. I've renamed all the files in llpc/test/shaderdb but I'm not sure if this will cause friction in other external tools. Should LLPC support both?

dEQP-VK.graphicsfuzz.loops-ifs-continues-call

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

loops-ifs-continues-call.zip

dEQP-VK.graphicsfuzz.loop-nested-ifs

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

loop-nested-ifs.zip

[CI] Status / Release builds with assertions

Moving the discussion with @kuhar from #634 to an issue to not clutter up pull requests too much.

I didn’t make progress on release builds with assertions, the ‘Fast Jenkins Build’ is doing a debug build with sanitizers, so we currently have a CI build with assertions switched on.
There are some outstanding issues with this, most prominently, you cannot see the error messages without going to the Jenkins website. There are some falsely failing builds/tests (sorry @trenouf, you got most of them) caused by rebuilding the base image once a day only and some failing builds because of md5 sums of the cache not matching (whatever goes wrong there…). I plan to fix these and also run CTS with sanitizers, but getting there will take time, especially as we will need to build own infrastructure for some of these things.

Sanitizer support got pushed to GitHub today (yay 🎉), so we can add GitHub CI builds. Things needed in the GitHub workflow:

Set -DXGL_USE_SANITIZER="Address;Undefined"
Set the environment variable ASAN_OPTIONS=detect_leaks=0, I would appreciate seeing no leaks, someone has to fix them though
I think libclang-common-9.0-dev needs to be installed
There is one patch that I think still needs to be applied to drivers/spvgen/external/glslang: https://gist.github.com/Flakebi/530ff13056407237fdbdeaab5dd740bf
This was merged upstream a while ago but nobody updated our pinned version.

So, about assertions in release builds. There does not seem to be a clean way to do this as CMake adds -DNDEBUG to release options by default. I see two options:

Follow the way LLVM implements LLVM_ENABLE_ASSERTIONS, which means using regex to strip out NDEBUG from compilation flags.
Add a new RelWithAsserts target that sets release flags manually and does not add NDEBUG. This sounds like it needs every subproject to support this target in the as Release, which does not make it appealing either.

Any opinions on this?

dEQP-VK.graphicsfuzz.return-float-from-while-loop

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

return-float-from-while-loop.zip

dEQP-VK.graphicsfuzz.modf-gl-color

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

modf-gl-color.zip

amdllpc does not preserve debug information

Hello there,

I'm not even sure if I'm posting this issue on the correct project in the suite, so apologies if that isn't the case.

I'm trying to compile a simple Vulkan fragment shader with amdllpc. According to the AMDGPU target user guide, the ELFs produced by it can contain a .debug section with DWARF data. However, I can't seem to be able to get it:

$ llvm-objdump --all-headers ellipse.elf

ellipse.elf:    file format ELF64-amdgpu

architecture: amdgcn
start address: 0x0000000000000000

Program Header:

Dynamic Section:
Sections:
Idx Name            Size     VMA              Type
  0                 00000000 0000000000000000
  1 .strtab         00000052 0000000000000000
  2 .text           00000250 0000000000000000 TEXT
  3 .note           000002a4 0000000000000000
  4 .AMDGPU.disasm  000017c8 0000000000000000
  5 .note.GNU-stack 00000000 0000000000000000
  6 .symtab         00000048 0000000000000000

SYMBOL TABLE:
0000000000000098         .text  00000000 BB0_1
0000000000000000 g     F .text  00000250 _amdgpu_ps_main

I've figured out that I need to set -trim-debug-info=false not to strip the SPIR-V debug info, and I had a look at the SPIR-V lowering code and it seems like the preserves debug info. I can also in the LLVM bitcode emitted to stdout that some symbols are there. How do I get the DWARF info out?

Here's the command line I'm using:

$ amdllpc -gfxip=9.0.6 ../ellipse.frag -trim-debug-info=false -enable-outs

And attached is the stdout output, which contains the GLSL source and all the intermediate stages: outs.log

Support loading multiPlane image descriptor

As for supporting multiplane image descriptor loading, upload the following pipeline dump file for compiler implementation reference:
SpvPipelineMultiPlane.zip

The multiplane image contains an immutalbe sampler and no more than 3 planes, each plane contains 8 DWords for resource descriptor.

I listed the following diagram of descriptor sets for all the cases.

A binding with array N which contains 1 plane image:
descriptorRangeValue.arraySize = N
node.type = DescriptorYCbCrSampler
node.sizeInDwords = 8 * N
A binding with array N which contains 2 plane image:
descriptorRangeValue.arraySize = N
node.type = DescriptorYCbCrSampler
node.sizeInDwords = 16 * N
A binding with array N which contains 3 plane image:
descriptorRangeValue.arraySize = N
node.type = DescriptorYCbCrSampler
node.sizeInDwords = 24 * N

Normally, the

node.sizeInDwords = resourceDescSizeInDwords * M * N (M means plane count)

The loaded descriptors for each plane should be fetchable in ImageBuilder::CreateImageSampleConvert().

dEQP-VK.graphicsfuzz.call-if-while-switch

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

call-if-while-switch.zip

Cache creator tool

Problem description

Compiling shaders when a game is running can be time consuming, and lowers the quality of the game. This can be because of long wait times when a level starts or, much worse, hitching during game play. One of the solutions to this is to create a cache of precompiled pipelines or shaders. Then much less work needs to be done while the game is running.

This leads to a second problem. How do we create the cache? For some games, building the cache as people play the game does not work. Developers are not able to see how their game will perform as they are developing it. This would be a time consuming process that does not scale well, especially when the developers have a large number of shaders that are only run under very specific situations.

In this document, we propose a tool which we will call the “LLPC cache creator”. This will be a tool that will create a shader cache that can be used by LLPC without running the application. The input to the tool will be the Spir-V for the shaders and a few other options like the architecture. It will write a cache to disk that could be passed to LLPC.

Developer workflow

Developers write the shaders for their game. Those will be given to the cache creator to produce the cache for LLPC with appropriate inputs. Once the cache is available this is a file that could be loaded and passed to vkCreatePipelineCache.

The developer will not have to run their game to prime the cache, but this is accomplished offline.

Requirements

The shader cache that is built does not require running the game.
The input is
- Spir-v binaries
- The gpu architecture to target
- Possible values for spec constants
- Possible formats for images, samplers, frame buffers, etc.
The output is a shader cache that can be used by LLPC
Initially, the shader cache will have a 100% hit rate for
- Compute pipelines meeting the constraints of the inputs
- Graphics pipelines composed of just a vertex shader and a fragment shader meeting the constraints of the inputs.
On average, the runtime of the generated code when using the cache must not be more than 5% worse than the code generated by doing a full compile.
On average, the cache hit time must be no more than 20% of the cache hit time of the current implementation.
The size of the shader cache cannot be larger than the size of the cache generated by the current implementation.
The tool must run in a generic Linux, even if an AMD GPU is not available.

Relocatable shader elf

The cache creator tool has to build a cache that cover cases that may never come up. This increases the size of the cache. We need ways of implementing the cache that will reduce the size in order to meet the requirement that the cache will not increase in size. This will be accomplished by compiling CS, VS, and PS shaders into individual relocatable shader elf files that can be reused in multiple pipelines. These elf files will be stored in the cache in place of the current full pipeline elf files.

The relocatable elf will be able to be linked in 5 stages:

Apply relocations to the .rel.text section of the relocatable elf files using information from the pipeline create info.
Append the updated .text sections to form the new .text section.
Pick the symbols from the symbol tables of the relocatable elf with updates offsets.
Merge the data in the .note sections of the relocatable elf and the pipeline create info.
Merge other sections by appends the data from one to the other.

Constructing relocatable shader elf

The exact content of a relocatable shader elf file will have to classify every piece of data in that appears in a .pipe file must be classified into one of the following categories:

Relocatable: This piece of information can be replaced by a relocation in .text section and can be fixed up when linking relocatable shader elf. These values will not need to be included in the hash for the cache.
Patchable: This piece of information does not effect the .text, but only the .note section. The .note can be updated at link time. These values will not need to be included in the hash for the cache.
Determinable: This piece of information has to match data in the spir-v, and can have only a single value. It will not be handled in any special way. These values will be included in the hash for the cache, but are not strictly needed.
Bounded: This is information that has a finite range of possible values. These values must be included in the hash for the cache.
Unbounded: This is information that could take on an infinite number (or very large) number of values. There is no special handling for these values when creating a relocatable elf. They must be part of the hash for the cache.

Attached is a first pass at how each value could be handled: pipeline state - Sheet1 (1).pdf

Cache creator tool

The cache creator tool will be incorporated into the amdllpc tools as an options. For each spir-v file that it is given, it will construct all possible pipeline build infos for it and build the relocatable elf for with that pipeline build info, caching them as it goes along.

The cache creator tool will handle each type of value in the build info in kind:

Relocatable: The specific values does not matter since it will not be used. A single default value will be used.
Patchable: Like relocatable values, a default value will be used.
Determinable: The spir-v will be parsed to find the single possible value, and that value will be used.
Bounded: The cache creator tools will create relocatable shader elf for each possible value.
Unbounded: The user will have to provide a set of possible values as an input to the cache creator tools, and the cache creator tool will build relocatable elf for each possible value.

Note that the combinations will build on each other. Say, there are two bounded values with 2 values each and two unbounded values where the user value 3 possible values, then a total of 2*2*3*3=36 relocatable elf version will be generated.

To further reduce the number of combinations, we will have to classify which members of the pipeline create info are used by which shader types.

dEQP-VK.graphicsfuzz.loop-dead-if-loop

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

loop-dead-if-loop.zip

dEQP-VK.graphicsfuzz.complex-nested-loops-and-call

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

complex-nested-loops-and-call.zip

Invalid llvm-ir generated in the emulib.

When function is defined with a particular calling convention, all calls to that function must use the same calling convention. Some of the code generated in the emulation library does not follow this convention. One example is in @_Z9normalizeDv3_f. Here is the generated code:

; Function Attrs: nounwind
define spir_func <3 x float> @_Z9normalizeDv3_f(<3 x float>) #0 {
  %2 = call float @_Z6lengthDv3_f(<3 x float> %0)
  %3 = fdiv float 1.000000e+00, %2
  %4 = extractelement <3 x float> %0, i32 0
  %5 = extractelement <3 x float> %0, i32 1
  %6 = extractelement <3 x float> %0, i32 2
  %7 = call float @llvm.amdgcn.fmul.legacy(float %4, float %3)
  %8 = call float @llvm.amdgcn.fmul.legacy(float %5, float %3)
  %9 = call float @llvm.amdgcn.fmul.legacy(float %6, float %3)
  %10 = insertelement <3 x float> undef, float %7, i32 0
  %11 = insertelement <3 x float> %10, float %8, i32 1
  %12 = insertelement <3 x float> %11, float %9, i32 2
  ret <3 x float> %12
}

Notice the call to @_Z6lengthDv3_f does not have a calling convention mentioned, which means it defaults to the C calling convention. However, the definition of @_Z6lengthDv3_f is

; Function Attrs: nounwind
define spir_func float @_Z6lengthDv3_f(<3 x float>) #0 {
  %2 = extractelement <3 x float> %0, i32 0
  %3 = extractelement <3 x float> %0, i32 1
  %4 = extractelement <3 x float> %0, i32 2
  %5 = fmul float %2, %2
  %6 = fmul float %3, %3
  %7 = fadd float %5, %6
  %8 = fmul float %4, %4
  %9 = fadd float %7, %8
  %10 = call float @llvm.sqrt.f32(float %9)
  ret float %10
}

Notice the calling convention is spir_func. This can cause problems. It does not for now because these function calls are inlined, and inlining does not notice the mismatch. However, if this code is run through instcombine before inlining, the function call will be replaced by a store to and undefined address. This will in turn be turned into a trap.

dEQP-VK.graphicsfuzz.two-nested-do-whiles

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

two-nested-do-whiles.zip

Inconsistent use of LLPC_CLIENT_INTERFACE_MAJOR_VERSION

The entryStage member of the PipelineShaderInfo struct is only defined if LLPC_CLIENT_INTERFACE_MAJOR_VERSION >= 21. See here.

However, not all uses of it are guarded by that macro. See here.

dEQP-VK.graphicsfuzz.discard-in-loop-in-function

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

discard-in-loop-in-function.zip

dEQP-VK.graphicsfuzz.two-loops-matrix

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

two-loops-matrix.zip

dEQP-VK.graphicsfuzz.mix-floor-add

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

mix-floor-add.zip

LGC: shader compilation proposal

There are several different efforts to move away from whole-pipeline
compilation in LLPC, or that will affect LLPC in the future. This proposal is
to unify them in new LGC (LLPC middle-end) functionality.

There is a "partial pipeline compilation" scheme in LLPC that kind of hacks
into LGC's otherwise whole-pipeline compilation, and does ELF linking in the
front-end using ad-hoc ELF reading and writing code, rather than LLVM code.
Steven et al have started work on their scheme to be able to compile separate
shaders (VS, FS, CS) offline to pre-populate a shader cache, with some
pipeline state missing, and some pipeline state guessed with multiple
combinations per shader. This builds on the front-end linking functionality
above. See Github issues
Cache creator tool,
Relocatable elf vertex input handling,
Handling descriptor offsets as relocations.
There are AMD-internal discussions about shader compilation.

This proposal is to unify these different efforts to use new LGC (LLPC
middle-end) functionality. The link stage in particular requires knowledge that
should be in the middle-end, such as the workings of PAL metadata, and ELF
reading and writing, and needs to be shared and used by potential multiple LLPC
front-ends.

Background

Existing whole pipeline compilation

Whole-pipeline compilation in LLPC works like this:

For each shader, run the front-end shader compilation: SPIR-V reader and
various "lowering" passes use Builder to construct the IR for a shader. This
phase does not use pipeline state.
LGC (the middle-end) is given the pipeline state, and it links the shader IR
modules into a pipeline IR module.
LGC runs its middle-end passes and optimizations, then passes the resulting
pipeline IR module to the AMDGPU back-end for pipeline ELF generation.

Existing(ish) shader and partial pipeline caching

Existing partial pipeline compilation

There are some changes on top of this to handle a "partial pipeline
compilation" mode. Part way through step 2, LGC calls a callback provided by
the front-end with a hash of each shader and the pipeline state and
input/output info pertaining to it. The callback in the front-end can ask to
omit a shader stage, if it finds it already has a cached ELF containing that
shader. Then, the front-end has a post-compilation ELF linking step to use the
part of that cached ELF for the omitted shader. This only works for VS-FS, and
has some other provisos, because of the way that it plucks the part of the
pipeline it needs out of a whole pipeline ELF.

This scheme has some disadvantages, especially the way that it allows the
middle-end to think that it is compiling a whole pipeline, but it then
post-processes the ELF to extract the part it needs. A more holistic approach
would be for the middle-end to know that it is not compiling a whole pipeline,
and for the link stage to be in the middle-end where knowledge of (for example)
PAL metadata should be confined to.

Steven et al's shader caching

Steven's scheme is to offline compile shaders to pre-populate a shader cache.
This would involve compiling a shader with most of the pipeline state missing
(principally resource descriptor layout, vertex buffer info and color export
info), and with some "bounded" items in the pipeline state set to a guessed
value. The resulting compiled shader ELF would be cached keyed on the input
SPIR-V and (I assume) the "bounded" parts of the pipeline state that were set.

The proposal

This proposal outlines a shader compilation scheme using relocs, prologs and
epilogs, and a pipeline linking stage, all handled in LGC (the LLPC
middle-end).

Shader compilation vs pipeline compilation

This proposal does not cover how and when a driver decides to do shader
compilation. Of the two compilation modes:

shader compilation and caching with pipeline linking for minimized compile time;
full pipeline compilation for optimal code;

there is scope for API and/or driver changes to use shader compilation first,
then kick off a background thread to do the optimized compilation and swap the
result in at the next opportunity.

Early vs late shader caching

We can divide existing and proposed shader caching schemes into two types:

Early shader caching caches the shader keyed on just its input language
(SPIR-V for Vulkan), possibly combined with some of the pipeline state.
Steven's scheme is an example.
Late shader caching caches the shader after some part of the compilation has
taken place, and keys it on the state of the compilation at that point. The
existing partial pipeline compilation scheme is an example.

I propose to focus here on early shader caching, which has the following pros and cons:

Pro: Minimize compilation time for cache-hit case
Pro: Fits in with Steven's scheme
Con: Only limited VS-FS optimization possible (although even late shader
caching still has some limits on this, unless you make it so late you have
done a large chunk of the compilation).

Nicolai also suggests taking the existing partial pipeline compilation scheme,
a late shader caching scheme, and tidying up its interface and implementation
(see
Inter-shader data cache tracking.
One problem is that we do pretty much have to choose one or the other; within
one application run, you can't use both at the same time, as trying to means
that a shader gets cached early and late, and next time the same shader is
seen, the early cache check always succeeds.

The choice partly depends on how you view the existing partial pipeline
compilation scheme: was a late shader caching scheme chosen for the possibility
of VS-FS optimizations, or was it chosen because that meant that it could be
implemented without implementing the relocs and prologs and epilogs in this
proposal? I suspect the latter, and I reckon we're better with an early shader
caching scheme for the two pros I list above.

What shaders are cached

This proposal makes no attempt to cache VS, TCS, TES, GS shaders that make up
part of a geometry or tessellation vertex-processing stage. The FS in such a
pipeline can still be cached though. So the shader types that can be cached
are:

CS
VS as long as it is standalone (if a VS accidentally gets compiled and cached
and turns out not to be standalone, it just gets ignored, but hopefully you
can tell that a VS is likely not to be standalone before getting to that
point)
FS

In addition, we can compile the whole vertex-processing stage (VS-GS,
VS-TCS-TES, or VS-TCS-TES-GS) without the FS, or with an already-compiled FS.

Failure of shader compilation or pipeline linking

There needs to be scope for shader compilation or pipeline linking to fail, in
which case the front-end needs to do full pipeline compilation instead:

Shader compilation can fail if the compiler can tell in advance that the
shader does something that will not work in the shader compilation model, for
example a VS that is obviously not a standalone VS.
Pipeline linking can fail because the pipeline uses something that is not
possible to implement in this model, for example:
- converting sampler
- specialization constant that shader compilation did not render as a reloc
  (used in a type) and whose value does not match the shader default
- descriptor set split between its own table and the top-level table
  Also it needs to fail because the pipeline uses something that has not yet
  been implemented in this model.

This kind of failure is different to normal compilation failure, in that it
needs to exit cleanly and clean up, because the driver or front-end is going to
retry as a full pipeline compilation. If any such condition is detected in an
LLVM pass flow, we need to come up with a clean exit mechanism, such as
deleting all the code in the module and detecting that at the end.

Prologs and epilogs

Compiling shaders with some or all pipeline state missing and without the other
shader to refer to means that the pipeline linker needs to generate prologs and
epilogs.

CS prolog

If the compilation of a CS without resource descriptor layout puts its user
data sgprs in the wrong order for the layout in the pipeline state, then the
linker needs to generate a CS prolog that loads and/or swaps around user data
sgprs. The linker picks up the descriptor set to sgpr mapping that the CS
compilation used from the user data registers in the PAL metadata.

VS prolog

If vertex buffer information is unavailable at VS compile time, then the linker
needs to generate a VS prolog (a "fetch shader") that loads vertex buffer
values required by the VS. The VS expects the values to be passed in vgprs, and
the linker picks up details of which vertex buffer locations and in what format
from extra pre-link metadata attached to the VS ELF.

VS epilog

If the VS (or whole vertex-processing stage) is compiled without information on
how the FS packs its parameter inputs, then the VS compilation does not know
how to export parameters, and the linker needs to generate a VS epilog. The VS
(or last vertex-processing-stage shader) exits with the parameter values in
vgprs, and the VS epilog takes those and exports them. The linker picks up
information on what parameter locations are in which vgprs and in what format
from extra pre-link metadata attached to the VS ELF, and information on how
parameter locations are packed and arranged from extra pre-link metadata
attached to the FS ELF.

No FS prolog

No FS prolog is ever needed. FS compilation decides how to pack and arrange its
input parameters.

FS epilog

If the FS is compiled without color export pipeline state, then it does not
know how to do its exports, and the linker needs to generate an FS epilog. The
FS exits with its color export values in vgprs (and the exec mask set to the
surviving pixels after kills/demotes), and the FS epilog takes those and
exports them. The linker picks up information on what color exports are in
which vgprs and in what format from extra pre-link metadata attached to the FS
ELF.

Prolog/epilog compilation notes

A prolog has the same input registers as the shader it will be attached to,
minus the vgprs that are generated by the prolog for passing to the shader
proper. That is, the shader's SPI register settings that determine what
registers are set up at wave dispatch apply to the prolog.

For a VS prolog where the VS is part of a merged shader (including the NGG
case), the code to set exec needs to be in the prolog.

The exact same set of registers are also outputs from the prolog, plus the
vgprs that are generated by the prolog.

A prolog/epilog is generated as an IR module then compiled. The compiled ELF is
cached with the hash of the inputs to the prolog/epilog IR generator being the
key.

In the context of a prolog being generated as IR then compiled:

Input args represent the input registers, with sgprs marked as "inreg", same
as the IR for a shader.
IR can only have a single return value, which here is a struct containing the
preserved input sgprs and vgprs, plus the vgprs generated by the prolog for
passing to the shader. By including sgprs as ints and vgprs as floats in the
return value struct, the back-end calling convention ensures that they are
allocated to sgprs and vgprs appropriately.
We can assume that compiling a prolog will never need scratch, so with that
single "shader prolog/epilog" calling convention, we don't need to worry that
it doesn't know how to find the scratch descriptor (which is different
between compute, single shader and merged shader including NGG).
Compiling the prolog with that "shader prolog/epilog" calling convention
leaves its sgpr and vgpr usage in some well-known place, e.g. the
SPI_SHADER_RSRC1_VS register in PAL metadata. The linker needs to take the
maximum usage of that and the shader proper.

An epilog's input registers are the same as the shader's output registers,
which is the vgprs containing the values to export. (This may need to change to
also have some sgprs passed for VS epilog parameter export on gfx11, if
parameter exports are going to be replaced by normal off-chip memory writes.)

Prolog/epilog generation even in pipeline compilation

In a case where a particular prolog or epilog is not needed (e.g. the VS prolog
when vertex buffer information is available at VS compilation time), I propose
that LGC internally uses the same scheme of setting up a shader as if it is
going to use the prolog/epilog (including setting up the metadata for the
linker), and then uses the same code to generate the IR for the prolog/epilog
as would otherwise be used at link time. Then it would merge the prolog/epilog
into the shader at the IR stage, allowing optimizations from there.

The advantage of that is that there is less different code in LGC between the
shader and pipeline compilation cases.

A change this causes is that the vertex buffer loads are all at the start of
the VS, even in a pipeline compilation. I'm not sure whether that is good, bad
or neutral for performance. (Ignoring the NGG culling issue for now.)

NGG culling

An early version of this feature should probably just ignore this case, because
it is quite complex.

With NGG culling, it is advantageous to delay vertex buffer loads that are only
used for parameter calculations until after the culling. Thus, for an NGG VS,
there should be two VS prologs (fetch shaders). The VS compilation needs to
generate the post-culling part as a separate shader, such that the second fetch
shader can be glued in between them. At that point (the exit of the first
shader), sgprs and vgprs need to be as at wave dispatch, except that the vgprs
(vertex index etc) have been copied through LDS to account for the vertices
being compacted. Also exec needs to reflect the compacted vertices.

Jumping between prolog, shader and epilog

I'm not sure how possible this is, or if there is a better idea, but:

We want the generated code to reflect that it is going to jump to the next part
of the shader. So, when generating the prolog, or when generating the shader
proper when there will be an epilog, we want to have an s_branch with a reloc,
rather than an s_endpgm. Perhaps we could tell the backend that by defining a
new function attribute giving the symbol name to s_branch to when generating
what would otherwise be an s_endpgm.

Linking a prolog, shader and epilog would then just work with the s_branch.
Linking could optimize that by ensuring the chunks of code are glued together
in the right order, and removing a final s_branch. Alignment is a
consideration:

The start of the glued-together shader must be a multiple of 256.
The main part of the shader should start cache-line-aligned, so anything the
compiler has done to align loop starts etc remains valid.
Padding could be done by adding s_nops, except that any final s_waitcnts
should be moved to after the s_nops as an optimization.

The LGC interface

I propose that we extend LGC (LLPC middle-end) to handle the various requirements.

Currently LGC has an interface that says:

Here are the IR modules for the shaders and the pipeline state; link into a
pipeline IR module.
Go and run middle-end and back-end passes to generate a pipeline ELF.

That interface needs to be extended to allow compilation of a shader with
missing or incomplete pipeline state, and to allow linking of
previously-compiled shader ELFs and pipeline state.

We would probably want to implement compilation of a geometry and/or
tessellation pipeline by providing LGC with IR modules for non-FS shaders, a
previously-compiled shader ELF for the FS, and the pipeline state. That allows
the other shaders to be compiled knowing which attribute exports will be unused
by the FS so can be removed.

Compilation modes

The compilation modes LGC would support (in probable order of implementation priority) are:

Pipeline compilation, as now. Must be provided with full pipeline state.
Generates a pipeline ELF satisfying the PAL pipeline ELF spec.
Compilation of a single shader with missing or partial pipeline state. The
shader must be CS, FS, or VS in a non-tessellation non-geometry pipeline.
For VS or FS, this may or may not be provided with the other shader already
compiled, which would provide parameter information. Generates an ELF that
needs to be pipeline linked. Then there is a link stage in LGC that takes
such ELFs and generates a pipeline ELF satisfying the PAL pipeline ELF spec.
Compilation of the vertex-processing part of a geometry or tessellation
pipeline, with full pipeline state. This may or may not be provided with the
already-compiled FS ELF, which would supply parameter layout information.
Generates an ELF that needs to be pipeline linked.

Note that the above modes do not include any case where a shader is compiled
separately, and then in the link stage needs to be combined with another shader
to create a merged shader or an NGG prim shader.

Tuning options

As proposed by Rob, tuning options should always be made available at shader
compilation time. This does probably mean that all tuning has to be done by
shader, not pipeline. Most tuning options are per-shader anyway, except the NGG
ones, which obviously apply only to the VS in a VS-FS pipeline.

Use of the LGC interface by the front-end

VS-FS parameter optimization

As pointed out by Nicolai, the use of early shader caching limits the parameter
optimizations that can be done between VS and FS, and how that is limited
depends on whether you compile the VS first or the FS first. I consider that it
is worth taking this hit because of the saving in compile time in the cache-hit
case.

FS first

In this scheme, at VS compilation time, we know exactly how parameters are
packed by the FS, so we can generate the parameter exports and we do not need a
VS epilog. We can also see where the FS does not use a parameter at all, and
DCE it and its calculation in the VS. However we cannot do constant parameter
propagation into the FS.

VS first

In this scheme, VS compilation does not know how parameters will be laid out by
the FS, so we need a VS epilog. This does allow constant parameter propagation
into the FS, because the VS's parameter metadata can include an indication that
a parameter is a constant so is not being returned in a vgpr at all. FS
compilation will see this metadata, and propagate the constant into the FS,
saving an export/import. (Note that LLPC doesn't do this at all currently.)
However, the dead parameter (one not used by the FS) optimization is limited to
the VS epilog spotting it does not need to export it. The calculation of the
dead parameter, and any vertex buffer load needed only for that, does not get
DCEd.

Other VS-FS parameter optimizations we miss out on

Here are some examples of potential optimizations Nicolai mentioned that we
miss out on by using early shader caching:

A transform that lifts certain instructions, such as "multiply parameter by a
constant" to the vertex shader.
A transform that lifts uniformity backwards, e.g. if there is information
(such as an annotation) in the fragment shader that proves that a parameter
must be uniform, that information could be back-propagated into the vertex
shader.
A transform that propagates range / scalar evolution information ("this
parameter is always an integer between 0 and 10")

All these are possible when doing a full pipeline compile.

LLPC front-end changes

The LLPC interface would need to change so that a partial pipeline state (and
tuning options) is provided to the shader compile function. That function would
then check the shader cache, and, if a compile is needed, do front-end
compilation then call the LGC interface with the partial pipeline state.

The pipeline compile function would check the cache for its shaders or partial
pipeline. The difficulty here is that it does not know how much of the pipeline
state was known at shader compile time, so there may need to be some mechanism for
multiple shader ELFs to be stored for a particular shader in the cache, with a
way of finding one whose known pipeline state at the time is compatible.

amdllpc

Steven proposes using a modified amdllpc as his offline shader compile tool.
Thus, that will be calling the LLPC shader compile function with an incomplete
pipeline state containing values for the "bounded" items.

The proposed un-pipeline-linked ELF module

Such an ELF is the result of anything other than full pipeline compilation. It
contains various things to represent the parts of the pipeline state or
inter-shader-stage linking information that was unavailable at the time it was
compiled.

Representation of metadata needed for linking

Some of the items below list metadata that needs to be left in the unlinked ELF
for the link stage to read. I propose that we will define a new section in the
PAL metadata msgpack tree to put these in. The link stage will remove that
metadata.

Representation of final PAL metadata

Some parts of the PAL metadata can be directly generated in a shader compile
before linking. Hopefully all the link stage needs to do is merge the two
msgpack trees, ORing together any register that appears in both. That handles
the case that the same register has a part used by VS and a part used by FS.

Resource descriptor layout

If resource descriptor layout was unavailable at shader compile time, then the
load of a descriptor from its descriptor table has a reloc on its offset where
the symbol name gives the descriptor set and binding. Such relocs are resolved
at link time, when the resource descriptor layout pipeline state is available.
This work is already underway by Steven from Gibraltar.

In addition, an array of image or sampler descriptors needs a reloc for the
array stride. That is different depending on whether it is actually an array of
combined image+samplers, and you can't tell at shader compile time.

For a descriptor set pointer that can fit into a user data sgpr, the PAL
metadata register for that user data sgpr contains the descriptor set number.
The link stage updates that to give the spill table offset. Work on this
mechanism is underway by David Zhou in AMD
(although in the context of the front-end ELF linking mechanism). There needs
to be some way of telling whether the PAL metadata register represents a
fully-linked spill table offset, or an unlinked descriptor set number. I
believe David's work already does that.

For a descriptor set pointer that cannot fit into a user data sgpr, it is
loaded from the spill table with a reloc on the offset whose symbol gives the
descriptor set. That reloc is resolved at link time.

We will have to ban the driver putting any descriptors into the top level of
the descriptor layout:

Currently, if a descriptor set contains both dynamic and non-dynamic
descriptors, the driver puts the dynamic ones in the top level. This proposal
would not be able to find them.
Banning that also avoids the use of compact descriptors, which we also cannot
cope with in this proposal.

A compute shader's user data has a restriction on which spill table entries can
be put into user data sgprs, and in what order. For that reason, the link stage
may need to prepend code to load and/or swap around sgprs for descriptor set
pointers.

Vertex inputs

If vertex input information is unavailable at VS compile time, then vertex
inputs are passed into the vertex shader in vgprs, with metadata saying which
inputs they are and what type. The link stage then constructs a "fetch shader",
and glues it on to the front of the shader.

The fetch shader has an ABI where the vertex shader's input registers are also
the fetch shader's inputs and outputs, except that the vertex input values are
obviously not part of the fetch shader's inputs.

Color exports

If color export information is unavailable at FS compile time, then color
exports are passed out of the fragment shader in vgprs, with metadata saying
which exports they are and what type. The link stage then constructs an FS
epilog, and glues it on to the back of the shader. The shader exits with exec
set to pixels that are not killed/demoted.

The following pipeline state items also affect color export code, so the
absence of any of them also forces the use of an FS epilog:

alphaToCoverageEnable
dualSourceBlendEnable

Parameter exports and attribute inputs

In a shader compile, parameter exports are passed out of the last stage
vertex-processing shader in vgprs, with metadata saying which parameters they
are. In an unlinked fragment shader, attributes are packed and there is
metadata saying how that is done. The link stage then ties them up, and adds an
epilog to the last stage vertex-processing stage.

enableMultiView

enableMultiView has several impacts:

What gl_Layer and gl_ViewIndex actually are
Whether and what to export as pos1

It looks like the best way of handling this if enableMultiView is unavailable
at VS compile time is to compile the two alternatives for each thing inside an
if..else..endif with a reloc as the condition.

perSampleShading

If the perSampleShading item is unavailable at FS compile time, and the FS uses
gl_SampleMask or gl_PointCoord, then the compiler needs to generate code for
both alternatives inside an if..else..endif where the condition is a reloc.

PAL metadata items

Certain pipeline state items do not affect compilation except for being copied straight into PAL metadata registers:

depthClipEnable
rasterizerDiscardEnable
topology
userClipPlaneMask

In a shader compile with a link stage, it is the link stage that copies these items into PAL metadata.

Relocatable items

As pointed out by Steven's document
pipeline state - Sheet1 (1).pdf,
the following items are relocatable. That is, if the item is unavailable in
pipeline state at shader compile time, a simple 32-bit constant load with a
reloc will work, so it can be resolved at link time:

deviceIndex
numSamples
samplePatternIdx

We should probably add the shadow descriptor table high 32 bits to this too.

Specialization constants

Steven's document claims that SPIR-V specialization constants can be handled by
relocs. That is only partly true:

Where a specialization constant is used somewhere a reloc can be used (an
operand to an instruction in function code), then the SPIR-V reader could
call a new Builder function "get reloc value". The name of the symbol
referenced by the reloc is private to the SPIR-V LLPC front-end, and is not
understood by LGC.
Where a specialization constant is used somewhere a reloc cannot be used
(e.g. the size of an array type), then the SPIR-V reader uses the default
value for that constant, and it somehow needs to record what value it used so
the linker can later check that the specialization constants supplied with
the pipeline do not clash with that. If they do clash, then the link fails
and the front-end needs to start again compiling that shader.
At the link stage, the front-end needs to supply a list of symbol,value pairs
to the linker to satisfy the relocs. I'm not sure whether it is worth
encapsulating that in an ELF.

Bounded items that we need to make relocatable

These are pipeline state items that Steven's document lists as "bounded", that
is, there is a limited range of values that each one can take. Gibraltar's
proposal to handle this in their offline shader cache populating scheme is to
compile a shader multiple times with these items set to the most popular
values, in the hope of covering most cases that the shader is used in a
pipeline.

The implication of this is that the shader cache needs to be able to keep
multiple ELFs for the same shader, with different assumptions about these
pipeline state items. When a pipeline compile looks for a cached shader, there
needs to be some mechanism where it can find the one with a compatible state
for these items.

However, for the purposes of app runtime shader compilation, we need to find
some way of making these fixuppable by the link stage. In some cases, that
might involve generating code that can handle all possibilities, and then
having a branch with a reloc to select the required alternative.

perSampleShading

NGG control items

These items are supplied to the compiler through pipeline state to save needing
to load them at runtime from the primitive shader table. If they are
unavailable at shader compile time, then the compiler is forced to load from
the primitive shader table.

cullMode
depthBiasEnable
frontFace

These items are similar, except certain settings also need to force NGG
pass-through mode. Therefore, if the items are unavailable at shader compile
time, we need to force NGG pass-through mode.

polygonMode, except that setting polygonMode to line or point forces NGG pass-through mode

Items only needed for tessellation or geometry

These pipeline state items are only used for tessellation or geometry. Because
this proposal insists that a vertex-processing half-pipeline with tessellation
or geometry has to be compiled with full pipeline state, these items do not
need to be handled by a reloc:

patchControlPoints
switchWinding

The link stage

The link stage needs to:

generate CS or VS prolog;
generate VS epilog;
generate FS epilog;
merge and patch up PAL metadata;
glue prolog and epilogs on to the corresponding shader;
apply relocs;
assemble the pipeline ELF.

A prolog is generated to end with an s_branch with a reloc to branch to the VS.

Where an FS needs an epilog (color export information was unavailable at shader
compile time), it is generated with an s_branch with a reloc instead of an
s_endpgm, to branch to its epilog code.

In both cases, we can optimize by gluing sections in the right order, and
applying the optimization that a chunk of code that ends with an s_branch can
have the s_branch removed and turned into a fallthrough. There may need to be
special handling for a prolog to ensure that the CS or VS remains
instruction-cache-line-aligned, such as inserting s_nop padding before the
fetch shader.

Prologs will be generated as IR then compiled. They will be cached so that will
not happen very often.

Pipeline segfaults in image lowering

pipeline.zip

The attached pipeline segfaults during image lowering. Run:

amdllpc -gfxip=9.0.0 pipeline.pipe

[CI] Disable the amdvlk cron action on private forks

It seems like more people are suffering from annoying workflow failures on their private forks, e.g., https://github.com/dnovillo/llpc/runs/566051267. A temporary workaround is to disable 3rd party actions in github settings for the affected forks.

The amdvlk workflow should only run on this repo. I tried to fix it before, but couldn't figure out the proper syntax. I asked on the github community forums and got this helpful answer: https://github.community/t5/GitHub-Actions/Run-scheduled-workflows-only-from-the-main-repository/m-p/45855#M6361.

I'm opening an issue to not forget about this and not to lose the link.

NGG non-GS rework

We are going to rework original NGG non-GS framework. Recent performance tuning suggests we'd better to have some architectural changes. This ticket is to list the ongoing sub-tasks.

Subgroup compaction (cull the whole subgroup or not) is proved to be less useful. Vertex compaction is more flexible. NggCompactMode will remove the item NggCompactSubgroup and replace it with a new one for future use NggCompactNone, which means no vertex compaction.
Initially, the deferring attribute shading was done for vertex parameter data. All vertex position data were fetched before culling. Rather, we only need pos0 data before culling (or need cull distance data if cull distance culling is enabled), other position data are still treated as deferred attributes and should be fetched after culling and vertex compaction. The granularity is not perfect.
Some special cases disable culling. They are:
(1) Vertex number < 16.
(2) gl_Position is not used in shader or is constant.
(3) Memory/resource write operation is involved. You can not do culling and compaction.
The major change are: group vertex compacted data together and don't allocate LDS region for them based on max subgroup thread count. Instead, use a emulated ES-GS ring to dynamically address the info and allocate required size. The region is: LdsRegionPosData, LdsRegionDrawFlag, LdsRegionCullDistance, LdsRegionVertThreadIdMap, LdsRegionCompactXXX
A single-wave subgroup optimization is under experiment. All primitive and vertex threads stay in one wave. This could avoid LDS barrier and many multiple-wave calculation.
Like 3, in the future, more pipeline states will be passed to compiler via primitive shader const buffer, such as primitive topology. The culling will be decided at run-time. So shader will use a calculated cullingMode flag (a SGPR) to execute culling code path or not. This is not only from the separated shader link requirement but also from an extension requirement introducing the concept of dynamic state.

Improve support for the image instruction NSA (non-sequential address) encoding

Some shaders have highly redundant addresses for sequences of image loads (or samples). For example, consider a blur kernel that loads an NxN block of texels from an image. There are really only 2N different coordinates for the NN loads. Without the NSA encoding, putting those loads into a clause blocks 2NN registers. With the NSA encoding, only 2*N registers are blocked.

For instruction density, it is beneficial not to use the NSA encoding, so the LLVM backend currently uses an extremely simplistic heuristic that will fail in this case. Really, the register allocator should be aware that for image instructions, it is not necessary to put addresses consecutively on gfx10, while still taking into account that doing so is beneficial.

dEQP-VK.graphicsfuzz.modf-temp-modf-color

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

modf-temp-modf-color.zip

SpirvLowerGlobal shouldn't visit the whole module

The SpirvLowerGlobal pass visits the whole module to find loads and stores to global variables, and call and return instructions. I think it would be more efficient to find the load and store instructions by iterating over the users of the global variable (and recursively the users of any getelementptr instructions), and to find the call instructions by iterating over users of the callee, and to find the return instructions by looking at the terminator of each basic block. This should be faster when visiting very large functions with very few load/store/call/return instructions.

Build failure with GPUOpen-Drivers/llvm

When I build ICD with amd-vulkan-dev branch of GPUOpen-Drivers/llvm, it shows me compiler errors:

external/llpc/patch/llpcPatchBufferOp.cpp:1290:48: error: no matching member function for call to 'CreateMemSet'
            Value* const pMemSet = m_pBuilder->CreateMemSet(pCastMemoryPointer,
                                   ~~~~~~~~~~~~^~~~~~~~~~~~
external/llvm/include/llvm/IR/IRBuilder.h:440:13: note: candidate function not viable: no known conversion from 'const llvm::Align' to 'unsigned int' for 4th argument
  CallInst *CreateMemSet(Value *Ptr, Value *Val, uint64_t Size, unsigned Align,
            ^
external/llvm/include/llvm/IR/IRBuilder.h:448:13: note: candidate function not viable: no known conversion from 'uint32_t' (aka 'unsigned int') to 'llvm::Value *' for 3rd argument
  CallInst *CreateMemSet(Value *Ptr, Value *Val, Value *Size, unsigned Align,
            ^
external/llpc/patch/llpcPatchBufferOp.cpp:1352:48: error: no matching member function for call to 'CreateMemSet'
            Value* const pMemSet = m_pBuilder->CreateMemSet(pCastMemoryPointer,
                                   ~~~~~~~~~~~~^~~~~~~~~~~~
external/llvm/include/llvm/IR/IRBuilder.h:440:13: note: candidate function not viable: no known conversion from 'const llvm::Align' to 'unsigned int' for 4th argument
  CallInst *CreateMemSet(Value *Ptr, Value *Val, uint64_t Size, unsigned Align,
            ^
external/llvm/include/llvm/IR/IRBuilder.h:448:13: note: candidate function not viable: no known conversion from 'unsigned int' to 'llvm::Value *' for 3rd argument
  CallInst *CreateMemSet(Value *Ptr, Value *Val, Value *Size, unsigned Align,
            ^
2 errors generated.

This is mainly because amd-vulkan-dev branch of GPUOpen-Drivers/llvm is behind the master branch of GPUOpen-Drivers/llvm-project.
dev branch of GPUOpen-Drivers/llpc uses APIs assuming that we use GPUOpen-Drivers/llvm-project.
See the difference of llvm/lib/IR/IRBuilder.cpp between amd-vulkan-dev branch of GPUOpen-Drivers/llvm and master branch of GPUOpen-Drivers/llvm-project.

Does it mean GPUOpen-Drivers will use GPUOpen-Drivers/llvm-project for a main LLVM repo?

Undefined behaviour by invalid enum values

The values defined here

llpc/lgc/include/lgc/state/Defs.h

Lines 37 to 40 in 090f200

 // Internal built-ins for fragment input interpolation (I/J) 

 static const BuiltInKind BuiltInInterpPerspSample = static_cast<BuiltInKind>(0x10000000); 

 static const BuiltInKind BuiltInInterpPerspCenter = static_cast<BuiltInKind>(0x10000001); 

 static const BuiltInKind BuiltInInterpPerspCentroid = static_cast<BuiltInKind>(0x10000002);

are not actually members of the BuiltInKind enum.

llpc/lgc/interface/lgc/BuiltInDefs.h

Line 54 in 090f200

 BUILTIN(BaryCoordNoPersp, 4992, N, P, v2f32) // Linearly-interpolated (I,J) at pixel center 

Ubsan rightfully notices this as undefined behavior, they should be added to the enum here:

llpc/lgc/interface/lgc/BuiltIns.h

Lines 37 to 41 in 090f200

 enum BuiltInKind { 

 #define BUILTIN(name, number, out, in, type) BuiltIn##name = number, 

 #include "lgc/BuiltInDefs.h" 

 #undef BUILTIN 

 };

Tagging @trenouf because I think he introduce that code.

dEQP-VK.graphicsfuzz.similar-nested-ifs

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

similar-nested-ifs.zip

NGG GS culling framework

I’m going to add NGG GS culling. The algebraic algorithms are the same as non-GS (see the function NggPrimShader::DoCulling). No change for this. The particular part for GS (compared to non-GS) is how to run the culling algorithms for each output primitives (triangle strip). The steps are:

Run GS and generate output primitives.
Revisit those output primitives’ connectivity data, fetch position data (three vertices), and perform culling (within a loop for all output primitives).
If the primitive survives, set a survive bit mask and update connectivity data (change vertex indices accordingly because vertex compaction). If the primitive is culled, change the primitive connectivity data to NULL primitive (0x80000000).
After above steps, the outputs will be: a group of survive bits (in LDS, will be use when exporting vertex positions and parameters), final survive output vertex count (will update original output vertex count), and updated primitive connectivity data (vertex compacted).

I attach a simulation program to illustrate and verify the processing. Welcome any discussion on unclear points. The small simulation program will be the prototype of future IR implementation.

dEQP-VK.graphicsfuzz.control-flow-in-function

This test fails. To reproduce: amdllpc -gfxip=9.0.0 -verify-ir -o temp.out *.spv

control-flow-in-function.zip

	const unsigned noteDescSize = alignTo(note.hdr.descSize, sizeof(unsigned));
	memcpy(data, note.data, noteDescSize);
	data += noteDescSize;

	auto data = new uint8_t[blob.size()];
	memcpy(data, blob.data(), blob.size());
	pNewNote->hdr.descSize = blob.size();
	pNewNote->data = data;

	// Internal built-ins for fragment input interpolation (I/J)
	static const BuiltInKind BuiltInInterpPerspSample = static_cast<BuiltInKind>(0x10000000);
	static const BuiltInKind BuiltInInterpPerspCenter = static_cast<BuiltInKind>(0x10000001);
	static const BuiltInKind BuiltInInterpPerspCentroid = static_cast<BuiltInKind>(0x10000002);

	enum BuiltInKind {
	#define BUILTIN(name, number, out, in, type) BuiltIn##name = number,
	#include "lgc/BuiltInDefs.h"
	#undef BUILTIN
	};

gpuopen-drivers / llpc Goto Github PK

llpc's People

Contributors

Stargazers

Watchers

Forkers

llpc's Issues

LGC: shader compilation interface

Current interface for pipeline compilation

Proposed interface for shader compilation

Proposed interface to link ELFs into pipeline ELF

Shader compilation and specialization constants proposal

Front-end (SPIR-V reader)

Middle-end (LGC)

Linker

Relocation in elf

Representation in the compilation

Offset representation in llvm-ir

Offset representation in MIR

Machine code lowering

Fields in the pipeline info

The Plan

Problem description

Developer workflow

Requirements

Relocatable shader elf

Constructing relocatable shader elf

Cache creator tool

LGC: shader compilation proposal

Background

Existing whole pipeline compilation

Existing(ish) shader and partial pipeline caching

Existing partial pipeline compilation

Steven et al's shader caching

The proposal

Shader compilation vs pipeline compilation

Early vs late shader caching

What shaders are cached

Failure of shader compilation or pipeline linking

Prologs and epilogs

CS prolog

VS prolog

VS epilog

No FS prolog

FS epilog

Prolog/epilog compilation notes

Prolog/epilog generation even in pipeline compilation

NGG culling

Jumping between prolog, shader and epilog

The LGC interface

Compilation modes

Tuning options

Use of the LGC interface by the front-end

VS-FS parameter optimization

FS first

VS first

Other VS-FS parameter optimizations we miss out on

LLPC front-end changes

amdllpc

The proposed un-pipeline-linked ELF module

Representation of metadata needed for linking

Representation of final PAL metadata

Resource descriptor layout

Vertex inputs

Color exports

Parameter exports and attribute inputs

enableMultiView

perSampleShading

PAL metadata items

Relocatable items

Specialization constants

Bounded items that we need to make relocatable

NGG control items

Items only needed for tessellation or geometry

The link stage

Recommend Projects

Recommend Topics

Recommend Org

Jobs