Part of <a class="issue-link js-issue-link" data-error-text="Failed to load title" dat

It seems default value isn't mentioned in the document. for example: inp

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data

Comments (27)

linqun commented on July 30, 2024

It seems default value isn't mentioned in the document.
for example:
input format is rgb8, and the type of shader input variable is vec4. we need assign default value 1.0 for w channel.
vertex_attribute_divisor isn't mentioned in this document, do you plan to support this feature in relocatable ELF?
I have a little concern if we always force per channel loading. is it possible to always generate two kinds of load in LLVM back-end, and we patch one load with nop instructions per real input info?
looks good for others.

from llpc.

trenouf commented on July 30, 2024

How do you cope with not knowing the rate (per vertex or per instance, plus the complication of the instance divisor that Qun mentioned) at compile time?

I think the "traditional" way of handling this, before the advent of pipeline compilation, was that any vertex inputs are just inputs to the shader, and the pipeline link stage generates the vertex buffer loads (a "fetch shader") and sticks it on the front. Maybe this is the way to go; trying to use relocs means that we lose the main advantage of pipeline compilation of vertex inputs, which is that we only need to load each used vertex buffer descriptor once.

from llpc.

nhaehnle commented on July 30, 2024

I had a bunch of detail comments here, like how you really shouldn't worry too much about the offset vs. stride thing since that's arguably an accident of the Vulkan API and should be fixed there.

However, I'm sorry to say that I think that from a high-level point of view, you're going down entirely the wrong path for this. The proposal is incredibly complex and doesn't have a lot of advantages (if any) over the much simpler thing:

Define an ABI whereby the main vertex shader expects to receive the relevant built-ins (like VertexID) as well as all of its vertex array inputs sequentially in VGPRs; access to the vertex array input VGPRs can be gated on executing an s_waitcnt vmcnt(0), for slightly better load pipelining.

You can then compile the vertex shader as-is without worrying about a vertex attributes at all. At runtime, we can compile a dedicated "vertex fetch shader" which contains all the relevant loads into the relevant VGPRs, and that shader simply ends with an s_setpc_b64 that branches to the main shader. Empirically, compiling these shader snippets tends to be fast, and we probably only have to do it once for many different vertex shaders.

I strongly urge you to go down that path instead. If you think this path has relevant limitations, then let's dive into those details instead.

from llpc.

s-perron commented on July 30, 2024

Thanks for the feedback and your patience. After thinking about it, I agree the prologue route would be better. The main concern I have is the amount of time to generate do the linking.

Here is how is see things going. Let me know if there are any details that I may have missed.

LLPC will be change so that when generating a relocatable shader all of the vertex inputs will be treated as a input that is already in a VGPRs.
This should involve changes to the patching passes in LLPC and not much else. I can investigate that more.
The link phase will prepend a prologue to the .text section of the vertex shader that will load all of the vertex inputs into the registers in which they are expected to be.
This prologue will not be "compiled". That is, we will not generate llvm-ir and generate the machine code using llvm.
1. I expect this would be too slow.
The prologue will be generated by writing the required instructions directly in the machine language.
1. This has a disadvantage of potentially requiring a new implementation for every new hardware.
2. It has the advantage of being much quicker than generating llvm-ir and compiling it using llvm.

The mapping for the location will be to sort the inputs based on location and then compress. This is pretty straightforward, expect for arrays. Consider this example. The input values will need to be loaded into 4 registers because it is an array of 4 values. This information is available in the spir-v, but, as far as I know, not in a the VkPipelineVertexInputStateCreateInfo. When generating the prologue for the vertex shader, we will have to get this information from somewhere. I do not want to parse the spir-v. Does anybody have any suggestions?

from llpc.

nhaehnle commented on July 30, 2024

1. LLPC will be change so that when generating a relocatable shader all of the vertex inputs will be treated as a input that is already in a VGPRs.
2. This should involve changes to the patching passes in LLPC and not much else.  I can investigate that more.

Agreed.

3. The link phase will prepend a prologue to the .text section of the vertex shader that will load all of the vertex inputs into the registers in which they are expected to be.
4. This prologue will not be "compiled".  That is, we will not generate llvm-ir and generate the machine code using llvm.
   1. I expect this would be too slow.

I'm not so sure about that. I think it makes sense to explore how long this actually takes through LLVM, because as you write, going through LLVM gives you the benefit that the same code can be used to generate vertex input fetches in all compilation modes. Given the occasional "weird" workaround that we need to implement (e.g. A2 component handling), it seems desirable to do all of that only once.

Besides, the number of different vertex fetch prologs is probably very small: they could easily be cached.

The mapping for the location will be to sort the inputs based on location and then compress. This is pretty straightforward, expect for arrays. Consider this example. The input values will need to be loaded into 4 registers because it is an array of 4 values. This information is available in the spir-v, but, as far as I know, not in a the VkPipelineVertexInputStateCreateInfo. When generating the prologue for the vertex shader, we will have to get this information from somewhere. I do not want to parse the spir-v. Does anybody have any suggestions?

How about: at the time of compiling the vertex shader, create an array of the following structure:

struct VertexInputLocationAbi {
    unsigned location; // the Location decoration in SPIR-V
    unsigned componentCount; // the number of components
    unsigned firstVgpr; // the VGPR index of the first component
};

I would suggest adding a convention that both location and firstVgpr must be sorted in increasing order. I would even seriously consider removing the firstVgpr member entirely, instead adding a firstInputVgpr variable next to the array, with VGPR indices for subsequent array entries being derived from firstInputVgpr and the componentCounts.

In the example you linked to, the array would be (in tuple notation):

[(0, 2, 0), (1, 3, 2), (2, 1, 5), (3, 4, 6), (4, 4, 10), (5, 4, 14), (6, 4, 18)]

It should be possible to generate this array during patching somehow, maybe llpcPatchEntryPointMutate (at that point you have to build the new LLVM IR function signature, so the data must be available somehow).

Generating the prolog should be possible by combining this array and the VkPipelineVertexInputStateCreateInfo.

from llpc.

linqun commented on July 30, 2024

FYI:

DXX and OGL driver generate machine instruction directly for fetch shader, and PAL also plan to support this style.
the prologue style may conflict with NGG, especial when culling is enabled. @amdrexu do you have any comment about it?

from llpc.

amdrexu commented on July 30, 2024

For NGG, the fetch shader might have performance impacts. Because we just want to fetch those vertex inputs that contribute to position calculation. Other irrelevant inputs are delayed to load until the primitive passes culling tests. If we loaded all vertex inputs at the beginning of fetch shader execution, the impacts could be noticeable if the amount of vertex inputs are not small and high culling rate is achieved for this vertex shader.

from llpc.

nhaehnle commented on July 30, 2024

This is a bit of tangent, but:

DXX and OGL driver generate machine instruction directly for fetch shader, and PAL also plan to support this style.

With the LLPC design, it fits better for LLPC to generate the fetch shader. That way, the ABI between fetch shader and main shader part is controlled entirely by one software component (LLPC), rather than having it spread between different software components.

This is independent of the question of whether the fetch shader is generated via LLVM or not. I still maintain that using LLVM makes sense as it is easier to maintain, and fetch shaders should be duplicated enough that compiling them will actually be extremely rare.

NGG

It was made clear from the very beginning that this approach of pre-compilation comes at a potential runtime performance cost. So I think the important part here is to have an overall design that is adaptable if at some point somebody wants to integrate the pre-compilation with fancier culling techniques. I think this is largely the case as long as the relevant ABI is under the control of a single software component...

from llpc.

s-perron commented on July 30, 2024

I will start implementing this. I will generate the fetch shader using LLVM, and see what kind of impact it has.

For the concerns about NGG, there are a lot of conditions on when it will make a noticeable difference. I will implement as is, and measure. I will keep this in mind when looking at the performance impact of relocatable shaders.

from llpc.

s-perron commented on July 30, 2024

Generating the fetch shader

The fetch shader will be code that is to be prepended to the start of a vertex shader. It will load all of the vertex inputs into registers, and those registers will be used by the vertex shader. We will go through the the different phases of the llpc compilation and explain how the compilation will work.

Invoking an LLVM compile

The parameters that are passed to LLVM will be the same as the vertex shader, except **we will introduce a new shader stage ShaderStatgeFetch. There will be some updates needed to handle the existence of the new shader stage.

Reading the Spir-V

The first part of the compilation will be the same as the vertex shader. This will continue until PatchEntryPointMutate. Between reading the Spir-V and entry point mutate, there are a small number of optimizations that are done. These can affect the entry point interface by removing all references to a builtin input for example. For now, I propose that we keep the creation of the entry point the same for simplicity. This requires recompiling the vertex shader up to PatchEntryPointMutate when building the fetch shader, so we can correctly determine the inputs and output for the fetch shader. In the future, we could improve this, if we find it to be a problem. Possible improvements are described in another section.

Modifying the entry point

In PatchEntryPointMutate, the compilation of a fetch shader and vertex shader will diverge. If we are building a fetch shader, the pass will change the entry point so that the inputs are the same as a vertex shader when not building relocatable shaders. The other difference is that it will also create a return value for the fetch shader. The type of the return value will be used to match the input to the relocatable vertex shader.

The problem with this plan is that the type of the parameters used to hold the value of the vertex inputs must match the type of the OpVaraible in the Spir-V. This information is not readily available in PatchEntryPointMutate. It was stored as the type of the global value representing the vertex input, but that is deleted during Lower SPIR-V globals (global variables, inputs, and outputs). We could look for calls to the the @llpc.input.import.generic.v4f32.i32.i32 builtin, but that is more work than should be necessary.

I would propose that we add the type information to the VertexInputDescription.

PatchEntryPointMutate should modify the return type as well to return a struct that represents the outputs. We may need to modify the original return instructions in the vertex shader so that we will still have valid llvm-ir.

Generating the body of the fetch shader

After patching the entry point, we have the information that we need to generate the body of the fetch shader. **I am proposing that we create a new pass GenerateFetchShaderBody. If we are not building a fetch shader, then this pass will have an early exit.

Otherwise, it will delete the body of the entry point, keeping the entry point instruction itself, and will generate a load for every vertex input. The inputs will be determined by looking at the vertex input descriptions. The loads will be calls to a builtin that will be built by calling CreateReadGenericInput in the builder. What are the conditions on which passes can access the builder, and which ones cannot?

The outputs will be collected into a struct that will be a parameter to a return instruction. This struct will include the parameters that are not the “vertex inputs”, which are simply pass through arguments.

Expanding the generic input builtins

For other shader stages, the import and export builtins are expanded during Patch LLVM for input import and output export operations. We will do the same thing. Expanding the imports is not difficult. The fetch shader imports will be expanded the same as vertex shader inputs when not building relocatable shader elf.

The output should not be using any builtins, so nothing needs to be done for them. The return will be handled during ISEL.

Expanding the return

The return instruction will be expanded during ISEL by SITargetLowering::LowerReturn. It already does many of the things that we want. It will end the shader with a RETURN\_TO\_EPILOG instruction with the return values on the instruction. This will make sure the return values are considered live by any other passes. The RETURN_TO_EPILOG instruction is ignored when encoding the instruction, so the code simply falls through to the next instruction. This means that we can simply contantinate the fetch shader and the vertex shader to get the vertex shader to execute after the fetch shader. No need to rewrite branches or delete instructions.

Special care will be needed to make sure that the registers in the output of the fetch shader match the inputs to the vertex shader.

Potential optimizations

As was mentioned earlier, the biggest drawback of this approach is that building a fetch shader requires compiling the vertex shader a second time. This could be fixed a couple ways.

The first option is to try to decide on the vertex entry point much sooner. Potentially deciding based on the Spir-V. The disadvantage to this approach is that the interface will include many parameters that will not be needed. It puts extra work on the driver to set up the parameter, and could force more parameter into the spill area if there are too many.

The second option is to somehow store the interface in the relocatable shader elf for the vertex shader. Then the fetch shader starts by reading that information, and using it. The disadvantage of this option is that it would seem to be complicated to implement. I would like to keep this as a viable option if it is needed; however, I would like to implement something simpler at first just to get something that works.

from llpc.

trenouf commented on July 30, 2024

Hi Steven

I don't think we want to be doing a whole compilation process starting with the spir-v to generate a fetch shader, and trying to common up much of the flow with a vertex shader compile.

The way I see it working is:

Compiling the vertex shader without the vertex state causes vertex inputs to be passed as vgprs, like you said.
Doing that gives you an unlinked elf containing some state saying what it is expecting as vertex inputs in vgprs, in terms of locations and types (like you said in "the second option" in "potential optimizations").
The link stage takes that state and the vertex info from the pipeline state that it now has, hashes it and checks a cache, then generates IR for the fetch shader, and compiles it through to an elf.
The code that generates the fetch IR is common between the whole pipeline compilation case and the separate fetch shader case.

That might be a bit hand-wavy. I think it is actually simpler than modifying existing passes that think they are doing a vertex shader to do a fetch shader instead, and fits better with the LGC (middle-end) abstraction by not involving the front-end at all.

from llpc.

s-perron commented on July 30, 2024

Thanks. I have no problem implementing the path that places entry point information in the elf. The reason I feel this will be a more complicated solution to implement is that this means I have to add this extra information to the pipeline state. After this I do not know the division on which data structures are used when, and how the state should be updated. That seemed like something that would be very difficult to get correct. It is a bit of a problem for both solutions, but it seems like even more if I have to update the elf.

Also, I has interpreted @nhaehnle's comment above as not wanting to pass information through the elf:

It should be possible to generate this array during patching somehow, maybe llpcPatchEntryPointMutate (at that point you have to build the new LLVM IR function signature, so the data must be available somehow).

Generating the prolog should be possible by combining this array and the VkPipelineVertexInputStateCreateInfo.

I could have been reading that as a more precise statement that it was intended. I can look into passing information through the elf file.

from llpc.

trenouf commented on July 30, 2024

I think the ideal would be for the extra state to be passed in the elf is an extra llpc-internally-defined part of the msgpack pal metadata in the .note record. The problem is, whenever I think of a way of doing something, it needs something else doing first. :-) In this case, we should really have a way of carrying the in-progress pal metadata msgpack through patch passes, instead of constructing it all in the pass that calls the config builder. Otherwise, you're going to have to put the vertex input metadata into ResourceInfo or InterfaceData, then copy it from there into the msgpack pal metadata in config builder.

from llpc.

linqun commented on July 30, 2024

It seems NGG isn't mentioned in this proposal. Could you help to add it? please include both passthrough mode and non-passthrough mode. and It is better if you can provide an manual written LLVM IR example. thanks!
I suppose passthrough mode is compatible with traditional pipeline, if you plan to support NGG non-passthrough mode, I think you need do some change in your proposal.

from llpc.

s-perron commented on July 30, 2024

@trenouf Okay, I'll try to get back to you on how we can store the interface. I need to look into what information is needed.

from llpc.

s-perron commented on July 30, 2024

@linqun Can you be more specific on why NGG may cause a problem? As I mentioned on another issue, Yong and I are still learning the AMD arch and the details of LLPC. I thought that this design would reuse enough existing code that it would already handle all of the cases.

For example, the only references to NGG in the entry point mutation code is to add a dummy "inreg" to match what PAL expects. Because I made generic statements like "the inputs are the same as a vertex shader when not building relocatable shaders", it is implied that this dummy input would still be added without specifically mentioning NGG.

from llpc.

s-perron commented on July 30, 2024

@linqun It is hard to say exactly what the code will look like. The initial version of the fetch shader will look something like:

define spir_func ret_struct_type @llpc.shader.fetch_shader(i32 inreg %0, i32 inreg %1, i32 inreg %2, i32 inreg %3, i32 inreg %4, i32 %5, i32 %6, i32 %7, i32 %8) local_unnamed_addr #1 !spirv.ExecutionModel !128 !llpc.shaderstage !128 {
.entry:
  ; Load the vertex inputs.  The actual code generated here is whatever BuilderImplInOut::ReadGenericInputOutput generates.
  %9 = call <4 x float> @llpc.input.import.generic.v4f32.i32.i32(i32 3, i32 0) #0
  %10 = call <4 x float> @llpc.input.import.generic.v4f32.i32.i32(i32 0, i32 0) #0

  ; Build the return value inserting the pass through parameter first
  %agg1 = insertvalue ret_struct_type undef, %0, 0
  %agg2 = insertvalue ret_struct_type undef, %1, 1
  ...
  %agg9 = insertvalue ret_struct_type undef, %8, 8

  ; Then insert the vertex inputs
  %agg10 = insertvalue ret_struct_type undef, %9, 9
  %agg11 = insertvalue ret_struct_type undef, %10, 10
  ret %agg11
}

Is this what you are looking for.

from llpc.

s-perron commented on July 30, 2024

I will also put this out as a reminder. The relocatable shader elf currently works on VS-FS and CS pipelines only. I plan on getting this working for those cases first just to show that we can do it. This also handles the majority of the cases. I don't want the design to exclude handling more cases in the future, but I will not be implementing or testing for other pipelines for now.

from llpc.

linqun commented on July 30, 2024

If NGG + culling is enabled. NGG framework will call VS twice,

first call is to calculate position
after that we do culling
then call VS again to calculate other output attributes for remain vertices.

You can try it with following simple shader:
#version 450
layout(location = 0) in vec4 v0;
layout(location = 1) in vec4 v1;
layout(location = 0) out vec4 color;
void main() {
gl_Position = v0;
color = v1;
}
amdllpc.exe -v -gfxip=10.1 --ngg-enable-backface-culling 11.vert

You will see two @llvm.amdgcn.struct.tbuffer.load.v4i32 in patching result. one is at the begging of the LLVM IR, the other is in block .expVertPos.

NGG + culling isn't enabled in default, and if you don't plan to support this mode, I think you only need add a comment in your proposal to simplify the design.

from llpc.

amdrexu commented on July 30, 2024

For NGG culling, VS is run twice: (1) a partial run by removing irrelevant instructions, just fetch vertex values (only those contributes to position calculation) and do gl_Position calculation, write the position values to LDS and prepare culling (2) a real run of VS for all vertices that pass culling.

from llpc.

s-perron commented on July 30, 2024

@linqun Sorry, but I still do not see what the real problem is.

When you say "NGG framework will call VS twice", that sounds like an high level informal description. What does that mean in real terms? Do we generate code for the vertex shader, and the driver will execute that code twice? I also noticed that LLPC splices the vertex shader when NGG is enabled. Then the different parts of the shader are run under different circumstances. Is the splicing what you mean by run twice?

If you mean either of those, then they seem irrelevant to me. If the code that is generated produces the same results, it should not matter how many times the shader is invoked. When I compile the example code provided above, I still see a single entry point for the vertex shader. If we do the changes I am suggesting, guarded by and option, and prepend the fetch shader to that code, then the shader should produce the same results. It would be equivalent to doing code motion to move the loads of the vertex input to the start of the shader.

Is there a reason the vertex inputs might not be available at the start of the shader so that loading them at that time will cause a crash? Do the vertex inputs get moved around so that the first call needs one fetch shader and the second needs another?

Is the problem a potential performance lose because we are loading a vertex input that would not otherwise be loaded? If so, is not NGG specific, and I am willing to live with it. I don't see what needs to change in the design.

from llpc.

trenouf commented on July 30, 2024

Hi @s-perron

I'm not a great expert on the NGG part here, but (to expand on what Rex said) I believe that a VS in NGG-with-culling mode works like this:

First copy of original VS, including code that loads from the vertex buffer, but only the position result is used, so half of the shader gets removed by DCE, and only vertex buffer loads that affect the position result remain.
Internal stuff to do culling of prims, gathering of remaining (unculled) vertices, and prim and position export.
Second copy of original VS, including code that loads from the vertex buffer, but only the param results are used and exported, so half the shader gets removed by DCE, and only vertex buffer loads that affect the params remain.

So there are actually two copies of the original VS, but each DCEd differently.

Note that a vertex running in the second copy of the VS is not necessarily running in the same lane (or even the same wave within a subgroup??) as that vertex was running in the first copy.

So having a fetch shader for the vertex buffer loads at the start of this doesn't really work, because

it's difficult to see how the loaded values would get through to the second copy of the VS;
values get loaded even for vertices that have been culled.

I wonder if there is some way of formulating this so there is a second fetch shader that gets pasted in between the internal part and the second copy of the VS. But I haven't really figured out how that could work.

With the current design, I think we'd just have to say "no NGG culling". As Qun says, NGG culling is not enabled by default anyway.

from llpc.

csyonghe commented on July 30, 2024

If there are two versions of the vertex shader, it should be trivial to run the same fetch shader generation process twice, one for each version, and append that to the front of each vertex shader variant. I guess what's proposed here will largely remain the same if we want to support NGG culling?

from llpc.

linqun commented on July 30, 2024

in original NGG culling mode, we only execute second fetch shader only for vertices which are not clip/culled. so if we always fetch all attribute for all vertices at the begging of vertex shader, it should work, but it may hurt performance a little. I suppose it is OK for you.

from llpc.

amdrexu commented on July 30, 2024

Tim's explanation is correct. There are performance impacts but no quality issues with this fetch shader proposal. We can state this in the design doc as a to-be-done item. The performance impacts are two: (1) you loaded unnecessary vertices value from vertex buffer in the first run of VS (2) you loaded cullled vertices value from vertex buffer in the second run of VS.

from llpc.

s-perron commented on July 30, 2024

Okay thanks. I'll leave a comment in the code and we can create a follow up issue. Since it is not a correctness issue, I will not fall back to the full pipeline compilation if culling is enabled. However, we can revisit that decision once we have something working, and users are able to give us feedback.

from llpc.

nhaehnle commented on July 30, 2024

FWIW, when I wrote about computing an array that matches VGPRs to vertex input components during llpcPatchEntryPointMutate, I was thinking of this happening during the initial compile of the vertex shader. The computed information would then be used later when determining the required fetch shader, and preserving that information in the ELF makes sense to me.

from llpc.

Relocatable elf vertex input handling about llpc HOT 27 CLOSED

Comments (27)

Generating the fetch shader

Invoking an LLVM compile

Reading the Spir-V

Modifying the entry point

Generating the body of the fetch shader

Expanding the generic input builtins

Expanding the return

Potential optimizations

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs