loongson / gcc Goto Github PK
View Code? Open in Web Editor NEWThis project forked from gcc-mirror/gcc
License: GNU General Public License v2.0
This project forked from gcc-mirror/gcc
License: GNU General Public License v2.0
For English version, please keep scrolling down.
这是目前基于GCC对LA64实现的一个实验性的向量调用约定,代码位于dev/vecarg分支,如果存在问题和不完善的情况欢迎大家进行讨论。如果中英文版本之间存在描述模棱两可的情况也请提出,谢谢大家!
代码位于#113
向量的比特位宽可以是128bit或者256bit,并且总是包含多个元素。向量元素从最低比特起占据向量空间,并且拥有从0开始递增的index。
向量的元素类型遵循于LP64数据模型。
LA64可以选择性的实现32个128位或者256位的向量寄存器硬件。如果实现向量寄存器,则必须实现双精度浮点硬件单元。
同编号的256位向量寄存器的低半部分与128位寄存器共用,同编号的128位向量寄存器低半部分和浮点寄存器共用。
以下为向量寄存器的使用约定:
名称 | 用途 | 是否在过程间保存 |
---|---|---|
$vr0 - $vr1 (128位) / $xr0 - $xr1 (256位) | 参数寄存器/返回值寄存器 | 否 |
$vr2 - $vr7 (128位) / $xr2 - $xr7 (256位) | 参数寄存器 | 否 |
$vr8 - $vr31 (128位) / $xr8 - $xr31 (256位) | 临时寄存器 | 否 |
TODO:对于在过程间保存完整内容的寄存器(static register/callee-saved register),目前尚无明确最终方案,需要有效的性能测试手段来辅助判断。
目前在配合sleef向量数学库(还未提交社区)对x264、libjpeg-turbo进行性能测试的过程中,不同s/t寄存器的分配对性能没有产生明显影响。
向量调用约定扩展是叠加于LP64D之上、使用128/256位向量寄存器,对向量参数和返回值进行传递的调用约定扩展。
可以通过以下的方式启用该调用约定:
为了使向量调用约定在函数、编译模块之间的行为保持一致,需要遵循以下的要求:
p.s.: 对于GCC当前的PoC实现,vecarg选项对应-mvecarg
, vecarg属性对应于__attribute__ ((vecarg))
。
在以下的向量调用约定描述中,对于128/256位向量的传递描述中,我们都认为编译器开启了对应位宽的向量指令支持。
VAR:0-7号向量寄存器按照编号依次用于向量参数的传递。同时,0-1号向量寄存器用于向量返回值的传递。向量参数传递时,总是会选择VRLEN等于向量参数位宽的VAR进行传递。
在启用向量调用约定时,参数可能的传递形式如下:
128位向量
256位向量
无论何种场合,最多仅使用两个寄存器(所有使用的寄存器类型的数量之和)进行结构体的传递,否则从栈进行参数传递。
p.s.:如果结构体成员包含0长度位域、0长度数组、空结构体或空组合体等成员,其处理规则与基础ABI中Other structures中所描述的处理方式相同。
对于向量参数,不使用VAR/FAR进行传递。
对于128位向量,如果至少有两个GAR可用,并且首个GAR的编号为偶数,则使用这对GAR传递参数。
对于256位向量,根据向量位宽遵循现有基础ABI定义。
0-1号VAR用于返回值的传递,传递方式与参数列表中首个参数的传递逻辑相同。
This is a experimental vector calling convention impl. for LoongArch64 based on GCC. The ad-hoc implementation can be found in this pull request: #113.
Any discussions about this prototype calling convention are welcome! And please report any inconsistency between Chinese and English version. Thanks!
A vector can be either 128 bits or 256 bits width, and always contains
multiple elements. Each member of vector consecutively occupies the vector
from lowest bits, and has index that starting from zero.
Elements of a vector always have same base scalar type from LP64 data model.
LoongArch machines that implements LA64 can optionally have 32 vector registers
may be either 128 or 256-bit, depending on the hardware implementation. double-precision
FPU is required for vector registers. Floating-point registers and vector registers that have same
index postfix follow the overlapping rules below:
Name | Usage | Preserved across calls |
---|---|---|
$vr0 - $vr1 (128-bit) / $xr0 - $xr1 (256-bit) | Argument registers / return value registers | No |
$vr2 - $vr7 (128-bit) / $xr2 - $xr7 (256-bit) | Argument registers | No |
$vr8 - $vr31 (128-bit) / $xr8 - $xr31 (256-bit) | Temporary registers | No |
TODO: For "static register"/"callee-saved register", we didn't have a clear resolution for now, and we need effective performance measurements for definition.
In current performance test, when utilize different static/temp register allocation solutions with vector calling convention, x264/libjpeg-turbo's testing tool and sleef vector math library(loongarch support not released yet), we can't see significant difference in performance outputs.
Vector calling convention extension is based on the LP64D, it utilizes 128-bit/256-bit vector register to pass vector argument and return value.It can be enabled via:
For consistent behavior between objects and functions, following rules should be considered while utilizing vector calling conventions:
p.s.: For current GCC PoC implementation, "vecarg option" refers to -mvecarg
, "vecarg attribute" refers to __attribute__ ((vecarg))
。
In the following description of vector calling convention, we assume 128/256-bit vector insturction support is enabled in compiler while utilizing corresponding convention.
VAR: Number 0 - 7 vector register are preserved for vector argument passing, and number 0 - 1 vector are also used for vector return value.
When vector calling convention is enabled, the possible passing method will be one of the following options:
$sp
.128-bit vector argument
256-bit vector argument
For all conditions, we only use at most 2 registers(sum of all register types) to pass a struct with vector member, otherwise pass structure on stack.
p.s.: If struct contains zero-with bit field/zero-length array/empty struct/empty union, the passing rule is same as the description of "Other Structure" in base ABI document.
We don't use VAR/FAR to pass vector arguments.
For 128-bit vector argument, if at least 2 GARs are available, and first GAR's number is even, then use this pair of GARs to pass argument.
For 256-bit vector argument, it follows the current base ABI conventions with its data bit-width(256-bit).
0 - 1 VARs are used for passing return value. The passing rule of return value is same as the first argument's method of argument list.
在Arch x86-64宿主上分别使用孙海勇老师的clfs cross-tools 4.0/5.0交叉编译gnu-efi(执行HOSTARCH=x86_64 CROSSARCH=loongarch64 prefix=loongarch64-unknown-linux-gnu- make),编译到libgnuefi.a时报错:
make[1]: *** [/home/prcups/gnu-efi//lib/Makefile:78: libefi.a] 浮点数例外 (核心已转储)
我们单位用的是gcc7.3,希望能在龙芯上运行gcc7.3
对于指令流中的分支指令cpu对其分支预测失败后会导致刷流水线,进而导致性能下降
最好的解决方式是不产生分支指令,而在高级语言中对于一些结构简单的if-then-else语句我们可以尝试对其实施分支消除
下面我们在c语言中对于a、b比较,根据比较结果将c或d赋值给out(其中a、b、c、d、out为变量)的情况进行分析讨论
以下为测试代码模板
int main(void)
{
int a,b,c,d,out;
foo(&a,&b,&c,&d);
out = a<b?c:d;
/* 也可写做以下形式
if(a<b)
out = c;
else
out = d;
*/
foo1(out);
return 0;
}
根据abcd变量类型的不同我们考虑以下4种情况,并设计其分支消除的指令序列
abcd全为定点
全为定点可以使用以下序列,现有gcc后端代码已经实现
slt
maskeqz
masknez
or
abcd全为浮点
现有gcc后端代码已经实现
fcmp
fsel
ab为浮点cd为定点
这种情况有两种方案
第一种是使用浮点fcmp,然后将fcc使用movcf2gr 移动到通用寄存器,再使用maskeqz masknez or
fcmp
movcf2gr
maskeqz
masknez
or
第二种将c、d使用movgr2fr移动到浮点寄存器,再进行浮点分支消除,完成后再将结果使用movfr2gr
movgr2fr
movgr2fr
fcmp
fsel
movfr2gr
ab为定点cd为浮点
这种情况和上一种情况一样也有两种方案
第一种,将stl结果使用movgr2cf到fcc再使用浮点fsel
stl
movgr2cf
fsel
第二种将定点数据移动到浮点寄存器,转换为浮点数据,再作浮点分支消除
movgr2fr
ffint
movgr2fr
ffint
fcmp
fsel
为了实现分支消除我们需要对la后端作以下改动
workaround: xen0n@23231e3
使用 loongson/loongarch_upstream_v5.3 分支(commit 4244eaa)构建的编译器,在编译 QEMU 时出现以下报错:
/tmp/cc9um8Th.s: Assembler messages:
/tmp/cc9um8Th.s:12175: Error: no match insn: ext.w.<size> $r5,$r5
/tmp/cc9um8Th.s:47097: Error: no match insn: ext.w.<size> $r6,$r6
/tmp/cc9um8Th.s:47559: Error: no match insn: ext.w.<size> $r6,$r6
/tmp/cc9um8Th.s:48939: Error: no match insn: ext.w.<size> $r6,$r6
/tmp/cc9um8Th.s:49398: Error: no match insn: ext.w.<size> $r6,$r6
/tmp/cc9um8Th.s:94462: Error: no match insn: ext.w.<size> $r6,$r6
/tmp/cc9um8Th.s:102759: Error: no match insn: ext.w.<size> $r6,$r6
/tmp/cc9um8Th.s:103233: Error: no match insn: ext.w.<size> $r6,$r6
这显然是 pattern 写法导致的。我看类似形状的 patterns 只有 MIPS 有,而且实现细节被改过了,试着调整了一下,不行,于是暂时在我的分支去掉了。请协助排查修复。
Trapping division/modulus operations are signatures of MIPS codegen, and indeed here the trapping-by-default behavior and the flag seem to come from MIPS. However, as division-by-zero in LLVM IR is undefined behavior, why can't we just omit the trapping behavior altogether (and match RISCV in this regard), or at least disable the trapping by default?
RISCV gcc also does not trap for zero divisor at all.
To me it makes sense to enable -mno-check-zero-division as the default, or at least the default for optimized code (with -O2 or more).
抱歉,由于Linux分支不能提交Issues,所以就在这儿提交,请转告陈华才根据larchintrin.h更新loongarch.h的函数名。由于最新版的gcc更改了larchintrin.h文件的几处函数命名,而Linux的asm/loongarch.h还没及时更新,造成在用最新gcc编译内核的时候通不过。请通知陈更新一下,谢谢!另外,对你们的工作成果表示祝贺!
编译 gcc 时发现不支持以下特性:
configure:3294: checking for libitm support
configure:3300: result: no
configure:3313: checking for libsanitizer support
configure:3319: result: no
configure:3332: checking for libvtv support
configure:3338: result: no
configure:3462: checking for libphobos support
configure:3468: result: no
请确认这些特性,哪些是可以在 LoongArch 上支持的,请增加相关支持,谢谢。
During a bootstrap of upstream GCC r13-1271, stage 2 libgcc fails with:
/home/xry111/git-repos/gcc-build/./gcc/xgcc -B/home/xry111/git-repos/gcc-build/./gcc/ -B/home/xry111/gcc-trunk/loongarch64-unknown-linux-gnu/bin/ -B/home/xry111/gcc-trunk/loongarch64-unknown-linux-gnu/lib/ -isystem /home/xry111/gcc-trunk/loongarch64-unknown-linux-gnu/include -isystem /home/xry111/gcc-trunk/loongarch64-unknown-linux-gnu/sys-include -fno-checking -g -O2 -O2 -g -O2 -DIN_GCC -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -isystem ./include -fpic -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector -fpic -I. -I. -I../.././gcc -I../../../gcc/libgcc -I../../../gcc/libgcc/. -I../../../gcc/libgcc/../gcc -I../../../gcc/libgcc/../include -DHAVE_CC_TLS -o _ffssi2.o -MT _ffssi2.o -MD -MP -MF _ffssi2.dep -DL_ffssi2 -c ../../../gcc/libgcc/libgcc2.c -fvisibility=hidden -DHIDE_EXPORTS
during GIMPLE pass: thread
In file included from ../../../gcc/libgcc/libgcc2.c:56:
../../../gcc/libgcc/libgcc2.c: In function ‘__clzdi2’:
../../../gcc/libgcc/libgcc2.h:202:25: internal compiler error: Segmentation fault
202 | #define __NW(a,b) __ ## a ## di ## b
Currently I have no clue about why this happens.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.