loongson / gcc Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gcc-mirror/gcc

25.0 25.0 20.0 3.13 GB

License: GNU General Public License v2.0

gcc's People

Contributors

Stargazers

Watchers

gcc's Issues

LoongArch64实验性向量调用约定实现 (Experimental implementation of LoongArch64 vector calling convention)

For English version, please keep scrolling down.

这是目前基于GCC对LA64实现的一个实验性的向量调用约定，代码位于dev/vecarg分支，如果存在问题和不完善的情况欢迎大家进行讨论。如果中英文版本之间存在描述模棱两可的情况也请提出，谢谢大家！

代码位于#113

关键字

VR: 向量寄存器(Vector Register)
VAR: 向量参数寄存器(Vector Argument Register)
VRLEN: 向量寄存器位宽(Vector Register Length)

向量类型

向量的比特位宽可以是128bit或者256bit，并且总是包含多个元素。向量元素从最低比特起占据向量空间，并且拥有从0开始递增的index。
向量的元素类型遵循于LP64数据模型。

向量寄存器

LA64可以选择性的实现32个128位或者256位的向量寄存器硬件。如果实现向量寄存器，则必须实现双精度浮点硬件单元。
同编号的256位向量寄存器的低半部分与128位寄存器共用，同编号的128位向量寄存器低半部分和浮点寄存器共用。

以下为向量寄存器的使用约定：

名称	用途	是否在过程间保存
$vr0 - $vr1 (128位) / $xr0 - $xr1 (256位）	参数寄存器/返回值寄存器	否
$vr2 - $vr7 (128位) / $xr2 - $xr7 (256位）	参数寄存器	否
$vr8 - $vr31 (128位) / $xr8 - $xr31 (256位)	临时寄存器	否

TODO：对于在过程间保存完整内容的寄存器（static register/callee-saved register)，目前尚无明确最终方案，需要有效的性能测试手段来辅助判断。
目前在配合sleef向量数学库（还未提交社区）对x264、libjpeg-turbo进行性能测试的过程中，不同s/t寄存器的分配对性能没有产生明显影响。

向量调用约定

向量调用约定扩展是叠加于LP64D之上、使用128/256位向量寄存器，对向量参数和返回值进行传递的调用约定扩展。
可以通过以下的方式启用该调用约定：

使用vecarg选项对编译模块进行编译。这会使编译模块内的所有使用了向量参数、向量返回值函数都遵循该调用约定。
使用vecarg属性在源码中标记特定函数。被标记的函数会启用该调用约定。

为了使向量调用约定在函数、编译模块之间的行为保持一致，需要遵循以下的要求：

如果使用vecarg选项构建一个编译模块，如果另一模块调用了该模块使用了向量参数、向量返回值的函数，该模块也应当使用vecarg选项进行编译。
如果使用vecarg属性标记了一个函数，该函数的所有声明、定义都应当使用vecarg属性进行标记。
对于所有利用了向量调用约定的编译模块，使用相同的向量长度指令集支持进行编译。

p.s.: 对于GCC当前的PoC实现，vecarg选项对应-mvecarg, vecarg属性对应于__attribute__ ((vecarg))。

子程序调用流程

在以下的向量调用约定描述中，对于128/256位向量的传递描述中，我们都认为编译器开启了对应位宽的向量指令支持。

寄存器

VAR：0-7号向量寄存器按照编号依次用于向量参数的传递。同时，0-1号向量寄存器用于向量返回值的传递。向量参数传递时，总是会选择VRLEN等于向量参数位宽的VAR进行传递。

参数传递

在启用向量调用约定时，参数可能的传递形式如下：

一个参数寄存器。
一对编号连续的参数寄存器。
下面的任意一种不同类型参数寄存器的配对组合：
- 一个GAR和一个FAR
- 一个GAR和一个VAR
- 一个FAR和一个VAR
一个在栈区域连续的内存块，该内存块具有由子程序调用者的$sp计算的偏移常量
1和4的组合。

单个向量参数的传递

128位向量
- 如果存在至少1个VAR可用，则使用VAR进行传递。
- 如果无VAR可用，至少2个编号相邻的GAR可用，则使用这一对GAR进行传递，低64位存储在编号靠前的GAR,高64位存储在编号靠后的GAR。
- 其他情况，完全通过栈进行传递。
256位向量
- 如果存在至少1个VAR可用，则使用VAR进行传递。
- 如果无VAR可用，至少一个GAR可用，则将256位向量存储在调用者的栈空间，并且将存放位置对应的内存地址存放在GAR。
- 如果无VAR和GAR可用，则完全通过栈进行传递。

带有向量成员的结构体的传递

无论何种场合，最多仅使用两个寄存器（所有使用的寄存器类型的数量之和）进行结构体的传递，否则从栈进行参数传递。

如果结构体仅存在一个成员，并且该成员是向量，则参数传递规则与单个向量参数的传递行为相同。
当结构体的成员为两个时：
1. 如果两个成员均为向量，而且有至少两个编号连续的VAR可用，则使用这两个VAR进行传递。
2. 如果结构体包含一个向量成员、一个浮点成员，在FAR、VAR有空闲的前提下，使用一个VAR、一个FAR对两个成员进行传递。
3. 如果结构体包含一个向量成员、一个整型成员，在FAR、GAR有空闲的前提下，使用一个VAR、一个GAR对两个成员进行传递。
其他情况下，如果至少有一个空闲的GAR,则进行引用传参，否则从栈进行传递。

p.s.:如果结构体成员包含0长度位域、0长度数组、空结构体或空组合体等成员，其处理规则与基础ABI中Other structures中所描述的处理方式相同。

可变长参数列表的传递

对于向量参数，不使用VAR/FAR进行传递。

对于128位向量，如果至少有两个GAR可用，并且首个GAR的编号为偶数，则使用这对GAR传递参数。

对于256位向量，根据向量位宽遵循现有基础ABI定义。

返回值

0-1号VAR用于返回值的传递，传递方式与参数列表中首个参数的传递逻辑相同。

This is a experimental vector calling convention impl. for LoongArch64 based on GCC. The ad-hoc implementation can be found in this pull request: #113.

Any discussions about this prototype calling convention are welcome! And please report any inconsistency between Chinese and English version. Thanks!

Keywords

VR: Vector Register
VAR: Vector Argument Register
VRLEN: Vector Register Length

Vector Types

A vector can be either 128 bits or 256 bits width, and always contains
multiple elements. Each member of vector consecutively occupies the vector
from lowest bits, and has index that starting from zero.

Elements of a vector always have same base scalar type from LP64 data model.

Vector Register

LoongArch machines that implements LA64 can optionally have 32 vector registers
may be either 128 or 256-bit, depending on the hardware implementation. double-precision
FPU is required for vector registers. Floating-point registers and vector registers that have same
index postfix follow the overlapping rules below:

Floating-point registers are overlapping the lower 64 bits of 128-bit and 256-bit vector registers.
128-bit vector registers are overlapping the lower 128 bits of 256-bit vector registers.

Name	Usage	Preserved across calls
$vr0 - $vr1 (128-bit) / $xr0 - $xr1 (256-bit）	Argument registers / return value registers	No
$vr2 - $vr7 (128-bit) / $xr2 - $xr7 (256-bit）	Argument registers	No
$vr8 - $vr31 (128-bit) / $xr8 - $xr31 (256-bit)	Temporary registers	No

TODO: For "static register"/"callee-saved register", we didn't have a clear resolution for now, and we need effective performance measurements for definition.

In current performance test, when utilize different static/temp register allocation solutions with vector calling convention, x264/libjpeg-turbo's testing tool and sleef vector math library(loongarch support not released yet), we can't see significant difference in performance outputs.

Vector Calling Conventions

Vector calling convention extension is based on the LP64D, it utilizes 128-bit/256-bit vector register to pass vector argument and return value.It can be enabled via:

Use "vecarg option" to compile objects. This way will makes all functions that contain vector arguments or vector return values follow this calling convention.
Use "vecarg attribute" to mark specific function. The function that being marked will follow this calling convetion.

For consistent behavior between objects and functions, following rules should be considered while utilizing vector calling conventions:

When compiling object A with "vecarg option", if object B invokes functions that contain vector arguments or return value from this object, obejct B also need to be compiled with "vecarg option".
When marking function with "vecarg attribute",
All objects that utilize vector calling convention should be compiled with same SIMD instuction option(Keep same max vector length).

p.s.: For current GCC PoC implementation, "vecarg option" refers to -mvecarg, "vecarg attribute" refers to __attribute__ ((vecarg))。

Subroutine Calling Sequence

In the following description of vector calling convention, we assume 128/256-bit vector insturction support is enabled in compiler while utilizing corresponding convention.

Registers

VAR: Number 0 - 7 vector register are preserved for vector argument passing, and number 0 - 1 vector are also used for vector return value.

Argument Passing

When vector calling convention is enabled, the possible passing method will be one of the following options:

An argument register.
A pair of argument registers with adjacent numbers.
Any combination type of a pair of argument registers below:

a GAR and a FAR.
a GAR and a VAR.
a FAR and a VAR.

A contiguous block of memory in the stack arguments region, with a constant offset from the caller's outgoing $sp.
A combination of 1 and 4.

Passing Single Vector Argument

128-bit vector argument
- If at least 1 VAR is available, then pass this argument via single VAR.
- If no VAR is available and at least 2 GARs with adjacent numbers are available, then pass vector argument via them; the low 64-bit part of vector argument is stored inside first GAR, and high 64-bit part of vector argument inside second GAR.
- For other condition, pass vector argument on stack.
256-bit vector argument
- If at least 1 VAR is available, then pass this argument via single VAR.
- If no VAR is available and at least 1 GAR is available, then store 256-bit vector argument on stack, then pass vector argument's address via GAR.
- If no GAR and VAR is available, pass vector argument on stack.

Passing Struct with Vector Member

For all conditions, we only use at most 2 registers(sum of all register types) to pass a struct with vector member, otherwise pass structure on stack.

If struct only has 1 member and it's vector type, then we follow the single vector argument passing rule for this struct.
If struct has 2 members:
1. If all of members are vector type, and 2 or more VARs with adjacent numbers are available, then pass struct via them; first vector member is stored inside first VAR, and second vector member inside second VAR.
2. If struct contains 1 vector member and 1 float-point member, and VARs and FARs are sufficient, then use 1 VAR and 1 FAR to pass struct.
3. If struct contains 1 vetcor member and 1 integer member, and VARs and GARs are sufficient, then use 1 VAR and 1 GAR to pass struct.
4. For other conditions, If at least 1 GAR is available, then pass struct by reference, otherwise pass on stack.

p.s.: If struct contains zero-with bit field/zero-length array/empty struct/empty union, the passing rule is same as the description of "Other Structure" in base ABI document.

Variadic arguments

We don't use VAR/FAR to pass vector arguments.

For 128-bit vector argument, if at least 2 GARs are available, and first GAR's number is even, then use this pair of GARs to pass argument.

For 256-bit vector argument, it follows the current base ABI conventions with its data bit-width(256-bit).

Return Value

0 - 1 VARs are used for passing return value. The passing rule of return value is same as the first argument's method of argument list.

交叉编译gnu-efi时提示“浮点数例外”

在Arch x86-64宿主上分别使用孙海勇老师的clfs cross-tools 4.0/5.0交叉编译gnu-efi（执行HOSTARCH=x86_64 CROSSARCH=loongarch64 prefix=loongarch64-unknown-linux-gnu- make），编译到libgnuefi.a时报错：
make[1]: *** [/home/prcups/gnu-efi//lib/Makefile:78: libefi.a] 浮点数例外 (核心已转储)

卢瑟快去修bug!!!

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106459

能不能适配一下gcc7.3

我们单位用的是gcc7.3，希望能在龙芯上运行gcc7.3

关于分支消除的实现

0x0

对于指令流中的分支指令cpu对其分支预测失败后会导致刷流水线，进而导致性能下降
最好的解决方式是不产生分支指令，而在高级语言中对于一些结构简单的if-then-else语句我们可以尝试对其实施分支消除

0x1

下面我们在c语言中对于a、b比较，根据比较结果将c或d赋值给out（其中a、b、c、d、out为变量）的情况进行分析讨论
以下为测试代码模板

int main(void)
{
	int a,b,c,d,out;

	foo(&a,&b,&c,&d);

	out = a<b?c:d;
/*   也可写做以下形式
        if(a<b)
          out = c;
       else
          out = d;
*/
	foo1(out);
	return 0;
}

根据abcd变量类型的不同我们考虑以下4种情况，并设计其分支消除的指令序列

abcd全为定点
全为定点可以使用以下序列，现有gcc后端代码已经实现
```
slt 
maskeqz
masknez
or
```
abcd全为浮点
现有gcc后端代码已经实现
```
fcmp
fsel
```
ab为浮点cd为定点
这种情况有两种方案

第一种是使用浮点fcmp，然后将fcc使用movcf2gr 移动到通用寄存器，再使用maskeqz masknez or
```
fcmp
movcf2gr
maskeqz
masknez
or
```
第二种将c、d使用movgr2fr移动到浮点寄存器，再进行浮点分支消除，完成后再将结果使用movfr2gr
```
movgr2fr
movgr2fr
fcmp
fsel
movfr2gr
```
ab为定点cd为浮点

这种情况和上一种情况一样也有两种方案

第一种，将stl结果使用movgr2cf到fcc再使用浮点fsel
```
stl
movgr2cf
fsel
```
第二种将定点数据移动到浮点寄存器，转换为浮点数据，再作浮点分支消除
```
movgr2fr
ffint
movgr2fr
ffint
fcmp
fsel
```

0x2

为了实现分支消除我们需要对la后端作以下改动

适当调大branch cost，让gcc生成更多的分支消除代码
放宽define_expand "movcc"的条件使得abcd中既有浮点也有定点的情况也能被loongarch_expand_conditional_move处理
在loongarch_expand_conditional_move中增加0x2中描述的几种情况的处理代码

extenddi_truncate<mode> 等 patterns 存在问题

workaround: xen0n@23231e3

使用 loongson/loongarch_upstream_v5.3 分支（commit 4244eaa）构建的编译器，在编译 QEMU 时出现以下报错：

/tmp/cc9um8Th.s: Assembler messages:
/tmp/cc9um8Th.s:12175: Error: no match insn: ext.w.<size>       $r5,$r5
/tmp/cc9um8Th.s:47097: Error: no match insn: ext.w.<size>       $r6,$r6
/tmp/cc9um8Th.s:47559: Error: no match insn: ext.w.<size>       $r6,$r6
/tmp/cc9um8Th.s:48939: Error: no match insn: ext.w.<size>       $r6,$r6
/tmp/cc9um8Th.s:49398: Error: no match insn: ext.w.<size>       $r6,$r6
/tmp/cc9um8Th.s:94462: Error: no match insn: ext.w.<size>       $r6,$r6
/tmp/cc9um8Th.s:102759: Error: no match insn: ext.w.<size>      $r6,$r6
/tmp/cc9um8Th.s:103233: Error: no match insn: ext.w.<size>      $r6,$r6

这显然是 pattern 写法导致的。我看类似形状的 patterns 只有 MIPS 有，而且实现细节被改过了，试着调整了一下，不行，于是暂时在我的分支去掉了。请协助排查修复。

gcc 编译器问题 internal compiler error: in output_constructor_regular_field, at varasm.c:5512

loongson/build-tools#12

consider using -mno-check-zero-division as the default?

@xen0n wrote:

Trapping division/modulus operations are signatures of MIPS codegen, and indeed here the trapping-by-default behavior and the flag seem to come from MIPS. However, as division-by-zero in LLVM IR is undefined behavior, why can't we just omit the trapping behavior altogether (and match RISCV in this regard), or at least disable the trapping by default?

RISCV gcc also does not trap for zero divisor at all.

To me it makes sense to enable -mno-check-zero-division as the default, or at least the default for optimized code (with -O2 or more).

请转告陈华才根据larchintrin.h更新loongarch.h的函数名

抱歉，由于Linux分支不能提交Issues，所以就在这儿提交，请转告陈华才根据larchintrin.h更新loongarch.h的函数名。由于最新版的gcc更改了larchintrin.h文件的几处函数命名，而Linux的asm/loongarch.h还没及时更新，造成在用最新gcc编译内核的时候通不过。请通知陈更新一下，谢谢！另外，对你们的工作成果表示祝贺！

gcc 特性支持不完整

编译 gcc 时发现不支持以下特性：

configure:3294: checking for libitm support                                      
configure:3300: result: no                                                       
configure:3313: checking for libsanitizer support                                
configure:3319: result: no                                                       
configure:3332: checking for libvtv support                                      
configure:3338: result: no                                                       
configure:3462: checking for libphobos support                                   
configure:3468: result: no

请确认这些特性，哪些是可以在 LoongArch 上支持的，请增加相关支持，谢谢。

ICE bootstrapping GCC trunk on LoongArch

During a bootstrap of upstream GCC r13-1271, stage 2 libgcc fails with:

/home/xry111/git-repos/gcc-build/./gcc/xgcc -B/home/xry111/git-repos/gcc-build/./gcc/ -B/home/xry111/gcc-trunk/loongarch64-unknown-linux-gnu/bin/ -B/home/xry111/gcc-trunk/loongarch64-unknown-linux-gnu/lib/ -isystem /home/xry111/gcc-trunk/loongarch64-unknown-linux-gnu/include -isystem /home/xry111/gcc-trunk/loongarch64-unknown-linux-gnu/sys-include   -fno-checking -g -O2 -O2  -g -O2 -DIN_GCC    -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem ./include  -fpic -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector  -fpic -I. -I. -I../.././gcc -I../../../gcc/libgcc -I../../../gcc/libgcc/. -I../../../gcc/libgcc/../gcc -I../../../gcc/libgcc/../include  -DHAVE_CC_TLS   -o _ffssi2.o -MT _ffssi2.o -MD -MP -MF _ffssi2.dep -DL_ffssi2 -c ../../../gcc/libgcc/libgcc2.c -fvisibility=hidden -DHIDE_EXPORTS

during GIMPLE pass: thread
In file included from ../../../gcc/libgcc/libgcc2.c:56:
../../../gcc/libgcc/libgcc2.c: In function ‘__clzdi2’:
../../../gcc/libgcc/libgcc2.h:202:25: internal compiler error: Segmentation fault
  202 | #define __NW(a,b)       __ ## a ## di ## b

Currently I have no clue about why this happens.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.