Due to memory and computational constraints inherent to the 6502 architecture, it is i

Here are the latest results as of today. This program: <div clas

Data type narrowing and LSR about llvm-mos HOT 8 CLOSED

llvm-mos commented on July 20, 2024

Data type narrowing and LSR

from llvm-mos.

Comments (8)

mysterymath commented on July 20, 2024

We do actually do a bit of this already; the "MOSIndexIV" pass looks for 16-bit index variables in loops and lowers them to 8 bits.

If you look really carefully at the assembly in the second version of the code, you can see that that pass is actually working. The X register is carried around the loop completely separately from whatever else is going on there, and the array is actually accessed using the 8-bit ,X addressing mode. It appears (I'm guessing) that the loop's test doesn't receive the same treatment however, which indicates a gap in the cases handled by that pass.

from llvm-mos.

beholdnec commented on July 20, 2024

If the MOSIndexIV pass is designed to narrow index variables in loops, then it takes advantage of some narrowing opportunities but probably misses others. Perhaps it would be good to have a more general data-narrowing pass that takes place earlier in compilation

from llvm-mos.

mysterymath commented on July 20, 2024

We'll probably need both. Loops are a bit special in that there's a circular dependency: both bytes of the 16-bit index are used by the 16-bit increment which becomes used by the next 16-bit increment, and so on. You have to look at the loop as a whole to see that the whole only requires 8 bits. It's a bit like cycle detection in a garbage collector.

In LLVM, this is the SCEV pass, which generates for say, "int x" in the above loop: (and don't quote me on this): i16:<+, 0, 1>. This encodes that x begins at zero and increases by 1 each time through the loop. MOSIndexIV notes that since the loop edge count is 64, the max value of x is 64 (0 + 1 + 1 + 1... 64 times). We then turn the index from (gStuff + i16:<+, 0, 1>) into (gStuff + zero_ext(i8<+, 0, 1>). It seems like when loop strength reduction later goes looking for an induction variable to use to test whether we've hit 64 iterations, it should prefer the native i8<+, 0, 1> to the original i16<+, 0, 1>, but it doesn't (or it constructs a completely different IV or something). We'll have to trace it and see what it's doing there.

There should already be some stuff in LLVM to lower straight-line code, since it'd be important to say, lower 64-bit operations to 32 bits on a 32-bit system. I haven't really gone looking for it yet, though; it may just be a matter of turning it on and/or telling it that we want it to take things down to 8 bits. A lot of this is triggered on the "native" int type, which we did already set to 8 bits, but there may be some optimization that hardcoded i32 (there's a SURPRISING amount of that).

from llvm-mos.

mysterymath commented on July 20, 2024

While optimizing Dhrystone, I enabled Loop Strength Reduction (LSR), even for non-native induction variables (IVs). Loop strengh reduction rewrites all uses of variables that vary with the loop (induction variables) to require maintaining as few as possible, and updating them as cheaply as possible. This will for example convert a multiplication in a loop to an addition of the stride on each iteration, and a host of other optimizations. LSR uses information about target addressing modes to try to produce expressions that will later reduce to addressing modes, rather than computing those values into registers.

A ton of induction variables will be 16-bit or larger on MOS, but that doesn't eliminate the need to rewrite them to e.g. count down to zero. LLVM didn't do this, since LSR wasn't smart enough to avoid production a lot of non-native IVs when smaller IVs would suffice.

Turning this on broke the old IndexIV optimization, since LSR completely rewrites away the narrow IV to a wider one. We need to actually teach LSR how the 6502's addressing modes work. This will require extending its internal interfaces quite a bit; it makes assumptions that are not valid on MOS, so there's no way at present to tell it about MOS addressing modes.

from llvm-mos.

mysterymath commented on July 20, 2024

The latest spat of changes should have repaired the IndexIV optimization, assuming they pass testing.

from llvm-mos.

beholdnec commented on July 20, 2024

Here are the latest results as of today.

This program:

char gStuff[64];

void foo(char x) {
    for (int i = 0; i < 64; i++) {
        gStuff[i] = i;
    }
}

Gets compiled to:

foo:                                    ; @foo
; %bb.0:                                ; %entry
	ldx	#64
	ldy	#0
.LBB0_1:                                ; %for.body
                                        ; =>This Inner Loop Header: Depth=1
	tya
	sta	gStuff,y
	clc
	adc	#1
	tay
	clc
	txa
	adc	#-1
	tax
	lda	#0
	cmp	#0
	bne	.LBB0_1
; %bb.2:                                ; %for.body
                                        ;   in Loop: Header=BB0_1 Depth=1
	txa
	cpx	#0
	bne	.LBB0_1
; %bb.3:                                ; %for.cond.cleanup
	rts

Note the useless pattern lda #0, cmp #0.

For reference, here are the results when int i is changed to char i:

foo:                                    ; @foo
; %bb.0:                                ; %entry
	lda	#0
.LBB0_1:                                ; %for.body
                                        ; =>This Inner Loop Header: Depth=1
	tax
	sta	gStuff,x
	clc
	adc	#1
	cmp	#64
	bne	.LBB0_1
; %bb.2:                                ; %for.cond.cleanup
	rts

from llvm-mos.

mysterymath commented on July 20, 2024

I'm still working on this pretty actively; it's been an absolutely wild ride through LLVM's codebase so far.

I've had to rather substantially alter the data model used by Loop Stength Reduction, which is the primary pass that deals with hardware addressing modes in LLVM. The 6502's addressing modes were completely unrepresentable in that pass, due to the hard assumption that all addressing modes are of the form "base + reg + scale * scalereg", where all of base, reg, and scalereg have exactly the same size.

After attempting several hacks and alternatives, I've broken that assumption in LoopStrengthReduction; the registers can now be narrower than the full addition, and they're implicitly zero-extended before adding. (This was a fairly hefty change, so it's taken up most of my time on this project over the last few weeks.)

I've had to disable a number of sections of LSR in the scenario we're dealing with, since I haven't repaired the logic yet, and they were spitting out garbage. I'm still hopeful that once these sections are repaired, LSR should pick really good induction variables to travel around the loop with. At least, it looks fairly promising so far.

Thanks for keeping an eye on this; I'll post an update once I've finished tweaking LSR.

Oh, and one side effect of this is that LSR can really change the output to the rest of the code generator, which isn't very well optimized either. Often a really good decision by LSR has produced a really poor decision elsewhere, just by coincidence. My goal for this pass is to get LSR putting out consistently good and clean loop code; this may actually make some of our benchmarks slower until we clean up all the little problems elsewhere in the codebase.

from llvm-mos.

mysterymath commented on July 20, 2024

After quite a lot more futzing with LSR, the given example is now:

char gStuff[64];

void foo(char x) {
    for (int i = 0; i < 64; i++) {
        gStuff[i] = i;
    }
}

foo:                                    ; @foo
; %bb.0:                                ; %entry
        lda     #0
.LBB0_1:                                ; %for.body
                                        ; =>This Inner Loop Header: Depth=1
        tax
        sta     gStuff,x
        clc
        adc     #1
        cmp     #64
        bne     .LBB0_1
; %bb.2:                                ; %for.cond.cleanup
        rts
.Lfunc_end0:

LSR seems to be doing a mostly okay job now from a cursory look at the benchmarks; at the very least, returns are starting to diminish. The changes to it seem fairly brittle; it's clear that we'll need to hammer on LSR a lot more throughout the life of the project, but que sera sera.

Closing this one for now, since there's at least a reasonable pass at handling this kind of optimization, even if it ends up not always "sticking" for one reason or another.

from llvm-mos.

Data type narrowing and LSR about llvm-mos HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs