Comments (8)
We do actually do a bit of this already; the "MOSIndexIV" pass looks for 16-bit index variables in loops and lowers them to 8 bits.
If you look really carefully at the assembly in the second version of the code, you can see that that pass is actually working. The X register is carried around the loop completely separately from whatever else is going on there, and the array is actually accessed using the 8-bit ,X addressing mode. It appears (I'm guessing) that the loop's test doesn't receive the same treatment however, which indicates a gap in the cases handled by that pass.
from llvm-mos.
If the MOSIndexIV pass is designed to narrow index variables in loops, then it takes advantage of some narrowing opportunities but probably misses others. Perhaps it would be good to have a more general data-narrowing pass that takes place earlier in compilation
from llvm-mos.
We'll probably need both. Loops are a bit special in that there's a circular dependency: both bytes of the 16-bit index are used by the 16-bit increment which becomes used by the next 16-bit increment, and so on. You have to look at the loop as a whole to see that the whole only requires 8 bits. It's a bit like cycle detection in a garbage collector.
In LLVM, this is the SCEV pass, which generates for say, "int x" in the above loop: (and don't quote me on this): i16:<+, 0, 1>. This encodes that x begins at zero and increases by 1 each time through the loop. MOSIndexIV notes that since the loop edge count is 64, the max value of x is 64 (0 + 1 + 1 + 1... 64 times). We then turn the index from (gStuff + i16:<+, 0, 1>) into (gStuff + zero_ext(i8<+, 0, 1>). It seems like when loop strength reduction later goes looking for an induction variable to use to test whether we've hit 64 iterations, it should prefer the native i8<+, 0, 1> to the original i16<+, 0, 1>, but it doesn't (or it constructs a completely different IV or something). We'll have to trace it and see what it's doing there.
There should already be some stuff in LLVM to lower straight-line code, since it'd be important to say, lower 64-bit operations to 32 bits on a 32-bit system. I haven't really gone looking for it yet, though; it may just be a matter of turning it on and/or telling it that we want it to take things down to 8 bits. A lot of this is triggered on the "native" int type, which we did already set to 8 bits, but there may be some optimization that hardcoded i32 (there's a SURPRISING amount of that).
from llvm-mos.
While optimizing Dhrystone, I enabled Loop Strength Reduction (LSR), even for non-native induction variables (IVs). Loop strengh reduction rewrites all uses of variables that vary with the loop (induction variables) to require maintaining as few as possible, and updating them as cheaply as possible. This will for example convert a multiplication in a loop to an addition of the stride on each iteration, and a host of other optimizations. LSR uses information about target addressing modes to try to produce expressions that will later reduce to addressing modes, rather than computing those values into registers.
A ton of induction variables will be 16-bit or larger on MOS, but that doesn't eliminate the need to rewrite them to e.g. count down to zero. LLVM didn't do this, since LSR wasn't smart enough to avoid production a lot of non-native IVs when smaller IVs would suffice.
Turning this on broke the old IndexIV optimization, since LSR completely rewrites away the narrow IV to a wider one. We need to actually teach LSR how the 6502's addressing modes work. This will require extending its internal interfaces quite a bit; it makes assumptions that are not valid on MOS, so there's no way at present to tell it about MOS addressing modes.
from llvm-mos.
The latest spat of changes should have repaired the IndexIV optimization, assuming they pass testing.
from llvm-mos.
Here are the latest results as of today.
This program:
char gStuff[64];
void foo(char x) {
for (int i = 0; i < 64; i++) {
gStuff[i] = i;
}
}
Gets compiled to:
foo: ; @foo
; %bb.0: ; %entry
ldx #64
ldy #0
.LBB0_1: ; %for.body
; =>This Inner Loop Header: Depth=1
tya
sta gStuff,y
clc
adc #1
tay
clc
txa
adc #-1
tax
lda #0
cmp #0
bne .LBB0_1
; %bb.2: ; %for.body
; in Loop: Header=BB0_1 Depth=1
txa
cpx #0
bne .LBB0_1
; %bb.3: ; %for.cond.cleanup
rts
Note the useless pattern lda #0, cmp #0
.
For reference, here are the results when int i
is changed to char i
:
foo: ; @foo
; %bb.0: ; %entry
lda #0
.LBB0_1: ; %for.body
; =>This Inner Loop Header: Depth=1
tax
sta gStuff,x
clc
adc #1
cmp #64
bne .LBB0_1
; %bb.2: ; %for.cond.cleanup
rts
from llvm-mos.
I'm still working on this pretty actively; it's been an absolutely wild ride through LLVM's codebase so far.
I've had to rather substantially alter the data model used by Loop Stength Reduction, which is the primary pass that deals with hardware addressing modes in LLVM. The 6502's addressing modes were completely unrepresentable in that pass, due to the hard assumption that all addressing modes are of the form "base + reg + scale * scalereg", where all of base, reg, and scalereg have exactly the same size.
After attempting several hacks and alternatives, I've broken that assumption in LoopStrengthReduction; the registers can now be narrower than the full addition, and they're implicitly zero-extended before adding. (This was a fairly hefty change, so it's taken up most of my time on this project over the last few weeks.)
I've had to disable a number of sections of LSR in the scenario we're dealing with, since I haven't repaired the logic yet, and they were spitting out garbage. I'm still hopeful that once these sections are repaired, LSR should pick really good induction variables to travel around the loop with. At least, it looks fairly promising so far.
Thanks for keeping an eye on this; I'll post an update once I've finished tweaking LSR.
Oh, and one side effect of this is that LSR can really change the output to the rest of the code generator, which isn't very well optimized either. Often a really good decision by LSR has produced a really poor decision elsewhere, just by coincidence. My goal for this pass is to get LSR putting out consistently good and clean loop code; this may actually make some of our benchmarks slower until we clean up all the little problems elsewhere in the codebase.
from llvm-mos.
After quite a lot more futzing with LSR, the given example is now:
char gStuff[64];
void foo(char x) {
for (int i = 0; i < 64; i++) {
gStuff[i] = i;
}
}
foo: ; @foo
; %bb.0: ; %entry
lda #0
.LBB0_1: ; %for.body
; =>This Inner Loop Header: Depth=1
tax
sta gStuff,x
clc
adc #1
cmp #64
bne .LBB0_1
; %bb.2: ; %for.cond.cleanup
rts
.Lfunc_end0:
LSR seems to be doing a mostly okay job now from a cursory look at the benchmarks; at the very least, returns are starting to diminish. The changes to it seem fairly brittle; it's clear that we'll need to hammer on LSR a lot more throughout the life of the project, but que sera sera.
Closing this one for now, since there's at least a reasonable pass at handling this kind of optimization, even if it ends up not always "sticking" for one reason or another.
from llvm-mos.
Related Issues (20)
- Missed optimization: fuse counter comparison and decrement together via "pre-offsetting" to avoid CMP instructions HOT 1
- Crash during linking SDK 8.0 HOT 4
- Regression in inc-dec-phi.ll
- Linker crashes on mixing non-constant and constant assembler constraints HOT 2
- Miscompilation at -Os interpreting struct as raw bytes HOT 1
- Port compiler-rt builtins
- Port llvm-libc HOT 2
- Port libc++
- Zero page allocation for size HOT 1
- Duplicate symbols with builtins HOT 1
- objdump --mcpu=mosw65816 doesn't decode newer instructions HOT 1
- Linker not garbage collecting correctly
- Inline assembly, missing txa in some (not all) cases HOT 1
- LLVM ERROR: Unable to legalize instruction HOT 3
- Support assembler sources in ca65 format
- Lower mem intrinsics to loops
- G_OR prevents selection of addressing mode HOT 1
- Don't copy single-use strings to the zero page
- rustc crash HOT 2
- Compilation failure on MacOS w. Apple silicon HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llvm-mos.