Vector operations about raptorjit HOT 26 OPEN

raptorjit commented on June 6, 2024

Vector operations

from raptorjit.

Comments (26)

lukego commented on June 6, 2024

Some operations would greatly benefit if RaptorJIT had vector operations.

I think it is worth being more explicit here. How widely applicable would vector intrinsics be for the kinds of applications that people will really want to write with RaptorJIT?

In Snabb land we are occasionally writing vector code and we do that by writing assembler code with DynASM (example: AVX2 IP checksum).

Having intrinsics support could be handy if it lets us write that code more simply in Lua or if it automatically retargets it for different instruction sets (e.g. AVX2 vs AVX512.)

On the other hand assembler may be appropriate because when we write this code we are often looking for really maximum performance and I believe that @wingo has opined that AVX512 is quirky enough that it's hard to imagine how to program it effectively except at the asm level.

from raptorjit.

wingo commented on June 6, 2024

MHO is that parallelization in the compiler is too hard. Instead if you add SIMD types to Lua, then we have to choose either a subset or a superset of functionality of Snabb targets. Even though we target server Xeons, there is a lot of variation in those chips, and a subset would be suboptimal and not used. We have some machines that don't even do AVX2. A superset of SIMD types and operations would be very expensive to build, both in the runtime and in the compiler, and wouldn't provide notable benefits over just coding in assembler -- you need to know assembler anyway to be able to use these data types effectively.

Many of the considerations here apply: https://www.mail-archive.com/[email protected]/msg25237.html

I plan to continue coding in DynASM.

from raptorjit.

wingo commented on June 6, 2024

As an add-on -- the API that you propose @laaas could be implemented by a user, I think, and if you had a DynASM compiler in the backend, you can generate code specialized to the processor you are running on (and possibly specialized to its width, but that's trickier).

The problem with using DynASM is that it's a compiler barrier: LuaJIT models a call to a "foreign" function as reading and writing all memory (and clobbering all scratch registers), so it prevents code motion in many cases. The solution is to be able to annotate foreign functions with their effects; there is a bug somewhere for that. Otherwise, you need to make your DynASM block big enough so that it does enough work so that the barrier cost is not significant.

from raptorjit.

lukego commented on June 6, 2024

The problem with using DynASM is that it's a compiler barrier

Could potentially be solved by merging @fsfod's Intrinsics branch: LuaJIT/LuaJIT#116. This can take machine code, e.g. from DynASM, and stitch it into the JIT code.

from raptorjit.

L-as commented on June 6, 2024

Wow, that PR is old.

from raptorjit.

L-as commented on June 6, 2024

Yeah I definitely think those kinds of intrinsics are the way forward.

from raptorjit.

lukego commented on June 6, 2024

Wow, that PR is old.

This is part of the reason that RaptorJIT exists. There needs to be a community where new features can get feedback, get merged, get used. LuaJIT is not serving that specific purpose because it is being maintained very conservatively.

I am still interested in measuring the practical impact of the various barriers when it comes to calling assembler code. On the one hand it seems intuitively that intrinsics should provide a substantial speed-up for pulling in generated code, on the other hand I don't like relying on intuitive reasoning when it comes to optimizations :)

from raptorjit.

L-as commented on June 6, 2024

I actually just asked about on (the dead) slack, so I'll just reiterate what I said there:
(raw copy)

I'm wondering if RaptorJIT is supposed to be that new upstream, as discussed here: https://github.com/fsfod/LuaJIT/issues/3#issuecomment-257885440
GitHub
Get new GC in LuaJIT upstream · Issue #3 · fsfod/LuaJIT
It would be very useful to get the new GC upstreamed.
12:29
But yet windows support, and support for other processor architecture have been dropped
12:30
But then again you also say this: Forks focused on other CPU families (Atom, Xeon Phi, AMD, VIA, etc) are encouraged and may be merged in the future.
12:30
What is RaptorJIT supposed to be?
12:30
Is it just Snabb's fork of LuaJIT?
12:30
Or is it supposed to be that upstream branch?
12:31
And if not, should one be made?

from raptorjit.

lukego commented on June 6, 2024

What is RaptorJIT supposed to be?

RaptorJIT is supposed to be an active, forward-moving, community-oriented fork of LuaJIT for Linux/x86-64 server applications. It's a project where people working in the same broad application domain can share features and code maintenance.

It's also a place where people working in other application domains, like video games, can cherry-pick features that they want too. Or where people can develop their ideas for new features that can be developed for use with multiple LuaJIT forks. Or people who want to learn about tracing JIT implementation. Or... we will see who turns up :).

Is it just Snabb's fork of LuaJIT?

No. Snabb has had a LuaJIT branch for a long time, living directly in the Snabb repo, not being promoted to other people. Lots of other projects are surely doing the same thing too.

RaptorJIT is a community project where we can collaborate and merge things together instead of each playing in our own sandbox. It's more narrowly focused than LuaJIT but should serve a reasonable subset of the community - e.g. packet networking people, OpenResty people, HFT people, etc - and hopefully onboard new people from outside the current LuaJIT world too. That is what I would like to see, anyway.

I'm wondering if RaptorJIT is supposed to be that new upstream, as discussed here: fsfod/LuaJIT#3 (comment)

Not exactly.

LuaJIT upstream is in the interesting situation that the only branches that exist (master and v2.1) are in feature freeze. This would seem to mean that the whole project is stuck in feature freeze indefinitely for the simple reason that there is no branch accepting new features.

If somebody would create an official v2.2 or v3.0 branch and start landing code for new features then this situation would change, but that hasn't happened, and at this stage I don't have any expectation that it will in the future. I also think the "LuaJIT" brand has been tarnished with the perception that the project is abandonware and that this is likely to limit uptake in new projects.

So RaptorJIT is now a place where new features, like a garbage collector or the intrinsics support, can be merged and maintained if it turns out to make sense for this community. But it's not the place where LuaJIT development is done.

from raptorjit.

lukego commented on June 6, 2024

on (the dead) slack

The Slack is not dead, it just hasn't come to life yet :). It's early days for RaptorJIT and realtime chat requires a certain critical mass to be active. However - I would much prefer to have all substantial discussions here on Github and mostly use Slack for shooting the breeze. That's because people who hate IM should feel comfortable contributing to RaptorJIT :).

from raptorjit.

L-as commented on June 6, 2024

What's the difference between being unborn and dead?

from raptorjit.

L-as commented on June 6, 2024

Thank you for clarifying

from raptorjit.

lukego commented on June 6, 2024

What's the difference between being unborn and dead?

Let's call it embryonic :).

from raptorjit.

L-as commented on June 6, 2024

Another thought: With the recent EPYC CPUs, is it really a wise decision to lock yourselves down to Intel? They look interesting, but then again, I don't host any servers.

from raptorjit.

lukego commented on June 6, 2024

@laaas I see this as an "area under the curve" problem. Sure, other processors may become relevant in the future, but we shouldn't pay for them unless/until they become so.

There are a lot of direct and indirect costs to supporting many platforms: more testing required, more bugs, more distractions while adding new features, higher barrier of entry for maintainers, disincentive to take advantage of really good non-portable features, etc. These need to be offset by commensurate benefits somehow. I don't see that to be the situation today.

(I wouldn't be surprised if a major reason that LuaJIT struggles to find a new maintainer is the requirement to have an encyclopaedic knowledge of the processors and toolchains for games consoles, etc. Few people possess these skills and probably even fewer are eager to acquire them. Sure, you could solve it by spreading the work between many people, but then you still have a bootstrapping problem of getting everybody involved, organizing them, keeping them engaged, etc.)

from raptorjit.

L-as commented on June 6, 2024

They're relevant now though.

from raptorjit.

wingo commented on June 6, 2024

The new AMD CPUs would not require significant modification on the RaptorJIT level; it's still x86-64 with a modern ISA. From that perspective, it's a moot point to bring up in a RaptorJIT context I think; there's no area under that curve right now, and focusing on Xeon doesn't prevent EPYC targets in the future.

However in Snabb I am not currently interested in EPYC, as it's the whole architecture (especially PCI and memory bus) that needs testing and baking-in and it's not something deployable now (or indeed within next 2 years), so simplifying focus to Xeon makes sense in that context.

from raptorjit.

lukego commented on June 6, 2024

One relevant case is #55. This is an optimization that improved the md5 benchmark performance by 15%. It's a simple machine-code generation change that targets the Intel Core microarchitecture with a well-understood rationale that is explicitly advocated by Intel.

This would have been harder if we required new optimizations to give equal weight to Intel Core, AMD EPYC, etc. We would have to benchmark all of them, we would have to consider the trade-offs of improving one at the expense of another, and in case of conflict we would need to implement both and select between them at runtime. This to me would represent at least some "area under the curve" effort.

That change was very easy to land in RaptorJIT. Ocaml have been pondering it for several years. I have not even submitted it to LuaJIT upstream because my head hurts just thinking about how I would advocate for it without knowing what CPUs other people are using and how the change impacts those.

from raptorjit.

lukego commented on June 6, 2024

Having said that: I find it easy to imagine a future in which our CI does cover more microarchitectures and we are able to consider the broad impact of our optimizations across many x86-64 CPU families. That would be neat. But doing that in an ad-hoc way, with everybody doing their own private testing with the hardware they have lying around, sounds like busy-work to me.

from raptorjit.

L-as commented on June 6, 2024

Couldn't you just have put in an if branch, that checked if it was indeed Intel? I don't think that would be very hard, just have an enum for the CPU families, then have a global of this type.

if (cpu_family > CPU_HASWELL && cpu_family < CPU_SKYLAKE) // affected CPUs

from raptorjit.

lukego commented on June 6, 2024

@laaas That's what technical debt looks like, to me.

What do you do on the else clause? (This change may be beneficial on the other microarchitectures too.)
How do you keep track of whether both branches work? (The CI needs to test both versions now.)
How do you deal with the explosion of these ifdef lines? This gets out of control quickly e.g. for GC32 vs GC64 modes of LuaJIT.

I don't say this is bad. I only say that it makes things more complicated. i.e. there is a cost so there needs to be a commensurate benefit.

from raptorjit.

L-as commented on June 6, 2024

It wouldn't be ifdef, since it wouldn't be decided at compile-time.
You could just change the global variable to something else.
raptorjit --emulate=RYZEN myfile.lua

from raptorjit.

L-as commented on June 6, 2024

I.e. you don't need to build a version for each configuration, just test it with each target. Of course performance regressions won't be found, but at least you'll find bugs.
This won't work with non-x86 architectures of course, unless you use QEMU, which doesn't sound that bad TBH.

from raptorjit.

lukego commented on June 6, 2024

@laaas Related: RaptorJIT Performance

RaptorJIT takes a quantitive approach to performance. The value of an optimization must be demonstrated by a reproducible benchmark. Optimizations that are not demonstrably beneficial for the currently supported CPUs are removed.

To me your code looks like an optimization for non-Haswell/Skylake CPUs. That's fine and well, and potentially very valuable. But I would want to have a reproducible benchmark to demonstrate its effectiveness.

from raptorjit.

L-as commented on June 6, 2024

People who do use those CPUs, could test performance regressions locally, no?
Also, back to the example: it wouldn't hurt performance for non-Skylake CPUs.
Basically it assumes that there is an enum like this:

enum CPU {
  BULLDOZER,
  PILEDRIVER,
  STEAMROLLER,
  EXCAVATOR,
  IVYBRIDGE,
  HASWELL,
  BROADWELL,
  SKYLAKE,
}

and that the if branch would then apply to all CPU families between those two families, i.e. HASWELL, BROADWELL & SKYLAKE. (I did write < instead of <= by accident).

The idea is then that this would be a change/optimisation that is for these CPUs. The code for other CPUs would not change, and there would thus be no regressions for them.

By not having such conditional branches, you are introducing regressions for other microarcitectures for no good reason, other than reducing the amount of code, but for cases as small as this, I don't think it hurts that badly.

from raptorjit.

lukego commented on June 6, 2024

Could be time to create a separate issue if you want to advocate for a good way to support more microarchitectures.

from raptorjit.

Vector operations about raptorjit HOT 26 OPEN

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs