Comments (7)
So! I am curious to better understand the current hashing approach before removing it. How good is it? How much could it be improved by increasing the hash size? Here is a first attempt at an analytical answer.
Suppose you have N hash buckets and H traces that are already hot. Suppose you now execute a bytecode B that is not supposed to be hot and will screw up your performance if it starts a trace. What is the probability that bytecode B will be traced due to a collision with one of the hot traces H in one of the N hash buckets?
It seems to me that this is equivalent to the "birthday problem" and more specifically the "same birthday as you" problem. In that case we can calculate the probability of screwing up performance by tracing byte B as 1-((N-1)/N)^H.
Here is how that looks on a graph (warning: log scales):
So what can we say about this?
Suppose that we were willing to accept a 1% chance of randomly screwing up performance. How many hot traces can we have when we execute the sensitive bytecode B that is not supposed to start a trace? Then 64 buckets are too few, 256 buckets is okay with 2 hot traces, 1024 buckets is okay with 10 hot traces, etc, up to 8192 that is okay up to around 80 hot traces.
Suppose alternatively that we are very concerned about consistency and we are not willing to take on more than a 0.01% chance of degrading performance due to a random collision. Then the picture is bleak: we would need the 8192 hash buckets and even that would not be sufficient once we have more than two hot traces. So hashing does not seem like a suitable approach in this context.
RaptorJIT needs to support the latter kind of applications. For example, the Snabb CI executes around 100,000 benchmarks per month and we need to be able to resolve any issues that cause outlier results -- having extra noise from non-deterministic tracing would tend to create noise and mask other problems. Likewise Snabb users need to be able to depend on consistent performance -- it would not be okay if for every 100 routers you deploy one of them will randomly perform badly. (This uncertainty would also lead to a "have you tried turning it off and on again?" approach to problems that always makes support complicated.)
Add to this that the real-world behavior is probably worse than the theoretical case that I have sketched above, e.g. I assumed that the hash function would perfectly distribute bytecodes between buckets whereas the real implementation is a simply shift-and-mask.
Conclusion: I have not yet been able to convince myself that the hashing approach is valid for RaptorJIT given our requirement of being the ideal compiler for soft-realtime applications.
... Caveat: Could be that I have bungled the analysis completely! Great if somebody wants to check my work :-)
from raptorjit.
a few data points:
- each trace exit has its own (non-hashed) hotcount. after 10 exits a side trace is generated. doesn't use of touch the global hotcount hash.
- not sure how you're defining "hot bytecodes" in your statistic thought experiment. but remember that only loops and functions actually touch the hotcounters. that means a lower density of "collisionable" bytecodes.
- i kinda like the idea of a HOTCOUNT bytecode. some (contradicting) ideas about taking it further:
- do the decrementing (and trace start) in the instruction itself, remove the .hotloop/.hotcall macros from loop/function instructions.
- use the second argument to indicate how much to decrement it. maybe the parser could get some hints about loop priority? currently loops are hardcoded to -=2, while functions are -=1. or maybe a profiler could tweak them?
- make this the patched instruction on trace compile (instead of the loop instructions). helps on the "detach from functions" idea; may simplify some hardcoded checks in the interpreter and the tracer, which enumerate instructions and their Jxxx/Ixxx variations. if the only "base/Jxxx/Ixxx" instruction is HOTCOUNT, we would know that all root traces start with it. again, simpler conditions on many parts.
- while decrementing is decrementing, penalizing is doubling (plus a small random), blacklistings come at 60,000. to keep the same or similar heuristics you need 16 bits, better use the "D" argument.
from raptorjit.
each trace exit has its own (non-hashed) hotcount. after 10 exits a side trace is generated. doesn't use of touch the global hotcount hash.
Nice. So side-trace exits are already using individual buckets like the scheme we are considering.
not sure how you're defining "hot bytecodes" in your statistic thought experiment.
It's a bit fuzzy. I think one major problem with my formulation is that I assume the hash buckets get "hot" and then stay hot i.e. that the hotcount latches at zero. However, looking at the code the counter actually wraps back to 255. So it looks like in reality the effect of hash collisions will be to randomize the hotcount as each bytecode is traced when the counter passes zero on its way back to 255.
I am not sure what the implications are exactly. Could be that this scheme works better because even with a random initial count the loops will tend to be traced first if they are executing more often and decrementing the count by 2. However I still don't see how to construct an argument that this mechanism will provide robust and predictable behavior.
but remember that only loops and functions actually touch the hotcounters. that means a lower density of "collisionable" bytecodes.
Right. However, a quick grep
of the Snabb sources suggests that we have around 10,000 functions and loops. These are all being hashed onto 64 counters. This seems like madness to me. Seems like one would expect everything to be colliding with everything else, perhaps like when running the benchmark suite with only 8 hash buckets (since the benchmark programs are so tiny and have so few bytecodes that will bump the hotcount.)
do the decrementing (and trace start) in the instruction itself, remove the .hotloop/.hotcall macros from loop/function instructions.
I like this idea!
while decrementing is decrementing, penalizing is doubling (plus a small random), blacklistings come at 60,000. to keep the same or similar heuristics you need 16 bits, better use the "D" argument.
I like this idea too. I would actually like to eliminate blacklisting completely and replace it with some suitable backoff mechanism (maybe in the spirit of TCP RTO.)
The risk I see with blacklisting is that it will occur on some very obscure code path of an application, e.g. a weird combination of configurations options or a DoS-like workload, and while I don't want to waste time constantly retracing during this period I do want a new recording to be made when the situation changes. Otherwise the blacklist is forever and maybe the server is still running code in the interpreter after years because of something funny that happened for a short period of time.
from raptorjit.
make this the patched instruction on trace compile (instead of the loop instructions)
This is an interesting aspect. I suppose this means the HOTC
bytecode would need to come before the loop so that you branch into JIT code without the interpreter doing the setup?
It does sound like an attractive simplification if we would have two new bytecodes, HOTC
for counting and then patched/replaced withJTRACE
to branch to mcode once the trace is recorded. Could we then retire the many special-case bytecodes for transitioning to mcode (JFORI JFORL JITERL JLOOP JFUNCF JFUNCF
) and the many special-case bytecodes for marking blacklisted code (IFORL IITERL ILOOP IFUNCF IFUNCV
)? If that would pan out then we have really reduced the bytecode list and can retire some really annoyingly clever code and the bytecode-numbering invariants required to make that work.
Or maybe it is not that simple... I suppose at least when we record the trace it would need to start with mcode equivalent to what the original bytecodes do and there is a risk that it is more complicated to do this in generated code than hand-written bytecode interpreter code. Have to see.
@javierguerragiraldez I made a really quick attempt at the first babystep of adding a HOTC
bytecode that is emitted before loops and executes as a NOP. This causes segfaults... any idea why? :-)
from raptorjit.
You can also place function hotcounts in GCproto directly, instead of adding a HOTC at the beginning of every function. This way the call count will be easier to retrieve inside C part.
from raptorjit.
@iehrlich This sounds like a neat idea. If we put a counter into the prototype object (i.e. representation of the bytecode for one function) then we could use that to precisely count the function-entry hot counters for bytecodes (FUNCF
and FUNCV
) instead of using the hashtable.
Then we would be precisely counting the hotness of both function heads and side-traces (as mentioned by @javierguerragiraldez.) However we are still using the probabilistic hashing approach to loops. Can the scheme be extended to cover those too?
from raptorjit.
However we are still using the probabilistic hashing approach to loops. Can the scheme be extended to cover those too?
Oh, sorry, I though we were already past this point :) You can abandon the hotcount cache instantly once you've introduced the HOTC bytecodes. In such bytecodes, only 8 bits are occupied for the opcode, and you still have at most 24 bits to store the hotcount - right on top of the loop nest or function header. You'll have 1:1 match between counters and counted events in this scenario.
from raptorjit.
Related Issues (20)
- Idea: Remove Lua C-API HOT 41
- Philosophy: Who is RaptorJIT for? HOT 1
- RaptorJIT language side evolution and Lua compatibility HOT 3
- A world on FFI HOT 6
- Benchmark: FFI
- Idea: Separate snapshot for each function call
- raptorjit release version confusion HOT 2
- Idea: Write Lua parser and bytecode compiler in Lua HOT 19
- Question: How to send relevant fixes to LuaJIT?
- Document VM bootstrap, code generation, build process HOT 2
- Idea: CNEWI sinking across trace boundaries HOT 4
- Demo: Over 50x slowdown on pointer arithmetic due to single branch
- Windows support HOT 2
- Openresty HOT 8
- Optimization: lambda lifting HOT 7
- Initial port of RaptorJIT bytecode interpreter to C
- Filling the gap with Lua 5.3 HOT 2
- Apply to GitHub sponsorship HOT 3
- LuaJIT/RaptorJIT at FOSDEM 2020?
- Linking failed on ArchLinux HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from raptorjit.