Problem : Tracing is unpredictable due to non-deterministic collisions

a few data points: each trace exit has its own (non-hashed) ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RFC: New bytecode "HOTCOUNT" about raptorjit HOT 7 OPEN

raptorjit commented on May 26, 2024

RFC: New bytecode "HOTCOUNT"

from raptorjit.

Comments (7)

lukego commented on May 26, 2024

So! I am curious to better understand the current hashing approach before removing it. How good is it? How much could it be improved by increasing the hash size? Here is a first attempt at an analytical answer.

Suppose you have N hash buckets and H traces that are already hot. Suppose you now execute a bytecode B that is not supposed to be hot and will screw up your performance if it starts a trace. What is the probability that bytecode B will be traced due to a collision with one of the hot traces H in one of the N hash buckets?

It seems to me that this is equivalent to the "birthday problem" and more specifically the "same birthday as you" problem. In that case we can calculate the probability of screwing up performance by tracing byte B as 1-((N-1)/N)^H.

Here is how that looks on a graph (warning: log scales):

So what can we say about this?

Suppose that we were willing to accept a 1% chance of randomly screwing up performance. How many hot traces can we have when we execute the sensitive bytecode B that is not supposed to start a trace? Then 64 buckets are too few, 256 buckets is okay with 2 hot traces, 1024 buckets is okay with 10 hot traces, etc, up to 8192 that is okay up to around 80 hot traces.

Suppose alternatively that we are very concerned about consistency and we are not willing to take on more than a 0.01% chance of degrading performance due to a random collision. Then the picture is bleak: we would need the 8192 hash buckets and even that would not be sufficient once we have more than two hot traces. So hashing does not seem like a suitable approach in this context.

RaptorJIT needs to support the latter kind of applications. For example, the Snabb CI executes around 100,000 benchmarks per month and we need to be able to resolve any issues that cause outlier results -- having extra noise from non-deterministic tracing would tend to create noise and mask other problems. Likewise Snabb users need to be able to depend on consistent performance -- it would not be okay if for every 100 routers you deploy one of them will randomly perform badly. (This uncertainty would also lead to a "have you tried turning it off and on again?" approach to problems that always makes support complicated.)

Add to this that the real-world behavior is probably worse than the theoretical case that I have sketched above, e.g. I assumed that the hash function would perfectly distribute bytecodes between buckets whereas the real implementation is a simply shift-and-mask.

Conclusion: I have not yet been able to convince myself that the hashing approach is valid for RaptorJIT given our requirement of being the ideal compiler for soft-realtime applications.

... Caveat: Could be that I have bungled the analysis completely! Great if somebody wants to check my work :-)

from raptorjit.

javierguerragiraldez commented on May 26, 2024

a few data points:

each trace exit has its own (non-hashed) hotcount. after 10 exits a side trace is generated. doesn't use of touch the global hotcount hash.
not sure how you're defining "hot bytecodes" in your statistic thought experiment. but remember that only loops and functions actually touch the hotcounters. that means a lower density of "collisionable" bytecodes.
i kinda like the idea of a HOTCOUNT bytecode. some (contradicting) ideas about taking it further:
- do the decrementing (and trace start) in the instruction itself, remove the .hotloop/.hotcall macros from loop/function instructions.
- use the second argument to indicate how much to decrement it. maybe the parser could get some hints about loop priority? currently loops are hardcoded to -=2, while functions are -=1. or maybe a profiler could tweak them?
- make this the patched instruction on trace compile (instead of the loop instructions). helps on the "detach from functions" idea; may simplify some hardcoded checks in the interpreter and the tracer, which enumerate instructions and their Jxxx/Ixxx variations. if the only "base/Jxxx/Ixxx" instruction is HOTCOUNT, we would know that all root traces start with it. again, simpler conditions on many parts.
- while decrementing is decrementing, penalizing is doubling (plus a small random), blacklistings come at 60,000. to keep the same or similar heuristics you need 16 bits, better use the "D" argument.

from raptorjit.

lukego commented on May 26, 2024

each trace exit has its own (non-hashed) hotcount. after 10 exits a side trace is generated. doesn't use of touch the global hotcount hash.

Nice. So side-trace exits are already using individual buckets like the scheme we are considering.

not sure how you're defining "hot bytecodes" in your statistic thought experiment.

It's a bit fuzzy. I think one major problem with my formulation is that I assume the hash buckets get "hot" and then stay hot i.e. that the hotcount latches at zero. However, looking at the code the counter actually wraps back to 255. So it looks like in reality the effect of hash collisions will be to randomize the hotcount as each bytecode is traced when the counter passes zero on its way back to 255.

I am not sure what the implications are exactly. Could be that this scheme works better because even with a random initial count the loops will tend to be traced first if they are executing more often and decrementing the count by 2. However I still don't see how to construct an argument that this mechanism will provide robust and predictable behavior.

but remember that only loops and functions actually touch the hotcounters. that means a lower density of "collisionable" bytecodes.

Right. However, a quick grep of the Snabb sources suggests that we have around 10,000 functions and loops. These are all being hashed onto 64 counters. This seems like madness to me. Seems like one would expect everything to be colliding with everything else, perhaps like when running the benchmark suite with only 8 hash buckets (since the benchmark programs are so tiny and have so few bytecodes that will bump the hotcount.)

do the decrementing (and trace start) in the instruction itself, remove the .hotloop/.hotcall macros from loop/function instructions.

I like this idea!

while decrementing is decrementing, penalizing is doubling (plus a small random), blacklistings come at 60,000. to keep the same or similar heuristics you need 16 bits, better use the "D" argument.

I like this idea too. I would actually like to eliminate blacklisting completely and replace it with some suitable backoff mechanism (maybe in the spirit of TCP RTO.)

The risk I see with blacklisting is that it will occur on some very obscure code path of an application, e.g. a weird combination of configurations options or a DoS-like workload, and while I don't want to waste time constantly retracing during this period I do want a new recording to be made when the situation changes. Otherwise the blacklist is forever and maybe the server is still running code in the interpreter after years because of something funny that happened for a short period of time.

from raptorjit.

lukego commented on May 26, 2024

make this the patched instruction on trace compile (instead of the loop instructions)

This is an interesting aspect. I suppose this means the HOTC bytecode would need to come before the loop so that you branch into JIT code without the interpreter doing the setup?

It does sound like an attractive simplification if we would have two new bytecodes, HOTC for counting and then patched/replaced withJTRACE to branch to mcode once the trace is recorded. Could we then retire the many special-case bytecodes for transitioning to mcode (JFORI JFORL JITERL JLOOP JFUNCF JFUNCF) and the many special-case bytecodes for marking blacklisted code (IFORL IITERL ILOOP IFUNCF IFUNCV)? If that would pan out then we have really reduced the bytecode list and can retire some really annoyingly clever code and the bytecode-numbering invariants required to make that work.

Or maybe it is not that simple... I suppose at least when we record the trace it would need to start with mcode equivalent to what the original bytecodes do and there is a risk that it is more complicated to do this in generated code than hand-written bytecode interpreter code. Have to see.

@javierguerragiraldez I made a really quick attempt at the first babystep of adding a HOTC bytecode that is emitted before loops and executes as a NOP. This causes segfaults... any idea why? :-)

from raptorjit.

iehrlich commented on May 26, 2024

You can also place function hotcounts in GCproto directly, instead of adding a HOTC at the beginning of every function. This way the call count will be easier to retrieve inside C part.

from raptorjit.

lukego commented on May 26, 2024

@iehrlich This sounds like a neat idea. If we put a counter into the prototype object (i.e. representation of the bytecode for one function) then we could use that to precisely count the function-entry hot counters for bytecodes (FUNCF and FUNCV) instead of using the hashtable.

Then we would be precisely counting the hotness of both function heads and side-traces (as mentioned by @javierguerragiraldez.) However we are still using the probabilistic hashing approach to loops. Can the scheme be extended to cover those too?

from raptorjit.

iehrlich commented on May 26, 2024

However we are still using the probabilistic hashing approach to loops. Can the scheme be extended to cover those too?

Oh, sorry, I though we were already past this point :) You can abandon the hotcount cache instantly once you've introduced the HOTC bytecodes. In such bytecodes, only 8 bits are occupied for the opcode, and you still have at most 24 bits to store the hotcount - right on top of the loop nest or function header. You'll have 1:1 match between counters and counted events in this scenario.

from raptorjit.

RFC: New bytecode "HOTCOUNT" about raptorjit HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs