One very important problem is the analysis of JIT decision making and performance. Rap

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

So <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Interface to external JIT analysis tools about raptorjit HOT 12 OPEN

raptorjit commented on May 26, 2024

Interface to external JIT analysis tools

from raptorjit.

Comments (12)

javierguerragiraldez commented on May 26, 2024 1

yes, that sounds reasonable. i don't think it would mean too much removed code, lib_jit.c isn't that long or complex; but i guess every little bit helps. the main goal of always-logging does seem a good justification

from raptorjit.

lukego commented on May 26, 2024

@javierguerragiraldez I would love your feedback on this feature idea!

from raptorjit.

javierguerragiraldez commented on May 26, 2024

Note that loom currently has very small runtime overhead in terms of time, but might be perceptible in terms of memory. It's because the hooks installed only record data, all processing is deferred to the "stop" function.

in principle this could be way lighter by using binary dumps instead of Lua tables from within the interpreter (jit.* hooks are never compiled), but the afterwards processing would be much more complicated without simple access to the functions and disassemblers (3 of them! bytecode, IR, mcode). well, the instruction formats are well documented (at least for the first two).

maybe a "fork+read the cow" scheme could avoid having to dump everything to a file. I mean, during "heavy" execution JIT events would be continuously logged, and from time to time, the main process could fork an analyzer, effectively creating a snapshot of the current memory state. the forked process would read the log and use the (snapshotted) memory to flesh it out.

What events would be logged? LuaJIT does: trace start/abort/end, each bytecode encountered during tracing, trace exit (until compiled), and also each function compiled to bytecode. These should happen only during warmup.

from raptorjit.

lukego commented on May 26, 2024

@javierguerragiraldez Thanks for explaining that. The backend you are using for LOOM does seem very pragmatic. You keep the implementation lean by reusing all the existing code; deferring analysis to stop() should work well for test/benchmark suites; and forking to analyse may generalize to more environments too.

The hard requirement for me is that I want to always be able to analyze the traces for support problems on production systems, and even for problems that happened in the past. So I need the logging to be "always on" and so efficient that nobody is tempted to turn it off. I think that people will be reluctant to enable a feature based on fork() and copy-on-write heap analysis in the field. Just seems like a heavy-weight mechanism that can potentially cause performance problems e.g. for Snabb where you are always racing to keep the ingress queue of a 10G/100G NIC from overflowing.

So I do think that a new lightweight backend is needed for this use case. And once we switch to a backend based on offline inspection instead of online introspection then the downside is that we need to write new tool code (bummer) but the upside is that it can replace code in RaptorJIT. I see this as a net win: RaptorJIT is prohibitively complex and hard to maintain -- if some complexity can be moved to another component that is not linked at runtime then that is a win.

I also see the potential to reuse more functionality with external tools. RaptorJIT is very conservative with dependencies - everything it does, it does directly. However, tooling could be more liberal, even using Docker (or Nix) to include off-the-shelf components. So, for example, I do not for one second imagine writing a new x86-64 disassembler - I will simply call out to objdump. Likewise, if I want to bootstrap a confidence interval for my profiler report then I will call out to R or Torch, and if I want to visualize the IR tree and inspect the dependency chains I will use a toolkit like D3 or Roassal, and so on.

So overall I feel that this approach will make JIT logging & profiling pervasive in production, it will simplify RaptorJIT by obsoleting the disassemblers and the basic analyzers and the introspection API, and it will liberate the tools from the constraints of the runtime application environment ("go crazy.")

I also see the inspection vs introspection question as deeply philosophical. If you want to have sophisticated tooling you can either make it separate - like gdb and Intel VTune etc - or you can make it built-in - like Lisp or Smalltalk live coding environments. I love the introspective live environments, but I don't think that this is the right operational approach for production server applications.

Does that make any sense to you?

from raptorjit.

lukego commented on May 26, 2024

What events would be logged?

Good question!

Start with the same events that LuaJIT logs now. Just having this available in production would be a massive win. Then I could "LOOM" logs from applications that do not embed LOOM.

But I can imagine adding more...

Per-trace profiler counts (separate buckets for head vs loop vs FFI vs GC.) This would be so helpful for presenting logs with hundreds of mostly-irrelevant traces.
Per-trace logs of inscrutable optimizer decisions e.g. which upvalues were specialized on value due to low instance count of the prototype.
Exact duration in nanoseconds of each JIT and GC operation.

... More?

from raptorjit.

fsfod commented on May 26, 2024

You also want save the vmdef.lua file or all the data in it in the log otherwise an update to LuaJIT that add something like a new IR code will make any old logs or logs from other people worthless.

Exact duration in nanoseconds of each JIT and GC operation.

My gcperflog system that I extracted from my experimental implementation of the new gc could provide gc time stats and maybe counters. I also have basic jit time tracking system in some other code that could be of use.

This some of what I've been saving in my own trace analyser

public class TraceEntry
{
  public int  StartFunctionId;
  public int StartPC;

  public int StopFunctionId;
  public int StopPC;

  public uint16_t TraceNumber;
  public uint16_t ParentId;
  public uint16_t ParentExit;
  public uint16_t RootTraceId;

  public uint16_t IRInstCount;
  public uint16_t IRConstantsCount;
  public uint16_t IROffsetCount;
  public uint16_t NumberofSnapShots;
  public uint16_t SnapshotMapSize;
  public uint16_t Unused;

  public int FunctionChangeCount;
  public int BytecodeCount;
  public int TraceDataOffset;

  public TraceLink LinkType;
  public byte AbortErrorCode;
  public uint16_t LinkTrace;
  public uint32_t ErrorInfo;

  public int MachineCodeSize;
  public int unused;
  public ulong MachineCodeAddress;
  public long StartTime;
  public long TraceTime;
};

public struct FunctionChange {
  public int PCIndex;
  public int FunctionIndex;
  public int Depth;
}

public struct TracedBC {
  public uint16_t PC;
  public uint16_t IRInstrCount;
}

from raptorjit.

lukego commented on May 26, 2024

@fsfod Thanks for sharing your work! I like the way you have defined a clear ABI for your low-level trace logs. I am thinking about how this approach could fit into my current mental framework.

Just to recap my plan is to:

Minimize code, overhead, and churn in the VM code.
Support diverse analysis tools (including new tools with old data.)
Let the tools absorb the complexity that is removed from the VM.

So I am imagining the VM logging raw native data structures (e.g. GCtrace) because a separate ABI would add code and overhead to the VM. However, this moves the problems to the tools, which will either need to use C headers or DWARF information to decode the logs. One approach could be to create a tool that translates from the native format to an ABI like yours: this could then be maintained outside the VM and support tools that don't want to see the native structures.

One missing part of the puzzle is how the tools get access to the native format. Could either embed the definition directly in the executable (e.g. DWARF or vmdef) or could embed some ID for finding the sources (e.g. version number or git revision or sha256 hash of sources.) Or both. Could be that other projects have come up with clever solutions to this problem.

You also want save the vmdef.lua file or all the data in it in the log otherwise an update to LuaJIT that add something like a new IR code will make any old logs or logs from other people worthless.

Do you think DWARF can solve this? Here is an excerpt from the debug information as pretty-printed by objdump:

 <1><2272>: Abbrev Number: 41 (DW_TAG_enumeration_type)
    <2273>   DW_AT_byte_size   : 4
    <2274>   DW_AT_type        : <0x70>
    <2278>   DW_AT_decl_file   : 23
    <2279>   DW_AT_decl_line   : 152
    <227a>   DW_AT_sibling     : <0x2410>
 <2><227e>: Abbrev Number: 34 (DW_TAG_enumerator)
    <227f>   DW_AT_name        : (indexed string: 0x742): IR_LT
    <2281>   DW_AT_const_value : 0
 <2><2282>: Abbrev Number: 34 (DW_TAG_enumerator)
    <2283>   DW_AT_name        : (indexed string: 0x782): IR_GE
    <2285>   DW_AT_const_value : 1
 <2><2286>: Abbrev Number: 34 (DW_TAG_enumerator)
    <2287>   DW_AT_name        : (indexed string: 0x735): IR_LE
    <2289>   DW_AT_const_value : 2
...

which corresponds to enum { IR_LT = 0, IR_GE = 1, IR_LE = 2, ... }. So I imagine a tool that knows the names and meanings of all the fields in the native format but can dynamically resolve the value/size/endian at runtime. This could generalize to quite sophisticated tools e.g. an analyzer running on x86-64 directly examining a core dump from an ARM64 VM. (RaptorJIT only cares about x86-64 but the separate analysis tools may want to support other LuaJIT forks too.)

from raptorjit.

lukego commented on May 26, 2024

... Incidentally, I wonder whether the jit.util introspection interface could be built on top of the @fsfod ABI? In that case tools like LOOM and jit.v and jit.dump could be used offline.

from raptorjit.

lukego commented on May 26, 2024

So @fsfod @javierguerragiraldez I wonder if this would be a reasonable position for RaptorJIT to take:

Remove jit.attach() and jit.util() completely.
Add extensive binary logging in the native format.
Ensure the log contains everything needed to rebuild jit.attach() and jit.util().

This keeps RaptorJIT lean and mean but still allows for adding a compatibility layer to run the unmodified jit.v, jit.dump, LOOM, etc. Those tools could either run online (examining the log for the running process) or offline (examining an archived log.) I think the same API would work for both: just replay the jit.attach() event callbacks and make sure the jit.util introspection respects the logical ordering of events (e.g. whether bytecode patches and JIT flushes are supposed to have happened yet.)

Have to see whether such a compatibility library would actually be implemented. Could alternatively be that the native format or the @fsfod format or another new format gets traction with tools and nobody needs the legacy API anymore. Have to see what people will care about enough to hack on over time. (I'm planning to use the native format directly with low-level GDB-like tooling.)

from raptorjit.

lukego commented on May 26, 2024

@javierguerragiraldez I am curious actually about how much "re-enter Lua from VM" code can be removed. Everything? I see it used mostly for the JIT events and for the profiler. (Aside: I totally don't understand why the profiler is designed as it is.) So maybe if RaptorJIT removes that then we can completely drop the whole mechanism of re-entry to Lua? That could save some code/complexity. However, maybe something will spoil the party, e.g. GC finalizers...

from raptorjit.

javierguerragiraldez commented on May 26, 2024

GC finalizer shouldn't be a problem, it's (mostly) the same code as in Lua 5.1; the call to the finalizer is done with the interpreter in a sane state. from the point of view of the interpreter, it's just a callback from whatever operation triggered the collection. it doesn't happen on JIT-complex points, and compiled traces have explicit calls to the collector checks.

gc_call_finalizer() (in lj_gc.c) does a simple lj_vm_pcall(), and it does have the "usual" L variable in context.

from raptorjit.

javierguerragiraldez commented on May 26, 2024

BTW, both the jit events and profiler run in a preallocated Lua thread (coroutine?), which simplifies the separation of states. of course, the global objects (G, J, anything else?) are shared and must be consistent before reentering.

from raptorjit.

Interface to external JIT analysis tools about raptorjit HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs