ezyang / tlparse Goto Github PK

View Code? Open in Web Editor NEW

18.0 3.0 3.0 940 KB

TORCH_LOGS parser for PT2

License: BSD 3-Clause "New" or "Revised" License

Rust 100.00%

tlparse's Introduction

tlparse: Parse structured PT2 logs

tlparse parses structured torch trace logs and outputs HTML files analyzing data.

Quick start: Run PT2 with the TORCH_TRACE environment variable set:

TORCH_TRACE=/tmp/my_traced_log example.py

Feed input into tlparse:

tlparse /tmp/my_traced_log -o tl_out/

Adding custom parsers

You can extend tlparse with custom parsers which take existing structured log data and output any file. To do so, first implement StructuredLogParser with your own trait:

pub struct MyCustomParser;
impl StructuredLogParser for MyCustomParser {
    fn name(&self) -> &'static str {
        "my_custom_parser"
    }
    fn get_metadata<'e>(&self, e: &'e Envelope) -> Option<Metadata<'e>> {
        // Get required metadata from the Envelope.
        // You'll need to update Envelope with your custom Metadata if you need new types here
        ....
    }

    fn parse<'e>(&self,
        lineno: usize,
        metadata: Metadata<'e>,
        _rank: Option<u32>,
        compile_id: &Option<CompileId>,
        payload: &str
    ) -> anyhow::Result<ParserResult> {
       // Use the metadata and payload however you'd like
       // Return either a ParserOutput::File(filename, payload) or ParserOutput::Link(name, url)
    }
}

tlparse's People

Contributors

Stargazers

Watchers

Forkers

aorenste jamesjwu drisspg

tlparse's Issues

Improve stack trie suffix pruning

I'm still regularly seeing stack tries that look like this:

torch/nn/modules/module.py:1566 in _wrapped_call_impl
torch/nn/modules/module.py:1575 in _call_impl
torch/_dynamo/eval_frame.py:433 in _fn
torch/nn/modules/module.py:1566 in _wrapped_call_impl
torch/nn/modules/module.py:1575 in _call_impl
torch/_dynamo/convert_frame.py:1116 in __call__
[[2/0]](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp0umcnJ/index.html#[2/0]) [[2/1]](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp0umcnJ/index.html#[2/1]) torch/_dynamo/convert_frame.py:472 in __call__

The torch/_dynamo/convert_frame.py:1116 in __call__ definitely need to be killed. But there's also some funny business with _wrapped_call_impl indirection that also is unnecessary 🤔

tlparse --latest

If I repeatedly TORCH_TRACE into a single directory, I'll accumulate lots of log files for each run. It would be convenient to have a --latest flag that make tlparse only process the latest log

Writing out files takes a long time when not compiling with optimizations

It seems to take longer than parsing the logs, and is noticeable when iterating over larger log files (e.g., processed an 8M file recently).

UX polish wrt id fragments

It should be possible to click on something like [33/0] and get the hash so you can bookmark this link.

When you are navigated to #[33/0] fragment, the relevant fragment should be highlighted yellow

Don't use panic when suggesting using --overwrite

Group together build products for restart analysis

Example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/20240321-wei-guo-ads-regression-f543344225-rank0/index.html

4/0 and 4/0_1 should be together in build products

Build products are printed in random order

Example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/experimental/f543266098-TrainingApplication/rank_0/index.html

HTML inductor output makes diffing harder

As a diff on the folders produces an HTML diff rather than a text diff

cc @drisspg

Integrate with INDUCTOR_ORIG_FX_SVG

TORCH_COMPILE_DEBUG=1 INDUCTOR_ORIG_FX_SVG=1

Internal post https://fb.workplace.com/groups/257735836456307/permalink/570238161872738/

Option to upload tlparse report to GitHub pages

Probably use https://github.com/XAMPPRocky/octocrab to do the integration bits. Create a repo named tlparse_reports on your user account, turn on GitHub pages on main branch, then just push the files to a folder.

Self documentation on reports

Reports should say what MAST job they were generated from, what the command to tlparse that was used to generate it, what version of tlparse was used

Support collapsing nodes in stack trie

Example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/20240321-wei-guo-ads-regression-f543344225-rank0/index.html

Should be able to click on the minus sign and that will cause the tree to fold up (so you can easily jump to the next sibling node)

Might also want to think about if we actually want these children:

- <torch_package_0>.caffe2/torch/fb/module_factory/sync_sgd/train_loop_pipeline/memcpy_comm_compute/torchrec/train_step.py:598 in run
- <torch_package_0>.caffe2/torch/fb/module_factory/sync_sgd/train_loop_pipeline/memcpy_comm_compute/torchrec/train_step.py:691 in run

these are technically the same function, maybe they should be put together? IDK.

tlparse should have recompiles information

Literally just what's in the recompiles log would be good.

Limited amount of runtime information associated with compiled frames

tlparse currently is a compile time only metrics collector. However, there is a small amount of runtime information that I think would be really useful:

How many times a particular compiled frame was hit. In particular, if we recompile multiple times, we might be interested to know how hot each of the particular recompiles are, which tells you if there is some warmup / unspec thing going on, or if there is legitimate multiple dispatch going on
How quickly the compiled frames run, a sort of poor man's profiling, but mostly I just want to see the timings from a compiled products perspective, as opposed to the usual performance trace perspective

Dump information about custom ops

Especially for debugging OSS problems like in pytorch/pytorch#127660 having access to the custom op schemas and dispatch tables can be helpful for understanding why the system is behaving a certain way

Dump information about is_dynamo_compiling queries

When diagnosing why code doesn't work with torch.compile but works without, is_dynamo_compiling is a way for the problem to be a userspace problem. It should be obvious when this has been hit in torch.compile so that we can tell if these are suspicious and need further information