I have been attempting to compile an existing, pre-trained PyTorch model using neuron-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Problems compiling existing PyTorch model about aws-neuron-sdk HOT 11 CLOSED

duckontheweb commented on July 25, 2024

Problems compiling existing PyTorch model

from aws-neuron-sdk.

Comments (11)

duckontheweb commented on July 25, 2024

I don't know if it's helpful, but if I try to run the command in the Command line ... line of the error logs I get a segmentation fault:

$ /home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 75 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --threshold_consecutive_num_spills_same_keep_vertices 10 --true_dep --mm_order
/home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 75 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --threshold_consecutive_num_spills_same_keep_vertices 10 --true_dep --mm_order
INFO:Finished Reading HHIR Json file...
INFO:Started Construction of IRS Graph...
INFO:Finished Construction of IRS Graph...
INFO:Reading Tensor Map Json file...
INFO:Finished Reading Tensor Map Json file...
INFO:Started Construction of TIR Data... Thu Feb 27 23:14:55 2020
INFO:Finished Construction of TIR Data... Thu Feb 27 23:14:55 2020
INFO:Started Construction of TIR Graph... Thu Feb 27 23:14:55 2020
INFO (IRSInterface::Total Ops) :0
INFO (IRSInterface::CountDanglingNodes) :
	Number of dangling nodes = 0
	Memory Usage of dangling nodes = 0 Bytes
INFO (IRSInterface::CountElemwiseOp) :
	Number of elemwise ops = 0
INFO (IRSInterface::Average Fanouts of ElemwiseOp) :-nan
INFO (IRSInterface::Average Number of ElemwiseOp Consumer) :-nan
INFO (IRSInterface::CountSingleConsumerElemwiseOp) :
	Number of single consumer elemwise ops = 0
INFO (IRSInterface::Num TTs with TT Srcs): 0
INFO (IRSInterface::Num TTs with MM Srcs): 0
INFO (IRSInterface::Num TTs with TT AND MM Srcs): 0
INFO (IRSInterface::Num TTs with MM AND MM Srcs): 0
INFO (IRSInterface::Average Partition Usage of MM) : -nan
INFO:Finished Construction of TIR Graph... Thu Feb 27 23:14:55 2020
INFO: Started ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Finished ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Started ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Finished ComputeDepth...Thu Feb 27 23:14:55 2020
INFO (PriorityFunction): Started ComputeProximity...Thu Feb 27 23:14:55 2020
INFO (PriorityFunction): Finished ComputeProximity...Thu Feb 27 23:14:55 2020
Initializing RT...
INFO (Tensor Init): Adding Data Dependency from the init lists of DNs
INFO:Starting Scheduling...Thu Feb 27 23:14:55 2020
Segmentation fault (core dumped)

from aws-neuron-sdk.

aws-taylor commented on July 25, 2024

Hello duckontheweb,

I'm sorry for the inconvenience; we've opened a ticket internally to track this issue. Before we can do much, we'll need more information. If you're able to share your model, that's the fastest way for us to be able to reproduce the issue. If you have sensitive IP, consider opening an AWS support ticket and sharing there.

In the mean time, I'd suggest you configure your system to dump core files and look for hints in the resulting stack trace.

ulimit -c unlimited //Turn on core files
//Run the program. A 'core.xxxx' files should be produced.
file core.xxxx // Check which command 'command' created the core file. 
gdb <command> core.xxx // Fire up GDB to open the core file
bt // Look at the stack trace for hints.

I've found that often times ABI incompatibilities can result in segmentation faults like this, especially for binaries embedded in python wheels. Re-installing python dependencies from source can help in these cases.

pip install --force-reinstall --no-binary <dependency>

Hopefully this helps a little.

Regards,
Taylor

from aws-neuron-sdk.

aws-taylor commented on July 25, 2024

Hello again duckontheweb,

I see that you're running version 1.0.6801.0+6001944336. There is a newer version available. In addition to the advice above, I would encourage you to update pip/apt/yum/conda as appropriate and try again.

Regards,
Taylor

from aws-neuron-sdk.

duckontheweb commented on July 25, 2024

@aws-taylor Thanks for the quick reply! I'll follow your suggestions for:

Configuring the system to get stack traces
Reinstalling Python dependencies
Upgrading neuron-cc

I'm also going to try re-training our model using the PyTorch version that comes with torch_neuron and compiling all in the same script to see if that helps. I'll let you know what I find.

The model is IP, as you mentioned, so if those steps don't work I'll open up a support ticket and pursue it there. Should I reference this issue in any way in the support ticket?

from aws-neuron-sdk.

aws-taylor commented on July 25, 2024

Hello duckontheweb,

>>Should I reference this issue in any way in the support ticket?
Yes, please reference this issue in any support ticket to ensure it is routed correctly.

-Taylor

from aws-neuron-sdk.

duckontheweb commented on July 25, 2024

Thanks.

So, for my latest attempt, I tried re-training the model in an environment with all of the neuron libraries installed. The training went fine, but then I get this error when trying to run torch.neuron.trace:

Traceback (most recent call last):
  File "model/train_model.py", line 360, in <module>
    example_inputs=[dummy_image]
  File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 150, in trace
    transform_torch_graph_to_tensorflow( func, example_inputs, args, kwargs )
  File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 288, in transform_torch_graph_to_tensorflow
    input_calls_map = get_input_calls_map(jit_trace.graph, example_inputs)
  File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 780, in get_input_calls_map
    func = _resolve_func(node)
  File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 957, in _resolve_func
    assert hasattr(module, func_name), "Neuron compile failed.  Operator {}::{} is not supported".format(mod_name,func_name)
AssertionError: Neuron compile failed.  Operator prim::PythonOp is not supported

Is that an indication that we're just using an unsupported model architecture?

from aws-neuron-sdk.

duckontheweb commented on July 25, 2024

I see that you're running version 1.0.6801.0+6001944336. There is a newer version available. In addition to the advice above, I would encourage you to update pip/apt/yum/conda as appropriate and try again.

I've been installing neuron-cc using:

$ pip install -U pip
$ pip install neuron-cc --extra-index-url https://pip.repos.neuron.amazonaws.com

Is there a specific newer version that I should try to get for Python 3.6, or is it better to install it via apt?

from aws-neuron-sdk.

aws-taylor commented on July 25, 2024

You may need to pass the --upgrade flag and possibly the --force-reinstall flag since you already have the software installed. Pip should be sufficient.

-Taylor

from aws-neuron-sdk.

duckontheweb commented on July 25, 2024

Thanks. I've been installing this on a clean EC2 each time, so the flags didn't seem to have any effect.

I was able to make some progress. I realized that I had forgotten the --no-deps option on torchvision the last time around. Re-installing all of that led to a successful compilation of one of the models! For some reason it is not recognizing the CUDA version, though:

import torch
import torch_neuron

torch.__version__
# '1.3.0.1.0.90.0'

torch.version.cuda
# None 

torch.cuda.device_count()
# 0

The default CUDA version for this AMI is 10.0, I'll play around with it and see if I have a mismatch in versions or something.

from aws-neuron-sdk.

duckontheweb commented on July 25, 2024

So, it looks like I'm able to compile the model as long as I train it using the environment that I've set up for Neuron. This should help us get a little farther along. Thanks for the help!

from aws-neuron-sdk.

awsrjh commented on July 25, 2024

great - thanks for the info! Will close this.... feel free to reopen if you have any more issues.

from aws-neuron-sdk.

Problems compiling existing PyTorch model about aws-neuron-sdk HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs