Comments (11)
I don't know if it's helpful, but if I try to run the command in the Command line ...
line of the error logs I get a segmentation fault:
$ /home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 75 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --threshold_consecutive_num_spills_same_keep_vertices 10 --true_dep --mm_order
/home/ubuntu/test_venv/lib/python3.6/site-packages/neuroncc/starfish/bin/list_sch --hhir hh-tr-external-move.json --verbose 0 --sb_size 75 --arith_intensity_target 2300 --sb_watermark_low 0.250000 --sb_watermark_high 0.750000 --sb_size_tol 1 --alloc simple1 --alloc_opt --depth_diff 0.100000 --verbose_start_cycle 0 --tt_dist --mm_meet_cnt 1 --load_speed_factor 0.300000 --schir sch_tmp.json --spill_depth_limit 5 --threshold_consecutive_num_spills_same_keep_vertices 10 --true_dep --mm_order
INFO:Finished Reading HHIR Json file...
INFO:Started Construction of IRS Graph...
INFO:Finished Construction of IRS Graph...
INFO:Reading Tensor Map Json file...
INFO:Finished Reading Tensor Map Json file...
INFO:Started Construction of TIR Data... Thu Feb 27 23:14:55 2020
INFO:Finished Construction of TIR Data... Thu Feb 27 23:14:55 2020
INFO:Started Construction of TIR Graph... Thu Feb 27 23:14:55 2020
INFO (IRSInterface::Total Ops) :0
INFO (IRSInterface::CountDanglingNodes) :
Number of dangling nodes = 0
Memory Usage of dangling nodes = 0 Bytes
INFO (IRSInterface::CountElemwiseOp) :
Number of elemwise ops = 0
INFO (IRSInterface::Average Fanouts of ElemwiseOp) :-nan
INFO (IRSInterface::Average Number of ElemwiseOp Consumer) :-nan
INFO (IRSInterface::CountSingleConsumerElemwiseOp) :
Number of single consumer elemwise ops = 0
INFO (IRSInterface::Num TTs with TT Srcs): 0
INFO (IRSInterface::Num TTs with MM Srcs): 0
INFO (IRSInterface::Num TTs with TT AND MM Srcs): 0
INFO (IRSInterface::Num TTs with MM AND MM Srcs): 0
INFO (IRSInterface::Average Partition Usage of MM) : -nan
INFO:Finished Construction of TIR Graph... Thu Feb 27 23:14:55 2020
INFO: Started ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Finished ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Started ComputeDepth...Thu Feb 27 23:14:55 2020
INFO: Finished ComputeDepth...Thu Feb 27 23:14:55 2020
INFO (PriorityFunction): Started ComputeProximity...Thu Feb 27 23:14:55 2020
INFO (PriorityFunction): Finished ComputeProximity...Thu Feb 27 23:14:55 2020
Initializing RT...
INFO (Tensor Init): Adding Data Dependency from the init lists of DNs
INFO:Starting Scheduling...Thu Feb 27 23:14:55 2020
Segmentation fault (core dumped)
from aws-neuron-sdk.
Hello duckontheweb,
I'm sorry for the inconvenience; we've opened a ticket internally to track this issue. Before we can do much, we'll need more information. If you're able to share your model, that's the fastest way for us to be able to reproduce the issue. If you have sensitive IP, consider opening an AWS support ticket and sharing there.
In the mean time, I'd suggest you configure your system to dump core files and look for hints in the resulting stack trace.
ulimit -c unlimited //Turn on core files
//Run the program. A 'core.xxxx' files should be produced.
file core.xxxx // Check which command 'command' created the core file.
gdb <command> core.xxx // Fire up GDB to open the core file
bt // Look at the stack trace for hints.
I've found that often times ABI incompatibilities can result in segmentation faults like this, especially for binaries embedded in python wheels. Re-installing python dependencies from source can help in these cases.
pip install --force-reinstall --no-binary <dependency>
Hopefully this helps a little.
Regards,
Taylor
from aws-neuron-sdk.
Hello again duckontheweb,
I see that you're running version 1.0.6801.0+6001944336. There is a newer version available. In addition to the advice above, I would encourage you to update pip/apt/yum/conda as appropriate and try again.
Regards,
Taylor
from aws-neuron-sdk.
@aws-taylor Thanks for the quick reply! I'll follow your suggestions for:
- Configuring the system to get stack traces
- Reinstalling Python dependencies
- Upgrading
neuron-cc
I'm also going to try re-training our model using the PyTorch version that comes with torch_neuron and compiling all in the same script to see if that helps. I'll let you know what I find.
The model is IP, as you mentioned, so if those steps don't work I'll open up a support ticket and pursue it there. Should I reference this issue in any way in the support ticket?
from aws-neuron-sdk.
Hello duckontheweb,
>>Should I reference this issue in any way in the support ticket?
Yes, please reference this issue in any support ticket to ensure it is routed correctly.
-Taylor
from aws-neuron-sdk.
Thanks.
So, for my latest attempt, I tried re-training the model in an environment with all of the neuron libraries installed. The training went fine, but then I get this error when trying to run torch.neuron.trace
:
Traceback (most recent call last):
File "model/train_model.py", line 360, in <module>
example_inputs=[dummy_image]
File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 150, in trace
transform_torch_graph_to_tensorflow( func, example_inputs, args, kwargs )
File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 288, in transform_torch_graph_to_tensorflow
input_calls_map = get_input_calls_map(jit_trace.graph, example_inputs)
File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 780, in get_input_calls_map
func = _resolve_func(node)
File "/home/ubuntu/env/lib/python3.6/site-packages/torch_neuron/decorators.py", line 957, in _resolve_func
assert hasattr(module, func_name), "Neuron compile failed. Operator {}::{} is not supported".format(mod_name,func_name)
AssertionError: Neuron compile failed. Operator prim::PythonOp is not supported
Is that an indication that we're just using an unsupported model architecture?
from aws-neuron-sdk.
I see that you're running version 1.0.6801.0+6001944336. There is a newer version available. In addition to the advice above, I would encourage you to update pip/apt/yum/conda as appropriate and try again.
I've been installing neuron-cc
using:
$ pip install -U pip
$ pip install neuron-cc --extra-index-url https://pip.repos.neuron.amazonaws.com
Is there a specific newer version that I should try to get for Python 3.6, or is it better to install it via apt
?
from aws-neuron-sdk.
You may need to pass the --upgrade
flag and possibly the --force-reinstall
flag since you already have the software installed. Pip should be sufficient.
-Taylor
from aws-neuron-sdk.
Thanks. I've been installing this on a clean EC2 each time, so the flags didn't seem to have any effect.
I was able to make some progress. I realized that I had forgotten the --no-deps
option on torchvision the last time around. Re-installing all of that led to a successful compilation of one of the models! For some reason it is not recognizing the CUDA version, though:
import torch
import torch_neuron
torch.__version__
# '1.3.0.1.0.90.0'
torch.version.cuda
# None
torch.cuda.device_count()
# 0
The default CUDA version for this AMI is 10.0, I'll play around with it and see if I have a mismatch in versions or something.
from aws-neuron-sdk.
So, it looks like I'm able to compile the model as long as I train it using the environment that I've set up for Neuron. This should help us get a little farther along. Thanks for the help!
from aws-neuron-sdk.
great - thanks for the info! Will close this.... feel free to reopen if you have any more issues.
from aws-neuron-sdk.
Related Issues (20)
- Input tensor is not an XLA tensor: CPUFloatType while using crf.decode function HOT 4
- RuntimeError: Bad StatusOr access: INVALID_ARGUMENT: PJRT_Client_Create: error condition nullptr != (args)->client->Error(): Init: error condition !(num_devices > 0): HOT 3
- BERT model implemented usiing TransformerEncoder returns all NaNs when running it torch==1.13.1 HOT 3
- PDF print on the home page is empty when the left side is collapsed HOT 1
- Quite largely increased latency with weights/neff separated HOT 1
- Input tensors not being read torch neuronx 2.1.2 HOT 4
- Is there something wrong in torch_neuronx.trace ? HOT 3
- support for aten::upsample_nearest3d HOT 1
- Is it possible to compile a model when no NeuronCores are available? HOT 2
- ECS inf1 neuron hook script fails HOT 2
- Issue on page /frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html
- Model doesn't support task text-classification for the neuron backend
- DataParallel Support on CRF inference HOT 1
- neuron-distributed for inference HOT 1
- AWS NeuronX sdk installation HOT 2
- Issue on page /general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html HOT 1
- Missing example in the doc for speculative decoding beta support HOT 1
- Links broken on page /libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.html
- [Runtime API] Missing `nrt_get_dmabuf_fd` Function HOT 4
- Inf1 BERT deployment using 1.13.1-neuron-py310-sdk2.19.0-ubuntu20.04
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-neuron-sdk.