Comments (4)
HI alexandrekm - Since you can't share the actual model I'm responding based upon the error message you provided. The message states: invalid literal for int() with base 10: '0.01'
Is the model attempting to assign a floating point value (0.01) to an integer typed variable?
from aws-neuron-sdk.
Hi @aws-donkrets Thanks for taking a look at this.
The model seems to converted from PyTorch to a SavedModel format successfully, but the subsequent neuron-cc
compilation step fails.
Troubleshooting Steps:
- Cast Analysis: I haven't observed any explicit string-to-integer conversion within the model itself. It's actually a string that contains the '0.01' value. This cast can come from one of the frameworks we use but since I do not have access to a debugger within
neuron-cc
(where this fails) I am not sure. Is this something that I can do myself? - Compilation Breakdown: The compilation process appears to be two-fold (is this correct?):
Stage 1: Converts the PyTorch model to a SavedModel (presumably usingtorch.jit.trace
which succeeds on it's own).
Stage 2: Compiles the SavedModel usingneuron-cc
.
Reproducing the Error:
The failure can be isolated and reproduced by running just the second stage (neuron-cc compilation) with the specific commands extracted from the logs. Here's an example of the recreated command:
/home/ubuntu/code/neuron-cc-inf1/1/graph_def.pb \
--framework TENSORFLOW \
--pipeline compile SaveTemps \
--output /home/ubuntu/code/neuron-cc-inf1/1/graph_def.neff \
--io-config '{"inputs": {"tensor.1:0": [[1, 3, 448, 768], "float32"]}, "outputs": "... (list of outputs) ..."}' \
--verbose 35```
from aws-neuron-sdk.
I haven’t observed any explicit string-to-integer conversion within the model itself. It’s actually a string that contains the ‘0.01’ value. This cast can come from one of the frameworks we use but since I do not have access to a debugger within neuron-cc (where this fails) I am not sure. Is this something that I can do myself?
I think the easiest thing you could try yourself is to come up with a minimal reproduction that does not contain any proprietary architectural information. The way you might approach this is to create a model with a single layer (instead of multiple) and then attempt to compile it like before. If this still causes an error, then remove submodules from the end of the layer until just a few operators can reproduce the failure. At this point you should be able to share a minimal set of operations to reproduce the issue.
Compilation Breakdown: The compilation process appears to be two-fold (is this correct?):
Stage 1: Converts the PyTorch model to a SavedModel (presumably using torch.jit.trace which succeeds on it’s own).
Stage 2: Compiles the SavedModel using neuron-cc.
Yes, exactly correct. Because there are a few stages to compilation, the easiest thing to do is come up with a minimal reproduction so we can determine exactly which component is failing.
from aws-neuron-sdk.
I managed to understand what the issue was and disabling a part of the model solved it. Thanks for the help.
from aws-neuron-sdk.
Related Issues (20)
- Input tensor is not an XLA tensor: CPUFloatType while using crf.decode function HOT 4
- RuntimeError: Bad StatusOr access: INVALID_ARGUMENT: PJRT_Client_Create: error condition nullptr != (args)->client->Error(): Init: error condition !(num_devices > 0): HOT 3
- BERT model implemented usiing TransformerEncoder returns all NaNs when running it torch==1.13.1 HOT 3
- PDF print on the home page is empty when the left side is collapsed HOT 1
- Quite largely increased latency with weights/neff separated HOT 1
- Input tensors not being read torch neuronx 2.1.2 HOT 4
- Is there something wrong in torch_neuronx.trace ? HOT 3
- support for aten::upsample_nearest3d HOT 1
- Is it possible to compile a model when no NeuronCores are available? HOT 2
- ECS inf1 neuron hook script fails HOT 2
- Issue on page /frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html
- Model doesn't support task text-classification for the neuron backend
- DataParallel Support on CRF inference HOT 1
- neuron-distributed for inference HOT 1
- AWS NeuronX sdk installation HOT 2
- Issue on page /general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html HOT 1
- Missing example in the doc for speculative decoding beta support HOT 1
- Links broken on page /libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.html
- [Runtime API] Missing `nrt_get_dmabuf_fd` Function HOT 4
- Inf1 BERT deployment using 1.13.1-neuron-py310-sdk2.19.0-ubuntu20.04
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-neuron-sdk.