bindog / pytorch-model-parallel Goto Github PK
View Code? Open in Web Editor NEWA memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch
A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch
pytorch1.6:
cuda10.2
titan rtx * 4
output = self.am_branches[i](x.cuda(i), labels[i])
File "/home/derron/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/derron/arcface-pytorch/head/metrics_parallel.py", line 102, in forward
output[index] = phi[index]
RuntimeError: Could not run 'aten::nonzero' with arguments from the 'SparseCUDA' backend. 'aten::nonzero' is only available for these backends: [CPU, CUDA, Autograd, Profiler, Tracer].
请问@amp.float_function可以去掉吗,我使用发现有inf问题
Can you provide your code running environment?
Thanks a lot!
Have you ever meet such problems when you run the training code? It happened after the training process goes for a few iterations
*** Error in `/opt/conda/bin/python': double free or corruption (fasttop): 0x00007f0018011960 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f026f6987e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f026f6a137a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f026f6a553c]
/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so(+0x3cead6e)[0x7f01f8755d6e]
/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so(+0x3ceae19)[0x7f01f8755e19]
/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so(+0x3ceaf95)[0x7f01f8755f95]
/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so(_ZN5torch8autograd6Engine17evaluate_functionERNS0_8NodeTaskE+0x1210)[0x7f01f874d6b0]
/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so(_ZN5torch8autograd6Engine11thread_mainEPNS0_9GraphTaskE+0x1c4)[0x7f01f874f564]
/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so(_ZN5torch8autograd6python12PythonEngine11thread_initEi+0x2a)[0x7f026b2eebca]
/opt/conda/lib/python3.7/site-packages/torch/_C.cpython-37m-x86_64-linux-gnu.so(+0xf14f)[0x7f026be2d14f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f026f9f26ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f026f72841d]
======= Memory map: ========
200000000-200200000 rw-s 00000000 00:06 533 /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0
200400000-200404000 rw-s 00000000 00:06 533 /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0
200600000-200a00000 rw-s 00000000 00:06 533 /dev/nvidiactl
200a00000-201600000 ---p 00000000 00:00 0
201600000-201800000 rw-s 00000000 00:06 533 /dev/nvidiactl
201800000-201804000 rw-s 00000000 00:06 533 /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0
201a00000-201e00000 rw-s 00000000 00:06 533 /dev/nvidiactl
201e00000-201e04000 rw-s 00000000 00:06 533 /dev/nvidiactl
201e04000-202000000 ---p 00000000 00:00 0
202000000-202400000 rw-s 00000000 00:06 533 /dev/nvidiactl
202400000-202404000 rw-s 00000000 00:06 533 /dev/nvidiactl
202404000-202600000 ---p 00000000 00:00 0
202600000-202a00000 rw-s 00000000 00:06 533 /dev/nvidiactl
202a00000-202a04000 rw-s 00000000 00:06 533 /dev/nvidiactl
202a04000-202c00000 ---p 00000000 00:00 0
202c00000-203000000 rw-s 00000000 00:06 533 /dev/nvidiactl
203000000-203004000 rw-s 00000000 00:06 533 /dev/nvidiactl
partial pc可以尝试一下
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.