Comments (18)
@Gavin666Github 可我已经用mutual learning在同一个gpu上训练过好多次了,之后均没出现过问题,显存占用基本在95%左右,没有爆显存的情况
from alignedreid-re-production-pytorch.
你好,这个问题我也看不出来什么原因,我猜你这里应该是每个batch8个人,每个人4张图片吧,错误信息里边的数1095010584
很奇怪。
from alignedreid-re-production-pytorch.
我也是遇到这个问题,用 train_ml.py在同一个GPU上训练两个model就会报这个错,分别在market1501和duke数据集上,均会出现此问题(cuhk03暂时没有测试),可能跟线程有关系,有时跑完第一个epoch不会报错,有时第一个epoch没跑完就会报错
duke test set
NO. Images: 19889
NO. IDs: 1110
NO. Query Images: 2228
NO. Gallery Images: 17661
NO. Multi-query Images: 0
Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "script/experiment/train_ml.py", line 481, in thread_target
normalize_feature=cfg.normalize_feature)
File "./aligned_reid/model/loss.py", line 215, in global_loss
dist_mat, labels, return_inds=True)
File "./aligned_reid/model/loss.py", line 163, in hard_example_mining
dist_mat[is_neg].contiguous().view(N, -1), 1, keepdim=True)
RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 992 elements at /pytorch/torch/lib/TH/THStorage.c:37
Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "script/experiment/train_ml.py", line 481, in thread_target
normalize_feature=cfg.normalize_feature)
File "./aligned_reid/model/loss.py", line 215, in global_loss
dist_mat, labels, return_inds=True)
File "./aligned_reid/model/loss.py", line 175, in hard_example_mining
ind[is_pos].contiguous().view(N, -1), 1, relative_p_inds.data)
RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 992 elements at /pytorch/torch/lib/TH/THStorage.c:37
market1501 trainval set
NO. Images: 12936
NO. IDs: 751
loading pickle file: /home/sobey123/code/project/AlignedReId/datasets/market1501/partitions.pkl
market1501 test set
NO. Images: 31969
NO. IDs: 751
NO. Query Images: 3368
NO. Gallery Images: 15913
NO. Multi-query Images: 12688
Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "script/experiment/train_ml.py", line 481, in thread_target
normalize_feature=cfg.normalize_feature)
File "./aligned_reid/model/loss.py", line 215, in global_loss
dist_mat, labels, return_inds=True)
File "./aligned_reid/model/loss.py", line 177, in hard_example_mining
ind[is_neg].contiguous().view(N, -1), 1, relative_n_inds.data)
RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 30752 elements at /pytorch/torch/lib/TH/THStorage.c:37
Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "script/experiment/train_ml.py", line 481, in thread_target
normalize_feature=cfg.normalize_feature)
File "./aligned_reid/model/loss.py", line 215, in global_loss
dist_mat, labels, return_inds=True)
File "./aligned_reid/model/loss.py", line 159, in hard_example_mining
dist_mat[is_pos].contiguous().view(N, -1), 1, keepdim=True)
RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 30752 elements at /pytorch/torch/lib/TH/THStorage.c:37
from alignedreid-re-production-pytorch.
@zouliangyu @ShiinaMitsuki 多谢指出问题!请问有没有试过两个模型在不同GPU,会不会出现这个问题?我接下来有空debug一下。
from alignedreid-re-production-pytorch.
@huanghoujing 暂时没有哈,我这边目前仅能用一个GPU,所以才发现了这个问题。
from alignedreid-re-production-pytorch.
貌似是python版本原因,换回2.7版本没有出现这个问题,(lll¬ω¬)
from alignedreid-re-production-pytorch.
@ShiinaMitsuki 但是这个issue的第一个post里边的错误信息显示用的是Python2.7
from alignedreid-re-production-pytorch.
mutual learling 需要在两个GPU跑,一个跑不起来,内存没有及时释放,慢慢的显存爆了
from alignedreid-re-production-pytorch.
@Gavin666Github 想要在一个GPU跑的话,又不想降低batch size,那就两个模型得迭代更新,一个batch更新一个模型(另一个模型前传取得结果后把中间变量删掉),也不需要多线程那些东西了,代码得做改动。
from alignedreid-re-production-pytorch.
@huanghoujing @ShiinaMitsuki 谢谢指导!应该是内存不够大的前提下跑不起来,out of memory,其实我这边本机(单gpu)内存只有4G,单跑Global Loss是没问题的(batch size调的比较小),但是同样的batch size跑 mutual learning 就不行了,直接换到服务器多个GPU跑了ok的,每个GPU分配的Memory =8G ,完美解决。
from alignedreid-re-production-pytorch.
@ShiinaMitsuki 你batch size = 32x4 的mutual learning可以在一张12G的卡跑起来吗?
from alignedreid-re-production-pytorch.
@ShiinaMitsuki 请问下 mutual learning你在单个gpu上训练,直接设置-d ((0,),(0,))嘛?
from alignedreid-re-production-pytorch.
@huanghoujing 调小一点是可以的
from alignedreid-re-production-pytorch.
@Coler1994 -d ((0,),(0,)) --num_models 2
from alignedreid-re-production-pytorch.
@ShiinaMitsuki @huanghoujing 我刚试了单卡 mutual learning没出bug,一路跑下来了,厉害炸了
from alignedreid-re-production-pytorch.
from alignedreid-re-production-pytorch.
你好,我遇到了同样的问题,python2.7.15,请问楼主解决了吗,看 @ShiinaMitsuki 大神说换python2.7就解决了 @Coler1994 ,但是我们就是2.7出现了问题。。还那能请问一下您用的具体什么命令吗 谢谢
from alignedreid-re-production-pytorch.
@Coler1994 请问你在python哪个版本下运行的,可否参考下你修改代码
from alignedreid-re-production-pytorch.
Related Issues (20)
- top-k结果可视化 HOT 2
- CUHK03和DUKE上的识别率 HOT 1
- How to inference my own test set
- Keys not found in source state_dict HOT 1
- Global Feature Extraction HOT 1
- Local feature dimensions
- About performance on market1501 for global learning and mutual learning
- Is it generalised
- TypeError: __init__() got an unexpected keyword argument 'log_dir'
- AssertionError HOT 3
- 请问论文中的Resnet50-Xception结构是不是没有实现? HOT 1
- 论文复现的参数问题
- how to use the test data to draw picture just like roc missrate cmc?
- How to use without GPU? HOT 2
- could you send me a partitions.pkl about market1501 HOT 1
- 为啥用你提供的weight测得market也只有88.78的top1呢 HOT 1
- how to infer some images or videos
- RuntimeError: cannot perform reduction function min on tensor with no elements because the operation does not have an identity
- Poor performnace when reproducing evaluation on market1501 HOT 2
- would you help me to fix this error?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alignedreid-re-production-pytorch.