当在同一个GPU上运行train_ml.py（-d ((0,),(0,))）时出现如下错误，到底是那里的问题呢，谢谢～～ Ep 1, 3

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

train_ml.py -d '((0,),(0,))' Error about alignedreid-re-production-pytorch HOT 18 OPEN

huanghoujing commented on July 20, 2024

train_ml.py -d '((0,),(0,))' Error

from alignedreid-re-production-pytorch.

Comments (18)

ShiinaMitsuki commented on July 20, 2024 1

@Gavin666Github 可我已经用mutual learning在同一个gpu上训练过好多次了，之后均没出现过问题，显存占用基本在95%左右，没有爆显存的情况

from alignedreid-re-production-pytorch.

huanghoujing commented on July 20, 2024

你好，这个问题我也看不出来什么原因，我猜你这里应该是每个batch8个人，每个人4张图片吧，错误信息里边的数1095010584 很奇怪。

from alignedreid-re-production-pytorch.

ShiinaMitsuki commented on July 20, 2024

我也是遇到这个问题，用 train_ml.py在同一个GPU上训练两个model就会报这个错，分别在market1501和duke数据集上，均会出现此问题（cuhk03暂时没有测试），可能跟线程有关系，有时跑完第一个epoch不会报错，有时第一个epoch没跑完就会报错

duke test set

NO. Images: 19889
NO. IDs: 1110
NO. Query Images: 2228
NO. Gallery Images: 17661
NO. Multi-query Images: 0

Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "script/experiment/train_ml.py", line 481, in thread_target
normalize_feature=cfg.normalize_feature)
File "./aligned_reid/model/loss.py", line 215, in global_loss
dist_mat, labels, return_inds=True)
File "./aligned_reid/model/loss.py", line 163, in hard_example_mining
dist_mat[is_neg].contiguous().view(N, -1), 1, keepdim=True)
RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 992 elements at /pytorch/torch/lib/TH/THStorage.c:37
Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "script/experiment/train_ml.py", line 481, in thread_target
normalize_feature=cfg.normalize_feature)
File "./aligned_reid/model/loss.py", line 215, in global_loss
dist_mat, labels, return_inds=True)
File "./aligned_reid/model/loss.py", line 175, in hard_example_mining
ind[is_pos].contiguous().view(N, -1), 1, relative_p_inds.data)
RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 992 elements at /pytorch/torch/lib/TH/THStorage.c:37

market1501 trainval set

NO. Images: 12936
NO. IDs: 751

loading pickle file: /home/sobey123/code/project/AlignedReId/datasets/market1501/partitions.pkl

market1501 test set

NO. Images: 31969
NO. IDs: 751
NO. Query Images: 3368
NO. Gallery Images: 15913
NO. Multi-query Images: 12688

Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "script/experiment/train_ml.py", line 481, in thread_target
normalize_feature=cfg.normalize_feature)
File "./aligned_reid/model/loss.py", line 215, in global_loss
dist_mat, labels, return_inds=True)
File "./aligned_reid/model/loss.py", line 177, in hard_example_mining
ind[is_neg].contiguous().view(N, -1), 1, relative_n_inds.data)
RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 30752 elements at /pytorch/torch/lib/TH/THStorage.c:37
Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "script/experiment/train_ml.py", line 481, in thread_target
normalize_feature=cfg.normalize_feature)
File "./aligned_reid/model/loss.py", line 215, in global_loss
dist_mat, labels, return_inds=True)
File "./aligned_reid/model/loss.py", line 159, in hard_example_mining
dist_mat[is_pos].contiguous().view(N, -1), 1, keepdim=True)
RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 30752 elements at /pytorch/torch/lib/TH/THStorage.c:37

from alignedreid-re-production-pytorch.

huanghoujing commented on July 20, 2024

@zouliangyu @ShiinaMitsuki 多谢指出问题！请问有没有试过两个模型在不同GPU，会不会出现这个问题？我接下来有空debug一下。

from alignedreid-re-production-pytorch.

ShiinaMitsuki commented on July 20, 2024

@huanghoujing 暂时没有哈，我这边目前仅能用一个GPU，所以才发现了这个问题。

from alignedreid-re-production-pytorch.

ShiinaMitsuki commented on July 20, 2024

貌似是python版本原因，换回2.7版本没有出现这个问题，(lll￢ω￢)

from alignedreid-re-production-pytorch.

huanghoujing commented on July 20, 2024

@ShiinaMitsuki 但是这个issue的第一个post里边的错误信息显示用的是Python2.7

from alignedreid-re-production-pytorch.

Gavin666Github commented on July 20, 2024

mutual learling 需要在两个GPU跑，一个跑不起来,内存没有及时释放，慢慢的显存爆了

from alignedreid-re-production-pytorch.

huanghoujing commented on July 20, 2024

@Gavin666Github 想要在一个GPU跑的话，又不想降低batch size，那就两个模型得迭代更新，一个batch更新一个模型（另一个模型前传取得结果后把中间变量删掉），也不需要多线程那些东西了，代码得做改动。

from alignedreid-re-production-pytorch.

Gavin666Github commented on July 20, 2024

@huanghoujing @ShiinaMitsuki 谢谢指导！应该是内存不够大的前提下跑不起来，out of memory,其实我这边本机(单gpu)内存只有4G，单跑Global Loss是没问题的(batch size调的比较小)，但是同样的batch size跑 mutual learning 就不行了，直接换到服务器多个GPU跑了ok的，每个GPU分配的Memory =8G ,完美解决。

from alignedreid-re-production-pytorch.

huanghoujing commented on July 20, 2024

@ShiinaMitsuki 你batch size = 32x4 的mutual learning可以在一张12G的卡跑起来吗？

from alignedreid-re-production-pytorch.

Coler1994 commented on July 20, 2024

@ShiinaMitsuki 请问下 mutual learning你在单个gpu上训练，直接设置-d ((0,),(0,))嘛？

from alignedreid-re-production-pytorch.

ShiinaMitsuki commented on July 20, 2024

@huanghoujing 调小一点是可以的

from alignedreid-re-production-pytorch.

ShiinaMitsuki commented on July 20, 2024

@Coler1994 -d ((0,),(0,)) --num_models 2

from alignedreid-re-production-pytorch.

Coler1994 commented on July 20, 2024

@ShiinaMitsuki @huanghoujing 我刚试了单卡 mutual learning没出bug，一路跑下来了，厉害炸了

from alignedreid-re-production-pytorch.

Gavin666Github commented on July 20, 2024

上个图看看，两个GPU跑的效果

from alignedreid-re-production-pytorch.

yuanding53 commented on July 20, 2024

你好，我遇到了同样的问题，python2.7.15，请问楼主解决了吗，看 @ShiinaMitsuki 大神说换python2.7就解决了 @Coler1994 ，但是我们就是2.7出现了问题。。还那能请问一下您用的具体什么命令吗谢谢

from alignedreid-re-production-pytorch.

guanhuiyan commented on July 20, 2024

@Coler1994 请问你在python哪个版本下运行的，可否参考下你修改代码

from alignedreid-re-production-pytorch.

train_ml.py -d '((0,),(0,))' Error about alignedreid-re-production-pytorch HOT 18 OPEN

Comments (18)

duke test set

NO. Images: 19889
NO. IDs: 1110
NO. Query Images: 2228
NO. Gallery Images: 17661
NO. Multi-query Images: 0

market1501 trainval set

NO. Images: 12936
NO. IDs: 751

loading pickle file: /home/sobey123/code/project/AlignedReId/datasets/market1501/partitions.pkl

market1501 test set

NO. Images: 31969
NO. IDs: 751
NO. Query Images: 3368
NO. Gallery Images: 15913
NO. Multi-query Images: 12688

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

Comments (18)

duke test set

NO. Images: 19889 NO. IDs: 1110 NO. Query Images: 2228 NO. Gallery Images: 17661 NO. Multi-query Images: 0

market1501 trainval set

NO. Images: 12936 NO. IDs: 751

loading pickle file: /home/sobey123/code/project/AlignedReId/datasets/market1501/partitions.pkl

market1501 test set

NO. Images: 31969 NO. IDs: 751 NO. Query Images: 3368 NO. Gallery Images: 15913 NO. Multi-query Images: 12688

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Jobs

NO. Images: 19889
NO. IDs: 1110
NO. Query Images: 2228
NO. Gallery Images: 17661
NO. Multi-query Images: 0

NO. Images: 12936
NO. IDs: 751

NO. Images: 31969
NO. IDs: 751
NO. Query Images: 3368
NO. Gallery Images: 15913
NO. Multi-query Images: 12688