shiqiyu / opengait Goto Github PK

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.

Python 99.22% Shell 0.78%

opengait's People

Contributors

Stargazers

Watchers

Forkers

zyg11 zihaomu opencvfun xusong-20 spynccat hutengfei0701 chaofan96 snorkeldepth hust-wayne xialuxi matiandan mc261670164 fightyang dlreseach amanda-barbara chaofan996 verigle chuanfushen cc1164 ssqingquan cclauss liudongdong1 lyk125 jiaofu1973 qitaozhao w1998w bmurdata qingnian97 yibit rook1e-li gosiqueira uni0902 bugjudger zq1335030905 gugu76 jamesnolan17 wj1tr0y fanchao1 jackhanyuan vividbingo xvenze tornado12345 hl-hyx mxyr yinxuanyin justherozen shinanzou blackwriter3 anylee2021 wzb-bupt princeon gudaochangsheng falalas jie311 sitongzhen pt-chon darkliang vincentleooo bb12346 ryanashbaugh frensher xjmz6 lutianhao asdlei99 1073376832 bill790 exitudio hialeeyang yogurt7771 puwang40 opentld jdyjjj cs304-2022 mikohab450 bilibili-mikoto ohhhhhhhhhhhhhhhhhhhhh 12231366ostun hanzanaoki yangyc07 alibdz corner4world appoorvabansal weichen-yu haijun-xiong haidongz-usc han15503885057 wzbzzz dungmn definekid yaocharley sundisee oshwife zgs-007 zepengw ltseed yiwangranabc reyna168 dashengge lrwinr davidlee528

opengait's Issues

raise ArgumentError ctypes.ArgumentError

当我使用Opengait的gaitset 添加自己的网络，并对网络的feature进行了修改
final_emb = final_emb.view(n,final_y,-1)#[n,128,_] n, s, _, h, w = sils.size() retval = { 'training_feat': { 'triplet': {'embeddings': final_emb, 'labels': labs} }, 'visual_summary': { 'image/sils': sils.view(n*s, 1, h, w) }, 'inference_feat': { 'embeddings': final_emb } }
但是当我使用DPP训练模型的时候，出现以下错误
File "/project/code/OpenGait/lib/modeling/losses/base.py", line 23, in inner raise ArgumentError
ctypes.ArgumentError
`
@functools.wraps(func)
def inner(*args, **kwds):
try:
for k, v in kwds.items():
kwds[k] = ddp_all_gather(v)

        loss, loss_info = func(*args, **kwds)
        loss *= torch.distributed.get_world_size()
        return loss, loss_info
    except:
        raise ArgumentError
return inner`

请问是何种原因导致此问题，有何解决办法呢？

What do these two lines of code do?

OpenGait/lib/modeling/modules.py

Line 46 in 49cbc44

_ = x.size()

pychram debug 单步调试

你好，我用win连接远程centos服务器想单步调试代码。不知如何配置pycharm 的Run/Debug configurations （因为代码使用到了分布式训练，shell 命令多了CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 lib/main.py这些参环境数。）

GREW pretreatment `to_pickle` has size 0

System information (version)

Pytorch => 1.11
Operating System / Platform => Ubuntu 20.04
Cuda => 11.3

Detailed description

I'm trying to run GREW pretreatment code but it generates no GREW-pkl folder at the end of the process.
I debugged myself and checked if the --dataset flag is set properly and the to_pickle list size before saving the pickle file.
The flag is well set but the size of the list is always 0.

I downloaded the GREW dataset from the link you guys sent me and made de GREW-rearranged folder using the code provided.
I'll keep investigating what is causing such an error and if I find I'll set a fixing PR.

Issue submission checklist

I checked the problem with documentation, FAQ, issues, and have not found solution.

关于模型的继续训练问题

想请问作者，如何在训练完成的模型上面继续训练

这里的 In x: [n, s, c, h, w] 里面的参数分别代表步态数据的什么呢？

OpenGait/lib/modeling/modules.py

Line 34 in 6e71c7a

class SetBlockWrapper(nn.Module):

关于GLN网络的实现细节

Detailed description

我看论文中在lateral connections中采用的是1X1卷积,但是看代码中实现的是3X3 kernel大小。是否这样实现的性能更高，就是单纯的求问一下哈

Then at each stage, a 1 × 1 convolutional layer is taken to rearrange the features and adjust the channel dimension

lateral_layer 代码

OpenGait/lib/modeling/models/gln.py

Lines 54 to 59 in 6a469fc

 self.lateral_layer1 = nn.Conv2d( 

 in_channels[1]*2, lateral_dim, kernel_size=3, stride=1, padding=1, bias=False) 

 self.lateral_layer2 = nn.Conv2d( 

 in_channels[2]*2, lateral_dim, kernel_size=3, stride=1, padding=1, bias=False) 

 self.lateral_layer3 = nn.Conv2d( 

 in_channels[3]*2, lateral_dim, kernel_size=3, stride=1, padding=1, bias=False)

Error occurrs when test gaitGL with CASIA-B dataset

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[2021-11-13 15:04:55] [INFO]: {'enable_float16': False, 'restore_ckpt_strict': True, 'restore_hint': 80000, 'save_name': 'GaitGL', 'eval_func': 'identification', 'sampler': {'batch_size': 4, 'sample_type': 'all_ordered', 'type': 'InferenceSampler'}, 'transform': [{'img_w': 64, 'type': 'BaseSilCuttingTransform'}], 'metric': 'euc', 'enable_distributed': True}
[2021-11-13 15:04:55] [INFO]: {'model': 'GaitGL', 'channels': [32, 64, 128], 'class_num': 74}
[2021-11-13 15:04:55] [INFO]: {'dataset_name': 'CASIA-B', 'dataset_root': '/media/yangxilab/DiskB1/wxian/dataset/CASIA-B/CASIA-B-pkl/', 'num_workers': 1, 'dataset_partition': './misc/partitions/CASIA-B_include_005.json', 'remove_no_gallery': False, 'cache': False, 'test_dataset_name': 'CASIA-B'}
[2021-11-13 15:04:55] [INFO]: -------- Test Pid List --------
[2021-11-13 15:04:55] [INFO]: [075, 076, ..., 124]
[2021-11-13 15:04:56] [INFO]: Restore Parameters from output/CASIA-B/GaitGL/GaitGL/checkpoints/GaitGL-80000.pt !!!
[2021-11-13 15:04:58] [INFO]: Parameters Count: 3.09667M
[2021-11-13 15:04:58] [INFO]: Model Initialization Finished!
Transforming: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5485/5485 [01:24<00:00, 64.84it/s]
Traceback (most recent call last):
File "lib/main.py", line 66, in
run_model(cfgs, training)
File "lib/main.py", line 51, in run_model
Model.run_test(model)
File "/media/yangxilab/DiskB1/wxian/dataset/OpenGait-master/lib/modeling/base_model.py", line 417, in run_test
return eval_func(info_dict, dataset_name, **valid_args)
File "/media/yangxilab/DiskB1/wxian/dataset/OpenGait-master/lib/utils/evaluation.py", line 67, in identification
gallery_x = feature[gseq_mask, :]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2744 but corresponding boolean dimension is 5485

It is normal when testing other models, such as GAITSET, but there is a problem when testing GAITGL. i don't know why ?

ValueError: TCP backend has been deprecated. Please use Gloo or MPI backend for collective operations on CPU tensors.

关于gaitset学习率的问题

optimizer_cfg:
lr: 0.1
momentum: 0.9
solver: SGD
weight_decay: 0.0005
想请问作者，为何将学习率设置为0.1,以及正则化是如何考虑的

`pretreatment.py` is causing dealock error on OUMVLP dataset

System information (version)

Pytorch => 1.10
Operating System / Platform =>Ubuntu 20.04
Cuda => Cuda 11.5

Detailed description

pretreatment.py for OUMVLP dataset results in random deadlocks.
Currently you are making asynchronous calls using a thread pool and storing the result e ordinary Python list. Thus, when I execute the code to convert OUMVLP files to pickles it's happening random deadlocks during the code execution.

I forked the directory and I'm trying to solve this myself to make a PR, bit I believe it's relevant to keep you apart of this problem.

Steps to reproduce

Steps:

Execute the code for OUMVLP dataset using personal machine processor. The error occurs in a non-deterministic fashion.

Issue submission checklist

I checked the problem with documentation, FAQ, issues, and have not found solution.

Training stuck sometimes

When training the task of gait recognition in the wild, for example, GREW, HID or Gait3D.

Found that training process sometimes stuck, probably because of sampler.

Problem about the given GaitGL-OUMVLP model

We test the given GaitGL model directly on OUMVLP dataset, but it can't ahcieve 89.9% at Rank@1(excluding identical-view cases).

The results are:
===Rank-1 (Include identical-view cases)===
NM: 89.914
===Rank-1 (Exclude identical-view cases)===
NM: 89.502

and the results for each angle are:

[84.33692308 89.85384615 91.12230769 91.42153846 90.89076923 90.58769231
90.18384615 88.3 88.05384615 90.27461538 90.38692308 89.58
89.40307692 88.63846154].

We notice that the value of Rank@1(Including identical-view cases) is 89.9%, is there a transcription error?

Where is the definition of data_config['data_in_use'] and what does it do.

This repo is wonderful! However, I'm a little confused where is the definition of data_config['data_in_use'] and what does it do.

Loss info unexpected result

The log output the unexpected loss info, like

[2021-12-09 14:51:09] [INFO]: Iteration 00100, Cost 61.51s, triplet_loss=0.1941, triplet_hard_loss=0.6903, triplet_loss_num=189226.8594, triplet_mean_dist=0.2503, softmax_loss=3.9386, softmax_hard_loss=0.6903, softmax_loss_num=189226.8594, softmax_mean_dist=0.2503, softmax_accuracy=0.1113, triplet_accuracy=0.1087
[2021-12-09 14:52:10] [INFO]: Iteration 00200, Cost 60.51s, triplet_loss=0.1943, triplet_hard_loss=1.1024, triplet_loss_num=98941.8203, triplet_mean_dist=0.7868, triplet_accuracy=0.2896, softmax_loss=3.0870, softmax_hard_loss=1.1024, softmax_loss_num=98941.8203, softmax_mean_dist=0.7868, softmax_accuracy=0.2903

使用提供的权重运行GaitGL时，出现问题

[2021-12-02 19:53:48] [INFO]: {'enable_float16': False, 'restore_ckpt_strict': True, 'restore_hint': 80000, 'save_name': 'GaitGL', 'eval_func': 'identification', 'sampler': {'batch_size': 4, 'sample_type': 'all_ordered', 'type': 'InferenceSampler'}, 'transform': [{'img_w': 64, 'type': 'BaseSilCuttingTransform'}], 'metric': 'euc', 'enable_distributed': True}
[2021-12-02 19:53:48] [INFO]: {'model': 'GaitGL', 'channels': [32, 64, 128], 'class_num': 74}
[2021-12-02 19:53:48] [INFO]: {'dataset_name': 'CASIA-B', 'dataset_root': 'xxxxxi/datasets/Gait/CASIA-B/opengait-64', 'num_workers': 1, 'dataset_partition': './misc/partitions/CASIA-B_include_005.json', 'remove_no_gallery': False, 'cache': False, 'test_dataset_name': 'CASIA-B', 'pid_num': 74}
[2021-12-02 19:53:48] [INFO]: -------- Test Pid List --------
[2021-12-02 19:53:48] [INFO]: [075, 076, ..., 124]
[2021-12-02 19:53:53] [INFO]: Restore Parameters from xxxxxi/GaitGL/GaitGL/checkpoints/GaitGL-80000.pt !!!
[2021-12-02 19:53:53] [INFO]: Parameters Count: 3.09667M
[2021-12-02 19:53:53] [INFO]: Model Initialization Finished!
Transforming: 100%|█████████████████████████| 5485/5485 [01:05<00:00, 84.04it/s]
Traceback (most recent call last):
File "lib/main.py", line 66, in
run_model(cfgs, training)
File "lib/main.py", line 51, in run_model
Model.run_test(model)
File "/home/xxxxxi/project/Gait/OpenGait/lib/modeling/base_model.py", line 423, in run_test
return eval_func(info_dict, dataset_name, **valid_args)
File "/home/xxxxxi/project/Gait/OpenGait/lib/utils/evaluation.py", line 67, in identification
gallery_x = feature[gseq_mask, :]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2744 but corresponding boolean dimension is 5485
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 230780) of binary: /usr/local/anaconda3/bin/python3.8
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

       lib/main.py FAILED

=======================================
Root Cause:
[0]:
time: 2021-12-02_19:55:02
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 230780)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
<NO_OTHER_FAILURES>

a

please delete

我想请教下你们复现时候为什么有些实验精度会更高呢？

您觉得是因为使用了混合半精度训练，初始化还是比论文里面多用了一个损失函数的原因呢？

Why is the loss multiplied by the world size?

Thank you for making this excellent work open source.
As written in the title, could you please explain the reason behind scaling the loss by the world size?
I didn't find such practice in other distributed training works so far!
Thank you again.

Train GaitPart on OU-MVLP

您好，感谢您出色的工作。
我训练GaitPart'在CASIA-B数据集上的训练结果跟您训练结果相近，但是在训练OU-MVLP时，在一块1080TI和GTX TITAN上训练，batch_size设为（24，16），训练350k iteration后准确率才刚79%左右不到80%。
请问您项目的代码修改配置文件中关于数据集的路径之后可以直接对OUMVLP数据集进行训练而不需要再对某些代码部分进行修改了吗？

what does the parallel_BN1d mean？

OpenGait/opengait/modeling/modules.py

Line 128 in 1be7333

if parallel_BN1d:

Can the code run in Windows 11 anaconda environment?

System information (version)

Pytorch = 1.6.0
Operating System / Platform = Windows 11
Cuda = 10.1.243

Detailed description

The problem shows when I run into my computer with the above system information. I create a new conda env and clone this code in the env. But I cannot run the code and the problems are below:
(opengait) C:\Users\Lenovo\OpenGait>set CUDA_VISIBLE_DEVICES=1 & python -m torch.distributed.launch --nproc_per_node=2 lib/main.py --cfgs ./config/baseline.yaml --phase train

Traceback (most recent call last):
Traceback (most recent call last):
File "lib/main.py", line 58, in
File "lib/main.py", line 58, in
torch.distributed.init_process_group('nccl', init_method='env://')
torch.distributed.init_process_group('nccl', init_method='env://')
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'
Traceback (most recent call last):
File "C:\Users\Lenovo\anaconda3\envs\opengait\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Lenovo\anaconda3\envs\opengait\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\Lenovo\anaconda3\envs\opengait\lib\site-packages\torch\distributed\launch.py", line 261, in
main()
File "C:\Users\Lenovo\anaconda3\envs\opengait\lib\site-packages\torch\distributed\launch.py", line 256, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['C:\Users\Lenovo\anaconda3\envs\opengait\python.exe', '-u', 'lib/main.py', '--local_rank=1', '--cfgs', './config/baseline.yaml', '--phase', 'train']' returned non-zero exit status 1.

这里几个参数代表什么 ipts, labs, seqL ？

OpenGait/lib/modeling/models/gaitset.py

Line 49 in 49cbc44

ipts, labs, _, _, seqL = inputs

ipts, labs, _, _, seqL = inputs
这个inputs由哪5部分组成？其中ipts, labs, seqL 分别指的什么？

Is there any explanation about the proposed baseline model?

请问baseline模型是作者团队自己提出的吗，有相关论文解释吗

关于准确率计算的问题

想请教一下大佬，这类步态识别模型的准确率（比如在拿CASIA-B进行测试的时候）貌似是根据角度进行的输出，NM,
BG,CL各输出11个角度的准确率，那这些准确率是所有测试样例准确率的平均值吗，我以为的准确率是根据测试样例的数量输出的。（刚刚才开始学习gaitset这篇论文，自己理解得很不到位）

What the faeture_num corresponds to？

OpenGait/lib/data/collate_fn.py

Line 37 in 49cbc44

feature_num = len(batch[0][0])

关于gaitpart模型训练loss不收敛的问题

对gaitpart模型进行了8万轮的训练，loss仍然没有收敛的趋势，想请问作者有没有类似情况

OUMVLP test phase: Error(s) in loading state_dict for Baseline

System information (version)

torch 1.8.0
Cuda 11.1

Detailed description

Distributed init (rank 0): env://
[2022-03-10 11:42:57] [INFO]: {'enable_float16': False, 'restore_ckpt_strict': True, 'restore_hint': 150000, 'save_name': 'Baseline', 'eval_func': 'identification', 'sampler': {'batch_size': 4, 'sample_type': 'all_ordered', 'type': 'InferenceSampler', 'batch_shuffle': False, 'frames_all_limit': 720}, 'transform': [{'img_w': 64, 'type': 'BaseSilCuttingTransform'}], 'metric': 'euc'}
[2022-03-10 11:42:57] [INFO]: {'model': 'Baseline', 'backbone_cfg': {'in_channels': 1, 'layers_cfg': ['BC-32', 'BC-32', 'M', 'BC-64', 'BC-64', 'M', 'BC-128', 'BC-128', 'BC-256', 'BC-256'], 'type': 'Plain'}, 'SeparateFCs': {'in_channels': 256, 'out_channels': 256, 'parts_num': 31}, 'SeparateBNNecks': {'class_num': 5153, 'in_channels': 256, 'parts_num': 31}, 'bin_num': [16, 8, 4, 2, 1]}
[2022-03-10 11:42:58] [INFO]: {'dataset_name': 'OUMVLP', 'dataset_root': 'OUMVLP-pkl', 'dataset_partition': './misc/partitions/OUMVLP.json', 'num_workers': 1, 'remove_no_gallery': False, 'cache': False, 'test_dataset_name': 'OUMVLP'}

[2022-03-10 11:42:58] [INFO]: -------- Test Pid List --------
[2022-03-10 11:42:58] [INFO]: [00002, 00004, ..., 00022]
Traceback (most recent call last):
File "D:/OpenGait/lib/my_main.py", line 111, in
mp.spawn(main,
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\multiprocessing\spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\multiprocessing\spawn.py", line 188, in start_processes
while not context.join():
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\multiprocessing\spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\multiprocessing\spawn.py", line 59, in _wrap
fn(i, *args)
File "D:\OpenGait\lib\my_main.py", line 104, in main
run_model(cfgs, training)
File "D:\OpenGait\lib\my_main.py", line 64, in run_model
model = Model(cfgs, training)
File "D:\OpenGait\lib\modeling\base_model.py", line 168, in init
self.resume_ckpt(restore_hint)
File "D:\OpenGait\lib\modeling\base_model.py", line 289, in resume_ckpt
self._load_ckpt(save_name)
File "D:\OpenGait\lib\modeling\base_model.py", line 262, in _load_ckpt
self.load_state_dict(model_state_dict, strict=load_ckpt_strict)
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\nn\modules\module.py", line 1482, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Baseline:
size mismatch for BNNecks.fc_bin: copying a param with shape torch.Size([31, 256, 5154]) from checkpoint, the shape in current model is torch.Size([31, 256, 5153]).

请教一下关于轮廓提取的问题

请问有没有什么好的步态轮廓提取方法推荐？谢谢！

A little bug

OpenGait/lib/utils/evaluation.py

Line 97 in cfda13e

result_dict["scalar/test_accuracy/BG"] = acc[0, :, :, i]

We find maybe here is a little bug which make the value of BG to be the same as the value of NM when set with_test as true, I don't know how to submit my code, so I can only tell you by this way.

training GaitGL on OUMVLP with default config too slow to tolerate.....

I try GaitGL on OUMVLP dataset, 6000 id for training and rest for teting, using deafault config with Tesla v100*8, it's cost 165 seconds for each 100 iters. if I want 10 epochs, time = 10(epoch) x 130000(seqs) / 8(bs) / 100 * 165 = 3.1 days.

I guess computing triplet loss cost might cost lot of time, I'm tring to modify the train procedure to, triplet loss on take effect at latter epochs, while early epochs only compute id loss, namely softmaxloss.

Is there any other methods to speed up?

thxs

CUDA error: an illegal memory access was encountered caused by new DDP strategy of Pytorch1.9+

Thanks for your great contributions to gait recognition community!!
I find this repo cannot run well on pytorch1.9 with CUDA11.1.
It raises an illegal memory access error when I use default baseline config to train model on CASIA-B.

Sometimes it can run normally at the first serveral steps(but full_loss_num is very small), then error raises.

After setting find_unused_parameter to True, the error disappears and training process is back to normal.

it looks like some parameters have been registered but not used in whole model, which triggers this error.

Unable to find a valid cuDNN algorithm to run convolution

用笔记本测试的，只有一块显卡所以用的CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 lib/main.py --cfgs ./config/gaitset.yaml --phase train。报错Traceback (most recent call last):
File "lib/main.py", line 66, in
run_model(cfgs, training)
File "lib/main.py", line 49, in run_model
Model.run_train(model)
File "/home/yq/OpenGait/lib/modeling/base_model.py", line 371, in run_train
model.train_step(loss_sum)
File "/home/yq/OpenGait/lib/modeling/base_model.py", line 307, in train_step
self.Scaler.scale(loss_sum).backward()
File "/home/yq/anaconda3/envs/pt14/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/yq/anaconda3/envs/pt14/lib/python3.7/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
Exception raised from try_all at /pytorch/aten/src/ATen/native/cudnn/Conv.cpp:692 (most recent call first):
请问是怎么回事啊

Data Sets

Thank you very much for the code provided. I have a question. How to obtain or preprocess these data sets--"HID2021"、'0001-1000'
gallery_seq_type = {'0001-1000': ['1', '2'],
"HID2021": ['0'], '0001-1000-test': ['0']}
probe_seq_type = {'0001-1000': ['3', '4', '5', '6'],
"HID2021": ['1'], '0001-1000-test': ['1']}

what is this line code for ?How

OpenGait/lib/data/sampler.py

Line 33 in 6982c35

total_size = int(math.ceil(_ / self.world_size)

The sequence number of obtained features does not match with the number of the testing set when evaluating.

Thanks for your good work.
I'm using the OpenGait as a basic framework for doing my own research. But I met a problem that the sequence number of obtained features does not match with the number of the testing set when evaluating. Any idea?

如何使用Baseline-60000.pt进行测试

我的运行命令没有反应
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 lib/main.py --cfgs ./config/baseline.yaml --phase train
请问baseline.yaml该怎么配置

想请教一下关于BatchNorm的问题

我在代码中看到，很多网络中都没有加上BatchNorm，效果和论文中差不多甚至更高，但是我实验了一下，加上BatchNorm后反而效果变差了，不知道这个原因是啥呢？

How to fix this problem when i want to test

System information (version)

Pytorch => ❔
Operating System / Platform => ❔
Cuda => ❔

Detailed description

Steps to reproduce

Issue submission checklist

I checked the problem with documentation, FAQ, issues, and have not found solution.

使用baseline模型进行测试时报错

System information (version)

Detailed description

[2022-02-17 12:42:57] [INFO]: {'dataset_name': 'CASIA-B', 'dataset_root': './dataset_output', 'num_workers': 1, 'dataset_partition': './misc/partitions/CASIA-B_include_005.json', 'remove_no_gallery': False, 'cache': False, 'test_dataset_name': 'CASIA-B'}
[2022-02-17 12:42:57] [INFO]: -------- Test Pid List --------
[2022-02-17 12:42:57] [INFO]: ['075']
[2022-02-17 12:43:00] [INFO]: Restore Parameters from output/CASIA-B/Baseline/Baseline/checkpoints/Baseline-60000.pt !!!
[2022-02-17 12:43:00] [INFO]: Parameters Count: 3.77914M
[2022-02-17 12:43:00] [INFO]: Model Initialization Finished!
Transforming: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 110/110 [00:01<00:00, 76.67it/s]
Traceback (most recent call last):
File "lib/main.py", line 70, in
run_model(cfgs, training)
File "lib/main.py", line 55, in run_model
Model.run_test(model)
File "/home/lalaland/PycharmProjects/OpenGait_/lib/modeling/base_model.py", line 470, in run_test
return eval_func(info_dict, dataset_name, **valid_args)
File "/home/lalaland/PycharmProjects/OpenGait_/lib/utils/evaluation.py", line 79, in identification
0) * 100 / dist.shape[0], 2)
ValueError: could not broadcast input array from shape (4,) into shape (5,)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10590) of binary: /home/lalaland/anaconda3/envs/torch/bin/python
Traceback (most recent call last):
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

lib/main.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-02-17_12:43:06
host : lalaland-Predator-PH317-55
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 10590)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

When run misc/pretreatment.py on Windows

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

DistributedDataParallel Usage

您好，我想请问一下，分布式训练的GPU数量和worl_size参数在哪里修改，以及以下报错该怎么解决？pytorch版本最好用哪个版本？感谢！
Traceback (most recent call last):
File "D:/OpenGait-1.0/OpenGait-1.0/lib/main.py", line 58, in
torch.distributed.init_process_group('nccl', init_method='env://')
File "D:\acada\envs\py36\lib\site-packages\torch\distributed\distributed_c10d.py", line 434, in init_process_group
init_method, rank, world_size, timeout=timeout
File "D:\acada\envs\py36\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))
RuntimeError: No rendezvous handler for env://

The re_ranking algorithm can improve the performance on the HID, but lower for both CASIA-B and OUMVLP?

I found that the re_ranking algorithm in evaluation.py can improve the performance on the HID dataset, but it doesn't work on CASIA-B and OUMVLP, how to explain it ？

the symlink in rearrange_OUMVLP seem to be not work on pretreatment phase

The symlink in rearrange_OUMVLP seem to be not work on pretreatment phase, ERROR for file not found.

what dose the 'SeqL' mean?

Hi,

OpenGait/lib/modeling/modules.py

Line 61 in 49cbc44

if seqL is None:

I guess it means the length of the sequence .What is theseqL[0] means? By the way, what is the shape of the seqL? What does the each dimension of the seqL correspond to?

The problem about distributed

您的代码使用了分布式训练，我在修改模型时候出现了一下问题：Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s
forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list,
dict, iterable)
可我确定forward返回值的内容都被用来计算损失了，请问这和分布式训练有关系吗？
您的代码在不改动的情况下可以不使用分布式训练吗

复现GAITGL出现问题

System information (version)

Pytorch => ❔
Operating System / Platform => ❔
Cuda => ❔

Detailed description

Steps to reproduce

Issue submission checklist

I checked the problem with documentation, FAQ, issues, and have not found solution.

Model fails to test: IndexError: boolean index did not match indexed array along dimension 0;

hello, we find when we use 2 gpus to test the GaitGL, we will get an error, just like this

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --master_port 12345 --nproc_per_node=2 lib/main.py --cfgs ./config/gaitgl.yaml --iter 80000 --phase test

we get an error like this

File "/OpenGait/lib/utils/evaluation.py", line 67, in identification
gallery_x = feature[gseq_mask, :]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2744 but corresponding boolean dimension is 5485

it seems like that the length of the output feature is shorter than the input, and we have checked other test code(gaitset and GLN) which work well when they only test on 2 gpus.

Because we only have two gpus, then how can we test the GaitGL on 2 gpus? Thank you very much!

这个函数的作用是什么呢？

OpenGait/lib/modeling/modules.py

Line 34 in 49cbc44

class SetBlockWrapper(nn.Module):

这个函数的最后 return x.view(*_)是不是相当于 return x.view(n, s, c, h ,w)？
如果这么说的话，这个函数没有改变什么?到底发挥了什么作用？没有看懂，望指点

model config and output discussion

System information (version)

Pytorch => ❔
Operating System / Platform => ❔
Cuda => ❔

Detailed description

Thank you for all the work! I am relatively new to this so sorry for my many questions! I am trying to build an ensemble of all the models, how would i go about configuring the model_cfg? Also which variables would I be needing to extract from the retval dictorionary if I were to perform inference on new data. Thank you in advance!

Steps to reproduce

Issue submission checklist

I checked the problem with documentation, FAQ, issues, and have not found solution.

	self.lateral_layer1 = nn.Conv2d(
	in_channels[1]*2, lateral_dim, kernel_size=3, stride=1, padding=1, bias=False)
	self.lateral_layer2 = nn.Conv2d(
	in_channels[2]*2, lateral_dim, kernel_size=3, stride=1, padding=1, bias=False)
	self.lateral_layer3 = nn.Conv2d(
	in_channels[3]*2, lateral_dim, kernel_size=3, stride=1, padding=1, bias=False)

shiqiyu / opengait Goto Github PK

opengait's People

Contributors

Stargazers

Watchers

Forkers

opengait's Issues

System information (version)

Detailed description

Issue submission checklist

Detailed description

System information (version)

Detailed description

Steps to reproduce

Steps:

Issue submission checklist

======================================= Root Cause: [0]: time: 2021-12-02_19:55:02 rank: 0 (local_rank: 0) exitcode: 1 (pid: 230780) error_file: <N/A> msg: "Process failed with exitcode 1"

System information (version)

Detailed description

System information (version)

Detailed description

System information (version)

Detailed description

Steps to reproduce

Issue submission checklist

System information (version)

Detailed description

lib/main.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2022-02-17_12:43:06 host : lalaland-Predator-PH317-55 rank : 0 (local_rank: 0) exitcode : 1 (pid: 10590) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

System information (version)

Detailed description

Steps to reproduce

Issue submission checklist

System information (version)

Detailed description

Steps to reproduce

Issue submission checklist

Recommend Projects

Recommend Topics

Recommend Org

Jobs

=======================================
Root Cause:
[0]:
time: 2021-12-02_19:55:02
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 230780)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-02-17_12:43:06
host : lalaland-Predator-PH317-55
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 10590)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html