GithubHelp home page GithubHelp logo

shiqiyu / opengait Goto Github PK

View Code? Open in Web Editor NEW
661.0 661.0 154.0 20.16 MB

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.

Python 99.22% Shell 0.78%

opengait's People

Contributors

cclauss avatar chaofan996 avatar chuanfushen avatar darkliang avatar davidlee528 avatar gosiqueira avatar happydeeplearning avatar jackhanyuan avatar jdyjjj avatar shenchuanfu avatar wj1tr0y avatar zhouzi180 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opengait's Issues

raise ArgumentError ctypes.ArgumentError

当我使用Opengait的gaitset 添加自己的网络,并对网络的feature进行了修改
final_emb = final_emb.view(n,final_y,-1)#[n,128,_] n, s, _, h, w = sils.size() retval = { 'training_feat': { 'triplet': {'embeddings': final_emb, 'labels': labs} }, 'visual_summary': { 'image/sils': sils.view(n*s, 1, h, w) }, 'inference_feat': { 'embeddings': final_emb } }
但是当我使用DPP训练模型的时候,出现以下错误
File "/project/code/OpenGait/lib/modeling/losses/base.py", line 23, in inner raise ArgumentError
ctypes.ArgumentError
`
@functools.wraps(func)
def inner(*args, **kwds):
try:
for k, v in kwds.items():
kwds[k] = ddp_all_gather(v)

        loss, loss_info = func(*args, **kwds)
        loss *= torch.distributed.get_world_size()
        return loss, loss_info
    except:
        raise ArgumentError
return inner`

请问是何种原因导致此问题,有何解决办法呢?

pychram debug 单步调试

你好,我用win连接远程centos服务器想单步调试代码。不知如何配置pycharm 的Run/Debug configurations (因为代码使用到了分布式训练,shell 命令多了CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 lib/main.py这些参环境数。)

GREW pretreatment `to_pickle` has size 0

System information (version)

  • Pytorch => 1.11
  • Operating System / Platform => Ubuntu 20.04
  • Cuda => 11.3

Detailed description

I'm trying to run GREW pretreatment code but it generates no GREW-pkl folder at the end of the process.
I debugged myself and checked if the --dataset flag is set properly and the to_pickle list size before saving the pickle file.
The flag is well set but the size of the list is always 0.

I downloaded the GREW dataset from the link you guys sent me and made de GREW-rearranged folder using the code provided.
I'll keep investigating what is causing such an error and if I find I'll set a fixing PR.

Issue submission checklist

  • I checked the problem with documentation, FAQ, issues, and have not found solution.

关于GLN网络的实现细节

Detailed description

我看论文中在lateral connections中采用的是1X1卷积,但是看代码中实现的是3X3 kernel大小。是否这样实现的性能更高,就是单纯的求问一下哈

Then at each stage, a 1 × 1 convolutional layer is taken to rearrange the features and adjust the channel dimension

lateral_layer 代码

self.lateral_layer1 = nn.Conv2d(
in_channels[1]*2, lateral_dim, kernel_size=3, stride=1, padding=1, bias=False)
self.lateral_layer2 = nn.Conv2d(
in_channels[2]*2, lateral_dim, kernel_size=3, stride=1, padding=1, bias=False)
self.lateral_layer3 = nn.Conv2d(
in_channels[3]*2, lateral_dim, kernel_size=3, stride=1, padding=1, bias=False)

Error occurrs when test gaitGL with CASIA-B dataset


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[2021-11-13 15:04:55] [INFO]: {'enable_float16': False, 'restore_ckpt_strict': True, 'restore_hint': 80000, 'save_name': 'GaitGL', 'eval_func': 'identification', 'sampler': {'batch_size': 4, 'sample_type': 'all_ordered', 'type': 'InferenceSampler'}, 'transform': [{'img_w': 64, 'type': 'BaseSilCuttingTransform'}], 'metric': 'euc', 'enable_distributed': True}
[2021-11-13 15:04:55] [INFO]: {'model': 'GaitGL', 'channels': [32, 64, 128], 'class_num': 74}
[2021-11-13 15:04:55] [INFO]: {'dataset_name': 'CASIA-B', 'dataset_root': '/media/yangxilab/DiskB1/wxian/dataset/CASIA-B/CASIA-B-pkl/', 'num_workers': 1, 'dataset_partition': './misc/partitions/CASIA-B_include_005.json', 'remove_no_gallery': False, 'cache': False, 'test_dataset_name': 'CASIA-B'}
[2021-11-13 15:04:55] [INFO]: -------- Test Pid List --------
[2021-11-13 15:04:55] [INFO]: [075, 076, ..., 124]
[2021-11-13 15:04:56] [INFO]: Restore Parameters from output/CASIA-B/GaitGL/GaitGL/checkpoints/GaitGL-80000.pt !!!
[2021-11-13 15:04:58] [INFO]: Parameters Count: 3.09667M
[2021-11-13 15:04:58] [INFO]: Model Initialization Finished!
Transforming: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5485/5485 [01:24<00:00, 64.84it/s]
Traceback (most recent call last):
File "lib/main.py", line 66, in
run_model(cfgs, training)
File "lib/main.py", line 51, in run_model
Model.run_test(model)
File "/media/yangxilab/DiskB1/wxian/dataset/OpenGait-master/lib/modeling/base_model.py", line 417, in run_test
return eval_func(info_dict, dataset_name, **valid_args)
File "/media/yangxilab/DiskB1/wxian/dataset/OpenGait-master/lib/utils/evaluation.py", line 67, in identification
gallery_x = feature[gseq_mask, :]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2744 but corresponding boolean dimension is 5485

It is normal when testing other models, such as GAITSET, but there is a problem when testing GAITGL. i don't know why ?

关于gaitset学习率的问题

optimizer_cfg:
lr: 0.1
momentum: 0.9
solver: SGD
weight_decay: 0.0005
想请问作者,为何将学习率设置为0.1,以及正则化是如何考虑的

`pretreatment.py` is causing dealock error on OUMVLP dataset

System information (version)

  • Pytorch => 1.10
  • Operating System / Platform =>Ubuntu 20.04
  • Cuda => Cuda 11.5

Detailed description

pretreatment.py for OUMVLP dataset results in random deadlocks.
Currently you are making asynchronous calls using a thread pool and storing the result e ordinary Python list. Thus, when I execute the code to convert OUMVLP files to pickles it's happening random deadlocks during the code execution.

I forked the directory and I'm trying to solve this myself to make a PR, bit I believe it's relevant to keep you apart of this problem.

Steps to reproduce

Steps:

  1. Execute the code for OUMVLP dataset using personal machine processor. The error occurs in a non-deterministic fashion.

Issue submission checklist

  • I checked the problem with documentation, FAQ, issues, and have not found solution.

Training stuck sometimes

When training the task of gait recognition in the wild, for example, GREW, HID or Gait3D.

Found that training process sometimes stuck, probably because of sampler.

Problem about the given GaitGL-OUMVLP model

We test the given GaitGL model directly on OUMVLP dataset, but it can't ahcieve 89.9% at Rank@1(excluding identical-view cases).

The results are:
===Rank-1 (Include identical-view cases)===
NM: 89.914
===Rank-1 (Exclude identical-view cases)===
NM: 89.502

and the results for each angle are:

[84.33692308 89.85384615 91.12230769 91.42153846 90.89076923 90.58769231
90.18384615 88.3 88.05384615 90.27461538 90.38692308 89.58
89.40307692 88.63846154].

We notice that the value of Rank@1(Including identical-view cases) is 89.9%, is there a transcription error?

Loss info unexpected result

The log output the unexpected loss info, like

[2021-12-09 14:51:09] [INFO]: Iteration 00100, Cost 61.51s, triplet_loss=0.1941, triplet_hard_loss=0.6903, triplet_loss_num=189226.8594, triplet_mean_dist=0.2503, softmax_loss=3.9386, softmax_hard_loss=0.6903, softmax_loss_num=189226.8594, softmax_mean_dist=0.2503, softmax_accuracy=0.1113, triplet_accuracy=0.1087
[2021-12-09 14:52:10] [INFO]: Iteration 00200, Cost 60.51s, triplet_loss=0.1943, triplet_hard_loss=1.1024, triplet_loss_num=98941.8203, triplet_mean_dist=0.7868, triplet_accuracy=0.2896, softmax_loss=3.0870, softmax_hard_loss=1.1024, softmax_loss_num=98941.8203, softmax_mean_dist=0.7868, softmax_accuracy=0.2903

使用提供的权重运行GaitGL时,出现问题


[2021-12-02 19:53:48] [INFO]: {'enable_float16': False, 'restore_ckpt_strict': True, 'restore_hint': 80000, 'save_name': 'GaitGL', 'eval_func': 'identification', 'sampler': {'batch_size': 4, 'sample_type': 'all_ordered', 'type': 'InferenceSampler'}, 'transform': [{'img_w': 64, 'type': 'BaseSilCuttingTransform'}], 'metric': 'euc', 'enable_distributed': True}
[2021-12-02 19:53:48] [INFO]: {'model': 'GaitGL', 'channels': [32, 64, 128], 'class_num': 74}
[2021-12-02 19:53:48] [INFO]: {'dataset_name': 'CASIA-B', 'dataset_root': 'xxxxxi/datasets/Gait/CASIA-B/opengait-64', 'num_workers': 1, 'dataset_partition': './misc/partitions/CASIA-B_include_005.json', 'remove_no_gallery': False, 'cache': False, 'test_dataset_name': 'CASIA-B', 'pid_num': 74}
[2021-12-02 19:53:48] [INFO]: -------- Test Pid List --------
[2021-12-02 19:53:48] [INFO]: [075, 076, ..., 124]
[2021-12-02 19:53:53] [INFO]: Restore Parameters from xxxxxi/GaitGL/GaitGL/checkpoints/GaitGL-80000.pt !!!
[2021-12-02 19:53:53] [INFO]: Parameters Count: 3.09667M
[2021-12-02 19:53:53] [INFO]: Model Initialization Finished!
Transforming: 100%|█████████████████████████| 5485/5485 [01:05<00:00, 84.04it/s]
Traceback (most recent call last):
File "lib/main.py", line 66, in
run_model(cfgs, training)
File "lib/main.py", line 51, in run_model
Model.run_test(model)
File "/home/xxxxxi/project/Gait/OpenGait/lib/modeling/base_model.py", line 423, in run_test
return eval_func(info_dict, dataset_name, **valid_args)
File "/home/xxxxxi/project/Gait/OpenGait/lib/utils/evaluation.py", line 67, in identification
gallery_x = feature[gseq_mask, :]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2744 but corresponding boolean dimension is 5485
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 230780) of binary: /usr/local/anaconda3/bin/python3.8
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/xxxxxi/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


       lib/main.py FAILED          

=======================================
Root Cause:
[0]:
time: 2021-12-02_19:55:02
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 230780)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
<NO_OTHER_FAILURES>


a

please delete

Why is the loss multiplied by the world size?

Thank you for making this excellent work open source.
As written in the title, could you please explain the reason behind scaling the loss by the world size?
I didn't find such practice in other distributed training works so far!
Thank you again.

Train GaitPart on OU-MVLP

您好,感谢您出色的工作。
我训练GaitPart'在CASIA-B数据集上的训练结果跟您训练结果相近,但是在训练OU-MVLP时,在一块1080TI和GTX TITAN上训练,batch_size设为(24,16),训练350k iteration后准确率才刚79%左右不到80%。
请问您项目的代码修改配置文件中关于数据集的路径之后可以直接对OUMVLP数据集进行训练而不需要再对某些代码部分进行修改了吗?

Can the code run in Windows 11 anaconda environment?

System information (version)

  • Pytorch = 1.6.0
  • Operating System / Platform = Windows 11
  • Cuda = 10.1.243

Detailed description

The problem shows when I run into my computer with the above system information. I create a new conda env and clone this code in the env. But I cannot run the code and the problems are below:
(opengait) C:\Users\Lenovo\OpenGait>set CUDA_VISIBLE_DEVICES=1 & python -m torch.distributed.launch --nproc_per_node=2 lib/main.py --cfgs ./config/baseline.yaml --phase train


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last):
Traceback (most recent call last):
File "lib/main.py", line 58, in
File "lib/main.py", line 58, in
torch.distributed.init_process_group('nccl', init_method='env://')
torch.distributed.init_process_group('nccl', init_method='env://')
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'
Traceback (most recent call last):
File "C:\Users\Lenovo\anaconda3\envs\opengait\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Lenovo\anaconda3\envs\opengait\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\Lenovo\anaconda3\envs\opengait\lib\site-packages\torch\distributed\launch.py", line 261, in
main()
File "C:\Users\Lenovo\anaconda3\envs\opengait\lib\site-packages\torch\distributed\launch.py", line 256, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['C:\Users\Lenovo\anaconda3\envs\opengait\python.exe', '-u', 'lib/main.py', '--local_rank=1', '--cfgs', './config/baseline.yaml', '--phase', 'train']' returned non-zero exit status 1.

关于准确率计算的问题

想请教一下大佬,这类步态识别模型的准确率(比如在拿CASIA-B进行测试的时候)貌似是根据角度进行的输出,NM,
BG,CL各输出11个角度的准确率,那这些准确率是所有测试样例准确率的平均值吗,我以为的准确率是根据测试样例的数量输出的。(刚刚才开始学习gaitset这篇论文,自己理解得很不到位)

OUMVLP test phase: Error(s) in loading state_dict for Baseline

System information (version)

  • torch 1.8.0
  • Cuda 11.1

Detailed description

Distributed init (rank 0): env://
[2022-03-10 11:42:57] [INFO]: {'enable_float16': False, 'restore_ckpt_strict': True, 'restore_hint': 150000, 'save_name': 'Baseline', 'eval_func': 'identification', 'sampler': {'batch_size': 4, 'sample_type': 'all_ordered', 'type': 'InferenceSampler', 'batch_shuffle': False, 'frames_all_limit': 720}, 'transform': [{'img_w': 64, 'type': 'BaseSilCuttingTransform'}], 'metric': 'euc'}
[2022-03-10 11:42:57] [INFO]: {'model': 'Baseline', 'backbone_cfg': {'in_channels': 1, 'layers_cfg': ['BC-32', 'BC-32', 'M', 'BC-64', 'BC-64', 'M', 'BC-128', 'BC-128', 'BC-256', 'BC-256'], 'type': 'Plain'}, 'SeparateFCs': {'in_channels': 256, 'out_channels': 256, 'parts_num': 31}, 'SeparateBNNecks': {'class_num': 5153, 'in_channels': 256, 'parts_num': 31}, 'bin_num': [16, 8, 4, 2, 1]}
[2022-03-10 11:42:58] [INFO]: {'dataset_name': 'OUMVLP', 'dataset_root': 'OUMVLP-pkl', 'dataset_partition': './misc/partitions/OUMVLP.json', 'num_workers': 1, 'remove_no_gallery': False, 'cache': False, 'test_dataset_name': 'OUMVLP'}

[2022-03-10 11:42:58] [INFO]: -------- Test Pid List --------
[2022-03-10 11:42:58] [INFO]: [00002, 00004, ..., 00022]
Traceback (most recent call last):
File "D:/OpenGait/lib/my_main.py", line 111, in
mp.spawn(main,
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\multiprocessing\spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\multiprocessing\spawn.py", line 188, in start_processes
while not context.join():
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\multiprocessing\spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\multiprocessing\spawn.py", line 59, in _wrap
fn(i, *args)
File "D:\OpenGait\lib\my_main.py", line 104, in main
run_model(cfgs, training)
File "D:\OpenGait\lib\my_main.py", line 64, in run_model
model = Model(cfgs, training)
File "D:\OpenGait\lib\modeling\base_model.py", line 168, in init
self.resume_ckpt(restore_hint)
File "D:\OpenGait\lib\modeling\base_model.py", line 289, in resume_ckpt
self._load_ckpt(save_name)
File "D:\OpenGait\lib\modeling\base_model.py", line 262, in _load_ckpt
self.load_state_dict(model_state_dict, strict=load_ckpt_strict)
File "D:\Installations\ComputerScience\Python\Anaconda3\envs\gait\lib\site-packages\torch\nn\modules\module.py", line 1482, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Baseline:
size mismatch for BNNecks.fc_bin: copying a param with shape torch.Size([31, 256, 5154]) from checkpoint, the shape in current model is torch.Size([31, 256, 5153]).

A little bug

result_dict["scalar/test_accuracy/BG"] = acc[0, :, :, i]

We find maybe here is a little bug which make the value of BG to be the same as the value of NM when set with_test as true, I don't know how to submit my code, so I can only tell you by this way.

training GaitGL on OUMVLP with default config too slow to tolerate.....

I try GaitGL on OUMVLP dataset, 6000 id for training and rest for teting, using deafault config with Tesla v100*8, it's cost 165 seconds for each 100 iters. if I want 10 epochs, time = 10(epoch) x 130000(seqs) / 8(bs) / 100 * 165 = 3.1 days.

I guess computing triplet loss cost might cost lot of time, I'm tring to modify the train procedure to, triplet loss on take effect at latter epochs, while early epochs only compute id loss, namely softmaxloss.

Is there any other methods to speed up?

thxs

CUDA error: an illegal memory access was encountered caused by new DDP strategy of Pytorch1.9+

Thanks for your great contributions to gait recognition community!!
I find this repo cannot run well on pytorch1.9 with CUDA11.1.
It raises an illegal memory access error when I use default baseline config to train model on CASIA-B.

image

Sometimes it can run normally at the first serveral steps(but full_loss_num is very small), then error raises.

After setting find_unused_parameter to True, the error disappears and training process is back to normal.

it looks like some parameters have been registered but not used in whole model, which triggers this error.

Unable to find a valid cuDNN algorithm to run convolution

用笔记本测试的,只有一块显卡所以用的CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 lib/main.py --cfgs ./config/gaitset.yaml --phase train。报错Traceback (most recent call last):
File "lib/main.py", line 66, in
run_model(cfgs, training)
File "lib/main.py", line 49, in run_model
Model.run_train(model)
File "/home/yq/OpenGait/lib/modeling/base_model.py", line 371, in run_train
model.train_step(loss_sum)
File "/home/yq/OpenGait/lib/modeling/base_model.py", line 307, in train_step
self.Scaler.scale(loss_sum).backward()
File "/home/yq/anaconda3/envs/pt14/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/yq/anaconda3/envs/pt14/lib/python3.7/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
Exception raised from try_all at /pytorch/aten/src/ATen/native/cudnn/Conv.cpp:692 (most recent call first):
请问是怎么回事啊

Data Sets

Thank you very much for the code provided. I have a question. How to obtain or preprocess these data sets--"HID2021"、'0001-1000'
gallery_seq_type = {'0001-1000': ['1', '2'],
"HID2021": ['0'], '0001-1000-test': ['0']}
probe_seq_type = {'0001-1000': ['3', '4', '5', '6'],
"HID2021": ['1'], '0001-1000-test': ['1']}

如何使用Baseline-60000.pt进行测试

我的运行命令没有反应
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 lib/main.py --cfgs ./config/baseline.yaml --phase train
请问baseline.yaml该怎么配置

想请教一下关于BatchNorm的问题

我在代码中看到,很多网络中都没有加上BatchNorm,效果和论文中差不多甚至更高,但是我实验了一下,加上BatchNorm后反而效果变差了,不知道这个原因是啥呢?

How to fix this problem when i want to test

System information (version)

  • Pytorch => ❔
  • Operating System / Platform => ❔
  • Cuda => ❔

Detailed description

image

Steps to reproduce

Issue submission checklist

  • I checked the problem with documentation, FAQ, issues, and have not found solution.

使用baseline模型进行测试时报错

System information (version)

Detailed description

[2022-02-17 12:42:57] [INFO]: {'dataset_name': 'CASIA-B', 'dataset_root': './dataset_output', 'num_workers': 1, 'dataset_partition': './misc/partitions/CASIA-B_include_005.json', 'remove_no_gallery': False, 'cache': False, 'test_dataset_name': 'CASIA-B'}
[2022-02-17 12:42:57] [INFO]: -------- Test Pid List --------
[2022-02-17 12:42:57] [INFO]: ['075']
[2022-02-17 12:43:00] [INFO]: Restore Parameters from output/CASIA-B/Baseline/Baseline/checkpoints/Baseline-60000.pt !!!
[2022-02-17 12:43:00] [INFO]: Parameters Count: 3.77914M
[2022-02-17 12:43:00] [INFO]: Model Initialization Finished!
Transforming: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 110/110 [00:01<00:00, 76.67it/s]
Traceback (most recent call last):
File "lib/main.py", line 70, in
run_model(cfgs, training)
File "lib/main.py", line 55, in run_model
Model.run_test(model)
File "/home/lalaland/PycharmProjects/OpenGait_/lib/modeling/base_model.py", line 470, in run_test
return eval_func(info_dict, dataset_name, **valid_args)
File "/home/lalaland/PycharmProjects/OpenGait_/lib/utils/evaluation.py", line 79, in identification
0) * 100 / dist.shape[0], 2)
ValueError: could not broadcast input array from shape (4,) into shape (5,)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10590) of binary: /home/lalaland/anaconda3/envs/torch/bin/python
Traceback (most recent call last):
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lalaland/anaconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

lib/main.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-02-17_12:43:06
host : lalaland-Predator-PH317-55
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 10590)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

When run misc/pretreatment.py on Windows

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

DistributedDataParallel Usage

您好,我想请问一下,分布式训练的GPU数量和worl_size参数在哪里修改,以及以下报错该怎么解决?pytorch版本最好用哪个版本?感谢!
Traceback (most recent call last):
File "D:/OpenGait-1.0/OpenGait-1.0/lib/main.py", line 58, in
torch.distributed.init_process_group('nccl', init_method='env://')
File "D:\acada\envs\py36\lib\site-packages\torch\distributed\distributed_c10d.py", line 434, in init_process_group
init_method, rank, world_size, timeout=timeout
File "D:\acada\envs\py36\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))
RuntimeError: No rendezvous handler for env://

The problem about distributed

您的代码使用了分布式训练,我在修改模型时候出现了一下问题:Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s
forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list,
dict, iterable)
可我确定forward返回值的内容都被用来计算损失了,请问这和分布式训练有关系吗?
您的代码在不改动的情况下可以不使用分布式训练吗

复现GAITGL出现问题

System information (version)

  • Pytorch => ❔
  • Operating System / Platform => ❔
  • Cuda => ❔

Detailed description

Steps to reproduce

Issue submission checklist

  • I checked the problem with documentation, FAQ, issues, and have not found solution.

Model fails to test: IndexError: boolean index did not match indexed array along dimension 0;

hello, we find when we use 2 gpus to test the GaitGL, we will get an error, just like this

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --master_port 12345 --nproc_per_node=2 lib/main.py --cfgs ./config/gaitgl.yaml --iter 80000 --phase test

we get an error like this

File "/OpenGait/lib/utils/evaluation.py", line 67, in identification
gallery_x = feature[gseq_mask, :]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2744 but corresponding boolean dimension is 5485

it seems like that the length of the output feature is shorter than the input, and we have checked other test code(gaitset and GLN) which work well when they only test on 2 gpus.

Because we only have two gpus, then how can we test the GaitGL on 2 gpus? Thank you very much!

model config and output discussion

System information (version)

  • Pytorch => ❔
  • Operating System / Platform => ❔
  • Cuda => ❔

Detailed description

Thank you for all the work! I am relatively new to this so sorry for my many questions! I am trying to build an ensemble of all the models, how would i go about configuring the model_cfg? Also which variables would I be needing to extract from the retval dictorionary if I were to perform inference on new data. Thank you in advance!

Steps to reproduce

Issue submission checklist

  • I checked the problem with documentation, FAQ, issues, and have not found solution.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.