jdai-cv / centerx Goto Github PK

View Code? Open in Web Editor NEW

552.0 18.0 86.0 196 KB

This repo is implemented based on detectron2 and centernet

License: Apache License 2.0

Python 98.03% Shell 1.97%

centernet detectron2 deep-learning object-detection fast-reid caffe onnx tensorrt centerx

centerx's People

Contributors

Stargazers

Watchers

Forkers

wangqi12332155 donnyyou zyg11 czc00125 taowenleon noticeable dongyangcai aliushn vixeruntr hajungong007 hx2009302823 fengxingxiang wzb1005 johnyfeng erwincheung zhang405744522 fqyy3210 dl-alva lem89757 hiyyg intjun youtang1993 trantorrepository huangwenwenlili happylearningml xrosliang hellodhd laughingz23 shendc zwyzwy kokoing123 kkg945 moyans xiaowenhe yawudede cool-lab ypw-lbj funkykoki chhanxiao urbaneman cv-ip wilbur-lqw zihua pzheng2018 liuxubit wangyaohui6233 lightingstars 970704 pro-flynn hexieshenghuo zhouleidcc wwlaoxi cydiachen chichen001 giorking yangkang779 caihong06302923 605789414 zixiwang930 abrams90 smallmunich wx-b xubin1994 joesrain wangyoucaocxl kewenjing1020 lxc86739795 liuwuhomepage roger1993 lyp-deeplearning barbecacov huangwgang wharu python-repository-hub cattpku taoshss winterxx lutianhao youyousir charlesnord keyxuliang uuie ngsford bobe-wang cloverfan xiaojing12581

centerx's Issues

你好，请问一下在deconv层是都使用的了deformable convolution怎样计算其FLOPs呢？

TypeError: init() takes 1 positional argument but 4 were given

File "/home/ac/code/centerX-master/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given

如何阶段性保存模型？在训练过程中valset的coco_eval的AP一直是0？total_loss较大

作者，你好，我有几个问题想请教一下：

我发现目前工程只在训练完全结束后才会保存模型，请问如何阶段性保存模型呢？我通过pip install安装了detectron2，随后在detectron2.engine.defaults.py中的DefaultTrainer增加train函数（以期覆盖TrainerBase中的train函数），具体代码如下（基于TrainerBase.train(), 增加了一行print, 以及阶段性保存模型的代码）：
` def train(self, start_iter: int, max_iter: int):
"""
Args:
start_iter, max_iter (int): See docs above
"""
logger = logging.getLogger(name)
logger.info("Starting training from iteration {}".format(start_iter))
import ipdb; ipdb.set_trace()
self.iter = self.start_iter = start_iter
self.max_iter = max_iter

 with EventStorage(start_iter) as self.storage:
     try:
         self.before_train()
         print('!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',start_iter, max_iter)
         for self.iter in range(start_iter, max_iter):
             self.before_step()
             self.run_step()
             self.after_step()
             if self.iter % 100 == 0:
                 self.checkpointer.save("model_" + str(self.iter+1))
                 
         # self.iter == max_iter can be used by `after_train` to
         # tell whether the training successfully finished or failed
         # due to exceptions.
         self.iter += 1

             
     except Exception:
         logger.exception("Exception during training:")
         raise
     finally:
         self.after_train()

`
然而并没有print对应的内容，模型也没保存上，请问正确的打开方式是什么呢？

现象：在训练过程中valset的coco_eval的AP一直是0。
环境配置：采用coco/centernet_res50_coco.yaml 进行目标检测任务，数据集按照coco格式处理好，且在xingyizhou发布的CenterNet工程上可以正常训练和测试。
在centerX上对cfg的修改：
cfg.DATASETS.TRAIN = ("table_aline_train",)
cfg.DATASETS.TEST = ("table_aline_val",)
cfg.DATALOADER.NUM_WORKERS = 2
cfg.SOLVER.MAX_ITER = 30
cfg.OUTPUT_DIR = "./output/table_aline"
cfg.SOLVER.IMS_PER_BATCH = 8
cfg.SOLVER.BASE_LR = 0.00125
cfg.INPUT.MAX_SIZE_TRAIN = 1024
cfg.INPUT.MIN_SIZE_TRAIN = 512

此外在main函数中借助register_coco_instances注册了我的数据集。

用作者提供的run.sh脚本，2块gpu运行。

train: 700+
val: 80+

具体问题
在训练过程中，发现在val set上做coco evaluation时，结果一直都是下图这样：

在迭代了2300+次后，total_loss从1281降到了6.6左右，看inference中生成的框score很多接近1了，但是它们的位置远远超出了图片的尺寸（尺寸参考下面的信息），例如：
{"image_id": 7, "category_id": 1, "bbox": [-120932.8515625, -51244.3125, 250420.453125, 95695.1640625], "score": 1.0}, {"image_id": 7, "category_id": 1, "bbox": [-146367.90625, -59846.8046875, 301889.0625, 119286.0078125], "score": 1.0}

已尝试的debug
对比total_loss相较原始centerNet上的训练情况（loss收敛到0.8左右），我怀疑也许dataloader加载的bbox有些问题，于是将数据集相关信息打印出来了，举个例子：
centerX/modeling/meta_arch/centernet.py 中 CenterNet.forward()里，输出了batched_inputs[0]，得到如下结果：
{'file_name': '/mnt/maskrcnn-benchmark/datasets/table_aline/train2017/d-27.png', 'height': 2339, 'width': 1654, 'image_id': 174, 'image': tensor([[[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
...,
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.]],

    [[170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     ...,
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.]],

    [[170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     ...,
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.]]]), 'instances': Instances(num_instances=2, image_height=723, image_width=512, fields=[gt_boxes: Boxes(tensor([[ 16.7869,  44.9777, 473.3902, 106.7382],
    [ 15.7797, 415.2047, 476.4118, 686.4136]])), gt_classes: tensor([0, 0])])}

在annotations文件中，对应的标注信息是：
{"category_id": 1, "id": 317, "image_id": 174, "iscrowd": 0, "segmentation": [[137.76953125, 1297.650390625, 1509.9000000000015, 1297.650390625, 1509.9000000000015, 2105.5, 137.76953125, 2105.5]], "area": 1108576.0, "bbox": [138.0, 1298.0, 1372.0, 808.0]}
{"category_id": 1, "id": 316, "image_id": 174, "iscrowd": 0, "segmentation": [[146.541015625, 194.87890625, 1507.0552978515625, 194.87890625, 1507.0552978515625, 379.3728790283203, 146.541015625, 379.3728790283203]], "area": 250240.0, "bbox": [147.0, 195.0, 1360.0, 184.0]},

经过计算，height/image_height ≈ width/ image_width
然而，原始gt bbox（转换为x1,y1,x2,y2的格式为[138, 1298, 1510, 2106]，[147, 195, 1507, 379]）和batched_inputs中的bbox()并没有高和宽那样的比例关系，这里是正常的吗？
但是，惊讶的是，当我在centerX/modeling/layers/centernet_gt.py中generate函数将画图部分代码取消注释后，观察了许多结果图片，框的位置并没有问题。
我有注意到，其实每张图片的shape是不同的，但generate函数里只传入了当前batch最后一张图的shape，并将所有图片按照这个shape（after scale）输出后续的gt，以确保一个batch里的score map是相同shape，这里会是症结所在吗？（原centernet是将图片resize为统一尺寸后，再进行后续的下采样，建gt等）

我现在很迷茫，不知道该如何解决这个问题，希望作者及了解的朋友可以指点迷津，万分感谢！

导出onnx之后如何转trt

用该项目能成功转为onnx模型,请问需要进一步转trt如何实现呢?

SWA in vanilla Detectron2

Could you please provide some guide how can I use SWA hook in Detectron2? Should I just add add additional config, or some code on CenternetTrainer is necessary?

Thanks!

请问速度和推理时间相比yolov5如何？

您好，请问有推理时间相关的结论吗？譬如在V100/2080Ti上能到多少FPS, 和yolov5相比如何呢？

导出onnx报错 Couldn't export Python operator _ModulatedDeformConv

成功训练了模型，但运行"projects/speedup/centerX2onnx.py"导出 onnx 报错
"RuntimeError: ONNX export failed: Couldn't export Python operator _ModulatedDeformConv"

请问是什么导致？

请问有没有训练好的权重模型？

你好，请问有没有coco或者crowd human的权重模型，或者其他可以初始化用的权重？

TypeError: _write_metrics() missing 1 required positional argument: 'data_time'

File "/home/a/.local/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 133, in train
self.run_step()
File "/home/a/code/centerX-master/engine/defaults.py", line 353, in run_step
self._write_metrics(metrics_dict)
TypeError: _write_metrics() missing 1 required positional argument: 'data_time'

自定义数据集如何使用该工程

请问如果是自定义数据集，需要修改那部分，我看好像牵扯到了detectron2

multi teacher KD cannot achieve better performance than seperate models

In the multi teacher KD experinment resdcn18_KD_woGT_scratch always performs a little worse than resdcn18 on crowd dataset, even if pretrained on imagenet, and it outperforms on another dataset, how does it happen?

你好，请问您出了论文详述了嘛 & KD相关

作者你好，请问您出了论文详述了嘛，想学习一下。
我resnet18单独训练模型大小为111M，但是KD之后（res50为teacher，res18为student），res18模型大小为166M，不跟原始的res18一致，模型参数量发生了变化，请问是这样吗？

小白提问请问这个有人体关键点检测嘛

有看到centernet有一个centerpose，这个如果有想做成姿态估计人体关键点检测的话要从哪方面入手呢

使用gfocal loss

我把检测头输出改为中心点预测和ltrb的形式，用centernet原版的loss函数没问题，但是使用Gfocal就loss变成负数，可以一起研究一下吗

wider_face_add_lm_10_10_add_mafa

what't 'add_mafa' mean?

多模型蒸馏本质上是否等价于伪标注

假设需要训练分别处于两个数据集中的两个类别：
数据集D1：标注A类别，B类别未标注
数据集D2：标注B类别，A类别未标注
模型M1：检测A类别的模型
模型M2：检测B类别的模型

当训练任意一张图片是，用M1或M2预测缺失的标签，然后作为监督信息，那这样和离线用M1和M2交叉标注D1和D2好像也没有什么区别。

转caffe遇到问题

File "./modeling/layers/centernet_deconv.py", line 91, in forward
offset = torch.cat((o1, o2), dim=1)
File "/home/22/code/centerX-master/projects/speedup/pytorch_to_caffe.py", line 660, in call
out = self.obj(self.raw, *args, **kwargs)
TypeError: _cat() got an unexpected keyword argument 'dim'

配置文件和模型是匹配的，为何还会出现这个问题

ttf supported?

训练问题

您好，我用centerx训练自己的模型时ap一直是0，请问您知道是什么原因吗？

小数据集训练问题

你好，
我在训练自己的小数据集上遇到了问题，
我先是用CenterX在coco数据集上，复现了你的res18和res50实验结果。
换成自己的小数据集后，在原始的Centernet上代码上的AP大约在0.59左右；但是在CenterX上经过多次尝试（例如调整学习率方式，数据增强等），AP最好的时候大约只有0.49左右。我这分析觉得可能是数据增强的方式不太对。
我看了一下你对比实验中一项是数据增强的结果比原始的结果要略低一点，对于这个问题你有什么想法或者建议吗？

KD training

hi, I want to make a KD train with yamls/coco/centernet_res18_KD.yaml, but got error "exp_results/coco/coco_exp_R50_SGD_0.5/model_final.pth not found!", so how to get this teacher model, thanks much

centernet转onnx咨询

可以问一下，centernet转onnx时候decode部分gather函数接收的是int64，为啥你的代码可以运行成功，我直接转就会报这个函数的错误，后面转换代码有点不明白。

fixed

关于boxes格式请求帮助

看到您在coco_class 里面将coco标签从xywh转换为xyxy了，detertron2中训练时候也将boxes转为为xyxy了吗？

python api

作者你好，我想请问一下
怎么绘制PR曲线以及loss曲线呢？

AP 大于1？

Loading and preparing results...
DONE (t=1.34s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type bbox
COCOeval_opt.evaluate() finished in 2.31 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 0.69 seconds.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.112
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.333
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.049
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.030
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.159
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.188
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.101
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.164
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.169
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.070
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.218
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.244
[05/31 20:26:18 d2.evaluation.coco_evaluation]: Evaluation results for bbox:

AP	AP50	AP75	APs	APm	APl
11.186	33.313	4.922	3.049	15.892	18.782
[05/31 20:26:18 d2.evaluation.coco_evaluation]: Per-category bbox AP:
category	AP	category	AP
:-----------	:-------	:-----------	:------
person	18.383	face	3.989
请问这正常嘛？AP大于1。

训练自己的数据集

想请问一下我制作好了COCO格式的数据集，需要改写什么地方才能到自己数据集的路径，要怎么选择使用resnet18，抱歉我还是个菜鸟

有论文做详细介绍吗？

作者，你好，感谢你的贡献。请问这份工程有对应的论文吗？可以po一下吗？我想基于你这个工程实践增量学习，你觉得可行吗？

您好，请问一下对输入标记为空box的正样本如何处理？

我发现如果都是box四个参数都为0训练20轮就会报cuda的错误。。。

Inference/demo.py的一个小问题

首先非常感谢作者的工作，知乎的那篇**特色CenterNet也写得非常有趣。
但是当我在自己的数据集上复现这个工程，想要用inference/demo.py来跑推理的时候，遇到了点小问题，输出的图片上始终没有框。
我尝试检查了下results这个变量，cls，bbox，scores其实并没有什么问题。
然后我发现在inference/demo.py中有这样一段：

       for c,(x1,y1,x2,y2),s in zip(cls,bbox,scores):
            if c != 0.0 or s < 0.35:
                continue

其中c对应类别，一般定义0.0为背景吧？但是这里的逻辑是，如果不是背景则跳过输出框的循环，这样岂不是永远输出不了目标的框了？
将其改成 :
if c == 0.0 or s < 0.35:
后，就可以得到正确的输出了。

自定义数据集

你好！请问一下在centerX项目中，怎么去训练自己的数据集，我在注册那块一直报错

what's the version of detectron2?

There was an error while I was running it:

~/centerX/engine/defaults.py in __init__(self, cfg)
     69                 model, device_ids=[comm.get_local_rank()], broadcast_buffers=False
     70             )
---> 71         super(DefaultTrainer, self).__init__(model, data_loader, optimizer)
     72 #         super().__init__(cfg)
     73 

TypeError: __init__() takes 1 positional argument but 4 were given

my detectron2 version is 0.3.

您好，我在sh run.sh之后，代码报了错TypeError: init() takes 1 positional argument but 4 were given，请问这个问题是什么原因呀，感谢！

[12/22 10:43:48 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ...
[12/22 10:43:52 d2.data.common]: Serialized dataset takes 451.21 MiB
[12/22 10:43:52 detectron2]: Using training sampler TrainingSampler
[12/22 10:43:57 detectron2]: initial from https://download.pytorch.org/models/resnet18-5c106cde.pth
[12/22 10:43:57 detectron2]: The checkpoint state_dict contains keys that are not used by the model:
fc.{weight, bias}
Traceback (most recent call last):
File "/home/yb/centerX/train_net.py", line 66, in
args=(args,),
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/detectron2/engine/launch.py", line 59, in launch
daemon=False,
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
main_func(*args)
File "/home/yb/centerX/train_net.py", line 52, in main
trainer = Trainer(cfg)
File "/home/yb/centerX/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given

crowd_human_face_train里面定义了person和face两个类别，但是在训练集中是没有face这个类别的，这样做的原因是什么呢

对detectron不熟，看了很久没看明白，这些类别定义起到了什么作用？训练集没有标注的类别是怎么学习到的呢

fixed

作者你好，我想问一下你在kd_loss里面定义的kd_parameter是什么呢，我看是从detectron2导入的，然后我去看了下detectron2的代码，发现导入的那部分是一些打开配置等的一些函数。这地方我不太明白那个参数是什么，可以解释一下吗？

Could you please release COCO pretrained model?

Hello, thx for your great work! I'm trying to use this repo in my project, but it is hard to converge and cost much more time.Could you plz release coco pretrained model. Thx a lot ❤!

请问这里是笔误吗

https://github.com/JDAI-CV/centerX/blame/master/projects/centerX.md#L82
再看模型蒸馏部分，这里第二点是否想说“对于输出宽高和中心点偏移量的head”

您好，训练一定时间后出现错误assert len(class_names)==precisions.shape[2]，请问是哪里有问题呢

whyCUDA out of memory. Tried to allocate 130.00 MiB (GPU 1; 7.80 GiB total capacity; 5.91 GiB already allocated; 33.56 MiB free; 6.09 GiB reserved in total by PyTorch)

KD issues

If I use res50(34.9) as teacher and res18(30.2) as student, and also train with 140 epoch, would I get a better result(>31.0 mAP on coco)？

My best result is res50(34.9)+res18(30.6) with 140 epoch, kd can only get 0.4% improved

Error in training a model based on RegNet

训练RegNetX_400MF模型的时候, 下载了权重报如下错误:
File "centerX-master/modeling/backbone/regnet/regnet.py", line 539, in init_pretrained_weights
state_dict = torch.load(cached_file, map_location=torch.device('cpu'))['model_state']
File "anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/serialization.py", line 763, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
请问是什么问题呢?

CUDA out of memory. 1080Ti 11G how can i less the memory ?

数据存放位置

你好，我想用coco数据集训练一下，请问coco文件夹放在哪里啊

why can't i get any response when i execute these install commands ?

复现实验

直接使用您的配置进行训练，widerface res50 精度为mAP24.1， res18的mAP为21.6，只修改了batchsize 从64改为16, 效果会差这么多么？另外 KD训练RES50 -> RES18 时训练中期，kd_cls_loss 上升，可能是什么原因（BS=16）。

adaptive loss weight

Hi, I had a similar training problem in another task. I want to try the adaptive loss weight strategy. I am wondering how to select COMMUNISM.CLS_LOSS, COMMUNISM.WH_LOSS and COMMUNISM.OFF_LOSS? Can you share the experience with me? Anyway, thanks a lot.

The training curve is strange

Hi~ Thank you for your sharing!
I trained three times with different categories in COCO with default configs. The AP curve always dropped suddenly in the middle, then slowly rise. It feels like retraining in the middle. Like this:

I checked most of the config params in the yaml files but did not find the reason. Could you tell me which config params made this happened?

issues

Traceback (most recent call last):
File "train_net.py", line 66, in
args=(args,),
File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "train_net.py", line 52, in main
trainer = Trainer(cfg)
File "/home/cc631/hailong/code/Dilated-FPN/centerX/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given

TypeError: init() takes 1 positional argument but 4 were given

Traceback (most recent call last):
File "train_net.py", line 77, in
args=(args,),
File "/home/zy/anaconda3/envs/CenterNet/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "train_net.py", line 63, in main
trainer = Trainer(cfg)
File "/home/zy/Downloads/2222/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given

您好, 请问这个问题怎么解决？

jdai-cv / centerx Goto Github PK

centerx's People

Contributors

Stargazers

Watchers

Forkers

centerx's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs