GithubHelp home page GithubHelp logo

jdai-cv / centerx Goto Github PK

View Code? Open in Web Editor NEW
552.0 18.0 86.0 196 KB

This repo is implemented based on detectron2 and centernet

License: Apache License 2.0

Python 98.03% Shell 1.97%
centernet detectron2 deep-learning object-detection fast-reid caffe onnx tensorrt centerx

centerx's People

Contributors

cpflame avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

centerx's Issues

如何阶段性保存模型?在训练过程中valset的coco_eval的AP一直是0?total_loss较大

作者,你好,我有几个问题想请教一下:

  1. 我发现目前工程只在训练完全结束后才会保存模型,请问如何阶段性保存模型呢?我通过pip install安装了detectron2,随后在detectron2.engine.defaults.py中的DefaultTrainer增加train函数(以期覆盖TrainerBase中的train函数),具体代码如下(基于TrainerBase.train(), 增加了一行print, 以及阶段性保存模型的代码):
    ` def train(self, start_iter: int, max_iter: int):
    """
    Args:
    start_iter, max_iter (int): See docs above
    """
    logger = logging.getLogger(name)
    logger.info("Starting training from iteration {}".format(start_iter))
    import ipdb; ipdb.set_trace()
    self.iter = self.start_iter = start_iter
    self.max_iter = max_iter

     with EventStorage(start_iter) as self.storage:
         try:
             self.before_train()
             print('!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',start_iter, max_iter)
             for self.iter in range(start_iter, max_iter):
                 self.before_step()
                 self.run_step()
                 self.after_step()
                 if self.iter % 100 == 0:
                     self.checkpointer.save("model_" + str(self.iter+1))
                     
             # self.iter == max_iter can be used by `after_train` to
             # tell whether the training successfully finished or failed
             # due to exceptions.
             self.iter += 1
    
                 
         except Exception:
             logger.exception("Exception during training:")
             raise
         finally:
             self.after_train()
    

`
然而并没有print对应的内容,模型也没保存上,请问正确的打开方式是什么呢?

  1. 现象:在训练过程中valset的coco_eval的AP一直是0。
    环境配置:采用coco/centernet_res50_coco.yaml 进行目标检测任务,数据集按照coco格式处理好,且在xingyizhou发布的CenterNet工程上可以正常训练和测试。
    在centerX上对cfg的修改:
    cfg.DATASETS.TRAIN = ("table_aline_train",)
    cfg.DATASETS.TEST = ("table_aline_val",)
    cfg.DATALOADER.NUM_WORKERS = 2
    cfg.SOLVER.MAX_ITER = 30
    cfg.OUTPUT_DIR = "./output/table_aline"
    cfg.SOLVER.IMS_PER_BATCH = 8
    cfg.SOLVER.BASE_LR = 0.00125
    cfg.INPUT.MAX_SIZE_TRAIN = 1024
    cfg.INPUT.MIN_SIZE_TRAIN = 512

此外在main函数中借助register_coco_instances注册了我的数据集。

用作者提供的run.sh脚本,2块gpu运行。

train: 700+
val: 80+

具体问题
在训练过程中,发现在val set上做coco evaluation时,结果一直都是下图这样:
1211

在迭代了2300+次后,total_loss从1281降到了6.6左右,看inference中生成的框score很多接近1了,但是它们的位置远远超出了图片的尺寸(尺寸参考下面的信息),例如:
{"image_id": 7, "category_id": 1, "bbox": [-120932.8515625, -51244.3125, 250420.453125, 95695.1640625], "score": 1.0}, {"image_id": 7, "category_id": 1, "bbox": [-146367.90625, -59846.8046875, 301889.0625, 119286.0078125], "score": 1.0}

已尝试的debug
对比total_loss相较原始centerNet上的训练情况(loss收敛到0.8左右),我怀疑也许dataloader加载的bbox有些问题,于是将数据集相关信息打印出来了,举个例子:
centerX/modeling/meta_arch/centernet.py 中 CenterNet.forward()里,输出了batched_inputs[0],得到如下结果:
{'file_name': '/mnt/maskrcnn-benchmark/datasets/table_aline/train2017/d-27.png', 'height': 2339, 'width': 1654, 'image_id': 174, 'image': tensor([[[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
...,
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.]],

    [[170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     ...,
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.]],

    [[170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     ...,
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.]]]), 'instances': Instances(num_instances=2, image_height=723, image_width=512, fields=[gt_boxes: Boxes(tensor([[ 16.7869,  44.9777, 473.3902, 106.7382],
    [ 15.7797, 415.2047, 476.4118, 686.4136]])), gt_classes: tensor([0, 0])])}

在annotations文件中,对应的标注信息是:
{"category_id": 1, "id": 317, "image_id": 174, "iscrowd": 0, "segmentation": [[137.76953125, 1297.650390625, 1509.9000000000015, 1297.650390625, 1509.9000000000015, 2105.5, 137.76953125, 2105.5]], "area": 1108576.0, "bbox": [138.0, 1298.0, 1372.0, 808.0]}
{"category_id": 1, "id": 316, "image_id": 174, "iscrowd": 0, "segmentation": [[146.541015625, 194.87890625, 1507.0552978515625, 194.87890625, 1507.0552978515625, 379.3728790283203, 146.541015625, 379.3728790283203]], "area": 250240.0, "bbox": [147.0, 195.0, 1360.0, 184.0]},

经过计算,height/image_height ≈ width/ image_width
然而,原始gt bbox(转换为x1,y1,x2,y2的格式为[138, 1298, 1510, 2106],[147, 195, 1507, 379])和batched_inputs中的bbox()并没有高和宽那样的比例关系,这里是正常的吗?
但是,惊讶的是,当我在centerX/modeling/layers/centernet_gt.py中generate函数 将画图部分代码取消注释后,观察了许多结果图片,框的位置并没有问题。
我有注意到,其实每张图片的shape是不同的,但generate函数里只传入了当前batch最后一张图的shape,并将所有图片按照这个shape(after scale)输出后续的gt,以确保一个batch里的score map是相同shape,这里会是症结所在吗?(原centernet是将图片resize为统一尺寸后,再进行后续的下采样,建gt等)

我现在很迷茫,不知道该如何解决这个问题,希望作者及了解的朋友可以指点迷津,万分感谢!

SWA in vanilla Detectron2

Could you please provide some guide how can I use SWA hook in Detectron2? Should I just add add additional config, or some code on CenternetTrainer is necessary?

Thanks!

你好,请问您出了论文详述了嘛 & KD相关

  1. 作者你好,请问您出了论文详述了嘛,想学习一下。
  2. 我resnet18单独训练模型大小为111M,但是KD之后(res50为teacher,res18为student),res18模型大小为166M,不跟原始的res18一致,模型参数量发生了变化,请问是这样吗?

使用gfocal loss

我把检测头输出改为中心点预测和ltrb的形式,用centernet原版的loss函数没问题,但是使用Gfocal就loss变成负数,可以一起研究一下吗

多模型蒸馏本质上是否等价于伪标注

假设需要训练分别处于两个数据集中的两个类别:
数据集D1:标注A类别,B类别未标注
数据集D2:标注B类别,A类别未标注
模型M1:检测A类别的模型
模型M2:检测B类别的模型

当训练任意一张图片是,用M1或M2预测缺失的标签,然后作为监督信息,那这样和离线用M1和M2交叉标注D1和D2好像也没有什么区别。

转caffe遇到问题

File "./modeling/layers/centernet_deconv.py", line 91, in forward
offset = torch.cat((o1, o2), dim=1)
File "/home/22/code/centerX-master/projects/speedup/pytorch_to_caffe.py", line 660, in call
out = self.obj(self.raw, *args, **kwargs)
TypeError: _cat() got an unexpected keyword argument 'dim'

配置文件和模型是匹配的,为何还会出现这个问题

训练问题

您好,我用centerx训练自己的模型时ap一直是0,请问您知道是什么原因吗?
2021-07-13 17-06-08 的屏幕截图

小数据集训练问题

你好,
我在训练自己的小数据集上遇到了问题,
我先是用CenterX在coco数据集上,复现了你的res18和res50实验结果。
换成自己的小数据集后,在原始的Centernet上代码上的AP大约在0.59左右;但是在CenterX上经过多次尝试(例如调整学习率方式,数据增强等),AP最好的时候大约只有0.49左右。我这分析觉得可能是数据增强的方式不太对。
我看了一下你对比实验中一项是数据增强的结果比原始的结果要略低一点,对于这个问题你有什么想法或者建议吗?

KD training

hi, I want to make a KD train with yamls/coco/centernet_res18_KD.yaml, but got error "exp_results/coco/coco_exp_R50_SGD_0.5/model_final.pth not found!", so how to get this teacher model, thanks much

centernet转onnx咨询

可以问一下,centernet转onnx时候decode部分gather函数接收的是int64,为啥你的代码可以运行成功,我直接转就会报这个函数的错误,后面转换代码有点不明白。

关于boxes格式请求帮助

看到您在coco_class 里面将coco标签从xywh转换为xyxy了,detertron2中训练时候也将boxes转为为xyxy了吗?

python api

作者你好,我想请问一下
怎么绘制PR曲线以及loss曲线呢?

AP 大于1?

Loading and preparing results...
DONE (t=1.34s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type bbox
COCOeval_opt.evaluate() finished in 2.31 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 0.69 seconds.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.112
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.333
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.049
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.030
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.159
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.188
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.101
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.164
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.169
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.070
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.218
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.244
[05/31 20:26:18 d2.evaluation.coco_evaluation]: Evaluation results for bbox:

AP AP50 AP75 APs APm APl
11.186 33.313 4.922 3.049 15.892 18.782
[05/31 20:26:18 d2.evaluation.coco_evaluation]: Per-category bbox AP:
category AP category AP
:----------- :------- :----------- :------
person 18.383 face 3.989
请问这正常嘛?AP大于1。

训练自己的数据集

想请问一下我制作好了COCO格式的数据集,需要改写什么地方才能到自己数据集的路径,要怎么选择使用resnet18,抱歉我还是个菜鸟

有论文做详细介绍吗?

作者,你好,感谢你的贡献。请问这份工程有对应的论文吗?可以po一下吗?我想基于你这个工程实践增量学习,你觉得可行吗?

Inference/demo.py的一个小问题

首先非常感谢作者的工作,知乎的那篇**特色CenterNet也写得非常有趣。
但是当我在自己的数据集上复现这个工程,想要用inference/demo.py来跑推理的时候,遇到了点小问题,输出的图片上始终没有框。
我尝试检查了下results这个变量,cls,bbox,scores其实并没有什么问题。
然后我发现在inference/demo.py中有这样一段:

       for c,(x1,y1,x2,y2),s in zip(cls,bbox,scores):
            if c != 0.0 or s < 0.35:
                continue

其中c对应类别,一般定义0.0为背景吧?但是这里的逻辑是,如果不是背景则跳过输出框的循环,这样岂不是永远输出不了目标的框了?
将其改成 :
if c == 0.0 or s < 0.35:
后,就可以得到正确的输出了。

自定义数据集

你好!请问一下在centerX项目中,怎么去训练自己的数据集,我在注册那块一直报错

what's the version of detectron2?

There was an error while I was running it:

~/centerX/engine/defaults.py in __init__(self, cfg)
     69                 model, device_ids=[comm.get_local_rank()], broadcast_buffers=False
     70             )
---> 71         super(DefaultTrainer, self).__init__(model, data_loader, optimizer)
     72 #         super().__init__(cfg)
     73 

TypeError: __init__() takes 1 positional argument but 4 were given

my detectron2 version is 0.3.

您好,我在sh run.sh之后,代码报了错TypeError: __init__() takes 1 positional argument but 4 were given,请问这个问题是什么原因呀,感谢!

[12/22 10:43:48 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ...
[12/22 10:43:52 d2.data.common]: Serialized dataset takes 451.21 MiB
[12/22 10:43:52 detectron2]: Using training sampler TrainingSampler
[12/22 10:43:57 detectron2]: initial from https://download.pytorch.org/models/resnet18-5c106cde.pth
[12/22 10:43:57 detectron2]: The checkpoint state_dict contains keys that are not used by the model:
fc.{weight, bias}
Traceback (most recent call last):
File "/home/yb/centerX/train_net.py", line 66, in
args=(args,),
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/detectron2/engine/launch.py", line 59, in launch
daemon=False,
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
main_func(*args)
File "/home/yb/centerX/train_net.py", line 52, in main
trainer = Trainer(cfg)
File "/home/yb/centerX/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given

Could you please release COCO pretrained model?

Hello, thx for your great work! I'm trying to use this repo in my project, but it is hard to converge and cost much more time.Could you plz release coco pretrained model. Thx a lot ❤!

KD issues

If I use res50(34.9) as teacher and res18(30.2) as student, and also train with 140 epoch, would I get a better result(>31.0 mAP on coco)?

My best result is res50(34.9)+res18(30.6) with 140 epoch, kd can only get 0.4% improved

Error in training a model based on RegNet

训练RegNetX_400MF模型的时候, 下载了权重报如下错误:
File "centerX-master/modeling/backbone/regnet/regnet.py", line 539, in init_pretrained_weights
state_dict = torch.load(cached_file, map_location=torch.device('cpu'))['model_state']
File "anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/serialization.py", line 763, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
请问是什么问题呢?

数据存放位置

你好,我想用coco数据集训练一下,请问coco文件夹放在哪里啊

复现实验

直接使用您的配置进行训练,widerface res50 精度为mAP24.1, res18的mAP为21.6,只修改了batchsize 从64改为16, 效果会差这么多么? 另外 KD训练RES50 -> RES18 时 训练中期,kd_cls_loss 上升,可能是什么原因(BS=16)。

adaptive loss weight

Hi, I had a similar training problem in another task. I want to try the adaptive loss weight strategy. I am wondering how to select COMMUNISM.CLS_LOSS, COMMUNISM.WH_LOSS and COMMUNISM.OFF_LOSS? Can you share the experience with me? Anyway, thanks a lot.

The training curve is strange

Hi~ Thank you for your sharing!
I trained three times with different categories in COCO with default configs. The AP curve always dropped suddenly in the middle, then slowly rise. It feels like retraining in the middle. Like this:
image
I checked most of the config params in the yaml files but did not find the reason. Could you tell me which config params made this happened?

issues

Traceback (most recent call last):
File "train_net.py", line 66, in
args=(args,),
File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "train_net.py", line 52, in main
trainer = Trainer(cfg)
File "/home/cc631/hailong/code/Dilated-FPN/centerX/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given

TypeError: __init__() takes 1 positional argument but 4 were given

Traceback (most recent call last):
File "train_net.py", line 77, in
args=(args,),
File "/home/zy/anaconda3/envs/CenterNet/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "train_net.py", line 63, in main
trainer = Trainer(cfg)
File "/home/zy/Downloads/2222/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given

您好, 请问这个问题怎么解决?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.