jdai-cv / centerx Goto Github PK
View Code? Open in Web Editor NEWThis repo is implemented based on detectron2 and centernet
License: Apache License 2.0
This repo is implemented based on detectron2 and centernet
License: Apache License 2.0
你好,请问一下在deconv层是都使用的了deformable convolution怎样计算其FLOPs呢?
File "/home/ac/code/centerX-master/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given
作者,你好,我有几个问题想请教一下:
我发现目前工程只在训练完全结束后才会保存模型,请问如何阶段性保存模型呢?我通过pip install安装了detectron2,随后在detectron2.engine.defaults.py中的DefaultTrainer增加train函数(以期覆盖TrainerBase中的train函数),具体代码如下(基于TrainerBase.train(), 增加了一行print, 以及阶段性保存模型的代码):
` def train(self, start_iter: int, max_iter: int):
"""
Args:
start_iter, max_iter (int): See docs above
"""
logger = logging.getLogger(name)
logger.info("Starting training from iteration {}".format(start_iter))
import ipdb; ipdb.set_trace()
self.iter = self.start_iter = start_iter
self.max_iter = max_iter
with EventStorage(start_iter) as self.storage:
try:
self.before_train()
print('!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',start_iter, max_iter)
for self.iter in range(start_iter, max_iter):
self.before_step()
self.run_step()
self.after_step()
if self.iter % 100 == 0:
self.checkpointer.save("model_" + str(self.iter+1))
# self.iter == max_iter can be used by `after_train` to
# tell whether the training successfully finished or failed
# due to exceptions.
self.iter += 1
except Exception:
logger.exception("Exception during training:")
raise
finally:
self.after_train()
`
然而并没有print对应的内容,模型也没保存上,请问正确的打开方式是什么呢?
此外在main函数中借助register_coco_instances注册了我的数据集。
用作者提供的run.sh脚本,2块gpu运行。
train: 700+
val: 80+
具体问题
在训练过程中,发现在val set上做coco evaluation时,结果一直都是下图这样:
在迭代了2300+次后,total_loss从1281降到了6.6左右,看inference中生成的框score很多接近1了,但是它们的位置远远超出了图片的尺寸(尺寸参考下面的信息),例如:
{"image_id": 7, "category_id": 1, "bbox": [-120932.8515625, -51244.3125, 250420.453125, 95695.1640625], "score": 1.0}, {"image_id": 7, "category_id": 1, "bbox": [-146367.90625, -59846.8046875, 301889.0625, 119286.0078125], "score": 1.0}
已尝试的debug
对比total_loss相较原始centerNet上的训练情况(loss收敛到0.8左右),我怀疑也许dataloader加载的bbox有些问题,于是将数据集相关信息打印出来了,举个例子:
centerX/modeling/meta_arch/centernet.py 中 CenterNet.forward()里,输出了batched_inputs[0],得到如下结果:
{'file_name': '/mnt/maskrcnn-benchmark/datasets/table_aline/train2017/d-27.png', 'height': 2339, 'width': 1654, 'image_id': 174, 'image': tensor([[[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
...,
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.]],
[[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
...,
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.]],
[[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
...,
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.]]]), 'instances': Instances(num_instances=2, image_height=723, image_width=512, fields=[gt_boxes: Boxes(tensor([[ 16.7869, 44.9777, 473.3902, 106.7382],
[ 15.7797, 415.2047, 476.4118, 686.4136]])), gt_classes: tensor([0, 0])])}
在annotations文件中,对应的标注信息是:
{"category_id": 1, "id": 317, "image_id": 174, "iscrowd": 0, "segmentation": [[137.76953125, 1297.650390625, 1509.9000000000015, 1297.650390625, 1509.9000000000015, 2105.5, 137.76953125, 2105.5]], "area": 1108576.0, "bbox": [138.0, 1298.0, 1372.0, 808.0]}
{"category_id": 1, "id": 316, "image_id": 174, "iscrowd": 0, "segmentation": [[146.541015625, 194.87890625, 1507.0552978515625, 194.87890625, 1507.0552978515625, 379.3728790283203, 146.541015625, 379.3728790283203]], "area": 250240.0, "bbox": [147.0, 195.0, 1360.0, 184.0]},
经过计算,height/image_height ≈ width/ image_width
然而,原始gt bbox(转换为x1,y1,x2,y2的格式为[138, 1298, 1510, 2106],[147, 195, 1507, 379])和batched_inputs中的bbox()并没有高和宽那样的比例关系,这里是正常的吗?
但是,惊讶的是,当我在centerX/modeling/layers/centernet_gt.py中generate函数 将画图部分代码取消注释后,观察了许多结果图片,框的位置并没有问题。
我有注意到,其实每张图片的shape是不同的,但generate函数里只传入了当前batch最后一张图的shape,并将所有图片按照这个shape(after scale)输出后续的gt,以确保一个batch里的score map是相同shape,这里会是症结所在吗?(原centernet是将图片resize为统一尺寸后,再进行后续的下采样,建gt等)
我现在很迷茫,不知道该如何解决这个问题,希望作者及了解的朋友可以指点迷津,万分感谢!
用该项目能成功转为onnx模型,请问需要进一步转trt如何实现呢?
Could you please provide some guide how can I use SWA hook in Detectron2? Should I just add add additional config, or some code on CenternetTrainer is necessary?
Thanks!
您好,请问有推理时间相关的结论吗?譬如在V100/2080Ti上能到多少FPS, 和yolov5相比如何呢?
成功训练了模型,但运行"projects/speedup/centerX2onnx.py"导出 onnx 报错
"RuntimeError: ONNX export failed: Couldn't export Python operator _ModulatedDeformConv"
请问是什么导致?
你好,请问有没有coco或者crowd human的权重模型,或者其他可以初始化用的权重?
File "/home/a/.local/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 133, in train
self.run_step()
File "/home/a/code/centerX-master/engine/defaults.py", line 353, in run_step
self._write_metrics(metrics_dict)
TypeError: _write_metrics() missing 1 required positional argument: 'data_time'
请问如果是自定义数据集,需要修改那部分,我看好像牵扯到了detectron2
In the multi teacher KD experinment resdcn18_KD_woGT_scratch always performs a little worse than resdcn18 on crowd dataset, even if pretrained on imagenet, and it outperforms on another dataset, how does it happen?
有看到centernet有一个centerpose,这个如果有想做成姿态估计人体关键点检测的话要从哪方面入手呢
我把检测头输出改为中心点预测和ltrb的形式,用centernet原版的loss函数没问题,但是使用Gfocal就loss变成负数,可以一起研究一下吗
what't 'add_mafa' mean?
假设需要训练分别处于两个数据集中的两个类别:
数据集D1:标注A类别,B类别未标注
数据集D2:标注B类别,A类别未标注
模型M1:检测A类别的模型
模型M2:检测B类别的模型
当训练任意一张图片是,用M1或M2预测缺失的标签,然后作为监督信息,那这样和离线用M1和M2交叉标注D1和D2好像也没有什么区别。
File "./modeling/layers/centernet_deconv.py", line 91, in forward
offset = torch.cat((o1, o2), dim=1)
File "/home/22/code/centerX-master/projects/speedup/pytorch_to_caffe.py", line 660, in call
out = self.obj(self.raw, *args, **kwargs)
TypeError: _cat() got an unexpected keyword argument 'dim'
配置文件和模型是匹配的,为何还会出现这个问题
ttf supported?
你好,
我在训练自己的小数据集上遇到了问题,
我先是用CenterX在coco数据集上,复现了你的res18和res50实验结果。
换成自己的小数据集后,在原始的Centernet上代码上的AP大约在0.59左右;但是在CenterX上经过多次尝试(例如调整学习率方式,数据增强等),AP最好的时候大约只有0.49左右。我这分析觉得可能是数据增强的方式不太对。
我看了一下你对比实验中一项是数据增强的结果比原始的结果要略低一点,对于这个问题你有什么想法或者建议吗?
hi, I want to make a KD train with yamls/coco/centernet_res18_KD.yaml, but got error "exp_results/coco/coco_exp_R50_SGD_0.5/model_final.pth not found!", so how to get this teacher model, thanks much
可以问一下,centernet转onnx时候decode部分gather函数接收的是int64,为啥你的代码可以运行成功,我直接转就会报这个函数的错误,后面转换代码有点不明白。
fixed
看到您在coco_class 里面将coco标签从xywh转换为xyxy了,detertron2中训练时候也将boxes转为为xyxy了吗?
作者你好,我想请问一下
怎么绘制PR曲线以及loss曲线呢?
Loading and preparing results...
DONE (t=1.34s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type bbox
COCOeval_opt.evaluate() finished in 2.31 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 0.69 seconds.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.112
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.333
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.049
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.030
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.159
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.188
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.101
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.164
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.169
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.070
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.218
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.244
[05/31 20:26:18 d2.evaluation.coco_evaluation]: Evaluation results for bbox:
AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|
11.186 | 33.313 | 4.922 | 3.049 | 15.892 | 18.782 |
[05/31 20:26:18 d2.evaluation.coco_evaluation]: Per-category bbox AP: | |||||
category | AP | category | AP | ||
:----------- | :------- | :----------- | :------ | ||
person | 18.383 | face | 3.989 | ||
请问这正常嘛?AP大于1。 |
想请问一下我制作好了COCO格式的数据集,需要改写什么地方才能到自己数据集的路径,要怎么选择使用resnet18,抱歉我还是个菜鸟
作者,你好,感谢你的贡献。请问这份工程有对应的论文吗?可以po一下吗?我想基于你这个工程实践增量学习,你觉得可行吗?
我发现如果都是box四个参数都为0训练20轮就会报cuda的错误。。。
首先非常感谢作者的工作,知乎的那篇**特色CenterNet也写得非常有趣。
但是当我在自己的数据集上复现这个工程,想要用inference/demo.py来跑推理的时候,遇到了点小问题,输出的图片上始终没有框。
我尝试检查了下results这个变量,cls,bbox,scores其实并没有什么问题。
然后我发现在inference/demo.py中有这样一段:
for c,(x1,y1,x2,y2),s in zip(cls,bbox,scores):
if c != 0.0 or s < 0.35:
continue
其中c对应类别,一般定义0.0为背景吧?但是这里的逻辑是,如果不是背景则跳过输出框的循环,这样岂不是永远输出不了目标的框了?
将其改成 :
if c == 0.0 or s < 0.35:
后,就可以得到正确的输出了。
你好!请问一下在centerX项目中,怎么去训练自己的数据集,我在注册那块一直报错
There was an error while I was running it:
~/centerX/engine/defaults.py in __init__(self, cfg)
69 model, device_ids=[comm.get_local_rank()], broadcast_buffers=False
70 )
---> 71 super(DefaultTrainer, self).__init__(model, data_loader, optimizer)
72 # super().__init__(cfg)
73
TypeError: __init__() takes 1 positional argument but 4 were given
my detectron2 version is 0.3.
[12/22 10:43:48 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ...
[12/22 10:43:52 d2.data.common]: Serialized dataset takes 451.21 MiB
[12/22 10:43:52 detectron2]: Using training sampler TrainingSampler
[12/22 10:43:57 detectron2]: initial from https://download.pytorch.org/models/resnet18-5c106cde.pth
[12/22 10:43:57 detectron2]: The checkpoint state_dict contains keys that are not used by the model:
fc.{weight, bias}
Traceback (most recent call last):
File "/home/yb/centerX/train_net.py", line 66, in
args=(args,),
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/detectron2/engine/launch.py", line 59, in launch
daemon=False,
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/yb/anaconda3/envs/centernetx/lib/python3.7/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
main_func(*args)
File "/home/yb/centerX/train_net.py", line 52, in main
trainer = Trainer(cfg)
File "/home/yb/centerX/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given
对detectron不熟,看了很久没看明白,这些类别定义起到了什么作用?训练集没有标注的类别是怎么学习到的呢
fixed
Hello, thx for your great work! I'm trying to use this repo in my project, but it is hard to converge and cost much more time.Could you plz release coco pretrained model. Thx a lot ❤!
https://github.com/JDAI-CV/centerX/blame/master/projects/centerX.md#L82
再看模型蒸馏部分,这里第二点是否想说“对于输出宽高和中心点偏移量的head”
If I use res50(34.9) as teacher and res18(30.2) as student, and also train with 140 epoch, would I get a better result(>31.0 mAP on coco)?
My best result is res50(34.9)+res18(30.6) with 140 epoch, kd can only get 0.4% improved
训练RegNetX_400MF模型的时候, 下载了权重报如下错误:
File "centerX-master/modeling/backbone/regnet/regnet.py", line 539, in init_pretrained_weights
state_dict = torch.load(cached_file, map_location=torch.device('cpu'))['model_state']
File "anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "anaconda3/envs/yolov3/lib/python3.7/site-packages/torch/serialization.py", line 763, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
请问是什么问题呢?
你好,我想用coco数据集训练一下,请问coco文件夹放在哪里啊
直接使用您的配置进行训练,widerface res50 精度为mAP24.1, res18的mAP为21.6,只修改了batchsize 从64改为16, 效果会差这么多么? 另外 KD训练RES50 -> RES18 时 训练中期,kd_cls_loss 上升,可能是什么原因(BS=16)。
Hi, I had a similar training problem in another task. I want to try the adaptive loss weight strategy. I am wondering how to select COMMUNISM.CLS_LOSS, COMMUNISM.WH_LOSS and COMMUNISM.OFF_LOSS? Can you share the experience with me? Anyway, thanks a lot.
Hi~ Thank you for your sharing!
I trained three times with different categories in COCO with default configs. The AP curve always dropped suddenly in the middle, then slowly rise. It feels like retraining in the middle. Like this:
I checked most of the config params in the yaml files but did not find the reason. Could you tell me which config params made this happened?
Traceback (most recent call last):
File "train_net.py", line 66, in
args=(args,),
File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "train_net.py", line 52, in main
trainer = Trainer(cfg)
File "/home/cc631/hailong/code/Dilated-FPN/centerX/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given
Traceback (most recent call last):
File "train_net.py", line 77, in
args=(args,),
File "/home/zy/anaconda3/envs/CenterNet/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "train_net.py", line 63, in main
trainer = Trainer(cfg)
File "/home/zy/Downloads/2222/engine/defaults.py", line 71, in init
super(DefaultTrainer, self).init(model, data_loader, optimizer)
TypeError: init() takes 1 positional argument but 4 were given
您好, 请问这个问题怎么解决?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.