GithubHelp home page GithubHelp logo

Comments (48)

makefile avatar makefile commented on August 17, 2024 1

@huinsysu the size is about input resize. 6000x4000 maybe too large to fit into 1 gpu.

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

Hi, When I train the models such as res50-12s-600-rfcn-cascade without FPN with my own dataset is fine. But when I try to train res50-15s-800-fpn-cascade with my own dataset, I meet the problem that decode_bbox_layer cannot get valid bbox. After the code of "screen out high IoU boxes, to remove redundant gt boxes" the valid_bbox_ids is 0.
So, what the problem might be? Thanks. @zhaoweicai

from cascade-rcnn.

zhaoweicai avatar zhaoweicai commented on August 17, 2024

@makefile If you don't want to remove the redundant gt boxes, you can simply set gt_iou_thr=1.0 or higher. But a more important problem is you might not have enough proposals. In your case of error, only gt boxes and no negative box. You can try to lower the proposal threshold in "BoxGroupOutput" layer to have more proposals. Or your training is diverging and crashed. You can also try to use a lower learning rate.

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@zhaoweicai Thanks! Follow your advice, set lower the fg_thr in BoxGroupOutput layer, the problem disappeared.

from cascade-rcnn.

Peng-wei-Yu avatar Peng-wei-Yu commented on August 17, 2024

@zhaoweicai @makefile I try to train cascade rcnn on my own dataset, and I got this problem, I tried to lower the iou_thr in "BoxGroupOutput" layer but the problem still there, can you give me any suggestion.
wenti

from cascade-rcnn.

jwnsu avatar jwnsu commented on August 17, 2024

The error seems related to multiple gpus. When I tried single gpu (not all GPU ids, gpu id 1 is fine, but gpu id 2 encounters same above error), training proceeds; however, with 2 gpus, encountered same above error.

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@Peng-wei-Yu try lower the score of fg_thr instead of nms thresh.

from cascade-rcnn.

Peng-wei-Yu avatar Peng-wei-Yu commented on August 17, 2024

@jwnsu @makefile Thank you for you help. But I tried to lower fg_thr and use only GPU 1, the problem is still there. Have you tried to change the --weights in train_detection, I decided to change the caffemodel and have a try.

from cascade-rcnn.

jwnsu avatar jwnsu commented on August 17, 2024

FYI. coco model seems to work fine (e.g. coco/res50-15s-800-fpn-cascade is fine, res101 runs out of GPU memory on 1080 Ti), suggest you switch to coco flavor from voc.

from cascade-rcnn.

zhaoweicai avatar zhaoweicai commented on August 17, 2024

@Peng-wei-Yu when you change the number of GPUs, you should change the learning rate at the same time, as described in the paper.

from cascade-rcnn.

zhaoweicai avatar zhaoweicai commented on August 17, 2024

@jwnsu The code should have no problem on multi-gpu training or VOC dataset. Try the run the script a couple of times to see if the problem still happens. If the problem is still there, try to lower the learning rate a little bit. If it still cannot be fixed, maybe there is something wrong.

from cascade-rcnn.

Peng-wei-Yu avatar Peng-wei-Yu commented on August 17, 2024

@makefile @zhaoweicai When you trained cascade rcnn on your own data, which caffemodel did you use. Your own caffemodel or ResNet-50-model-merge.caffemodel. The picture in my own data have the size of 1600*1200, should I change the short_size and long_size in train.prototxt.

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@Peng-wei-Yu If you use the author's prototxt, you should use the corresponding ResNet-50-model-merge.caffemodel, since it merges the BN layer to scale layer to reduce memory and speed up. You can increase the input size of image if your memory is enough, but the result may not increase too much.

from cascade-rcnn.

Peng-wei-Yu avatar Peng-wei-Yu commented on August 17, 2024

@makefile Thank you very much. I'll have a try by using ResNet-50-model-merge.caffemodel.

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@makefile @Peng-wei-Yu in BoxGroupOutput layer,the original setting is 0.001, you finally set it?

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@makefile @Peng-wei-Yu
When I was training, batchsize was equal to 1. There was at least one sample in my own training pictures, but Why is total positive equal to 0 in many iterations during the training process?and my rpn loss is 0.Have you encountered such a problem?
default

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@GuoxingYan I set fg_thr: 0.01 or 0 in all BoxGroupOutput layer. If your positive rois num is always 0, maybe your dataset has some problem.

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@makefile Did you try to change the short_size and long_size in train.prototxt?when i only changed the short_size or long_size ,There will be an error。

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@GuoxingYan I did not try to change that, since there use Deconvolution layer to upsample, the size maybe need to be multiplier of 32, 64 or larger.

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@makefile thank you very much!!

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@makefile Will you have the following problems when training fpn?
default

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@GuoxingYan I didn't met. the integer seems to be abnormal big.

from cascade-rcnn.

licy5152 avatar licy5152 commented on August 17, 2024

@Peng-wei-Yu @zhaoweicai my own data size is 960*1280,I try to use the ResNet-50-model-merge.caffemodel, but I also get this problem.
wx20180624-154016 2x

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@makefile @zhaoweicai @Peng-wei-Yu When I was training, I found that the short_size in detection_data_param in trian.prototxt is 800, which is exactly equal to img_width and img_height in proposal_target_param. So the question arises. When I change the short_size to 320, does the img_width and img_height need to be changed to 320?

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@GuoxingYan I think it needs to be.

from cascade-rcnn.

licy5152 avatar licy5152 commented on August 17, 2024

@makefile I use to train my owe dataset,how can I get the output for every picture?

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@licy5152 I wrote a python script CascadeRCNN-demo.py imitate the matlab code, you can modify it to use.

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@makefile 你的demo.py 显示无效链接诶。

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@GuoxingYan 你的网络问题吧

from cascade-rcnn.

PacteraKun avatar PacteraKun commented on August 17, 2024

@makefile @zhaoweicai
When I was training my own dataset, the following issue happened. However, I have already check that there is no box has xmin = 1664 and xmax = 636 in the window_file.txt. And I also have not found bbox_util.cpp file under the workspace directory. Could you guys help me to solve this issue? Thanks a lot.
image

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@PacteraKun The situation you encountered is unusual, check carefully.

from cascade-rcnn.

PacteraKun avatar PacteraKun commented on August 17, 2024

@makefile
Have you use cascade-rcnn to train your own dataset successfully?

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@PacteraKun I once trained several model, but failed to visualize the demo result. Later I transplant it to my own familiar framework to use.

from cascade-rcnn.

lzh19961031 avatar lzh19961031 commented on August 17, 2024

@makefile 请问下,你test那个python文件中的labelmap_file是什么呢?

from cascade-rcnn.

DetectionIIT avatar DetectionIIT commented on August 17, 2024

@GuoxingYan @zhaoweicai
I meet the same error?have you solved?
there are many params are equal to -1 and can't save the model??

I0806 23:44:24.048591 20123 solver.cpp:219] Iteration 9900 (2.14913 iter/s, 46.5305s/100 iters), loss = 0.440841
I0806 23:44:24.048627 20123 solver.cpp:238] Train net output #0: bbox_iou = -1
I0806 23:44:24.048635 20123 solver.cpp:238] Train net output #1: bbox_iou_2nd = -1
I0806 23:44:24.048638 20123 solver.cpp:238] Train net output #2: bbox_iou_3rd = -1
I0806 23:44:24.048641 20123 solver.cpp:238] Train net output #3: bbox_iou_pre = -1
I0806 23:44:24.048645 20123 solver.cpp:238] Train net output #4: bbox_iou_pre_2nd = -1
I0806 23:44:24.048648 20123 solver.cpp:238] Train net output #5: bbox_iou_pre_3rd = -1
I0806 23:44:24.048651 20123 solver.cpp:238] Train net output #6: cls_accuracy = 0.984375
I0806 23:44:24.048655 20123 solver.cpp:238] Train net output #7: cls_accuracy_2nd = 0.972656
I0806 23:44:24.048658 20123 solver.cpp:238] Train net output #8: cls_accuracy_3rd = 0.964844
I0806 23:44:24.048666 20123 solver.cpp:238] Train net output #9: loss_bbox = 0.0117847 (* 1 = 0.0117847 loss)
I0806 23:44:24.048671 20123 solver.cpp:238] Train net output #10: loss_bbox_2nd = 0.0129223 (* 0.5 = 0.00646114 loss)
I0806 23:44:24.048676 20123 solver.cpp:238] Train net output #11: loss_bbox_3rd = 0.00699362 (* 0.25 = 0.0017484 loss)
I0806 23:44:24.048681 20123 solver.cpp:238] Train net output #12: loss_cls = 0.0294972 (* 1 = 0.0294972 loss)
I0806 23:44:24.048686 20123 solver.cpp:238] Train net output #13: loss_cls_2nd = 0.0663875 (* 0.5 = 0.0331937 loss)
I0806 23:44:24.048689 20123 solver.cpp:238] Train net output #14: loss_cls_3rd = 0.0622066 (* 0.25 = 0.0155517 loss)
I0806 23:44:24.048696 20123 solver.cpp:238] Train net output #15: rpn_accuracy = 0.999953
I0806 23:44:24.048701 20123 solver.cpp:238] Train net output #16: rpn_accuracy = -1
I0806 23:44:24.048703 20123 solver.cpp:238] Train net output #17: rpn_bboxiou = -1
I0806 23:44:24.048708 20123 solver.cpp:238] Train net output #18: rpn_loss = 0.000343773 (* 1 = 0.000343773 loss)
I0806 23:44:24.048713 20123 solver.cpp:238] Train net output #19: rpn_loss = 0 (* 1 = 0 loss)
I0806 23:44:24.048717 20123 sgd_solver.cpp:105] Iteration 9900, lr = 0.0002
I0806 23:45:10.848093 20123 solver.cpp:587] Snapshotting to binary proto file /disk1/g201708021059/cascade-rcnn/examples/voc/res101-9s-600-rfcn-cascade/log/cascadercnn_voc_iter_10000.caffemodel
*** Aborted at 1533570310 (unix time) try "date -d @1533570310" if you are using GNU date ***
PC: @ 0x7f55674532e7 caffe::Layer<>::ToProto()
*** SIGSEGV (@0x0) received by PID 20123 (TID 0x7f55682b49c0) from PID 0; stack trace: ***
@ 0x7f5565dedcb0 (unknown)
@ 0x7f55674532e7 caffe::Layer<>::ToProto()
@ 0x7f55675d7533 caffe::Net<>::ToProto()
@ 0x7f55675f415f caffe::Solver<>::SnapshotToBinaryProto()
@ 0x7f55675f42f2 caffe::Solver<>::Snapshot()
@ 0x7f55675f7f7a caffe::Solver<>::Step()
@ 0x7f55675f8994 caffe::Solver<>::Solve()
@ 0x40d4c0 train()
@ 0x408d32 main
@ 0x7f5565dd8f45 (unknown)
@ 0x409442 (unknown)
@ 0x0 (unknown)

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@Emmra https://blog.csdn.net/e01528/article/details/80913443 希望能帮到你,可以的话,帮忙点个赞。

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@Emmra 保存不了caffemodel的问题我没有遇到

from cascade-rcnn.

GuoxingYan avatar GuoxingYan commented on August 17, 2024

@lzh19961031 那个检测的python你能打开吗?
我这边试了好几次没有打开,方便的话可以给我 发一下吗?
[email protected]

from cascade-rcnn.

huinsysu avatar huinsysu commented on August 17, 2024

@makefile @GuoxingYan @licy5152 请问,在train.prototext文件中的long_size和short_size的作用是什么呢?我得数据集中有的图片长宽分别为6000和4000,我需要在设置这两个参数为6000和4000吗?谢谢!

from cascade-rcnn.

huinsysu avatar huinsysu commented on August 17, 2024

@makefile @zhaoweicai Hi,I try to lower the fg_th in BoxGroupOutput layer, but I still get the problem of keep_num > 0(0 vs. 0). Could I just set the fg_th to 0 for all fg_th in BoxGroupOutput layer? Thanks for help me!

from cascade-rcnn.

huinsysu avatar huinsysu commented on August 17, 2024

@licy5152 Hi, when I trained the model with my own dataset, I met the same error as you met. Would you please tell me how you solve such problem? Thanks!

from cascade-rcnn.

lininglouis avatar lininglouis commented on August 17, 2024

@zhaoweicai May I know the intuition of using fg_thr ( or when the cls_score is 0.99 or higher) to filter the bboxes? It seems that you drop all those bboxes. ( they dont even get into the nms_by_cls_score or proposal stage). So why drop the bbox whose cls_score is higher than 0.99 by default?

from cascade-rcnn.

elgong avatar elgong commented on August 17, 2024

网络能正常训练了,但是每次 Ctrl +c 终止程序,会出现 “irq/132-nvidia”的root进程,cpu100%占用,内存占用0,重新执行训练会卡在最开始的地方,Nvidia-smi也卡住了:

5242 root -51 0 0 0 0 R 100.0 0.0 29:08.18 irq/132-nvidia
必须重启才能解决,请问您遇到过这个状况吗?

from cascade-rcnn.

hu5tao avatar hu5tao commented on August 17, 2024

@makefile At last,are you satisfied with you results about your datasets? I am preparing for train my dataset in my datasets.

from cascade-rcnn.

makefile avatar makefile commented on August 17, 2024

@hu5tao not bad.

from cascade-rcnn.

qianfangjj avatar qianfangjj commented on August 17, 2024

@licy5152 I wrote a python script CascadeRCNN-demo.py imitate the matlab code, you can modify it to use.

@makefile 你好,我试了好几次都不能打开 CascadeRCNN-demo.py的链接,请问你是否方便发给我一份?[email protected] 谢谢了!

from cascade-rcnn.

foralliance avatar foralliance commented on August 17, 2024

@lininglouis
0.99??
I think its just drop cls_score is lower than 0.01

from cascade-rcnn.

leizhu1989 avatar leizhu1989 commented on August 17, 2024

when I train my own data ,it has a error,but I don't know why,could you give me some ideas? Thanks a lot

I0604 13:28:15.270220 87804 detection_data_layer.cpp:142] num: 0 /home/zhulei/data/VOCdevkit/VOC2007/JPEGImages/IMG_0_112.jpg 3 1080 1920 windows to process: 36, RONI windows: 0
F0604 13:28:15.274016 87804 detection_data_layer.cpp:123] Check failed: label > 0 (0 vs. 0)
*** Check failure stack trace: ***
@ 0x7fb962af05cd google::LogMessage::Fail()
@ 0x7fb962af2433 google::LogMessage::SendToLog()
@ 0x7fb962af015b google::LogMessage::Flush()
@ 0x7fb962af2e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fb963271781 caffe::DetectionDataLayer<>::DataLayerSetUp()
@ 0x7fb9631c27d5 caffe::BasePrefetchingDataLayer<>::LayerSetUp()
@ 0x7fb96338b6a2 caffe::Net<>::Init()
@ 0x7fb96338dd0e caffe::Net<>::Net()
@ 0x7fb963312515 caffe::Solver<>::InitTrainNet()
@ 0x7fb963312aa4 caffe::Solver<>::Init()
@ 0x7fb963312d8f caffe::Solver<>::Solver()
@ 0x7fb963335701 caffe::Creator_SGDSolver<>()
@ 0x40d912 train()
@ 0x408795 main
@ 0x7fb961335830 __libc_start_main
@ 0x4090a9 _start
@ (nil) (unknown)

from cascade-rcnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.