GithubHelp home page GithubHelp logo

bharatsingh430 / py-r-fcn-multigpu Goto Github PK

View Code? Open in Web Editor NEW
193.0 17.0 97.0 9.03 MB

Code for training py-faster-rcnn and py-R-FCN on multiple GPUs in caffe

License: MIT License

CMake 1.16% Makefile 0.28% HTML 0.08% CSS 0.10% Jupyter Notebook 55.79% C++ 32.06% Shell 0.44% Python 6.37% Cuda 2.59% MATLAB 0.37% M 0.01% Protocol Buffer 0.65% C 0.11%
faster-rcnn multi-gpu object-detection

py-r-fcn-multigpu's People

Contributors

bharatpublic avatar bharatsingh430 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

py-r-fcn-multigpu's Issues

pretrained model on mscoco

I'm very happy to see a released R-FCN model on MS COCO. However, I noticed that the coco test_agnostic.prototxt didn't fit the released model well. Would you please release the right prototxt?

How to train my data? about class_aware file

@bharatsingh430 @bharatpublic
1、What is the use of the prototxt in class_aware file? The solver.prototxt is linked to the train.prototxt not in the class_aware file.
2、When training, I got this problem:
/home/liuchuanjian/software/py-R-FCN-multiGPU-master/tools/../lib/fast_rcnn/bbox_transform.py:23: RuntimeWarning: invalid value encountered in log targets_dw = np.log(gt_widths / ex_widths)
How to correction?

Do i need to modify the learning rate when several gpus exploited?

hi:
In caffe, loss will be averaged by iter_size (like batch training). Will the loss be averaged in multigpu training? (e.g. averaged by the number of gpu used for training). If not, learning rate should keep the same as the lr used in single gpu. Am i right?

bests
jemmy li

Provide Resnet-152 training network scripts

Hi, @bharatsingh430 , could provide Resnet-152 training network scripts for this multi-GPU RFCN code? I download the Resnet-152 model and deploy prototxt, and generate a training prototxt by refer the deploy version, but the the loss seems not decrease normally as Resnet-101 or 50 in training process. So, could you provide it? Much appreciate! PS, I observed that when loading the Resnet-152 model by own training prototxt, the feature extracting layers was not copied to the RFCN network, but in the original py-rfcn code, it's ok.

Using MultiGPU Faster-RCNN

Hi, thx for the code sharing.
I am interested in running Faster RCNN with multi-GPU only (without FCN modification).
As I understand, I only need to set AGNOSTIC as False, and then all the things are the same as the original F-RCNN code, right?

The error when using coco dataset

when using coco dataset for training, the error is that:

File "/home/aaa/soft-nms-master/tools/../lib/rpn/anchor_target_layer.py", line 138, in forward
argmax_overlaps = overlaps.argmax(axis=1)
ValueError: attempt to get argmax of an empty sequence

I find the error is because the ration of the image width and height is too large or too small.

i want to know the solution or how do you train?

The detection results

Hi, Bharat.
Glad to hear the multi-gpus version of faster R-CNN. Can you provide the testing detection mAP on psacal voc dataset? or COCO dataset? I'm curious if the the version is better than /comparable with the single-gpu version.

caffe version

@bharatsingh430
@bharatpublic
HI

In README, you point out:
Please use the version of caffe matches with this repository. I have merged many files between the latest version of caffe and py-r-fcn.

I want to know the latest version of Caffe refers to which version of Caffe is it? is BVLC

How to download imagenet pretrained model

Hi, in your readme you mentioned Please download ImageNet-pre-trained ResNet-50 and ResNet-100 model manually, and put them into . Do you have a link to a pretrained model, I download from net just can not get the rfcn run. I don't know what going wrong. Any help would be very appreciated!

about coco branch, about Results on MS-COCO

@bharatsingh430
@bharatpublic
HI

About the "Results on ms-coco" section
Based on the training model you provided,
 1. In master branch, the test results were 29.0
 2. In coco branch, the test results were 30.6. (Similar to the results you provided)

There are several questions:

Under the coco branch, the following code is added to the lib/fast_rcnn/test.py

Print '++++++ + evaluate from stored Jason+++++ +'
# PDB. Set_trace ()
Imdb. Evaluate_detections2 (output_dir + '_coco/detections_val2014_results_e64e37ea - 2268-432 - d - b3d9-581 abbca029b. Json', output_dir)
return

direct execution will report an error:
IOError: [Errno 2] No such file or directory: '/home/jmx/py-R-FCN-multiGPU-coco-branch/output/rfcn_end2end_ohem/coco_2014_minival/coco_rfcn_coco/detections_val2014_results_e64e37ea-2268-432d-b3d9-581abbca029b.json'

when i delete the added code, the test can be carried out normally and the result is 30.6

Why is this happening?Why delete this code and it will execute?What's the point of the added code?


The difference between the two branches was 1.6%. Why such a big gap?
Comparing the differences between the two branches, i find that the parameters of many files (yml, prototxt) have changed. In addition to these changes,the following code has also changed:
 1. lib/datasets/coco.py
 2. lib/datasets/lmdb.py
 3. lib/pycocotools/cocoeval.py
 4. lib/roi_data_layer/layer.py

I want to know are the changes in the above code significant to the results? Can you explain that

So appreciated for your reply.

Problem of multi-GPU "'NCCL' has no attribute 'new_uid'"

Hi, I had clone your code and install NCCL. However , when run the file train_multi_gpu.py, I encounter a error:

Traceback (most recent call last):
File "./tools/train_net_multi_gpu.py", line 109, in
max_iter=args.max_iters, gpus=gpus)
File "/home/ultron/py-R-FCN-multiGPU/tools/../lib/fast_rcnn/train_multi_gpu.py", line 205, in train_net_multi_gpu
uid = caffe.NCCL.new_uid()
AttributeError: type object 'NCCL' has no attribute 'new_uid'

I think I installed NCCL correctly, and I found the same problem in the issue, but it did not work

caffe version

@bharatsingh430
@bharatpublic
HI

In README, you point out:
Please use the version of caffe matches with this repository. I have merged many files between the latest version of caffe and py-r-fcn.

I want to know the latest version of Caffe refers to which version of Caffe is it? is BVLC

Can't parse message of type "caffe.NetParameter" error

Hi, my device single 1070, cuda 8.0, cudnn 5.0

and follows your 4 step installation
git clone --recursive https://github.com/bharatsingh430/py-R-FCN-multiGPU/
install NCCL and sucessfully compile the caffe , build Cython modules
resnet101_rfcn_final.caffemodel are also downloaded

but meet error while executing
$RFCN/tools/demo_rfcn.py

I0510 17:45:45.567600 11746 net.cpp:202] conv1 does not need backward computation.
I0510 17:45:45.567605 11746 net.cpp:202] input does not need backward computation.
I0510 17:45:45.567610 11746 net.cpp:244] This network produces output bbox_pred
I0510 17:45:45.567615 11746 net.cpp:244] This network produces output cls_prob
I0510 17:45:45.567797 11746 net.cpp:257] Network initialization done.

[libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "caffe.NetParameter" because it is missing required fields: layer[494].psroi_pooling_param.output_dim, layer[494].psroi_pooling_param.group_size
F0510 17:45:45.759696 11746 upgrade_proto.cpp:95] Check failed: ReadProtoFromBinaryFile(param_file, param) Failed to parse NetParameter file: /home/stream/py-R-FCN-multiGPU/data/rfcn_models/resnet101_rfcn_final.caffemodel

Are there any idea to solve this ?

VOC resnet50 error prototxt

        num_output: 392 #8*(7^2) cls_num*(score_maps_size^2)

Is not this should be 1096???? Class is 21.................

how to solve the overflow encountered runtimewanring ?

hello,
I use this work to train in voc0712 , A runtime errorwanring producet.it's necessary to fix it.
the output is
I0324 18:50:32.082937 8933 solver.cpp:219] Iteration 0 (0 iter/s, 0.709095s/20 iters), loss = 4.24172 I0324 18:50:32.082983 8933 solver.cpp:238] Train net output #0: accuarcy = 0 I0324 18:50:32.082993 8933 solver.cpp:238] Train net output #1: loss_bbox = 1.67801e-05 (* 1 = 1.67801e-05 loss) I0324 18:50:32.082998 8933 solver.cpp:238] Train net output #2: loss_cls = 3.04193 (* 1 = 3.04193 loss) I0324 18:50:32.083003 8933 solver.cpp:238] Train net output #3: rpn_cls_loss = 0.779677 (* 1 = 0.779677 loss) I0324 18:50:32.083009 8933 solver.cpp:238] Train net output #4: rpn_loss_bbox = 0.493717 (* 1 = 0.493717 loss) I0324 18:50:32.155719 8933 sgd_solver.cpp:105] Iteration 0, lr = 0.001 /home/dmt/FV/py-R-FCN/tools/../lib/fast_rcnn/bbox_transform.py:47: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] /home/dmt/FV/py-R-FCN/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] /home/dmt/FV/py-R-FCN/tools/../lib/fast_rcnn/bbox_transform.py:47: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] /home/dmt/FV/py-R-FCN/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] I0324 18:50:45.361021 8933 solver.cpp:219] Iteration 20 (1.50624 iter/s, 13.2781s/20 iters), loss = nan I0324 18:50:45.361094 8933 solver.cpp:238] Train net output #0: accuarcy = 0 I0324 18:50:45.361110 8933 solver.cpp:238] Train net output #1: loss_bbox = nan (* 1 = nan loss) I0324 18:50:45.361119 8933 solver.cpp:238] Train net output #2: loss_cls = 87.3365 (* 1 = 87.3365 loss) I0324 18:50:45.361126 8933 solver.cpp:238] Train net output #3: rpn_cls_loss = 0.693147 (* 1 = 0.693147 loss) I0324 18:50:45.361137 8933 solver.cpp:238] Train net output #4: rpn_loss_bbox = 1.85539e+33 (* 1 = 1.85539e+33 loss) I0324 18:50:45.446388 8933 sgd_solver.cpp:105] Iteration 20, lr = 0.001

My environment is as follows:
Ubuntu 16.04LTS
CUDA 9.1
CUDNN 7.1
NVIDIA Tesla P100 GPU.
...
I have try to change the base_lr and 0-base boxes ,it doesn't work.It's no runtime warning with single GPU.
can you help me fix it ?
Thanks!

Problem of multi-GPU "'NCCL' has no attribute 'new_uid'"

Hi, I had clone your code and install NCCL. However , when run the file train_multi_gpu.py, I encounter a error

File "/home/bo718.wang/xiangyu.zhu/py-R-FCN-multiGPU-soft-nms/tools/../lib/fast_rcnn/train_multi_gpu.py", line 205, in train_net_multi_gpu
uid = caffe.NCCL.New_Uid()
AttributeError: type object 'NCCL' has no attribute 'New_Uid'

To make sure I had install NCCL succesfully I type in the commends as follow

caffe.NCCL
<class 'caffe._caffe.NCCL'>
caffe.NCCL.new_uid
Traceback (most recent call last):
File "", line 1, in
AttributeError: type object 'NCCL' has no attribute 'new_uid'

caffe.NCCL is successfully installed but can not find new_uid in NCCL
Could you help me figure it out?Thks very much

Can't parse message of type "caffe.NetParameter"

I run the demo and get this error. Please help.

Command line is:
./tools/test_net.py --gpu 0 --def models/pascal_voc/ResNet-50/rfcn_end2end/test_agnostic.prototxt --net resnet50_rfcn_final.caffemodel --imdb coco_2014_test --cfg experiments/cfgs/rfcn_end2end_ohem_pascal_voc.yml --set TEST.SOFT_NMS 1

error:
I0305 14:36:16.657363 5021 net.cpp:244] This network produces output bbox_pred
I0305 14:36:16.657366 5021 net.cpp:244] This network produces output cls_prob
I0305 14:36:16.657456 5021 net.cpp:257] Network initialization done.
[libprotobuf ERROR google/protobuf/message_lite.cc:121] Can't parse message of type "caffe.NetParameter" because it is missing required fields: layer[256].psroi_pooling_param.output_dim, layer[256].psroi_pooling_param.group_size
F0305 14:36:16.719847 5021 upgrade_proto.cpp:95] Check failed: ReadProtoFromBinaryFile(param_file, param) Failed to parse NetParameter file: resnet50_rfcn_final.caffemodel
*** Check failure stack trace: ***
Aborted (core dumped)

Thank you

GPU memory not release after interrupted the training script

Hi, @bharatsingh430 , I faced a problem that the GPU memory not released normally after I interrupted the training script, in details saying, I used 2 GPU, like [0,1], while I pressed the Ctrl+C to stop the training script, then I prompt the nvidia-smi to see the GPU usage, found that only GPU 1 was normally released the used memory and GPU 0 still keep the allocated memory, even wait for a long time, the problem still there, so want to ask which reasons may caused such problem? And how could I fixed it? PS: I tried kill the Python process, but it not work. Waiting for your help! Thank you very much!

Any pascal voc results?

hi,

I've run this code on pascal voc. train on voc"07+12" trainval and test on 07test.
Finally get 78.3mAP(averaged three experiments), but it seems a little bit lower than results provided by py-r-fcn.

single gpu, iter_size 2: 78.3 (vs 79.4)
two gpu, iter_size 1: 78.3

bests
jemmy

Cannot allocate memory

Hello,
I tried your code and it produces the following error:

Process Process-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "./tools/train_fast.py", line 143, in train_fast_rcnn
    max_iter=max_iters, gpus=gpus)
  File "/home/tuan/Downloads/mGPU_faster-RCNN/py-R-FCN-multiGPU/tools/../lib/fast_rcnn/train_multi_gpu.py", line 214, in train_net_multi_gpu
    p.start()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 121, in __init__
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Can you let me know if I miss anything to get it worked?
Thank you

when use the ZF model, not the VGG16. The error happens

I have used the code to train my own model. But there is a error about log.
when you train your own model, the final output is: (Wrote snapshot to:xxx).
But the py-faster-rcnn the output is : done solving.
The difference result to the error in test phase.
The code:

  • ./tools/test_net.py --gpu --gpu 0,1,2,3 --def models/pascal_voc/VGG16/faster_rcnn_end2end/test.prototxt --net --imdb voc_2007_test --cfg experiments/cfgs/faster_rcnn_end2end.yml
    imagenet_train
    <function at 0x7f69d313e2a8>
    imagenet_val
    <function at 0x7f69d313e320>
    usage: test_net.py [-h] [--gpu GPU_ID] [--def PROTOTXT] [--net CAFFEMODEL]
    [--cfg CFG_FILE] [--wait WAIT] [--imdb IMDB_NAME] [--comp]
    [--set ...] [--vis] [--num_dets MAX_PER_IMAGE]
    [--rpn_file RPN_FILE]
    test_net.py: error: argument --gpu: expected one argument

You can see in the : --net , expected one argument.
Can anyone solve it?

Error appears when running demo_rfcn.py

Hello, is there anything wrong with your provided rfcn model?
When i run 'demo_rfcn.py', following message appears
image

and when i run 'demo.py' for faster rcnn demo, everything runs fine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.