GithubHelp home page GithubHelp logo

clovaai / wsolevaluation Goto Github PK

View Code? Open in Web Editor NEW
329.0 15.0 55.0 12.12 MB

Evaluating Weakly Supervised Object Localization Methods Right (CVPR 2020)

License: MIT License

Python 98.05% Shell 1.95%
wsol-methods wsol-task wsol-evaluation wsol-training dataset-contribution evaluation-protocol wsol-benchmark cvpr2020 localization

wsolevaluation's Introduction

Evaluating Weakly Supervised Object Localization Methods Right (CVPR 2020)

CVPR 2020 paper | TPAMI paper

Junsuk Choe1,3*, Seong Joon Oh2*, Seungho Lee1, Sanghyuk Chun3, Zeynep Akata4, Hyunjung Shim1
* Equal contribution

1 School of Integrated Technology, Yonsei University
2 Clova AI Research, LINE Plus Corp. 3 Clova AI Research, NAVER Corp. 4 University of Tübingen

Weakly-supervised object localization (WSOL) has gained popularity over the last years for its promise to train localization models with only image-level labels. Since the seminal WSOL work of class activation mapping (CAM), the field has focused on how to expand the attention regions to cover objects more broadly and localize them better. However, these strategies rely on full localization supervision to validate hyperparameters and for model selection, which is in principle prohibited under the WSOL setup. In this paper, we argue that WSOL task is ill-posed with only image-level labels, and propose a new evaluation protocol where full supervision is limited to only a small held-out set not overlapping with the test set. We observe that, under our protocol, the five most recent WSOL methods have not made a major improvement over the CAM baseline. Moreover, we report that existing WSOL methods have not reached the few-shot learning baseline, where the full-supervision at validation time is used for model training instead. Based on our findings, we discuss some future directions for WSOL.

RubberDuck

Overview of WSOL performances 2016-2019. Above image shows that recent improvements in WSOL are illusory due to (1) different amount of implicit full supervision through validation and (2) a fixed score-map threshold to generate object boxes. Under our evaluation protocol with the same validation set sizes and oracle threshold for each method, CAM is still the best. In fact, our few-shot learning baseline, i.e., using the validation supervision (10 samples/class) at training time, outperforms existing WSOL methods.

Updates

Table of contents

1. Our dataset contribution

WSOL is an ill-posed problem when only image-level labels are available (see paper for an argument). To be able to solve the WSOL task, certain amount of full supervision is inevitable, and prior WSOL approaches have utilized different amount of implicit and explicit full supervision (usually through validation). We propose to use a fixed amount of full supervision per method by carefully designing validation splits (called train-fullsup in the paper), such that different methods use the same amount of localization-labelled validation split.

In this section, we explain how each dataset is split, and introduce our data contributions (image collections and new annotations) on the way.

The dataset splits

split ImageNet CUB OpenImages
train-weaksup ImageNet "train" CUB-200-2011 "train" OpenImages30k "train" we_curated data
train-fullsup ImageNetV2 we_collected annotations CUBV2 we_collected images_and_annotations OpenImages30k "val" we_curated data
test ImageNet "val" CUB-200-2011 "test" OpenImages30k "test" we_curated data

We propose three disjoint splits for every dataset: train-weaksup, train-fullsup, and test. The train-weaksup contains images with weak supervision (the image-level labels). The train-fullsup contains images with full supervision (either bounding box or binary mask). It is left as freedom for the user to utilize it for hyperparameter search, model selection, ablative studies, or even model fitting. The test split contains images with full supervision; it must be used only for the final performance report. For example, checking the test results multiple times with different model configurations violates the protocol as the learner implicitly uses more full supervision than allowed. The splits and their roles are more extensively explained in the paper.

  • ImageNet
    • "train" and "val" splits of original ImageNet are treated as our train-weaksup and test.
    • ImageNetV2 is treated as our train-fullsup. Note that we have annotated bounding boxes on ImageNetV2.
  • CUB
    • "train" and "test" splits of original CUB-200-2011 are treated as our train-weaksup and test.
    • We contribute images and annotations that are similar as the original CUB, namely CUBV2.
  • OpenImages
    • We curate the existing OpenImagesV5 for the task of WSOL.
    • We have randomly selected images from the original "train", "val", and "test" splits of the instance segmentation subset.

2. Dataset downloading and license

For original ImageNet and CUB datasets, please follow the common procedure to download the datasets. In this section, we only explain how to obtain the less used (or never used before) datasets. We also provide the license status for each dataset. This section is for those who are interested in the full data for each dataset. If the aim is to utilize the data for WSOL evaluation and/or training, please follow the links below:

ImageNetV2

Download images

We utilize 10,000 images in the Threshold0.7 split of ImageNetV2 for our train-fullsup split. We have annotated bounding boxes on those images. Box labels exist in here and are licensed by NAVERCorp. under Attribution 2.0 Generic (CC-BY-2.0).

CUBV2

Download images

We have collected and annotated CUBV2 on our own as the train-fullsup split. We have ensured that the data distribution follows the original CUB dataset and there is no duplicate image. We have collected 5 images per class (1,000 images total) from Flickr. Box labels and license files of all images exist in here. Both class and box labels are licensed by NAVERCorp under Attribution 2.0 Generic (CC-BY-2.0).

OpenImages30k

Download images
Download segmentation masks

The WSOL community has relied on ImageNet and CUB datasets at least for the last three years. It is perhaps time for us to move on. We provide a WSOL benchmark based on the OpenImages30k dataset to provide a new perspective on the generalizability of WSOL methods in the past and future. To make it suitable for the WSOL task, we use 100 classes to ensure the minimum number of single-class samples for each class. We have randomly selected 29,819, 2,500, and 5,000 images from the original "train", "val", and "test" splits of OpenImagesV5. Corresponding metadata can be found in here. The annotations are licensed by Google LLC under Attribution 4.0 International (CC-BY-4.0). The images are listed as having a Attribution 2.0 Generic (CC-BY-2.0).

Dataset statistics

Below tables summarizes dataset statistics of each split.

# images/classes ImageNet 1,000 classes CUB 200 classes OpenImages 100 classes
train-weaksup ~1,200 ~30 ~300
train-fullsup 10 ~5 25
test 10 ~29 50

Licenses

The licenses corresponding to our dataset contribution are summarized as follows

Dataset Images Class Annotations Localization Annotations
ImageNetV2 See the original Github See the original Github CC-BY-2.0 NaverCorp.
CUBV2 Follows original image licenses. See here. CC-BY-2.0 NaverCorp. CC-BY-2.0 NaverCorp.
OpenImages CC-BY-2.0 (Follows original image licenses. See here) CC-BY-4.0 Google LLC CC-BY-4.0 Google LLC

Detailed license files are summarized in the release directory.

Note: At the time of collection, images were marked as being licensed under the following licenses:

Attribution-NonCommercial License
Attribution License
Public Domain Dedication (CC0)
Public Domain Mark

However, we make no representations or warranties regarding the license status of each image. You should verify the license for each image yourself.

3. Code dependencies

Both the evaluation-only and eval+train scripts require only the following libraries:

pip freeze returns the version information as below:

munch==2.5.0
numpy==1.18.1
opencv-python==4.1.2.30
Pillow==7.0.0
six==1.14.0
torch==1.4.0
torchvision==0.5.0

4. WSOL evaluation

We support evaluation of weakly-supervised object localization (WSOL) methods on CUB, ImageNet, and OpenImages. The main script for evaluation is evaluation.py. We will show how to download the train-fullsup (validation) and test set images and localization annotations. An example evaluation script will be provided.

Prepare evaluation data

WSOL evaluation data consist of images and corresponding localization ground truths. On CUB and ImageNet, they are given as boxes, and on OpenImages, they are given as binary masks.

To prepare evaluation data, first, download ImageNet "val" split from here and put the downloaded file on dataset/ILSVRC2012_img_val.tar.

Then, run the following command

./dataset/prepare_evaluation_data.sh

The script will download the train-fullsup (validation) and test images at dataset. Metadata and box annotations already exist in this repository under metadata. OpenImages mask annotations are also downloaded by the above script, and will be saved under dataset with the images.

The structure of image files looks like

dataset
└── ILSVRC
    └── val2
        └── 0
            ├── 0.jpeg
            ├── 1.jpeg
            └── ...
        └── 1
        └── ...
    └── val
        ├── ILSVRC2012_val_00000001.JPEG
        ├── ILSVRC2012_val_00000002.JPEG
        └── ...
└── CUB
    └── 001.Black_footed_Albatross
        ├── Black_Footed_Albatross_0046_18.jpg
        ├── Black_Footed_Albatross_0002_55.jpg
        └── ...
    └── 002.Laysan_Albatross
    └── ...
└── OpenImages
    └── val
        └── 0bt_c3
            ├── 1cd9ac0169ec7df0.jpg
            ├── 1cd9ac0169ec7df0_ignore.png
            ├── 1cd9ac0169ec7df0_m0bt_c3_6932e993.png
            └── ...
        └── 0bt9lr
        └── ...
    └── test   
        └── 0bt_c3
            ├── 0a51958fcd523ae4.jpg
            ├── 0a51958fcd523ae4_ignore.png
            ├── 0a51958fcd523ae4_m0bt_c3_41344f12.png
            ├── 0a51958fcd523ae4_m0bt_c3_48f37c0f.png
            └── ...
        └── 0bt9lr
        └── ...

Prepare heatmaps to evaluate

Our WSOL evaluation evaluates heatmaps of the same width and height as the input images. The evaluation script requires the heatmaps to meet the following criteria:

  1. Heatmap file structure.
  • Heatmaps shall be located at the user-defined <heatmap_root>.
  • <heatmap_root> folder contains the heatmap files with the file names dictated by the metadata/<dataset>/<split>/image_ids.txt files.
  • If an image_id has slashes (/), e.g. val2/995/0.jpeg, then the corresponding heatmaps shall be located at the corresponding sub-directories, e.g. <heatmap_root>/val2/995/0.npy.
  1. Heatmap data type.
  • Each heatmap file should be a .npy file that can be loaded as a numpy array with numpy.load().
  • The array shall be two-dimensional array of shape (height, width), same as the input image sizes.
  • The array shall be of type np.float.
  • The array values must be between 0 and 1.

Evaluate your heatmaps

We support three datasets, CUB, ImageNet, and OpenImages.

On CUB and ImageNet, we evaluate the MaxBoxAcc, the maximal box accuracy at the optimal heatmap threshold, where the box accuracy is measured by the ratio of images where the box generated from the heatmap overlaps with the ground truth box with IoU at least 0.5. Please see the code and paper for the full details.

On OpenImages, we evaluate the PxAP, pixel average precision. We generate the pixel-wise precision-recall curve, and compute the area under the curve. Please see the code and paper for the full details.

We present an example call to the evaluation API below:

python evaluation.py --scoremap_root=train_log/scoremaps/ \
                     --metadata_root=metadata/ \
                     --mask_root=dataset/ \
                     --dataset_name=CUB \
                     --split=val \
                     --cam_curve_interval=0.01

When CUB evaluation data are downloaded at dataset using our download script above, and the corresponding heatmaps are saved under train_log/scoremaps/, then the MaxBoxAcc will be evaluated as a result of this call.

Testing the evaluation code

The test code for the evaluation modules is given at evaluation_test.py. The unit tests ensure the correctness of the evaluation logic, and potentially prevents unnoticed changes in the functionalities of underlying libraries (e.g. OpenCV, Numpy). To run the unit test, run

nosetests

pip3 install nose may be required to install nose.

5. Library of WSOL methods

We support the training and evaluation of the following weakly-supervised object localization (WSOL) methods. Our implementation of the methods can be found in the wsol folder. Please add your own WSOL method in the list by making a pull request.

We provide the full training and evaluation scripts on the provided WSOL methods. Details will be explained in the next section.

Method Paper Original code
Class-Activation Mapping (CAM) CVPR'16 Code
Hide-and-Seek (HaS) ICCV'17 Code
Adversarial Complementary Learning (ACoL) CVPR'18 Code
Self-Produced Guidance (SPG) ECCV'18 Code
Attention-based Dropout Layer (ADL) CVPR'19 Code
CutMix ICCV'19 Code

Evaluation of WSOL methods. How much have WSOL methods improved upon the vanilla CAM model? MaxBoxAccV2 and PxAP performances over the test split are shown, relative to the vanilla CAM performance. We recommend the MaxBoxAccV2 over the original box metric MaxBoxAcc used in the CVPR version. For details, see the latest arXiv version. Hyperparameters have been optimized over the identical train-fullsup split for all WSOL methods and the FSL baseline: (10,5,5) full supervision/class for (ImageNet,CUB,OpenImages). Note that we evaluate the last checkpoint of each training session. More detailed results and corresponding hyperparameter sets are available at here.

6. WSOL training and evaluation

We describe the data preparation and training scripts for the above six prior WSOL methods.

Prepare train+eval datasets

Our repository enables evaluation and training of WSOL methods on two commonly-used benchmarks, CUB and ImageNet, and our newly-introduced benchmark OpenImages. We describe below how to prepare those datasets.

ImageNet

Both the original ImageNet and ImageNetV2 are required for WSOL training. Note that "val" split of the original ImageNet is considered as test split, and ImageNetV2 is used for split (train-fullsup) in our framework.

To prepare ImageNet data, download ImageNet "train" and "val" splits from here and put the downloaded file on dataset/ILSVRC2012_img_train.tar and dataset/ILSVRC2012_img_val.tar.

Then, run the following command on root directory to extract the images.

./dataset/prepare_imagenet.sh

apt-get install parallel may be required to install parallel.

The structure of image files looks like

dataset
└── ILSVRC
    └── train
        └── n01440764
            ├── n01440764_10026.JPEG
            ├── n01440764_10027.JPEG
            └── ...
        └── n01443537
        └── ...
    └── val2
        └── 0
            ├── 0.jpeg
            ├── 1.jpeg
            └── ...
        └── 1
        └── ...
    └── val
        ├── ILSVRC2012_val_00000001.JPEG
        ├── ILSVRC2012_val_00000002.JPEG
        └── ...

Corresponding annotation files can be found in here.

CUB

Both the original CUB-200-2011 and our CUBV2 datasets are required for WSOL training. Note that CUBV2 is considered as a validation split (train-fullsup). Then, run the following command to download original CUB dataset and extract the image files on root directory.

./dataset/prepare_cub.sh

Note: you can also download the CUBV2 dataset from here. Put the downloaded file on dataset/CUBV2.tar directory and then run the above script.

The structure of image files looks like

dataset
└── CUB
    └── 001.Black_footed_Albatross
        ├── Black_Footed_Albatross_0001_796111.jpg
        ├── Black_Footed_Albatross_0002_55.jpg
        └── ...
    └── 002.Laysan_Albatross
    └── ...

Corresponding annotation files can be found in here.

OpenImages

We provide a new WSOL benchmark, OpenImages30k, based on OpenImagesV5.

To download and extract files, run the following command on root directory

./dataset/prepare_openimages.sh

Note: you can also download the OpenImages30k dataset from here (images , masks). Put the downloaded OpenImages_images.zip and OpenImages_annotations.zip files in dataset directory and run the above script.

The structure of image files looks like:

dataset
└── OpenImages
    └── train
        └── 0bt_c3
            ├── 0a9b7df4d832baf7.jpg
            ├── 0abee225b2418fe7.jpg
            └── ...
        └── 0bt9lr
        └── ...
    └── val
        └── 0bt_c3
            ├── 1cd9ac0169ec7df0.jpg
            ├── 1cd9ac0169ec7df0_ignore.png
            ├── 1cd9ac0169ec7df0_m0bt_c3_6932e993.png
            └── ...
        └── 0bt9lr
        └── ...
    └── test   
        └── 0bt_c3
            ├── 0a51958fcd523ae4.jpg
            ├── 0a51958fcd523ae4_ignore.png
            ├── 0a51958fcd523ae4_m0bt_c3_41344f12.png
            ├── 0a51958fcd523ae4_m0bt_c3_48f37c0f.png
            └── ...
        └── 0bt9lr
        └── ...

Corresponding annotation files can be found in here.

Run train+eval

We support the following architecture and method combinations:

  • Architectures.

    • vgg16
    • inception_v3
    • resnet50
  • Methods (see Library of WSOL methods and paper for descriptions).

    • cam
    • has
    • acol
    • spg
    • adl
    • cutmix

Below is an example command line for the train+eval script.

python main.py --dataset_name OpenImages \
               --architecture vgg16 \
               --wsol_method cam \
               --experiment_name OpenImages_vgg16_CAM \
               --pretrained TRUE \
               --num_val_sample_per_class 5 \
               --large_feature_map FALSE \
               --batch_size 32 \
               --epochs 10 \
               --lr 0.00227913316 \
               --lr_decay_frequency 3 \
               --weight_decay 5.00E-04 \
               --override_cache FALSE \
               --workers 4 \
               --box_v2_metric True \
               --iou_threshold_list 30 50 70 \
               --eval_checkpoint_type last

See config.py for the full descriptions of the arguments, especially the method-specific hyperparameters.

During training, we evaluate the model on train-fullsup split at every epoch and save a checkpoint (best_checkpoint.pth.tar) if the localization performance surpasses every previous score. We also save last checkpoint (last_checkpoint.pth.tar) when the training is finished. You can select checkpoint type for evaluation on test split by setting eval_checkpoint_type argument accordingly. We suggest to use the last checkpoint for evaluation.

Improved box evaluation

We introduce an improved box evaluation metric, MaxBoxAccV2, over the original metric used in the CVPR version: MaxBoxAcc. Key improvements are as follows:

  • Box evaluation using multiple IoU thresholds (default: 30%, 50%, 70%). If you set multi_iou_eval to True (default), the localization metric in the log shows a mean of MaxBoxAcc across all IoU thresholds. Otherwise, it only shows MaxBoxAcc at 50% IoU threshold. The IoU threshold list can be easily set by changing iou_threshold_list argument.

  • A new advanced bounding box mining scheme. Bounding boxes are extracted from all contours in the thresholded score map. You can use this feature by setting multi_contour_eval to True (default). Otherwise, bounding boxes are extracted from the largest connected component of the score map.

We recommend for future researchers to use the MaxBoxAccV2 metric for box-based evaluation. Users can evaluate the WSOL methods with this metric by setting box_v2_metric to True.

7. Code license

This project is distributed under MIT license.

Copyright (c) 2020-present NAVER Corp.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

8. How to cite

@inproceedings{choe2020cvpr,
  title={Evaluating Weakly Supervised Object Localization Methods Right},
  author={Choe, Junsuk and Oh, Seong Joon and Lee, Seungho and Chun, Sanghyuk and Akata, Zeynep and Shim, Hyunjung},
  year = {2020},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  note = {to appear},
  pubstate = {published},
  tppubtype = {inproceedings}
}
@ARTICLE{choe2022tpami,
  author={Choe, Junsuk and Oh, Seong Joon and Chun, Sanghyuk and Lee, Seungho and Akata, Zeynep and Shim, Hyunjung},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Evaluation for Weakly Supervised Object Localization: Protocol, Metrics, and Datasets}, 
  year={2022},
  volume={},
  number={},
  pages={1-1},
  doi={10.1109/TPAMI.2022.3169881}}

wsolevaluation's People

Contributors

clovaaiadmin avatar coallaoh avatar junsukchoe avatar umairjavaid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wsolevaluation's Issues

Sequence of normalization and resize of CAM?

Hi, Thanks for your good work.

I'm a little wondering the line86-line88 in your inference.py.
''
cam_resized = cv2.resize(cam, image_size, interpolation=cv2.INTER_CUBIC)
cam_normalized = normalize_scoremap(cam_resized)
''

In WSOL, after we get a certain class's CAM score of the feature size (hxw, e.g, 7x7), do we resize it to the original image (224x224) and then normalize the score to [0, 1], or do we normalize the CAM score (in the 7x7 shape) to [0, 1] and then resize it to the original image?

I'm looking forward to your reply.

Thanks in advance!

Best,

About ImageNetV2 file name

Hi,

I downloaded Threshold0.7 of ImageNetV2 to use it as train-fullsup.
However, the file name of the image is not one of 0.jpeg to 9.jpeg, it is in the format like 0af3f1b55de791c4144e2fb6d7dfe96dfc22d3fc.jpeg, 8e1374a4e20d7af22665b7749158b7eb9fa3826e.jpeg, etc.

How can I change the file name to correctly use the box labels you annotated?

Thanks.

Pretrained Resnet

Hi,

I notice that for ResNet-50, you change the stride to 1 at Layer 3 (the name comes from pytorch, torchvision.models.resnet50), in order to increase the feature map from 7x7 to 14x14. So I wonder do you first do this change and then use ImageNet to train the modified ResNet-50, and finally based on this trained version new ResNet-50, you train (or finetune) it using CUB dataset?

expected str, bytes or os.PathLike object, not NoneType

Getting the above issue after running the following commands

!git clone https://github.com/clovaai/wsolevaluation.git
os.chdir("wsolevaluation")
!bash dataset/prepare_cub.sh

!python main.py --dataset_name CUB
--architecture vgg16
--wsol_method cam
--experiment_name cub_vgg16
--pretrained TRUE
--num_val_sample_per_class 0
--large_feature_map TRUE
--batch_size 32
--epochs 50
--lr 0.000227913316
--lr_decay_frequency 15
--weight_decay 1.00E-04
--override_cache FALSE
--workers 4
--box_v2_metric True
--iou_threshold_list 30 50 70
--eval_checkpoint_type last

ImagenetV2 has a lot of incorrect Bbox annotation information?

I used the bBox annotation information provided by you for testing, and found many wrong Bboxes.
for example:
100/5.jpeg,73,214,148,299
100/5.jpeg,180,85,200,111
100/5.jpeg,169,137,231,195
100/5.jpeg,394,163,462,207
100/5.jpeg,316,134,374,160
100/5.jpeg,136,206,171,235
100/5.jpeg,242,146,267,164

Custom datasets

Cloud you throw some light on how to modify the code for other custom datasets?
I want to use my own dataset instead of using ImageNet or CUB or OpenImages.
I have dataset in VOC format

?

how much should the position be set in resnet50

slow cv2.findContours

hi,
cv2.findContours is extremely slow depending on the quality of the cam (from .00005s to .001s). this brings the validation time from 2mins to 12mins easily. things get worse when the cam is bad (way too many contours per threshold: >1000/threshold).

contours = cv2.findContours(

is there a way to speed it up whout breaking the evaluation protocol?
i really appreciate your help.
thanks

Interpretation of the result

Hi,
Thank you for providing the code for this amazing work. I have a question for which I seek your guidance. When I run the basic resnet50 code for CAM on CUB200, I get the following results on the test set

Split test, metric classification, current value: 50.517777010700726
Split test, metric localization, current value: 58.97480151881257
Split test, metric localization_IOU_30, current value: 96.08215395236452
Split test, metric localization_IOU_50, current value: 66.27545736969279
Split test, metric localization_IOU_70, current value: 14.566793234380393

I wanted to confirm whether metric localization corresponds to the MaxBoxAcc-v2 metric that you rmention in your work? Also what does metric localization, current value mean?. In Table 6 of your paper, do you report metric localization_IOU_50? Looking forward to hearing from you.
Thanks

Cant see data for evaluation.py

I cant find this folder train_log/scoremaps/ for CUB dataset. I am trying to use your evaluation only script. Could you please tell me the procedure to run this code on heatmaps. How do we generate or run your evaluation.py on custom object detection datasets?

num_val_sample_per_class

Hi, I wanted to ask about the argument num_val_sample_per_class in
python main.py --dataset_name OpenImages \ --architecture vgg16 \ --wsol_method cam \ --experiment_name OpenImages_vgg16_CAM \ --pretrained TRUE \ --num_val_sample_per_class 5 \ --large_feature_map FALSE \ --batch_size 32 \ --epochs 10 \ --lr 0.00227913316 \ --lr_decay_frequency 3 \ --weight_decay 5.00E-04 \ --override_cache FALSE \ --workers 4 \ --box_v2_metric True \ --iou_threshold_list 30 50 70 \ --eval_checkpoint_type last

You set it to 5 for OpenImage dataset. It shouldn't be 25 instead because we have 25 sample per class for the validation set ?

So when we will use your code with CUB and ImageNet datasets to which number should we set the argument num_val_sample_per_class ?

Thank you in advance for you replay :)

Cropping and Resizing

Can you please provide any insights on why you crop images in 224x224 patches by default and why you resize them to 256x256, regardless of input image size?

Results of OpenImage30K

Hi! Thank you for your work in bringing new benchmarks and a unified approach to evaluation for the WSOL community.
In the CVPR2020 paper, some tables you provided seem to be inconsistent with those in https://docs.google.com/spreadsheets/d/1O4gu69FOOooPoTTtAEmFdfjs2K0EtFneYWQFk8rNqzw/edit#gid=0. For example, the result of Inceptionv3 with ACoL in the paper is 63.0, but the form indicated by the web link page is lower. Is there something wrong with the way I use it? Looking forward to your reply!

Some issue about the CUB val dataset

Thanks for your work.

When I follow the guidance to run the code on CUB, the error says:

sampled_indices = np.random.choice( indices, self.num_sample_per_class, replace=False)
........
........
File "mtrand.pyx", line 946, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'

Looking back to the config.py, the recommended setting is 'args.num_val_sample_per_class<=5' for CUB.

def check_dependency(args):

if args.dataset_name == 'CUB':

if args.num_val_sample_per_class >= 6:

raise ValueError("num-val-sample must be <= 5 for CUB.")

if args.dataset_name == 'OpenImages':

if args.num_val_sample_per_class >= 26:

raise ValueError("num-val-sample must be <= 25 for OpenImages.")

However, the bird category '059.California_Gull' only contains 3 images for validation while '002.Laysan_Albatross' and '007.Parakeet_Auklet' contains 6 images each. So this leads to the error above.

The solution is to expand the 059.California_Gull number from 3 to 5, or change the recommended num_val_sample_per_class=3. Feel strange there is no one meet with this problem before. 👍

FSL baseline

could you please release the code of fsl, thank you

Data split for FSL

Dear the authors,

While trying to reproduce the reported results of few shot learning baseline (FSL), I came up with a question. According to the paper, FSL exploited (10, 5, 5) for imagenet, cub and open images, respectively, and the same amount of supervision was applied to CAM methods.
For FSL, the numer of samples per class e.g., 10 for imagenet, is the sum of samples for train and val I believe since FSL also need some amount of val set. So my question is that for FSL, how did you split the training and val set among the number of samples per class you specified (10, 5, 5)?

Hope my question is clear to you. Looking forward to hearing from you. Thank you!

Doubt regarding input image files in metatdata

The code in the data loader reads the image_ids file as: which has.jpg appended

with open(metadata['image_ids' + suffix]) as f:
for line in f.readlines():
image_ids.append(line.strip('\n'))

but while reading open Images data, and reading the localization.txt, it expects. path_file.jpg.npy. isnt it wrong? should it not be path_file.npy. The jpg should be dropped. It would be great if you could clear this issue.

cannot reproduce the results using CAM-Inception on CUB dataset

Hi,

I've set the large_feature_map = True, which means the final feature map used to generate CAMs is 28x28 (image input size is 224x224, rather than 229x229). Also, I've set the LR of SPG_A3_1b, SPG_A3_2b and SPG_A4 10 times higher than the rest blocks (Conv2d_4a_3x3, Conv2d_1a_3x3, Conv2d_2a_3x3, Conv2d_2b_3x3, Conv2d_3b_1x1, Mixed_5b, Mixed_5c, Mixed_5d, Mixed_6a, Mixed_6b, Mixed_6c, Mixed_6d and Mixed_6e), whose learning rate is 0.00224844746. The WD is 5e-4, momentum is 0.9 and nesterov is True for the SGD optimizer.
The LR decay frequency is 15 epoch and I use StepLR scheduler with gamma = 0.1.

The boxaccv2 is around 53%, more specifically, 0.92, 0.56, 0.1 for iou=0.3, 0.5 and 0.7, respectively.

Could you please tell me if I miss something? Or does someone has a similar issue?

openimages pxap perf changed between maxboxacc and maxboxaccv2. why?

hi,
in https://arxiv.org/pdf/2001.07437.pdf , the pxap performance changed between using maxboxacc (tab.2) and using maxboxaccv2 (tab.8) over openimages. i didnt expect that.
why is that?
because the validation (for model selection) on openimages is done using pxap (as for test as well), using maxboxacc or maxboxaccv2 in the code configuration should not impact the results on pxap, right?

or did you run a new 30 random trials for hyper-parameters search that led to different best hyper-parameters for pxap tab.8? or simply due to randomness if the code is not reproducible (running the same experiment twice with the same settings does not lead to the exact same results...)?

tab 2

tab 8

thanks

Dataset Structure

Thank you for putting together this brilliant collection of WSOL methods and shedding light on the reality of the progress in the field! Truly appreciated!
I was trying to run your code on my own custom dataset following the directions from #17
I was wondering whether the dataset folder hierarchy plays a crucial role in determining class labels. As long as I've correctly produced the class_labels.txt file, do I need to care about the subfolders? For example, I've two labels 0 and 1. Do I need to create two subfolders or putting all the images in one folder should be enough?

Evaluation_test Failure and also path cannot find.

Hi,

Thanks for this great work done. I have two questions:
(1) When I run: python evaluation_test.py, it outputs:
FAIL: test_compute_bboxes_from_scoremaps_degenerate (main.EvalUtilTest)

Traceback (most recent call last):
File "evaluation_test.py", line 98, in test_compute_bboxes_from_scoremaps_degenerate
self.assertListEqual(boxes, [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0],
AssertionError: First sequence is not a list: ([array([[0, 0, 0, 0]]), array([[0, 0, 0, 0]]), array([[0, 0, 0, 0]]), array([[0, 0, 0, 0]]), array([[0, 0, 0, 0]])], [1, 1, 1, 1, 1])

======================================================================
FAIL: test_compute_bboxes_from_scoremaps_multimodal (main.EvalUtilTest)

Traceback (most recent call last):
File "evaluation_test.py", line 125, in test_compute_bboxes_from_scoremaps_multimodal
self.assertListEqual(boxes, [[0, 0, 4, 3],
AssertionError: First sequence is not a list: ([array([[0, 0, 4, 3]]), array([[0, 0, 2, 2]]), array([[0, 3, 3, 3]]), array([[2, 3, 3, 3]]), array([[0, 3, 1, 3]])], [1, 1, 1, 1, 1])

======================================================================
FAIL: test_compute_bboxes_from_scoremaps_unimodal (main.EvalUtilTest)

Traceback (most recent call last):
File "evaluation_test.py", line 110, in test_compute_bboxes_from_scoremaps_unimodal
self.assertListEqual(boxes, [[1, 1, 4, 3],
AssertionError: First sequence is not a list: ([array([[1, 1, 4, 3]]), array([[1, 1, 4, 3]]), array([[2, 1, 4, 3]]), array([[2, 2, 4, 3]]), array([[2, 2, 3, 3]])], [1, 1, 1, 1, 1])

(2) My second problem is when I run your suggested script: python evaluation.py --scoremap_root=train_log/scoremaps/ --metadata_root=metadata/ --mask_root=dataset/ --dataset_name=CUB --split=val --cam_curve_interval=0.01

It gives the following error:

Loading and evaluating cams.
Traceback (most recent call last):
File "evaluation.py", line 528, in
main()
File "evaluation.py", line 516, in main
evaluate_wsol(scoremap_root=args.scoremap_root,
File "evaluation.py", line 465, in evaluate_wsol
image_ids = get_image_ids(metadata)
File "/egundogdu/WSOL/wsolevaluation/data_loaders.py", line 62, in get_image_ids
with open(metadata['image_ids' + suffix]) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'metadata/image_ids.txt'

Do you have any idea with these issues?

Test Dataset description

The test dataset for this task describes that the test data contains images with full supervision, but the datasets taken for test in Imagnet, CUB and open images dont have bounding boxes or masks. Could yo please explain this in detail?

Pretrained Models

Hi.
Thank you for such a clean repository. I would like to ask if it's possible to have access to a few of the pre-trained models. As many would use your repository to reproduce the current SOTA, having access to pre-trained models would indeed speed up the process of testing code validity. I think just having access to a pretrained model on CUB would make a huge difference because everyone can basically download it and validate the accuracies on a local machine (even without GPU).

Many thanks.

Configuration of reproducing the result in the CUB and ImageNet

Thanks first for the great work!

Following the demo configuration of OpenImages in README.MD, I could easily reproduce the expected accuracy in the given table.

python main.py --dataset_name OpenImages \
               --architecture vgg16 \
               --wsol_method cam \
               --experiment_name OpenImages_vgg16_CAM \
               --pretrained TRUE \
               --num_val_sample_per_class 5 \
               --large_feature_map FALSE \
               --batch_size 32 \
               --epochs 10 \
               --lr 0.00227913316 \
               --lr_decay_frequency 3 \
               --weight_decay 5.00E-04 \
               --override_cache FALSE \
               --workers 4 \
               --box_v2_metric True \
               --iou_threshold_list 30 50 70 \
               --eval_checkpoint_type last

May I ask what is the configuration to CUB and ImageNet datase?
I use the above one (only change --dataset_name ), the experiment accuracy is much lower than reported as shown in the given.

Many thanks!

confusion regarding additional datasets

I am trying to understand the paper. Just clear out my confusion. I am confused whether the newly added datasets are used for training or not, after the optimal hyperparameters are selected.
For example:
For different hyperparameters, the model is first trained on the CUB dataset, then validated on the CUBv2 dataset. Hyperparameters of the model with the highest localization accuracy are then selected. A new model with the selected hyperparameters is then trained on the CUB training dataset along with the added CUBv2 dataset, and then finally tested on the CUB test dataset. Am I correct or are you doing something different?

logic problem

When generate cam, you use the ground truth label to get the channel weight of fully connected layer. But I think the predict class of model is the right choice.

Hyperparameter to reproduce Table 2

Could you also provide the recommended command/hyperparameter settings to reproduce Table 2?

Especially for the learning rate/batch size/epochs etc. for different methods with different backbones on three datasets.

This would benefit to a fair comparison who will cite your work.

Thank you very much!!

GPU and training time required by this repo

Dear authors,

Thank you very much for this dedicated reop! This is extremely helpful to the WSOL community!

Some questions:

  1. Are jobs covered by this repo all only require one GPU to train and evaluate? Is it helpful or necessary to use multiple GPUs?
  2. What kind of GPU did you use for the jobs? How much memory did you consume?
  3. How much time to train on each dataset?

Thank you again!

Re-implementation confusion

Hi, I used your config params to train resnet50 vanilla cam, I cannot reach your reported accuracy.
Here's my configurations:
CUDA_VISIBLE_DEVICES=6 python train.py --dataset_name CUB --architecture resnet50 --wsol_method cam --experiment_name CUB_CAM_resnet50_box_v2_metric --pretrained TRUE --large_feature_map FALSE --batch_size 32 --epochs 50 --lr 0.0002 --lr_decay_frequency 15 --weight_decay 0.0001 --override_cache TRUE --workers 4 --box_v2_metric True --iou_threshold_list 30 50 70 --eval_checkpoint_type last --data_root /data/lijinlong/datasets/CUB-200-2011/
result:

Final epoch evaluation on test set ...
Check train_log/CUB_CAM_resnet50_box_v2_metric/last_checkpoint.pth.tar loaded.
rank 0, Evaluate epoch 50, split test
Computing and evaluating cams.
Split train, metric loss, current value: 0.07756523653730615
Split train, metric loss, best value: 0.07533820327974217
Split train, metric loss, best epoch: 48
Split train, metric classification, current value: 99.84984984984985
Split train, metric classification, best value: 99.88321654988322
Split train, metric classification, best epoch: 43
Split val, metric classification, current value: 72.89999999999999
Split val, metric classification, best value: 74.2
Split val, metric classification, best epoch: 30
Split val, metric localization, current value: 46.36666666666667
Split val, metric localization, best value: 50.900000000000006
Split val, metric localization, best epoch: 1
Split val, metric localization_IOU_30, current value: 89.1
Split val, metric localization_IOU_30, best value: 92.6
Split val, metric localization_IOU_30, best epoch: 2
Split val, metric localization_IOU_50, current value: 43.9
Split val, metric localization_IOU_50, best value: 51.6
Split val, metric localization_IOU_50, best epoch: 1
Split val, metric localization_IOU_70, current value: 6.1
Split val, metric localization_IOU_70, best value: 8.9
Split val, metric localization_IOU_70, best epoch: 1
Split test, metric classification, current value: 77.06247842595789
Split test, metric localization, current value: 51.26567713726845
Split test, metric localization_IOU_30, current value: 95.11563686572316
Split test, metric localization_IOU_50, current value: 50.465999309630654
Split test, metric localization_IOU_70, current value: 8.215395236451501

CUDA_VISIBLE_DEVICES=5 python train.py --dataset_name CUB --architecture resnet50 --wsol_method cam --experiment_name CUB_CAM_resnet50 --pretrained TRUE --large_feature_map FALSE --batch_size 32 --epochs 50 --lr 0.0002 --lr_decay_frequency 15 --weight_decay 0.0001 --override_cache TRUE --workers 4 --box_v2_metric False --iou_threshold_list 30 50 70 --eval_checkpoint_type last --data_root /data/lijinlong/datasets/CUB-200-2011/
results:

Final epoch evaluation on test set ...
Check train_log/CUB_CAM_resnet50/last_checkpoint.pth.tar loaded.
rank 0, Evaluate epoch 50, split test
Computing and evaluating cams.
Split train, metric loss, current value: 0.078823547021007
Split train, metric loss, best value: 0.07638261178592304
Split train, metric loss, best epoch: 45
Split train, metric classification, current value: 99.76643309976645
Split train, metric classification, best value: 99.83316649983317
Split train, metric classification, best epoch: 43
Split val, metric classification, current value: 73.2
Split val, metric classification, best value: 74.0
Split val, metric classification, best epoch: 18
Split val, metric localization, current value: 43.5
Split val, metric localization, best value: 52.6
Split val, metric localization, best epoch: 1
Split val, metric localization_IOU_30, current value: 88.6
Split val, metric localization_IOU_30, best value: 93.2
Split val, metric localization_IOU_30, best epoch: 1
Split val, metric localization_IOU_50, current value: 43.5
Split val, metric localization_IOU_50, best value: 52.6
Split val, metric localization_IOU_50, best epoch: 1
Split val, metric localization_IOU_70, current value: 6.0
Split val, metric localization_IOU_70, best value: 9.4
Split val, metric localization_IOU_70, best epoch: 1
Split test, metric classification, current value: 76.61373835001726
Split test, metric localization, current value: 50.84570245081118
Split test, metric localization_IOU_30, current value: 95.11563686572316
Split test, metric localization_IOU_50, current value: 50.84570245081118
Split test, metric localization_IOU_70, current value: 8.439765274421816

Here is my model architecture and config params:
Namespace(acol_threshold=0.7, adl_drop_rate=0.75, adl_threshold=0.9, architecture='resnet50', architecture_type='cam', batch_size=32, box_v2_metric=True, cam_curve_interval=0.001, crop_size=224, cutmix_beta=1.0, cutmix_prob=1.0, data_paths=Munch({'train': '/data/lijinlong/datasets/CUB-200-2011/CUB', 'val': '/data/lijinlong/datasets/CUB-200-2011/CUB', 'test': '/data/lijinlong/datasets/CUB-200-2011/CUB'}), data_root='/data/lijinlong/datasets/CUB-200-2011/', dataset_name='CUB', dist_backend='nccl', dist_url='tcp://127.0.0.1', epochs=50, eval_checkpoint_type='last', experiment_name='CUB_CAM_resnet50_box_v2_metric', gpu=None, has_drop_rate=0.5, has_grid_size=4, iou_threshold_list=[30, 50, 70], large_feature_map=False, launcher='pytorch', local_rank=0, log_folder='train_log/CUB_CAM_resnet50_box_v2_metric', lr=0.0002, lr_classifier_ratio=10, lr_decay_frequency=15, mask_root='dataset/OpenImages', master_port='47562', metadata_root='metadata/CUB', momentum=0.9, multi_contour_eval=True, multi_iou_eval=True, multiprocessing_distributed=False, num_val_sample_per_class=0, override_cache=True, pretrained=True, pretrained_path=None, proxy_training_set=False, rank=-1, reporter=<class 'util.Reporter'>, reporter_log_root='train_log/CUB_CAM_resnet50_box_v2_metric/reports', resize_size=256, scoremap_paths=Munch({'train': 'train_log/CUB_CAM_resnet50_box_v2_metric/scoremaps/train', 'val': 'train_log/CUB_CAM_resnet50_box_v2_metric/scoremaps/val', 'test': 'train_log/CUB_CAM_resnet50_box_v2_metric/scoremaps/test'}), seed=None, spg_threshold_1h=0.7, spg_threshold_1l=0.01, spg_threshold_2h=0.5, spg_threshold_2l=0.05, spg_threshold_3h=0.7, spg_threshold_3l=0.1, spg_thresholds=((0.7, 0.01), (0.5, 0.05), (0.7, 0.1)), weight_decay=0.0001, workers=4, world_size=-1, wsol_method='cam') Loading model resnet50
`
ResNetCam(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(layer3): Sequential(
(0): Bottleneck(
(conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(layer4): Sequential(
(0): Bottleneck(
(conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=2048, out_features=200, bias=True)
)

IOU 50 70 are much lower than your report, do I missed something??
But for vgg16, there're ok.
Thanks.

About dataset

I can not obtain the CUBV2 from the prepare_cub.sh.Could you please check the link valid?

Request for configs and question about the MaxBoxAcc

Hi,
Thank you for sharing the code for the awesome paper!

I was trying to reproduce the results shown in Table 6 of the appendix but have had a hard time to reach the performance you reported. Would it possible for you to share the configs for CAM, HaS, ACoL and and SPG for V, I and R?

Also, I was wondering if there is a specific reason why you proposed to use the max of GT-known instead of the max of Top1-Loc. According to my understanding, Top1-Loc is more comprehensive metric since it also takes classification performance into account. Although localization is important, if a prediction on classification is wrong in the first place, a model would not be considered as a "good model". It would be highly appreciated if you can elaborate it, or if it has already been mentioned in the paper, please direct me to the part. Thank you!

optimal oracle value

Hello. Thank you for such interesting work.
I have one question though
I got it that you randomly conducted 30 trials for each method with single backbone.

Then how did you choose the optimal oracle value?
You mentioned that you find it with train-fullsup set.
Do you mean that you have conducted multiple experiments on train-fullsup set to find optimal oracle?

Top-1 localization

Can I directly use the model trained by your code to test the top-1 localization ? and could you please provide the implement of the top-1 localization evaluation. Thank you

calculate MaxBoxAcc

Hi,

I try to calculate the MaxBoxAcc version 1 for CAM method on CUB dataset, using VGG16, the number (~85%) is much higher than the number reported in the paper (~76%).

I calculate it as:

counter = 0
for all testing_image:
        get the current CAM and normalized it via min-max normalization (range to [0, 1])
        for 1,000 steps (I assume you sample the score map threshold 1,000 steps):
                c_CAM = c_CAM >= current_score_map_threshold
                get the current bbox and calculate the IOU
        if one of IOUs > 0.5 (there shall be 1,000 IOUs):
               counter += 1

and the final maxboxacc = counter / number_of_testing_images.

Do I miss something in the procedure? Btw, the bbox is estimated from all contours.

Optimal threshold for test set

Hi,

I have a question about the evaluation. I've tried to understand how the optimal threshold is used in the test time. It seems you are searching the optimal threshold for test set again instead of using the optimal threshold that is found using validation set. Or am I missing something? Please correct me if I am wrong. I cannot really find the line of code where the optimal threshold is stored for test set. Thank you.

ValueError: Cannot take a larger sample than population when 'replace=False'

Getting the mentioned error by running this command

!python main.py --dataset_name CUB
--architecture vgg16
--wsol_method cam
--experiment_name CUB_vgg16_CAM
--pretrained TRUE
--num_val_sample_per_class 5
--large_feature_map FALSE
--batch_size 32
--epochs 10
--lr 0.00227913316
--lr_decay_frequency 3
--weight_decay 5.00E-04
--override_cache FALSE
--workers 4
--multi_iou_eval False
--iou_threshold_list 30 50 70
--multi_contour_eval False
--eval_checkpoint_type best

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.