amazon-science / tubelet-transformer Goto Github PK

This is an official implementation of TubeR: Tubelet Transformer for Video Action Detection

Home Page: https://openaccess.thecvf.com/content/CVPR2022/supplemental/Zhao_TubeR_Tubelet_Transformer_CVPR_2022_supplemental.pdf

License: Apache License 2.0

Python 99.73% Shell 0.27%

transformer action-detection ava jhmdb tubelet-transformer tuber ucf

tubelet-transformer's Introduction

TubeR: Tubelet Transformer for Video Action Detection

This repo contains the supported code to reproduce spatio-temporal action detection results of TubeR: Tubelet Transformer for Video Action Detection.

Updates

08/08/2022 Initial commits

Results and Models

AVA 2.1 Dataset

Backbone	Pretrain	#view	mAP	FLOPs	config	model
CSN-50	Kinetics-400	1 view	27.2	78G	config	S3
CSN-50 (with long-term context)	Kinetics-400	1 view	28.8	78G	config	Comming soon
CSN-152	Kinetics-400+IG65M	1 view	29.7	120G	config	S3
CSN-152 (with long-term context)	Kinetics-400+IG65M	1 view	31.7	120G	config	Comming soon

AVA 2.2 Dataset

Backbone	Pretrain	#view	mAP	FLOPs	config	model
CSN-152	Kinetics-400+IG65M	1 view	31.1	120G	config	S3
CSN-152 (with long-term context)	Kinetics-400+IG65M	1 view	33.4	120G	config	Comming soon

JHMDB Dataset

Backbone	#view	[email protected]	[email protected]	config	model
CSN-152	1 view	87.4	82.3	config	S3

Usage

The project is developed based on GluonCV-torch. Please refer to tutorial for details.

Dependency

The project is tested working on:

Torch 1.12 + CUDA 11.3
timm==0.4.5
tensorboardX

Dataset

Please download the asset.zip and unzip them at ./datasets.

[AVA] Please refer to DATASET.md for AVA dataset downloading and pre-processing. [JHMDB] Please refer to JHMDB for JHMDB dataset and Dataset Section for UCF dataset. You also can refer to ACT-Detector to prepare the two datasets.

Inference

To run inference, first modify the config file:

set the correct WORLD_SIZE, GPU_WORLD_SIZE, DIST_URL, WOLRD_URLS based on experiment setup.
set the LABEL_PATH, ANNO_PATH, DATA_PATH to your local directory accordingly.
Download the pre-trained model and set PRETRAINED_PATH to model path.
make sure LOAD and LOAD_FC are set to True

Then run:

# run testing
python3  eval_tuber_ava.py <CONFIG_FILE> 

# for example, to evaluate ava from scratch, run:
python3 eval_tuber_ava.py configuration/TubeR_CSN152_AVA21.yaml

Training

To train TubeR from scratch, first modify the configfile:

set the correct WORLD_SIZE, GPU_WORLD_SIZE, DIST_URL, WOLRD_URLS based on experiment setup.
set the LABEL_PATH, ANNO_PATH, DATA_PATH to your local directory accordingly.
Download the pre-trained feature backbone and transformer weights and set PRETRAIN_BACKBONE_DIR (CSN50, CSN152), PRETRAIN_TRANSFORMER_DIR (DETR) accordingly.
make sure LOAD and LOAD_FC are set to False

Then run:

# run training from scratch
python3  train_tuber.py <CONFIG_FILE>

# for example, to train ava from scratch, run:
python3 train_tuber_ava.py configuration/TubeR_CSN152_AVA21.yaml

TODO

[ ]Add tutorial and pre-trained weights for TubeR with long-term memory

[ ]Add weights for UCF24

Citing TubeR

@inproceedings{zhao2022tuber,
  title={TubeR: Tubelet transformer for video action detection},
  author={Zhao, Jiaojiao and Zhang, Yanyi and Li, Xinyu and Chen, Hao and Shuai, Bing and Xu, Mingze and Liu, Chunhui and Kundu, Kaustav and Xiong, Yuanjun and Modolo, Davide and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={13598--13607},
  year={2022}
}

tubelet-transformer's People

Contributors

Stargazers

Watchers

Forkers

lcc9125 coocoo90 jiaozizhao salmank255 zilre24 oliver0804 yihangchen9 yjh0410 fangyuanzhi sibonjia wenjinzhang wangtaoas minhtien2405

tubelet-transformer's Issues

Details about DETR pretraining

Hi,

Thank you for your work on TubeR - it is super interesting. And really appreciate that the code is also open source.

I was comparing the code and the paper and noticed that the open source TubeR code initialises from a pre-trained DETR model (see link). This does not seem to be mentioned in the paper.

Have the results reported in the paper been obtained from models using pre-trained DETR weights?
If so, how do the results change when not using pre-trained DETR weights for TubeR?

What is the schedule for uCF24

Thanks for sharing the code what is the plan for UCF24 training and video-level eval?
many thanks
G.

Inference JHMDB mAP: 0.00000,Inference ava2.2 mAP: 0.00001

Thank you very much for your work, I have encountered the following problems and look forward to your answers.I used the single 3090 Inference, why the inference result is not correct.

JHMDB:
{'PascalBoxes_Precision/[email protected]': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/Basketball': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/BasketballDunk': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/Biking': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/CliffDiving': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/CricketBowling': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/Diving': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/Fencing': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/FloorGymnastics': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/GolfSwing': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/HorseRiding': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/IceDancing': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/LongJump': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/PoleVault': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/RopeClimbing': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/SalsaSpin': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/SkateBoarding': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/Skiing': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/Skijet': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/SoccerJuggling': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/Surfing': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/TennisSwing': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/TrampolineJumping': nan, 'PascalBoxes_PerformanceByCategory/[email protected]/VolleyballSpiking': nan, 'PascalBoxes_PerformanceByCategory/[email protected]/WalkingWithDog': nan} mAP: 0.00000
ava2.2:
{'PascalBoxes_Precision/[email protected]': 5.804685659762801e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/bend/bow (at the waist)': 6.7412939834 3058e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/crouch/kneel': 8.161423910565086e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/dance ': 1.5340785366848546e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/fall down': 1.6245995747063194e-06, 'PascalBoxes_PerformanceByCategory/ [email protected]/get up': 1.8427525662848433e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/jump/leap': 1.4822620209897411e-06, 'PascalBoxes_Perfor manceByCategory/[email protected]/lie/sleep': 7.1702034136118856e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/martial art': 2.600186942913417e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/run/jog': 3.4277535995126226e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/sit': 6.1169030924 04758e-05, 'PascalBoxes_PerformanceByCategory/[email protected]/stand': 1.6757065787305266e-05, 'PascalBoxes_PerformanceByCategory/[email protected]/swim': 6.4 61702800037559e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/walk': 5.273005180575442e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/ans wer phone': 4.863533319063223e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/carry/hold (an object)': 2.5340778408370884e-05, 'PascalBoxes_P erformanceByCategory/[email protected]/climb (e.g., a mountain)': 1.443913393914046e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/close (e.g., a do or, a box)': 6.563171051115627e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/cut': 6.245620212630477e-06, 'PascalBoxes_PerformanceByCategor y/[email protected]/dress/put on clothing': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/drink': 8.838041137011977e-07, 'PascalBoxes_PerformanceBy Category/[email protected]/drive (e.g., a car, a truck)': 2.7613813974371657e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/eat': 4.0655577881098433 e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/enter': 1.1173525493390063e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/hit (an object) ': 1.3902296570436025e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/lift/pick up': 1.2300693719805807e-06, 'PascalBoxes_PerformanceByCatego ry/[email protected]/listen (e.g., to music)': 2.1775085911619012e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/open (e.g., a window, a car door)': 2.4403168714603825e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/play musical instrument': 6.602051911313615e-07, 'PascalBoxes_Performance ByCategory/[email protected]/point to (an object)': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/pull (an object)': 4.33689787523033e-07, 'PascalB oxes_PerformanceByCategory/[email protected]/push (an object)': 3.1405851538258606e-08, 'PascalBoxes_PerformanceByCategory/[email protected]/put down': 1.59893 42340545008e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/read': 1.091865985129312e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/ride ( e.g., a bike, a car, a horse)': 1.102664152188559e-05, 'PascalBoxes_PerformanceByCategory/[email protected]/sail boat': 0.0, 'PascalBoxes_PerformanceBy Category/[email protected]/shoot': 8.63523810932688e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/smoke': 1.1636571772297917e-06, 'PascalBoxes_Perf ormanceByCategory/[email protected]/take a photo': 0.0, 'PascalBoxes_PerformanceByCategory/[email protected]/text on/look at a cellphone': 2.2049693397968982e- 06, 'PascalBoxes_PerformanceByCategory/[email protected]/throw': 5.79245313156018e-08, 'PascalBoxes_PerformanceByCategory/[email protected]/touch (an object)': 1.8177643240392173e-05, 'PascalBoxes_PerformanceByCategory/[email protected]/turn (e.g., a screwdriver)': 0.0, 'PascalBoxes_PerformanceByCategory/AP@0 .5IOU/watch (e.g., TV)': 2.0338172018682278e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/work on a computer': 6.156932327995735e-06, 'Pasc alBoxes_PerformanceByCategory/[email protected]/write': 3.4058487513383753e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/fight/hit (a person)': 5.2 529223403893136e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/give/serve (an object) to (a person)': 2.9697631126085355e-07, 'PascalBoxes_P erformanceByCategory/[email protected]/grab (a person)': 6.969206220854079e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/hand clap': 8.520963944770 175e-08, 'PascalBoxes_PerformanceByCategory/[email protected]/hand shake': 2.1026831777453487e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/hand wa ve': 1.222112678680342e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/hug (a person)': 2.4244765951677872e-06, 'PascalBoxes_PerformanceByCat egory/[email protected]/kiss (a person)': 4.0355634999615806e-05, 'PascalBoxes_PerformanceByCategory/[email protected]/lift (a person)': 1.8139397140893772e-06 , 'PascalBoxes_PerformanceByCategory/[email protected]/listen to (a person)': 7.775082670289784e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/push (another person)': 3.990939991523882e-07, 'PascalBoxes_PerformanceByCategory/[email protected]/sing to (e.g., self, a person, a group)': 6.231937069619 35e-06, 'PascalBoxes_PerformanceByCategory/[email protected]/take (an object) from (a person)': 4.257469986539582e-08, 'PascalBoxes_PerformanceByCategory/[email protected]/talk to (e.g., self, a person, a group)': 4.670111015421275e-05, 'PascalBoxes_PerformanceByCategory/[email protected]/watch (a person)': 2.174873472814551e-05} mAP: 0.00001

Tuber CSN-152 model with memory

Hi,

First, thanks for your work and for providing the implementation. I wonder when is the CSN 152 model with memory is going to be released.

Thank you.

The eval results from Tuber CSN-152 IG65+K400 model

Hi,

First, thanks for your work and for providing the implementation.

Following the steps you provided, I downloaded the pretrained |CSN-152 Kinetics-400+IG65M from this link you provided: TubeR_CSN152_AVA22; and after installing the same version of pytorch and other packages as you suggested and changing only the paths to the data and model in the config file: TubeR_CSN152_AVA22.yaml. I was not able to obtain the 31.1 mAP, but have only gotten 27.8 mAP (did 2 runs, same results).

I wonder if I am doing everything right and how to proceed.

Thank you.

Question about temporal localization using action switch

Thank you for the wonderful work. I have read the paper and code, and have a question about temporal localization.
How do you determine the start and end of a tubelet when inference (evaluation)? I expect that the action switch is extracting the tubelets in the range above the threshold(=0.5?). If this expectation is correct, then the case with occlusion (i.e., action temporarily disappears in the middle of a tubelet) would result in multiple tubulets being extracted from a single query. Is this correct?

Questions about the code for JHMDB

Thanks for the great work. I have read the code for JHMDB and have some questions:
(1) The performance of [email protected] is just 0.72, much lower than the 82.3 that is reported.
(2) I also notice that the provided evaluation code for JHMDB is for frame-mAP, rather than video-mAP, because the AP is calculated on frame-level rather than tubelet-level.
(3) Although the query number is defined as 10*clip_len, only the predictions of the queries corresponding to the intermediate frame (key_pos) are extracted as the final prediction result during training and testing. In other words, such a pipeline is more like a video object detection where the input is a video clip but the goal is just to predict the object and its class in the middle frame of the input video. I did not find the place that can reveal the properties of the so called tubelet transformer.
In summary, is some configurations wrong with the current code?

Code here don't match the paper

Code here don't match the paper(TubeR: Tubelet Transformer for Video Action Detection. https://arxiv.org/abs/2104.00969v3)

Training problems for JHMDB datasets

I used a pre-training dataset that worked fine during training and did not predict correct results on the validation set。
To my surprise, everything works fine when continuing training with the weights provided by the author that have already been trained（TubeR_CSN152_JHMDB.pth）. But training from scratch will cause problems like the code below.

Train
Epoch: [1][460/2839] lr: 5e-05 data_time: 0.025, batch time: 0.657 class_error: 11.761, loss: 7.271, loss_bbox: 0.085, loss_giou: 0.161, loss_ce: 0.356, loss_ce_b: 0.000
Epoch: [1][461/2839] lr: 5e-05 data_time: 0.031, batch time: 0.661 class_error: 11.736, loss: 7.273, loss_bbox: 0.085, loss_giou: 0.161, loss_ce: 0.356, loss_ce_b: 0.000

eval
Epoch: [0][9138/9139] data_time: 0.003, batch time: 0.069 class_error: 99.354, loss: 15.582, loss_bbox: 0.143, loss_giou: 0.233, loss_ce: 1.298
Epoch: [0][9139/9139] data_time: 0.003, batch time: 0.068 class_error: 99.354, loss: 15.582, loss_bbox: 0.143, loss_giou: 0.233, loss_ce: 1.298

per_class_len 24 per_class [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.02090471 0. nan nan nan]

DETR checkpoint mismatch for JHMDB

Thanks for your great work. I notice that the DETR model used in JHMDB is different from that in AVA. For example, as mentioned in models/tuber_ava.py line 43:

if self.dataset_mode != 'ava':
self.avg_s = nn.AdaptiveAvgPool3d((1, 1, 1))
self.query_embed = nn.Embedding(num_queries * temporal_length, hidden_dim)
else:
self.query_embed = nn.Embedding(num_queries, hidden_dim)

Currently, it seems only detr.pth for AVA dataset is provided. As a result, when running the code for JHMDB, there will be an error:
size mismatch for module.query_embed.weight: copying a param with shape torch.Size([10, 256]) from checkpoint, the shape in current model is torch.Size([320, 256]).

Can you provide the right pre-trained DETR checkpoint (i.e., detr.pth) for JHMDB? Thanks.

Are tubelets actually predicted for AVA?

Hello,

Thank you for your excellent work on TubeR. And also for open sourcing the code and the main results.

Looking through the code, it appears that there is a lot of specialisation of the model happening for specific datasets (for example 1, 2, 3).

Most importantly, Figure 5 of the paper suggests that the model predicts tubelets on AVA. But based on the released code, I don't see where this happens.

Specifially, from code it looks like when training on AVA the TubeR model does not actually use tubelet queries (i.e. the query_embed tensor does not have a temporal axis or the temporal_length multiplier). How can TubeR be used to output tubletlet predictions on AVA in this case?

Thank you!

Question of Loading the trained model.

Hi authors,

Thank your for great job. I have one question about loading the trained model such as ava 2.1 and 2.2, the result in checkpoint is 0 epoch. It means that the model only trained for 0 epoch, Am I right?

some question for JHMDB

The performance of [email protected] is just 0.7, much lower than 0.86， we can find the code of f-mAP but not v-mAP . Could you publish the code for this section?

the complete version of the TubeR-UCF code

Very good job! When will the complete version of the TubeR code, especially the video-level evaluation, be open source

Question about the DETR pretraining process

Thanks for the impressive work.
I have one question about the pretraining process of DETR (of which you've mentioned here: https://github.com/amazon-science/tubelet-transformer#training)

From here (#4 (comment)),
I figured that you've brought the DETR weights trained on COCO dataset and re-trained it on AVA to detect human instances.

Could you describe this process in a more detailed way? (e.g., how did you manipulated the DETR structure to only detect human, what exactly was the input, position embedding, ... etc)
Was your intention of this pretraining to make queries focus more on classification after DETR architecture of TubeR learns how to localize actors well enough?
Have you tried training the whole architecture without the pretrained DETR weights? I've tried several times but could not find a good configuration to make the actual learning happen.

Thanks in advance.

Cannot reproduce training results

Hello

I have been trying to reproduce the training results. However, I am not getting anywhere close to the results reported in the paper, or the checkpoint released in this repo (I get an mAP of 20, compared to the public model that gets 31.0 on AVA).

Can you provide some assistance here, in reproducing the paper's results. Or provide explanations for why the code does not reproduce?

These are the steps I have taken:

Firstly, I had to apply the changes from this issue to make the code work. Otherwise, the code provided would crash on loading the data.

I then followed the instructions in the readme. However, the performance on AVA after just a few epochs was very bad:

Epoch	1	3
mAP on AVA	1.43	1.98

Looking into the provided config further, we can see that MODEL.PRETRAINED = False, which means that the weights of the backbone are not loaded here

By loading the backbone pretrained weights, the performance did improve. But after training completed, the results are still nowhere close to what was reported in the paper (33.6), or what the publicly released checkpoint gets (31.0)

Epoch	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
mAP on AVA	11.28	15.08	17.77	18.50	18.81	19.59	20.34	20.98	21.00	20.43	20.94	21.77	20.76	20.03	20.35	20.42	21.14	20.79	20.39	19.91

Can you please explain why the training code does not reproduce the results of the paper?
Also, why is the default training setting to not load the backbone weights? This setting was never explored in the paper.
I have also attached the config that I used to achieve the 20.0 mAP result on AVA above.

path doesnt exist /mnt/sda/ava/frames/{}/

I use ava dataset to evaluate tubeR and meet this problem，but I have set the path, as shown in the picture.Desire for help, thanks!

Missing annotation file 'ava_train_v21.json'

Thanks for the great work. In the assets.zip, there is no 'ava_train_v21.json' file that is needed to build the train dataloader. Kindly provide us the missing file, if possible.

Inference on AVA and JHMDB Needs Maintenance and Necessary Files

For the version I am using,

AVA2.1 inference needs several modifications:

tubelet-transformer/datasets/ava_frame.py

Line 135 in f610c97

video_frame_list = sorted(glob(video_frame_path + '/*.jpg'))

For function loadvideo, the function should be reading images with the video name.
video_frame_list = sorted(glob(video_frame_path + vid + '/*.jpg'))

Change the path here for the annotations.

tubelet-transformer/evaluates/evaluate_ava.py

Line 36 in f610c97

f = open("/xxx/datasets/ava_val_excluded_timestamps_v2.1.csv")
The fixes above would get the number listed in the README table. But there would still be a tensorboard error "EOFerror". Add lines after

tubelet-transformer/eval_tuber_ava.py

Line 48 in f610c97

if cfg.DDP_CONFIG.GPU_WORLD_RANK == 0:
        writer.close()

AVA2.2 Inference

per_class [0.49119732        nan 0.32108856 0.58690862 0.1453127  0.25250868
 0.05269343 0.55119903 0.47336599 0.58118356 0.83511073 0.85809156
 0.4264426  0.79215918 0.7533182         nan 0.61339698        nan
        nan 0.04726829        nan 0.16529978        nan 0.23965087
        nan 0.04494236 0.306021   0.55275175 0.36725148 0.07057226
        nan        nan        nan 0.12159738        nan 0.03173127
 0.02196539 0.2641557         nan        nan 0.67544085        nan
 0.00367732        nan 0.01473403 0.03833153 0.03002702 0.37160171
 0.53368705        nan 0.21649021 0.1374056         nan 0.29578147
        nan 0.03978733 0.10253565 0.03219929 0.33915299 0.01752664
 0.28362901 0.3223239  0.14873739 0.52285939 0.14770317 0.11950478
 0.44886859 0.17733113 0.06789831 0.27917222        nan 0.46795067
 0.06238106 0.71983267        nan 0.05018591 0.31590126 0.09531384
 0.8376019  0.70844574]
{'PascalBoxes_Precision/[email protected]': 0.30985340450933535, 'PascalBoxes_PerformanceByCategory/[email protected]/bend/bow (at the waist)': 0.4911973183134509, 'PascalBoxes_PerformanceByCategory/[email protected]/crouch/kneel': 0.3210885611841083, 'PascalBoxes_PerformanceByCategory/[email protected]/dance': 0.5869086163647963, 'PascalBoxes_PerformanceByCategory/[email protected]/fall down': 0.14531270272554303, 'PascalBoxes_PerformanceByCategory/[email protected]/get up': 0.25250867821227696, 'PascalBoxes_PerformanceByCategory/[email protected]/jump/leap': 0.05269343043207558, 'PascalBoxes_PerformanceByCategory/[email protected]/lie/sleep': 0.5511990313327797, 'PascalBoxes_PerformanceByCategory/[email protected]/martial art': 0.47336599427812304, 'PascalBoxes_PerformanceByCategory/[email protected]/run/jog': 0.5811835550049768, 'PascalBoxes_PerformanceByCategory/[email protected]/sit': 0.8351107282724392, 'PascalBoxes_PerformanceByCategory/[email protected]/stand': 0.8580915605931295, 'PascalBoxes_PerformanceByCategory/[email protected]/swim': 0.42644259946642094, 'PascalBoxes_PerformanceByCategory/[email protected]/walk': 0.7921591772441756, 'PascalBoxes_PerformanceByCategory/[email protected]/answer phone': 0.7533181965878357, 'PascalBoxes_PerformanceByCategory/[email protected]/carry/hold (an object)': 0.613396976906247, 'PascalBoxes_PerformanceByCategory/[email protected]/climb (e.g., a mountain)': 0.047268291513739374, 'PascalBoxes_PerformanceByCategory/[email protected]/close (e.g., a door, a box)': 0.16529978105316412, 'PascalBoxes_PerformanceByCategory/[email protected]/cut': 0.239650870599096, 'PascalBoxes_PerformanceByCategory/[email protected]/dress/put on clothing': 0.04494235744272522, 'PascalBoxes_PerformanceByCategory/[email protected]/drink': 0.30602100382076136, 'PascalBoxes_PerformanceByCategory/[email protected]/drive (e.g., a car, a truck)': 0.5527517520577403, 'PascalBoxes_PerformanceByCategory/[email protected]/eat': 0.3672514840844659, 'PascalBoxes_PerformanceByCategory/[email protected]/enter': 0.07057225556756908, 'PascalBoxes_PerformanceByCategory/[email protected]/hit (an object)': 0.12159737681929804, 'PascalBoxes_PerformanceByCategory/[email protected]/lift/pick up': 0.03173127096825363, 'PascalBoxes_PerformanceByCategory/[email protected]/listen (e.g., to music)': 0.021965385905557883, 'PascalBoxes_PerformanceByCategory/[email protected]/open (e.g., a window, a car door)': 0.2641556990694153, 'PascalBoxes_PerformanceByCategory/[email protected]/play musical instrument': 0.6754408509957595, 'PascalBoxes_PerformanceByCategory/[email protected]/point to (an object)': 0.0036773150722066972, 'PascalBoxes_PerformanceByCategory/[email protected]/pull (an object)': 0.01473402768023624, 'PascalBoxes_PerformanceByCategory/[email protected]/push (an object)': 0.038331529680086275, 'PascalBoxes_PerformanceByCategory/[email protected]/put down': 0.03002701544153771, 'PascalBoxes_PerformanceByCategory/[email protected]/read': 0.3716017145811048, 'PascalBoxes_PerformanceByCategory/[email protected]/ride (e.g., a bike, a car, a horse)': 0.5336870531261757, 'PascalBoxes_PerformanceByCategory/[email protected]/sail boat': 0.21649020512834088, 'PascalBoxes_PerformanceByCategory/[email protected]/shoot': 0.13740559748226708, 'PascalBoxes_PerformanceByCategory/[email protected]/smoke': 0.2957814682780021, 'PascalBoxes_PerformanceByCategory/[email protected]/take a photo': 0.03978732762876234, 'PascalBoxes_PerformanceByCategory/[email protected]/text on/look at a cellphone': 0.10253564997258985, 'PascalBoxes_PerformanceByCategory/[email protected]/throw': 0.03219929211064902, 'PascalBoxes_PerformanceByCategory/[email protected]/touch (an object)': 0.33915299353156436, 'PascalBoxes_PerformanceByCategory/[email protected]/turn (e.g., a screwdriver)': 0.017526643108955034, 'PascalBoxes_PerformanceByCategory/[email protected]/watch (e.g., TV)': 0.28362901476702795, 'PascalBoxes_PerformanceByCategory/[email protected]/work on a computer': 0.322323903124391, 'PascalBoxes_PerformanceByCategory/[email protected]/write': 0.1487373880589133, 'PascalBoxes_PerformanceByCategory/[email protected]/fight/hit (a person)': 0.5228593870747025, 'PascalBoxes_PerformanceByCategory/[email protected]/give/serve (an object) to (a person)': 0.14770317484649234, 'PascalBoxes_PerformanceByCategory/[email protected]/grab (a person)': 0.11950477963584528, 'PascalBoxes_PerformanceByCategory/[email protected]/hand clap': 0.44886858836133026, 'PascalBoxes_PerformanceByCategory/[email protected]/hand shake': 0.17733112595251085, 'PascalBoxes_PerformanceByCategory/[email protected]/hand wave': 0.06789830556787521, 'PascalBoxes_PerformanceByCategory/[email protected]/hug (a person)': 0.27917221591712854, 'PascalBoxes_PerformanceByCategory/[email protected]/kiss (a person)': 0.4679506698404774, 'PascalBoxes_PerformanceByCategory/[email protected]/lift (a person)': 0.062381058259554645, 'PascalBoxes_PerformanceByCategory/[email protected]/listen to (a person)': 0.7198326661128859, 'PascalBoxes_PerformanceByCategory/[email protected]/push (another person)': 0.050185914377705816, 'PascalBoxes_PerformanceByCategory/[email protected]/sing to (e.g., self, a person, a group)': 0.31590125934914154, 'PascalBoxes_PerformanceByCategory/[email protected]/take (an object) from (a person)': 0.09531383956904724, 'PascalBoxes_PerformanceByCategory/[email protected]/talk to (e.g., self, a person, a group)': 0.8376018955287321, 'PascalBoxes_PerformanceByCategory/[email protected]/watch (a person)': 0.7084457445779531}
mAP: 0.30985

Question about 'out_logits_b'.

Thanks for the great work. According to your code, I found out_logits_b contains 3 class predictions, where label 1 represents the action box, and 2 represents the non-action box. I don't understand the meaning of label 0, since there are no boxes refer to 0. Is my understanding wrong?