GithubHelp home page GithubHelp logo

nacayu / vote2cap-detr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ch3cook-fdu/vote2cap-detr

0.0 0.0 0.0 214.84 MB

Code release for ''End-to-End 3D Dense Captioning with Vote2Cap-DETR'' (CVPR2023)

License: MIT License

Shell 0.02% Python 98.61% Cython 1.37%

vote2cap-detr's Introduction

End-to-End 3D Dense Captioning with Vote2Cap-DETR (CVPR 2023)

Official implementation of "End-to-End 3D Dense Captioning with Vote2Cap-DETR" and "Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning".

pipeline

Thanks to the implementation of 3DETR, Scan2Cap, and VoteNet.

0. News

  • 2023-09-07. ๐Ÿค— We further propose an advanced model at arXiV, and release some of the pre-trained weights at huggingface.

  • 2022-11-17. ๐Ÿšฉ Our model sets a new state-of-the-art on the Scan2Cap online test benchmark.

pipeline

1. Environment

Our code is tested with PyTorch 1.7.1, CUDA 11.0 and Python 3.8.13. Besides pytorch, this repo also requires the following Python dependencies:

matplotlib
opencv-python
plyfile
'trimesh>=2.35.39,<2.35.40'
'networkx>=2.2,<2.3'
scipy
cython
transformers

If you wish to use multi-view feature extracted by Scan2Cap, you should also install h5py:

pip install h5py

It is also REQUIRED to compile the CUDA accelerated PointNet++ and gIoU support:

cd third_party/pointnet2
python setup.py install
cd utils
python cython_compile.py build_ext --inplace

To build support for METEOR metric for evaluating captioning performance, we also installed the java package.

2. Dataset Preparation

We follow Scan2Cap's procedure to prepare datasets under the ./data folder (Scan2CAD NOT required).

Preparing 3D point clouds from ScanNet. Download the ScanNetV2 dataset and change the SCANNET_DIR to the scans folder in data/scannet/batch_load_scannet_data.py, and run the following commands.

cd data/scannet/
python batch_load_scannet_data.py

Preparing Language Annotations. Please follow this to download the ScanRefer dataset, and put it under ./data.

[Optional] To prepare for Nr3D, it is also required to download and put the Nr3D under ./data. Since it's in .csv format, it is required to run the following command to process data.

cd data; python parse_nr3d.py

3. Download Pretrained Weights

You can download all the ready-to-use weights at huggingface.

Model SCST rgb multi-view normal checkpoint
Vote2Cap-DETR - $\checkmark$ - $\checkmark$ [checkpoint]
Vote2Cap-DETR - - $\checkmark$ $\checkmark$ [checkpoint]
Vote2Cap-DETR $\checkmark$ $\checkmark$ - $\checkmark$ [checkpoint]
Vote2Cap-DETR $\checkmark$ - $\checkmark$ $\checkmark$ [checkpoint]
Vote2Cap-DETR++ - $\checkmark$ - $\checkmark$ [checkpoint]
Vote2Cap-DETR++ - - $\checkmark$ $\checkmark$ [checkpoint]
Vote2Cap-DETR++ $\checkmark$ $\checkmark$ - $\checkmark$ [checkpoint]
Vote2Cap-DETR++ $\checkmark$ - $\checkmark$ $\checkmark$ [checkpoint]

4. Training and Evaluating your own models

Though we provide training commands from scratch, you can also start with some pretrained parameters provided under the ./pretrained folder and skip certain steps.

Because of storage limitations of github, we have uploaded all the pretrained weights to huggingface. It is recommended to download pretrained.zip to ./pretrained and unzip it.

[optional] 4.0 Pre-Training for Detection

If you have already downloaded and unzipped the pretrained.zip from huggingface to ./pretrained, you can SKIP the following procedures as they are to generate the pre-trained weights in ./pretrained folder.

To train the Vote2Cap-DETR's detection branch for point cloud input without additional 2D features (aka [xyz + rgb + normal + height])

python main.py --use_color --use_normal --detector detector_Vote2Cap_DETR --checkpoint_dir pretrained/Vote2Cap_DETR_XYZ_COLOR_NORMAL

To evaluate the pre-trained detection branch on ScanNet:

python main.py --use_color --use_normal --detector detector_Vote2Cap_DETR --test_ckpt pretrained/Vote2Cap_DETR_XYZ_COLOR_NORMAL/checkpoint_best.pth --test_detection

To train with additional 2D features (aka [xyz + multiview + normal + height]) rather than RGB inputs, you can replace --use_color to --use_multiview.

4.1 MLE Training for 3D Dense Captioning

Please make sure there are pretrained checkpoints under the ./pretrained directory. To train the mdoel for 3D dense captioning with MLE training on ScanRefer:

python main.py --use_color --use_normal --use_pretrained --warm_lr_epochs 0 --pretrained_params_lr 1e-6 --use_beam_search --base_lr 1e-4 --dataset scene_scanrefer --eval_metric caption --vocabulary scanrefer --detector detector_Vote2Cap_DETR --captioner captioner_dcc --checkpoint_dir exp_scanrefer/Vote2Cap_DETR_RGB_NORMAL --max_epoch 720

Change --dataset scene_scanrefer to --dataset scene_nr3d to train the model for the Nr3D dataset.

4.2 Self-Critical Sequence Training for 3D Dense Captioning

To train the model with Self-Critical Sequence Training(SCST), you can use the following command:

python scst_tuning.py --use_color --use_normal --base_lr 1e-6 --detector detector_Vote2Cap_DETR --captioner captioner_dcc --freeze_detector --use_beam_search --batchsize_per_gpu 2 --max_epoch 180 --pretrained_captioner exp_scanrefer/Vote2Cap_DETR_RGB_NORMAL/checkpoint_best.pth --checkpoint_dir exp_scanrefer/scst_Vote2Cap_DETR_RGB_NORMAL

Change --dataset scene_scanrefer to --dataset scene_nr3d to train the model for the Nr3D dataset.

4.3 Evaluating the Weights

You can evaluate the trained model in each step by specifying different checkpont directories:

python main.py --use_color --use_normal --dataset scene_scanrefer --vocabulary scanrefer --use_beam_search --detector detector_Vote2Cap_DETR --captioner captioner_dcc --batchsize_per_gpu 8 --test_ckpt [...]/checkpoint_best.pth --test_caption

Change --dataset scene_scanrefer to --dataset scene_nr3d to train the model for the Nr3D dataset.

5. Make Predictions for online test benchmark

Our model also provides prediction codes for ScanRefer online test benchmark.

The following command will generate a .json file under the folder defined by --checkpoint_dir.

python predict.py --use_color --use_normal --dataset test_scanrefer --vocabulary scanrefer --use_beam_search --detector detector_Vote2Cap_DETR --captioner captioner_dcc --batchsize_per_gpu 8 --test_ckpt [...]/checkpoint_best.pth

6. BibTex

If you find our work helpful, please kindly cite our paper:

@inproceedings{chen2023end,
  title={End-to-end 3d dense captioning with vote2cap-detr},
  author={Chen, Sijin and Zhu, Hongyuan and Chen, Xin and Lei, Yinjie and Yu, Gang and Chen, Tao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={11124--11133},
  year={2023}
}
@misc{chen2023vote2capdetr,
  title={Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning}, 
  author={Sijin Chen and Hongyuan Zhu and Mingsheng Li and Xin Chen and Peng Guo and Yinjie Lei and Gang Yu and Taihao Li and Tao Chen},
  year={2023},
  eprint={2309.02999},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

7. License

Vote2Cap-DETR and Vote2Cap-DETR++ are both licensed under a MIT License.

8. Contact

If you have any questions or suggestions about this repo, please feel free to open issues or contact me ([email protected])!

vote2cap-detr's People

Contributors

ch3cook-fdu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.