GithubHelp home page GithubHelp logo

3dlg-hcvc / m3dref-clip Goto Github PK

View Code? Open in Web Editor NEW
63.0 1.0 3.0 1.56 MB

[ICCV 2023] Multi3DRefer: Grounding Text Description to Multiple 3D Objects

Home Page: https://3dlg-hcvc.github.io/multi3drefer/

License: MIT License

Python 82.81% C++ 7.16% Cuda 7.88% C 2.15%
3d computer-vision deep-learning visual-grounding clip cuda localization pytorch pytorch-lightning transformer

m3dref-clip's Introduction

M3DRef-CLIP

PyTorch Lightning WandB

This is the official implementation for Multi3DRefer: Grounding Text Description to Multiple 3D Objects.

Model Architecture

Requirement

This repo contains CUDA implementation, please make sure your GPU compute capability is at least 3.0 or above.

We report the max computing resources usage with batch size 4:

Training Inference
GPU mem usage 15.2 GB 11.3 GB

Setup

Conda (recommended)

We recommend the use of miniconda to manage system dependencies.

# create and activate the conda environment
conda create -n m3drefclip python=3.10
conda activate m3drefclip

# install PyTorch 2.0.1
conda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia

# install PyTorch3D with dependencies
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
conda install pytorch3d -c pytorch3d

# install MinkowskiEngine with dependencies
conda install -c anaconda openblas
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
--install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"

# install Python libraries
pip install .

# install CUDA extensions
cd m3drefclip/common_ops
pip install .

Pip

Note: Setting up with pip (no conda) requires OpenBLAS to be pre-installed in your system.

# create and activate the virtual environment
virtualenv env
source env/bin/activate

# install PyTorch 2.0.1
pip install torch torchvision

# install PyTorch3D
pip install pytorch3d

# install MinkowskiEngine
pip install MinkowskiEngine

# install Python libraries
pip install .

# install CUDA extensions
cd m3drefclip/common_ops
pip install .

Data Preparation

Note: Both ScanRefer and Nr3D datasets requires the ScanNet v2 dataset. Please preprocess it first.

ScanNet v2 dataset

  1. Download the ScanNet v2 dataset (train/val/test), (refer to ScanNet's instruction for more details). The raw dataset files should be organized as follows:

    m3drefclip # project root
    ├── dataset
    │   ├── scannetv2
    │   │   ├── scans
    │   │   │   ├── [scene_id]
    │   │   │   │   ├── [scene_id]_vh_clean_2.ply
    │   │   │   │   ├── [scene_id]_vh_clean_2.0.010000.segs.json
    │   │   │   │   ├── [scene_id].aggregation.json
    │   │   │   │   ├── [scene_id].txt
  2. Pre-process the data, it converts original meshes and annotations to .pth data:

    python dataset/scannetv2/preprocess_all_data.py data=scannetv2 +workers={cpu_count}
  3. Pre-process the multiview features from ENet: Please refer to the instructions in ScanRefer's repo with one modification:

    • comment out lines 51 to 56 in batch_load_scannet_data.py since we follow D3Net's setting that doesn't do point downsampling here.

    Then put the generated enet_feats_maxpool.hdf5 (116GB) under m3drefclip/dataset/scannetv2

ScanRefer dataset

  1. Download the ScanRefer dataset (train/val). Also, download the test set. The raw dataset files should be organized as follows:

    m3drefclip # project root
    ├── dataset
    │   ├── scanrefer
    │   │   ├── metadata
    │   │   │   ├── ScanRefer_filtered_train.json
    │   │   │   ├── ScanRefer_filtered_val.json
    │   │   │   ├── ScanRefer_filtered_test.json
  2. Pre-process the data, "unique/multiple" labels will be added to raw .json files for evaluation purpose:

    python dataset/scanrefer/add_evaluation_labels.py data=scanrefer

Nr3D dataset

  1. Download the Nr3D dataset (train/test). The raw dataset files should be organized as follows:

    m3drefclip # project root
    ├── dataset
    │   ├── nr3d
    │   │   ├── metadata
    │   │   │   ├── nr3d_train.csv
    │   │   │   ├── nr3d_test.csv
  2. Pre-process the data, "easy/hard/view-dep/view-indep" labels will be added to raw .csv files for evaluation purpose:

    python dataset/nr3d/add_evaluation_labels.py data=nr3d

Multi3DRefer dataset

  1. Downloading the Multi3DRefer dataset (train/val). The raw dataset files should be organized as follows:
    m3drefclip # project root
    ├── dataset
    │   ├── multi3drefer
    │   │   ├── metadata
    │   │   │   ├── multi3drefer_train.json
    │   │   │   ├── multi3drefer_val.json

Pre-trained detector

We pre-trained PointGroup implemented in MINSU3D on ScanNet v2 and use it as the detector. We use coordinates + colors + multi-view features as inputs.

  1. Download the pre-trained detector. The detector checkpoint file should be organized as follows:
    m3drefclip # project root
    ├── checkpoints
    │   ├── PointGroup_ScanNet.ckpt

Training, Inference and Evaluation

Note: Configuration files are managed by Hydra, you can easily add or override any configuration attributes by passing them as arguments.

# log in to WandB
wandb login

# train a model with the pre-trained detector, using predicted object proposals
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt

# train a model with the pretrained detector, using GT object proposals
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt model.network.detector.use_gt_proposal=True

# train a model from a checkpoint, it restores all hyperparameters in the .ckpt file
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={checkpoint_experiment_name} ckpt_path={ckpt_file_path}

# test a model from a checkpoint and save its predictions
python test.py data={scanrefer/nr3d/multi3drefer} data.inference.split={train/val/test} ckpt_path={ckpt_file_path} pred_path={predictions_path}

# evaluate predictions
python evaluate.py data={scanrefer/nr3d/multi3drefer} pred_path={predictions_path} data.evaluation.split={train/val/test}

Checkpoints

ScanRefer dataset

M3DRef-CLIP_ScanRefer.ckpt

Performance:

Split IoU Unique Multiple Overall
Val 0.25 85.3 43.8 51.9
Val 0.5 77.2 36.8 44.7
Test 0.25 79.8 46.9 54.3
Test 0.5 70.9 38.1 45.5

Nr3D dataset

M3DRef-CLIP_Nr3d.ckpt

Performance:

Split Easy Hard View-dep View-indep Overall
Test 55.6 43.4 42.3 52.9 49.4

Multi3DRefer dataset

M3DRef-CLIP_Multi3DRefer.ckpt

Performance:

Split IoU ZT w/ D ZT w/o D ST w/ D ST w/o D MT Overall
Val 0.25 39.4 81.8 34.6 53.5 43.6 42.8
Val 0.5 39.4 81.8 30.6 47.8 37.9 38.4

Benchmark

ScanRefer

Convert M3DRef-CLIP predictions to ScanRefer benchmark format:

python dataset/scanrefer/convert_output_to_benchmark_format.py data=scanrefer pred_path={predictions_path} +output_path={output_file_path}

Nr3D

Please refer to ReferIt3D benchmark to report results.

m3dref-clip's People

Contributors

eamonn-zh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

m3dref-clip's Issues

Reproducing Nr3D results in Table 6.

Hi,

I am trying to train from scratch to reproduce the 49.4 overall accuracy on Nr3D, as is reported in Table 6. However, I could only get around 46.5 under the settings of provided config (changing data to nr3d and also set the use_gt_proposal flag). Could you provide more details on the training settings to reproduce your result on Nr3D? Thanks!

RuntimeError: CUDA error: an illegal memory access was encountered

Encountered this error,everything is freshly clone from this repo

“RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. ”

RuntimeError when setting "model.network.detector.use_gt_proposal=True"

Dear author:

Thanks for your interesting work.

When I run the following command:

# train a model with the pretrained detector, using GT object proposals
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt model.network.detector.use_gt_proposal=True

an error has occurred:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.

It seems like the model return some variables that do not calculate loss, and I wander how to solve it?

Best!
Xiaolong

debugger crashes while normal outside the vscode

Hi! Thanks for the work!

I got error constantly when i using debugger to run the project as :
"Error executing job with overrides: ['data=scanrefer', 'experiment_name=test', 'data.inference.split=test', 'ckpt_path=/path/to/my/M3DRef-CLIP_ScanRefer.ckpt']"

or other train/eval/test commands, but things works when i don't use the debugger. Do you have any idea which part caused this?

Could not append to config. An item is already at 'ckpt_path'. Either remove + prefix: 'ckpt_path=M3DRef-CLIP_ScanRefer.ckpt' Or add a second + to add or override 'ckpt_path': '++ckpt_path=M3DRef-CLIP_ScanRefer.ckpt'

Hello, I followed the readme and did all of it, when I ran the command:
python test.py data=scanrefer data.inference.split=val +ckpt_path=M3DRef-CLIP_ScanRefer.ckpt pred_path=output
The error is:
Could not append to config. An item is already at 'ckpt_path'. Either remove + prefix: 'ckpt_path=M3DRef-CLIP_ScanRefer.ckpt' Or add a second + to add or override 'ckpt_path': '++ckpt_path=M3DRef-CLIP_ScanRefer.ckpt'

How to test on Scanrefer benchmark?

This is my first time trying to test on the scanrefer benchmark, but I encountered some difficulties. When I ran the test command according to the instructions in the readme, some errors occurred in the Dataloader (these errors were caused by the fact that the test data set did not have instance ids, sem labels, etc.), and the test data set did not seem to be adapted in test_epoch.

How do you conduct benchmark testing? Do I need to write additional code?

Hope to get your reply! Thanks!

Memory cost went too high

I have added number of views, and the computation cost went bigger than 24GB as i'm using a single 3090.

Is there any way I can reduce the computation cost while the performance won't down drasticly?

Many thanks.

Proposal filtering for PointGroup

Hi,

I noticed that you chose not to filter the PointGroup proposals by nms or proposal scores. Have you tried that before? I wonder about the effects of proposal filtering, such as the final performance going down, but early-stage performance being high.

Best,
Tony

The training speed becomes so slow after a few epochs

Great work, and the code is clearly written!
However, when I was training with the default configuration on a single NVIDIA 3090 gpu, I noticed something strange.

  1. When using only 3d features, the training and reasoning speed are relatively fast, but after 20 hours (the 24th epoch) it becomes very slow (tens of times slower), and the gpu occupancy, power consumption, etc., are significantly reduced.
  2. When using 2d+3d features, validation of an epoch takes more than 5 hours (I'm not sure if this is normal), and training becomes particularly slow after the first validation epoch (again tens of times slower), which confuses me.
    image

Have you ever encountered these problems? Looking forward to your reply very much, thanks!

Visualization Script

Hi,

Thanks for the nice work! I am wondering whether you could provide the visualization scripts for your model. For example, the script to generate the sub-figures in your Figure 6, Figure 10 and Figure 12. Thanks!

Questions about the predictions on ScanRefer with the given ckpt

Dear author:

Thanks for your interesting work.

I have completed the entire process of training and inferencing following the README.md, but when I run the follow command with the given ckpt:

# get the predictions
python test.py data=scanrefer data.inference.split=val +ckpt_path={M3DRef-CLIP_ScanRefer.ckpt} pred_path={predictions_path}

# evaluate predictions
python evaluate.py data=scanrefer pred_path={M3DRef-CLIP_ScanRefer.ckpt} pred_path={predictions_path} data.evaluation.split=val

I get unsatisfactory performance, far lower than your results in readme.md:

===========================================
IoU         unique      multiple    overall     
-------------------------------------------
0.25        45.3        28.6        31.8        
0.50        33.1        21.9        24.1        
===========================================

I wander if it's correct? And how to handle it to achieve the same results as the one in readme.md?

Thanks!!

M3DRef-CLIP on Scanrefer Test Benchmark

Hi,

Thank you for your awesome work!

For scanrefer benchmark submission, do you train on a combination of train+val split or only the train split? I am asking because in scannet benchmarks, training on train+val is a common practice, but not sure what people do in referential grounding benchmarks. Also are there any other details/tricks for test set submission, or simply exporting the data with the provided checkpoint trained on train set?

Thank you!

Access the Multi3DRefer test set

Hi,

Thanks for the nice work!
I am wondering whether you would hold some online benchmark for us to evaluate on the test set?

CLIP text model output. How/Why two outputs word_features & sentence features?

I was wondering why you are expecting two outputs when calling word_features, sentence_features = self.clip_model.encode_text(clip_tokens) here

As far as I understand, you are using a vanilla clip model which outputs only one embedding for clip_model.encode_text().
Evidently this cant be the case since you are expecting two different embeddings. So where did you implement the custom functionality to get two embeddings from encode_text()?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.