nv-tlabs / lion Goto Github PK

View Code? Open in Web Editor NEW

728.0 40.0 55.0 92.98 MB

Latent Point Diffusion Models for 3D Shape Generation

License: Other

Python 98.16% Shell 1.39% Dockerfile 0.46%

lion's Introduction

LION: Latent Point Diffusion Models for 3D Shape Generation

NeurIPS 2022

Xiaohui Zeng Arash Vahdat Francis Williams Zan Gojcic Or Litany Sanja Fidler Karsten Kreis

Paper Project Page

Update

add pointclouds rendering code used for paper figure, see utils/render_mitsuba_pc.py
When opening an issue, please add @ZENGXH so that I can reponse faster!

Install

Dependencies:
- CUDA 11.6

Setup the environment Install from conda file

    conda env create --name lion_env --file=env.yaml 
    conda activate lion_env 

    # Install some other packages 
    pip install git+https://github.com/openai/CLIP.git 

    # build some packages first (optional)
    python build_pkg.py

Tested with conda version 22.9.0

Using Docker
- build the docker with bash ./docker/build_docker.sh
- launch the docker with bash ./docker/run.sh

Demo

run python demo.py, will load the released text2shape model on hugging face and generate a chair point cloud. (Note: the checkpoint is not released yet, the files loaded in the demo.py file is not available at this point)

Released checkpoint and samples

checkpoint can be downloaded from here
after download, run the checksum with python ./script/check_sum.py ./lion_ckpt.zip
put the downloaded file under ./lion_ckpt/

Training

data

ShapeNet can be downloaded here.
Put the downloaded data as ./data/ShapeNetCore.v2.PC15k or edit the pointflow entry in ./datasets/data_path.py for the ShapeNet dataset path.

train VAE

run bash ./script/train_vae.sh $NGPU (the released checkpoint is trained with NGPU=4 on A100)
if want to use comet to log the experiment, add .comet_api file under the current folder, write the api key as {"api_key": "${COMET_API_KEY}"} in the .comet_api file

train diffusion prior

require the vae checkpoint
run bash ./script/train_prior.sh $NGPU (the released checkpoint is trained with NGPU=8 with 2 node on V100)

train diffusion prior with clip feat

this script trains model for single-view-reconstruction or text2shape task
- the idea is that we take the encoder and decoder trained on the data as usual (without conditioning input), and when training the diffusion prior, we feed the clip image embedding as conditioning input: the shape-latent prior model will take the clip embedding through AdaGN layer.
require the vae checkpoint trained above
require the rendered ShapeNet data, you can render yourself or download it from here
- put the rendered data as ./data/shapenet_render/ or edit the clip_forge_image entry in ./datasets/data_path.py
- the img data will be read under ./datasets/pointflow_datasets.py with the render_img_path, you may need to cutomize this variable depending of the folder structure
run bash ./script/train_prior_clip.sh $NGPU

(Optional) monitor exp

(tested) use comet-ml: need to add a file .comet_api under this LION folder, example of the .comet_api file:

{"api_key": "...", "project_name": "lion", "workspace": "..."}

(not tested) use wandb: need to add a .wandb_api file, and set the env variable export USE_WB=1 before training

{"project": "...", "entity": "..."}

(not tested) use tensorboard, set the env variable export USE_TFB=1 before training
see the utils/utils.py files for the details of the experiment logger; I usually use comet-ml for my experiments

evaluate a trained prior

download the test data (Table 1) from here, unzip and put it as ./datasets/test_data/
download the released checkpoint from above

checkpoint="./lion_ckpt/unconditional/airplane/checkpoints/model.pt" 
bash ./script/eval.sh $checkpoint  # will take 1-2 hour

other test data

ShapeNet-Vol test data:
- please check here before using this data
- all category: 1000 shapes are sampled from the full validation set
- chair, airplane, car
table 21 and table 20, point-flow test data
- check here before using this data
- mug and bottle
- 55 catergory data

Evaluate the samples with the 1-NNA metrics

download the test data from here, unzip and put it as ./datasets/test_data/
run python ./script/compute_score.py (Note: for ShapeNet-Vol data and table 21, 20, need to set norm_box=True)

Citation

@inproceedings{zeng2022lion,
    title={LION: Latent Point Diffusion Models for 3D Shape Generation},
        author={Xiaohui Zeng and Arash Vahdat and Francis Williams and Zan Gojcic and Or Litany and Sanja Fidler and Karsten Kreis},
        booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
        year={2022}
}

lion's People

Contributors

Stargazers

Watchers

lion's Issues

Multi GPU Training Problem

Hey! Thanks for your wonderful work again.@ZENGXH

But now I meet another problem. I want to know why the training process is unstable?

I train VAE model on all categories (bash ./scripts/train_vae_all.sh)with batchsize of 12 on 8 V100 16GB.

However, the loss starts to increase when it decrease to around 14.

I want to know why the training process is so unstable and how to fix this problem.

Looking forward to your reply!

what is causal attention？

Hi，

Impressive work! When I reading the paper I can not find the math description of causal attention.

Would you please kindly tell me what is the mathematical explanation of causal attention ?

Thanks and best regards!

ShapeNet-vol dataset

Could you share ShapeNet-vol dataset mentioned in Table 3 of paper?

1-NN accuracy benchmarking

Are the numbers you report in tables 1 and 2 computed with compute_score or some other code?

python build_pkg.py failed

Hi, I've been trying to run the set-up steps to do some training locally, but I'm stuck on the last optional step, and even with this the demo or training doesn't run. I ran the set up steps on WSL Ubuntu, but when I run python build_pkg.py it fails after the the line below. I also tried running python demo.py which also failed. Woud you happen you have a docker image for inference, it would be super helpful! I tried installing CUDA 11.6 locally as well to no avail. Any help would be much appreciated!

Detected CUDA files, patching ldflags
Emitting ninja build file /home/gina/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/14] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/grouping/grouping.cu -o grouping.cuda.o 
FAILED: grouping.cuda.o 
/usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/grouping/grouping.cu -o grouping.cuda.o 
Killed
[2/14] c++ -MMD -MF trilinear_devox.o.d -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/interpolate/trilinear_devox.cpp -o trilinear_devox.o   
FAILED: trilinear_devox.o
c++ -MMD -MF trilinear_devox.o.d -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/interpolate/trilinear_devox.cpp -o trilinear_devox.o
c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
[3/14] c++ -MMD -MF bindings.o.d -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/bindings.cpp -o bindings.o
FAILED: bindings.o
c++ -MMD -MF bindings.o.d -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/bindings.cpp -o bindings.o
c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
[4/14] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/interpolate/neighbor_interpolate.cu -o neighbor_interpolate.cuda.o
FAILED: neighbor_interpolate.cuda.o
/usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/interpolate/neighbor_interpolate.cu -o neighbor_interpolate.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
[5/14] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/voxelization/vox.cu -o vox.cuda.o
FAILED: vox.cuda.o
/usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/voxelization/vox.cu -o vox.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
[6/14] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/ball_query/ball_query.cu -o ball_query.cuda.o   
FAILED: ball_query.cuda.o
/usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/ball_query/ball_query.cu -o ball_query.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
[7/14] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/sampling/sampling.cu -o sampling.cuda.o
FAILED: sampling.cuda.o
/usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/sampling/sampling.cu -o sampling.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
[8/14] c++ -MMD -MF vox.o.d -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/voxelization/vox.cpp -o vox.o
[9/14] /usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/interpolate/trilinear_devox.cu -o trilinear_devox.cuda.o
FAILED: trilinear_devox.cuda.o
/usr/bin/nvcc  -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/interpolate/trilinear_devox.cu -o trilinear_devox.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
[10/14] c++ -MMD -MF ball_query.o.d -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/ball_query/ball_query.cpp -o ball_query.o
[11/14] c++ -MMD -MF neighbor_interpolate.o.d -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/interpolate/neighbor_interpolate.cpp -o neighbor_interpolate.o
[12/14] c++ -MMD -MF grouping.o.d -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/grouping/grouping.cpp -o grouping.o
[13/14] c++ -MMD -MF sampling.o.d -DTORCH_EXTENSION_NAME=_pvcnn_backend -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /home/gina/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -c /mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/src/sampling/sampling.cpp -o sampling.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1717, in _run_ninja_build
    subprocess.run(
  File "/home/gina/miniconda3/envs/lion_env/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "build_pkg.py", line 2, in <module>
    from models import pvcnn2
  File "/mnt/c/Users/G/GitHub/LION/models/pvcnn2.py", line 21, in <module>
    import third_party.pvcnn.functional as F
  File "/mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/__init__.py", line 1, in <module>
    from third_party.pvcnn.functional.ball_query import ball_query
  File "/mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/ball_query.py", line 3, in <module>
    from third_party.pvcnn.functional.backend import _backend
  File "/mnt/c/Users/G/GitHub/LION/third_party/pvcnn/functional/backend.py", line 8, in <module>
    _backend = load(name='_pvcnn_backend',
  File "/home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
    return _jit_compile(
  File "/home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1337, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1449, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/gina/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension '_pvcnn_backend'

training loss nan

Hi, I train the vae model as the readme part tells. But the training loss become nan. I use 4 gpu and 40 batchsize. And I keep the left the same in the repo.

Missing file or directory

I run demo.py, and encounter an error that "No such file or directory: "./lion_ckpt/text2shape/chair/cfg.yml". Could you help me fix it?

cfg.yml for train diffusion prior

Hi @ZENGXH ,

I successfully completed the first stage hvae training, but there is no cfg file when training the second diffusion prior.

Could you please provide the diffusion prior cfg file?
Is it okay to use the first stage vae cfg.yml for the second diffusion prior training process? I have tried, but the beta shape maybe wrong.

Best regards,
Yingjie

VAE visulazation problem

Hey,

I visualize my VAE training result and find that it cannot performance as well as your results show in tensorboard.

My visualization results are as follows:

Results above are obtained from training 4000 epochs by train_prior_clip with batch_size 12 and lr 1e-4 (I do not change any parameters) on 8 V100.

I want to check whether the VAE model is trained enough and then I visualize training results at around 5000 epochs.

But the results did not get better.

I want to know the reason why I cannot get a reasonable results after 20 days training. (Do I need to train further, but in general, this is an unacceptable length of training)

Looking forward to your reply. @ZENGXH

SAP reconstruction results

Hi @ZENGXH, thanks for your great work.
Could you kindly should the example codes for the following up SAP for the surface reconstruction?
Thanks!

training VAE

hello @ZENGXH , i'm trying to train VAE but i got errors when i run the train_vae.sh
can you help me !

How to apply Shape Interpolation？

see title

AttributeError: 'Trainer' object has no attribute 'load_vae'

When i am training the diffusion prior, it shows:
AttributeError: 'Trainer' object has no attribute 'load_vae'
script/train_prior.sh: line 20: latent_pts.pvd_mse_loss: command not found
@ZENGXH

Train Prior With CLIP problem

Hey,

Sorry to disturb you again. @ZENGXH

I finally got a reasonable result. I choose VAE checkpoint with epoch=1599 as final VAE training result. The results are as follows:

I visualize results with epochs larger than 1599 and find that the reconstruction results seem to get worse.

Then I start to train prior with clip feature following train_prior_clip.sh and I get an error. I debug the code and find that it is caused by ddpm.num_steps=1. I want to know is there any problem if I set ddpm.num_steps to 1000?

Looking forward to your reply.

RuntimeError: Error building extension '_pvcnn_backend'

I have reinstalled the pytorch and the other packages based on the Dockfile. But it still occurred the error. @ZENGXH

Visualization

Thanks for the impressive work. May I check with you how to generate the 3D visualization of the voxel shape and what about the point cloud shapes?

Details of ShapeNet-vol evaluation

Hello,
I'm bringing up the questions I had after you closed #16, in case you missed that part. I would be interested to know as much as possible about the (sub)set of examples you used to evaluate on ShapeNet-vol, such that I can meanigfully compare to LION in absence of released weights/samples. The easiest would be if you can share that subset as a files, but in absence I would need

The IDs of the models used.
The preprocessing scheme (in particular, are applying the scale and loc parameters found in the .npz files of the dataset?).
Ideally the IDs of the points you picked in each model (since the dataset provides more than 2048 points per model).

Thanks for your swift responses on the previous issues.

Training bash scripts missing

Neither the ./script/train_vae.sh nor the ./scripts/train_prior.sh referenced from the readme.md are available.

Question about the SAP ?

Thank you for your great work, I have a question, Did you integrate LION and SAP networks? or you used SAP as a second network to generate mesh?

How do I know when to stop Stage 2 training?

Hi @ZENGXH ,

Is there any indicator to judge whether stage2 is trained enough? I used a different data set, I don't know what epoch/iteration to stop.

Best regards,
Yingjie

Per-category reference data for ShapeNet-vol evaluation in Table. 3

Can you share the evaluation reference point cloud file for the 3 categories (Airplane, Chair, Car) of ShapeNet-vol?

Thank you.

Issue during training VAE

@ZENGXH Thanks again for the amazing work and quick reply to maintaining this repo

During training the VAE I run into this issue. While it is doing the evaluation

2023-03-16 01:28:50.786 | ERROR    | utils.utils:init_processes:1156 - An error has been caught in function 'init_processes', process 'Process-1' (38392), thread 'MainThread' (140360210575360):
Traceback (most recent call last):

  File "/home/alberto/Documents/LION/train_dist.py", line 239, in <module>
    p.start()
    │ └ <function BaseProcess.start at 0x7fa8273c2170>
    └ <Process name='Process-1' parent=38345 started>

  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
    │    │        │    │      └ <Process name='Process-1' parent=38345 started>
    │    │        │    └ <staticmethod(<function Process._Popen at 0x7fa8271f12d0>)>
    │    │        └ <Process name='Process-1' parent=38345 started>
    │    └ None
    └ <Process name='Process-1' parent=38345 started>
  File "/usr/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           │                │                            └ <Process name='Process-1' parent=38345 started>
           │                └ <function DefaultContext.get_context at 0x7fa8271f1480>
           └ <multiprocessing.context.DefaultContext object at 0x7fa8273e68c0>
  File "/usr/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
    return Popen(process_obj)
           │     └ <Process name='Process-1' parent=38345 started>
           └ <class 'multiprocessing.popen_fork.Popen'>
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
    │    │       └ <Process name='Process-1' parent=38345 started>
    │    └ <function Popen._launch at 0x7fa67ea62950>
    └ <multiprocessing.popen_fork.Popen object at 0x7fa67eb5a110>
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 71, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
           │           │                          └ 7
           │           └ <function BaseProcess._bootstrap at 0x7fa8273c2a70>
           └ <Process name='Process-1' parent=38345 started>
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
    │    └ <function BaseProcess.run at 0x7fa8273c20e0>
    └ <Process name='Process-1' parent=38345 started>
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    │    │        │    │        │    └ {}
    │    │        │    │        └ <Process name='Process-1' parent=38345 started>
    │    │        │    └ (0, 2, <function main at 0x7fa67ea62560>, Namespace(exp_root='../exp', skip_sample=0, skip_nll=0, ntest=None, dataset='cifar1...
    │    │        └ <Process name='Process-1' parent=38345 started>
    │    └ <function init_processes at 0x7fa67ea617e0>
    └ <Process name='Process-1' parent=38345 started>

> File "/home/alberto/Documents/LION/utils/utils.py", line 1156, in init_processes
    fn(args, config)
    │  │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │  └ Namespace(exp_root='../exp', skip_sample=0, skip_nll=0, ntest=None, dataset='cifar10', data='/tmp/nvae-diff/data', autocast_t...
    └ <function main at 0x7fa67ea62560>

  File "/home/alberto/Documents/LION/train_dist.py", line 84, in main
    trainer.train_epochs()
    │       └ <function BaseTrainer.train_epochs at 0x7fa7d4598040>
    └ <trainers.hvae_trainer.Trainer object at 0x7fa67ea7ea70>

  File "/home/alberto/Documents/LION/trainers/base_trainer.py", line 285, in train_epochs
    eval_score = self.eval_nll(step=step, save_file=False)
                 │    │             └ 7599
                 │    └ <function BaseTrainer.eval_nll at 0x7fa7d4598700>
                 └ <trainers.hvae_trainer.Trainer object at 0x7fa67ea7ea70>

  File "/home/alberto/Documents/LION/my_venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           │     │       └ {'step': 7599, 'save_file': False}
           │     └ (<trainers.hvae_trainer.Trainer object at 0x7fa67ea7ea70>,)
           └ <function BaseTrainer.eval_nll at 0x7fa7d4598670>

  File "/home/alberto/Documents/LION/trainers/base_trainer.py", line 805, in eval_nll
    results = compute_NLL_metric(
              └ <function compute_NLL_metric at 0x7fa7e5c50af0>

  File "/home/alberto/Documents/LION/utils/eval_helper.py", line 59, in compute_NLL_metric
    pair_vis(gen_pcs[worse_ten], ref_pcs[worse_ten],
    │        │       │           │       └ tensor([266, 263, 265,  51, 122,  91, 323, 298, 101, 319], device='cuda:0')
    │        │       │           └ tensor([[[-3.3173e-02,  4.3725e-02, -7.8650e-02],
    │        │       │                      [-3.2106e-02, -7.7591e-02, -2.6546e-02],
    │        │       │                      [-7.3885e-03, -6...
    │        │       └ tensor([266, 263, 265,  51, 122,  91, 323, 298, 101, 319], device='cuda:0')
    │        └ tensor([[[-0.0330,  0.0436, -0.0783],
    │                   [-0.0322, -0.0777, -0.0267],
    │                   [-0.0076, -0.0620,  0.0369],
    │                   .....
    └ <function pair_vis at 0x7fa7e5c50940>

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
2023-03-16 01:28:50.928 | INFO     | __main__:<module>:243 - join 1

But for some reasons it didn't stop the training, the terminal was hanging there

MM-CD plot while training VAE

@ZENGXH Hi, could you please share the plot of MM-CD of the VAE training for 1 class?

Thanks in advance!

Testing on other categories

Hello, I am trying to reproduce the results obtained on the classes with small numbers of samples like mug and bottle but the evaluation step is throwing this error

AssertionError: file not found: ./datasets/test_data/ref_val_mug.pt

In the drive folder, you only provided the files for chair, car and airplane. Could you please assist me with how to generate these files for other categories?
Thanks !

TypeError: expected Tensor as element 0 in argument 0, but got int

 emd = torch.cat(emd_lst)
          _     _   _ [9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999]
          _     _ <built-in method cat of type object at 0x7f3b1185d100>
          _ <module 'torch' from '/~/anaconda3/envs/pytorch13/lib/python3.8/site-packages/torch/__init__.py'>

TypeError: expected Tensor as element 0 in argument 0, but got int

Training to Generate Point Cloud with More Features

Hello, I really appreciate your great work and want to explore it more!

In the paper you mentioned that the each point can have more features than only its xyz coordinates, if I were to train this model to generate a point cloud for a more complex object, for example, an object with different colors for different parts, is it possible for me to just append the rgb values to each point's coordinates, so its now N x 6? Or should I just have multiple colors and treat them as class labels, then encode these class labels as features to append to each point? In this case do I need to re-design the loss? Would this potentially cause the training procedure to take much much longer?

Thank you so much!

Did you tried to generating point with normal?

@ZENGXH

I'm just asking this out of curiosity.

Did you tried to generating point with normal?
I think it is much easier way to generate meshes by making LION generate point cloud with its normal (N, 6).
Then it can build meshes by just applying DPSR directly instead of additionally fine-tuning SAP.

Many-class 3D shape Generation Model

Hi @ZENGXH , thanks for your great work!
I want to train a many-class unconditional 3D shape generation model. Does this mean I need to use all classes of ShapeNet to train the VAE? I replaced "car" in the example with "all", but found that the loss became NAN at the beginning of training. How can I solve this problem?

NAN Loss

Hello,
I try to train the VAE, follow the step

but the loss becomes NAN as follows

Determine vae model convergence

Hello! I'd like to ask how I can determine if my VAE model has converged. Which metrics or loss should I look at? When I'm training on the car dataset, as the KL weights increase, the latent points become more noisy, leading to a decrease in reconstruction quality. Is it possible that if I keep training the model, the reconstruction quality will continue to get worse? If so, how can I know when to stop training?

I used the default config. trainer.epochs set to 800.
step 25480

Evaluation on ShapeNet-vol

I would like to ask how you evaluated your code on ShapeNet-vol. Since the benchmark's complexity is quadratic in the number of examples, and already takes quite a while with just the airplane category of ShapeNet-pointflow, I am wondering how you computed the metrics for all categories combined.

hey, I reimplement the encoder and diffusion process, but can not reach the metrics you report, when do you plan to release the training code?

Hi, when will you release your code ?

ShapeNet normalization

When you mention global normalization to [-1, 1] in table 1 and per-example normalization to [-1, 1] in table 2, which flags on ShapeNet15kPointClouds does that correspond to? I'm thinking about the combination of normalize_per_shape, normalize_shape_box, normalize_std_per_axis and normalize_global. Thank you.

Can we directly sample the latent space of VAE and get a reasonable generation result if we finish training the VAEs？

@ZENGXH

VAE Training Time

Hey,
Thanks for your great work.@ZENGXH
I would like to ask how long it takes to train VAE in all categories.
I train VAE in all categories on 8 V100 16GB for 15days with batchsize 12. But only 4000 epochs have been trained.

Is there anyway to accelerate the training process? ( For example: increase batchsize ?)

Another problem is that it is hard for me to judge whether VAE is well trained (I think visualisation is not a comprehensive way to reflect the effectiveness of VAE training). Especially when the training process takes a lot of time, it is important to guarantee the training effect.

Code release date?

Super impressive work, do you have an estimate for the time of your code release?

Realease the source code

Thank you for your excellent work. I was deeply impressed by the excellent results in your paper. When will the source code and pretrained model be released?

Hi..thank you for the great work that you have done..can I ask ..when you are planning make your code public

Can LION support various number of point instead of freezing the number to 2048

About code release

@ZENGXH Hi, all.

Excellent work you've done like this, so what's the plan about the code release?

Thank you.

assert(context.shape[1] == self.num_points*self.context_dim) shapes don't match

Hi @ZENGXH , Thanks for your hard work. I am testing out custom dataset with
(1076, 200000, 3), 200k size point cloud data. I've adjust few code line in pointflow_datasets.py. However, the final shape don't match in models/latent_points_ada.py: Any way to solve it or suggestions?

  context.shape[1] 40000
context.shape torch.Size([1, 40000])
self.num_points*self.context_dim 400000
self.num_points 100000
self.context_dim 4

         # TODO: why do we need this??
        # self.train_points = self.all_points[:, :min(
        #     10000, self.all_points.shape[1])]  # subsample 15k points to 10k points per shape
        self.train_points = self.all_points[:, :min(
        200000, self.all_points.shape[1])]  # depercate 15k points to 10k points per shape
        self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape

self.te_sample_size = min(5000, te_sample_size)
and train_vae_sh settings

     shapelatent.decoder_num_points  100000 \
    data.tr_max_sample_points 100000 data.te_max_sample_points 100000 \

Revised few line codes

        # TODO: why do we need this??
        # self.train_points = self.all_points[:, :min(
        #     10000, self.all_points.shape[1])]  # subsample 15k points to 10k points per shape
        self.train_points = self.all_points[:, :min(
        200000, self.all_points.shape[1])]  # depercate 15k points to 10k points per shape
        self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape

        self.te_sample_size = min(5000, te_sample_size)

2023-08-24 22:37:03.789 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:03.790 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:03.793 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:03.801 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:03.802 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:03.803 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:03.871 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:03.923 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:05.245 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data/data_t_npy/
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/data__npy/; norm global=True, norm-box=False
2023-08-24 22:37:05.692 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/data__npy/house/train 
2023-08-24 22:37:06.622 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:37:10.636 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:37:14.391 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/data__npy/
2023-08-24 22:37:14.441 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/data__npy/; norm global=True, norm-box=False
2023-08-24 22:37:14.443 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/data__npy/house/val 
2023-08-24 22:37:14.560 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:37:14.905 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:37:14.918 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:37:14.920 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:37:15.186 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f2b5e36ae80>
tr_x[-1].shape:  torch.Size([1, 100000, 3])
2023-08-24 22:37:15.456 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 100000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 100000, 3])
2023-08-24 22:37:15.482 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:37:15.483 | INFO     | trainers.base_trainer:set_writer:57 - 
----------

----------
2023-08-24 22:37:15.487 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:37:15.488 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
> /home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py(370)vis_recont()
-> x_list.append(v[b])
(Pdb) ^C--KeyboardInterrupt--
(Pdb) q
2023-08-24 22:37:40.372 | ERROR    | utils.utils:init_processes:1158 - An error has been caught in function 'init_processes', process 'MainProcess' (2820942), thread 'MainThread' (139833154426688):
Traceback (most recent call last):

  File "train_dist.py", line 251, in <module>
    utils.init_processes(0, size, main, args, config)
    │     │                 │     │     │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │     │                 │     │     └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    │     │                 │     └ <function main at 0x7f2d64c749d0>
    │     │                 └ 1
    │     └ <function init_processes at 0x7f2d64c6bc10>
    └ <module 'utils.utils' from '/home/bim-group/Documents/GitHub/LION/utils/utils.py'>

> File "/home/bim-group/Documents/GitHub/LION/utils/utils.py", line 1158, in init_processes
    fn(args, config)
    │  │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │  └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    └ <function main at 0x7f2d64c749d0>

  File "train_dist.py", line 86, in main
    trainer.train_epochs()
    │       └ <function BaseTrainer.train_epochs at 0x7f2baf6ba670>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 242, in train_epochs
    self.vis_recont(logs_info, writer, step)
    │    │          │          │       └ 0
    │    │          │          └ <utils.utils.Writer object at 0x7f2bacb39be0>
    │    │          └ {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1.5042e+00, 8.3240e+00, 1.5077e-01,
    │    │                     3.8869e-02, 3.8...
    │    └ <function BaseTrainer.vis_recont at 0x7f2baf6ba8b0>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
           │     │       └ {}
           │     └ (<trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>, {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1...
           └ <function BaseTrainer.vis_recont at 0x7f2baf6ba820>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
           │    │             └ <frame at 0x5561ae20aaa0, file '/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py', line 370, code vis_recont>
           │    └ <function Bdb.dispatch_line at 0x7f2d69d9e550>
           └ <pdb.Pdb object at 0x7f2b5e30d370>
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
       │    │               └ <class 'bdb.BdbQuit'>
       │    └ True
       └ <pdb.Pdb object at 0x7f2b5e30d370>

bdb.BdbQuit
COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds)
COMET INFO: The Python SDK has 3600 seconds to finish before aborting...
COMET INFO: Uploading 1 metrics, params and output messages
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 347, in accept
    return self._accept(callback)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 384, in _accept
    callback(list_to_sent)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 511, in _send_stdout_messages_batch
    self._process_rest_api_send(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 591, in _process_rest_api_send
    sender(**kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 3231, in send_stdout_batch
    self.post_from_endpoint(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2031, in post_from_endpoint
    return self._result_from_http_method(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2053, in _result_from_http_method
    return method(url, payload, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2134, in post
    return super(RestApiClient, self).post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 1988, in post
    response = self.low_level_api_client.post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 536, in post
    return self.do(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 639, in do
    response = session.request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ ^C
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ bash script/train_vae_bnet.sh 1
+ DATA=' ddpm.input_dim 3 data.cates house '
+ NGPU=1
+ num_node=1
+ BS=1
++ echo 'scale=2; 1/10'
++ bc
+ OPT_GRAD_CLIP=.10
+ total_bs=1
+ ((  1 > 128  ))
+ ENT='python train_dist.py --num_process_per_node 1 '
+ kl=0.5
+ lr=1e-3
+ latent=1
+ skip_weight=0.01
+ sigma_offset=6.0
+ loss=l1_sum
+ python train_dist.py --num_process_per_node 1 ddpm.num_steps 1 ddpm.ema 0 trainer.opt.vae_lr_warmup_epochs 0 trainer.opt.grad_clip .10 latent_pts.ada_mlp_init_scale 0.1 sde.kl_const_coeff_vada 1e-7 trainer.anneal_kl 1 sde.kl_max_coeff_vada 0.5 sde.kl_anneal_portion_vada 0.5 shapelatent.log_sigma_offset 6.0 latent_pts.skip_weight 0.01 trainer.opt.beta2 0.99 data.num_workers 4 ddpm.loss_weight_emd 1.0 trainer.epochs 8000 data.random_subsample 1 viz.viz_freq -400 viz.log_freq -1 viz.val_freq 200 data.batch_size 1 viz.save_freq 2000 trainer.type trainers.hvae_trainer model_config default shapelatent.model models.vae_adain shapelatent.decoder_type models.latent_points_ada.LatentPointDecPVC shapelatent.encoder_type models.latent_points_ada.PointTransPVC latent_pts.style_encoder models.shapelatent_modules.PointNetPlusEncoder shapelatent.prior_type normal shapelatent.latent_dim 1 trainer.opt.lr 1e-3 shapelatent.kl_weight 0.5 shapelatent.decoder_num_points 100000 data.tr_max_sample_points 100000 data.te_max_sample_points 100000 ddpm.loss_type l1_sum cmt lion ddpm.input_dim 3 data.cates house viz.viz_order '[2,0,1]' data.recenter_per_shape False data.normalize_global True
utils/utils.py: USE_COMET=1, USE_WB=0
2023-08-24 22:37:47.706 | INFO     | __main__:get_args:209 - EXP_ROOT: ../exp + exp name: 0824/house/21dd03h_hvae_lion_B1N100000, save dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:214 - save config at ../exp/0824/house/21dd03h_hvae_lion_B1N100000/cfg.yml
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:217 - log dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1133 - set MASTER_PORT: 127.0.0.1, MASTER_PORT: 6020
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1154 - init_process: rank=0, world_size=1
2023-08-24 22:37:47.737 | INFO     | __main__:main:29 - use trainer: trainers.hvae_trainer
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module emd_ext...
load emd_ext time: 0.118s
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module _pvcnn_backend...
2023-08-24 22:37:49.185 | INFO     | utils.utils:common_init:467 - [common-init] at rank=0, seed=1


2023-08-24 22:37:55.498 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:55.498 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:55.501 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:55.505 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:55.505 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:55.506 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:55.557 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:55.557 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:55.558 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:55.609 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:56.937 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data/transform_buildingnet_npy/
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/transform_buildingnet_npy/; norm global=True, norm-box=False
2023-08-24 22:37:57.509 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/transform_buildingnet_npy/house/train 
2023-08-24 22:37:58.454 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:38:02.066 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:38:04.353 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/transform_buildingnet_npy/
2023-08-24 22:38:04.396 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/transform_buildingnet_npy/; norm global=True, norm-box=False
2023-08-24 22:38:04.398 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/transform_buildingnet_npy/house/val 
2023-08-24 22:38:04.514 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:38:04.855 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:38:04.863 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:38:04.865 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:38:05.123 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f5f3a86d880>
tr_x[-1].shape:  torch.Size([1, 10000, 3])
2023-08-24 22:38:05.383 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 10000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 10000, 3])
2023-08-24 22:38:05.396 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:38:05.397 | INFO     | trainers.base_trainer:set_writer:57 - 
----------
[url]: https://www.comet.com/kg571852741/general/53e826d2f0544ecca7b21d35cc10c1f0
../exp/0824/house/21dd03h_hvae_lion_B1N100000
----------
2023-08-24 22:38:05.398 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:38:05.399 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
context.shape[1] 40000
context.shape torch.Size([1, 40000])
self.num_points*self.context_dim 400000
self.num_points 100000
self.context_dim 4
> /home/bim-group/Documents/GitHub/LION/models/latent_points_ada.py(279)forward()
-> assert(context.shape[1] == self.num_points*self.context_dim)
(Pdb) 
```2023-08-24 22:37:03.789 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:03.790 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:03.793 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:03.801 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:03.802 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:03.803 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:03.871 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:03.923 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:05.245 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/_npy/; norm global=True, norm-box=False
2023-08-24 22:37:05.692 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/_npy/house/train 
2023-08-24 22:37:06.622 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:37:10.636 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:37:14.391 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/npy/
2023-08-24 22:37:14.441 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/t_npy/; norm global=True, norm-box=False
2023-08-24 22:37:14.443 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/_npy/house/val 
2023-08-24 22:37:14.560 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:37:14.905 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:37:14.918 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:37:14.920 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:37:15.186 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f2b5e36ae80>
tr_x[-1].shape:  torch.Size([1, 100000, 3])
2023-08-24 22:37:15.456 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 100000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 100000, 3])
2023-08-24 22:37:15.482 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:37:15.483 | INFO     | trainers.base_trainer:set_writer:57 - 
----------
[url]: https://www.comet.com/kg571852741/general/75ce6d1e28c3496c9b264a8567167fcc
../exp/0824/house/21dd03h_hvae_lion_B1N100000
----------
2023-08-24 22:37:15.487 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:37:15.488 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
> /home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py(370)vis_recont()
-> x_list.append(v[b])
(Pdb) ^C--KeyboardInterrupt--
(Pdb) q
2023-08-24 22:37:40.372 | ERROR    | utils.utils:init_processes:1158 - An error has been caught in function 'init_processes', process 'MainProcess' (2820942), thread 'MainThread' (139833154426688):
Traceback (most recent call last):

  File "train_dist.py", line 251, in <module>
    utils.init_processes(0, size, main, args, config)
    │     │                 │     │     │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │     │                 │     │     └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    │     │                 │     └ <function main at 0x7f2d64c749d0>
    │     │                 └ 1
    │     └ <function init_processes at 0x7f2d64c6bc10>
    └ <module 'utils.utils' from '/home/bim-group/Documents/GitHub/LION/utils/utils.py'>

> File "/home/bim-group/Documents/GitHub/LION/utils/utils.py", line 1158, in init_processes
    fn(args, config)
    │  │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │  └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    └ <function main at 0x7f2d64c749d0>

  File "train_dist.py", line 86, in main
    trainer.train_epochs()
    │       └ <function BaseTrainer.train_epochs at 0x7f2baf6ba670>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 242, in train_epochs
    self.vis_recont(logs_info, writer, step)
    │    │          │          │       └ 0
    │    │          │          └ <utils.utils.Writer object at 0x7f2bacb39be0>
    │    │          └ {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1.5042e+00, 8.3240e+00, 1.5077e-01,
    │    │                     3.8869e-02, 3.8...
    │    └ <function BaseTrainer.vis_recont at 0x7f2baf6ba8b0>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
           │     │       └ {}
           │     └ (<trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>, {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1...
           └ <function BaseTrainer.vis_recont at 0x7f2baf6ba820>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
           │    │             └ <frame at 0x5561ae20aaa0, file '/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py', line 370, code vis_recont>
           │    └ <function Bdb.dispatch_line at 0x7f2d69d9e550>
           └ <pdb.Pdb object at 0x7f2b5e30d370>
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
       │    │               └ <class 'bdb.BdbQuit'>
       │    └ True
       └ <pdb.Pdb object at 0x7f2b5e30d370>

bdb.BdbQuit
COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds)
COMET INFO: The Python SDK has 3600 seconds to finish before aborting...
COMET INFO: Uploading 1 metrics, params and output messages
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 347, in accept
    return self._accept(callback)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 384, in _accept
    callback(list_to_sent)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 511, in _send_stdout_messages_batch
    self._process_rest_api_send(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 591, in _process_rest_api_send
    sender(**kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 3231, in send_stdout_batch
    self.post_from_endpoint(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2031, in post_from_endpoint
    return self._result_from_http_method(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2053, in _result_from_http_method
    return method(url, payload, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2134, in post
    return super(RestApiClient, self).post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 1988, in post
    response = self.low_level_api_client.post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 536, in post
    return self.do(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 639, in do
    response = session.request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ ^C
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ bash script/train_vae_bnet.sh 1
+ DATA=' ddpm.input_dim 3 data.cates house '
+ NGPU=1
+ num_node=1
+ BS=1
++ echo 'scale=2; 1/10'
++ bc
+ OPT_GRAD_CLIP=.10
+ total_bs=1
+ ((  1 > 128  ))
+ ENT='python train_dist.py --num_process_per_node 1 '
+ kl=0.5
+ lr=1e-3
+ latent=1
+ skip_weight=0.01
+ sigma_offset=6.0
+ loss=l1_sum
+ python train_dist.py --num_process_per_node 1 ddpm.num_steps 1 ddpm.ema 0 trainer.opt.vae_lr_warmup_epochs 0 trainer.opt.grad_clip .10 latent_pts.ada_mlp_init_scale 0.1 sde.kl_const_coeff_vada 1e-7 trainer.anneal_kl 1 sde.kl_max_coeff_vada 0.5 sde.kl_anneal_portion_vada 0.5 shapelatent.log_sigma_offset 6.0 latent_pts.skip_weight 0.01 trainer.opt.beta2 0.99 data.num_workers 4 ddpm.loss_weight_emd 1.0 trainer.epochs 8000 data.random_subsample 1 viz.viz_freq -400 viz.log_freq -1 viz.val_freq 200 data.batch_size 1 viz.save_freq 2000 trainer.type trainers.hvae_trainer model_config default shapelatent.model models.vae_adain shapelatent.decoder_type models.latent_points_ada.LatentPointDecPVC shapelatent.encoder_type models.latent_points_ada.PointTransPVC latent_pts.style_encoder models.shapelatent_modules.PointNetPlusEncoder shapelatent.prior_type normal shapelatent.latent_dim 1 trainer.opt.lr 1e-3 shapelatent.kl_weight 0.5 shapelatent.decoder_num_points 100000 data.tr_max_sample_points 100000 data.te_max_sample_points 100000 ddpm.loss_type l1_sum cmt lion ddpm.input_dim 3 data.cates house viz.viz_order '[2,0,1]' data.recenter_per_shape False data.normalize_global True
utils/utils.py: USE_COMET=1, USE_WB=0
2023-08-24 22:37:47.706 | INFO     | __main__:get_args:209 - EXP_ROOT: ../exp + exp name: 0824/house/21dd03h_hvae_lion_B1N100000, save dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:214 - save config at ../exp/0824/house/21dd03h_hvae_lion_B1N100000/cfg.yml
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:217 - log dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1133 - set MASTER_PORT: 127.0.0.1, MASTER_PORT: 6020
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1154 - init_process: rank=0, world_size=1
2023-08-24 22:37:47.737 | INFO     | __main__:main:29 - use trainer: trainers.hvae_trainer
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module emd_ext...
load emd_ext time: 0.118s
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module _pvcnn_backend...
2023-08-24 22:37:49.185 | INFO     | utils.utils:common_init:467 - [common-init] at rank=0, seed=1
COMET INFO: Experiment is live on comet.com https://www.comet.com/kg571852741/general/53e826d2f0544ecca7b21d35cc10c1f0

2023-08-24 22:37:55.498 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:55.498 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:55.501 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:55.505 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:55.505 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:55.506 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:55.557 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:55.557 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:55.558 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:55.609 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:56.937 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: datanpy/
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/_npy/; norm global=True, norm-box=False
2023-08-24 22:37:57.509 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/py/house/train 
2023-08-24 22:37:58.454 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:38:02.066 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:38:04.353 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/npy/
2023-08-24 22:38:04.396 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/npy/; norm global=True, norm-box=False
2023-08-24 22:38:04.398 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/_npy/house/val 
2023-08-24 22:38:04.514 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:38:04.855 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:38:04.863 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:38:04.865 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:38:05.123 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f5f3a86d880>
tr_x[-1].shape:  torch.Size([1, 10000, 3])
2023-08-24 22:38:05.383 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 10000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 10000, 3])
2023-08-24 22:38:05.396 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:38:05.397 | INFO     | trainers.base_trainer:set_writer:57 - 
----------
[url]: https://www.comet.com/kg571852741/general/53e826d2f0544ecca7b21d35cc10c1f0
../exp/0824/house/21dd03h_hvae_lion_B1N100000
----------
2023-08-24 22:38:05.398 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:38:05.399 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
context.shape[1] 40000
context.shape torch.Size([1, 40000])
self.num_points*self.context_dim 400000
self.num_points 100000
self.context_dim 4
> /home/bim-group/Documents/GitHub/LION/models/latent_points_ada.py(279)forward()
-> assert(context.shape[1] == self.num_points*self.context_dim)
(Pdb)

RuntimeError: Error building extension 'emd_ext'

How to solve it locally ?

Increasing loss

Hello,
I try to train the VAE, follow the step

but the loss is increasing

Multiple GPU usage problem

Hi, thank you for the quick response and maintaining the amazing repo!

I have a server with 4 GPUs. I want to use the 4 of them so I set $NGPU to 4 when running train_vae.sh. However the process initialization gets stuck. you can see my log below

`
2023-03-16 22:26:34.640 | INFO | main:get_args:206 - EXP_ROOT: /LION/trials + exp name: 0316/colon/f14d9fh_hvae_lion_B8, save dir: /LION/trials/0316/colon/f14d9fh_hvae_lion_B8

2023-03-16 22:26:34.820 | INFO | main:get_args:211 - save config at /LION/trials/0316/colon/f14d9fh_hvae_lion_B8/cfg.yml

2023-03-16 22:26:34.821 | INFO | main:get_args:214 - log dir: /LION/trials/0316/colon/f14d9fh_hvae_lion_B8

2023-03-16 22:26:34.862 | INFO | main::228 - In Rank=0

2023-03-16 22:26:34.892 | INFO | main::234 - Node rank 0, local proc 0, global proc 0

2023-03-16 22:26:34.937 | INFO | main::228 - In Rank=1

2023-03-16 22:26:34.941 | DEBUG | utils.utils:init_processes:1141 - set port as 6010

2023-03-16 22:26:34.952 | INFO | main::234 - Node rank 0, local proc 1, global proc 1

2023-03-16 22:26:34.953 | INFO | utils.utils:init_processes:1152 - init_process: rank=0, world_size=4

2023-03-16 22:26:34.967 | INFO | main::228 - In Rank=2

2023-03-16 22:26:34.971 | DEBUG | utils.utils:init_processes:1141 - set port as 6010

2023-03-16 22:26:34.983 | INFO | main::234 - Node rank 0, local proc 2, global proc 2

2023-03-16 22:26:34.983 | INFO | utils.utils:init_processes:1152 - init_process: rank=1, world_size=4

2023-03-16 22:26:34.998 | INFO | main::228 - In Rank=3

2023-03-16 22:26:35.002 | DEBUG | utils.utils:init_processes:1141 - set port as 6010

2023-03-16 22:26:35.013 | INFO | main::234 - Node rank 0, local proc 3, global proc 3

2023-03-16 22:26:35.013 | INFO | utils.utils:init_processes:1152 - init_process: rank=2, world_size=4

2023-03-16 22:26:35.056 | INFO | main::242 - join 3

2023-03-16 22:26:35.060 | DEBUG | utils.utils:init_processes:1141 - set port as 6011

2023-03-16 22:26:35.073 | INFO | utils.utils:init_processes:1152 - init_process: rank=3, world_size=4
`

Nothing happens after this. I am using Docker. Do you have an idea on how to solve this problem? thank you in advance!

Question about SVR

Hi, thanks for your great work! @ZENGXH

I want to know how to implement single view reconstruction (SVR). From your supplementary materials，I know you implement voxel guided generation by fine-tuning encoder of VAE and shape interpolation by Diffuse-Denoise. So, I guess you just replace the encoder of VAE with CLIP image encoder and then training on ShapeNet to produce plausible shapes from a single view image. Is it right?

Looking forward to your reply!

NaN loss while training stage 1 VAE

Hi @ZENGXH ,

Thank you for sharing the code.

I am training VAE (stage 1) on the ShapeNet15k dataset by following the instructions given in the README.md file.
I am using the default config, except the batch size is 16 (because using batch size 32 was giving cuda_out_of_memory error). The loss started increasing and eventually became nan.
So, I trained with a lower learning rate of 1e-4 (originally it was 1e-3). This time again, the loss decreased, then increased, and becamenan.

Please see the contents of log file below:

2023-06-13 21:50:53.148 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[ 70/153] | [Loss] 335.14 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]    70 | [url] none
2023-06-13 21:51:53.192 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[152/153] | [Loss] 233.48 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   152 | [url] none
2023-06-13 21:51:53.251 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E0 iter[152/153] | [Loss] 233.48 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   152 | [url] none | [time] 2.0m (~267h) |[best] 0 -100.000x1e-2
2023-06-13 21:52:53.518 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[ 81/153] | [Loss] 108.90 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   234 | [url] none
2023-06-13 21:53:45.658 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E1 iter[152/153] | [Loss] 100.31 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   305 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 21:54:46.026 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[ 81/153] | [Loss] 79.69 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   387 | [url] none
2023-06-13 21:55:38.097 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E2 iter[152/153] | [Loss] 76.43 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   458 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 21:56:38.487 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E3 iter[ 81/153] | [Loss] 66.25 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   540 | [url] none
2023-06-13 21:57:30.785 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E3 iter[152/153] | [Loss] 63.98 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   611 | [url] none | [time] 1.9m (~250h) |[best] 0 -100.000x1e-2
2023-06-13 21:58:31.106 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E4 iter[ 81/153] | [Loss] 58.29 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   693 | [url] none
2023-06-13 21:59:23.191 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E4 iter[152/153] | [Loss] 57.15 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   764 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:00:23.558 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E5 iter[ 81/153] | [Loss] 55.49 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   846 | [url] none
2023-06-13 22:01:15.726 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E5 iter[152/153] | [Loss] 55.84 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   917 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:02:16.029 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E6 iter[ 81/153] | [Loss] 58.48 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]   999 | [url] none
2023-06-13 22:03:08.117 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E6 iter[152/153] | [Loss] 59.70 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1070 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:04:08.409 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E7 iter[ 81/153] | [Loss] 64.31 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1152 | [url] none
2023-06-13 22:05:00.592 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E7 iter[152/153] | [Loss] 65.85 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1223 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:06:00.953 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E8 iter[ 81/153] | [Loss] 70.98 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1305 | [url] none
2023-06-13 22:06:53.085 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E8 iter[152/153] | [Loss] 72.55 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1376 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:07:53.497 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E9 iter[ 81/153] | [Loss] 77.83 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1458 | [url] none
2023-06-13 22:08:45.652 | INFO     | trainers.base_trainer:train_epochs:256 - [R0] | E9 iter[152/153] | [Loss] 79.42 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1529 | [url] none | [time] 1.9m (~249h) |[best] 0 -100.000x1e-2
2023-06-13 22:08:45.776 | INFO     | utils.exp_helper:get_evalname:94 - git hash: 13b1c
2023-06-13 22:08:47.341 | INFO     | trainers.base_trainer:eval_nll:743 - eval: 1/36
2023-06-13 22:08:51.946 | INFO     | trainers.base_trainer:eval_nll:743 - eval: 31/36
2023-06-13 22:09:00.621 | INFO     | utils.eval_helper:compute_NLL_metric:65 - best 10: tensor([ 57,   1, 349, 131, 113, 282, 271, 201, 108, 182], device='cuda:0')
2023-06-13 22:09:00.621 | INFO     | utils.eval_helper:compute_NLL_metric:72 - MMD-CD: 5.0256807604398546e-09
2023-06-13 22:09:00.622 | INFO     | utils.eval_helper:compute_NLL_metric:72 - MMD-EMD: 1.9488379621179774e-05
2023-06-13 22:09:00.622 | INFO     | utils.eval_helper:compute_NLL_metric:77 -
------------------------------------------------------------
../../output/lion_output/0613/car/cb9303h_hvae_lion_B16/recont_1529noemas1H13b1c.pt |
MMD-CD=0.000x1e-2 MMD-EMD=0.002x1e-2  step=1529
 none
 ------------------------------------------------------------
2023-06-13 22:09:00.622 | INFO     | trainers.base_trainer:eval_nll:814 - add: MMD-CD
2023-06-13 22:09:00.622 | INFO     | trainers.base_trainer:eval_nll:814 - add: MMD-EMD
2023-06-13 22:09:00.634 | INFO     | trainers.base_trainer:save:106 - save model as : ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16/checkpoints/best_eval.pth
2023-06-13 22:09:10.367 | INFO     | trainers.common_fun:validate_inspect_noprior:104 - writer: none
2023-06-13 22:09:46.203 | INFO     | trainers.base_trainer:train_epochs:219 - [R0] | E10 iter[ 49/153] | [Loss] 83.91 | [exp] ../../output/lion_output/0613/car/cb9303h_hvae_lion_B16 | [step]  1579 | [url] none

I looked at previous issues #9 , #17 , #18 , #22 , #35 , but did not find any solution.
Could you please tell me how to resolve this issue?

Also, could you please share the checkpoint you mentioned in this section?

Thank you,
Supriya

Question about Diffuse-Denoise

@ZENGXH Hi Xiaohui, thanks so much for your impressive work and code! I have a question about the Diffuse-Denoise process you mentioned in chapter 3.1 and app.c.1, which diffuses the latent feature (z0 and h0) to step τ < T and then denoises it back to get multimodal generation. The generation result in the paper is very fascinating, so I would like to how can I reproduce it with the released code? I try to look for this function but I still can't find it. ^^

How can I use clip-related features ?

Hi, @ZENGXH. I appreciate your excellent work!

I try to use clip-related features equipped with this model, such as single-view reconstruction.
I see through the original paper, and it said that feature requires training latent diffusion models by images.
I'd like to know how I can realize this.

We render 2D images from the 3D ShapeNet shapes, extracted the images’ CLIP [105]
image embeddings, and trained LION’s latent diffusion models while conditioning on the shapes’
CLIP image embeddings.

I guess I need to change clip_forge_enable = 1 when training train_prior.
But I needed help understanding how to do it properly.
I was wondering if you could instruct how to do it.

thank you in advance !

Per-category evaluation result of resampled pointcloud from fine-tuned SAP

Thanks @ZENGXH for fast and detailed responses.

I want to ask is there per-category (airplane, chair, car) evaluation result of resampled pointcloud like table 15 of supp.

nv-tlabs / lion Goto Github PK

lion's Introduction

LION: Latent Point Diffusion Models for 3D Shape Generation NeurIPS 2022

Update

Install

Demo

Released checkpoint and samples

Training

data

train VAE

train diffusion prior

train diffusion prior with clip feat

(Optional) monitor exp

evaluate a trained prior

other test data

Evaluate the samples with the 1-NNA metrics

Citation

lion's People

Contributors

Stargazers

Watchers

Forkers

lion's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

LION: Latent Point Diffusion Models for 3D Shape Generation

NeurIPS 2022