GithubHelp home page GithubHelp logo

zhaozh10 / pointblip Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pointcept/gpt4point

0.0 0.0 0.0 67.32 MB

PointBLIP: A Point Cloud Multi-modal model Embracing Diverse Data without Reliance on Image Domain

License: MIT License

Python 99.34% Jupyter Notebook 0.56% Shell 0.10%

pointblip's Introduction

PointBLIP: A Unified Framework for Point-Language Understanding without Reliance on Image Domain

version Status-building PRs-Welcome

Overview

This project presents PointBLIP , a 3D multi-modality model that aligns 3D point clouds with language inspired by 2D-multimodal model BLIP.

  • We directly align the representation of point cloud and language without the need to align with image modality additionally as classical methods.
  • Furthermore, we explore the great potential of point clouds for various tasks on 3D multimodal understanding and generation.

πŸ”₯Demos

These demo shows the 3D captions of objects in the Objaverse.

  • The first line is the mesh of the objects. The second line is the point cloud of them without the color.
  • Grey captions. are generated by the BLIP when provided with a view of object images.
  • White captions are generated by the PointBLIP when provided with only the point clouds.

Observably, the white captions exhibit a greater capacity to convey intricate details concerning the objects' geometrical attributes.

News

πŸ”₯ 2023/08/13: Two-stage Pre-training code of PointBLIP has been released.

πŸ”₯ 2023/08/13: Part of datasets used and result files has been uploaded.

PointBLIP

Previous method ULIP:

Our PointBLIP:

ULIP is a representative work on aligning point clouds with other modality information (Upper part). However, it needs to align 3D point clouds with both images and texts during training just to make the model gain the ability of 3D semantic understanding.

To simplify this approach, our ${\ PointBLIP\ }$ considers directly aligning texts with 3D point clouds (Lower part). Besides, we add an LLM(Large Language Model) to the basis of joint representation learning, which fully promote the combination of 3D point cloud and text representation, and successfully apply to multiple downstream tasks.

Our PointBLIP demonstrates 3 key attributes:

  • $\color{darkorange}{Directly\ Align\ Texts\ with\ 3D\ Point\ Clouds\ .}$ To improve the recognition ability and semantic understanding of 3D backbone models, we directly align the representation of 3D point clouds and texts. We doesn't introduce additional infomation of image representions during training, which simplifies the training process and fully aligns the representations.
  • $\color{darkorange}{Bridge\ Modality\ Gap\ Guided\ By\ BLIP2\ .}$ Inspired by 2D multi-modality model BLIP2, we ingeniously utilize both pretrained 3D point cloud models and large language models. We bridge the modality gap between 3D point clouds and texts using a trainable module (text encoder in the figure) pretrained in two-stages.
  • $\color{darkorange}{LLM\ Empowers\ a\ Wide\ Range\ of\ 3D\ Semantic\ Tasks\ .}$ Incorporating the large language model enables the capability to perform a broader spectrum of 3D semantic understanding and genaration tasks. Besides engaging in a 3D classification task directly using the trained representations, PointBLIP can perform 3D caption generation, 3D retrieval and 3D question answering tasks, fully exploring the semantic capabilities of 3D point clouds

3D Caption

Given the input of a 3D point cloud, the caption generated by our PointBLIP can better reflect the structural features, orientation and more details of the object.

Demos at the outset are identical to the initial demos. Grey captions are produced by the BLIP when presented with object image views, while white captions are generated by the PointBLIP using only point clouds.

Point Cloud QA

Given 3D point cloud and text input, PointBLIP can generate answers to questions interactively.

Get Started

Preparation

1. Install salesforce-lavis

$ conda create -n lavis python=3.8
$ conda activate lavis

$ git clone https://github.com/salesforce/LAVIS.git SalesForce-LAVIS
$ cd SalesForce-LAVIS
$ pip install -e .

$ pip install positional_encodings

2. Prepare the dataset

We use Objaverse (80k objects) to train and evaluate models. We use the point cloud and text description label provided by ULIP2. You can download the datasets directly here.

pointblip
β”œβ”€β”€ figure
β”œβ”€β”€ lavis
β”œβ”€β”€ data
β”‚ β”œβ”€β”€ objaverse_pc_parallel
β”‚ β”œβ”€β”€ merged_data_new.json
β”‚ β”œβ”€β”€ common_ids.txt

3. convert dataset into training format

the abs path of converted dataset should be registered in lavis/configs/default.yaml as cache_root

Training

$ conda activate lavis
# use facebook/opt-2.7b:
# stage 1:
$ python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/point_blip/train/pretrain_stage1_point_obja.yaml
# stage 2:
$ python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/point_blip/train/pretrain_stage2_point_obja.yaml

before stage2 training, you need to place stage1 trained checkpoint under model dir and change the corresponding config path.

Evaluation

$ python -m torch.distributed.run --nproc_per_node=8 evaluate.py --cfg-path lavis/projects/point_blip/eval/caption_objaverse_opt2.7b_eval.yaml

result will be saved as .json file in lavis/output with following formats:

[
    {
        "image_id": "object hash id of objaverse",
        "2d_caption": "gt caption when training BLIP-3D",
        "caption": "generated caption by BLIP-3D"
    },
    
]

pointblip's People

Contributors

qi-zhangyang avatar aleafy avatar sunzey avatar zhaozh10 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.