GithubHelp home page GithubHelp logo

remyxai / vqasynth Goto Github PK

View Code? Open in Web Editor NEW
76.0 4.0 3.0 1.29 MB

Compose multimodal datasets 🎹

Home Page: https://twitter.com/smellslikeml/status/1756723056675094726

Python 87.57% Dockerfile 7.92% Shell 4.51%
data-pipeline data-processing dataset-generation multimodal-datasets multimodal-deep-learning synthetic-dataset-generation

vqasynth's Introduction

VQASynth

Enhance the reasoning of multimodal models with pipelines to synthesize VQA datasets.

Background

Inspired by SpatialVLM, this repo uses ZoeDepth to adapt Vision Langauge Models for spatial reasoning. The demos feature pipelines using LLaVA for object captioning and SAM for segmentation. One uses CLIPSeg for region proposal, while the other uses GroundingDINO.

VQASynth-diagram.png

Environment

Before running the demo scripts, ensure you have the following installed:

CLIPSeg-based SpatialVLM data processing (recommended):

cd tests/data_processing/
docker build -f clipseg_data_processing.dockerfile -t vqasynth:clipseg-dataproc-test .
docker run --gpus all -v /path/to/output/:/path/to/output vqasynth:clipseg-dataproc-test --input_image="warehouse_rgb.jpg" --output_dir "/path/to/output" 

GroundingDINO-based SpatialVLM data processing:

cd tests/data_processing/
docker build -f groundingDino_data_processing.dockerfile -t vqasynth:dino-dataproc-test .
docker run --gpus all -v /path/to/output/:/path/to/output vqasynth:dino-dataproc-test --input_image="warehouse_rgb.jpg" --output_dir "/path/to/output" 

The scripts will produce 3D point clouds, segmented images, labels, and prompt examples for a test image.

Run a Pipeline on Your Images

The main pipeline uses Docker Compose to process a directory of images into a VQA dataset including spatial relations between objects. The dataset follows conventions for training models like LLaVA. We recommend using an A10 GPU or larger for processing.

Make sure to update .env with the full path to your image directory and output directory. Then launch the pipeline with:

cd /path/to/VQASynth
docker compose -f pipelines/spatialvqa.yaml up --build

In your designated output directory, you'll find a json file processed_dataset.json containing the formatted dataset.

Here are some examples:

sample_1 sample_2 sample_3
Does the red forklift in warehouse appear on the left side of the brown cardboard boxes stacked? How close is the man in red hat walking from the wooden pallet with boxes? Does the man in blue shirt working have a greater height compared to the wooden pallet with boxes on floor?
Incorrect, the red forklift in warehouse is not on the left side of the brown cardboard boxes stacked. The man in red hat walking is 60.13 centimeters from the wooden pallet with boxes. Indeed, the man in blue shirt working is taller compared to the wooden pallet with boxes on floor.

Here's a sample of warehouse images captioned with spatial relationships similar to the table above.

wget https://remyx.ai/assets/vqasynth/vqasynth_warehouse_spaces.zip

# Data is formatted for LLaVA fine-tuning
unzip vqasynth_warehouse_spaces.zip 

Once completed, you can follow this resource on fine-tuning LLaVa.

Models

Check out our LLaVA 1.5 LoRA SpaceLLaVA and MobileVLM-based SpaceLLaVA-lite

Try SpaceLLaVA in Discord

image

Notebooks

We've hosted some notebooks visualizing and experimenting with the techniques included in this repo.

Notebook Description Launch
Spatial Reasoning with Point Clouds Visualize point clouds and evaluate spatial relationships Open In Colab

References

This project was inspired by or utilizes concepts discussed in the following research paper(s):

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

vqasynth's People

Contributors

smellslikeml avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

vqasynth's Issues

Inquiry about jupyter notebook Spatial Reasoning with Point Clouds

Hello, thank you for sharing your fantastic work!

I am looking into the jupyter notebookSpatial Reasoning with Point Clouds, and I have a inquiry.

image
image

The Heatmap's quality is not good but makes sense, however the sampled coordinates makes no sense.

Do you know why this happens? Is this may be a discrepancy between original image and reshaped image?

Thank you in advance

can you be so kind to release the dataset?

Thank for your remarkable job for VLLM community! but i wonder if you can release the data that you collected for spaceLLaMa training? collecting by myself might be way much more expensive than i can afford, thank you!

inference shell file

It is really surprising that the ckpt is accessible in hugging face. Can you provide a python file for inference?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.