GithubHelp home page GithubHelp logo

mbzuai-oryx / groundinglmm Goto Github PK

View Code? Open in Web Editor NEW
740.0 28.0 37.0 112.64 MB

[CVPR 2024 ๐Ÿ”ฅ] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Home Page: https://grounding-anything.com

Python 99.60% Shell 0.40%
foundation-models lmm vision-and-language vision-language-model llm-agent

groundinglmm's Introduction

GLaMM : Pixel Grounding Large Multimodal Model [CVPR 2024]

Oryx Video-ChatGPT

Mohamed bin Zayed University of AI, Australian National University, Aalto University, Carnegie Mellon University, University of California - Merced, Linkรถping University, Google Research

paper Dataset Demo Website video


๐Ÿ“ข Latest Updates

  • Mar-21-24- We're excited to announce the release of GranD dataset and the GranD Automated Annotation Pipeline ๐Ÿ”ฅ
  • Feb-27-23- We're thrilled to share that GLaMM has been accepted to CVPR 2024! ๐ŸŽŠ
  • Dec-27-23- GLaMM training and evaluation codes, pretrained checkpoints and GranD-f dataset are released click for details ๐Ÿ”ฅ๐Ÿ”ฅ
  • Nov-29-23: GLaMM online interactive demo is released demo link. ๐Ÿ”ฅ
  • Nov-07-23: GLaMM paper is released arxiv link. ๐ŸŒŸ
  • ๐ŸŒŸ Featured: GLaMM is now highlighted at the top on AK's Daily Papers page on HuggingFace! ๐ŸŒŸ

GLaMM Overview

Grounding Large Multimodal Model (GLaMM) is an end-to-end trained LMM which provides visual grounding capabilities with the flexibility to process both image and region inputs. This enables the new unified task of Grounded Conversation Generation that combines phrase grounding, referring expression segmentation, and vision-language conversations. Equipped with the capability for detailed region understanding, pixel-level groundings, and conversational abilities, GLaMM offers a versatile capability to interact with visual inputs provided by the user at multiple granularity levels.


๐Ÿ† Contributions

  • GLaMM Introduction. We present the Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

  • Novel Task & Evaluation. We propose a new task of Grounded Conversation Generation (GCG). We also introduce a comprehensive evaluation protocol for this task.

  • GranD Dataset Creation. We create the GranD - Grounding-anything Dataset, a large-scale densely annotated dataset with 7.5M unique concepts grounded in 810M regions.


๐Ÿš€ Dive Deeper: Inside GLaMM's Training and Evaluation

Delve into the core of GLaMM with our detailed guides on the model's Training and Evaluation methodologies.

  • Installation: Provides guide to set up conda environment for running GLaMM training, evaluation and demo.

  • Datasets: Provides detailed instructions to download and arrange datasets required for training and evaluation.

  • GranD: Provides detailed instructions to download the GranD dataset and run the automated annotation pipeline.

  • Model Zoo: Provides downloadable links to all pretrained GLaMM checkpoints.

  • Training: Provides instructions on how to train the GLaMM model for its various capabilities including Grounded Conversation Generation (GCG), Region-level captioning, and Referring Expression Segmentation.

  • Evaluation: Outlines the procedures for evaluating the GLaMM model using pretrained checkpoints, covering Grounded Conversation Generation (GCG), Region-level captioning, and Referring Expression Segmentation, as reported in our paper.

  • Demo: Guides you through setting up a local demo to showcase GLaMM's functionalities.

๐Ÿ‘๏ธ๐Ÿ’ฌ GLaMM: Grounding Large Multimodal Model

The components of GLaMM are cohesively designed to handle both textual and optional visual prompts (image level and region of interest), allowing for interaction at multiple levels of granularity, and generating grounded text responses.

GLaMM Architectural Overview


๐Ÿ” Grounding-anything Dataset (GranD)

The Grounding-anything GranD dataset, a large-scale dataset with automated annotation pipeline for detailed region-level understanding and segmentation masks. GranD comprises 7.5M unique concepts anchored in a total of 810M regions, each with a segmentation mask.

Dataset Annotation Pipeline


Below we present some examples of the GranD dataset.

GranD Dataset Sample

GranD Dataset Sample


๐Ÿ“š Building GranD-f for Grounded Conversation Generation

The GranD-f dataset is designed for the GCG task, with about 214K image-grounded text pairs for higher-quality data in fine-tuning stage.

GranD-f Dataset Sample


๐Ÿค– Grounded Conversation Generation (GCG)

Introducing GCG, a task to create image-level captions tied to segmentation masks, enhancing the modelโ€™s visual grounding in natural language captioning.

Results_GCG

GCG_Table


๐Ÿš€ Downstream Applications

๐ŸŽฏ Referring Expression Segmentation

Our model excels in creating segmentation masks from text-based referring expressions.

Results_RefSeg

Table_RefSeg


๐Ÿ–ผ๏ธ Region-Level Captioning

GLaMM generates detailed region-specific captions and answers reasoning-based visual questions.

Results_RegionCap

Table_RegionCap


๐Ÿ“ท Image Captioning

Comparing favorably to specialized models, GLaMM provides high-quality image captioning.

Results_Cap


๐Ÿ’ฌ Conversational Style Question Answering

GLaMM demonstrates its prowess in engaging in detailed, region-specific, and grounded conversations. This effectively highlights its adaptability in intricate visual-language interactions and robustly retaining reasoning capabilities inherent to LLMs.

Results_Conv


Results_Conv


๐Ÿ“œ Citation

  @article{hanoona2023GLaMM,
          title={GLaMM: Pixel Grounding Large Multimodal Model},
          author={Rasheed, Hanoona and Maaz, Muhammad and Shaji, Sahal and Shaker, Abdelrahman and Khan, Salman and Cholakkal, Hisham and Anwer, Rao M. and Xing, Eric and Yang, Ming-Hsuan and Khan, Fahad S.},
          journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2024}
  }

๐Ÿ™ Acknowledgement

We are thankful to LLaVA, GPT4ROI, and LISA for releasing their models and code as open-source contributions.


groundinglmm's People

Contributors

hanoonar avatar mmaaz60 avatar sahalshajim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

groundinglmm's Issues

3D implementation of GLaMM

Hi!

I have been experimenting with your model for quite some time now, specifically on medical imaging data.

I am currently working on looking into possibilities of extending your architecture such that it would be able to encode sequences of images and decode these accordingly to obtain 3D segmentations.

I was curious if you maybe have a take on how to tackle this. It would greatly help me, as I am doing my master's thesis on LMMs in medical imaging with your model as the main focus of interest! :)

Thank you in advance,
Rachel

mmcv version

Hi,

Currently the code, in particular mmdet supports mmcv version up to 1.5.0. I tried from 1.4.7 that you suggest up to 1.5.0 and as I'm on AMD GPUs mmcv won't install due to bugs in the ROCm support. Later versions address the issue. I successfully installed mmcv 2.1.0 with ROCm support on AMD GPUs. But due to the current limitation of mmdet that accepts only mmcv versions up to 1.5.0 I cannot properly use the code.

Is there any chance that you could update the code so that mmdet can support mmcv 2.1.0? Thank you.

Some bugs in the GranD_ReferringSegm_ds.py

Hello, I find there may be several bugs in the GranD_ReferringSegm_ds.py file. Such as:

  • undefined argument max_gt_per_img in the GrandReferSegmDataset
  • the incorrect implementation for the create_conversations method
  • some typos, like data_masks = data_item['maks'] which should be data_masks = data_item['masks']

I would greatly appreciate it if you could address and rectify these concerns, followed by thorough testing of the code at your earliest convenience. Thank you!

Question about the seg-token mask computation

Hi Authors,
Thanks for the code, and datasets!
I had a question about this line here: mask = input_ids[:, 1:] == self.seg_token_idx Why do we index from the first token? Shouldn't the output hidden states have a 1-1 mapping with the input_ids?

For region level captioning, does the model support multi-region inputs?

When I read the paper, I found the model can handle multiple regions as input๏ผŒDoes this mean that if given an image and multiple boxes as input, the model can generate all the region descriptions at once. But when I look at the code, it seems that the model can only caption one box(region) at a time. If I need to generate captions of all the regions in an image, does that mean I have to infer several times? Looking forward to your reply....

about create_seg_token_mask

Hi! Thanks for your great work. I have some doubt about this following code:
def _create_seg_token_mask(self, input_ids):
mask = input_ids[:, 1:] == self.seg_token_idx
return torch.cat(
[torch.zeros((mask.shape[0], 575)).bool().cuda(), mask, torch.zeros((mask.shape[0], 1)).bool().cuda()],
dim=1
)

Can you explain the meaning of the number 575? And why concat these zero vectors to mask in the left and right side? Thanks in advance for your answering!

FlashAttention

Hi,

Thank you for the codebase and the models! I notice that flash attention is one of the dependencies of the project. Since I'm working on AMD GPUs and currently installing flash attention with ROCm support is rather challenging, I was wondering whether the code uses flash attention for training or only in inference. Because if it's used only for training I might skip installing it as I want to use GLaMM mostly for inference. I'm asking this question because I've been looking on another GitHub repo where the instructions suggest to install flash attention only if training is required. Thank you!

Demo issue

image
Hello dear author, your work is quite great! I will reproduce your work code as soon as it is released today.

I have been keeping an eye on your work, and today when I reproduced your code, I encountered the demo section. It has encountered this situation, and the above is the error that occurred after I reproduced it.

There have also been many issues in the code.

Could you please help answer this question.
image

An error is reported when running eval

Hello, thank you very much for your contribution, I encountered an error while evaluating the GCG task. And the calculated evaluation results also showed errors
Uploading 20240327192849.pngโ€ฆ

GLaMM-FullScope model generates only a single mask

Hi @hanoonaR
Congrats on the CVPR acceptance. Great work, thank you for sharing the code and the model weights.

I have a couple of questions.

--------------------------------------------- Q1 --------------------------------------------------------
I was trying to reproduce the results using the balloon.jpg image available in the repo using the prompt "Describe the image. Please output interleaved segmentation mask." However the network does not seem to generate multiple masks inspite of the generate text being "The image shows a <p> hot air balloon </p> [SEG] flying over a <p> river </p> [SEG] . The <p> sky </p> [SEG] is visible over the river."

I went a step further to check if the issue is from my side. Below are the generated "generated_output_ids "

[  319, 13563,  1546,   263, 12758,  5199,   322,   385, 23116, 21082,
         20255, 29889,   450, 20255,  4076,  8444, 29892, 13173, 29892,   322,
          1248,   568,  6089,   304,   278,  5199, 29915, 29879,  5155, 29889,
          3148,  1001, 29901,   450, 32000,  -200, 29871, 32001, 16123,  2247,
           385,   975,  1493,   310,   278,  7623, 29889,    13,  4002, 29581,
           278,  1967, 29889,  3529,  1962,  1006,   280, 10511, 10768,   362,
         11105, 29889,   319,  1799,  9047, 13566, 29901,   450,  1967,  3697,
           263, 32005,  7375,  4799,  6411,   417,   265, 32006, 32004, 22764,
           975,   263, 32005,  8580, 32006, 32004,   869,   450, 32005, 14744,
         32006, 32004,   338,  7962,   975,   278,  8580, 29889,     2]

As you can see id 29871(seg_token_idx) is generated only once. I am not sure if I am missing something in my attempts to reproduce the results and I would appreciate your educated guess of what I might be doing wrong.

--------------------------------------------- Q2 --------------------------------------------------------
Another interesting property I observed, when I run tokenizer("[SEG]").input_ids the output indices are [ 1, 29871, 32004] where as tokenizer("a [SEG]").input_ids returns [ 1, 263, 32004] as you can notice the tokenizer outputs id 29871(seg_token_idx) in the first case is this expected, I am curious to understand the intuition behind this.

Thank you, I appreciate any time you can spend to help with my questions.

Regards,
Pradyumna.

local llm interface for glamm

description: glamm performs very well on semantic segmentation. I want to introduce glamm into my multi-agent workflow to solve a sub-task. My multi-agent is constructed by autogen framework.

request: In autogen, we usually provide a local url by litellm for autogen to call other llm model(such as models on ollam), like:
litellm --model ollama/llama2
Is there any similar way for glamm? Thanks!

Issue with ngrok Error (ERR_NGROK_8012) on GLaMM Demo Page

Hello,

I am encountering an issue while trying to access the GLaMM Demo Page. The error message I received is as follows:
ngrok Error (ERR_NGROK_8012)

I tried refreshing the page as suggested in the error message, but the issue persists.

Thank you for your assistance.

Best regards,
w228h

A bug in region captioning evaluation scripts

Hi, thanks for your great work! I just notice there might be a bug in the eval/region_captioning/evaluate.py.

Specifically, when loading generated results from a collection of result files, it uses

for result_file in os.listdir(args.results_dir):
    all_results = json.load(open(f"{args.results_dir}/{result_file}", "r"))
merged_file_path = f"{args.results_dir}/merged.json"

At the end, only the results in the last result file are loaded to all_results. And the model is essentially evaluated on a subset of test set if we use multiple GPUs for inference.

Inference speed

Hi,

I can successfully now run the code on AMD GPUs but I've noticed that the inference speed is very low. Could this be because I have not installed flash attention (due to the complexity to compile it for AMD) or am I missing something else?

Question about Output Quality Difference Between Local and Online Demo for MBZUAI/GLaMM-FullScope

Hello,

I've successfully run the demo locally and managed to obtain output results. However, I've noticed that the quality of the output significantly differs from what is showcased in the online demo, with the local results being notably inferior. I'm currently using the MBZUAI/GLaMM-FullScope for my tests. Could you please shed some light on why there might be such a discrepancy between the two?

Thank you for your assistance.

Data Annotation Pipeline

was wondering if it would be possible to make the execution script for the automated annotation pipeline publicly available. I have reviewed the dataset definition in groundingLMM/dataset, but I am uncertain about the process for generating annotations. Any guidance or access to the execution script would be greatly appreciated.

the demo caption is very simple

the demo caption is very simple, not like the detailed one in the paper, did you limit the output max length?

the caption result is quite simple
image
image

Running GranD Automated Annotation pipeline from scratch

@hanoonaR and @mmaaz60 I wish to run the automated annotation pipeline from scratch. As mentioned in #35 I try the command:
conda create --name grand_env_1 --file requirements_grand_env_1.txt I get the error:

Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - async-timeout==4.0.2=pypi_0
  - terminaltables==3.1.10=pypi_0
  - ipython==8.14.0=pypi_0
  - pytz==2023.3=pypi_0
  - groundingdino==0.1.0=dev_0
  - openai-whisper==20230314=pypi_0
  - async-lru==2.0.3=pypi_0
  - jupyter-events==0.6.3=pypi_0
  - chardet==5.2.0=pypi_0
  - codecov==2.1.13=pypi_0
  - aiosignal==1.3.1=pypi_0
  - numpy==1.24.3=pypi_0
  - peft==0.3.0=pypi_0
  - fastapi==0.100.0=pypi_0
  - aliyun-python-sdk-kms==2.16.1=pypi_0
  - awq==0.1.0=pypi_0
  - mmcv-full==1.5.0=dev_0
  - multiscaledeformableattention==1.0=pypi_0
  - pycocotools==2.0.6=pypi_0
  - multiprocess==0.70.15=pypi_0
  - importlib-resources==6.0.0=pypi_0
  - pybind11==2.11.1=pypi_0
  - scipy==1.11.1=pypi_0
  - typepy==1.3.1=pypi_0
  - isort==4.3.21=pypi_0
  - mmdet==2.25.3=dev_0
  - onnxruntime==1.15.1=pypi_0
  - exceptiongroup==1.1.2=pypi_0
  - torchvision==0.15.2+cu117=pypi_0
  - supervision==0.11.1=pypi_0
  - nbconvert==7.7.2=pypi_0
  - httpcore==0.17.3=pypi_0
  - jupyter-console==6.6.3=pypi_0
  - jupyter-server-terminals==0.4.4=pypi_0
  - cupy-cuda117==10.6.0=pypi_0
  - qtconsole==5.4.3=pypi_0
  - quant-cuda==0.0.0=pypi_0
  - contourpy==1.1.0=pypi_0
  - yarl==1.9.2=pypi_0
  - setproctitle==1.3.2=pypi_0
  - pathtools==0.1.2=pypi_0
  - oss2==2.17.0=pypi_0
  - deepdiff==6.3.1=pypi_0
  - comm==0.1.3=pypi_0
  - coverage==7.3.0=pypi_0
  - imageio==2.31.1=pypi_0
  - cymem==2.0.7=pypi_0
  - json5==0.9.14=pypi_0
  - jupyter-client==8.3.0=pypi_0
  - keras==2.13.1=pypi_0
  - markdown-it-py==2.2.0=pypi_0
  - einops-exts==0.0.4=pypi_0
  - outdated==0.2.2=pypi_0
  - markupsafe==2.1.3=pypi_0
  - widgetsnbextension==4.0.8=pypi_0
  - pyarrow==12.0.1=pypi_0
  - addict==2.4.0=pypi_0
  - flatbuffers==23.5.26=pypi_0
  - platformdirs==3.10.0=pypi_0
  - prompt-toolkit==3.0.39=pypi_0
  - shortuuid==1.0.11=pypi_0
  - openxlab==0.0.15=pypi_0
  - bleach==6.0.0=pypi_0
  - pyproject-api==1.5.4=pypi_0
  - smmap==5.0.0=pypi_0
  - munkres==1.1.4=pypi_0
  - pyflakes==2.1.1=pypi_0
  - etils==1.3.0=pypi_0
  - anyio==3.7.1=pypi_0
  - dassl==0.6.3=dev_0
  - huggingface-hub==0.16.4=pypi_0
  - thinc==8.1.10=pypi_0
  - typer==0.9.0=pypi_0
  - httpx==0.24.0=pypi_0
  - zstandard==0.21.0=pypi_0
  - nh3==0.2.14=pypi_0
  - jupyterlab-widgets==3.0.8=pypi_0
  - timm==0.5.4=pypi_0
  - accelerate==0.21.0=pypi_0
  - tensorflow-metadata==1.13.1=pypi_0
  - nltk==3.8.1=pypi_0
  - pyparsing==3.0.9=pypi_0
  - texttable==1.6.7=pypi_0
  - openmim==0.3.9=pypi_0
  - opencv-python==4.8.0.74=pypi_0
  - six==1.16.0=pypi_0
  - spacy-alignments==0.9.0=pypi_0
  - spacy==3.6.0=pypi_0
  - spacy-loggers==1.0.4=pypi_0
  - langcodes==3.3.0=pypi_0
  - safetensors==0.3.1=pypi_0
  - wavedrom==2.0.3.post3=pypi_0
  - terminado==0.17.1=pypi_0
  - pure-eval==0.2.2=pypi_0
  - argon2-cffi==21.3.0=pypi_0
  - ninja==1.11.1=pypi_0
  - pycountry==22.3.5=pypi_0
  - overrides==7.3.1=pypi_0
  - hjson==3.1.0=pypi_0
  - nvidia-cuda-cupti-cu11==11.7.101=pypi_0
  - uvicorn==0.23.1=pypi_0
  - virtualenv==20.24.3=pypi_0
  - python-multipart==0.0.6=pypi_0
  - arrow==1.2.3=pypi_0
  - wcwidth==0.2.6=pypi_0
  - typing-inspect==0.9.0=pypi_0
  - trax==1.4.1=pypi_0
  - gdown==4.7.1=pypi_0
  - websockets==11.0.3=pypi_0
  - nbformat==5.9.1=pypi_0
  - onnx==1.14.0=pypi_0
  - astunparse==1.6.3=pypi_0
  - datasets==2.14.4=pypi_0
  - en-core-web-md==3.6.0=pypi_0
  - decorator==5.1.1=pypi_0
  - llava==1.0.0=pypi_0
  - tensorflow==2.13.0=pypi_0
  - pyre-extensions==0.0.29=pypi_0
  - tensorflow-hub==0.14.0=pypi_0
  - xtcocotools==1.13=pypi_0
  - nvidia-cuda-nvrtc-cu11==11.7.99=pypi_0
  - networkx==3.1=pypi_0
  - absl-py==1.4.0=pypi_0
  - kornia==0.6.4=pypi_0
  - gradio-client==0.2.10=pypi_0
  - pycryptodome==3.18.0=pypi_0
  - crcmod==1.7=pypi_0
  - scikit-learn==1.2.2=pypi_0
  - beautifulsoup4==4.12.2=pypi_0
  - toolz==0.12.0=pypi_0
  - dm-tree==0.1.8=pypi_0
  - pluggy==1.2.0=pypi_0
  - starlette==0.27.0=pypi_0
  - lit==16.0.6=pypi_0
  - debugpy==1.6.7=pypi_0
  - srsly==2.4.7=pypi_0
  - tcolorpy==0.1.3=pypi_0
  - en-core-web-trf==3.6.1=pypi_0
  - fsspec==2023.6.0=pypi_0
  - mmpose==0.24.0=dev_0
  - nvidia-nccl-cu11==2.14.3=pypi_0
  - flake8==3.7.9=pypi_0
  - jupyter==1.0.0=pypi_0
  - pycocoevalcap==1.2=pypi_0
  - torch==2.0.1+cu117=pypi_0
  - appdirs==1.4.4=pypi_0
  - click==8.1.6=pypi_0
  - libclang==16.0.6=pypi_0
  - attributedict==0.3.0=pypi_0
  - kiwisolver==1.4.4=pypi_0
  - pycodestyle==2.5.0=pypi_0
  - fschat==0.2.24=pypi_0
  - ipywidgets==8.0.7=pypi_0
  - requests==2.28.2=pypi_0
  - vllm==0.1.3=pypi_0
  - rouge-score==0.1.2=pypi_0
  - opencv-python-headless==4.8.0.74=pypi_0
  - jupyter-server==2.7.0=pypi_0
  - chumpy==0.70=pypi_0
  - littleutils==0.2.2=pypi_0
  - fastrlock==0.8.2=pypi_0
  - argon2-cffi-bindings==21.2.0=pypi_0
  - rfc3986-validator==0.1.1=pypi_0
  - ffmpy==0.3.1=pypi_0
  - numexpr==2.8.5=pypi_0
  - protobuf==4.23.4=pypi_0
  - defusedxml==0.7.1=pypi_0
  - preshed==3.0.8=pypi_0
  - blessings==1.7=pypi_0
  - pydantic==1.10.11=pypi_0
  - nvidia-curand-cu11==10.2.10.91=pypi_0
  - tqdm-multiprocess==0.0.11=pypi_0
  - triton==2.0.0=pypi_0
  - ml-dtypes==0.2.0=pypi_0
  - orjson==3.9.2=pypi_0
  - threadpoolctl==3.2.0=pypi_0
  - nvidia-nvtx-cu11==11.7.91=pypi_0
  - wandb==0.15.5=pypi_0
  - rouge==1.0.1=pypi_0
  - markdown2==2.4.9=pypi_0
  - pyyaml==6.0=pypi_0
  - jsonschema==4.18.4=pypi_0
  - certifi==2023.5.7=pypi_0
  - google-pasta==0.2.0=pypi_0
  - matplotlib-inline==0.1.6=pypi_0
  - detectron2==0.6=dev_0
  - h11==0.14.0=pypi_0
  - pandocfilters==1.5.0=pypi_0
  - gast==0.4.0=pypi_0
  - webencodings==0.5.1=pypi_0
  - matplotlib==3.7.2=pypi_0
  - nvidia-cufft-cu11==10.9.0.58=pypi_0
  - sentencepiece==0.1.99=pypi_0
  - sacrebleu==1.5.0=pypi_0
  - funcsigs==1.0.2=pypi_0
  - backcall==0.2.0=pypi_0
  - nvidia-cudnn-cu11==8.5.0.96=pypi_0
  - spacy-transformers==1.2.5=pypi_0
  - sqlitedict==2.1.0=pypi_0
  - googleapis-common-protos==1.59.1=pypi_0
  - jinja2==3.1.2=pypi_0
  - jax==0.4.13=pypi_0
  - docker-pycreds==0.4.0=pypi_0
  - python-json-logger==2.0.7=pypi_0
  - fire==0.5.0=pypi_0
  - nvidia-cuda-runtime-cu11==11.7.99=pypi_0
  - semantic-version==2.10.0=pypi_0
  - promise==2.3=pypi_0
  - referencing==0.30.0=pypi_0
  - uri-template==1.3.0=pypi_0
  - asttokens==2.2.1=pypi_0
  - importlib-metadata==6.8.0=pypi_0
  - gitpython==3.1.32=pypi_0
  - fonttools==4.41.0=pypi_0
  - ipython-genutils==0.2.0=pypi_0
  - tifffile==2023.8.12=pypi_0
  - aiohttp==3.8.4=pypi_0
  - sentry-sdk==1.28.1=pypi_0
  - uc-micro-py==1.0.2=pypi_0
  - stack-data==0.6.2=pypi_0
  - transformers==4.33.2=pypi_0
  - nvidia-cusolver-cu11==11.4.0.1=pypi_0
  - cmake==3.26.4=pypi_0
  - regex==2023.6.3=pypi_0
  - enchant==0.0.1=pypi_0
  - nvidia-cusparse-cu11==11.7.4.91=pypi_0
  - tokenizers==0.13.3=pypi_0
  - gym==0.26.2=pypi_0
  - tzdata==2023.3=pypi_0
  - fairscale==0.4.4=pypi_0
  - mistune==3.0.1=pypi_0
  - cryptography==41.0.3=pypi_0
  - parso==0.8.3=pypi_0
  - gitdb==4.0.10=pypi_0
  - pillow==9.5.0=pypi_0
  - wrapt==1.15.0=pypi_0
  - rfc3339-validator==0.1.4=pypi_0
  - humanfriendly==10.0=pypi_0
  - prometheus-client==0.17.1=pypi_0
  - frozenlist==1.4.0=pypi_0
  - opt-einsum==3.3.0=pypi_0
  - pytablewriter==1.0.0=pypi_0
  - fastjsonschema==2.18.0=pypi_0
  - confection==0.1.0=pypi_0
  - dill==0.3.7=pypi_0
  - nbclient==0.8.0=pypi_0
  - pathy==0.10.2=pypi_0
  - mpmath==1.3.0=pypi_0
  - isoduration==20.11.0=pypi_0
  - psutil==5.9.5=pypi_0
  - en-core-web-sm==3.6.0=pypi_0
  - entrypoints==0.3=pypi_0
  - aliyun-python-sdk-core==2.13.36=pypi_0
  - jupyter-core==5.3.1=pypi_0
  - pyzmq==25.1.0=pypi_0
  - annotated-types==0.5.0=pypi_0
  - colour-runner==0.1.1=pypi_0
  - tiktoken==0.3.3=pypi_0
  - flash-attn==1.0.7=pypi_0
  - altair==5.0.1=pypi_0
  - ipykernel==6.24.0=pypi_0
  - segment-anything==1.0=dev_0
  - ray==2.6.3=pypi_0
  - ordered-set==4.1.0=pypi_0
  - scikit-image==0.21.0=pypi_0
  - yapf==0.40.1=pypi_0
  - sympy==1.12=pypi_0
  - notebook==7.0.0=pypi_0
  - tinycss2==1.2.1=pypi_0
  - cycler==0.11.0=pypi_0
  - lm-eval==0.3.0=pypi_0
  - jupyterlab==4.0.3=pypi_0
  - idna==3.4=pypi_0
  - lazy-loader==0.3=pypi_0
  - inspecta==0.1.3=pypi_0
  - lmdb==1.4.1=pypi_0
  - openai==0.27.8=pypi_0
  - send2trash==1.8.2=pypi_0
  - colorama==0.4.6=pypi_0
  - jedi==0.18.2=pypi_0
  - jaxlib==0.4.13=pypi_0
  - wilds==1.2.2=pypi_0
  - numba==0.57.1=pypi_0
  - py-cpuinfo==9.0.0=pypi_0
  - auto-gptq==0.4.1+cu117=pypi_0
  - catalogue==2.0.9=pypi_0
  - rpds-py==0.9.2=pypi_0
  - python-dateutil==2.8.2=pypi_0
  - multidict==6.0.4=pypi_0
  - tabledata==1.3.1=pypi_0
  - notebook-shim==0.2.3=pypi_0
  - pandas==2.0.3=pypi_0
  - webcolors==1.13=pypi_0
  - smart-open==6.3.0=pypi_0
  - pydub==0.25.1=pypi_0
  - pickleshare==0.7.5=pypi_0
  - coloredlogs==15.0.1=pypi_0
  - h5py==3.9.0=pypi_0
  - traitlets==5.9.0=pypi_0
  - mccabe==0.6.1=pypi_0
  - nvidia-cublas-cu11==11.10.3.66=pypi_0
  - shapely==2.0.1=pypi_0
  - linkify-it-py==2.0.2=pypi_0
  - xxhash==3.3.0=pypi_0
  - blis==0.7.10=pypi_0
  - opendatalab==0.0.10=pypi_0
  - jsonlines==3.1.0=pypi_0
  - json-tricks==3.17.2=pypi_0
  - qtpy==2.3.1=pypi_0
  - murmurhash==1.0.9=pypi_0
  - grpcio==1.56.0=pypi_0
  - svgwrite==1.4.3=pypi_0
  - zipp==3.16.2=pypi_0
  - aiofiles==23.1.0=pypi_0
  - pathvalidate==3.1.0=pypi_0
  - spacy-legacy==3.0.12=pypi_0
  - tensorflow-io-gcs-filesystem==0.32.0=pypi_0
  - gin-config==0.5.0=pypi_0
  - msgpack==1.0.5=pypi_0
  - ogb==1.3.6=pypi_0
  - awq-inference-engine==0.0.0=pypi_0
  - nest-asyncio==1.5.6=pypi_0
  - tensorflow-datasets==4.9.2=pypi_0
  - tomli==2.0.1=pypi_0
  - deepspeed==0.9.5=pypi_0
  - tb-nightly==2.15.0a20230816=pypi_0
  - jupyterlab-server==2.24.0=pypi_0
  - sacremoses==0.0.53=pypi_0
  - tensorflow-estimator==2.13.0=pypi_0
  - dataproperty==1.0.1=pypi_0
  - filelock==3.12.2=pypi_0
  - rootpath==0.1.1=pypi_0
  - jmespath==0.10.0=pypi_0
  - tensorflow-text==2.13.0=pypi_0
  - jupyterlab-pygments==0.2.2=pypi_0
  - pygments==2.15.1=pypi_0
  - soupsieve==2.4.1=pypi_0
  - gradio==3.35.2=pypi_0
  - pywavelets==1.4.1=pypi_0
  - termcolor==2.3.0=pypi_0
  - ftfy==6.1.1=pypi_0
  - charset-normalizer==3.2.0=pypi_0
  - llvmlite==0.40.1=pypi_0
  - gym-notices==0.0.8=pypi_0
  - pexpect==4.8.0=pypi_0
  - bitsandbytes==0.42.0=pypi_0
  - cython==0.29.36=pypi_0
  - mbstrdecoder==1.1.3=pypi_0
  - model-index==0.1.11=pypi_0
  - einops==0.6.1=pypi_0
  - jsonschema-specifications==2023.7.1=pypi_0
  - mdurl==0.1.2=pypi_0
  - xformers==0.0.20=pypi_0
  - tornado==6.3.2=pypi_0
  - babel==2.12.1=pypi_0
  - ptyprocess==0.7.0=pypi_0
  - pydantic-core==2.3.0=pypi_0
  - rich==13.4.2=pypi_0
  - packaging==23.1=pypi_0
  - mmengine==0.8.2=pypi_0
  - setuptools==60.2.0=pypi_0
  - tqdm==4.66.1=pypi_0
  - joblib==1.3.1=pypi_0
  - tox==4.9.0=pypi_0
  - distlib==0.3.7=pypi_0
  - executing==1.2.0=pypi_0
  - attrs==23.1.0=pypi_0
  - mdit-py-plugins==0.3.3=pypi_0
  - wasabi==1.1.2=pypi_0
  - sniffio==1.3.0=pypi_0
  - black==22.3.0=pypi_0
  - fqdn==1.5.1=pypi_0
  - more-itertools==9.1.0=pypi_0
  - typing-extensions==4.7.1=pypi_0
  - array-record==0.4.0=pypi_0
  - urllib3==2.0.3=pypi_0
  - jupyter-lsp==2.2.0=pypi_0

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

I think its a channels issue as a normal conda environment yml file has a section defining the channels for the packages. I have also tried adding conda-forge as a channel via conda config --add channels conda-forge, I still get the same error.

Your help in reproducing the environments for the annotation pipeline of the dataset is much appreciated.

Internal error from sentencepiece

Hi,

I have successfully installed the offline demo as instructed here

While running the command :

 python app.py --version ./GLaMM-FullScope

I am getting following error:

Traceback (most recent call last):
File "/mnt/winD/ML/gitProjs/LLVM/groundingLMM/app.py", line 271, in
tokenizer = setup_tokenizer_and_special_tokens(args)
File "/mnt/winD/ML/gitProjs/LLVM/groundingLMM/app.py", line 42, in setup_tokenizer_and_special_tokens
tokenizer = AutoTokenizer.from_pretrained(
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 682, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1805, in from_pretrained
return cls._from_pretrained(
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1959, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 71, in init
self.sp_model.Load(vocab_file)
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/home/srikrishna/Install/conda/conda/envs/glamm/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Thanks in advance for your time.

For further information visit https://errors.pydantic.dev/2.5/v/missing

Dear professor, could you please answer my question? I completed the installation according to the environment you provided, but the following errors still appeared in the code. The reply chat model you provided has not had a corresponding output statement. The positioning box in this picture can be drawn today.
image
I went to the provided website to check for my local error but was unable to provide relevant information. Please provide an answer. Please.
image

Phrase grounding model

Hi,

Are you planning to release the checkpoint of the phrase grounding model? Thank you!

Regards

Pretraining instructions

First of all, thank you very much for sharing your amazing work.

I might have missed it, but it looks like there are currently no instructions on how to pretrain a model from scratch. Are you planning to share this too?

Release of pre-training instructions?

Hi!

I have recently taken a great interest in your work! However I was wondering: are you planning on releasing the pre-training code/instructions as well? I would love to experiment with training the model from scratch!

Thanks,
Rachel

Question on reproducing the evaluation/demo performance from pretrained models

Hello, first of all thanks for the great work!

I've been trying to reproduce the evaluation and demo, however I don't find them producing the same quality results as produced in official materials.

Environment

  • CUDA 12.1, PyTorch 2.1.2+cu121
  • A100 with 40G RAM
  • Ubuntu 20.04
  • Followed the installation doc, and is running on version ba4f2b6

Demo
I set up the gradio environment, run with python app.py --version='GLaMM-FullScope', and tried a few examples listed on the page, but the quality is bad (as shown below). No luck if I change 'GLaMM-FullScope' to other models.

glamm-issue-1

glamm-issue-2

Evaluation
I downloaded COCO train 2014 and refCOCO series, and executed bash eval/referring_seg/run_evaluation.sh 'MBZUAI/GLaMM-RefSeg' './results_refseg_finetuned'.

In my initial attempt in numerical evaluation, the code produces an error from the assert here. I looked into it and found cur_len is always total_len + 2. Without knowledge on how to fix it, I had to comment it out in order to run the script.

Here're the results I have obtained (not finished on every test set, but the performance is obviously bad):

[{'model': './results_refseg_finetuned', 'dataset': 'refcoco|val', 'giou': '0.050665364', 'ciou': '0.1175718'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco|val', 'giou': '0.049141906', 'ciou': '0.11084828'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco|testA', 'giou': '0.06668462', 'ciou': '0.15553774'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco|testB', 'giou': '0.034645554', 'ciou': '0.094948955'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco+|val', 'giou': '0.0536244', 'ciou': '0.11712953'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco+|testA', 'giou': '0.075434625', 'ciou': '0.15804651'}, {'model': './results_refseg_finetuned', 'dataset': 'refcoco+|testB', 'giou': '0.04812723', 'ciou': '0.110266894'}]

Any help or clue to resolve the performance issue is appreciated, thanks!

GrandD Detailed Operation Guide

Your work is of great academic value and significance, and I am very grateful for the contributions you have made. I would like to ask you about the specific operational steps for implementing the GranD Automated Annotation Pipeline. I am very grateful that you could take the time out of your busy schedule to look at my question.

About region caption

the generated results only describe the content and not the answer for the specified prompt.
1715157993410
result:
1552a25d96b1424058997d306117c77

how the relationships are formed using objects from level-1?

As mentioned in the title, Section 4.2, "Relationships and Landmarks," presents some points that may cause confusion:

1). Could you clarify how relationships are established using objects from Level 1?
2) . What was the rationale behind introducing the landmark category at this stage? Were there other considerations involved?

=====================================================
I wonder about the relationships derived from short captions generated by LLM?

Undefined `self.base_dir` in `GranDfDataset.__init__`

Hello, it seems that in the following code, self.base_dir is not defined before calling super().__init__, and would raise AttributeError.

class GranDfDataset(GCGBaseDataset):
"""
Human annotated dataset proposed in GLaMM as part of GranDf dataset.
"""
def __init__(self, dataset_dir, tokenizer, global_image_encoder, epoch_samples=8000, precision="fp32",
image_size=224, num_classes_per_sample=3, validation=False, random_sampling=True):
json_path = "GranDf_HA_GCG_train.json"
image_dir = os.path.join(self.base_dir, "GranDf_HA_images", "train")
mode = "Val" if validation else "Train"
super().__init__(
dataset_dir, tokenizer, global_image_encoder, epoch_samples, precision, image_size, num_classes_per_sample,
validation, random_sampling, image_dir, json_path, )
print('\033[92m' + "----GCG-{}: GranDf-GCG dataset initialized----".format(mode) + '\033[0m')

Data release

Hi! Loved reading the paper. Is there a release date on the data that you've used to train?

About GranD Pre-training Dataset

Hello GLaMM Team,

Thank you very much for sharing this fascinating work!

It seems that you have been incrementally uploading the pre-training dataset, GranD, to https://huggingface.co/datasets/MBZUAI/GranD. Just a few clarification questions:

  1. Does the whole dataset use all 11M images from SA-1B?
  2. Any estimation when the upload will be completed?

Thanks,
Shengcao

Supplementary materials

Hi! Very nice and promising work!

Where can I download the supplementary materials๏ผŸThank you!

Easiest way to fine-tune on custom data?

Hello! Thank you for this great work! Is there a preferred way to fine-tune this model on custom data? I am specifically interested in fine-tuning for open-vocabulary segmentation and referring segmentation.

Thank you!

Grand-env

Hello respected friend, your environment file seems a bit odd, and I can't even use pip to install some of its contents. ca-certificates=2023.05.30=h06a4308_ does not seem to be the correct format for installation packages.
2024-03-22 220258

Empty output when inferring on the example image.

I used the GLaMM-FullScope model to perform inference on a sample image and received a peculiar output. I've verified the versions of the relevant installed libraries, and they align with the specified requirements. How can I address this problem?
image

The training losses in the GCG task

Hello, could you please provide a detailed explanation of the training losses in the GCG task? It seems that segmentation task and text generation task are separated. Are there any specific losses to make the specific phrases in the image-level captions and the corresponding segmentation masks macth?
QQๆˆชๅ›พ20240307114726

Code release

Hi! Very nice and promising work! When will the code be released? I am really looking forward to experimenting with your code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.