salesforce / lavis Goto Github PK

LAVIS - A One-stop Library for Language-Vision Intelligence

License: BSD 3-Clause "New" or "Revised" License

Python 25.60% Jupyter Notebook 74.31% Shell 0.08%

deep-learning deep-learning-library image-captioning salesforce vision-and-language vision-framework vision-language-pretraining vision-language-transformer visual-question-anwsering multimodal-datasets

lavis's People

Contributors

Stargazers

Watchers

Forkers

yangwenz kfzyqin gjtjx zxz2jj dfrntl bjmeo8 codeaudit yosshi-git fireae plyfager techthiyanes jaedukseo vitasoftai gary109 simplelifetime huang-xx micseb gullalc babajideowoyele mbrukman yiwenwanganu ashwanthkumard lyrl suryatmodulus thliang01 hadryan cthed2 lengocgiang anhphtu sean-riddle pawan2411 hongbo-sun adahirwar undercontroller freakynit sanyamlakhanpal sontro wcirq ltoscano yui010206 yangboz gavinljj m2man garinmckayl huaxz1986 liuzhaofeng123 animesh mcmartel biztrology-kd mmderakhshani xinhaomei nolanmangalmm pszemraj skytodmoon xiaoheng-zhang99 surajx discordianfish gschurck quickgrid anthonytmh hi-zhenyu dd-consulting sonhv3112 animebing keunwoochoi lelas213 qingfengting2017 chenglele dxli94 yk-ren johnnypeck charleoy kurianbenoy-sentient marqo-ai yetiansh altryne albin-jo tilakd samconsidine kaytony glorf zache-fi cr-gjx raven38 pldlgb guangsen-wang nagyist hegdekartik gg-big-org swapnildreams100 chenjun0210 bekyilma eltociear gyanachand1 omvishal1 gillevi caoandong jameelhamdan timdingman-scale natnaelino

lavis's Issues

Confidence of the BLIP captioning model

Hi,

Is there a way to access the confidence of the generated caption?

Blip feature extractor API

Thanks for the great work. I have some questions about the BLIP feature extractor interface.

In the example code, you wrote

# torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks

What are the other channels [:, 1:12, :] useful for?

In the example code of the API, there is another attribute called image_features (link), but they are not available. Can you comment on the difference between image_embeds and image_features, and how to access the latter?

print(features_image.image_embeds.shape)
print(features_image.image_features.shape)

Thanks!

CODE to reproduce examples in LAVIS technical report

Hi,
I would be grateful if you could kindly provide the code to regenerate the other test cases provided in the technical report.

I'm really interested in to regenerate zero shot classification, text localisation, multimodal classification, etc.

Additional requirements for COCO caption finetuning

Hi, thank you for sharing a great library.

I want to fine-tune the model on COCO caption, which uses pycocoevalcap that depends on stanford-corenlp.

While running run_scripts/blip/train/train_caption_cocosh, it causes an error when the SPICE score is calculated, because of Unable to make field private final byte[] java.lang.String.value accessible: module java.base does not "opens java.lang" to unnamed module.

I guess this error is due to its dependency on a specific version of Java. Which version of Java do you use in the development?

Thank you!

error with blip_classification, model_type = 'base'

Thank you LAVIS team for this wonderful repo!

It appears that loading blip_classification with model_type='base' throws an error stemming from the configuration files. Is this expected behavior? If so, perhaps a more descriptive error (or solution) could be provided with the assertion?

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from lavis.models import load_model_and_preprocess
model, processors, _ = load_model_and_preprocess(name="blip_classification", 
                                                 model_type="base", 
                                                 is_eval=True, device=device)

AssertionError: Invalid number of classes provided, found -1

Build issues with decord for Mac with Python 3.9+

Decord is listed as a dependency but there is no prebuilt binaries for Mac above Python 3.8 (see issues here). This means Mac users of LAVIS effectively have to stay below 3.8. Can we get rid of Decord and replace it with something that's more well-maintained?

Can't load gpt_dialogue

model, vis_processors, txt_processors = load_model_and_preprocess(name="gpt_dialogue", model_type="base", is_eval=True, device=device)

TypeError                                 Traceback (most recent call last)

[<ipython-input-48-216011dd9979>](https://localhost:8080/#) in <module>
----> 1 model, vis_processors, txt_processors = load_model_and_preprocess(name="gpt_dialogue", model_type="base", is_eval=True, device=device)

[/content/LAVIS/lavis/models/__init__.py](https://localhost:8080/#) in load_model_and_preprocess(name, model_type, is_eval, device)
    171 
    172     # load model
--> 173     model = model_cls.from_pretrained(model_type=model_type)
    174 
    175     if is_eval:

TypeError: from_pretrained() missing 1 required positional argument: 'pretrained_model_name_or_path'

Discrepancy with BLIP paper results when using PyTorch > 1.10

I was trying to reproduce results with BLIP on VQAv2 test-dev and I observed a non-negligible difference between the VQA accuracy obtained using the published checkpoint (77.41%) and the number reported in the paper (78.25%).

These are the steps I followed:

Clone this repo
Install dependencies with pip install .
Create a symlink cache/coco/images pointing to the local copy of the COCO images
Modify lavis/projects/blip/eval/vqav2_eval.yaml as follows:
Run python -m torch.distributed.run --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/vqav2_eval.yaml (note I only have 4 A100 GPUs available)
Submit the test_vqa_result.json file generated in lavis/output/BLIP/VQA/... to EvalAI

After some debugging, I narrowed it down to a discrepancy in PyTorch versions: I was using the latest version (1.13.0), while LAVIS fixes the version to 1.10.0. So there is some change between PyTorch 1.10 and PyTorch 1.13 which causes a performance degradation when loading a checkpoint trained on 1.10. After downgrading the PyTorch version to 1.10.0, I am able to achieve 78.24% VQA accuracy on VQAv2 test-dev, almost the same number reported in the paper.

Reproduce the result on okvqa

Hi, I use the scripts LAVIS/run_scripts/blip/train/train_okvqa.sh to finetune blip on okvqa. However, the result in epoch 7 is only 45.12, which has a large gap between the result 55.4 you proposed. Are there any hyperparameter that need to modify in LAVIS/lavis/projects/blip/train/okvqa_ft.yaml ?

It would be highly appreciated if you could give some help.

Using albef to do image tagging

Hi, i want to do image tagging using albef model. I have writen the code for that but the top one tags are way off. Could you help we regarding this
def calculate_pred(img):
raw_image = Image.open(img).convert("RGB")
image = vis_processors"eval".unsqueeze(0).to(device)
text = [[txt_processors"eval"] for x in occ_list] #occ_list is list of occupation and template = "photo of a "
sample = [{"image": image, "text_input": text_input} for text_input in text]
img_feats = model.extract_features(sample[0], mode="image").image_embeds_proj[:,0,:]
img_feats = img_feats/img_feats.norm()
text_feats = torch.cat([model.extract_features(sample[i], mode="text").text_embeds_proj[:,0,:] / model.extract_features(sample[i], mode="text").text_embeds_proj[:,0,:].norm() for i in range(len(sample))])
_, index = torch.max(img_feats @ text_feats.T, dim=1)
return(occ_list[index.tolist()[0]])

VQA evaluation doesn't print accuracy

I am trying to replicate the VQAv2 evaluation by running bash run_scripts/pnp-vqa/eval/eval_vqav2.sh. However, the scores aren't printed out. I only get:

...
result file saved to /home/LAVIS/lavis/output/PNP-VQA/VQAv2_val/20221124171/result/val_vqa_result.json
loading VQA annotations and questions into memory...
creating index...
index created!
Loading and preparing results...
DONE (t=0.42s)
creating index...
index created!
computing accuracy
Finshed Percent: [####################] 99% Done computing accuracy
NCCL INFO [Service thread] Connection closed by localRank 0
NCCL INFO comm 0x7fa08c008fb0 rank 0 nranks 8 cudaDev 0 busId e00000 - Abort COMPLETE

Is this expected behavior? If so, where do I find the final scores? Thanks!

Training Tracking Tools

Hi! I'm trying to the LAVIS repo to pretrain BLIP on my own dataset. I was wondering if there're any plans to integrate some type of tracking tools like wandb or tensorboard into the codebase? Or any suggestions on how I can easily set it up?

Thanks!

Default process group has not been initialized, please make sure to call init_process_group.

When I evaluate coco-caption task on a single GPU device, It has this error.

  def evaluation(self, model, data_loader, cuda_enabled=True):
      metric_logger = MetricLogger(delimiter="  ")
      header = "Evaluation"
      # TODO make it configurable
      print_freq = 10
      results = []

      for samples in metric_logger.log_every(data_loader, print_freq, header):
          samples = prepare_sample(samples, cuda_enabled=cuda_enabled)

          eval_output = self.valid_step(model=model, samples=samples)
          results.extend(eval_output)

      dist.barrier()

      return results

The error occurs on “dist.barrier()”

`albef_vqa.rank_answers` returns single answer

I've ran following piece of code

import torch
from lavis.models import load_model, load_model_and_preprocess
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# load sample image
raw_image = Image.open("LAVIS/docs/_static/merlion.png").convert("RGB")

model, vis_processors, txt_processors = load_model_and_preprocess(
            name="albef_vqa", model_type="vqav2", is_eval=True, device=device
        )

question = "Which city is this photo taken?"

image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
question = txt_processors["eval"](question)

samples = {"image": image, "text_input": question}

answer_list = ["Singapore", "London", "Palo Alto", "Tokyo"]
answers = model.rank_answers(samples, answer_list=answer_list, num_ans_candidates=3)

answers

and got this as an output

{'name': 'blip_image_train', 'image_size': 384}
{'name': 'blip_image_eval', 'image_size': 384}
{'name': 'blip_question'}
{'name': 'blip_question'}

['Singapore']

Is this the expected behavior? Or have I missed something about the inputs?
I was expecting 3 ordered answers or maybe even 3 ordered answers with their probabilities (would be really nice), but got the output from predict_answers

Adding dataset for VQA task

Hi, how can I add the visual7w dataset for the VQA task? The adding datasets documentation is for AVSD task and I'm not sure how to do similar steps but for a different task... My data has images, questions, multiple options and answers. Thanks.

DDP set device id

Hi, I am using LAVIS to train albef model recently, while I find a potential issue.

At here, we are applying torch DDP to the original model that requires device id, i.e., the rank of current process.

However, I think it is not the responsibility of config file to set rank for each process and it results in failure. I suggest using get_rank instead.

How to apply multimodal feature for multimodal classification?

Hi everyone,
First of all, thank you so much for this great package.

I would be so grateful if you could kindly provide me with an example code of how the multimodal feature can be used for multimodal classification, as you mentioned here.

Best,

sbu caption dataset format

sub.json is organized in the format:
[{'image': '4385058960_b0f291553e.jpg',
'caption': 'a wooden chair in the living room',
'url': 'http://static.flickr.com/2723/4385058960_b0f291553e.jpg'},
...}

but the downloaded sbu_images.rar is extracted as:
0000/ 0001/ 0002/ 0003/ ... 0999/
in each directory contains 1000 images named in order:
000.jpg 001.jpg 002.jpg ... 999.jpg

Therefore, the image storage path does not correspond to the path in json. @dxli94

fine-tuned result of aokvqa

I want to fine-tune BLIP on aokvqa, and I download your fine-tuned checkpoint and directly evaluated on the validation set, the result is 50.22. However, when I fine-tune the model myself, the result is 41.89. I didn't change the hyperparameter in the config. Could you provide the hyperparameter that you fine-tune the BLIP on aokvqa?

By the way, did you use other vqa datasets to continue pre-train BLIP before fine-tuning it on the aokvqa?

Question about blip vqa

hello

First of all, thank you for building such a wonderful library.

I have a question about VQA using BLIP.

It is about samples among the arguments of predict_answers in BLIP VQA model.

samples (dict): A dictionary containing the following keys:
- image (torch.Tensor): A tensor of shape (batch_size, 3, H, W). Default H=480, W=480.
- text_input (list): A list of strings, each string is a question

In the case of text_input, it is a list format, and I think I can put multiple questions.

However, when I put several questions in the list and ran it, I got a tensor size mismatch error.

I want to know if the VQA is running only one question per image.

thanks

annotation urls for CC, SBU

Hi,

I was trying to download the datasets, after running the download scripts for CC,
when loading CC, it failed, errors as below

when I run below code

from lavis.datasets.builders import load_dataset

In [3]: data = load_dataset('conceptual_caption_12m') `

errors as

File ~/anaconda3/envs/lavis/lib/python3.8/urllib/request.py:383, in Request._parse(self)
    381 self.type, rest = _splittype(self._full_url)
    382 if self.type is None:
--> 383     raise ValueError("unknown url type: %r" % self.full_url)
    384 self.host, self.selector = _splithost(rest)
    385 if self.host:

ValueError: unknown url type: '/export/home/workspace/datasets/cc12m.json'

same goes for 'sbu_caption', can not find url ''/export/share/dongxuli/data/lavis/sbu/annotation/sbu.json',

can you please help with obtaining these annotation json files?

Best

Loading ALBEF feature extractor

while loading ALBEF feature extractor using
model, vis_processors, text_processors = load_model_and_preprocess(name="albef_feature_extractor", model_type="base", is_eval=True, device=device)
It is returning:-
reshape position embedding from 256 to 196
None
{'name': 'blip_image_eval', 'image_size': 224}
None
{'name': 'blip_caption'}
Is this BLIP model or ALBEF ?

ModuleNotFoundError: No module named 'app'

Hi Dongxu,
I have received the following error when running the above command:
bash run_scripts/run_demo.sh

ModuleNotFoundError: No module named 'app'
File "/usr/local/lib/python3.8/dist-packages/streamlit/runtime/scriptrunner/script_runner.py", line 562, in _run_script exec(code, module.__dict__)

File "/home/ermia/PycharmProjects/LAVIS/app/main.py", line 8, in <module> from app.multipage import MultiPage

I'm digging the code to find the reasons for the error.

updating the required torch version

Hi, thanks for the great stuff!
Is there any plan to update the torch version (from ==1.10 to anything newer), or relax it?

How to interpret NLVR model outputs and input labels

As I understand from looking at blip paper, NLVR takes pair of images, a sentence for them and predicts if the sentence describes the image pair.

I have used the following code to generate output from comparing two images and their text input with total 3 comparisons on minibatch.

During predict I have to pass labels in samples dict. Are the values of labels only 0, 1 for False, True, or something else?
Each image pair in minibatch outputs two values in predictions. How to interpret these output predictions values?

Code

model, vis_processors, text_processors = load_model_and_preprocess("blip_nlvr", "nlvr", device=device, is_eval=True)

samples = {
    "image0": torch.randn((3, 3, 384, 384), device=device),
    "image1": torch.randn((3, 3, 384, 384), device=device),
    "text_input": [
        "there is a car with yellow color",
        "there are cars in one of the images",
        "there are bikes in both images"
    ],
    "label": torch.tensor([0, 1, 1], device=device),
}

with torch.no_grad():
    output = model.predict(samples)

Output

{'predictions': tensor([[ 0.6208, -0.7106],
         [ 0.6987, -0.7888],
         [ 1.3222, -1.4706]], device='cuda:0'),
 'targets': tensor([0, 1, 1], device='cuda:0')}

Failed to download msrvtt

from lavis.datasets.builders import load_dataset
msrvtt_dataset = load_dataset("msrvtt_caption")

as picture

another way still failed

Downloading https://download1602.mediafire.com/jslug277m67g/x3rrbe4hwp04e6w/train_val_videos.zip to train
Failed to download or extracting datasets. Aborting.
Merging to C:\Users\jianwei\anaconda3\envs\thesis\lib\site-packages\lavis\..\cache\msrvtt\videos
Failed to merging datasets. Aborting.```

Enable tensorboard visualization

Hi, thank you for the great work!

I wonder if there is any plan to incorporate the tensorboard visualization. Also if there is any plan to combine pytorch_lightning?

BLIP generated caption length

Hi, thanks for the amazing work you did with the library!

I am currently trying to fine-tune BLIP on a custom dataset. I followed your tutorial on the custom dataset generation and set up all the necessary files for the fine-tuning, and everything works as expected.
The only problem I've encountered is with the maximum length of the generated captions. In my training configuration file this length is set to 256, but the model never generates captions which are longer than ~50 words (roughly 90 tokens in average).

I have already increased BERT embedding's size to 256, hard-coding it in this line:

LAVIS/lavis/models/blip_models/blip_caption.py

Line 51 in 6c6c981

self.max_txt_len = max_txt_len

and changed the default max_lengths to 256 here:

LAVIS/lavis/models/blip_models/blip_caption.py

Line 214 in 6c6c981

max_txt_len = cfg.get("max_txt_len", 40)

and here

LAVIS/lavis/models/blip_models/blip_caption.py

Line 141 in 6c6c981

max_length=30,

My training config file looks like this

model:
  arch: blip_caption

  model_type: base_coco
  load_finetuned: False

datasets:
  custom_caption: # name of the dataset builder
    vis_processor:
        train:
          name: "blip_image_train"
        eval:
          name: "blip_image_eval"
    text_processor:
        train:
          name: "blip_caption"
          prompt: "a picture of "
        eval:
          name: "blip_caption"

run:
  task: captioning
  # optimizer
  lr_sched: "linear_warmup_cosine_lr"
  init_lr: 1e-5
  min_lr: 0
  weight_decay: 0.05
  max_epoch: 20
  batch_size_train: 2
  batch_size_eval: 8
  num_workers: 1

  max_len: 256
  min_len: 5
  num_beams: 3

  seed: 42
  output_dir: "output/BLIP/Caption_custom"

  amp: False
  resume_ckpt_path: null

  evaluate: False 
  train_splits: ["train"]
  valid_splits: ["val"]
  test_splits: ["test"]

  device: "cuda"
  world_size: 1
  dist_url: "env://"
  distributed: True

I am training the model with 5000 samples. Do you have any suggestion on what could be wrong or missing in my fine tuning configuration? Should I use different parameters for the optimiser? Is generating captions of this length even achievable with BLIP?

Thanks!

What are the differences between `base_coco` and `large_coco` model types for `blip_caption` ?

Video Question Answering examples

Hi there,
Could you please provide an example as how we should run Video Question Answering task using LAVIS?

Any examples about other video-related tasks well be very appreciated.

Error when running with transformers 4.22.2

When I run the BLIP captioning model with transformers 4.22.2 I get the following error:

import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# generate caption
model.generate({"image": image})
# ['a large fountain spewing water into the air']

 The following `model_kwargs` are not used by the model: ['encoder_hidden_states', 'encoder_attention_mask'] (note: typos in the generate arguments will also show up in this list)

Relation between ITE and GradCAM

在3.1节，第2自然段说是ITE完成图像与问题的相似度计算，第3自然段又说GradCAM完成相似度计算，这两个模块是啥关系？
怎么理解这句话?To identify relevant image patches, we feed the image v and the question t to the ITE network and apply a variation of GradCAM.... 应用GradCAM在干什么？

English translations by @dxli94 :
"In Section, para. 2, you mention ITE measures the similarities between images and questions, in para.3 you mention use GradCam to measure similarity." What is the relation between ITE and GradCAM?

Second, how to understand "To identify relevant image patches, we feed the image v and the question t to the ITE network and apply a variation of GradCAM"? What is GradCAM doing here?"

urllib.error.HTTPError: HTTP Error 403: Forbidden

Hi everyone, @dxli94
When I am running the below code, I receive the corresponding error:

import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)

The error is about loading the model from a path, which seems unreachable due to some unknown issue.

 raise HTTPError(req.full_url, code, msg, hdrs, fp)
          │         │   │         │     │    │     └ <http.client.HTTPResponse object at 0x7fa4f4114820>
          │         │   │         │     │    └ <http.client.HTTPMessage object at 0x7fa4f41149a0>
          │         │   │         │     └ 'Forbidden'
          │         │   │         └ 403
          │         │   └ <property object at 0x7fa58770f1d0>
          │         └ <urllib.request.Request object at 0x7fa4f413d790>
          └ <class 'urllib.error.HTTPError'>

Which is related to the following file.
https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP/blip_coco_caption_base.pth

Could you please kindly make the models available somewhere else that can easily be accessible?

features generated by albef_feature_extraction are low-quality.

I follow the demo and use the features generated by albef_feature_extraction to perform zero-shot cross-modal retrieval on MSCOCO. The t2i recall scores are extremely low while that of i2t score looks normal, and I don't know the answer. What's more, I found the cosine similarity even between the paired image and text is low (about 0.09).

dependencies for `run_demo.sh`

running following in an empty venv

git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .

gives

with scipy installed via pip install scipy I'm getting

scikit-image installation does the trick

Finetune CLIP on COCO and Flickr30K

Hello, thanks for your nice work! Are there scripts and configuration files that can be used to finetune CLIP on COCO and Flickr30K, like BLIP (retrieval_coco_ft.yaml and train_retrieval_coco)? Thanks again!

LAVIS on Anaconda?

Hello there! First of all: this library is godlike. Thanks for all the effort!
Second: can we get LAVIS on the conda-forge channel? It would be awesome for everybody.

Error during installation

Getting following error

Could not find a version that satisfies the requirement decord>=0.6.0 (from lavis) (from versions: none)

Can you help please

Where are the pretrained models saved after downloading?

Hi, can anyone help me answer where are the all the pretrained model saved in system after the first time downloading it? I am trying to integrate lavis into my docker image and needs sort out the model save path

Error in load_model_and_preprocess

Hi, I met an error when loading the model pnp_vqa
model, vis_processors, txt_processors = load_model_and_preprocess(name="pnp_vqa", model_type="base", is_eval=True, device=device)
...
File ~/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/serialization.py:600, in load(f, map_location, pickle_module, **pickle_load_args)
595 if _is_zipfile(opened_file):
596 # The zipfile reader is going to advance the current file position.
597 # If we want to actually tail call to torch.jit.load, we need to
598 # reset back to the original position.
599 orig_position = opened_file.tell()
--> 600 with _open_zipfile_reader(opened_file) as opened_zipfile:
601 if _is_torchscript_zip(opened_zipfile):
602 warnings.warn("'torch.load' received a zip file that looks like a TorchScript archive"
603 " dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to"
604 " silence this warning)", UserWarning)

File ~/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/serialization.py:242, in _open_zipfile_reader.init(self, name_or_buffer)
241 def init(self, name_or_buffer) -> None:
--> 242 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

I think this issue happens when the file is not downloaded completely.
Is there any way to redownload the model?

is there a better guide for BLIP or CLIP pretraining?

Thank you for your great work! I was wondering how should implement image-text pre-training for CLIP or BLIP, which seems unclear in the project readme files.

Error in pnp_vqa

When I run the following line of code in pnp_vqa.ipynb on the colab:
model, vis_processors, txt_processors = load_model_and_preprocess(name="pnp_vqa", model_type="base", is_eval=True, device=device)

there raises an error:

ConfigAttributeError                      Traceback (most recent call last)
[<ipython-input-5-3ec70409a921>](https://localhost:8080/#) in <module>
----> 1 model, vis_processors, txt_processors = load_model_and_preprocess(name="pnp_vqa", model_type="base", is_eval=True, device=device)

10 frames
[/content/LAVIS/lavis/models/__init__.py](https://localhost:8080/#) in load_model_and_preprocess(name, model_type, is_eval, device)
    175 
    176     # load model
--> 177     model = model_cls.from_pretrained(model_type=model_type)
    178 
    179     if is_eval:

[/content/LAVIS/lavis/models/base_model.py](https://localhost:8080/#) in from_pretrained(cls, model_type)
     68         """
     69         model_cfg = OmegaConf.load(cls.default_config_path(model_type)).model
---> 70         model = cls.from_config(model_cfg)
     71 
     72         return model

[/content/LAVIS/lavis/models/pnp_vqa_models/pnp_vqa.py](https://localhost:8080/#) in from_config(cls, model_config)
    335                     image_captioning_model=image_captioning_model,
    336                     question_answering_model=question_answering_model,
--> 337                     offload_model= True if model_config.model_type == '3b' else False,
    338                     )
    339 

[/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py](https://localhost:8080/#) in __getattr__(self, key)
    354         except ConfigKeyError as e:
    355             self._format_and_raise(
--> 356                 key=key, value=None, cause=e, type_override=ConfigAttributeError
    357             )
    358         except Exception as e:

[/usr/local/lib/python3.7/dist-packages/omegaconf/base.py](https://localhost:8080/#) in _format_and_raise(self, key, value, cause, msg, type_override)
    235             msg=str(cause) if msg is None else msg,
    236             cause=cause,
--> 237             type_override=type_override,
    238         )
    239         assert False

[/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py](https://localhost:8080/#) in format_and_raise(node, key, value, msg, cause, type_override)
    898         ex.ref_type_str = ref_type_str
    899 
--> 900     _raise(ex, cause)
    901 
    902 

[/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py](https://localhost:8080/#) in _raise(ex, cause)
    796     else:
    797         ex.__cause__ = None
--> 798     raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
    799 
    800 

[/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py](https://localhost:8080/#) in __getattr__(self, key)
    350         try:
    351             return self._get_impl(
--> 352                 key=key, default_value=_DEFAULT_MARKER_, validate_key=False
    353             )
    354         except ConfigKeyError as e:

[/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py](https://localhost:8080/#) in _get_impl(self, key, default_value, validate_key)
    441         try:
    442             node = self._get_child(
--> 443                 key=key, throw_on_missing_key=True, validate_key=validate_key
    444             )
    445         except (ConfigAttributeError, ConfigKeyError):

[/usr/local/lib/python3.7/dist-packages/omegaconf/basecontainer.py](https://localhost:8080/#) in _get_child(self, key, validate_access, validate_key, throw_on_missing_value, throw_on_missing_key)
     76             validate_key=validate_key,
     77             throw_on_missing_value=throw_on_missing_value,
---> 78             throw_on_missing_key=throw_on_missing_key,
     79         )
     80         if isinstance(child, UnionNode) and not _is_special(child):

[/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py](https://localhost:8080/#) in _get_node(self, key, validate_access, validate_key, throw_on_missing_value, throw_on_missing_key)
    478         if value is None:
    479             if throw_on_missing_key:
--> 480                 raise ConfigKeyError(f"Missing key {key!s}")
    481         elif throw_on_missing_value and value._is_missing():
    482             raise MissingMandatoryValue("Missing mandatory value: $KEY")

ConfigAttributeError: Missing key model_type
    full_key: model.model_type
    object_type=dict

What should I do about the model_type? I'm looking forward to your reply :)

Hugging Face integration of `BLIP`

Dear authors,

We have a working implementation of BLIP and 3 of its variants in huggingface transformers (image captioning, visual question answering, image text retrieval): huggingface/transformers#20716 that is not merged yet

The license of the repository and model states that:

3. Neither the name of [Salesforce.com](http://salesforce.com/) nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

We would like to promote the addition of this architecture to transformers library. Therefore I would like to ask you the permission for promoting this contribution

Thank you very much in advance

Custom dataset format

Hi,

Congrats on the amazing work!! I plan to fine-tune BLIP for image captioning on a custom dataset. What is the input format of the files and changes required in the .yaml files?

PNP-VQA on UnifiedQAv2 11b

Thanks for your work in Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training! I notice a config for a model with 3b parameters, but 11b is unavailable. What should I do to test UnifiedQAv2 with 11b parameters?

Thanks in advance, I'm looking forward to your reply :)

Error when running text localization example

The issue is about the text localization example.

The input image is "../docs/_static/merlion.png" while the input caption is changed to "Merlion near marina bay. It is a city in Singapore. It is a very beautiful city located in Asia. It attract a lot of tourists to come at all seasons. There is a famous hotel in the picture. The picture is capture in night time."

Below is the error message:

gradcam, _ = compute_gradcam(model, img, txt, txt_tokens, block_num=7)
File "/data/code/LAVIS/lavis/models/blip_models/blip_image_text_matching.py", line 147, in compute_gradcam
  cams = cams[:, :, :, 1:].reshape(visual_input.size(0), 12, -1, 24, 24) * mask
RuntimeError: The size of tensor a (35) must match the size of tensor b (48) at non-singleton dimension 2

Can you elaborate on how to fix this error?

release web demo

when is the release?

Convergence and force output

Hi,

I am following the example in https://opensource.salesforce.com/LAVIS//latest/tutorial.training-example.html and using my own dataset for a retrain. Perhaps my data are not enough so it keeps running for 100+ epochs. Is there a way to tune the tolerance for the convergence? Or is there a way to force output the model if the retrain reaches max_epoch? Thanks!

Generating captions from a heatmap

Congrats on the amazing work!
As related to my research, I want to generate captions of an image from an input heatmap. As stated in the PNP_VQA paper, LAVIS can generate captions based on the relevancy score, but the sample code of the authors (pnp_vqa.ipynb) requires an input question.
How can I do it without the input question?

Visual Genome dataset caption preprocessing

Hi,

I notice that the VG caption annotation provided and used in papers has around 800k captions, however the original VG annotation has 5.4M regional captions, wondering what kind of pre-processing is involved to reach the current provided VG caption annotations?

Best

salesforce / lavis Goto Github PK

lavis's People

Contributors

Stargazers

Watchers

Forkers

lavis's Issues

Code

Output

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Recommend Projects

Recommend Topics

Recommend Org

Jobs