salesforce / lavis Goto Github PK
View Code? Open in Web Editor NEWLAVIS - A One-stop Library for Language-Vision Intelligence
License: BSD 3-Clause "New" or "Revised" License
LAVIS - A One-stop Library for Language-Vision Intelligence
License: BSD 3-Clause "New" or "Revised" License
Hi,
Is there a way to access the confidence of the generated caption?
Thanks for the great work. I have some questions about the BLIP feature extractor interface.
# torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks
What are the other channels [:, 1:12, :]
useful for?
image_features
(link), but they are not available. Can you comment on the difference between image_embeds
and image_features
, and how to access the latter?print(features_image.image_embeds.shape)
print(features_image.image_features.shape)
Thanks!
Hi,
I would be grateful if you could kindly provide the code to regenerate the other test cases provided in the technical report.
I'm really interested in to regenerate zero shot classification, text localisation, multimodal classification, etc.
Hi, thank you for sharing a great library.
I want to fine-tune the model on COCO caption, which uses pycocoevalcap
that depends on stanford-corenlp
.
While running run_scripts/blip/train/train_caption_cocosh
, it causes an error when the SPICE
score is calculated, because of Unable to make field private final byte[] java.lang.String.value accessible: module java.base does not "opens java.lang" to unnamed module
.
I guess this error is due to its dependency on a specific version of Java. Which version of Java do you use in the development?
Thank you!
Thank you LAVIS team for this wonderful repo!
It appears that loading blip_classification with model_type='base' throws an error stemming from the configuration files. Is this expected behavior? If so, perhaps a more descriptive error (or solution) could be provided with the assertion?
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
from lavis.models import load_model_and_preprocess
model, processors, _ = load_model_and_preprocess(name="blip_classification",
model_type="base",
is_eval=True, device=device)
AssertionError: Invalid number of classes provided, found -1
Decord is listed as a dependency but there is no prebuilt binaries for Mac above Python 3.8 (see issues here). This means Mac users of LAVIS effectively have to stay below 3.8. Can we get rid of Decord and replace it with something that's more well-maintained?
model, vis_processors, txt_processors = load_model_and_preprocess(name="gpt_dialogue", model_type="base", is_eval=True, device=device)
TypeError Traceback (most recent call last)
[<ipython-input-48-216011dd9979>](https://localhost:8080/#) in <module>
----> 1 model, vis_processors, txt_processors = load_model_and_preprocess(name="gpt_dialogue", model_type="base", is_eval=True, device=device)
[/content/LAVIS/lavis/models/__init__.py](https://localhost:8080/#) in load_model_and_preprocess(name, model_type, is_eval, device)
171
172 # load model
--> 173 model = model_cls.from_pretrained(model_type=model_type)
174
175 if is_eval:
TypeError: from_pretrained() missing 1 required positional argument: 'pretrained_model_name_or_path'
I was trying to reproduce results with BLIP on VQAv2 test-dev and I observed a non-negligible difference between the VQA accuracy obtained using the published checkpoint (77.41%) and the number reported in the paper (78.25%).
These are the steps I followed:
pip install .
cache/coco/images
pointing to the local copy of the COCO imageslavis/projects/blip/eval/vqav2_eval.yaml
as follows:python -m torch.distributed.run --nproc_per_node=4 evaluate.py --cfg-path lavis/projects/blip/eval/vqav2_eval.yaml
(note I only have 4 A100 GPUs available)test_vqa_result.json
file generated in lavis/output/BLIP/VQA/...
to EvalAIAfter some debugging, I narrowed it down to a discrepancy in PyTorch versions: I was using the latest version (1.13.0), while LAVIS fixes the version to 1.10.0. So there is some change between PyTorch 1.10 and PyTorch 1.13 which causes a performance degradation when loading a checkpoint trained on 1.10. After downgrading the PyTorch version to 1.10.0, I am able to achieve 78.24% VQA accuracy on VQAv2 test-dev, almost the same number reported in the paper.
Hi, I use the scripts LAVIS/run_scripts/blip/train/train_okvqa.sh to finetune blip on okvqa. However, the result in epoch 7 is only 45.12, which has a large gap between the result 55.4 you proposed. Are there any hyperparameter that need to modify in LAVIS/lavis/projects/blip/train/okvqa_ft.yaml ?
It would be highly appreciated if you could give some help.
Hi, i want to do image tagging using albef model. I have writen the code for that but the top one tags are way off. Could you help we regarding this
def calculate_pred(img):
raw_image = Image.open(img).convert("RGB")
image = vis_processors"eval".unsqueeze(0).to(device)
text = [[txt_processors"eval"] for x in occ_list] #occ_list is list of occupation and template = "photo of a "
sample = [{"image": image, "text_input": text_input} for text_input in text]
img_feats = model.extract_features(sample[0], mode="image").image_embeds_proj[:,0,:]
img_feats = img_feats/img_feats.norm()
text_feats = torch.cat([model.extract_features(sample[i], mode="text").text_embeds_proj[:,0,:] / model.extract_features(sample[i], mode="text").text_embeds_proj[:,0,:].norm() for i in range(len(sample))])
_, index = torch.max(img_feats @ text_feats.T, dim=1)
return(occ_list[index.tolist()[0]])
I am trying to replicate the VQAv2 evaluation by running bash run_scripts/pnp-vqa/eval/eval_vqav2.sh
. However, the scores aren't printed out. I only get:
...
result file saved to /home/LAVIS/lavis/output/PNP-VQA/VQAv2_val/20221124171/result/val_vqa_result.json
loading VQA annotations and questions into memory...
creating index...
index created!
Loading and preparing results...
DONE (t=0.42s)
creating index...
index created!
computing accuracy
Finshed Percent: [####################] 99% Done computing accuracy
NCCL INFO [Service thread] Connection closed by localRank 0
NCCL INFO comm 0x7fa08c008fb0 rank 0 nranks 8 cudaDev 0 busId e00000 - Abort COMPLETE
Is this expected behavior? If so, where do I find the final scores? Thanks!
Hi! I'm trying to the LAVIS repo to pretrain BLIP on my own dataset. I was wondering if there're any plans to integrate some type of tracking tools like wandb or tensorboard into the codebase? Or any suggestions on how I can easily set it up?
Thanks!
When I evaluate coco-caption task on a single GPU device, It has this error.
def evaluation(self, model, data_loader, cuda_enabled=True):
metric_logger = MetricLogger(delimiter=" ")
header = "Evaluation"
# TODO make it configurable
print_freq = 10
results = []
for samples in metric_logger.log_every(data_loader, print_freq, header):
samples = prepare_sample(samples, cuda_enabled=cuda_enabled)
eval_output = self.valid_step(model=model, samples=samples)
results.extend(eval_output)
dist.barrier()
return results
The error occurs on “dist.barrier()”
I've ran following piece of code
import torch
from lavis.models import load_model, load_model_and_preprocess
from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# load sample image
raw_image = Image.open("LAVIS/docs/_static/merlion.png").convert("RGB")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="albef_vqa", model_type="vqav2", is_eval=True, device=device
)
question = "Which city is this photo taken?"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
question = txt_processors["eval"](question)
samples = {"image": image, "text_input": question}
answer_list = ["Singapore", "London", "Palo Alto", "Tokyo"]
answers = model.rank_answers(samples, answer_list=answer_list, num_ans_candidates=3)
answers
and got this as an output
{'name': 'blip_image_train', 'image_size': 384}
{'name': 'blip_image_eval', 'image_size': 384}
{'name': 'blip_question'}
{'name': 'blip_question'}
['Singapore']
Is this the expected behavior? Or have I missed something about the inputs?
I was expecting 3 ordered answers or maybe even 3 ordered answers with their probabilities (would be really nice), but got the output from predict_answers
Hi, how can I add the visual7w dataset for the VQA task? The adding datasets documentation is for AVSD task and I'm not sure how to do similar steps but for a different task... My data has images, questions, multiple options and answers. Thanks.
Hi, I am using LAVIS to train albef model recently, while I find a potential issue.
At here, we are applying torch DDP to the original model that requires device id, i.e., the rank of current process.
However, I think it is not the responsibility of config file to set rank for each process and it results in failure. I suggest using get_rank
instead.
Hi everyone,
First of all, thank you so much for this great package.
I would be so grateful if you could kindly provide me with an example code of how the multimodal feature can be used for multimodal classification, as you mentioned here.
Best,
sub.json is organized in the format:
[{'image': '4385058960_b0f291553e.jpg',
'caption': 'a wooden chair in the living room',
'url': 'http://static.flickr.com/2723/4385058960_b0f291553e.jpg'},
...}
but the downloaded sbu_images.rar is extracted as:
0000/ 0001/ 0002/ 0003/ ... 0999/
in each directory contains 1000 images named in order:
000.jpg 001.jpg 002.jpg ... 999.jpg
Therefore, the image storage path does not correspond to the path in json. @dxli94
I want to fine-tune BLIP on aokvqa, and I download your fine-tuned checkpoint and directly evaluated on the validation set, the result is 50.22. However, when I fine-tune the model myself, the result is 41.89. I didn't change the hyperparameter in the config. Could you provide the hyperparameter that you fine-tune the BLIP on aokvqa?
By the way, did you use other vqa datasets to continue pre-train BLIP before fine-tuning it on the aokvqa?
hello
First of all, thank you for building such a wonderful library.
I have a question about VQA using BLIP.
It is about samples among the arguments of predict_answers in BLIP VQA model.
samples (dict): A dictionary containing the following keys:
- image (torch.Tensor): A tensor of shape (batch_size, 3, H, W). Default H=480, W=480.
- text_input (list): A list of strings, each string is a question
In the case of text_input, it is a list format, and I think I can put multiple questions.
However, when I put several questions in the list and ran it, I got a tensor size mismatch error.
I want to know if the VQA is running only one question per image.
thanks
Hi,
I was trying to download the datasets, after running the download scripts for CC,
when loading CC, it failed, errors as below
when I run below code
from lavis.datasets.builders import load_dataset
In [3]: data = load_dataset('conceptual_caption_12m') `
errors as
File ~/anaconda3/envs/lavis/lib/python3.8/urllib/request.py:383, in Request._parse(self)
381 self.type, rest = _splittype(self._full_url)
382 if self.type is None:
--> 383 raise ValueError("unknown url type: %r" % self.full_url)
384 self.host, self.selector = _splithost(rest)
385 if self.host:
ValueError: unknown url type: '/export/home/workspace/datasets/cc12m.json'
same goes for 'sbu_caption', can not find url ''/export/share/dongxuli/data/lavis/sbu/annotation/sbu.json',
can you please help with obtaining these annotation json files?
Best
while loading ALBEF feature extractor using
model, vis_processors, text_processors = load_model_and_preprocess(name="albef_feature_extractor", model_type="base", is_eval=True, device=device)
It is returning:-
reshape position embedding from 256 to 196
None
{'name': 'blip_image_eval', 'image_size': 224}
None
{'name': 'blip_caption'}
Is this BLIP model or ALBEF ?
Hi Dongxu,
I have received the following error when running the above command:
bash run_scripts/run_demo.sh
ModuleNotFoundError: No module named 'app'
File "/usr/local/lib/python3.8/dist-packages/streamlit/runtime/scriptrunner/script_runner.py", line 562, in _run_script exec(code, module.__dict__)
File "/home/ermia/PycharmProjects/LAVIS/app/main.py", line 8, in <module> from app.multipage import MultiPage
I'm digging the code to find the reasons for the error.
Hi, thanks for the great stuff!
Is there any plan to update the torch version (from ==1.10
to anything newer), or relax it?
As I understand from looking at blip paper, NLVR takes pair of images, a sentence for them and predicts if the sentence describes the image pair.
I have used the following code to generate output from comparing two images and their text input with total 3 comparisons on minibatch.
labels
in samples dict. Are the values of labels only 0, 1 for False, True, or something else?model, vis_processors, text_processors = load_model_and_preprocess("blip_nlvr", "nlvr", device=device, is_eval=True)
samples = {
"image0": torch.randn((3, 3, 384, 384), device=device),
"image1": torch.randn((3, 3, 384, 384), device=device),
"text_input": [
"there is a car with yellow color",
"there are cars in one of the images",
"there are bikes in both images"
],
"label": torch.tensor([0, 1, 1], device=device),
}
with torch.no_grad():
output = model.predict(samples)
{'predictions': tensor([[ 0.6208, -0.7106],
[ 0.6987, -0.7888],
[ 1.3222, -1.4706]], device='cuda:0'),
'targets': tensor([0, 1, 1], device='cuda:0')}
from lavis.datasets.builders import load_dataset
msrvtt_dataset = load_dataset("msrvtt_caption")
another way still failed
Downloading https://download1602.mediafire.com/jslug277m67g/x3rrbe4hwp04e6w/train_val_videos.zip to train
Failed to download or extracting datasets. Aborting.
Merging to C:\Users\jianwei\anaconda3\envs\thesis\lib\site-packages\lavis\..\cache\msrvtt\videos
Failed to merging datasets. Aborting.```
Hi, thank you for the great work!
I wonder if there is any plan to incorporate the tensorboard visualization. Also if there is any plan to combine pytorch_lightning
?
Hi, thanks for the amazing work you did with the library!
I am currently trying to fine-tune BLIP on a custom dataset. I followed your tutorial on the custom dataset generation and set up all the necessary files for the fine-tuning, and everything works as expected.
The only problem I've encountered is with the maximum length of the generated captions. In my training configuration file this length is set to 256, but the model never generates captions which are longer than ~50 words (roughly 90 tokens in average).
I have already increased BERT embedding's size to 256, hard-coding it in this line:
My training config file looks like this
model:
arch: blip_caption
model_type: base_coco
load_finetuned: False
datasets:
custom_caption: # name of the dataset builder
vis_processor:
train:
name: "blip_image_train"
eval:
name: "blip_image_eval"
text_processor:
train:
name: "blip_caption"
prompt: "a picture of "
eval:
name: "blip_caption"
run:
task: captioning
# optimizer
lr_sched: "linear_warmup_cosine_lr"
init_lr: 1e-5
min_lr: 0
weight_decay: 0.05
max_epoch: 20
batch_size_train: 2
batch_size_eval: 8
num_workers: 1
max_len: 256
min_len: 5
num_beams: 3
seed: 42
output_dir: "output/BLIP/Caption_custom"
amp: False
resume_ckpt_path: null
evaluate: False
train_splits: ["train"]
valid_splits: ["val"]
test_splits: ["test"]
device: "cuda"
world_size: 1
dist_url: "env://"
distributed: True
I am training the model with 5000 samples. Do you have any suggestion on what could be wrong or missing in my fine tuning configuration? Should I use different parameters for the optimiser? Is generating captions of this length even achievable with BLIP?
Thanks!
Hi there,
Could you please provide an example as how we should run Video Question Answering task using LAVIS?
Any examples about other video-related tasks well be very appreciated.
When I run the BLIP captioning model with transformers 4.22.2 I get the following error:
import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# generate caption
model.generate({"image": image})
# ['a large fountain spewing water into the air']
The following `model_kwargs` are not used by the model: ['encoder_hidden_states', 'encoder_attention_mask'] (note: typos in the generate arguments will also show up in this list)
在3.1节,第2自然段说是ITE完成图像与问题的相似度计算,第3自然段又说GradCAM完成相似度计算,这两个模块是啥关系?
怎么理解这句话?To identify relevant image patches, we feed the image v and the question t to the ITE network and apply a variation of GradCAM.... 应用GradCAM在干什么?
English translations by @dxli94 :
"In Section, para. 2, you mention ITE measures the similarities between images and questions, in para.3 you mention use GradCam to measure similarity." What is the relation between ITE and GradCAM?
Second, how to understand "To identify relevant image patches, we feed the image v and the question t to the ITE network and apply a variation of GradCAM"? What is GradCAM doing here?"
Hi everyone, @dxli94
When I am running the below code, I receive the corresponding error:
import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
The error is about loading the model from a path, which seems unreachable due to some unknown issue.
raise HTTPError(req.full_url, code, msg, hdrs, fp)
│ │ │ │ │ │ └ <http.client.HTTPResponse object at 0x7fa4f4114820>
│ │ │ │ │ └ <http.client.HTTPMessage object at 0x7fa4f41149a0>
│ │ │ │ └ 'Forbidden'
│ │ │ └ 403
│ │ └ <property object at 0x7fa58770f1d0>
│ └ <urllib.request.Request object at 0x7fa4f413d790>
└ <class 'urllib.error.HTTPError'>
Which is related to the following file.
https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP/blip_coco_caption_base.pth
Could you please kindly make the models available somewhere else that can easily be accessible?
I follow the demo and use the features generated by albef_feature_extraction
to perform zero-shot cross-modal retrieval on MSCOCO. The t2i recall scores are extremely low while that of i2t score looks normal, and I don't know the answer. What's more, I found the cosine similarity even between the paired image and text is low (about 0.09).
Hello, thanks for your nice work! Are there scripts and configuration files that can be used to finetune CLIP on COCO and Flickr30K, like BLIP (retrieval_coco_ft.yaml and train_retrieval_coco)? Thanks again!
Hello there! First of all: this library is godlike. Thanks for all the effort!
Second: can we get LAVIS on the conda-forge channel? It would be awesome for everybody.
Getting following error
Could not find a version that satisfies the requirement decord>=0.6.0 (from lavis) (from versions: none)
Can you help please
Hi, can anyone help me answer where are the all the pretrained model saved in system after the first time downloading it? I am trying to integrate lavis into my docker image and needs sort out the model save path
Hi, I met an error when loading the model pnp_vqa
model, vis_processors, txt_processors = load_model_and_preprocess(name="pnp_vqa", model_type="base", is_eval=True, device=device)
...
File ~/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/serialization.py:600, in load(f, map_location, pickle_module, **pickle_load_args)
595 if _is_zipfile(opened_file):
596 # The zipfile reader is going to advance the current file position.
597 # If we want to actually tail call to torch.jit.load, we need to
598 # reset back to the original position.
599 orig_position = opened_file.tell()
--> 600 with _open_zipfile_reader(opened_file) as opened_zipfile:
601 if _is_torchscript_zip(opened_zipfile):
602 warnings.warn("'torch.load' received a zip file that looks like a TorchScript archive"
603 " dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to"
604 " silence this warning)", UserWarning)
File ~/anaconda3/envs/lavis/lib/python3.8/site-packages/torch/serialization.py:242, in _open_zipfile_reader.init(self, name_or_buffer)
241 def init(self, name_or_buffer) -> None:
--> 242 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
I think this issue happens when the file is not downloaded completely.
Is there any way to redownload the model?
Thank you for your great work! I was wondering how should implement image-text pre-training for CLIP or BLIP, which seems unclear in the project readme files.
When I run the following line of code in pnp_vqa.ipynb on the colab:
model, vis_processors, txt_processors = load_model_and_preprocess(name="pnp_vqa", model_type="base", is_eval=True, device=device)
there raises an error:
ConfigAttributeError Traceback (most recent call last)
[<ipython-input-5-3ec70409a921>](https://localhost:8080/#) in <module>
----> 1 model, vis_processors, txt_processors = load_model_and_preprocess(name="pnp_vqa", model_type="base", is_eval=True, device=device)
10 frames
[/content/LAVIS/lavis/models/__init__.py](https://localhost:8080/#) in load_model_and_preprocess(name, model_type, is_eval, device)
175
176 # load model
--> 177 model = model_cls.from_pretrained(model_type=model_type)
178
179 if is_eval:
[/content/LAVIS/lavis/models/base_model.py](https://localhost:8080/#) in from_pretrained(cls, model_type)
68 """
69 model_cfg = OmegaConf.load(cls.default_config_path(model_type)).model
---> 70 model = cls.from_config(model_cfg)
71
72 return model
[/content/LAVIS/lavis/models/pnp_vqa_models/pnp_vqa.py](https://localhost:8080/#) in from_config(cls, model_config)
335 image_captioning_model=image_captioning_model,
336 question_answering_model=question_answering_model,
--> 337 offload_model= True if model_config.model_type == '3b' else False,
338 )
339
[/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py](https://localhost:8080/#) in __getattr__(self, key)
354 except ConfigKeyError as e:
355 self._format_and_raise(
--> 356 key=key, value=None, cause=e, type_override=ConfigAttributeError
357 )
358 except Exception as e:
[/usr/local/lib/python3.7/dist-packages/omegaconf/base.py](https://localhost:8080/#) in _format_and_raise(self, key, value, cause, msg, type_override)
235 msg=str(cause) if msg is None else msg,
236 cause=cause,
--> 237 type_override=type_override,
238 )
239 assert False
[/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py](https://localhost:8080/#) in format_and_raise(node, key, value, msg, cause, type_override)
898 ex.ref_type_str = ref_type_str
899
--> 900 _raise(ex, cause)
901
902
[/usr/local/lib/python3.7/dist-packages/omegaconf/_utils.py](https://localhost:8080/#) in _raise(ex, cause)
796 else:
797 ex.__cause__ = None
--> 798 raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
799
800
[/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py](https://localhost:8080/#) in __getattr__(self, key)
350 try:
351 return self._get_impl(
--> 352 key=key, default_value=_DEFAULT_MARKER_, validate_key=False
353 )
354 except ConfigKeyError as e:
[/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py](https://localhost:8080/#) in _get_impl(self, key, default_value, validate_key)
441 try:
442 node = self._get_child(
--> 443 key=key, throw_on_missing_key=True, validate_key=validate_key
444 )
445 except (ConfigAttributeError, ConfigKeyError):
[/usr/local/lib/python3.7/dist-packages/omegaconf/basecontainer.py](https://localhost:8080/#) in _get_child(self, key, validate_access, validate_key, throw_on_missing_value, throw_on_missing_key)
76 validate_key=validate_key,
77 throw_on_missing_value=throw_on_missing_value,
---> 78 throw_on_missing_key=throw_on_missing_key,
79 )
80 if isinstance(child, UnionNode) and not _is_special(child):
[/usr/local/lib/python3.7/dist-packages/omegaconf/dictconfig.py](https://localhost:8080/#) in _get_node(self, key, validate_access, validate_key, throw_on_missing_value, throw_on_missing_key)
478 if value is None:
479 if throw_on_missing_key:
--> 480 raise ConfigKeyError(f"Missing key {key!s}")
481 elif throw_on_missing_value and value._is_missing():
482 raise MissingMandatoryValue("Missing mandatory value: $KEY")
ConfigAttributeError: Missing key model_type
full_key: model.model_type
object_type=dict
What should I do about the model_type? I'm looking forward to your reply :)
Dear authors,
We have a working implementation of BLIP
and 3 of its variants in huggingface transformers
(image captioning, visual question answering, image text retrieval): huggingface/transformers#20716 that is not merged yet
The license of the repository and model states that:
3. Neither the name of [Salesforce.com](http://salesforce.com/) nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
We would like to promote the addition of this architecture to transformers
library. Therefore I would like to ask you the permission for promoting this contribution
Thank you very much in advance
Hi,
Congrats on the amazing work!! I plan to fine-tune BLIP for image captioning on a custom dataset. What is the input format of the files and changes required in the .yaml files?
Thanks for your work in Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training! I notice a config for a model with 3b parameters, but 11b is unavailable. What should I do to test UnifiedQAv2 with 11b parameters?
Thanks in advance, I'm looking forward to your reply :)
The issue is about the text localization example.
The input image is "../docs/_static/merlion.png" while the input caption is changed to "Merlion near marina bay. It is a city in Singapore. It is a very beautiful city located in Asia. It attract a lot of tourists to come at all seasons. There is a famous hotel in the picture. The picture is capture in night time."
Below is the error message:
gradcam, _ = compute_gradcam(model, img, txt, txt_tokens, block_num=7)
File "/data/code/LAVIS/lavis/models/blip_models/blip_image_text_matching.py", line 147, in compute_gradcam
cams = cams[:, :, :, 1:].reshape(visual_input.size(0), 12, -1, 24, 24) * mask
RuntimeError: The size of tensor a (35) must match the size of tensor b (48) at non-singleton dimension 2
Can you elaborate on how to fix this error?
when is the release?
Hi,
I am following the example in https://opensource.salesforce.com/LAVIS//latest/tutorial.training-example.html and using my own dataset for a retrain. Perhaps my data are not enough so it keeps running for 100+ epochs. Is there a way to tune the tolerance for the convergence? Or is there a way to force output the model if the retrain reaches max_epoch? Thanks!
Congrats on the amazing work!
As related to my research, I want to generate captions of an image from an input heatmap. As stated in the PNP_VQA paper, LAVIS can generate captions based on the relevancy score, but the sample code of the authors (pnp_vqa.ipynb) requires an input question.
How can I do it without the input question?
Hi,
I notice that the VG caption annotation provided and used in papers has around 800k captions, however the original VG annotation has 5.4M regional captions, wondering what kind of pre-processing is involved to reach the current provided VG caption annotations?
Best
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.