The flava-tutorials from apsdehal

Reference from CLIP

Hi thanks for the tutorial !
Is there any source of notebook introduced by CLIP mentioned in Readme?

ITM output changes due to padding

Hi,

Thank you for uploading these notebooks for FLAVA, they have been very helpful. I'm opening this issue because there is one thing I'm a bit confused about. Running winoground-flava-example.ipynb example, I find that the image-text match outputs change depending on the input length (padding) and thus also on the batch size.

Under the header 'Look at an example from Winoground and get the image-caption scores from FLAVA', the text and image of Winograd sample 155 are processed with max_length=77, padding=True:

inputs_c0_i0 = flava_processor(text=[winoground[155]["caption_0"]], images=[winoground[155]["image_0"].convert("RGB")], return_tensors="pt", max_length=77, padding=True, return_codebook_pixels=True, return_image_mask=True).to("cuda")

According to the HuggingFace documentation, padding=True pads to the max sequence in the batch. That means that for this sample with a batch size of 1, no padding is applied. Inspecting the output of the processor, we find that indeed, no padding is applied.

Now in the next section of the notebook ('Get FLAVA image-caption scores from the whole dataset'), padding is set as padding="max_length", max_length=77, causing all tokenized inputs to have length 77. This yields the following scores:

contrastive text score: 0.2525
contrastive image score: 0.135
contrastive group score: 0.09
itm text score: 0.3225
itm image score: 0.1975
itm group score: 0.14

Changing the settings here to padding=True (removing padding and leaving everything else as it was), the scores now become:

contrastive text score: 0.2525
contrastive image score: 0.135
contrastive group score: 0.09
itm text score: 0.2125
itm image score: 0.1025
itm group score: 0.0475

We observe that the contrastive scores remain the same, but the itm scores have changed. Going back to sample 155, we find that going from padding=True (no padding) to padding="max_length" drastically changes the itm-score for each (image, text) pair.

FLAVA itm image-text match scores (no padding):
image_0, caption_0: 0.5821857452392578
image_0, caption_1: 0.6948502063751221
image_1, caption_0: 0.5699254274368286
image_1, caption_1: 0.7856388092041016

FLAVA itm image-text match scores (padding to max_length=77):
image_0, caption_0: 0.9999473094940186
image_0, caption_1: 0.9999871253967285
image_1, caption_0: 0.9997100234031677
image_1, caption_1: 0.9999109506607056

What could be the cause of this, and would this mean that we cannot use FLAVA with HuggingFace for batch processing (since that will add padding)?

Finetuning example

Hi, thanks for your very good work and examples. Just wondering if there is any plan to make the 6th point available.

Fine-tune on custom task [to be added soon]

Thank you very much!

apsdehal / flava-tutorials Goto Github PK

flava-tutorials's People

Contributors

Stargazers

Watchers

flava-tutorials's Issues

Reference from CLIP

ITM output changes due to padding

Finetuning example

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs