Hello, I would like to finetune RAM++ tagging with other datasets.<b

Thanks for your reply and your awesome work <a class="user-mention notranslate" data-h

Ok so I just need: <div class="snippet-clipboard-content notranslate position-rela

Finetuning question about recognize-anything HOT 7 OPEN

adbmdp commented on September 26, 2024

Finetuning question

from recognize-anything.

Comments (7)

xinyu1205 commented on September 26, 2024 1

It means you need to modify the forward function of ram.py or ram_plus.py.
And I strongly recommend that you read the RAM or RAM++paper before completing these tasks.

from recognize-anything.

xinyu1205 commented on September 26, 2024

Thanks for your attention.
Actually, this is certainly feasible. The performance of the model depends on the quality of your finetune dataset.

from recognize-anything.

adbmdp commented on September 26, 2024

Thanks for your reply and your awesome work @xinyu1205 !!

OK let's say I want to train the model with a celebrity dataset.

I have trouble understanding which tag file I need to update with the new tags.
To my understandings:
parse_label_id refers to the tag indices present in ram/data/tag_list.txt
union_label_id refers to the tag indices present in ram/data/ram_tag_list.txt

But for example when I watch in the COCO dataset for example can be found:

{
  "image_path":"coco/val2014/COCO_val2014_000000522418.jpg",
  "parse_label_id":[
    [
      4480,
      4532,
      678
    ]
  ],
  "caption":[
    "there is a woman that is cutting a white cake"
  ],
  "union_label_id":[
    4480,
    2624,
    2051,
    678,
    2599,
    2577,
    4532,
    1238,
    215,
    2332,
    4439
  ]
}

parse_label_id I should find the id in the ram/data/tag_list.txt file right?
This file only has 3429 IDs and I see an id 4480 !

So to summarize. If I want to modify only the tagging part of RAM++. In which file should I add my tags (maybe just one)?
And my Dataset can be something like:

{
  "image_path":"datasets/celebrities/CELEB_00001.jpg",
  "parse_label_id":[
    [
      9999
    ]
  ],
  "caption":[
    "Michael Jordan"
  ],
  "union_label_id":[
     8888
  ]
}

from recognize-anything.

xinyu1205 commented on September 26, 2024

parse_label_id refers to the tag parsed from image caption
union_label_id refers to the full tags of the image
Therefore, if you only have image-tag dataset, you just need set image tags as union_label_id.
And you only need the loss_tag and loss_dis in RAM or RAM++

from recognize-anything.

adbmdp commented on September 26, 2024

Ok so I just need:

{
  "image_path":"datasets/celebrities/CELEB_00001.jpg",
  "caption":[
    "Michael Jordan"
  ],
  "union_label_id":[
     new id from ram/data/ram_tag_list.txt
  ]
}

And you only need the loss_tag and loss_dis in RAM or RAM++

I don't know what you mean here but i'll try to find out. Do I have to change some code in finetune.py?

Thanks again for taking from you time to reply 👍 🥇

from recognize-anything.

adbmdp commented on September 26, 2024

Thanks. I'll do that.

from recognize-anything.

adbmdp commented on September 26, 2024

So i'm trying to fine-tune the model on just one tag as a test (on my CPU).
I've add a new tag in recognize-anything/ram/data/ram_tag_list.txt so now there is 4586 lines in this file.

I've modified the forward function:

def forward(self, image, caption, image_tag, clip_feature, batch_text_embed):
        image_embeds = self.image_proj(self.visual_encoder(image))
        image_atts = torch.ones(image_embeds.size()[:-1],
                                dtype=torch.long).to(image.device)
    
        ##================= Distillation from CLIP ================##
        image_cls_embeds = image_embeds[:, 0, :]
        image_spatial_embeds = image_embeds[:, 1:, :]
    
        loss_dis = F.l1_loss(image_cls_embeds, clip_feature)
    
        ###===========multi tag des reweight==============###
        bs = image_embeds.shape[0]
    
        des_per_class = int(self.label_embed.shape[0] / self.num_class)
    
        image_cls_embeds = image_cls_embeds / image_cls_embeds.norm(dim=-1, keepdim=True)
        reweight_scale = self.reweight_scale.exp()
        logits_per_image = (reweight_scale * image_cls_embeds @ self.label_embed.t())
        logits_per_image = logits_per_image.view(bs, -1, des_per_class)
    
        weight_normalized = F.softmax(logits_per_image, dim=2)
        label_embed_reweight = torch.empty(bs, self.num_class, 512).to(image.device).to(image.dtype)
    
        for i in range(bs):
            reshaped_value = self.label_embed.view(-1, des_per_class, 512)
            product = weight_normalized[i].unsqueeze(-1) * reshaped_value
            label_embed_reweight[i] = product.sum(dim=1)
    
        label_embed = torch.nn.functional.relu(self.wordvec_proj(label_embed_reweight))
    
        ##================= Image Tagging ================##
    
        tagging_embed = self.tagging_head(
            encoder_embeds=label_embed,
            encoder_hidden_states=image_embeds,
            encoder_attention_mask=image_atts,
            return_dict=False,
            mode='tagging',
        )
    
        logits = self.fc(tagging_embed[0]).squeeze(-1)
    
        loss_tag = self.tagging_loss_function(logits, image_tag)
    
        # Ignorez la perte d'alignement texte-image
        loss_alignment = None
    
        # Renvoyez les pertes loss_tag et loss_dis
        return loss_tag, loss_dis

Here is my finetune.yaml file :

train_file: [
            'outputs/data.json',
             ]
image_path_root: ""

# size of vit model; base or large
vit: 'swin_l'
vit_grad_ckpt: False
vit_ckpt_layer: 0

image_size: 384
batch_size: 26

# optimizer
weight_decay: 0.05
init_lr: 5e-06
min_lr: 0
max_epoch: 2
warmup_steps: 3000

class_num: 4586

I lauch the fine tuning like this:
python3 finetune.py --model-type ram_plus --config ram/configs/finetune.yaml --checkpoint outputs/ram_plus/ram_plus_swin_large_14m.pth --output-dir outputs/ram_plus_ft --device cpu

RuntimeError: Error(s) in loading state_dict for RAM_plus:
	size mismatch for label_embed: copying a param with shape torch.Size([233835, 512]) from checkpoint, the shape in current model is torch.Size([233886, 512]).

I think the error message indicates that there is a size mismatch between the pre-trained model's label_embed layer and the current model's label_embed layer. This is likely due to a difference in the number of tags or classes between the pre-trained model and the current model. But I have no clue how to resolve this.

Thanks!

from recognize-anything.

Finetuning question about recognize-anything HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs