Comments (7)
It means you need to modify the forward function of ram.py or ram_plus.py.
And I strongly recommend that you read the RAM or RAM++paper before completing these tasks.
from recognize-anything.
Thanks for your attention.
Actually, this is certainly feasible. The performance of the model depends on the quality of your finetune dataset.
from recognize-anything.
Thanks for your reply and your awesome work @xinyu1205 !!
OK let's say I want to train the model with a celebrity dataset.
I have trouble understanding which tag file I need to update with the new tags.
To my understandings:
parse_label_id
refers to the tag indices present in ram/data/tag_list.txt
union_label_id
refers to the tag indices present in ram/data/ram_tag_list.txt
But for example when I watch in the COCO dataset for example can be found:
{
"image_path":"coco/val2014/COCO_val2014_000000522418.jpg",
"parse_label_id":[
[
4480,
4532,
678
]
],
"caption":[
"there is a woman that is cutting a white cake"
],
"union_label_id":[
4480,
2624,
2051,
678,
2599,
2577,
4532,
1238,
215,
2332,
4439
]
}
parse_label_id
I should find the id in the ram/data/tag_list.txt
file right?
This file only has 3429 IDs and I see an id 4480 !
So to summarize. If I want to modify only the tagging part of RAM++. In which file should I add my tags (maybe just one)?
And my Dataset can be something like:
{
"image_path":"datasets/celebrities/CELEB_00001.jpg",
"parse_label_id":[
[
9999
]
],
"caption":[
"Michael Jordan"
],
"union_label_id":[
8888
]
}
from recognize-anything.
parse_label_id refers to the tag parsed from image caption
union_label_id refers to the full tags of the image
Therefore, if you only have image-tag dataset, you just need set image tags as union_label_id.
And you only need the loss_tag and loss_dis in RAM or RAM++
from recognize-anything.
Ok so I just need:
{
"image_path":"datasets/celebrities/CELEB_00001.jpg",
"caption":[
"Michael Jordan"
],
"union_label_id":[
new id from ram/data/ram_tag_list.txt
]
}
And you only need the loss_tag and loss_dis in RAM or RAM++
I don't know what you mean here but i'll try to find out. Do I have to change some code in finetune.py?
Thanks again for taking from you time to reply 👍 🥇
from recognize-anything.
Thanks. I'll do that.
from recognize-anything.
So i'm trying to fine-tune the model on just one tag as a test (on my CPU).
I've add a new tag in recognize-anything/ram/data/ram_tag_list.txt
so now there is 4586 lines in this file.
I've modified the forward function:
def forward(self, image, caption, image_tag, clip_feature, batch_text_embed):
image_embeds = self.image_proj(self.visual_encoder(image))
image_atts = torch.ones(image_embeds.size()[:-1],
dtype=torch.long).to(image.device)
##================= Distillation from CLIP ================##
image_cls_embeds = image_embeds[:, 0, :]
image_spatial_embeds = image_embeds[:, 1:, :]
loss_dis = F.l1_loss(image_cls_embeds, clip_feature)
###===========multi tag des reweight==============###
bs = image_embeds.shape[0]
des_per_class = int(self.label_embed.shape[0] / self.num_class)
image_cls_embeds = image_cls_embeds / image_cls_embeds.norm(dim=-1, keepdim=True)
reweight_scale = self.reweight_scale.exp()
logits_per_image = (reweight_scale * image_cls_embeds @ self.label_embed.t())
logits_per_image = logits_per_image.view(bs, -1, des_per_class)
weight_normalized = F.softmax(logits_per_image, dim=2)
label_embed_reweight = torch.empty(bs, self.num_class, 512).to(image.device).to(image.dtype)
for i in range(bs):
reshaped_value = self.label_embed.view(-1, des_per_class, 512)
product = weight_normalized[i].unsqueeze(-1) * reshaped_value
label_embed_reweight[i] = product.sum(dim=1)
label_embed = torch.nn.functional.relu(self.wordvec_proj(label_embed_reweight))
##================= Image Tagging ================##
tagging_embed = self.tagging_head(
encoder_embeds=label_embed,
encoder_hidden_states=image_embeds,
encoder_attention_mask=image_atts,
return_dict=False,
mode='tagging',
)
logits = self.fc(tagging_embed[0]).squeeze(-1)
loss_tag = self.tagging_loss_function(logits, image_tag)
# Ignorez la perte d'alignement texte-image
loss_alignment = None
# Renvoyez les pertes loss_tag et loss_dis
return loss_tag, loss_dis
Here is my finetune.yaml file :
train_file: [
'outputs/data.json',
]
image_path_root: ""
# size of vit model; base or large
vit: 'swin_l'
vit_grad_ckpt: False
vit_ckpt_layer: 0
image_size: 384
batch_size: 26
# optimizer
weight_decay: 0.05
init_lr: 5e-06
min_lr: 0
max_epoch: 2
warmup_steps: 3000
class_num: 4586
I lauch the fine tuning like this:
python3 finetune.py --model-type ram_plus --config ram/configs/finetune.yaml --checkpoint outputs/ram_plus/ram_plus_swin_large_14m.pth --output-dir outputs/ram_plus_ft --device cpu
RuntimeError: Error(s) in loading state_dict for RAM_plus:
size mismatch for label_embed: copying a param with shape torch.Size([233835, 512]) from checkpoint, the shape in current model is torch.Size([233886, 512]).
I think the error message indicates that there is a size mismatch between the pre-trained model's label_embed layer and the current model's label_embed layer. This is likely due to a difference in the number of tags or classes between the pre-trained model and the current model. But I have no clue how to resolve this.
Thanks!
from recognize-anything.
Related Issues (20)
- NameError: name '_C' is not defined HOT 1
- VisionTransformer undefined in ram.models.utils.py
- HuggingFace App is not working HOT 1
- Uncertain output results
- 【Bug】BertLayer should be used as a decoder model if cross attention is added
- finetuning on specific tag list
- How can I obtain the file ram_plus_swin_large_14m.pth? HOT 1
- how to form a ram_plus_tag_embedding_class_4585_des_51.pth for my own data. HOT 3
- Unable to proceed with command 'pip install -e .' HOT 2
- Can't load tokenizer for 'bert-base-uncased'
- tag_encoder and text_decoder HOT 1
- pip install error HOT 2
- Normalize image features while calculating the L1 loss
- i think it is the best to call it MAM(match-anything-model)
- CUDA out of memory error
- Pip Install Error HOT 1
- Checkpoints for smaller versions of Swin
- Relax transformers dependency
- Tag2Text模型微调问题
- retrieval code of Tag2text
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from recognize-anything.