sunanhe / mkt Goto Github PK
View Code? Open in Web Editor NEWOfficial implementation of "Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer".
License: MIT License
Official implementation of "Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer".
License: MIT License
Hi, I think this is a great job. I want to know how long does the training stage take and which kind of GPUs you use?
Hi, I am currently reproducing your paper, and I have a question.
According to your paper, backbone is ViT and VLP is clip.
But the first stage code loads model and VLP from same argument (arg.clip_path, https://github.com/sunanhe/MKT/blob/main/train_nus_first_stage.py)
Are they same or different?
你好,我在尝试复现您的实验,但是不知道您论文中数据集的组织结构是如何构建的。
您给出的数据集地址中下载的数据集并不是按您的架构组织的,所以您给出架构中的feature和Flicker文件夹中应该是放入哪些文件?
还有就是nus_wide_test.h5,nus_wide_train.h5两个文件是从哪下载的,我在您给出的NUS_WIDE官网中并没有找出这两个文件的下载地址,能麻烦指导下吗。
本人水平有限,如果问了很笨比的问题,请多多包涵
In the process of reproducing, I found that directly using CLIP for multi-label zero-shot image classification tasks on the NUSWIDE dataset yielded the following results:
It seems to align with the best results listed in your paper. Does the proposed method contribute to the improvement in recognizing samples from unseen classes? Or is the state-of-the-art performance primarily attributed to the powerful capabilities of the CLIP network?
Your paper is really a good work about Open-Vocabulary Multi-Label Classification. But I have two questions:
--- How do you choose negative samples in your ranking loss?
--- The paper states that "Motivated by CoOp (Zhou et al. 2021), we introduce prompt tuning for the adaptation of label embedding. During the tuning process, all parameters except for context embedding of prompt template illustrated as the dotted box named prompt template in Figure 2, are fixed. " what does parameters of context embedding of prompt template mean?
Hi, I am wondering if you could provide training code for Open Images as well :) Thanks in advance..!
Hello~, I'm interested in your work and reproducing experiment results with your code. I find that the mAP in ZSL setting on NUS-WIDE dataset after first-stage training is 42.2 using the checkpoint provided, but the result I reproduced is around 36. Other results like F1 or in GZSL setting are close with your results. I follow the settings and hyperparameters in the paper. I cannot find the reason. Could you help me?
Thanks for releasing the code , when I use the code and data as you provide , I got this error when I train nus-wide on the first stage.
so, I check the download file "nus_wide_train.h5"from google drive, it is the same, 7.40G.
Then I print the label during dataloader process, I got like bellow(many labels got None):
I try to fix this problem as bellow:
However,I still got dataload error during trainning, could you help me fix it ?
label_emb.pt
请问新数据,怎么制作label_emb.pt
Nice job! I would like to ask how the "label_emb.pt" file came about and what to do if I want to generate the "label_emb.pt" file myself?
您好! 我想请问一下,"label_emb.pt"文件是怎么来的,如果要自己生成"label_emb.pt"文件该怎么做?
Thanks! 谢谢!
when will you release code~? thanks
This paper is great I am just wondering about the implementation when using the NUS-Wide dataset is it intended that you should download all the images from the NUS-Wide image list as there is no provided file for all the images?
If so which image sizes did this model use as they provide three different sizes for each image?
Hi, Thanks for your excenllent work! Regarding the "nus wide" label, when the label is -1, does it indicate that the corresponding category is not present in the image or that the category is not annotated for that image?
Hi! The dimension of the last fully connected layer is not mentioned in the paper. Can you provide more detail?
Why is the label embedding in stage2 initialized by nn.embedding instead of CLIP text_encoder?
self.token_embedding = nn.Embedding(args.vocab_size, args.transformer_width)
self.label_emb = torch.zeros((len(self.name_lens), max(self.name_lens), self.transformer_width)).to(self.device)
for i, embed in enumerate(self.token_embedding(self.label_token)):
self.label_emb[i][:self.name_lens[i]] = embed[4:4+self.name_lens[i]].clone().detach()
Hey, I was wondering how you generated the label embeddings which are in the other files section used for the first training round as I couldn't see in the paper how these were generated unless it is just using the clip ViT model?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.