GithubHelp home page GithubHelp logo

Comments (5)

xinyu1205 avatar xinyu1205 commented on June 7, 2024

Did you use the image-tag recognition decoder on the tagging task to obtain the Grad-Cam? Since the figure 7 of Tag2Text is obtained based on the backward gradient of the image-tag interaction encoder on the generation task.
And I also found that the image-tag recognition decoder's Grad-Cam is often a meaningless scatter graph, even if it predicts high logits. Normally, with good recognition performance, its grad-cam should be very accurate. I haven't found the reason yet.

from recognize-anything.

SKBL5694 avatar SKBL5694 commented on June 7, 2024

It seems that I am indeed doing gradcam on the recognition task, because your code does not open the generation task for RAM, but I have added the generation task to RAM in the way of T2T, I will test it, thank you.
Also, I share your opinion that "with good recognition performance, its grad-cam should be very accurate". But I also encountered a similar situation pytorch-grad-cam/issues/84 in some other swin-transformer-based discussions about gradcam, but unfortunately these discussions were fruitless. I think it may be due to the patch merging operation in swin-transformer that the features lose the meaning of the traditional spatial structure, but this can't explain why gradcam is accurate sometimes, and I am also very confused. Thanks for your reply though, I'll try it again. Thank you again for your excellent work and kind reply.

from recognize-anything.

xinyu1205 avatar xinyu1205 commented on June 7, 2024

Thank you for your interest and your kind words, welcome to provide feedback if you have more issues.

from recognize-anything.

SKBL5694 avatar SKBL5694 commented on June 7, 2024

I think I have some problem with image-tag interaction encoder performing backward calculation grad. The calculation method I use is to pre-define a hook, and then register it to the location I need. The general idea is as follows

def backward_hook(module, grad_input, grad_output):
     global gradients
     print('Backward hook running...')
     gradients = grad_output
     print(f'Gradients size: {gradients[0].size()}')
backward_hook = model.visual_encoder.register_full_backward_hook(backward_hook)

Then do backward on the category where I need to calculate the gradient. For example, in the previous recognition decoder, I can perform the following operations on any type of logits
logits[0,252-1].backward() (where 252 is the number of lines where the word "cat" is located in ram_tag_list)
But now, the interaction encoder output is not a scalar, but a shape of (#beam, max_length, #features) eg: (3, 40, 768)
You mentioned that the grad of fig7 is obtained from the interaction encoder, does it mean that I should perform backward on this output to calculate the gradient? Or is there any other operation?
In addition, I also tried to perform .backward() on the output of the text generation decoder, but since self.text_decoder is an instance of the official transformer library, its generate method does not contain grad, so I cannot perform backward() on the output of this , hope you can give me some idea, I want to try to get similar results to fig7.

from recognize-anything.

pribadihcr avatar pribadihcr commented on June 7, 2024

Hi @SKBL5694 , Have you resolve it?

from recognize-anything.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.