Hello, firstly, thank you for this outstanding library! I was wo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Comments (44)

AvantiShri commented on June 3, 2024 3

@EricSpeidel for what it’s worth, it has been shown that for networks that have only ReLU or maxpooling operations, the “z-rule” of LRP is equivalent to gradient-times-input when biases are included as inputs and the “epsilon” value used for numerical stability is set to 0. This equivalence was (to my knowledge) first noted in the precursor to the DeepLIFT paper (https://arxiv.org/pdf/1605.01713.pdf) and was also shown in a paper by Kindermans et al (https://arxiv.org/pdf/1611.07270.pdf). I think the Kindermans et al proof is more concise/polished, and I have pasted it below for convenience. Of course, it is worth noting that there are many other variants of LRP that are not equivalent to gradient*input and are of interest, particularly the variants involving RNNs.

from captum.

nanohanno commented on June 3, 2024 3

Hi @NarineK ,
we started working in a research project on attribution methods and specifically LRP. We need an implementation in PyTorch and consider aligning or even adding it to captum. What is your status on the LRP implementation and would you be open to a contribution from our side or some form of collaboration? We would be very happy to add value to your brilliant project.
Best regards,
Hanno

from captum.

marcoancona commented on June 3, 2024 3

Hi all,
I came across this discussion by chance. Sorry if I am late. I see there was quite some discussion about how to implement LRP in Captum. This is indeed something that gave me a lot to think about in the past.
While it is definitely possible to implement epsilon-LRP as done for DeepLIFT, I would argue that the epsilon rule is the least interesting one. In fact, DeepLIFT can be seen as a generalization of epsilon-LRP and, in my opinion, DeepLIFT is preferable for all use-cases.
More interesting would be to have the alpha-beta rule in place but, unfortunately, this cannot be computed by only replacing the gradient at the non-linear operations. It requires instead to use a backward-hook at the linear layers and possibly requires an ad-hoc implementation for all types of layers. By the way, the same kind of difficulty applies to DeepLIFT RevealCancel, which would be also a very useful benchmark to support (as fast Shapley value approximation) but more tricky to implement with gradient-tricks.

from captum.

NarineK commented on June 3, 2024 1

Hi @EricSpeidel, thank you for bringing this up. Yes, this is on my radar.

There was also an issue created in pytorch discussions a while ago regarding LRP.
https://discuss.pytorch.org/t/layer-wise-relevance-propagation-lrp-in-pytorch/46730

from captum.

nanohanno commented on June 3, 2024 1

Hi @NarineK , great to hear that. I tried to assemble a small design document with our basic idea of how to implement LRP. I hope this is what you had in mind. As we are just at the beginning of the project our plan is not very detailed and thorough, yet.
Furthermore, regarding implementation, is it now mandatory to use type hinting as stated in the CONTRIBUTING guidelines?
Best regards,
Hanno

Design documentation: Layer-wise relevance propagation in captum

author: Hanno Küpers, [email protected]

reviewer:

Overview

Layer-wise relevance propagation (LRP) is a method for generating attribution heatmaps for neural networks, see Bach et al., Montavon et al.. Relevance is propagated backwards through the network, leading to a relation of the predicted confidence value to single pixels. Different rules exist for the propagation of relevance through each layer of the network. Therefore, it is not gradient based but relies on back-propagation, resulting in sensitivity and completeness of the relevance values.
LRP may be integrated in the captum library, following the same API to have user-friendly access to the method in Pytorch.

Context

Captum already contains implementations of several methods for generating plausibility heatmaps for neural networks. However, as many competing methods exist and need to be compared to find the method that fits a given problem, adding LRP will add value as it is a widely used and well known method.

Existing implementations

A collection of implementations exists from the group of Prof. Müller at Technical University Berlin. Here, also a basic implementation in pytorch is presented.
Furthermore, innvestigate, a larger framework of attribution methods exists for Keras, containing LRP with several implemented rules (z-rule, epsilon-rule, w^2-rule, alpha_beta-rule).

Milestones

Implementation of LRP-alpha1beta0 for CNNs with convolutional and pooling layers and tests until end of January 2020.
Addition of more propagation rules until end of February 2020.

Technical architecture

The method will be available as a class in captum.att.core.layerwise_relevance_propagation. Attribution will be computed by the method attribute(). The compute_convergence_delta() method may return the difference of confidence and attribution, based on the completeness property. Here, the implementation in class captum.attr._utils.attribution.GradientAttribution may work with the correct baseline value.

Different propagation rules may be implemented as inheriting classes, private classes that are passed to the main class or by selection of an input parameter of the main class.

First, the activations "a" are calculated for each layer (the layers of the model will be recognized during creation of the LRP object) by a complete forward pass.
Then, the score of the target class is propagated backwards, layer by layer. For which the weights "w" of the model are needed. Here, different rules are implemented.
The propagation typically consists of four steps (at least for alphabeta-rules):
1. Forward pass through the layer: z=a.dot(w)
2. Divide the Relevance scores element-wise by z values: s=R_k/z
3. Backward pass through the layer: c=w.dot(s)
4. Multiply element-wise with activation: R_j=a*c
For the relevance propagation through convolutional and linear layers, pytorch's forward() and backward() functions will be used.

Testability

Tests on the attribution functionality will be implemented using test.attr.helpers.utils.BaseTest and models from tests.attr.helpers.basic_models.

from captum.

NarineK commented on June 3, 2024 1

Hi @nanohanno !
Thank you again for the design documentation for LRP!
I went through it and some of the papers and implementations that you have mentioned in your comment above.
In general sounds great to me.

Couple ideas and questions:

How are you thinking of specifying what rules are going to be used for which layers? 
Marco, for example, applies Epsilion-Rule to all non-linear modules in his library:  https://github.com/marcoancona/DeepExplain/blob/master/deepexplain/tensorflow/methods.py#L314
We usually implement general input/feature attributions (the attribution of the output with respect to the input features) under attr/_core, layer attribution (the attribution of the output with respect to the selected layer) under attr/_core/layer and neuron attribution (the attribution of internal selected neuron with respect to the inputs) under attr/_core/neuron.
Maybe we can think about those different variants as well while implementing it but they can be added later.
Starting with an implementation in: captum.att.core.layerwise_relevance_propagation sounds great to me.
I looked into this implementation: http://www.heatmapping.org/tutorial/
and it looks like backward() is being called for each layer? Have you tried to use backward_hooks? backward_hooks have some imperfections but it might help us to get away with only one backward pass ?

Thank you!

from captum.

nanohanno commented on June 3, 2024 1

Dear @NarineK ,
thanks for your valuable feedback. To answer your points:

Montavon et al. describe in the publication related to their PyTorch implementation (https://link.springer.com/chapter/10.1007/978-3-030-28954-6_10) that it makes sense to select different rules for different layers. Therefore, I intended to implement the propagation rules as separate classes and pass a list of rule instances to the LRP instance. As I understood it, Marco only implemented the epsilon (or z?) rule in their work and use it for all layers.
I guess there can be the option to return the relevance for all layers, not only the input layer. Thus, it can be a kind of layer attribution as well. Is that what you meant?
You are right, backward() is called for each layer. Do you think the performance would be better using backward_hooks? Then, I guess for each layer a hook needs to be registered and backward() can be called once, right?

As you see, I have already started the implementation but I haven't found the best solution for all parts yet.
Thank you for your help

from captum.

nanohanno commented on June 3, 2024 1

Hi @NarineK ,
I am back working on the implementation. Regarding your points:

I try my best to stay consistent with the designed API, which I find nice and clear.
Returning the attributions corresponding to a list of layers sounds good.
I thought about using backward hooks for the implementation to be able to use backward() on the complete model. However, I did not find a way of using it because the backpropagation in the model is done using inputs coming from a forward pass on a modified model. Therefore, I stayed with the published implementation. I think gradient accumulation should not be a problem as backward() is called on the output tensors of every layer only once. A big drawback as I see it is, that each model layer needs to be detected at the beginning and passing values through the model is done by manually handing value from layer to layer. Here, custom layer like flatten() can cause problems and need to be handled. Unfortunately, I did not find a more robust way.
I think the layer rules are focused on CV models, however they discuss also input layer rules for numerical values that are not pixels. I think the middle layer rules should also apply for non CV models.

from captum.

nanohanno commented on June 3, 2024 1

Hi @NarineK ,

I have completed the first implementation of LRP now and pushed it to the forked repo.
After implementing it and reproducing published results on VGG16, I recognized that the recommended implementation pattern does not support more complex models using skip connections for instance. For the back-propagation of the relevance values the computational graph needs to be executed in a reverse way. In this implementation all layers are detected using forward hooks and then the back-propagation is done quasi manually in a sequence. This solution comes to its limits when the graph is branched and layer inputs consists of multiple tensors that are added together.

The only option to solve this limitation that I found is to get the actual graph representation and use its structure for back-propagation. However, the option I found was to get the graph from the trace and then probably convert the .nodes() into the required information. Similar to the following snippet:

def _get_graph(self, input):
        trace, out = torch.jit.get_trace_graph(self.model, args=(input,))
        trace_graph = torch.onnx.utils._optimize_graph(
            trace.graph(), torch.onnx.OperatorExportTypes.ONNX
        )
        nodes = list()
        for node in trace_graph.nodes():
            op = node.kind()
            scope = node.scopeName()
            outputs = list(node.outputs())
            nodes.append((scope, op, outputs))

As I am not a very experienced PyTorch Developer I cannot say if this solution is feasible or if there is a smarter option. Do you have a recommendation?

On the good side, apart from the limitation to sequential models, the LRP module and rule module is working and I tried to provide documentation to all important methods. Do you think the LRP method is already of value for the project or do you need a more general applicability which may take a while to achieve?

from captum.

NarineK commented on June 3, 2024 1

Hi @nanohanno ! Thank you very much for the implementation. I'll check out the code and have a deeper look into it and will make some suggestions soon.

from captum.

nanohanno commented on June 3, 2024 1

Hi @NarineK ,
sorry for not responding earlier. Thanks for your valuable feedback. The points you made make total sense to me. During the last days I have been busy taking a closer look at the mentioned paper and the DeepLift implementation. At the moment I am trying out a way to use backward hooks for the relevance propagation. I will come back to you in a couple of days when I have a clear idea.
Best regards,
Hanno

from captum.

NarineK commented on June 3, 2024 1

Awesome! Thank you :)

from captum.

nanohanno commented on June 3, 2024 1

Hi @NarineK ,
I finished the basic implementation using backward hooks with the latest commit a1a4099. It can reproduce results from the previous implementation and works in principle with more complex architectures as the backpropagation is based on autograd, which is great. However, models using skip connections are still a problem. The propagation works but the addition in the skip connection is not treated correctly (There might be other problems that I don't know yet as well). As addition is not an explicit module, the relevance is propagated like the gradient over this node. Actually, as the addition is like a linear layer with weights 1 the relevance should be weighted according to the activations, using the z or epsilon rule. I have not found a solution yet. Do you have an idea?

I had a detailed look at Marco Ancona's publication and suggestion for implementation. In principle, his implementation using gradient override makes sense and the DeepLift implementation seems correct, but he states in the paper for the Tensorflow LRP snippet

After registering this function as the gradient for nonlinear activation functions, a call to tf.gradients() and the multiplication with the input will produce the $\epsilon$-LRP attributions.

However, as pointed out in the LRP literature, LRP focuses on the linear layers and not the non-linear parts. So, the gradient override should be done on the linear layers. And a simpler gradient override - only passing the output gradient to input gradient - should be registered on the non-linear module. Here, part of the confusion might come from the unclear definition of layer and the role of its non-linear part. I had the feeling that especially in the literature on attribution methods this definition is often not very clear which turns into problems when the method needs to be implemented in a deep learning framework.

Additionally, I moved towards using backward hooks on tensors instead of modules, as it appeared that backward hooks on modules (or complex modules) can lead to inconsistent results.

A brief description of the idea of the implementation (too bad, that LaTeX is not supported in github markdown):

Basic Rule (Z-Rule)

The basic rule for LRP is defined as:
$$ R_j= \sum_k\frac{a_j w_{jk}}{\sum_ja_jw_{jk}}R_k=a_j\sum_kw_{jk}\frac{R_k}{z_k}$$
with $R$ the respective relevances, $a_j$ the activations from the previous non-linear unit, $w_{jk}$ the weights of the linear layer, and $z_k$ the output from the linear layer. The formula on the right suggests that the propagation can be done by gradient backpropagation if the gradient on the output of the layer ($R_k$) is divided by the output of the layer $z_k$ and the resulting gradient on the input of the layer is multiplied by the layers activations $a_j$.

Inspired by the implementation of Ancona et al. we implemented LRP using a combination of forward and backward hooks to be able to use the autograd backward pass over a broad range of network architectures.
A forward hook (PropagationRule.forward_hook(self, module, inputs, outputs)) is registered on every linear module listed in SUPPORTED_LINEAR_LAYERS:

This hook gets the activations of each layer ($a_i$) and saves it as module.activations.
Then it registers a backward hook (PropagationRule._create_backward_hook_output(z_k)) on the output tensor outputs:
This backward hook scales the output tensor in backward pass by the outputs $z_k$ ($R_k / z_k$).
A second backward hook (PropagationRule._create_backward_hook_input(a_i)) is registered on the input tensor inputs:
This tensor multiplies the input gradient with the activations ($a_i * grad$).

When a forward and backward pass is done on the inputs, the relevance is calculated during backward pass. Intermediate values (relevance on hidden layers) may get extracted from the input tensors or even backward hooks on the modules.

$\gamma$-Rule and $\alpha_1\beta_0$-Rule

For rules that manipulate the module itself, as for example the gamma rule, which enhances the positive weights of the module, a previous forward pass needs to be done to change the module.weight tensor and save the initial activations $a_i$ in a forward hook (PropagationRule_ManipulateModules.forward_hook_weights()) before changing the weights. The manipulation of the weights is defined in PropagationRule_ManipulateModules._manipulate_weights. Additionally, the initial activations need to be passed to the final forward hook by a forward_pre_hook (PropagationRule_ManipulateModules.forward_pre_hook_activations()).

Additional changes

I also worked on your suggestions for improvements. I moved the rules module, used util functions, used ABC for abstract class, added your small test case (which works now). I moved the VGG test to a notebook and will add that later as a tutorial, however, I left the testcase in the tests because that was my reference for more complex models, but it will be removed later. Rules are saved as attributes of layer (layer.rule). There are default rules now and the suggested rules was an idea to be able to have optimized collections of rules for popular models as VGG16 for example. Finally, I have not tried out with multi input or output models yet due to time constraints.

I won't be available during the next two weeks but I am looking forward to reading your feedback.
Best regards,
Hanno

from captum.

NarineK commented on June 3, 2024 1

Hi @nanohanno!

Great Job!
Thank you very much again for all the work that you've done!

Regarding your questions:

The skip connection problem would have been better addressed in JIT graph mode. Wasn't that something that you also proposed in previous comments ? We will have hook support for JIT modules and hopefully that will help us with that issue.

In terms of addition:
We plan to add operator level hook support. If we were able to hook addition operator will that solve the problem ?  Here is the PR for that, but I'm not sure when it will be merged: pytorch/pytorch#28361 (comment)

Some other comments and suggestions: 

MaxPool2d is linear ?  There is description on wikipedia about pooling: It is described as non-linear downsampling method : https://en.wikipedia.org/wiki/Convolutional_neural_network
Do we want to add maxpool1d and maxpool3d similar to DeepLift ? https://github.com/pytorch/captum/blob/master/captum/attr/_core/deep_lift.py#L1030
For .attributemethod let's fix argument ordering and make consistent with other algorithms https://github.com/nanohanno/captum/blob/feature/layer_wise_relevance_propagation/captum/attr/_core/layer_wise_relevance_propagation.py#L44
Let's removing hooks after creating them for  input_hook and output_hook
I've noticed that the rules are being passed as key-value pairs where the keys are the indices.  I think that it can be challenging for a user to specify right index, especially, if the network has layers in many different depths.  As a first version maybe we can keep things simple and ask the user to attach the rules to each module before calling attribute method. This will give the user full power of defining and attaching the rules the way they want to. If there is no rule attach to a module we can use a default rule.  Basically this will be done by the user: https://github.com/nanohanno/captum/blob/feature/layer_wise_relevance_propagation/captum/attr/_core/layer_wise_relevance_propagation.py#L282 What do you think ?
Do you want to move the logic for the cases when you output layer relevance into attr/layer/layer_lrp_propagation.py similar to what we did with other algorithms ?  In that case you could move: https://github.com/nanohanno/captum/blob/feature/layer_wise_relevance_propagation/captum/attr/_core/layer_wise_relevance_propagation.py#L311 `_select_layer_output(relevances) ` and related logic to that file. It will make things more modular and clean.  We'd need to think about a neuron variant as well.
In general, we also make sure that all algorithms run in distributed settings. This means that if you wrap your model with DataParallel and execute attribution in GPU environment, it should give us expected result. This way we can compute attribution in a much faster manner. We have test cases for that in test_dataparallel.py. You can add support for data_parallel in a separate PR.
For backward_hook_activation we are actually modifying the gradients, because instead of returning grad_inputs you return grad_outputs but the comment says that we are not modifying it. Otherwise we wouldn't need that hook ?
https://pytorch.org/docs/stable/_modules/torch/nn/modules/module.html#Module.register_backward_hook
For the Epsilon rule: you use Marco's derivation, right ? But if we use that special handling only for epsilon rule it makes code more convoluted, we might as well use weight modification strategy for Epsilion-Rule, for generality. Would that be a problem when we attribute w.r.t inputs ?
For the gamma and alpha-beta rules: I will do another pass and see if we can make the code more compact.

Thank you again!
Let me know if you have any questions!

from captum.

nanohanno commented on June 3, 2024 1

Hi @NarineK ,
Thanks for your feedback.
Regarding the skip connections, it might be the best way to use the graph information in JIT mode. I brought that up earlier and that is what is done in the reference Tensorflow implementation innvestigate, as far as I understand. However, my impression was that JIT mode is not so well documented and it appeared difficult to get a working implementation. Definitely, one would have the complete information from the graph about all operations being done during backward pass. The incomplete information is the main problem that I see in our case.
That brings me to your second suggestion, where you mentioned that there will be support for hooks on add operations. I guess that will help with skip connections. That is, as long as one can address the specific add module to register a hook. Therefore, the add module would be incorporated in model.modules() or something like that, right?
Do I understand you correctly that both implementations (hooks in JIT mode and hooks on add module) will be done in the near future? So we would need to wait for that to try it out?

Regarding your other suggestions I made some changes already:

You are right, MaxPool2D is not a linear layer. It is just in SUPPORTED_LINEAR_LAYERS because it is treated like the other linear layers or better to say not treated as a non-linear activation layer, i.e. skipped. Maybe we can change the name of the dict to SUPPORTED_LAYERS or SUPPORTED_LAYERS_WITH_RULES to be less confusing. In fact, in the tutorial it is argued that pooling layers behave best when MaxPool is converted to AveragePool. This conversion of a layer is however difficult without exchanging the module in the PyTorch model, which we do not intend to do. I evaluated the difference in the treatment of pooling layer in the different implementations and it seemed that it affects the attribution results just slightly.
I guess we can add MaxPool1D and MaxPool3D, the behaviour should be the same. The list of layers is not necessarily complete anyways, only contains layers that I have tested so far.
I changed the order.
Can you elaborate? I don't understand what exactly you mean here. The hooks are removed finally in LRP._remove_backward_hooks() and the handles are deleted when LRP._remove_rules() is called.
I like that idea, it makes the selection of the corresponding layer much clearer. I changed the involved functions and test cases accordingly.
I moved the layer attribution code to layer.layer_lrp.py . I just changed the code slightly so that it works with inheritance from LayerAttribution. That means the relevances for all layers are still computed and at the end the values for the selected layer are returned. There is probably a more elegant way to only do propagation until the selected layer like you did for DeepLift, I am going to work on it at a later point.
I added a warning and the handling of DataParallel models needs to be done later.
Maybe the comment was unclear. What I was trying to say is that for non-linear layers no specific rules are applied but the relevance is propagated without any manipulation, basically it is skipped over the layer. Therefore, I added the backward hook that return the gradient on output as gradient on input, meaning it skips the layer, right? I tried to make it clearer in the comment.
The implementation of the epsilon rule is similar to Marco's implementation. This rule is the basic rule, therefore the other rules inherit from the PropagationRule class which is basically the epsilon-rule. However, for the rules that manipulate the weights, more hooks are needed, which is why they need to be more complex. Does that make sense?

Hopefully, we can sort out the issued with skip connections soon.
Thanks again for you valuable feedback!

from captum.

nanohanno commented on June 3, 2024 1

I worked on your suggestions and did some changes:

Agreed, I did the change.
Agreed, I did the changes.
Yes, that is how I understood grad_input and grad_output. I also saw a remark on the limits of it, that it might give unexpected results if complex modules are used. Unfortunately it is not explicitly stated what a complex Module is in the documentation. However, in LRP it should only be applied to simple Modules like ReLU or other non-linear layers. Then grad_input=grad_output should act like skipping the layer in backward pass, right? It only works for simple layers where the dimensions of input and output are the same.
We can reduce the code to a single Parent class which is PropagationRule_ManipulateModules, which would have an empty _manipulate_weights method for rules that do not manipulate the weights like the epsilon-rule for example. The reason why I used two rules are that if the weights are not changed for any layer, the first forward pass in LRP._change_weights does not need to be run, reducing the computation time. However, this case is probably very specific anyways, so clearer code might be more beneficial than small performance increases. I reduced it now to a single parent class PropagationRule.
I rebased the branch to the current master and I am going to work on additional tests and type hints.

An important example for skip connections for our use case are residual blocks in Resnets. In torchvision the addition is defined as out += identity. If this implicit addition operator could be somehow replaced by a custom operator on which a backward hook can be registered it might work. Actually, this custom addition operator could be a fully connected layer with weights all ones and zero bias, right? Then, the propagation rules are already set. However, possible ways to detect the addition operator in the forward function to replace it which I imagined seem very hacky. Do you have a clearer idea of how to tackle it in the current form or in JIT mode?

Thanks for your feedback!

from captum.

nanohanno commented on June 3, 2024 1

Hey @NarineK ,
As I see it, the ReLU only needs to be skipped in the backward pass. In the forward pass it needs to act as normal to generate the output score of the model. Therefore, I only added the backward hook.

However, I had a look at the formalism again and if you take
$$ R_j= \sum_k\frac{a_j w_{jk}}{\sum_ja_jw_{jk}}R_k=a_j\sum_kw_{jk}\frac{R_k}{z_k}$$
as the basic propagation for LRP and layer $j$ is a linear layer which is following a ReLU activation, then $a_j$ corresponds to the output of the ReLU layer, meaning it is zero if the output of the ReLU is zero (input of ReLU is zero or less). Thus, $R_j$ is also zero. Therefore, it does not matter how the relevance is propagated over the ReLU layer, it is always zero.
According to our implementation, the ReLU layer is skipped in the propagation. Meaning, if the input of the ReLU is negative the relevance is just taken from the upper layer but it is already zero because the output of the ReLU is zero. Thus, we can also use the gradient for ReLU layers, which will give the same result (zero). There is no need for a special rule for ReLU layers. I do not know if that works for other activation function, possibly the varying statements in literature are related to this point. Or is this actually what you meant?
This way, there should be no problem with inplace ReLUs but thanks for the remark, I did not think of that.

Regarding your question 5., yes, that hook is used for all rules. Do you see a problem?

from captum.

nanohanno commented on June 3, 2024 1

Hey @NarineK ,
I added test cases for LRP and LayerLRP now, comprising MaxPool1D, 2D, 3D, in-place ReLU, skip-connections, all three implemented rules, tanh activation, all layer output, restoring of model after attribution with the last commits 58d518a .
The test case for skip-connections is commented out and there is a problem with in-place ReLU for the LayerLRP attribution but not for LRP. I haven't understood the origin of the error yet.
The methods you mentioned in the previous post were related to LayerLRP and are now updated. I followed your suggestion and implemented the option to set layer to None to return a list of attributions for all layers.
Regarding pooling layers, it was noted in the initial implementation in the tutorial that MaxPool layers should be changed into AveragePool layers, but if I remember correctly, I did not see a thorough argumentation for it. I guess, it smoothens the output and is more robust against noise? However, when I checked in the old implementation there was only a very small difference in the results when using MaxPool or AveragePool in the backpropagation.

from captum.

NarineK commented on June 3, 2024 1

Awesome! Thank you very much @nanohanno ! I'll take a look.

from captum.

NarineK commented on June 3, 2024 1

Thank you so much, @nanohanno, for addressing all the comments.

A question about convergence delta. Is this based on the LRP conservation rule ? sum(R_j) = sum(R_k) and in a global case: sum(R_i) = f(x) ?
 In the code I see: 1 - torch.sum(attributions) 
So, you assume that sum of the relevances in the output layer is 1 ?
Also, the documentation doesn't match with it. It would be good to add an explanation
Do we also want to compute convergence delta for LayerLRP ?
nit: I think that we can remove: Keyword Arguments: here 

captum/captum/attr/_core/lrp.py

Line 212 in 0dc0b8b

Keyword Arguments:
nit: can we check the length instead ? 

captum/captum/attr/_core/lrp.py

Line 258 in 0dc0b8b

if list(layer.children()) == []:
Are origin_weight and origin_bias used anywhere ? 

captum/captum/attr/_utils/lrp_rules.py

Line 101 in 0dc0b8b

module.original_weights = module.weight.clone()
I think that we need to pass additional_forward_args here too 

captum/captum/attr/_core/lrp.py

Line 319 in 0dc0b8b

_ = _run_forward(self.model, inputs)
In this case, I think that you need to use: is_layer_tuple instead of is_input_tuple because the relevance is returned with respect to  a layer.

captum/captum/attr/_core/layer/layer_lrp.py

Line 167 in 0dc0b8b

return _format_attributions(is_inputs_tuple, relevances), delta

Here is an example: https://github.com/pytorch/captum/blob/master/captum/attr/_core/layer/layer_activation.py#L120

Would you remind me why were we doing ? grad / (outputs + torch.sign(outputs) * self.STABILITY_FACTOR) instead of  grad * output * / (inputs + torch.sign(inputs) * self.STABILITY_FACTOR)
as described here:
https://openreview.net/pdf?id=Sy21R9JAW
I think that another good test case would be to compare it with InputXGradient. According to above mentioned comments they will be equivalent for models that have relus or maxpools.
you can use it by simply importing like:

from captum.attr import InputXGradient

ixg = InputXGradient(model)
ixg.attribute(inputs)

It would be good to test on some of this classes with additional forward args. BasicModel_MultiLayer allows us to do that.
I'd recommend moving this:

captum/tests/attr/test_lrp.py

Line 30 in 0dc0b8b

class Model(nn.Module):

 to
 https://github.com/pytorch/captum/blob/master/tests/helpers/basic_models.py

Or just use this one instead: https://github.com/pytorch/captum/blob/master/tests/helpers/basic_models.py#L170
We'll ultimately need to rename it and give less DeepLift specific name.

At some point when you address the comments and rebase the PR, feel free to create a Pull Request against the master branch ;)

from captum.

nanohanno commented on June 3, 2024 1

Hey @NarineK , thanks for the feedback. The following are replies to your points:

You are right, the convergence is based on the completeness of LRP, which actually breaks down when epsilon values are used and relevance is absorbed during propagation. In the implementation, the total relevance was normalized to one because the gradient that the last output tensor receives at the beginning of relevance propagation is 1 and not the score value itself. I don't know how to change that in a robust way. Therefore, I used 1 - torch.sum(attributions) as the difference to the theoretical value. It is more like a sanity check to see if too much convergence was absorbed during propagation. I updated the documentation and we can discuss if there is a robust way how to make changes to return the correct total values.
I worked on the convergence delta calculation for multi inputs and LayerLRP now.
Done.
Done.
Thanks for finding that. It is not used anymore but remained unused from a former implementation of LRP._restore_state() which is now using state dictionary. I removed the parts now.
Done.
That is true, I added the functionality that you suggested.
Discussed at the bottom.
I added a testcase assuring that the results are equivalent. However, as mentioned for convergence delta, the relevance in the LRP implementation is normalized, and therefore it needs to be multiplied by the output score to be the exact same as the output from InputXGradient.
I wrote a test case but found that the results deviate if an additional input (0) is given or not. I found that it is due to the backward hook on the input tensor not being called using autograd.grad() in gradient.compute_gradients(). I opened an issue on github for it because I did not understand the discrepancies and could not use the pytorch forum.
I moved the model and renamed it to SimpleLRPModel. I guess it would have been better to use an existing model. When I checked in the beginning I had the impression that the models all seemed to be rather specific and often used multiple inputs. That is why I defined the simple model for the tests. Now, many tests depend on the model and its output.

Regarding point 8:
Due to problems with the current functionality of backward hooks on modules -- the input gradient might not be the actual gradient w.r.t the input of the module but w.r.t the input to the last operation in a more complex module containing multiple operations -- we chose to use hooks on the input and output tensors of each linear module instead. The implementation could be easily moved to module backward hooks:

def module_propagation_hook(self, grad_input, grad_output):
   inputs = self.input # saved in forward run
   outputs = self.outputs # saved in forward run
   relevance = inputs * grad_output / (output + EPSILON)
   return relevance

which get registered on linear and conv modules. This implementation is equivalent to the actual implementation using tensor hooks.

Ancona et al. present a different implementation, starting from the assumption of a model with alternating modules of linear or conv and non-linear layers. They register hooks not on linear but on the non-linear layers. Thus, the output of a linear layer is the input of a non-linear layer and the transformation happens at the non-linear modules instead of the linear modules.

def nonlinear_hook(input_non_linear, output_non_linear, grad)
   relevance = output_non_linear * grad / input_non_linear
   return relevance

Here, input_non_linear is the output of the previous linear layer and output_non_linear is the input of the following linear layer. Thus, the equation is equivalent to relevance = input_linear_following_layer * grad / output_linear_previous_layer. For the linear layers, the gradient is not changed and also multiplied according to the chain rule. Thus, after propagating the whole model both implementations come to the same result. Ancona et al. have the assumption that every linear layer is followed by a non-linear layer and therefore transformations can be done at the non-linearities. In my view it is beneficial to register directly at the linear layer -- or at the input and output tensors -- to target the actual layer directly and be able to set different rules depending on the type of layer, which would be confusing using the implementation of Ancona et al.
As a remark, the implementation of Ancona et al. implies that during back-propagation the gradient of the non-linear function is applied and not skipped as seen in other implementations. (Skipping and applying gradient gives the same result for ReLU). I hope my thoughts are clear and it makes sense :)

I am going to rebase and open a pull request now. Thanks again for your valuable feedback!

Have you thought about the problem with skip connections or additions, respectively? Do you want to wait for the necessary changes are done in pytorch? For our project it will be important so we may use a workaround in the model itself.

from captum.

AvantiShri commented on June 3, 2024 1

One quick note about the alpha-beta rule - this paper by @berleon found that it doesn't pass sanity checks: https://arxiv.org/pdf/1912.09818.pdf

from captum.

yidinghao commented on June 3, 2024 1

Hi @nanohanno and @NarineK:

Thank you so much for your amazing work on this fantastic package! I'm very excited to use LRP on Captum.

I was just wondering, are there any plans to support PyTorch's recurrent modules, such as nn.LSTM and nn.GRU? I understand that DeepLIFT currently doesn't support these modules, since they don't expose the internal state of the forward pass. Will that be the case for LRP as well?

I was also wondering: LSTM and GRU have multiplicative connections, which are handled by special propagation rules (summarized by Arras et al., 2019). Will there be support for such rules as well?

from captum.

NarineK commented on June 3, 2024

Hi @nanohanno, thank you very much for reaching out and showing interest in implementing LRP in Captum. We haven't started implementing it yet.

Yes, definitely, contributions from OSS community are welcome. Feel free to implement it in Captum. Let us know if you have any questions. It would, probably also be good, if you could share a small design document with us so that we can help/guide you with it.
Thank you :)

from captum.

NarineK commented on June 3, 2024

Thank you for the thorough description, @nanohanno!
I'm currently at NeurIPS. Sorry, for the delay in replying. I'll read through it in detail after the conference.

Regarding type hinting, we have started adding support for type hints. Here is an example PR that does it: #184

You can implement the first version and add type hinting later in a separate PR. We have just started adding support for type hinting in the algorithms.

from captum.

NarineK commented on June 3, 2024

Hi @nanohanno ! Thank you! I saw the section on LRP in that book!

That sounds great! I think it would be good to keep in mind to be consistent with our current API. In terms of inputs and output of attribute(...) methods.
Currently, layer attribution is exposed through the implementations in attr/_core/layer. Users specify a specific layer that they want to attribute to. But we can think of providing a list of layers and return the attribution for a requested list of layers instead of always returning all at once. Otherwise it can lead to memory and performance issues on user side. The attribution of the output with respect to only input features is exposed to users in the implementations under attr/_core/xxx.py
In terms of backward hooks - yes, you'd need to register a backward hook for each module. 
I don't know the details of the implementation but
when you call backward() each time on same tensors your grads might be unintentionally getting accumulated. You might want to zero out the grads depending on the implementation logic.
In the book that you mentioned, Computer Vision models were mentioned. Do those layer rules also apply to non-CV models ?

from captum.

NarineK commented on June 3, 2024

Hi @nanohanno,

Thank you again for the implementation.

I've tried the following example model and it looks like this one would also fail because z and relevance have different shapes and this can happen for a linear layer but should be fine for some non-linearities such as relus and maxpools for example.

class MyModel(nn.Module):
    def __init__(self, inplace=False):
        super().__init__()
        self.lin = nn.Linear(2, 2)
        self.lin.weight = nn.Parameter(torch.ones(2, 2))

    def forward(self, input):
        return self.lin(input)[0].unsqueeze(0)

input = torch.tensor([[1.0, 2.0], [1.0, 3.0]])
model = MyModel()
model(input)

lrp = LRP_0(model)
lrp.attribute(input, 0)

I think that current implementation is more similar to the layer attributions when we attribute to the outputs of the layers since activations are the outputs of the layer (see https://github.com/pytorch/captum/blob/master/captum/attr/_core/layer/layer_deep_lift.py#L49) but if we imagine to attribute to the inputs of the layer (which is essentially the outputs of the previous layers) or to the inputs of the model then we could derive attribution with respect to input features as well.

In this paper Marco has derivations where he is able to attribute to tuples of inputs that can later be concatenated:
https://arxiv.org/pdf/1711.06104.pdf
https://github.com/marcoancona/DeepExplain/blob/master/deepexplain/tensorflow/methods.py#L324
Potentially, if we have epsilon-Rule for the input layer we can rely on Marco's derivations input * gradient ? The gradient can be overridden in the backward hook similar how we did for the deep lift. Currently, I have only rescale rule for deeplift but we should be able to use similar technique also for the other types of rules.

Btw, the derivations that Marco did for epsilon-LRP, is it also possible to do similar derivations for the other LRP-Rules as well ?

In terms of using JIT to trace the models: We plan to enable JIT tracing for all algorithms, in a general, but I think that at this point we can have an implementation similar to other algorithms, DeepLift is an example.

I think we could potentially use backward hooks and avoid multiple backward calls using the original formulation from the book, page 196, eq 10.2 or have a reformulation of the rules similar to what Marco did for epsilon-Rule.

In the forward hook we can set attributes to the modules, that is, in the forward hook we can do something like module.inputs = inputs, module.activation =outputs, and use them in the backward hook. We have a similar logic in the deeplift implementation.

In general about the API:

I think that current implementation corresponds to layer attribution when we attribute to the output of the layer. For the input feature attribution, we can have and additional attribution to each input feature that supports multiple features (it can be multiple tensors), similar to epsilon-LRP for example from Marco's implementation as I mentioned above.

To be more generic, in the API we can have predefined rules for a selected subset of layers and use default rule for the rest of the layers, that will bring more flexibility.
We could potentially set a rule attribute to each layer, e.g. layer.rule = rule, and access the rules for corresponding layer with layer.rule. When we finish attribution we can remove those rules from the modules.

In this case the Constructor can take a list of layers and a list of rules that can be mapped to each other. In the case when the layers aren't specified in the input, we can use default Rule.

Since VGG has predefined rule for each layer, we can generate the full mapping of layer-rule with suggested_rules for example and pass to the constructor.

Let me know what do you think.

Some other minor comments:

Default a rule that will be applied to all layers so that the users aren't obligated to provide a rule Basically instead of having LRP_0 we can use use that logic when rules= None, that will be our default logic
The tests with real VGG model can time out, that's why we have toy models. Let's work only with toy models in the test cases
Let's move _core/lrp_rules.py under _utils/lrp_rules.py since core contains core algorithms only
suggested_rules, is this used only in test cases ?
For Python 3 and higher we don't need to extend from object explicitly. Let's be consistent with other classes
PropagationRule can be potentially described as an ABC class and propagate as an abstract method
activations = (activations.data).requires_grad_(True) -> you can potentially use our functions,   gradient_mask = apply_gradient_requirements(inputs) and set back to the original state with undo_gradient_requirements(inputs, gradient_mask)
For the VGG, do you mind moving it from the test case into a notebook and potentially make it part of the tutorials ? This can be done in a separate PR
I see that you excluded the cases of having Sigmoid and Tanh in general but I see that the following paper(), for example, has applied epsilon-LRP rule on Tanh. Do we want to exclude it in general ? https://github.com/nanohanno/captum/blob/5906b691c80d60cbf6c272ea34b8eabba650416a/captum/attr/_core/layer_wise_relevance_propagation.py#L282
Currently I see that you support only target of type int but we supput tuples, etc as well. That can be especially useful for multidimensional outputs
rules = repeat(BasicRule(), 1000) -> TypeError: 'itertools.repeat' object is not reversible change to: rules = list(repeat(BasicRule(), 1000))

Let me know what do you think!

from captum.

NarineK commented on June 3, 2024

Hi @nanohanno,

let me know if you have any questions. We can also have a call and see how much generic we can make the algorithms.

Thank you,
Narine

from captum.

NarineK commented on June 3, 2024

Thank you so much for taking time and making all changes, @nanohanno ! I'll have a deeper look into it and get back to you :)

from captum.

NarineK commented on June 3, 2024

Thank you very much for addressing the comments, @nanohanno and the quick feedback.

As you mentioned skip connection would be best to solve in the graph mode and that's what JIT provides.

Yes, operator level hooks will help with the skip connection where input addition is used but I don't know the timeline about it. I'm not sure if it will be available from model.modules() because model.modules() shows only pre-defined modules not the functional ones.

Do you have a concrete example of skip connection ? I'l look into it closer.  Actually, do you think that adding a custom module in between, that does the addition only, might help (kind of like an addition wrapper) ?

Jitt-ed models have some limitation. e.g. currently they don't support hooks. We can work on a subgraph level but that looks a bit hack-y. JIT hooks will be supported in the first half of this year.

Yes, let's rename it to SUPPORTED_LAYERS_WITH_RULES that's a better name
That's right! Ignore my comment
Thank you for creating layer_lrp.py variant. To be more consistent do you want to rename: layer_wise_relevance_propagation.py to lrp.py It would be also good to split the test cases for layer lrp to test_layer_lrp. similar to what we did for layer_deeplift.py
Yeah, in that case you are returning the gradients with respect to the outputs of the layer instead of the inputs of that layer. I don't know if it can be seen as skipping
a layer because instead of returning the gradients of the loss with respect to the inputs (as in the normal case) it returns the gradients with respect to the outputs of that layer.
https://discuss.pytorch.org/t/exact-meaning-of-grad-input-and-grad-output/14186/3  Is that how you also understood grad_out ?
Yes, that makes sense! What I meant is that instead of using Marco's derivations use the standard weight manipulation.  Basically using something like this as an epsilon rule. Would there be any disadvantage ?

class EpsilonRule(PropagationRule_ManipulateModules):

Basically, treating epsilon rule like others if there is no advantage of using Marco's derivations.

I tried the test cases and it looks like most or all rules are epsilon. Are we testing other rules ? It would be good to add tests for each rule.
Let's also add a test case for skip connection. Even if it doesn't work as expected we can put a comment and make it fully functional once we find a better way of doing it.
Can we have more test cases including Maxpool1d, Maxpool2d and other activations that aren't tested ?
We've done many changes in captum. It looks like your branch it a little older. It would be good to rebase it.
Currently DeepLift works with DataParallels. We also have type hints everywhere. You can add it at the end or in a separate PR.

Thank you :)

from captum.

NarineK commented on June 3, 2024

Hi @nanohanno!

Thank you very much for addressing the comments.

With respect to complex modules, I think that they might refer to the issues raised here:
pytorch/pytorch#12331
ReLU or other non-linear layers that you use aren't complex modules.
Let's say that we have the following network:
 Input -> Linear1 -> Relu -> Linear2 -> output

grad_input for the Linear2 is doutput / dLinread2_input , which is doutput / dReLU(Linear1_output), right ?  So in that case in order to skip ReLUs you might want to register a forward hook for the RELU that instead of returning output returns the inputs. You can verify it, I think that it's how it works.

Assuming that the relu isn't inplace:

 def frwd_hook(self, input, output):
    return input[0]

hook = model1.relu.register_forward_hook(frwd_hook)

That will basically skip ReLU layer but if in the backward pass you want to avoid using grad_input for ReLU then that works too but it doesn't quite skip the layer.

Another point I wanted to mention is the inplace Relus.
In case you place a forward hook on a ReLU:

Inplace Relus are tricky. If you have a forward hook then output is the same as the input. To avoid that, we needed to set forward pre-hook.
Here is how we tested it for DeepLift:
https://github.com/pytorch/captum/blob/master/tests/attr/test_deeplift_basic.py#L88

The resent model that you pointed me out to has inplace ReLUs: https://github.com/pytorch/vision/blob/b6f28ec1a8c5fdb8d01cc61946e8f87dddcfa830/torchvision/models/resnet.py#L51

That sounds good! 
Another question:
Is this also being used for gamma and alpha-beta rule: https://github.com/nanohanno/captum/blob/feature/layer_wise_relevance_propagation/captum/attr/_utils/lrp_rules.py#L42

return grad / (outputs + self.STABILITY_FACTOR)

Good point! Yes, a custom addition operator could be a fully connected layer with weights all ones and zero bias! It could be hacky to modify existing layers and add new ones but you can define a new, slightly different architecture and load all weights and use the new model with a custom forward function that uses custom addition module. For your new linear layer all weights will be 1s (and biases: 0s) as you mentioned and other weights you can load from pre-trained model?

With JIT I'm not sure. We've just started looking deeper into them. The first thing we need is to add hook support for them.

Thank you :)

from captum.

NarineK commented on June 3, 2024

Hi @nanohanno !

I see, yeah, that makes sense. So, you are skipping the relevance coming from the ReLU layer. It makes sense then to skip it during backprop. Thank you for the explanation!

As you mentioned in the TODO are you planning to do something similar for pooling modules because they are simple filters with no weights ?

Regarding point 5.: that sounds good.
That sounds good to me. Let me know once you add the test cases and make all changes. I'll make another pass and see if there is anything we can improve.

I might be looking into wrong branch:
Is _backward_hook_relevance being called anywhere ?
https://github.com/nanohanno/captum/blob/feature/layer_wise_relevance_propagation/captum/attr/_utils/lrp_rules.py#L30

Also, return_for_all_layers isn't being used anywhere ?
https://github.com/nanohanno/captum/blob/feature/layer_wise_relevance_propagation/captum/attr/_core/layer/layer_lrp.py#L145

According to our other layer methods we only return the relevance for the layer:
https://github.com/nanohanno/captum/blob/feature/layer_wise_relevance_propagation/captum/attr/_core/layer/layer_lrp.py#L28

We can for example, remove return_for_all_layers and return the relevance for all layers if layer is None, otherwise we will return the relevance for the layer specified by the input.
Note that layer has to be provided in the constructor and the user can explicitly set it to None if they want relevances for all layers. We need to check this with typehints to make sure it will work nicely.
We won't default it because returning the relevances for all layers is a large lists of tensors.

from captum.

NarineK commented on June 3, 2024

Thank you for addressing the comments, @nanohanno :)

Thank you for updating the docs. It makes sense. If epsilon is a very small number it should be a very small factor that get multiplied through the chain rule and we could probably try to factor that in but I think that if epsilon is very small then it's fine.
Also, if the input hook for the network works properly we might need to also divide by input or do not multiply by input in the final relevance computation in the lrp.py ? (See further comments below)
About the issue: https://github.com/pytorch/pytorch/issues/35802 
I think that I have encountered similar behavior in the past.  During backprop when the gradients are computed for a tensor they are also set to a grad attribute and  you can access them through input.grad. In case of autograd.grad they might have preferred to not modify the input.  That could be a reason. 
But in a general case it looks like in the test cases it doesn't multiply by the inputs for the networks' input layer in the input tensor's backward hook.

  def _backward_hook_input(grad):
                  relevance = grad * inputs
                  self.relevance_input = relevance.data
                  return relevance

This might be hacky but, for the consistency, we might always want to multiply by that input as well. 

This is a bit hacky but, as a workaround, we can always add a scalar zero to inputs to make the autograd believe that the input is 0 + input instead of input and that will allow as to enter the _backward_hook_input for the network's input. 

input = torch.Tensor([2])
input.requires_grad = True

input_plus_zero = 0 + input
input_plus_zero.register_hook(hook_grad)

output = 5 * input_plus_zero
grad = torch.autograd.grad(output, input)

Let me know what do you think.

I also briefly chatted with @albanD about the autograd and he told me that he has an idea how to fix it but I think that it might take some time until the fix is in. I'd use a workaround and put a TODO comment about removing the workaround once the fix is in.  

Thank you for explaining point 8. I think that I understand it and I also quickly implemented Marco's version to compare the relevances. One thing is that, if autogrid.grad works as expected and the backward hook gets called as expected for SimpleLRPModel then our implementation will deviate from Marco's implementation by a factor of inputs, we'd need to divide by input to get the same result ?

About skip connections
I think that since we do not have time estimate on when we'll have hooks on functional operators and other improvements, it would be good to use the workaround solutions for now and watch when it gets implemented in pytorch.

Another small nit:
1. For the LRP, can we extend from GradientAttribution, in that case we can access gradient_func from self ?

Also, let's move the discussion to the PR ;)

from captum.

felmoreno1726 commented on June 3, 2024

Hi, I greatly thank you both for this work. Thank you for bringing this amazing framework into existence! I can't stress how awesome this is. I can't wait to get started using the LRP. I'm going through the test cases to get a sense on how to use the framework, and I wonder if there are any docs yet or if there is a way to generate documentation.

from captum.

nanohanno commented on June 3, 2024

Hey @felmoreno1726 , cool to hear that you are excited about the LRP implementation. I think you can find API docs on the homepage (https://captum.ai/api/), but not yet for LRP.

from captum.

NarineK commented on June 3, 2024

Thank you very much for your inputs @marcoancona and @AvantiShri !
@marcoancona, @nanohanno has implemented alpha, beta and gamma rules in: #342
The implementation modifies/updates the weights accordingly and uses the gradients to compute LRP. It computes the gradients with respect to the linear layers instead of the ReLUs.
@nanohanno , explains it in more detail here: #143 (comment)
Let us know what do you think about it.

@AvantiShri , thank you for sharing the reference paper. This looks very interesting. For the cat and dog sanity check it shows insensitive to the class but it seems that it was useful for a different application such as localizing evidence for Alzheimer’s disease. I'll read the theoretical insights on that as well.

from captum.

NarineK commented on June 3, 2024

Thank you for reaching out and for your feedback, @yidinghao! At this point we want to finish up the implementation for the CNNs and see if we can expand it to LSTMs and GRUs.
I had a quick look into it and my understanding is that similar to DeepLIFT we would need to access the individual elements (building blocks) of the LSTMs and GRUs. @nanohanno might know more about it.

from captum.

rGure commented on June 3, 2024

Hey everyone,
I am @nanohanno 's successor in the research project (from which the LRP code originated) and we needed to extend the implementation by a few rules that were not implemented yet. As before we would be interested in contributing this code to captum if you are interested in that.

from captum.

NarineK commented on June 3, 2024

Hi @rGure, great to hear that you are interesting in contributing to captum.
That's sounds good to me. I think that it would be great to finish the outstanding PR and make it release ready. Additional rules can be added in a separate PR after we merge current PR.
For the additional rules, I'd recommend to open a 🚀 Feature request type of issue on github and we can discuss it. We'll have more details on contributing guidelines available on github soon.

from captum.

NarineK commented on June 3, 2024

Thank you everyone for the collaboration and feedback. LRP implementation is merged with #342. Thank you, @nanohanno, for the implementation.
New feature requests can be opened related to LRP. We have already one feature opened here: #485
Closing this issue for the time being.

from captum.

pribadihcr commented on June 3, 2024

@nanohanno,

I got the following error using LRP:

TypeError: Module of type <class 'torch.nn.modules.flatten.Flatten'> has no rule defined and nodefault rule exists for this module type. Please, set a ruleexplicitly for this module and assure that it is appropriatefor this type of layer.

look like Flatten is not support yet

from captum.

nanohanno commented on June 3, 2024

Hey @pribadihcr ,
As it says in the error message you can explicitly set a rule for that layer. It is described in the LRP documentation: Custom rules for a given layer need to be defined as attribute module.rule and need to be of type PropagationRule. If you find a suitable rule you could add it to SUPPORTED_LAYERS_WITH_RULES and create a PR.
Good luck!

from captum.

zenghjian commented on June 3, 2024

Hi @nanohanno,

I found that implementing ResNet with LRP on Captum didn't work very well. Do you have any thoughts on this or can you provide an example of the implementation?

from captum.

pribadihcr commented on June 3, 2024

Hi @nanohanno

I add nn.Flatten: PropagationRule in SUPPORTED_LAYERS_WITH_RULES, got the following error:

TypeError: Can't instantiate abstract class PropagationRule with abstract method _manipulate_weights

from captum.

Support for LRP/DeepTaylor about captum HOT 44 CLOSED

Comments (44)

Design documentation: Layer-wise relevance propagation in captum

Overview

Context

Existing implementations

Milestones

Technical architecture

Testability

Basic Rule (Z-Rule)

$\gamma$-Rule and $\alpha_1\beta_0$-Rule

Additional changes

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs