yikaiw / cen Goto Github PK
View Code? Open in Web Editor NEW[TPAMI 2023, NeurIPS 2020] Code release for "Deep Multimodal Fusion by Channel Exchanging"
License: MIT License
[TPAMI 2023, NeurIPS 2020] Code release for "Deep Multimodal Fusion by Channel Exchanging"
License: MIT License
你好,将设我有AB两个模态,AB之间做CE,如果我理解正确的话,会让一个其中一个feature vector变得越来越重要,另一个越来越不重要,那么最后是应该直接丢弃那个越来越不重要的,还是用soft_alpha去fuse他们两个呢。。。
Hi, I am trying to use the channel-exchanging in a multimodal self-supervised network (depth and rgb) and follow this line of code to add sparse constraints.
As I plot the scaling factors that have sparse constraints, I find all of them decrease to zero at last. However, there seems to be a stable ratio of scaling factors that will not become zero (larger than threshold, thus will not be exchanged) in the figure.5 in your paper. May I ask have you met this case that all scaling factors become zero in your experiments?
Best,
beniko
作者你好!在文章的loss公式中,我的理解是只将ens拿来算loss。为什么代码里面需要对rgb,depth和ens的loss求和。
你们的工作做得非常的棒。我想问一下:经过CEN的fusion之后,如果我要进行多模态情感分析(有三个模态:文本,音频以及视频),从文章看说经过fusion之后仍然是三个输出,每个输出都有了其他模态的信息了。现在我要进行情感预测,我是只取其中的一个模态进预测还是把他们给拼接起来呢?
Thanks for your excellent work!I want to input images with different height and width, but I will get an error.Do the images input have to be the same height and width?
Thank you very much and look forward your reply.
Hi,
Thank you for your great work. I had a specific query the answer to which was not mentioned in the paper:
Given homogeneous data (images), but where different modalities (different image streams) do not correspond to a different version of the same view, for example, in pose regression methods like MapNet, the multimodal input would be 2 completely different images. So not 2 images (different views) of the same scene (like RGB+D images) but 2 different images of 2 different scenes (1 image taken at time t and the other taken at time t+1).
Do you feel that there could still be a gain with channel exchanging?
Thank you for your answer.
Thanks for your excellent work!I have some question about the input size of NYUDv2,
Why the processed image size is not 480 x 640 in provided NYUDv2 dataset?
and what is the AlignToMask() transformation used for in NYUDv2 and SUNRGBD dataset?
Thank you very much and look forward your reply.
Where can I find the non-colorised labels or masks?
Hello, thank you very much for the code you shared, we used your algorithm to train the model (aerial RGB and elevation modality, using ImageNet pre-trained model), but during model validation we found that the scaling factors of the two modalities are very close, and with the deepening of the convolution, the two modalities scaling factors of all channels are almost the same, the effect of exchange is not obvious.
May I ask where is the problem?
Thanks in advance!
Hello,
Thank you for your very interesting work! I was planning on experimenting with CEN but I couldn't seem to find the implementation of the sparsity constraint in channel exchanging, as mentioned in Section 3.3, that channel exchanging is only performed in different (disjoint) sub-parts for different modalities. Would you be able to point me to where in the model is this implemented?
Thanks.
It's my pleasure for me to see your paper "Deep Multimodal Fusion by Channel Exchanging",and I have downloaded the corresponding code from github, but how can I get the "train" dataset and "val" dataset? Looking forward to your early reply! Thank you!
Hi, I find that the function validate()
in the segmentation experiment may be wrong. It looks like this. The annotation says
"""Validate segmenter
Args:
segmenter (nn.Module) : segmentation network
val_loader (DataLoader) : training data iterator
epoch (int) : current epoch
num_classes (int) : number of classes to consider
Returns:
Mean IoU (float)
"""
, but I do not find any operation to calculate the mean value of IoU
between the output of RGB
and that of depth
. It seems just return the IoU
of depth
, not the mean value. Would you mind giving more details of this?
Hello, I am interested in your research. I want to get a deeper understanding by reading your paper, but I could not retrieve your paper on arxiv.org. Can you give me this paper? ? I promise that it will only be used for personal research and not spread, my email address is: [email protected], thank you very much
您好!非常感谢您如此出色的作品!
对于多模态的数据是如何进行工作的我有点没看明白,我的理解是这样的:每个模态的数据以及标签作为一组,同时并行训练两个模态的数据,每个模态内部进行Channel exchange,不知道我理解的是否正确,非常期待您的回答!
谢谢!
Hi,
I was experimenting with your code on my own dataset. However, I realized that image2image translation model only supports the fusion for only two modalities. I checked out the code in detail, it seems that the exchange class is implemented for two sub-networks.
class Exchange(nn.Module):
def init(self):
super(Exchange, self).init()
def forward(self, x, insnorm, insnorm_threshold):
insnorm1, insnorm2 = insnorm[0].weight.abs(), insnorm[1].weight.abs()
x1, x2 = torch.zeros_like(x[0]), torch.zeros_like(x[1])
x1[:, insnorm1 >= insnorm_threshold] = x[0][:, insnorm1 >= insnorm_threshold]
x1[:, insnorm1 < insnorm_threshold] = x[1][:, insnorm1 < insnorm_threshold]
x2[:, insnorm2 >= insnorm_threshold] = x[1][:, insnorm2 >= insnorm_threshold]
x2[:, insnorm2 < insnorm_threshold] = x[0][:, insnorm2 < insnorm_threshold]
return [x1, x2]
You can see here that's the case. Can you provide the exchange class for more than two modalities?
Thanks in advance
Hi, thanks for your job. I wonder what 'averaged' means in Figure3, since the visualized feature maps are chosen by scaling factor in BN layer. And may I ask which layer/stage are these feature maps belonged to specifically?Because I really want to know whether the outdoor datasets have such characters. I'd be grateful if you would describe it in more detail.
We have 2D depth data corresponding to an RGB image, it has values 0.0-5.0 that represent the distance in meters from the sensor to that object in a straight line.
We want to transfer learn (on your pretrained weights) over our dataset, and it seems that we should save our depth data as pngs, since that's the format of them in the dataset you link to. What preprocessing should we run, if any, for our depth data to be in the correct format - maybe just scale 0-5 to 0-255 and save as a grayscale png?
(By the way, in our data 0.0 is the default value when something is too far away or no signal is returned. It seems like this is the same for the kinect depth data because the black patches are probably 0 values.)
And I don't think the paper mentions any preprocessing done on depth, but the utils/datasets.py file does have this:
if key == 'depth':
img = cv2.applyColorMap(cv2.convertScaleAbs(255 - img, alpha=1), cv2.COLORMAP_JET)
What is this doing?
Thanks in advance!
Eli
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.