jperezrua / mfas Goto Github PK

View Code? Open in Web Editor NEW

76.0 76.0 20.0 75 KB

Implementation of CVPR 2019 paper "Mfas: Multimodal fusion architecture search"

Python 100.00%

mfas's People

Contributors

Stargazers

Watchers

mfas's Issues

What does _create_alphas() function mean?

Hi, thank u for your great work! 👍

I am a little confused about the meaning of "self.alphas = self._create_alphas()" in Searchable_xxx_Net class? What was _create_alphas() function used for?

Preparing MM_IMDB Dataset

Dear Authors,

Would you mind sharing the scripts to prepare the mm_imdb raw dataset? Or could you tell me if my interpretation is right or not.

This part in your datasets/mm_imdb.py:

image = np.load(imagepath)
label = np.load(labelpath)
text = np.load(textpath)

The "image" is the poster image, the "label" is "genres", and the text is "plot", here a sample from mm_imdb raw dataset:

   "plot": [
       "A stationary camera looks at a large anvil with a blacksmith behind it and one on either side. The smith in the middle draws a        heated metal rod from the fire, places it on the anvil, and all three begin a rhythmic hammering. After several blows, the metal goes          back in the fire. One smith pulls out a bottle of beer, and they each take a swig. Then, out comes the glowing metal and the hammering         resumes.",
       "Three men hammer on an anvil and pass a bottle of beer around."
   ],
   "votes": 1335,
   "title": "Blacksmith Scene",
   "smart canonical title": "Blacksmith Scene",
   "long imdb canonical title": "Blacksmith Scene (1893)",
   "certificates": [
       "USA:Unrated"
   ],
   "long imdb title": "Blacksmith Scene (1893)",
   "country codes": [
       "us"
   ],
   "smart long imdb canonical title": "Blacksmith Scene (1893)",
   "cover url": "http://ia.media-imdb.com/images/M/MV5BNDg0ZDg0YWYtYzMwYi00ZjVlLWI5YzUtNzBkNjlhZWM5ODk5XkEyXkFqcGdeQXVyNDk0MDg4NDk@._V1.      _SX100_SY75_.jpg",
   "sound mix": [
       "lent"
   ],
   "genres": [
       "Short"
   ],

Skeleton Net Low Accuracy

I am trying to reproduce the unimodal and multimodal results reported in the paper. I got following accuracies by running the scripts provided in this repo:

best_3_1_1_1_3_0_1_1_1_3_3_0_0.9134.checkpoint: 90.03%
conf_[[3_0_0][1_3_0][1_1_1]_[3_3_0]]_both_0.896888457572633.checkpoint: 88.64%

As you see, the results reasonable (still about 1% less than the numbers you got) which implies that I have setup the dataset correctly.

On the other hand, I get very different results from Skeleton unimodal net. I used the provided pre-trained checkpoints for each modality and loaded them into models.central.Visual and models.central.Skeleton modules. I wrote a simple script to forward and compute the accuracy of these modules. The result (especially for skeleton net) are very different from the paper

skeleton_32frames_85.24.checkpoint: 48.02%
rgb_8frames_83.91.checkpoint: 85.23%

Do you have any idea what I am doing wrong here? I would appreciate your comment.

MM_IMDB Searchable and AV-MNIST Dataset

Dear Author,

Thanks for this work! I'm trying to reproduce the result, first I want to know if av-mnist a public dataset? Because I can't find it. So I'm trying to use mmimdb. And got some questions in addition to #8 :

I still have some question about preparing the mmimdb, for the image sizes are different, did you crop it or pad before converting .jpeg to .npy?
I didn't see a searchable class specialized for mmimdb, does that mean I should just use the ModelSearcher() for it?
Also there seems to be 27 classes in mmimdb, not "23" in the paper.
Counter({'Drama': 13967, 'Comedy': 8592, 'Romance': 5364, 'Thriller': 5192, 'Crime': 3838, 'Action': 3550, 'Adventure': 2710, 'Horror': 2703, 'Documentary': 2082, 'Mystery': 2057, 'Sci-Fi': 1991, 'Fantasy': 1933, 'Family': 1668, 'Biography': 1343, 'War': 1335, 'History': 1143, 'Music': 1045, 'Animation': 997, 'Musical': 841, 'Western': 705, 'Sport': 634, 'Short': 471, 'Film-Noir': 338, 'News': 64, 'Adult': 4, 'Talk-Show': 2, 'Reality-TV': 1})
Maybe the last four classes are excluded?

Sincerely,
Somedaywilldo

RuntimeError: Error(s) in loading state_dict for Searchable_Skeleton_Image_Net:

Unexpected key(s) in state_dict: "fusion_layers.0.2.weight", "fusion_layers.0.2.bias", "fusion_layers.0.2.running_mean", "fusion_layers.0.2.running_var", "fusion_layers.0.2.num_batches_tracked", "fusion_layers.1.2.weight", "fusion_layers.1.2.bias", "fusion_layers.1.2.running_mean", "fusion_layers.1.2.running_var", "fusion_layers.1.2.num_batches_tracked", "fusion_layers.2.2.weight", "fusion_layers.2.2.bias", "fusion_layers.2.2.running_mean", "fusion_layers.2.2.running_var", "fusion_layers.2.2.num_batches_tracked", "fusion_layers.3.2.weight", "fusion_layers.3.2.bias", "fusion_layers.3.2.running_mean", "fusion_layers.3.2.running_var", "fusion_layers.3.2.num_batches_tracked".

I am testing the network you provided, i am getting the above error regarding the fusion layer weights.

Could you please provide a checkpoint file that has the fusion layer weights as well.

AVMNIST: my test acc is 65%!

Hi
Thanks for sharing your nice work,
I tried the AVmnist code for uni-modal image classification with different hyper-parameters, but I could not get results better than 65-6% while 75% acc is reported in the paper. Would you kindly guide me how to fix that?
Thanks

How to get pretrained backbones for RGB and skeleton modalities

Hi~
Thank you for sharing such a great job!
I plan to search an architecture on my own dataset, so I want to know how to get pretrained backbones models, like rgb_8frames_83.91.checkpoint and skeleton_32frames_85.24.checkpoint in your work.

Git Clone failed due to using the reserved name as a folder name on windows

Hi, juanmanpr, thank you for your open source MFAS. But I encounter a bug when I clone your repository on windows. The details of the bug are shown in the follow:

Cloning into 'mfas'...
remote: Enumerating objects: 63, done.
remote: Counting objects: 100% (63/63), done.
remote: Compressing objects: 100% (55/55), done.
remote: Total 63 (delta 20), reused 38 (delta 6), pack-reused 0
Unpacking objects: 100% (63/63), done.
fatal: cannot create directory at 'models/aux': Invalid argument
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

This bug is caused by the forbidden file name you used in your repository. Here is what MicroSoft said:

Do not use the following reserved names for the name of a file: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Also avoid these names followed immediately by an extension; for example, NUL.txt is not recommended. For more information, see Namespaces.

So I hope you can rename the folder model/aux. Thanks!

Search an architecture on custom dataset

Hi~
Thank you for sharing such a great job!
I want to know how to use MFAS to search an architecture on my custom datasets, such as RGB and infrared images.

accuracy of the image unimodal network on AVMNIST

Hi, when I run the code training the unimodal image network (LeNet5 structure, as depicted in the paper) on the disturbed MNIST (25% energy removed), I obtain an accuracy ~53% instead of ~74% as described in the paper. I also tested in an extreme case when only 1% energy is removed, which gives an accuracy 95% as expected. This implies the problem lies in my dataset, instead of training settings, I believe.

I was wondering what might be the issue? Or have you ever came across this problem before? Thanks for your time.

AV-MNIST Dataset

Hi, I am unable to find the AV-MNIST dataset online. Could you kindly share a link? I am just starting out so am hoping to start with a less complex dataset.

Thanks

Could I use this repository to find the best structure for a unimodal architecture?

I noticed you have a Cifar-10 specialization, is this being used to find the best structure of a CNN ?

Extending the work to approaches not utilising pre-trained feature extractors

Hi,
Congratulations on the work. It seems really intriguing.
I came across a line in the paper:

However, the reader should consider that our fusion approach is in fact not limited to neural networks as primary feature extractors.

I was wondering if you could elaborate on this a little bit.

I was hoping to use a similar approach as mentioned in the paper but I don't want to restrict the search to pre-trained detectors. If I want to search for pre-fusion and post-fusion layers as well, do you think the current framework can handle that? And what would be a good starting point?

prepare NTU RGB+D dataset

I downloaded the dataset according to your instructions, but I stuck in "change all video clips resolution to 256x256 30fps and copy them to / ntrgbd"_ rgb/avi_ 256x256_ 30/ directory.”How can I change all video clips resolution to 256x256 30fps?
Thank you in advance for your answer.

jperezrua / mfas Goto Github PK

mfas's People

Contributors

Stargazers

Watchers

Forkers

mfas's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs