antoine77340 / mixture-of-embedding-experts Goto Github PK

View Code? Open in Web Editor NEW

118.0 8.0 15.0 27 KB

Mixture-of-Embeddings-Experts

License: Apache License 2.0

Python 100.00%

mixture-of-embedding-experts's Introduction

Mixture-of-Embeddings-Experts

This github repo provides a Pytorch implementation of the Mixture-of-Embeddings-Experts model (MEE) [1].

Dependencies

Python 2 and Pytorch 0.3

Usage example

Creating an MEE block:

from model import MEE

'''
Initializig an MEE module
Input:
- video_modality_dim: dictionary of all video modality with input dimension and output embedding dimension.
In this example: You have face modality (input dimension 128, output embedding dimension 128), 
audio, visual and motion modalities as an example.
- text_dim: dimensionality of sentence representation (e.g 1000)

'''

video_modality_dim = {'face': (128,128), 'audio': (128*16,128),
'visual': (2048,2048), 'motion': (1024,1024)}

text_dim = 1000

mee_block = MEE(video_modality_dim, text_dim)

MEE forward pass:

'''
Inputs:
- captions: an Nx1000 input (N sentences, 1000 is the dimension of the sentences)
- videos: a dictionary with the modalities input, for instance face_data is of size Nx128 or
visual_data is of size Nx2048.
- ind: ind provides binary list for each modality. 1 means the data modality is provided and 0 means the data is not provided.
For instance, if the visual modality is provided for all N inputs then visual_ind = np.ones((N)).
If the first half only are provided with the visual modality, then visual_ind = np.concatenate((np.ones((N/2)),np.zeros((N/2)), axis=0).
'''

videos = {'face': face_data, 'audio': audio_data, 'visual': visual_data, 'motion': motion_data}
ind = {'face': face_ind, 'audio': audio_ind, 'visual': visual_ind, 'motion': motion_ind}

# Gives matrix scores
matrix_result  = mee_block(captions, videos, ind, conf=True)

# Gives pairwise scores
pairwise_result = mee_block(captions, videos, ind, conf=False)

Reproducing results on MPII dataset and MSR-VTT dataset

Downloading the data:

wget https://www.rocq.inria.fr/cluster-willow/amiech/ECCV18/data.zip
unzip data.zip

Training on MSR-VTT:

python train.py --epochs=100 --batch_size=64 --lr=0.0004  --coco_sampling_rate=0.5 --MSRVTT=True --coco=True

Training on MPII:

python train.py --epochs=50 --batch_size=512 --lr=0.0001  --coco=True

Web demo

We implemented a small demo using our MEE model to perform Text-to-Video retrieval. You can try to search for any videos from the MPII (Test/Val) or MSRVTT dataset with your own query. The model was trained on the MPII dataset.

The demo is available at: http://willow-demo.inria.fr

References

If you use this code, please cite the following paper:

[1] Antoine Miech and Ivan Laptev and Josef Sivic, Learning a Text-Video Embedding from Incomplete and Heterogeneous Data, arXiv link: https://arxiv.org/abs/1804.02516

@article{miech18learning,
  title={Learning a {T}ext-{V}ideo {E}mbedding from {I}ncomplete and {H}eterogeneous {D}ata},
  author={Miech, Antoine and Laptev, Ivan and Sivic, Josef},
  journal={arXiv:1804.02516},
  year={2018},
}

Antoine Miech

mixture-of-embedding-experts's People

Contributors

Stargazers

Watchers

Forkers

ml-lab shubhampachori12110095 hyzcn lvaleriu shafiahmed dendisuhubdy m-and-ms 5l1v3r1 mbencherif goel42 forence ai-is-light 1157942086 sunyong2016 learnerma

mixture-of-embedding-experts's Issues

Question on NetVLAD implementation

Hi @antoine77340 ,

I couldn't help notice that in your implementation of NetVLAD, you dropped the biases for the conv layer and only consider the multiplication with the weights, especially on that line:

Mixture-of-Embedding-Experts/loupe.py

Line 41 in a53979f

assignment = th.matmul(x,self.clusters)

The original NetVLAD paper considers learning the weights and biases of the conv layer, as per Equation (3) here: https://openaccess.thecvf.com/content_cvpr_2016/papers/Arandjelovic_NetVLAD_CNN_Architecture_CVPR_2016_paper.pdf

Do you have any rationale on why considering only the multiplication and not the biases?

Any insight would be welcome.

Thanks!

About coco related data

Hello,

I am a student on Master 1, doing a research internship on subject related to video captioning. I find your work on MEE very inspiring and is trying to test your model. But I encountered a problem when running train.py on MSRVTT:

these two files are missing:
coco_visual_path='data/X_train2014_resnet152.npy' ,coco_text_path='data/w2v_coco_train2014_1.npy'

can you add these two files to the public ftp server?

Thanks in advance

Image Feature Extracion Detail Missing

In your journal, you mention

For videos, we extract frames at 25 frames per seconds and resize each frame
to have a consistent height of 300 pixels

My question is do you preserve the aspect ratio of the image when set to Height = 300 pixels ?

Multiple features throw batchnorm exception, problem at size

with open('/home/estathop/Desktop/ordered_word2vec_22_sentences.pickle', 'r') as f42:
    ordered_word_feats = pickle.load(f42)
with open('/home/estathop/Desktop/22featvec.pickle', 'r') as f66:
    listoffeats = pickle.load(f66)

Here I load the pre-extracted features by me, ordered_word_feats is a list of 22 X N X 300 where N is the number of words in text. listoffeatsis a list of 22 X N X 2048 where N is the number of frames per video. Apparently, 22 is the number of video-text pair, and 300 and 2048 is the feature size. The ordered_word_feats[0] is the text from listoffeats[0] image features extracted from video. [1] for [1], [2] for [2] etc.

''' Load and initiate the pre-trained model from Miech Journal for similarity among modalities score'''
video_modality_dim = {'face': (128,128), 'audio': (128*16,128),
'visual': (2048,2048), 'motion': (1024,1024)}
the_model = Net(video_modality_dim, 300, audio_cluster=16)
the_model.load_state_dict(torch.load('/home/estathop/Desktop/journalmodel/msrvttjournal.pt'))
the_model.eval()

Here I load the model and put it in eval mode.

''' create indices for the last model'''
face_ind=np.zeros(1)
audio_ind=np.zeros(1)
motion_ind=np.zeros(1)

'''create the tensors for the last model'''

face_data = torch.from_numpy(np.zeros([1,128]))
audio_data = torch.from_numpy(np.zeros([1,1,128]))
motion_data =torch.from_numpy(np.zeros([1,1024]))

audio_data = audio_data.type(torch.FloatTensor)
audio_data = Variable(audio_data, requires_grad=False)

face_data = face_data.type(torch.FloatTensor)
face_data = Variable(face_data, requires_grad = False)

motion_data = motion_data.type(torch.FloatTensor)
motion_data = Variable(motion_data, requires_grad = False)

Here I create the indices and tensors for data I don't have, it's missing

pred_true=list()
pred_false=list()
for enum in ordered_word_feats:
    for enum2 in listoffeats:
         word_tensor_to_be = enum.reshape(1,len(enum),300)
         word_tensor = torch.from_numpy(np.array(word_tensor_to_be))
         visual_data = torch.from_numpy(np.array([enum2]))
         
         visual_data = visual_data.type(torch.FloatTensor)
         visual_data = Variable(visual_data, requires_grad = False)
         visual_ind= np.ones(len(enum2))

         ind = {'face': face_ind, 'audio': audio_ind, 'visual': visual_ind, 'motion': motion_ind}
         videos = {'face': face_data, 'audio': audio_data, 'visual': visual_data, 'motion': motion_data}
         ypreds = the_model(word_tensor, videos, ind)
         ypreds2 = the_model(word_tensor, videos, ind, False)
         pred_true.append(ypreds)
         pred_false.append(ypreds2)

And here is my problem word_tensor_to_be is supposed to be a (1,N,300) word feature vector with N being the number of words in that text, 1 is the batch size and 300 the feature size. enum and enum2 are the equivalent numpy arrays iterating through the parent lists for image and word feature vectors described above. My goal is to take the final similarity score among all pairs between 22 texts and 22 videos. But there is an error again which follows:

Traceback (most recent call last):

  File "<ipython-input-67-106513ee0f47>", line 3, in <module>
    ypreds = the_model(word_tensor, videos, ind)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)

  File "model.py", line 46, in forward
    return self.mee(text, aggregated_video, ind, conf)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)

  File "model.py", line 71, in forward
    video[self.m[i]] = l(video[self.m[i]])

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)

  File "model.py", line 128, in forward
    x = self.cg(x)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)

  File "model.py", line 145, in forward
    x1 = self.batch_norm(x1)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/batchnorm.py", line 66, in forward
    exponential_average_factor, self.eps)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/functional.py", line 1254, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled

RuntimeError: running_mean should contain 16889 elements not 2048

As read in the journal, the array for visual_data should be Nx2048 with visual_ind being np.ones(len(N))

enum2.shape
Out[68]: (16889, 2048)

len(enum2)
Out[69]: 16889


enum.shape
Out[70]: (411, 300)

len(enum)
Out[71]: 411

Any ideas ? What am I missing again here ?

documentation training on MPII failure

I download both the source code and the data.zip but when tried to execute
python train.py --epochs=50 --batch_size=512 --lr=0.0001 --coco=True
this message appears, I guess it has something to do with the relative paths while parsing

python train.py --epochs=50 --batch_size=512 --lr=0.0001  --coco=True
Namespace(GPU=True, MSRVTT=False, batch_size=512, coco=True, coco_sampling_rate=1.0, epochs=50, eval_qcm=False, lr=0.0001, lr_decay=0.95, margin=0.2, model_name='test', momentum=0.9, n_cpu=1, n_display=100, optimizer='adam', seed=1, text_cluster_size=32)
Pre-loading features ... This may takes several minutes ...
Traceback (most recent call last):
  File "train.py", line 162, in <module>
    path_to_audio, mp_flow_path, mp_face_path, coco=args.coco) 
  File "/home/estathop/Mixture-of-Embedding-Experts/LSMDC.py", line 76, in __init__
    coco_visual = np.load(coco_visual_path)
  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/numpy/lib/npyio.py", line 384, in load
    fid = open(file, "rb")
IOError: [Errno 2] No such file or directory: '../X_train2014_resnet152.npy

I tried to rename the coco_visual_path='../X_train2014_resnet152.npy' ,coco_text_path='../w2v_coco_train2014_1.npy' paths in LSMDC.py but with no success, any thoughts ?

Question about how to extract feature

In the paper,the appearance features are extracted using ResNet-152,and the motion features are computed using a Kinetics pre-trained I3D flow network.
I really want to know how this part is practiced.
If you like, can you share the implemented code or ideas for this part with me?

Web Demo link not working

What the title says, Web Demo link is not loading anything at all.

downloading data.zip is very slow.

Hello, when I try to download data.zip by wget https://www.rocq.inria.fr/cluster-willow/amiech/ECCV18/data.zip, I find it is very slow. How to solve this problem? Can you give some advises?

about test sentences

Hi Antoine, many thanks for sharing the code and data. Recently I would like to compare with your wonderfull model on your split test set. But I only obtained the sentence feature of test sentences, I am wondering could you provide the original 1000 test sentences. Thanks very much.

Meaning of negative final similarity score

when trying to predict through the pre-trained on MSRVTT net model() call among 22 video-text pairs, I receive in many occasions negative values and all values are between [-1,1] . I understand that Wi is positive because it is a fraction of exponential expressions but Si(X,Ii) = <fi(h(X)),gi(hi(Ii))> is a scalar product. So it might be negative and this is the case for my output below. Do you think something is wrong with my pre-extracted features ? Or I shouldn't apply the formula of Si like above ? Or maybe there is a logical malfunction in code implementation . I thought similarity was supposed to be between [0,1]. A guess would that the negative values are explained by the scalar product which might be the cosine distance between unit vectors, as a result, similarity value (-0.11) has the same meaning as (+0.11) which means an absolute value of the resulting matrix might solve the case and rescale between [0,1]. Is my intuition correct ?

print testf
[[ -3.24183628e-02  -7.41406754e-02  -1.15471378e-01  -1.10864259e-01
   -1.98287945e-02  -9.02476162e-02  -3.47756259e-02  -1.00641184e-01
   -1.37256049e-02  -4.02769297e-02  -2.73261871e-02  -1.63121358e-01
    1.03009306e-03  -1.27487540e-01  -1.25783518e-01  -9.84312743e-02
   -6.67047054e-02  -1.32712394e-01   1.02249742e-01  -5.70785441e-02
   -1.87351570e-01  -1.41061395e-01]
 [  7.26137012e-02   2.90240813e-02   2.84049883e-02  -6.38301810e-03
   -1.30889425e-03   1.12976208e-01   7.28432909e-02   6.05115443e-02
   -2.32231170e-02   2.97102742e-02   7.48468796e-03   1.31310478e-01
   -6.04089908e-03   8.95220190e-02   7.28030363e-03   7.41472542e-02
    5.49364602e-03   8.67729634e-02   5.84026203e-02  -2.88652629e-02
    8.19042474e-02   7.37317577e-02]
 [ -1.24743320e-01  -7.99030811e-02  -1.87523142e-01  -1.84981748e-01
   -9.64704081e-02  -9.78353918e-02  -7.32782781e-02  -1.44911975e-01
   -1.22601211e-01  -9.01780128e-02  -1.56859562e-01  -1.05674535e-01
   -1.35695502e-01  -1.53662503e-01  -1.18376255e-01  -1.18971728e-01
   -1.30202979e-01  -1.04229242e-01   3.12037840e-02  -1.72016352e-01
   -1.46726742e-01  -1.64542437e-01]
 [  4.54021841e-02   5.45046367e-02  -4.80982177e-02  -3.07810716e-02
    5.26017025e-02   2.03696340e-02   7.51884207e-02  -6.23101927e-02
    3.00709307e-02   5.13408855e-02  -3.97063233e-03  -2.11593304e-02
    6.98746135e-03  -5.83432242e-03  -1.42855104e-03   2.54983772e-02
    2.02639643e-02  -1.00494139e-02   1.33574635e-01  -8.36661234e-02
   -1.17250439e-03  -4.25007083e-02]
 [ -6.39017150e-02  -6.95983917e-02   1.53941996e-02   2.44147126e-02
    7.68429637e-02  -4.73194942e-03  -3.85971069e-02  -1.54499374e-02
   -2.80762371e-02  -6.79559335e-02  -3.17716273e-04  -3.70029025e-02
    1.55389346e-02   8.87664855e-02   2.35865600e-02  -2.81906556e-02
   -6.47488162e-02  -4.09939513e-02  -6.59288093e-02  -5.31612001e-02
   -3.27354483e-02   4.72476240e-03]
 [ -5.82479425e-02   5.28509310e-03  -1.34228934e-02  -3.38902213e-02
    7.79314116e-02  -1.30767878e-02  -6.77357689e-02  -1.17590472e-01
   -9.07741673e-03  -6.06850572e-02   1.46346260e-02  -7.27157891e-02
   -9.40722525e-02  -7.05330148e-02  -7.52090067e-02  -1.21718809e-01
   -8.08869302e-02  -8.79719183e-02  -2.36649364e-02  -7.15887994e-02
   -1.28242001e-01  -4.69399728e-02]
 [  6.20326512e-02   5.70791811e-02   1.09690391e-01   2.75402516e-02
    2.72457376e-02  -3.65210623e-02  -7.09921494e-02  -1.91165768e-02
    2.58173421e-02   4.46821228e-02   3.41608748e-02   6.28040656e-02
    3.62423882e-02   1.15092315e-01   1.22108325e-01   2.77724136e-02
   -1.53239379e-02  -9.96280760e-02  -6.81859180e-02  -9.34418738e-02
    2.65506115e-02  -3.66509408e-02]
 [ -9.70247984e-02  -6.36440367e-02  -6.86436817e-02  -4.02032062e-02
   -5.69447018e-02  -1.59706306e-02  -1.27269909e-01  -6.11479245e-02
   -1.19319588e-01  -1.27181739e-01  -8.24628919e-02  -2.30135545e-02
   -1.33809328e-01  -5.88734336e-02  -8.47195759e-02  -8.17673281e-03
   -6.27270192e-02  -5.20532578e-02  -2.02408180e-01  -7.06336796e-02
    2.21954249e-02   1.72704495e-02]
 [  2.44908780e-02   1.48762435e-01   8.77285674e-02   2.01166011e-02
    7.41947144e-02   2.57322751e-02  -6.39501885e-02  -4.64443415e-02
    5.12235351e-02   5.36970757e-02   4.69269529e-02   9.78809074e-02
    1.87400058e-02   7.25047514e-02   3.15663517e-02  -1.81421228e-02
   -6.49784878e-02  -6.39207810e-02  -2.08910042e-03  -6.15730546e-02
   -6.85766563e-02  -7.96502605e-02]
 [ -1.65404044e-02  -6.24049036e-03   2.22393453e-01   9.45802256e-02
    8.80606249e-02  -7.35003650e-02  -2.13893130e-02  -4.27991115e-02
    1.08794300e-02   5.49021102e-02   1.67748239e-02  -7.45519949e-03
    2.73488872e-02   7.09499344e-02   4.78424244e-02   6.57194182e-02
   -3.95079814e-02  -7.97091424e-02  -4.54489850e-02   6.10869415e-02
    2.15491168e-02   8.13219044e-03]
 [  9.36637521e-02   1.34750977e-01   4.94977497e-02  -4.53725569e-02
    7.44696334e-02   4.21539359e-02   1.26292836e-02  -2.40726471e-02
    5.69366179e-02   9.32943001e-02   1.19255148e-02   5.91801144e-02
    4.14878801e-02   5.57211637e-02   6.24290481e-02   3.16961808e-03
   -2.23359000e-02  -1.63406823e-02   8.58647674e-02  -7.64402524e-02
   -7.02173822e-03  -5.49222305e-02]
 [ -2.01124288e-02   8.51725414e-03  -2.26553045e-02  -2.96279751e-02
    1.55003527e-02   4.25108224e-02   3.21909264e-02   1.79191828e-02
    5.17049469e-02   1.01298898e-01   1.30138109e-02   1.62815481e-01
   -2.85304580e-02   1.91626847e-02  -6.70741871e-03   3.82079966e-02
    1.79803967e-02   1.38818026e-01   5.86742572e-02   3.03571764e-02
   -3.46273519e-02   9.23256204e-03]
 [ -2.01635242e-01  -2.45678976e-01  -1.70555308e-01  -1.53034672e-01
   -1.75189063e-01  -2.81836778e-01  -2.59397984e-01  -1.27868578e-01
   -2.47567400e-01  -2.50188291e-01  -2.13203445e-01  -2.69998074e-01
   -1.26149431e-01  -1.61028355e-01  -2.41407141e-01  -1.62489206e-01
   -1.93650544e-01  -2.07882926e-01  -1.52326077e-01  -2.20853820e-01
   -2.72521377e-01  -2.28788257e-01]
 [ -9.80144367e-03   5.72884530e-02  -9.54776555e-02  -9.55424979e-02
    1.66246928e-02  -4.50232401e-02  -3.63711361e-03  -4.74896692e-02
    1.62540451e-02   2.24860869e-02  -3.62359099e-02  -7.94385076e-02
    5.69506362e-02  -7.04971924e-02  -6.24199435e-02  -5.67711666e-02
   -3.82282659e-02  -1.08942568e-01   1.63991898e-01  -9.31654871e-02
   -1.56522453e-01  -1.35957435e-01]
 [ -4.17577997e-02  -1.73674338e-02   4.40126471e-02   8.83409940e-03
    8.42970759e-02  -8.30247346e-03  -7.88934231e-02   4.44930084e-02
    1.77199277e-03  -4.80838940e-02   2.54752114e-02  -1.06255990e-02
    7.20020756e-02   1.02943242e-01   1.36651024e-02  -1.28620323e-02
   -6.20622821e-02  -6.62789419e-02  -3.03219929e-02  -9.53692570e-02
   -5.53613231e-02  -1.13135232e-02]
 [ -1.63695216e-01  -1.03556216e-01  -1.30339012e-01  -1.27679422e-01
   -6.53931201e-02  -1.41193062e-01  -9.00636315e-02  -1.05518721e-01
   -1.46505311e-01  -8.17040205e-02  -1.66507512e-01  -5.89924678e-02
   -1.40381262e-01  -7.63726383e-02  -8.16158280e-02  -2.68251300e-02
   -7.30719045e-02  -6.70554638e-02  -1.46689951e-01  -2.19214350e-01
   -8.05616453e-02  -1.12476036e-01]
 [ -1.21404991e-01  -1.14523426e-01  -3.42636555e-02  -4.87659359e-04
   -4.34590727e-02  -1.06190741e-01  -1.00279182e-01  -8.25343654e-03
   -8.18278939e-02  -1.54373929e-01  -5.62417321e-02  -9.04628560e-02
    7.84442015e-03   6.85879588e-02  -4.17854637e-02   2.97641614e-03
   -1.04182869e-01  -9.23364758e-02  -1.04770355e-01  -1.30434185e-01
   -7.37716258e-02  -8.28684494e-02]
 [  1.15031272e-01   2.56469369e-01   1.12943843e-01   4.46119495e-02
    7.23781735e-02   1.97715119e-01   7.89265186e-02   7.82785714e-02
    9.20869112e-02   1.70955852e-01   1.27202481e-01   3.34807426e-01
    3.07610724e-02   1.29343569e-01   1.30846947e-01   6.40140474e-02
    4.00245562e-02   1.46558061e-01   7.64195472e-02   1.31789222e-01
    8.06794465e-02   8.62019435e-02]
 [ -2.04161495e-01  -2.18299240e-01  -2.20326051e-01  -2.49812812e-01
   -1.86067030e-01  -2.44726986e-01  -2.16143042e-01  -2.22517475e-01
   -2.36097068e-01  -1.02723010e-01  -2.54578173e-01  -1.56383947e-01
   -2.23630071e-01  -2.24462405e-01  -1.85922280e-01  -1.94886908e-01
   -2.51363993e-01  -2.08146721e-01  -1.93092525e-01  -2.22982675e-01
   -2.58376628e-01  -2.90976077e-01]
 [ -6.82781860e-02   2.14418732e-02  -1.27149254e-01  -1.51303664e-01
   -5.04882149e-02  -7.07955211e-02  -1.25511080e-01  -1.69080257e-01
   -6.10480197e-02  -3.97486351e-02  -7.43140951e-02  -9.37108174e-02
   -9.71176326e-02  -1.51634470e-01  -3.22163440e-02  -1.65336177e-01
   -1.13592789e-01  -1.89894676e-01  -5.23247272e-02  -1.17505334e-01
   -1.58030182e-01  -1.58447787e-01]
 [ -6.47434965e-02   3.69024463e-02   2.49979142e-02   6.86121807e-02
    5.38992956e-02   1.13212280e-01   8.89809206e-02   6.67528510e-02
    2.75871567e-02  -4.95128818e-02   4.69142459e-02   9.03209001e-02
    1.42190894e-02   1.09815843e-01  -1.24358004e-02   1.21240906e-01
    2.09475979e-02   1.12565108e-01   4.05558012e-02   6.61107339e-03
    3.42538953e-02   8.72310326e-02]
 [ -6.16250597e-02  -4.59563434e-02  -7.37420321e-02  -1.23237528e-01
   -8.27581882e-02  -1.12275667e-01  -1.44379571e-01  -1.48147196e-01
   -1.08328998e-01  -5.41988574e-02  -1.40186787e-01  -3.67872417e-02
   -1.25735983e-01  -4.85603511e-02  -5.80897667e-02  -1.10250957e-01
   -1.32780135e-01  -1.50575638e-01  -5.27918190e-02  -2.26982966e-01
   -1.19033180e-01  -1.68496773e-01]]

How to use other video-captions pairs for testing

I closed the previous issue because it was a little bit irrelevant.
As far as I know some parts of the preprocess chain are available online like word2vec model for word embedding by google or NetVlad or Imagenet pre-trained ResNet-152 CNN or Kinetics pre-trained I3D flow network or the audio CNN by google again.
Let's say I have my own 100 videos-captions pairs , how can I test the retrieval (recall@K and mr) with your pre-trained model on MSRVTT or MPII for example ? Do I have to extract features in the same way like you mention in your journal and change heavily your code ?
@antoine77340

Trying to use ypreds = net (text, videos, ind)

as said in the journal I extracted frames from a random video (25 frames per second and with standard height = 300 ) after that I passed it to a resnet152 model and extracted the 2048 feature vector from the last global average pooling layer. Results in a 1 X 2048 numpy array for one image

`import os
import subprocess
#from resnet_152_keras import resnet152_model
from resnet152 import ResNet152
from keras.preprocessing import image
from keras.applications.imagenet_utils import preprocess_input
import numpy as np
import pickle
from model import Net
import torch
from torch.autograd import Variable

def extract_frame(video, dst):
    '''
    Given the input video path, convert each frame of the video
    into jpg format in the destination directory.
    Args:
        video: video path
        dst: destination folder
    '''
    videoid = video.replace('.','/').split('/')
    print videoid[-2]
    with open(os.devnull, "w") as ffmpeg_log:
        command = 'ffmpeg -i ' + video + ' -r 25 ' + ' -vf scale=iw:300 '+' -f image2 ' + '{}%06d.jpg'.format(dst+videoid[-2])
        subprocess.call(command, shell=True, stdout=ffmpeg_log, stderr=ffmpeg_log)
        

extract_frame('/home/estathop/Desktop/extractim/trainvideo7011.mp4', '/home/estathop/Desktop/extractim/images/')
`

`model = ResNet152(include_top=False, weights='imagenet')
    
img_path = '/home/estathop/Desktop/extractim/images/trainvideo7011000002.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features2 = model.predict(x)
featurakia2 = features2[0][0][0]`

I also extracted text features with word2vec model pre-trained by google via gensim api which resulted in a 6 x 300 numpy array in my occasion. One row for every word.

You mention that your model can function with some modalities missing so that's what I am trying to do.
Here is your pretrained model according to your code, being loaded

`video_modality_dim = {'face': (128,128), 'audio': (128*16,128),
'visual': (2048,2048), 'motion': (1024,1024)}

the_model = Net(video_modality_dim, 300, audio_cluster=16, text_cluster=32)
the_model.load_state_dict(torch.load('/home/estathop/Desktop/journalmodel/msrvttjournal.pt'))
the_model.eval()`

`word_feature_1 = wordsw2v[0]
testix = torch.from_numpy(np.array(wordsw2v[0]))
testix2 = torch.from_numpy(np.array(word_feature_1))
face_ind=np.zeros(0)
audio_ind=np.zeros(0)
motion_ind=np.zeros(0)
visual_ind= np.ones(1)
ind = {'face': face_ind, 'audio': audio_ind, 'visual': visual_ind, 'motion': motion_ind}
#torch.from_numpy(np.array([testList[1]]))
face_data = torch.from_numpy(np.empty([1,128]))

audio_data = torch.from_numpy(np.empty([1,128*16]))

motion_data =torch.from_numpy(np.empty([1,1024]))

visual_data = torch.from_numpy(np.array([featurakia2]))

text = torch.from_numpy(np.array([word_feature_1]))

audio_data = audio_data.type(torch.FloatTensor)
audio_data = Variable(audio_data, requires_grad=False)

#visual_data = torch.from_numpy(np.stack((featurakia,featurakia2)))

videos = {'face': face_data, 'audio': audio_data, 'visual': visual_data, 'motion': motion_data}

ypreds = the_model(text, videos, ind)`

In the code above I am building the indices accordingly with everything but visual_ind set to zero and visual_ind set to 1 for 1 feature vector. I then construct the dictionary with the data that is needed, I am trying to pass empty numpy arrays for missing modalities and the text and visual features accordingly.
when I try to `ypreds = the_model(text, videos, ind)`` the following error appears, any ideas ? I assume there is a mismatch in dimensions that I am not able to find. Any help would be appreciated

ypreds = the_model(text, videos, ind)
Traceback (most recent call last):

  File "<ipython-input-89-2d0cb1ee9a7e>", line 3, in <module>
    ypreds = the_model(text, videos, ind)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)

  File "/home/estathop/model.py", line 39, in forward
    aggregated_video['audio'] = self.audio_pooling(video['audio'])

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)

  File "/home/estathop/loupe.py", line 47, in forward
    assignment = assignment.view(-1, max_sample, self.cluster_size)

RuntimeError: invalid argument 2: size '[-1 x 2048 x 16]' is invalid for input with 256 elements at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/TH/THStorage.cpp:80`

What is the expert in video-to-text task?

I read your work and find the expert in text-to-video retrieval task is W_i(X). May I ask what is the expert in video-to-text retrieval task? I'm a little confused since there are several features ( video, audio, motion, face..) in video side.

Thank you very much!

Kindly asking for supplementary source code

Training and evaluating works just fine. I never worked with pytorch before and I was wondering if it is possible to provide source code or scripts to

save and load the whole model
run inference on testing videos according to your journal, or maybe anything possible that can run through your model.

I am trying to re-enact the results on your paper "Learning a Text-Video Embedding from
Incomplete and Heterogeneous Data"
It will take me days or weeks to do it manually alone and I would be grateful if you are willing to provide anything to speed up the process
@antoine77340

antoine77340 / mixture-of-embedding-experts Goto Github PK

mixture-of-embedding-experts's Introduction

Mixture-of-Embeddings-Experts

Dependencies

Usage example

Reproducing results on MPII dataset and MSR-VTT dataset

Web demo

References

mixture-of-embedding-experts's People

Contributors

Stargazers

Watchers

Forkers

mixture-of-embedding-experts's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs