speechvision
Flickr8k Data, validation
All files below contain utterance data in the same order. The data and models can be downloaded from: https://drive.google.com/open?id=1junt1_4Rk-Xdw8omz6MxxPp4tYN-axCZ
flickr8k_val_text.npy
: text of each utterance read by crowd workersflickr8k_val_spk.npy
: speaker ID for each utterancesflicr8k_val_mfcc.npy
: mean MFCC features for each utteranceflickr8k_val_conv.npy
: mean convolutional layer activationsflickr8k_val_rec.npy
: mean recurrent layer activations (for each of 4 layers)flickr8k_val_emb.npy
: utterance embeddings (after the self-attention layer)
The mean activations (average over time) were extracted from model flickr8k-speech.zip, trained as described in:
Chrupała, G., Gelderloos, L., & Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. ACL. arXiv preprint: https://arxiv.org/abs/1702.01991
Places Data, validation
All files below contain utterance data in the same order. The data and models can be downloaded from: https://drive.google.com/file/d/17OSXke01YsgzyCo6dZ-QNfLKFPxyGxom/view?usp=sharing
places_val_text.npy
: text of each utterance read by crowd workersplaces_val_spk.npy
: speaker ID for each utterancesplaces_val_mfcc.npy
: mean MFCC features for each utteranceplaces_val_conv.npy
: mean convolutional layer activationsplaces_val_rec.npy
: mean recurrent layer activations (for each of 4 layers)places_val_emb.npy
: utterance embeddings (after the self-attention layer)
The mean activations (average over time) were extracted from model ...