xinhaomei / audio-text_retrieval Goto Github PK
View Code? Open in Web Editor NEWImplementation of our paper 'On Metric Learning For Audio-Text Cross-Modal Retrieval'
Implementation of our paper 'On Metric Learning For Audio-Text Cross-Modal Retrieval'
Hello, we are trying to set up and train the models on Mac M2 10-core GPU with 8GB RAM with MPS framework and also a GTX 1650 Ti with 4GB Ram. But we get the following errors -
Mac
RuntimeError: MPS backend out of memory (MPS allocated: 6.71 GB, other allocations: 939.02 MB, max allowed: 9.07 GB). Tried to allocate 2.85 GB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).
GTX 1650
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.91 GiB (GPU 0; 4.00 GiB total capacity; 3.98 GiB already allocated; 0 bytes free; 4.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Would running this on a higher GPU memory perhaps help? Why is the environment specific to RTX 30x?
Currently I'm using to following settings.yaml
, modification is to change cnn_encoder.pretrained
to Yes
. I only ensure ./pretrained_models/audio_encoder/ResNet38.pth
exists, I don't have path.word2vec
.
mode: 'train'
exp_name: 'exp'
dataset: 'AudioCaps'
text_encoder: 'bert'
joint_embed: 1024
wav:
sr: 32000
window_size: 1024
hop_length: 320
mel_bins: 64
bert_encoder:
type: 'bert-base-uncased'
freeze: Yes
cnn_encoder:
model: 'ResNet38'
pretrained: Yes
freeze: Yes
data:
batch_size: 64
num_workers: 8
training:
margin: 0.2
freeze: Yes
loss: contrastive # 'triplet', 'weight', 'ntxent'
spec_augmentation: Yes
epochs: 50
lr: !!float 1e-4
clip_grad: 2
seed: 20
resume: No
l2_norm: Yes
dropout: 0.2
path:
vocabulary: 'data/{}/pickles/words_list.p'
word2vec: 'pretrained_models/w2v_all_vocabulary.model'
resume_model: ''
The best results in 50 epoches is
Caption to audio: r1: 16.96, r5: 47.81, r10: 65.99, r50: 92.23, medr: 6.00, meanr: 17.94
Audio to caption: r1: 24.09, r5: 54.05, r10: 71.66, r50: 94.94, medr: 5.00, meanr: 13.57
which couldn't reach the fine-tune result shown in your paper:
I go through the paper but fail to find the detail about fine-tune. Would you mind elaborating this?
Could you explain the significance of mask in the NT-Xent loss function?
mask = labels.expand(n, n).eq(labels.expand(n, n).t()).to(a2t.device)
mask_diag = mask.diag()
mask_diag = torch.diag_embed(mask_diag)
mask = mask ^ mask_diag
a2t_loss = - self.loss(a2t).masked_fill(mask, 0).diag().mean()
t2a_loss = - self.loss(t2a).masked_fill(mask, 0).diag().mean()
From what we have inferred, mask disregards the diagonal positive pairs, (i.e ( [i, i] ), but takes into account [i, j] (where i != j) positive pairs.
In the final a2t_loss calculation, we take the mean of diagonal values instead of taking the means of negative pairs. Since NT-Xent loss is supposed to account for the negative pairs similarity, how is that being calculated?
I am trying to reproduce the results of "CNN14+BERT", but the obtained results are significantly different from that shown in the tech report. I would like to know if I did anything wrong.
The config I used is:
mode: 'train'
exp_name: 'exp'
dataset: 'Clotho'
text_encoder: 'bert'
joint_embed: 1024
wav:
sr: 32000
window_size: 1024
hop_length: 320
mel_bins: 64
bert_encoder:
type: 'bert-base-uncased'
freeze: Yes
cnn_encoder:
model: 'Cnn14'
pretrained: Yes
freeze: Yes
data:
batch_size: 64
num_workers: 0
training:
margin: 0.2
freeze: Yes
loss: ntxent # 'triplet', 'weight', 'ntxent'
spec_augmentation: Yes
epochs: 50
lr: !!float 1e-4
clip_grad: 2
seed: 20
resume: No
l2_norm: Yes
dropout: 0.2
path:
vocabulary: 'data/{}/pickles/words_list.p'
word2vec: 'pretrained_models/w2v_all_vocabulary.model'
resume_model: ''
Finally I got:
Caption to audio: r1: 7.69, r5: 24.38, r10: 36.11, r50: 69.15, medr: 20.00, meanr: 63.24
But the results of CNN14+BERT in the tech report is:
R@1: 0.147, R@5: 0.377, R@10: 0.495
I wish to do a reproducibility study of this work.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.