xinhaomei / audio-text_retrieval Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 6.0 1.88 MB

Implementation of our paper 'On Metric Learning For Audio-Text Cross-Modal Retrieval'

Python 100.00%

audio-text_retrieval's People

Contributors

Stargazers

Watchers

Forkers

nomiscientist bhavinjawade topel yuhuofeng wuyikai123123

audio-text_retrieval's Issues

Need help to train on another GPU

Hello, we are trying to set up and train the models on Mac M2 10-core GPU with 8GB RAM with MPS framework and also a GTX 1650 Ti with 4GB Ram. But we get the following errors -
Mac

RuntimeError: MPS backend out of memory (MPS allocated: 6.71 GB, other allocations: 939.02 MB, max allowed: 9.07 GB). Tried to allocate 2.85 GB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

GTX 1650

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.91 GiB (GPU 0; 4.00 GiB total capacity; 3.98 GiB already allocated; 0 bytes free; 4.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Would running this on a higher GPU memory perhaps help? Why is the environment specific to RTX 30x?

How to achieve the fine-tune result?

Currently I'm using to following settings.yaml, modification is to change cnn_encoder.pretrained to Yes. I only ensure ./pretrained_models/audio_encoder/ResNet38.pth exists, I don't have path.word2vec.

Click to view `settings.yaml`

mode: 'train'
exp_name: 'exp'
dataset: 'AudioCaps'
text_encoder: 'bert'
joint_embed: 1024

wav:
  sr: 32000
  window_size: 1024
  hop_length: 320
  mel_bins: 64

bert_encoder:
  type: 'bert-base-uncased'
  freeze: Yes

cnn_encoder:
  model: 'ResNet38'
  pretrained: Yes
  freeze: Yes

data:
  batch_size: 64
  num_workers: 8

training:
  margin: 0.2
  freeze: Yes
  loss: contrastive  # 'triplet', 'weight', 'ntxent'
  spec_augmentation: Yes
  epochs: 50
  lr: !!float 1e-4
  clip_grad: 2
  seed: 20
  resume: No
  l2_norm: Yes
  dropout: 0.2

path:
  vocabulary: 'data/{}/pickles/words_list.p'
  word2vec: 'pretrained_models/w2v_all_vocabulary.model'
  resume_model: ''

The best results in 50 epoches is

Caption to audio: r1: 16.96, r5: 47.81, r10: 65.99, r50: 92.23, medr: 6.00, meanr: 17.94
Audio to caption: r1: 24.09, r5: 54.05, r10: 71.66, r50: 94.94, medr: 5.00, meanr: 13.57

which couldn't reach the fine-tune result shown in your paper:

I go through the paper but fail to find the detail about fine-tune. Would you mind elaborating this?

Understanding the NT-Xent loss function

Could you explain the significance of mask in the NT-Xent loss function?

mask = labels.expand(n, n).eq(labels.expand(n, n).t()).to(a2t.device)
mask_diag = mask.diag()
mask_diag = torch.diag_embed(mask_diag)
mask = mask ^ mask_diag

a2t_loss = - self.loss(a2t).masked_fill(mask, 0).diag().mean()
t2a_loss = - self.loss(t2a).masked_fill(mask, 0).diag().mean()

From what we have inferred, mask disregards the diagonal positive pairs, (i.e ( [i, i] ), but takes into account [i, j] (where i != j) positive pairs.

In the final a2t_loss calculation, we take the mean of diagonal values instead of taking the means of negative pairs. Since NT-Xent loss is supposed to account for the negative pairs similarity, how is that being calculated?

Question Regarding 《On Metric Learning for Audio-Text Cross-Modal Retrieval》

Cannot get similar results for CNN14+BERT

I am trying to reproduce the results of "CNN14+BERT", but the obtained results are significantly different from that shown in the tech report. I would like to know if I did anything wrong.

The config I used is:

mode: 'train'
exp_name: 'exp'
dataset: 'Clotho'
text_encoder: 'bert'
joint_embed: 1024

wav:
  sr: 32000 
  window_size: 1024
  hop_length: 320
  mel_bins: 64

bert_encoder:
  type: 'bert-base-uncased'
  freeze: Yes

cnn_encoder:
  model: 'Cnn14'
  pretrained: Yes
  freeze: Yes

data:
  batch_size: 64
  num_workers: 0

training:
  margin: 0.2
  freeze: Yes
  loss: ntxent  # 'triplet', 'weight', 'ntxent'
  spec_augmentation: Yes
  epochs: 50
  lr: !!float 1e-4
  clip_grad: 2
  seed: 20
  resume: No
  l2_norm: Yes
  dropout: 0.2

path:
  vocabulary: 'data/{}/pickles/words_list.p'
  word2vec: 'pretrained_models/w2v_all_vocabulary.model'
  resume_model: ''

Finally I got:

Caption to audio: r1: 7.69, r5: 24.38, r10: 36.11, r50: 69.15, medr: 20.00, meanr: 63.24

But the results of CNN14+BERT in the tech report is:

R@1: 0.147, R@5: 0.377,  R@10: 0.495

How to to obtain the downstream retrieval performace?

I wish to do a reproducibility study of this work.

xinhaomei / audio-text_retrieval Goto Github PK

audio-text_retrieval's People

Contributors

Stargazers

Watchers

Forkers

audio-text_retrieval's Issues

Need help to train on another GPU

How to achieve the fine-tune result?

Understanding the NT-Xent loss function

Question Regarding 《On Metric Learning for Audio-Text Cross-Modal Retrieval》

Cannot get similar results for CNN14+BERT

How to to obtain the downstream retrieval performace?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs