GithubHelp home page GithubHelp logo

audio-text_retrieval's People

Contributors

xinhaomei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

audio-text_retrieval's Issues

Need help to train on another GPU

Hello, we are trying to set up and train the models on Mac M2 10-core GPU with 8GB RAM with MPS framework and also a GTX 1650 Ti with 4GB Ram. But we get the following errors -
Mac

RuntimeError: MPS backend out of memory (MPS allocated: 6.71 GB, other allocations: 939.02 MB, max allowed: 9.07 GB). Tried to allocate 2.85 GB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

GTX 1650

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.91 GiB (GPU 0; 4.00 GiB total capacity; 3.98 GiB already allocated; 0 bytes free; 4.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Would running this on a higher GPU memory perhaps help? Why is the environment specific to RTX 30x?

How to achieve the fine-tune result?

Currently I'm using to following settings.yaml, modification is to change cnn_encoder.pretrained to Yes. I only ensure ./pretrained_models/audio_encoder/ResNet38.pth exists, I don't have path.word2vec.

Click to view `settings.yaml`
mode: 'train'
exp_name: 'exp'
dataset: 'AudioCaps'
text_encoder: 'bert'
joint_embed: 1024

wav:
  sr: 32000
  window_size: 1024
  hop_length: 320
  mel_bins: 64

bert_encoder:
  type: 'bert-base-uncased'
  freeze: Yes

cnn_encoder:
  model: 'ResNet38'
  pretrained: Yes
  freeze: Yes

data:
  batch_size: 64
  num_workers: 8

training:
  margin: 0.2
  freeze: Yes
  loss: contrastive  # 'triplet', 'weight', 'ntxent'
  spec_augmentation: Yes
  epochs: 50
  lr: !!float 1e-4
  clip_grad: 2
  seed: 20
  resume: No
  l2_norm: Yes
  dropout: 0.2

path:
  vocabulary: 'data/{}/pickles/words_list.p'
  word2vec: 'pretrained_models/w2v_all_vocabulary.model'
  resume_model: ''

The best results in 50 epoches is

Caption to audio: r1: 16.96, r5: 47.81, r10: 65.99, r50: 92.23, medr: 6.00, meanr: 17.94
Audio to caption: r1: 24.09, r5: 54.05, r10: 71.66, r50: 94.94, medr: 5.00, meanr: 13.57

which couldn't reach the fine-tune result shown in your paper:

image

I go through the paper but fail to find the detail about fine-tune. Would you mind elaborating this?

Understanding the NT-Xent loss function

Could you explain the significance of mask in the NT-Xent loss function?

mask = labels.expand(n, n).eq(labels.expand(n, n).t()).to(a2t.device)
mask_diag = mask.diag()
mask_diag = torch.diag_embed(mask_diag)
mask = mask ^ mask_diag

a2t_loss = - self.loss(a2t).masked_fill(mask, 0).diag().mean()
t2a_loss = - self.loss(t2a).masked_fill(mask, 0).diag().mean()

From what we have inferred, mask disregards the diagonal positive pairs, (i.e ( [i, i] ), but takes into account [i, j] (where i != j) positive pairs.

In the final a2t_loss calculation, we take the mean of diagonal values instead of taking the means of negative pairs. Since NT-Xent loss is supposed to account for the negative pairs similarity, how is that being calculated?

Cannot get similar results for CNN14+BERT

I am trying to reproduce the results of "CNN14+BERT", but the obtained results are significantly different from that shown in the tech report. I would like to know if I did anything wrong.

The config I used is:

mode: 'train'
exp_name: 'exp'
dataset: 'Clotho'
text_encoder: 'bert'
joint_embed: 1024

wav:
  sr: 32000 
  window_size: 1024
  hop_length: 320
  mel_bins: 64

bert_encoder:
  type: 'bert-base-uncased'
  freeze: Yes

cnn_encoder:
  model: 'Cnn14'
  pretrained: Yes
  freeze: Yes

data:
  batch_size: 64
  num_workers: 0

training:
  margin: 0.2
  freeze: Yes
  loss: ntxent  # 'triplet', 'weight', 'ntxent'
  spec_augmentation: Yes
  epochs: 50
  lr: !!float 1e-4
  clip_grad: 2
  seed: 20
  resume: No
  l2_norm: Yes
  dropout: 0.2

path:
  vocabulary: 'data/{}/pickles/words_list.p'
  word2vec: 'pretrained_models/w2v_all_vocabulary.model'
  resume_model: ''

Finally I got:

Caption to audio: r1: 7.69, r5: 24.38, r10: 36.11, r50: 69.15, medr: 20.00, meanr: 63.24

But the results of CNN14+BERT in the tech report is:

R@1: 0.147, R@5: 0.377,  R@10: 0.495

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.