xinhaomei / wavcaps Goto Github PK

View Code? Open in Web Editor NEW

182.0 182.0 10.0 18.54 MB

This reporsitory contains metadata of WavCaps dataset and codes for downstream tasks.

Python 100.00%

wavcaps's People

Contributors

Stargazers

Watchers

Forkers

thesekyi adrianwangzhao haoheliu lizezheng duduoliver techthiyanes frankenliu arililoia-cmu arunbanswal

wavcaps's Issues

Question of learning time

Could you please provide a rough estimate of the time it takes to complete one epoch with a batch size of 32? I would like to verify if my training setup is configured properly.

Considerations on issue with mute sample (592228)

Hello, thank you very much for your work on this dataset! It's honestly incredible🙏🏻

I wanted to ask though if you ever considered removing (either in the metadata or the blacklists) samples with nothing (only silence) in them, when building the dataset.

To be precise, I don't know how many mute samples there are, and I would argue that they might actually be useful during training, but I have had issues with one in particular:

592228

Even if the file is at 1Hz (hence very small on disk), when loading it with librosa it would cause my entire training to crash without any particular error, because librosa.load() was trying to resample it to the given sampling rate (in my case 16khz), but the audio is 200 hours long, thus causing the underlying memory allocation to fail (around 85gb of contiguous memory😅).

Given the absence of a real error message (except a generic RuntimeError: DataLoader worker (pid(s) ***) exited unexpectedly), finding the issue wasn't as easy as solving it (I simply specified the duration parameter in librosa.load(), so that the function doesn't try to load the entire file), but I wanted to ask you anyway if anything of the sort happened to you (or someone else) when dealing with the dataset.

Reproducibility

Sorry to bother you again, I want to reproduce the results of audio-text retrieval, however, I always got 2%-3% lower outcome compared to the table in your paper, so i wonder if there any training log files I can refer to?

ModuleNotFoundError: No module named 'models.configuration_audio_encoder_decoder'

no such files.. when I try to run a caption demo.

Question about pretrain

Hi, thank you for publishing wavcaps! it is really useful.
I am trying to reproduce the results using HTSAT-BART using the code pretrain.py. But I got the following errors:

in pretrain.yaml in json_files: the list contains all the datasets with _pretrain.json extension (but are not provided in this repo), I guess the _final.json ones provided here are the same.
in pretrain_dataset.py when loading the json files, gives the error that AudioCaps/train.json and Clotho/train.json do not have the "duration" parameter.

I am doing something wrong?
Thank you.

pretrained audio_encoder

Hi, thank you for sharing this wonderful project. I want to know how the pretrained audio encoders model is trained. I want to change the feature dimension and network structure based on your code

sounddesc

According to the prompt in the paper, using llm processing, the effect is not improved

The different counting of datasets

Thank you very much for your great contributions to the field of audio!
I've downloaded the WavCaps dataset from HuggingFace and unzipped them. However, according to my counting of each data source, the counting numbers are slightly different from your claim. I attached my statistics as follows.

Data Source	# audio Claimed	# audio in json_files	# audio in ZIP
FreeSound	262300	262300(all)	214208
BBC Sound Effects	31291	31201	31201
SoundBible	1232	1232	1320
AudioSet SL subset	108317	108317	108317
Total	403140	403050	355046
WavCaps	403050

I found that the sequence of archives is discontinuous in Freesound and I don't know if it might be the reason that the index information in FreeSound.zip were out updated.
Do you know how the differences were introduced to the datasets? and is there any easy way to make it up as the info in JSON files is not aligned with the audio sets and it would result in extra work to do data preprocessing?

Thank you again for preparing the very meaningful datasets!

Requirements for more pretrained weights

Hi, thanks for your brilliant work. I want to know if there are any plans to release more pretrained weights, such as CNN14-BERT-PT ?

Blacklist Metadata

Dear Author,

First and foremost, congratulations on your commendable work!

Now, onto the inquiry.

In the directory "blacklist" of the dataset there are 3 json files: "blacklist_exclude_all_ac.json", "blacklist_exclude_test_ac.json" and "blacklist_exclude_ub8k_esc50_vggsound.json".

I'd like to confirm my understanding of the file removal process from the WavCaps dataset as follows:

"blacklist_exclude_all_ac.json" excludes ALL data from audiocaps (overlaping with AudioSet) and Clotho (overlaping with FreeSound)
"blacklist_exclude_test_ac.json" excludes ALL test (or maybe ALL test AND validation data) data from audiocaps (overlaping with AudioSet) and Clotho (overlaping with FreeSound)
"blacklist_exclude_ub8k_esc50_vggsound.json" excludes ALL data from UrbanSound8K (overlaping with FreeSound), ESC-50 (overlaping with FreeSound) and VGGSound (overlaping with AudioSet)

Is this intrepretation correct?

How to download soundbibe file by the url?

I found that the url from the soundbibe json file can not be downloaded by wget, how to download this dataset?
wget -O 0.wav http://soundbible.com/grab.php?id=2219&type=wav does not work

Very long audio utterances

Hi,

Thanks for making the dataset public.

In the free sound split of wavcaps dataset, I found some very long audio utterances. For example:

[24] -> soxi 71336.flac 

Input File     : '71336.flac'
Channels       : 1
Sample Rate    : 32000
Precision      : 16-bit
Duration       : 00:05:00.00 = 9600000 samples ~ 22500 CDDA sectors
File Size      : 10.0M
Bit Rate       : 268k
Sample Encoding: 16-bit FLAC
Comment        : 'encoder=Lavf58.29.100'

Input File     : '65749.flac'
Channels       : 1
Sample Rate    : 32000
Precision      : 16-bit
Duration       : 00:05:00.00 = 9600000 samples ~ 22500 CDDA sectors
File Size      : 9.32M
Bit Rate       : 248k
Sample Encoding: 16-bit FLAC
Comment        : 'encoder=Lavf58.29.100'

 File     : '118791.flac'
Channels       : 1
Sample Rate    : 32000
Precision      : 16-bit
Duration       : 00:05:00.00 = 9600000 samples ~ 22500 CDDA sectors
File Size      : 14.4M
Bit Rate       : 385k
Sample Encoding: 16-bit FLAC
Comment        : 'encoder=Lavf58.29.100'

There are many such files. How do you handle these during model development? Just clip them to a fixed duration? Sorry if I missed this in your paper.

Alignment between audio and text

Hi, I find a problem about the pretrained models. Although clip-level audio features are well aligned with text features, the fine-grained alignment between frame-level audio features and text features is not established since I directly test the pretrained models on DESED (a dataset used for sound event detection) validation dataset and find that all frame-level embeddings share high similarity with label embeddings instead of some specific frames where the corresponding sound event exactly take place. I think the possible reason might be the global avg pool is used in the audio encoder to aggregate local features (instead of attention pool used in the original CLIP), so I wonder if you have tried different pooling method?

Freesound JSON files

Thank you so much for providing such valuable resources for the audio community. The WavCaps dataset can be incredibly helpful for training and evaluating models.

I was wondering if it may be possible to upload the JSON files for the Freesound dataset on the WavCaps Github repository, similar to the files provided for AudioSet, BBC, and SoundBible? Access to the JSON metadata through the repository could make it those resources even more useful for researchers.

Thank you again for your contributions to advancing this important area of research.

AttributeError: 'SequentialSampler' object has no attribute 'set_epoch'

Hi, thank you for sharing this wonderful project. When I run pretrain.py, I find a bug like "AttributeError: 'SequentialSampler' object has no attribute 'set_epoch'". Why does this occur? Can you provide the “requirements.txt” file, and I would like to know version of all packages.
Thanks.

issue for automatic audio captioning

Dear author, hello！
I am trying to use your model for automatic audio captioning, but I got some questions I was used train.py to train 10 epochs from scratch, but I found that CNN14-BART could generate captioning normally, while HTSAT-BART could only generate repeated captioning that had nothing to do with real captioning. May I ask why?

Question audio caption evaluating

I'm testing the performance of audio caption.
I encountered some problems when using [eval_metrics.py] in the caption folder.
For example, "image","image_id" does not exist in self.dataset, etc.

I guess you have modified pycocotools and pycocoevalcap. Can you provide the modified evaluation tools?

How can I directly use your pretrained model of audio caption to test my audio cases without finetuning?

Hi, thank you for sharing this wonderful project. How can I directly use your pretrained model of audio caption to test my audio cases without finetuning?

about Pretraining Model on wavcaps for audio captioning

Thank you for providing such a wonderful job！ I couldn't find the audiocaptioning pretraining model on Wavcaps. [CNN14-BART baseline,HTSAT-BART baseline],Can you provide it?

AudioLDM checkpoint and training code

Dear Author,

First of all, I would like to express my gratitude for sharing your code and dataset with the community. Upon reviewing your paper, I noticed that you have conducted experiments on the text-to-audio generation task using the AudioLDM model.

I was wondering if you would be willing to share the training code and checkpoint files for AudioLDM. Access to these resources would provide invaluable insights for my project

Please let me know if you would be able to share these resources.
I truly appreciate your time and consideration and look forward to hearing from you.

To reproduce zero-shot audio classification result

Dear author, I want to pretrain from scratch and reproduce the zero-shot audio classification result. Should I use the 'blacklist_exclude_ub8k_esc50_vggsound.json' as the blacklist file, and use the 'retrieval/settings/pretrain.yaml' as the configuration?

Question about training data on retrieval task

First of all, thanks for your code!
I wonder how you train AC+Clotho+WavCaps together, I have seen that different datasets are fed into the “AudioCaptionDataset” module separately in your code .So I actually train them one by one, it coudn't reach the promising result.
Thanks again!

xinhaomei / wavcaps Goto Github PK

wavcaps's People

Contributors

Stargazers

Watchers

Forkers

wavcaps's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs