xinhaomei / wavcaps Goto Github PK
View Code? Open in Web Editor NEWThis reporsitory contains metadata of WavCaps dataset and codes for downstream tasks.
This reporsitory contains metadata of WavCaps dataset and codes for downstream tasks.
Could you please provide a rough estimate of the time it takes to complete one epoch with a batch size of 32? I would like to verify if my training setup is configured properly.
Hello, thank you very much for your work on this dataset! It's honestly incredible🙏🏻
I wanted to ask though if you ever considered removing (either in the metadata or the blacklists) samples with nothing (only silence) in them, when building the dataset.
To be precise, I don't know how many mute samples there are, and I would argue that they might actually be useful during training, but I have had issues with one in particular:
Even if the file is at 1Hz (hence very small on disk), when loading it with librosa it would cause my entire training to crash without any particular error, because librosa.load()
was trying to resample it to the given sampling rate (in my case 16khz), but the audio is 200 hours long, thus causing the underlying memory allocation to fail (around 85gb of contiguous memory😅).
Given the absence of a real error message (except a generic RuntimeError: DataLoader worker (pid(s) ***) exited unexpectedly
), finding the issue wasn't as easy as solving it (I simply specified the duration
parameter in librosa.load()
, so that the function doesn't try to load the entire file), but I wanted to ask you anyway if anything of the sort happened to you (or someone else) when dealing with the dataset.
Sorry to bother you again, I want to reproduce the results of audio-text retrieval, however, I always got 2%-3% lower outcome compared to the table in your paper, so i wonder if there any training log files I can refer to?
no such files.. when I try to run a caption demo.
Hi, thank you for publishing wavcaps! it is really useful.
I am trying to reproduce the results using HTSAT-BART using the code pretrain.py. But I got the following errors:
I am doing something wrong?
Thank you.
Hi, thank you for sharing this wonderful project. I want to know how the pretrained audio encoders model is trained. I want to change the feature dimension and network structure based on your code
According to the prompt in the paper, using llm processing, the effect is not improved
Thank you very much for your great contributions to the field of audio!
I've downloaded the WavCaps dataset from HuggingFace and unzipped them. However, according to my counting of each data source, the counting numbers are slightly different from your claim. I attached my statistics as follows.
Data Source | # audio Claimed | # audio in json_files | # audio in ZIP |
---|---|---|---|
FreeSound | 262300 | 262300(all) | 214208 |
BBC Sound Effects | 31291 | 31201 | 31201 |
SoundBible | 1232 | 1232 | 1320 |
AudioSet SL subset | 108317 | 108317 | 108317 |
Total | 403140 | 403050 | 355046 |
WavCaps | 403050 |
I found that the sequence of archives is discontinuous in Freesound
and I don't know if it might be the reason that the index information in FreeSound.zip
were out updated.
Do you know how the differences were introduced to the datasets? and is there any easy way to make it up as the info in JSON files is not aligned with the audio sets and it would result in extra work to do data preprocessing?
Thank you again for preparing the very meaningful datasets!
Hi, thanks for your brilliant work. I want to know if there are any plans to release more pretrained weights, such as CNN14-BERT-PT ?
Dear Author,
First and foremost, congratulations on your commendable work!
Now, onto the inquiry.
In the directory "blacklist" of the dataset there are 3 json files: "blacklist_exclude_all_ac.json", "blacklist_exclude_test_ac.json" and "blacklist_exclude_ub8k_esc50_vggsound.json".
I'd like to confirm my understanding of the file removal process from the WavCaps dataset as follows:
Is this intrepretation correct?
I found that the url from the soundbibe json file can not be downloaded by wget, how to download this dataset?
wget -O 0.wav http://soundbible.com/grab.php?id=2219&type=wav
does not work
Hi,
Thanks for making the dataset public.
In the free sound split of wavcaps dataset, I found some very long audio utterances. For example:
[24] -> soxi 71336.flac
Input File : '71336.flac'
Channels : 1
Sample Rate : 32000
Precision : 16-bit
Duration : 00:05:00.00 = 9600000 samples ~ 22500 CDDA sectors
File Size : 10.0M
Bit Rate : 268k
Sample Encoding: 16-bit FLAC
Comment : 'encoder=Lavf58.29.100'
Input File : '65749.flac'
Channels : 1
Sample Rate : 32000
Precision : 16-bit
Duration : 00:05:00.00 = 9600000 samples ~ 22500 CDDA sectors
File Size : 9.32M
Bit Rate : 248k
Sample Encoding: 16-bit FLAC
Comment : 'encoder=Lavf58.29.100'
File : '118791.flac'
Channels : 1
Sample Rate : 32000
Precision : 16-bit
Duration : 00:05:00.00 = 9600000 samples ~ 22500 CDDA sectors
File Size : 14.4M
Bit Rate : 385k
Sample Encoding: 16-bit FLAC
Comment : 'encoder=Lavf58.29.100'
There are many such files. How do you handle these during model development? Just clip them to a fixed duration? Sorry if I missed this in your paper.
Hi, I find a problem about the pretrained models. Although clip-level audio features are well aligned with text features, the fine-grained alignment between frame-level audio features and text features is not established since I directly test the pretrained models on DESED (a dataset used for sound event detection) validation dataset and find that all frame-level embeddings share high similarity with label embeddings instead of some specific frames where the corresponding sound event exactly take place. I think the possible reason might be the global avg pool is used in the audio encoder to aggregate local features (instead of attention pool used in the original CLIP), so I wonder if you have tried different pooling method?
Thank you so much for providing such valuable resources for the audio community. The WavCaps dataset can be incredibly helpful for training and evaluating models.
I was wondering if it may be possible to upload the JSON files for the Freesound dataset on the WavCaps Github repository, similar to the files provided for AudioSet, BBC, and SoundBible? Access to the JSON metadata through the repository could make it those resources even more useful for researchers.
Thank you again for your contributions to advancing this important area of research.
Hi, thank you for sharing this wonderful project. When I run pretrain.py, I find a bug like "AttributeError: 'SequentialSampler' object has no attribute 'set_epoch'". Why does this occur? Can you provide the “requirements.txt” file, and I would like to know version of all packages.
Thanks.
Dear author, hello!
I am trying to use your model for automatic audio captioning, but I got some questions I was used train.py to train 10 epochs from scratch, but I found that CNN14-BART could generate captioning normally, while HTSAT-BART could only generate repeated captioning that had nothing to do with real captioning. May I ask why?
I'm testing the performance of audio caption.
I encountered some problems when using [eval_metrics.py] in the caption folder.
For example, "image","image_id" does not exist in self.dataset, etc.
I guess you have modified pycocotools and pycocoevalcap. Can you provide the modified evaluation tools?
Hi, thank you for sharing this wonderful project. How can I directly use your pretrained model of audio caption to test my audio cases without finetuning?
Thank you for providing such a wonderful job! I couldn't find the audiocaptioning pretraining model on Wavcaps. [CNN14-BART baseline,HTSAT-BART baseline],Can you provide it?
Dear Author,
First of all, I would like to express my gratitude for sharing your code and dataset with the community. Upon reviewing your paper, I noticed that you have conducted experiments on the text-to-audio generation task using the AudioLDM model.
I was wondering if you would be willing to share the training code and checkpoint files for AudioLDM. Access to these resources would provide invaluable insights for my project
Please let me know if you would be able to share these resources.
I truly appreciate your time and consideration and look forward to hearing from you.
Dear author, I want to pretrain from scratch and reproduce the zero-shot audio classification result. Should I use the 'blacklist_exclude_ub8k_esc50_vggsound.json' as the blacklist file, and use the 'retrieval/settings/pretrain.yaml' as the configuration?
First of all, thanks for your code!
I wonder how you train AC+Clotho+WavCaps together, I have seen that different datasets are fed into the “AudioCaptionDataset” module separately in your code .So I actually train them one by one, it coudn't reach the promising result.
Thanks again!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.