retrocirce / hts-audio-transformer Goto Github PK

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"

Home Page: https://arxiv.org/abs/2202.00874

License: MIT License

Python 53.49% Shell 0.16% Jupyter Notebook 46.35%

audio-classification sound-event-detection music-information-retrieval transformer-models python

hts-audio-transformer's Introduction

Hierarchical Token Semantic Audio Transformer

Introduction

The Code Repository for "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection", in ICASSP 2022.

In this paper, we devise a model, HTS-AT, by combining a swin transformer with a token-semantic module and adapt it in to audio classification and sound event detection tasks. HTS-AT is an efficient and light-weight audio transformer with a hierarchical structure and has only 30 million parameters. It achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models.

Classification Results on AudioSet, ESC-50, and Speech Command V2 (mAP)

Localization/Detection Results on DESED dataset (F1-Score)

Getting Started

Install Requirments

pip install -r requirements.txt

We do not include the installation of PyTorch in the requirment, since different machines require different vereions of CUDA and Toolkits. So make sure you install the PyTorch from the official guidance.

Install the 'SOX' and the 'ffmpeg', we recommend that you run this code in Linux inside the Conda environment. In that, you can install them by:

sudo apt install sox 
conda install -c conda-forge ffmpeg

Download and Processing Datasets

config.py

change the varible "dataset_path" to your audioset address
change the variable "desed_folder" to your DESED address
change the classes_num to 527

AudioSet

./create_index.sh # 
// remember to change the pathes in the script
// more information about this script is in https://github.com/qiuqiangkong/audioset_tagging_cnn

python main.py save_idc 
// count the number of samples in each class and save the npy files

ESC-50

Open the jupyter notebook at esc-50/prep_esc50.ipynb and process it

Speech Command V2

Open the jupyter notebook at scv2/prep_scv2.ipynb and process it

DESED Dataset

python conver_desed.py 
// will produce the npy data files

Set the Configuration File: config.py

The script config.py contains all configurations you need to assign to run your code. Please read the introduction comments in the file and change your settings.

IMPORTANT NOTICE

Similar to many transformer structures, the HTS-AT needs warm-up otherwise the model will underfit in the beginning. To find a proper warm-up step or warm-up epoch, please pay attention to these two hyperparameters in the configuration file. The default settings works for the full AudioSet (2.2M data samples). If your working dataset contains different size of samples (e.g. 100K, 1M, 10M, etc.), you might need to change a proper warm-up step or epoch.

For the most important part: If you want to train/test your model on AudioSet, you need to set:

dataset_path = "your processed audioset folder"
dataset_type = "audioset"
balanced_data = True
loss_type = "clip_bce"
sample_rate = 32000
hop_size = 320 
classes_num = 527

If you want to train/test your model on ESC-50, you need to set:

dataset_path = "your processed ESC-50 folder"
dataset_type = "esc-50"
loss_type = "clip_ce"
sample_rate = 32000
hop_size = 320 
classes_num = 50

If you want to train/test your model on Speech Command V2, you need to set:

dataset_path = "your processed SCV2 folder"
dataset_type = "scv2"
loss_type = "clip_bce"
sample_rate = 16000
hop_size = 160
classes_num = 35

If you want to test your model on DESED, you need to set:

resume_checkpoint = "Your checkpoint on AudioSet"
heatmap_dir = "localization results output folder"
test_file = "output heatmap name"
fl_local = True
fl_dataset = "Your DESED npy file"

Train and Evaluation

Notice: Our model is now supporting the single GPU.

All scripts is run by main.py:

Train: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py train

Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test

Ensemble Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py esm_test 
// See config.py for settings of ensemble testing

Weight Average: python main.py weight_average
// See config.py for settings of weight averaging

Localization on DESED

CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test
// make sure that fl_local=True in config.py
python fl_evaluate.py
// organize and gather the localization results
fl_evaluate_f1.ipynb
// Follow the notebook to produce the results

Model Checkpoints:

We provide the model checkpoints on three datasets (and additionally DESED dataset) in this link. Feel free to download and test it.

Citing

@inproceedings{htsat-ke2022,
  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
  booktitle = {{ICASSP} 2022}
}

Our work is based on Swin Transformer, which is a famous image classification transformer model.

hts-audio-transformer's People

Contributors

Stargazers

Watchers

hts-audio-transformer's Issues

报错内容ValueError: The provided lr scheduler "<torch.optim.lr_scheduler.LambdaLR object at 0x7fe3d759bb50>" is invalid

按照教程在esc-50上进行训练，显示上面的报错是什么原因，怎么修改呢

Key to checkpoints in drive

Hello!
Thank You for the awesome repo and work! I want to use the fine-tuned audioset encoder for the large variant of the model. However, I am confused from the checkpoints provided on which one to choose.

Would it be possible to provide a key to the checkpoints stored on drive?

Thank You!

training and evaluation on a single GPU

HI, RetroCirce,
I want to use a single GPU for training and evaluation. How to mannually change the code of sed_model.py and main.py?

 Looking forward to your reply. Thanks.

Model Checkpoints

Hello. I am a student studying various audio transformer papers, and I am impressed by your work. I have a few questions.

In the model checkpoints you provide, the filenames for audioset range from 1 to 6. What does it signify?
How were the models in the 'other settings' folder trained?

Thank you.

masking and difference in mix-up strategy

Hi Ke,

Thanks for the great work and open sourcing the code!

I'd like to build from your excellent codebase, and I have a few questions regarding the details:

I couldn't find any information about padding mask. Is it not used in the model?
the mix-up procedure seems to be a bit different from AST.
2.1 In AST, they mix up the raw waveform before applying transforms, while in HST-AT, you get fbanks first, and then mix-up fbanks.
2.2 In AST, the mix-up waveform is randomly sampled from the entire dataset, while you sample within the current batch.
2.3 In AST, the they also mix-up labels using lambda * label1 + (1-lambda)*label2, while HST-AT does not mix labels.
Not sure if the three differences will make a big difference in performance, but I'm curious about your thoughts.

Thanks,
Puyuan

How to perform localization and generate heatmap with AudioSet

Hello RetroCirce, I've been getting familiar with your codebase and I am having issues performing localization and generating heatmaps with the AudioSet dataset. I understand to perform localization with DESED you use the following code:

CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test
// make sure that fl_local=True in config.py
python fl_evaluate.py
// organize and gather the localization results
fl_evaluate_f1.ipynb
// Follow the notebook to produce the results

Is there an AudioSet equivalent to this code?
I also noticed in fl_evaulate.py an option to process AudioSet with the following code:

# audioset_process(config.heatmap_dir, config.test_file)

but this function doesn't exist anywhere in the codebase.

Thank you very much for your time.

Question regarding framewise output timesteps

Hi, thank you for sharing this!

I'm trying to use the HTSAT for SED with strong labels, i.e. with known onset and offset times. I have found that with the default config, the input shape is (batch_size, 1001, 527) in the case of AudioSet, whereas the framewise output results in (batch_size, 1024, 527) as implemented in the method foward_features of the HTSAT_Swin_Transformer class by:

if self.config.htsat_attn_heatmap:
    fpx = interpolate(torch.sigmoid(x).permute(0,2,1).contiguous() * attn, 8 * self.patch_stride[1]) 
else: 
    fpx = interpolate(torch.sigmoid(x).permute(0,2,1).contiguous(), 8 * self.patch_stride[1])

Now I wonder what the best strategy would be in case of computing the loss between the framewise output and the target labels woud be. Normally, I just would generate a target label with the same timestep size of the input spectrogram and then optimize for the BCE.

So the question is: Would you, in this case, resize the framewise output to the same size as the input timesteps and then proceed as described above? Or is there a better way?

To be more specific, would something like this make sense:

def out_frames(sec, timesteps = 1024, max_seconds = 10):
    return sec * (timesteps / max_seconds)

label       = np.zeros((1024, 527))
tmp_data    = np.array([ # sample data holding event labels for a given audio file
    [0,   1.5, 0], # onset, offset, class_id
    [9.85, 10, 1]
])

frame_start = np.floor(out_frames(tmp_data[:, 0])).astype(int)
frame_end   = np.ceil(out_frames(tmp_data[:, 1])).astype(int)
se_class    = tmp_data[:, 2].astype(int)

for ind, val in enumerate(se_class):
    label[frame_start[ind]:frame_end[ind], val] = 1 # 1 for active
    
"""
Resulting in:
array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])
"""

If the timesteps of the spectrogram and the featurewise output would be the same, I normally would calcultate frame_start and frame_end by

frame_start = np.floor(tmp_data[:, 0] * sr / hop_len).astype(int)
frame_end  = np.ceil(tmp_data[:, 1] * sr / hop_len).astype(int)

In the image below, the timesteps mismatch is visualized according to the input spectrogram.

Any help is greatly (!) appreciated and thanks again for sharing your code!

Unexpectedly high accuracy of 99 percent

I am getting a very high accuracy from first epoch only. I doubt if everything is alright. The accuracy on mivia dataset reach 93 percent in first epoch itself. and with pretrained model it reaches 99 percent. Please let me what is the scenario. (I do not have exposure to Pytorch Lightning)

my training outputs looks as follows:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | sed_model | HTSAT_Swin_Transformer | 28.6 M

27.6 M Trainable params
1.1 M Non-trainable params
28.6 M Total params
114.583 Total estimated model params size (MB)
Training: 0it [00:00, ?it/s]
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9361128142244022}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9509503372164316}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9634580012262416}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9735131820968731}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9872470876762722}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9832004904966278}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9729000613120785}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9894543225015328}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9917841814837522}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.992274678111588}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9931330472103004}
Validating: 0it [00:00, ?it/s]
cuda:0 {'acc': 0.9949724095646842}

关于音频事件检测

你好，
我想知道用于音频事件检测的部分，也就是关于定位的部分（在DESED数据上跑到模型），和那种在esc-50这种数据集上跑到分类任务相比，多了那些操作，可以大体说说吗。
论文中说是那个最后的Token Semantic Module实现了定位的功能，我是初学者只看到有卷积操作在里模块里面，可以粗略讲讲关于实现定位的细节吗。

Validation loss metric

Hi,

Why was the validation loss not tracked as a metric alongside validation accuracy? It's crucial and necessary for identifying overfitting during training. In your code, there are no logs for the validation loss, only for train loss. How can you tell while training if the model is overfitting?

Thank you!

reporduce training on esc-50 has an error

My CUDA version is 11.1, Pytorch version is 1.9.0, Pytorch-lighting version is 1.6.0.

When I reproduce your HTS-Transformer on esc-50, I get this error, could you help me to solve this error？

Details

Traceback (most recent call last):
File "main.py", line 432, in
main()
File "main.py", line 428, in main
train()
File "main.py", line 398, in train
trainer.fit(model, audioset_data)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
results = self._run_stage()
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
return self._run_train()
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train
self.fit_loop.run()
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
result = self._run_optimization(
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1596, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1625, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/torch/optim/adamw.py", line 65, in step
loss = closure()
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 140, in _wrap_closure
closure_result = closure()
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in call
self._result = self.closure(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 143, in closure
self._backward_fn(step_output.closure_loss)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 311, in backward_fn
self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1766, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 168, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
model.backward(closure_loss, optimizer, *args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1370, in backward
loss.backward(*args, **kwargs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/data/chenyuanjian/anaconda3/envs/htstrans/lib/python3.8/site-packages/torch/autograd/init.py", line 147, in backward
Variable._execution_engine.run_backward(
RuntimeError: upsample_bicubic2d_backward_out_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

Audioset dataset for pretraining

Dear author,
since the model depends on pretraining on AudioSet to reach the highest score, why not to share the dataset and pretrained-model file? For the audioset dataset always become partially invalid.

How can i use my own dataset in this model:

Hello, both esc-50 and audioset are data sets with 5 seconds and 10 seconds, and each audio is not more than 10 seconds. But each audio in my own dataset is 60 seconds. In this case, can your model be used in my dataset? How can I modify the code to use it on my dataset?

If you can answer, I would be grateful!

Learning rate

Hi
I think lr in code is different for paper.
In the paper show learning rate is 0.05,0.1,0.2 in the first three epochs, but code lr is 2e-5,5e-5,1e-4 (config.py lr_rate=[0.02,0.05,0.1]).
I revised lr_rate=[50,100,200] to match the paper but model training shows bad results.
I want to know what method is right to get the same results indicated in the paper.

想利用这个网络训练一个用于sed的模型，训练数据都是强帧级的标注

有一些帧级别的标注音频，想训练一个声音事件检测的模型，但是看你们这个只有audioset数据集的训练方法，如果换成这种帧级别的标签，该如何重新构建训练方案？

upsample_bicubic2d_backward_out_cuda

When model trained on esc-50, run into this error:
RuntimeError: upsample_bicubic2d_backward_out_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'.

If set torch.use_deterministic_algorithms =False,
acc can only achieve 0.65.

env :
torch 1.12.1
pytorch-lightning 1.5.9

TypeError: cannot pickle 'module' object

I am running htsat_esc_training.ipynb and getting this error on my PC.

Python version: 3.9.12
Installed all requirements from requirements.txt.
Ran the notebook in VSCode.
Not changes to the code.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
trainer properties:
gpus: 1
max_epochs: 100
auto_lr_find: True
accelerator: <pytorch_lightning.accelerators.gpu.GPUAccelerator object at 0x0000016B08410790>
num_sanity_val_steps: 0
resume_from_checkpoint: None
gradient_clip_val: 1.0

Error:

TypeError Traceback (most recent call last)
Cell In [26], line 3
1 # Training the model
2 # You can set different fold index by setting 'esc_fold' to any number from 0-4 in esc_config.py
----> 3 trainer.fit(model, audioset_data)

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\trainer\trainer.py:740, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, train_dataloader, ckpt_path)
735 rank_zero_deprecation(
736 "trainer.fit(train_dataloader) is deprecated in v1.4 and will be removed in v1.6."
737 " Use trainer.fit(train_dataloaders) instead. HINT: added 's'"
738 )
739 train_dataloaders = train_dataloader
--> 740 self._call_and_handle_interrupt(
741 self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
742 )

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\trainer\trainer.py:685, in Trainer._call_and_handle_interrupt(self, trainer_fn, *args, **kwargs)
675 r"""
676 Error handling, intended to be used only for main trainer function entry points (fit, validate, test, predict)
677 as all errors should funnel through them
(...)
682 **kwargs: keyword arguments to be passed to trainer_fn
683 """
684 try:
--> 685 return trainer_fn(*args, **kwargs)
686 # TODO: treat KeyboardInterrupt as BaseException (delete the code below) in v1.7
687 except KeyboardInterrupt as exception:

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\trainer\trainer.py:777, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
775 # TODO: ckpt_path only in v1.7
776 ckpt_path = ckpt_path or self.resume_from_checkpoint
--> 777 self._run(model, ckpt_path=ckpt_path)
779 assert self.state.stopped
780 self.training = False

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\trainer\trainer.py:1199, in Trainer._run(self, model, ckpt_path)
1196 self.checkpoint_connector.resume_end()
1198 # dispatch start_training or start_evaluating or start_predicting
-> 1199 self._dispatch()
1201 # plugin will finalized fitting (e.g. ddp_spawn will load trained model)
1202 self._post_dispatch()

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\trainer\trainer.py:1279, in Trainer._dispatch(self)
1277 self.training_type_plugin.start_predicting(self)
1278 else:
-> 1279 self.training_type_plugin.start_training(self)

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py:202, in TrainingTypePlugin.start_training(self, trainer)
200 def start_training(self, trainer: "pl.Trainer") -> None:
201 # double dispatch to initiate the training loop
--> 202 self._results = trainer.run_stage()

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\trainer\trainer.py:1289, in Trainer.run_stage(self)
1287 if self.predicting:
1288 return self._run_predict()
-> 1289 return self._run_train()

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\trainer\trainer.py:1319, in Trainer._run_train(self)
1317 self.fit_loop.trainer = self
1318 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1319 self.fit_loop.run()

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\loops\base.py:145, in Loop.run(self, *args, **kwargs)
143 try:
144 self.on_advance_start(*args, **kwargs)
--> 145 self.advance(*args, **kwargs)
146 self.on_advance_end()
147 self.restarting = False

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\loops\fit_loop.py:234, in FitLoop.advance(self)
231 data_fetcher = self.trainer._data_connector.get_profiled_dataloader(dataloader)
233 with self.trainer.profiler.profile("run_training_epoch"):
--> 234 self.epoch_loop.run(data_fetcher)
236 # the global step is manually decreased here due to backwards compatibility with existing loggers
237 # as they expect that the same step is used when logging epoch end metrics even when the batch loop has
238 # finished. this means the attribute does not exactly track the number of optimizer steps applied.
239 # TODO(@carmocca): deprecate and rename so users don't get confused
240 self.global_step -= 1

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\loops\base.py:140, in Loop.run(self, *args, **kwargs)
136 return self.on_skip()
138 self.reset()
--> 140 self.on_run_start(*args, **kwargs)
142 while not self.done:
143 try:

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py:141, in TrainingEpochLoop.on_run_start(self, data_fetcher, **kwargs)
138 self.trainer.fit_loop.epoch_progress.increment_started()
140 self._reload_dataloader_state_dict(data_fetcher)
--> 141 self._dataloader_iter = _update_dataloader_iter(data_fetcher, self.batch_idx + 1)

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\loops\utilities.py:121, in _update_dataloader_iter(data_fetcher, batch_idx)
118 """Attach the dataloader."""
119 if not isinstance(data_fetcher, DataLoaderIterDataFetcher):
120 # restore iteration
--> 121 dataloader_iter = enumerate(data_fetcher, batch_idx)
122 else:
123 dataloader_iter = iter(data_fetcher)

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\utilities\fetching.py:198, in AbstractDataFetcher.iter(self)
196 self.reset()
197 self.dataloader_iter = iter(self.dataloader)
--> 198 self._apply_patch()
199 self.prefetching(self.prefetch_batches)
200 return self

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\utilities\fetching.py:133, in AbstractDataFetcher._apply_patch(self)
130 loader._lightning_fetcher = self
131 patch_dataloader_iterator(loader, iterator, self)
--> 133 apply_to_collections(self.loaders, self.loader_iters, (Iterator, DataLoader), _apply_patch_fn)

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\utilities\fetching.py:181, in AbstractDataFetcher.loader_iters(self)
178 raise MisconfigurationException("The dataloader_iter isn't available outside the iter context.")
180 if isinstance(self.dataloader, CombinedLoader):
--> 181 loader_iters = self.dataloader_iter.loader_iters
182 else:
183 loader_iters = [self.dataloader_iter]

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\trainer\supporters.py:537, in CombinedLoaderIterator.loader_iters(self)
535 """Get the _loader_iters and create one if it is None."""
536 if self._loader_iters is None:
--> 537 self._loader_iters = self.create_loader_iters(self.loaders)
539 return self._loader_iters

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\trainer\supporters.py:577, in CombinedLoaderIterator.create_loader_iters(loaders)
568 """Create and return a collection of iterators from loaders.
569
570 Args:
(...)
574 a collections of iterators
575 """
576 # dataloaders are Iterable but not Sequences. Need this to specifically exclude sequences
--> 577 return apply_to_collection(loaders, Iterable, iter, wrong_dtype=(Sequence, Mapping))

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\pytorch_lightning\utilities\apply_func.py:95, in apply_to_collection(data, dtype, function, wrong_dtype, include_none, *args, **kwargs)
93 # Breaking condition
94 if isinstance(data, dtype) and (wrong_dtype is None or not isinstance(data, wrong_dtype)):
---> 95 return function(data, *args, **kwargs)
97 elem_type = type(data)
99 # Recursively apply to collection items

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\torch\utils\data\dataloader.py:444, in DataLoader.iter(self)
442 return self._iterator
443 else:
--> 444 return self._get_iterator()

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\torch\utils\data\dataloader.py:390, in DataLoader._get_iterator(self)
388 else:
389 self.check_worker_number_rationality()
--> 390 return _MultiProcessingDataLoaderIter(self)

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\torch\utils\data\dataloader.py:1077, in _MultiProcessingDataLoaderIter.init(self, loader)
1070 w.daemon = True
1071 # NB: Process.start() actually take some time as it needs to
1072 # start a process and pass the arguments over via a pipe.
1073 # Therefore, we only add a worker to self._workers list after
1074 # it started, so that we do not call .join() if program dies
1075 # before it starts, and del tries to join but will get:
1076 # AssertionError: can only join a started process.
-> 1077 w.start()
1078 self._index_queues.append(index_queue)
1079 self._workers.append(w)

File ~\AppData\Local\Programs\Python\Python39\lib\multiprocessing\process.py:121, in BaseProcess.start(self)
118 assert not _current_process._config.get('daemon'),
119 'daemonic processes are not allowed to have children'
120 _cleanup()
--> 121 self._popen = self._Popen(self)
122 self._sentinel = self._popen.sentinel
123 # Avoid a refcycle if the target function holds an indirect
124 # reference to the process object (see bpo-30775)

File ~\AppData\Local\Programs\Python\Python39\lib\multiprocessing\context.py:224, in Process._Popen(process_obj)
222 @staticmethod
223 def _Popen(process_obj):
--> 224 return _default_context.get_context().Process._Popen(process_obj)

File ~\AppData\Local\Programs\Python\Python39\lib\multiprocessing\context.py:327, in SpawnProcess._Popen(process_obj)
324 @staticmethod
325 def _Popen(process_obj):
326 from .popen_spawn_win32 import Popen
--> 327 return Popen(process_obj)

File ~\AppData\Local\Programs\Python\Python39\lib\multiprocessing\popen_spawn_win32.py:93, in Popen.init(self, process_obj)
91 try:
92 reduction.dump(prep_data, to_child)
---> 93 reduction.dump(process_obj, to_child)
94 finally:
95 set_spawning_popen(None)

File ~\AppData\Local\Programs\Python\Python39\lib\multiprocessing\reduction.py:60, in dump(obj, file, protocol)
58 def dump(obj, file, protocol=None):
59 '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60 ForkingPickler(file, protocol).dump(obj)

TypeError: cannot pickle 'module' object

Which version of PyTorch would you recommend?
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

error in training

Hi，RetroCirce

   I get the error when running the esc_50 script for training as follows. How to solved it ?

RuntimeError: upsample_bicubic2d_backward_out_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation. Epoch 0: 0%| | 0/65 [00:04<?, ?it/s]

Questions about models.py

Hi. A dumb question. May I ask what is models.py used for? I thought your model is in htsat.py...

QUESTION

Can I run this project with a single GPU?
or what should change for run on single GPU?

Hello, I have the following problem when running the code, what is the reason?

RuntimeError: upsample_bicubic2d_backward_out_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'.
You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application.
You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation

关于精度没有达到论文中的水平的问题

关于您的模型，已按照您的配置执行，采用的是esc50数据集，并加载了swin_transformer的预训练模型，但最后的精度如图，只能达到0.88，测试也是如此，不知哪里出了问题，您可以给一些建议吗？十分感谢！

How can I test model？

Hello , in "htsat_esc_training.ipnb", how can i find test_data? Or i should design the test_data myself? I'm not familiar with pytorch-lightning.

Usage on Strongly labelled Dataset for SED

Hi,

Thank you for your code and the great work!

I had a question regarding usage of strongly labelled data for the task of sound event detection. How can I do the inference/prediction to get the time stamps of the sound events along with the classes?

Thanks in advance!

How can run this project with one GPU?!

hi.
I am trying to test this model on Google Speech Command but after testing progress bar complete, I get this error:
Default process group has not been initialized, please make sure to call init_process_group

I google this error and found that this error happens because of 'SyncBatchNorm' in 1 GPU and I should replace them by normal ones.
in your code i found 'sync_batchnorm' parameter in pl.Trainer() and set its value to 'False' and run test command again but it wasn't work.

could anyone please tell me how can i run this project on 1 GPU ?!
thanks.

How to finetune on strong label dataset?

您好，非常棒的工作！但是我在强标注数据集上finetune进行训练的时候有一些疑惑，我想请问一下您在issue 25中提到 "need to extract different output of HST-AT (I believe it is the last second layer feature-map output)", 这个last second layer 指的是token semantic 模块的输出吗，以及您提到“ the interpolation and resolution of the output may be different from the input localization time resolution ----- in that you need to find a way to align them.”，您代码中将输出的时间轴进行插值处理后变成1024的长度，算是一种处理的方式吗？若您回答我的问题，不胜感激！

Getting started with a custom dataset

Hi,

Thank you for your excellent work!

I want to use HTS-Audio-Transformer for my custom dataset, different classification task.

Are there any instructions on how to run the model for a different dataset? From which file should I start?

Thanks

bug

Global seed set to 970131

each batch size: 8

INFO:root:total dataset size: 400

GPU available: True, used: True

TPU available: False, using: 0 TPU cores

IPU available: False, using: 0 IPUs

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Testing: 98%|█████████▊| 49/50 [00:03<00:00, 22.58it/s]Traceback (most recent call last):

File "E:/HTS/HTS/main.py", line 429, in

main()

File "E:/HTS/HTS/main.py", line 417, in main

test()

File "E:/HTS/HTS/main.py", line 247, in test

trainer.test(model, datamodule=audioset_data)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 911, in test

return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 685, in _call_and_handle_interrupt

return trainer_fn(*args, **kwargs)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 954, in _test_impl

results = self._run(model, ckpt_path=self.tested_ckpt_path)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1199, in _run

self._dispatch()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1275, in _dispatch

self.training_type_plugin.start_evaluating(self)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 206, in start_evaluating

self._results = trainer.run_stage()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1286, in run_stage

return self._run_evaluate()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1334, in _run_evaluate

eval_loop_results = self._evaluation_loop.run()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\loops\base.py", line 151, in run

output = self.on_run_end()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 131, in on_run_end

self._evaluation_epoch_end(outputs)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 231, in _evaluation_epoch_end

model.test_epoch_end(outputs)

File "E:\HTS\HTS\sed_model.py", line 201, in test_epoch_end

gather_pred = [torch.zeros_like(pred) for _ in range(dist.get_world_size())]

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\torch\distributed\distributed_c10d.py", line 748, in get_world_size

return _get_group_size(group)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\torch\distributed\distributed_c10d.py", line 274, in _get_group_size

default_pg = _get_default_group()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\torch\distributed\distributed_c10d.py", line 358, in _get_default_group

raise RuntimeError("Default process group has not been initialized, "

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Process finished with exit code 1

I run the test process on esc-50 dateset with a signal GPU, something wrong with it, the result like above , I Commented out the program about gather_pred、gather_target and some about dist.xxxxx, It works .

I want to know that can I run the code though signal GPU with changes above

About shape of input wav

Hi,thanks for your paper and code~
我想做类似Mask audio prediction的任务，需要利用帧级别的输入，但是我发现输入的维度是[batch_size,1,256,256],patch embedding是[Batch_size,4096,96],而预测输出的fine_grained_embedding:torch.Size([Batch_size, 1024, 768])，我想输入对应上输出的1024.请问论文中提到的pad to 1024frame在哪里呢？

Training will get stuck and stop without reporting an error

I set deterministic to be False, and it can run successfully. But when it runs to about 68% of epoch=1, the training will get stuck and stop without reporting an error, and it will not move. How can I solve this?

Stopped during Audioset training

Hi, Thanks for your good study.
I tried to trianing Audioset using your code.
After epoch 2 10%, Training is not working.(Always happened this epoch)
Not shutdown just not working.
Can I know why it was happened?
I attached log monitor.

get bad result for esc50

i have one GPU, so i changed some code in model.py and sed_model.py and main.py

and set config.py like that:

dataset_type = "esc-50"
loss_type = "clip_ce"
sample_rate = 32000
classes_num = 50

then i just get ACC : 0.55

i changed

deterministic=False
dist.init_process_group( backend="nccl", init_method="tcp://localhost:23456", rank=0, world_size=1 ) for init_process_group error

code can be runing. but not get results same as your paper.

cannot pickle 'module' object when running the htsat_esc_training

once trainier.fit(model, audio set_data) is called, the error below is output. Any help on the matter would be greatly appreciated!!

TypeError Traceback (most recent call last)
Cell In [10], line 3
1 # Training the model
2 # You can set different fold index by setting 'esc_fold' to any number from 0-4 in esc_config.py
----> 3 trainer.fit(model, audioset_data)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:740, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, train_dataloader, ckpt_path)
735 rank_zero_deprecation(
736 "trainer.fit(train_dataloader) is deprecated in v1.4 and will be removed in v1.6."
737 " Use trainer.fit(train_dataloaders) instead. HINT: added 's'"
738 )
739 train_dataloaders = train_dataloader
--> 740 self._call_and_handle_interrupt(
741 self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
742 )

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:685, in Trainer._call_and_handle_interrupt(self, trainer_fn, *args, **kwargs)
675 r"""
676 Error handling, intended to be used only for main trainer function entry points (fit, validate, test, predict)
677 as all errors should funnel through them
(...)
682 **kwargs: keyword arguments to be passed to trainer_fn
683 """
684 try:
--> 685 return trainer_fn(*args, **kwargs)
686 # TODO: treat KeyboardInterrupt as BaseException (delete the code below) in v1.7
687 except KeyboardInterrupt as exception:

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:777, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
775 # TODO: ckpt_path only in v1.7
776 ckpt_path = ckpt_path or self.resume_from_checkpoint
--> 777 self._run(model, ckpt_path=ckpt_path)
779 assert self.state.stopped
780 self.training = False

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1199, in Trainer._run(self, model, ckpt_path)
1196 self.checkpoint_connector.resume_end()
1198 # dispatch start_training or start_evaluating or start_predicting
-> 1199 self._dispatch()
1201 # plugin will finalized fitting (e.g. ddp_spawn will load trained model)
1202 self._post_dispatch()

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1279, in Trainer._dispatch(self)
1277 self.training_type_plugin.start_predicting(self)
1278 else:
-> 1279 self.training_type_plugin.start_training(self)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py:202, in TrainingTypePlugin.start_training(self, trainer)
200 def start_training(self, trainer: "pl.Trainer") -> None:
201 # double dispatch to initiate the training loop
--> 202 self._results = trainer.run_stage()

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1289, in Trainer.run_stage(self)
1287 if self.predicting:
1288 return self._run_predict()
-> 1289 return self._run_train()

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1319, in Trainer._run_train(self)
1317 self.fit_loop.trainer = self
1318 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1319 self.fit_loop.run()

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/loops/base.py:145, in Loop.run(self, *args, **kwargs)
143 try:
144 self.on_advance_start(*args, **kwargs)
--> 145 self.advance(*args, **kwargs)
146 self.on_advance_end()
147 self.restarting = False

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py:234, in FitLoop.advance(self)
231 data_fetcher = self.trainer._data_connector.get_profiled_dataloader(dataloader)
233 with self.trainer.profiler.profile("run_training_epoch"):
--> 234 self.epoch_loop.run(data_fetcher)
236 # the global step is manually decreased here due to backwards compatibility with existing loggers
237 # as they expect that the same step is used when logging epoch end metrics even when the batch loop has
238 # finished. this means the attribute does not exactly track the number of optimizer steps applied.
239 # TODO(@carmocca): deprecate and rename so users don't get confused
240 self.global_step -= 1

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/loops/base.py:140, in Loop.run(self, *args, **kwargs)
136 return self.on_skip()
138 self.reset()
--> 140 self.on_run_start(*args, **kwargs)
142 while not self.done:
143 try:

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:141, in TrainingEpochLoop.on_run_start(self, data_fetcher, **kwargs)
138 self.trainer.fit_loop.epoch_progress.increment_started()
140 self._reload_dataloader_state_dict(data_fetcher)
--> 141 self._dataloader_iter = _update_dataloader_iter(data_fetcher, self.batch_idx + 1)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py:121, in _update_dataloader_iter(data_fetcher, batch_idx)
118 """Attach the dataloader."""
119 if not isinstance(data_fetcher, DataLoaderIterDataFetcher):
120 # restore iteration
--> 121 dataloader_iter = enumerate(data_fetcher, batch_idx)
122 else:
123 dataloader_iter = iter(data_fetcher)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py:198, in AbstractDataFetcher.iter(self)
196 self.reset()
197 self.dataloader_iter = iter(self.dataloader)
--> 198 self._apply_patch()
199 self.prefetching(self.prefetch_batches)
200 return self

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py:133, in AbstractDataFetcher._apply_patch(self)
130 loader._lightning_fetcher = self
131 patch_dataloader_iterator(loader, iterator, self)
--> 133 apply_to_collections(self.loaders, self.loader_iters, (Iterator, DataLoader), _apply_patch_fn)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py:181, in AbstractDataFetcher.loader_iters(self)
178 raise MisconfigurationException("The dataloader_iter isn't available outside the iter context.")
180 if isinstance(self.dataloader, CombinedLoader):
--> 181 loader_iters = self.dataloader_iter.loader_iters
182 else:
183 loader_iters = [self.dataloader_iter]

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py:537, in CombinedLoaderIterator.loader_iters(self)
535 """Get the _loader_iters and create one if it is None."""
536 if self._loader_iters is None:
--> 537 self._loader_iters = self.create_loader_iters(self.loaders)
539 return self._loader_iters

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py:577, in CombinedLoaderIterator.create_loader_iters(loaders)
568 """Create and return a collection of iterators from loaders.
569
570 Args:
(...)
574 a collections of iterators
575 """
576 # dataloaders are Iterable but not Sequences. Need this to specifically exclude sequences
--> 577 return apply_to_collection(loaders, Iterable, iter, wrong_dtype=(Sequence, Mapping))

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py:95, in apply_to_collection(data, dtype, function, wrong_dtype, include_none, *args, **kwargs)
93 # Breaking condition
94 if isinstance(data, dtype) and (wrong_dtype is None or not isinstance(data, wrong_dtype)):
---> 95 return function(data, *args, **kwargs)
97 elem_type = type(data)
99 # Recursively apply to collection items

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/torch/utils/data/dataloader.py:435, in DataLoader.iter(self)
433 return self._iterator
434 else:
--> 435 return self._get_iterator()

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/torch/utils/data/dataloader.py:381, in DataLoader._get_iterator(self)
379 else:
380 self.check_worker_number_rationality()
--> 381 return _MultiProcessingDataLoaderIter(self)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/site-packages/torch/utils/data/dataloader.py:1034, in _MultiProcessingDataLoaderIter.init(self, loader)
1027 w.daemon = True
1028 # NB: Process.start() actually take some time as it needs to
1029 # start a process and pass the arguments over via a pipe.
1030 # Therefore, we only add a worker to self._workers list after
1031 # it started, so that we do not call .join() if program dies
1032 # before it starts, and del tries to join but will get:
1033 # AssertionError: can only join a started process.
-> 1034 w.start()
1035 self._index_queues.append(index_queue)
1036 self._workers.append(w)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/multiprocessing/process.py:121, in BaseProcess.start(self)
118 assert not _current_process._config.get('daemon'),
119 'daemonic processes are not allowed to have children'
120 _cleanup()
--> 121 self._popen = self._Popen(self)
122 self._sentinel = self._popen.sentinel
123 # Avoid a refcycle if the target function holds an indirect
124 # reference to the process object (see bpo-30775)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/multiprocessing/context.py:224, in Process._Popen(process_obj)
222 @staticmethod
223 def _Popen(process_obj):
--> 224 return _default_context.get_context().Process._Popen(process_obj)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/multiprocessing/context.py:284, in SpawnProcess._Popen(process_obj)
281 @staticmethod
282 def _Popen(process_obj):
283 from .popen_spawn_posix import Popen
--> 284 return Popen(process_obj)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/multiprocessing/popen_spawn_posix.py:32, in Popen.init(self, process_obj)
30 def init(self, process_obj):
31 self._fds = []
---> 32 super().init(process_obj)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/multiprocessing/popen_fork.py:19, in Popen.init(self, process_obj)
17 self.returncode = None
18 self.finalizer = None
---> 19 self._launch(process_obj)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/multiprocessing/popen_spawn_posix.py:47, in Popen._launch(self, process_obj)
45 try:
46 reduction.dump(prep_data, fp)
---> 47 reduction.dump(process_obj, fp)
48 finally:
49 set_spawning_popen(None)

File ~/opt/anaconda3/envs/htsaudio/lib/python3.8/multiprocessing/reduction.py:60, in dump(obj, file, protocol)
58 def dump(obj, file, protocol=None):
59 '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60 ForkingPickler(file, protocol).dump(obj)

TypeError: cannot pickle 'module' object

Does this framework's output have been compared with other features?

Does this framework's output have been compared with other features like wav2vec, hubert?

Question about AudioSet and finetune learning rate.

陈轲你好！
很棒的工作！我主要想问两个问题：
第一个是关于AudioSet，根据我阅读你的源码，我猜测你是用 Kong qiuqiang分享在百度云盘的数据集，如果是，你是否遇到有解压到问题。表现在解压时报错（不管是winrar还是7zip，但winrar仍可以解压），加载数据时报错（大概是出现损坏文件）。排除掉损坏文件后，但结果无法复现另一篇工作（指标很差），所以我怀疑和我没有正确解压有关。如果你是用此分享数据集，我想问问你的解压方式是什么？
第二个是关于利用ImageNet预训练的Swin Transformer进行finetue时，你的学习率是多少？还有学习策略和train from scratch 是一样的吗？论文中似乎只说到了train from scratch的学习率和学习策略。

如果你能解答我的疑惑将不胜感激！

cyclic window shifting in the (256,256) tensor

Hi,
Awesome repo. I have a question regarding the architecture are token interaction. Don't you think the way HTSAT creates (256, 256) tensor from (1024,64) spectrogram causes problematic token interaction when cyclic window shifting?
What I understood is that you cut the spectrogram (1024,64) into 4 PIECES along dim=0 (256,64) each. Later these 4 are concatenated along dim=1 resulting in the final tensor of shape (256,256). On this when you do cyclic window shift, results in window comprises of tokens from two different PIECES, that is some from low-frequency region of PIECE 1 and some from high frequency region from PIECE 2.

type of GPU

What type of GPU was used for training and how long it was trained

the size of the input spectrum

Hello, may I ask if the size of the input spectrum is determined? What is the time and frequency of the spectra, respectively

RuntimeError: Input and output sizes should be greater than 0, but got input (H: 0, W: 64) output (H: 1024, W: 64)

Hello, RetroCirce.

Thank you for your great work and I want to try inference single audio file.

Based on the ipynotebook of test_esc.zip #11 (comment), I added little changes and created my notebook as in my gist (https://gist.github.com/Mizuho32/fba4105ab95fad1e64b9cf1421c21597).
But I got an error as listed below. (For detail, please refer to the output of the 3rd cell of https://gist.github.com/Mizuho32/fba4105ab95fad1e64b9cf1421c21597)

RuntimeError: Input and output sizes should be greater than 0, but got input (H: 0, W: 64) output (H: 1024, W: 64)

Anything wrong? Please help me 🙏
I used this audio https://www.youtube.com/watch?v=zzNdwF40ID8.

My torch versions are

torch==2.0.1
torchaudio==2.0.2
torchcontrib==0.0.2
torchlibrosa==0.0.9
torchmetrics==0.11.4

Training and infering with dataset containing 4 classes

Hi Ke,

I hope you are having a happy holiday :)

I have used htsat_esc_training.ipynb to retrain a model on my data (removed all the ESC-50 data and replaced it with my data, split into 5 folds), which only contains 4 classes. When I load the state_dict for predicting, the following error happens:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In [2], line 2
      1 # Inference
----> 2 Audiocls = Audio_Classification(model_path, config)
      4 pred_prob = Audiocls.predict("scptj-6t8dl.wav")
      6 print('Audiocls predict output: ', pred_prob)

Cell In [1], line 44, in Audio_Classification.__init__(self, model_path, config)
     42 for key in ckpt["state_dict"]:
     43     temp_ckpt[key[10:]] = ckpt['state_dict'][key]
---> 44 self.sed_model.load_state_dict(temp_ckpt)
     45 self.sed_model.to(self.device)
     46 self.sed_model.eval()

File c:\Users\jonat\source\repos\HTS-Audio-Transformer-main\HTSATvenv\lib\site-packages\torch\nn\modules\module.py:1604, in Module.load_state_dict(self, state_dict, strict)
   1599         error_msgs.insert(
   1600             0, 'Missing key(s) in state_dict: {}. '.format(
   1601                 ', '.join('"{}"'.format(k) for k in missing_keys)))
   1603 if len(error_msgs) > 0:
-> 1604     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   1605                        self.__class__.__name__, "\n\t".join(error_msgs)))
   1606 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for HTSAT_Swin_Transformer:
	size mismatch for tscam_conv.weight: copying a param with shape torch.Size([50, 768, 2, 3]) from checkpoint, the shape in current model is torch.Size([4, 768, 2, 3]).
	size mismatch for tscam_conv.bias: copying a param with shape torch.Size([50]) from checkpoint, the shape in current model is torch.Size([4]).
	size mismatch for head.weight: copying a param with shape torch.Size([50, 50]) from checkpoint, the shape in current model is torch.Size([4, 4]).
	size mismatch for head.bias: copying a param with shape torch.Size([50]) from checkpoint, the shape in current model is torch.Size([4]).

I have changed the model_path to be one of the checkpoints from when I trained the model:
model_path = './workspace/results/exp_htsat_esc_50/checkpoint/lightning_logs/version_34/checkpoints/l-epoch=38-acc=1.000.ckpt'
This confuses me a lot. Do you have any idea why this is happening?

关于语义模块

感谢你的回复，
我又回过头看了关于token semantic module 部分的代码，仍然有一些不明白的地方要请教你，

        B, N, C = x.shape
        SF = frames_num // (2 ** (len(self.depths) - 1)) // self.patch_stride[0]
        ST = frames_num // (2 ** (len(self.depths) - 1)) // self.patch_stride[1]
        x = x.permute(0,2,1).contiguous().reshape(B, C, SF, ST)        
        B, C, F, T = x.shape

        # group 2D CNN
        c_freq_bin = F // self.freq_ratio
        x = x.reshape(B, C, F // c_freq_bin, c_freq_bin, T)
        x = x.permute(0,1,3,2,4).contiguous().reshape(B, C, c_freq_bin, -1)

        x = self.tscam_conv(x)
        x = torch.flatten(x, 2) # B, C, T

1.group 2D 那里，前面已经将特征图分成了SF,ST 的形式了，这里的关于特征图的形状的变换的作用是什么？

2.self.tscam_conv处理后的特征图的形状变成了B,Class,T'，那么这个T'是有什么物理意义在里面吗？

3.上述操作完成了我看到程序中对x进行上采样来生成fpx作为framewise_output，并用它来做定位（确定开始结束时间？），那么这个fpx为什么来用来做定位，以及fpx（B，1024，527）的1024的物理意义是什么？

希望您能抽空解决我的疑惑

Originally posted by @dong-0412 in #19 (comment)

Learning rate for small datasets

Hi, thank you for your great work.

Here, you mentioned that the warm-up part should be adjusted according to the dataset. Could you give some advice for small datasets? For example, I have approximately 15K samples, how can I set lr_scheduler_epoch and lr_rate?

I compared audioset config and esc config, but mentioned parameters are same.

Different length audio input for infer mode

Hi, thanks for the interesting work!

I have a question about the infer mode in htsat.py. When training, the length of audio input will always be 10 seconds. When inference, the model needs to handle variable-length audio input which could be longer or shorter than 10 seconds, but for the infer mode in the htsat.py,

 if infer_mode:
            # in infer mode. we need to handle different length audio input
            frame_num = x.shape[2]
            target_T = int(self.spec_size * self.freq_ratio)
            repeat_ratio = math.floor(target_T / frame_num)
            x = x.repeat(repeats=(1,1,repeat_ratio,1))
            x = self.reshape_wav2img(x)

What if the length of input frame_num > target_T (should be 256x4=1024 here)? If so, repeat_ratio will be 0. So how does the model process the audio longer than 10 seconds for inference here?

Question about reshape log_mel to img size

Hi,

Could you explain a little more about how reshape_wav2img function works?

Why could we simply interpolate T and F to spec_size? What is target_T and target_F represent here?

A detailed explanation about the reshape process is appreciated. Thank you!

audioset 训练中报错

数据未打乱，训练中同一时刻崩溃
KeyError: "Unable to open object (object 'hdf5_path' doesn't exist)"

预训练的.ckpt文件

你好，我看见在htsat_esc_training.ipynb文件中的数据准备部分，下载完ESC-50的文件后有要求下载一个名为htsat_audioset_pretrain.ckpt的文件，但是我试了并无法下载，1OK8a5XuMVLyeVKF117L8pfxeZYdfSDZv这是对应的ID，我想知道怎么在不适用gdown的情况下下载这个文件

The mAP following Audioset Recipe is very low

Hi, I downloaded model checkpoint files that you provided through Google drive and followed README Audios Evaluate code.
Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test
I expected to get similar performance to your paper But I got very low mAP.
A number of eval dataset I used are 18,887.
I would like to know the your data set size if possible.
Attached Single model evaluation(HTSAT_AudioSet_Saved_1.ckpt) results pic.
Thanks.

训练过程报错：段错误 (核心已转储)

INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [1,2,3,4]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [1,2,3,4]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1,2,3,4]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [1,2,3,4]

| Name | Type | Params

0 | sed_model | HTSAT_Swin_Transformer | 28.9 M

27.8 M Trainable params
1.1 M Non-trainable params
28.9 M Total params
115.404 Total estimated model params size (MB)
Epoch 0: 8%|███████▍ | 5/65 [00:03<00:36, 1.62it/s, loss=3.99, v_num=8, loss_step=3.990cuda:0 4 {'acc': 0.02}███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.10it/s]
段错误 (核心已转储)

How to choose loss functions for a different dataset.

I am working on a dataset other than those used in paper. What factors should I consider to choose the loss function for it.

How to run inference of a single audio file with by your release trained esc_50 model?

Hi，RetroCirce

    通过下载您已经训练好的esc_50模型，如何快速通过这个模型来验证输入音频的分类结果？

retrocirce / hts-audio-transformer Goto Github PK

hts-audio-transformer's Introduction

Hierarchical Token Semantic Audio Transformer

Introduction

Classification Results on AudioSet, ESC-50, and Speech Command V2 (mAP)

Localization/Detection Results on DESED dataset (F1-Score)

Getting Started

Install Requirments

Download and Processing Datasets

Set the Configuration File: config.py

Train and Evaluation

Localization on DESED

Model Checkpoints:

Citing

hts-audio-transformer's People

Contributors

Stargazers

Watchers

Forkers

hts-audio-transformer's Issues

| Name | Type | Params

0 | sed_model | HTSAT_Swin_Transformer | 28.6 M

| Name | Type | Params

0 | sed_model | HTSAT_Swin_Transformer | 28.9 M

Recommend Projects

Recommend Topics

Recommend Org

Jobs