I try to use the feature extractor on my audiofiles. My audio files are all 16000H

AssertionError: choose a window size 400 that is [2, 1] about ast HOT 2 OPEN

GrafKnusprig commented on July 18, 2024

AssertionError: choose a window size 400 that is [2, 1]

from ast.

Comments (2)

YuanGongND commented on July 18, 2024

16kHz 5second should work, our ESC-50 recipe is in this setting. Which line report this error? is your audio monochannel or multi-channel? check shape of waveform.

-Yuan

from ast.

GrafKnusprig commented on July 18, 2024

First of all, thanks for the answer!

My waveform looks like this:

Waveform shape: torch.Size([1, 80000])
Waveform dtype: torch.float32
Number of channels: 1

80000 because of the 16000Hz and the 5 seconds.

And the error happens in:

Cell In[14], line 86, in preprocess_function(examples)
     79     # print(f"Waveform max: {waveform.max()}")
     80     # print(f"Waveform min: {waveform.min()}")
     81     # print(f"Waveform mean: {waveform.mean()}")
     82     # print(f"Waveform std: {waveform.std()}")
     83     # printing the number of channels in the waveform
     84     print(f"Number of channels: {waveform.shape[0]}")
---> 86     input_values = feature_extractor(waveform, sampling_rate=16000, return_tensors="pt").input_values
     87     inputs['input_values'].append(input_values.squeeze(0))
     88 return inputs

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\transformers\models\audio_spectrogram_transformer\feature_extraction_audio_spectrogram_transformer.py:219, in ASTFeatureExtractor.__call__(self, raw_speech, sampling_rate, return_tensors, **kwargs)
    216     raw_speech = [raw_speech]
    218 # extract fbank features and pad/truncate to max_length
--> 219 features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech]
    221 # convert into BatchFeature
    222 padded_inputs = BatchFeature({"input_values": features})

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\transformers\models\audio_spectrogram_transformer\feature_extraction_audio_spectrogram_transformer.py:219, in <listcomp>(.0)
    216     raw_speech = [raw_speech]
    218 # extract fbank features and pad/truncate to max_length
--> 219 features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech]
    221 # convert into BatchFeature
    222 padded_inputs = BatchFeature({"input_values": features})

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\transformers\models\audio_spectrogram_transformer\feature_extraction_audio_spectrogram_transformer.py:119, in ASTFeatureExtractor._extract_fbank_features(self, waveform, max_length)
    117 if is_speech_available():
    118     waveform = torch.from_numpy(waveform).unsqueeze(0)
--> 119     fbank = ta_kaldi.fbank(
    120         waveform,
    121         sample_frequency=self.sampling_rate,
    122         window_type="hanning",
    123         num_mel_bins=self.num_mel_bins,
    124     )
    125 else:
    126     waveform = np.squeeze(waveform)

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\torchaudio\compliance\kaldi.py:591, in fbank(waveform, blackman_coeff, channel, dither, energy_floor, frame_length, frame_shift, high_freq, htk_compat, low_freq, min_duration, num_mel_bins, preemphasis_coefficient, raw_energy, remove_dc_offset, round_to_power_of_two, sample_frequency, snip_edges, subtract_mean, use_energy, use_log_fbank, use_power, vtln_high, vtln_low, vtln_warp, window_type)
    542 r"""Create a fbank from a raw audio signal. This matches the input/output of Kaldi's
    543 compute-fbank-feats.
    544 
   (...)
    587     where m is calculated in _get_strided
    588 """
    589 device, dtype = waveform.device, waveform.dtype
--> 591 waveform, window_shift, window_size, padded_window_size = _get_waveform_and_window_properties(
    592     waveform, channel, sample_frequency, frame_shift, frame_length, round_to_power_of_two, preemphasis_coefficient
    593 )
    595 if len(waveform) < min_duration * sample_frequency:
    596     # signal is too short
    597     return torch.empty(0, device=device, dtype=dtype)

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\torchaudio\compliance\kaldi.py:142, in _get_waveform_and_window_properties(waveform, channel, sample_frequency, frame_shift, frame_length, round_to_power_of_two, preemphasis_coefficient)
    139 window_size = int(sample_frequency * frame_length * MILLISECONDS_TO_SECONDS)
    140 padded_window_size = _next_power_of_2(window_size) if round_to_power_of_two else window_size
--> 142 assert 2 <= window_size <= len(waveform), "choose a window size {} that is [2, {}]".format(
    143     window_size, len(waveform)
    144 )
    145 assert 0 < window_shift, "`window_shift` must be greater than 0"
    146 assert padded_window_size % 2 == 0, (
    147     "the padded `window_size` must be divisible by two." " use `round_to_power_of_two` or change `frame_length`"
    148 )

AssertionError: choose a window size 400 that is [2, 1]

I know that's a lot to ask, but do you have any ideas about what could be wrong? i'm lost.

Thanks a lot.

UPDATE:
If I run it with stereo files I get the error:
AssertionError: choose a window size 400 that is [2, 2]

do i use the wrong feature extractor?

# Load the model and feature extractor
model_name = "MIT/ast-finetuned-audioset-10-10-0.4593"
model = ASTForAudioClassification.from_pretrained(model_name)
feature_extractor = ASTFeatureExtractor.from_pretrained(model_name)

from ast.

AssertionError: choose a window size 400 that is [2, 1] about ast HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs