GithubHelp home page GithubHelp logo

Comments (2)

YuanGongND avatar YuanGongND commented on July 18, 2024

16kHz 5second should work, our ESC-50 recipe is in this setting. Which line report this error? is your audio monochannel or multi-channel? check shape of waveform.

-Yuan

from ast.

GrafKnusprig avatar GrafKnusprig commented on July 18, 2024

First of all, thanks for the answer!

My waveform looks like this:

Waveform shape: torch.Size([1, 80000])
Waveform dtype: torch.float32
Number of channels: 1

80000 because of the 16000Hz and the 5 seconds.

And the error happens in:

Cell In[14], line 86, in preprocess_function(examples)
     79     # print(f"Waveform max: {waveform.max()}")
     80     # print(f"Waveform min: {waveform.min()}")
     81     # print(f"Waveform mean: {waveform.mean()}")
     82     # print(f"Waveform std: {waveform.std()}")
     83     # printing the number of channels in the waveform
     84     print(f"Number of channels: {waveform.shape[0]}")
---> 86     input_values = feature_extractor(waveform, sampling_rate=16000, return_tensors="pt").input_values
     87     inputs['input_values'].append(input_values.squeeze(0))
     88 return inputs

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\transformers\models\audio_spectrogram_transformer\feature_extraction_audio_spectrogram_transformer.py:219, in ASTFeatureExtractor.__call__(self, raw_speech, sampling_rate, return_tensors, **kwargs)
    216     raw_speech = [raw_speech]
    218 # extract fbank features and pad/truncate to max_length
--> 219 features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech]
    221 # convert into BatchFeature
    222 padded_inputs = BatchFeature({"input_values": features})

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\transformers\models\audio_spectrogram_transformer\feature_extraction_audio_spectrogram_transformer.py:219, in <listcomp>(.0)
    216     raw_speech = [raw_speech]
    218 # extract fbank features and pad/truncate to max_length
--> 219 features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech]
    221 # convert into BatchFeature
    222 padded_inputs = BatchFeature({"input_values": features})

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\transformers\models\audio_spectrogram_transformer\feature_extraction_audio_spectrogram_transformer.py:119, in ASTFeatureExtractor._extract_fbank_features(self, waveform, max_length)
    117 if is_speech_available():
    118     waveform = torch.from_numpy(waveform).unsqueeze(0)
--> 119     fbank = ta_kaldi.fbank(
    120         waveform,
    121         sample_frequency=self.sampling_rate,
    122         window_type="hanning",
    123         num_mel_bins=self.num_mel_bins,
    124     )
    125 else:
    126     waveform = np.squeeze(waveform)

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\torchaudio\compliance\kaldi.py:591, in fbank(waveform, blackman_coeff, channel, dither, energy_floor, frame_length, frame_shift, high_freq, htk_compat, low_freq, min_duration, num_mel_bins, preemphasis_coefficient, raw_energy, remove_dc_offset, round_to_power_of_two, sample_frequency, snip_edges, subtract_mean, use_energy, use_log_fbank, use_power, vtln_high, vtln_low, vtln_warp, window_type)
    542 r"""Create a fbank from a raw audio signal. This matches the input/output of Kaldi's
    543 compute-fbank-feats.
    544 
   (...)
    587     where m is calculated in _get_strided
    588 """
    589 device, dtype = waveform.device, waveform.dtype
--> 591 waveform, window_shift, window_size, padded_window_size = _get_waveform_and_window_properties(
    592     waveform, channel, sample_frequency, frame_shift, frame_length, round_to_power_of_two, preemphasis_coefficient
    593 )
    595 if len(waveform) < min_duration * sample_frequency:
    596     # signal is too short
    597     return torch.empty(0, device=device, dtype=dtype)

File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\torchaudio\compliance\kaldi.py:142, in _get_waveform_and_window_properties(waveform, channel, sample_frequency, frame_shift, frame_length, round_to_power_of_two, preemphasis_coefficient)
    139 window_size = int(sample_frequency * frame_length * MILLISECONDS_TO_SECONDS)
    140 padded_window_size = _next_power_of_2(window_size) if round_to_power_of_two else window_size
--> 142 assert 2 <= window_size <= len(waveform), "choose a window size {} that is [2, {}]".format(
    143     window_size, len(waveform)
    144 )
    145 assert 0 < window_shift, "`window_shift` must be greater than 0"
    146 assert padded_window_size % 2 == 0, (
    147     "the padded `window_size` must be divisible by two." " use `round_to_power_of_two` or change `frame_length`"
    148 )

AssertionError: choose a window size 400 that is [2, 1]

I know that's a lot to ask, but do you have any ideas about what could be wrong? i'm lost.

Thanks a lot.

UPDATE:
If I run it with stereo files I get the error:
AssertionError: choose a window size 400 that is [2, 2]

do i use the wrong feature extractor?

# Load the model and feature extractor
model_name = "MIT/ast-finetuned-audioset-10-10-0.4593"
model = ASTForAudioClassification.from_pretrained(model_name)
feature_extractor = ASTFeatureExtractor.from_pretrained(model_name)

from ast.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.