Comments (2)
16kHz 5second should work, our ESC-50 recipe is in this setting. Which line report this error? is your audio monochannel or multi-channel? check shape of waveform
.
-Yuan
from ast.
First of all, thanks for the answer!
My waveform looks like this:
Waveform shape: torch.Size([1, 80000])
Waveform dtype: torch.float32
Number of channels: 1
80000 because of the 16000Hz and the 5 seconds.
And the error happens in:
Cell In[14], line 86, in preprocess_function(examples)
79 # print(f"Waveform max: {waveform.max()}")
80 # print(f"Waveform min: {waveform.min()}")
81 # print(f"Waveform mean: {waveform.mean()}")
82 # print(f"Waveform std: {waveform.std()}")
83 # printing the number of channels in the waveform
84 print(f"Number of channels: {waveform.shape[0]}")
---> 86 input_values = feature_extractor(waveform, sampling_rate=16000, return_tensors="pt").input_values
87 inputs['input_values'].append(input_values.squeeze(0))
88 return inputs
File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\transformers\models\audio_spectrogram_transformer\feature_extraction_audio_spectrogram_transformer.py:219, in ASTFeatureExtractor.__call__(self, raw_speech, sampling_rate, return_tensors, **kwargs)
216 raw_speech = [raw_speech]
218 # extract fbank features and pad/truncate to max_length
--> 219 features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech]
221 # convert into BatchFeature
222 padded_inputs = BatchFeature({"input_values": features})
File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\transformers\models\audio_spectrogram_transformer\feature_extraction_audio_spectrogram_transformer.py:219, in <listcomp>(.0)
216 raw_speech = [raw_speech]
218 # extract fbank features and pad/truncate to max_length
--> 219 features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech]
221 # convert into BatchFeature
222 padded_inputs = BatchFeature({"input_values": features})
File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\transformers\models\audio_spectrogram_transformer\feature_extraction_audio_spectrogram_transformer.py:119, in ASTFeatureExtractor._extract_fbank_features(self, waveform, max_length)
117 if is_speech_available():
118 waveform = torch.from_numpy(waveform).unsqueeze(0)
--> 119 fbank = ta_kaldi.fbank(
120 waveform,
121 sample_frequency=self.sampling_rate,
122 window_type="hanning",
123 num_mel_bins=self.num_mel_bins,
124 )
125 else:
126 waveform = np.squeeze(waveform)
File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\torchaudio\compliance\kaldi.py:591, in fbank(waveform, blackman_coeff, channel, dither, energy_floor, frame_length, frame_shift, high_freq, htk_compat, low_freq, min_duration, num_mel_bins, preemphasis_coefficient, raw_energy, remove_dc_offset, round_to_power_of_two, sample_frequency, snip_edges, subtract_mean, use_energy, use_log_fbank, use_power, vtln_high, vtln_low, vtln_warp, window_type)
542 r"""Create a fbank from a raw audio signal. This matches the input/output of Kaldi's
543 compute-fbank-feats.
544
(...)
587 where m is calculated in _get_strided
588 """
589 device, dtype = waveform.device, waveform.dtype
--> 591 waveform, window_shift, window_size, padded_window_size = _get_waveform_and_window_properties(
592 waveform, channel, sample_frequency, frame_shift, frame_length, round_to_power_of_two, preemphasis_coefficient
593 )
595 if len(waveform) < min_duration * sample_frequency:
596 # signal is too short
597 return torch.empty(0, device=device, dtype=dtype)
File D:\GitLab\ss24-aai-lab\.venv\Lib\site-packages\torchaudio\compliance\kaldi.py:142, in _get_waveform_and_window_properties(waveform, channel, sample_frequency, frame_shift, frame_length, round_to_power_of_two, preemphasis_coefficient)
139 window_size = int(sample_frequency * frame_length * MILLISECONDS_TO_SECONDS)
140 padded_window_size = _next_power_of_2(window_size) if round_to_power_of_two else window_size
--> 142 assert 2 <= window_size <= len(waveform), "choose a window size {} that is [2, {}]".format(
143 window_size, len(waveform)
144 )
145 assert 0 < window_shift, "`window_shift` must be greater than 0"
146 assert padded_window_size % 2 == 0, (
147 "the padded `window_size` must be divisible by two." " use `round_to_power_of_two` or change `frame_length`"
148 )
AssertionError: choose a window size 400 that is [2, 1]
I know that's a lot to ask, but do you have any ideas about what could be wrong? i'm lost.
Thanks a lot.
UPDATE:
If I run it with stereo files I get the error:
AssertionError: choose a window size 400 that is [2, 2]
do i use the wrong feature extractor?
# Load the model and feature extractor
model_name = "MIT/ast-finetuned-audioset-10-10-0.4593"
model = ASTForAudioClassification.from_pretrained(model_name)
feature_extractor = ASTFeatureExtractor.from_pretrained(model_name)
from ast.
Related Issues (20)
- how to use my own dataset HOT 3
- AST Audioset Training Time and Hardware HOT 2
- seq2seq classification with AST HOT 2
- After fine-tune a 3-class dataset, how to load its fine-tuned weighted to update pre-trained ast model? HOT 7
- CPU memory increase while training HOT 6
- Fine tuning AST model to Music Emotion Classification Overfit HOT 3
- How can I adapt the pretrained AST model to fit my own dataset HOT 6
- ESC-50-master zip file location has changed HOT 2
- Installing requirements issues
- When I download the pretrained model with stride=16, I need to change `fstride` and `tstride` in the source code from 10 to 16. Besides these changes, what else do I need to adjust?
- Different audio sample size for fine-tuning the model gives overfitting issue HOT 1
- training MAP HOT 2
- One question regarding the linear projection of AST. HOT 1
- Inquiry Regarding Audio Spectrogram Transformer HOT 2
- self-contained Google Colab script error HOT 2
- Ask for help HOT 1
- some questions when reproducing your results HOT 2
- csv error HOT 1
- Discrepancy in Model Performance Using HuggingFace Pipeline Utility HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ast.