jameslyons / python_speech_features Goto Github PK
View Code? Open in Web Editor NEWThis library provides common speech features for ASR including MFCCs and filterbank energies.
License: MIT License
This library provides common speech features for ASR including MFCCs and filterbank energies.
License: MIT License
Hi and thank you for these wonderful tools!
I have some troubles trying to get the same length for several MFCC arrays (for different audio files, in order to train a neural network after that).
I do :
sr, wave = wav_feat.read(path+wav)
mfcc = python_speech_features.mfcc(wave,sr,numcep=20)
on all my files.
So I obtain mfcc arrays of elements. All these elements contain 20 features, but how can I do to have the same size of MFCC arrays with the parameters ?
I know that sound easy but I'm a beginner...
Maybe I could use len(wave), the rate and other parameters...
Thank you very much !
How to installl this in python
congratulations your programs are amazing congratulations but when i run base.py appear this error: your cannot import name 'sigproc'
22 minute audio file
mfcc(signal, samplerate=16000, numcep=26, lowfreq=300, highfreq=4000, appendEnergy=True)
File "/vagrant/dossier/gsapi/memo/features/base.py", line 54, in mfcc
feat,energy = fbank(signal,samplerate,winlen,winstep,nfilt,nfft,lowfreq,highfreq,preemph)
File "/vagrant/dossier/gsapi/memo/features/base.py", line 80, in fbank
frames = sigproc.framesig(signal, winlen_samplerate, winstep_samplerate)
File "/vagrant/dossier/gsapi/memo/features/sigproc.py", line 55, in framesig
return frames*win
MemoryError
I am just calling it in batches for now to avoid this problem but might be something the library should better handle.
I did a pip install. And the source code on pypi does not have a delta function in it. I checked on github it is there. It will be great if the source code on pypi also gets updated.
Hi, I was looking in the source code and I saw that pspec = sigproc.powspec(frames,nfft) (power spectrum) from def fbank(...) is using numpy.absolute and then numpy.square. In this way you perform a sqrt followed by a square operation!
More efficient would be to extract the power spectrum using directly the formula Real * Real + Imag * Imag or something similar.
Hi, I am trying to port this algorithm to JavaScript and I am running into the following:
feat = numpy.dot(pspec,fb.T)
(https://github.com/jameslyons/python_speech_features/blob/master/features/base.py#L56)
The issue I am running into is that pspec and fb here should have the same dimensions, but for some reason they don't. Is there something in the algorithm, some kind of balance between parameters for example, which should cause these two arrays to have the same dimensions?
compute-mfcc-feats --window-type=hamming --dither=0.0 --use-energy=false --sample-frequency=8000 --num-mel-bins=40 --num-ceps=40 --low-freq=40 --raw-energy=false --remove-dc-offset=false --high-freq=3800 scp:wav.scp ark,scp:feats.ark,feats.scp
mfcc(signal=sig, samplerate=rate, winlen=0.025, winstep=0.01, numcep=40, nfilt=40, lowfreq=40, highfreq=3800,
appendEnergy=False, winfunc = lambda x: np.hamming(x) )
is there some difference ?
Hi,
I used your code, it was great I just want to appreciate :)
Good luck,
Elahe
I generated 13 MFCC co-efficents using mfcc().
mfcc_feat = mfcc(audio_data, sample_rate, winlen=0.025, winstep=0.01,
numcep=13, nfilt=26, nfft=512, lowfreq=0, highfreq=None,
preemph=0.97, ceplifter=22, appendEnergy=True)
How can I get delta and delta-delta cepstrum so I can build a 39 dimensional MFCC?
Deepa
Wondering if we can use this lib for MFCC generation in Android NDK?
Hi, could you please add an example how to use python_speech_features.base.mfcc with winfunc=Hann (or other window)?
Thank you
Hi,
If i want the 39 coefficients (13 mfcc + 13 delta +13 delta-delta),
How can i use the code to get it?
Thanks
Hi there,
I was wondering if there is a reason why the output of fbank and logfbank functions don't return both the energy. I usually use logfbank features with energy, which obliges me to redefine logfbank while it would be very easy to return the energy as well.
Thanks,
Bertrand
Hello,
I wonder whether the window type is hamming, thank you
ImportError Traceback (most recent call last)
in ()
1 import python_speech_features
----> 2 from python_speech_features import mfcc
3 from python_speech_features import delta
4 from python_speech_features import logfbank
5 import scipy.io.wavfile as wav
ImportError: cannot import name 'mfcc'
How can i solve this error?
I have 31 second audio @ 16000hz
I run MFCC on the audio at default settings (.01s step sizes)
This should mean that I get 31s/0.01s = 3100 frames.
What I am getting as a result of calling mfcc() is 6200 frames. Am I misunderstanding something?
I noticed that you use numpy.log() to compute log instead of numpy.log10(). I found another reference that use numpy.log10() instead of numpy.log().
numpy.log() is natural logarithm instead of base 10 numpy.log10().
Which one is correct? or it can be used interchangeably? would you explain why you choose to use log() instead of log10()? i want to extract log mel-filter-bank but i don't know which log should be used.
Hello, What method would you suggest for training the MFCC and how to setup feature vectors?
Should I implement a state by state approach for each sample?
Hello.
I think in the logpowspec function
if norm: return lps - numpy.max(lps) else: return lps
was supposed to be if norm: return lps / numpy.max(lps) else: return lps
Also, consider accepting frame_len and frame_step in terms of samples rather than seconds, as this lets the user enter precise powers of two for the fft to work just right.
Hi, in the file "sigproc.py" in line 102 & 103:
complex_spec = numpy.fft.rfft(frames, NFFT)
return numpy.absolute(complex_spec)
"rfft" return real not complex, you can use "fft" instead or keep it and no need for "numpy.absolute"
I'm a little confused with the length of the mfcc output array. The following code
from python_speech_features import mfcc
import scipy.io.wavfile as wav
(rate,sig) = wav.read('test.wav')
mfcc_feat = mfcc(sig,rate)
print("rate="+str(rate))
print("sig.size="+str(sig.size))
print("mfcc_feat.shape="+str(mfcc_feat.shape))
produces:
rate=16000
sig.size=1760
mfcc_feat.shape=(10, 13)
I was expecting a shape of (11, 13), since the audio length is 110ms (160 frames per 10ms), which should result in 11 steps with 10ms each, or shouldn't it?
(If I append some more frames I'll get 11 steps starting from 1841 frames, while sig.size=1840 still gives 10 steps.)
Hey all,
Thanks for the awesome python module ๐ I was just wondering, have any of you run into an issue where the mfcc
function hangs (and presumably other functions such as fbank
) hang on OSX? I was running some multi-processed code and these functions confusingly never completed on my Mac. However, when I ran the equivalent function in Ubuntu 16.04 on the same data, the functions returned as expected.
Have any of you run into this before? I'm not sure what would could cause this - especially seeing that the source code seems to primarily call numpy operations. I'll investigate this further and hopefully post more details here. Unfortunately the code I was running was heavily multi-processed, so it might take a bit of refactoring to allow me to properly debug things. I just wanted to post on here and see if anybody else had run into a similar issue? ๐
When using mfcc, the window
parameter can use numpy.hamming
, but numpy.hamming
is a funtion, and it can take an int
input as number of points in the output window.
See numpy.hamming Doc.
However, frame_len
is used in sigproc.py
Could you please how does the np.hamming
work in mfcc?
What if I want to input a specific window length?
Thank you !
Hi,
First of all I really like your contribution.
Second, when using it I got a few deprecated warning messages. No big deal and was able to fix them by just casting to ints frame_len, numframes, padlen in function framesig inside sigproc.py
Hope it helps
Thanks for sharing
Hi,
I am using your library to compute MFCC features that will be used to train a neural network to perform speech recognition. When I've searched for data augmentation options, one very popular is the one named VTLP (Vocal tract length perturbation), which basically consists of warping frequency axis by a random factor.
I am wondering, how difficult is to implement this augmentation in your code (I am supposing that this warping should be done right before the mfcc extraction, but I am still not sure)?
What is the window functions?
The docstring say 'output will be NxNFFT' but the actually is Nx(NFFT//2+1) due to numpy.fft.rfft.
I compared the mfcc of librosa with python_speech_analysis package and got totally different results.
Which one is correct?
librosa list of first frame coefficients:
[-395.07433842032867, -7.1149347948192963e-14, 3.5772469223901538e-14, -1.7476140989485184e-14, 3.1665300829452658e-14, -4.4214136625668904e-14, 6.7157035631648599e-14, 1.5013974158050108e-14, 2.9512326634271699e-14, 7.2275398398734558e-14, -1.5043753316598812e-13, -2.2358383003147776e-14, 1.6209256159527285e-13]
python_speech_analysis list of first frame coefficients:
[-169.91598446684722, 1.3219891974654943, 0.22216979881740945, -0.7368248288464827, 0.26268194306407788, 1.8470757480486224, 3.2670900572694435, 2.3726120692753563, 1.4983949546889608, 0.67862219561000914, -0.44705590991616034, 0.39184067109778226, -0.48048214059101707]
import librosa
import python_speech_features
from scipy.signal.windows import hann
n_mfcc = 13
n_mels = 40
n_fft = 512 # in librosa, win_length is assumed to be equal to n_fft implicitly
hop_length = 160
fmin = 0
fmax = None
y, sr = librosa.load(librosa.util.example_audio_file())
sr = 16000 # fake sample rate just to make the point
# librosa
mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
n_mfcc=n_mfcc, n_mels=n_mels,
hop_length=hop_length,
fmin=fmin, fmax=fmax)
# python_speech_features
# no preemph nor ceplifter in librosa, so setting to zero
# librosa default stft window is hann
mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
preemph=0, ceplifter=0, appendEnergy=False, winfunc=hann)
print(list(mfcc_librosa[:, 0]))
print(list(mfcc_speech[0, :]))
I'm not an expert in this kind of stuff, so I'm sorry if this will be a waste of time.
From the numpy.fft.rfft documentation [in our case: n=NFTT, input=frame]:
"Number of points along transformation axis in the input to use. If n is smaller than the length of the input, the input is cropped. If it is larger, the input is padded with zeros. If n is not given, the length of the input along the axis specified by axis is used."
Is not this cropping something we want to avoid? Because, as far as I've seen, there's not any check in the code about how the frame size compares to NFTT.
Looking at the source code, I can't see how the parameters highfreq
and lowfreq
are used in the call to fbank
and logfbank
. Are they certainly not being ignored?
Thanks.
python_speech_features depends on NumPy and SciPy, but it doesn't declare them as requirements in setup.py.
I guess most users will know what to do when they get the ImportError, but it would be more convenient if they didn't have to install NumPy and SciPy manually.
is the entire code that will extract the features present in example.py?
or is that just a portion of the code?. also, where will the features be extracted?
If you are playing a song on your laptop, As you increase the volume from 0 to 100, the audio becomes louder and louder.
Say I have an .mp3 or .wav , how do I capture this ^ perceived loudness/intensity at regular intervals (may be 0.1 second) in the audio using python speech features?
Any advice is appreciated.
Thanks
Vivek
Hi,
Thanks for your code. However, I have some questions. I tried to use machine learning to remove noise from a audio file, and I used MFCC as my feature. My goal is to use these feature as input data and get another MFCC matrix as my output data, and that through that matrix I can get a new signal.
My first question is,
Through MFCC, I got 39 feature in each frame, and this feature dimensions seems to be too small. What else feature could I add?
My second question is,
How could I get signal from MFCC matrix?
Lookforward for your response. Thanks!
If I try to import the delta
module, my IDE flags it as unresolved reference. Needless to say, trying to run the program gives the ImportError: cannot import name 'delta'
exception.
Edit: After going through your code here on GitHub, I can see that delta
is indeed defined. Any ideas on why I'm getting the error that I am?
Edit 2: Its got to do with the version distributed by PyPi. The base.py file there does not have the delta function defined in it. You might want to fix this. Cheers! :D
Hi:
I recorded some wav file originally in 44.1k sample rate, and then I convert this file to 16k by sox. After that I use this python script to caculate the MFCC feature of the 44.1k file and 16k file, but found that the result was completely different. And one same file no matter convert to 44.1k or 16k, I think the result should be the same. Isn't that ?
The max number of cepstrum to return is 26?
can the function return a 40 points?
It would be awesome if you could also host this package on PyPi for easier inclusion as a dependency in projects.
Hi, my aim is to get the noise data from a audio file, which is from classroom, it is meaning that i want to remove the teachers voice. So, what should I do ?
I am a newcomer to audio processing. I using
wavfile.read(buf)
However, the audio's shape is (8127488, 2). How can i turn it into (N, 1) and feed it into the mfcc.
If you put in 0 for the pre-emphasis filter and 0 for the lifter all Nans will be returned. I think if these values are zero the filter components should not be run. I am guessing maybe putting zero in results in a divide by zero somewhere that returns all Nans. These stages should just be removed if a zero is passed in or let me know if you can bypass these filters by another way.
Hi,
I tried to compare the MFCC features generated using HTK, and those generated by python_speech_features. Unfortunately, somehow they always mismatch.
Below is the configuration I used for HTK
SOURCEFORMAT = NIST
TARGETKIND = MFCC_0
TARGETRATE = 100000
SAVECOMPRESSED = F
SAVEWITHCRC = F
WINDOWSIZE = 250000
USEHAMMING = F
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F
The configuration for python_speech_features
is default.
I also tried adding USEPOWER = F/T
, and still features obtained are very different (actually, for file TIMITcorpus/TIMIT/TRAIN/DR8/FBCG1/SX442
, I got 358 frames for HTK, but only 354 frames for python_speech_features.
Any insight? I'm a newbie in speech recognition, and may have committed some silly mistakes..
error while deframing in lines 62 ,63 :
error: index 1 is out of bounds for axis 0 with size 1
For FilterBank features section logfbank is used which has a return value of just a single array. However, the documentation states that:
A numpy array of size (NUMFRAMES by nfilt) containing features. Each row holds 1 feature vector. The second return value is the enrgy in each frame (total energy, unwindowed)
I think it's a mistake since the energy is returned by fbank function, not logfbank.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.