jameslyons / python_speech_features Goto Github PK

View Code? Open in Web Editor NEW

2.3K 2.3K 616.0 221 KB

This library provides common speech features for ASR including MFCCs and filterbank energies.

License: MIT License

Python 100.00%

python_speech_features's People

Contributors

Stargazers

Watchers

Forkers

dima42 arunsnair henrikalmer kazeka carleeto jizhonglee ishine jakobularius sbirch antoniorohit wqren del82 nayyarv r4zzm magodo citrix123 grahamc bahbarrettmatthew jwjpaton yingzha winner134 bbnsumanth symdumair supernow bigbigsnail tabishsada lojikal qinchunlai pjadzinsky bradddd soroushmehr hocgabri sandwriter rajivpoddar xczhanjun philipnz nirbhaytandon sharax newtonmwai einsteinder hariag manasi94 kjwang915 julian-ramos takeshineshiro hemantsurale p4nos clinzy poijqwef theodleif bches yassinhox duanexiao groupw66 ashishmd unrelatedlabs sasha2222 hihiy tazjel verderey sinozope clever-scientist dhvfanny nightfury13 lee52 bpig leomauro dokim92 jhoelzl mechcoder hdubey manvig jiaxililearn hermes777 crouzet-o jp-myk rhythmize matarhaller ybdarrenwang jkravanja tuming1990 maksymdelta lopessec jinmingzhao yangwithtao vaibs96 jianfeiwang lix930 ar13pit mikalaisyty maratsarbasov lidhsieh rosrad dbp1921 stevenlol jhludwig joseph-zhong amro-pydev jjery2243542 dewabayu

python_speech_features's Issues

WARNING:root

frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.

Length of MFCC arrays

Hi and thank you for these wonderful tools!

I have some troubles trying to get the same length for several MFCC arrays (for different audio files, in order to train a neural network after that).
I do :

sr, wave = wav_feat.read(path+wav)
mfcc = python_speech_features.mfcc(wave,sr,numcep=20)

on all my files.
So I obtain mfcc arrays of elements. All these elements contain 20 features, but how can I do to have the same size of MFCC arrays with the parameters ?

I know that sound easy but I'm a beginner...
Maybe I could use len(wave), the rate and other parameters...

Thank you very much !

ImportError: cannot import name 'sigproc'

congratulations your programs are amazing congratulations but when i run base.py appear this error: your cannot import name 'sigproc'

Memory Error for large audio files

22 minute audio file
mfcc(signal, samplerate=16000, numcep=26, lowfreq=300, highfreq=4000, appendEnergy=True)

File "/vagrant/dossier/gsapi/memo/features/base.py", line 54, in mfcc
feat,energy = fbank(signal,samplerate,winlen,winstep,nfilt,nfft,lowfreq,highfreq,preemph)
File "/vagrant/dossier/gsapi/memo/features/base.py", line 80, in fbank
frames = sigproc.framesig(signal, winlen_samplerate, winstep_samplerate)
File "/vagrant/dossier/gsapi/memo/features/sigproc.py", line 55, in framesig
return frames*win
MemoryError

I am just calling it in batches for now to avoid this problem but might be something the library should better handle.

PIP install does not have a delta function

I did a pip install. And the source code on pypi does not have a delta function in it. I checked on github it is there. It will be great if the source code on pypi also gets updated.

Duplicate operations, ineficient implementation for power spectrum (np.square after np.absolute)

Hi, I was looking in the source code and I saw that pspec = sigproc.powspec(frames,nfft) (power spectrum) from def fbank(...) is using numpy.absolute and then numpy.square. In this way you perform a sqrt followed by a square operation!
More efficient would be to extract the power spectrum using directly the formula Real * Real + Imag * Imag or something similar.

Troubles when porting

Hi, I am trying to port this algorithm to JavaScript and I am running into the following:

feat = numpy.dot(pspec,fb.T)

(https://github.com/jameslyons/python_speech_features/blob/master/features/base.py#L56)

The issue I am running into is that pspec and fb here should have the same dimensions, but for some reason they don't. Is there something in the algorithm, some kind of balance between parameters for example, which should cause these two arrays to have the same dimensions?

can't get same result as compute-mfcc-feats.

compute-mfcc-feats --window-type=hamming --dither=0.0 --use-energy=false --sample-frequency=8000 --num-mel-bins=40 --num-ceps=40 --low-freq=40 --raw-energy=false --remove-dc-offset=false --high-freq=3800 scp:wav.scp ark,scp:feats.ark,feats.scp

mfcc(signal=sig, samplerate=rate, winlen=0.025, winstep=0.01, numcep=40, nfilt=40, lowfreq=40, highfreq=3800,
appendEnergy=False, winfunc = lambda x: np.hamming(x) )

is there some difference ?

No issue!

Hi,
I used your code, it was great I just want to appreciate :)

Good luck,
Elahe

39 dimensional MFCC

I generated 13 MFCC co-efficents using mfcc().

mfcc_feat = mfcc(audio_data, sample_rate, winlen=0.025, winstep=0.01,
numcep=13, nfilt=26, nfft=512, lowfreq=0, highfreq=None,
preemph=0.97, ceplifter=22, appendEnergy=True)

How can I get delta and delta-delta cepstrum so I can build a 39 dimensional MFCC?
Deepa

Does python speech features have a c port?

Wondering if we can use this lib for MFCC generation in Android NDK?

ValueError: File format b'OggS'... not understood.

MFCC extraction using Hann window

Hi, could you please add an example how to use python_speech_features.base.mfcc with winfunc=Hann (or other window)?
Thank you

delta and delta-delta

Hi,
If i want the 39 coefficients (13 mfcc + 13 delta +13 delta-delta),
How can i use the code to get it?
Thanks

logfbank missing energy

Hi there,

I was wondering if there is a reason why the output of fbank and logfbank functions don't return both the energy. I usually use logfbank features with energy, which obliges me to redefine logfbank while it would be very easy to return the energy as well.

Thanks,
Bertrand

The parameter of the window type

Hello,

I wonder whether the window type is hamming, thank you

Import Error

ImportError Traceback (most recent call last)
in ()
1 import python_speech_features
----> 2 from python_speech_features import mfcc
3 from python_speech_features import delta
4 from python_speech_features import logfbank
5 import scipy.io.wavfile as wav

ImportError: cannot import name 'mfcc'
How can i solve this error?

Why am I getting double the frames that I am expecting? [Answer: I was using stereo audio [facepalm]]

I have 31 second audio @ 16000hz

I run MFCC on the audio at default settings (.01s step sizes)

This should mean that I get 31s/0.01s = 3100 frames.

What I am getting as a result of calling mfcc() is 6200 frames. Am I misunderstanding something?

log() or log10() ?

I noticed that you use numpy.log() to compute log instead of numpy.log10(). I found another reference that use numpy.log10() instead of numpy.log().

numpy.log() is natural logarithm instead of base 10 numpy.log10().

Which one is correct? or it can be used interchangeably? would you explain why you choose to use log() instead of log10()? i want to extract log mel-filter-bank but i don't know which log should be used.

Training samples

Hello, What method would you suggest for training the MFCC and how to setup feature vectors?
Should I implement a state by state approach for each sample?

incorrect normalization in logpowspec

Hello.
I think in the logpowspec function
if norm: return lps - numpy.max(lps) else: return lps

was supposed to be if norm: return lps / numpy.max(lps) else: return lps

Also, consider accepting frame_len and frame_step in terms of samples rather than seconds, as this lets the user enter precise powers of two for the fft to work just right.

Filterbank=80

It works fine for filterbank=40.But when I try for 80, the third filterbank out is constant value like this
-36.04365,-36.04365,-36.04365,-36.04365,-36.04365,-36.04365

I have attached the image showing speech,spectrogram, logfilterbank for 80 filters

rfft vs fft

Hi, in the file "sigproc.py" in line 102 & 103:

complex_spec = numpy.fft.rfft(frames, NFFT)
return numpy.absolute(complex_spec)

"rfft" return real not complex, you can use "fft" instead or keep it and no need for "numpy.absolute"

Question about length of mfcc output array

I'm a little confused with the length of the mfcc output array. The following code

from python_speech_features import mfcc
import scipy.io.wavfile as wav
(rate,sig) = wav.read('test.wav')
mfcc_feat = mfcc(sig,rate)
print("rate="+str(rate))
print("sig.size="+str(sig.size))
print("mfcc_feat.shape="+str(mfcc_feat.shape))

produces:

rate=16000
sig.size=1760
mfcc_feat.shape=(10, 13)

I was expecting a shape of (11, 13), since the audio length is 110ms (160 frames per 10ms), which should result in 11 steps with 10ms each, or shouldn't it?
(If I append some more frames I'll get 11 steps starting from 1841 frames, while sig.size=1840 still gives 10 steps.)

Can I use m4a format audio file to use the code to extract feature?

Functions hanging on OS X

Hey all,

Thanks for the awesome python module 👍 I was just wondering, have any of you run into an issue where the mfcc function hangs (and presumably other functions such as fbank) hang on OSX? I was running some multi-processed code and these functions confusingly never completed on my Mac. However, when I ran the equivalent function in Ubuntu 16.04 on the same data, the functions returned as expected.

Have any of you run into this before? I'm not sure what would could cause this - especially seeing that the source code seems to primarily call numpy operations. I'll investigate this further and hopefully post more details here. Unfortunately the code I was running was heavily multi-processed, so it might take a bit of refactoring to allow me to properly debug things. I just wanted to post on here and see if anybody else had run into a similar issue? 😃

Question about hamming window length

When using mfcc, the window parameter can use numpy.hamming, but numpy.hamming is a funtion, and it can take an int input as number of points in the output window.
See numpy.hamming Doc.
However, frame_len is used in sigproc.py
Could you please how does the np.hamming work in mfcc?
What if I want to input a specific window length?
Thank you !

Small issue with variables

Hi,

First of all I really like your contribution.

Second, when using it I got a few deprecated warning messages. No big deal and was able to fix them by just casting to ints frame_len, numframes, padlen in function framesig inside sigproc.py

Hope it helps
Thanks for sharing

Data augmentation using VTLP

Hi,

I am using your library to compute MFCC features that will be used to train a neural network to perform speech recognition. When I've searched for data augmentation options, one very popular is the one named VTLP (Vocal tract length perturbation), which basically consists of warping frequency axis by a random factor.

I am wondering, how difficult is to implement this augmentation in your code (I am supposing that this warping should be done right before the mfcc extraction, but I am still not sure)?

What is the window functions?

Shape of sigproc.magspec's output is not equal to its docstring

The docstring say 'output will be NxNFFT' but the actually is Nx(NFFT//2+1) due to numpy.fft.rfft.

inconsistency with librosa

I compared the mfcc of librosa with python_speech_analysis package and got totally different results.

Which one is correct?
librosa list of first frame coefficients:

[-395.07433842032867, -7.1149347948192963e-14, 3.5772469223901538e-14, -1.7476140989485184e-14, 3.1665300829452658e-14, -4.4214136625668904e-14, 6.7157035631648599e-14, 1.5013974158050108e-14, 2.9512326634271699e-14, 7.2275398398734558e-14, -1.5043753316598812e-13, -2.2358383003147776e-14, 1.6209256159527285e-13]

python_speech_analysis list of first frame coefficients:

[-169.91598446684722, 1.3219891974654943, 0.22216979881740945, -0.7368248288464827, 0.26268194306407788, 1.8470757480486224, 3.2670900572694435, 2.3726120692753563, 1.4983949546889608, 0.67862219561000914, -0.44705590991616034, 0.39184067109778226, -0.48048214059101707]

import librosa
import python_speech_features
from scipy.signal.windows import hann

n_mfcc = 13
n_mels = 40
n_fft = 512 # in librosa, win_length is assumed to be equal to n_fft implicitly
hop_length = 160
fmin = 0
fmax = None
y, sr = librosa.load(librosa.util.example_audio_file())
sr = 16000  # fake sample rate just to make the point

# librosa
mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                    n_mfcc=n_mfcc, n_mels=n_mels,
                                    hop_length=hop_length,
                                    fmin=fmin, fmax=fmax)

# python_speech_features
# no preemph nor ceplifter in librosa, so setting to zero
# librosa default stft window is hann
mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                          numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                          preemph=0, ceplifter=0, appendEnergy=False, winfunc=hann)


print(list(mfcc_librosa[:, 0]))
print(list(mfcc_speech[0, :]))

num_frames = 1 + math.floor(...) ?

What if the frame length is greater than NFFT?

I'm not an expert in this kind of stuff, so I'm sorry if this will be a waste of time.

From the numpy.fft.rfft documentation [in our case: n=NFTT, input=frame]:
"Number of points along transformation axis in the input to use. If n is smaller than the length of the input, the input is cropped. If it is larger, the input is padded with zeros. If n is not given, the length of the input along the axis specified by axis is used."

Is not this cropping something we want to avoid? Because, as far as I've seen, there's not any check in the code about how the frame size compares to NFTT.

Are `highfreq` and `lowfreq` used in `(log)fbank`?

Looking at the source code, I can't see how the parameters highfreq and lowfreq are used in the call to fbank and logfbank. Are they certainly not being ignored?

Thanks.

Missing requirements in setup.py

python_speech_features depends on NumPy and SciPy, but it doesn't declare them as requirements in setup.py.

I guess most users will know what to do when they get the ImportError, but it would be more convenient if they didn't have to install NumPy and SciPy manually.

How do i run the code?

is the entire code that will extract the features present in example.py?
or is that just a portion of the code?. also, where will the features be extracted?

[Question:] How to capture intensity or perceived loudness of a given audio file at regular intervals

If you are playing a song on your laptop, As you increase the volume from 0 to 100, the audio becomes louder and louder.

Say I have an .mp3 or .wav , how do I capture this ^ perceived loudness/intensity at regular intervals (may be 0.1 second) in the audio using python speech features?

Any advice is appreciated.

Thanks
Vivek

Inverse MFCC to signal and rate

Hi,

Thanks for your code. However, I have some questions. I tried to use machine learning to remove noise from a audio file, and I used MFCC as my feature. My goal is to use these feature as input data and get another MFCC matrix as my output data, and that through that matrix I can get a new signal.

My first question is,
Through MFCC, I got 39 feature in each frame, and this feature dimensions seems to be too small. What else feature could I add?

My second question is,
How could I get signal from MFCC matrix?

Lookforward for your response. Thanks!

No delta?

If I try to import the delta module, my IDE flags it as unresolved reference. Needless to say, trying to run the program gives the ImportError: cannot import name 'delta' exception.

Edit: After going through your code here on GitHub, I can see that delta is indeed defined. Any ideas on why I'm getting the error that I am?

Edit 2: Its got to do with the version distributed by PyPi. The base.py file there does not have the delta function defined in it. You might want to fix this. Cheers! :D

Why there's big difference using 16k and 44.1k sample rate

Hi:
I recorded some wav file originally in 44.1k sample rate, and then I convert this file to 16k by sox. After that I use this python script to caculate the MFCC feature of the 44.1k file and 16k file, but found that the result was completely different. And one same file no matter convert to 44.1k or 16k, I think the result should be the same. Isn't that ?

The max number of numcep

The max number of cepstrum to return is 26?
can the function return a 40 points?

Put package onto pypi

It would be awesome if you could also host this package on PyPi for easier inclusion as a dependency in projects.

obtain the noise data

Hi, my aim is to get the noise data from a audio file, which is from classroom, it is meaning that i want to remove the teachers voice. So, what should I do ?

How to turn wav in to (N,1)?

I am a newcomer to audio processing. I using

wavfile.read(buf)

However, the audio's shape is (8127488, 2). How can i turn it into (N, 1) and feed it into the mfcc.

Input 0 for both preemph and ceplifter results in all NANs returned

If you put in 0 for the pre-emphasis filter and 0 for the lifter all Nans will be returned. I think if these values are zero the filter components should not be run. I am guessing maybe putting zero in results in a divide by zero somewhere that returns all Nans. These stages should just be removed if a zero is passed in or let me know if you can bypass these filters by another way.

inconsistent result with HTK

Hi,

I tried to compare the MFCC features generated using HTK, and those generated by python_speech_features. Unfortunately, somehow they always mismatch.

Below is the configuration I used for HTK

SOURCEFORMAT = NIST
TARGETKIND = MFCC_0
TARGETRATE = 100000
SAVECOMPRESSED = F
SAVEWITHCRC = F
WINDOWSIZE = 250000
USEHAMMING = F
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F

The configuration for python_speech_features is default.

I also tried adding USEPOWER = F/T, and still features obtained are very different (actually, for file TIMITcorpus/TIMIT/TRAIN/DR8/FBCG1/SX442, I got 358 frames for HTK, but only 354 frames for python_speech_features.

Any insight? I'm a newbie in speech recognition, and may have committed some silly mistakes..

index error in sigproc.deframesig()

error while deframing in lines 62 ,63 :
error: index 1 is out of bounds for axis 0 with size 1

Minor issue in README.md

For FilterBank features section logfbank is used which has a return value of just a single array. However, the documentation states that:

A numpy array of size (NUMFRAMES by nfilt) containing features. Each row holds 1 feature vector. The second return value is the enrgy in each frame (total energy, unwindowed)

I think it's a mistake since the energy is returned by fbank function, not logfbank.

jameslyons / python_speech_features Goto Github PK

python_speech_features's People

Contributors

Stargazers

Watchers

Forkers

python_speech_features's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs