GithubHelp home page GithubHelp logo

jameslyons / python_speech_features Goto Github PK

View Code? Open in Web Editor NEW
2.3K 2.3K 616.0 221 KB

This library provides common speech features for ASR including MFCCs and filterbank energies.

License: MIT License

Python 100.00%

python_speech_features's People

Contributors

adamstark avatar cwiiis avatar erikmav avatar groupw66 avatar henrikalmer avatar hshteingart avatar jameslyons avatar janluke avatar jhoelzl avatar mr-yamraj avatar sbirch avatar shuttle1987 avatar tbfly avatar timgates42 avatar ybdarrenwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python_speech_features's Issues

WARNING:root

frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
image

Length of MFCC arrays

Hi and thank you for these wonderful tools!

I have some troubles trying to get the same length for several MFCC arrays (for different audio files, in order to train a neural network after that).
I do :

sr, wave = wav_feat.read(path+wav)
mfcc = python_speech_features.mfcc(wave,sr,numcep=20)

on all my files.
So I obtain mfcc arrays of elements. All these elements contain 20 features, but how can I do to have the same size of MFCC arrays with the parameters ?

I know that sound easy but I'm a beginner...
Maybe I could use len(wave), the rate and other parameters...

Thank you very much !

Memory Error for large audio files

22 minute audio file
mfcc(signal, samplerate=16000, numcep=26, lowfreq=300, highfreq=4000, appendEnergy=True)

File "/vagrant/dossier/gsapi/memo/features/base.py", line 54, in mfcc
feat,energy = fbank(signal,samplerate,winlen,winstep,nfilt,nfft,lowfreq,highfreq,preemph)
File "/vagrant/dossier/gsapi/memo/features/base.py", line 80, in fbank
frames = sigproc.framesig(signal, winlen_samplerate, winstep_samplerate)
File "/vagrant/dossier/gsapi/memo/features/sigproc.py", line 55, in framesig
return frames*win
MemoryError

I am just calling it in batches for now to avoid this problem but might be something the library should better handle.

PIP install does not have a delta function

I did a pip install. And the source code on pypi does not have a delta function in it. I checked on github it is there. It will be great if the source code on pypi also gets updated.

Troubles when porting

Hi, I am trying to port this algorithm to JavaScript and I am running into the following:

feat = numpy.dot(pspec,fb.T)

(https://github.com/jameslyons/python_speech_features/blob/master/features/base.py#L56)

The issue I am running into is that pspec and fb here should have the same dimensions, but for some reason they don't. Is there something in the algorithm, some kind of balance between parameters for example, which should cause these two arrays to have the same dimensions?

can't get same result as compute-mfcc-feats.

compute-mfcc-feats --window-type=hamming --dither=0.0 --use-energy=false --sample-frequency=8000 --num-mel-bins=40 --num-ceps=40 --low-freq=40 --raw-energy=false --remove-dc-offset=false --high-freq=3800 scp:wav.scp ark,scp:feats.ark,feats.scp

mfcc(signal=sig, samplerate=rate, winlen=0.025, winstep=0.01, numcep=40, nfilt=40, lowfreq=40, highfreq=3800,
appendEnergy=False, winfunc = lambda x: np.hamming(x) )

is there some difference ?

No issue!

Hi,
I used your code, it was great I just want to appreciate :)

Good luck,
Elahe

39 dimensional MFCC

I generated 13 MFCC co-efficents using mfcc().

mfcc_feat = mfcc(audio_data, sample_rate, winlen=0.025, winstep=0.01,
numcep=13, nfilt=26, nfft=512, lowfreq=0, highfreq=None,
preemph=0.97, ceplifter=22, appendEnergy=True)

How can I get delta and delta-delta cepstrum so I can build a 39 dimensional MFCC?
Deepa

delta and delta-delta

Hi,
If i want the 39 coefficients (13 mfcc + 13 delta +13 delta-delta),
How can i use the code to get it?
Thanks

logfbank missing energy

Hi there,

I was wondering if there is a reason why the output of fbank and logfbank functions don't return both the energy. I usually use logfbank features with energy, which obliges me to redefine logfbank while it would be very easy to return the energy as well.

Thanks,
Bertrand

Import Error

ImportError Traceback (most recent call last)
in ()
1 import python_speech_features
----> 2 from python_speech_features import mfcc
3 from python_speech_features import delta
4 from python_speech_features import logfbank
5 import scipy.io.wavfile as wav

ImportError: cannot import name 'mfcc'
How can i solve this error?

log() or log10() ?

I noticed that you use numpy.log() to compute log instead of numpy.log10(). I found another reference that use numpy.log10() instead of numpy.log().

numpy.log() is natural logarithm instead of base 10 numpy.log10().

Which one is correct? or it can be used interchangeably? would you explain why you choose to use log() instead of log10()? i want to extract log mel-filter-bank but i don't know which log should be used.

Training samples

Hello, What method would you suggest for training the MFCC and how to setup feature vectors?
Should I implement a state by state approach for each sample?

incorrect normalization in logpowspec

Hello.
I think in the logpowspec function
if norm: return lps - numpy.max(lps) else: return lps

was supposed to be if norm: return lps / numpy.max(lps) else: return lps

Also, consider accepting frame_len and frame_step in terms of samples rather than seconds, as this lets the user enter precise powers of two for the fft to work just right.

Filterbank=80

It works fine for filterbank=40.But when I try for 80, the third filterbank out is constant value like this
-36.04365,-36.04365,-36.04365,-36.04365,-36.04365,-36.04365

I have attached the image showing speech,spectrogram, logfilterbank for 80 filters
screenshot from 2015-12-03 11 13 31

rfft vs fft

Hi, in the file "sigproc.py" in line 102 & 103:

complex_spec = numpy.fft.rfft(frames, NFFT)
return numpy.absolute(complex_spec)

"rfft" return real not complex, you can use "fft" instead or keep it and no need for "numpy.absolute"

Question about length of mfcc output array

I'm a little confused with the length of the mfcc output array. The following code

from python_speech_features import mfcc
import scipy.io.wavfile as wav
(rate,sig) = wav.read('test.wav')
mfcc_feat = mfcc(sig,rate)
print("rate="+str(rate))
print("sig.size="+str(sig.size))
print("mfcc_feat.shape="+str(mfcc_feat.shape))

produces:

rate=16000
sig.size=1760
mfcc_feat.shape=(10, 13)

I was expecting a shape of (11, 13), since the audio length is 110ms (160 frames per 10ms), which should result in 11 steps with 10ms each, or shouldn't it?
(If I append some more frames I'll get 11 steps starting from 1841 frames, while sig.size=1840 still gives 10 steps.)

Functions hanging on OS X

Hey all,

Thanks for the awesome python module ๐Ÿ‘ I was just wondering, have any of you run into an issue where the mfcc function hangs (and presumably other functions such as fbank) hang on OSX? I was running some multi-processed code and these functions confusingly never completed on my Mac. However, when I ran the equivalent function in Ubuntu 16.04 on the same data, the functions returned as expected.

Have any of you run into this before? I'm not sure what would could cause this - especially seeing that the source code seems to primarily call numpy operations. I'll investigate this further and hopefully post more details here. Unfortunately the code I was running was heavily multi-processed, so it might take a bit of refactoring to allow me to properly debug things. I just wanted to post on here and see if anybody else had run into a similar issue? ๐Ÿ˜ƒ

Question about hamming window length

When using mfcc, the window parameter can use numpy.hamming, but numpy.hamming is a funtion, and it can take an int input as number of points in the output window.
See numpy.hamming Doc.
However, frame_len is used in sigproc.py
Could you please how does the np.hamming work in mfcc?
What if I want to input a specific window length?
Thank you !

Small issue with variables

Hi,

First of all I really like your contribution.

Second, when using it I got a few deprecated warning messages. No big deal and was able to fix them by just casting to ints frame_len, numframes, padlen in function framesig inside sigproc.py

Hope it helps
Thanks for sharing

Data augmentation using VTLP

Hi,

I am using your library to compute MFCC features that will be used to train a neural network to perform speech recognition. When I've searched for data augmentation options, one very popular is the one named VTLP (Vocal tract length perturbation), which basically consists of warping frequency axis by a random factor.

I am wondering, how difficult is to implement this augmentation in your code (I am supposing that this warping should be done right before the mfcc extraction, but I am still not sure)?

inconsistency with librosa

I compared the mfcc of librosa with python_speech_analysis package and got totally different results.

Which one is correct?
librosa list of first frame coefficients:

[-395.07433842032867, -7.1149347948192963e-14, 3.5772469223901538e-14, -1.7476140989485184e-14, 3.1665300829452658e-14, -4.4214136625668904e-14, 6.7157035631648599e-14, 1.5013974158050108e-14, 2.9512326634271699e-14, 7.2275398398734558e-14, -1.5043753316598812e-13, -2.2358383003147776e-14, 1.6209256159527285e-13]

python_speech_analysis list of first frame coefficients:

[-169.91598446684722, 1.3219891974654943, 0.22216979881740945, -0.7368248288464827, 0.26268194306407788, 1.8470757480486224, 3.2670900572694435, 2.3726120692753563, 1.4983949546889608, 0.67862219561000914, -0.44705590991616034, 0.39184067109778226, -0.48048214059101707]

import librosa
import python_speech_features
from scipy.signal.windows import hann

n_mfcc = 13
n_mels = 40
n_fft = 512 # in librosa, win_length is assumed to be equal to n_fft implicitly
hop_length = 160
fmin = 0
fmax = None
y, sr = librosa.load(librosa.util.example_audio_file())
sr = 16000  # fake sample rate just to make the point

# librosa
mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                    n_mfcc=n_mfcc, n_mels=n_mels,
                                    hop_length=hop_length,
                                    fmin=fmin, fmax=fmax)

# python_speech_features
# no preemph nor ceplifter in librosa, so setting to zero
# librosa default stft window is hann
mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                          numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                          preemph=0, ceplifter=0, appendEnergy=False, winfunc=hann)


print(list(mfcc_librosa[:, 0]))
print(list(mfcc_speech[0, :]))

What if the frame length is greater than NFFT?

I'm not an expert in this kind of stuff, so I'm sorry if this will be a waste of time.

From the numpy.fft.rfft documentation [in our case: n=NFTT, input=frame]:
"Number of points along transformation axis in the input to use. If n is smaller than the length of the input, the input is cropped. If it is larger, the input is padded with zeros. If n is not given, the length of the input along the axis specified by axis is used."

Is not this cropping something we want to avoid? Because, as far as I've seen, there's not any check in the code about how the frame size compares to NFTT.

Missing requirements in setup.py

python_speech_features depends on NumPy and SciPy, but it doesn't declare them as requirements in setup.py.

I guess most users will know what to do when they get the ImportError, but it would be more convenient if they didn't have to install NumPy and SciPy manually.

How do i run the code?

is the entire code that will extract the features present in example.py?
or is that just a portion of the code?. also, where will the features be extracted?

Inverse MFCC to signal and rate

Hi,

Thanks for your code. However, I have some questions. I tried to use machine learning to remove noise from a audio file, and I used MFCC as my feature. My goal is to use these feature as input data and get another MFCC matrix as my output data, and that through that matrix I can get a new signal.

My first question is,
Through MFCC, I got 39 feature in each frame, and this feature dimensions seems to be too small. What else feature could I add?

My second question is,
How could I get signal from MFCC matrix?

Lookforward for your response. Thanks!

No delta?

If I try to import the delta module, my IDE flags it as unresolved reference. Needless to say, trying to run the program gives the ImportError: cannot import name 'delta' exception.

Edit: After going through your code here on GitHub, I can see that delta is indeed defined. Any ideas on why I'm getting the error that I am?

Edit 2: Its got to do with the version distributed by PyPi. The base.py file there does not have the delta function defined in it. You might want to fix this. Cheers! :D

Why there's big difference using 16k and 44.1k sample rate

Hi:
I recorded some wav file originally in 44.1k sample rate, and then I convert this file to 16k by sox. After that I use this python script to caculate the MFCC feature of the 44.1k file and 16k file, but found that the result was completely different. And one same file no matter convert to 44.1k or 16k, I think the result should be the same. Isn't that ?

Put package onto pypi

It would be awesome if you could also host this package on PyPi for easier inclusion as a dependency in projects.

obtain the noise data

Hi, my aim is to get the noise data from a audio file, which is from classroom, it is meaning that i want to remove the teachers voice. So, what should I do ?

How to turn wav in to (N,1)?

I am a newcomer to audio processing. I using

wavfile.read(buf)

However, the audio's shape is (8127488, 2). How can i turn it into (N, 1) and feed it into the mfcc.

Input 0 for both preemph and ceplifter results in all NANs returned

If you put in 0 for the pre-emphasis filter and 0 for the lifter all Nans will be returned. I think if these values are zero the filter components should not be run. I am guessing maybe putting zero in results in a divide by zero somewhere that returns all Nans. These stages should just be removed if a zero is passed in or let me know if you can bypass these filters by another way.

inconsistent result with HTK

Hi,

I tried to compare the MFCC features generated using HTK, and those generated by python_speech_features. Unfortunately, somehow they always mismatch.

Below is the configuration I used for HTK

SOURCEFORMAT = NIST
TARGETKIND = MFCC_0
TARGETRATE = 100000
SAVECOMPRESSED = F
SAVEWITHCRC = F
WINDOWSIZE = 250000
USEHAMMING = F
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F

The configuration for python_speech_features is default.

I also tried adding USEPOWER = F/T, and still features obtained are very different (actually, for file TIMITcorpus/TIMIT/TRAIN/DR8/FBCG1/SX442, I got 358 frames for HTK, but only 354 frames for python_speech_features.

Any insight? I'm a newbie in speech recognition, and may have committed some silly mistakes..

Minor issue in README.md

For FilterBank features section logfbank is used which has a return value of just a single array. However, the documentation states that:

A numpy array of size (NUMFRAMES by nfilt) containing features. Each row holds 1 feature vector. The second return value is the enrgy in each frame (total energy, unwindowed)

I think it's a mistake since the energy is returned by fbank function, not logfbank.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.