GithubHelp home page GithubHelp logo

Comments (11)

ar1st0crat avatar ar1st0crat commented on July 18, 2024

Hi!

Check this page

Also, this seems strange:

Audio length is 371499 in librosa (len(audio)), 134784 in NWaves. File is mono.

What is the sampling rate and duration (in seconds) of the signal?

from nwaves.

OrangeOlko avatar OrangeOlko commented on July 18, 2024

Thanks! I started from this page but no success.

print(librosa.get_duration(audio))
16.848027210884354

sampling rate is 8000

from nwaves.

ar1st0crat avatar ar1st0crat commented on July 18, 2024

So the number of samples should be, indeed, int(16.848027210884354 * 8000) = 134784.

mfcc = librosa.feature.mfcc(y = audio, sr = 8000, n_mfcc=13, n_fft=1024, hop_length=int(np.floor(len(audio)/20)),
                                    dct_type=2, norm='ortho', htk=False, fmin=0, center=False, n_mels=128, window='hanning')

is equivalent to

int sr = 8000;           // sampling rate
int fftSize = 1024;
int filterbankSize = 128;

var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr);


int hopLength = <just specify here the value stored in _ int(np.floor(len(audio)/20)) _ >


var opts = new MfccOptions
{
    SamplingRate = sr,
    FrameDuration = (double)fftSize / sr,
    HopDuration = (double)hopLength / sr,
    FeatureCount = 13,
    Filterbank = melBank, 
    NonLinearity = NonLinearityType.ToDecibel,
    Window = WindowTypes.Hann,
    LogFloor = 1e-10f, 
    DctType="2N",
    LifterSize = 0
};

var extractor = new MfccExtractor(opts);

Note. Set center=False in librosa (as I explained in wiki).

PS. Your hop_length depends on len(audio), so specify its concrete value to avoid confusion.

from nwaves.

OrangeOlko avatar OrangeOlko commented on July 18, 2024

Thank you very much for the example!

I took this value from python debug code:
print(int(np.floor(len(audio)/20)))

int hopLength = 18574;

Also I set center=False

After all these I got
(13, 20) in Python Librosa
(13, 8) in NWaves

What else could be wrong?


 var left = waveFile[Channels.Left];
                int sr = 8000;           // sampling rate
                int fftSize = 1024;
                int filterbankSize = 128;

                var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr);
                
                int hopLength = 18574;


                   var opts = new MfccOptions
                   {
                       SamplingRate = sr,
                       FrameDuration = (double)fftSize / sr,
                       HopDuration = (double)hopLength / sr,
                       FeatureCount = 13,
                       FilterBank = melBank,
                       NonLinearity = NonLinearityType.ToDecibel,
                       Window = WindowTypes.Hann,
                       LogFloor = 1e-10f,
                       DctType = "2N",
                       LifterSize = 0
                   };

                var mfccExtractor = new MfccExtractor(opts);
                var mfccVectors = mfccExtractor.ComputeFrom(left);

from nwaves.

ar1st0crat avatar ar1st0crat commented on July 18, 2024

You need to find out why librosa returns 371499 samples. Because

the number of samples should be, indeed, int(16.848027210884354 * 8000) = 134784.

Also, do you understand what the hop_length is (both in librosa and in NWaves)? Currently you're trying to extract 20 short frames from a relatively long signal, and the distance between 2 adjacent frames is quite big as well (it's very unusual scenario)

UPD.
Seems like the signal is resampled at 22050 Hz during loading.

According to librosa docs

Audio will be automatically resampled to the given rate (default sr=22050). To preserve the native sampling rate of the file, use sr=None.

from nwaves.

OrangeOlko avatar OrangeOlko commented on July 18, 2024

Thanks, I will try to find out

from nwaves.

ar1st0crat avatar ar1st0crat commented on July 18, 2024

I've already found out (see my previous comment):

Audio will be automatically resampled to the given rate (default sr=22050). To preserve the native sampling rate of the file, use sr=None.

Simply set: librosa.load(..., sr=None)

from nwaves.

OrangeOlko avatar OrangeOlko commented on July 18, 2024

Thanks again for help!
You are right about sr=None, so now I have arrays of the same size in librosa and in NWaves, but data is different inside.
I tried to change all parameters but none of then brought me better result.

image

mfcc = librosa.feature.mfcc(y = audio, sr = sr, n_mfcc=13, n_fft=1024, hop_length=int(np.floor(len(audio)/20)), dct_type=2, norm='ortho', htk=False,fmin=0,center=False, n_mels=128, window='hanning')

image

from nwaves.

ar1st0crat avatar ar1st0crat commented on July 18, 2024

You need to analyze the results more carefully. Compare them frame by frame.
For example, here are the results of my experiments:

image

The values are very slightly different, and this is because of round-off errors. As we can see, the algorithm is implemented correctly. In the first frame of you signal (and many others as well) the first coeff seems very different, because the corresponding frame contains silence (sample values are very close to 0); essentially, in this case you have some big value in mfcc_0 and zeros in other coeffs (NWaves shows you 10e-5... 10e-7, but basically they are zeros); anyway, frames containing silence, most likely, will be discarded during feature analysis.

Also, read more about:

  1. the first MFCC coeff; what to do with it;
  2. filter banks and their settings (usually, 24 - 40 bands are enough; I don't know why librosa sets 128 by default);
  3. window analysis and what is frame size / hop size

from nwaves.

OrangeOlko avatar OrangeOlko commented on July 18, 2024

Thank you very much for the details, I will investigate this!

from nwaves.

OrangeOlko avatar OrangeOlko commented on July 18, 2024

I wanted to post final solution and found errors in my code which might help others.

  1. Audio file was 8 bit so librosa and windows libraries results were different in reading samples. File was converted to 16 bit.
  2. Librosa by default loads data as float64, though for this file float32 is needed to get the same results. So call to load file was changed to use soundfile directly to change type parameter:
import soundfile as sf
audio, sr = sf.read(filename, dtype='float32')
  1. As @ar1st0crat mentioned by default sample rate is set to 22500 in librosa, so sr = None can be applied while reading. By in my case we used soundfile which doesn't resample audio so no need in this parameter.
  2. As first mfcc frame doesn't contain relevant information it can be omitted. But even it now contains similar results.

Librosa code:

mfcc = librosa.feature.mfcc(y = audio, sr = sr, center=False, hop_length=int(np.floor(len(audio)/ 20)), 
                                 n_mels=128,  n_fft = 1024, n_mfcc = 20, fmax = 4000, fmin = 0,norm = 'ortho',
                               window = 'hann', htk = False, power=1, dct_type=2)

NWaves code:

int fftSize = 1024;
int filterbankSize = 128;
var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr);
var hopCount = 20;
var hopLength = chunkData.Count / hopCount;
var opts = new MfccOptions
{
    SamplingRate = sr,
    FrameDuration = (double)((double)fftSize / (double)sr),
    HopDuration = ((double)(double)hopLength / (double)sr),
    FeatureCount = 20,
    FilterBank = melBank,
    NonLinearity = NonLinearityType.ToDecibel,
    Window = WindowTypes.Hann,
    LogFloor = 1e-10f,
    DctType = "2N",
    LifterSize = 0,
    FftSize = fftSize,
    HighFrequency =4000,
    SpectrumType = SpectrumType.Magnitude
};

var mfccExtractor = new MfccExtractor(opts);
var mfccVectors = mfccExtractor.ComputeFrom(chunkData.ToArray());
var mfccFlatten = new List<float>();

// remove 1 mfcc
for (int m = 1; m < 20; m++)
    for (int hop = hopCount; hop > 0; hop--)
        mfccFlatten.Add(mfccVectors[hopCount - hop][m]);

Results using NWaves:
image

Results using Librosa:
image

from nwaves.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.