seanwood / gcc-nmf Goto Github PK

Real-time GCC-NMF Blind Speech Separation and Enhancement

License: MIT License

Python 100.00%

speech-separation speech-enhancement gcc-nmf nmf real-time real-time-processing speech speech-processing cross-correlation generalized-cross-correlation

gcc-nmf's Introduction

GCC-NMF

GCC-NMF is a blind source separation and denoising algorithm that combines the GCC spatial localization method with the NMF unsupervised dictionary learning algorithm. GCC-NMF has been used for stereo speech separation and enhancement in both offline and real-time settings. Though we have focused on speech applications so far, GCC-NMF is a generic source separation and denoising algorithm and may well be applicable to other types of signals.

This GitHub repository provides:

A standalone Python executable to execute and visualize GCC-NMF in real-time.
- Real-time Speech Enhancement: RT-GCC-NMF
A series of iPython notebooks notebooks presenting GCC-NMF in tutorial style, building towards the low latency, real-time context:

Journal Papers

Sean UN Wood, Jean Rouat, Unsupervised Low Latency Speech Enhancement with RT-GCC-NMF, IEEE Journal on Selected Topics in Signal Processing (JSTSP) Special Issue on Data Science: Machine Learning for Audio Signal Processing, 2019.
DOI: 10.1109/JSTSP.2019.2909193
Sean UN Wood, Jean Rouat, Stéphane Dupont, Gueorgui Pironkov, Blind Speech Separation and Enhancement with GCC-NMF, IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 25, no. 4, pp. 745–755, 2017.
DOI: 10.1109/TASLP.2017.2656805

Conference Papers

Sean UN Wood and Jean Rouat, Towards GCC-NMF Speech Enhancement for Hearing Assistive Devices: Reducing Latency with Asymmetric Windows, 1st International Workshop on Challenges in Hearing Assistive Technology, CHAT 2017.
Sean UN Wood and Jean Rouat, Real-time Speech Enhancement with GCC-NMF, Interspeech 2017.
Sean UN Wood and Jean Rouat, Real-time Speech Enhancement with GCC-NMF: Demonstration on the Raspberry Pi and NVIDIA Jetson, Interspeech 2017 Show and Tell Demonstrations.
Sean UN Wood and Jean Rouat, Speech Separation with GCC-NMF, Interspeech 2016.

Real-time Speech Enhancement: RT-GCC-NMF

The Real-time Speech Enhancement standalone Python executable is an implementation of the RT-GCC-NMF real-time speech enhancement algorithm. Users may interactively modify system parameters including the NMF dictionary size and GCC-NMF masking function parameters, where the effects on speech enhancement quality may be heard in real-time.

Offline Speech Separation

The Offline Speech Separation iPython notebook shows how GCC-NMF can be used to separate multiple concurrent speakers in an offline fashion. The NMF dictionary is first learned directly from the mixture signal, and sources are subsequently separated by attributing each atom at each time to a single source based on the dictionary atoms' estimated time delay of arrival (TDOA). Source localization is achieved with GCC-PHAT.

Offline Speech Enhancement

The Offline Speech Enhancement iPython notebook demonstrates how GCC-NMF can can be used for offline speech enhancement, where instead of multiple speakers, we have a single speaker plus noise. In this case, individual atoms are attributed either to the speaker or to noise at each point in time base on the the atom TDOAs as above. The target speaker is again localized with GCC-PHAT.

Online Speech Enhancement

The Online Speech Enhancement iPython notebook demonstrates an online variant of GCC-NMF that works in a frame-by-frame fashion to perform speech enhancement in real-time. Here, the NMF dictionary is pre-learned from a different dataset than used at test time, NMF coefficients are inferred frame-by-frame, and speaker localization is performed with an accumulated GCC-PHAT method.

Low Latency Speech Enhancement

In the Low Latency Speech Enhancement iPython notebook we extend the online GCC-NMF approach to reduce algorithmic latency via asymmetric STFT windowing strategy. Long analysis windows maintain the high spectral resolution required by GCC-NMF, while short synthesis windows drastically reduce algorithmic latency with little effect on speech enhancement quality. Algorithmic latency can be reduced from over 64 ms using traditional symmetric STFT windowing to below 2 ms with the proposed asymmetric STFT windowing, provided sufficient computational power is available.

gcc-nmf's People

Contributors

Stargazers

Watchers

Forkers

lihao0214 jmjnz windstudent 18307612949 fy378968174 uncledickhe twistedmove jordi-adell maggie0830 zhaoforever runngezhang nieshaoshuai james-lh snsun chenxinglili aishinchi huotuichang1 miftanurfarid parseb ltcxjtu cxywzx necotis quantumgame yunzqq wxb506 suwoncjh xingdonw xinkez audiobucket doctorboshi kirillrnd nd1511 aheba mobil787 lym0302 simonbiggs saccadic alongwithyou chenxiaoxi12 lesliekuo del18687058912 xdcesc templeblock andreacarlesimo agangzz jyt1234 whu933314 xushoucai byfaith dung-n-tran timewaitsnoone ishine wangyang2014 joelibaceta yingmuying mingmchen fandyanf bubing zhuleiustc whiteweak sucrerouge rpersie ronggan asipresearch xuhaoi speechdnn psyxusheng zcy618 dendisuhubdy zane678 spxnn maxmax2016 rmithyx orangebaowang userzhongjieli haibit chenhuansky diaodiaolzq meadow163 senpin judyzhou95 xiongmaoxia xianruiwang zhangwen464 yinliu-91 5l1v3r1 jackli95 feizi normonisping kurhula road2018 okrio chowho tuyenbk xf739645524 lewistrong dingguijin wjliu0215 moplast zijuzhang

gcc-nmf's Issues

Offline speech enhancement notebook very slow

Mentioned in discussion of #2.

real-time gcc-nmf error?

I am trying to make the real-time gcc-nmf work, I have the correct data paths, etc, the script created the pretrained files. The following error appears after starting the script after:

C:\gccNMF>python demo5.py
INFO:root:GCCNMFConfig: loading configuration params...
INFO:root:TDOA
INFO:root: targetTDOAEpsilon: 5.0
INFO:root: targetTDOANoiseFloor: 0.0
INFO:root: numSpectrogramHistory: 128
INFO:root: microphoneSeparationInMetres: 0.1
INFO:root: numTDOAs: 64
INFO:root: numTDOAHistory: 128
INFO:root: targetTDOABeta: 2.0
INFO:root: gccPHATNLAlpha: 2.0
INFO:root: gccPHATNLEnabled: False
INFO:root:NMF
INFO:root: dictionarySize: 64
INFO:root: dictionaryType: Pretrained
INFO:root: numHUpdates: 0
INFO:root: dictionarySizes: [64, 128, 256, 512, 1024]
INFO:root:Audio
INFO:root: deviceIndex: None
INFO:root: sampleRate: 44100
INFO:root: numChannels: 2
INFO:root:STFT
INFO:root: blockSize: 512
INFO:root: windowSize: 1024
INFO:root: hopSize: 512
INFO:root:GCCNMFPretraining: Loading pretrained W (size 64): ./pretrainedW\W_64.
npy
INFO:root:GCCNMFPretraining: Loading pretrained W (size 128): ./pretrainedW\W_12
8.npy
INFO:root:GCCNMFPretraining: Loading pretrained W (size 256): ./pretrainedW\W_25
6.npy
INFO:root:GCCNMFPretraining: Loading pretrained W (size 512): ./pretrainedW\W_51
2.npy
INFO:root:GCCNMFPretraining: Loading pretrained W (size 1024): ./pretrainedW\W_1
024.npy
INFO:root:RealtimeGCCNMF: Starting with audio path: ./test.wav
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be remove
d in the next release (v0.10). Please switch to the gpuarray backend. You can g
et more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpua
rray%29

WARNING:theano.sandbox.cuda:The cuda backend is deprecated and will be removed i
n the next release (v0.10). Please switch to the gpuarray backend. You can get
more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpua
rray%29

Using gpu device 0: GeForce GTX 860M (CNMeM is enabled with initial size: 70.0%
of memory, cuDNN 5005)
INFO:root:GCCNMFConfig: loading configuration params...
INFO:root:TDOA
INFO:root: targetTDOAEpsilon: 5.0
INFO:root: targetTDOANoiseFloor: 0.0
INFO:root: numSpectrogramHistory: 128
INFO:root: microphoneSeparationInMetres: 0.1
INFO:root: numTDOAs: 64
INFO:root: numTDOAHistory: 128
INFO:root: targetTDOABeta: 2.0
INFO:root: gccPHATNLAlpha: 2.0
INFO:root: gccPHATNLEnabled: False
INFO:root:NMF
INFO:root: dictionarySize: 64
INFO:root: dictionaryType: Pretrained
INFO:root: numHUpdates: 0
INFO:root: dictionarySizes: [64, 128, 256, 512, 1024]
INFO:root:Audio
INFO:root: deviceIndex: None
INFO:root: sampleRate: 44100
INFO:root: numChannels: 2
INFO:root:STFT
INFO:root: blockSize: 512
INFO:root: windowSize: 1024
INFO:root: hopSize: 512
INFO:root:GCCNMFPretraining: Loading pretrained W (size 64): ./pretrainedW\W_64.
npy
INFO:root:GCCNMFPretraining: Loading pretrained W (size 128): ./pretrainedW\W_12
8.npy
INFO:root:GCCNMFPretraining: Loading pretrained W (size 256): ./pretrainedW\W_25
6.npy
INFO:root:GCCNMFPretraining: Loading pretrained W (size 512): ./pretrainedW\W_51
2.npy
INFO:root:GCCNMFPretraining: Loading pretrained W (size 1024): ./pretrainedW\W_1
024.npy
INFO:root:RealtimeGCCNMF: Starting with audio path: ./test.wav
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be remove
d in the next release (v0.10). Please switch to the gpuarray backend. You can g
et more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpua
rray%29

Using gpu device 0: GeForce GTX 860M (CNMeM is enabled with initial size: 70.0%
of memory, cuDNN 5005)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\multiprocessing\forking.py", line 380, in main
prepare(preparation_data)
File "C:\Python27\lib\multiprocessing\forking.py", line 509, in prepare
'parents_main', file, path_name, etc
File "C:\gccNMF\demo5.py", line 5, in
RealtimeGCCNMF()
File "C:\gccNMF\runRealtimeGCCNMF.py", line 50, in init
self.initProcesses(params)
File "C:\gccNMF\runRealtimeGCCNMF.py", line 91, in initProcesses
self.audioProcess.start()
File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
self._popen = Popen(self)
File "C:\Python27\lib\multiprocessing\forking.py", line 258, in init
cmd = get_command_line() + [rhandle]
File "C:\Python27\lib\multiprocessing\forking.py", line 358, in get_command_li
ne
is not going to be frozen to produce a Windows executable.''')
RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.

        This probably means that you are on Windows and you have
        forgotten to use the proper idiom in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce a Windows executable.

Traceback (most recent call last):
File "demo5.py", line 5, in
RealtimeGCCNMF()
File "C:\gccNMF\runRealtimeGCCNMF.py", line 50, in init
self.initProcesses(params)
File "C:\gccNMF\runRealtimeGCCNMF.py", line 91, in initProcesses
self.audioProcess.start()
File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
self._popen = Popen(self)
File "C:\Python27\lib\multiprocessing\forking.py", line 277, in init
dump(process_obj, to_child, HIGHEST_PROTOCOL)
File "C:\Python27\lib\multiprocessing\forking.py", line 199, in dump
ForkingPickler(file, protocol).dump(obj)
File "C:\Python27\lib\pickle.py", line 224, in dump
self.save(obj)
File "C:\Python27\lib\pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python27\lib\pickle.py", line 425, in save_reduce
save(state)
File "C:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python27\lib\pickle.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "C:\Python27\lib\pickle.py", line 687, in _batch_setitems
save(v)
File "C:\Python27\lib\pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python27\lib\pickle.py", line 425, in save_reduce
save(state)
File "C:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python27\lib\pickle.py", line 568, in save_tuple
save(element)
File "C:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python27\lib\pickle.py", line 492, in save_string
self.write(BINSTRING + pack("<i", n) + obj)
IOError: [Errno 22] Invalid argument

what is wrong? thanks!

gccNMF model

hi
I try to run the file "runGccNMF.py", it shows "No module named 'gccNMF'".
I'm not sure how to fix this problem, if you can, please tell me the method to solve this problem.

best

Add ability to set SNR dynamically

Command line interface: add full speed mode

Add ability to run system at full speed, safe to file, no audio payback

Is there a logical error?

In the processFrames method of GCCNMFProcessor class, the targetTDOAIndex is set after the calling of getTFMask method, which means that the mask is got using the target TDOA of last 6 frames, instead of the latest 6 frames including the current frame.
My English is not very good, please forgive me.

Add save functionality to Real-time GCC-NMF demo.

about offline speech separation

”Due to differences in TDOA estimation for the 1m and 5cm microphone separation settings, including increased spatial aliasing and lower spatial resolution in the 5cm case,we also compared results averaged over these two settings separately. While we found somewhat decreased scores and increased variance in the 5cm case, the results were generally comparable.“
In the task of offline voice separation, do you need to make changes in the code for 5cm and 1m voices? Why I ran your code and the 1m data can be separated correctly, but all 5cm voices cannot get the correct results. Only one or two peak points can be obtained (the truth is that there should be three sources).
My English is not very good, these English are from Google Translate, please forgive me

Why gain factor when reconstructing the signal?

Hi seanwood @seanwood , I'm a junior in BSS and thank you for your useful and effective open-source GCC_NMF. I happened to came across an unknown parameter when applying istft and reconstructing waveform:

def getTargetSignalEstimates(targetSpectrogramEstimates, windowSize, hopSize, windowFunction,numSamples):
    numTargets, numChannels, numFreq, numTime = targetSpectrogramEstimates.shape
    stftGainFactor = hopSize / float(windowSize) * 2
......
    return array(targetSignalEstimates) * stftGainFactor

Of course the outcome is good, but I really don't know why multiplying 2*hop_size/n_fft. Could you please give an explanation? Thank you.

Synthesis window in lowLatencySpeechEnhancement.ipynb

Hi,

is there a reason why the synthesis window is not applied?

See also the attached sketch based on the lowLatencySpeechEnhancement.ipynb example.

fix = False

your version
clearly visible modulation in the OLA output
output gain != 1

fix = True

adequate hop size
unity gain
synthesis window after irfft

from matplotlib.pyplot import *
from numpy import *
from numpy.fft import rfft, irfft

# Apply the fix
fix = False

# Preprocessing params
fftSize = 1024

# Asymmetric windowing params
analysisWindowSize = fftSize
synthesisWindowSize = 128

asymmetricHopSize = synthesisWindowSize // 4 if fix else (synthesisWindowSize * 3) // 4
m = synthesisWindowSize // 2
k = analysisWindowSize
d = 0

# Symmetric windowing params
symmetricWindowSize = fftSize
symmetricHopSize = asymmetricHopSize # to better compare results

# Generate test signal
stereoSamples = ones((1, fftSize*10))
numChannels, numSamples = stereoSamples.shape

def getAsymmetricAnalysisWindow(k, m, d):
    risingSqrtHann = sqrt( hanning(2*(k-m-d)+1)[:2*(k-m-d)] )
    fallingSqrtHann = sqrt( hanning(2*m+1)[:2*m] )

    window = zeros(k)
    window[:d] = 0
    window[d:k-m] = risingSqrtHann[:k-m-d]
    window[k-m:] = fallingSqrtHann[-m:]

    return window

def getAsymmetricSynthesisWindow(k, m, d):
    risingSqrtHannAnalysis = sqrt( hanning(2*(k-m-d)+1)[:2*(k-m-d)] )
    risingNoramlizedHann = hanning(2*m+1)[:m] / risingSqrtHannAnalysis[k-2*m-d:k-m-d]
    fallingSqrtHann = sqrt( hanning(2*m+1)[:2*m] )

    window = zeros(k)
    window[:-2*m] = 0
    window[-2*m:-m] = risingNoramlizedHann
    window[-m:] = fallingSqrtHann[-m:]

    return window

def performOnlineSpeechEnhancement(analysisWindow, synthesisWindow, hopSize):
    # Setup variables to save speech enhancement results
    numFrequencies = len(rfft(zeros(len(analysisWindow))))
    numFrames = (numSamples-len(synthesisWindow)) // hopSize

    if fix:
        gainFactor = hopSize / sum(analysisWindow * synthesisWindow)
    else:
        gainFactor = hopSize / float(len(synthesisWindow)) * 2

    targetEstimateSamplesOLA = zeros_like(stereoSamples)
    inputSpectrogram = zeros( (2, numFrequencies, numFrames), 'complex64')
    outputSpectrogram = zeros( (2, numFrequencies, numFrames), 'complex64')

    for frameIndex in range(numFrames):
        # compute FFT
        frameStart = frameIndex * hopSize
        frameEnd = frameStart + analysisWindowSize
        stereoSTFTFrame = rfft( stereoSamples[:, frameStart:frameEnd] * analysisWindow )
        inputSpectrogram[..., frameIndex] = stereoSTFTFrame
        outputSpectrogram[..., frameIndex] = stereoSTFTFrame

        # reconstruct time domain samples
        recStereoSTFTFrame = irfft(stereoSTFTFrame)

        if fix:
            # apply synthesis window as well
            recStereoSTFTFrame *= synthesisWindow

        # overlap-add to output samples
        targetEstimateSamplesOLA[:, frameStart:frameEnd] += recStereoSTFTFrame

    targetEstimateSamplesOLA *= gainFactor

    return inputSpectrogram, outputSpectrogram, targetEstimateSamplesOLA

analysisWindow = getAsymmetricAnalysisWindow(k, m, d)
synthesisWindow = getAsymmetricSynthesisWindow(k, m, d)

symmetricWindow = sqrt(hanning(symmetricWindowSize))

symmetricResults = performOnlineSpeechEnhancement(symmetricWindow, symmetricWindow, symmetricHopSize)
asymmetricResults = performOnlineSpeechEnhancement(analysisWindow, synthesisWindow, asymmetricHopSize)

title('fixed' if fix else 'orig')
plot(symmetricResults[-1][-1], label='symmetric', color='b', alpha=0.5)
plot(asymmetricResults[-1][-1], label='asymmetric', color='r', alpha=0.5)
legend()
show()

two small problems?

First problem is that the low latency algorithm works in terms of working, but the output for both symmetric and asymmetric is silence, no sound? I copied the exact code from the python books.

Second problem is with the low latency and online speech enhancement algorithms, compared to the first two algorithms, that output the correct WAV format, these 2 last ones output all good, except that the bit depth is doubled for some reason? So instead of input signed 16-bit WAV, I get float 32bit WAV output? how to fix this?

thanks!

Any Audio demo？

Hi,
I'm new to audio enhancement， and there seems a lot to learn， any Audio sample or pre trained model for fast evaluation?
Thanks

argmax error

I am running python 2.7 64-bit with latest scipy and numpy.

The first demo works with the multiple speakers, but for the speech enhancement task, I get error:
argMaxGCCNMF = argmax(gccNMF, axis=1)
NameError: name 'argmax' is not defined

how to fix this? Thank you!

Preprocessing for chimeTrainSet.npy

I am interested in making trainSet.npy for 'onlineSpeechEnhancement'

What is the way to make trainSet for prelearning Dictionary?

How can i make trainset with other wav files?

Thank you.

cpu and memory usage

hi
Did u measure the CPU usage for this application and also memory

runRealtimeGCCNMF

Hello,

I have a question to fix my problem. When I run 'runRealtimeGCCNMF.py' and hit the play button, a variable 'realGCC' in 'gccNMFProcessor.py' at 'processFrames' has only NaN value for default input.
'dev_Sq1_Co_A_mix.wav'

How can I fix it?

Thank you