Cross-domain Diffusion based Speech Enhancement for Very Noisy Speech

Authors: Heming Wang and DeLiang Wang

Abstract: Deep learning based speech enhancement has achieved remarkable success, but challenges remain in low signal-to-noise ratio (SNR) nonstationary noise scenarios. In this study, we propose to incorporate diffusion-based learning into an enhancement model and improve robustness in extremely noisy conditions. Specifically, a frequency-domain diffusion-based generative module is employed, and it accepts the enhanced signal obtained from a time-domain supervised enhancement module as an auxiliary input to learn to recover clean speech spectrograms. Experimental results on the TIMIT dataset demonstrate the advantage of this approach and show better enhancement performance over other strong baselines in both -5 and -10 dB SNR noisy conditions.


Diffusion Process


spectrograms demonstrations



Audio Demos

We provide audio demos for one male and one female speakers under two SNR levels, -5 dB and -10 dB:

Factory Noise -5 dB

Noisy Mixture GCRN DPARN Proposed Clean Speech

Factory Noise -10 dB

Noisy Mixture GCRN DPARN Proposed Clean Speech

Babble Noise -5 dB

Noisy Mixture GCRN DPARN Proposed Clean Speech

Babble Noise -10 dB

Noisy Mixture GCRN DPARN Proposed Clean Speech