Cross-domain Diffusion based Speech Enhancement for Very Noisy Speech
Authors:
Heming Wang and DeLiang Wang
Abstract:
Deep learning based speech enhancement has achieved remarkable success, but challenges remain in low signal-to-noise ratio (SNR) nonstationary noise scenarios. In this study, we propose to incorporate diffusion-based learning into an enhancement model and improve robustness in extremely noisy conditions. Specifically, a frequency-domain diffusion-based generative module is employed, and it accepts the enhanced signal obtained from a time-domain supervised enhancement module as an auxiliary input to learn to recover clean speech spectrograms. Experimental results on the TIMIT dataset demonstrate the advantage of this approach and show better enhancement performance over other strong baselines in both -5 and -10 dB SNR noisy conditions.
Diffusion Process
spectrograms demonstrations
Audio Demos
We provide audio demos for one male and one female speakers under two SNR levels, -5 dB and -10 dB: