Unifying Robustness and Fidelity:
Using Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

Authors: Anonymous

Abstract: Enhancing speech signals in adverse acoustic environments is a long-standing challenge in speech processing. Existing deep learning based enhancement methods struggle with removing background noise and reverberation in real-world challenging scenarios. To address this challenge, we propose a novel approach that uses a pre-trained speech codec to synthesize clean speech given degraded inputs. In addition, we conduct a comprehensive comparison with two other pipelines: mapping-based and vocoder-based speech enhancement. Experimental results on both simulated and recorded datasets demonstrate the effectiveness and robustness of our proposed method. We observe generative methods show stronger robustness against degradation compared with conventional mapping-based speech enhancement. In particular, by leveraging the codec, we achieve improved audio quality with reduced background noise, and reverberation.


Architectures of the proposed vocoder and codec approaches


Fig. 1: the architecture for the proposed codec pipeline.


Fig. 2: the architecture for the vocoder-based pipeline.


Fig.3: the architecture for the mapping-based pipeline.


Results


Objective scores of All Pipelines for Comparison.


Diagrams of MOS for both synthetic and realistic samples.


Audio Demos

We provide audio demos for both simulated data and real-world recordings:

I. Reverberant Samples

ID Unprocessed STFT-based Vocoder-based Codec-based Ground Truth
A
B
C
D

II. Real-world Recordings Samples

ID Unprocessed STFT-based Vocoder-based Codec-based
E
F
G
H