Perceptual metric for speech quality evaluation (PMSQE): Source code and audio examples

A Deep Learning Loss Function based on the Perceptual Evaluation of the Speech Quality

This paper proposes a perceptual metric for speech quality evaluation which is suitable, as a loss function, for training deep learning methods. This metric, derived from the perceptual evaluation of the speech quality (PESQ) algorithm, is computed in a per-frame basis and from the power spectra of the reference and processed speech signal. Thus, two disturbance terms, which account for distortion once auditory masking and threshold effects are factored in, amend the mean square error (MSE) loss function by introducing perceptual criteria based on human psychoacoustics. The proposed loss function is evaluated for noisy speech enhancement with deep neural networks. Experimental results show that our metric achieves significant gains in speech quality (evaluated using an objective metric and a listening test) when compared to using MSE or other perceptual-based loss functions from the literature.

Paper

Tensorflow and Pytorch Code (Updated 15-05-2019)

Accepted in IEEE Signal Processing Letters.

Some example audios below.

Clean and noisy speech signal, and enhanced speech signals using a DNN trained with the MSE loss function, the wMSE-SVS loss function and the proposed PMSQE approach.

Noise / SNR	Clean	Noisy	MSE	wMSE-SVS	PMSQE
Bus station 15 dB
Bus station 5 dB
Street 10 dB
Street 0 dB
Pedestrian street 10 dB
Pedestrian street 5 dB
Mall 5 dB
Mall 0 dB
Cafe 15 dB
Cafe 10 dB
Car 10 dB
Car 0 dB
Bus 10 dB
Bus 5 dB
Babble 5 dB
Babble 0 dB

Contact: Juan M. Martín-Doñas