Speaker Recognition Using Real vs. Synthetic Parallel Data for DNN Channel Compensation
Abstract:
The effective use of synthetic multi-channel data for training denoising DNNs has been demonstrated for several speech technologies such as ASR and speaker recognition. This paper compares the use of real and synthetic data for training denoising DNNs for multi-microphone speaker recognition. Large reductions in error rates 37 and 50 for the AVG and POOLEERs and 20 and 30 for the AVG and POOL min DCFsare attained on Mixer 6 microphone data using Mixer 1 and 2multi-microphone data to train a denoising DNN. Nearly the same reduction in error rate is realized using room impulse response and noise estimates RIRs derived from the Mixer 1and 2 data and applied to just the telephone channel. Applying RIRs from three publicly available databases used in the Kaldi Aspire evaluation system yields lower but significant reductions in error rate 16 and 34 relative improvement in AVG and POOL EER and 13 and 25 relative improvement in AVG and POOL min DCFs. In all cases, the telephone channel performance on SRE10 is improved by the denoising DNNs with the real Mixer 1 and 2 trained DNN reducing EER by 12 and min DCF by 8.9.