Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection
Abstract:
Voice activity detection VAD is an important topic in audio signal processing. Contextual information is important for improving the performance of VAD at low signal-to noise ratios. Here we explore contextual information by machine learning methods at three levels. At the top level, we employ an ensemble learning framework, named multi-resolution stacking MRS, which is a stack of ensemble classifiers. Each classifier in a building block inputs the concatenation of the predictions of its lower building blocks and the expansion of the raw acoustic feature by a given window called a resolution. At the middle level, we describe a base classifier in MRS, named boosted deep neural network bDNN. bDNN first generates multiple base predictions from different contexts of a single frame by only one DNN and then aggregates the base predictions for a better prediction of the frame, and it is different from computationally-expensive boosting methods that train ensembles of classifiers for multiple base predictions. At the bottom level, we employ the multi-resolution cochlea gram feature, which incorporates the contextual information by concatenating the cochlea gram features at multiple spectro temporal resolutions. Experimental results show that the MRS based VAD outperforms other VADs by a considerable margin. Moreover, when trained on a large amount of noise types and a wide range of signal-to-noise ratios, the MRS-based VAD demonstrates surprisingly good generalization performance on unseen test scenarios, approaching the performance with noise-dependent training.