Lip Tracking for Audio-Visual Speech Recognition.
AIR FORCE INST OF TECH WRIGHT-PATTERSON AFB OH
Pagination or Media Count:
Human speech is conveyed through both acoustic and visual channels and is therefore inherently multi-modal. Further, the two channels are largely complementary in that the acoustic signal typically contains information about the manner of articulation while the visual signal embodies knowledge of the place of articulation. This orthogonal nature of the audio and visual components has enticed researchers to develop audio-visual speech recognition systems that have been shown to be robust to acoustic noise. A fundamental requirement of automatic audio-visual speech recognition is the need for real-time tracking however, this necessity has been largely ignored by the lipreading community. This work presents a new approach for tracking unadorned lips in real time 50 fieldssec. The tracking framework presented combines comprehensive shape and motion models learnt from continuous speech sequences with focused image feature detection methods. Statistical models of the grey-level appearance of the mouth are shown to enable identification of the lip boundary in poorly contrasted grey-level images. The combined armory of the these modeling approaches permits robust, real-time tracking of unadorned lips. Isolated-word recognition experiments using dynamic time warping and Hidden Markov Model-based recognizers demonstrate that real-time, contour-based, lip tracking can be used to provide robust recognition of degraded speech. In noisy acoustic conditions, the performance of recognizers incorporating visual shape parameters are superior to the acoustic-only solutions, providing for error rate reductions up to 44.
- Voice Communications