Automatic lipreading has major potential impact for speech recognition, supplementing and complementing the acoustic modality. Most attempts at lipreading have been performed on small vocabulary tasks, due to a shortfall of appropriate audio- visual datasets. In this work we use the publicly available TCD- TIMIT database, designed for large vocabulary continuous audio-visual speech recognition. We compare the viseme recognition performance of the most widely used features for lipread- ing, Discrete Cosine Transform (DCT) and Active Appearance Models (AAM), in a traditional Hidden Markov Model (HMM) framework. We also exploit recent advances in AAM fitting. We found the DCT to outperform AAM by more than 6% for a viseme recognition task with 56 speakers. The overall accuracy of the DCT is quite low (32-34%). We conclude that a fundamental rethink of the modelling of visual features may be needed for this task.
Cite as: Sterpu, G., Harte, N. (2017) Towards Lipreading Sentences with Active Appearance Models. Proc. The 14th International Conference on Auditory-Visual Speech Processing, 70-75, doi: 10.21437/AVSP.2017-14
@inproceedings{sterpu17_avsp, author={George Sterpu and Naomi Harte}, title={{Towards Lipreading Sentences with Active Appearance Models}}, year=2017, booktitle={Proc. The 14th International Conference on Auditory-Visual Speech Processing}, pages={70--75}, doi={10.21437/AVSP.2017-14} }