Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition

Movellan, Javier R.; Mineiro, Paul

doi:10.1023/A:1007468413059

Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition

Published: August 1998

Volume 32, pages 85–100, (1998)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition

Download PDF

Javier R. Movellan¹ &
Paul Mineiro¹

940 Accesses
15 Citations
Explore all metrics

Abstract

This paper analyzes the issue of catastrophic fusion, a problem that occurs in multimodal recognition systems that integrate the output from several modules while working in non-stationary environments. For concreteness we frame the analysis with regard to the problem of automatic audio visual speech recognition (AVSR), but the issues at hand are very general and arise in multimodal recognition systems which need to work in a wide variety of contexts. Catastrophic fusion is said to have occurred when the performance of a multimodal system is inferior to the performance of some isolated modules, e.g., when the performance of the audio visual speech recognition system is inferior to that of the audio system alone. Catastrophic fusion arises because recognition modules make implicit assumptions and thus operate correctly only within a certain context. Practice shows that when modules are tested in contexts inconsistent with their assumptions, their influence on the fused product tends to increase, with catastrophic results. We propose a principled solution to this problem based upon Bayesian ideas of competitive models and inference robustification. We study the approach analytically on a classic Gaussian discrimination task and then apply it to a realistic problem on audio visual speech recognition (AVSR) with excellent results.

References

Adjondani, A. & Benoit, C. (1995). Audio-visual speech recognition compared across two architectures. Proceedings of the Eurospeech'95 Conference (pp. 1563–1566). Madrid, Spain.
Adjondani, A. & Benoit, C. (1996). On the Integration of Auditory and Visual Parameters in an HMM-based ASR. In D.G. Stork & M.E. Hennecke (eds.), Speechreading by Humans and Machines: Models, Systems, and Applications. New York: NATO/Springer-Verlag.
Google Scholar
Bernstein, L. & Benoit, C. (1996). For Speech Perception Three Senses are Better than One. Proceedings of the 4th International Conference on Spoken Language Processing, Philadelphia, PA.
Box, G.E.P. (1980). Sampling and Bayes inference in scientific modeling. J. Roy. Stat. Soc., A., 143, 383–430.
Google Scholar
Bregler, C., Hild, H., Manke, S., & Waibel, A. (1993). Improving Connected Letter Recognition by Lipreading. Proc. Int. Conf. on Acoust., Speech, and Signal Processing (pp. 557–560), Minneapolis. IEEE.
Google Scholar
Bregler, C., Manke, S., & Waibel, A. (1993). Bimodal Sensor Integration on the Example of Speech-Reading. Proceedings of the IEEE International Conference on Neural Networks (pp. 667–671).
Bregler, C., Omohundro, S.M., & Konig, Y. (1994). A Hybrid Approach to Bimodal Speech Recognition. 28th Annual Asilomar Conference on Signals, Systems, and Computers (pp. 556–560), Pacific Grove, CA.
Bülthoff, H.H. & Yuille, A.L. (1996). A Bayesian framework for the integration of visual modules. In T. Inui & J.L. McClelland (eds.), Attention and performance XVI: Information integration in perception and communication. Cambridge, MA: MIT Press.
Google Scholar
Chadderdon, G. & Movellan, J.R. (1995). Testing for Channel Independence in Bimodal Speech Recognition. Proceedings of 2nd Joint Symposium on Neural Computation (pp. 84–90). University of California San Diego and California Institute of Technology.
Clark, J.J. & Yuille, A.L. (1990). Data Fusion for Sensory Information Processing Systems. Boston: Kluwer Academic Publishers.
Google Scholar
Cosi, P., Magno Caldognetto, E., Vagges, K., Mian, G.A., & Contolini, M. (1994). Bimodal recognition experiments with recurrent neural networks. Proc. Int. Conf. on Acoust., Speech, and Signal Processing (pp. 553–556).
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38.
Google Scholar
de Sa, V. (1994). Combining Uni-Modal Classifiers to Improve Learning. In Integration of Elementary Functions into Complex Behavior, volume 2, 19–29.
Google Scholar
Efron, A. (1982). The jacknife, the bootstrap and other resampling plans. SIAM, Philadelphia, PA.
Google Scholar
Gray, M.S., Movellan, J.R., & Sejnowski, T. (1997). Dynamic features for visual speechreading: A systematic comparison. In Advances in Neural Information Processing Systems, 9. Cambridge, MA: MIT Press.
Google Scholar
Hennecke, M.E., Stork, D.G., & Ventakesh Prasad, K. (1996). Visionary speech: looking ahead to practical speech reading systems. In D.G. Stork & M.E. Hennecke (eds.), Speechreading by Humans and Machines: Models, Systems, and Applications. New York: NATO/Springer-Verlag.
Google Scholar
Hennecke, M.E., Venkatesh Prasad, K., & Stork, D.G. (1994). Using Deformable Templates to InferVisual Speech Dynamics. 28th Asilomar Conference on Signals, Systems, and Computers (pp. 578–582). Pacific Grove, CA: IEEE Computer Society Press.
Google Scholar
Hoglin, D.C., Mosteller, F., & Tukey, J.W. (1983). Understanding robust and exploratory data analysis. New York: John Wiley.
Google Scholar
Kuhl, P.K. & Meltzoff, A.M. (1982). The bimodal perception of speech in infancy. Science, 218, 1138–1141.
Google Scholar
MacKay, D.J.C. (1996). Hyperparameters: Optimise or interate out?. In G. Heidbreder (ed.), Maximum entropy and Bayesian methods, Santa Barbara 1993. Dordrecht: Kluwer.
Google Scholar
Massaro, D.W. (1987). Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Hillsdale, NJ: Lawrence Erlbaum Associates.
Google Scholar
McGurk, H. & MacDonald, J. (1976). Hearing Lips and Seeing Voices. Nature, 264, 746–748.
Movellan, J.R. (1995). Visual speech recognition with stochastic neural networks. In G. Tesauro, D. Touretzky, & T. Leen (eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press.
Google Scholar
Movellan, J.R. & Chadderdon, G. (1996). Channel Separability in the Audio Visual Integration of Speech: A Bayesian Approach. In D.G. Stork & M.E. Hennecke (eds.), Speechreading by Humans and Machines: Models, Systems, and Applications. New York: NATO/Springer-Verlag.
Google Scholar
Movellan, J.R. & Prayaga, R.S. (1996). Gabor Mosaics: A description of Local Orientation Statistics with Applications to Machine Perception. In Proceedings of the Eight Annual Conference of the Cognitive Science Society. Mahwah, NJ: LEA.
Google Scholar
Neal, R.M. (1996). Bayesian learning for neural networks. New York: Springer.
Google Scholar
O'Hagan, A. (1994). Kendall's Advanced Theory of Statistics: Volume 2B, Bayesian Inference. Cambridge University Press.
Peterson, W.W., Birdsall, T.G., & Fox, W.C. (1954). The theory of signal detectability. Transactions IRE Professional Group on Information Theory, 4, 171–212.
Google Scholar
Stork, D.G., Wolff, G.J., & Levine, E.P. (1992). Neural Network Lipreading System for Improved Speech Recognition. Proceedings International Joint Conference on Neural Networks (pp. 289–295). IEEE.
Wolff, G.J., Venkatesh Prasad, K., Stork, D.G., & Hennecke, M.E. (1994). Lipreading by Neural Networks: Visual Preprocessing, Learning and Sensory Integration. In J.D. Cowan, G. Tesauro, & J. Alspector (eds.), Advances in Neural Information Processing Systems, 6, 1027–1034. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Wu, J., Tamura, S., Mitsumoto, H., Kawai, H., Kurosu, K., & Okazaki, K. (1991). Neural network vowel recognition Jointly using voice features and mouth shape image. Pattern Recognition, 24(10), 921–927.
Google Scholar
Yuhas, B.P., Goldstein, M.H., Sejnowski, T.J., & Jenkins, R.E. (1990). Neural Network Models of Sensory Integration for Improved Vowel Recognition. Proc. IEEE, 78(10), 1658–1668.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Cognitive Science, University of California San Diego, La Jolla, California CA, 92093–0515. E-mail
Javier R. Movellan & Paul Mineiro

Authors

Javier R. Movellan
View author publications
You can also search for this author in PubMed Google Scholar
Paul Mineiro
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Movellan, J.R., Mineiro, P. Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition. Machine Learning 32, 85–100 (1998). https://doi.org/10.1023/A:1007468413059

Download citation

Issue Date: August 1998
DOI: https://doi.org/10.1023/A:1007468413059

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition

Abstract

Article PDF

Similar content being viewed by others

A review on face recognition systems: recent approaches and challenges

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Singular value decomposition of noisy data: noise filtering

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition

Abstract

Article PDF

Similar content being viewed by others

A review on face recognition systems: recent approaches and challenges

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Singular value decomposition of noisy data: noise filtering

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation