Abstract
This paper analyzes the issue of catastrophic fusion, a problem that occurs in multimodal recognition systems that integrate the output from several modules while working in non-stationary environments. For concreteness we frame the analysis with regard to the problem of automatic audio visual speech recognition (AVSR), but the issues at hand are very general and arise in multimodal recognition systems which need to work in a wide variety of contexts. Catastrophic fusion is said to have occurred when the performance of a multimodal system is inferior to the performance of some isolated modules, e.g., when the performance of the audio visual speech recognition system is inferior to that of the audio system alone. Catastrophic fusion arises because recognition modules make implicit assumptions and thus operate correctly only within a certain context. Practice shows that when modules are tested in contexts inconsistent with their assumptions, their influence on the fused product tends to increase, with catastrophic results. We propose a principled solution to this problem based upon Bayesian ideas of competitive models and inference robustification. We study the approach analytically on a classic Gaussian discrimination task and then apply it to a realistic problem on audio visual speech recognition (AVSR) with excellent results.
Article PDF
Similar content being viewed by others
References
Adjondani, A. & Benoit, C. (1995). Audio-visual speech recognition compared across two architectures. Proceedings of the Eurospeech'95 Conference (pp. 1563–1566). Madrid, Spain.
Adjondani, A. & Benoit, C. (1996). On the Integration of Auditory and Visual Parameters in an HMM-based ASR. In D.G. Stork & M.E. Hennecke (eds.), Speechreading by Humans and Machines: Models, Systems, and Applications. New York: NATO/Springer-Verlag.
Bernstein, L. & Benoit, C. (1996). For Speech Perception Three Senses are Better than One. Proceedings of the 4th International Conference on Spoken Language Processing, Philadelphia, PA.
Box, G.E.P. (1980). Sampling and Bayes inference in scientific modeling. J. Roy. Stat. Soc., A., 143, 383–430.
Bregler, C., Hild, H., Manke, S., & Waibel, A. (1993). Improving Connected Letter Recognition by Lipreading. Proc. Int. Conf. on Acoust., Speech, and Signal Processing (pp. 557–560), Minneapolis. IEEE.
Bregler, C., Manke, S., & Waibel, A. (1993). Bimodal Sensor Integration on the Example of Speech-Reading. Proceedings of the IEEE International Conference on Neural Networks (pp. 667–671).
Bregler, C., Omohundro, S.M., & Konig, Y. (1994). A Hybrid Approach to Bimodal Speech Recognition. 28th Annual Asilomar Conference on Signals, Systems, and Computers (pp. 556–560), Pacific Grove, CA.
Bülthoff, H.H. & Yuille, A.L. (1996). A Bayesian framework for the integration of visual modules. In T. Inui & J.L. McClelland (eds.), Attention and performance XVI: Information integration in perception and communication. Cambridge, MA: MIT Press.
Chadderdon, G. & Movellan, J.R. (1995). Testing for Channel Independence in Bimodal Speech Recognition. Proceedings of 2nd Joint Symposium on Neural Computation (pp. 84–90). University of California San Diego and California Institute of Technology.
Clark, J.J. & Yuille, A.L. (1990). Data Fusion for Sensory Information Processing Systems. Boston: Kluwer Academic Publishers.
Cosi, P., Magno Caldognetto, E., Vagges, K., Mian, G.A., & Contolini, M. (1994). Bimodal recognition experiments with recurrent neural networks. Proc. Int. Conf. on Acoust., Speech, and Signal Processing (pp. 553–556).
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38.
de Sa, V. (1994). Combining Uni-Modal Classifiers to Improve Learning. In Integration of Elementary Functions into Complex Behavior, volume 2, 19–29.
Efron, A. (1982). The jacknife, the bootstrap and other resampling plans. SIAM, Philadelphia, PA.
Gray, M.S., Movellan, J.R., & Sejnowski, T. (1997). Dynamic features for visual speechreading: A systematic comparison. In Advances in Neural Information Processing Systems, 9. Cambridge, MA: MIT Press.
Hennecke, M.E., Stork, D.G., & Ventakesh Prasad, K. (1996). Visionary speech: looking ahead to practical speech reading systems. In D.G. Stork & M.E. Hennecke (eds.), Speechreading by Humans and Machines: Models, Systems, and Applications. New York: NATO/Springer-Verlag.
Hennecke, M.E., Venkatesh Prasad, K., & Stork, D.G. (1994). Using Deformable Templates to InferVisual Speech Dynamics. 28th Asilomar Conference on Signals, Systems, and Computers (pp. 578–582). Pacific Grove, CA: IEEE Computer Society Press.
Hoglin, D.C., Mosteller, F., & Tukey, J.W. (1983). Understanding robust and exploratory data analysis. New York: John Wiley.
Kuhl, P.K. & Meltzoff, A.M. (1982). The bimodal perception of speech in infancy. Science, 218, 1138–1141.
MacKay, D.J.C. (1996). Hyperparameters: Optimise or interate out?. In G. Heidbreder (ed.), Maximum entropy and Bayesian methods, Santa Barbara 1993. Dordrecht: Kluwer.
Massaro, D.W. (1987). Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Hillsdale, NJ: Lawrence Erlbaum Associates.
McGurk, H. & MacDonald, J. (1976). Hearing Lips and Seeing Voices. Nature, 264, 746–748.
Movellan, J.R. (1995). Visual speech recognition with stochastic neural networks. In G. Tesauro, D. Touretzky, & T. Leen (eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press.
Movellan, J.R. & Chadderdon, G. (1996). Channel Separability in the Audio Visual Integration of Speech: A Bayesian Approach. In D.G. Stork & M.E. Hennecke (eds.), Speechreading by Humans and Machines: Models, Systems, and Applications. New York: NATO/Springer-Verlag.
Movellan, J.R. & Prayaga, R.S. (1996). Gabor Mosaics: A description of Local Orientation Statistics with Applications to Machine Perception. In Proceedings of the Eight Annual Conference of the Cognitive Science Society. Mahwah, NJ: LEA.
Neal, R.M. (1996). Bayesian learning for neural networks. New York: Springer.
O'Hagan, A. (1994). Kendall's Advanced Theory of Statistics: Volume 2B, Bayesian Inference. Cambridge University Press.
Peterson, W.W., Birdsall, T.G., & Fox, W.C. (1954). The theory of signal detectability. Transactions IRE Professional Group on Information Theory, 4, 171–212.
Stork, D.G., Wolff, G.J., & Levine, E.P. (1992). Neural Network Lipreading System for Improved Speech Recognition. Proceedings International Joint Conference on Neural Networks (pp. 289–295). IEEE.
Wolff, G.J., Venkatesh Prasad, K., Stork, D.G., & Hennecke, M.E. (1994). Lipreading by Neural Networks: Visual Preprocessing, Learning and Sensory Integration. In J.D. Cowan, G. Tesauro, & J. Alspector (eds.), Advances in Neural Information Processing Systems, 6, 1027–1034. San Mateo, CA: Morgan Kaufmann.
Wu, J., Tamura, S., Mitsumoto, H., Kawai, H., Kurosu, K., & Okazaki, K. (1991). Neural network vowel recognition Jointly using voice features and mouth shape image. Pattern Recognition, 24(10), 921–927.
Yuhas, B.P., Goldstein, M.H., Sejnowski, T.J., & Jenkins, R.E. (1990). Neural Network Models of Sensory Integration for Improved Vowel Recognition. Proc. IEEE, 78(10), 1658–1668.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Movellan, J.R., Mineiro, P. Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition. Machine Learning 32, 85–100 (1998). https://doi.org/10.1023/A:1007468413059
Issue Date:
DOI: https://doi.org/10.1023/A:1007468413059