Abstract
Categorization is one of the fundamental building blocks of cognitive systems. Object categorization has traditionally been addressed in the vision domain, even though cognitive agents are intrinsically multimodal. Indeed, biological systems combine several modalities in order to achieve robust categorization. In this paper we propose a multimodal approach to object category detection, using audio and visual information. The auditory channel is modeled on biologically motivated spectral features via a discriminative classifier. The visual channel is modeled by a state of the art part based model. Multimodality is achieved using two fusion schemes, one high level and the other low level. Experiments on six different object categories, under increasingly difficult conditions, show strengths and weaknesses of the two approaches, and clearly underline the open challenges for multimodal category detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Pfeifer, R., Bongard, J.: How the body shapes the way we think. MIT Press, Cambridge (2006)
Fergus, R., Perona, P., Zisserman, A.: Weakly supervised scale-invariant learning of models for visual recognition. Int. J. Comput. Vision 71(3), 273–303 (2006)
Bar-Hillel, A., Weinshall, D.: Efficient learning of relational object class models. Int. J. Comput. Vision (in press, 2007)
Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing 14(5), 449–480 (2004)
Burr, D., Alais, D.: Combining visual and auditory information. Progress in Brain Research 155, 243–258 (2006)
Schmidt, D., Anemüller, J.: Acoustic feature selection for speech detection based on amplitude modulation spectrograms. In: 33rd German Annual Conference on Acoustics (2007)
Nilsback, M.E., Caputo, B.: Cue integration through discriminative accumulation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 578–585 (2004)
Kadir, T., Brady, M.: Saliency, scale and image description. Int. J. Comput. Vision 45(2), 83–105 (2001)
Kollmeier, B., Koch, R.: Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. J. Acoust. Soc. Am. 95(3), 1593–1602 (1994)
Polikar, R.: Ensemble based systems in decision making. IEEE Circuits and Systems Mag. 6(3), 21–45 (2006)
Zweig, A., Weinshall, D.: Exploiting object hierarchy: Combining models from different category levels. In: IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luo, J., Caputo, B., Zweig, A., Bach, JH., Anemüller, J. (2008). Object Category Detection Using Audio-Visual Cues. In: Gasteratos, A., Vincze, M., Tsotsos, J.K. (eds) Computer Vision Systems. ICVS 2008. Lecture Notes in Computer Science, vol 5008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79547-6_52
Download citation
DOI: https://doi.org/10.1007/978-3-540-79547-6_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-79546-9
Online ISBN: 978-3-540-79547-6
eBook Packages: Computer ScienceComputer Science (R0)