Abstract
In this paper we review the major approaches to multimodal human computer interaction from a computer vision perspective. In particular, we focus on body, gesture, gaze, and affective interaction (facial expression recognition, and emotion in audio). We discuss user and task modeling, and multimodal fusion, highlighting challenges, open issues, and emerging applications for Multimodal Human Computer Interaction (MMHCI) research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, J.K., Cai, Q.: Human motion analysis: A review. CVIU 73(3), 428–440 (1999)
Application of Affective Computing in Human-computer Interaction. Int. J. of Human-Computer Studies 59(1-2) (2003)
Ben-Arie, J., Wang, Z., Pandit, P., Rajaram, S.: Human activity recognition using multidimensional indexing. IEEE Trans. On PAMI 24(8), 1091–1104 (2002)
Benali-Khoudja, M., Hafez, M., Alexandre, J.-M., Kheddar, A.: Tactile interfaces: a state-of-the-art survey. In: Int. Symposium on Robotics (2004)
Bobick, A.F., Davis, J.: The recognition of human movement using temporal templates. IEEE Trans. on PAMI 23(3), 257–267 (2001)
Brewster, S.A., Lumsden, J., Bell, M., Hall, M., Tasker, S.: Multimodal ’Eyes-Free’ Interaction Techniques for Wearable Devices. In: Proc. ACM CHI 2003 (2003)
Campbell, C.S., Maglio, P.P.: A Robust Algorithm for Reading Detection. In: ACM Workshop on Perceptive User Interfaces (2001)
Cohen, P.R., McGee, D.R.: Tangible Multimodal Interfaces for Safety-critical Applications. Communications of the ACM 47(1), 41–46 (2004)
Cohen, I., Sebe, N., Cozman, F., Cirelo, M., Huang, T.S.: Semi-supervised learning of classifiers: Theory, algorithms, and their applications to human-computer interaction. IEEE Trans. on PAMI 22(12), 1553–1567 (2004)
Cohen, I., Sebe, N., Garg, A., Chen, L., Huang, T.S.: Facial expression recognition from video sequences: Temporal and static modeling. CVIU 91(1-2), 160–187 (2003)
Chen, L.S.: Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction, PhD thesis, UIUC (2000)
Duchowski, A.T.: A Breadth-First Survey of Eye Tracking Applications. Behavior Research Methods, Instruments, and Computing 34(4), 455–470 (2002)
Dickie, C., Vertegaal, R., Fono, D., Sohn, C., Chen, D., Cheng, D., Shell, J.S., Aoudeh, O.: Augmenting and Sharing Memory with eyeBlog. In: CARPE 2004 (2004)
Duric, Z., Gray, W., Heishman, R., Li, F., Rosenfeld, A., Schoelles, M., Schunn, C., Wechsler, H.: Integrating perceptual and cognitive modeling for adaptive and intelligent human- computer interaction. Proc. of the IEEE 90(7), 1272–1289 (2002)
Ekman, P. (ed.): Emotion in the Human Face. Cambridge University Press, Cambridge (1982)
Fagiani, C., Betke, M., Gips, J.: Evaluation of tracking methods for human-computer interaction. In: IEEE Workshop on Applications in Computer Vision (2002)
Fasel, B., Luettin, J.: Automatic facial expression analysis: A survey. Patt. Recogn. 36, 259–275 (2003)
Fong, T., Nourbakhsh, I., Dautenhahn, K.: A survey of socially interactive robots. Robotics and Autonomous Systems 42(3-4), 143–166 (2003)
Fussell, S., Setlock, L., Yang, J., Ou, J., Mauer, E., Kramer, A.: Gestures over video streams to support remote collaboration on physical tasks. Human-Computer Int. 19(3), 273–309 (2004)
Garg, A., Naphade, M., Huang, T.S.: Modeling video using input/output Markov models with application to multi-modal event detection, Handbook of Video Databases: Design and Applications (2003)
Garg, I., Pavlovic, V., Rehg, J.: Boosted learning in dynamic Bayesian networks for multimodal speaker detection. Proceedings of the IEEE 91(9), 1355–1369 (2003)
Gavrila, D.M.: The Visual Analysis of Human Movement: A Survey. CVIU 73(1), 82–98 (1999)
Hanjalic, A., Xu, L.-Q.: Affective video content representation and modeling. IEEE Trans. on Multimedia 7(1), 143–154 (2005)
Hakeem, A., Shah, M.: Ontology and taxonomy collaborated framework for meeting classification ICPR. (2004)
Heishman, R., Duric, Z., Wechsler, H.: Using eye region biometrics to reveal affective and cognitive states. In: CVPR Workshop on Face Processing in Video (2004)
Hjelmas, E., Low, B.K.: Face detection: A survey. CVIU 83, 236–274 (2001)
Hu, W., Tan, T., Wang, L., Maybank, S.: A Survey on Visual Surveillance of Object Motion and Behaviors. IEEE Trans. On Systems, Man, and Cybernetics 34(3) (2004)
Intille, S., Larson, K., Beaudin, J., Nawyn, J., Tapia, E., Kaushik, P.: A living laboratory for the design and evaluation of ubiquitous computing technologies, In: Conf. on Human Factors in Computing Systems (2004)
Jaimes, A., Liu, J.: Hotspot Components for Gesture-Based Interaction. In: proc. IFIP Interact 2005, Rome, Italy (September 2005)
El Kaliouby, R., Robinson, P.: Real time inference of complex mental states from facial expressions and head gestures. In: CVPR Workshop on Real-time Vision for HCI (2004)
Kettebekov, S., Sharma, R.: Understanding gestures in multimodal human computer interaction. Int. J. on Artificial Intelligence Tools 9(2), 205–223 (2000)
Kirishima, T., Sato, K., Chihara, K.: Real-time gesture recognition by learning and selective control of visual interest points. IEEE Trans. on PAMI 27(3), 351–364 (2005)
Kisacanin, T., Pavlovic, V., Huang, T.S. (eds.): Real-Time Vision for Human-Computer Interaction. Springer, New York (2005)
Kuno, Y., Shimada, N., Shirai, Y.: Look where you’re going: A robotic wheelchair based on the integration of human and environmental observations. IEEE Robotics and Automation 10(1), 26–34 (2003)
Lang, P.: The emotion probe: Studies of motivation and attention. American Psychologist 50(5), 372–385 (1995)
Legin, A., Rudnitskaya, A., Seleznev, B., Vlasov, Y.: Electronic tongue for quality assessment of ethanol, vodka and eau-de-vie. Analytica Chimica Acta, 534, 129–135 (2005)
Lyons, M.J., Haehnel, M., Tetsutani, N.: Designing, playing, and performing, with a vision-based mouth Interface. In: Conf. on New Interfaces for Musical Expression (2003)
Marcel, S.: Gestures for multi-modal interfaces: A Review, Technical Report IDIAP-RR 02-34 (2002)
Maynes-Aminzade, D., Pausch, R., Seitz, S.: Techniques for interactive audience participation. In: ICMI 2002 (2002)
McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M., Zhang, D.: Automatic analysis of multimodal group actions in meetings. IEEE Trans. on PAMI 27(3), 305–317 (2005)
McNeill, D.: Hand and Mind: What Gestures Reveal About Thought. Univ. of Chicago Press, Chicago (1992)
Mehrabian, A.: Communication without words. Psychology Today 2(4), 53–56 (1968)
Meyer, S., Rakotonirainy, A.: A Survey of research on context-aware homes, Australasian Information Security Workshop Conference on ACSW Frontiers (2003)
Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. CVIU 81(3), 231–258 (2001)
Nielsen, J.: Non-command user interfaces. Comm. of the ACM 36(4), 83–99 (1993)
Oudeyer, P.Y.: The production and recognition of emotions in speech: Features and algorithms. Int. J. of Human-Computer Studies 59(1-2), 157–183 (2003)
Oulasvirta, A., Salovaara, A.: A cognitive meta-analysis of design approaches to interruptions in intelligent environments. In: Proceedings of ACM Conference on Human Factors in Computing Systems, CHI 2004 (2004) (Extended Abstracts)
Qvarfordt, P., Zhai, S.: Conversing with the user based on eye-gaze patterns. In: Conf. Human-Factors in Computing Syst. (2005)
Oviatt, S., Darrell, T., Flickner, M.: Multimodal Interfaces that Flex, Adapt, and Persist. Communications of the ACMÂ 47 (1) (2004), special issue
Oviatt, S.L., Cohen, P.: Multimodal interfaces that process what comes naturally. Comm. of the ACM 43(3), 45–48 (2000)
Oviat, S.L.: Multimodal interfaces. In: Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, ch. 14, pp. 286–304 (2003)
Oviatt, S.L., Cohen, P., Wu, L., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T., Winograd, T., Landay, J., Larson, J., Ferro, D.: Designing the user interface for multimodal speech and pen-based gesture applications: State-of-the-art systems and future research directions. Human-Computer Int. 15, 263–322 (2000)
Pan, H., Liang, Z.P., Anastasio, T.J., Huang, T.S.: Exploiting the dependencies in information fusion. CVPR 2, 407–412 (1999)
Pantic, M., Rothkrantz, L.J.M.: Automatic analysis of facial expressions: The state of the art. IEEE Trans. on PAMI 22(12), 1424–1445 (2000)
Pantic, M., Rothkrantz, L.J.M.: Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE 91(9), 1370–1390 (2003)
Paradiso, J., Sparacino, F.: Optical Tracking for Music and Dance Performance. In: Gruen, A., Kahmen, H. (eds.) Optical 3-D Measurement Techniques IV, pp. 11–18 (1997)
Pavlovic, V.I., Sharma, R., Huang, T.S.: Visual interpretation of hand gestures for human- computer interaction: a review. IEEE Trans. on PAMI 19(7), 677–695 (1997)
Pelz, J.B.: Portable eye-tracking in natural behavior. J. of Vision 4(11) (2004)
Pentland, A.: Looking at People. Comm. of the ACM 43(3), 35–44 (2000)
Pentland, A.: Socially Aware Computation and Communication. IEEE Computer 38(3) (2005)
Picard, R.W.: Affective Computing. MIT Press, Cambridge (1997)
Porta, M.: Vision-based user interfaces: methods and applications. Int. J. Human-Computer Studies 57(1), 27–73 (2002)
Reeves, L.M., et al.: Guidelines for multimodal user interface design. Communications of the ACM 47(1), 57–69 (2004)
Rosales, R., Sclaroff, S.: Learning body pose via specialized maps. NIPS 14, 1263–1270 (2001)
Roth, P., Pun, T.: Design and evaluation of a multimodal system for the non-visual exploration of digital pictures. In: INTERACT 2003 (2003)
Ruddaraju, R., Haro, A., Nagel, K., Tran, Q., Essa, I., Abowd, G., Mynat, E.: Perceptual user interfaces using vision-based eye tracking. ICMI (2003)
Santella, A., DeCarlo, D.: Robust clustering of eye movement recordings for quantification of visual interest. Eye Tracking Research and Applications (ETRA), 27–34 (2004)
Sebe, N., Cohen, I., Huang, T.S.: Multimodal emotion recognition, Handbook of Pattern Recognition and Computer Vision. World Scientific, Singapore (2005)
Schapira, E., Sharma, R.: Experimental evaluation of vision and speech based multimodal interfaces. In: Workshop on Perceptive User Interfaces, pp. 1–9 (2001)
Schuller, B., Lang, M., Rigoll, G.: Multimodal emotion recognition in audiovisual communication. In: ICME (2002)
Selker, T.: Visual Attentive Interfaces. BT Technology Journal 22(4), 146–150 (2004)
Sharma, R., Yeasin, M., Krahnstoever, N., Rauschert, I., Cai, G., Brewer, I., MacEachren, A., Sengupta, K.: Speech–gesture driven multimodal interfaces for crisis management. Proceedings of the IEEE 91(9), 1327–1354 (2003)
Sibert, L.E., Jacob, R.J.K.: Evaluation of eye gaze interaction. In: Conf. Human-Factors in Computing Syst., pp. 281–288 (2000)
Smith, P., Shah, M., Lobo, N.d.V.: Determining driver visual zttention with one camera. IEEE Trans. on Intelligent Transportation Systems 4(4) (2003)
Simpson, R., LoPresti, E., Hayashi, S., Nourbakhsh, I., Miller, D.: The smart wheelchair component system. J. of Rehabilitation Research and Development (May/June 2004)
Sparacino, F.: The museum wearable: Real-time sensor-driven understanding of visitors. interests for personalized visually-augmented museum experiences. Museums and the Web (2002)
Trivedi, M.M., Cheng, S.Y., Childers, E.M.C., Krotosky, S.J.: Occupant posture analysis with stereo and thermal infrared video: Algorithms and experimental evaluation. IEEE Trans. on Vehicular Technology 53(6), 1698–1712 (2004)
Turk, M.: Gesture recognition. In: Stanney, K. (ed.) Handbook of Virtual Environment Technology (2001)
Turk, M.: Computer vision in the interface. Communications of the ACM 47(1), 60–67 (2004)
Turk, M., Robertson, G.: Perceptual Interfaces. Communications of the ACM 43(3), 32–34 (2000)
Turk, M., Kölsch, M.: Perceptual Interfaces. In: Medioni, G., Kang, S.B. (eds.) Emerging Topics in Computer Vision, Prentice Hall, Englewood Cliffs (2004)
Wang, J.-G., Sung, E., Venkateswarlu, R.: Eye gaze estimation from a single image of one eye. In: ICCV, pp. 136–143 (2003)
Wang, L., Hu, W., Tan, T.: Recent developments in human motion analysis. Patt. Recogn. 36, 585–601 (2003)
Wang, J.J.L., Singh, S.: Video analysis of human dynamics – A survey. Real-Time Imaging 9(5), 321–346 (2003)
Wassermann, K.C., Eng, K., Verschure, P.F.M.J., Manzolli, J.: Live soundscape composition based on synthetic emotions. IEEE Multimedia Magazine 10(4) (2003)
Wu, Y., Huang, T.: Vision-based gesture recognition: A review. In: 3rd Gesture Workshop (1999)
Wu, Y., Hua, G., Yu, T.: Tracking articulated body by dynamic Markov network. In: ICCV, pp. 1094–1101 (2003)
Yang, M.-H., Kriegman, D., Ahuja, N.: Detecting faces in images: A survey. IEEE Trans. on PAMI 24(1), 34–58 (2002)
Yuan, Q., Sclaroff, S., Athitsos, V.: Automatic 2D hand tracking in video sequences. In: IEEE Workshop on Applications of Computer Vision (2005)
Yu, C., Ballard, D.H.: A multimodal learning interface for grounding spoken language in sensorimotor experience. ACM Trans. on Applied Perception (2004)
Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, J.: Face recognition: A literature survey. ACM Computing Surveys 12, 399–458 (2003)
Salen, K., Zimmerman, E.: Rules of Play: Game Design Fundamentals. MIT Press, Cambridge (2003)
Zeng, Z., Tu, J., Liu, M., Zhang, T., Rizzolo, N., Zhang, Z., Huang, T.S., Roth, D., Levinson, S.: Bimodal HCI-related affect recognition. In: ICMI (2004)
Wu, Y., Huang, T.S.: Human hand modeling, analysis and animation in the context of human computer interaction. IEEE Signal Processing 18(3), 51–60 (2001)
Murray, I.R., Arnott, J.L.: Toward the simulation of emotion in synthetic speech: A review of the literature of human vocal emotion. J. of the Acoustic Society of America 93(2), 1097–1108 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jaimes, A., Sebe, N. (2005). Multimodal Human Computer Interaction: A Survey. In: Sebe, N., Lew, M., Huang, T.S. (eds) Computer Vision in Human-Computer Interaction. HCI 2005. Lecture Notes in Computer Science, vol 3766. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573425_1
Download citation
DOI: https://doi.org/10.1007/11573425_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29620-1
Online ISBN: 978-3-540-32129-3
eBook Packages: Computer ScienceComputer Science (R0)