Abstract
We propose a method for describing human activities from video images based on concept hierarchies of actions. Major difficulty in transforming video images into textual descriptions is how to bridge a semantic gap between them, which is also known as inverse Hollywood problem. In general, the concepts of events or actions of human can be classified by semantic primitives. By associating these concepts with the semantic features extracted from video images, appropriate syntactic components such as verbs, objects, etc. are determined and then translated into natural language sentences. We also demonstrate the performance of the proposed method by several experiments.
Similar content being viewed by others
References
Asanuma, K., Onishi, M., Kojima, A., and Fukunaga, K. 1999. Extracting regions of human face and hands considering information of color and region tracking. Trans. IEEJ(C), 119-C(11):1351–1358 (in Japanese).
Ayers, D. and Shah, M. 2001. Monitoring human behavior from video taken in an office environment. Image and Vision Computing, 19(12):833–846.
Babaguchi, N., Dan, S., and Kitahashi, T. 1996. Generation of sketch map image and its instructions to support the understanding of geographical information. In Proc. of ICPR'96, pp. 274–278.
Fellbaum, C. (Ed.) 1998. WordNet: An Electronic Lexical Database. MIT Press: Cambridge.
Fillmore, C.J. 1968. The case for case. In Universals in Linguistic Theory, E. Bach and R. Harms (Eds.). Rinehart and Wiston: New York, pp. 1–88.
Herzog, G. and Rohr, K. 1995. Integrating vision and language: Towards automatic description of human movements. In Proc. 19th Annual German Conf. on Artificial Intelligence, pp. 257–268.
Hornby, A.S. 1975. Guide to Patterns and Usage in English. Oxford Univ. Press: London.
Intille, S. and Bobick, A. 1998. Representation and Visual Recognition of Complex, Multi-Agent Actions Using Belief Networks, Technical Report 454, M.I.T Media Lab. Perceptual Computing Section.
Kitahashi, T., Ohya, M., Kakusho, K., and Babaguchi, N. 1997. Media information processing in documents—Generation of manuals of mechanical parts assembling—. In Proc. of 4th Int. Conf. on Document Analysis and Recognition, Ulm, Germany, pp. 792–796.
Kojima, A., Izumi, M., Tamura, T., and Fukunaga, K. 2000. Generating natural language description of human behavior from video images. In Proc. of ICPR 2000, vol. 4, pp. 728–731.
Kollnig, H., Nagel, H.-H., and Otte, M. 1994. Association of motion verbs with vehicle movements extracted from dense optical flow fields. In Proc. of 3rd European Conf. on Computer Vision'94, pp. 338–347.
Nagel, H.-H. 1994. A vision of ‘vision and language’ comprises action: An example from road traffic. Artificial Intelligence Review, 8:189–214.
Nishida, F. and Takamatsu, S. 1982. Japanese-English translation through internal expressions. In Proc. of COLING-82, pp. 271–276.
Nishida, F., Takamatsu, S., Tani, T., and Doi, T. 1988. Feed-back of correcting information in postediting to a ma-chine translation system. In Proc. of COLING-88, pp. 476–481.
Okada, N. 1980. Conceptual taxonomy of Japanese verbs for un-derstanding natural language and picture patterns. In Proc. of COLING-80, pp. 123–135.
Okada, N. 1996. Integrating vision, motion and language through mind. Artificial Intelligence Review, 8:209–234.
Shafer, D. 1976. A Mathematical Theory of Evidence. Princeton Univ. Press: Princeton, NJ.
Thonnat, M. and Rota, N. 1999. Image understanding for visual surveillance applications. In Proc. of 3rd Int. Workshop on Cooperative Distributed Vision, pp. 51–82.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kojima, A., Tamura, T. & Fukunaga, K. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions. International Journal of Computer Vision 50, 171–184 (2002). https://doi.org/10.1023/A:1020346032608
Issue Date:
DOI: https://doi.org/10.1023/A:1020346032608