Skip to main content
Log in

Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a method for describing human activities from video images based on concept hierarchies of actions. Major difficulty in transforming video images into textual descriptions is how to bridge a semantic gap between them, which is also known as inverse Hollywood problem. In general, the concepts of events or actions of human can be classified by semantic primitives. By associating these concepts with the semantic features extracted from video images, appropriate syntactic components such as verbs, objects, etc. are determined and then translated into natural language sentences. We also demonstrate the performance of the proposed method by several experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Asanuma, K., Onishi, M., Kojima, A., and Fukunaga, K. 1999. Extracting regions of human face and hands considering information of color and region tracking. Trans. IEEJ(C), 119-C(11):1351–1358 (in Japanese).

    Google Scholar 

  • Ayers, D. and Shah, M. 2001. Monitoring human behavior from video taken in an office environment. Image and Vision Computing, 19(12):833–846.

    Google Scholar 

  • Babaguchi, N., Dan, S., and Kitahashi, T. 1996. Generation of sketch map image and its instructions to support the understanding of geographical information. In Proc. of ICPR'96, pp. 274–278.

  • Fellbaum, C. (Ed.) 1998. WordNet: An Electronic Lexical Database. MIT Press: Cambridge.

    Google Scholar 

  • Fillmore, C.J. 1968. The case for case. In Universals in Linguistic Theory, E. Bach and R. Harms (Eds.). Rinehart and Wiston: New York, pp. 1–88.

    Google Scholar 

  • Herzog, G. and Rohr, K. 1995. Integrating vision and language: Towards automatic description of human movements. In Proc. 19th Annual German Conf. on Artificial Intelligence, pp. 257–268.

  • Hornby, A.S. 1975. Guide to Patterns and Usage in English. Oxford Univ. Press: London.

    Google Scholar 

  • Intille, S. and Bobick, A. 1998. Representation and Visual Recognition of Complex, Multi-Agent Actions Using Belief Networks, Technical Report 454, M.I.T Media Lab. Perceptual Computing Section.

  • Kitahashi, T., Ohya, M., Kakusho, K., and Babaguchi, N. 1997. Media information processing in documents—Generation of manuals of mechanical parts assembling—. In Proc. of 4th Int. Conf. on Document Analysis and Recognition, Ulm, Germany, pp. 792–796.

  • Kojima, A., Izumi, M., Tamura, T., and Fukunaga, K. 2000. Generating natural language description of human behavior from video images. In Proc. of ICPR 2000, vol. 4, pp. 728–731.

    Google Scholar 

  • Kollnig, H., Nagel, H.-H., and Otte, M. 1994. Association of motion verbs with vehicle movements extracted from dense optical flow fields. In Proc. of 3rd European Conf. on Computer Vision'94, pp. 338–347.

  • Nagel, H.-H. 1994. A vision of ‘vision and language’ comprises action: An example from road traffic. Artificial Intelligence Review, 8:189–214.

    Google Scholar 

  • Nishida, F. and Takamatsu, S. 1982. Japanese-English translation through internal expressions. In Proc. of COLING-82, pp. 271–276.

  • Nishida, F., Takamatsu, S., Tani, T., and Doi, T. 1988. Feed-back of correcting information in postediting to a ma-chine translation system. In Proc. of COLING-88, pp. 476–481.

  • Okada, N. 1980. Conceptual taxonomy of Japanese verbs for un-derstanding natural language and picture patterns. In Proc. of COLING-80, pp. 123–135.

  • Okada, N. 1996. Integrating vision, motion and language through mind. Artificial Intelligence Review, 8:209–234.

    Google Scholar 

  • Shafer, D. 1976. A Mathematical Theory of Evidence. Princeton Univ. Press: Princeton, NJ.

    Google Scholar 

  • Thonnat, M. and Rota, N. 1999. Image understanding for visual surveillance applications. In Proc. of 3rd Int. Workshop on Cooperative Distributed Vision, pp. 51–82.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kojima, A., Tamura, T. & Fukunaga, K. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions. International Journal of Computer Vision 50, 171–184 (2002). https://doi.org/10.1023/A:1020346032608

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1020346032608

Navigation