Skip to main content

Coherent Multi-sentence Video Description with Variable Level of Detail

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8753))

Abstract

Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptions’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Details can be found in [17].

  2. 2.

    www.nist.gov/tac/2011/Summarization/Guided-Summ.2011.guidelines.html

  3. 3.

    The BLEU score per description is much higher than per sentence as the n-grams can be matched to the full descriptions.

  4. 4.

    The BLEU score for human description is not fully comparable due to one reference less, which typically has a strong effect on the BLEU score.

References

  1. Das, P., Xu, C., Doell, R.F., Corso, J.: Thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)

    Google Scholar 

  2. Dyer, C., Muresan, S., Resnik, P.: Generalizing word lattice translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2008)

    Google Scholar 

  3. Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  4. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)

    Google Scholar 

  5. Gupta, A., Srinivasan, P., Shi, J.B., Davis, L.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)

    Google Scholar 

  6. Khan, M.U.G., Zhang, L., Gotoh, Y.: Human focused video description. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (2011)

    Google Scholar 

  7. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (demo) (2007)

    Google Scholar 

  8. Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. (IJCV) 50, 171–184 (2002)

    Article  MATH  Google Scholar 

  9. Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI Conference on Artificial Intelligence (AAAI) (2013)

    Google Scholar 

  10. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding and generating simple image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)

    Google Scholar 

  11. Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2012)

    Google Scholar 

  12. Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A.C., Berg, T.L., III, H.D.: Midge: Generating image descriptions from computer vision detections. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2012)

    Google Scholar 

  13. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. (TACL) 1, 25–36 (2013)

    Google Scholar 

  14. Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: IEEE International Conference on Computer Vision (ICCV) (2013)

    Google Scholar 

  15. Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script data for attribute-based recognition of composite activities. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 144–157. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  16. Schmidt, M.: UGM: Matlab code for undirected graphical models (2013). http://www.di.ens.fr/~mschmidt/Software/UGM.html

  17. Senina, A., Rohrbach, M., Qiu, W., Friedrich, A., Amin, S., Andriluka, M., Pinkal, M., Schiele, B.: Coherent multi-sentence video description with variable level of detail. arXiv:1403.6173 (2014)

  18. Siddharth, N., Barbu, A., Siskind, J.M.: Seeing what youre told: Sentence-guided activity recognition in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

    Google Scholar 

  19. Tan, C.C., Jiang, Y.G., Ngo, C.W.: Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM Multimedia (2011)

    Google Scholar 

  20. Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms (2008). http://www.vlfeat.org/

  21. Wang, H., Kläser, A., Schmid, C., Liu, C.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. (IJCV) 103, 60–79 (2013)

    Article  Google Scholar 

  22. Yu, H., Siskind, J.M.: Grounded language learning from videos described with sentences. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2013)

    Google Scholar 

  23. Zukerman, I., Litman, D.: Natural language processing and user modeling: Synergies and limitations. User Model. User-Adap. Inter. 11, 129–158 (2001)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Rohrbach .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B. (2014). Coherent Multi-sentence Video Description with Variable Level of Detail. In: Jiang, X., Hornegger, J., Koch, R. (eds) Pattern Recognition. GCPR 2014. Lecture Notes in Computer Science(), vol 8753. Springer, Cham. https://doi.org/10.1007/978-3-319-11752-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11752-2_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11751-5

  • Online ISBN: 978-3-319-11752-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics