Coherent Multi-sentence Video Description with Variable Level of Detail

Rohrbach, Anna; Rohrbach, Marcus; Qiu, Wei; Friedrich, Annemarie; Pinkal, Manfred; Schiele, Bernt

doi:10.1007/978-3-319-11752-2_15

Coherent Multi-sentence Video Description with Variable Level of Detail

Anna Rohrbach¹⁶,
Marcus Rohrbach^16,17,
Wei Qiu^16,18,
Annemarie Friedrich¹⁸,
Manfred Pinkal¹⁸ &
…
Bernt Schiele¹⁶

Conference paper
First Online: 15 October 2014

3332 Accesses
80 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8753))

Abstract

Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptions’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Details can be found in [17].
2.
www.nist.gov/tac/2011/Summarization/Guided-Summ.2011.guidelines.html
3.
The BLEU score per description is much higher than per sentence as the n-grams can be matched to the full descriptions.
4.
The BLEU score for human description is not fully comparable due to one reference less, which typically has a strong effect on the BLEU score.

References

Das, P., Xu, C., Doell, R.F., Corso, J.: Thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)
Google Scholar
Dyer, C., Muresan, S., Resnik, P.: Generalizing word lattice translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2008)
Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010)
Chapter Google Scholar
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)
Google Scholar
Gupta, A., Srinivasan, P., Shi, J.B., Davis, L.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Khan, M.U.G., Zhang, L., Gotoh, Y.: Human focused video description. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (2011)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (demo) (2007)
Google Scholar
Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. (IJCV) 50, 171–184 (2002)
Article MATH Google Scholar
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI Conference on Artificial Intelligence (AAAI) (2013)
Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding and generating simple image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Google Scholar
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2012)
Google Scholar
Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A.C., Berg, T.L., III, H.D.: Midge: Generating image descriptions from computer vision detections. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2012)
Google Scholar
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. (TACL) 1, 25–36 (2013)
Google Scholar
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: IEEE International Conference on Computer Vision (ICCV) (2013)
Google Scholar
Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script data for attribute-based recognition of composite activities. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 144–157. Springer, Heidelberg (2012)
Chapter Google Scholar
Schmidt, M.: UGM: Matlab code for undirected graphical models (2013). http://www.di.ens.fr/~mschmidt/Software/UGM.html
Senina, A., Rohrbach, M., Qiu, W., Friedrich, A., Amin, S., Andriluka, M., Pinkal, M., Schiele, B.: Coherent multi-sentence video description with variable level of detail. arXiv:1403.6173 (2014)
Siddharth, N., Barbu, A., Siskind, J.M.: Seeing what youre told: Sentence-guided activity recognition in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Tan, C.C., Jiang, Y.G., Ngo, C.W.: Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM Multimedia (2011)
Google Scholar
Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms (2008). http://www.vlfeat.org/
Wang, H., Kläser, A., Schmid, C., Liu, C.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. (IJCV) 103, 60–79 (2013)
Article Google Scholar
Yu, H., Siskind, J.M.: Grounded language learning from videos described with sentences. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2013)
Google Scholar
Zukerman, I., Litman, D.: Natural language processing and user modeling: Synergies and limitations. User Model. User-Adap. Inter. 11, 129–158 (2001)
Article MATH Google Scholar

Download references

Acknowledgments

Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD).

Author information

Authors and Affiliations

Max Planck Institute for Informatics, Saarbrücken, Germany
Anna Rohrbach, Marcus Rohrbach, Wei Qiu & Bernt Schiele
UC Berkeley EECS and ICSI, Berkeley, CA, USA
Marcus Rohrbach
Department of Computational Linguistics, Saarland University, Saarbrücken, Germany
Wei Qiu, Annemarie Friedrich & Manfred Pinkal

Authors

Anna Rohrbach
View author publications
You can also search for this author in PubMed Google Scholar
Marcus Rohrbach
View author publications
You can also search for this author in PubMed Google Scholar
Wei Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Annemarie Friedrich
View author publications
You can also search for this author in PubMed Google Scholar
Manfred Pinkal
View author publications
You can also search for this author in PubMed Google Scholar
Bernt Schiele
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Rohrbach .

Editor information

Editors and Affiliations

Department of Mathematics and Computer Science, University of Münster, Münster, Germany
Xiaoyi Jiang
Computer Science Department 5, University of Erlangen-Nürnberg, Erlangen, Germany
Joachim Hornegger
Department of Computer Science, University of Kiel, Kiel, Germany
Reinhard Koch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B. (2014). Coherent Multi-sentence Video Description with Variable Level of Detail. In: Jiang, X., Hornegger, J., Koch, R. (eds) Pattern Recognition. GCPR 2014. Lecture Notes in Computer Science(), vol 8753. Springer, Cham. https://doi.org/10.1007/978-3-319-11752-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-11752-2_15
Published: 15 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11751-5
Online ISBN: 978-3-319-11752-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics