Video Question Answering with Spatio-Temporal Reasoning

Jang, Yunseok; Song, Yale; Kim, Chris Dongjoo; Yu, Youngjae; Kim, Youngjin; Kim, Gunhee

doi:10.1007/s11263-019-01189-x

Video Question Answering with Spatio-Temporal Reasoning

Published: 18 June 2019

Volume 127, pages 1385–1412, (2019)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yunseok Jang¹,
Yale Song²,
Chris Dongjoo Kim¹,
Youngjae Yu¹,
Youngjin Kim¹ &
…
Gunhee Kim ORCID: orcid.org/0000-0002-9543-7453¹

2200 Accesses
22 Citations
Explore all metrics

Abstract

Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention and show its effectiveness over conventional VQA techniques through empirical evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Visual attention network

Article Open access 28 July 2023

Notes

References

Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In: CVPR
Andreas, J., Rohrbach, M., Darrel, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In: NAACL
Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In: CVPR
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). VQA: Visual question answering. In: ICCV
Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). Layer normalization. In: arXiv:1607.06450
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: ICLR
Bakhshi, S., Shamma, D.A., Kennedy, L., Song, Y., de Juan, P., & Kaye, J.J. (2016). Fast, Cheap, and Good—Why animated GIFs engage us. In: CHI
Chomsky, N. (1971). Conditions on transformations. In: Indiana University Linguistics Club
Daiber, J., Jakob, M., Hokamp, C., & Mendes, P.N. (2013). Improving efficiency and accuracy in multilingual entity extraction. In: I-Semantics
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., & Batra, D. (2017). Visual dialog. In: CVPR
Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. CACM, 58, 92–103.
Article Google Scholar
Denkowski, M., & Lavie, A. (2011). Meteor universal: Language specific translation evaluation for any target language. In: EMNLP
Farneback, G. (2003). Two-frame motion estimation based on polynomial expansion. In: SCIA
Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
Book MATH Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP
Gao, J., & Ge, R. (2018). Motion-appearance co-memory networks for video question answering. In: CVPR
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPR
Gygli, M., Song, Y., & Cao, L. (2016). Video2GIF: Automatic generation of animated GIFs from video. In: CVPR
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR
Isola, P., Lim, J.J., & Adelson, E.H. (2015). Discovering states and transformations in image collections. In: CVPR
Jang, Y., Song, Y., Yu, Y., Kim, Y., & Kim, G. (2017). TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In: CVPR
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., & Girshick, R. (2017). CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F.F. (2014). Large-scale video classification with convolutional neural networks. In: CVPR
Kim, K., Heo, M., Choi, S., & Zhang, B. (2017). DeepStory: Video story QA by deep embedded memory networks. In: IJCAI
Kim, J.H., Lee, S.W,, Kwak, D.H., Heo, M.O., Kim, J., Ha, J.W., & Zhang, B.T. (2016). Multimodal residual learning for visual QA. In: NIPS
Kingma, D.P., & Ba, J.L. (2015). ADAM: A method for stochastic optimization. In: ICLR
Kipper-Schuler, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis, UPenn CIS
Kiros, J.R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors. In: NIPS
Lei, J., Yu, L., Bansal, M., & Berg, T. (2018). TVQA: Localized, compositional video question answering. In: EMNLP
Levi, G., & Hassner, T. (2015). Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In: ICMI
Levy, O., & Wolf, L. (2015). Live repetition counting. In: ICCV
Li, Y., Song, Y., Cao, L., Tetreault, J., Goldberg, L., Jaimes, A., & Luo, J. (2016). TGIF: A new dataset and benchmark on animated GIF description. In: CVPR
Lin, X., & Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In: CVPR
Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO—common objects in context. In: ECCV
Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., & Pal, C. (2017). A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR
Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS
Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In: ICCV
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In: ACL
Mun, J., Seo, P.H., Jung, I., & Han, B. (2017). MarioQA: Answering questions by watching gameplay videos. In: ICCV
Na, S., Lee, S., Kim, J., & Kim, G. (2018). A read-write memory network for movie story understanding. In: ICCV
Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). Bleu: A method for automatic evaluation of machine translation. In: ACL
Pennington, J., Socher, R., & Manning, C.D. (2014). Glove—Global vectors for word representation. In: EMNLP
Pham, V., Bluche, T., Kermorvant, C., & Louradour, J. (2014). Dropout improves recurrent neural networks for handwriting recognition. In: ICFHR
Piotr Bojanwoski, E.G., & Armand Joulin, T.M. (2017). Enriching word vectors with subword information. In: TACL
Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In: NIPS
Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., et al. (2017). Movie description. IJCV, 123, 94–120.
Article Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. IJCV, 115, 211–252.
Article MathSciNet Google Scholar
Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01
Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In: ICML
Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. In: NIPS
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding stories in movies through question-answering. In: CVPR
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In: ICCV
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, l., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In: NIPS
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence—video to text. In: ICCV
Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2016). What value do explicit high level concepts have in vision to language problems? In: CVPR
Xie, S., Chen, S., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning for video understanding. In: ECCV
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In: ICML
Yang, Z., Xiadong, H., Jianfeng, G., Li, D., & Smola, A.J. (2015). Stacked attention networks for image question answering. In: CVPR
You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In: CVPR
Yu, Y., Ko, H., Choi, J., & Kim, G. (2017). End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR
Yu, L., Park, E., Berg, A.C., & Berg, T.L. (2015). Visual madlibs: Fill in the blank description generation and question answering. In: ICCV
Zhao, Z., Yang, Q., Cai, D., He, X., & Zhuang, Y. (2017). Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In: NIPS
Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. (2017). Uncovering temporal context for video question answering. IJCV, 124, 409–421.
Article MathSciNet Google Scholar
Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded question answering in images. In: CVPR
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: ICCV

Download references

Acknowledgements

This work was supported by IITP Grant (No. 2019-0-01082, SW StarLab) (No.2017-0-01772, Video Turing Test), Brain Research Program through the NRF (2017M3C7A1047860) funded by the Korea government (MSIT) and Academic Research Program in Yahoo Research. Gunhee Kim is the corresponding author.

Author information

Authors and Affiliations

Seoul National University, Seoul, South Korea
Yunseok Jang, Chris Dongjoo Kim, Youngjae Yu, Youngjin Kim & Gunhee Kim
Microsoft AI & Research, Redmond, USA
Yale Song

Authors

Yunseok Jang
View author publications
You can also search for this author in PubMed Google Scholar
Yale Song
View author publications
You can also search for this author in PubMed Google Scholar
Chris Dongjoo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Youngjae Yu
View author publications
You can also search for this author in PubMed Google Scholar
Youngjin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Gunhee Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gunhee Kim.

Additional information

Communicated by Christoph H. Lampert.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 7 shows the statistics of unique questions (Q), answers (A) and words (W) for each task. Near half of the question sentences in each task are unique, except for the Repeating Action task. This is because the templates used for question generation allow only the variation in the subject of the question sentences. However, since each question consists of a pair of (query sentence, video) (not just question sentence only) and no identical video is used for different questions, all questions are virtually unique to one another in our dataset. Note that the unique answers for the Repetition Action task is 10 because it allows a limited number of answers only (i.e.0 or 2 10+).

Figures 16, 17, 18 and 19 shows screenshots of the instructions and tasks for the Repletion and State Transition. They are the actual interfaces for the workers of Amazon Mechanical Turk.

In Fig. 20, multi-pie graphs display the question/answer word distribution of each task in TGIF-QA. The graphs show that each task includes a diverse set of words and thus it is hard for models to take advantage of any bias in data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jang, Y., Song, Y., Kim, C.D. et al. Video Question Answering with Spatio-Temporal Reasoning. Int J Comput Vis 127, 1385–1412 (2019). https://doi.org/10.1007/s11263-019-01189-x

Download citation

Received: 24 May 2018
Accepted: 08 June 2019
Published: 18 June 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11263-019-01189-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video Question Answering with Spatio-Temporal Reasoning

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations