Abstract
Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention and show its effectiveness over conventional VQA techniques through empirical evaluations.
Similar content being viewed by others
References
Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In: CVPR
Andreas, J., Rohrbach, M., Darrel, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In: NAACL
Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In: CVPR
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). VQA: Visual question answering. In: ICCV
Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). Layer normalization. In: arXiv:1607.06450
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: ICLR
Bakhshi, S., Shamma, D.A., Kennedy, L., Song, Y., de Juan, P., & Kaye, J.J. (2016). Fast, Cheap, and Good—Why animated GIFs engage us. In: CHI
Chomsky, N. (1971). Conditions on transformations. In: Indiana University Linguistics Club
Daiber, J., Jakob, M., Hokamp, C., & Mendes, P.N. (2013). Improving efficiency and accuracy in multilingual entity extraction. In: I-Semantics
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., & Batra, D. (2017). Visual dialog. In: CVPR
Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. CACM, 58, 92–103.
Denkowski, M., & Lavie, A. (2011). Meteor universal: Language specific translation evaluation for any target language. In: EMNLP
Farneback, G. (2003). Two-frame motion estimation based on polynomial expansion. In: SCIA
Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP
Gao, J., & Ge, R. (2018). Motion-appearance co-memory networks for video question answering. In: CVPR
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPR
Gygli, M., Song, Y., & Cao, L. (2016). Video2GIF: Automatic generation of animated GIFs from video. In: CVPR
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR
Isola, P., Lim, J.J., & Adelson, E.H. (2015). Discovering states and transformations in image collections. In: CVPR
Jang, Y., Song, Y., Yu, Y., Kim, Y., & Kim, G. (2017). TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In: CVPR
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., & Girshick, R. (2017). CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F.F. (2014). Large-scale video classification with convolutional neural networks. In: CVPR
Kim, K., Heo, M., Choi, S., & Zhang, B. (2017). DeepStory: Video story QA by deep embedded memory networks. In: IJCAI
Kim, J.H., Lee, S.W,, Kwak, D.H., Heo, M.O., Kim, J., Ha, J.W., & Zhang, B.T. (2016). Multimodal residual learning for visual QA. In: NIPS
Kingma, D.P., & Ba, J.L. (2015). ADAM: A method for stochastic optimization. In: ICLR
Kipper-Schuler, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis, UPenn CIS
Kiros, J.R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors. In: NIPS
Lei, J., Yu, L., Bansal, M., & Berg, T. (2018). TVQA: Localized, compositional video question answering. In: EMNLP
Levi, G., & Hassner, T. (2015). Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In: ICMI
Levy, O., & Wolf, L. (2015). Live repetition counting. In: ICCV
Li, Y., Song, Y., Cao, L., Tetreault, J., Goldberg, L., Jaimes, A., & Luo, J. (2016). TGIF: A new dataset and benchmark on animated GIF description. In: CVPR
Lin, X., & Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In: CVPR
Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO—common objects in context. In: ECCV
Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., & Pal, C. (2017). A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR
Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS
Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In: ICCV
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In: ACL
Mun, J., Seo, P.H., Jung, I., & Han, B. (2017). MarioQA: Answering questions by watching gameplay videos. In: ICCV
Na, S., Lee, S., Kim, J., & Kim, G. (2018). A read-write memory network for movie story understanding. In: ICCV
Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). Bleu: A method for automatic evaluation of machine translation. In: ACL
Pennington, J., Socher, R., & Manning, C.D. (2014). Glove—Global vectors for word representation. In: EMNLP
Pham, V., Bluche, T., Kermorvant, C., & Louradour, J. (2014). Dropout improves recurrent neural networks for handwriting recognition. In: ICFHR
Piotr Bojanwoski, E.G., & Armand Joulin, T.M. (2017). Enriching word vectors with subword information. In: TACL
Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In: NIPS
Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., et al. (2017). Movie description. IJCV, 123, 94–120.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. IJCV, 115, 211–252.
Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01
Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In: ICML
Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. In: NIPS
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding stories in movies through question-answering. In: CVPR
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In: ICCV
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, l., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In: NIPS
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence—video to text. In: ICCV
Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2016). What value do explicit high level concepts have in vision to language problems? In: CVPR
Xie, S., Chen, S., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning for video understanding. In: ECCV
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In: ICML
Yang, Z., Xiadong, H., Jianfeng, G., Li, D., & Smola, A.J. (2015). Stacked attention networks for image question answering. In: CVPR
You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In: CVPR
Yu, Y., Ko, H., Choi, J., & Kim, G. (2017). End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR
Yu, L., Park, E., Berg, A.C., & Berg, T.L. (2015). Visual madlibs: Fill in the blank description generation and question answering. In: ICCV
Zhao, Z., Yang, Q., Cai, D., He, X., & Zhuang, Y. (2017). Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In: NIPS
Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. (2017). Uncovering temporal context for video question answering. IJCV, 124, 409–421.
Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded question answering in images. In: CVPR
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: ICCV
Acknowledgements
This work was supported by IITP Grant (No. 2019-0-01082, SW StarLab) (No.2017-0-01772, Video Turing Test), Brain Research Program through the NRF (2017M3C7A1047860) funded by the Korea government (MSIT) and Academic Research Program in Yahoo Research. Gunhee Kim is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Christoph H. Lampert.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Table 7 shows the statistics of unique questions (Q), answers (A) and words (W) for each task. Near half of the question sentences in each task are unique, except for the Repeating Action task. This is because the templates used for question generation allow only the variation in the subject of the question sentences. However, since each question consists of a pair of (query sentence, video) (not just question sentence only) and no identical video is used for different questions, all questions are virtually unique to one another in our dataset. Note that the unique answers for the Repetition Action task is 10 because it allows a limited number of answers only (i.e.0 or 2 10+).
Figures 16, 17, 18 and 19 shows screenshots of the instructions and tasks for the Repletion and State Transition. They are the actual interfaces for the workers of Amazon Mechanical Turk.
In Fig. 20, multi-pie graphs display the question/answer word distribution of each task in TGIF-QA. The graphs show that each task includes a diverse set of words and thus it is hard for models to take advantage of any bias in data.
Rights and permissions
About this article
Cite this article
Jang, Y., Song, Y., Kim, C.D. et al. Video Question Answering with Spatio-Temporal Reasoning. Int J Comput Vis 127, 1385–1412 (2019). https://doi.org/10.1007/s11263-019-01189-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01189-x