skip to main content
10.1145/3219819.3220036acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering

Authors Info & Claims
Published:19 July 2018Publication History

ABSTRACT

Recently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities. Existing methods mainly rely on extracting image and question features to learn their joint feature embedding via multimodal fusion or attention mechanism. Some recent studies utilize external VQA-independent models to detect candidate entities or attributes in images, which serve as semantic knowledge complementary to the VQA task. However, these candidate entities or attributes might be unrelated to the VQA task and have limited semantic capacities. To better utilize semantic knowledge in images, we propose a novel framework to learn visual relation facts for VQA. Specifically, we build up a Relation-VQA (R-VQA) dataset based on the Visual Genome dataset via a semantic similarity module, in which each data consists of an image, a corresponding question, a correct answer and a supporting relation fact. A well-defined relation detector is then adopted to predict visual question-related relation facts. We further propose a multi-step attention model composed of visual attention and semantic attention sequentially to extract related visual knowledge and semantic knowledge. We conduct comprehensive experiments on the two benchmark datasets, demonstrating that our model achieves state-of-the-art performance and verifying the benefit of considering visual relation facts.

Skip Supplemental Material Section

Supplemental Material

lu_visual_question_answering.mp4

mp4

307.2 MB

References

  1. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV '15). 2425--2433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web. Springer, 722--735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR '14).Google ScholarGoogle Scholar
  4. Hedi Ben-Younes, Rémi Cadène, Nicolas Thome, and Matthieu Cord. 2017. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In International Conference on Computer Vision (ICCV '17).Google ScholarGoogle Scholar
  5. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08). ACM, 1247--1250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems (NIPS '13). 2787--2795. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In Empirical Methods in Natural Language Processing (EMNLP '14). 1724-- 1734.Google ScholarGoogle ScholarCross RefCross Ref
  8. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Empirical Methods in Natural Language Processing (EMNLP '16). 457--468.Google ScholarGoogle Scholar
  9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  10. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. 2016. A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485 (2016).Google ScholarGoogle Scholar
  12. Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual qa. In Advances In Neural Information Processing Systems (NIPS '16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Hadamard product for low-rank bilinear pooling. In International Conference on Learning Representations (ICLR '17).Google ScholarGoogle Scholar
  14. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (2017), 32--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Guohao Li, Hang Su, and Wenwu Zhu. 2017. Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks. arXiv preprint arXiv:1712.00733 (2017).Google ScholarGoogle Scholar
  16. Huayu Li, Martin Renqiang Min, Yong Ge, and Asim Kadav. 2017. A Contextaware Attention Network for Interactive Question Answering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '17). ACM, 927--935. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ruiyu Li and Jiaya Jia. 2016. Visual question answering with question representation update (qru). In Advances In Neural Information Processing Systems (NIPS '16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yikang Li, Wanli Ouyang, Xiaogang Wang, et al. 2017. Vip-cnn: Visual phrase guided convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). 7244--7253.Google ScholarGoogle ScholarCross RefCross Ref
  19. Xiaodan Liang, Lisa Lee, and Eric P Xing. 2017. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In IEEE Conference on Computer Vision and Pattern Recognitio (CVPR '17). IEEE, 4408--4417.Google ScholarGoogle ScholarCross RefCross Ref
  20. Guang-Hai Liu, Jing-Yu Yang, and ZuoYong Li. 2015. Content-based image retrieval using computational visual attention model. Pattern Recognition 48, 8 (2015), 2554--2566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yunfei Long, Lu Qin, Rong Xiang, Minglei Li, and Chu-Ren Huang. 2017. A Cognition Based Attention Model for Sentiment Analysis. In Conference on Empirical Methods in Natural Language Processing (EMNLP '17). 462--471.Google ScholarGoogle ScholarCross RefCross Ref
  22. Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual Relationship Detection with Language Priors. In European Conference on Computer Vision (ECCV '16).Google ScholarGoogle Scholar
  23. Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering.. In The AAAI Conference on Artificial Intelligence (AAAI'18). 7218--7225.Google ScholarGoogle Scholar
  24. Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to Answer Questions from Image Using Convolutional Neural Network.. In The AAAI Conference on Artificial Intelligence (AAAI '16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16).Google ScholarGoogle ScholarCross RefCross Ref
  26. Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances In Neural Information Processing Systems (NIPS '16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  28. Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems (NIPS '13). 926--934. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Xuejian Wang, Lantao Yu, Kan Ren, Guanyu Tao, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Dynamic attention deep model for article recommendation by learning human editors' demonstration. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '17). ACM, 2051--2059. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems?. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 203--212.Google ScholarGoogle Scholar
  31. Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. 2017. Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).Google ScholarGoogle Scholar
  32. Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 4622--4630.Google ScholarGoogle Scholar
  33. Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15). 842--850.Google ScholarGoogle Scholar
  34. Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In International Conference on Machine Learning (ICML '16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring questionguided spatial attention for visual question answering. In European Conference on Computer Vision (ECCV '16). 451--466.Google ScholarGoogle ScholarCross RefCross Ref
  36. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML '15). 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zhao Yan, Nan Duan, Junwei Bao, Peng Chen, Ming Zhou, Zhoujun Li, and Jianshe Zhou. 2016. Docchat: An information retrieval approach for chatbot engines using unstructured documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL '16), Vol. 1. 516--525.Google ScholarGoogle ScholarCross RefCross Ref
  38. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 21--29.Google ScholarGoogle ScholarCross RefCross Ref
  39. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Eduard H Hovy. 2016. Hierarchical Attention Networks for Document Classification. In The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL '16). 1480--1489.Google ScholarGoogle Scholar
  40. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 4651--4659.Google ScholarGoogle ScholarCross RefCross Ref
  41. Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. 2017. Multi-level attention networks for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). 4709--4717.Google ScholarGoogle ScholarCross RefCross Ref
  42. Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and Zhongfei Mark Zhang. 2016. Deepintent: Learning attentions for online advertising with recurrent neural networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '16). ACM, 1295--1304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wei Zhang, Wen Wang, Jun Wang, and Hongyuan Zha. 2018. User-guided Hierarchical Attention Network for Multi-modal Social Image Popularity Prediction. In Proceedings of the 2018 World Wide Web Conference (WWW '18). 1277--1286. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        July 2018
        2925 pages
        ISBN:9781450355520
        DOI:10.1145/3219819

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 July 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader