research-article

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering

Authors:
Pan Lu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Lei Ji

Microsoft Research Asia &Institute of Computing Technology, CAS, Beijing, China

Microsoft Research Asia &Institute of Computing Technology, CAS, Beijing, China
View Profile

,
Wei Zhang

East China Normal University, Shanghai, China

East China Normal University, Shanghai, China
View Profile

,
Nan Duan

Microsoft Research Asia &Microsoft Corporation, Beijing, China

Microsoft Research Asia &Microsoft Corporation, Beijing, China
View Profile

,
Ming Zhou

Microsoft Research Asia &Microsoft Corporation, Beijing, China

Microsoft Research Asia &Microsoft Corporation, Beijing, China
View Profile

,
Jianyong Wang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018Pages 1880–1889https://doi.org/10.1145/3219819.3220036

Published:19 July 2018Publication History

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1880–1889

ABSTRACT

Recently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities. Existing methods mainly rely on extracting image and question features to learn their joint feature embedding via multimodal fusion or attention mechanism. Some recent studies utilize external VQA-independent models to detect candidate entities or attributes in images, which serve as semantic knowledge complementary to the VQA task. However, these candidate entities or attributes might be unrelated to the VQA task and have limited semantic capacities. To better utilize semantic knowledge in images, we propose a novel framework to learn visual relation facts for VQA. Specifically, we build up a Relation-VQA (R-VQA) dataset based on the Visual Genome dataset via a semantic similarity module, in which each data consists of an image, a corresponding question, a correct answer and a supporting relation fact. A well-defined relation detector is then adopted to predict visual question-related relation facts. We further propose a multi-step attention model composed of visual attention and semantic attention sequentially to extract related visual knowledge and semantic knowledge. We conduct comprehensive experiments on the two benchmark datasets, demonstrating that our model achieves state-of-the-art performance and verifying the benefit of considering visual relation facts.

Supplemental Material

lu_visual_question_answering.mp4

mp4

307.2 MB

Download

References

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV '15). 2425--2433. Google ScholarDigital Library
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web. Springer, 722--735. Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR '14).Google Scholar
Hedi Ben-Younes, Rémi Cadène, Nicolas Thome, and Matthieu Cord. 2017. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In International Conference on Computer Vision (ICCV '17).Google Scholar
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08). ACM, 1247--1250. Google ScholarDigital Library
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems (NIPS '13). 2787--2795. Google ScholarDigital Library
Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In Empirical Methods in Natural Language Processing (EMNLP '14). 1724-- 1734.Google ScholarCross Ref
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Empirical Methods in Natural Language Processing (EMNLP '16). 457--468.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 770--778.Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. 2016. A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485 (2016).Google Scholar
Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual qa. In Advances In Neural Information Processing Systems (NIPS '16). Google ScholarDigital Library
Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Hadamard product for low-rank bilinear pooling. In International Conference on Learning Representations (ICLR '17).Google Scholar
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (2017), 32--73. Google ScholarDigital Library
Guohao Li, Hang Su, and Wenwu Zhu. 2017. Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks. arXiv preprint arXiv:1712.00733 (2017).Google Scholar
Huayu Li, Martin Renqiang Min, Yong Ge, and Asim Kadav. 2017. A Contextaware Attention Network for Interactive Question Answering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '17). ACM, 927--935. Google ScholarDigital Library
Ruiyu Li and Jiaya Jia. 2016. Visual question answering with question representation update (qru). In Advances In Neural Information Processing Systems (NIPS '16). Google ScholarDigital Library
Yikang Li, Wanli Ouyang, Xiaogang Wang, et al. 2017. Vip-cnn: Visual phrase guided convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). 7244--7253.Google ScholarCross Ref
Xiaodan Liang, Lisa Lee, and Eric P Xing. 2017. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In IEEE Conference on Computer Vision and Pattern Recognitio (CVPR '17). IEEE, 4408--4417.Google ScholarCross Ref
Guang-Hai Liu, Jing-Yu Yang, and ZuoYong Li. 2015. Content-based image retrieval using computational visual attention model. Pattern Recognition 48, 8 (2015), 2554--2566. Google ScholarDigital Library
Yunfei Long, Lu Qin, Rong Xiang, Minglei Li, and Chu-Ren Huang. 2017. A Cognition Based Attention Model for Sentiment Analysis. In Conference on Empirical Methods in Natural Language Processing (EMNLP '17). 462--471.Google ScholarCross Ref
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual Relationship Detection with Language Priors. In European Conference on Computer Vision (ECCV '16).Google Scholar
Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering.. In The AAAI Conference on Artificial Intelligence (AAAI'18). 7218--7225.Google Scholar
Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to Answer Questions from Image Using Convolutional Neural Network.. In The AAAI Conference on Artificial Intelligence (AAAI '16). Google ScholarDigital Library
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16).Google ScholarCross Ref
Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances In Neural Information Processing Systems (NIPS '16). Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems (NIPS '13). 926--934. Google ScholarDigital Library
Xuejian Wang, Lantao Yu, Kan Ren, Guanyu Tao, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Dynamic attention deep model for article recommendation by learning human editors' demonstration. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '17). ACM, 2051--2059. Google ScholarDigital Library
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems?. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 203--212.Google Scholar
Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. 2017. Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).Google Scholar
Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 4622--4630.Google Scholar
Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15). 842--850.Google Scholar
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In International Conference on Machine Learning (ICML '16). Google ScholarDigital Library
Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring questionguided spatial attention for visual question answering. In European Conference on Computer Vision (ECCV '16). 451--466.Google ScholarCross Ref
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML '15). 2048--2057. Google ScholarDigital Library
Zhao Yan, Nan Duan, Junwei Bao, Peng Chen, Ming Zhou, Zhoujun Li, and Jianshe Zhou. 2016. Docchat: An information retrieval approach for chatbot engines using unstructured documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL '16), Vol. 1. 516--525.Google ScholarCross Ref
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 21--29.Google ScholarCross Ref
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Eduard H Hovy. 2016. Hierarchical Attention Networks for Document Classification. In The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL '16). 1480--1489.Google Scholar
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 4651--4659.Google ScholarCross Ref
Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. 2017. Multi-level attention networks for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). 4709--4717.Google ScholarCross Ref
Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and Zhongfei Mark Zhang. 2016. Deepintent: Learning attentions for online advertising with recurrent neural networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '16). ACM, 1295--1304. Google ScholarDigital Library
Wei Zhang, Wen Wang, Jun Wang, and Hongyuan Zha. 2018. User-guided Hierarchical Attention Network for Multi-modal Social Image Popularity Prediction. In Proceedings of the 2018 World Wide Web Conference (WWW '18). 1277--1286. Google ScholarDigital Library

Index Terms

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Question answering

Recommendations

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. In this context, however, inherent structure in our world and bias in our language tend ...
Read More
VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping ...
Read More
VQA as a factoid question answering problem: A novel approach for knowledge-aware and explainable visual question answering
Highlights
- The proposed model is a free form, open ended and knowledge aware VQA model.
Abstract
With recent advancements in machine perception and scene understanding, Visual Question Answering (VQA) has garnered much attraction from researchers in the direction of training neural models for jointly analyzing, grounding and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attention network
question answering
relation fact mining
semantic knowledge
visual question answering
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 980
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

VQA: Visual Question Answering

VQA as a factoid question answering problem: A novel approach for knowledge-aware and explainable visual question answering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media