ABSTRACT
Multimodal target/aspect sentiment classification combines multimodal sentiment analysis and aspect/target sentiment classification. The goal of the task is to combine vision and language to understand the sentiment towards a target entity in a sentence. Twitter is an ideal setting for the task because it is inherently multimodal, highly emotional, and affects real world events. However, multimodal tweets are short and accompanied by complex, possibly irrelevant images. We introduce a two-stream model that translates images in input space using an object-aware transformer followed by a single-pass non-autoregressive text generation approach. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. Our approach increases the amount of text available to the language model and distills the object-level information in complex images. We achieve state-of-the-art performance on two multimodal Twitter datasets without modifying the internals of the language model to accept multimodal data, demonstrating the effectiveness of our translation. In addition, we explain a failure mode of a popular approach for aspect sentiment analysis when applied to tweets. Our code is available at https://github.com/codezakh/exploiting-BERT-thru-translation.
- AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: Carnegie Mellon University-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2236--2246. https://doi.org/10.18653/v1/P18-1208Google Scholar
- T. Baltruv saitis, C. Ahuja, and L. Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (Feb. 2019), 423--443. https://doi.org/10.1109/TPAMI.2018.2798607 Google ScholarDigital Library
- Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Language Resources and Evaluation, Vol. 42, 4 (Dec. 2008), 335--359. https://doi.org/10.1007/s10579-008-9076-6Google ScholarCross Ref
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. arXiv:2005.12872 [cs] (May 2020). arxiv: 2005.12872 [cs]Google Scholar
- Chaorui Deng, Ning Ding, Mingkui Tan, and Qi Wu. 2020. Length-Controllable Image Captioning. In Computer Vision textendash ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Vol. 12358. Springer International Publishing, Cham, 712--729. https://doi.org/10.1007/978-3-030-58601-0_42Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre -Training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs] (May 2019). arxiv: 1810.04805 [cs]Google Scholar
- Feifan Fan, Yansong Feng, and Dongyan Zhao. 2018. Multi-Grained Attention Network for Aspect -Level Sentiment Classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3433--3442. https://doi.org/10.18653/v1/D18--1380Google ScholarCross Ref
- Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality -Invariant and -Specific Representations for Multimodal Sentiment Analysis. arXiv:2005.03545 [cs] (Oct. 2020). arxiv: 2005.03545 [cs] Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (Dec. 2015). arxiv: 1512.03385 [cs]Google Scholar
- Sarah J. Jackson, Moya Bailey, and Brooke Foucault Welles. 2020. #HashtagActivism: Networks of Race and Gender Justice. The MIT Press. https://doi.org/10.7551/mitpress/10858.001.0001Google Scholar
- Ema Kuv sen and Mark Strembeck. 2018. Politics, Sentiments, and Misinformation: An Analysis of the Twitter Discussion on the 2016 Austrian Presidential Elections. Online Social Networks and Media, Vol. 5 (March 2018), 37--50. https://doi.org/10.1016/j.osnem.2017.12.002Google Scholar
- H. Li, P. Wang, C. Shen, and A. v d Hengel. 2019 b. Visual Question Answering as Reading Comprehension. In 2019 IEEE /CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6312--6321. https://doi.org/10.1109/CVPR.2019.00648Google ScholarCross Ref
- Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam. 2019 a. Exploiting BERT for End -to-End Aspect -Based Sentiment Analysis. In Proceedings of the 5th Workshop on Noisy User -Generated Text (W -NUT 2019). Association for Computational Linguistics, Hong Kong, China, 34--41. https://doi.org/10.18653/v1/D19--5505Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision textendash ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Vol. 8693. Springer International Publishing, Cham, 740--755. https://doi.org/10.1007/978-3-319-10602-1_48Google ScholarCross Ref
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs] (July 2019). arxiv: 1907.11692 [cs]Google Scholar
- Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual Attention Model for Name Tagging in Multimodal Social Media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1990--1999. https://doi.org/10.18653/v1/P18--1185Google ScholarCross Ref
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task -Agnostic Visiolinguistic Representations for Vision -and-Language Tasks. arXiv:1908.02265 [cs] (Aug. 2019). arxiv: 1908.02265 [cs]Google Scholar
- Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. 2021. Pretrained Transformers as Universal Computation Engines. arXiv:2103.05247 [cs] (March 2021). arxiv: 2103.05247 [cs]Google Scholar
- Edison Marrese-Taylor, Jorge Balazs, and Yutaka Matsuo. 2017. Mining Fine-Grained Opinions on Closed Captions of YouTube Videos with an Attention-RNN. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, Copenhagen, Denmark, 102--111. https://doi.org/10.18653/v1/W17-5213Google ScholarCross Ref
- Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2019. M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues. arXiv:1911.05659 [cs, eess] (Nov. 2019). arxiv: 1911.05659 [cs, eess]Google Scholar
- Aditya Mogadala, Marimuthu Kalimuthu, and Dietrich Klakow. 2020. Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods. arXiv:1907.09358 [cs] (Sept. 2020). arxiv: 1907.09358 [cs]Google Scholar
- Rajdeep Mukherjee, Shreyas Shetty, Subrata Chattopadhyay, Subhadeep Maji, Samik Datta, and Pawan Goyal. 2021. Reproducibility, Replicability and Beyond: Assessing Production Readiness of Aspect Based Sentiment Analysis in the Wild. arXiv:2101.09449 [cs] (Jan. 2021). arxiv: 2101.09449 [cs]Google Scholar
- Vinod Nair and Geoffrey E Hinton. [n.d.]. Rectified Linear Units Improve Restricted Boltzmann Machines. ([n.,d.]), 8. Google ScholarDigital Library
- Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. 2021. Do Transformer Modifications Transfer Across Implementations and Applications? arXiv:2102.11972 [cs] (Feb. 2021). arxiv: 2102.11972 [cs]Google Scholar
- Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A Pre-Trained Language Model for English Tweets. arXiv:2005.10200 [cs] (Oct. 2020). arxiv: 2005.10200 [cs]Google Scholar
- Andrei Paleyes, Raoul-Gabriel Urma, and Neil D. Lawrence. 2021. Challenges in Deploying Machine Learning: A Survey of Case Studies. arXiv:2011.09926 [cs] (Jan. 2021). arxiv: 2011.09926 [cs]Google Scholar
- Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. 2014. Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach. In Proceedings of the 16th International Conference on Multimodal Interaction. ACM, Istanbul Turkey, 50--57. https://doi.org/10.1145/2663204.2663260 Google ScholarDigital Library
- Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33 (July 2019), 6892--6899. https://doi.org/10.1609/aaai.v33i01.33016892Google ScholarDigital Library
- Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, and Rada Mihalcea. 2020. Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research. arXiv:2005.00357 [cs] (Nov. 2020). arxiv: 2005.00357 [cs]Google Scholar
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. [n.d.]. Language Models Are Unsupervised Multitask Learners. ([n.,d.]), 24.Google Scholar
- Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating Multimodal Information in Large Pretrained Transformers. arXiv:1908.05787 [cs, stat] (Nov. 2020). arxiv: 1908.05787 [cs, stat]Google Scholar
- Robert J. Shiller. 2017. Narrative Economics. American Economic Review, Vol. 107, 4 (April 2017), 967--1004. https://doi.org/10.1257/aer.107.4.967Google ScholarCross Ref
- Chi Sun, Luyao Huang, and Xipeng Qiu. 2019. Utilizing BERT for Aspect -Based Sentiment Analysis via Constructing Auxiliary Sentence. arXiv:1903.09588 [cs] (March 2019). arxiv: 1903.09588 [cs]Google Scholar
- Alasdair Tran, Alexander Mathews, and Lexing Xie. 2020. Transform and Tell: Entity -Aware News Image Captioning. arXiv:2004.08070 [cs] (June 2020). arxiv: 2004.08070 [cs]Google Scholar
- Sahil Uppal. 2021. Saahiluppal/Catr.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs] (Dec. 2017). arxiv: 1706.03762 [cs]Google Scholar
- Alakananda Vempala and Daniel Preoc tiuc-Pietro. [n.d.]. Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts. ([n.,d.]), 11.Google Scholar
- Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The Spread of True and False News Online. Science, Vol. 359, 6380 (March 2018), 1146--1151. https://doi.org/10.1126/science.aap9559Google ScholarCross Ref
- Hu Xu, Bing Liu, Lei Shu, and Philip Yu. [n.d.]. BERT Post -Training for Review Reading Comprehension and Aspect -Based Sentiment Analysis. ([n.,d.]), 12.Google Scholar
- Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33 (July 2019), 371--378. https://doi.org/10.1609/aaai.v33i01.3301371Google ScholarDigital Library
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2020. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv:1906.08237 [cs] (Jan. 2020). arxiv: 1906.08237 [cs]Google Scholar
- Jianfei Yu and Jing Jiang. 2019. Adapting BERT for Target -Oriented Multimodal Sentiment Classification. In Proceedings of the Twenty -Eighth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, Macao, China, 5408--5414. https://doi.org/10.24963/ijcai.2019/751 Google ScholarCross Ref
- Amir Zadeh, Rown Zellers, and Eli Pincus. [n.d.]. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. ([n.,d.]), 10.Google Scholar
- Biqing Zeng, Heng Yang, Ruyang Xu, Wu Zhou, and Xuli Han. 2019. LCF: A Local Context Focus Mechanism for Aspect -Based Sentiment Classification. Applied Sciences, Vol. 9, 16 (Aug. 2019), 3389. https://doi.org/10.3390/app9163389Google ScholarCross Ref
- Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. [n.d.]. Adaptive Co -Attention Network for Named Entity Recognition in Tweets. ([n.,d.]), 8.Google Scholar
Index Terms
- Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation
Recommendations
Identification of Conflict Opinion in Aspect-Based Sentiment Analysis using BERT-based Method
IC3INA '22: Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its ApplicationsAspect-based sentiment analysis (ABSA) is an NLP task for predicting sentiment polarities of specific aspects in a given opinion sentence. Recent research shows that deep learning and language modeling like BERT has become state-of-the-art in NLP tasks,...
Contextual semantics for sentiment analysis of Twitter
We propose a semantic sentiment representation of words called SentiCircle.SentiCircle captures the contextual semantic of words from their co-occurrences.SentiCircle updates the sentiment of words based on their contextual semantics.SentiCircle can be ...
Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementTwitter is one of the biggest platforms where massive instant messages (i.e. tweets) are published every day. Users tend to express their real feelings freely in Twitter, which makes it an ideal source for capturing the opinions towards various ...
Comments