skip to main content
10.1145/3474085.3475692acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation

Published:17 October 2021Publication History

ABSTRACT

Multimodal target/aspect sentiment classification combines multimodal sentiment analysis and aspect/target sentiment classification. The goal of the task is to combine vision and language to understand the sentiment towards a target entity in a sentence. Twitter is an ideal setting for the task because it is inherently multimodal, highly emotional, and affects real world events. However, multimodal tweets are short and accompanied by complex, possibly irrelevant images. We introduce a two-stream model that translates images in input space using an object-aware transformer followed by a single-pass non-autoregressive text generation approach. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. Our approach increases the amount of text available to the language model and distills the object-level information in complex images. We achieve state-of-the-art performance on two multimodal Twitter datasets without modifying the internals of the language model to accept multimodal data, demonstrating the effectiveness of our translation. In addition, we explain a failure mode of a popular approach for aspect sentiment analysis when applied to tweets. Our code is available at https://github.com/codezakh/exploiting-BERT-thru-translation.

References

  1. AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: Carnegie Mellon University-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2236--2246. https://doi.org/10.18653/v1/P18-1208Google ScholarGoogle Scholar
  2. T. Baltruv saitis, C. Ahuja, and L. Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (Feb. 2019), 423--443. https://doi.org/10.1109/TPAMI.2018.2798607 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Language Resources and Evaluation, Vol. 42, 4 (Dec. 2008), 335--359. https://doi.org/10.1007/s10579-008-9076-6Google ScholarGoogle ScholarCross RefCross Ref
  4. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. arXiv:2005.12872 [cs] (May 2020). arxiv: 2005.12872 [cs]Google ScholarGoogle Scholar
  5. Chaorui Deng, Ning Ding, Mingkui Tan, and Qi Wu. 2020. Length-Controllable Image Captioning. In Computer Vision textendash ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Vol. 12358. Springer International Publishing, Cham, 712--729. https://doi.org/10.1007/978-3-030-58601-0_42Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre -Training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs] (May 2019). arxiv: 1810.04805 [cs]Google ScholarGoogle Scholar
  7. Feifan Fan, Yansong Feng, and Dongyan Zhao. 2018. Multi-Grained Attention Network for Aspect -Level Sentiment Classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3433--3442. https://doi.org/10.18653/v1/D18--1380Google ScholarGoogle ScholarCross RefCross Ref
  8. Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality -Invariant and -Specific Representations for Multimodal Sentiment Analysis. arXiv:2005.03545 [cs] (Oct. 2020). arxiv: 2005.03545 [cs] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (Dec. 2015). arxiv: 1512.03385 [cs]Google ScholarGoogle Scholar
  10. Sarah J. Jackson, Moya Bailey, and Brooke Foucault Welles. 2020. #HashtagActivism: Networks of Race and Gender Justice. The MIT Press. https://doi.org/10.7551/mitpress/10858.001.0001Google ScholarGoogle Scholar
  11. Ema Kuv sen and Mark Strembeck. 2018. Politics, Sentiments, and Misinformation: An Analysis of the Twitter Discussion on the 2016 Austrian Presidential Elections. Online Social Networks and Media, Vol. 5 (March 2018), 37--50. https://doi.org/10.1016/j.osnem.2017.12.002Google ScholarGoogle Scholar
  12. H. Li, P. Wang, C. Shen, and A. v d Hengel. 2019 b. Visual Question Answering as Reading Comprehension. In 2019 IEEE /CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6312--6321. https://doi.org/10.1109/CVPR.2019.00648Google ScholarGoogle ScholarCross RefCross Ref
  13. Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam. 2019 a. Exploiting BERT for End -to-End Aspect -Based Sentiment Analysis. In Proceedings of the 5th Workshop on Noisy User -Generated Text (W -NUT 2019). Association for Computational Linguistics, Hong Kong, China, 34--41. https://doi.org/10.18653/v1/D19--5505Google ScholarGoogle ScholarCross RefCross Ref
  14. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision textendash ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Vol. 8693. Springer International Publishing, Cham, 740--755. https://doi.org/10.1007/978-3-319-10602-1_48Google ScholarGoogle ScholarCross RefCross Ref
  15. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs] (July 2019). arxiv: 1907.11692 [cs]Google ScholarGoogle Scholar
  16. Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual Attention Model for Name Tagging in Multimodal Social Media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1990--1999. https://doi.org/10.18653/v1/P18--1185Google ScholarGoogle ScholarCross RefCross Ref
  17. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task -Agnostic Visiolinguistic Representations for Vision -and-Language Tasks. arXiv:1908.02265 [cs] (Aug. 2019). arxiv: 1908.02265 [cs]Google ScholarGoogle Scholar
  18. Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. 2021. Pretrained Transformers as Universal Computation Engines. arXiv:2103.05247 [cs] (March 2021). arxiv: 2103.05247 [cs]Google ScholarGoogle Scholar
  19. Edison Marrese-Taylor, Jorge Balazs, and Yutaka Matsuo. 2017. Mining Fine-Grained Opinions on Closed Captions of YouTube Videos with an Attention-RNN. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, Copenhagen, Denmark, 102--111. https://doi.org/10.18653/v1/W17-5213Google ScholarGoogle ScholarCross RefCross Ref
  20. Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2019. M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues. arXiv:1911.05659 [cs, eess] (Nov. 2019). arxiv: 1911.05659 [cs, eess]Google ScholarGoogle Scholar
  21. Aditya Mogadala, Marimuthu Kalimuthu, and Dietrich Klakow. 2020. Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods. arXiv:1907.09358 [cs] (Sept. 2020). arxiv: 1907.09358 [cs]Google ScholarGoogle Scholar
  22. Rajdeep Mukherjee, Shreyas Shetty, Subrata Chattopadhyay, Subhadeep Maji, Samik Datta, and Pawan Goyal. 2021. Reproducibility, Replicability and Beyond: Assessing Production Readiness of Aspect Based Sentiment Analysis in the Wild. arXiv:2101.09449 [cs] (Jan. 2021). arxiv: 2101.09449 [cs]Google ScholarGoogle Scholar
  23. Vinod Nair and Geoffrey E Hinton. [n.d.]. Rectified Linear Units Improve Restricted Boltzmann Machines. ([n.,d.]), 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. 2021. Do Transformer Modifications Transfer Across Implementations and Applications? arXiv:2102.11972 [cs] (Feb. 2021). arxiv: 2102.11972 [cs]Google ScholarGoogle Scholar
  25. Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A Pre-Trained Language Model for English Tweets. arXiv:2005.10200 [cs] (Oct. 2020). arxiv: 2005.10200 [cs]Google ScholarGoogle Scholar
  26. Andrei Paleyes, Raoul-Gabriel Urma, and Neil D. Lawrence. 2021. Challenges in Deploying Machine Learning: A Survey of Case Studies. arXiv:2011.09926 [cs] (Jan. 2021). arxiv: 2011.09926 [cs]Google ScholarGoogle Scholar
  27. Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. 2014. Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach. In Proceedings of the 16th International Conference on Multimodal Interaction. ACM, Istanbul Turkey, 50--57. https://doi.org/10.1145/2663204.2663260 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33 (July 2019), 6892--6899. https://doi.org/10.1609/aaai.v33i01.33016892Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, and Rada Mihalcea. 2020. Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research. arXiv:2005.00357 [cs] (Nov. 2020). arxiv: 2005.00357 [cs]Google ScholarGoogle Scholar
  30. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. [n.d.]. Language Models Are Unsupervised Multitask Learners. ([n.,d.]), 24.Google ScholarGoogle Scholar
  31. Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating Multimodal Information in Large Pretrained Transformers. arXiv:1908.05787 [cs, stat] (Nov. 2020). arxiv: 1908.05787 [cs, stat]Google ScholarGoogle Scholar
  32. Robert J. Shiller. 2017. Narrative Economics. American Economic Review, Vol. 107, 4 (April 2017), 967--1004. https://doi.org/10.1257/aer.107.4.967Google ScholarGoogle ScholarCross RefCross Ref
  33. Chi Sun, Luyao Huang, and Xipeng Qiu. 2019. Utilizing BERT for Aspect -Based Sentiment Analysis via Constructing Auxiliary Sentence. arXiv:1903.09588 [cs] (March 2019). arxiv: 1903.09588 [cs]Google ScholarGoogle Scholar
  34. Alasdair Tran, Alexander Mathews, and Lexing Xie. 2020. Transform and Tell: Entity -Aware News Image Captioning. arXiv:2004.08070 [cs] (June 2020). arxiv: 2004.08070 [cs]Google ScholarGoogle Scholar
  35. Sahil Uppal. 2021. Saahiluppal/Catr.Google ScholarGoogle Scholar
  36. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs] (Dec. 2017). arxiv: 1706.03762 [cs]Google ScholarGoogle Scholar
  37. Alakananda Vempala and Daniel Preoc tiuc-Pietro. [n.d.]. Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts. ([n.,d.]), 11.Google ScholarGoogle Scholar
  38. Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The Spread of True and False News Online. Science, Vol. 359, 6380 (March 2018), 1146--1151. https://doi.org/10.1126/science.aap9559Google ScholarGoogle ScholarCross RefCross Ref
  39. Hu Xu, Bing Liu, Lei Shu, and Philip Yu. [n.d.]. BERT Post -Training for Review Reading Comprehension and Aspect -Based Sentiment Analysis. ([n.,d.]), 12.Google ScholarGoogle Scholar
  40. Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33 (July 2019), 371--378. https://doi.org/10.1609/aaai.v33i01.3301371Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2020. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv:1906.08237 [cs] (Jan. 2020). arxiv: 1906.08237 [cs]Google ScholarGoogle Scholar
  42. Jianfei Yu and Jing Jiang. 2019. Adapting BERT for Target -Oriented Multimodal Sentiment Classification. In Proceedings of the Twenty -Eighth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, Macao, China, 5408--5414. https://doi.org/10.24963/ijcai.2019/751 Google ScholarGoogle ScholarCross RefCross Ref
  43. Amir Zadeh, Rown Zellers, and Eli Pincus. [n.d.]. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. ([n.,d.]), 10.Google ScholarGoogle Scholar
  44. Biqing Zeng, Heng Yang, Ruyang Xu, Wu Zhou, and Xuli Han. 2019. LCF: A Local Context Focus Mechanism for Aspect -Based Sentiment Classification. Applied Sciences, Vol. 9, 16 (Aug. 2019), 3389. https://doi.org/10.3390/app9163389Google ScholarGoogle ScholarCross RefCross Ref
  45. Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. [n.d.]. Adaptive Co -Attention Network for Named Entity Recognition in Tweets. ([n.,d.]), 8.Google ScholarGoogle Scholar

Index Terms

  1. Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '21: Proceedings of the 29th ACM International Conference on Multimedia
        October 2021
        5796 pages
        ISBN:9781450386517
        DOI:10.1145/3474085

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 October 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader