skip to main content
10.1145/3240508.3240633acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Unified Framework for Multimodal Domain Adaptation

Authors Info & Claims
Published:15 October 2018Publication History

ABSTRACT

Domain adaptation aims to train a model on labeled data from a source domain while minimizing test error on a target domain. Most of existing domain adaptation methods only focus on reducing domain shift of single-modal data. In this paper, we consider a new problem of multimodal domain adaptation and propose a unified framework to solve it. The proposed multimodal domain adaptation neural networks(MDANN) consist of three important modules. (1) A covariant multimodal attention is designed to learn a common feature representation for multiple modalities. (2) A fusion module adaptively fuses attended features of different modalities. (3) Hybrid domain constraints are proposed to comprehensively learn domain-invariant features by constraining single modal features, fused features, and attention scores. Through jointly attending and fusing under an adversarial objective, the most discriminative and domain-adaptive parts of the features are adaptively fused together. Extensive experimental results on two real-world cross-domain applications (emotion recognition and cross-media retrieval) demonstrate the effectiveness of the proposed method.

References

  1. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV. 2425--2433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. SoundNet: Learning Sound Representations from Unlabeled Video. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2017. Multimodal Machine Learning: A Survey and Taxonomy. CoRR (2017).Google ScholarGoogle Scholar
  4. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. 2006. Analysis of Representations for Domain Adaptation. In NIPS. 137--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hedi Ben-younes, Rémi Cadène, Matthieu Cord, and Nicolas Thome. 2017. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In ICCV. 2631-- 2639.Google ScholarGoogle Scholar
  6. John Blitzer, Ryan T. McDonald, and Fernando Pereira. 2006. Domain Adaptation with Structural Correspondence Learning. In EMNLP. 120--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth Narayanan. 2008. IEMOCAP: interactive emotional dyadic motion capture database. LRE (2008).Google ScholarGoogle Scholar
  8. Minmin Chen, Zhixiang Eddie Xu, Kilian Q. Weinberger, and Fei Sha. 2012. Marginalized Denoising Autoencoders for Domain Adaptation. In ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tseng-Hung Chen, Yuan-Hong Liao, Ching-Yao Chuang,Wan Ting Hsu, Jianlong Fu, and Min Sun. 2017. Show, Adapt and Tell: Adversarial Training of Cross- Domain Image Captioner. In ICCV.Google ScholarGoogle Scholar
  10. Wen-Sheng Chu, Fernando De la Torre, and Jeffrey F. Cohn. 2013. Selective Transfer Machine for Personalized Facial Action Unit Detection. In CVPR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Abhinav Dhall, Roland Goecke, Jyoti Joshi, Michael Wagner, and Tom Gedeon. 2013. Emotion recognition in the wild challenge (EmotiW) challenge and workshop summary. In ICMI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Julien Epps, Fang Chen, Sharon Oviatt, Kenji Mase, Andrew Sears, Kristiina Jokinen, and Björn W. Schuller (Eds.). 2013. ICMI.Google ScholarGoogle Scholar
  14. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP.Google ScholarGoogle Scholar
  15. Yaroslav Ganin and Victor S. Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation. In ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor S. Lempitsky. 2016. Domain-Adversarial Training of Neural Networks. JMLR (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Timnit Gebru, Judy Hoffman, and Li Fei-Fei. 2017. Fine-Grained Recognition in the Wild: A Multi-task Domain Adaptation Approach. In ICCV.Google ScholarGoogle Scholar
  18. Muhammad Ghifary,W. Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. 2015. Domain Generalization for Object Recognition with Multi-task Autoencoders. In ICCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Boqing Gong, Kristen Grauman, and Fei Sha. 2013. Connecting the Dots with Landmarks: Discriminatively Learning Domain-Invariant Features for Unsupervised Domain Adaptation. In ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Matthieu Guillaumin, Jakob J. Verbeek, and Cordelia Schmid. 2010. Multimodal semi-supervised learning for image classification. In CVPR.Google ScholarGoogle Scholar
  21. David R. Hardoon, Sandor R. Szedmak, and John R. Shawe-taylor. 2004. Canonical Correlation Analysis: AnOverviewwith Application to Learning Methods. Neural Comput. (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015).Google ScholarGoogle Scholar
  23. Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Schölkopf. 2006. Correcting Sample Selection Bias by Unlabeled Data. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Schölkopf. 2007. Correcting Sample Selection Bias by Unlabeled Data. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xun Huang and Serge J. Belongie. 2017. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. In ICCV.Google ScholarGoogle Scholar
  26. Gideon J., Khorram S., Aldeneh Z., Dimitriadis D., and Provost E. 2017. Progressive Neural Networks for Transfer Learning in Emotion Recognition. In Interspeech.Google ScholarGoogle Scholar
  27. Bousmalis K., Trigeorgis G., Silberman N., Krishnan D., and Erhan D. 2016. Domain Separation Networks. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Bousmalis K., Silberman N., Dohan D., Erhan D., and Krishnan D. 2017. Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks. In CVPR.Google ScholarGoogle Scholar
  29. Samira Ebrahimi Kahou, Xavier Bouthillier, Pascal Lamblin, Çaglar Gülçehre, Vincent Michalski, Kishore Konda, Sébastien Jean, Pierre Froumenty, Yann Dauphin, Nicolas Boulanger-Lewandowski, Raul Chandias Ferrari, Mehdi Mirza, David Warde-Farley, Aaron C. Courville, Pascal Vincent, Roland Memisevic, Christopher Joseph Pal, and Yoshua Bengio. 2016. EmoNets: Multimodal deep learning approaches for emotion recognition in video. JMUI (2016).Google ScholarGoogle Scholar
  30. Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. PAMI (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2016. Hadamard Product for Low-rank Bilinear Pooling. CoRR (2016).Google ScholarGoogle Scholar
  32. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual- Semantic Embeddings with Multimodal Neural Language Models. CoRR (2014).Google ScholarGoogle Scholar
  33. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-Thought Vectors. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 2016. NIPS.Google ScholarGoogle Scholar
  35. Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. 2017. Demystifying Neural Style Transfer. In IJCAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV.Google ScholarGoogle Scholar
  37. Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks. In ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. 2016. Unsupervised Domain Adaptation with Residual Transfer Networks. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. 2017. Deep Transfer Learning with Joint Adaptation Networks. In ICML.Google ScholarGoogle Scholar
  40. Zhengdong Lu and Hang Li. 2013. A Deep Architecture for Matching Short Texts. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. 2015. Deep multimodal learning for Audio-Visual Speech Recognition. In ICASSP. 2130--2134.Google ScholarGoogle Scholar
  42. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Shubham Pachori, Ameya Deshpande, and Shanmuganathan Raman. 2018. Hashing in the zero shot framework with domain adaptation. Neurocomputing (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, and Qiang Yang. 2011. Domain Adaptation via Transfer Component Analysis. TNN (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Nikhil Rasiwasia, Pedro J. Moreno, and Nuno Vasconcelos. 2007. Bridging the Gap: Query by Semantic Example. TMM (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM MM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Fabien Ringeval, Björn W. Schuller, Michel F. Valstar, Jonathan Gratch, Roddy Cowie, and Maja Pantic (Eds.). 2017. ACM MM.Google ScholarGoogle Scholar
  49. Idan Schwartz, Alexander G. Schwing, and Tamir Hazan. 2017. High-Order Attention Models for Visual Question Answering. In NIPS.Google ScholarGoogle Scholar
  50. Yikang Shen,Wenge Rong, Zhiwei Sun, Yuanxin Ouyang, and Zhang Xiong. 2015. AAAI. In AAAI.Google ScholarGoogle Scholar
  51. Henri Theil and Ching-Fan Chung. 1988. Relations between two sets of variates: The bits of information provided by each variate in each set. Statistics & Probability Letters (1988).Google ScholarGoogle Scholar
  52. Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep Domain Confusion: Maximizing for Domain Invariance. CoRR (2014).Google ScholarGoogle Scholar
  53. C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. 2011. The Caltech- UCSD Birds-200--2011 Dataset. Technical Report.Google ScholarGoogle Scholar
  54. Mei Wang and Weihong Deng. 2018. Deep Visual Domain Adaptation: A Survey. CoRR (2018).Google ScholarGoogle Scholar
  55. Huijuan Xu and Kate Saenko. 2016. Ask, Attend and Answer: Exploring Question- Guided Spatial Attention for Visual Question Answering. In ECCV.Google ScholarGoogle Scholar
  56. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2015. Cross-Domain Feature Learning in Multimedia. TMM (2015).Google ScholarGoogle Scholar
  58. Xiaoshan Yang, Tianzhu Zhang, Changsheng Xu, Shuicheng Yan, M. Shamim Hossain, and Ahmed Ghoneim. 2016. Deep Relative Attributes. TMM (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Unified Framework for Multimodal Domain Adaptation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '18: Proceedings of the 26th ACM international conference on Multimedia
      October 2018
      2167 pages
      ISBN:9781450356657
      DOI:10.1145/3240508

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 October 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader