skip to main content
10.1145/3503161.3547865acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos

Published:10 October 2022Publication History

ABSTRACT

Current works of facial expression learning in video consume significant computational resources to learn spatial channel feature representations and temporal relationships. To mitigate this issue, we propose a Dual Path multi-excitation Collaborative Network (DPCNet) to learn the critical information for facial expression representation from fewer keyframes in videos. Specifically, the DPCNet learns the important regions and keyframes from a tuple of four view-grouped frames by multi-excitation modules and produces dual-path representations of one video with consistency under two regularization strategies. A spatial-frame excitation module and a channel-temporal aggregation module are introduced consecutively to learn spatial-frame representation and generate complementary channel-temporal aggregation, respectively. Moreover, we design a multi-frame regularization loss to enforce the representation of multiple frames in the dual view to be semantically coherent. To obtain consistent prediction probabilities from the dual path, we further propose a dual path regularization loss, aiming to minimize the divergence between the distributions of two-path embeddings. Extensive experiments and ablation studies show that the DPCNet can significantly improve the performance of video-based FER and achieve state-of-the-art results on the large-scale DFEW dataset.

Skip Supplemental Material Section

Supplemental Material

MM22-fp0487.mp4

mp4

37 MB

References

  1. Dawood Adel Al Chanti and Alice Caplier. 2018. Deep learning for spatiotemporal modeling of dynamic spontaneous emotions. IEEE Transactions on Affective Computing (2018).Google ScholarGoogle Scholar
  2. Danilo Avola, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, and Cristiano Massaroni. 2020. Deep temporal analysis for non-acted body affect recognition. IEEE Transactions on Affective Computing (2020).Google ScholarGoogle Scholar
  3. Amal Azazi, Syaheerah Lebai Lutfi, Ibrahim Venkat, and Fernando Fernández-Martínez. 2015. Towards a robust affect recognition: Automatic facial expression recognition in 3D faces. Expert Systems with Applications 42, 6 (2015), 3056--3066.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O'Reilly, and Yan Tong. 2018. Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 302--309.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  6. Yuedong Chen, Jianfeng Wang, Shikai Chen, Zhongchao Shi, and Jianfei Cai. 2019. Facial Motion Prior Networks for Facial Expression Recognition. CoRR abs/1902.08788 (2019). arXiv:1902.08788 http://arxiv.org/abs/1902.08788Google ScholarGoogle Scholar
  7. Zhaoyu Chen, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2022. Towards Practical Certifiable Patch Defense With Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15148--15158.Google ScholarGoogle ScholarCross RefCross Ref
  8. Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. 2018. Emotiw 2018: Audio-video, student engagement and group-level affect prediction. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 653--656.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hui Ding, Shaohua Kevin Zhou, and Rama Chellappa. 2017. Facenet2expnet: Regularizing a deep face recognition net for expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 118--126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dominik Maria Endres and Johannes E Schindelin. 2003. A new metric for probability distributions. IEEE Transactions on Information theory 49, 7 (2003), 1858--1860.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yingruo Fan, Jacqueline CK Lam, and Victor OK Li. 2018. Video-based emotion recognition using deeply-supervised neural networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 584--588.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yingruo Fan, Victor Li, and Jacqueline CK Lam. 2020. Facial expression recognition with deeply-supervised attention network. IEEE transactions on affective computing (2020).Google ScholarGoogle ScholarCross RefCross Ref
  13. Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445--450.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445--450.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Amir Hossein Farzaneh and Xiaojun Qi. 2020. Discriminant distribution-agnostic loss for facial expression recognition in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 406--407.Google ScholarGoogle ScholarCross RefCross Ref
  16. Amir Hossein Farzaneh and Xiaojun Qi. 2021. Facial expression recognition in the wild via deep attentive center loss. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2402--2411.Google ScholarGoogle ScholarCross RefCross Ref
  17. Shuyong Gao, Wei Zhang, Yan Wang, Qianyu Guo, Chenglong Zhang, Yangji He, and Wenqiang Zhang. 2022. Weakly-Supervised Salient Object Detection Using Point Supervison. Proceedings of the AAAI Conference on Artificial Intelligence 36 (Jun. 2022).Google ScholarGoogle Scholar
  18. Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. Msceleb- 1m: A dataset and benchmark for large-scale face recognition. In European conference on computer vision. Springer, 87--102.Google ScholarGoogle ScholarCross RefCross Ref
  19. Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.Google ScholarGoogle ScholarCross RefCross Ref
  20. Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, PengWu, and Hichem Sahli. 2015. Multimodal affective dimension prediction using deep bidirectional long shortterm memory recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. 73--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.Google ScholarGoogle ScholarCross RefCross Ref
  22. Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. 2017. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM international conference on multimodal interaction. 553--560.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hao Huang, Yongtao Wang, Zhaoyu Chen, Yuze Zhang, Yuheng Li, Zhi Tang, Wei Chu, Jingdong Chen,Weisi Lin, and Kai-Kuang Ma. 2022. Carnegie Mellon UniversityA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes. Proceedings of the AAAI Conference on Artificial Intelligence 36, 1 (Jun. 2022), 989--997. https://doi.org/10.1609/aaai.v36i1.19982Google ScholarGoogle Scholar
  24. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2012), 221--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. 2020. DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild. In 28th ACM International Conference on Multimedia. 2881--2889.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and Junmo Kim. 2015. Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE international conference on computer vision. 2983--2991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hangyu Li, Nannan Wang, Xinpeng Ding, Xi Yang, and Xinbo Gao. 2021. Adaptively Learning Facial Expression Representation via CF Labels and Distillation. IEEE Transactions on Image Processing 30 (2021), 2016--2028.Google ScholarGoogle ScholarCross RefCross Ref
  28. Shan Li and Weihong Deng. 2018. Reliable crowdsourcing and deep localitypreserving learning for unconstrained facial expression recognition. IEEE Transactions on Image Processing 28, 1 (2018), 356--370.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Shan Li and Weihong Deng. 2020. Deep facial expression recognition: A survey. IEEE transactions on affective computing (2020).Google ScholarGoogle Scholar
  30. Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909--918.Google ScholarGoogle ScholarCross RefCross Ref
  31. Chuanhe Liu, Tianhao Tang, Kui Lv, and Minghao Wang. 2018. Multi-feature based emotion recognition for video clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 630--634.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. 2020. SAANet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing 413 (2020), 145--157.Google ScholarGoogle ScholarCross RefCross Ref
  33. Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1749--1756.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tingting Liu, Jixin Wang, Bing Yang, and Xuan Wang. 2021. Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Physics & Technology 112 (2021), 103594.Google ScholarGoogle ScholarCross RefCross Ref
  35. Xin Liu, Silvia L Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C van Gemert. 2021. No frame left behind: Full Video Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14892-- 14901.Google ScholarGoogle ScholarCross RefCross Ref
  36. Cheng Lu, Wenming Zheng, Chaolong Li, Chuangao Tang, Suyuan Liu, Simeng Yan, and Yuan Zong. 2018. Multiple spatio-temporal feature learning for videobased emotion recognition in the wild. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 646--652.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. 2010. The extended cohn-kanade dataset (ck): A complete dataset for action unit and emotion-specified expression. In 2010 ieee computer society conference on computer vision and pattern recognition-workshops. IEEE, 94--101.Google ScholarGoogle Scholar
  38. Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. 2019. frame attention networks for facial expression recognition in videos. CoRR abs/1907.00193 (2019). arXiv:1907.00193 http://arxiv.org/abs/1907.00193Google ScholarGoogle Scholar
  39. Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion 37 (2017), 98--125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Swami Sankaranarayanan, Azadeh Alavi, and Rama Chellappa. 2016. Triplet similarity embedding for face verification. arXiv preprint arXiv:1602.03418 (2016).Google ScholarGoogle Scholar
  41. Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision. 618--626.Google ScholarGoogle ScholarCross RefCross Ref
  42. Karan Sikka, Gaurav Sharma, and Marian Bartlett. 2016. Lomo: Latent ordinal model for facial analysis in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5580--5589.Google ScholarGoogle ScholarCross RefCross Ref
  43. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarGoogle ScholarCross RefCross Ref
  46. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarGoogle ScholarCross RefCross Ref
  47. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).Google ScholarGoogle Scholar
  48. Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 569--576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2020. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing 29 (2020), 4057--4069.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Mei Wang and Weihong Deng. 2021. Deep face recognition: A survey. Neurocomputing 429 (2021), 215--244.Google ScholarGoogle ScholarCross RefCross Ref
  51. XiaolongWang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.Google ScholarGoogle Scholar
  52. Yan Wang, Wei Song, Wei Tao, Antonio Liotta, Dawei Yang, Xinlei Li, Shuyong Gao, Yixuan Sun,Weifeng Ge,Wei Zhang, andWenqiang Zhang. 2022. A Systematic Review on Affective Computing: Emotion Models, Databases, and Recent Advances. Information Fusion (2022).Google ScholarGoogle Scholar
  53. Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. 2022. FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 20922--20931.Google ScholarGoogle ScholarCross RefCross Ref
  54. Zhengwei Wang, Qi She, and Aljosa Smolic. 2021. ACTION-Net: Multipath Excitation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13214--13223.Google ScholarGoogle ScholarCross RefCross Ref
  55. Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European conference on computer vision. Springer, 499--515.Google ScholarGoogle ScholarCross RefCross Ref
  56. Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV). 3--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Siyue Xie and Haifeng Hu. 2018. Facial expression recognition using hierarchical features with deep comprehensive multipatches aggregation convolutional neural networks. IEEE Transactions on Multimedia 21, 1 (2018), 211--220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Siyue Xie, Haifeng Hu, and YongboWu. 2019. Deep multi-path convolutional neural network joint with salient region attention for facial expression recognition. Pattern recognition 92 (2019), 177--191.Google ScholarGoogle Scholar
  59. Eric Xing, Michael Jordan, Stuart J Russell, and Andrew Ng. 2002. Distance metric learning with application to clustering with side-information. Advances in neural information processing systems 15 (2002), 521--528.Google ScholarGoogle Scholar
  60. Huiyuan Yang, Umur Ciftci, and Lijun Yin. 2018. Facial expression recognition by de-expression residue learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2168--2177.Google ScholarGoogle ScholarCross RefCross Ref
  61. Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. 2014. Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014).Google ScholarGoogle Scholar
  62. Mingjing Yu, Huicheng Zheng, Zhifeng Peng, Jiayu Dong, and Heran Du. 2020. Facial expression recognition based on a multi-task global-local network. Pattern Recognition Letters 131 (2020), 166--171.Google ScholarGoogle ScholarCross RefCross Ref
  63. Zhenbo Yu, Guangcan Liu, Qingshan Liu, and Jiankang Deng. 2018. Spatiotemporal convolutional features with nested LSTM for facial expression recognition. Neurocomputing 317 (2018), 50--57.Google ScholarGoogle ScholarCross RefCross Ref
  64. Haifeng Zhang,Wen Su, Jun Yu, and ZengfuWang. 2020. Weakly supervised localglobal relation network for facial expression recognition. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 1040--1046.Google ScholarGoogle Scholar
  65. Xiangyun Zhao, Xiaodan Liang, Luoqi Liu, Teng Li, Yugang Han, Nuno Vasconcelos, and Shuicheng Yan. 2016. Peak-piloted deep network for facial expression recognition. In European conference on computer vision. Springer, 425--442.Google ScholarGoogle ScholarCross RefCross Ref
  66. Zengqun Zhao and Qingshan Liu. 2021. Former-DFER:Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 1553--1561.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Yin-Dong Zheng, Zhaoyang Liu, Tong Lu, and Limin Wang. 2020. Dynamic sampling networks for efficient action recognition in videos. IEEE Transactions on Image Processing 29 (2020), 7970--7983.Google ScholarGoogle ScholarCross RefCross Ref
  68. Ruicong Zhi, Caixia Zhou, Tingting Li, Shuai Liu, and Yi Jin. 2021. Action unit analysis enhanced facial expression recognition by deep neural network evolution. Neurocomputing 425 (2021), 135--148.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '22: Proceedings of the 30th ACM International Conference on Multimedia
      October 2022
      7537 pages
      ISBN:9781450392037
      DOI:10.1145/3503161

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 October 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader