research-article

DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos

Authors:
Yan Wang

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Yixuan Sun

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Wei Song

Shanghai Ocean University, Shanghai, China

Shanghai Ocean University, Shanghai, China
View Profile

,
Shuyong Gao

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Yiwen Huang

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Zhaoyu Chen

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Weifeng Ge

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Wenqiang Zhang

Fudan University & Yiwu Research Institute of Fudan University, Shanghai, China

Fudan University & Yiwu Research Institute of Fudan University, Shanghai, China
View Profile

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022Pages 101–110https://doi.org/10.1145/3503161.3547865

Published:10 October 2022Publication History

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 101–110

ABSTRACT

Current works of facial expression learning in video consume significant computational resources to learn spatial channel feature representations and temporal relationships. To mitigate this issue, we propose a Dual Path multi-excitation Collaborative Network (DPCNet) to learn the critical information for facial expression representation from fewer keyframes in videos. Specifically, the DPCNet learns the important regions and keyframes from a tuple of four view-grouped frames by multi-excitation modules and produces dual-path representations of one video with consistency under two regularization strategies. A spatial-frame excitation module and a channel-temporal aggregation module are introduced consecutively to learn spatial-frame representation and generate complementary channel-temporal aggregation, respectively. Moreover, we design a multi-frame regularization loss to enforce the representation of multiple frames in the dual view to be semantically coherent. To obtain consistent prediction probabilities from the dual path, we further propose a dual path regularization loss, aiming to minimize the divergence between the distributions of two-path embeddings. Extensive experiments and ablation studies show that the DPCNet can significantly improve the performance of video-based FER and achieve state-of-the-art results on the large-scale DFEW dataset.

Supplemental Material

MM22-fp0487.mp4

mp4

37 MB

Download

References

Dawood Adel Al Chanti and Alice Caplier. 2018. Deep learning for spatiotemporal modeling of dynamic spontaneous emotions. IEEE Transactions on Affective Computing (2018).Google Scholar
Danilo Avola, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, and Cristiano Massaroni. 2020. Deep temporal analysis for non-acted body affect recognition. IEEE Transactions on Affective Computing (2020).Google Scholar
Amal Azazi, Syaheerah Lebai Lutfi, Ibrahim Venkat, and Fernando Fernández-Martínez. 2015. Towards a robust affect recognition: Automatic facial expression recognition in 3D faces. Expert Systems with Applications 42, 6 (2015), 3056--3066.Google ScholarDigital Library
Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O'Reilly, and Yan Tong. 2018. Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 302--309.Google ScholarDigital Library
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarCross Ref
Yuedong Chen, Jianfeng Wang, Shikai Chen, Zhongchao Shi, and Jianfei Cai. 2019. Facial Motion Prior Networks for Facial Expression Recognition. CoRR abs/1902.08788 (2019). arXiv:1902.08788 http://arxiv.org/abs/1902.08788Google Scholar
Zhaoyu Chen, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2022. Towards Practical Certifiable Patch Defense With Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15148--15158.Google ScholarCross Ref
Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. 2018. Emotiw 2018: Audio-video, student engagement and group-level affect prediction. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 653--656.Google ScholarDigital Library
Hui Ding, Shaohua Kevin Zhou, and Rama Chellappa. 2017. Facenet2expnet: Regularizing a deep face recognition net for expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 118--126.Google ScholarDigital Library
Dominik Maria Endres and Johannes E Schindelin. 2003. A new metric for probability distributions. IEEE Transactions on Information theory 49, 7 (2003), 1858--1860.Google ScholarDigital Library
Yingruo Fan, Jacqueline CK Lam, and Victor OK Li. 2018. Video-based emotion recognition using deeply-supervised neural networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 584--588.Google ScholarDigital Library
Yingruo Fan, Victor Li, and Jacqueline CK Lam. 2020. Facial expression recognition with deeply-supervised attention network. IEEE transactions on affective computing (2020).Google ScholarCross Ref
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445--450.Google ScholarDigital Library
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445--450.Google ScholarDigital Library
Amir Hossein Farzaneh and Xiaojun Qi. 2020. Discriminant distribution-agnostic loss for facial expression recognition in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 406--407.Google ScholarCross Ref
Amir Hossein Farzaneh and Xiaojun Qi. 2021. Facial expression recognition in the wild via deep attentive center loss. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2402--2411.Google ScholarCross Ref
Shuyong Gao, Wei Zhang, Yan Wang, Qianyu Guo, Chenglong Zhang, Yangji He, and Wenqiang Zhang. 2022. Weakly-Supervised Salient Object Detection Using Point Supervison. Proceedings of the AAAI Conference on Artificial Intelligence 36 (Jun. 2022).Google Scholar
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. Msceleb- 1m: A dataset and benchmark for large-scale face recognition. In European conference on computer vision. Springer, 87--102.Google ScholarCross Ref
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.Google ScholarCross Ref
Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, PengWu, and Hichem Sahli. 2015. Multimodal affective dimension prediction using deep bidirectional long shortterm memory recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. 73--80.Google ScholarDigital Library
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.Google ScholarCross Ref
Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. 2017. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM international conference on multimodal interaction. 553--560.Google ScholarDigital Library
Hao Huang, Yongtao Wang, Zhaoyu Chen, Yuze Zhang, Yuheng Li, Zhi Tang, Wei Chu, Jingdong Chen,Weisi Lin, and Kai-Kuang Ma. 2022. Carnegie Mellon UniversityA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes. Proceedings of the AAAI Conference on Artificial Intelligence 36, 1 (Jun. 2022), 989--997. https://doi.org/10.1609/aaai.v36i1.19982Google Scholar
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2012), 221--231.Google ScholarDigital Library
Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. 2020. DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild. In 28th ACM International Conference on Multimedia. 2881--2889.Google ScholarDigital Library
Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and Junmo Kim. 2015. Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE international conference on computer vision. 2983--2991.Google ScholarDigital Library
Hangyu Li, Nannan Wang, Xinpeng Ding, Xi Yang, and Xinbo Gao. 2021. Adaptively Learning Facial Expression Representation via CF Labels and Distillation. IEEE Transactions on Image Processing 30 (2021), 2016--2028.Google ScholarCross Ref
Shan Li and Weihong Deng. 2018. Reliable crowdsourcing and deep localitypreserving learning for unconstrained facial expression recognition. IEEE Transactions on Image Processing 28, 1 (2018), 356--370.Google ScholarDigital Library
Shan Li and Weihong Deng. 2020. Deep facial expression recognition: A survey. IEEE transactions on affective computing (2020).Google Scholar
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909--918.Google ScholarCross Ref
Chuanhe Liu, Tianhao Tang, Kui Lv, and Minghao Wang. 2018. Multi-feature based emotion recognition for video clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 630--634.Google ScholarDigital Library
Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. 2020. SAANet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing 413 (2020), 145--157.Google ScholarCross Ref
Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1749--1756.Google ScholarDigital Library
Tingting Liu, Jixin Wang, Bing Yang, and Xuan Wang. 2021. Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Physics & Technology 112 (2021), 103594.Google ScholarCross Ref
Xin Liu, Silvia L Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C van Gemert. 2021. No frame left behind: Full Video Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14892-- 14901.Google ScholarCross Ref
Cheng Lu, Wenming Zheng, Chaolong Li, Chuangao Tang, Suyuan Liu, Simeng Yan, and Yuan Zong. 2018. Multiple spatio-temporal feature learning for videobased emotion recognition in the wild. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 646--652.Google ScholarDigital Library
Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. 2010. The extended cohn-kanade dataset (ck): A complete dataset for action unit and emotion-specified expression. In 2010 ieee computer society conference on computer vision and pattern recognition-workshops. IEEE, 94--101.Google Scholar
Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. 2019. frame attention networks for facial expression recognition in videos. CoRR abs/1907.00193 (2019). arXiv:1907.00193 http://arxiv.org/abs/1907.00193Google Scholar
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion 37 (2017), 98--125.Google ScholarDigital Library
Swami Sankaranarayanan, Azadeh Alavi, and Rama Chellappa. 2016. Triplet similarity embedding for face verification. arXiv preprint arXiv:1602.03418 (2016).Google Scholar
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision. 618--626.Google ScholarCross Ref
Karan Sikka, Gaurav Sharma, and Marian Bartlett. 2016. Lomo: Latent ordinal model for facial analysis in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5580--5589.Google ScholarCross Ref
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarDigital Library
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarDigital Library
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarCross Ref
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarCross Ref
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).Google Scholar
Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 569--576.Google ScholarDigital Library
Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2020. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing 29 (2020), 4057--4069.Google ScholarDigital Library
Mei Wang and Weihong Deng. 2021. Deep face recognition: A survey. Neurocomputing 429 (2021), 215--244.Google ScholarCross Ref
XiaolongWang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.Google Scholar
Yan Wang, Wei Song, Wei Tao, Antonio Liotta, Dawei Yang, Xinlei Li, Shuyong Gao, Yixuan Sun,Weifeng Ge,Wei Zhang, andWenqiang Zhang. 2022. A Systematic Review on Affective Computing: Emotion Models, Databases, and Recent Advances. Information Fusion (2022).Google Scholar
Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. 2022. FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 20922--20931.Google ScholarCross Ref
Zhengwei Wang, Qi She, and Aljosa Smolic. 2021. ACTION-Net: Multipath Excitation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13214--13223.Google ScholarCross Ref
Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European conference on computer vision. Springer, 499--515.Google ScholarCross Ref
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV). 3--19.Google ScholarDigital Library
Siyue Xie and Haifeng Hu. 2018. Facial expression recognition using hierarchical features with deep comprehensive multipatches aggregation convolutional neural networks. IEEE Transactions on Multimedia 21, 1 (2018), 211--220.Google ScholarDigital Library
Siyue Xie, Haifeng Hu, and YongboWu. 2019. Deep multi-path convolutional neural network joint with salient region attention for facial expression recognition. Pattern recognition 92 (2019), 177--191.Google Scholar
Eric Xing, Michael Jordan, Stuart J Russell, and Andrew Ng. 2002. Distance metric learning with application to clustering with side-information. Advances in neural information processing systems 15 (2002), 521--528.Google Scholar
Huiyuan Yang, Umur Ciftci, and Lijun Yin. 2018. Facial expression recognition by de-expression residue learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2168--2177.Google ScholarCross Ref
Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. 2014. Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014).Google Scholar
Mingjing Yu, Huicheng Zheng, Zhifeng Peng, Jiayu Dong, and Heran Du. 2020. Facial expression recognition based on a multi-task global-local network. Pattern Recognition Letters 131 (2020), 166--171.Google ScholarCross Ref
Zhenbo Yu, Guangcan Liu, Qingshan Liu, and Jiankang Deng. 2018. Spatiotemporal convolutional features with nested LSTM for facial expression recognition. Neurocomputing 317 (2018), 50--57.Google ScholarCross Ref
Haifeng Zhang,Wen Su, Jun Yu, and ZengfuWang. 2020. Weakly supervised localglobal relation network for facial expression recognition. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 1040--1046.Google Scholar
Xiangyun Zhao, Xiaodan Liang, Luoqi Liu, Teng Li, Yugang Han, Nuno Vasconcelos, and Shuicheng Yan. 2016. Peak-piloted deep network for facial expression recognition. In European conference on computer vision. Springer, 425--442.Google ScholarCross Ref
Zengqun Zhao and Qingshan Liu. 2021. Former-DFER:Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 1553--1561.Google ScholarDigital Library
Yin-Dong Zheng, Zhaoyang Liu, Tong Lu, and Limin Wang. 2020. Dynamic sampling networks for efficient action recognition in videos. IEEE Transactions on Image Processing 29 (2020), 7970--7983.Google ScholarCross Ref
Ruicong Zhi, Caixia Zhou, Tingting Li, Shuai Liu, and Yi Jin. 2021. Action unit analysis enhanced facial expression recognition by deep neural network evolution. Neurocomputing 425 (2021), 135--148.Google ScholarCross Ref

Index Terms

DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Facial expression recognition using dual dictionary learning

Comprehensive feature extraction method is proposed for facial expression recognition.A sparse dictionary learning approach is proposed for facial expression recognition.A regression dictionary is proposed for regression facial expression ...
Read More
Collaborative discriminative multi-metric learning for facial expression recognition in video

We present a new metric learning approach for facial expression recognition in videos.Our approach combines both audio and visual features and achieves better facial expression recognition performance.Experimental results clearly show the advantages of ...
Read More
Collaborative expression representation using peak expression and intra class variation face images for practical subject-independent emotion recognition in videos

This paper proposes a facial expression recognition (FER) method in videos. The proposed method automatically selects the peak expression face from a video sequence using closeness of the face to the neutral expression. The severely non-frontal faces ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dual path collaborative network
facial expression representation learning in videos
spatial-temporal-channel excitation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 486
  Total Downloads
- Downloads (Last 12 months)265
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Facial expression recognition using dual dictionary learning

Collaborative discriminative multi-metric learning for facial expression recognition in video

Collaborative expression representation using peak expression and intra class variation face images for practical subject-independent emotion recognition in videos