ABSTRACT
Current works of facial expression learning in video consume significant computational resources to learn spatial channel feature representations and temporal relationships. To mitigate this issue, we propose a Dual Path multi-excitation Collaborative Network (DPCNet) to learn the critical information for facial expression representation from fewer keyframes in videos. Specifically, the DPCNet learns the important regions and keyframes from a tuple of four view-grouped frames by multi-excitation modules and produces dual-path representations of one video with consistency under two regularization strategies. A spatial-frame excitation module and a channel-temporal aggregation module are introduced consecutively to learn spatial-frame representation and generate complementary channel-temporal aggregation, respectively. Moreover, we design a multi-frame regularization loss to enforce the representation of multiple frames in the dual view to be semantically coherent. To obtain consistent prediction probabilities from the dual path, we further propose a dual path regularization loss, aiming to minimize the divergence between the distributions of two-path embeddings. Extensive experiments and ablation studies show that the DPCNet can significantly improve the performance of video-based FER and achieve state-of-the-art results on the large-scale DFEW dataset.
Supplemental Material
- Dawood Adel Al Chanti and Alice Caplier. 2018. Deep learning for spatiotemporal modeling of dynamic spontaneous emotions. IEEE Transactions on Affective Computing (2018).Google Scholar
- Danilo Avola, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, and Cristiano Massaroni. 2020. Deep temporal analysis for non-acted body affect recognition. IEEE Transactions on Affective Computing (2020).Google Scholar
- Amal Azazi, Syaheerah Lebai Lutfi, Ibrahim Venkat, and Fernando Fernández-Martínez. 2015. Towards a robust affect recognition: Automatic facial expression recognition in 3D faces. Expert Systems with Applications 42, 6 (2015), 3056--3066.Google ScholarDigital Library
- Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O'Reilly, and Yan Tong. 2018. Island loss for learning discriminative features in facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 302--309.Google ScholarDigital Library
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarCross Ref
- Yuedong Chen, Jianfeng Wang, Shikai Chen, Zhongchao Shi, and Jianfei Cai. 2019. Facial Motion Prior Networks for Facial Expression Recognition. CoRR abs/1902.08788 (2019). arXiv:1902.08788 http://arxiv.org/abs/1902.08788Google Scholar
- Zhaoyu Chen, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2022. Towards Practical Certifiable Patch Defense With Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15148--15158.Google ScholarCross Ref
- Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. 2018. Emotiw 2018: Audio-video, student engagement and group-level affect prediction. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 653--656.Google ScholarDigital Library
- Hui Ding, Shaohua Kevin Zhou, and Rama Chellappa. 2017. Facenet2expnet: Regularizing a deep face recognition net for expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 118--126.Google ScholarDigital Library
- Dominik Maria Endres and Johannes E Schindelin. 2003. A new metric for probability distributions. IEEE Transactions on Information theory 49, 7 (2003), 1858--1860.Google ScholarDigital Library
- Yingruo Fan, Jacqueline CK Lam, and Victor OK Li. 2018. Video-based emotion recognition using deeply-supervised neural networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 584--588.Google ScholarDigital Library
- Yingruo Fan, Victor Li, and Jacqueline CK Lam. 2020. Facial expression recognition with deeply-supervised attention network. IEEE transactions on affective computing (2020).Google ScholarCross Ref
- Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445--450.Google ScholarDigital Library
- Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction. 445--450.Google ScholarDigital Library
- Amir Hossein Farzaneh and Xiaojun Qi. 2020. Discriminant distribution-agnostic loss for facial expression recognition in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 406--407.Google ScholarCross Ref
- Amir Hossein Farzaneh and Xiaojun Qi. 2021. Facial expression recognition in the wild via deep attentive center loss. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2402--2411.Google ScholarCross Ref
- Shuyong Gao, Wei Zhang, Yan Wang, Qianyu Guo, Chenglong Zhang, Yangji He, and Wenqiang Zhang. 2022. Weakly-Supervised Salient Object Detection Using Point Supervison. Proceedings of the AAAI Conference on Artificial Intelligence 36 (Jun. 2022).Google Scholar
- Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. Msceleb- 1m: A dataset and benchmark for large-scale face recognition. In European conference on computer vision. Springer, 87--102.Google ScholarCross Ref
- Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.Google ScholarCross Ref
- Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, PengWu, and Hichem Sahli. 2015. Multimodal affective dimension prediction using deep bidirectional long shortterm memory recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. 73--80.Google ScholarDigital Library
- Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.Google ScholarCross Ref
- Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. 2017. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM international conference on multimodal interaction. 553--560.Google ScholarDigital Library
- Hao Huang, Yongtao Wang, Zhaoyu Chen, Yuze Zhang, Yuheng Li, Zhi Tang, Wei Chu, Jingdong Chen,Weisi Lin, and Kai-Kuang Ma. 2022. Carnegie Mellon UniversityA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes. Proceedings of the AAAI Conference on Artificial Intelligence 36, 1 (Jun. 2022), 989--997. https://doi.org/10.1609/aaai.v36i1.19982Google Scholar
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2012), 221--231.Google ScholarDigital Library
- Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, and Jiateng Liu. 2020. DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild. In 28th ACM International Conference on Multimedia. 2881--2889.Google ScholarDigital Library
- Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and Junmo Kim. 2015. Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE international conference on computer vision. 2983--2991.Google ScholarDigital Library
- Hangyu Li, Nannan Wang, Xinpeng Ding, Xi Yang, and Xinbo Gao. 2021. Adaptively Learning Facial Expression Representation via CF Labels and Distillation. IEEE Transactions on Image Processing 30 (2021), 2016--2028.Google ScholarCross Ref
- Shan Li and Weihong Deng. 2018. Reliable crowdsourcing and deep localitypreserving learning for unconstrained facial expression recognition. IEEE Transactions on Image Processing 28, 1 (2018), 356--370.Google ScholarDigital Library
- Shan Li and Weihong Deng. 2020. Deep facial expression recognition: A survey. IEEE transactions on affective computing (2020).Google Scholar
- Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909--918.Google ScholarCross Ref
- Chuanhe Liu, Tianhao Tang, Kui Lv, and Minghao Wang. 2018. Multi-feature based emotion recognition for video clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 630--634.Google ScholarDigital Library
- Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. 2020. SAANet: Siamese action-units attention network for improving dynamic facial expression recognition. Neurocomputing 413 (2020), 145--157.Google ScholarCross Ref
- Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1749--1756.Google ScholarDigital Library
- Tingting Liu, Jixin Wang, Bing Yang, and Xuan Wang. 2021. Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Physics & Technology 112 (2021), 103594.Google ScholarCross Ref
- Xin Liu, Silvia L Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C van Gemert. 2021. No frame left behind: Full Video Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14892-- 14901.Google ScholarCross Ref
- Cheng Lu, Wenming Zheng, Chaolong Li, Chuangao Tang, Suyuan Liu, Simeng Yan, and Yuan Zong. 2018. Multiple spatio-temporal feature learning for videobased emotion recognition in the wild. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 646--652.Google ScholarDigital Library
- Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. 2010. The extended cohn-kanade dataset (ck): A complete dataset for action unit and emotion-specified expression. In 2010 ieee computer society conference on computer vision and pattern recognition-workshops. IEEE, 94--101.Google Scholar
- Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. 2019. frame attention networks for facial expression recognition in videos. CoRR abs/1907.00193 (2019). arXiv:1907.00193 http://arxiv.org/abs/1907.00193Google Scholar
- Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion 37 (2017), 98--125.Google ScholarDigital Library
- Swami Sankaranarayanan, Azadeh Alavi, and Rama Chellappa. 2016. Triplet similarity embedding for face verification. arXiv preprint arXiv:1602.03418 (2016).Google Scholar
- Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision. 618--626.Google ScholarCross Ref
- Karan Sikka, Gaurav Sharma, and Marian Bartlett. 2016. Lomo: Latent ordinal model for facial analysis in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5580--5589.Google ScholarCross Ref
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarDigital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarDigital Library
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarCross Ref
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarCross Ref
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).Google Scholar
- Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 569--576.Google ScholarDigital Library
- Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2020. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing 29 (2020), 4057--4069.Google ScholarDigital Library
- Mei Wang and Weihong Deng. 2021. Deep face recognition: A survey. Neurocomputing 429 (2021), 215--244.Google ScholarCross Ref
- XiaolongWang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.Google Scholar
- Yan Wang, Wei Song, Wei Tao, Antonio Liotta, Dawei Yang, Xinlei Li, Shuyong Gao, Yixuan Sun,Weifeng Ge,Wei Zhang, andWenqiang Zhang. 2022. A Systematic Review on Affective Computing: Emotion Models, Databases, and Recent Advances. Information Fusion (2022).Google Scholar
- Yan Wang, Yixuan Sun, Yiwen Huang, Zhongying Liu, Shuyong Gao, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. 2022. FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 20922--20931.Google ScholarCross Ref
- Zhengwei Wang, Qi She, and Aljosa Smolic. 2021. ACTION-Net: Multipath Excitation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13214--13223.Google ScholarCross Ref
- Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European conference on computer vision. Springer, 499--515.Google ScholarCross Ref
- Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV). 3--19.Google ScholarDigital Library
- Siyue Xie and Haifeng Hu. 2018. Facial expression recognition using hierarchical features with deep comprehensive multipatches aggregation convolutional neural networks. IEEE Transactions on Multimedia 21, 1 (2018), 211--220.Google ScholarDigital Library
- Siyue Xie, Haifeng Hu, and YongboWu. 2019. Deep multi-path convolutional neural network joint with salient region attention for facial expression recognition. Pattern recognition 92 (2019), 177--191.Google Scholar
- Eric Xing, Michael Jordan, Stuart J Russell, and Andrew Ng. 2002. Distance metric learning with application to clustering with side-information. Advances in neural information processing systems 15 (2002), 521--528.Google Scholar
- Huiyuan Yang, Umur Ciftci, and Lijun Yin. 2018. Facial expression recognition by de-expression residue learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2168--2177.Google ScholarCross Ref
- Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. 2014. Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014).Google Scholar
- Mingjing Yu, Huicheng Zheng, Zhifeng Peng, Jiayu Dong, and Heran Du. 2020. Facial expression recognition based on a multi-task global-local network. Pattern Recognition Letters 131 (2020), 166--171.Google ScholarCross Ref
- Zhenbo Yu, Guangcan Liu, Qingshan Liu, and Jiankang Deng. 2018. Spatiotemporal convolutional features with nested LSTM for facial expression recognition. Neurocomputing 317 (2018), 50--57.Google ScholarCross Ref
- Haifeng Zhang,Wen Su, Jun Yu, and ZengfuWang. 2020. Weakly supervised localglobal relation network for facial expression recognition. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 1040--1046.Google Scholar
- Xiangyun Zhao, Xiaodan Liang, Luoqi Liu, Teng Li, Yugang Han, Nuno Vasconcelos, and Shuicheng Yan. 2016. Peak-piloted deep network for facial expression recognition. In European conference on computer vision. Springer, 425--442.Google ScholarCross Ref
- Zengqun Zhao and Qingshan Liu. 2021. Former-DFER:Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 1553--1561.Google ScholarDigital Library
- Yin-Dong Zheng, Zhaoyang Liu, Tong Lu, and Limin Wang. 2020. Dynamic sampling networks for efficient action recognition in videos. IEEE Transactions on Image Processing 29 (2020), 7970--7983.Google ScholarCross Ref
- Ruicong Zhi, Caixia Zhou, Tingting Li, Shuai Liu, and Yi Jin. 2021. Action unit analysis enhanced facial expression recognition by deep neural network evolution. Neurocomputing 425 (2021), 135--148.Google ScholarCross Ref
Index Terms
- DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos
Recommendations
Facial expression recognition using dual dictionary learning
Comprehensive feature extraction method is proposed for facial expression recognition.A sparse dictionary learning approach is proposed for facial expression recognition.A regression dictionary is proposed for regression facial expression ...
Collaborative discriminative multi-metric learning for facial expression recognition in video
We present a new metric learning approach for facial expression recognition in videos.Our approach combines both audio and visual features and achieves better facial expression recognition performance.Experimental results clearly show the advantages of ...
Collaborative expression representation using peak expression and intra class variation face images for practical subject-independent emotion recognition in videos
This paper proposes a facial expression recognition (FER) method in videos. The proposed method automatically selects the peak expression face from a video sequence using closeness of the face to the neutral expression. The severely non-frontal faces ...
Comments