ABSTRACT
This paper presents a hybrid deep learning network submitted to the 6th Emotion Recognition in the Wild (EmotiW 2018) Grand Challenge [9], in the category of group-level emotion recognition. Advanced deep learning models trained individually on faces, scenes, skeletons and salient regions using visual attention mechanisms are fused to classify the emotion of a group of people in an image as positive, neutral or negative. Experimental results show that the proposed hybrid network achieves 78.98% and 68.08% classification accuracy on the validation and testing sets, respectively. These results outperform the baseline of 64% and 61%, and achieved the first place in the challenge.
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-Up and Top-Down Attention for Image Captioning and VQA. CoRR Vol. abs/1707.07998 (2017). {arxiv}1707.07998 http://arxiv.org/abs/1707.07998.Google Scholar
- J. Bullington. 2005. Affective computing and emotion recognition systems: the future of biometric surveillance? In Proceedings of the 2nd annual conference on Information security curriculum development. ACM, 95--99. Google ScholarDigital Library
- Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2017. VGGFace2: A dataset for recognising faces across pose and age. CoRR Vol. abs/1710.08092 (2017). {arxiv}1710.08092 http://arxiv.org/abs/1710.08092.Google Scholar
- Z. Cao, T. Simon, S. Wei, and Y. Sheikh. 2016. Realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1611.08050 (2016).Google Scholar
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database CVPR.Google Scholar
- A. Dhall, A. Asthana, and R. Goecke. 2010. Facial expression based automatic album creation. In International Conference on Neural Information Processing. Springer, 485--492. Google ScholarDigital Library
- A. Dhall, R. Goecke, and T. Gedeon. 2015. Automatic group happiness intensity analysis. IEEE Transactions on Affective Computing Vol. 6, 1 (2015), 13--26.Google ScholarDigital Library
- A. Dhall, J. Joshi, K. Sikka, R. Goecke, and N. Sebe. 2015. The more the merrier: Analysing the affect of a group of people in images IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, Vol. Vol. 1. IEEE, 1--8.Google Scholar
- Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. 2018. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction (ACM International Conference on Multimodal Interaction 2018 (in press)). ACM. Google ScholarDigital Library
- I. J. Goodfellow et al.. 2013. Challenges in representation learning: A report on three machine learning contests International Conference on Neural Information Processing. Springer, 117--124.Google Scholar
- X. Guo, L.F. Polan#237;a, and K. E. Barner. 2017. Group-level emotion recognition using deep models on image scene, faces, and skeletons. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 603--608. Google ScholarDigital Library
- Xin Guo, Luisa F. Polania, and Kenneth E. Barner. 2018. Smile detection in the wild based on transfer learning. (2018).Google Scholar
- Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. 2016. MS-Celeb-1M: A Dataset and Benchmark for Large Scale Face Recognition European Conference on Computer Vision. Springer.Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.Google Scholar
- S. Hochreiter and J. Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. Vol. 9, 8 (Nov. 1997), 1735--1780. Google ScholarDigital Library
- Jie Hu, Li Shen, and Gang Sun. 2017. Squeeze-and-Excitation Networks. CoRR Vol. abs/1709.01507 (2017). {arxiv}1709.01507 http://arxiv.org/abs/1709.01507.Google Scholar
- Xiaohua Huang, Abhinav Dhall, Guoying Zhao, Roland Goecke, and Matti Pietikäinen. 2015. Riesz-based Volume Local Binary Pattern and A Novel Group Expression Model for Group Happiness Intensity Analysis. In BMVC. 1--9.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
- J. Li, S. Roy, J. Feng, and T. Sim. 2016. Happiness level prediction with sequential inputs via multiple regressions Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 487--493. Google ScholarDigital Library
- Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. 2016. Large-Margin Softmax Loss for Convolutional Neural Networks Proceedings of The 33rd International Conference on Machine Learning. 507--516. Google ScholarDigital Library
- Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.Google Scholar
- Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. 2014. Recurrent Models of Visual Attention. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2204--2212. http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf. Google ScholarDigital Library
- W. Mou, O. Celiktutan, and H. Gunes. 2015. Group-level arousal and valence recognition in static images: Face, body and context IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. Vol. 5. IEEE, 1--6.Google Scholar
- P. M. Niedenthal and M. Brauer. 2012. Social functionality of human emotion. Annual review of psychology Vol. 63 (2012), 259--285.Google Scholar
- O. M. Parkhi, A. Vedaldi, and A. Zisserman. 2015. Deep Face Recognition. In British Machine Vision Conference.Google Scholar
- F. E. Pollick, H. M. Paterson, A. Bruderlin, and A. J. Sanford. 2001. Perceiving affect from arm movement. Cognition Vol. 82, 2 (2001), B51--B61.Google ScholarCross Ref
- Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2016. Self-critical Sequence Training for Image Captioning. CoRR Vol. abs/1612.00563 (2016). {arxiv}1612.00563 http://arxiv.org/abs/1612.00563.Google Scholar
- K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the inception architecture for computer vision Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.Google Scholar
- L. Tan, K. Zhang, K. Wang, X. Zeng, X. Peng, and Y. Qiao. 2017. Group emotion recognition with individual facial emotion CNNs and global image based CNNs. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 549--552. Google ScholarDigital Library
- S. Tomas, J. Hanbyul, M. Iain, and S. Yaser. 2017. Hand Keypoint Detection in Single Images using Multiview Bootstrapping CVPR.Google Scholar
- T. Vandal, D. McDuff, and R. El Kaliouby. 2015. Event detection: Ultra large-scale clustering of facial expressions IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. Vol. 1. IEEE, 1--8.Google Scholar
- Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. 2017. Residual Attention Network for Image Classification. CoRR Vol. abs/1704.06904 (2017). {arxiv}1704.06904 http://arxiv.org/abs/1704.06904.Google Scholar
- S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. 2016. Convolutional pose machines. In CVPR.Google Scholar
- J. Whitehill, G. Littlewort, I. Fasel, M. Bartlett, and J. Movellan. 2009. Toward practical smile detection. IEEE transactions on pattern analysis and machine intelligence Vol. 31, 11 (2009), 2106--2111. Google ScholarDigital Library
- J. Wu and J.M. Rehg. 2011. CENTRIST: A Visual Descriptor for Scene Categorization. IEEE Trans. Pattern Anal. Mach. Intell. Vol. 33, 8 (2011), 1489--1501. Google ScholarDigital Library
- Huijuan Xu and Kate Saenko. 2016. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII. 451--466.Google Scholar
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2015. Stacked Attention Networks for Image Question Answering. CoRR Vol. abs/1511.02274 (2015). http://arxiv.org/abs/1511.02274.Google Scholar
- K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters Vol. 23, 10 (Oct. 2016), 1499--1503.Google ScholarCross Ref
Index Terms
- Group-Level Emotion Recognition Using Hybrid Deep Models Based on Faces, Scenes, Skeletons and Visual Attentions
Recommendations
From individual to group-level emotion recognition: EmotiW 5.0
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal InteractionResearch in automatic affect recognition has come a long way. This paper describes the fifth Emotion Recognition in the Wild (EmotiW) challenge 2017. EmotiW aims at providing a common benchmarking platform for researchers working on different aspects ...
Group-level emotion recognition using deep models on image scene, faces, and skeletons
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal InteractionThis paper presents the work submitted to the Group-level Emotion Recognition sub-challenge, which is part of the 5th Emotion Recognition in the Wild (EmotiW 2017) Challenge. The task of this sub-challenge is to classify the emotion of a group of ...
Group-Level Emotion Recognition using Deep Models with A Four-stream Hybrid Network
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal InteractionGroup-level Emotion Recognition (GER) in the wild is a challenging task gaining lots of attention. Most recent works utilized two channels of information, a channel involving only faces and a channel containing the whole image, to solve this problem. ...
Comments