ABSTRACT
Engagement detection is essential in many areas such as driver attention tracking, employee engagement monitoring, and student engagement evaluation. In this paper, we propose a novel approach using attention based hybrid deep models for the 8th Emotion Recognition in the Wild (EmotiW 2020) Grand Challenge in the category of engagement prediction in the wild EMOTIW2020. The task aims to predict the engagement intensity of subjects in videos, and the subjects are students watching educational videos from Massive Open Online Courses (MOOCs). To complete the task, we propose a hybrid deep model based on multi-rate and multi-instance attention. The novelty of the proposed model can be summarized in three aspects: (a) an attention based Gated Recurrent Unit (GRU) deep network, (b) heuristic multi-rate processing on video based data, and (c) a rigorous and accurate ensemble model. Experimental results on the validation set and test set show that our method makes promising improvements, achieving a competitively low MSE of 0.0541 on the test set, improving on the baseline results by 64%. The proposed model won the first place in the engagement prediction in the wild challenge.
Supplemental Material
- Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. 2016. OpenFace: A general-purpose face recognition library with mobile applications. Technical Report. Carnegie Mellon University-CS-16--118, Carnegie Mellon University School of Computer Science.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arxiv: cs.CL/1409.0473Google Scholar
- Joseph E. Beck. 2005. Engagement Tracing: Using Response Times to Model Student Disengagement. In Proceedings of the 2005 Conference on Artificial Intelligence in Education: Supporting Learning Through Intelligent and Socially Informed Technology. IOS Press, Amsterdam, The Netherlands, The Netherlands, 88--95. http://dl.acm.org/citation.cfm?id=1562524.1562542Google Scholar
- Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2018. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CoRR, Vol. abs/1812.08008 (2018). arxiv: 1812.08008 http://arxiv.org/abs/1812.08008Google Scholar
- Cheng Chang, Cheng Zhang, Lei Chen, and Yang Liu. 2018. An Ensemble Model Using Face and Body Tracking for Engagement Detection. In Proceedings of the 20th ACM International Conference on Multimodal Interaction (ICMI '18). ACM, New York, NY, USA, 616--622. https://doi.org/10.1145/3242969.3264986Google ScholarDigital Library
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014a. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arxiv: cs.CL/1406.1078Google Scholar
- Kyunghyun Cho, Bart van Merrienboer, cC aglar Gü lcc ehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014b. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR, Vol. abs/1406.1078 (2014). arxiv: 1406.1078 http://arxiv.org/abs/1406.1078Google Scholar
- Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arxiv: cs.NE/1412.3555Google Scholar
- Abhinav Dhall, Garima Sharma, Roland Goecke, and Tom Gedeon. 2020. EmotiW 2020: Driver Gaze, Group Emotion, Student Engagement and Physiological Signal based Challenges. In ACM International Conference on Multimodal Interaction 2020.Google ScholarDigital Library
- Matthias Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. 2015. Efficient and robust automated machine learning. Advances in Neural Information Processing Systems, Vol. 28 (01 2015), 2944--2952.Google Scholar
- V. Garc'ia, J. S. Sánchez, and R. A. Mollineda. 2012. On the Effectiveness of Preprocessing Methods When Dealing with Different Levels of Class Imbalance. Know.-Based Syst., Vol. 25, 1 (Feb. 2012), 13--21. https://doi.org/10.1016/j.knosys.2011.06.013Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput., Vol. 9, 8 (Nov. 1997), 1735--1780. https://doi.org/10.1162/neco.1997.9.8.1735Google ScholarDigital Library
- Michael I. Jordan. 1990. Attractor Dynamics and Parallelism in a Connectionist Sequential Machine .IEEE Press, 112--127.Google Scholar
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '14). IEEE Computer Society, Washington, DC, USA, 1725--1732. https://doi.org/10.1109/CVPR.2014.223Google ScholarDigital Library
- Kenneth R. Koedinger, John R. Anderson, William H. Hadley, and Mary A. Mark. 1997. Intelligent Tutoring Goes To School in the Big City. International Journal of Artificial Intelligence in Education (IJAIED), Vol. 8 (1997), 30--43. https://telearn.archives-ouvertes.fr/hal-00197383Google Scholar
- Reed W. Larson and Maryse H. Richards. 1991. Boredom in the Middle School Years: Blaming Schools versus Blaming Students. American Journal of Education, Vol. 99, 4 (1991), 418--443. https://doi.org/10.1086/443992 https://doi.org/10.1109/lsp.2016.2603342Google ScholarCross Ref
Index Terms
- Multi-rate Attention Based GRU Model for Engagement Prediction
Recommendations
Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal InteractionThis paper elaborates the winner approach for engagement intensity prediction in the EmotiW Challenge 2018. The task is to predict the engagement level of a subject when he or she is watching an educational video in diverse conditions and different ...
Advanced Multi-Instance Learning Method with Multi-features Engineering and Conservative Optimization for Engagement Intensity Prediction
ICMI '20: Proceedings of the 2020 International Conference on Multimodal InteractionThis paper proposes an advanced multi-instance learning method with multi-features engineering and conservative optimization for engagement intensity prediction. It was applied to the EmotiW Challenge 2020 and the results demonstrated the proposed ...
Multi-feature and Multi-instance Learning with Anti-overfitting Strategy for Engagement Intensity Prediction
ICMI '19: 2019 International Conference on Multimodal InteractionThis paper proposes a novel engagement intensity prediction approach, which is also applied in the EmotiW Challenge 2019 and resulted in good performance. The task is to predict the engagement level when a subject student is watching an educational ...
Comments