ABSTRACT
This paper presents our effort to the Audio/Visual+ Emotion Challenge (AV+EC2015), whose goal is to predict the continuous values of the emotion dimensions arousal and valence from audio, visual and physiology modalities. The state of art classifier for dimensional recognition, long short term memory recurrent neural network (LSTM-RNN) is utilized. Except regular LSTM-RNN prediction architecture, two techniques are investigated for dimensional emotion recognition problem. The first one is ε -insensitive loss is utilized as the loss function to optimize. Compared to squared loss function, which is the most widely used loss function for dimension emotion recognition, ε -insensitive loss is more robust for the label noises and it can ignore small errors to get stronger correlation between predictions and labels. The other one is temporal pooling. This technique enables temporal modeling in the input features and increases the diversity of the features fed into the forward prediction architecture. Experiments results show the efficiency of key points of the proposed method and competitive results are obtained.
- J. Tao and T. Tan, "Affective Computing: A Review," Proc. First Int'l Conf. Affective Computing and Intelligent Interaction, J. Tao, T. Tan, and R.W. Picard, eds., pp. 981--995, 2005. Google ScholarDigital Library
- Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang, "A survey of affect recognition methods: Audio, visual, and spontaneous expressions," IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39--58. doi:10.1109/TPAMI.2008.52. Google ScholarDigital Library
- H. Gunes and M. Pantic, "Automatic, dimensional and continuous emotion recognition," Int. J. Synthetic Emotions, vol. 1, no. 1, pp. 68--99, 2010. Google ScholarDigital Library
- J. R. Fontaine, K. R. Scherer, E. B. Roesch, P. C. Ellsworth, "The world of emotions is not two-dimensional," Psychological science, 18(12), 1050--1057,2007.Google ScholarCross Ref
- C. Breazeal, "Emotion and sociable humanoid robots" {J}. International Journal of Human-Computer Studies, 2003, 59(1): 119--155. Google ScholarDigital Library
- A. Mehrabian, and J. Russell, An approach to environmental psychology. Cambridge, MA: MIT Press.Google Scholar
- J. Davitz, Auditory correlates of vocal expression of emotional feeling. In J. Davitz (Ed.), The communication of emotional meaning (pp. 101--112). New York: McGraw-Hill, 1964.Google Scholar
- B. Schuller, M. Valstar, F. Eyben, G. Mckeown, R. Cowie and M. Pantic, "Avec 2011--the first international audio/visual emotion challenge." Affective Computing and Intelligent Interaction. Springer Berlin Heidelberg, 2011. 415--424. Google ScholarDigital Library
- M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework, Image and Vision Computing, 2012.Google Scholar
- Ringeval, F., Eyben, F., Kroupi, E., Yuce, A., Thiran, J. P., Ebrahimi, T.... & Schuller, B. (2014). Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognition Letters.Google Scholar
- Valstar, M., Schuller, B., Smith, K., Almaev, T., Eyben, F., Krajewski, J. & Pantic, M. (2014, November). Avec 2014: 3d dimensional affect and depression recognition challenge. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge (pp. 3--10). ACM. Google ScholarDigital Library
- Ringeval, F., Schuller, B., Valstar, M., Jaiswal, S., Marchi, E., Lalanne, D. & Pantic, M. (2015). The AV+EC 2015 Multimodal Affect Recognition Challenge: Bridging Across Audio, Video, and Physiological Data.Google Scholar
- Chao, L., Tao, J., Yang, M., Li, Y., & Wen, Z. (2014, November). Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge (pp. 11--18). ACM. Google ScholarDigital Library
- Ringeval, F., Sonderegger, A., Sauer, J., & Lalanne, D. (2013, April). Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on (pp. 1--8). IEEE.Google ScholarCross Ref
- Zaremba W, Sutskever I. Learning to execute {J}. arXiv preprint arXiv:1410.4615, 2014.Google Scholar
- Graves A., Jaitly N. Towards end-to-end speech recognition with recurrent neural networks{C}//Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014: 1764--1772.Google Scholar
- Cherkassky, V., & Ma, Y. (2002). Selecting of the loss function for robust linear regression. Neural computation.Google Scholar
- F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. André, C.Busso, L. Devillers, J. Epps, P. Laukka, S. Narayanan, and K. Truong. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing, 2015. to appear.Google Scholar
- Almaev T R, Valstar M F. Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition{C}//Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 2013: 356--361. Google ScholarDigital Library
- Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.Google Scholar
- Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.Google ScholarCross Ref
- Zeiler M D. ADADELTA: an adaptive learning rate method {J}. arXiv preprint arXiv:1212.5701, 2012.Google Scholar
- Schuller B, Valster M, Eyben F, et al. Avec 2012: the continuous audio/visual emotion challenge{C}//Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 2012: 449--456. Google ScholarDigital Library
- Valstar M, Schuller B, Smith K., et al. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge{C}//Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM, 2013: 3--10. Google ScholarDigital Library
- Valstar M, Schuller B, Smith K, et al. Avec 2014: 3d dimensional affect and depression recognition challenge{C}//Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 2014: 3--10. Google ScholarDigital Library
- Gunes, H., Schuller, B., 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing: Affect Analysis in Continuous Input 31,120--136. Google ScholarDigital Library
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T., Caffe: Convolutional Architecture for Fast Feature Embedding, arXiv: 1408. 5093, 2015.Google Scholar
- X. Zhang, L. Zhang, X.-J. Wang and H.-Y. Shum. Finding celebrities in billions of web images. Multimedia, IEEE Transactions on, 14 (4):995--1007, 2012. Google ScholarDigital Library
- H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27--30, 2014.Google ScholarCross Ref
- Krizhevsky, A., Sutskever, I., Hinton. G., ImageNet Classification with Deep. Convolutional Neural Networks, NIPS 2012.Google Scholar
- H. Gunes, M. Piccardi, M. Pantic, From the Lab to the Real World: Affect Recognition Usng, Affective Computing: Focus on Emotion Expression, Synthesis, and Recognition. I-Tech Education and Publishing, Vienna, Austria, pp. 185 - 218, 2008.Google Scholar
Index Terms
- Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition
Recommendations
AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge
AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion ChallengeMood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence, arousal and dominance. In ...
AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge
AVEC '16: Proceedings of the 6th International Workshop on Audio/Visual Emotion ChallengeThe Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological ...
Multi-scale Temporal Modeling for Dimensional Emotion Recognition in Video
AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion ChallengeUnderstanding nonverbal behaviors in human machine interaction is a complex and challenge task. One of the key aspects is to recognize human emotion states accurately. This paper presents our effort to the Audio/Visual Emotion Challenge (AVEC'14), whose ...
Comments