research-article

Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection

Authors:
Mariana Rodrigues Makiuchi

Tokyo Institute of Technology, Tokyo, Japan

Tokyo Institute of Technology, Tokyo, Japan
View Profile

,
Tifani Warnita

Tokyo Institute of Technology, Tokyo, Japan

Tokyo Institute of Technology, Tokyo, Japan
View Profile

,
Kuniaki Uto

Tokyo Institute of Technology, Tokyo, Japan

Tokyo Institute of Technology, Tokyo, Japan
View Profile

,
Koichi Shinoda

Tokyo Institute of Technology, Tokyo, Japan

Tokyo Institute of Technology, Tokyo, Japan
View Profile

AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and WorkshopOctober 2019Pages 55–63https://doi.org/10.1145/3347320.3357694

Published:15 October 2019Publication History

AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

Pages 55–63

ABSTRACT

Depression is a common, but serious mental disorder that affects people all over the world. Besides providing an easier way of diagnosing the disorder, a computer-aided automatic depression assessment system is demanded in order to reduce subjective bias in the diagnosis. We propose a multimodal fusion of speech and linguistic representation for depression detection. We train our model to infer the Patient Health Questionnaire (PHQ) score of subjects from AVEC 2019 DDS Challenge database, the E-DAIC corpus. For the speech modality, we use deep spectrum features extracted from a pretrained VGG-16 network and employ a Gated Convolutional Neural Network (GCNN) followed by a LSTM layer. For the textual embeddings, we extract BERT textual features and employ a Convolutional Neural Network (CNN) followed by a LSTM layer. We achieved a CCC score equivalent to 0.497 and 0.608 on the E-DAIC corpus development set using the unimodal speech and linguistic models respectively. We further combine the two modalities using a feature fusion approach in which we apply the last representation of each single modality model to a fully-connected layer in order to estimate the PHQ score. With this multimodal approach, it was possible to achieve the CCC score of 0.696 on the development set and 0.403 on the testing set of the E-DAIC corpus, which shows an absolute improvement of 0.283 points from the challenge baseline.

References

Substance Abuse, Mental Health Services Administration Center for Behavioral Health Statistics, and Quality. 2018. Results from the 2017 National Survey on Drug Use and Health: Detailed Tables.Google Scholar
S. Alghowinem, R. Goecke, M. Wagner, G. Parkerx, and M. Breakspear. 2013. Head Pose and Movement Analysis as an Indicator of Depression. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. 283--288. https://doi.org/10.1109/ACII.2013.53Google ScholarDigital Library
Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Björn Schuller. 2017. Snore Sound Classification Using Image-Based Deep Spectrum Features. Proc. Interspeech 2017 (2017), 3512--3516.Google ScholarCross Ref
Alzheimer's Association et almbox. 2018. 2018 Alzheimer's disease facts and figures. Alzheimer's & Dementia, Vol. 14, 3 (2018), 367--429.Google ScholarCross Ref
American Psychiatric Association. 2017. What Is Depression? https://www.psychiatry.org/patients-families/depression/what-is-depression Retrieved June 5 2019 fromGoogle Scholar
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. SoundNet: learning sound representations from unlabeled video. In Proceedings of the 30th International Conference on Neural Information Processing Systems. Curran Associates Inc., 892--900.Google Scholar
Edward B Blanchard, Jacqueline Jones-Alexander, Todd C Buckley, and Catherine A Forneris. 1996. Psychometric properties of the PTSD Checklist (PCL). Behaviour research and therapy, Vol. 34, 8 (1996), 669--673.Google Scholar
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 933--941.Google ScholarDigital Library
David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et almbox. 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, 1061--1068.Google ScholarDigital Library
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. 2018. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018).Google Scholar
Mavis Evans and Pat Mottram. 2000. Diagnosis of depression in elderly patients. Advances in Psychiatric Treatment, Vol. 6, 1 (2000), 49--56. https://doi.org/10.1192/apt.6.1.49Google ScholarCross Ref
Fabien Ringeval and Björn Schuller and Michel Valstar and Nicholas Cummins and Roddy Cowie and Leili Tavabi and Maximilian Schmitt and Sina Alisamir and Shahin Amiriparian and Eva-Maria Messner and Siyang Song and Shuo Lui and Ziping Zhao and Adria Mallol-Ragolta and Zhao Ren, and Mohammad Soleymani, and Maja Pantic. 2019. AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition. In Proceedings of the 9th International Workshop on Audio/Visual Emotion Challenge, AVEC'19, co-located with the 27th ACM International Conference on Multimedia, MM 2019,, Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, and Maja Pantic (Eds.). ACM, Nice, France.Google Scholar
Jeffrey M Girard, Jeffrey F Cohn, Mohammad H Mahoor, Seyedmohammad Mavadati, and Dean P Rosenwald. 2013. Social risk and depression: Evidence from manual and automatic facial expression analysis. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE, 1--8.Google ScholarCross Ref
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning .MIT Press. http://www.deeplearningbook.org.Google Scholar
Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et almbox. 2014. The distress analysis interview corpus of human and computer interviews.. In LREC. Citeseer, 3123--3128.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Zhengwei Huang, Ming Dong, Qirong Mao, and Yongzhao Zhan. 2014. Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 801--804.Google ScholarDigital Library
Jyoti Joshi, Abhinav Dhall, Roland Goecke, and Jeffrey F Cohn. 2013a. Relative body parts movement for automatic depression analysis. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. IEEE, 492--497.Google ScholarDigital Library
Jyoti Joshi, Roland Goecke, Sharifa Alghowinem, Abhinav Dhall, Michael Wagner, Julien Epps, Gordon Parker, and Michael Breakspear. 2013b. Multimodal assistive technologies for depression diagnosis and monitoring. Journal on Multimodal User Interfaces, Vol. 7, 3 (2013), 217--228.Google ScholarCross Ref
Jyoti Joshi, Roland Goecke, Gordon Parker, and Michael Breakspear. 2013c. Can body expressions contribute to automatic depression analysis?. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE, 1--7.Google ScholarCross Ref
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Kurt Kroenke and Robert L Spitzer. 2002. The PHQ-9: a new depression diagnostic and severity measure. Psychiatric annals, Vol. 32, 9 (2002), 509--515.Google Scholar
I Lawrence and Kuei Lin. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics (1989), 255--268.Google Scholar
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural computation, Vol. 1, 4 (1989), 541--551.Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.Google Scholar
Geneva: World Health Organization;. 2017. Depression and Other Common Mental Disorders: Global Health Estimates.Google Scholar
Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV). 631--648.Google ScholarDigital Library
Ankur P Parikh, Oscar T"ackström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933 (2016).Google Scholar
Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association .Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarCross Ref
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).Google Scholar
Filip Povolny, Pavel Matejka, Michal Hradis, Anna Popková, Lubom'ir Otrusina, Pavel Smrz, Ian Wood, Cecile Robin, and Lori Lamel. 2016. Multimodal emotion recognition for AVEC 2016 challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 75--82.Google ScholarDigital Library
B Schuller, A Batliner, S Steidl, F Schiel, and J Krajewski. 2011. The INTERSPEECH 2011 Speaker State Challenge. In Proc. INTERSPEECH 2011, Florence, Italy.Google Scholar
B Schuller, S Steidl, and A Batliner. 2009. The Interspeech 2009 Emotion Challenge. In Proc. Interspeech 2009, Brighton, UK. 312--315.Google ScholarCross Ref
B Schuller, S Steidl, A Batliner, F Burkhardt, L Devillers, C Müller, and S Narayanan. 2010. The INTERSPEECH 2010 Paralinguistic Challenge. In Proc. INTERSPEECH 2010, Makuhari, Japan. 2794--2797.Google Scholar
Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian MüLler, and Shrikanth Narayanan. 2013. Paralinguistics in speech and language - State-of-the-art and the challenge. Computer Speech & Language, Vol. 27, 1 (2013), 4--39.Google ScholarDigital Library
B Schuller, S Steidl, A Batliner, E Nöth, A Vinciarelli, F Burkhardt, R van Son, F Weninger, F Eyben, T Bocklet, et almbox. 2012. The INTERSPEECH 2012 Speaker Trait Challenge. In INTERSPEECH 2012, Portland, OR, USA.Google Scholar
Aliaksei Severyn and Alessandro Moschitti. 2015. Twitter sentiment analysis with deep convolutional neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 959--962.Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Douglas Sturim, Pedro A Torres-Carrasquillo, Thomas F Quatieri, Nicolas Malyska, and Alan McCree. 2011. Automatic detection of depression in speech using gaussian mixture modeling with factor analysis. In Twelfth Annual Conference of the International Speech Communication Association.Google Scholar
Tifani Warnita, Nakamasa Inoue, and Koichi Shinoda. 2018. Detecting Alzheimer`s Disease Using Gated Convolutional Neural Network from Audio Data. Proc. Interspeech 2018 (2018), 1706--1710.Google ScholarCross Ref
Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision. Springer, 451--466.Google ScholarCross Ref
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked Attention Networks for Image Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar

Index Terms

Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection
1. Applied computing
  1. Life and medical sciences
    1. Health care information systems
2. Computing methodologies
  1. Machine learning

Recommendations

Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking
iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion

Video hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...
Read More
An FPGA-based Fine Tuning Accelerator for a Sparse CNN
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Fine-tuning learns abundant feature expression for a wide range of natural images by using a pre-trained CNN model. It can be applied to a wide range of the neural network (NN)based computer vision problems. This paper proposes an FPGA-based fine-tuning ...
Read More
Speech Emotion Recognition among Elderly Individuals using Multimodal Fusion and Transfer Learning
ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction

Recognizing the emotions of the elderly is important as it could give an insight into their mental health. Emotion recognition systems that work well on the elderly could be used to assess their emotions in places such as nursing homes and could inform ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop
October 2019
96 pages
ISBN:9781450369138
DOI:10.1145/3347320
General Chairs:
Fabien Ringeval
Grenoble Alps University, France
,
Björn Schuller
University of Augsburg/Imperial College London, Germany/UK
,
Michel Valstar
University of Nottingham, UK
,
Nicholas Cummins
University of Augsburg, Germany
,
Roddy Cowie
Queen's University Belfast, UK
,
Maja Pantic
Imperial College London/Twente University, UK/The Netherlands
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CNN
affective computing
bert
deep learning
depression detection
gated CNN
multimodal systems
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate52of98submissions,53%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 1,911
  Total Downloads
- Downloads (Last 12 months)390
- Downloads (Last 6 weeks)65
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection

AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking

An FPGA-based Fine Tuning Accelerator for a Sparse CNN

Speech Emotion Recognition among Elderly Individuals using Multimodal Fusion and Transfer Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection

AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking

An FPGA-based Fine Tuning Accelerator for a Sparse CNN

Speech Emotion Recognition among Elderly Individuals using Multimodal Fusion and Transfer Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media