skip to main content
10.1145/2661806.2661819acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Model Fusion for Multimodal Depression Classification and Level Detection

Authors Info & Claims
Published:07 November 2014Publication History

ABSTRACT

Audio-visual emotion and mood disorder cues have been recently explored to develop tools to assist psychologists and psychiatrists in evaluating a patient's level of depression. In this paper, we present a number of different multimodal depression level predictors using a model fusion approach, in the context of the AVEC14 challenge. We show that an i-vector based representation for short term audio features contains useful information for depression classification and prediction. We also employed a classification step prior to regression to allow having different regression models depending on the presence or absence of depression. Our experiments show that a combination of our audio-based model and two other models based on the LGBP-TOP video features lead to an improvement of 4% over the baseline model proposed by the challenge organizers.

References

  1. A. T. Beck, R. A. Steer, R. Ball, and W. F. Ranieri. Comparison of beck depression inventories-ia and-ii in psychiatric outpatients. Journal of personality assessment, 67(3):588--597, 1996.Google ScholarGoogle Scholar
  2. N. Cummins, J. Epps, M. Breakspear, and R. Goecke. An investigation of depressed speech detection: Features and normalization. In INTERSPEECH, pages 2997--3000. ISCA, 2011.Google ScholarGoogle Scholar
  3. N. Cummins, J. Epps, V. Sethu, and J. Krajewski. Variability compensation in small data: Oversampled extraction of i-vectors for the classification of depressed speech. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 970--974, May 2014.Google ScholarGoogle ScholarCross RefCross Ref
  4. N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, and J. Epps. Diagnosis of depression by behavioural signals: A multimodal approach. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, AVEC '13, pages 11--20, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 19(4):788--798, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Dehak, P. A. Torres-Carrasquillo, D. A. Reynolds, and R. Dehak. Language recognition via i-vectors and dimensionality reduction. In INTERSPEECH, pages 857--860, 2011.Google ScholarGoogle Scholar
  7. A. Dobson. An Introduction to Genelarized Linear Models. Chapman & Hall/CRC; 3 edition, 2008.Google ScholarGoogle Scholar
  8. H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support vector regression machines. Advances in neural information processing systems, 9:155--161, 1997.Google ScholarGoogle Scholar
  9. J. S. Garofolo, L. D. Consortium, et al. TIMIT: acoustic-phonetic continuous speech corpus, 1993.Google ScholarGoogle Scholar
  10. R. Horwitz, T. F. Quatieri, B. S. Helfer, B. Yu, J. R. Williamson, and J. Mundt. On the relative importance of vocal source, system, and prosody in human depression. In Body Sensor Networks (BSN), 2013 IEEE International Conference on, pages 1--6, May 2013.Google ScholarGoogle ScholarCross RefCross Ref
  11. P. Kenny. A small foot-print i-vector extractor. In Proc. Odyssey, 2012.Google ScholarGoogle Scholar
  12. P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo. A study of acoustic features for the classification of depressed speech. In Proceedings of the International Convention Mipro Conference On Intelligent Systems (CIS), Special Session on Biometrics & Forensics & De-identification and Privacy Protection (BiForD). MIPRO, May 2014.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1):19--41, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. O. Sadjadi, M. Slaney, and L. Heck. Msr identity toolbox v1.0: A matlab toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter, November 2013.Google ScholarGoogle Scholar
  15. M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel. A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(1):217--227, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Shum, N. Dehak, E. Chuangsuwanich, D. A. Reynolds, and J. R. Glass. Exploiting intra-conversation variability for speaker diarization. In INTERSPEECH, pages 945--948, 2011.Google ScholarGoogle Scholar
  17. D. Sturim, P. Torres-carrasquillo, T. F. Quatieri, N. Malyska, and A. Mccree. Automatic detection of depression in speech using gaussian mixture modeling with factor analysis. In Proceedings of Interspeech, 2011.Google ScholarGoogle Scholar
  18. E. M. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal Machine Learning Research, 1:211--244, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic. AVEC 2014 - 3d dimensional affect and depression recognition challenge. In Proceedings of the 4th International Audio/Visual Emotion Challenge and Workshop (to appear). SSPNET, November 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.Google ScholarGoogle Scholar
  21. J. R. Williamson, T. F. Quatieri, B. S. Helfer, R. Horwitz, B. Yu, and D. D. Mehta. Vocal biomarkers of depression based on motor incoordination. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, AVEC '13, pages Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Model Fusion for Multimodal Depression Classification and Level Detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      AVEC '14: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge
      November 2014
      110 pages
      ISBN:9781450331197
      DOI:10.1145/2661806

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 November 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      AVEC '14 Paper Acceptance Rate8of22submissions,36%Overall Acceptance Rate52of98submissions,53%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader