short-paper

Opensmile: the munich versatile and fast open-source audio feature extractor

Authors:
Florian Eyben

Technische Universität München, München, Germany

Technische Universität München, München, Germany
View Profile

,
Martin Wöllmer

Technische Universität München, München, Germany

Technische Universität München, München, Germany
View Profile

,
Björn Schuller

Technische Universität München, München, Germany

Technische Universität München, München, Germany
View Profile

MM '10: Proceedings of the 18th ACM international conference on MultimediaOctober 2010Pages 1459–1462https://doi.org/10.1145/1873951.1874246

Published:25 October 2010Publication History

MM '10: Proceedings of the 18th ACM international conference on Multimedia

Pages 1459–1462

ABSTRACT

We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.

References

X. Amatriain, P. Arumi, and D. Garcia. A framework for efficient and rapid development of cross-platform audio applications. Multimedia Systems, 14(1):15--32, June 2008.Google ScholarDigital Library
A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson. Combining efforts for improving automatic classification of emotional user states. In T. Erjavec and J. Gros, editors, Language Technologies, IS-LTC 2006, pages 240--245. Informacijska Druzba, 2006.Google Scholar
P. Boersma and D. Weenink. Praat: doing phonetics by computer (v. 4.3.14). http://www.praat.org/, 2005.Google Scholar
F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie. On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces, 3(1-2):7--19, Mar. 2010.Google ScholarCross Ref
F. Eyben, M. Wöllmer, and B. Schuller. openEAR - introducing the munich open-source emotion and affect recognition toolkit. In Proc. of ACII 2009, volume I, pages 576--581. IEEE, 2009.Google ScholarCross Ref
R. Fernandez. A Computational Model for the Automatic Recognition of Affect in Speech. PhD thesis, MIT Media Arts and Science, Feb. 2004. Google ScholarDigital Library
P. N. Garner, J. Dines, T. Hain, A. El Hannani, M. Karafiat, D. Korchagin, M. Lincoln, V. Wan, and L. Zhang. Real-time asr from meetings. In Proc. of INTERSPEECH 2009, Brighton, UK. ISCA, 2009.Google Scholar
A. Lerch and G. Eisenberg. FEAPI: a low level feature extraction plug-in api. In Proc. of the 8th International Conference on Digital Audio Effects (DAFx), Madrid, Spain, 2005.Google Scholar
D. McEnnis, C. McKay, I. Fujinaga, and P. Depalle. jaudio: A feature extraction library. In Proc. of ISMIR 2005, pages 600--603, 2005.Google Scholar
I. Mporas and T. Ganchev. Estimation of unknown speaker's height from speech. International Journal of Speech Technology, 12(4):149--160, dec 2009.Google ScholarCross Ref
B. Schuller, S. Steidl, and A. Batliner. The INTERSPEECH 2009 emotion challenge. In Proc. Interspeech (2009), Brighton, UK, 2009. ISCA.Google Scholar
B. Schuller, F. Wallhoff, D. Arsic, and G. Rigoll. Musical signal type discrimination based on large open feature sets. In Proc. of the International Conference on Multimedia and Expo ICME 2006. IEEE, 2006.Google ScholarCross Ref
B. Schuller, M. Wimmer, L. Mösenlechner, C. Kern, D. Arsic, and G. Rigoll. Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? In Proc. of ICASSP 2008, April 2008.Google ScholarCross Ref
I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2nd edition edition, 2005. Google ScholarDigital Library
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK book (v3.4). Cambridge University Press, Cambridge, UK, December 2006.Google Scholar

Index Terms

Opensmile: the munich versatile and fast open-source audio feature extractor
1. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems
  2. Robustness
    1. Hardware reliability
      1. Signal integrity and noise analysis

Recommendations

Modeling Emotion and Attitude in Speech by Means of Perceptually Based Parameter Values

This study focuses on the perception of emotion and attitude in speech. The ability to identify vocal expressions of emotion and/or attitude in speech material was investigated. Systematic perception experiments were carried out to determine optimal ...
Read More
Evaluation of the affective valence of speech using pitch substructure

In order to study the relationship between emotion and intonation, a new technique is introduced for the extraction of the dominant pitches within speech utterances and the quasi-musical analysis of the multipitch structure. After the distribution of ...
Read More
On the perception of "segmental intonation": F0 context effects on sibilant identification in German

In normal modally voiced utterances, voiceless fricatives like [s], [ź], [f], and [x] vary such that their aperiodic pitch impressions mirror the pitch level of the adjacent F0 contour. For instance, if the F0 contour creates a high or low pitch context,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '10: Proceedings of the 18th ACM international conference on Multimedia
October 2010
1836 pages
ISBN:9781605589336
DOI:10.1145/1873951
General Chairs:
Alberto del Bimbo
University of Florence, Italy
,
Shih-Fu Chang
Columbia University, USA
,
Program Chair:
Arnold Smeulders
University of Amsterdam, NL
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
audio feature extraction
emotion
music
signal processing
speech
statistical functionals
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,369
  Total Citations
  View Citations
- 3,511
  Total Downloads
- Downloads (Last 12 months)511
- Downloads (Last 6 weeks)82
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Opensmile: the munich versatile and fast open-source audio feature extractor

MM '10: Proceedings of the 18th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Modeling Emotion and Attitude in Speech by Means of Perceptually Based Parameter Values

Evaluation of the affective valence of speech using pitch substructure

On the perception of "segmental intonation": F0 context effects on sibilant identification in German