On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis

Wang, DeLiang

doi:10.1007/0-387-22794-6_12

DeLiang Wang²

1678 Accesses
314 Citations
8 Altmetric

Conclusion

In his famous treatise of computational vision, Marr (1982) makes a compelling argument for separating different levels of analysis in order to understand complex information processing. In particular, the computational theory level, concerned with the goal of computation and general processing strategy, must be separated from the algorithm level, or the separation of what from how. This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA problem.

My analysis results in the proposal of the ideal binary mask as a main goal of CASA. This goal is consistent with characteristics of human auditory scene analysis. The goal is also consistent with more specific objectives such as enhancing ASR and speech intelligibility. The resulting evaluation metric has the properties of simplicity and generality, and is easy to apply when the premixing target is available. The goal of the ideal binary mask has led to effective for speech separation algorithms that attempt to explicitly estimate such masks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bird, J. and Darwin, C.J., 1997, Effects of a difference in fundamental frequency in separating two sentences, in: Psychophysical and Physiological Advances in Hearing, A.R. Palmer, et al., ed., Whurr, London.
Google Scholar
Bodden, M., 1993, Modeling human sound-source localization and the cocktail-party-effect, Acta Acust. 1: 43–55.
Google Scholar
Bregman, A.S., 1990, Auditory Scene Analysis, MIT Press, Cambridge MA.
Google Scholar
Brown, G.J. and Cooke, M., 1994, Computational auditory scene analysis, Computer Speech and Language 8: 297–336.
Google Scholar
Brungart, D., Chang, P., Simpson, B., and Wang, D. L., in preparation.
Google Scholar
Carlyon, R.P. and Shackleton, T.M., 1994, Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms?, J. Acoust. Soc. Am. 95: 3541–3554.
Google Scholar
Cherry, E.C., 1953, Some experiments on the recognition of speech, with one and with two ears, J. Acoust. Soc. Am. 25: 975–979.
Article Google Scholar
Cooke, M., 1993, Modelling Auditory Processing and Organization, Cambridge University Press, Cambridge U.K.
Google Scholar
Cooke, M., Green, P., Josifovski, L., and Vizinho, A., 2001, Robust automatic speech recognition with missing and unreliable acoustic data, Speech Comm. 34: 267–285.
MATH Google Scholar
Cowan, N., 2001, The magic number 4 in short-term memory: a reconsideration of mental storage capacity, Behav. Brain Sci. 24: 87–185.
Google Scholar
Ellis, D.P.W., 1996, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineersing and Computer Science.
Google Scholar
Gibson, J.J., 1966, The Senses Considered as Perceptual Systems, Greenwood Press, Westport CT.
Google Scholar
Glotin, H., 2001, Elaboration et étude comparative de systèmes adaptatifs multi-flux de reconnaissance robuste de la parole: incorporation d’indices de voisement et de localisation, Ph.D. Dissertation, Institut National Polytechnique de Grenoble.
Google Scholar
Helmholtz, H., 1863, On the Sensation of Tone (A.J. Ellis, Trans.), Dover Publishers, Second English ed., New York.
Google Scholar
Hu, G. and Wang, D.L., 2001, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 79–82.
Google Scholar
Hu, G. and Wang, D.L., 2003, Monaural speech separation, in: Advances in Neural Information Processing Systems (NIPS’02), MIT Press, Cambridge MA, pp. 1221–1228.
Google Scholar
Hu, G. and Wang, D.L., 2004, Monaural speech segregation based on pitch tracking and amplitude modulation, IEEE Trans. Neural Net., in press.
Google Scholar
Hyvärinen, A., Karhunen, J., and Oja, E., 2001, Independent Component Analysis, Wiley, New York.
Google Scholar
Jourjine, A., Rickard, S., and Yilmaz, O., 2000, Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures, in Proceedings of IEEE ICASSP, pp. 2985–2988.
Google Scholar
Krim, H. and Viberg, M., 1996, Two decades of array signal processing research: The parametric approach, IEEE Sig. Proc. Mag. 13: 67–94.
Article Google Scholar
Lee, T.-W., 1998, Independent Component Analysis: Theory and Applications, Kluwer Academic, Boston.
MATH Google Scholar
Lim, J., ed., 1983, Speech Enhancement, Prentice Hall, Englewood Cliffs NJ.
Google Scholar
Marr, D., 1982, Vision, Freeman, New York.
Google Scholar
McCabe, S.L. and Denham, M.J., 1997, A model of auditory streaming, J. Acoust. Soc. Am. 101: 1611–1621.
Article Google Scholar
Moore, B.C.J., 1998, Cochlear Hearing Loss, Whurr Publishers, London.
Google Scholar
Moore, B.C.J., 2003, An Introduction to the Psychology of Hearing, Academic Press, 5th ed., San Diego, CA.
Google Scholar
Nakatani, T. and Okuno, H.G., 1999, Harmonic sound stream segregation using localization and its application to speech stream segregation, Speech Comm. 27: 209–222.
Google Scholar
Norris, M., 2003, Assessment and extension of Wang’s oscillatory model of auditory stream segregation, Ph.D. Dissertation, University of Queensland School of Information Technology and Electrical Engineering.
Google Scholar
O’Shaughnessy, D., 2000, Speech Communications: Human and Machine, IEEE Press, 2nd ed., Piscataway NJ.
Google Scholar
Pashler, H.E., 1998, The Psychology of Attention, MIT Press, Cambridge MA.
Google Scholar
Roman, N., Wang, D.L., and Brown, G.J., 2001, Speech segregation based on sound localization, in Proceedings of IJCNN, pp. 2861–2866.
Google Scholar
Roman, N., Wang, D.L., and Brown, G.J., 2003, Speech segregation based on sound localization, J. Acoust. Soc. Am. 114: 2236–2252.
Article Google Scholar
Rosenthal, D.F. and Okuno, H.G., ed., 1998, Computational Auditory Scene Analysis, Lawrence Erlbaum, Mahwah NJ.
Google Scholar
Roweis, S.T., 2001, One microphone source separation, in: Advances in Neural Information Processing Systems (NIPS’00), MIT Press.
Google Scholar
Stubbs, R.J. and Summerfield, Q., 1988, Evaluation of two voice-separation algorithms using normal-hearing and hearing-impaired listeners, J. Acoust. Soc. Am. 84: 1236–1249.
Article Google Scholar
Stubbs, R.J. and Summerfield, Q., 1990, Algorithms for separating the speech of interfering talkers: Evaluations with voiced sentences, and normal-hearing and hearing-impaired listeners, J. Acoust. Soc. Am. 87: 359–372.
Article Google Scholar
Treisman, A., 1999, Solutions to the binding problem: progress through controversy and convergence, Neuron 24: 105–110.
Article Google Scholar
van der Kouwe, A.J.W., Wang, D.L., and Brown, G.J., 2001, A comparison of auditory and blind separation techniques for speech segregation, IEEE Trans. Speech Audio Process. 9: 189–195.
Google Scholar
van Veen, B.D. and Buckley, K.M., April 1988, Beamforming: A versatile approach to spatial filtering, IEEE ASSP Magazine, pp. 4–24.
Google Scholar
Wang, D.L., 1996, Primitive auditory segregation based on oscillatory correlation, Cognit. Sci. 20: 409–456.
Google Scholar
Wang, D.L. and Brown, G.J., 1999, Separation of speech from interfering sounds based on oscillatory correlation, IEEE Trans. Neural Net. 10: 684–697.
Google Scholar
Weintraub, M., 1985, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering.
Google Scholar
Wrigley, S.N. and Brown, G.J., 2004, A computational model of auditory selective attention, IEEE Trans. Neural Net., in press.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering and Center of Cognitive Science, The Ohio State University, Columbus, OH
DeLiang Wang

Authors

DeLiang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

East Bay Institute for Research and Education, USA
Pierre Divenyi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wang, D. (2005). On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_12

Download citation

DOI: https://doi.org/10.1007/0-387-22794-6_12
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-8001-2
Online ISBN: 978-0-387-22794-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics