Conclusion
In his famous treatise of computational vision, Marr (1982) makes a compelling argument for separating different levels of analysis in order to understand complex information processing. In particular, the computational theory level, concerned with the goal of computation and general processing strategy, must be separated from the algorithm level, or the separation of what from how. This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA problem.
My analysis results in the proposal of the ideal binary mask as a main goal of CASA. This goal is consistent with characteristics of human auditory scene analysis. The goal is also consistent with more specific objectives such as enhancing ASR and speech intelligibility. The resulting evaluation metric has the properties of simplicity and generality, and is easy to apply when the premixing target is available. The goal of the ideal binary mask has led to effective for speech separation algorithms that attempt to explicitly estimate such masks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bird, J. and Darwin, C.J., 1997, Effects of a difference in fundamental frequency in separating two sentences, in: Psychophysical and Physiological Advances in Hearing, A.R. Palmer, et al., ed., Whurr, London.
Bodden, M., 1993, Modeling human sound-source localization and the cocktail-party-effect, Acta Acust. 1: 43–55.
Bregman, A.S., 1990, Auditory Scene Analysis, MIT Press, Cambridge MA.
Brown, G.J. and Cooke, M., 1994, Computational auditory scene analysis, Computer Speech and Language 8: 297–336.
Brungart, D., Chang, P., Simpson, B., and Wang, D. L., in preparation.
Carlyon, R.P. and Shackleton, T.M., 1994, Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms?, J. Acoust. Soc. Am. 95: 3541–3554.
Cherry, E.C., 1953, Some experiments on the recognition of speech, with one and with two ears, J. Acoust. Soc. Am. 25: 975–979.
Cooke, M., 1993, Modelling Auditory Processing and Organization, Cambridge University Press, Cambridge U.K.
Cooke, M., Green, P., Josifovski, L., and Vizinho, A., 2001, Robust automatic speech recognition with missing and unreliable acoustic data, Speech Comm. 34: 267–285.
Cowan, N., 2001, The magic number 4 in short-term memory: a reconsideration of mental storage capacity, Behav. Brain Sci. 24: 87–185.
Ellis, D.P.W., 1996, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineersing and Computer Science.
Gibson, J.J., 1966, The Senses Considered as Perceptual Systems, Greenwood Press, Westport CT.
Glotin, H., 2001, Elaboration et étude comparative de systèmes adaptatifs multi-flux de reconnaissance robuste de la parole: incorporation d’indices de voisement et de localisation, Ph.D. Dissertation, Institut National Polytechnique de Grenoble.
Helmholtz, H., 1863, On the Sensation of Tone (A.J. Ellis, Trans.), Dover Publishers, Second English ed., New York.
Hu, G. and Wang, D.L., 2001, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 79–82.
Hu, G. and Wang, D.L., 2003, Monaural speech separation, in: Advances in Neural Information Processing Systems (NIPS’02), MIT Press, Cambridge MA, pp. 1221–1228.
Hu, G. and Wang, D.L., 2004, Monaural speech segregation based on pitch tracking and amplitude modulation, IEEE Trans. Neural Net., in press.
Hyvärinen, A., Karhunen, J., and Oja, E., 2001, Independent Component Analysis, Wiley, New York.
Jourjine, A., Rickard, S., and Yilmaz, O., 2000, Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures, in Proceedings of IEEE ICASSP, pp. 2985–2988.
Krim, H. and Viberg, M., 1996, Two decades of array signal processing research: The parametric approach, IEEE Sig. Proc. Mag. 13: 67–94.
Lee, T.-W., 1998, Independent Component Analysis: Theory and Applications, Kluwer Academic, Boston.
Lim, J., ed., 1983, Speech Enhancement, Prentice Hall, Englewood Cliffs NJ.
Marr, D., 1982, Vision, Freeman, New York.
McCabe, S.L. and Denham, M.J., 1997, A model of auditory streaming, J. Acoust. Soc. Am. 101: 1611–1621.
Moore, B.C.J., 1998, Cochlear Hearing Loss, Whurr Publishers, London.
Moore, B.C.J., 2003, An Introduction to the Psychology of Hearing, Academic Press, 5th ed., San Diego, CA.
Nakatani, T. and Okuno, H.G., 1999, Harmonic sound stream segregation using localization and its application to speech stream segregation, Speech Comm. 27: 209–222.
Norris, M., 2003, Assessment and extension of Wang’s oscillatory model of auditory stream segregation, Ph.D. Dissertation, University of Queensland School of Information Technology and Electrical Engineering.
O’Shaughnessy, D., 2000, Speech Communications: Human and Machine, IEEE Press, 2nd ed., Piscataway NJ.
Pashler, H.E., 1998, The Psychology of Attention, MIT Press, Cambridge MA.
Roman, N., Wang, D.L., and Brown, G.J., 2001, Speech segregation based on sound localization, in Proceedings of IJCNN, pp. 2861–2866.
Roman, N., Wang, D.L., and Brown, G.J., 2003, Speech segregation based on sound localization, J. Acoust. Soc. Am. 114: 2236–2252.
Rosenthal, D.F. and Okuno, H.G., ed., 1998, Computational Auditory Scene Analysis, Lawrence Erlbaum, Mahwah NJ.
Roweis, S.T., 2001, One microphone source separation, in: Advances in Neural Information Processing Systems (NIPS’00), MIT Press.
Stubbs, R.J. and Summerfield, Q., 1988, Evaluation of two voice-separation algorithms using normal-hearing and hearing-impaired listeners, J. Acoust. Soc. Am. 84: 1236–1249.
Stubbs, R.J. and Summerfield, Q., 1990, Algorithms for separating the speech of interfering talkers: Evaluations with voiced sentences, and normal-hearing and hearing-impaired listeners, J. Acoust. Soc. Am. 87: 359–372.
Treisman, A., 1999, Solutions to the binding problem: progress through controversy and convergence, Neuron 24: 105–110.
van der Kouwe, A.J.W., Wang, D.L., and Brown, G.J., 2001, A comparison of auditory and blind separation techniques for speech segregation, IEEE Trans. Speech Audio Process. 9: 189–195.
van Veen, B.D. and Buckley, K.M., April 1988, Beamforming: A versatile approach to spatial filtering, IEEE ASSP Magazine, pp. 4–24.
Wang, D.L., 1996, Primitive auditory segregation based on oscillatory correlation, Cognit. Sci. 20: 409–456.
Wang, D.L. and Brown, G.J., 1999, Separation of speech from interfering sounds based on oscillatory correlation, IEEE Trans. Neural Net. 10: 684–697.
Weintraub, M., 1985, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering.
Wrigley, S.N. and Brown, G.J., 2004, A computational model of auditory selective attention, IEEE Trans. Neural Net., in press.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer Science + Business Media, Inc.
About this chapter
Cite this chapter
Wang, D. (2005). On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_12
Download citation
DOI: https://doi.org/10.1007/0-387-22794-6_12
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-8001-2
Online ISBN: 978-0-387-22794-8
eBook Packages: EngineeringEngineering (R0)