Skip to main content

On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis

  • Chapter
Speech Separation by Humans and Machines

Conclusion

In his famous treatise of computational vision, Marr (1982) makes a compelling argument for separating different levels of analysis in order to understand complex information processing. In particular, the computational theory level, concerned with the goal of computation and general processing strategy, must be separated from the algorithm level, or the separation of what from how. This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA problem.

My analysis results in the proposal of the ideal binary mask as a main goal of CASA. This goal is consistent with characteristics of human auditory scene analysis. The goal is also consistent with more specific objectives such as enhancing ASR and speech intelligibility. The resulting evaluation metric has the properties of simplicity and generality, and is easy to apply when the premixing target is available. The goal of the ideal binary mask has led to effective for speech separation algorithms that attempt to explicitly estimate such masks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bird, J. and Darwin, C.J., 1997, Effects of a difference in fundamental frequency in separating two sentences, in: Psychophysical and Physiological Advances in Hearing, A.R. Palmer, et al., ed., Whurr, London.

    Google Scholar 

  • Bodden, M., 1993, Modeling human sound-source localization and the cocktail-party-effect, Acta Acust. 1: 43–55.

    Google Scholar 

  • Bregman, A.S., 1990, Auditory Scene Analysis, MIT Press, Cambridge MA.

    Google Scholar 

  • Brown, G.J. and Cooke, M., 1994, Computational auditory scene analysis, Computer Speech and Language 8: 297–336.

    Google Scholar 

  • Brungart, D., Chang, P., Simpson, B., and Wang, D. L., in preparation.

    Google Scholar 

  • Carlyon, R.P. and Shackleton, T.M., 1994, Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms?, J. Acoust. Soc. Am. 95: 3541–3554.

    Google Scholar 

  • Cherry, E.C., 1953, Some experiments on the recognition of speech, with one and with two ears, J. Acoust. Soc. Am. 25: 975–979.

    Article  Google Scholar 

  • Cooke, M., 1993, Modelling Auditory Processing and Organization, Cambridge University Press, Cambridge U.K.

    Google Scholar 

  • Cooke, M., Green, P., Josifovski, L., and Vizinho, A., 2001, Robust automatic speech recognition with missing and unreliable acoustic data, Speech Comm. 34: 267–285.

    MATH  Google Scholar 

  • Cowan, N., 2001, The magic number 4 in short-term memory: a reconsideration of mental storage capacity, Behav. Brain Sci. 24: 87–185.

    Google Scholar 

  • Ellis, D.P.W., 1996, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineersing and Computer Science.

    Google Scholar 

  • Gibson, J.J., 1966, The Senses Considered as Perceptual Systems, Greenwood Press, Westport CT.

    Google Scholar 

  • Glotin, H., 2001, Elaboration et étude comparative de systèmes adaptatifs multi-flux de reconnaissance robuste de la parole: incorporation d’indices de voisement et de localisation, Ph.D. Dissertation, Institut National Polytechnique de Grenoble.

    Google Scholar 

  • Helmholtz, H., 1863, On the Sensation of Tone (A.J. Ellis, Trans.), Dover Publishers, Second English ed., New York.

    Google Scholar 

  • Hu, G. and Wang, D.L., 2001, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 79–82.

    Google Scholar 

  • Hu, G. and Wang, D.L., 2003, Monaural speech separation, in: Advances in Neural Information Processing Systems (NIPS’02), MIT Press, Cambridge MA, pp. 1221–1228.

    Google Scholar 

  • Hu, G. and Wang, D.L., 2004, Monaural speech segregation based on pitch tracking and amplitude modulation, IEEE Trans. Neural Net., in press.

    Google Scholar 

  • Hyvärinen, A., Karhunen, J., and Oja, E., 2001, Independent Component Analysis, Wiley, New York.

    Google Scholar 

  • Jourjine, A., Rickard, S., and Yilmaz, O., 2000, Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures, in Proceedings of IEEE ICASSP, pp. 2985–2988.

    Google Scholar 

  • Krim, H. and Viberg, M., 1996, Two decades of array signal processing research: The parametric approach, IEEE Sig. Proc. Mag. 13: 67–94.

    Article  Google Scholar 

  • Lee, T.-W., 1998, Independent Component Analysis: Theory and Applications, Kluwer Academic, Boston.

    MATH  Google Scholar 

  • Lim, J., ed., 1983, Speech Enhancement, Prentice Hall, Englewood Cliffs NJ.

    Google Scholar 

  • Marr, D., 1982, Vision, Freeman, New York.

    Google Scholar 

  • McCabe, S.L. and Denham, M.J., 1997, A model of auditory streaming, J. Acoust. Soc. Am. 101: 1611–1621.

    Article  Google Scholar 

  • Moore, B.C.J., 1998, Cochlear Hearing Loss, Whurr Publishers, London.

    Google Scholar 

  • Moore, B.C.J., 2003, An Introduction to the Psychology of Hearing, Academic Press, 5th ed., San Diego, CA.

    Google Scholar 

  • Nakatani, T. and Okuno, H.G., 1999, Harmonic sound stream segregation using localization and its application to speech stream segregation, Speech Comm. 27: 209–222.

    Google Scholar 

  • Norris, M., 2003, Assessment and extension of Wang’s oscillatory model of auditory stream segregation, Ph.D. Dissertation, University of Queensland School of Information Technology and Electrical Engineering.

    Google Scholar 

  • O’Shaughnessy, D., 2000, Speech Communications: Human and Machine, IEEE Press, 2nd ed., Piscataway NJ.

    Google Scholar 

  • Pashler, H.E., 1998, The Psychology of Attention, MIT Press, Cambridge MA.

    Google Scholar 

  • Roman, N., Wang, D.L., and Brown, G.J., 2001, Speech segregation based on sound localization, in Proceedings of IJCNN, pp. 2861–2866.

    Google Scholar 

  • Roman, N., Wang, D.L., and Brown, G.J., 2003, Speech segregation based on sound localization, J. Acoust. Soc. Am. 114: 2236–2252.

    Article  Google Scholar 

  • Rosenthal, D.F. and Okuno, H.G., ed., 1998, Computational Auditory Scene Analysis, Lawrence Erlbaum, Mahwah NJ.

    Google Scholar 

  • Roweis, S.T., 2001, One microphone source separation, in: Advances in Neural Information Processing Systems (NIPS’00), MIT Press.

    Google Scholar 

  • Stubbs, R.J. and Summerfield, Q., 1988, Evaluation of two voice-separation algorithms using normal-hearing and hearing-impaired listeners, J. Acoust. Soc. Am. 84: 1236–1249.

    Article  Google Scholar 

  • Stubbs, R.J. and Summerfield, Q., 1990, Algorithms for separating the speech of interfering talkers: Evaluations with voiced sentences, and normal-hearing and hearing-impaired listeners, J. Acoust. Soc. Am. 87: 359–372.

    Article  Google Scholar 

  • Treisman, A., 1999, Solutions to the binding problem: progress through controversy and convergence, Neuron 24: 105–110.

    Article  Google Scholar 

  • van der Kouwe, A.J.W., Wang, D.L., and Brown, G.J., 2001, A comparison of auditory and blind separation techniques for speech segregation, IEEE Trans. Speech Audio Process. 9: 189–195.

    Google Scholar 

  • van Veen, B.D. and Buckley, K.M., April 1988, Beamforming: A versatile approach to spatial filtering, IEEE ASSP Magazine, pp. 4–24.

    Google Scholar 

  • Wang, D.L., 1996, Primitive auditory segregation based on oscillatory correlation, Cognit. Sci. 20: 409–456.

    Google Scholar 

  • Wang, D.L. and Brown, G.J., 1999, Separation of speech from interfering sounds based on oscillatory correlation, IEEE Trans. Neural Net. 10: 684–697.

    Google Scholar 

  • Weintraub, M., 1985, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering.

    Google Scholar 

  • Wrigley, S.N. and Brown, G.J., 2004, A computational model of auditory selective attention, IEEE Trans. Neural Net., in press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer Science + Business Media, Inc.

About this chapter

Cite this chapter

Wang, D. (2005). On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_12

Download citation

  • DOI: https://doi.org/10.1007/0-387-22794-6_12

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4020-8001-2

  • Online ISBN: 978-0-387-22794-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics