Elsevier

Neurocomputing

Volume 129, 10 April 2014, Pages 199-207
Neurocomputing

Real-time frequency-based noise-robust Automatic Speech Recognition using Multi-Nets Artificial Neural Networks: A multi-views multi-learners approach

https://doi.org/10.1016/j.neucom.2013.09.040Get rights and content

Highlights

  • A noise-robust Multi-Networks Speech Recogniser model is provided.

  • The proposed model is based on Multi-Nets Artificial Neural Networks.

  • The proposed model was verified using unforeseen testing data infected with additive noise.

  • A detailed performance comparison is made between the proposed model and monolithic ANN-based ASRs.

  • The proposed recogniser improved the recognition rate by up to 20.14% in compare with MVSL ANN-based ASRs.

Abstract

Automatic Speech Recognition (ASR) is a technology for identifying uttered word(s) represented as an acoustic signal. However, one of the important aspects of a noise-robust ASR system is its ability to recognise speech accurately in noisy conditions. This paper studies the applications of Multi-Nets Artificial Neural Networks (M-N ANNs), a realisation of multiple-views multiple-learners approach, as Multi-Networks Speech Recognisers (M-NSRs) in providing a real-time, frequency-based noise-robust ASR model. M-NSRs define speech features associated with each word as a different view and apply a standalone ANN as one of the learners to approximate that view; meanwhile, multiple-views single-learner (MVSL) ANN-based speech recognisers employ only one ANN to memorise the features of the entire vocabulary. In this research, an M-NSR was provided and evaluated using unforeseen test data that were affected by white, brown, and pink noises; more specifically, 27 experiments were conducted on noisy speech to measure the accuracy and recognition rate of the proposed model. Furthermore, the results of the M-NSR were compared in detail with an MVSL ANN-based ASR system. The M-NSR recorded an improved average recognition rate by up to 20.14% when it was given the test data infected with noise in our experiments. It is shown that the M-NSR with higher degree of generalisability can handle frequency-based noise because it has higher recognition rate than the previous model under noisy conditions.

Introduction

An Automatic Speech Recognition (ASR) system identifies uttered word(s) represented as an acoustic signal. An ASR system relies on a given lexicon and prior knowledge of a problem domain to recognise spoken word(s). ASR has several applications in voice-enabled control systems such as those implemented in health care, military, telephony and other domains. Nonetheless, speech recognisers are generally unable to show performance equivalent to that of human level under realistic conditions (i.e. noisy conditions). Although most of the recent speech recognisers possess high recognition rates in the lab, their performance in real-life applications under noisy environments remain unsatisfactory. This is because noise may introduce a mismatch between the data considered for ASR modelling and the actual speech data while the recogniser is being used; this mismatch of data degrades the recognition rate of the ASR system [1]. Therefore, for speech communication, accurate classification of uttered words under noisy conditions is necessary [2].

A pattern recognition system usually consists of two major components: a feature extractor and a classifier. The first component transforms an input into a representation that is trivially convertible into a class decision. Feature extractor is used to convert raw input into a form that is easily classifiable; this is a common place to incorporate classifiers such as Artificial Neural Networks (ANNs). They are mathematical models imitating natural neural systems; they have been used widely by both academia and industry as classifiers. ANNs are considered universal classifiers; to put it differently, they can in theory learn any mapping functions based on samples of inputs to the function under simulation and their responses [3].

Lately, in the studies of noise-robust ASR systems, there is a trend towards isolating noise from speech data rather than enhancing recognisers to accept and handle noisy data [4]. Nonetheless, noise reduction may not be effective when isolating noise is difficult or impossible, in addition to the fact that these methods are usually not implemented in real time. In these methods, pre-noise-processing operations need to be carried out in order to define a noise profile, detect noise, and finally isolate noise, before speech data are given to an ASR system. The drawbacks of noise isolation techniques are further discussed in Section 2.2. On the other hand, a real-time and noise-robust ASR system can recognise speech more accurately without the requirement of any noise processing. In addition, noise reduction alone may not always be a successful approach to provide clean data since it is likely that some noise will remain in the speech data after noise-robust methods are applied. Finally, noise reduction algorithms may require considerable amount of processing overhead, which makes them unsuitable for low-capacity devices. Therefore, it is important to consider approaches that can provide real-time noise-robust speech recognisers without requiring any noise processing. Such approaches should improve ASR recognition rate and generalisability so that the effects of the remaining noise are minimised.

Multi-Nets ANNs (M-N ANNs) are based on a novel approach proposed by Sun and Qingiu called Multiple-Views Multiple-Learners (MVML) [5], [6]. The general principle of MVML is that when the function under simulation is complex due to multiple views, using multiple learners increases the classification performance. Sun conducted a survey of multi-view learning theories and discussed its usefulness; he concluded that MVML is an effective approach with widespread applicability [7].

In the context of M-N ANNs, each ANN learns one of the views and together they form an ensemble of learners. They have been proven capable of recognising complex patterns better than Multiple-Views Single-Learners (MVSL) ANN-based classifiers because they distribute the complexity of the function among several parallel and independent neural networks, making the overall classification easier [8]; under similar conditions, an MVSL ANN-based system is unable to learn the function with adequate accuracy.

In this paper, we investigated whether M-N ANNs can be used to increase the recognition rate and accuracy of an ASR system in noisy conditions. A speaker-independent Multi-Network Speech Recogniser (M-NSR) is provided using M-N ANNs, in which each neural network represents one of the words in the vocabulary. The model proposed here assumes that speech data are corrupted with noise, regardless of application of a noise robustness method. The model adopts a real-time approach, which means that it does not need to perform any specific noise processing or digital signal processing to isolate or compensate for the noise; instead, we concentrated on providing a more accurate ASR model that identifies noisy acoustic signals with a better recognition rate. We also experimented with the helpfulness of increasing the ASR model's generalisability to provide a frequency-based noise-robust ASR model. For evaluation purpose, the trained M-NSR was used to identify unforeseen test data that were affected by frequency-based white, brown and pink noises, each having a different Signal-to-Noise Ratio (SNR).

Since this paper studies the advantages of MVML ANN-based ASR system over MVSL, an MVSL ANN-based ASR system was provided as the reference model. It was trained and tested with the same data as well as same methodology. Finally, the results obtained from the proposed M-NSR model and the reference model were meticulously compared.

Section snippets

Related work

This section is divided into two parts. The first part briefly surveys the applications of MVSL ASR systems based on ANNs. The second part highlights some of the state-of-the-art approaches in building noise-robust speech recognisers.

Multi-networks speech recognisers

This section explains how M-N ANNs can be employed to formulate the M-NSR. The main drawback of ANN classifiers comprising only one neural network is that they learn all the views using only one particular ANN; thus, they are considered as an MVSL approach. The consequence is that the monolithic ANN may fail to provide an accurate approximation of all the views; the burden of functional complexity and the number of views are too heavy for one neural network.

On the other hand, M-N ANNs use

Evaluation

This section explains the evaluation of the proposed approach. The ASR system identifies digits 0–9 and additional eight common words such as “yes”, “no” and “help”; therefore, the size of |W|=18. The AN4 Lexicon database was used to provide the training, validation, and testing speech data. The training and validation speech materials extracted from 74 male and female speakers (64 speakers for training and 10 for validation) as well as the acoustic samples of another 10 participants of both

Results and discussion

Three sets (K=1, 2, 3) of ten experiments were conducted on unforeseen test data in order to perform 3-fold cross-validation. For each K-fold, nine experiments were designed to study the performance of the proposed approach with a different type and level of noise. Additionally, one experiment was conducted on unforeseen, clean test data. Table 4 shows the recognition rates, and Table 5 depicts the NRMSEs of these experiments in comparison with the results of the reference model. Moreover, the

Conclusion

This paper studies the application of M-N ANNs to provide a noise-robust ASR system without requiring noise pre-processing or post-processing. An M-NSR based on MVML approach is proposed, which improves the recognition rate of MVSL ANN-based speech recognisers in noisy environment; the M-NSR employs standalone ANNs as multiple learners, each of which represents a word of the vocabulary as a different view. The proposed ASR system was evaluated by unforeseen testing data that were infected with

Acknowledgement

This research is supported by UM High Impact Research Grant UM.C/HIR/MOHE/FCSIT/05 from the Ministry of Higher Education Malaysia.

Seyed Reza Shahamiri received the B.S. and M.S. degrees in software engineering from Islamic Azad University, Esfahan, Iran in 2004 and 2007, respectively, and the Ph.D. in computer science from Universiti Teknologi Malaysia, 2011. He is currently a senior lecturer at the Department of Software Engineering, University of Malaya. His research interests include artificial neural networks and pattern recognition, software testing and reliability, automated speech recognition, and software

References (40)

Cited by (50)

  • A judgment-based model for usability evaluating of interactive systems using fuzzy Multi Factors Evaluation (MFE)

    2022, Applied Soft Computing
    Citation Excerpt :

    The effect of usability metrics for multi-model ASR is determined through a FIS (phase 1). Active learning methods can be divided into four categories: (1) single-view, single-learner (SVSL); (2) single-view, multi-learner (SVML); (3) multi-view, single-learner (MVSL); and (4) multi-view, multi-learner (MVML) [54]. Multi-model ASR is based on the AMLMs (SVSL, SVML, MVSL, and MVML) and acts like four separate interactive systems.

  • Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process

    2021, Engineering Applications of Artificial Intelligence
    Citation Excerpt :

    This consequently affects the overall performance of the system. While different approaches have addressed the question of solving or mitigating the errors produced in the ASR-M (Fernández-Díaz and Gallardo-Antolín, 2020; Graves et al., 2013; Squartini et al., 2012; Zhou et al., 2014; Trentin and Matassoni, 2003; Hannun et al., 2014; Shahamiri and Salim, 2014; Salem et al., 2007; Amrouche et al., 2010), only a few papers analyze the impact of these errors in subsequent components. Voleti et al. (2019) analyzed the effects of word substitution errors on sentence embeddings, and Simonnet et al. (2018) measured the impact of word substitution errors produced by ASR-M on NLU-M. Nevertheless, the question of the relationship between the different types of ASR-M errors and their influence on the EOTD-M has not been addressed.

  • Machine Learning for the prediction of the dynamic behavior of a small scale ORC system

    2019, Energy
    Citation Excerpt :

    The choice of the number of neurons in the hidden layer represents an unsolved question. Most of the scientists use empirical methods to define them, which lead to the selection of a number of hidden neurons comprehended between the number of the input nodes and that of the output nodes [22]. Another factor that affects the choice of the number of neurons in the hidden layer is the number of instances available in the training set.

  • A novel architecture combined with optimal parameters for back propagation neural networks applied to anomaly network intrusion detection

    2018, Computers and Security
    Citation Excerpt :

    Also, we have applied a novel rule for calculating the number of nodes of the hidden layer for generating new architectures of the neural networks of some IDSs. This method has been used in the work of Shahamiri and Salim (2014) to build a neural network for an Automatic Speech Recognition (ASR) system, but as we know, it is not exploited previously in an anomaly-based intrusion detection system. Further information about this novel rule is given in section 1.2

View all citing articles on Scopus

Seyed Reza Shahamiri received the B.S. and M.S. degrees in software engineering from Islamic Azad University, Esfahan, Iran in 2004 and 2007, respectively, and the Ph.D. in computer science from Universiti Teknologi Malaysia, 2011. He is currently a senior lecturer at the Department of Software Engineering, University of Malaya. His research interests include artificial neural networks and pattern recognition, software testing and reliability, automated speech recognition, and software engineering. He is also a certified software tester from Malaysian Software Testing Board (MSTB), a member of International Software Testing Qualifications Board (ISTQB).

Siti Salwah Binti Salim is a professor at the Department of Software Engineering, Faculty of Computer Science and Information Technology, University of Malaya. She holds a Ph.D. degree in computer science from the University of Manchester Institute of Science and Technology (UMIST), United Kingdom, 1998. She supervises Ph.D. and masters students in the areas of requirements engineering, human computer interaction, automatic speech recognition, component based software development and e-learning. She also leads and teaches modules at both B.Sc. and M.Sc. levels in software engineering.

1

Tel.: +60 3 79676300; fax: +60 3 79579249.

View full text