Elsevier

Applied Acoustics

Volume 175, April 2021, 107829
Applied Acoustics

A novel acoustic scene classification model using the late fusion of convolutional neural networks and different ensemble classifiers

https://doi.org/10.1016/j.apacoust.2020.107829Get rights and content

Highlights

  • Enhanced fusion model for classifying environmental acoustic scenes is proposed.

  • This model makes use of convolutional neural networks (CNNs) and ensemble classifiers.

  • The fusion model had a 10%-increase in accuracy compared to the CNN model.

  • The model outperformed previous CNN models of acoustic scene classification.

  • The results help to improve environmental acoustic scene classification.

Abstract

Recent evidence suggests that convolutional neural networks (CNNs) can model acoustic scene classification (ASC) with high accuracy. Ensemble classifiers have also shown high accuracy in different machine learning areas. However, little is known about fusion models between CNNs and different ensemble classifiers for ASC. This study presents an enhanced CNN classification model using the late fusion between CNNs and ensemble classifiers to predict different classes of acoustic scenes. A CNN model was first built to classify fifteen acoustic scene environments. Different ensemble classifier models were then used for this classification problem. Late fusion of CNN and ensemble classifier models was then applied. The results showed that late fusion models have higher classification accuracy, as compared to individual CNN or ensemble classifier models. The best model was obtained by fusion of the CNN and discriminant random subspace classifier with an increase in the average accuracy of 10% as compared to the average accuracy of the CNN model. When compared with previous research on ASC, the late fusion model between CNN and ensemble classifiers showed higher accuracy. Therefore, this method has robust applicability for future ASC problems.

Introduction

Acoustic scene classification (ASC) is the way to classify different environments depending on their sound characteristics. The scene, in this context, refers to the acoustic environment summarised in one situation such as “restaurant” or “office”. Acoustic scenes could be pre-recorded or live streaming audio [1]. ASC plays an important role in many areas, such as context awareness in smart devices, hearing aids, robots, and many other applications [2]. However, there is a need for high-performance ASC models [3]. Therefore, many algorithms and methods have been developed to achieve accurate ASC models.

There have been many methods for ASC, mostly using CNNs [2], [4]. However, early fusion CNN models showed high accuracy for ASC. Fusion means combining one or more characters at the same time. In terms of modelling, fusion can be classified into early and late fusion. Early fusion could refer to combining the features using more than one method or other concepts such as refining frequency resolution before beginning model training. Recently, early fusion models have extensively been used for ASC. For example, Yang et al. [5] used multistage feature extraction fusion for ASC. Su et al. [6] also used aggregated feature extraction for ASC. Zhang et al. [7] used fine-resolution frequency for feature selection of ASC. Mulimani et al. [8] also used fisher vector for feature extraction of ASC.

Late fusion refers to combining the results of different models after building each model separately [9]. Recently, late fusion models were used in many areas because of their higher predictability as compared to individual models. For example, they have shown higher predictability than early fusion models when used for semantic video analysis [10]. Late fusion can be achieved by combining CNN model with other models such as SVM or different CNN models with different feature extraction methods [11]. Recently, it was also used for emotion recognition for audio-visual data [12], [13]. They were also used for recognising human activity [14]. However, the use of late fusion models for ASC has not been applied before between CNN and different ensemble classifier models for ASC problems.

Most studies optimising ASC models are based on the early fusion of feature characteristics before using them in CNN models. It is hypothesised that late fusion of different models could yield higher predictive power, as compared to when using only one model. Therefore, this study proposes a late fusion model between CNN and ensemble classifier models. Different ensemble classifiers are studied and their accuracy, when fused with CNN, is also presented. The results help to improve ASC predicted accuracies.

Section snippets

Data source

The dataset of TUT Acoustic scenes 2017 challenge was used [15]. A description of acoustic scenes included in the dataset can be found through http://www.cs.tut.fi/sgn/arg/dcase2016/acoustic-scenes#library. The dataset consists of various acoustic scenes recorded from distinct locations. Each acoustic scene has 312 segments for training noise samples and 108 for testing noise samples. For each original recording location, a 3–5-minute-long audio recording was captured.

Acoustic scene types

Acoustic scenes were

Overview

Fig. 1 shows the proposed late fusion model procedures. Data was first entered and was then split into 10-sec segments. Feature extraction was then done by applying Mel-spectrograms to convolutional neural networks (CNNs) and wavelet scattering for ensemble classifiers. Hyper-parameter tuning was then done for CNN and ensemble classifier models separately. The fusion of CNN and ensemble classifier models was then applied to maximise the accuracy obtained. Each model (i.e. the CNN, ensemble

Results of convolutional neural network (CNN) models

Fig. 4 shows the accuracy and loss during the training process for each iteration of the fifteen epochs included. The accuracy and loss are almost the same after epoch 8. The final overall average accuracy of CNN was 72.9% with SD ±20%. Fig. 5a shows the confusion matrix of CNN models for all included acoustic scenes.

Results of ensemble classifier models and their fusion with CNNs

Different ensemble classifiers were run as shown in Table 2. The average accuracy of ensemble classifier models ranged between 40.4 and 76.5%. The results of the fusion of these

Conclusion

Accurate acoustic scene classification (ASC) models are of great help in many areas. This study presented an enhanced model for ASC by the late fusion of convolutional neural networks (CNNs) and ensemble classifiers. The results showed that the late fusion model had a higher accuracy for ASC, compared to the individual convolutional neural network (CNN) or ensemble classifier models. This fusion model had an average increase in accuracy of 10% as compared to the CNN model average accuracy. A

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (32)

  • A. Mesaros et al.

    Acoustic scene classification: An overview of dcase 2017 challenge entries

  • M.J. Bianco et al.

    Machine learning in acoustics: Theory and applications

    J Acoust Soc Am

    (2019)
  • Y. Su et al.

    Performance analysis of multiple aggregated acoustic features for environment sound classification

    Appl Acoust

    (2020)
  • X. Dong et al.

    Late fusion via subspace search with consistency preservation

    IEEE Trans Image Process

    (2019)
  • G. Li et al.

    Early versus late fusion in semantic video analysis

  • B.T. Atmaja et al.

    Multitask learning and multistage fusion for dimensional audiovisual emotion recognition

    IEEE

    (2020)
  • Cited by (21)

    • Acoustic scene classification: A comprehensive survey

      2024, Expert Systems with Applications
    • Deep mutual attention network for acoustic scene classification

      2022, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      All these TFFs contain valuable information and thus can be utilized in a complementary way to improve features' discriminative ability. Following this idea, researchers [21,24–26] have proposed some fusion solutions to learn the deep representations from various TFFs and combine them in different manners. The strategies of deep representations fusion can effectively improve the performance of ASC systems and therefore has received much interest from the community [27–29].

    • Late fusion for acoustic scene classification using swarm intelligence

      2022, Applied Acoustics
      Citation Excerpt :

      For instance, Paseddula et al. [25] perform decision level DNN score fusions using the weighted sum rule for improving the performance. Alamir et al. [26] create a late fusion model based on CNNs and ensemble classifiers by multiplying different classifier responses to enhance the CNN classification model. Both of them obtained higher performance than individual CNN or DNN models.

    View all citing articles on Scopus
    View full text