A novel acoustic scene classification model using the late fusion of convolutional neural networks and different ensemble classifiers
Introduction
Acoustic scene classification (ASC) is the way to classify different environments depending on their sound characteristics. The scene, in this context, refers to the acoustic environment summarised in one situation such as “restaurant” or “office”. Acoustic scenes could be pre-recorded or live streaming audio [1]. ASC plays an important role in many areas, such as context awareness in smart devices, hearing aids, robots, and many other applications [2]. However, there is a need for high-performance ASC models [3]. Therefore, many algorithms and methods have been developed to achieve accurate ASC models.
There have been many methods for ASC, mostly using CNNs [2], [4]. However, early fusion CNN models showed high accuracy for ASC. Fusion means combining one or more characters at the same time. In terms of modelling, fusion can be classified into early and late fusion. Early fusion could refer to combining the features using more than one method or other concepts such as refining frequency resolution before beginning model training. Recently, early fusion models have extensively been used for ASC. For example, Yang et al. [5] used multistage feature extraction fusion for ASC. Su et al. [6] also used aggregated feature extraction for ASC. Zhang et al. [7] used fine-resolution frequency for feature selection of ASC. Mulimani et al. [8] also used fisher vector for feature extraction of ASC.
Late fusion refers to combining the results of different models after building each model separately [9]. Recently, late fusion models were used in many areas because of their higher predictability as compared to individual models. For example, they have shown higher predictability than early fusion models when used for semantic video analysis [10]. Late fusion can be achieved by combining CNN model with other models such as SVM or different CNN models with different feature extraction methods [11]. Recently, it was also used for emotion recognition for audio-visual data [12], [13]. They were also used for recognising human activity [14]. However, the use of late fusion models for ASC has not been applied before between CNN and different ensemble classifier models for ASC problems.
Most studies optimising ASC models are based on the early fusion of feature characteristics before using them in CNN models. It is hypothesised that late fusion of different models could yield higher predictive power, as compared to when using only one model. Therefore, this study proposes a late fusion model between CNN and ensemble classifier models. Different ensemble classifiers are studied and their accuracy, when fused with CNN, is also presented. The results help to improve ASC predicted accuracies.
Section snippets
Data source
The dataset of TUT Acoustic scenes 2017 challenge was used [15]. A description of acoustic scenes included in the dataset can be found through http://www.cs.tut.fi/sgn/arg/dcase2016/acoustic-scenes#library. The dataset consists of various acoustic scenes recorded from distinct locations. Each acoustic scene has 312 segments for training noise samples and 108 for testing noise samples. For each original recording location, a 3–5-minute-long audio recording was captured.
Acoustic scene types
Acoustic scenes were
Overview
Fig. 1 shows the proposed late fusion model procedures. Data was first entered and was then split into 10-sec segments. Feature extraction was then done by applying Mel-spectrograms to convolutional neural networks (CNNs) and wavelet scattering for ensemble classifiers. Hyper-parameter tuning was then done for CNN and ensemble classifier models separately. The fusion of CNN and ensemble classifier models was then applied to maximise the accuracy obtained. Each model (i.e. the CNN, ensemble
Results of convolutional neural network (CNN) models
Fig. 4 shows the accuracy and loss during the training process for each iteration of the fifteen epochs included. The accuracy and loss are almost the same after epoch 8. The final overall average accuracy of CNN was 72.9% with SD ±20%. Fig. 5a shows the confusion matrix of CNN models for all included acoustic scenes.
Results of ensemble classifier models and their fusion with CNNs
Different ensemble classifiers were run as shown in Table 2. The average accuracy of ensemble classifier models ranged between 40.4 and 76.5%. The results of the fusion of these
Conclusion
Accurate acoustic scene classification (ASC) models are of great help in many areas. This study presented an enhanced model for ASC by the late fusion of convolutional neural networks (CNNs) and ensemble classifiers. The results showed that the late fusion model had a higher accuracy for ASC, compared to the individual convolutional neural network (CNN) or ensemble classifier models. This fusion model had an average increase in accuracy of 10% as compared to the CNN model average accuracy. A
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (32)
- et al.
Two-level fusion-based acoustic scene classification
Appl Acoust
(2020) - et al.
Acoustic event recognition using cochleagram image and convolutional neural networks
Appl Acoust
(2019) - et al.
Multi-scale semantic feature fusion and data augmentation for acoustic scene classification
Appl Acoust
(2020) - et al.
Acoustic scene classification using deep CNN with fine-resolution feature
Expert Syst Appl
(2020) - et al.
Robust acoustic event classification using fusion fisher vector features
Appl Acoust
(2019) - et al.
Late fusion framework for Acoustic Scene Classification using LPCC, SCMC, and log-Mel band energies with Deep Neural Networks
Appl Acoust
(2021) - et al.
An efficient model-level fusion approach for continuous affect recognition from audiovisual signals
Neurocomputing
(2020) - et al.
The effect of type and level of background noise on food liking: A laboratory non-focused listening test
Appl Acoust
(2021) - et al.
Subjective responses to wind farm noise: A review of laboratory listening test methods
Renew Sustain Energy Rev
(2019) An artificial neural network model for predicting the performance of thermoacoustic refrigerators
Int J Heat Mass Transf
(2021)
Acoustic scene classification: An overview of dcase 2017 challenge entries
Machine learning in acoustics: Theory and applications
J Acoust Soc Am
Performance analysis of multiple aggregated acoustic features for environment sound classification
Appl Acoust
Late fusion via subspace search with consistency preservation
IEEE Trans Image Process
Early versus late fusion in semantic video analysis
Multitask learning and multistage fusion for dimensional audiovisual emotion recognition
IEEE
Cited by (21)
Acoustic scene classification: A comprehensive survey
2024, Expert Systems with ApplicationsScientific computing of radiative heat transfer with thermal slip effects near stagnation point by artificial neural network
2024, Case Studies in Thermal EngineeringFrequency-based CNN and attention module for acoustic scene classification
2023, Applied AcousticsDeep mutual attention network for acoustic scene classification
2022, Digital Signal Processing: A Review JournalCitation Excerpt :All these TFFs contain valuable information and thus can be utilized in a complementary way to improve features' discriminative ability. Following this idea, researchers [21,24–26] have proposed some fusion solutions to learn the deep representations from various TFFs and combine them in different manners. The strategies of deep representations fusion can effectively improve the performance of ASC systems and therefore has received much interest from the community [27–29].
Late fusion for acoustic scene classification using swarm intelligence
2022, Applied AcousticsCitation Excerpt :For instance, Paseddula et al. [25] perform decision level DNN score fusions using the weighted sum rule for improving the performance. Alamir et al. [26] create a late fusion model based on CNNs and ensemble classifiers by multiplying different classifier responses to enhance the CNN classification model. Both of them obtained higher performance than individual CNN or DNN models.