1 Introduction

The diagnosis is more related to audio or visual information gathered from different methods. Medicine specialist diagnoses the data. Becoming a medicine specialist takes skill, experience, and time. The World Health Organization (WHO) gave a statistic [2] based on the available physician per 1000 population. Around 45% of WHO member states have less than an average number of physicians that is 1 per 1000. With these statistics, the diagnosis at a non-compressible time with the physicians being overbooked can lead to a misdiagnosis. So there is a need to find ways to help medicine specialists save their time and make an accurate diagnosis. Automatic diagnosis of the disease using computers can help achieve this goal.

Every respiratory-related diagnosis is based on an audio sample taken from the patient by a specialist from various parts of the body using a specific tool (sonography, stethoscope). It helps with the diagnosis of the disease and hearing sounds (normal, wheeze, crackle, etc.) [43][63]. It shows us that the audio sample helps in the classification of respiratory diseases. Heart-related disorders are one of the leading causes of death around the globe [1]. There have been many studies published about audio analysis of heart-related problems. A review paper [17] on signal processing techniques concludes that the gathering, analyzing, and classification of heart sounds for better diagnosis is required. As Machine Learning and Deep Learning techniques deploy to help in the medical sector, the diagnosis has become more accurate and saves a lot of time [16][10]. This recent approach seems viable to help with the detection and improve the performance of systems aimed at diagnosing the disease at an early stage. Also, this could help doctors and nursing staff prepare reliable and faster diagnosis reports [68]

Breath sounds may be reduced and expiration may be longer in COPD patients [70]. Patients with COPD, particularly those with chronic bronchitis, may hear coarse crackles at the start of inspiration [20]. These crackles have a “popping” sound to them, vary in quantity and timing, and can be heard in any part of the lungs [35]. These early inspiratory crackles can also be heard during expiration, and coughing can make these noises go away [35]. The passage of boluses of gas through an intermittently obstructed airway causes these harsh crackles [20]. During inspiration, normal breath sounds are characterised by a low noise. These sounds are hardly discernible during expiration [64]. There are no distinct peaks and the sound is not melodic [56]. The lobar and segmental airways create the inspiratory component of the sound, whereas the more proximal regions produce the expiratory component. Normal lung sounds are considered to be caused by air turbulence [56].

The International Conference on Biomedical Health Informatics (ICBHI) had organized a challenge aimed at developing an automatic identification of wheezes and crackles in respiratory audio samples. For this task, a corpus was built [66] which consists of audio samples of seven respiratory pathologies namely asthma, chronic obstructive pulmonary disease (COPD), lower respiratory tract infection (LRTI), pneumonia, bronchiolitis, upper respiratory tract infection (URTI) and bronchiectasis along with audio samples from healthy people. The detailed description of the selected dataset discussed in the dataset section. The leading cause of pathologies in the corpus is smoking. Other then the smoking some are genetic while others are by environmental factors [22].

Chronic Obstructive Pulmonary Disease (COPD) is very similar to asthma as it causes the same effects like shortness of breath and cough. So the detection of COPD is difficult [72]. It is mainly by smoking. All these symptoms can also misinterpret as old age symptoms. Upper Respiratory Tract Infection (URTI) disease is common around the time of fall and winter and can happen at any given time. It is mainly by a virus [38]. The symptoms are often confused with pneumonia [22]. Pneumonia infected people mostly recover in a little time, sometimes it is difficult for certain people, and it can cause death in some cases. So the detection of the disease is necessary.

Bronchiectasis is considered as chronic illness. In this condition, the airway of the lungs can widen and damages the air passage and allows mucus and bacteria to build up in the lungs. It causes the blockage of airways. This condition can often be confused with bronchiolitis or common cold [77]. But there is a little difference between the two, bronchiolitis is mostly affected by young children and can cure. On the other hand, bronchiectasis is a chronic disease.

Lower Respiratory tract Infections (LRTI) disease has symptoms that depend on the advancement of the disease. At the early stages, symptoms are similar to bronchiolitis and bronchiectasis. The later stages can cause a pneumonia-like situation.

In recent times the advancements in technologies and the availability of data for the research of such conditions. The Deep Learning algorithm, Convolutional Neural Network(CNN) has helped in the field of medical and biological data for the segmentation and classification [39][71], brain tumor classification [58], prostate [40] to name a few. Here in this paper, we propose a new way of preprocessing the audio files with CNN architecture to classify the pathologies. The contribution of this work is to suggest a new approach to classify audio samples. A Deep Learning approach using a convolution neural network has used in [59] that was able to achieve an accuracy of 83%. A new architecture of the convolutional neural network proposed along with an approach at preprocessing data to compare the results with the previous. We have used a matrix of MFCC features to fed to the 1D CNN as the 1D CNN will be able to capture all the important features. As it has a kernel size of 1 so no important feature will be left behind. The main target of this work aims towards:

1.1 Contribution

Based on all the discussion mentioned above, we come up with a solution that can overcome the limitation of the previous research work and give a better and more accurate result.

  1. 1.

    Novel approach incorporating feature transformation by combining pre-preprocessing steps MFCC, Melspectogram and Chroma CENS with CNN.

  2. 2.

    Incorporation of future deep learning feature extraction methods like d-vectors, i-vectors for better audio signal representations.

  3. 3.

    Analysis on ICBHI dataset (chronic, non-chronic and normal classes) with the comparison of three features (MFCC, Melspectrogram and ChromaCENS).

The proposed research has numerous practical applications in field of medical science, along with operating at low cost solution, if deployed at a hospital level. It has the potential to work in a semi-supervised way which can make healthcare more accessible as well as cost effective. This work helps incorporate maximum information from the audio as it is dependent on three features as compared to one. Deploying this service can help common public make use of the technology for free, they would just require to get their respiratory sound from an instrument and use the deployed service to get a diagnosis.

1.2 Paper organization

The paper divides into seven sections in which Section 2 shows detailed information about previous approaches and related work in the field of respiratory sound classification. Section 3 shows a description of the dataset used for the evaluation of the proposed approach. Section 4 presents information on the implementation scenario and description of the machine on which the algorithm carried out. Section 5 describes the execution and the implementation scenario. Section 6 discusses the results and experiments and has a comparison table that shows us the performance of different approaches. Section 7 presents the conclusion and future work for the proposed research work.

2 Related work

There are various studies that deal with this challenge from the perspective of audio processing. Pasterkamp et al. [57] played a major role in reporting the types and characteristics of respiratory sound. The presence of wheezes and crackles has helped many research work to classify different categories of diseases. It is also not necessary to consider the difference in gender or age in this task of respiratory analysis [24][18]. There respiratory sound samples are subject to noise such as artefacts and heart sounds [53].

Many previous studies on respiratory pathology analysis have been conducted using machine learning and various signal processing algorithms [52]. A review [62] was published in 2007 highlighted the researchers study to identify markers such as wheezes and crackles. With these markers being identified this problem was transformed into a classification problem [12] and machine learning can help with this task. Islam et al. [30] were able to detect asthma by the use of lung sound samples with the help of the wheeze. They used the samples collected from 60 individuals with half of the asthma. The authors collected the data from the back of the individuals. The classification was done with the help of Artificial Neural Network (ANN) and also with Support Vector Machine (SVM). Best accuracy found with SVM that was around 93%. Other research articles based on the diagnosis of wheezes and crackle [25] used configurations of neural networks, and they were able to obtain an accuracy of 93% for crackles and 91.7% for wheezes. Bardou et al. [7] used the dataset with seven classes stridor, squeak, polyphonic wheeze, monophonic wheeze, fine crackle, coarse crackle, and the normal. The classification reached the best results with the help of CNN’s.

Yunseo et al. [36] dealt with something similar. The number of patients with respiratory diseases in Seoul on a daily basis was gathered. The daily number of patients treated for respiratory disorders per 10,000 residents was predicted using meteorological and air-pollution parameters. To determine the relevance of feature selection, we used a relief-based feature selection method. Two alternative prediction models were developed using the gradient boosting and Gaussian process regression (GPR) approaches.

Mahmoud et al. [4] shared their findings via averaging the predictions of many models trained and assessed individually on distinct sound kinds, simple binary classifiers may reach an AUC of 96.4 percent and an accuracy of 96 percent. Finally, the goal of this research is to highlight the relevance of the human voice, as well as other respiratory noises, in sound-based COVID-19 diagnosis.

Eu Sun Lee et al. [37] used a machine learning approach as an AI tool to examine the associations between weather and air pollution factors and respiratory illness patients attending EDs, based on the consistently reported and systemized data registry of national emergency medical facilities. This study used data from three days before to a visit, as well as the daily temperature difference and other information, to estimate the exact values of meteorological conditions and their amount of effect.

Chen et al. [13] proposed ResNet architecture for the classification of an Optimized S-Transform based (OST) feature map of wheezes, crackles, and normal sounds. The RGB maps of the feature map fed into the ResNet architecture for classification. The results compared with ResNet Short Term Fourier Transform (STFT) and ResNet S-transform (ResNet-ST) with better results carried out by the ResNet-OST. Jakovljevic et al. in [31] had come up with an algorithm to classify audio samples into wheezes, both wheezes with crackles and crackles with normal from ICBHI corpus [66]. The methodology they followed consisted of noise suppression using spectral subtraction after which features were extracted, then the Hidden Markov Models are used to classify and obtain 39.56% average of specificity and sensitivity. Perna el at. [59] used CNN to classify the different pathologies present in the ICBHI corpus [66]. The MFCC features extracted from the audio that later fed into the 2D CNN architecture for classification. They had compared the effects of different activation functions, and the two techniques SMOTE sampling and RUS technique. They were able to get an accuracy of 83%, F1 score of 0.84 and recall of 0.82. Prerna et al. [60] introduced the use of a Long Short Term Memory (LSTM) based approach for the classification of respiratory pathologies. The authors had used MFCC features to classify the audio sample and reached an accuracy of about 99%. The normal and abnormal classes were considered. Further, the LSTM was used to classify three i.e. chronic, non-chronic, and normal and four classes i.e. normal, wheeze, crackle, and both where they were able to get an accuracy of about 98% and 74% respectively, and F1 score of 0.91 and recall of 0.82 when three classes were taken. Garcia et al. [21] proposed a CNN architecture considering the melspectrogram image as a feature. The melspectrograms were classified into three classes i.e. chronic, non-chronic and normal with the help of a CNN model. They were able to reach an accuracy of about 99% on the three classes, F1 score of 0.900 and recall of 0.986.

Guler et al. [26] studied power spectral density features of the respiratory sounds and they were classified into crackles, wheeze and normal respiratory sounds. Electret microphones had been used to record 129 subject for their respiratory sound. For the classification genetic algorithm (GA)-based Artificial neural networks and artificial neural network (ANN) were used. Accuracy of 81-91% and 83-93% were calculated for ANN and GA-based ANN respectively.

Alsmadi et al. [3] proposed an autoregressive model for the problem of classification of respiratory sound. ECM-77B microphone was used to record 43 subjects for their respiratory sounds and for the classification they used k-nearest neighbour algorithm (k-nn) which led to a recognition rate of 96%. Dockur et al. [15] proposed an incremental supervised neural network for the task of classification of respiratory sound. For the evaluation they had 18 subjects and the power spectrum features were used. After which a grow and learn (GAL) network was used for the incremental supervise network and their approach was compared to previous approaches. Sankar et al. [67] studied about a feedforward neural network approach for this task. The features that they used were respiratory rate, energy index, strength of dominant frequency and dominant frequency. Electret microphone was used on size subjects to record their respiratory sounds which led to classification accuracy of 98.7%.

Hashemi et al. [29] highlighted the use of wavelet based features along with multilayer perceptron network for the task of this classification. Electric stethoscope were used to record 140 subjects for their respiratory sounds and the approach led to a recognition rate of 89.28%. Flietstra et al. [19] proposed support vector machine (SVM) for the this problem. STG 16 lung sound analyser was used to record 257 subjects for their respiratory sounds. They were able to get mean classification accuracy of 84% with this approach. Palaniappan et al. [54] presented a comparative study about the used of SVM and k-nn for the task of classification of respiratory sounds. They evaluated their approach on R.A.L.E. [52] dataset with the use of MFCC features along with one way ANOVA test [44] and were able to reach classification accuracy of about 92.1% and 98.26% for SVM and k-nn respectively.

Gadge et al. [50] studied the analysis of respiratory sounds and developed a MATLAB based tool. The tool helped filter heart sounds from acoustic pulmonary sounds. Mhetre et al. [47] developed a tool for the plotting and viewing of spectrogram of respiratory sounds. Lin et al. [41] proposed a neural network architecture to classify wheeze and normal breath sounds using average truncate method. ECM microphone was used to record 58 subjects for their respiratory sounds. They reported a specificity of 1 and sensitivity of 0.946. Umeki et al. [73] proposed a hidden markov model (HMM) for this task of classification of normal and abnormal respiratory sounds with the help of respiratory rate from breath sounds. They were able to reach classification accuracy of about 83.7%. Maruf et al. [45] presented SVM, gaussian mixture model (GMM) and ANN for the task of classification of normal and crackle respiratory sound with the help of spatial temporal features. They reported classification accuracy of 92.6%, 85.3%, and 97.56% for SVM, ANN and GMM respectively. Lin et al. [42] developed a recognition system capable of detecting wheeze with the help of backward-propogating neural network based on spectrogram features. The proposed approach was able to reach a specificity of 1 and sensitivity of 0.946.

Yadav et al. [75] described a machine learning-based method for. Use the classification system to determine whether a pulmonary signal is natural or abnormal, pulmonary acoustic sounds were documented. They extracted wavelet coefficients which were fed into a supervised machine learning classifier called SVM, which efficiently classifies pulmonary sounds as natural or abnormal. When it comes to classifying pulmonary tones, the experimental findings indicate an accuracy of 92.30 percent.

Goudarzi et al. [23] proposed a novel approach with the help of recurrent fuzzy function. The dataset used had 31 asthma, 27 COPD and 25 normal patients. using a 10-fold cross validation an accuracy of less than 80% was reported. Naves et al. [49] used higher order statistical features which were extracted from respiratory sounds. Although the number of samples used for this research work were limited to come to a conclusion. They used two classifiers Naive Bayes and k-nn, along with this Fisher’s discriminant analysis for dimension reduction and they were able to reach an accuracy of about 94.6%. Palaniappan et al. [51] presented a reliable classifier with the help of SVM based on parametric feature extraction technique. They were able to reach an accuracy of 89.68% and 88.72% for MFCC and AR coefficients respectively. Chambers et al. [8] proposed a way to look at a trend for a patient rather than focusing only on audio content for every respiratory cycle. Their macro level approach gave an accuracy of about 85%. Chinazunwa et al. [74] proposed a mobile application to classify respiratory sounds using machine learning methods. They were able to reach an accuracy of about 88.9%, 75.8% and 86.7% with k-nn, SVM and random forest.

Murat et al. [6] proposed a convolution neural network for this task and compared his approach to another SVM approach. The classification was done for several classes such as rhonchus, rale, normal, singular respiratory sounds. The maximum accuracy presented was 86% and 86% for SVM and CNN respectively for healthy vs pathological classification. Chen et al. [11] proposed a classification of rale, rhoncus, wheeze and normal sounds with the use of MFCC and k-nn as a classifier. They were able to reach an accuracy of about 93.2% with data from 140 subjects. Aras et al. [5] used data from 27 subjects and were able to get an accuracy of about 96% with the classes being rale, rhoncus and normal sounds with the help of k-nn and MFCC, LFCC being the features. Serbes et al. [69] used data from 26 patients and classified crackles with an accuracy of 97.2% with the help of SVm as a classifier and wavelet transform as a feature extraction method. Yamashita et al. [76] proposed a HMM based approach to classify normal and emphysema sounds. For their work they used data from 114 patients. They were able to reach a maximum accuracy of 88.7% with segmentation of audio samples. Feng et al. [32] used data from 21 subjects to classify normal and abnormal sounds with the help of temporal-spectral dominance spectrogram as a feature extraction method and k-nn as a classifier. They were able to reach an accuracy score of 92.4%.

Charleston et al. [9] proposed the use of AR model features with multilayer perceptron as a classifier with data from 27 subject. They were able to reach an accuracy score of 75% and 93% for the classes of normal and abnormal sounds. Yamamoto et al. [46] used raw data from 114 subjects and with the help of this data he was able to make a predictor with 84.2% accuracy for normal and abnormal sounds and HMM as a classifier. Kahya et al. [33] proposed the use of k-nn with wavelet transform features extracted from sounds of 20 different subjects. They were able to reach an accuracy score of 46%. Riella et al. [65] proposed the use of FFT and STFT features along with multilayer perceptron to reach an accuracy of 92.8%. They were able to classify just wheeze sounds. Kandaswamy et al. [34] proposed the use of multilayer perceptron with wavelet transform and STFT as features from sounds. They were able to reach an accuracy of 94.2% for lung sounds. From our literature survey we were able to figure out that the classification was being varied across different features and classifiers, to eradicate this our proposed approach aims at including three features together for the classification along with convolutional neural network. Nishi et al. [27] proposed a SVM classifier using MFCC features for the task of detecting COPD from lung sounds. They were able to reach an accuracy of about 98.2%. A comparative analysis shows in Table 1 for better understanding of approaches and the classification done by different authors and their methodologies along with the classification classes. Along with the advantages and limitations of the previous approaches.

Table 1 A comparative analysis for better understanding of approaches and the classification done by researchers

3 Dataset details

The International Conference on Biomedical and Health Informatics (ICBHI) corpus [66] consists of a total of 5.5 hours of recording which contain 6898 respiratory cycles of 126 patients. The categories are Chronic Obstructive Pulmonary Disease (COPD), Upper Respiratory Tract Infection (URTI), Healthy, Pneumonia, Bronchiectasis, and Bronchiolitis. The number of audio files present in the corpus is given in Table 1.

The respiratory cycles are annotated by experts in their domain, to provide the presence of respiratory pathologies. The annotation format is done in a style that includes the last part of the respiratory cycle as well as the beginning, presence of crackles and wheezes. The audio recorded were gathered using various equipment with a ranging duration of 1s to 90 s, and the average duration being 2.7 s with a standard deviation of 1.17 s and the median being about 2.54 s. The data is obtained from 7 different locations on the patient’s chest like right and left anterior, right and left lateral, trachea, and right and left posterior.

For comparison purposes, the dataset has been divided into 3 parts Chronic, Non-Chronic, and Healthy. The Chronic part contains COPD, Bronchiectasis, and Asthma, while Non-Chronic contains Bronchiolitis, URTI, LRTI, and Pneumonia. This will help us in the proper evaluation of our proposed method with respect to the previous approach in [59].

4 Proposed research work

For our proposed approach highlighted in (Fig. 1) we have first augmented the data so for the augmentation of the data, we have used a delay function that adds the delayed audio with the original audio. The delay function adds a delay of 250ms, 500ms, 750ms and 1000ms, so we are able to produce 4 different versions of one single audio sample. Next, we have used 3 types of features namely Mel-frequency cepstral coefficients (MFCCs), Melspectrogram features and Chroma energy normalized statistics (CENS) features. The number of features used is 39. The 39 MFCC features cover double delta features as well. a brief description of the 3 features is given in the following subsections. After the transformation of features they are stacked together so they behave as a single multidimensional feature which is used for the classification. Finally a newly proposed CNN architecture is used for the classification of 3 classes namely chronic, non-chronic and healthy audio. Detailed explanation is given in Section 4.

Fig. 1
figure 1

Proposed execution flow with flowchart

4.1 Mel-frequency cepstral coefficients (MFCCs)

Classification of pathologies is done by transforming different features from audio samples and one of them is Mel frequency cepstral coefficients which have produced good results in audio classification tasks [55] [28]. In audio processing, the short term spectrum of sound representation is called a Mel frequency cepstrum, which is based on the Mel scale of frequency. Various studies have shown the effectiveness of MFCCs in audio processing.

MFCC features of audio are based on hearing perceptions of humans who can not perceive frequencies over 1 kHz. The MFCC features are based on the critical bandwidth variations of the ear [61]. MFCC consists of 2 filters that are logarithmically spaced above 10kHz and linearly spaced below 1kHz. MFCC has 7 computational steps [48].

  1. 1.

    Pre-emphasis: In this step, the audio sample is passed through a filter which increases the emphasis of higher frequencies.

  2. 2.

    Framing: This step consists of segmenting the audio sample from analog to digital, in frames with a duration of around 20 to 40 ms.

  3. 3.

    Windowing: Window named Hamming window is used for the next block in the process.

  4. 4.

    Fast Fourier Transform: The frames are converted into the frequency domain using the Fast Fourier transform (FFT).

  5. 5.

    Mel Filter Bank Processing: The filter bank according to mel scale is performed as shown in Figure 2. To calculate the weighted sum of filter spectral components the triangular filters are used. Also to make the output with respect to the mel scale.

  6. 6.

    Discrete Cosine Transform: Now the conversion of log mel spectrum in time domain takes place with the help of Discrete Cosine Transform. The results of the conversion are called the coefficients.

  7. 7.

    Delta Energy and Delta Spectrum: The cepstral features change over time so features are added, features like double delta or delta or velocity features.

Dealing with variable audio length, every 39 MFCC feature over all the frames are averaged and a final vector of 39 features is formed for a single audio file. The MFCC features are a bit more decorrelarated than melspectrogram features, due to which MFCC features have been proved beneficial with linear models such as gaussian mixture models.

Fig. 2
figure 2

Mel scale filter bank, from (young et al, 1997)

The formula for converting from frequency to Mel scale is:

$$ M(f)=1125 \ln (1+f / 700) $$
(1)

4.2 Melspectrogram features

The Melspectrogram is calculated on a time series input using the librosa package. The spectrogram magnitude is calculated which is later mapped to mel scale. The melspectrogram is calculated with the help of following steps:

  1. 1.

    Separate windows: This step involves the sampling of input with a length of windows size 2048. After which the window is shifted 512 samples and the next window is considered.

  2. 2.

    Compute FFT: This step involves calculating the fast fourier transform of each window so that the conversion of the time domain to the frequency domain is done.

  3. 3.

    Generating a Mel scale: Next, the frequency spectrum is separated into 39 evenly spaced frequencies.

4.3 Chroma energy features

Chroma energy normalized statistics: The main idea of Chroma Energy Normalized Statistics (CENS) features is that by matching the similarity of audio sound with the selected features by taking statistics over large windows smooths local deviations in musical ornaments, tempo, and articulation such as chords and trills. The CENS features are calculated where each chroma is first normalized with the help of L1- norm which helps express the relative energy distribution. The next step is the application of quantization based on the chosen threshold.The last step is smoothing and downsampling.

5 Execution and implementation scenario

The implementation of the algorithm is carried out on a machine having 8GB DDR4 RAM, a 4Gb Nvidia GTX1050 Graphics Processing Unit (GPU), and an Intel Core i7-7700HQ Central Processing Unit (CPU) 2.80GHz which is a 64-bit processor. The implementation of this multi-class classification starts with the following order

  1. 1.

    Data Augmentation: A delay function is introduced which introduces a delay of 250ms, 500ms, 750ms and 1000ms in the audio files. The delayed audio and normal audio are added and saved in the local machine to help augment the data.

  2. 2.

    Feature Transformation: The 3 features namely MFCC, Melspectrogram, Chroma CENS are used from the audio samples. This happens in a way that first the audio is sliced in equal parts of 30ms and is passed from a pre-emphasis filter. After which the 3 features are transformed from the audio with the librosaFootnote 1 library and stacked together. The audio is loaded as a floating-point time series and keeping the sampling rate the same, with the help of the resample type as kaiser_best. The feature vector can be visualized by Fig. 3.

  3. 3.

    Dividing Data: The selected dataset of audio file is split in three partitions that can be used for testing, training, and validation. The feature vectors are created based on split size, the split size is chosen as 70%,20% and 10% for the training, testing, and validation respectively. Keeping in mind that the samples used in validation are the original samples and not the augmented samples.

  4. 4.

    Reshaping Data: For the input of CNN we need to reshape the data, this is done using the NumPyFootnote 2 package in python.

  5. 5.

    Classifier: Then the feature vectors are passed to a CNN classifier which classifies them in different categories based on the probability given to them by the softmax layer.

As the dataset contains just 920 files there is a need to augment the data for the CNN layers to properly train on the dataset. The data augmentation step helps us augment the data so now the newly created dataset has 5 augmented audio files with every file at a delay of 250 ms. The number of augmented files is given in Table 2. So the final augmented dataset has a total of 4600 audio files. Next, the features transformation is done and a feature vector of 4600 x 39 x 3 is created. We can visualize this with the example of a regular image where the red layer is the MFCC layer, the green layer is the Melspectrogram features and the blue layer is the Chroma CENS features. So this tells us that every audio file has 3 features corresponding to a class. The next step is the reshaping of the data, this is done with the NumPy package so the final dimension of the feature vector becomes 4600 x 39 x 3 x 1. This helps us feed the feature vector in the CNN architecture. Figure 4 shows us healthy audio sample amplitude with respect to time (seconds). Next, the architecture is defined in Section 4.1. The model is trained for 100 epochs. The significance of our preprocessing step is that it helps include all features of the audio file and convolution layers have shown to perform greatly in the task of image classification so the CNN layers will help us correctly predict the classes of the dataset. The final softmax layer helps give probabilities to different classes. So the output of the CNN is one of the 3 cases, which helps us carry out our diagnosis of chronic disease, non-chronic disease, or healthy audio sample.

Table 2 Original Data Size and Augmented data size
Fig. 3
figure 3

Feature visualization

Fig. 4
figure 4

Plot of a healthy audio

For a robust classification, we have applied K-fold cross validation. We have applied 5 folds to the model. So in every fold, the model is trained on different splits of the training and testing dataset and the validation dataset is kept separate on which the model is validated. This helps understand how our model performs on unseen data. So the model is trained on training and testing feature vectors and validated on validation feature vector. The mean recall obtained is 0.991, F1 score of 0.993 and precision of 0.994. The entire algorithm is shown as a pseudo code in Algoithm 1.

5.1 CNN architecture for our proposed approach

The CNN architecture contains 11 layers the first being a Conv2D layer with 64 filters, kernel size 3, stride as 1, activation as ‘relu’ and padding argument as ‘same’ which lets the output have the same length as the original input. The second layer being a max-pooling layer with padding kept as same. Third layer is a Conv2D layer with 128 filters, with the same parameters as the first layer. The fourth layer is a max-pooling layer with the padding kept as ‘same’. Fifth layer is a dropout layer with dropout size kept at 0.3 which specifies the rate and so the probability of each input will be dropped by 30%. Sixth layer is a flatten layer which helps flatten the nodes. The seventh layer is a dense layer with 256 nodes and the activation kept as ‘relu’. The eighth layer is a dropout layer with the same specification as the fifth layer. Ninth layer being a dense layer again with 512 nodes with ‘relu’ activation. Tenth layer being a dropout layer with the same specification as the fifth layer. The last layer being a dense layer with 3 nodes and the activation kept as ‘softmax’. The optimizer used for this model is an Adam optimizer with its default parameters. The model can be visualized in Fig. 5.

Fig. 5
figure 5

CNN architecture

Kongtao Chen et al. [14] proposed that high computational and storage requirements sometimes stymie the implementation of convolutional neural networks. Structured model pruning is a possible way to get around these constraints. We show that systematic model pruning on TPUs may considerably reduce model memory use and performance without sacrificing accuracy, particularly for small datasets (e.g., CIFAR-10).

figure a

6 Experiment results

As a baseline each feature was first independently used for the task of classification. These independent features were passed through the same CNN network to correctly identify the 3 classes. The experimental results are shown in Table 3. As observed from the table, these independent features were not able to perform well. After which the concatenation of these features were done to improve the performance of the model.

Table 3 Analysis of independent features

Three metrics have been selected for testing the performance of the model. Precision, Recall and F1-score. Where Precision is

$$ \text{precision}=\frac{T P}{T P+F P} $$
(2)

where TP stands for true positive, FP stands for false positive. Recall is

$$ \text{recall}=\frac{T P}{T P+F N} $$
(3)

where FN is false negative and TP is true positive. And F1-score is defined as

$$ F 1=\frac{2 \times \text{precision} \times \text{recall}}{\text{precision}+\text{recall}} $$
(4)

The evaluation of our proposed approach is done on the ICBHI corpus and plots of accuracy and loss have been plotted in in Figs. 6 and 7 respectively. We can see from the plots that there is no overfitting of the data and our proposed approach is better than previous approaches for the classification of diseases. The comparison from [59] also shows that our proposed method outperforms the previous approach. The comparison between the approaches is shown in Table 3. The newly proposed approach at preprocessing helps transform 3 kinds of features and in turn, helps us get the most of the information from the audio sample.

Fig. 6
figure 6

Training and Validation Accuracy

Fig. 7
figure 7

Training and Validation Loss

The method performs better than previous techniques as the transformation of different features helps understand enough information from the audio sample for proper classification of classes. The preprocessing proposed in this paper is unique than previous methods and as Table 4 suggests that the classification of a feature vector produced by preprocessing helps outperform previous approaches.

Table 4 Experimental results

Figure 8 shows the comparison of features and highlights the first 10 coefficients (for viewing the difference in amplitude between different features) and their amplitude. From the figure we can see the variation of features in the audio sample. As we can see that the plot of a healthy audio sample in Fig. 4, the audio after preprocessing can be visualized as Fig. 8 (but with 39 features).

Fig. 8
figure 8

Comparison of features

The combining of the features helps the model learn a better pattern of lung sounds to help it classify audio samples better. MFCC and Melspectrogram features provide the log scaled features and Chroma CENS provide us with matching the similarity of audio sound with the selected features by taking statistics over large windows smooths. These features together depict a far better understanding of the audio sample compared to a single feature alone. The results also point us to a similar direction.

Apart from this a few noise suppression techniques were looked at to help improve the quality of audio samples. But recent noise supression techniques degraded the quality of the audio samples as these techniques have a hard time differentiating lung sound with background noise. The noise supression techniques are better suited with speech and background noise problems.

7 Conclusion and future work

This paper aims at a new and effective approach towards the classification of respiratory diseases which can help in the biomedical field for early detection of the diseases. Our approach dwells upon a novel feature transformation method along with data augmentation which help us produce state-of-the-art performing model. The dataset used to evaluate the model is the ICBHI corpus [66] which is a publicly available dataset for research. The proposed approach combines pre-processing steps MFCC, Melspectrogram, and Chroma CENS with CNN, it helps for making an accurate diagnosis of lung sounds. Moreover, proposed approach outperforms previous approaches and presents a new way of preprocessing of audio files. The proposed approach was able to reach an recall of 0.991, F1 score of 0.993 and precision of 0.994. This works helps us extract maximum information from the audio sample and leverage it for a better, robust and accurate diagnosis. For future work, we plan to include other features like spectrogram, wavelet transform, FFT, or STFT and evaluate their performance for the task of respiratory pathology diagnosis. Along with improving performance this service can be deployed on edge devices to deliver, for example it can be deployed on a jetson nano and made available to public for usage.