1 Introduction

Over the past two decades, the availability and use of various medical imaging modalities such as X-rays, positron emission tomography (PET) scans, magnetic resonance imaging (MRI), etc., has resulted in a massive increase in the size of medical image repositories. This underscores the need for developing effective healthcare information management systems (HIMS) that can support intuitive and easy-to-use applications such as image categorization, similar image retrieval, query-based image retrieval, for enabling efficient utilization of available historical data. Content-based image retrieval (CBIR) is a prominent research field that focuses on designing and incorporating effective techniques for managing large repositories of images and supporting an efficient retrieval process [1]. A CBIR system typically has two phases—the offline phase, where extraction of features takes place from an extensive collection of images, which are used as training images and stored as local features, and the online phase, where the features for the input query image are generated, and a similarity or distance metric is computed between these features and those of images in the database. The images with a lower distance or a high similarity with the query image are perceived as the retrieved results [2].

Various feature extraction methods have been used previously to capture image-specific relevant features, enabling effective retrieval in CBIR systems. These can be different colour spaces (Lab, HSV), shapes, or edge-based features. The Lab colour space defines three values—L (lightness from black to white), a (from green to red) and b (from blue to yellow). The HSV colour space remaps RGB colours to dimensions that humans can comprehend. Such system’s most challenging issue is the semantic gap between the Human Vision Systems (HVS) perceived visual information and the imaging device captured visual information [3]. The edges detected using gradient approaches that find the minimum and maximum in the first image derivative using statistical methods (first-order, second-order, etc.) have been employed in CBIR systems. Accurately identifying the edges in medical images is a time-consuming and challenging process. The transforms (2D wavelet, Gabor, etc.), and texture-based features, have been employed in CBIR systems for supporting image retrieval across multiple domains [4, 5]. The performance of such a CBIR system is highly reliant on chosen features [6], which may result in overfitting when the quantity of input images is inadequate. Additionally, when the dimension of the input images is enormous, the selection procedure may need substantial computing time.

These challenges can be addressed by making use of various artificial intelligence (AI) techniques to develop systems that are intelligent enough and able to learn to perform like HVS. Deep learning techniques can learn complex features from raw training input images. Existing research underscores a significant limitation of CBMIR systems, i.e., limited accuracy when images of a dataset have a significant similarity. In this work, we adopt deep neural models to classify medical images and learn latent feature representations for developing CBMIR systems that retrieve X-ray images of various lung diseases, including the recently emergent COVID-19 based lung infections. The proposed approach of two stages reduces the search space and facilitates retrieving relevant class images across multiple classes. The primary contributions of this work are listed below:

  • A novel CBMIR framework for COVID-19 related symptoms built as a two-step approach consisting of classification and disease prediction.

  • Benchmarking state-of-the-art transfer learning models for effective classification of radiology images.

  • Adopting the enhanced CBMIR system on standard X-ray datasets for the accurate diagnosis of COVID-19 and other lung diseases, to reduce the mortality rate.

The rest of the paper is presented as follows: In Sect. 2, existing approaches of deep learning in designing CBIR systems and identification of various lung diseases are discussed. Section 3 details our approach and the proposed model architecture, along with the implementation specifics. In Sect. 4, the experimental evaluation and benchmarking results are presented, followed by a conclusion and future work scope.

2 Related work

With the wide variety of imaging modalities in use for diagnosing diseases in the medical domain, the volume of medical image data produced is growing at a tremendous rate [7]. Therefore, managing such huge databases efficiently and productively is a critical requirement. CBMIR systems aim to solve this issue by supporting query by example (QBE) interfaces, which can return the closest matched images from a huge database, given a query image. CBMIR also helps doctors by providing functionalities like identification of similar patient characteristics through automated image analysis and feature extraction [8, 9].

CBMIR techniques have been adapted by researchers to address various clinical tasks across imaging modalities. Pilevar [10] detected the contours and signatures of decomposed sub-regions of the input images and then extracted the feature vector using the function that maps to the input training images. Their approach attained 93% accuracy on a simulated dataset of 5000 radiography images. Mizotin et al. [8] utilized a bag of visual words (BoVW) approach along with scale-invariant feature transform (SIFT) features for detecting Alzheimer’s disease from MRIs. Similarly, CBMIR on skin lesion images is proposed by Jiji et al. [11] and Ponciano et al. [12], evaluated the impact of CBMIR by inquiring specialists about the proposed system’s confidence in the diagnosis and presented a high acceptance by radiologists. Rahman et al. [13] and Qayyam et al. [2] implemented a CBMIR approach that utilizes a classification technique. Here, the query image’s class is predicted, and the search space is reduced to that particular class, which was found to improve the model’s accuracy. As a lesser number of image feature comparisons were required in this technique, the computational cost was also significantly reduced.

Deep learning-based algorithms learn many levels of representations to model complex data interactions [14]. DL models are also known for their ability to self generate intermediate representations, which can be more complex representations such as pictorial structure [15]. Deep learning has been applied for clinical tasks such as image retrieval and classification [16, 17], disease prediction [18, 19] among others.

Hu et al. [20] used CT scan images and designed a supervised deep learning method for automating the task of detection as well as classification of COVID-19. COVID-Net [21] proposed a deep convolutional neural network (CNN) for identifying COVID-19 in chest X-rays. They also introduced an open-source test dataset named COVIDx. COVID-Net was trained on the COVIDx dataset after being pre-trained on the ImageNet dataset. Several research has been done to detect COVID-19 using transfer learning [22]. VGG19 can be used for COVID-19 detection and to classify COVID-19 vs. Pneumonia from images taken from multiple modes such as Computerized Tomography (CT) scan, X-ray and Ultrasound. Apart from COVID-19, other lung diseases such as severe acute respiratory syndrome (SARS) have also been researched upon, and there are methods used to predict their presence. Bannerjee et al. [23] have used multiple models such as random forest, shallow learning for the prediction of SARS. Pineda et al. [24] worked on the classification of influenza, using classifiers like artificial neural network (ANN), support vecor machines (SVM), decision tree, naive Bayes, etc. Serener et al. [25] made use of seven different deep learning architectures including ResNet18, VGG, MobileNetv2, DenseNet121, and SqueezeNet, for the classification of Mycoplasma. The authors reported the best performance for ResNet18 and MobileNetv2 in comparison to other experimented methods. Lanera et al. [26] used Elastic-net regularized generalized linear model (GLMNet), LogitBoost algorithms, etc., for the prediction of Varicella.

In addition to enabling effective classification for detection of various lung diseases, it is also necessary to support an efficient retrieval system as the data collection related to COVID-19 and other lung diseases are increasing. This brings in the need for CBMIR on the chest X-ray lung disease dataset. Deep learning based CBIR has been experimented for the retrieval of natural images. There is ample scope for developing deep learning based CBMIR systems to enable effective diagnosis of lung disorders.

In our work, a classification step is proposed that can facilitate accurate retrieval by managing similar images belonging to a single domain (e.g. Lung X-ray images). This can also aid in differentiating diseases similar to COVID-19. An efficient retrieval system is proposed to quickly retrieve similar disease images, which can facilitate early diagnosis and ultimately reduce the mortality rate.

3 Methodology

Figure 1 illustrates the proposed methodology which integrates medical image classification with content-based medical image retrieval. The proposed CBMIR methodology was experimentally evaluated on the COVID-19 chest X-ray image dataset.Footnote 1 The dataset contains both chest X-ray images, as well as the metadata connected with each image, along with disease classes and subclasses linked to respiratory issues. It consists of 584 COVID-19 images and around 120 images of other types of diseases. Only 33 COVID-19 images were considered to eliminate bias caused by the varied number of images in each subclass. Therefore the dataset used in this work consists of 152 images across three classes and several subclasses. Some sample dataset images in the three main classes are shown in Fig. 2. In the following subsections, we describe the two phases of the proposed methodology, in detail.

Fig. 1
figure 1

Proposed methodology for integrating medical image classification with CBMIR

Fig. 2
figure 2

Image samples from different classes showing the inter-class similarities (a) viral, b bacterial, c fungal

3.1 Classification process

The first phase consists of the classification step, which is a supervised learning approach for training the model to classify the medical images. In this phase, the features from all images in the dataset are extracted and stored in an index. This is known as the offline indexing phase. The model used to classify has two steps: feature extraction and the second step uses these extracted features to classify the images. Generally, CBIR systems consist only of the feature extraction step using DCNN. However, these features, extracted from the dense layers of DCNN, can be used to classify the query images using the classification model. The predicted class of the query image can reduce the search space for retrieving similar images. DCNNs generally consist of a large number of layers. The convolution layer applies filters to the input image to extract features. Every node in one layer is connected to every node in another layer in fully connected layer. The pooling layer can reduce the computational requirements by reducing the number of parameters.

Recent studies have shown that transfer learning from state-of-the-art models like VGG19, ResNet50, InceptionV3 can perform exceedingly well for the detection of COVID-19 from chest X-ray images [27, 28]. In view of this, the VGG19 [29] and ResNet [30] model variants were adapted for our work. VGG19 is trained on the ImageNet dataset [31], which consists of millions of images. VGG19 consists of 19 layers and takes an input image of size (\(224 \times 224\)). It uses a kernel of size (\(3 \times 3\)) with a stride of 1 unit. Max pooling is used with a window size of (\(2 \times 2\)) with a stride of 2 units. Finally, ReLu (Rectified Linear Unit) is used to introduce non-linearity. The ResNet50 model contains 50 layers and is also trained on the ImageNet dataset, which contains around 1000 object categories. Similar to VGG19, this also takes an input image of size (\(224\times 224\)). Kernels of various sizes are used for better feature extraction, followed by an average pool layer and a fully connected layer. The feature vector is formed using the last convolution layer resulting in 4096 and 2048 features for VGG19 and ResNet50, respectively. Due to the limitation of the size of the dataset, a more rational approach is followed, where pre-trained VGG19 and ResNet50 models are used as feature generators in this phase. Since these models have been trained extensively on large datasets, their feature extraction capability is better than training a model from scratch.

For the classification step, 80% of the data was used for training, and the remaining 20% was used for testing the models. The images in the dataset have varied dimensions, so they are transformed to (\(224\times 224\)) dimensions because VGG19 and ResNet50 both require (\(224\times 224\)) images. Pre-trained weights were added to the models, and they were rendered untrainable. All layers except the last output layer were considered when using the pre-trained models, and then a Dense layer with a softmax activation function was added. The model was implemented in Python language using Keras framework.

3.2 Image retrieval process

Image retrieval is achieved by extracting low-level features from images using a classifier and then representing them in feature vector form. Feature vectors can be used to compare similarities. Two feature vectors are given as inputs to the similarity functions or distance metrics, and the resulting output (decimal number) represents the closeness among the feature vectors. The general steps applied for retrieving relevant images using the feature vectors obtained from the classifier in the CBIR system is shown in Fig. 1. The process mainly involves three steps viz. creation of the feature database, finding optimal similarity metric, ranking the retrieved images. To find each image’s representative characteristics, the classifier is first trained on the feature sets extracted from each image in the dataset.

The features are extracted from the trained models and stored for similarity measurement. Chi-square distance, Euclidean distance, and cosine distance measures were experimented with for determining the similarity between the query image extracted features and the stored features. Given a query image, image features will be extracted from the pre-trained neural models and compared to those previously indexed using the specified distance function. The most relevant results are ranked according to the distance function.

4 Experimental results and analysis

The proposed method is evaluated using various standard metrics. For classification, supervised learning is used, and the different models are evaluated using the standard metric of accuracy, computed as per Eq. (1). Precision, average precision (AP), recall and mean average precision (mAP) are some of the metrics used to assess image retrieval (as per Eqs. (2) to (5)), where TP, TN, FP and FN are true positive, true negative, false positive, and false negative, respectively. Precision and recall are used for plotting the precision-recall curve and computing of average precision (AP) and mean average precision (mAP) for assessing the model’s retrieval performance.

$$\begin{aligned} Accuracy= & {} \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(1)
$$\begin{aligned} Precision, P= & {} \frac{TP}{TP+FP} \end{aligned}$$
(2)
$$\begin{aligned} Recall, R= & {} \frac{TP}{TP+FN} \end{aligned}$$
(3)
$$\begin{aligned} Average\, precision, AP= & {} \sum _n (R_n - R_{n-1}) P_n \end{aligned}$$
(4)
$$\begin{aligned} Mean\, Average\, Precision, mAP= & {} \frac{1}{N}{\sum _n{AP_n}}. \end{aligned}$$
(5)

Various distance metrics such as Euclidean, chi-square, and cosine distance, were used to find the similarity between images using their features. The Euclidean distance calculates the distance between two real valued vectors, whereas the chi-square determines the independence between two events. Cosine similarity is a metric used to determine how similar the images are irrespective of their sizes (as per Eqs. (6), (7) and (8), where n is the total number of data samples and x and y are two image features). All these measures give a holistic view of the results with respect to different aspects.

$$\begin{aligned} Euclidean-distance\left( x,y\right)= & {} \sqrt{\sum _{i=1}^{n} \left( x_{i}-y_{i}\right) ^2 } \end{aligned}$$
(6)
$$\begin{aligned} Cosine-Similarity\left( x,y\right)= & {} \frac{x \cdot y}{|x||y|} \end{aligned}$$
(7)
$$\begin{aligned} Chi-square(x,y)= & {} \frac{1}{2} {\sum _{i=1}^{n}{\frac{(x_i-y_i)^2}{(x_i+y_i+e^{-10})}}}. \end{aligned}$$
(8)

To illustrate the improved accuracy attained with the addition of the classification step, we performed the ablation studies – i.e., with and without classification and computed their results for comparison. Different models were trained on the dataset, and ResNet50 was found to give better accuracy when compared to VGG19 for the classification step. First, binary classification on class labels Viral and Bacterial was performed, these classes have four and six subclasses respectively. The training and validation loss curves for the trained model are shown in Fig. 3. As shown in Table 1, model attained a binary classification accuracy of 81.66% (Viral and Bacterial) and a multi-class classification accuracy of 76% on all the three classes (Viral, Bacterial and Fungal). Since Fungal class has only two subclasses with a very less number of images, the binary classification model was considered.

Table 1 Classification accuracy for different models (Highest accuracy is in bold)
Table 2 Class-wise mAP values with different distance measures without and with classification (Highest accuracy is in bold)
Fig. 3
figure 3

Loss estimation graph for binary classification

Fig. 4
figure 4

a Precision-recall performance for Viral class (without classification), b Bacterial class (without classification), c Viral class (with classification) and d Bacterial class (with classification)

To evaluate the image retrieval phase, various standard metrics such as average precision, precision-recall curve, AP for each query, mAP for different classes and subclasses are used. For similarity measures, various distance metrics like Euclidean, chi-square and cosine are tested and compared. In addition, for each subclass, the area under the precision-recall curve (AUPRC) is determined for a query image.

Table 2 presents a comparison of the mAP values for different classes without and with classification, and Table 3 compares the mAP values for different subclasses without and with classification. Among all the three distance metrics, the best result is observed with the cosine distance metric. It is also evident that the mAP values with classification are higher as compared to the results without classification for all the three types of distance metrics. Some subclasses like Varicella, Influenza, E. coli, and Chlamydophila have variability in mAP values as they have a limited sample size.

Table 3 Subclass-wise mAP values with different distance measures (without and with classification)
Table 4 Subclass-wise AUPRC values with and without classification

The AUPRC values are also calculated to retrieve different categories to compare the performance with and without classification which is illustrated in Table 4. It could be observed that there is an increase in the AUPRC values when the classification phase is included. The precision-recall curve is plotted for each class and for all the subclasses of Viral and Bacterial class using Euclidean metric, both with and without considering classification. This is shown in Fig. 4. The precision obtained for both Viral and Bacterial class with classification phase inclusion is clearly higher, as is evident from Fig. 4.

5 Conclusion and future work

This article presents a content-based medical image retrieval system based on a deep learning architecture, for initially classifying the query image, to reduce the search space while finding the relevant images. This improves the precision of the CBMIR system and also reduces the computational overhead. The proposed method is tested on the open COVID-19 chest X-ray image data collection, consisting of chest X-ray images. The model was successfully trained, and a 5-fold cross validation accuracy of 81% was observed. To compare the influence of classification on retrieval performance, we employed a variety of distance-based measures such as Euclidean, chi-square, and cosine distances, as well as performance measurements such as mAP and AUPRC. A significant improvement was observed due to the addition of the classification step. As part of future work, we intend to test the model with a larger dataset with more subclasses. We also plan to adapt other deep neural models to improve the performance of the CBMIR system and scale it for real-world applications.