1 Introduction

Melanoma is the most serious form of skin cancer that begins in cells known as melanocytes. Melanoma has an increasing incidence, where just in Europe were estimated 144,200 cases and 20,000 deaths in 2018 [1], whereas in USA, 106,110 new cases of invasive melanoma will be diagnosed (62,260 in men and 43,850 in women) and 7180 deaths are expected in 2021 (4600 men and 2580 women) [2]. The lesion is first diagnosed through an initial clinical screening, and then potentially through a dermoscopic analysis, biopsy and histopathological examination [3]. Despite the expertise of dermatologists, early diagnosis of melanoma remains a challenging task since it is presented in many different shapes, sizes and colors even between samples in the same category [4]. Providing a comprehensive set of tools is necessary for simplifying diagnosis and assisting dermatologists in their decision-making processes [3].

Several automated computer image analysis strategies have been used as tools for medical practitioners to provide accurate lesion diagnostics, including descriptor-based methods [5, 6] and convolutional neural networks (CNNs) [3, 7, 8]. Descriptor-based methods require the previous extraction of handcrafted features [9], which rely on the expertise of dermatologists and introduce a margin of error. By contrast, CNN models can automatically learn high-level features from raw images [3], thus allowing for the development of applications in a shorter timeframe. Furthermore, Nasr-Esfahani et al. [10] showed that CNN models can overcome handcrafted features-based methods. The authors obtained 7% and 19% better sensitivity performance compared to color-based descriptor and texture-based descriptor, respectively. Recently, Brinker et al. [11] demonstrated that CNN models can match the prediction performance of 145 dermatologists.

CNN models have shown to be effective in solving several complex problems [12, 13]. However, they still present several issues which hamper their accuracy in diagnosing skin conditions. CNN models can learn from a wide variety of nonlinear data points. As such, they are prone to overfitting on datasets with small numbers of samples per category, thus attaining a poor generalization capacity. It is noteworthy that so far, most of the existing public skin datasets only encompass a few hundreds or thousands of images. This can limit the learning capacity of CNN models. On the other hand, CNN models are sensitive to some characteristics in data, such as large inter-class similarities and intra-class variances, variations in viewpoints, changes in lighting conditions, occlusions, and background clutter [14]. These days, the majority of skin datasets are made up of dermoscopic images reviewed by expert dermatologists. However, bear in mind that there is an increased tendency to collect images taken by common digital cameras [15]. The above can reduce invasive treatments and their associated expenses in addition to augmenting the development of modern tools for cheaper and better melanoma diagnoses. Finally, CNN models are approximately invariant with regard to small translations to the input, but they are not rotation, color or lighting-invariant [16, 17]. Invariance is an important concept in the area of image recognition. It means that if you take the input and transform it, the representation you get is the same as the representation of the original. Due to the vast variability in the morphology of moles, this is an important issue to resolve in order to attain a more effective melanoma diagnosis, thus allowing a model to detect rotations or changes to proportion and adapt itself in a way that the learned representation is the same.

Several techniques can be applied to overcome some of these issues, but the most proven include data augmentation [18], transfer learning [3], ensemble learning [19] and, more recently, generative adversarial networks [20] and multi-task learning [21]. However, in most cases, researchers rely on their expertise to select which techniques to apply, and there is no specific pattern to follow that will definitively produce a model with a high level of reliability. Furthermore, most research does not follow a standard experimental study and only includes a limited number of datasets.

As a consequence of the above, in this work, a novel approach for diagnosing melanoma via the use of images is proposed. First, the proposal is inspired by ensemble learning and is built via a genetic algorithm. The genetic algorithm is designed to find the optimal members of such ensembles while considering the entire training phase of the possible members. In addition, the abstract features of all models in the ensemble are merged, resulting in an additional prediction. Next, these individual predictions are combined and a final diagnosis is made. This approach can be seen as a double ensemble, in which all predictive components are double related. The aim is to find the models that best contribute to the ensemble, rather than the individual level. As a result, the state of each CNN model that best trains, generalizes and suits with the other CNN models is selected [22]. In this manner, the training process is implicitly regularized, which has shown better convergence, mitigates the overfitting of the model and improves the generalization performance [19, 23]. To the best of our knowledge, this is the first attempt at intelligently constructing ensembles of CNN models by following a genetic algorithm to better solve the challenge of diagnosing melanoma through the use of images. Second, a novel lesion segmentation method, which is capable of efficiently obtaining reliable segmentation masks without prior information through the use of just a CPU, is applied. Bear in mind that state-of-the-art biomedical segmentation methods commonly require a prior training stage and the use of GPUs. In addition, common techniques such as transfer learning and data augmentation have been applied to further improve performance. To evaluate the suitability of this proposal, an extensive experimental study was conducted on sixteen melanoma-image datasets, enabling a better analysis of the effectiveness of the model. The results showed that the proposed approach achieved promising results and was competitive compared to six state-of-the-art CNN models which have previously been used for diagnosing melanoma [3, 10, 24,25,26]. These works were summarized in Pérez et al. [8], and the most relevant are explained in the next section. To perform a fair comparison, the above architectures were assessed in Sect. 5.5 by using the same large number of datasets and a quantification of the results was included. Finally, a web application was developed in order to support dermatologists in decision making.

The rest of this work is organized as follows: Sect. 2 briefly presents the state of the art in solving melanoma diagnosis problems mainly by using CNN models; Sect. 3 describes the proposed ensemble convolutional architecture; Sect. 4 describes the genetic algorithm application; the analysis and discussion of the results are portrayed in Sect. 5; and finally, Sect. 6 outlines conclusions and future works.

2 Related works

The popularity of CNN models increased when AlexNet [27] won the well-known ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, reducing the top-5 labels error rate from 26.1% to 15.3%. Since then, CNN models are widely applied in image classification tasks [28,29,30]. However, most authors apply other sophisticated techniques over CNN models to achieve a better performance in melanoma diagnosis, such as segmentation [31], data augmentation [18], transfer learning techniques [3], and CNN-based ensembles [19].

Data augmentation is employed to add new data to the input space, which helps reducing overfitting [32] and obtaining transformation-invariant models [17]. This technique is usually performed by means of random transformations [33]. Also, most of the datasets available for melanoma diagnosis lack of balance between the categories, so data augmentation can help to tackle the imbalance issue [34]. For example, Esteva et al. [3] showed the suitability of CNN models as a powerful tool for melanoma diagnosis. The authors augmented the images by a factor of 720\(\times\) using basic transformations, such as rotation, flips and crops. Also, they compared the performance of one CNN model to 21 board-certified dermatologists on biopsy-proven clinical images. The results showed that the CNN model achieved a performance on par with experts. Pérez et al. [8] showed that 12 CNN models achieve better performance when applying data augmentation. For example, Xception [35] increased its average performance by 76%.

Transfer learning has been successfully applied in image classification tasks [3, 30, 36, 37]. For example, Shin et al. [30] applied it specifically in thoraco abdominal lymph node detection and interstitial lung disease classification, in both cases the weights obtained after training with ImageNet were beneficial. Esteva et al. [3] used Google’s InceptionV3 [38] architecture pretrained on ImageNet. The authors removed the final classification layer and then they re-trained with 129,450 skin lesions images. Pérez et al. [8] studied the impact of applying transfer learning in skin images, where MobileNet [39] increased its average performance by 42%.

On the other hand, ensemble learning [40] has also shown to be effective in solving complex problems [41, 42]. The combination of several classifiers built from different hypothesis spaces can reach better results than a single classifier [43]. For example, Mahbod et al. [19] proposed an ensemble-based approach where two different CNN architectures were trained with skin lesion images. The results showed to be competitive compared to the state-of-the-art methods for melanoma diagnosis. Harangi et al. [23] proposed an ensemble composed by the well-known CNN architectures AlexNet [27], VGG and InceptionV1 [44]. The ensemble was assessed on the International Symposium on Biomedical Imaging (ISBI) 2017 challenge and obtained very competitive results. However, the ensemble models are usually built trusting in prior knowledge.

In order to achieve better performance, not only new strategies to train the models have been developed, but also efforts have been made to improve the input data. Skin lesion segmentation plays an important role in melanoma diagnosis. It isolates the region of interest and significantly improves the performance of the model. This is a highly complex task, and it is important because some areas not related to the lesion can lead CNN models to misclassify samples. However, some authors decided not to use some of these preprocessing techniques. Mahbod et al. [19] ignored complex preprocessing steps, as well segmentation methods, but applied basic data augmentation techniques to prevent overfitting. Nevertheless, if irrelevant information is removed from the image, the models could be able to achieve better performance. Since 2016, The International Skin Imaging CollaborationFootnote 1 (ISIC) project annually organizes a challenge in which more than 180 teams have already participated. From 2016 to 2018, there was a special task about lesion segmentation. In ISIC-2016, considering the top performances, the use of segmentation obtained 8% better sensitivity compared to its non-use. Several segmentation methods can be found. For example, Ronneberger et al. [45] designed a CNN model (U-Net) for biomedical image segmentation. The model relies on data augmentation to use the available labeled images more efficiently. The U-Net architecture achieved high performance on different biomedical segmentation applications, such as neuronal structures in electron microscopic recordings and skin lesion segmentation [46, 47]. However, U-Net requires a costly training process using GPU and more important, a prior knowledge from data already segmented by expert dermatologists. Furthermore, Alom et al. [48] proposed a recurrent residual U-Net model (R2U-Net). The model was tested on blood vessel segmentation in retinal images, skin cancer segmentation, and lung lesion segmentation. The results showed better performance compared to the U-Net model. Huang et al. [31] proposed a new segmentation method based on end-to-end object scale-oriented fully convolutional networks. The authors achieved 92.5% of sensitivity and outperformed all CNN models in their study, which was 1.4% better compared to the winner in ISIC-2016.

Considering the above, it would be interesting to design a deep learning model that combines features from different approaches such as segmentation, data augmentation, transfer learning and ensemble learning. We hypothesized that evolutionary optimization methods could be an effective approach to find the optimal combination of CNN models [49, 50], considering all states from all the models. Evolutionary methods have proven to be useful in solving many complex problems, finding and optimizing architectures of neural networks [51], and mining imbalanced data [52]. Regarding segmentation methods, in this work it is applied an extension of the Chan-Vese segmentation algorithm [53] and it is evaluated using specialized datasets. To augment data, it is important to perform a data augmentation both on training and test phases [18], which can increase the performance significantly.

3 Ensemble-based convolutional architecture

First of all, the source datasets were preprocessed following an extension of the Chan-Vese segmentation algorithm, as shown in Fig. 1. This algorithm is designed to segment objects without clearly defined boundaries and is based on techniques of curve evolution, Mumford-Shah [54] functional for segmentation and level sets. Chan-Vese has been previously applied in skin image segmentation [55], being an effective starting point to accurately segment skin images. First, Chan-Vese is applied on each input image and as a result, a mask with positive and negative values is obtained. After that, the positive pixels within 40% of the center are selected (cluster of pixels P). After using several recognized skin lesion detection applications such as SkinVisionFootnote 2, we realized that most of them demand that images must be centered in the lesion in order to perform an accurate diagnosisFootnote 3 [56]. In addition, after reviewing a large number of skin image datasets, it was noticed that most images are centered in the lesion, which is an advantage when applying segmentation. Third, all positive clusters (Q) that intersect P are merged, obtaining a new segmentation mask M, \(M = Q_1 \cup Q_2 \cup \text {...} \cup Q_n, P \cap Q_i \ne \emptyset\). Finally, the segmented image is obtained after applying the mask M on the original image. In Sect. 5, the proposed segmentation method is compared to other state-of-the-art biomedical segmentation methods to corroborate its effectiveness. Next, the segmented images are used as input to the architecture described below.

Fig. 1
figure 1

Preprocessing steps applied before training. U-Net score is the average between Dice coefficient and Jaccard similarity, which have been used before as evaluation metrics in the segmentation task of the ISIC-2018 contest. Input image taken from the DERM-LIB dataset

Fig. 2
figure 2

Example architecture of the proposed ensemble model. Figure shows how a new ensemble is obtained by selecting those CNN models represented by \(g^j_k\) \(>0\). In (a) the chromosome contains the CNN models and their respective epochs; b the ensemble is composed of DenseNet (epoch 33), NASNet (epoch 32) and Xception (epoch 11); c a late fusion approach is used to merge all the representations

Let us say \(\Phi\) is a model with m independent CNN models, which learn the representations from the same feature space, and an extra prediction block (stacking of dense layers) that yields an extra prediction. The prediction block is obtained by first concatenating all the representations learned by the individual CNN models. Figure 2 shows a specific example of the proposal. An image i is passed as input to the jth CNN model of \(\Phi\), and a chromosome indicates which tuples of CNN models and their weights determine the ensemble’s architecture. Each CNN outputs the learned representation (\(r^{j}_{i}\)) and a partial prediction \(\hat{o}^{j}_{i}\) for the label of this sample by considering the weights required in the chromosome, e.g., the weights from DenseNet trained until epoch 33 (\(g_1=33\)). Thereafter, the representations \(r^{j}_{i}\) learned by each CNN model are flattened, then concatenated and then passed to the prediction block of the model. A late fusion approach is used to concatenate all the representations learned by the CNN models (Fig. 2c), which has proven to obtain better performance compared to merely combining the individual predictions [57]. Nevertheless, in Sect. 5.5 the prediction block is assessed versus its non-use. Thereafter, we freeze the weights from the individual CNN models and only the prediction block is trained during 20 epochs, providing an additional prediction (\(\hat{o}^{m+1}_{i}\)) for the label of the sample. In this work, the prediction block is composed of Dense(512 ReLUs) \(\rightarrow\) Dense(256 ReLUs) \(\rightarrow\) Dense(128 ReLUs) \(\rightarrow\) Dense(1 unit - sigmoid). Then, \(\Phi\) predicts the label of an image by using a soft-voting procedure based on the individual predictions. Regarding efficiency, each CNN model was trained just once and its weights along every epoch of interest were stored in a hard drive.

Once the prediction for a given training sample is computed, the losses produced by each CNN model and the prediction block of \(\Phi\) are calculated. The loss produced by the jth CNN model of \(\Phi\) on the ith training image (denoted as \(\mathcal {L}^{j}(i)\)) is computed by means of a binary cross-entropy. The goal is to iteratively minimize the prediction errors on the training samples. Mini-batch Gradient Descent (MGD) [58] can be applied to solve this optimization problem, since the first derivative of the loss function is well-defined. This algorithm consists in randomly splitting the training set in small batches in order to calculate the model error. The above method has several advantages, such as the model update frequency is higher than batch gradient descent which allows for a more robust convergence, avoiding local minima; batch-based updates provide a computationally more efficient process than stochastic gradient descent and the split in small batches allows the efficiency of not having all training data in memory.

4 Ensemble model via genetic algorithm

In this work, a genetic algorithm (GA) was designed to find the optimal members of an ensemble model. Next, the different components and steps of the proposed method are explained in detail.

4.1 Individuals and chromosome codification

Let say that the population of the GA has q individuals \(\{I_1, I_2, \ldots , I_q\}\), where the j-th individual (\(I_j\)) has a chromosome encoded as a list of integer values (from 0 to 150), as shown in Fig. 2a. A chromosome is composed of K genes, where \(g^j_k\) represents the k-th gene of the individual \(I_j\). Each gene index k is independent of the others and is related to a specific CNN model. Its value means the epoch of the CNN model to be considered as part of the ensemble; if \(g^j_k>0\), the CNN model is selected; otherwise, it is not selected. Consequently, using this encoding, each individual of the population represents a full solution to the problem.

4.2 Fitness function

The fitness function used to evaluate the individual \(I_j\) can be calculated as

$$\begin{aligned} f_j= \frac{1}{n \times p}\sum _{i=1}^{n} \left[ \mathcal {L}_{m+1}^{j}(i) + \sum _{k=1}^{m} \mathcal {L}_k^{j}(i)\right] \,\textit{ if }\, g^j_k>0, \end{aligned}$$
(1)

where \(\mathcal {L}_k^{j}(i)\) calculates the loss values for the ith image in the CNN model with index k, and \(\mathcal {L}_{m+1}^{j}(i)\) means the loss value of the prediction block; p means the number of times that \(g^j_k>0\) is fulfilled; m represents the total number of CNN models encoded in the chromosome; and n is the total number of images. In summary, for each individual, the fitness function calculates the average loss value between the chosen CNN models and the prediction block; lower average means a more desirable individual.

4.3 Creation of the initial population

Maintaining a diverse population, especially in the early iterations of the algorithm, is crucial to ensure a good exploration of the search space. The training epochs (e) and the CNN models (m) determine the total number of possible combinations in the search space. To guarantee the diversity, the chromosome of each individual \(I_j\) of the population is randomly created, but repeated individuals are not allowed. In this manner, it is possible to avoid the early convergence of the method to local minima. In this work, we have considered an individual as repeated when all genes \(g_k^j\) in the chromosome are identical. For example, let us say \(g_k^A\) and \(g_k^B\) are genes from chromosomes A and B, respectively; k means the index related to a CNN model (e.g., DenseNet, InceptionV3 and Xception) and m means the maximum number of CNN models in each ensemble. A and B are identical if for every \(k={1,2,...,m}\), \(g_1^A=g_1^B\), \(g_2^A=g_2^B\), \(g_k^A=g_k^B\),..., \(g_m^A=g_m^B\). Having said the above, the individuals are repeated if they use a different CNN in any of their genes, i.e., two models are considered different if they have different architectures or the same architecture but with different weights.

4.4 Parent selection

The parents are selected by a tournament selection procedure to create the intermediate population [59]. A tournament size equal 2 was used in this work; the smaller the tournament size, the lower the selection pressure and the search space can be widely considered. To achieve this, two individuals are randomly selected. Then, the individuals are compared and the best individual is selected with replacement, i.e., this individual could be selected in further rounds. This process is repeated until the number of individuals is completed. As the generations increase, the algorithm can focus more on promising regions of the search space.

4.5 Genetic operators

Figure 3 shows the genetic operators applied. First, a custom Flat crossover [60, 61] was performed with a crossover rate \(p_c\), where its general mechanism is adapted for an integer representation scheme. Flat crossover is applied in each pair of genes that are located in the same locus and represents the same CNN model. As a result, a random integer value is chosen from the interval \([g^1_{k}, g^2_{k}]\). Once the new offspring is generated, an one-point mutator operator is applied with a probability \(p_m\). The mutation operator randomly selects a gene, then, the value of epoch is changed by a new value selected from a range of possible epochs, e.g., [0; 150]. Finally, the offspring was generated. It is noteworthy that valid individuals are always obtained after performing these operators.

Fig. 3
figure 3

Example of the genetic operators applied in the genetic algorithm

4.6 Population update

In this work, a generational elitist algorithm [62] was used to update the population passed from one generation to the next one. As a result, the best individual in the last generation is the best individual of the evolution. To achieve this, the population in each generation keeps all new individuals, as long as the best parent is not better than all the children. In such cases, the best parent replaces the worst child. The worst child is determined by sorting the individuals of the new population according to the fitness value. After replacing the worst child, the new population replaces the previous one. At the end of the algorithm, the best individual in the last generation will be the best ensemble. In this work, the number of individuals generated in each generation is 100 and the population size is kept constant in order to alleviate the computational cost of CNN. However, these values could be tuned depending on the context, including the possibility of increasing/decreasing them. Regarding efficiency, a map with each explored individual and its corresponding fitness value is cached.

Fig. 4
figure 4

Integration of the genetic algorithm in the training phase

4.7 Integration of the genetic algorithm in the training phase

Figure 4 shows how the training phase was conducted. First, the source images are preprocessed by using the method explained above. Second, given a set of CNN models the initial population of the GA is created. For example, if we have an individual with \(g_1^j=50\) and another one is \(g_1^{j+z}=51\) (\(z>0\)), the last one is obtained after training \(g_1^j\) for one more epoch. In general, each CNN model was trained just one time when needed and its weights along every demanded epoch were stored in a hard drive. In the worst case, we train each CNN model 150 epochs once. Third, each ensemble represented by a chromosome is trained for n epochs, bearing in mind that only the prediction block is updated. Then, the fitness of each individual of the current population is calculated by considering the loss obtained in the training images by the members of the ensemble. Four, the parents are selected, then, a crossover and a mutation are performed, and after that a new population is created. Five, the population of the GA is updated. Steps from three to five are repeated until one of the stopping criteria is satisfied. In this work, we have applied the most frequently used stopping criterion which is a specified maximum number of generations. In addition, we stopped the search when an individual had achieved the top performance. Finally, the ensemble model \(\Phi\) is obtained.

5 Experimental study

This section describes the experimental study carried out in this work. First, the datasets and the experimental protocol are portrayed, and then, the experimental results and a discussion of them are presented.

5.1 Datasets

Table 1 shows a summary of the benchmark datasets. UDA [66], MSK [67], HAM10000 [65] and BCN20000 [63] datasets are included in the ISIC repository. The images are composed strictly of melanocytic lesions that are biopsy-proven and annotated as malignant or benign. Also, images from BCN20000 would be considered hard-to-diagnose and had to be excised and histopathologically diagnosed. PH2Footnote 4 [69] comprises dermoscopic images, clinical diagnosis and the identification of several dermoscopic structures. The Dermofit Image LibraryFootnote 5 [64] gathers 1,300 focal high-quality skin lesion images under standardised conditions. Each image has a diagnosis based on expert opinion and like PH2, it includes a binary segmentation mask that denotes the lesion area. MED-NODEFootnote 6 [68] collects 170 non-dermoscopic images from common digital cameras; this type of image is very important to prove the models with data from affordable devices. SDC-198Footnote 7 [70] contains 6,584 real-world images from 198 categories to encourage further research and its application in real-life scenarios. Finally, DERM7PTFootnote 8 [21] is a benchmark dataset composed of clinical and dermoscopic images, allowing to assess how different it is to use dermoscopic images versus images taken with digital cameras.

Table 1 Summary of the benchmark datasets

Only the images labeled as melanoma and nevus were considered, being in total 36,703 images. Most datasets present a high imbalance ratio (ImbR), up to ten times in the case of MSK-3, commonly hampering the learning process. The intra-class (IntraC) and inter-class (InterC) metrics show the average distances between images belonging to different classes, as well as between images belonging to the same class. Both metrics were computed using the Euclidean distance; each image i was represented as a vector. Then, the ratio (DistR) between these metrics showed that both distances are similar, which commonly indicates a high degree of overlapping between classes. Finally, the silhouette score (Silho) [71] was calculated, representing how similar an image is to its own cluster compared to other clusters. The results indicated that images were not well matched to their own cluster, and even samples belonging to different clusters are close in the feature space.

5.2 Experimental settings

Firstly, the proposed segmentation algorithm was applied over PH2 and DERM-LIB, where both datasets were manually segmented by expert dermatologists and they are commonly used as benchmarks due its quality [64, 69]. The aim was to find how much the method is close to expert segmentation. Also, we compared the segmentation algorithm to U-Net and R2U-Net, which are CNN architectures designed for biomedical image segmentation. Bear in mind that the above CNN architectures need prior training. In order to obtain the segmentation mask of an image X, the CNN architectures were trained during 150 epochs with 10% of images as validation set and the rest as training set. The best model obtained during validation was applied on the image X and a segmentation mask was obtained. The above procedure was applied in both PH2 and DERM-LIB datasets.

The second phase aimed at analyzing the GA’s hyperparameters on the model’s performance. The model was dubbed as Genetic Algorithm Programming-based Ensemble CNN model (GAPE-CNN). It should be noted that the main aim of this work was to perform a preliminary study to assess the effectiveness that can be attained by using the proposed architecture in melanoma diagnosis. Consequently, only the hyperparameter values listed in Table 2 values were considered, because including more settings requires high computational resources during training CNN models. Three different number of generations were evaluated, where larger values can lead to a larger number of possible ensembles, but increasing the training cost. Three crossover probabilities were tested, where higher values increase the probability that the genetic information of parents can be combined to generate new individuals. Three mutation probabilities were also evaluated, where higher values allow to escape from local minima and add more exploration of the search space.

Table 2 Basic configuration used

In the third phase, the proposal was compared to the following state-of-the-art CNN models that have previously been used in melanoma diagnosis: InceptionV3 [3], DenseNet [25], VGG16 [10], MobileNet [24], Xception [26] and NASNetMobile [72, 73]. Table 2 shows the configuration used to train all the models: the \(\text {learning rate}\) \((\alpha )\) was equal to 0.01 and it was reduced by a factor of 0.2 if an improvement in predictive performance was not observed during 10 epochs; the weights of the networks were initialized using Xavier method [74] in those cases where transfer learning was not present, e.g., the prediction block and baseline CNN models; a batch of size 8 was used due the medium size of the used datasets and the models were trained along 150 epochs. Mini-batch gradient descent was used for training the models, which is one of the most used optimizers for training CNNs. Despite its simplicity, it performs well across a variety of applications [75] and has been successfully applied for training networks in melanoma diagnosis [7, 18, 76]. In this work, a tuning process was not carried out and so the results could not be conferred to an over-adjustment. The datasets utilized in this work correspond to binary classification problems, so the cost function used for training the models was defined as the average of the binary cross-entropy along all training samples.

Data augmentation technique was mainly applied to tackle the imbalance problem in melanoma diagnosis by applying and combining random rotation-based, flip-based and crop-based transformations over the original images. Bear in mind that color-based transformations were not considered in order not to alter the color space, which is important for the diagnosis of melanoma [77]. As a result, the changes performed during data augmentation do not change the labels of the samples. The data augmentation process was assessed in both training and test data, which can increase the performance significantly [18]. After splitting a dataset into training and test sets, training data were balanced by creating new images until the number of melanoma images was equal to the normal ones, and the generated training images were considered as independent from the original ones. On the other hand, test data were expanded by randomly augmenting each test image at least ten times, but the generated images remained related to the original ones. Consequently, given an original test image X, the classes’ probabilities for X and its related set of images \(S_X\) were averaged to yield the final prediction; so any CNN model performs like an ensemble one, where the final probabilities for a test image was computed using a soft-voting strategy.

5.3 Evaluation process

In order to evaluate quantitatively the segmentation method, several performance metrics are considered, including accuracy (ACC), F1-score, Dice coefficient (DC), Jaccard similarity (JS), and the U-Net score (UNS), which is the average between DC and JS. Accuracy and F1-score are calculated using Eqs. 2, 3 and Dice coefficient and Jaccard similarity are calculated using Eqs. 4 and 5.

$$\begin{aligned}&ACC = \frac{TP+TN}{TP+TN+FP+FN}, \end{aligned}$$
(2)
$$\begin{aligned}&F1 = \frac{TP}{TP+\frac{1}{2}(FP+FN)}, \end{aligned}$$
(3)
$$\begin{aligned}&DC = 2\frac{|GT \cap SR|}{|GT| + |SR|}, \end{aligned}$$
(4)
$$\begin{aligned}&JS = \frac{|GT \cap SR|}{|GT \cup SR|}, \end{aligned}$$
(5)
$$\begin{aligned}&UN = \frac{DC+JS}{2}, \end{aligned}$$
(6)

where TP, FP, TN and FN represent well-selected pixels, mis-selected pixels, well-discarded pixels and mis-discarded pixels, respectively; GT and SR mean the ground truth pixels and the segmentation result, respectively. The comparison between the segmentation methods is performed using UNS, summarizing DC and JS. The above metrics have been used before in the segmentation task of the ISIC-2018 contest.

Regarding the evaluation process of the CNN models, a 3-times 10-fold cross validation process was performed on the datasets, and the results were averaged across all fold executions. In each fold, Matthews Correlation Coefficient (MCC) was used to measure the predictive performance of the models. MCC is widely used in Bioinformatics as a performance metric [78], and it is specially designed to analyze the predictive performance on unbalanced data. MCC is computed as:

$$\begin{aligned} \small { MCC=\frac{t_p\times t_n - f_p\times f_n}{\sqrt{(t_p + f_p)(t_p + f_n)(t_n + f_p)(t_n + f_n)}}, } \end{aligned}$$
(7)

where \(t_p\), \(t_n\), \(f_p\), and \(f_n\) are the number of true positives, true negative, false positives, and false negatives, respectively. MCC value is always in the range \([-1,1]\), where 1 represents a perfect prediction, 0 indicates a performance similar to a random prediction, and -1 an inverse prediction.

Finally, non-parametric statistical tests were used to detect whether there was any significant difference in predictive performance. Friedman’s test [79] was conducted in cases where a multiple comparison was carried out, Hommel’s post hoc test [80] was employed to perform a multiple comparison with a control method, Shaffer post hoc test [81] was employed to perform pairwise comparisons, and finally Wilcoxon Signed-Rank test [82] was performed in those cases where only two individual methods were compared. All hypothesis testing was conducted at 95% confidence.

5.4 Software and hardware

The experimental study (training and test) was executed with Ubuntu 18.04, four GPUs NVIDIA Geforce RTX 2080-Ti with 11 GB DDR6 each one and four GPUs NVIDIA Geforce RTX 1080-Ti with 11 GB DDR5X each one. All the experiments were implemented in Python v3.6, and the CNN models were developed by using Keras framework v2.2.4 [83] as high level API, and TensorFlow v1.12 [84] as backend. In addition, the proposals were implemented in a web-based application in order to assist dermatologists in decision making. Amazon web servicesFootnote 9 were used in deployment time as platform, providing secure and resizable compute capacity in the cloud. Regarding efficiency, we converted all tensorflow models to Tensorflow Lite models. Tensorflow Lite is a lightweight library for deploying models using a minimum of computational resources. Table 3 shows the minimum amount of hard drive space, RAM and inference time required by each CNN model. The time was measured when processing one sample. Bear in mind that these values were analyzed in deployment time by using CPU and not GPU. As expected, MobileNet was the lightest-weight model which required the least amount of resources. In addition, the highest difference between using or not Tensorflow Lite is regarding RAM—this technology saves more than ten times compared to default deployment. As a result the models only needed 510 MB of RAM and one CPU core to perform inference. In this work was selected basic Amazon EC2 t2.micro, with one GB of RAM and one CPU core (\(\$5.26\)/month), saving \(\$36.72/\)month compared to Amazon EC2 t2.large (\(\$41.98\)/month), which is the closest architecture capable of supporting more than five GB of RAM.

Table 3 Summary of the computational resources in deployment time; “D” and “L” mean using default tensorflow-cpu and tensorflow-lite; HDD includes both architecture and weights

5.5 Results and discussion

In this section, the main results are presented. Additional material for supporting this work can be found at the available web pageFootnote 10.

5.5.1 Segmentation methods

Table 4 shows the segmentation performance obtained from a small sample (594 images in total). Both datasets contain high-quality images and manual segmentation performed by expert dermatologists. Each image was evaluated applying UNS and other metrics. The proposed segmentation method obtained the best average performance in both datasets with 81% and 80% UNS in PH2 and DERM-LIB, respectively. In addition, Fig. 5 shows Friedman’s test ranking, where the proposed method outperformed both U-Net and R2U-Net. Then, Friedman’s test rejected the null hypothesis with a p-value < 2.2E-16 in DERM-LIB dataset. The proposal was ranked first, and afterward, the Shaffer’s post hoc test was conducted, where the proposal achieved significantly better performance compared to U-Net and R2U-Net. In addition, U-Net significantly outperformed R2U-Net. On the other hand, in PH2 no significant differences were encountered between the three methods, the Friedman’s test did not reject the null hypothesis with a p-value equal to 6.476E-1. (The test was conducted with two degrees of freedom, and the Friedman’s statistic was equal to 86.898E-2.) However, the proposal was ranked first and it attained 111% and 74% less variance compared to U-Net and R2U-Net, respectively. In this way, the CNN models can focus on relevant pixels, so easing the learning of better abstract and discriminative features for melanoma diagnosis. Also, the proposal did not require prior training, which is a clear advantage compared to those based on CNN models. In addition, the proposed segmentation method can be used with only a CPU, avoiding a significant amount of computational power for training like those using GPU. Finally, R2U-Net obtained a better performance compared to U-Net in PH2, but the opposite occurred in DERMLIB. A larger number of datasets are needed to validate the differences between both methods regarding skin lesion diagnosis.

Table 4 Preprocessing performance obtained in DERM-LIB dataset
Table 5 Average MCC values obtained by using six state-of-the-art CNN models.
Fig. 5
figure 5

All pairwise comparisons between the proposed method and the state-of-the-art biomedical segmentation methods in DERM-LIB. The null hypothesis was rejected with adjusted p-value < 2.2E-16

Fig. 6
figure 6

Each sub-figure shows the methods ordered from left to right according to the ranking computed by Friedman’s test; the proposed segmentation method achieved the best performance in all CNN models. The lines located summarizes the significant differences encountered by the Shaffer’s post hoc test, in such a way that groups of models that are not significantly different (at α = 0.05) are connected by a line

In addition, we corroborated that the CNN models trained with segmented data were able to achieve better performance. All models achieved the best performance using segmented data, and on average the best ones were MobileNet, DenseNet, and InceptionV3, in that order. Despite MobileNet was designed with efficiency in mind, it managed to overcome more complex CNN models. Nevertheless, DenseNet and InceptionV3 surpassed MobileNet in BCN20000, which is the largest and one of the more complex datasets. Skin image datasets are commonly small, which is where MobileNet usually achieved its best performance [85, 86]. Although these results could be also caused by the use of default hyperparameters, all CNN models shared the same condition in order to avoid advantages between them. On the other hand, the number of epochs could be another cause. This is traditionally large when training CNN models, often hundreds or thousands, allowing the learning algorithm to run until the error from the model has been sufficiently minimized. It is common to find examples in the literature and in tutorials where the number of epochs is set to 500, 1000, and larger. However, in this work we used 150 epochs, mainly because of the number of available images, and even then all models and the proposal achieved competitive performances.

UDA-2 was the most challenging dataset. On average, the models achieved only 44% MCC; UDA-2 has the lowest Silhouette value (0.020), indicating a high overlapping level between classes and increasing the difficulty. The overall best performance was achieved in DERM-LIB, PH2 and HAM10000 datasets with a 90%, 85% and 79% MCC, respectively. The above datasets have the three highest Silhouette values, meaning that they have a low overlapping level between images, which makes easier the task. Table 5 and Fig. 6 summarize the results obtained after training the six CNN models with the segmentation method. The models in Fig. 6 are ordered from left to right according to the ranking computed by Friedman’s test, where the proposed segmentation method achieved the best performance in all datasets, followed by TDA. Friedman’s test rejected the null hypothesis on all CNN models. It was observed that overall all CNN models using the segmented data presented better results compared to its non-use, proving to be suitable for melanoma diagnosis. Shaffer’s post hoc test found that InceptionV3 and Xception significantly improved their performance when segmented data was used compared to the baseline and transfer-learning combined with data augmentation. All baseline CNN models were significantly surpassed by the other techniques. Results showed the proposed segmentation method was able to improve the performance in all CNN models considered in this work. In the following sections, all CNN models are compared using segmented data, which already proved to achieve the best performance.

5.5.2 Analyzing the impact of three main hyperparameters

First of all, the advantages of using the extra prediction block were analyzed. The best epochs from each CNN model were combined and all possible ensembles were evaluated. In the end, 57 ensembles were obtained after discarding the empty set and the one-element sets. Then, the best one was selected as baseline (BL) and included in the comparison. In addition, it was obtained another baseline model applying the same above procedure by simply combining the predictions of the CNN models without using the prediction block (BLs). Table 6 shows the average MCC values on test sets comparing BL versus BLs. Results show that models using the extra prediction block obtained the best performance in all datasets compared to BLs. Finally, the Wilcoxon’s test rejected the null hypothesis with a p-value equal to 2.189E-4, confirming the benefit and effectiveness of using the prediction block for combining the features from the individual CNN models. Henceforth, the rest of the experimental study was executed using the proposal architecture.

Table 6 Average MCC values on test sets; BL represents the best ensemble obtained by combining the individual best models from each architecture and BLs represents the same as BL, but simply merging the predictions without using the extra prediction block
Table 7 Average MCC values on test sets; g, \(p_c\) and \(p_m\) represent generations, crossover rate and mutation rate, respectively

Table 7 shows the average MCC results attained by the proposed genetic algorithm using different values of g, \(p_c\) and \(p_m\). For each dataset, the best MCC value is highlighted in bold typeface. Overall, all models obtained a high performance, being the lowest and the highest average performance 96.9% and 97.8%, respectively. The models attained their best performance in MSK-3, PH2, DERM-LIB and SDC-198 datasets. Regarding the parameters that control the GA, the results showed the higher the number of generations, the better was the predictive performance. A higher number of generations could imply more exploration and higher possibilities to create offspring with a more accurate set of CNN models. Furthermore, the best results peaked when using a high mutation probability, denoting that a higher mutation probability could imply a better exploration of good sets of epochs. Finally, it was observed that 250 generations are good enough to obtain a performance which is significantly similar to the top performance obtained by using 400 generations.

The Rank column shows the average ranking computed by Friedman’s test, and this ranking found that \(g=400\), \(p_c=90\%\) and \(p_m=30\%\) were the best settings for the GA, obtaining on average the best predictive performance. The Friedman’s test rejected the null hypothesis with a p-value equal to 8.771E-15; Friedman’s statistic was equal to 124.28 with 26 degrees of freedom. Afterward, the Hommel’s post hoc test was conducted by considering the GA as the control method, which significantly outperformed 50% of all other considered configurations. Next, the best GA is compared to several state-of-the-art CNN models that have previously been used in melanoma diagnosis and at the same time are the baseline to build the ensemble.

5.5.3 Comparing with state-of-the-art CNN models

Table 8 shows the fold changes between the proposed model and each of the state-of-the-art CNN models. The BL ensemble overcame all individual state-of-the-art models, except in DERM-LIB, where DenseNet201 and MobileNet surpassed it by a small margin. It should be noticed that these models are the same considered by the GA to build the ensemble, so this comparison plays another role, which is to corroborate that ensemble learning is more suitable for melanoma diagnosis tasks compared to individual models. The results were very promising since the proposal GAPE-CNN achieved the highest MCC values in all datasets. It is noteworthy that the proposal achieved a predictive performance 165% and 130% higher than Xception and InceptionV3 on UDA-2 dataset, respectively. Also, it achieved high performance in the two largest datasets. The best average predictive performance was observed on DERM-LIB and PH2 datasets, where it was obtained 98% and 96% MCC values, respectively. The lowest overall predictive performance was observed on the dataset UDA-2 with 56% of MCC. However, the proposal was at least 62% and 21% better than all the individual CNN models and the BL, respectively. Overall, the worst performance was attained by VGG16.

Table 8 Average MCC values of the different models on each dataset
Table 9 Average MCC values on test sets; BL represents the best ensemble obtained by combining the individual best model from each architecture and GAPE-CNN represents our proposal

Table 9 summarizes the average MCC values on test sets comparing the baseline versus GAPE-CNN. The proposal surpassed BL in all datasets, and finally, the Wilcoxon’s test rejected the null hypothesis with a p-value equal to 2.189E-4, confirming the benefit and effectiveness of using genetic algorithms for learning the set of CNN models to build an ensemble. Each independent CNN model provides a partial prediction, which is aggregated to yield a final decision. The architecture follows an ensemble approach, which has demonstrated to be an effective way to improve the learning process in many real-world problems [87]. Furthermore, the proposed model applies transfer learning and data augmentation techniques. Data are augmented not only at training stage, but also at test stage, thus allowing to attain a better predictive performance. The data augmentation process on both phases has shown to be an effective way to improve melanoma diagnosis [88], and also it is an excellent approach to cope with the imbalance data issue, and the high inter- and intra-class variability present in most skin image datasets.

5.5.4 Dermoscopic versus non-dermoscopic images

Figure 7 shows the average performance attained by the models by grouping the datasets in dermoscopic and non-dermoscopic ones. The results showed that the proposed architecture attained the best performance whatever the type of image, denoting the effectiveness of the approach. All CNN models attained their best performance using dermoscopic images by a slight margin. NASNetMobile was the architecture most benefited from using dermoscopic images, with an improvement of about 8%. The biggest improvement was found comparing with VGG; the proposal outperformed VGG considering dermoscopic and non-dermoscopic images in 31% and 30%, respectively. The second best performance behind the proposal was achieved by the BL in both types of images; the proposal surpassed it by 5% in both types of images. To sum up, the results obtained through the experimental study revealed that GAPE-CNN was effective for diagnosing melanoma, attaining better predictive performance with respect to the state-of-the-art models.

Fig. 7
figure 7

Average MCC values on test sets by grouping the datasets in dermoscopic (green bars) and non-dermoscopic (orange bars); “D”: DenseNet201; “I”: InceptionV3; “M”: MobileNet; “N”: NASNetMobile; “V”: VGG16; “X”: Xception; “B”: base line; “G”: our proposal

6 Conclusions

It is clear that automatic melanoma diagnosis via deep learning models is a challenging task, mainly due to a lack of data and differences even between samples from the same category. We addressed these problems via a series of contributions. First, to preprocess data, we validated and applied an extension of the Chan-Vese algorithm. The segmentation masks indicated that the proposal achieved a better performance compared to state-of-the-art segmentation methods. Also, the results showed that all CNN models improved their performance by using segmented data. Second, the training and testing data were enriched using data augmentation, reducing overfitting and obtaining transformation-invariant models. Third, we increased the performance by applying transfer learning from the pre-trained ImageNet. Also, transfer learning alleviated the requirement for a large number of training data. Finally, to further improve the discriminative power of CNN models, we proposed a novel ensemble method based on a genetic algorithm, which finds an optimal set of CNN models. An extensive experimental study was conducted on sixteen image datasets, demonstrating the utility and effectiveness of the proposed approach, attaining a high predictive performance even in datasets with complex properties. Results also showed the proposed ensemble model is competitive with regard to state-of-the-art computational methods.

Future works will conduct more extensive experiments to validate the full potential of the proposed architecture, for example by considering a wide set of hyperparameters to be tuned as well as a larger number of datasets. Finally, it is noteworthy that our approach benefits from combining different CNN models—abstract features are combined in the extra prediction block, and individual predictions from the ensemble and the mentioned block are combined to obtain a diagnosis. This approach is not restricted to melanoma diagnosis problems and could be applied on other real-world problems in the future.