Introduction

Loss of property and human life due to earthquake-triggered landslides is significantly high and is expected to increase due to climate change (Froude and Petley 2016; Gariano and Guzzetti 2016). About 47,000 earthquake-induced landslide casualties were reported from 2004 to 2010 (Petley 2012). Earthquake-induced landslides (EQIL) have direct and indirect long-term socioeconomic and environmental effects (Fan et al. 2018). The direct and indirect effects of landslides, for example, through the formation and breakout of landslide dams, are a significant natural hazard in the mountain regions of the Himalayas (Dhital 2015). Studies show unprecedented loss to both human lives and the economy in the Himalayan regions due to landslides, contributing up to 30% of the world’s total landslide-related damage value (Dahal and Hasegawa 2008; Haigh and Rawat 2011). In Northern India, for example, during the recent 2021 Uttarakhand landslides, 24 people were killed by landslides and around 150 were missing (Meena et al. 2021a). A large number of people are affected in the Himalayan regions by small and large-scale landslides, especially during the monsoon seasons (Khanal and Watanabe 2005; Thapa and Dhital 2000; Upreti and Dhital 1996). Although landslides often occur in remote areas, the resulting catastrophic flash floods from landslide dam outbreak cause extensive damage to settlements, hydroelectric projects, and agriculture fields in the downstream areas (Meena and Tavakkoli Piralilou 2019).

To better analyze the frequency and distribution of landslides, there is a growing demand for event-based inventories that can be used to determine the probability of landslide occurrence in space and time as a basis for hazard and risk assessment. There is still insufficient information on landslide occurrences for many areas to make reliable hazard maps (Reichenbach et al. 2018). Landslide susceptibility and hazard modeling require accurate and complete landslide inventory datasets. This inventory dataset is usually used for training hazard models to find potential landslide-prone areas (Guzzetti et al. 2012).

The accuracy and completeness of landslide inventory datasets are essential for making spatial predictions for future events (Hakan and Luigi 2020). The mapping of event-based landslide inventories in remote and mountainous areas makes remote sensing data the primary source of information for mapping these events (Chen et al. 2018).

In terms of detecting landslide boundaries with remote sensing images, classification methods like pixel-based, feature-based, and object-based techniques can be employed (Lu et al. 2020; Su et al. 2020). While pixel-based methods only extract features by classifying each pixel, they do not take the spatial-context into account. However, feature-based methods (like gray level co-occurrence matrix and principal component analysis) (Whitworth et al. 2002) and object-based image analysis (OBIA) explicitly leverage the spatial information from satellite images (Bacha et al. 2020; Hölbling et al. 2012; Martha et al. 2010). During the last decade, deep-learning models and other machine learning models, particularly Convolutional Neural Networks (CNNs), have been applied successfully in a broad range of image segmentation and object detection purposes (Ding et al. 2016; Ghorbanzadeh et al. 2020; Jin et al. 2019; Liu et al. 2019; Shi et al. 2020).

The use of CNN models has yielded promising results for classification of aerial images (Bui et al. 2019; Ghorbanzadeh et al. 2021, 2020; Meena et al. 2021b; Yu et al. 2017). Numerous studies using CNN have been conducted for landslide detection (see Table 1). Many authors used CNN models for automated landslide detection in mountainous regions using multi-temporal high-resolution remote sensing data, mono-temporal medium-resolution image data (Chen et al. 2018), where others optimized their models and compared with existing baseline models such as Fully Convolutional Networks (FCNs) (Lei et al. 2019). Hyperspectral data for landslide detection was first investigated by Ye et al. (2019). In recent studies, different topographical factors like elevation and its derivates like slope, aspect, and curvature combined with remote sensing data for landslide detection were explored to improve landslide detection (Sameen and Pradhan 2019; Liu et al. 2020b; Prakash et al. 2020).

Table 1 Overview of some recently published studies on the automated mapping of landslides using deep learning approaches sorted according to the use of topographical features

Deep learning models usually require large amount of training data to detect objects efficiently. However, since landslide inventories are generated for small areas using manual interpretation and fieldwork, such inventories commonly have just a few samples and present a limitation for the training of deep learning models (Chen et al. 2020; Liu et al. 2020a; Qi et al. 2020) Therefore, in this study, the main objective was to evaluate and compare the performance of the machine and deep learning models trained with a small dataset composed of only 239 landslide polygons (55 polygons for training and 184 polygons for testing purposes). The fully convolutional U-Net deep learning model and other machine learning models were trained with data from a 5-m RapidEye optical satellite imagery and resampled 12.5-m ALOS PALSAR digital elevation data for landslide detection.

Study area

The study area is located in Rasuwa district Nepal, which is situated in higher Himalayas and is one of the highly affected regions after the 2015 Gorkha earthquake (see Fig. 1). Most of the study area falls in the Langtang national park and there are several hydropower plants projects along the Trishuli River. After the 2015 Gorkha earthquake, a series of landslides triggered by the earthquake caused damage to hydro powerplants, agricultural land, and human settlements. On 15 April 2015, during the Gorkha earthquake, more than 80 people were killed due to EQILs and flood events near the Mailung village hydropower plant camps. Several attempts have been made by local authorities and foreign institutes to study impact of landslide on human settlements and hydro powerplants in the region. However, in many inaccessible hilly areas, field visit was not feasible hence remote sensing tools can help supplement the field visits. The study area is highly affected by monsoonal rains and every year several deep-seated landslides get reactivated such as the one near Ramche village.

Fig. 1
figure 1

A Location of the study area in Nepal, B landslide training and testing zones in the study area, and C sampling points along the center line of the landslide polygons (black) and non-landslide class (purple)

Data used and methodology

Datasets

The landslides were visually interpreted as polygons from RapidEye imagery acquired on 04 November 2016 (Planet Labs Inc.) and field observations. The data has 5 m spatial resolution in five spectral bands: blue (440–510 nm), green (520–590 nm), red (630–685 nm), red-edge (690–730 nm), and near-infrared (760–850 nm) (RapidEye 2011).

A total of 239 landslide polygons were mapped in the entire study area, 55 in the training zones and 184 in the test zone (the training zones are yellow and testing zones are red in color in Fig. 1c). For training the model, 117 sampling points were manually selected along the centerline of the landslide polygons present in the training zone. Other 57 points were selected outside the landslide polygons to represent non-landslide samples (see Fig. 1c). Therefore, a total of 174 sampling points were used to train the models. Those points were used as the centroid to generate the training patches of four sizes: 16 × 16, 32 × 32, 64 × 64, and 128 × 128 pixels (Fig. 2).

Fig. 2
figure 2

Conceptualization of generating the patches for training the models

Two datasets were created to train the models. Dataset-1 consists of the five spectral bands (RGB, red-edge, NIR) from the RapidEye satellite. The Dataset-2 consists of the same five bands and two extra topographical bands (elevation and slope). The elevation and slope data were acquired from a digital elevation model (DEM), resampled to 5-m spatial resolution, derived from Phased Array type L-band Synthetic Aperture Radar (PALSAR) of the Advanced Land Observing Satellite (ALOS).

All the models used the same training data to compare the results from the models properly. The deep learning algorithms were trained using the Python libraries TensorFlow 2.0 and the machine learning using Scikit-Learn.

Classifiers

U-Net model

U-Net (Ronneberger et al. 2015) is a state-of-art deep learning model used for semantic segmentation tasks. This model has an encoder-decoder architecture similar to the letter “U” (Fig. 3). The encoder path is composed of blocks of two 3 × 3 convolutional layers followed by a 2 × 2 max-pooling layer. The convolutional layers are 3 × 3 moving windows that translate around the image, calculating a dot product that can be summarized by Eq. 1 (Zhang et al. 2018):

Fig. 3
figure 3

The architecture of the U-Net model. The numbers below the convolution represent the number of filters used to train the model

$${O}^{l }=\sigma ({O}^{l-1 }*{W}^{l }+{b}^{l})$$
(1)

where \({O}^{l-1}\) refers to the output of the (l-1)th layer, \({W}^{l}\) represents the weights and \({b}^{l}\) represents the bias. \(\sigma\) indicates the non-linear activation function. The rectified linear unit (ReLU) was used as the activation function in this research. ReLU is commonly used as the activation function because it is more efficient than other functions and reduces the gradient vanishing problem during the training step (Wang et al. 2019). The function returns 0 when the input is negative and the same input value if it is positive. The max-pooling layers keep only the maximum values from the feature maps generated from the convolution operation. Thus, after a max-pooling operation, the spatial dimension of the feature map is reduced to half of the input size.

The decoder path recovers the spatial location by using up-convolutions and concatenations from the encoder path (Ronneberger et al. 2015). The up-convolution layers increase the dimensions of the feature maps. The layers’ output is concatenated with the feature map from the symmetrical position in the encoder path. In the last layer, a sigmoid function was used to output the class predictions in a 0–1 probability range. A threshold of 0.5 was used to determine the positive (> 0.5) and the negative (< 0.5) classes after the prediction.

Several papers describe and explain the U-Net structure and how convolutional neural networks are trained (Ghorbanzadeh et al. 2019a, b; Prakash et al. 2020; Wang et al. 2019). In this study, we use a fully convolutional neural network that is capable of calculating per-pixel probability of comprising a landslide. Unlike previous work conducted by Ghorbanzadeh et al. (2019a, b) where they used a classical convolutional neural network to generate patch-wise landslide classification, the neural network used in our study is more efficient for landslide segmentation problems as the result is a binary output with the same size as the input image (Prakash et al. 2020, 2021; Qi et al. 2020). The network hyperparameter tuning process considered different number of filters (8, 16, 32), learning rates (0.01, 0.001, 0.0001), and batch sizes (8, 16, 32). The learning rate value was reduced by a factor of 0.1 when the validation loss function reaches a plateau for more than twenty epochs. The models were saved only when the validation loss function decreased as an attempt to avoid overfitting.

Support vector machine (SVM) model

SVM is a machine learning method that uses kernel functions to map the dataset into a higher dimension to determine a hyperplane that separates the training data feature spaces (Cortes and Vapnik 1995). The margining of the hyperplanes, also known as support vectors, is maximized to be the closest to the training features. This method gained popularity for landslide mapping due to accurate results, even with small datasets and unknown statistical distributions (Moosavi et al. 2014; Mountrakis et al. 2011; Pawłuszek and Borkowski 2016).

The classification result is affected by the kernel function (e.g., linear, sigmoid, polynomial, radial basis). Thus, various kernel functions were evaluated to find the best classifier.

K-nearest neighbors algorithm (KNN) model

K-nearest neighbors is a machine learning algorithm that uses the training data to find the feature space’s K-closest neighbors. The algorithm outputs a class probability that reflects the uncertainty with which a given individual item can be assigned to any given class (Marjanovic et al. 2009). In this study, the distance between the feature space points was calculated using the Euclidean distance method. An optimal K value was determined by testing K in a 1–10 range.

Random forests (RF) model

Random forest is an ensemble method widely used for landslide detection (Chen et al. 2014). The method is based on multiple decision trees. Each tree is slightly different since they are trained with the training dataset’s random subsets. The technique is less prone to overfitting because each tree’s output class is weighted based on a majority voting technique where the class with the most votes becomes the model’s prediction.

Multiple input patches

Different patch sizes may affect the model accuracy because landslides have different shapes and sizes, which may not be well-represented depending on the patch size. Moreover, since the negative class is usually more frequent than the positive class in remote sensing imagery, larger patches may negatively influence the model because they can increase the imbalance between the positive and negative class (Ghorbanzadeh et al. 2019a, b).

In this work, the patches used to train the models were constituted by a multiple of 16 pixels since this is a condition to effectively train the U-Net model. The models were trained with 16 × 16, 32 × 32, 64 × 64, and 128 × 128 pixel patches to compare and evaluate how the different patch sizes affect the accuracy of the model. The models were also trained with 256 × 256 pixels patches. However, since the results were inferior compared to the other patch sizes, only the results achieved with the mentioned patch sizes were considered in the “Results” section.

Results

The machine and deep learning models were trained using only 174 samples to evaluate and compare the performance of the algorithms using small datasets. In total, sixteen result maps were generated for each dataset (dataset-1 and dataset-2). The result maps (Figs. 4a, b; 5a, b; 6a, b; 7a, b) are named based on the algorithm, the patch size, and the dataset used to train the algorithm. Therefore, the map U-Net_16_5 and U-Net_16_7 (Fig. 4a and b) correspond to the U-Net deep learning algorithm trained with the 16 × 16 patch size using the dataset with five optical bands (dataset-1) and seven bands (dataset-2), respectively. The best results were achieved by U-Net models with a learning rate of 0.001; SVM models trained with a polynomial kernel function and a scalable gamma parameter (γ); KNN models trained with nine neighbors; and RF models with 200 trees and depth 8.

Fig. 4
figure 4figure 4

a Landslide detection results using U-Net model in sampled area in the test zone using dataset-1. b Landslide detection results using U-Net model in sampled area in the test zone using dataset-2

Fig. 5
figure 5figure 5

a Landslide detection results using SVM model in sampled area in the test zone using dataset-1. b Landslide detection results using SVM model in sampled area in the test zone using dataset-2

Fig. 6
figure 6figure 6

a Landslide detection results using KNN model in sampled area in the test zone using dataset-1. b Landslide detection results using KNN model in sampled area in the test zone using dataset-2.

Fig. 7
figure 7figure 7

a Landslide detection results using RF model in sampled area in the test zone using dataset-1. b Landslide detection results using RF model in sampled area in the test zone using dataset-2.

Figure 8 portrays the differences in the areas of the landslides detected with the different machine learning models with respect to the influence of the topographical information from dataset-2. As seen in Fig. 8b, the total area in most of the models is relatively higher in dataset-2 than dataset-1 when compared against the manually interpreted ground truth area. This difference is because of the detection of false positives as an influence from the slope and elevation in dataset-2. Although there are improvements in the built-up area and river sand bars, the model gets confused and generates false positives in forests and agricultural areas.

Fig. 8
figure 8

Area of detected landslides using different machine learning and U-Net models against the manually interpreted ground truth (red color). A Dataset-1, B dataset-2

The models were evaluated based on precision, recall, F1-score, and Matthews Correlation Coefficient (MCC) metrics, which are calculated using the value of true positive (TP), false positives (FP), and false negatives (FN) (Fig. 9). Precision (Eq. 2) calculates the proportion of pixels correctly classified as landslides. Recall (Eq. 3) value represents the number of pixels that was correctly classified as landslides from the total pixels representing landslides.

Fig. 9
figure 9

Confusion matrix showing true class and predicted classes of landslides and other features and four different evaluation metrics

$$\mathrm{Precision}=\frac{TP}{TP+FP} \times 100$$
(2)
$$\mathrm{Recall}=\frac{TP}{TP+FN} \times 100$$
(3)

F1-score (Eq. 4) is a harmonic mean between precision and recall; therefore, the highest values of F1-score correspond to models with better performance. Landslide datasets usually have an unbalance between the positive (landslides) and negative (background) classes. Thus, the MCC (Eq. 5) metric is better for comparing imbalanced datasets (Baldi et al. 2000).

$$F1 score=\frac{2 \times \mathrm{precison} \times \mathrm{recall}}{(\mathrm{precision}+\mathrm{recall})}\times 100$$
(4)
$$MCC=\frac{TP \times TN-FP x PN}{ \surd (TP+FP)(TP+FN)(TN+FP)(TN+FN)} \times 100$$
(5)

The results show that among the models trained with dataset-1, the U-Net 128,5 model achieved the highest MCC (71.06) and F1-score (71.12). Nevertheless, compared with the other algorithms, the MCC results are just 0.63, 1.59, and 2.65 higher than the SVM, KNN, and RF algorithms (Table 2). SVM 1285 achieved the highest precision (80.28), while U-Net 16,5 had the highest recall (83.94).

Table 2 The results of landslide detection in the study area based on the different ML and U-Net model for dataset-1; accuracies are stated as precision, recall, F1-measure, and MCC. The best values are in bold

The U-Net also had better performance in dataset-2 (Table 3). However, in dataset-1, the model trained with 128 × 128 patch size achieved the best F1-score and MCC, while in dataset-2, the model trained with 16 × 16 patch size achieved the highest F1-score (69.42) and MCC (69.70). The patch size seems to be more relevant to dataset-2 since all the models trained with 16 × 16 patch size achieved the best results. In dataset-1, the SVM and KNN trained with the 16 × 16 patch size also had the best results; however, the best U-Net and RF model was trained with 128 × 128 and 32 × 32 patch size, respectively.

Table 3 The results of landslide detection in the study area based on the different ML and U-Net model for dataset-2; accuracies are stated as precision, recall, F1-measure, and MCC. The best values are in bold

Comparing the results of both datasets, the models trained with dataset-1 achieved better results compared to the same algorithm over dataset-2. The U-Net 128,5 was the best overall model among both datasets. Similar to what was observed by Ghorbanzadeh et al. (2019a, b) with machine learning models trained in the same area, the topographical layers helped differentiate human settlement areas, which have identical spectral responses to landslides; however, the models generate more false-positive in the steeper areas. Visually evaluating the segmentation of each algorithms (as seen in Fig. 10), the U-Net segmentation is smoother and more continuous, with greater similarity in comparison to the manual annotations than with the other ML methods. SVM, KNN, and RF results show similar segmentation patterns and mistakes.

Fig. 10
figure 10figure 10

Enlarged maps of sub-area from the test zone. Landslide detection results are overlayed on the inventory data

Discussion

The U-Net deep learning model achieved the best results in this study based on the metrics used to evaluate the models. However, the MCC and F1-score values were similar among all the models. The results highlight that U-Net can achieve robust results even with few training samples. However, since the machine and deep learning achieved similar accuracies, all the algorithms have similar behavior with a small dataset, and it is impossible to define a better algorithm based on the accuracy metrics. However, similarities between the manual annotations and the U-Net model results are noted in terms of landslide prediction smoothness and continuity, demonstrating better segmentation results than the other models. The models evaluated by Ghorbanzadeh et al. (2019a, b) in the same study area were trained with a bigger dataset composed of 3500 samples, which was augmented to 7000 samples. In that study, the CNN model achieved the best results with an F1-score that was 5.73% greater than the best machine learning model. The significant differences in the author’s accuracy between the machine and deep learning models highlight the importance of the dataset size. In this study, despite the slightly higher accuracy achieved by the U-Net, the deep learning algorithms were computationally more expensive, needing a GPU (GeForce RTX 2060, 8 GB memory) for the training process, while the machine learning algorithms only used the CPU (Intel I7 10700 K).

The patch size is an important parameter to find the best algorithm since it affects the model’s accuracy. The U-Net trained with the optical data showed a similar pattern to what was observed by Soares et al. (2020), where the U-Net models trained with smaller patches (32 × 32) yield a greater recall while the models trained with the bigger patches (128 × 128) achieved a greater precision. The models trained with bigger patches became more restrictive (made fewer false-positive errors) than the models trained with smaller patches. Nevertheless, this pattern was not observed in the U-Net models trained with the topographical dataset and on the results achieved by the machine learning models.

The topographical data does not improve the results of the models in this study. This may be related to the resampled DEM used and the samples. Since the dataset is composed of 174 samples, the models were not exposed to various topographic features. Therefore, the pattern learned with the training samples may not represent the test area, and consequently, the results were worse. Moreover, the Hughes Phenomenon (Hughes 1968), also known as the Curse of Dimensionality in the field of machine learning, may also be related to the inferior results with the topographical dataset. Since the dataset with two extra topographical bands has a higher dimensionality, a greater number of samples are needed to improve the models’ accuracy. The small number of samples used to train the models was not enough for the classifier to reliably classify the landslide areas; therefore, the classification performance degraded with the higher dimensional data. This phenomenon may also justify why the models trained with 16 × 16 patch size (smaller patch, with lower dimensionality) achieved the best results within this dataset.

The training and test area used in this study have landslides with similar spectral characteristics. Therefore, this may also explain the comparable results achieved with the machine and deep learning models. However, since machine learning algorithms are trained using a one-dimensional vector with pixel values, the spatial pattern of the landslides, such as the shapes, is not learned by those models. Consequently, it is expected that the deep learning method achieves better results in areas with different spectral characteristics than the machine learning algorithm because those models are trained with two-dimensional patches that keep the spatial information of the images. According to the literature, the U-Net like architectures achieve the best results for segmenting landslides in test areas with similar spectral characteristics to the training zones, and test areas with different spectral characteristics highlighting their generalization capacity and good accuracy on landslide segmentation (Qi et al. 2020; Prakasha et al. 2020; Soares et al. 2020; Yi et al. 2020; Prakasha et al. 2021).

Conclusions

This work evaluates different machine and deep learning model performances trained with small datasets and different patch sizes for landslide segmentation. The U-Net deep learning model achieved the best results on dataset-1 and dataset-2. However, all the models achieved similar MCC and F1-scores, highlighting that deep learning models achieve comparable results to machine learning algorithms with small datasets. The extra topographic features (slope and elevation) did not improve the models’ results but yielded improved detection of false-positive such as built-up areas, an error in riverbeds. In this study, U-Net has slightly better results than other machine learning approaches. Although it can depend on the model architecture and the complexity of geographical features in the imagery, the U-Net model is still preliminary when considered for landslide detection. A reason for the U-Net model to perform better is because of the encoder-decoder and skip-connection structure of the model that preserves the structural integrity of the output results even with lower training data (Ronneberger et al. 2015). This exhibits the notion of actually using lesser training data, which is generally the case for new events, and can be then used in training and detecting landslides for newer events.

This study is one of the first efforts of using U-Net for landslide detection in the Himalayas. Nevertheless, U-Net has the potential to further improve automated landslide detection in the future as U-Net excels in producing good results as stated above in regard to the architecture structure but also that since the output is a segmentation result, we are provided with the information of the landslide boundary and the delineation of the landslide body as well. Further adjusting of the encoder part of the model, we can add deeper layers like Virtual Geometry Group (VGG) and Residual Neural Network (ResNet-50) (Simonyan and Zisserman 2014; He et al. 2016) to further improve the results and thereby detecting more landslides with fewer false positives as model complexity overall tends to overcome such artifacts.

The use of only spectral bands can be a limitation for landslide detection since geological and the degree of saturation of the soil directly affect the targets’ spectral response. Therefore, areas with higher soil saturation may present darker colors while less saturated areas will have light colors. Moreover, rocks with different weathering conditions will show different spectral responses. Thus, to avoid algorithm misclassifications and improve the results, further studies need to use images covering a more comprehensive range of time and different seasons. This way, the models can learn and predict a broader range of spectral responses of the landslides and achieve better results.