Abstract

Numerous human actions such as “Phoning,” “PlayingGuitar,” and “RidingHorse” can be inferred by static cue-based approaches even if their motions in video are available considering one single still image may already sufficiently explain a particular action. In this research, we investigate human action recognition in still images and utilize deep ensemble learning to automatically decompose the body pose and perceive its background information. Firstly, we construct an end-to-end NCNN-based model by attaching the nonsequential convolutional neural network (NCNN) module to the top of the pretrained model. The nonsequential network topology of NCNN can separately learn the spatial- and channel-wise features with parallel branches, which helps improve the model performance. Subsequently, in order to further exploit the advantage of the nonsequential topology, we propose an end-to-end deep ensemble learning based on the weight optimization (DELWO) model. It contributes to fusing the deep information derived from multiple models automatically from the data. Finally, we design the deep ensemble learning based on voting strategy (DELVS) model to pool together multiple deep models with weighted coefficients to obtain a better prediction. More importantly, the model complexity can be reduced by lessening the number of trainable parameters, thereby effectively mitigating overfitting issues of the model in small datasets to some extent. We conduct experiments in Li’s action dataset, uncropped and 1.5x cropped Willow action datasets, and the results have validated the effectiveness and robustness of our proposed models in terms of mitigating overfitting issues in small datasets. Finally, we open source our code for the model in GitHub (https://github.com/yxchspring/deep_ensemble_learning) in order to share our model with the community.

1. Introduction

Human action recognition [16] is one of the most important research fields in computer vision. Although recognizing the motion of human action in video can provide discriminative clues for classifying one specific action, many human actions (e.g., “Phoning,” “InteractingWithComputer,” and “Shooting,” as shown in Figure 1), can be represented by one single still image [2]. In particular, certain actions (e.g., “PlayingGuitar,” “RidingHorse,” and “Running,” as shown in Figure 1) may require static cue-based approaches even if those motions in videos are available [2]. To recognize these human actions with video-based approaches mentioned above [5, 6, 8] may be inappropriate due to their slight action changes without distinguishability. Its static features by nature motivate us to address those human action recognition tasks in still images [2]. Classifying human actions in still images is a more challenging task, especially when only one single image is available along with disturbance and cluttered background.

More and more work [914] has recently focused on human action recognition in still images. In this research, we strive to investigate a robust human action model for still images which does not need manual feature engineering, explicit body pose estimation and reasoning, or part-based representations. In this research, we concentrate on employing the deep ensemble learning to address such tasks. First of all, we explore the application of nonsequential network topology in human action recognition in still images. Specifically, we propose to attach a nonsequential convolutional neural network (NCNN) module to the pretrained model. The NCNN module has three independent branches, and each branch can learn the spatial- and channel-wise features separately. Then, the end-to-end NCNN-based model is trained to learn the domain-specific knowledge in small datasets. Secondly, different kinds of models may discover multiple aspects of the “truth,” so we further examine the benefits of deep ensemble learning in terms of improving classification performance. We propose an end-to-end deep ensemble learning based on the weight optimization (DELWO) model to fuse the information derived from multiple deep models to achieve better performance. DELWO also has a nonsequential network topology and is a generalized multi-input model with each pretrained model as an input. Besides, we also propose a deep ensemble learning based on voting strategy (DELVS) model to integrate prediction results using different voting strategies to obtain better predictions. Our proposed models can side-step the trivial tasks related to manual feature design, body part-based modeling, and action poselet-based representation, etc.

In practice, how to mitigate overfitting issues is one important concern in computer vision tasks when only a few data are available for training. For instance, in Li’s action dataset [7] as shown in Figure 1, only 180 images are used as the training set. For another example, only 208 images and 280 images are utilized as the training set for uncropped and 1.5x cropped Willow action dataset [2], respectively. We are committed to construct the deep CNN models which can mitigate the overfitting issue in small datasets to some degree. The main features of our proposed models in this research are as follows:(1)Our proposed NCNN-based model and DELWO model are both end-to-end, which can directly produce a prediction for one input and also make the batch processing operation possible for model training. It can greatly reduce the memory consumption and make it feasible to train our own deep learning models on a single PC with CPU.(2)Our NCNN-based model and DELWO model have a nonsequential network topology. The advantages of the NCNN-based model are as follows: first, it can automatically learn information from different channels; second, it can contribute to fine-tuning the top layers by optimizing weight parameters of the NCNN module so that the model can be more competent for a domain-specific task. DELWO model fuses the deep information derived from multiple models and then automatically exploits the advantages of each model from the data.(3)We propose the deep ensemble models, DELWO and DELVS. DELWO can avoid manually specifying weight coefficients of various models. DELVS model needs to determine weight parameters in advance, and then multiple models are combined to pool together the prediction using different voting strategies and endeavor to achieve better performance. These two kinds of deep ensemble models are proposed to explore their performance for a domain-specific task.

The whole framework of our proposed algorithm is elaborated in Figure 2.

The rest of the paper is organized as follows. We first review the related work for human action recognition in still images in Section 2. In Section 3, we elucidate the specific methodology including data processing and model construction. We report the experimental results in Section 4, and this is followed by the conclusions and future work drawn in Section 5.

Existing work mainly focuses on feature engineering (e.g., bag-of-features), body part-based modeling, or action poselet-based representation, etc. for human action recognition. Delaitre et al. [2] studied human action recognition in still images using the bag-of-features model [2]. Qi et al. [14] proposed to construct a hint-enhanced CNN framework by jointly learning the pose hints and deep feature extraction. Kong et al. [15] proposed to extract depth motion maps pyramid descriptor for each action followed by the classifier of discriminative collaborative representation to perform human action recognition. Gupta et al. [16] utilized the body pose as a clue for action recognition. Zhang et al. [17] proposed a foreground trajectory extraction approach based on a saliency sampling strategy intending to lessen the reduction of valid trajectories of action. Felzenszwalb et al. [18] proposed a structural part-based model to represent human actions. Ko et al. [19] proposed an action poselet-based approach and a two-layer classification model to infer human actions. Cai et al. [20] proposed an improved CNN for conducting human action recognition by extracting depth sequence features using depth motion maps as well as obtaining the three projected maps: the front, side, and top views.

With the outstanding performance of deep learning in computer vision, it will be one important step forward to construct a deep learning model for automatically analyzing the body pose and perceive its background information. However, training a deep CNN model from scratch using a small dataset often suffers from overfitting issues. Data augmentation, a powerful technique to mitigate overfitting, can generate more training data using random transformations such as rotation, shift, and shear. It can make our model never see the same sample twice during training and therefore enable the exposure of our model to more aspects of the data. Another technique is to adopt pretrained deep networks (e.g., VGG16 [21]) as the initial model to extract deep features, which makes our deep learning model effective even when only a small dataset is available.

The traditional CNN models (e.g., LetNet-5 [22] and VGG16 [21]) are sequential ones in the sense that they have linear stacks of layers. These sequential models may be inflexible in some cases. The inception module proposed by Szegedy et al. [23] possesses a nonsequential network topology: the model follows the structure of a directed acyclic graph. Its input in the inception module is separately processed by certain parallel branches followed by a concatenated layer which merges back the output of each branch into one single tensor. The number of trainable parameters of networks considerably determines the degree of model overfitting. Lin et al. [24] proposed the global average mapping (GAP) to replace the fully connected layer followed by the Softmax layers in CNN. GAP greatly reduces the number of trainable parameters and make the model lighter and thus alleviate overfitting. Multi-input models [2528] is another kind of nonsequential network topology, and it has multiple input layers which can make full use of multimodal or multiple types of data. Nickfarjam and Ebrahimpour-Komleh [25] adopt the deep belief networks with multi-input topology to conduct shape-based human action classification and improved model performance.

In addition, ensemble learning [29, 30] can pool together different models to achieve better performance. It pools different classifiers and incorporates their predictions either by weighted coefficients or majority votes. It can take advantage of different models to explore as many parts of the “truth” of the data as possible [31].

3. Methodology

3.1. Data Processing

The availability of very few data is a common situation when a classification model needs to be trained for image recognition. In order to mitigate the overfitting issues, we adopt the data augmentation technique to improve the performance of CNN. Data augmentation can generate more training data via random transformations of training images and enable to expose the training model to more possible aspects of the data distribution [31]. The random transformations in this research contain rotation within 0–90 degree, width shift within 0–0.2, height shift within 0–0.2, shear within 0–0.2, zoom within 0–0.2, horizontal flip, and vertical flip. Besides, we conduct the image-wise centralization to implement the sample normalization.

3.2. Nonsequential Convolutional Neural Network Model

In this section, we present our proposed NCNN module. First, our proposed NCNN-based model is applicable to small datasets. Second, it is an end-to-end model that directly produces the output for each input sample. More importantly, it can well contribute to reducing memory consumption by batch processing and make it possible to train our models when only CPUs are available (although GPUs are better). Compared to adding a convolution layer of the same number of filters, it makes NCNN-based model lighter and improves the generalization ability of model.

Figure 3 shows the structure of the VGG16 in this research, and the VGG16_base module and Classifier module are connected by a GAP layer. The VGG16_base module is based on VGG16 [21], and the weights of convolution layers are initialized from the pretrained VGG16, and those layers are frozen when carrying out model training.

In order to mitigate overfitting, speed up model training, and improve generalization ability, we construct the NCNN-based model (e.g., VGG16_NCNN). Specifically, the VGG16_NCNN incorporates three modules, the VGG16_base module, NCNN module, and the classifier module after the GAP layer. The whole structure of VGG16_NCNN is illustrated in Figure 4. As shown in Figure 4, the NCNN module has three branches. Branch A possesses one “Conv1–128” layer, which means the kernel size is 1 × 1, and the filter size is 128, where K denotes the kernel size and F denotes the filter size for “ConvK-F.” Similarly, Branch B has two convolution layers: one “Conv1–128” layer followed by one “Conv3–128” layer. Branch C has one 3 × 3 average pooling layer followed by a “Conv1–128” layer. Finally, the last activation output of each branch is concatenated together to form the resulting concatenated layer. Same as the VGG16, a three-layer classifier module is constructed after the GAP layer.

The weights of the VGG16_base module are initialized from the pretrained VGG16 model, and the ones of the NCNN module are randomly initialized. When we conduct model training, the weights of the VGG16_base module are frozen and the purpose is to prevent the backpropagated error via the randomly initialized layers from destroying the pretrained convolution layers.

The advantage of having NCNN module and a GAP layer is that it can effectively decrease the trainable parameters. Therefore, our model will be lighter, which can greatly facilitate the mitigation of model overfitting and promote generalization ability. Specifically, when the size of input samples is 224 × 224, the number of parameters between the last layer of the VGG16_base module and the first layer of classifier module will be (7 × 7 × 512) × 2048 + 2048 = 51, 382, 272. For VGG16, the number of parameters between the GAP layer and the first layer of classifier module will be 512 × 2048 + 2048 = 1,050,624. For VGG16_NCNN, the number of parameters between the GAP layer and the first layer of classifier module will be 384 × 2048 + 2048 = 788, 480.

Without the NCNN module and a GAP layer, the total number of trainable parameters will be 70,307,655. However, the total number of trainable parameters of the VGG16 (see Figure 4) is 5,261,319, and that of our proposed VGG16_NCNN model is 5,868,039. Compared with VGG16, although the total number of trainable parameters of VGG16_NCNN is slightly larger, the effectiveness of the VGG16_NCNN is enhanced. In other words, by training the parameters of the NCNN module, we can effectively fine-tune the model (i.e., fine-tune layers in the red dotted box of Figure 4), making the model more suitable for our domain-specific tasks. The number of trainable parameters of a model determines the complexity of the model. A reasonable situation is that the size of the data and the complexity of the model can be effectively matched. Therefore, by adding the NCNN module and a GAP layer to the model, the overfitting issues can be mitigated to some extent.

To train our model, we need to minimize the following loss function of categorical crossentropy:where denotes the probability of predicting the sample i to class k, N denotes the sample size, and is the true label for sample i belonging to class k.

Specifically, when the first “FC-2048” of the classifier module is regarded as the input, formula (1) can be further expressed aswhere “FC1” and “FC2” denote the first and second “FC-2048” layers of the classifier module, denotes the sigmoid function, h denotes the ReLU activation function, and D and M represent the numbers of nodes in “FC1” and “FC2,” respectively.

When an unseen sample comes in, the following function is employed to produce a prediction class for sample i:

3.3. Deep Ensemble Learning Based on Weight Optimization

Different deep models will focus on different aspects of the “truth.” Therefore, in order to better incorporate more information about the “truth,” we design an end-to-end DELWO model that can directly produce the output for each input sample. As shown in Figure 5, the training data are fed into multiple deep models with the NCNN module and then we conduct the GAP concatenation from each model to form a longer layer (i.e., GAP concatenation module) which is connected to the classifier module. In this research, we define 3 kinds of DELWO models:(1)DELWO1 fuses VGG16 [21], VGG19 [21], and ResNet50 [32] in deep model module, and the filter size of GAP is set to 128 in GAP concatenation module(2)DELWO2 fuses VGG16_NCNN, VGG19_NCNN, and ResNet50_NCNN in deep model module, and the filter size of GAP is set to 128 in GAP concatenation module(2)DELWO3 fuses VGG16_NCNN, VGG19_NCNN, and ResNet50_NCNN in deep model module, and the filter size of GAP is set to 384 in GAP concatenation module

Figure 5 elaborates the specific process of human action classification using DELWO2. When the model is well trained, the testing data are fed into it to produce the final prediction. The specific training steps are similar to those of the NCNN-based model.

3.4. Deep Ensemble Learning Based on Voting Strategy

Ensemble learning is a powerful technique to achieve better prediction results. We assume that different models focus on different aspects of the “truth model.” Therefore, pooling together different models can discover as many parts of “truth” as possible. In this research, we pool together multiple deep models using DELVS with an aim to obtain better prediction results. The prediction class for sample i is estimated using the following 3 kinds of functions.

3.4.1. Hard Voting

Hard voting strategy (i.e., majority voting) aims to predict the final class label via computing the label majority of all classifiers. And the function is shown in the following equation:where denotes sample i, denotes the prediction label of j-th classifier, mode function is used to compute the majority of all the prediction labels, and m denotes the number of classifiers.

3.4.2. Soft Voting

Soft voting strategy aims to predict the final class label via computing the sum of prediction probabilities of each class among all classifiers. The label is assigned to the category that gets the highest probability sum. And the function is shown in the following formula:where denotes the weight of the j-th classifier, denotes the prediction probability of the j-th classifier predicting the sample i into k-th class, and m denotes the number of classifiers. It is worth noting that the weight is set to 1/m in this voting strategy.

3.4.3. Tuning Weight Voting

Soft voting adopts the weighted average strategy, and it sometimes does not highlight the differences in model contributions. Therefore, we adopt the grid search approach to search for the optimal weights to obtain better predictions. Specifically, the weight parameters are optimized by setting a step size to find the best overall accuracy within a specific weight parameter range. The weight coefficients corresponding to the best overall accuracy are the optimal results. And the function is shown in formula (6), which is similar to formula (5):where is the optimal weight coefficients for each classifier and .

In this research, we define 3 kinds of DELVS models, which correspond to DELVS1, DELVS2, and DELVS3:(1)DELVS1 integrates the predictions of VGG16 [21], VGG19 [21], and ResNet50 [32] using the three voting strategies mentioned above to obtain the final prediction results(2)DELVS2 integrates the predictions of VGG16_NCNN, VGG19_NCNN, and ResNet50_NCNN using the three voting strategies mentioned above to obtain the final prediction results(3)DELVS3 integrates the predictions of VGG16 [21], VGG19 [21], and ResNet50 [32] and VGG16_NCNN, VGG19_NCNN, and ResNet50_NCNN using the three voting strategies mentioned above to obtain the final prediction results

Figure 6 elaborates the specific process of human action classification using DELVS. When the testing data are fed into the m classification models, three kinds of voting strategies are utilized to obtain the final prediction, respectively.

4. Results

In this section, we evaluate the performance of our proposed models in the following datasets: Li’s action dataset, Willow action dataset, and 1.5x cropped Willow action dataset. We first demonstrate the specific experimental setup and then present the detailed experimental results.

4.1. Experimental Setup
4.1.1. Datasets

For Li’s action dataset (the data and code of Li’s paper are available at https://github.com/lipiji/PG_BOW_DEMO), six common human action categories are released with a total of 240 images, and these actions are “Phoning,” “PlayingGuitar,” “RidingBike,” “RidingHorse,” “Running,” and “Shooting,” respectively [7]. These images are well cropped in advance, and the images in each category are of the same size (see Figure 1(a)). We randomly choose 30 images for training, 10 images for validation, and the remaining 20 images for testing for each class.

For the Willow action dataset (the data and code of Delaitre’s paper are available at https://www.di.ens.fr/willow/research/stillactions/), seven human action categories are collected by Delaitre et al. [2] with 911 images, and they are “InteractingWithComputer,” “Photographing,” “PlayingMusic,” “RidingBike,” “RidingHorse,” “Running,” and “Walking,” respectively. This dataset contains more challenging noncropped consumer photographs, where natural viewpoint variation, occlusions, scene layout, variations related to object appearance, and people’s clothing are present among them [2]. In addition, the images in each category are of different sizes. The location of people in each image is manually annotated with bounding boxes in this dataset. In order to evaluate the performance of our model, we conduct the experiments on both uncropped (i.e., original images with background) Willow action datasets and 1.5x cropped (i.e., rescale the bounding box of human action to 1.5 times). The training, validation, and testing set settings are consistent with those in Delaitre’s work [2].

4.1.2. Model Parameter Setup

For each model, we adopt the pretrained model to initialize the weights, and the weights of the NCNN module and the following classifier module are randomly initialized. It is worth noting that all the convolution layers of the pretrained model are frozen during training. It tries to make sure the weights of pretrained convolution layers will not be destroyed. Otherwise, the backpropagated error via the randomly initialized classifier layers will be too large, which makes our model more difficult to train. We conduct the experiments on multiple pretrained models including VGG16 [21], VGG19 [21], InceptionV3 [23], DenseNet [33], ResNet50 [32], and MobileNet [34]. However, only VGG16, VGG19, and ResNet50 achieve good performance in human behavior recognition. Therefore, we will carry out further research on these models. In order to assess the performance of our proposed models, the comparison algorithms including speeded-up robust features (SURF) [35], bag-of-features (BOF) [36], and pyramid bag-of-features (PBOF) [37] are used. In order to compare with nondeep ensemble learning approaches, we further conduct the comparison algorithms including bagging-based ensemble learning [38] (e.g., Random Forests (RF [39])), boosting-based ensemble learning [40] (e.g., gradient boosting machines (GBMs) [41]), and voting-based ensemble learning [42] among support vector machine (SVM), RF, and GBM classifiers. All the evaluation of nondeep ensemble learning approaches is based on the 512-dimensional GIST descriptors. At the same time, the GIST [43, 44] (with SVM classifier) method was also attached to the comparison experiments.

4.2. Experimental Results
4.2.1. Results for Nonsequential Convolutional Neural Network Model

Table 1 shows the classification performance in Li’s action dataset. The SURF achieves the worst performance. The BOF and PBOF achieve better performance over SURF but still do not exceed the models we propose. For the nondeep ensemble learning approaches, only the performance of the voting-based approach surpasses that of GIST, while the remaining RF and GBM achieve worse results compared with the GIST approach. VGG16, VGG19, and ResNet50 perform well, and this reveals the pretrained weights are feasible for conducting human action categories in Li’s action dataset. More importantly, VGG19_NCNN and ResNet50_NCNN still outperform the baseline models in terms of the overall accuracy and loss. Particularly, ResNet50_NCNN model achieves the best overall accuracy and least loss. From Table 1 and Figure 7, we are convinced that the NCNN-based model works the same or better over the baseline model despite a slightly higher loss are obtained for VGG16_NCNN.

Table 2 presents the results of using the NCNN-based model in the Willow action dataset. This is a more challenging dataset, the natural and challenging disturbances occurred in the images, including viewpoint variation, occlusions, scene layout, and variations related to object appearance and people’s clothing. Therefore, the seven comparison algorithms have failed in classifying those human actions. In particular, the performance of nondeep ensemble learning approaches does not exceed the GIST approach. From Table 2 and Figure 8, we can see that NCNN-based model basically outperforms the baseline model in terms of the overall accuracy and loss except for the VGG19_NCNN in terms of overall accuracy.

Besides, we can see that all the models fail when classifying the “Photographing,” “Running,” and “Walking” actions. Because these three actions are similar (see Figure 1(b)), the uncertainty will arise when producing their predictions. As shown in Figure 8, for the “Photographing” action row, it was misclassified into “Walking” with a rate of 0.28 by the VGG16_NCNN, “Walking” with a rate of 0.38 by VGG19_NCNN, and “Walking” with a rate of 0.16 by the ResNet50_NCNN. For the “Running” action row, VGG16_NCNN and VGG19_NCNN achieve a better classification ability compared with the ResNet50_NCNN. This reveals that VGG16_NCNN and VGG19_NCNN can better discriminate “Running” and “Walking” to a certain extent. However, for the “Walking” action row, all the models cannot well tell the differences between this action and the “Running” action.

Table 3 presents the results of using the NCNN-based model in the 1.5x cropped Willow action dataset. From Table 3 and Figure 9, we can see that NCNN-based model outperforms the baseline model in terms of the overall accuracy and loss. Similar to the results in Table 2, the performance of nondeep ensemble learning approaches does not exceed the GIST approach. It is worth noting that VGG19 and VGG19_NCNN perform better when classifying the “Running” action compared with other models, and the ResNet50_NCNN performs better when classifying the “Walking” actions compared with the other models. ResNet50_NCNN achieves the best overall accuracy and loss among all models, and it well validates the effectiveness of our proposed NCNN-based method.

The whole experimental results of thirteen algorithms in those three datasets are shown in Figure 10. Figure 10 shows that almost all comparative methods including SURF, BOF, PBOF, GIST, and nondeep ensemble learning approaches including RF, GM, and voting fail in this task, and only BOF, PBOF, GIST, RF, GM, and voting approaches show good results in Li’s action dataset. The performance of all the deep learning-based models is better than that of the comparison algorithms.

4.2.2. Results for Deep Ensemble Learning Based on Weight Optimization

Table 4 shows the results of using DELWO1, DELWO2, and DELWO3 in Li’s action dataset. DELWO1 and DELWO3 obtain the best overall accuracy, and DELWO2 obtains the least loss. Compared with the performance of nonensemble models, we find all the models perform better than the best one in Table 1. Table 4 shows the specific experimental results, and Figure 11 illustrates the ROC curves using DELWO1, DELWO2, and DELWO3, which fully demonstrates the robustness of DELWO models.

Table 5 shows the results of using DELWO1, DELWO2, and DELWO3 in the Willow action dataset. DELWO2 obtains the best overall accuracy, DELWO3 obtains the second-best one, and DELWO1 obtains the least loss. Compared with the performance of nonensemble models in the Willow action dataset, DELWO models have improved by almost 5%. The specific performance is shown in Table 5 and Figure 12.

Similarly, Table 6 shows the results of using DELWO1, DELWO2, and DELWO3 in the 1.5x cropped Willow action dataset. DELWO3 obtains the best overall accuracy, and DELWO2 obtains the least loss. Compared with the performance of nonensemble models in the 1.5x cropped Willow action dataset, DELWO models have improved by almost 6%. Particularly, the DELWO3 performs best when identifying the “Running” action. The detailed performance is shown in Table 6 and Figure 13.

4.2.3. Results for Deep Ensemble Learning Based on Voting Strategy

Table 7 presents the results of using DELVS1, DELVS2, and DELVS3 in Li’s action dataset. Comparing Table 7 with Table 1, we can see that the performance of the DELVS model is better than that of the NCNN-based model. It is worth noting that DELVS2 (tuning) and DELVS3 (tuning) obtain better results over DELWO1 and DELWO3 in Li’s action dataset. In general, the tuning weight voting method will achieve the best results among these three voting strategies. The detailed performance of DELVS models is elaborated in Table 7 and Figure 14.

Table 8 presents the results of using DELVS1, DELVS2, and DELVS3 in the Willow action dataset. Comparing Table 8 with Table 2, we can see that the performance of the DELVS model is better than that of the NCNN-based model. It is worth noting that DELVS3 (tuning) obtains better results in the Willow action dataset. And the tuning weight voting method has achieved best results among these three voting strategies. The detailed performance of DELVS models is elaborated in Table 8 and Figure 15.

Similarly, we can reach conclusion that the tuning weight voting method performs best among these three voting strategies. Compared with the performance of nonensemble models in the 1.5x cropped Willow action dataset, the DELVS model has improved by almost 5%. However, the performance of the best DELVS2 (tuning) does not exceed that of the DELWO3. It shows that DELWO model is more competent in this dataset. The detailed performance of the DELVS model in the 1.5x cropped Willow action dataset is elaborated in Table 9 and Figure 16.

4.3. Experimental Analysis

From Figure 17, we can conclude that the deep ensemble models outperform the nonensemble ones. DELVS models obtain the best results in Li’s action dataset and Willow action dataset, while DELWO models obtain the best results in the 1.5x cropped Willow action dataset. This can fully explain that the deep ensemble models can better discover more aspects of the “truth” and show robustness when it is faced with interference.

Comparing the experimental results in the 1.5x cropped Willow action dataset, we can draw another conclusion that all the deep models perform better in terms of the overall accuracy in the Willow action dataset. Therefore, we can speculate that the “background” information of each action can provide useful signals and cues for classifying corresponding actions. For example, the “InteractingWithComputer” actions usually occur indoors, while the “RidingBike,” “RidingHorse,” etc. often occur outdoors. The “background” information is usually linked with specific actions, so it is valuable to incorporate the “background” information when classifying corresponding human actions.

Figure 18 shows the detailed class activation heatmaps using Grad-CAM [45] for different deep models among three “RidingHorse” actions. The NCNN-based models will detect more response area than the baseline ones to some degree. Although DELWO models have found less response area, they retain the most core part and are more compact ones. We speculate that this may be the reason why the DELWO models show greater robustness for classifying these actions.

5. Conclusions

In this research, we propose the deep ensemble learning approaches to automatically perform human action recognition in still images. Human actions such as “Phoning,” “RidingHorse,” and “Running” require static cue-based approaches due to the nature of these actions. Recognizing human actions in still images is complementary to video-based methods. How to mitigate overfitting has always been one of the most challenging tasks in computer vision and machine learning. It becomes more intractable when a deep learning model needs to be trained in small datasets. Therefore, how to well mitigate overfitting when training our model is an important issue.

To solve the above issues, first, the weights of the convolution layer module of our models are initialized by pretrained models in terms of transfer learning. In addition, we adopt the data augmentation technique to generate more training data to further mitigate overfitting. Second, the GAP trick can greatly shrink the number of trainable parameters. Therefore, our models are lighter and generalize well to unseen data. Furthermore, it is feasible to train our novel model on a single PC with CPU benefiting from the end-to-end structure. Moreover, the nonsequential network topology facilitates the NCNN-based model to separately learn the spatial- and channel-wise features for parallel branches. The DELWO model, a generalized nonsequential network topology, can fuse deep features among multiple models automatically from data. The DELVS model can pool together different classifiers to produce a better prediction.

Our experimental results reveal that the “background” information may provide helpful signals and cues for classifying human actions. Incorporating the action and background information will be part of our follow-up work. The nonsequential network topology possesses powerful advantages over traditional sequential topologies, and thus, further developing a nonsequential model with separable convolution layers and multiple inputs will be another line of our follow-up research. For example, we can utilize the action information and background information as two independent inputs to jointly learn a nonsequential model. That is to say, the nonsequential model can be trained by utilizing multiple modalities of inputs concurrently, and this idea will be the focus of our follow-up research.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by the Doctoral Scientific Research Foundation of Jiangxi University of Science and Technology (Grant no. jxxjbs19029), Science and Technology Developing Project of Jilin Province, China (Grant no. 20150204007GX), Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Science and Technology Development Plan Project of Jilin Province (Grant no. 20180520017JH), and Science and Technology Project of the Jilin Provincial Education Department (Grant no. JJKH20170107KJ).