Keywords

1 Introduction

Facial Expressions are the source of Human Emotion Recognition and a deep understanding on these emotional responses gives a major advantage to any dependent field. The information retrieved from emotion detection and recognition technology will leverage a wide range of applications such as lie detections, pain assessment, surveillance, healthcare, consumer electronics, law enforcement and many others dependent on Human Computer Interfaces. The challenge in the field is to develop a way to integrate a large spectrum of expressions intrinsically related to the capacity of humans to express feelings. However, there are six basic (or prototypic) expressions that can be proven consensual within any culture, namely Anger, Disgust, Fear, Happiness, Sadness, Surprise [4]. These have being used by the majority of Facial Expression Recognition (FER) systems.

A classic FER system is structured according to three stages, starting with facial detection, followed by the feature extraction and the final stage, expression recognition. The first stage is the most explored and can be considered as an exhausted topic. A few surveys have identified and categorized many proposed improvements and extensions [1]. Viola and Jones made a boost in facial detection [16] and their classifier is still widely used despite the recent appearance of more effective cases [6]. The CIFE dataset survey [12] includes a descriptive table of the algorithms used for the above stages, and concludes that feature-based approaches perform better. Most of the drawbacks come from the dataset construction, performed under a controlled environment (positions, lightning, no occlusions), therefore jeopardizing the robustness of the system in presence of invariance. Despite these constraints, in the past few years with hardware technological innovations and the consequent development of deep learning hierarchical models it was possible to extract complex data interactions from large scale datasets and build advanced solutions on demand. Within this context, an important and possible solution has been revisited - Convolutional Neural Networks, leading to significant results in image base classification, as demonstrated in the annual contest ImageNet Large Scale Visual Recognition Challenge (ILSVRC). There are some works that stand out for their performance in this large image classification problem, the AlexNet [10] from 2012 and a more recent, GoogLeNet [15] inspired in a deeper concept - “Network inside Network”.

In this paper, we propose and develop a Convolutional Neural Network for emotion facial recognition. Our proposed model was trained with seven expressions, introducing also the expression of Contempt to the prototypic set, based on its presence in a widely explored dataset, the Cohn-Kanade Extended [13] and therefore considered as a reliable source. In order to enhance the quality of the results we have augmented the dataset based on random perturbations from a wide set including: skew, translation, scale, and horizontal flip. We specified several heuristics for the model configuration and ran the experiments in a standalone version of Caffe [9], supported by an instance from Amazon EC2 which allows access to GPU advanced hardware. We are aware that in the literature the methods are dependent on the type of dataset, falling into static-images or dynamic images sequences. In the case of sequence-based approaches, the temporal information could be a concern due to real time constraints. In this work, we look at how the static methods can be applied to sequence events without compromising the reliability and even improving speed of recognition. We put forward, a real-time video framework composed by Viola-Jones OpenCV Footnote 1 face tracker. This framework shows promising preliminary results with our deep approach and matches the question we try to answer: How deep can we rely on emotion recognition? Future work should fully answer this question.

This paper is organized as follows. In the next section we present the related work highlighting inspiring studies in the field of emotion recognition. In Sect. 3 our proposed model is described from both static image data and sequence based data for real-time application. In Sect. 4 the datasets, the experimental setup, the technicalities of the approach and the evaluation metrics of our model classifier are described. In Sect. 5 the results are discussed. Finally, in Sect. 6 the conclusions are presented with the future work.

2 Related Work

An annually scientific contest Emotion Recognition in the Wild Challenge (EmotiW) is defining the state of the art focusing on affective sensing with uncontrolled environmentsFootnote 2. From the latest year contest (EmotiW 2015), two versions of dealing with expressions can be selected: one uses static images from Static Facial Expressions in the Wild (SFEW); and the other uses an acted point of view resorting to an Acted Facial Expressions in the Wild (AFEW) dataset. Among the static-image approaches the project [17] proposes a 3-way detection of the face, with a hierarchical selection from the Join Cascade Detection and Alignment (JDA), Deep CNN-Based (DCNN) and MoT along with a simple network (11 layers). These detectors are processed in a multiple network framework in order to enhance the performance. It also includes a pre-processing phase to improve accuracy, which might be considered a drawback in the classification response. An approach resorting to video is strongly dependent on spatio-temporal issues. In [3] a Synchrony Autoencoder (SAE) was introduced for local motion feature extraction together with assembling an hybrid network, CNN-RNN.

3 Proposed Emotion Recognition Model

Our proposed model was constructed with LeNet-5 as baseline and has been progressively developed. Its current stage is depicted in Fig. 1.

The classic LeNet-5 architecture [11], is a combination of seven layers represented by a convolutional layer with 16 feature maps with a \(5\times 5\) kernel size, followed by a sub-sampling by half. This sub-sampling (pooling layer) and the next convolutional layer (16 feature maps) are not connected in order to break symmetry. The network includes also another pooling stage, maintaining the 16 feature maps and the kernel to \(5\times 5\). As opposed to our version of the LeNet-5, it includes another convolutional full sized (kernel \(1\times 1\)), mapping 120 units before the last full connected layers with 84 and 10 neurons, respectively (since LeNet is addressed to digit recognition).

Fig. 1.
figure 1

Flowchart of the different stages of our CNN model, adapted from the classic LeNet-5 - Lenet Ov (our version)

Our model is composed by an initial convolutional set with a \(5\times 5\) kernel size and 20 feature maps plus a shared bias ending up with 520 parameters. The next layer or Pooling Layer performs a downsampling with a maximum value of a \(2 \times 2\) kernel size. This process is repeated except for the pooling stride which changed to 2 and the convolution process expecting 50 instead of 20 feature maps, augmenting the parameters to 1300. We used in our work Krizhevsky [10] alternative model neurons output. Instead of the standard functions \(f(x)=tanh(x)\) or \(f(x)=(1- e^{-x})^{-1}\), we use a faster version \(f(x)=max(0,x)\) designated as Rectified Linear Units (ReLUs). The convolutional neural networks with ReLU proved to be 6 times faster than an equivalent network using saturating neurons, reaching \(25\%\) of training error rate. Taking this into account, the full connected layer that follows, containing 500 filter numbers, is connected with an ReLU. So far, the structure is similar to Lenet, however we introduced a Dropout between the full connected layers, reducing between 0.4 and 0.5 percent of their connectivity by dropping randomly some units or neurons which do not contribute to the forward pass. This procedure will overcome overfitting. Dropout prevents co-adaption showing a significant improvement by \(10\%\) of accuracy, namely in the ILSVRC 2012 validation and test sets [14]. Finally, the last full connected layer is responsible for shrinking the feature maps to our class problem - 7 (expressions). The weights presented to the net follow the xavier type except for the first convolutional layer which is set between gaussian, unitball, xavier.

4 Experimental Design

The Cohn-Kanade Extended dataset [13] (CKP) set is labeled between 0–7 corresponding to Neutral, Anger, Contempt, Disgust, Fear, Happiness, Sadness, Surprise and contains 593 sequences across 123 subjects with posed and non-posed (candid) expressions captured into a \(640x\times 490\) px or \(640\times 480\) px frame, depending on the channel. The class Contempt despite being excluded from the range of the six basic expressions, was used mainly because it was reported to be found above \(75\%\) both in Western and non-Western cultures [5]. On the other hand, the neutral face was not considered in the training stage since it is hardly present in video-based classification. Only images with labels and in the peak of expression (apex state) were considered (1631 images), cropped as depicted in Fig. 1. The CKP dataset was split into \(70\% \) for training; the remaining were taken for the validation and test phases. In order to feed properly the network, the CKP set of images were augmented with random perturbations, based on the expressive results [17] from an experiment over a lower resolution dataset (FER dataset [8]). The perturbation set, skew, translation, scale, rotation and horizontal flip worked separately in order to achieve a wider set, instead of the proposed overlapping method. Skew parameters were randomly selected from \(\{-0.1,0,0.1\}\), translation parameters were sampled from \(\{0, \delta \}\), where \(\delta \) is a random sample from a [0, 4] set, scaling uses a \(\delta \) value to define a random parameter \(c = 47/(47-\delta )\) and the rotation is dependent on the angle sampled randomly from \(\{-\pi /18, 0, \pi /18\}\). The final augmentation version has 978, 288 and 132 images per class in training, validation and test phases, respectively. We included a set of images to the test phase, populated with frames (Fig. 2) from a real-time video framework composed by a Viola-Jones OpenCV Footnote 3 face tracker [16] along with our classifier. Images were resized to the classifier input shape (\(224\times 224\) px), captured within 35 frames per second, and classified in 0.250s (average including cropping process) into an expression displayed in the command line. The experiments started from the determent of the network, questioning the relevant inner parameters comparing with an appropriate solver. Fixing the network, involved testing some prominent networks from the state of the art: the AlexNet from classifications of the ILSVRC2012 challenge [10], the recent GoogLeNet and the classic LeNet-5, to report a fair judgment. The non-augmented dataset was considered small enough for a CNN input and therefore a candidate to make a sanity check on the hyperparameters. In this context, we used GoogleNet and LeNet-5 and both passed the test, overfitting with an high accuracy between [0.9,0.92] in training, whereas the validation loss computed by summing the total weighted loss over the network (for the outputs with non-zero loss), reached a value of 0.8. In order to follow the right network in place, we conducted some preliminary classification experiments over a short amount of training time and the best gains in the test set came from the LeNet-5 as depicted in Fig. 3. Since the results over the training (Table 1) and validation set had a retarded loss decay, no further experiments on GoogleNet and Alexnet were developed because training memory and processing time would become an issue. Considering these marks, a new version of LeNet-5 (LeNet - Ov) was explored. Moreover, LeNet-5 baseline parameters were addressed from the Caffe standards.

Fig. 2.
figure 2

CKP static images vs our frame sequence, both cropped to \(224\times 224\) pixels

The network was prepared taking into consideration the dependency on the convolutional neural networks to the feature extraction process, and how first layers and their high level of information are determinant to the success of the classification. Therefore we tested different ways of adjusting the weights involved. Our network was tested with three types of fillers, namely, gaussian, positive unitball and xavier. The gaussian filler only chooses values according a gaussian distribution, limiting non-zero inputs up to 3, and the standard deviation assume a 0.01 value (increased from the default 0.005). The positive unitball, fills a blob with values between [0, 1] such that \(\forall i \sum _{j} xij=1\). Finally, the xavier type (weight filler), initializes the incoming matrix with values from an uniform distribution within \([-\sqrt{\frac{3}{n}}, \sqrt{\frac{3}{n}}]\), where the n is the number of the input neurons. This Caffe version of Xavier differs from what was initially introduced by Glorot [7], removing the output information. Our version of the network is optimized with a stochastic gradient descent solver, since the Alexnet trained with Adagrad (see Fig. 3) was not expressive. The solver hyperparameters were highly influenced with the results [2] from an automatic hyperparameters optimization over the MNIST dataset. Our parameters fit their hyperspace, with 0.09 for the momentum, 0.0005 of weight decay and a initial learning rate of 0.001, dropping a factor of 10 in the last \(10\%\) of iterations. Our model produces a set of discrete class labels (0–6) or predicted classes which can be distinguished from the actual classes with the information provided by the confusion matrices, as shown in Fig. 3. It is possible then to extract from the test set classification (Table 2), three types of outcomes, as depicted in Table 3, and infer at least two scores, Precision (P) and Recall (R). From the framework used to test the model (Fig. 4) a confusion matrix with 30 frames per class is presented. Ideally we should gather with 132 different subjects to test our framework, equal to the number used to test CKP dataset, however this simple test is enough to infer the learning stage. During this test, we tried to mimic the CKP dataset then expecting same labels in return.

5 Results and Discussion

In this section we discuss the results from the preliminary experiments, as well as the final model achievements comparing to our video framework. Figure 3 shows the density of the first classification tests, trained just over 20000 iterations. This demonstration crossed with the Table 1 was used to select the best network shape for the problem, filtered by the most accurate results over a short period of training. From Table 1 we can also infer that the poor expressiveness between methods is representative of the lack of some augmentation diversity. As explained in Sect. 3, the most complex networks (AlexNet and Googlenet) were discarded due to their poor performance which reinforces the relation unbalance with the low dimension dataset (even augmented). The best results from the 3-type version of our assembled model used the UnitBall configuration trained over 100000 iterations and achieved a top recognition with F1-score of 0.906, representing an average value retrieved from Table 3. The version accuracy, also \(90\%\) on test set, is close to the current state of the art, with the CKP dataset represented by a \(93\%\) of accuracy. We could improve the accuracy with a threshold value to infer the classes with best confidence, however the purpose of this paper is to evaluate a sequence base system which is temporal dependent, due to this, we made an effort to reduce recognition time. According to Table 2 the model with the best results was the same used to perform the tests in the video framework that captured 30 frames per class. Figure 4 shows that in our framework, the expression of Contempt is the most expressive, misguiding the recognition of others. Moreover, the expression of Anger, Happiness and Sadness were not properly learned. Despite, the results not matching the desirable performance of the static test version, there are some significant achievements comparing to the state of the art, where some of the “emotions in the wild” [17] were also misclassified or inexistent (such as Disgust in the test set). In this case, difficulty of video environment constraints were compared with the complexity of their dataset, which presents more candid images.

Fig. 3.
figure 3

Normalized confusion matrices - (a) Alexnet/Ada; (b) Alexnet/SGD; (c) Googlenet/SGD; (d) Lenet changed/SGD.

Table 1. Train loss results used to select the shallow network
Table 2. Test accuracy with three versions of the CNN model (a), (b) and (c).
Table 3. Precision, recall and F1 from three above versions (a) (b) (c).
Fig. 4.
figure 4

Confusion matrix for the classification of the framework in the test set

6 Conclusion

In this paper, we present a deep inference model for emotion recognition from facial expressions. We developed a convolutional neural network tailored for the problem of emotion recognition from static facial expressions using the CKP dataset. After several enhancements, the results of our crafted network design are significant, attaining 90% of accuracy in the test set. Both the inclusion of a Dropout layer in architectural decisions and the augmentation performed on the dataset with several transformations, led to a superior performance with the seven facial expressions. Although the results are preliminary, the deep inference model is able to perform also in a sequence of video frames, in real-time, at least for some of the facial expressions (a problem that also exists in “wilder” datasets). This paper proved that we can start from static baselines to build a sequence model, thus answering in a positive way our initial question that we can rely on a deep emotion recognition model. Future work should focus on the improvement of the training set by increasing transformations and promoting different lighting and positions. Moreover, the real-time face tracker should also be improved by including active learning with human judgment.