Keywords

1 Introduction

Accurately estimating human gaze direction has many applications in assistive technologies for users with motor disabilities [4], gaze-based human-computer interaction [19], visual attention analysis [16], consumer behavior research [34], AR, VR and more. Traditionally this has been done via specialized hardware, shining infrared illumination into the user’s eyes and via specialized cameras, sometimes requiring use of a headrest. Recently deep learning based approaches have made first steps towards fully unconstrained gaze estimation under free head motion, in environments with uncontrolled illumination conditions, and using only a single commodity (and potentially low quality) camera. However, this remains a challenging task due to inter-subject variance in eye appearance, self-occlusions, and head pose and rotation variations. In consequence, current approaches attain accuracies in the order of \(6^\circ \) only and are still far from the requirements of many application scenarios. While demonstrating the feasibility of purely image based gaze estimation and introducing large datasets, these learning-based approaches [14, 43, 44] have leveraged convolutional neural network (CNN) architectures, originally designed for the task of image classification, with minor modifications. For example, [43, 45] simply append head pose orientation to the first fully connected layer of either LeNet-5 or VGG-16, while [14] proposes to merge multiple input modalities by replicating convolutional layers from AlexNet. In [44] the AlexNet architecture is modified to learn so-called spatial-weights to emphasize important activations by region when full face images are provided as input. Typically, the proposed architectures are only supervised via a mean-squared error loss on the gaze direction output, represented as either a 3-dimensional unit vector or pitch and yaw angles in radians.

Fig. 1.
figure 1

Our sequential neural network architecture first estimates a novel pictorial representation of 3D gaze direction, then performs gaze estimation from the minimal image representation to yield improved performance on MPIIGaze, Columbia and EYEDIAP.

In this work we propose a network architecture that has been specifically designed with the task of gaze estimation in mind. An important insight is that regressing first to an abstract but gaze specific representation helps the network to more accurately predict the final output of 3D gaze direction. Furthermore, introducing this gaze representation also allows for intermediate supervision which we experimentally show to further improve accuracy. Our work is loosely inspired by recent progress in the field of human pose estimation. Here, earlier work directly regressed joint coordinates [32]. More recently the need for a more task specific form of supervision has led to the use of confidence maps or heatmaps, where the position of a joint is depicted as a 2-dimensional Gaussian [20, 31, 35]. This representation allows for a simpler mapping between input image and joint position, allows for intermediate supervision, and hence for deeper networks. However, applying this concept of heatmaps to regularize training is not directly applicable to the case of gaze estimation since the crucial eyeball center is not observable in 2D image data. We propose a conceptually similar representation for gaze estimation, called gazemaps. Such a gazemap is an abstract, pictorial representation of the eyeball, the iris and the pupil at it’s center (see Fig. 1).

The simplest depiction of an eyeball’s rotation can be made via a circle and an ellipse, the former representing the eyeball, and the latter the iris. The gaze direction is then defined by the vector connecting the larger circle’s center and the ellipse. Thus 3D gaze direction can be (pictorially) represented in the form of an image, where a spherical eyeball and circular iris are projected onto the image plane, resulting in a circle and ellipse. Hence, changes in gaze direction result in changes in ellipse positioning (cf. Fig. 2a). This pictorial representation can be easily generated from existing training data, given known gaze direction annotations. At inference time recovering gaze direction from such a pictorial representation is a much simpler task than regressing directly from raw pixel values. However, adapting the input image to fit our pictorial representation is non-trivial. For a given eye image, a circular eyeball and an ellipse must be fitted, then centered and rescaled to be in the expected shape. We experimentally observed that this task can be performed well using a fully convolutional architecture. Furthermore, we show that our approach outperforms prior work on the final task of gaze estimation significantly.

Our main contribution consists of a novel architecture for appearance-based gaze estimation. At the core of the proposed architecture lies the pictorial representation of 3D gaze direction to which the network fits the raw input images and from which additional convolutional layers estimate the final gaze direction. In addition, we perform: (a) an in-depth analysis of the effect of intermediate supervision using our pictorial representation, (b) quantitative evaluation and comparison against state-of-the-art gaze estimation methods on three challenging datasets (MPIIGaze, EYEDIAP, Columbia) in the person independent setting, and a (c) detailed evaluation of the robustness of a model trained using our architecture in terms of gaze direction and head pose as well as image quality. Finally, we show that our method reduces gaze error by \(18\%\) compared to the state-of-the-art [45] on MPIIGaze.

2 Related Work

Here we briefly review the most important work in eye gaze estimation and review work touching on relevant aspects in terms of network architecture from adjacent areas such as image classification and human pose estimation.

2.1 Appearance-Based Gaze Estimation with CNNs

Traditional approaches to image-based gaze estimation are typically categorized as feature-based or model-based. Feature-based approaches reduce an eye image down to a set of features based on hand-crafted rules [11, 12, 24, 39] and then feed these features into simple, often linear machine learning models to regress the final gaze estimate. Model-based methods instead attempt to fit a known 3D model to the eye image [28, 33, 37, 40] by minimizing a suitable energy.

Appearance-based methods learn a direct mapping from raw eye images to gaze direction. Learning this direct mapping can be very challenging due to changes in illumination, (partial) occlusions, head motion and eye decorations. Due to these challenges, appearance-based gaze estimation methods required the introduction of large, diverse training datasets and typically leverage some form of convolutional neural network architecture.

Early works in appearance-based methods were restricted to laboratory settings with fixed head pose [1, 30]. These initial constraints have become progressively relaxed, notably by the introduction of new datasets collected in everyday settings [14, 43] or in simulated environments [27, 36, 38]. The increasing scale and complexity of training data has given rise to a wide variety of learning-based methods including variations of linear regression [7, 17, 18], random forests [27], k-nearest neighbours [27, 38], and CNNs [14, 25, 36, 43,44,45]. CNNs have proven to be more robust to visual appearance variations, and are capable of person-independent gaze estimation when provided with sufficient scale and diversity of training data. Person-independent gaze estimation can be performed without a user calibration step, and can directly be applied to areas such as visual attention analysis on unmodified devices [21], interaction on public displays [46], and identification of gaze targets [42], albeit at the cost of increased need for training data and computational cost.

Several CNN architectures have been proposed for person-independent gaze estimation in unconstrained settings, mostly differing in terms of possible input data modalities. Zhang et al. [43, 44] adapt the LeNet-5 and VGG-16 architectures such that head pose angles (pitch and yaw) are concatenated to the first fully-connected layers. Despite its simplicity this approach yields the current best gaze estimation error of \(5.5^\circ \) when evaluating for the within-dataset cross-person case on MPIIGaze with single eye image and head pose input. In [14] separate convolutional streams are used for left/right eye images, a face image, and a \(25\times 25\) grid indicating the location and scale of the detected face in the image frame. Their experiments demonstrate that this approach yields improvements compared to [43]. In [44] a single face image is used as input and so-called spatial-weights are learned. These emphasize important features based on the input image, yielding considerable improvements in gaze estimation accuracy.

We introduce a novel pictorial representation of eye gaze and incorporate this into a deep neural network architecture via intermediate supervision. To the best of our knowledge we are the first to apply fully convolutional architecture to the task of appearance-based gaze estimation. We show that together these contribution lead to a significant performance improvement of \(18\%\) even when using a single eye image as sole input.

2.2 Deep Learning with Auxiliary Supervision

It has been shown [15, 29] that by applying a loss function on intermediate outputs of a network, better performance can be yielded in different tasks. This technique was introduced to address the vanishing gradients problem during the training of deeper networks. In addition, such intermediate supervision allows for the network to quickly learn an estimate for the final output then learn to refine the predicted features - simplifying the mappings which need to be learned at every layer. Subsequent works have adopted intermediate supervision [20, 35] to good effect for human pose estimation, by replicating the final output loss.

Another technique for improving neural network performance is the use of auxiliary data through multi-task learning. In [23, 47], the architectures are formed of a single shared convolutional stream which is split into separate fully-connected layers or regression functions for the auxiliary tasks of gender classification, face visibility, and head pose. Both works show marked improvements to state-of-the-art results in facial landmarks localization. In these approaches through the introduction of multiple learning objectives, an implicit prior is forced upon the network to learn a representation that is informative to both tasks. On the contrary, we explicitly introduce a gaze-specific prior into the network architecture via gazemaps.

Most similar to our contribution is the work in [9] where facial landmark localization performance is improved by applying an auxiliary emotion classification loss. A key aspect to note is that their network is sequential, that is, the emotion recognition network takes only facial landmarks as input. The detected facial landmarks thus act as a manually defined representation for emotion classification, and creates a bottleneck in the full data flow. It is shown experimentally that applying such an auxiliary loss (for a different task) yields improvement over state-of-the-art results on the AFLW dataset. In our work, we learn to regress an intermediate and minimal representation for gaze direction, forming a bottleneck before the main task of regressing two angle values. Thus, an important distinction to [9] is that while we employ an auxiliary loss term, it directly contributes to the task of gaze direction estimation. Furthermore, the auxiliary loss is applied as an intermediate task. We detail this further in Sect. 3.1.

Recent work in multi-person human pose estimation [3] learns to estimate joint location heatmaps alongside so-called “part affinity fields”. When combined, the two outputs then enable the detection of multiple peoples’ joints with reduced ambiguity in terms of which person a joint belongs to. In addition, at the end of every image scale, the architecture concatenates feature maps from each separate stream such that information can flow between the “part confidence” and “part affinity” maps. Thus, they operate on the image representation space, taking advantage of the strengths of convolutional neural networks. Our work is similar in spirit in that it introduces a novel image-based representation.

3 Method

A key contribution of our work is a pictorial representation of 3D gaze direction - which we call gazemaps. This representation is formed of two boolean maps, which can be regressed by a fully convolutional neural network. In this section, we describe our representation (Sect. 3.1) then explain how we constructed our architecture to use the representation as reference for intermediate supervision during training of the network (Sect. 3.2).

Fig. 2.
figure 2

Our pictorial representation of 3D gaze direction, essentially a projection of simple eyeball and iris models onto binary maps (a). Example-pairs are shown in (b) with (left-to-right) input image, iris map, eyeball map, and a superimposed visualization.

3.1 Pictorial Representation of 3D Gaze

In the task of appearance-based gaze estimation, an input eye image is processed to yield gaze direction in 3D. This direction is often represented as a 3-element unit vector \(\varvec{v}\) [6, 25, 44], or as two angles representing eyeball pitch and yaw \(\varvec{g}= \left( \theta ,\,\phi \right) \) [27, 36, 43, 45]. In this section, we propose an alternative to previous direct mappings to \(\varvec{v}\) or \(\varvec{g}\).

If we state the input eye images as \(\varvec{x}\) and regard regressing the values \(\varvec{g}\), a conventional gaze estimation model estimates \(f: \varvec{x}\rightarrow \varvec{g}\). The mapping f can be complex, as reflected by the improvement in accuracies that have been attained by simple adoption of newer CNN architectures ranging from LeNet-5 [25, 43], AlexNet [14, 44], to VGG-16 [45], the current state-of-the-art CNN architecture for appearance-based gaze estimation. We hypothesize that it is possible to learn an intermediate image representation of the eye, \(\varvec{m}\). That is, we define our model as \(\varvec{g}= k \circ j(x)\) where \(j: \varvec{x}\rightarrow \varvec{m}\) and \(k:\varvec{m}\rightarrow \varvec{g}\). It is conceivable that the complexity of learning j and k should be significantly lower than directly learning f, allowing for neural network architectures with significantly lower model complexity to be applied to the same task of gaze estimation with higher or equivalent performance.

Thus, we propose to estimate so-called gazemaps (\(\varvec{m}\)) and from that the 3D gaze direction (\(\varvec{g}\)). We reformulate the task of gaze estimation into two concrete tasks: (a) reduction of input image to minimal normalized form (gazemaps), and (b) gaze estimation from gazemaps.

The gazemaps for a given input eye image should be visually similar to the input yet distill only the necessary information for gaze estimation to ensure that the mapping \(k: \varvec{m}\rightarrow \varvec{g}\) is simple. To do this, we consider that an average human eyeball has a diameter of \({\approx }\)24 mm [2] while an average human iris has a diameter of \({\approx }\)12 mm [5]. We then assume a simple model of the human eyeball and iris, where the eyeball is a perfect sphere, and the iris is a perfect circle. For an output image dimension of \(m\times n\), we assume the projected eyeball diameter \(2r = 1.2\,\mathrm{n}\) and calculate the iris centre coordinates \(\left( u_i,\,v_i\right) \) to be:

$$\begin{aligned} u_i&= \frac{m}{2} - r^\prime \sin \phi \cos \theta \end{aligned}$$
(1)
$$\begin{aligned} v_i&= \frac{n}{2} - r^\prime \sin \theta \end{aligned}$$
(2)

where \(r^\prime = r \cos \left( \sin ^{-1} \frac{1}{2}\right) \), and gaze direction \(\varvec{g}=\left( \theta ,\phi \right) \). The iris is drawn as an ellipse with major-axis diameter of r and minor-axis diameter of \(r\left| \cos \theta \cos \phi \right| \). Examples of our gazemaps are shown in Fig. 2b where two separate boolean maps are produced for one gaze direction \(\varvec{g}\).

Learning how to predict gazemaps only from a single eye image is not a trivial task. Not only do extraneous factors such as image artifacts and partial occlusion need to be accounted for, a simplified eyeball must be fit to the given image based on iris and eyelid appearance. The detected regions must then be scaled and centered to produce the gazemaps. Thus the mapping \(j: \varvec{x}\rightarrow \varvec{m}\) requires a more complex neural network architecture than the mapping \(k: \varvec{m}\rightarrow \varvec{g}\).

3.2 Neural Network Architecture

Our neural network consists of two parts: (a) regression from eye image to gazemap, and (b) regression from gazemap to gaze direction \(\varvec{g}\). While any CNN architecture can be implemented for (b), regressing (a) requires a fully convolutional architecture such as those used in human pose estimation. We adapt the stacked hourglass architecture from Newell et al.  [20] for this task. The hourglass architecture has been proven to be effective in tasks such as human pose estimation and facial landmarks detection [41] where complex spatial relations need to be modeled at various scales to estimate the location of occluded joints or key points. The architecture performs repeated multi-scale refinement of feature maps, from which desired output confidence maps can be extracted via \(1\times 1\) convolution layers. We exploit this fact to have our network predict gazemaps instead of classical confidence or heatmaps for joint positions. In Sect. 5, we demonstrate that this works well in practice.

In our gazemap-regression network, we use 3 hourglass modules with intermediate supervision applied on the gazemap outputs of the last module only. The minimized intermediate loss is:

$$\begin{aligned} \mathcal {L}_\mathrm {gazemap} = -\alpha \sum _{p\in \mathcal {P}} \varvec{m}(p) \log \hat{\varvec{m}}(p), \end{aligned}$$
(3)

where we calculate a cross-entropy between predicted \(\hat{\varvec{m}}\) and ground-truth gazemap \(\varvec{m}\) for pixels p in set of all pixels \(\mathcal {P}\). In our evaluations, we set the weight coefficient \(\alpha \) to \(10^{-5}\).

For the regression to \(\varvec{g}\), we select DenseNet which has recently been shown to perform well on image classification tasks [10] while using fewer parameters compared to previous architectures such as ResNet [8]. The loss term for gaze direction regression (per input) is:

$$\begin{aligned} \mathcal {L}_\mathrm {gaze} = \left| \left| \varvec{g}- \hat{\varvec{g}}\right| \right| ^2_2, \end{aligned}$$
(4)

where \(\tilde{\varvec{g}}\) is the gaze direction predicted by our neural network.

4 Implementation

In this section, we describe the fully convolutional (Hourglass) and regressive (DenseNet) parts of our architecture in more detail.

4.1 Hourglass Network

In our implementation of the Stacked Hourglass Network [20], we provide images of size \(150\times 90\) as input, and refine 64 feature maps of size \(75\times 45\) throughout the network. The half-scale feature maps are produced by an initial convolutional layer with filter size 7 and stride 2 as done in the original paper [20]. This is followed by batch normalization, ReLU activation, and two residual modules before being passed as input to the first hourglass module.

There exist 3 hourglass modules in our architecture, as visualized in Fig. 1. In human pose estimation, the commonly used outputs are 2-dimensional confidence maps, which are pixel-aligned to the input image. Our task differs, and thus we do not apply intermediate supervision to the output of every hourglass module. This is to allow for the input image to be processed at multiple scales over many layers, with the necessary features becoming aligned to the final output gazemap representation. Instead, we apply \(1\times 1\) convolutions to the output of the last hourglass module, and apply the gazemap loss term (Eq. 3) (Fig. 3).

Fig. 3.
figure 3

Intermediate supervision is applied to the output of an hourglass module by performing \(1\times 1\) convolutions. The intermediate gazemaps and feature maps from the previous hourglass module are then concatenated back into the network to be passed onto the next hourglass module as is done in the original Hourglass paper [20].

4.2 DenseNet

As described in Sect. 3.1, our pictorial representation allows for a simpler function to be learnt for the actual task of gaze estimation. To demonstrate this, we employ a very lightweight DenseNet architecture [10]. Our gaze regression network consists of 5 dense blocks (5 layers per block) with a growth-rate of 8, bottleneck layers, and a compression factor of 0.5. This results in just 62 feature maps at the end of the DenseNet, and subsequently 62 features through global average pooling. Finally, a single linear layer maps these features to \(\varvec{g}\). The resulting network is light-weight and consists of just 66 k trainable parameters.

4.3 Training Details

We train our neural network with a batch size of 32, learning rate of 0.0002 and \(L_2\) weights regularization coefficient of \(10^{-4}\). The optimization method used is Adam [13]. Training occurs for 20 epochs on a desktop PC with an Intel Core i7 CPU and Nvidia Titan Xp GPU, taking just over 2 h for one fold (out of 15) of a leave-one-person-out evaluation on the MPIIGaze dataset.

During training, slight data augmentation is applied in terms of image translation and scaling, and learning rate is multiplied by 0.1 after every 5k gradient update steps, to address over-fitting and to stabilize the final error.

Fig. 4.
figure 4

Example of image representations learned by our architecture in the absence or presence of \(\mathcal {L}_\mathrm {gazemap}\). Note that the pictorial representation is more consistent, and that the hourglass network is able to account for occlusions. Predicted gaze directions are shown in green, with ground-truth in red. (Color figure online)

5 Evaluations

We perform our evaluations primarily on the MPIIGaze dataset, which consists of images taken of 15 laptop users in everyday settings. The dataset has been used as the standard benchmark dataset for unconstrained appearance-based gaze estimation in recent years [25, 36, 38, 43,44,45]. Our focus is on cross-person single-eye evaluations where 15 models are trained per configuration or architecture in a leave-one-person-out fashion. That is, a neural network is trained on 14 peoples’ data (1500 entries each from left and right eyes), then tested on the test set of the left-out person (1000 entries). The mean over 15 such evaluations is used as the final error metric representing cross-person performance. As MPIIGaze is a dataset which well represents real-world settings, cross-person evaluations on the dataset is indicative of the real-world person-independence of a given model.

To further test the generalization capabilities of our method, we also perform evaluations on two additional datasets in this section: Columbia [26] and EYEDIAP [7], where we perform 5-fold cross validation. While Columbia displays large diversity between its 55 participants, the images are of high quality, having been taken using a DSLR. EYEDIAP on the other hand suffers from the low resolution of the VGA camera used, as well as large distance between camera and participant. We select screen target (CS/DS) and static head pose sequences (S) from the EYEDIAP dataset, sampling every 15 s from its VGA video streams (V). Training on moving head sequences (M) with just single eye input proved infeasible, with all models experiencing diverging test error during training. Performance improvements on MPIIGaze, Columbia, and EYEDIAP would indicate that our model is robust to cross-person appearance variations and the challenges caused by low eye image resolution and quality.

In this section, we first evaluate the effect of our gazemap loss (Sect. 5.1), then compare the performance (Sect. 5.2) and robustness (Sect. 5.3) of our approach against state-of-the-art architectures.

5.1 Pictorial Representation (Gazemaps)

We postulated in Sect. 3.1 that by providing a pictorial representation of 3D gaze direction that is visually similar to the input image, we could achieve improvements in appearance-based gaze estimation. In our experiments we find that applying the gazemaps loss term generally offers performance improvements compared to the case where the loss term is not applied. This improvement is particularly emphasized when DenseNet growth rate is high (eg. \(k=32\)), as shown in Table 1.

Table 1. Cross-person gaze estimation errors in the absence and presence of \(\mathcal {L}_{gazemap}\), with DenseNet (k=32).

By observing the output of the last hourglass module and comparing against the input images (Fig. 4), we can confirm that even without intermediate supervision, our network learns to isolate the iris region, yielding a similar image representation of gaze direction across participants. Note that this representation is learned only with the final gaze direction loss, \(\mathcal {L}_\mathrm {gaze}\), and that blobs representing iris locations are not necessarily aligned with actual iris locations on the input images. Without intermediate supervision, the learned minimal image representation may incorporate visual factors such as occlusion due to hair and eyeglases, as shown in Fig. 4a.

This supports our hypothesis that an intermediate representation consisting of an iris and eyeball contains the required information to regress gaze direction. However, due to the nature of learning, the network may also learn irrelevant details such as the edges of the glasses. Yet, by explicitly providing an intermediate representation in the form of gazemaps, we enforce a prior that helps the network learn the desired representation, without incorporating the previously mentioned unhelpful details.

5.2 Cross-Person Gaze Estimation

We compare the cross-person performance of our model by conducting a leave-one-person-out evaluation on MPIIGaze and 5-fold evaluations on Columbia and EYEDIAP. In Sect. 3.1 we discussed that the mapping k from gazemap to gaze direction should not require a complex architecture to model. Thus, our DenseNet is configured with a low growth rate (\(k=8\)). To allow fair comparison, we re-implement 2 architectures for single-eye image inputs (of size \(150\times 90\)): AlexNet and VGG-16. The AlexNet and VGG-16 architectures have been used in recent works in appearance-based gaze estimation and are thus suitable baselines [44, 45]. Implementation and training procedure details of these architectures are provided in supplementary materials.

Table 2. Mean gaze estimation error in degrees for within-dataset cross-person k-fold evaluation. Evaluated on (a) MPIIGaze, (b) Columbia, and (c) EYEDIAP datasets.
Fig. 5.
figure 5

Gazemap predictions (middle) on Columbia and EYEDIAP datasets with ground-truth (red) and predicted (green) gaze directions visualized on input eye images (left). Ground-truth gazemaps are shown on the far-right of each triplet. (Color figure online)

In MPIIGaze evaluations (Table 2a), our proposed approach outperforms the current state-of-the-art approach by a large margin, yielding an improvement of \(1.0^\circ \) (\(5.5^\circ \rightarrow 4.5^\circ =18.2\%\)). This significant improvement is in spite of the reduced number of trainable parameters used in our architecture (90 M vs 0.7 M). Our performance compares favorably to that reported in [44] (\(4.8^\circ \)) where full-face input is used in contrast to our single-eye input. While our results cannot directly be compared with those of [44] due to the different definition of gaze direction (face-centred as opposed to eye centred), the similar performance suggests that eye images may be sufficient as input to the task of gaze direction estimation. Our approach attains comparable performance to models taking face input, and uses considerably less parameters than recently introduced architectures (129x less than GazeNet).

We additionally evaluate our model on the Columbia Gaze and EYEDIAP datasets in Table 2b and c respectively. While high image quality results in all three methods performing comparably for Columbia Gaze, our approach still prevails with an improvement of \(0.4^\circ \) over AlexNet. On EYEDIAP, the mean error is very high due to the low resolution and low quality input. Note that there is no head pose estimation performed, with only single eye input being relied on for gaze estimation. Our gazemap-based architecture shows its strengths in this case, performing \(0.9^\circ \) better than VGG-16 - a \(8\%\) improvement. Sample gazemap and gaze direction predictions are shown in Fig. 5 where it is evident that despite the lack of visual detail, it is possible to fit gazemaps to yield improved gaze estimation error.

By evaluating our architecture on 3 different datasets with different properties in the cross-person setting, we can conclude that our approach provides significantly higher generalization capabilities compared to previous approaches. Thus, we bring gaze estimation closer to direct real-world applications.

Fig. 6.
figure 6

Robustness of AlexNet (red), VGG-16 (green), and our approach (blue) to different head pose (top), gaze direction (middle), and image quality (bottom). The lines are a moving average. (Color figure online)

5.3 Robustness Analysis

In order to shed more light onto our models’ performance, we perform an additional robustness analysis. More concretely, we aim to analyze how our approach performs under difficult and challenging situations, such as extreme head pose and gaze direction. In order to do so, we evaluate a moving average on the output of our within-MPIIGaze evaluations, where the y-values correspond to the mean angular error and the x-values take one of the following factor of variations: head pose (pitch & yaw), gaze direction (pitch & yaw). Additionally, we also consider image quality (contrast & sharpness) as a qualitative factor. In order to isolate each factor of variation from the rest, we evaluate the moving average only on the points whose remaining factors are close to its median value. Intuitively, this corresponds to data points where the person moves only in one specific direction, while staying at rest in all of the remaining directions. This is not the case for image quality analysis, where all data points are used. Figure 6 plots the mean angular error as a function of different movement variations and image qualities. The top row corresponds to variation along the head pose, the middle along gaze direction and the bottom to varying image quality. In order to calculate the image contrast, we used the RMS contrast metric whereas to compute the sharpness, we employ a Laplacian-based formula as outlined in [22]. Both metrics are explained in supplementary materials. The figure shows that we consistently outperform competing architectures for extreme head and gaze angles. Notably, we show more consistent performance in particular over large ranges of head pitch and gaze yaw angles. In addition, we surpass prior works on images of varying quality, as shown in Figs. 6e and f.

6 Conclusion

Our work is a first attempt at proposing an explicit prior designed for the task of gaze estimation with a neural network architecture. We do so by introducing a novel pictorial representation which we call gazemaps. An accompanying architecture and training scheme using intermediate supervision naturally arises as a consequence, with a fully convolutional architecture being employed for the first time for appearance-based eye gaze estimation. Our gazemaps are anatomically inspired, and are experimentally shown to outperform approaches which consist of significantly more model parameters and at times, more input modalities. We report improvements of up to \(18\%\) on MPIIGaze along with improvements on additional two different datasets against competitive baselines. In addition, we demonstrate that our final model is more robust to various factors such as extreme head poses and gaze directions, as well as poor image quality compared to prior work.

Future work can look into alternative pictorial representations for gaze estimation, and an alternative architecture for gazemap prediction. Additionally, there is potential in using synthesized gaze directions (and corresponding gazemaps) for unsupervised training of the gaze regression function, to further improve performance.