Keywords

1 Introduction

In this work, we consider the task of learning deep architectures that can transform input images into new images in a certain way (deep image resynthesis). Generally, using deep architectures for image generation has become a very active topic of research. While a lot of very interesting results have been reported over recent years and even months, achieving photo-realism beyond the task of synthesizing small patches has proven to be hard.

Previously proposed methods for deep resynthesis usually tackle the resynthesis problem in a general form and strive for universality. Here, we take an opposite approach and focus on a very specific image resynthesis problem (gaze manipulation) that has a long history in the computer vision community [1, 7, 13, 16, 18, 20, 24, 26, 27] and some important real-life applications. We show that by restricting the scope of the method and exploiting the specifics of the task, we are indeed able to train deep architectures that handle gaze manipulation well and can synthesize output images of high realism (Fig. 1).

Fig. 1.
figure 1

Gaze redirection with our model trained for vertical gaze redirection. The model takes an input image (middle row) and the desired redirection angle (here varying between \(-15\) and \(+15^\circ \)) and re-synthesize the new image with the new gaze direction. Note the preservation of fine details including specular highlights in the resynthesized images.

Generally, few image parts can have such a dramatic effect on the perception of an image like regions depicting eyes of a person in this image. Humans (and even non-humans [23]) can infer a lot of information about of the owner of the eyes, her intent, her mood, and the world around her, from the appearance of the eyes and, in particular, from the direction of the gaze. Generally, the role of gaze in human communication is long known to be very high [15].

In some important scenarios, there is a need to digitally alter the appearance of eyes in a way that changes the apparent direction of the gaze. These scenarios include gaze correction in video-conferencing, as the intent and the attitude of a person engaged in a videochat is distorted by the displacement between the face on her screen and the webcamera (e.g. while the intent might be to gaze into the eyes of the other person, the apparent gaze direction in a transmitted frame will be downwards). Another common scenario that needs gaze redirection is “talking head”-type videos, where a speaker reads the text appearing alongside the camera but it is desirable to redirect her gaze into the camera. One more example includes editing of photos (e.g. group photos) and movies (e.g. during postproduction) in order to make gaze direction consistent with the ideas of the photographer or the movie director.

All of these scenarios put very high demands on the realism of the result of the digital alteration, and some of them also require real-time or near real-time operation. To meet these challenges, we develop a new deep feed-forward architecture that combines several principles of operation (coarse-to-fine processing, image warping, intensity correction). The architecture is trained end-to-end in a supervised way using a specially collected dataset that depicts the change of the appearance under gaze redirection in real life.

Qualitative and quantitative evaluation demonstrate that our deep architecture can synthesize very high-quality eye images, as required by the nature of the applications, and does so at several frames per second. Compared to several recent methods for deep image synthesis, the output of our method contains larger amount of fine details (comparable to the amount in the input image). The quality of the results also compares favorably with the results of a random forest-based gaze redirection method [16]. Our approach has thus both practical importance in the application scenarios outlined above, and also contributes to an actively-developing field of image generation with deep models.

2 Related Work

Deep Learning and Image Synthesis. Image synthesis using neural networks is receiving growing attention [2, 3, 5, 8, 9, 19]. More related to our work are methods that learn to transform input images in certain ways [6, 17, 22]. These methods proceed by learning internal compact representations of images using encoder-decoder (autoencoder) architectures, and then transforming images by changing their internal representation in a certain way that can be trained from examples. We have conducted numerous experiments following this approach combining standard autoencoders with several ideas that have reported to improve the result (convolutional and up-convolutional layers [3, 28], adversarial loss [8], variational autoencoders [14]). However, despite our efforts (see the supplementary material), we have found that for large enough image resolution, the outputs of the network lacked high-frequency details and were biased towards typical mean of the training data (“regression-to-mean” effect). This is consistent with the results demonstrated in [6, 17, 22] that also exhibit noticeable blurring.

Compared to [6, 17, 22], our approach can learn to perform a restricted set of image transformations. However, the perceptual quality and, in particular, the amount of high-frequency details is considerably better in the case of our method due to the fact that we deliberately avoid any input data compression within the processing pipeline. This is crucial for the class of applications that we consider.

Finally, the idea of spatial warping that lies in the core of the proposed system has been previously suggested in [12]. In relation to [12], parts of our architecture can be seen as spatial transformers with the localization network directly predicting a sampling grid instead of low-dimensional transformation parameters.

Gaze Manipulation. An early work on monocular gaze manipulation [24] did not use machine learning, but relied on pre-recording a number of potential eye replacements to be copy-pasted at test time. The idea of gaze redirection using supervised learning was suggested in [16], which also used warping fields that in their case were predicted by machine learning. Compared to their method, we use deep convolutional network as a predictor, which allows us to achieve better result quality. Furthermore, while random forests in [16] are trained for a specific angle of gaze redirection, our architecture allows the redirection angle to be specified as an input, and to change continuously in a certain range. Most practical applications discussed above require such flexibility. Finally, the realism of our results is boosted by the lightness adjustment module, which has no counterpart in the approach of [16].

Less related to our approach are methods that aim to solve the gaze problem in videoconferencing via synthesizing 3D rotated views of either the entire scene [1, 20, 26] or of the face (that is subsequently blended into the unrotated head) [7, 18]. Out of this works only [7] works in a monocular setting without relying on extra imaging hardware. The general problem with the novel view synthesis is how to fill disoccluded regions. In cases when the 3D rotated face is blended into the image of the unrotated head [7, 18], there is also a danger of distorting head proportions characteristic to a person.

3 The Model

In this section, we discuss the architecture of our deep model for re-synthesis. The model is trained on pairs of images corresponding to eye appearance before and after the redirection. The redirection angle serves as an additional input parameter that is provided both during training and at test time.

As in [16], the bulk of gaze redirection is accomplished via warping the input image (Fig. 2). The task of the network is therefore the prediction of the warping field. This field is predicted in two stages in a coarse-to-fine manner, where the decisions at the fine scale are being informed by the result of the coarse stage. Beyond coarse-to-fine warping, the photorealism of the result is improved by performing pixel-wise correction of the brightness where the amount of correction is again predicted by the network. All operations outlined above are implemented in a single feed-forward architecture and are trained jointly end-to-end.

Fig. 2.
figure 2

The proposed system takes an input eye region, feature points (anchors) as well as a correction angle \( \alpha \) and sends them to the multi-scale neural network (see Sect. 3.2) predicting a flow field. The flow field is then applied to the input image to produce an image of a redirected eye. Finally, the output is enhanced by processing with the lightness correction neural network (see Sect. 3.4).

We now provide more details on each stages of the procedure, starting with more detailed description of the data used to train the architecture.

3.1 Data Preparation

At training time, our dataset allows us to mine pairs of images containing eyes of the same person looking in two different directions separated by a known angle \( \alpha \). The head pose, the lighting, and all other nuisance parameters are (approximately) the same between the two images in the pair. Following [16] (with some modifications), we extract the image parts around each of the eye and resize them to characteristic scale. For simplicity of explanation, let us assume that we need to handle left eyes only (the right eyes can be handled at training and at test times via mirroring).

To perform the extraction, we employ an external face alignment library [25] producing, among other things, \( N = 7 \) feature points \( \{ (x_i^{\text{ anchor }}, y_i^{\text{ anchor }} ) \, | \, i = 1, \ldots , N \} \) for the eye (six points along the edge and also the pupil center). Next, we compute a tight axis-aligned bounding box \( \mathcal {B}^{\prime } \) of the points in the input image. We enlarge \( \mathcal {B}^{\prime } \) to the final bounding-box \( \mathcal {B} \) using a characteristic radius R that equals the distance between the corners of an eye. The size of \( \mathcal {B} \) is set to \( 0.8 R \times 1.0 R \). We then cut out the interior of the estimated box from the input image, and also from the output image of the pair (using exactly the same bounding box coordinates). Both images are then rescaled to a fixed size (\( W \times H = 51 \times 41 \) in our experiments). The resulting image pair serves as a training example for the learning procedure (Fig. 4-Right).

3.2 Warping Modules

Each of the two warping modules takes as an input the image, the position of the feature points, and the redirection angle. All inputs are expressed as maps as discussed below, and the architecture of the warping modules is thus “fully-convolutional”, including several convolutional layers interleaved with Batch Normalization layers [11] and ReLU non-linearities (the actual configuration is shown in the supplementary material). To preserve the resolution of the input image, we use ‘same’-mode convolutions (with zero padding), set all strides to one, and avoid using max-pooling.

Coarse Warping. The last convolutional layer of the first (half-scale) warping module produces a pixel-flow field (a two-channel map), which is then upsampled \( \mathbf {D}_{\text{ coarse }}(I, \alpha ) \) and applied to warp the input image by means of a bilinear sampler \( \mathbf {S} \) [12, 21] that finds the coarse estimate:

$$\begin{aligned} O_{\text{ coarse }} = \mathbf {S} \left( I, \mathbf {D}_{\text{ coarse }}(I, \alpha ) \right) . \end{aligned}$$
(1)

Here, the sampling procedure S samples the pixels of \(O_{\text{ coarse }}\) at pixels determined by the flow field:

$$\begin{aligned} O_{\text{ coarse }}(x,y,c) = I\{x+\mathbf {D}_{\text{ coarse }}(I, \alpha )(x,y,1),y+\mathbf {D}_{\text{ coarse }}(I, \alpha )(x,y,2),c\}, \end{aligned}$$
(2)

where c corresponds to a color channel (R,G, or B), and the curly brackets correspond to bilinear interpolation of \(I(\cdot ,\cdot ,c)\) at a real-valued position. The sampling procedure (1) is piecewise differentiable [12].

Fine Warping. In the fine warping module, the rough image estimate \(O_{\text{ coarse }}\) and the upsampled low-resolution flow \(\mathbf {D}_{\text{ coarse }}(I, \alpha )\) are concatenated with the input data (the image, the angle encoding, and the feature point encoding) at the original scale and sent to the \( 1\times \)-scale network which predicts another two-channel flow \(\mathbf {D}_{\text{ res }}\) that amends the half-scale pixel-flow (additively [10]):

$$\begin{aligned} \mathbf {D}(I, \alpha ) = \mathbf {D}_{\text{ coarse }}(I, \alpha ) + \mathbf {D}_{\text{ res }}(I, \alpha , O_{\text{ coarse }}, \mathbf {D}_{\text{ coarse }}(I, \alpha )) \, , \end{aligned}$$
(3)

The amended flow is used to obtain the final output (again, via bilinear sampler):

$$\begin{aligned} O = \mathbf {S} \left( I, \mathbf {D}(I, \alpha ) \right) . \end{aligned}$$
(4)

The purpose of coarse-to-fine processing is two-fold. The half-scale (coarse) module effectively increases the receptive field of the model resulting in a flow that moves larger structures in a more coherent way. Secondly, the coarse module gives a rough estimate of how a redirected eye would look like. This is useful for locating problematic regions which can only be fixed by a neural network operating at a finer scale.

3.3 Input Encoding

As discussed above, alongside the raw input image, the warping modules also receive the information about the desired redirection angle and feature points also encoded as image-sized feature maps.

Embedding the Angle. Similarly to [6], we treat the correction angle as an attribute and embed it into a higher dimensional space using a multi-layer perceptron \( \mathbf {F}_{\text{ angle }} (\alpha ) \) with ReLU non-linearities. The precise architecture is FC(16) \( \rightarrow \) ReLU \( \rightarrow \) FC(16) \( \rightarrow \) ReLU. Unlike [6], we do not output separate features for each spatial location but rather opt for a single position-independent 16-dimensional vector. The vector is then expressed as 16 constant maps that are concatenated into the input map stack. During learning, the embedding of the angle parameter is also updated by backpropagation.

Embedding the Feature Points. Although in theory a convolutional neural network of an appropriate architecture should be able to extract necessary features from the raw input pixels, we found it beneficial to further augment 3 color channels with additional 14 feature maps containing information about the eye anchor points.

In order to get the anchor maps, for each previously obtained feature point located at \( (x_i^{\text{ anchor }}, y_i^{\text{ anchor }}) \), we compute a pair of maps:

$$\begin{aligned} \begin{aligned} \varDelta _x^i[x, y] = x - x_i^{\text{ anchor }}, \\ \varDelta _y^i[x, y] = y - y_i^{\text{ anchor }}, \end{aligned} \quad \forall (x, y) \in \{ 0, \ldots , W \} \times \{ 0, \ldots , H \}, \end{aligned}$$
(5)

where WH are width and height of the input image respectively. The embedding give the network “local” access to similar features as used by decision trees in [16].

Ultimately, the input map stack consists of 33 maps (RGB + 16 angle embedding maps + 14 feature point embedding maps).

Fig. 3.
figure 3

Visualization of three challenging redirection cases where Lightness Correction Module helps considerably compared to the system based solely on coarse-to-fine warping (CFW) which is having difficulties with expanding the area to the left of the iris. The ‘Mask’ column shows the soft mask corresponding to parts where lightness is increased. Lightness correction fixes problems with inpainting disoccluded eye-white, and what is more emphasizes the specular highlight increasing the perceived realism of the result.

3.4 Lightness Correction Module

While the bulk of appearance changes associated with gaze redirection can be modeled using warping, some subtle but important transformations are more photometric than geometric in nature and require a more general transformation. In addition, the warping approach can struggle to fill in disoccluded areas in some cases.

To increase the generality of the transformation that can be handled by our architecture, we add the final lightness adjustment module (see Fig. 2). The module takes as input the features computed within the coarse warping and fine warping modules (specifically, the activations of the third convolutional layer), as well as the image produced by the fine warping module. The output of the module is a single map M of the same size as the output image that is used to modify the brightness of the output O using a simple element-wise transform:

$$\begin{aligned} O_\text {final} (x,y,c) = O(x,y,c) \cdot (1-M(x,y)) + M(x,y), \end{aligned}$$
(6)

assuming that the brightness in each channel is encoded between zero and one. The resulting pixel colors can thus be regarded as blends between the colors of the warped pixels and the white color. The actual architecture for the lightness correction module in our experiments is shown in the supplementary material.

This idea can be, of course, generalized further to a larger number of colors in the palette for admixing, while these colors can be defined either manually or made dataset-dependent or even image-dependent. Our initial experiments along these directions, however, have not brought consistent improvement in photorealism in the case of the gaze redirection task.

4 Experiments

4.1 Dataset

There are no publicly available datasets suitable for the purpose of the gaze correction task with continuously varying redirection angle. Therefore, we collect our own dataset (Fig. 4). To minimize head movement, a person places her head on a special stand and follows with her gaze a moving point on the screen in front of the stand. While the point is moving, we record several images with eyes looking in different fixed directions (about 200 for one video sequence) using a webcam mounted in the middle of the screen. For each person we record 2–10 sequences, changing the head pose and light conditions between different sequences. Training pairs are collected, taking two images with different gaze directions from one sequence. We manually exclude bad shots, where a person is blinking or where she is not changing gaze direction monotonically as anticipated. Most of the experiments were done on the dataset of 33 persons and 98 sequences. Unless noted otherwise, we train the model for vertical gaze redirection in the range between \( -30^\circ \) and \( 30^\circ \).

Fig. 4.
figure 4

Left – dataset collection process. Right – examples of two training pairs (input image with superimposed feature points on top, output image in the bottom).

4.2 Training Procedure

The model was trained end-to-end on 128-sized batches using Adam optimizer [14]. We used a regular \( \ell _2 \)-distance between the synthesized output \(O_\text {output}\) and the ground-truth \(O_\text {gt}\) as the objective function. We tried to improve over this simple baseline in several ways. First, we tried to put emphasis on the actual eye region (not the rectangular bounding-box) by adding more weight to the corresponding pixels but were not able to get any significant improvements. Our earlier experiments with adversarial loss [8] were also inconclusive. As the residual flow predicted by the \( 1\times \)-scale module tends to be quite noisy, we attempted to smoothen the flow-field by imposing a total variation penalty. Unfortunately, this resulted in a slightly worse \( \ell _2 \)-loss on the test set.

Sampling Training Pairs. We found that biasing the selection process for more difficult and unusual head poses and bigger redirection angles improved the results. For this reason, we used the following sampling scheme aimed at reducing the dataset imbalance. We split all possible correction angles (that is, the range between \( -30^\circ \) and \( 30^\circ \)) into 15 bins. A set of samples falling into a bin is further divided into “easy” and “hard” subsets depending on the input’s tilt angle (an angle between the segment connecting two most distant eye feature points and the horizontal baseline). A sample is considered to be “hard” if its tilt is \( \geqslant 8^\circ \). This subdivision helps to identify training pairs corresponding to the rare head poses. We form a training batch by picking 4 correction angle bins uniformly at random and sampling 24 “easy” and 8 “hard” examples for each of the chosen bins.

4.3 Quantitative Evaluation

We evaluate our approach on our dataset. We randomly split the initial set of subjects into a development (26 persons) and a test (7 persons) sets. Several methods were compared using the mean square error (MSE) between the synthesized and the ground-truth images extracted using the procedure described in Sect. 3.1.

Models. We consider 6 different models:

  1. 1.

    A system based on Structured Random Forests (RF) proposed in [16]. We train it for \(15^\circ \) redirection only using the reference implementation.

  2. 2.

    A single-scale (SS (\( 15^\circ \) only)) version of our method with a single warping module operating on the original image scale that is trained for \( 15^\circ \) redirection only.

  3. 3.

    A single-scale (SS) version of our method with a single warping module operating on the original image scale.

  4. 4.

    A multi-scale (MS) network without coarse warping. It processes inputs on two scales and uses features from both scales to predict the final warping transformation.

  5. 5.

    A coarse-to-fine warping-based system described in Sect. 3 (CFW).

  6. 6.

    A coarse-to-fine warping-based system with a lightness correction module (CFW + LCM).

The latter four models are trained for the task of vertical gaze redirection in the range. We call such models unified (as opposed to single angle correction systems).

15 \(^\circ \) Correction. In order to have the common ground with the existing systems, we first restrict ourselves to the case of \( 15^\circ \) gaze correction. Following [16], we present a graph of sorted normalized errors (Fig. 5), where all errors are divided by the MSE obtained by an input image and then the errors on the test set are sorted for each model.

Fig. 5.
figure 5

Ordered errors for \( 15^\circ \) redirection. Our multi-scale models (MS, CFW, CFW + LCM) show results that are comparable or superior the Random Forests (RF) [16].

It can be seen that the unified multi-scale models are, in general, comparable or superior to the RF-based approach in [16]. Interestingly, the lightness adjustment extension (Sect. 3.4) is able to show quite significant improvements for the samples with low MSE. Those are are mostly cases similar to shown in Fig. 3. It is also worth noting that the single-scale model trained for this specific correction angle consistently outperforms [16], demonstrating the power of the proposed architecture. However, we note that results of the methods can be improved using additional registration procedure, one example of which is described in Sect. 4.5.

Arbitrary Vertical Redirection. We also compare different variants of unified networks and plot the error distribution over different redirection angles (Fig. 6). For small angles, all the methods demonstrate roughly the same performance, but as we increase the amount of correction, the task becomes much harder (which is reflected by the growing error) revealing the difference between the models. Again, the best results are achieved by the palette model, which is followed by the multi-scale networks making use of coarse warping.

Fig. 6.
figure 6

Distribution of errors over different correction angles.

Fig. 7.
figure 7

Sample results on a hold-out. The full version of our model (CFW+LCM) outperforms other methods.

4.4 Perceptual Quality

We demonstrate the results of redirection on \(15^\circ \) upwards in Fig. 7. CFW-based systems produce the results visually closer to the ground truth than RF. The effect of the lightness correction is pronounced: on the input image with the lack of white Random Forest and CFW fail to get output with sufficient eye-white and copy-paste red pixels instead, whereas CFW+LCM achieve good correspondence with the ground-truth. However, the downside effect of the LCM could be blurring/lower contrast because of the multiplication procedure (6).

User Study. To confirm the improvement corresponding to different aspects of the proposed models, which may not be adequately reflected by \( \ell _2 \)-measure, we performed an informal user study enrolling 16 subjects unrelated to computer vision and comparing four methods (RF, SS, CFW, CFW+LCM). Each user was shown 160 quadruplets of images, and in each quadruplet one of the images was obtained by re-synthesis with one of the methods, while the remaining three were unprocessed real images of eyes. 40 randomly sampled results from each of the compared methods were thus embedded. When a quadruplet was shown, the task of the subject was to click on the artificial (re-synthesized) image as quickly as possible. For each method, we then recorded the number of correct guesses out of 40 (for an ideal method the expected number would be 10, and for a very poor one it would be 40). We also recorded the time that the subject took to decide on each quadruplet (better method would take a longer time for spotting). Table 1 shows results of the experiment. Notably, here the gap between methods is much wider then it might seem from the MSE-based comparisons, with CFW+LCM method outperforming others very considerably, especially when taking into account the timings.

Table 1. User assessment for the photorealism of the results for the four methods. During the session, each of the 16 test subjects observed 40 instances of results of each method embedded within 3 real eye images. The participants were asked to click on the resynthesized image in as little time as they could. The first three parts of the table specify the number of correct guesses (the smaller the better). The last line indicates the mean time needed to make a guess (the larger the better). Our full system (coarse-to-fine warping and lightness correction) dominated the performance.

Horizontal Redirection. While most of our experiments were about vertical gaze redirection, the same models can be trained to redirect the gaze horizontally (and, with trivial generalization, by a 2D family of angles). In Fig. 8, we provide qualitative results of CFW+LCM for horizontal redirection. Some examples showing the limitations of our method are given. The limitations are concerned with cases with severe disocclusions, where large areas have to be filled by the network.

We provide more qualitative results on the project webpage [4].

4.5 Incorporating Registration

We found that results can be further perceptually improved (see [4]) if the objective is slightly modified to take into account misalignment between inputs and ground-truth images. To that end, we enlarge the bounding-box \( \mathcal {B} \) that we use to extract the output image of a training pair by \( k = 3 \) pixels in all the directions. Given that now \( O_{gt} \) has the size of \( (H + 2k) \times (W + 2k) \), the new objective is defined as:

$$\begin{aligned} \mathcal {L}(O_{\text {output}}, O_{\text {gt}}) = \min _{i, j} \text{ dist } \left( O_{\text {output}}, O_{\text {gt}}[i : i + H, j : j + W] \right) , \end{aligned}$$
(7)

where \( \text {dist}(\cdot ) \) can be either \( \ell _2 \) or \( \ell _1 \)-distance (the latter giving slightly sharper results), and \( O_{\text {gt}}[i : i + H, j : j + W] \) corresponds to a \( H \times W \) crop of \( O_{\text {gt}} \) with top left corner at the position (ij) . Being an alternative to the offline registration of input/ground-truth pairs [16] which is computationally prohibitive in large-scale scenarios, this small trick greatly increases robustness of the training procedure against small misalignments in a training set.

Fig. 8.
figure 8

Horizontal redirection with a model trained for both vertical and horizontal gaze redirection. For the first six rows the angle varies from \( -15^\circ \) to \( 15^\circ \) relative to the central (input) image. The last two rows push the redirection to extreme angles (up to \(45^\circ \)) breaking our model down.

5 Discussion

We have suggested a method for realistic gaze redirection, allowing to change gaze continuously in a certain range. At the core of our approach is the prediction of the warping field using a deep convolutional network. We embed redirection angle and feature points as image-sized maps and suggest “fully-convolutional” coarse-to-fine architecture of warping modules. In addition to warping, photorealism is increased using lightness correction module. Quantitative comparison of MSE-error, qualitative examples and a user study show the advantage of suggested techniques and the benefit of their combination within an end-to-end learnable framework.

Our system is reasonably robust against different head poses (e.g., see Fig. 3) and deals correctly with the situations where a person wears glasses (see [4]). Most of the failure modes (e.g., corresponding to extremely tilted head poses or large redirection angles involving disocclusion of the different parts of an eye) are not inherent to the model design and can be addressed by augmenting the training data with appropriate examples.

We concentrated on gaze redirection, although our approach might be extended for other similar tasks, e.g. re-synthesis of faces. In contrast with autoencoders-based approach, our architecture does not compress data to a representation with lower explicit or implicit dimension, but directly transforms the input image. Our method thus might be better suited for fine detail preservation, and less prone to the “regression-to-mean” effect.

The computational performance of our method is up to 20 fps on a mid-range consumer GPU (NVIDIA GeForce-750M), which is however slower than the competing method of [16], which is able to achieve similar speed on CPU. Our models are however much more compact than forests from [16] (250 Kb vs 30–60 Mb in our comparisons), while also being universal. We are currently working on the unification of the two approaches.

Speed optimization of the proposed system is another topic for future work. Finally, we plan to further investigate non-standard loss functions for our architectures (e.g. the one proposed in Sect. 4.5), as the \( \ell _2 \)-loss is not closely enough related to perceptual quality of results (as highlighted by our user study).