1 Introduction

Learning to infer the 3D shape of complex objects given only a few images is one of the grand challenges of computer vision. Another of the many benefits of deep learning has been a resurgence of interest in this task. Many recent works have developed the idea of inferring 3D shape given a set of classes (e.g. cars, chairs, rooms). This modern treatment of class based reconstruction follows on from the pre-deep learning classic work of Blanz and Vetter (1999) for faces and later for other classes such as semantic categories (Kar et al. 2015; Cashman and Fitzgibbon 2013) or cuboidal room structures (Fouhey 2015; Hedau et al. 2009).

Fig. 1
figure 1

An overview of SiDeNet. First, images of an object are taken at various viewpoints \(\theta _1 \cdots \theta _N\) by rotating the object about the vertical axis. Given a set of these views (the number of which may vary at test time), SiDeNet predicts the depth of the sculpture at the given views and the silhouette at a new view \(\theta '\). Here, renderings of the predicted depth at two of the given views and silhouette predictions at new viewpoints are visualised. The depth predictions are rendered using the depth value for the colour (e.g. dark red is further away and yellow/white nearer) (Color figure online)

This work extends this area in two directions: first, it considers 3D shape inference from multiple images rather than a single one (though this is considered as well); second, it considers the quite generic class of piecewise smooth textured sculptures and the associated challenges.

To achieve this, a deep learning architecture is introduced which can take into account a variable number of views in order to predict depth for the given views and the silhouette at a new view (see Fig. 1 for an overview). This approach has a number of benefits: first the network learns how to combine the given views—it is an architectural solution—without using multi view stereo. As a result, the views need not be photometrically consistent. This is useful if the views exhibit changes in exposure/lighting/texture or are taken in different contexts (so one may be damaged), etc. By enforcing that the same network must be able to predict 3D from single and multiple views, the network must be able to infer 3D shape using global information from one view and combine this information given multiple views; this is a different approach from building up depth locally using correspondences as would be done in a traditional multi view stereo approach.

Second, using a view-dependent representation means that the model makes few assumptions about the distribution of input shapes or their orientation. This is especially beneficial if there is no canonical frame or natural orientation over the input objects (e.g. a chair facing front and upright is at \(0^{\circ }\)). This generalisation power is demonstrated by training/evaluating SiDeNet on a dataset of sculptures which have a wide variety of shapes and textures. SiDeNet generalises to new unseen shapes without requiring any changes.

Finally, as only image representations are used, the quality of the 3D model is not limited by the 3D resolution of a voxel grid or a finite set of points but by the image resolution.

Contributions This work brings the following contributions. First, a fully convolutional architecture and loss function, termed SiDeNet (Sects. 3, 4) is introduced for understanding 3D shape. It can incorporate additional views at test time, and the predictions improve as additional views are incorporated when both using 2D convolutions to predict depth/silhouettes as well as 3D convolutions to latently infer the 3D shape. Further, this is true without assuming that the objects have a canonical representation unlike many contemporary methods. Second, a dataset of complex sculptures which are augmented in 3D (Sect. 5). This dataset demonstrates that the learned 3D representation is sufficient for silhouette prediction as well as new view synthesis for a set of unseen objects with complex shapes and textures. Third, a thorough evaluation that demonstrates how incorporating additional views improves results and the benefits of the data augmentation scheme (Sect. 6) as well as that SiDeNet can be used directly on real images. This evaluation also demonstrates how SiDeNet can incorporate multiple views without requiring photometric consistency and demonstrates that SiDeNet is competitive or better than comparable state-of-the-art methods for 3D prediction and at leveraging multiple views on both the Sculptures and ShapeNet datasets. Finally, the architecture is investigated to determine how information is encoded and aggregated across views in Sect. 8.Footnote 1

This work is an extension of that described in Wiles and Zisserman (2017). The original architecture is referred to as SilNet, and the improved architecture (the subject of this work) SiDeNet. SilNet learns about the visual hull of the object and is trained on images of a small resolution size to predict the silhouette of the object at again a small resolution size. This is improved in this work, SiDeNet. The loss function is improved by adding an additional term for depth that enforces that the network should learn to predict concavities on the 3D shape (Sect. 3). The architecture is improved by increasing the resolution of the input and predicted image (Sect. 4). The dataset acquisition phase is improved by adding data augmentation in 3D (Sect. 5). These changes are analysed in Sect. 6.

2 Related Work

Inferring 3D shape from one or more images has a long history in computer vision. However, single vs multi-image approaches have largely taken divergent routes. Multi-image approaches typically enforce geometric constraints such that the estimated model satisfies the silhouette and photometric constraints imposed by the given views whereas single image approaches typically impose priors in order to constrain the problem. However, recent deep learning approaches have started to tackle these problems within the same model. This section is divided into three areas: multi-image approaches and single image approaches without deep learning, and newer deep learning approaches which attempt to combine these two problems into one model.

2.1 Multi-image

Traditionally, given multiple images of an object, 3D can be estimated by tracking feature points across multiple views; these constraints are then used to infer the 3D at the feature points using structure-from-motion (SfM), as explained in Hartley and Zisserman (2004). Additional photometric and silhouette constraints can also be imposed on the estimated shape of the object. Silhouette based approaches that attempt to learn the visual hull (introduced by Laurentini 1994) using a set of silhouettes with known camera positions can be done in 3D using voxels (or another 3D representation) or in the image domain by interpolating between views (e.g. the work of Matusik et al. 2000). This is improved by other approaches which attempt to construct the latent shape subject to the silhouette as well as photometric constraints; they differ in how they represent the shape and how they enforce the geometric and photometric constraints (Boyer and Franco 2003; Kolev et al. 2009; Vogiatzis et al. 2003—see Seitz et al. 2006 for a thorough review). The limitation of these approaches is that they require multiple views of the object at test time in order to impose constraints on the generated shape and they cannot extrapolate to unseen portions of the object.

2.2 Single Image

When given a single image, then correspondences cannot be used to derive the 3D shape of the model. As a result, single-image approaches must impose priors in order to recover 3D information. The prior may be based on the class by modelling the deviation from a mean shape. This approach was introduced in the seminal work of Blanz and Vetter (1999). The class based reconstruction approach has continued to be developed for semantic categories (Cashman and Fitzgibbon 2013; Prasad et al. 2010; Vicente et al. 2014; Xiang et al. 2014; Kar et al. 2015; Rock et al. 2015; Kong et al. 2017) or cuboidal room structures (Fouhey 2015; Hedau et al. 2009). Another direction is to use priors on shading, texture, or illumination to infer aspects of 3D shape (Zhang et al. 1999; Blake and Marinos 1990; Barron and Malik 2015; Witkin 1981).

2.3 Deep Learning Approaches

Newer deep learning approaches have traditionally built on the single image philosophy of learning a prior distribution of shapes for a given object class. However, in these cases the distribution is implicitly learned for a specific object class from a single image using a neural network. These methods rely on a large number of images of a given object class that are usually synthetic. The distribution may be learned by predicting the corresponding 3D shape from a given image for a given object class using a voxel, point cloud, or surface representation (Girdhar et al. 2016; Wu et al. 2016; Fan et al. 2016; Sinha et al. 2017; Yan et al. 2016; Tulsiani et al. 2017; Rezende et al. 2016; Wu et al. 2017). These methods differ in whether they are supervised or use a weak-supervision (e.g. the silhouette or photometric consistency as in Yan et al. 2016; Tulsiani et al. 2017). A second set of methods learn a latent representation by attempting to generate new views conditioned on a given view. This approach was introduced in the seminal work of Tatarchenko et al. (2016) and improved on by Zhou et al. (2016), Park et al. (2017).

While demonstrating impressive results, these deep learning methods methods are trained/evaluated on a single or small number of object classes and often do not consider the additional benefits of multiple views. The following approaches consider how to generalise to multiple views and/or the real domain.

The approaches that consider the multi-view case are the following. Choy et al. (2016) use a recurrent neural network on the predicted voxels given a sequence of images to reconstruct the model. Kar et al. (2017) use the known camera position to impose geometric constraints on how the views are combined in the voxel representation. Finally, Soltani et al. (2017) pre-determine a fixed set of viewpoints of the object and then train a network for silhouette/depth from these known viewpoints. However, changing any of the input viewpoints or output viewpoints would require training a new network.

More recent approaches such as the works of Zhu et al. (2017), Wu et al. (2017) have attempted to fine-tune the model trained on synthetic data on real images using the silhouette or another constraint, but they only extend to semantic classes that have been seen in the synthetic data. Novotny et al. (2017) directly learn on real data using 3D reconstructions generated by a SfM pipeline. However, they require many views of the same object and enough correspondences at train time in order to make use of the SfM pipeline.

This paper improves on previous work in three ways. First an image based approach is used for predicting the silhouette and depth, thereby enforcing that the latent model learns about 3D shape without having to explicitly model the full 3D shape. Second our method of combining multiple views using a latent embedding acts globally as opposed to locally (e.g. Choy et al. 2016 combine information for subsets of voxels and Kar et al. 2017 combine information along projection rays). Additionally, our method does not require photometric consistency or geometric modelling of the camera movement and intrinsic parameters—it is an architectural solution. In spirit, our method of combining multiple views is more similar to multi-view classification/recognition architectures such as the works of Su et al. (2015), Qi et al. (2016). Third a new Sculptures dataset is curated from SketchFab (2018) which exhibits a wide variety of shapes from many semantic classes. Many contemporary methods train/test on ShapeNet core which contains a set of semantic classes. Training on class-specific datasets raises the question: to what extent have these architectures actually learnt about shape and how well will they generalise to unseen objects that vary widely from the given class (e.g. as an extreme how accurately would these models reconstruct a tree when trained on beds/bookcases). We investigate this on the Sculptures dataset.

3 Silhouette and Depth: A Multi-task Loss

The loss function used enforces two principles: first that the network learns about the visual hull, and second that it learns to predict the surface (and thus also concavities) at the given view. This is done by predicting, for a given image (or set of images), the silhouette in a new view and the depth at the given views. We expand on these two points in the following.

3.1 Silhouette

The first task considered is how to predict the silhouette at a new view given a set of views of an object. The network can do well at this task only if it has learned about the 3D shape of the object. To predict the silhouette at a new angle \(\theta '\), the network must at least encode the visual hull (the visual hull is the volume swept out by the intersection of the back-projected silhouettes of an object as the viewpoint varies). Using a silhouette image has desirable properties: first, it is a 2D representation and so is limited by the 2D image size (e.g. as opposed to the size of a 3D voxel grid). Second, pixel intensities do not have to be modelled.

3.2 Depth

However, using the silhouette and thereby enforcing the visual hull has the limitation that the network is not forced to predict concavities on the object, as they never appear on the visual hull. The proposed solution to this is to use a multi-task approach. Instead of having the learned representation describe only the silhouette in the new view, the representation must learn additionally to predict the depth of the object in the given views. This enforces that the representation must have a richer understanding of the object, as it must model the concavities on the object as opposed to just the visual hull (which using a silhouette loss imposes). Using a depth image is also a 2D representation, so as with using an image for the silhouette, it is limited by the 2D image size.

4 Implementation

In order to actually implement the proposed approach, the problem is formulated as described in Sects. 4.1 and 4.2 and a fully convolutional CNN architecture is used, as described in Sect. 4.3.

4.1 Loss Function

The loss function is implemented as follows. Given a set of images with their corresponding viewpoints \((I_1, \theta _1), \ldots , (I_N, \theta _N)\) a representation x is learned such that x can be used to not only predict the depth in the given views \(d_1, \ldots , d_N\) but also predict the silhouette S at a new viewpoint \(\theta '\). Moreover, the number of input views (e.g. N) should be changeable at test time such that as N increases then the predictions \(d_1, \ldots d_N, S\) improve.

To do this, the images and their corresponding viewpoints are first encoded using a convolutional encoder f to give a latent representation \(fv_i\). The same encoder is used for all viewpoints giving \(f(I_1, \theta _1), \ldots , f(I_N, \theta _N) = fv_1, \ldots , fv_N\). These are then combined to give the latent view-dependent representation x. x is then decoded using a convolutional decoder \(h_{sil}\) conditioned on the new viewpoint \(\theta '\) to predict the silhouette S in the new view. Optionally, x is also decoded via another convolutional decoder \(h_{depth}\), which is conditioned on the given image and viewpoints to predict the depth at the given viewpoints—\(d_i = h_{depth}(x, I_i, \theta _i)\). Finally, the binary cross entropy loss is used to compare S to the ground truth \(S^{{gt}}\) and the \({L}_1\) loss to compare \(d_i\) to the ground truth \(d_{i_{gt}}\).

4.2 Improved Loss Functions

Implementing the loss functions naively as described in Sect. 4.1 is problematic. First, the depth being predicted is the absolute depth, which means the model must guess the absolute position of the object in the scene. This is inherently ambiguous. Second, the silhouette prediction decoder struggles to model the finer detail on the silhouette, instead focusing on the middle of the object which is usually filled.

As a result, both losses are modified. For the depth prediction, the mean of both the ground truth and predicted depth are moved to 0.

The silhouette loss is weighted at a given pixel \(w_{i,j}\) based on the Euclidean distance at that point to the silhouette (denoted as \(\text {dist}_{i,j}\)):

$$\begin{aligned} w_{i,j} = {\left\{ \begin{array}{ll} \text {dist}_{i,j}, &{} \quad \text {if } \text {dist}_{i,j} \le T \\ c &{} \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$
(1)

In practice \(T=20, c=5\). The rationale for the fall-off when \(\text {dist}_{i,j} > T\) is due to the fact that most of the objects are centred and have few holes, so modelling the pixels far from the silhouette is easy. Using the fall-off incentivises SiDeNet to correctly model the pixels near the silhouette. Weighting based on the distance to the silhouette models the fact that it is ambiguous whether pixels on the silhouette are part of the background or foreground.

Fig. 2
figure 2

A diagrammatic explanation of the multi-task loss function used. Given the input images, the images are combined to give a feature vector x which is used by both decoders (denoted in green—depth—and orange—silhouette) to generate the depth predictions for the given views and the silhouette prediction in a new view (Color figure online)

In summary, the complete loss functions are

$$\begin{aligned}&\mathcal {L}_{sil} = \sum _{i,j} w_{i,j} \left( S^{gt}_{i,j} \log (S_{i,j}) + \left( 1 - S^{gt}_{i,j}\right) \log (1 - S_{i,j})\right) ; \end{aligned}$$
(2)
$$\begin{aligned}&\mathcal {L}_{depth} = \sum _{i=1}^N | d_i - d_{i_{gt}} |_1. \end{aligned}$$
(3)

The loss function is visualised in Fig. 2. Note that in this example the network’s prediction exhibits a concavity in the groove of the sculpture’s folded arms.

4.3 Architecture

This section describes the various components of SiDeNet, which are visualised in Fig. 3 and described in detail in Table 10. This architecture takes as input a set of images of size \(256\times 256\) and corresponding viewpoints (encoded as \([\sin \theta _i,\cos \theta _i]\) so that \(0^{\circ }, 360^{\circ }\) map to the same value) and generates depth and silhouette images at a resolution of size \(256\times 256\). SiDeNet takes the input image viewpoints as additional inputs because there is no implicit coordinate frame that is true for all objects. For example, a bust may be oriented along the z-axis for one object and the x-axis for another and there is no natural mapping from a bust to a sword. Explicitly modelling the coordinate frame using the input/output viewpoints removes these ambiguities.

SiDeNet is modified to produce a latent 3D representation in SideNet3D, which is visualised in Fig. 4 and described in Sect. 4.4. This architecture is useful for two reasons. First, it demonstrates that the method of combining multiple views is useful in this scenario as well. Second, it is used to evaluate whether the image representation does indeed allow for more accurate predictions, as the 3D representation necessitates using fewer convolutional transposes and so generates a smaller \(57\times 57\) silhouette image.

Fig. 3
figure 3

A diagrammatic overview of the architecture used in SiDeNet. Weights are shared across encoders and decoders (e.g. portions of the architecture having the same colour indicate shared weights). The blue, orange, and purple arrows denote concatenation. The input angles \(\theta _1 \cdots \theta _N\) are broadcast over the feature channels as illustrated by the orange arrows. The feature vectors are combined to form x (indicated by the yellow block and arrows). This value is then used to predict the depth at the given views \(\theta _1 \cdots \theta _N\) and the silhouette at a new view \(\theta '\). The size of x is invariant to the number of input views N, so an extra view \(\theta _i\) can be added at test time without any increase in the number of parameters. Please see Table 10 for the precise details (Color figure online)

Encoder The encoder f takes the given image \(I_i\) and theta \(\theta _i\) and encodes it to a latent representation \(fv_i\). In the case of all architectures, this is implemented using a convolutional encoder, which is illustrated in Fig. 3. The layer parameters and design are based on the encoder portion of the pix2pix architecture by Isola et al. (2017) which is based on the UNet architecture of Ronneberger et al. (2015).

Combination function To combine the feature vectors of each encoder, any function that satisfies the following property could be considered: given a set of feature vectors \(fv_i\), the combination function should combine them into a single latent vector x such that for any number of feature vectors, x always has the same number of elements. In particular, an element-wise max pool over the feature vectors and an element-wise average pool are considered. This vector x must encode properties of 3D shape useful for both depth prediction and silhouette prediction in a new view.

Decoder (depth) The depth branch predicts the depth of a given image using skip connections (taken from the corresponding input branch) to propagate the higher details. The exact filter sizes are modelled on the pix2pix and UNet networks.

Decoder (silhouette) The silhouette branch predicts the silhouette of a given image at a new viewpoint \(\theta '\). The layers are the same as the decoder (depth) branch without the skip connections (as there is no corresponding input view).

4.4 3D Decoder

For SiDeNet3D, the silhouette decoder is modified to generate a latent 3D representation encoded using a voxel occupancy grid. Using a projection layer this grid is projected to 2D, which allows the silhouette loss to be used to train the network in an end-end manner. This is done as follows. First, the decoder is encoded as a sequence of 3D convolutional transposes which generate a voxel of size \(V = 57\times 57\times 57\) (please refer to appendix A.1 for the precise details). This box is then transformed to the desired output \(\theta '\) to give \(V'\) using a nearest neighbour sampler as described by Jaderberg et al. (2015). The box is projected to generate the silhouette in a new view using the \(\max \) function. As the \(\max \) function is differentiable, the silhouette loss can be back propagated through this layer and the entire network trained end-to-end.

Fig. 4
figure 4

A diagrammatic overview of the projection in SiDeNet3D. A set of 3D convolutional transposes up-sample from the combined feature vector x to generate the \(57\times 57\times 57\) voxel (V). This is then projected using a max-operation over each pixel location to generate the silhouette in a new view. Please see Table 10 for a thorough description of the three different architectures

The idea of using a differentiable projection layer was also considered by Yan et al. (2016), Tulsiani et al. (2017), Gadelha et al. (2016), Rezende et al. (2016). (However, we can incorporate additional views at test time.)

Table 1 Overview of the datasets. Gives the number of sculptures in the train/val/test set as well as the number of views per object
Fig. 5
figure 5

Sample renderings of the three different datasets. Zoom in for more details. Best viewed in colour. aSketchFab dataset. Two sample renderings of seven objects. The first three fall into the train set, the rest into the test set. bSynthSculpture dataset. Sample renderings of eight objects. These samples demonstrate the variety of objects, e.g. toys, animals, etc. cShapeNet. Seven sample renderings of the chair subset (Color figure online)

5 Dataset

Three datasets are used in this work: a large sculpture dataset of scanned objects which is downloaded from SketchFab (2018), a set of scanned sculptures, and a subset of the synthetic ShapeNet objects (Chang et al. 2015). An overview of the datasets are given in Table 1. Note that unlike our dataset, ShapeNet consists of object categories for which one can impose a canonical view (e.g. that \(0^{\circ }\) corresponds to a chair facing the viewer). This allows for methods trained on this dataset to make use of rotations or transformations relative to the canonical view. However, for the sculpture dataset, this property does not exist, necessitating the need of a view-dependent representation for SiDeNet.

Performing data augmentation in 3D is also investigated and shown to increase performance in Sect. 6.2.

5.1 Sculpture Datasets

SketchFab: sculptures from SketchFab A set of realistic sculptures are downloaded from SketchFab (the same sculptures as used in Wiles and Zisserman 2017 but different renderings). These are accurate reconstructions of the original sculptures generated by users using photogrammetry and come with realistic textures. Some examples are given in Fig. 5a.

SynthSculptures This dataset includes an additional set of 77 sculptures downloaded from TurboSquidFootnote 2 using the query sculpture. These objects have a variety of realism and come from a range of object classes. For example the sculptures range from low quality meshes that are clearly polygonized to high quality, highly realistic meshes. The object classes range from abstract sculptures to jewellery to animals. Some examples are given in Fig. 5b.

Rendering The sculptures and their associated material (if it exists) are rendered in Blender (Blender Online Community 2017). The sculptures are first resized to be within a uniform range (this is necessary for the depth prediction component of the model). Then, for each sculpture, five images of the sculpture are rendered from uniformly randomly chosen viewpoints between \(0^{\circ }\) and \(120^{\circ }\) as the object is rotated about the vertical axis. Three light sources are added to the scene and translated randomly with each render. Some sample sculptures (and renders) for SketchFab and SynthSculptures are given in Fig. 5.

3D augmentation 3D data augmentation is used to augment the two sculpture datasets by modifying the dimensions and material of a given 3D model. The x,y,z dimensions of a model are each randomly scaled from between [0.5, 1.4] of the original dimension. Then a material is randomly chosen from a set of standard blender materials.Footnote 3 These materials include varieties of wood, stone, and marble. Finally, the resulting model is rendered from five viewpoints exactly as described above. The whole process is repeated 20 times for each model. Some example renderings using data augmentation for a selection of models from SynthScultpures are illustrated in Fig. 6.

Dataset split The sculptures from SketchFab are divided at the sculpture level into train, val, test so that there are 372/20/33 sculptures respectively. All sculptures from SynthSculptures are used for training. For a given iteration during train/val/test, a sculpture is randomly chosen from which a subset of the 5 rendered views is selected.

Fig. 6
figure 6

Seven sample augmentations of three models in the SynthSculpture dataset using the 3D augmentation setup described in Sect. 5.1. These samples demonstrate the variety of materials, sizes and viewpoints for a given 3D model using the 3D data augmentation method

5.2 ShapeNet

ShapeNet (Chang et al. 2015) is a dataset of synthetic objects divided into a set of semantic classes. To compare this work to that of Yan et al. (2016), their subdivision, train/val/test split and renderings of the ShapeNet chair subset are used. Their rendered synthetic objects are rendered under simple lighting conditions at fixed \(15^{\circ }\) intervals about the vertical axis for each object to give a total of 24 views per object. We additionally collect depth maps for each render using the extrinsic/intrinsic parameters of Yan et al. (2016). Some example renderings are given in Fig. 5c. Again at train/val/test time, a sculpture is randomly chosen and a subset of this sculpture’s 24 renders is chosen.

6 Experiments

This section first evaluates the design choices: the utility of using the data augmentation scheme is demonstrated in Sect. 6.2, the effect of the different architectures in Sect. 6.3, the multi-task loss in Sect. 6.4, and the effect of the choice of \(\theta '\) in Sect. 6.8. Second it evaluates the method of combining multiple views: Sect.s 6.5 and 6.6 demonstrate how increasing the number of views at test time improves performance on the Sculpture dataset irrespective of whether the input/output views are photometrically consistent. Section 6.7 demonstrates that the approach works on ShapeNet and Sect. 6.9 evaluates the approach in 3D. SiDeNet’s ability to perform new view synthesis is exhibited in Sect. 7 as well as its generalisation capability to real images. Finally, the method by which SiDeNet can encode a joint embedding of shape and viewpoint is investigated in Sect. 8.

6.1 Training Setup

The networks are written in pytorch (Paszke et al. 2017) and trained with SGD with a learning rate of 0.001, momentum of 0.9 and a batch size of 16. They are trained until the loss on the validation set stops improving or for a maximum of 200 iterations, whichever happens first. The tradeoff between the two losses – \(\mathcal {L} = \lambda _{depth} \mathcal {L}_{depth} + \lambda _{sil} \mathcal {L}_{sil}\)—is set such that \(\lambda _{depth} = 1\) and \(\lambda _{sil} = 1\).

6.1.1 Evaluation Measure

The evaluation measure used is the intersection over union (IoU) error for the silhouette, \(L_1\) error for the depth error, and chamfer distance for the error when evaluating in 3D. The IoU for a given predicted silhouette S and ground truth silhouette \(\bar{S}\) is evaluated as \(\frac{\sum _{x,y}(I(S) \cap I(\bar{S}))}{\sum _{x,y}(I(S) \cup I(\bar{S}))}\) where I is an indicator function and equals 1 if the pixel is a foreground pixel, else 0. This is then averaged over all images to give the mean IoU.

Table 2 Effect of data augmentation. This table demonstrates the utility of using 3D data augmentation to effectively enlarge the number of sculptures being trained with. SketchFab is always used and sometimes augmented (denoted by Augment). SynthSculpture is sometimes used (denoted by Used) and sometimes augmented. The models are evaluated on the test set of SketchFab. Lower is better for \(L_1\) and higher is better for IoU
Table 3 Ablation study of the different architectures, which vary in size and complexity. \(_{basic}\) refers to using the standard \(L_1\) and binary cross entropy loss without the improvements described in Sect. 4.2. The models are evaluated on the test set of SketchFab. Lower is better for \(L_1\) and higher is better for IoU. The sizes denote the size of the corresponding images (e.g. \(256\times 256\) corresponds to an output image of this resolution)

The \(L_1\) loss is simply the average over all foreground pixels: \(L_1 = \frac{1}{N} \sum _{p_x} | d_{p_x}^{pred} - d_{p_x}^{gt} |_1\) where \(p_x\) is a foreground pixel and N the number of foreground pixels. Note that the predicted and ground truth depth are first normalised by subtracting off the mean depth. This is then averaged over the batch. When there are multiple input views, the depth error is only computed for the first view, so the comparison across increasing numbers of views is valid.

The chamfer distance used is the symmetrized version. Given the ground truth point cloud g and the predicted one p, then the error is \(CD = \frac{1}{N} \sum _{i=1}^N \min _j | g_i - p_j|^2 + \frac{1}{M} \sum _{i=1}^M \min _j | g_j - p_i|^2 \).

6.1.2 Evaluation Setup

Unless otherwise stated, the results are for the max-pooling version of SiDeNet, with input/output view size \(256\times 256\), trained with 2 distinct views, data augmentation of both datasets (Sect. 6.2), \(\lambda _{depth} = 1\) and \(\lambda _{sil} = 1\), and the improved losses described in Sect. 4.2.

6.2 The Effect of the Data Augmentation

First, the effect of the 3D data augmentation scheme is considered. The results for four methods trained with varying amounts of data augmentation (described in section 5.1) are reported in Table 2 and demonstrate the benefit of using the 3D data augmentation scheme. (These are trained with the non-improved losses.) Using only 2D modifications was tried but not found to improve performance.

6.3 Ablation Study of the Different Architectures

This section compares the performance of \(\hbox {SiDeNet}_{57\times 57}\), SiDeNet3D, and SiDeNet on the silhouette/depth prediction tasks, as well as using average vs max-pooling. SiDeNet/SiDeNet3D are described in Sect. 4.3. \(\hbox {SiDeNet}_{57\times 57}\) modifies SiDeNet to generate a \(57\times 57\) silhouette (for the details for all architectures please refer to “Appendix A.1”). It additionally compares the simple version of the loss functions, described in Sect. 4.1 to the improved version described in Sect. 4.2. Finally the performance of predicting the mean depth value is given as a baseline. See Table 3 for the results.

These results demonstrate that while the difference in the pooling function in terms of results is minimal, our improved loss functions improve performance. Weighting more strongly the more difficult parts of the silhouette (e.g. around the boundary) can encourage the model to learn a better representation.

Finally, \(\hbox {SiDeNet}_{57\times 57}\) does worse than SiDeNet for both the \(L_1\) loss and the silhouette IoU loss. While in this case the difference is small, as more data is introduced and the predictions become more and more accurate, the benefit of using a larger image/representation is clear. This is demonstrated by the chairs on ShapeNet in Sect. 6.7.

6.4 The effect of using \(\mathcal {L}_{depth}\) and \(\mathcal {L}_{sil}\)

Second, the effect of the individual components of the multi-task loss is considered. The multi-task loss enforces that the network learns a richer 3D representation; the network must predict concavities in order to perform well at predicting depth and it must learn about the visual hull of the object in order to predict silhouettes at new viewpoints. As demonstrated in Table 4, using the multi-task loss does not negatively affect the prediction accuracy as compared to predicting each component separately. This demonstrates that the model is able to represent both aspects of shape at the same time.

Some visual results are given in Figs. 11 and 12. Example (b) in Fig. 12 demonstrates how the model has learned to predict concavities, as it is able to predict grooves in the relief.

6.5 The effect of increasing the number of views

Next, the effect of increasing the number of input views is investigated with interesting results.

For SiDeNet, as with SilNet, increasing the number of views improves results over all error metrics in Table 5. Some qualitative results are given in Fig. 7. It is interesting to note that not only does the silhouette performance improve given additional input views but so does the depth evaluation metric. So incorporating additional views improves the depth prediction for a given view using only the latent vector x.

Table 4 Effect of the multi-task loss. This table demonstrates the effect of the multi-task loss. As can be seen, using both losses does not negatively affect the performance of either task. The models are evaluated on the test set of SketchFab. Lower is better for \(L_1\) and higher is better for IoU
Table 5 Effect of incorporating additional views at test time. This architecture was trained with one, two, or three views. These results demonstrate how additional views can be dynamically incorporated at test time and results on both depth and silhouette measures improve. The models are evaluated on the test set of SketchFab. Lower is better for \(L_1\) and higher is better for IoU

A second interesting point is that training with more views can predict better than training with fewer numbers of views—e.g. training with three views and testing on one or two views does better than training on two views and testing on two or training on one view and testing on one view. It seems that when training with additional views and testing with a smaller number, the network can make use of information learned from the additional views. This demonstrates the generalisability of the SiDeNet architecture.

Fig. 7
figure 7

(a–c) Qualitative results for increasing the number of input views on SiDeNet for three different sculptures. SiDeNet’s depth and silhouette predictions are visualised as the number of input views is increased. To the left are the input views, the centre gives the depth prediction for the first input view, and the right gives the predicted silhouette for each set of input views. The silhouette in the red box gives the ground truth silhouette. The scale on the side gives the error in depth—blue means the depth prediction is perfectly accurate and red that the prediction is off by 1 unit. (The depth error is clamped between 0 and 1 for visualisation purposes.) As can be seen, performance improves with additional views. This is most clearly seen for the ram in (c) (Color figure online)

6.6 The Effect of Non-photometrically Consistent Inputs

A major benefit of SiDeNet is it does not require photometrically consistent views: provided the object is of the same shape, then the views may vary in lighting or material. While the sculpture renderings used already vary in lighting conditions across different views (Sect. 5), this section considers the extreme case: how does SiDeNet perform when the texture is modified in the input views. To perform this comparison, SiDeNet is tested on the sculpture dataset with a randomly chosen texture for each view (see Fig. 6 for some sample textures demonstrating the variety of the 20 textures). It is then tested again on the same test set but with the texture fixed across all input views. The results are reported in Table 6.

Surprisingly, with no additional training, SiDeNet performs nearly as well when the input/output views have randomly chosen textures. Moreover, performance improves given additional views. The network appears to have learned to combine input views with varying textures without being explicitly trained for this. This demonstrates a real benefit of SiDeNet over traditional approaches—the ability to combine multiple views of an object for shape prediction without requiring photometric consistency.

6.7 Comparison on ShapeNet

SiDeNet is compared to Perspective Transformer Nets by Yan et al. (2016) by training and testing on the chair subset of the ShapeNet dataset. The comparison demonstrates three benefits of our approach: the ability to incorporate multiple views, the benefit of our 3D data augmentation scheme, and the benefits of staying in 2D. This is done by comparing the accuracy of SiDeNet’s predicted silhouettes to those of Yan et al. (2016). Their model is trained with the intention of using it for 3D shape prediction, but we focus on the 2D case here to demonstrate that using an image representation means that, with the same data, we can achieve better prediction performance in the image domain, as we are not limited by the latent voxel resolution. To compare the generated silhouettes, their implementation of the IoU metric is used: \(\frac{\sum _{x,y} I(S_{x,y}) \times \bar{S}_{x,y}}{\sum _{x,y}(I(S_{x,y}) + \bar{S}_{x,y}) > 0.9}\).

Multiple setups for SiDeNet are considered: fine-tuning from the model trained on the sculptures with data augmentation (e.g. both in Table 1), with/without the improved loss function and for multiple output sizes. To demonstrate the benefits of the SiDeNet architecture, SiDeNet is trained only with the silhouette loss, so both models are trained with the exact same information. The model from Yan et al. (2016) is fine-tuned from a model trained for multiple ShapeNet categories. The results are reported in Table 7.

Table 6 The effect of using non-photometrically consistent inputs. These results demonstrate that SiDeNet trained with views of an object with the same texture generalises at runtime to incorporating views of an object with differing textures. Additional views can be dynamically incorporated at test time and results on both depth and silhouette measures improve. The model is trained with 2 views. The models are evaluated on the test set of SketchFab. Lower is better for \(L_1\) and higher is better for IoU
Table 7 Comparison to Perspective Transformer Nets (PTNs) (Yan et al. 2016) on the silhouette prediction task on the chair subset of ShapeNet. Their model is first trained on multiple ShapeNet categories and fine-tuned on the chair subset. SiDeNet is optionally first trained on the Sculpture dataset or trained directly on the chair subset. As can be seen, SiDeNet outperforms PTN given one view and improves further given additional views. These results also demonstrate the utility of various components of SiDeNet: using a larger \(256\times 256\) image to train the silhouette prediction task and using the improved, weighted loss function. It is also interesting to note that pre-training with the complex sculpture class gives a small boost in performance (e.g. it generalises to this very different domain of chairs). The value reported is the mean IoU metric for the silhouette; higher is better

These results demonstrate the benefits of various components of SiDeNet, which outperforms Yan et al. (2016). First, using a 2D resolution means a much larger image segmentation can be used to train the network. As a result, much better performance can be obtained (e.g. \(\hbox {SiDeNet}_{256\times 256_{\mathrm{basic}}}\) has much better performance than \(\hbox {SiDeNet}_{57\times 57_{\mathrm{basic}}}\)). Second, the improved, weighted loss function for the silhouette (Sect. 4.2) improves performance further. Third, fine-tuning a model trained with the 3D sculpture augmentation scheme gives an additional small boost in performance. Finally, using additional views improves results for all versions of SiDeNet. Some qualitative results are given in Fig. 10.

Fig. 8
figure 8

The effect of varying the range of \(\theta '\) used at train time on the IoU error at test time (Color figure online)

6.8 The Effect of Varying \(\theta '\)

In order to see how well SiDeNet can extrapolate to new angles (and there by how much it has learned about the visual hull), the following experiment is performed on ShapeNet. SiDeNet is first trained with various ranges of \(\theta ', \theta _i\). For example if the range is \([15^{\circ } \cdots 120^{\circ }]\), then all randomly selected input angles \(\theta _i\) and \(\theta '\) are constrained to be within this range during training. At test time, a random chair is chosen and the silhouette IoU error evaluated for each target viewpoint \(\theta '\) in the full range (e.g. \([15^{\circ } \cdots 360^{\circ }]\)), but the input angles \(\theta _i\) are still constrained to be in the constrained range (e.g. \([15^{\circ } \cdots 120^{\circ }]\)). This evaluates how well the model extrapolates to unseen viewpoints at test time and how well it has learned about shape. If the model was perfect, then there would be no performance degradation as \(\theta '\) moved out of the constrained range used to train the model. The results are given in Fig. 8. As can be seen (and would be expected), for various training ranges the performance degrades as a function of how much \(\theta '\) differs from the range used to train the model. The model is able to extrapolate outside of the training range, but the more the model must extrapolate, the worse the prediction.

Table 8 CD (\(\times 100\)) on the ShapeNet dataset. The models evaluated on depth predict a depth map which is back-projected to generate a 3D point cloud
Fig. 9
figure 9

Comparison of multi-view methods on ShapeNet. Renderings of the given chair are given in the top row, followed by SiDeNet’s and Kar et al. (2017)’s predictions. For each chair, for each row, the point clouds from left to right show the ground truth followed by the predictions for one, two, three, and four views respectively. The colour denotes the z value. As can be seen SiDeNet’s predictions are higher quality than those of Kar et al. (2017) for these examples

Table 9 CD (\(\times 100\)) on the Sculptures dataset. The models evaluated on depth predict a depth map which is back-projected to generate a 3D point cloud. The models evaluated on 3D are compared using the explicitly or implicitly learned 3D

6.9 Comparison in 3D

We additionally evaluate SiDeNet’s 3D predictions and consider the two cases: using the depth maps predicted by SiDeNet and the voxels from SiDeNet3D.

SiDeNet The depth maps are compared to those predicted using the depth map version of Kar et al. (2017) in Table 8. This comparison is only done on ShapeNet as for the Sculpture dataset we found it was necessary to subtract off the mean depth to predict high quality depth maps (Sect. 4.2). However, for ShapeNet there is less variation between the chairs so this is not necessary. As a result SiDeNet is trained with 2 views, the improved silhouette loss but the depth predicted is the absolute depth. The comparison is performed as follows for both methods. For each chair in the test set an initial view is chosen and the depth back-projected using the known extrinsic/intrinsic camera parameters. Then for each additional view, the initial views are chosen by sampling evenly around the z-axis (e.g. if the first view is at \(15^{\circ }\), then two views would be at \(15^{\circ }, 195^{\circ }\) and three views at \(15^{\circ }, 195^{\circ }, 255^{\circ }\)) and the depth again back-projected to give a point cloud. 2500 points are randomly chosen from the predicted point cloud and aligned using ICP (Besl and McKay 1992) with the ground truth point cloud. This experiment evaluates the method of pooling information in the two methods and demonstrates that SiDeNet’s global method of combining information performs better than that of Kar et al. (2017) which combines information along projection rays. Some qualitative results are given in Fig. 9.

SiDeNet3D SiDeNet3D is trained with 2 views and the improved losses. The predicted voxels from the 3D projection layer are extracted and marching cubes used to fit a mesh over the iso-surface. The threshold value is chosen on the validation set. A point cloud is extracted by randomly sampling from the resulting mesh.

SiDeNet3D is compared to a number of other methods in Table 9 for the Sculpture dataset. For SiDeNet3D and all baselines models, 2500 points are randomly chosen from the predicted point cloud and aligned with the ground truth point cloud using ICP. The resulting point cloud is compared to the ground truth point cloud by reporting the chamfer distance (CD). As can be seen, the performance of our method improves as the number of input views increases.

Additionally, SiDeNet3D performs better than other baseline methods on the Sculpture dataset in Table 9 which demonstrates the utility of explicitly encoding the input viewpoint and thereby representing the coordinate frame of the object. We note again that there is no canonical coordinate frame and the input viewpoint does not align with the output shape, so just predicting the 3D without allowing the network to learn the transformation from the input viewpoint to the 3D (as done in all the baseline methods) leads to poor performance.

Baselines The baseline methods which do not produce point clouds are converted as follows. To convert Yan et al. (2016) to a point cloud, marching cubes is used to fit a mesh over the predicted voxels. Points are then randomly chosen from the extracted mesh. To convert Tatarchenko et al. (2016) to a point cloud, the model is used to predict depth maps at \([0^{\circ }, 90^{\circ }, 180, ^{\circ }, 270^{\circ }]\). The known intrinsic/extrinsic camera parameters are used to back-project the depth maps. The four point clouds are then combined to form a single point cloud.

7 Generating new views

Finally SiDeNet’s representation can be qualitatively evaluated by performing two tasks that require new view generation: rotation and new view synthesis.

7.1 Rotation

As SiDeNet is trained with a subset of views for each dataset (e.g. only 5 views of an object from a random set of viewpoints in \([0^{\circ }, 120^{\circ }]\) for the Sculpture dataset and 24 views taken at \(15^{\circ }\) intervals for ShapeNet), the angle representation can be probed by asking SiDeNet to predict the silhouette as the angle is continuously varied within the given range of viewpoints. Given a fixed input, if the angle is varied continuously, then the output should similarly vary continuously. This is demonstrated in Fig. 10 for both the Sculpture and ShapeNet databases.

Fig. 10
figure 10

Qualitative results for rotating an object using the angle embedding of \(\theta '\). As the angle \(\theta '\) is rotated from \([0^{\circ }, 360^{\circ }]\) while the input images and viewpoints are kept fixed, it can be seen that the objects rotate continuously for ShapeNet (ad) and the Sculpture database (e). Additionally, the results for ShapeNet improve given additional input views. For example, in (d), the base of the chair is incorrectly predicted as solid given one view but correctly predicted given additional views

7.2 New view synthesis

Using the predicted depth, new viewpoints can be synthesised, as demonstrated in Fig. 11. This is done by rendering the depth map of the object using Open3D (Zhou et al. (2018)) as a point cloud at the given viewpoint and at a \(45^{\circ }\) rotation. At both viewpoints the object is rendered in three ways: using a textured point cloud, relighting the textured point cloud, and rendering the point cloud using the predicted z value.

Fig. 11
figure 11

This figure demonstrates how new views of a sculpture can be synthesised. For each sculpture the input views are shown to the left. The sculpture is then rendered at two viewpoints. At each viewpoint, three renderings are shown: (i) the rendered, textured point cloud, (ii) the point cloud relit and (iii) the depth cloud rendered by using the z-value for the colour (e.g. dark red is further away and yellow/white nearer). Zoom in for details (Color figure online)

7.3 Real Images

Finally, the generalisability of what SiDeNet has learned is tested on another dataset of real images of sculptures, curated by Zollhöfer et al. (2015). The images of two sculptures (augustus and relief) are taken. The images are segmented and padded such that the resulting images have the same properties as the Sculpture dataset (e.g. distance of sculpture to the boundary and background colour). The image is then input to the network with viewpoint \(0^{\circ }\). The resulting prediction is rendered as in Sect. 7.2 at multiple viewpoints and under multiple lighting conditions in Fig. 12. This figure demonstrates that SiDeNet generalises to real images, even though SiDeNet is trained only on synthetic images and for a comparatively small (only \(\approx 400\)) sculptures. Moreover these real images have perspective effects, yet SiDeNet generalises to these images, producing realistic predictions.

Fig. 12
figure 12

SiDeNet’s predictions for real images. This figure demonstrates how SiDeNet generalises to real images. For each sculpture the input view (before padding and segmentation) is shown to the left. The predicted point cloud is then rendered at two viewpoints. At each viewpoint, three renderings are shown: (i) the rendered, textured point cloud, (ii) the point cloud relit and (iii) the depth cloud rendered by using the z-value for the colour (e.g. dark red is further away and yellow/white nearer). Zoom in for details (Color figure online)

8 Explainability

This section delves into SiDeNet, attempting to understand how the network learns to incorporate multiple views. To this end, the network is investigated using two methods. The first considers how well the original input images can be reconstructed given the angles and feature encoding x. The second considers how well the original input viewpoints \(\theta _i\) can be predicted as a function of the embedding x and what this implies about the encoding. This is done for both the max and average pooling architectures.

8.1 Reconstruction

The first investigation demonstrates that the original input images can be relatively well reconstructed given only the feature encoding x and the input views. These reconstructions in Fig. 13 demonstrate that x must hold some viewpoint and image information.

To reconstruct the images, the approach of Mahendran and Vedaldi (2015) is followed. Two images and their corresponding viewpoints, are input to the network and a forward pass computed. Then the combined feature vector x is extracted (so it contains the information from the input views and their viewpoints). The two images are reconstructed, starting from noise, by minimizing a cost function consisting of two losses: the first loss, the \(\mathcal {L}_{MSE}\) error, simply says that the two reconstructed images when input to the network, should give a feature vector \(x'\) that is the same as x. The second loss, the total variation regulariser \(\mathcal {L}_{TV}\) (as in Mahendran and Vedaldi 2015 and Upchurch et al. 2017), states that the reconstructed images should be smooth.

$$\begin{aligned}&\mathcal {L}_{MSE} = \sum _{i} (x_i - x'_i)^2 \end{aligned}$$
(4)
$$\begin{aligned}&\mathcal {L}_{TV} = \sum _{i,j} \left( (I_{i,j+1} - I_{i,j})^2 + (I_{i+1,j} - I_{i,j})^2\right) ^{\beta / 2} \end{aligned}$$
(5)

This gives the total loss \(\mathcal {L} = \mathcal {L}_{MSE} + \lambda _{TV} * \mathcal {L}_{TV}\). Here, \(\beta ,\lambda _{TV}\) are chosen such that \(\beta =2\) and \(\lambda _{TV}=0.001\). The cost function is optimized using SGD (with momentum 0.975 and learning rate 1, which is decreased by a factor of 0.1 at each 1000 steps).

8.2 Analysis of feature embeddings

In the reconstructions above, it seems that some viewpoint information is propagated through the network, despite the aggregation function. Here, we want to understand precisely how this is done. In order to do so, the following experiment is conducted: how well can the various viewpoints (e.g. \(\theta _1 \cdots \theta _N\)) be predicted for a given architecture from the embedding x. If the hypothesis—that the embedding x encodes viewpoint—is correct, then these viewpoints should be accurately predicted.

As a result, x is considered to determine how much of it is viewpoint-independent and how much of it is viewpoint-dependent. This is done by using each hidden unit in x to predict the viewpoint \(\theta _1\) using ordinary least squares regression (Friedman et al. 2001) (only \(\theta _1\) is considered as x is invariant to the input ordering). Training pairs are obtained by taking two images with corresponding viewpoints \(\theta _1\) and \(\theta _2\), passing them through the network and obtaining x.

The p value for each hidden unit is computed to determine whether there is a significant relation between the hidden unit and the viewpoint. If the p-value is insignificant (i.e. it is large, \(> 0.05\)) then this implies that the hidden unit and viewpoint are not related, so it contains viewpoint-independent information (presumably shape information). The number of hidden units with p value less than c, as c is varied, is visualised in Fig. 14 for both architectures. As can be seen, more than \(80\%\) of the hidden units for both architectures are significantly related to the viewpoint.

Since so many of the hidden units have a significant relation to the viewpoint, they would be expected to vary as a function of the input angle. To investigate this, the activations of the hidden units are visualised as a function of the angle \(\theta _1\). For two objects, all input values are kept fixed (e.g. the images and other viewpoint values) except for \(\theta _1\) which is varied between \(0^{\circ }\) and \(360^{\circ }\). A subset of the hidden units in x are visualised as \(\theta _1\) is varied in Fig. 15. As can be seen, the activation either varies in a seemingly sinusoidal fashion—it is maximised at some value for \(\theta _1\) and decays as \(\theta _1\) is varied—or it is constant.

Moreover, the activations are not the same if the input images are varied. This implies that the hidden units encode not just viewpoint but also viewpoint-dependent information (e.g. shape—such as the object is tall and thin at \(90^{\circ }\)). This information is aggregated over all views with either aggregation method. The aggregation method controls whether the most ‘confident’ view (e.g. if using max) is chosen or all views are considered (e.g. avg). Finally, this analysis demonstrates the utility of encoding the input viewpoints in the architecture. When generating the silhouette and depth at a new or given viewpoint, these properties can be easily morphed into the new view (e.g. if the new viewpoint is at \(90^{\circ }\) then components nearer \(90^{\circ }\) can be easily considered with more weight by the model).

Fig. 13
figure 13

Reconstruction of the original input images for max/avg pooling architectures. The ability to propagate view and viewpoint information through the network is demonstrated by the fact that the input images can be reconstructed given the latent feature vector and input angles using the approach of Mahendran and Vedaldi (2015)

Fig. 14
figure 14

Visualises the relation between the individual hidden units and the viewpoint. Each hidden unit is used in a separate regression to predict the viewpoint. The p value for each hidden unit is computed and for a given set of values c, the number of hidden units with a p value \(< c\) is plotted. This demonstrates that the majority of hidden units in both architectures are correlated with the viewpoint. For the max architecture, \(98\%\) of the hidden units have \(p < 0.05\) and for the avg pool architecture \(90\%\)

8.3 Discussion

In this section, to understand what two versions of SiDeNet—avg and max—have learned, two questions have been posed. How well can the original input images be reconstructed from the angles and latent vector x? How is x encoded such that views can be aggregated and that with more views, performance improves? The subsequent analysis has not only demonstrated that the original input views can be reconstructed given the viewpoints and x but has also put forward an explanation for how the views are aggregated: by using the hidden units to encode shape and viewpoint together.

9 Summary

This work has introduced a new architecture SiDeNet for learning about 3D shape, which is tested on a challenging dataset of 3D sculptures with a high variety of shapes and textures. To do this a multi-task loss is used; the network learns to predict the depth for the given views and the silhouette at a new view. This loss has multiple benefits. First, it enforces that the network learns a complex representation of shape, as predicting the silhouette enforces that the network learns about the visual hull of the object and predicting the depth that the network learns about concavities on the object’s surface. Second, using an image-based representation is beneficial, as it does not limit the resolution of the generated model; this benefit is demonstrated on the ShapeNet dataset. The trained network can then be used for various applications, such as new view synthesis and can even be used directly on real images.

Fig. 15
figure 15

Visualisation of the activation of hidden units as a function of \(\theta _i\) for the two architectures. \(\theta _i\) is varied between \(0^{\circ }, 360^{\circ }\) and all other values kept constant. Each hidden unit is normalised to between 0 and 1 over this sequence of \(\theta _i\) and visualised. This figure demonstrates two things: that the activation is a continuous, smooth function of \(\theta _i\) or constant (visualised as white in the figure). Second, it demonstrates that the hidden units activated are based on the input views, as they vary from view to view. This implies that the hidden units encode viewpoint dependent information (e.g. object properties and the associated viewpoint). a The activation for a subset of hidden units for the avg-pooling architecture for two different sets of input images (left and right). b The activation for a subset of hidden units for the max-pooling architecture for two different sets of input images (left and right)

The second benefit of the SiDeNet architecture is the view-dependent representation and the ability to generalise over additional views at test-time. Using a view-dependent representation means that no implicit assumptions need to be made about the nature of the 3D objects (e.g. that there exists a canonical orientation). Additionally, SiDeNet can leverage additional views at test time and results (both silhouette and depth) improve with each additional view, even when the views are not photometrically consistent.

While the architecture is able to capture a wide variety of shapes and styles as demonstrated in our results, it is most likely that SiDeNet would improve given more data. However, despite the sculpture dataset being small compared to standard deep learning datasets, it is interesting that SiDeNet can be used to boost performance on a very different synthetic dataset of chairs and predict depth, out-of-the-box, on real sculpture images.