Robust Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction

Yang, Bo; Wang, Sen; Markham, Andrew; Trigoni, Niki

doi:10.1007/s11263-019-01217-w

Robust Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction

Open access
Published: 28 August 2019

Volume 128, pages 53–73, (2020)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

Robust Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction

Download PDF

Bo Yang ORCID: orcid.org/0000-0002-2419-4140¹,
Sen Wang²,
Andrew Markham¹ &
…
Niki Trigoni¹

6432 Accesses
70 Citations
12 Altmetric
Explore all metrics

Abstract

We study the problem of recovering an underlying 3D shape from a set of images. Existing learning based approaches usually resort to recurrent neural nets, e.g., GRU, or intuitive pooling operations, e.g., max/mean poolings, to fuse multiple deep features encoded from input images. However, GRU based approaches are unable to consistently estimate 3D shapes given different permutations of the same set of input images as the recurrent unit is permutation variant. It is also unlikely to refine the 3D shape given more images due to the long-term memory loss of GRU. Commonly used pooling approaches are limited to capturing partial information, e.g., max/mean values, ignoring other valuable features. In this paper, we present a new feed-forward neural module, named AttSets, together with a dedicated training algorithm, named FASet, to attentively aggregate an arbitrarily sized deep feature set for multi-view 3D reconstruction. The AttSets module is permutation invariant, computationally efficient and flexible to implement, while the FASet algorithm enables the AttSets based network to be remarkably robust and generalize to an arbitrary number of input images. We thoroughly evaluate FASet and the properties of AttSets on multiple large public datasets. Extensive experiments show that AttSets together with FASet algorithm significantly outperforms existing aggregation approaches.

Pix2Vox++: Multi-scale Context-aware 3D Object Reconstruction from Single and Multiple Images

Article 15 July 2020

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Learning Attentive and Hierarchical Representations for 3D Shape Recognition

1 Introduction

The problem of recovering a geometric representation of the 3D world given a set of images is classically defined as multi-view 3D reconstruction in computer vision. Traditional pipelines such as Structure from Motion (SfM) (Ozyesil et al. 2017) and visual Simultaneous Localization and Mapping (vSLAM) (Cadena et al. 2016) typically rely on hand-crafted feature extraction and matching across multiple views to reconstruct the underlying 3D model. However, if the multiple viewpoints are separated by large baselines, it can be extremely challenging for the feature matching approach due to significant changes of appearance or self occlusions (Lowe 2004). Furthermore, the reconstructed 3D shape is usually a sparse point cloud without geometric details.

Recently, a number of deep learning approaches, such as 3D-R2N2 (Choy et al. 2016), LSM (Kar et al. 2017), DeepMVS (Huang et al. 2018) and RayNet (Paschalidou et al. 2018) have been proposed to estimate the 3D dense shape from multiple images and have shown encouraging results. Both 3D-R2N2 (Choy et al. 2016) and LSM (Kar et al. 2017) formulate multi-view reconstruction as a sequence learning problem, and leverage recurrent neural networks (RNNs), particularly GRU, to fuse the multiple deep features extracted by a shared encoder from input images. However, there are three limitations. First, the recurrent network is permutation variant, i.e., different permutations of the input image sequence give different reconstruction results (Vinyals et al. 2015). Therefore, inconsistent 3D shapes are estimated from the same image set with different permutations. Second, it is difficult to capture long-term dependencies in the sequence because of the gradient vanishing or exploding (Bengio et al. 1994; Hochreiter et al. 2001), so the estimated 3D shapes are unlikely to be refined even if more images are given during training and testing. Third, the RNN unit is inefficient as each element of the input sequence must be sequentially processed without parallelization (Martin and Cundy 2018), so is time-consuming to generate the final 3D shape given a sequence of images.

The recent DeepMVS (Huang et al. 2018) applies max pooling to aggregate deep features across a set of unordered images for multi-view stereo reconstruction, while RayNet (Paschalidou et al. 2018) adopts average pooling to aggregate the deep features corresponding to the same voxel from multiple images to recover a dense 3D model. The very recent GQN (Eslami et al. 2018) uses sum pooling to aggregate an arbitrary number of orderless images for 3D scene representation. Although max, average and summation poolings do not suffer from the above limitations of RNN, they tend to be ‘hard attentive’, since they only capture the max/mean values or the summation without learning to attentively preserve the useful information. In addition, the above pooling based neural nets are usually optimized with a specific number of input images during training, therefore being not robust and general to a dynamic number of input images during testing. This critical issue is also observed in GQN (Eslami et al. 2018).

In this paper, we introduce a simple yet efficient attentional aggregation module, named AttSets.^{Footnote 1} It can be easily included in an existing multi-view 3D reconstruction network to aggregate an arbitrary number of elements of a deep feature set. Inspired by the attention mechanism which shows great success in natural language processing (Bahdanau et al. 2015; Raffel and Ellis 2016), image captioning (Xu et al. 2015), etc., we design a feed-forward neural module that can automatically learn to aggregate each element of the input deep feature set. In particular, as shown in Fig. 1, given a variable sized deep feature set, which are usually learnt view-invariant visual representations from a shared encoder (Paschalidou et al. 2018), our AttSets module firstly learns an attention activation for each latent feature through a standard neural layer (e.g., a fully connected layer, a 2D or 3D convolutional layer), after which an attention score is computed for the corresponding feature. Subsequently, the attention scores are simply multiplied by the original elements of the deep feature set, generating a set of weighted features. At last, the weighted features are summed across different elements of the deep feature set, producing a fixed size of aggregated features which are then fed into a decoder to estimate 3D shapes. Basically, this AttSets module can be seen as a natural extension of sum pooling into a “weighted” sum pooling with learnt feature-specific weights. AttSets shares similar concepts with the concurrent work (Ilse et al. 2018), but it does not require the additional gating mechanism in Ilse et al. (2018). Notably, our simple feed-forward design allows the attention module to be separately trainable according to the property of its gradients.

In addition, we propose a new Feature-Attention Separate training (FASet) algorithm that elegantly decouples the base encoder–decoder (to learn deep features) from the AttSets module (to learn attention scores for features). This allows the AttSets module to learn desired attention scores for deep feature sets and guarantees the AttSets based neural networks to be robust and general to dynamic sized deep feature sets. Basically, in the proposed training algorithm, the base encoder–decoder neural layers are only optimized when the number of input images is 1, while the AttSets module is only optimized where there are more than 1 input images. Eventually, the whole optimized AttSets based neural network achieves superior performance with a large number of input images, while simultaneously being extremely robust and able to generalize to a small number of input images, even to a single image in the extreme case. Comparing with the widely used feed-forward attention mechanisms for visual recognition (Hu et al. 2018; Rodríguez et al. 2018; Liu et al. 2018; Sarafianos et al. 2018; Girdhar and Ramanan 2017), our FASet algorithm is the first to investigate and improve the robustness of attention modules to dynamically sized input feature sets, whilst existing works are only applicable to fixed sized input data.

Overall, our novel AttSets module and FASet algorithm are distinguished from all existing aggregation approaches in three ways. (1) Compared with RNN approaches, AttSets is permutation invariant and computationally efficient. (2) Compared with the widely used pooling operations, AttSets learns to attentively select and weight important deep features, thereby being more effective to aggregate useful information for better 3D reconstruction. (3) Compared with existing visual attention mechanisms, our FASet algorithm enables the whole network to be general to variable sized sets, being more robust and suitable for realistic multi-view 3D reconstruction scenarios where the number of input images usually varies dramatically.

Our key contributions are:

We propose an efficient feed-forward attention module, AttSets, to effectively aggregate deep feature sets. Our design allows the attention module to be separately optimizable according to the property of its gradients.
We propose a new two-stage training algorithm, FASet, to decouple the base encoder/decoder and the attention module, guaranteeing the whole network to be robust and general to an arbitrary number of input images.
We conduct extensive experiments on multiple public datasets, demonstrating consistent improvement over existing aggregation approaches for 3D object reconstruction from either single or multiple views.

2 Related Work

(1) Multi-view 3D Reconstruction 3D shapes can be recovered from multiple color images or depth scans. To estimate the underlying 3D shape from multiple color images, classic SfM (Ozyesil et al. 2017) and vSLAM (Cadena et al. 2016) algorithms firstly extract and match hand-crafted geometric features (Hartley and Zisserman 2004) and then apply bundle adjustment (Triggs et al. 1999) for both shape and camera motion estimation. Ji et al. (2017b) use “maximizing rigidity” for reconstruction, but this requires 2D point correspondences across images. Recent deep neural net based approaches tend to recover dense 3D shapes through learnt features from multiple images and achieve compelling results. To fuse the deep features from multiple images, both 3D-R2N2 (Choy et al. 2016) and LSM (Kar et al. 2017) apply the recurrent unit GRU, resulting in the networks being permutation variant and inefficient for aggregating long sequence of images. Recent SilNet (Wiles and Zisserman 2017, 2018) and DeepMVS (Huang et al. 2018) simply use max pooling to preserve the first order information of multiple images, while RayNet (Paschalidou et al. 2018) applies average pooling to reserve the first moment information of multiple deep features. MVSNet (Yao et al. 2018) proposes a variance-based approach to capture the second moment information for multiple feature aggregation. These pooling techniques only capture partial information, ignoring the majority of the deep features. Recent SurfaceNet (Ji et al. 2017a) and SuperPixel Soup (Kumar et al. 2017) can reconstruct 3D shapes from two images, but they are unable to process an arbitrary number of images. As for multiple depth image reconstruction, the traditional volumetric fusion method (Curless and Levoy 1996; Cao et al. 2018) integrates multiple viewpoint information by averaging truncated signed distance functions (TSDF). Recent learning based OctNetFusion (Riegler et al. 2017) also adopts a similar strategy to integrate multiple depth information. However, this integration might result in information loss since TSDF values are averaged (Riegler et al. 2017). PSDF (Dong et al. 2018) is recently proposed to learn a probabilistic distribution through Bayesian updating in order to fuse multiple depth images, but it is not straightforward to include the module into existing encoder–decoder networks.

(2) Deep Learning on Sets In contrast to traditional approaches operating on fixed dimensional vectors or matrices, deep learning tasks defined on sets usually require learning functions to be permutation invariant and able to process an arbitrary number of elements in a set (Zaheer et al. 2017). Such problems are widespread. Zaheer et al. introduce general permutation invariant and equivariant models in Zaheer et al. (2017), and they end up with a sum pooling for permutation invariant tasks such as population statistics estimation and point cloud classification. In the very recent GQN (Eslami et al. 2018), sum pooling is also used to aggregate an arbitrary number of orderless images for 3D scene representation. Gardner et al. (2017) use average pooling to integrate an unordered deep feature set for classification task. Su et al. (2015) use max pooling to fuse the deep feature set of multiple views for 3D shape recognition. Similarly, PointNet (Qi et al. 2017) also uses max pooling to aggregate the set of features learnt from point clouds for 3D classification and segmentation. In addition, the higher-order statistics based pooling approaches are widely used for 3D object recognition from multiple images. Vanilla bilinear pooling is applied for fine-grained recognition in Lin et al. (2015) and is further improved in Lin and Maji (2017). Concurrently, log-covariance pooling is proposed in Ionescu et al. (2015), and is recently generalized by harmonized bilinear pooling in Yu et al. (2018). Bilinear pooling techniques are further improved in the recent work (Yu and Salzmann 2018; Lin et al. 2018). However, both first-order and higher-order pooling operations ignore a majority of the information of a set. In addition, the first-order poolings do not have trainable parameters, while the higher-order poolings have only few parameters available for the network to learn. These limitations lead to the pooling based neural networks to be optimized with regards to the specific statistics of data batches during training, and therefore unable to be robust and generalize well to variable sized deep feature sets during testing.

(3) Attention Mechanism The attention mechanism was originally proposed for natural language processing (Bahdanau et al. 2015). Being coupled with RNNs, it achieves compelling results in neural machine translation (Bahdanau et al. 2015), image captioning (Xu et al. 2015), image question answering (Yang et al. 2016), etc. However, all these coupled attention approaches are permutation variant and computationally time-consuming. Dispensing with recurrence and convolutions entirely and solely relying on attention mechanism, Transformer (Vaswani et al. 2017) achieves superior performance in machine translation tasks. Similarly, being decoupled with RNNs, attention mechanisms are also applied for visual recognition (Hu et al. 2018; Rodríguez et al. 2018; Liu et al. 2018; Sarafianos et al. 2018; Zhu et al. 2018; Nakka and Salzmann 2018; Girdhar and Ramanan 2017), semantic segmentation (Li et al. 2018), long sequence learning (Raffel and Ellis 2016), and image generation (Zhang et al. 2018). Although the above decoupled attention modules can be used to aggregate variable sized deep feature sets, they are literally designed to operate on fixed sized features for tasks such as image recognition and generation. The robustness of attention modules regarding dynamic deep feature sets has not been investigated yet.

Compared with the original attention mechanism, our AttSets does not couple with RNNs. Instead, AttSets is a simplified feed-forward module which shares similar concepts with the concurrent work (Ilse et al. 2018). However, our AttSets is much simpler, without requiring the additional gating mechanism in Ilse et al. (2018). Besides, we further propose a dedicated FASet algorithm, enabling the AttSets based network to be remarkably robust and general to arbitrarily sized deep sets. This algorithm is the first to investigate and improve the robustness of feed-forward attention mechanisms.

3 AttSets

3.1 Problem Definition

This paper considers the problem of aggregating an arbitrary number of elements of a set ${\mathcal {A}}$ into a fixed single output ${\varvec{y}}$. Each element of set ${\mathcal {A}}$ is a feature vector extracted from a shared encoder, and the fixed dimension output ${\varvec{y}}$ is fed into a subsequent decoder, such that the whole network can process an arbitrary number of input elements.

Given N elements in the input deep feature set ${\mathcal {A}} = \{{\varvec{x}}_1, {\varvec{x}}_2, \ldots , {\varvec{x}}_N\}$, ${\varvec{x}}_n \in {\mathbb {R}}^{1\times D}$, where N is an arbitrary value, while D is fixed for a specific encoder, and the output ${\varvec{y}} \in {\mathbb {R}}^{1\times D}$, which is then fed into the subsequent decoder, our task is to design an aggregation function f with learnable weights $\varvec{ W }$: ${\varvec{y}} = f({\mathcal {A}}, \varvec{ W })$, which should be permutation invariant, i.e., for any permutation $\pi $:

$$\begin{aligned} f(\{ {\varvec{x}}_1, \ldots , {\varvec{x}}_N \}, \varvec{ W }) = f(\{{\varvec{x}}_{\pi (1)}, \ldots , {\varvec{x}}_{\pi (N)} \}, \varvec{ W }) \end{aligned}$$

(1)

The common pooling operations, e.g., max/mean/sum, are the simplest instantiations of function f where $\varvec{ W } \in \emptyset $. However, these pooling operations are predefined to capture partial information.

3.2 AttSets Module

The basic idea of our AttSets module is to learn an attention score for each latent feature of the whole deep feature set. In this paper, each latent feature refers to each entry of an individual element of the feature set, with an individual element usually represented by a latent vector, i.e., ${\varvec{x}}_n$. The learnt scores can be regarded as a mask that automatically selects useful latent features across the set. The selected features are then summed across multiple elements of the set.

As shown in Fig. 2, given a set of features ${\mathcal {A}} = \{{\varvec{x}}_1, {\varvec{x}}_2, \ldots , {\varvec{x}}_N\}$, ${\varvec{x}}_n \in {\mathbb {R}}^{1\times D}$, AttSets aims to fuse it into a fixed dimensional output ${\varvec{y}}$, where ${\varvec{y}} \in {\mathbb {R}}^{1\times D}$.

To build the AttSets module, we first feed each element of the feature set ${\mathcal {A}}$ into a shared function g which can be a standard neural layer, i.e., a linear transformation layer without any non-linear activation functions. Here we use a fully connected (fc) layer as an example, the bias term is dropped for simplicity. The output of function g is a set of learnt attention activations ${\mathcal {C}}=\{{\varvec{c}}_1, {\varvec{c}}_2, \ldots , {\varvec{c}}_N\}$, where

$$\begin{aligned}&{\varvec{c}}_n = g({\varvec{x}}_n, {\varvec{W}}) = {\varvec{x}}_n{\varvec{W}},\nonumber \\&\quad ({\varvec{x}}_n \in {\mathbb {R}}^{1\times D}, \quad {\varvec{W}} \in {\mathbb {R}}^{D\times D}, \quad {\varvec{c}}_n \in {\mathbb {R}}^{1\times D} ) \end{aligned}$$

(2)

Secondly, the learnt attention activations are normalized across the N elements of the set, computing a set of attention scores ${\mathcal {S}}=\{{\varvec{s}}_1, {\varvec{s}}_2, \ldots , {\varvec{s}}_N\}$. We choose softmax as the normalization operation, so the attention scores for the nth feature element are

$$\begin{aligned} {\varvec{s}}_n= & {} [s^1_n, s^2_n, \ldots , s^d_n, \ldots , s^D_n],\nonumber \\ s^d_n= & {} \frac{e^{c^d_n}}{\sum ^N_{j=1}{ e^{c^d_j}}}, c^d_n, c^d_j \, \textit{are the dth entry of}\, {\varvec{c}}_n, {\varvec{c}}_j. \end{aligned}$$

(3)

Thirdly, the computed attention scores ${\mathcal {S}}$ are multiplied by their corresponding original feature set ${\mathcal {A}}$, generating a new set of deep features, denoted as weighted features ${\mathcal {O}}=\{{\varvec{o}}_1, {\varvec{o}}_2, \ldots , {\varvec{o}}_N\}$, where

$$\begin{aligned} {\varvec{o}}_n = {\varvec{x}}_n * {\varvec{s}}_n \end{aligned}$$

(4)

Lastly, the set of weighted features ${\mathcal {O}}$ are summed up across the total N elements to get a fixed size feature vector, denoted as ${\varvec{y}}$, where

$$\begin{aligned} {\varvec{y}}= & {} [y^1, y^2, \ldots , y^d, \ldots , y^D],\nonumber \\ y^d= & {} \sum ^N_{n=1}o^d_n, \qquad o^d_n \,\textit{is the dth entry of}\, {\varvec{o}}_n. \end{aligned}$$

(5)

In the above formulation, we show how AttSets gradually aggregates a set of N feature vectors ${\mathcal {A}}$ into a single vector ${\varvec{y}}$, where ${\varvec{y}} \in {\mathbb {R}}^{1\times D}$.

3.3 Permutation Invariance

The output of AttSets module ${\varvec{y}}$ is permutation invariant with regard to the input deep feature set ${\mathcal {A}}$. Here is the simple proof.

$$\begin{aligned}{}[y^1, \ldots y^d \ldots , y^D] = f(\{{\varvec{x}}_1, \ldots {\varvec{x}}_n \ldots , {\varvec{x}}_N\}, {\varvec{W}}) \end{aligned}$$

(6)

In Eq. 6, the dth entry of the output ${\varvec{y}}$ is computed as follows:

$$\begin{aligned} y^d&= \sum ^N_{n=1}o^d_n = \sum ^N_{n=1}(x^d_n*s^d_n) \nonumber \\&= \sum ^N_{n=1}\left( x^d_n * \frac{e^{c^d_n}}{\sum ^N_{j=1} e^{c^d_j} } \right) \nonumber \\&= \sum ^N_{n=1}\left( x^d_n * \frac{e^{({\varvec{x}}_n{\varvec{w}}^d)} }{ \sum ^N_{j=1}e^{({\varvec{x}}_j{\varvec{w}}^d})} \right) \nonumber \\&=\frac{\sum ^N_{n=1} \left( x^d_n * e^ {({\varvec{x}}_n{\varvec{w}}^d)} \right) }{\sum ^N_{j=1}e^{({\varvec{x}}_j{\varvec{w}}^d)}}, \qquad \end{aligned}$$

(7)

where ${\varvec{w}}^d$ is the dth column of the weights ${\varvec{W}}$. In above Eq. 7, both the denominator and numerator are a summation of a permutation equivariant term. Therefore the value $y^d$, and also the full vector ${\varvec{y}}$, is invariant to different permutations of the deep feature set ${\mathcal {A}}=\{{\varvec{x}}_1, {\varvec{x}}_2, \ldots , {\varvec{x}}_n, \ldots , {\varvec{x}}_N\}$ (Zaheer et al. 2017).

3.4 Implementation

In Sect. 3.2, we described how our AttSets aggregates an arbitrary number of vector features into a single vector, where the attention activation learning function g embeds fully connected layer. AttSets can also be easily implemented with both 2D and 3D convolutional neural layers to aggregate both 2D and 3D deep feature sets, thus being flexible to be included into a 2D encoder/decoder or 3D encoder/decoder. Particularly, as shown in Fig. 3, to aggregate a set of 2D features, i.e., a tensor of $(width\times height\times channels)$, the attention activation learning function g embeds a standard conv2d layer with a stride of $(1\times 1)$. Similarly, to fuse a set of 3D features, i.e., a tensor of $(width\times height\times depth\times channels)$, the function g embeds a standard conv3d layer with a stride of $(1\times 1\times 1)$. For the above conv2d/conv3d layer, the filter size can be 1, 3 or many. The larger the filter size, the learnt attention score is considered to be correlated with the larger local spatial area.

Instead of embedding a single neural layer, the function g is also flexible to include multiple layers, but the tensor shape of the output of function g is required to be consistent with the input element ${\varvec{x}}_n$. This guarantees each individual feature of the input set ${\mathcal {A}}$ will be associated with a learnt and unique weight. For example, a standard 2-layer or 3-layer ResNet module (He et al. 2016) could be a candidate of the function g. The more layers that g embeds, the capability of AttSets module is expected to increase accordingly.

Compared with fc enabled AttSets, the conv2d or conv3d based AttSets variants tend to have fewer learnable parameters. Note that both the conv2d and conv3d based AttSets are still permutation invariant, as the function g is shared across all elements of the deep feature set and it does not depend on the order of the elements (Zaheer et al. 2017).

4 FASet

4.1 Motivation

Our AttSets module can be included in an existing encoder–decoder multi-view 3D reconstruction network, replacing the RNN units or pooling operations. Basically, in an AttSets enabled encoder–decoder net, the encoder–decoder serves as the base architecture to learn visual features for shape estimation, while the AttSets module learns to assign different attention scores to combine those features. As such, the base network tends to have robustness and generality with regard to different input image content, while the AttSets module tends to be general regarding an arbitrary number of input images.

However, to achieve this robustness is not straightforward. The standard end-to-end joint optimization approach is unable to guarantee that the base encoder–decoder and AttSets are able to learn visual features and the corresponding scores separately, because there are no explicit feature score labels available to directly supervise the AttSets module.

Let us revisit the previous Eq. 7 as follows and draw insights from it.

$$\begin{aligned} y^d =\frac{\sum ^N_{n=1} \left( x^d_n * e^ {({\varvec{x}}_n{\varvec{w}}^d)} \right) }{\sum ^N_{j=1}e^{({\varvec{x}}_j{\varvec{w}}^d)}} \end{aligned}$$

(8)

where N is the size of an arbitrary input set and ${\varvec{w}}^d$ are the AttSets parameters to be optimized. If N is 1, then the equation can be simplified as

$$\begin{aligned} y^d&= x^d_n \end{aligned}$$

(9)

$$\begin{aligned} \frac{\partial y^d}{\partial x^d_n}&=1, \qquad \frac{\partial y^d}{\partial {\varvec{w}}^d} ={\varvec{0}}, \qquad N=1 \end{aligned}$$

(10)

This shows that all parameters, i.e., ${\varvec{w}}^d$, of the AttSets module are not going to be optimized when the size of the input feature set is 1.

However, if $N>1$, Eq. 8 is unable to be simplified to Eq. 9. Therefore,

$$\begin{aligned} \frac{\partial y^d}{\partial x^d_n} \ne 1, \qquad \frac{\partial y^d}{\partial {\varvec{w}}^d } \ne {\varvec{0}}, \qquad N>1 \end{aligned}$$

(11)

This shows that both the parameters of AttSets and the base encoder–decoder layers will be optimized simultaneously, if the whole network is trained in the standard end-to-end fashion.

Here arises the critical issue. When $N>1$, all derivatives of the parameters in the encoder are different from the derivatives when $N=1$ due to the chain rule of differentiation applied backwards from $\frac{\partial y^d}{\partial x^d_n}$. Put simply, the derivatives of encoder are N-dependent. As a consequence, the encoded visual features and the learnt attention scores would be N-biased if the whole network is jointly trained. This biased network is unable to generalize to an arbitrary value of N during testing.

To illustrate the above issue, assuming the base encoder–decoder and the AttSets module are jointly trained given 5 images to reconstruct every object, the base encoder will be eventually optimized towards 5-view object reconstruction during training. The trained network can indeed perform well given 5 views during testing, but it is unable to predict a satisfactory object shape given only 1 image.

To alleviate the above problem, a naive approach is to enumerate various values of N during the jointly training, such that the final optimized network can be somehow robust and general to arbitrary N during testing. However, this approach would inevitably optimize the encoder to learn the mean features of input data for varying N. The overall performance will hence not be optimal. In addition, it is impractical and also time-consuming to enumerate all values of N during training.

4.2 Algorithm

To resolve the critical issue discussed in Sect. 4.1, we propose a Feature-Attention Separate training (FASet) algorithm that decouples the base encoder–decoder and the AttSets module, enabling the base encoder–decoder to learn robust deep features and the AttSets module to learn the desired attention scores for the feature sets.

In particular, the base encoder–decoder neural layers are only optimized when the number of input images is 1, while the AttSets module is only optimized where there are more than 1 input images. In this regard, the parameters of the base encoding layers would have consistent derivatives in the whole training stage, thus being optimized steadily. In the meantime, the AttSets module would be optimized solely based on multiple elements of learnt visual features from the shared encoder.

The trainable parameters of the base encoder–decoder are denoted as $\varvec{\varTheta }_{base}$, and the trainable parameters of AttSets module are denoted as $\varvec{\varTheta }_{att}$, and the loss function of the whole network is represented by $\ell $ which is determined by the specific supervision signal of the base network. Our FASet is shown in Algorithm 1. It can be seen that $\varvec{\varTheta }_{base}$ and $\varvec{\varTheta }_{att}$ are completely decoupled from one another, thus being separately optimized in two stages. In stage 1, the $\varvec{\varTheta }_{base}$ is firstly well optimized until convergence, which guarantees the base encoder–decoder is able to learn robust and general visual features. In stage 2, the $\varvec{\varTheta }_{att}$ is optimized to learn attention scores for individual visual features. Basically, this module learns to select and weight important deep features automatically.

In FASet algorithm, once the $\varvec{\varTheta }_{base}$ is well optimized in stage 1, it is not necessary to train it again, since the two-stage algorithm guarantees that optimizing $\varvec{\varTheta }_{base}$ is agnostic to the attention module. The FASet algorithm is a crucial component to maintain the superior robustness of the AttSets module, as shown in Sect. 5.9. Without it, the feed-forward attention mechanism is ineffective with respect to dynamic input sets.

5 Evaluation

Base Networks To evaluate the performance and various properties of AttSets, we choose the encoder–decoders of 3D-R2N2 (Choy et al. 2016) and SilNet (Wiles and Zisserman 2017) as two base networks.

Encoder–decoder of 3D-R2N2. The original 3D-R2N2 consists of (1) a shared ResNet-based 2D encoder which encodes a size of $127\times 127 \times 3$ images into 1024 dimensional latent vectors, (2) a GRU module which fuses N 1024 dimensional latent vectors into a single $4\times 4\times 4\times 128$ tensor, and (3) a ResNet-based 3D decoder which decodes the single tensor into a $32\times 32\times 32$ voxel grid representing the 3D shape. Figure 4 shows the architecture of AttSets based multi-view 3D reconstruction network where the only difference is that the original GRU module is replaced by AttSets in the middle. This network is called Base$_{\text {r2n2}}$-AttSets.
Encoder–decoder of SilNet. The original SilNet consists of (1) a shared 2D encoder which encodes a size of $127\times 127\times 3$ images together with image viewing angles into 160 dimensional latent vectors, (2) a max pooling module which aggregates N latent vectors into a single vector, and (3) a 2D decoder which estimates an object silhouette ($57\times 57$) from the single latent vector and a new viewing angle. Instead of being explicitly supervised by 3D shape labels, SilNet aims to implicitly learn a 3D shape representation from multiple images via the supervision of 2D silhouettes. Figure 5 shows the architecture of AttSets based SilNet where the only difference is that the original max pooling is replaced by AttSets in the middle. This network is called Base$_{\text {silnet}}$-AttSets.

Competing Approaches We compare our AttSets and FASet with three groups of competing approaches. Note that all the following competing approaches are connected at the same location of the base encoder–decoder shown in the pink block of Figs. 4 and 5, with the same network configurations and training/testing settings.

RNNs. The original 3D-R2N2 makes use of the GRU (Choy et al. 2016; Kar et al. 2017) unit for feature aggregation and serves as a solid baseline.
First-order poolings. The widely used max/mean/ sum pooling operations (Huang et al. 2018; Paschalidou et al. 2018; Eslami et al. 2018) are all implemented for comparison.
Higher-order poolings. We also compare with the state-of-the-art higher-order pooling approaches, including bilinear pooling (BP) (Lin et al. 2015), and the very recent MHBN (Yu et al. 2018) and SMSO poolings (Yu and Salzmann 2018).

Datasets All approaches are evaluated on four large open datasets.

ShapeNet$_{\text {r2n2}}$ Dataset (Choy et al. 2016). The released 3D-R2N2 dataset consists of 13 categories of 43, 783 common objects with synthesized RGB images from the large scale ShapeNet 3D repository (Chang et al. 2015). For each 3D object, 24 images are rendered from different viewing angles circling around each object. The train/test dataset split is 0.8 : 0.2.
ShapeNet$_{\text {lsm}}$ Dataset (Kar et al. 2017). Both LSM and 3D-R2N2 datasets are generated from the same 3D ShapeNet repository (Chang et al. 2015), i.e., they have the same ground truth labels regarding the same object. However, the ShapeNet$_{\text {lsm}}$ dataset has totally different camera viewing angles and lighting sources for the rendered RGB images. Therefore, we use the ShapeNet$_{\text {lsm}}$ dataset to evaluate the robustness and generality of all approaches. All images of ShapeNet$_{\text {lsm}}$ dataset are resized from $224\times 224$ to $127\times 127$ through linear interpolation.
ModelNet40 Dataset. ModelNet40 (Wu et al. 2015) consists of 12, 311 objects belonging to 40 categories. The 3D models are split into 9, 843 training samples and 2, 468 testing samples. For each 3D model, it is voxelized into a $30\times 30\times 30$ shape in (Qi et al. 2016), and 12 RGB images are rendered from different viewing angles. All 3D shapes are zero-padded to be $32\times 32\times 32$, and the images are linearly resized from $224\times 224$ to $127\times 127$ for training and testing.
Blobby Dataset (Wiles and Zisserman 2017). It contains 11, 706 blobby objects. Each object has 5 RGB images paired with viewing angles and the corresponding silhouettes, which are generated from Cycles in Blender under different lighting sources and texture models.

Metrics The explicit 3D voxel reconstruction performance of Base$_{\text {r2n2}}$-AttSets and the competing approaches is evaluated on three datasets: ShapeNet$_{\text {r2n2}}$, ShapeNet$_{\text {lsm}}$ and ModelNet40. We use the mean Intersection-over-Union (IoU) (Choy et al. 2016) between predicted 3D voxel grids and their ground truth as the metric. The IoU for an individual voxel grid is formally defined as follows:

$$\begin{aligned} IoU = \frac{\sum _{i=1}^{L} \left[ I (h_i>p) * I(\bar{h_i}) \right] }{ \sum _{i=1}^{L} \left[ I \left( I(h_{i} >p) + I(\bar{h_i}) \right) \right] } \end{aligned}$$

where $I(\cdot )$ is an indicator function, $h_{i}$ is the predicted value for the ith voxel, $\bar{h_i}$ is the corresponding ground truth, p is the threshold for voxelization, L is the total number of voxels in a whole voxel grid. As there is no validation split in the above three datasets, to calculate the IoU scores, we independently search the optimal binarization threshold value from 0.2 to 0.8 with a step 0.05 for all approaches for fair comparison. In our experiments, we found that all optimal thresholds of different approaches end up with 0.3 or 0.35.

The implicit 3D shape learning performance of Base$_{\text {silnet}}$-AttSets and the competing approaches is evaluated on the Blobby dataset. The mean IoU between predicted 2D silhouettes and the ground truth is used as the metric (Wiles and Zisserman 2017).

Table 1 Group 1: mean IoU for multi-view reconstruction of all 13 categories in ShapeNet$_{\text {r2n2}}$ testing split. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given 2 images per object in Stage 2, while other competing approaches are fine-tuned given 2 images per object in Stage 2

Robust Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction

Abstract

Similar content being viewed by others

Pix2Vox++: Multi-scale Context-aware 3D Object Reconstruction from Single and Multiple Images

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Learning Attentive and Hierarchical Representations for 3D Shape Recognition

1 Introduction

2 Related Work

3 AttSets

3.1 Problem Definition

3.2 AttSets Module

3.3 Permutation Invariance

3.4 Implementation

4 FASet

4.1 Motivation

4.2 Algorithm

5 Evaluation

5.1 Evaluation on ShapeNet\(_{\text {r2n2}}\) Dataset

5.2 Evaluation on ShapeNet\(_{\text {lsm}}\) Dataset

5.3 Evaluation on ModelNet40 Dataset

5.4 Evaluation on Blobby Dataset

5.5 Qualitative Results on Real-World Images

5.6 Computational Efficiency

5.7 Comparison Between Variants of AttSets

5.8 Feature-Wise Attention versus Element-Wise Attention

5.9 Significance of FASet Algorithm

6 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation