Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

Sun, Ming; Yuan, Yuchen; Zhou, Feng; Ding, Errui

doi:10.1007/978-3-030-01270-0_49

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11220))

Included in the following conference series:

European Conference on Computer Vision

3274 Accesses
227 Citations

Abstract

Attention-based learning for fine-grained image recognition remains a challenging task, where most of the existing methods treat each object part in isolation, while neglecting the correlations among them. In addition, the multi-stage or multi-scale mechanisms involved make the existing methods less efficient and hard to be trained end-to-end. In this paper, we propose a novel attention-based convolutional neural network (CNN) which regulates multiple object parts among different input images. Our method first learns multiple attention region features of each input image through the one-squeeze multi-excitation (OSME) module, and then apply the multi-attention multi-class constraint (MAMC) in a metric learning framework. For each anchor feature, the MAMC functions by pulling same-attention same-class features closer, while pushing different-attention or different-class features away. Our method can be easily trained end-to-end, and is highly efficient which requires only one training stage. Moreover, we introduce Dogs-in-the-Wild, a comprehensive dog species dataset that surpasses similar existing datasets by category coverage, data volume and annotation quality. Extensive experiments are conducted to show the substantial improvements of our method on four benchmark datasets.

You have full access to this open access chapter, Download conference paper PDF

Fine-Grained Image Classification Based on Target Acquisition and Feature Fusion

Improving Fine-Grained Visual Recognition in Low Data Regimes via Self-boosting Attention Mechanism

Interpretable Attention Guided Network for Fine-Grained Visual Classification

Keywords

1 Introduction

In the past few years, the performances of generic image recognition on large-scale datasets (e.g., ImageNet [8], Places [56]) have undergone unprecedented improvements, thanks to the breakthroughs in the design and training of deep neural networks (DNNs). Such fast-pacing progresses in research have also drawn attention of the related industries to build software like Google Lens on smartphones to recognize everything snapshotted by the user. Yet, recognizing the fine-grained category of daily objects such as car models, animal species or food dishes is still a challenging task for existing methods. The reason is that the global geometry and appearances of fine-grained classes can be very similar, and how to identify their subtle differences on the key parts is of vital importance. For instance, to differentiate the two dog species in Fig. 1, it is important to consider their discriminative features on the ear, tail and body length, which is extremely difficult to notice even for human without domain expertise.

Thus the majority of efforts in the fine-grained community focus on how to effectively integrate part localization into the classification pipeline. In the pre-DNN era, various parametric [9, 24, 29] and non-parametric [25] part models have been employed to extract discriminative part-specific features. Recently, with the popularity of DNNs, the tasks of object part localization and feature representation can be both learned in a more effective way [2, 18, 22, 48, 49]. The major drawback of these strongly-supervised methods, however, is that they heavily rely on manual object part annotations, which is too expensive to be prevalently applied in practice. Therefore, weakly-supervised frameworks have received increasing attention in recent researches. For instance, the attention mechanism can be implemented as sequential decision processes [27] or multi-stream part selections [10] without the need of part annotations. Despite the great progresses, these methods still suffer several limitations. First, their additional steps, such as the part localization and feature extraction of the attended regions, can incur expensive computational cost. Second, their training procedures are sophisticated, requiring multiple alternations or cascaded stages due to the complex architecture designs. More importantly, most works tend to detect the object parts in isolation, while neglect their inherent correlations. As a consequence, the learned attention modules are likely to focus on the same region and lack the capability to localize multiple parts with discriminative features that can differentiate between similar fine-grained classes.

From extensive experimental studies, we observe that an effective visual attention mechanism for fine-grained classification should follow three criteria: (1) The detected parts should be well spread over the object body to extract non-correlated features; (2) Each part feature alone should be discriminative for separating objects of different classes; (3) The part extractors should be lightweight in order to be scaled up for practical applications. To meet these demands, this paper presents a novel framework that contains two major improvements. First, we propose one-squeeze multi-excitation module (OSME) to localize different parts inspired by the latest ImageNet winner SENet [13]. It is fully differentiable and can directly extract part features with budgeted computational cost. Second, inspired by metric learning loss, we propose the multi-attention multi-class constraint (MAMC) to coherently enforce the correlations among different parts in training. In addition, we have released a large scale dog species dataset named Dogs-in-the-Wild, which exhibits higher category coverage, data volume and annotation quality than similar public datasets. Experimental results show that our method achieves substantial improvements on four benchmark datasets. Moreover, our method can be easily trained end-to-end, and unlike many existing methods that require multiple feedforward processes for feature extraction [41, 52] or multiple alternative training stages [10, 31], only one stage and one feedforward are required for each training step.

2 Related Work

2.1 Fine-Grained Image Recognition

In the task of fine-grained image recognition, since the inter-class differences are subtle, more specialized techniques, including discriminative feature learning and object parts localization, need to be applied. A straightforward way is supervised learning with manual object part annotations, which has shown promising results in classifying birds [2, 9, 48, 49], dogs [16, 25, 29, 48], and cars [17, 20, 24]. However, it is usually laborious and expensive to obtain object part annotations, which severely restricts the effectiveness of such methods.

Consequently, more recently proposed methods tend to localize object parts with weakly-supervised mechanisms, such as the combination of pose alignment and co-segmentation [18], dynamic spatial transformation of the input image for better alignment [14], and parallel CNNs for bilinear feature extraction [23].

Compared with previous works, our method also takes a weakly-supervised mechanism, but can directly extract the part features without cropping them out, and is highly efficient to be scaled up with multiple parts.

In recent years, more advanced methods emerge with improved results. For instance, the bipartite-graph labeling [57] leverages the label hierarchy on the fine-grained classes, which is less expensive to obtain. The work in [51] exploit unified CNN framework with spatially weighted representation by the Fisher vector [30]. [3] and [45] incorporate human knowledge and various types of computer vision algorithms into a human-in-the-loop framework for the complementary strengths of both ends. In [34], the average and bilinear pooling are combined to learn the pooling strategy during training. [6] uses the dataset bootstrapping with the help of human. And the work in [50], the structures of label are exploited. These techniques can also be potentially combined with our method for further works.

2.2 Visual Attention

The aforementioned part-based methods have shown strong performances in fine-grained image recognition. Nevertheless, one of their major drawbacks is that they need meaningful definitions of the object parts, which are hard to obtain for non-structured objects such as flowers [28] and food dishes [1]. Therefore, the methods enabling CNN to attend loosely defined regions for general objects have emerged as a promising direction.

For instance, the soft proposal network [58] combines random walk and CNN for object proposals. The works in [52] and [26] introduce long short-term memory [12] and reinforcement learning to attention-based classification, respectively. And the class activation mapping [55] generates the heatmap of the input image, which provides a better way for attention visualization. On the other hand, the idea of multi-scale feature fusion or recurrent learning has become increasingly popular in recent works. For instance, the work in [31] extends [55] and establishes a cascaded multi-stage framework, which refines the attention region by iteration. The residual attention network [41] obtains the attention mask of input image by up-sampling and down-sampling, and a series of such attention modules are stacked for feature map refinement. And the recurrent attention CNN [10] alternates between the optimization of softmax and pairwise ranking losses, which jointly contribute to the final feature fusion. Even an acceleration method [21] with reinforcement learning is proposed particularly for the recurrent attention models above.

In parallel to these efforts, our method not only automatically localizes the attention regions, but also directly captures the corresponding features without explicitly cropping the ROI and feedforwarding again for the feature, which makes our method highly efficient.

2.3 Metric Learning

Apart from the techniques above, deep metric learning aims at the learning of appropriate similarity measurements between sample pairs, which provides another promising direction to fine-grained image recognition. The pioneer work of Siamese network [4] formulates the deep metric learning with a contrastive loss that minimizes distance between positive pairs while keeps negative pairs apart. Despite its great success on face verification [33], contrastive embedding requires that training data contains real-valued precise pair-wise similarities or distances. The triplet loss [32] addresses this issue by optimizing the relative distance of the positive pair and one negative pair from three samples. It has been proven that triplet loss is extremely effective for fine-grained product search [43]. Later, triplet loss is improved to automatically search for discriminative patches [44]. Nevertheless, compared with softmax loss, triplet loss is difficult to train due to its slow convergence. To alleviate this issue, the N-pair loss [37] is introduced to consider multiple negative samples in training, and exhibits higher efficiency and performance. More recently, the angular loss [42] enhances N-pair loss by integrating high-order constraint that captures additional local structure of triplet triangles.

Our method differs previous metric learning works in two aspects: First, we take object parts instead of the whole images as instances in the feature learning process; Second, our formulation simultaneously considers the part and class labels of each instance.

3 Proposed Method

In this section, we present our proposed method which can efficiently and accurately attend discriminative regions despite being trained only on image-level labels. As shown in Fig. 2, the framework of our method is composed by two parts: (1) A differentiable one-squeeze multi-excitation (OSME) module that extracts features from multiple attention regions with a slight increase in computational burden. (2) A multi-attention multi-class (MAMC) constraint that enforces the correlation of the attention features in favor of the fine-grained classification task. In contrast to many prior works, the entire network of our method can be effectively trained end-to-end in one stage.

3.1 One-Squeeze Multi-excitation Attention Module

There have been a number of visual attention models exploring weakly supervised part localization, and the previous works can be roughly categorized in two groups. The first type of attention is also known as part detection, i.e., each attention is equivalent to a bounding box covering a certain area. Well-known examples include the early work of recurrent visual attention [27], the spatial transformer networks [14], and the recent method of recurrent attention CNN [10]. This hard-attention setup can benefit a lot from the object detection community in the formulation and training. However, its architectural design is often cumbersome as the part detection and feature extraction are separated in different modules. The second type of attention can be considered as imposing a soft mask on the feature map, which origins from activation visualization [46, 54]. Later, people find it can be extended for localizing parts [31, 55] and improving the overall recognition performance [13, 41]. Our approach also falls into this category. We adopt the idea of SENet [13], the latest ImageNet winner, to capture and describe multiple discriminative regions in the input image. Compared to other soft-attention works [41, 55], we build on SENet because of its superiority in performance and scalability in practice.

As shown in Fig. 2, our framework is a feedforward neural network where each image is first processed by a base network, e.g., ResNet-50 [11]. Let $\mathbf {x}\in \mathbb {R}^{W' \times H' \times C'}$ denote the input fed into the last residual block $\tau $. The goal of SENet is to re-calibrate the output feature map,

$$\begin{aligned} \mathbf {U}= \tau (\mathbf {x}) = [ \mathbf {u}_1, \cdots , \mathbf {u}_C ] \in \mathbb {R}^{W \times H \times C}, \end{aligned}$$

(1)

through a pair of squeeze-and-excitation operations. In order to generate P attention-specific feature maps, we extend the idea of SENet by performing one-squeeze but multi-excitation operations.

In the first one-squeeze step, we aggregate the feature maps $\mathbf {U}$ across spatial dimensions $W \times H$ to produce a channel-wise descriptor $\mathbf {z}= [z_1, \cdots , z_C] \in \mathbb {R}^{C}$. The global average pooling is adopted as a simple but effective way to describe each channel statistic:

$$\begin{aligned} z_c = \frac{1}{W H} \sum _{w=1}^W \sum _{h=1}^H \mathbf {u}_c(w,h). \end{aligned}$$

(2)

In the second multi-excitation step, a gating mechanism is independently employed on $\mathbf {z}$ for each attention $p = 1, \cdots , P$:

$$\begin{aligned} \mathbf {m}^p= \sigma \Big ( \mathbf {W}_{2}^p \delta (\mathbf {W}_{1}^p \mathbf {z}) \Big ) = [m^p_1, \cdots , m^p_C] \in \mathbb {R}^C, \end{aligned}$$

(3)

where $\sigma $ and $\delta $ refer to the Sigmod and ReLU functions respectively. We adopt the same design of SENet by forming a pair of dimensionality reduction and increasing layers parameterized with $\mathbf {W}_{1}^p \in \mathbb {R}^{\frac{C}{r} \times {C}}$ and $\mathbf {W}_{2}^p \in \mathbb {R}^{{C} \times \frac{C}{r}}$. Because of the property of the Sigmod function, each $\mathbf {m}^p$ encodes a non-mutually-exclusive relationship among channels. We therefore use it to re-weight the channels of the original feature map $\mathbf {U}$,

$$\begin{aligned} \mathbf {S}^p = [m^p_1 \mathbf {u}_1, \cdots , m^p_{C} \mathbf {u}_C] \in \mathbb {R}^{W \times H \times C}. \end{aligned}$$

(4)

To extract attention-specific features, we feed each attention map $\mathbf {S}^p$ to a fully connected layer $\mathbf {W}_3^p \in \mathbb {R}^{D \times WHC}$:

$$\begin{aligned} \mathbf {f}^p = \mathbf {W}_{3}^p {{\mathrm{vec}}}(\mathbf {S}^p) \in \mathbb {R}^{D}, \end{aligned}$$

(5)

where the operator ${{\mathrm{vec}}}(\cdot )$ flattens a matrix into a vector.

In a nutshell, the proposed OSME module seeks to extract P feature vectors $\{\mathbf {f}^p\}_{p=1}^P$ for each image $\mathbf {x}$ by adding a few layers on top of the last residual block. Its simplicity enables the use of relatively deep base networks and an efficient one-stage training pipeline.

It is worth to clarify that the SENet is originally not designed for learning visual attentions. By adopting the key idea of SENet, our proposed OSME module implements a lightweight yet effective attention mechanism that enables an end-to-end one-stage training on large-scale fine-grained datasets.

3.2 Multi-attention Multi-class Constraint

Apart from the attention mechanism introduced in Sect. 3.1, the other crucial problem is how to guide the extracted attention features to the correct class label. A straightforward way is to directly evaluate the softmax loss on the concatenated attention features [14]. However, the softmax loss is unable to regulate the correlations between attention features. As an alternative, another line of research [10, 26, 27] tends to mimic human perception with a recurrent search mechanism. These approaches iteratively generate the attention region from coarse to fine by taking previous predictions as references. The limitation of them, however, is that the current prediction is highly dependent on the previous one, thereby the initial error could be amplified by iteration. In addition, they require advanced techniques such as reinforcement learning or careful initialization in a multi-stage training. In contrast, we take a more practical approach by directly enforcing the correlations between parts in training. There has been some prior works like [44] that introduce geometrical constraints on local patches. Our method, on the other hand, explores much richer correlations of object parts by the proposed multi-attention multi-class constraint (MAMC).

Suppose that we are given a set of training images $\{(\mathbf {x}, y), \cdots \}$ of K fine-grained classes, where $y = 1, \cdots , K$ denotes the label associated with the image $\mathbf {x}$. To model both the within-image and inter-class attention relations, we construct each training batch, $\mathcal {B}= \{(\mathbf {x}_{i}, \mathbf {x}_{i}^+, y_i)\}_{i=1}^N$, by sampling N pairs of images^{Footnote 1} similar to [37]. For each pair $(\mathbf {x}_i, \mathbf {x}_i^+)$ of class $y_i$, the OSME module extracts P attention features $\{\mathbf {f}_i^p, \mathbf {f}_i^{p+}\}_{p=1}^P$ from multiple branches according to Eq. 5.

Given 2N samples in each batch (Fig. 3a), our intuition comes from the natural clustering of the 2NP features (Fig. 3b) extracted by the OSME modules. By picking $\mathbf {f}_i^p$, which corresponds to the $i^{th}$ class and $p^{th}$ attention region as the anchor, we divide the rest features into four groups:

same-attention same-class features, $\mathcal {S}_{sasc}(\mathbf {f}_i^p) = \{\mathbf {f}_i^{p+} \}$;
same-attention different-class features, $\mathcal {S}_{sadc}(\mathbf {f}_i^p) = \{ \mathbf {f}_j^p, \mathbf {f}_j^{p+} \}_{j \ne i}$;
different-attention same-class features, $\mathcal {S}_{dasc}(\mathbf {f}_i^p) = \{ \mathbf {f}_i^{q}, \mathbf {f}_i^{q+} \}_{q \ne p}$;
different-attention different-class features $\mathcal {S}_{dadc}(\mathbf {f}_i^p) = \{ \mathbf {f}_j^q, \mathbf {f}_j^{q+} \}_{j \ne i, q \ne p}$.

Our goal is to excavate the rich correlations among the four groups in a metric learning framework. As summarized in Fig. 3c, we compose three types of triplets according to the choice of the positive set for the anchor $\mathbf {f}_i^p$. To keep notation concise, we omit $\mathbf {f}_i^p$ in the following equations.

Same-attention same-class positives. The most similar feature to the anchor $\mathbf {f}_i^p$ is $\mathbf {f}_i^{p+}$, while all the other features should have larger distance to the anchor. The positive and negative sets are then defined as:

$$\begin{aligned} \mathcal {P}_{sasc} = \mathcal {S}_{sasc}, \ \mathcal {N}_{sasc} = \mathcal {S}_{sadc} \cup \mathcal {S}_{dasc} \cup \mathcal {S}_{dadc}. \end{aligned}$$

(6)

Same-attention different-class positives. For the features from different classes but extracted from the same attention region, they should be more similar to the anchor than the ones also from different attentions:

$$\begin{aligned} \mathcal {P}_{sadc} = \mathcal {S}_{sadc}, \ \mathcal {N}_{sadc} = \mathcal {S}_{dadc}. \end{aligned}$$

(7)

Different-attention same-class positives. Similarly, for the features from same class but extracted from different attention regions, we have:

$$\begin{aligned} \mathcal {P}_{dasc} = \mathcal {S}_{dasc}, \ \mathcal {N}_{dasc} = \mathcal {S}_{dadc}. \end{aligned}$$

(8)

For any positive set $\mathcal {P}\in \{\mathcal {P}_{sasc}, \mathcal {P}_{sadc}, \mathcal {P}_{dasc} \}$ and negative set $\mathcal {N}\in \{\mathcal {N}_{sasc}, $ $\mathcal {N}_{sadc}, \mathcal {N}_{dasc} \}$ combinations, we expect the anchor to be closer to the positive than to any negative by a distance margin $m > 0$, i.e.,

$$\begin{aligned} \Vert \mathbf {f}_i^p - \mathbf {f}^+ \Vert ^2 + m \le \Vert \mathbf {f}_i^p - \mathbf {f}^- \Vert ^2, \ \forall \mathbf {f}^+ \in \mathcal {P}, \mathbf {f}^- \in \mathcal {N}. \end{aligned}$$

(9)

To better understand the three constraints, let’s consider the synthetic example of six feature points shown in Fig. 4. In the initial state (Fig. 4a), the $\mathcal {S}_{sasc}$ feature point (green hexagon) stays further away from the anchor $\mathbf {f}_i^p$ at the center than the others. After applying the first constraint (Eq. 6), the underlying feature space is transformed to Fig. 4b, where the $\mathcal {S}_{sasc}$ positive point (green $\checkmark $) has been pulled towards the anchor. However, the four negative features (cyan rectangles and triangles) are still in disordered positions. In fact, $\mathcal {S}_{sadc}$ and $\mathcal {S}_{dasc}$ should be considered as the positives compared to $\mathcal {S}_{dadc}$ given the anchor. By further enforcing the second (Eq. 7) and third (Eq. 8) constraints, a better embedding can be achieved in Fig. 4c, where $\mathcal {S}_{sadc}$ and $\mathcal {S}_{dasc}$ are regularized to be closer to the anchor than the ones of $\mathcal {S}_{dadc}$.

3.3 Training Loss

To enforce the triplet constraint in Eq. 9, a common approach is to minimize the following hinge loss:

$$\begin{aligned} \Big [ \Vert \mathbf {f}_i^p - \mathbf {f}^+ \Vert ^2 - \Vert \mathbf {f}_i^p - \mathbf {f}^- \Vert ^2 + m \Big ]_+. \end{aligned}$$

(10)

Despite being broadly used, optimizing Eq. 10 using standard triplet sampling leads to slow convergence and unstable performance in practice. Inspired by the recent advance in metric learning, we enforce each of the three constraints by minimizing the N-pair loss^{Footnote 2} [37],

$$\begin{aligned} L^{np} = \frac{1}{N} \sum _{\mathbf {f}_i^p \in \mathcal {B}} \Big \{ \sum _{\mathbf {f}^+ \in \mathcal {P}}\log \Big (1 + \sum _{\mathbf {f}^- \in \mathcal {N}}\exp (\mathbf {f}_{i}^{pT} \mathbf {f}^- - \mathbf {f}_{i}^{pT} \mathbf {f}^+) \Big ) \Big \}. \end{aligned}$$

(11)

In general, for each training batch $\mathcal {B}$, MAMC jointly minimizes the softmax loss and the N-pair loss with a weight parameter $\lambda $:

$$\begin{aligned} L^{mamc} = L^{softmax} + \lambda \Big ( L^{np}_{sasc} + L^{np}_{sadc} + L^{np}_{dasc} \Big ). \end{aligned}$$

(12)

Given a batch of N images and P parts, MAMC is able to generate $2(PN-1)+4(N-1)^2(P-1)+4(N-1)(P-1)^2$ constraints of three types (Eqs. 6 to 8), while the N-pair loss can only produce $N-1$. To put it in perspective, we are able to generate $130\times $ more constraints than N-pair loss with the same data under the normal setting where $P = 2$ and $N = 32$. This implies that MAMC leverages much richer correlations among the samples, and is able to obtain better convergence than either triplet or N-pair loss.

4 The Dogs-in-the-Wild Dataset

Large image datasets (such as ImageNet [8]) with high-quality annotations enables the dramatic development in visual recognition. However, most datasets for fine-grained recognition are out-dated, non-natural and relatively small (as shown in Table 1). Recently, there are several attempts such as Goldfinch [19] and the iNaturalist Challenge [38] in building large-scale fine-grained benchmarks. However, there still lacks a comprehensive dataset with large enough data volume, highly accurate data annotation, and full tag coverage of common dog species. We hence introduce the Dogs-in-the-Wild dataset with 299,458 images of 362 dog categories^{Footnote 3}, which is 15$\times $ larger than Stanford Dogs [16]. We generate the list of dog species by combining multiple sources (e.g., Wikipedia), and then crawl the images with search engines (e.g., Google, Baidu). The label of each image is then checked with crowd sourcing. We further prune small classes with less than 100 images, and merge extremely similar classes by applying confusion matrix and manual validation. The whole annotation process is conducted three times to guarantee the annotation quality. Last but not least, since most of the experimental baselines are pre-trained on ImageNet, which has substantial category overlap with our dataset, we exclude any image of ImageNet from our dataset for fair evaluation.

Figure 5a and b qualitatively compare our dataset with the two most relevant benchmarks, Stanford Dogs [16] and the dog section of Goldfinch [19]. It can be seen that our dataset is more challenging in two aspects: (1) The intra-class variation of each category is larger. For instance, almost all common patterns and hair colors of Staffordshire Bull Terriers are covered in our dataset, as illustrated in Fig. 5a. (2) More surrounding environment types are covered, which includes but is not limited to, natural scenes, indoor scenes and even artificial scenes; and the dog itself could either be in its natural appearance or dressed up, such as the first Boston Terrier in Fig. 5a. Another feature of our dataset is that all of our images are manually examined to minimize annotation errors. Although Goldfinch has comparable class number and data volume, it is common to find noisy images inside, as shown in Fig. 5b.

We then demonstrate the statistics of the three datasets in Fig. 5c and Table 1. It is observed that our dataset is significantly more imbalanced in term of images per category, which is more consistent with real-life situations, and notably increases the classification difficulty. Note that the curves in Fig. 5c are smoothed for better visualization. On the other hand, the average images per category of our dataset is higher than the other two datasets, which contributes to its high intra-class variation, and makes it less vulnerable to overfitting.

Table 1. Statistics of the related datasets

Full size table

5 Experimental Results

We conduct our experiments on four fine-grained image recognition datasets, including three publicly available datasets CUB-200-2011 [39], Stanford Dogs [16] and Stanford Cars [20], and the proposed Dogs-in-the-Wild dataset. The detailed statistics including class numbers and train/test distributions are summarized in Table 1. We adopt top-1 accuracy as the evaluation metric.

In our experiments, the input images are resized to 448$\times $448 for both training and testing. We train on each dataset for 60 epochs; the batch size is set to 10 (N=5), and the base learning rate is set to 0.001, which decays by 0.96 for every 0.6 epoch. The reduction ratio r of $\mathbf {W}_1^p$ and $\mathbf {W}_2^p$ in Eq. 3 is set to 16 in reference to [13]. The weight parameter $\lambda $ is empirically set to 0.5 as it achieves consistently good performances. And for the FC layers, we set the channels $C=2048$ and $D=1024$. Our method is implemented with Caffe [15] and one Tesla P40 GPU.

5.1 Ablation Analysis

To fully investigate our method, Table 2a provides a detailed ablation analysis on different configurations of the key components.

Base networks. To extract convolutional feature before the OSME module, we choose VGG-19 [36], ResNet-50 and ResNet-101 [11] as our candidate baselines. Based on Table 2a, ResNet-50 and ResNet-101 are selected given their good balance between performance and efficiency. We also note that although a better ResNet-50 baseline on CUB is reported in [21] (84.5%), it is implemented in Torch [5] and tuned with more advanced data augmentation (e.g., color jittering, scaling). Our baselines, on the other hand, are trained with simple augmentation (e.g., mirror and random cropping) and meet the Caffe baselines of other works, such as 82.0% in [26] and 78.4% in [7].

Importance of OSME. OSME is important in attending discriminative regions. For ResNet-50 without MAMC, using OSME solely with $P=2$ can offer 3.2% performance improvement compared to the baseline (84.9% vs. 81.7%). With MAMC, using OSME boosts the accuracy by 0.5% than without OSME (using two independent FC layers instead, 86.2% vs. 85.7%). We also notice that two attention regions ($P=2$) lead to promising results, while more attention regions ($P=3$) provide slightly better performance.

MAMC constraints. Applying the first MAMC constraint (Eq. 6) achieves 0.5% better performance than the baseline with ResNet-50 and OSME. Using all of the three MAMC constraints (Eqs. 6 to 8) leads to another 0.8% improvement. This indicates the effectiveness of each of the three MAMC constraints.

Complexity. Compared with the ResNet-50 baseline, our method provides significantly better result (+4.5%) with only 30% more time, while a similar method [10] offers less optimal result but takes $3.6\times $ more time than ours.

Table 2. Experimental results. “Anno.” stands for using extra annotation (bounding box or part) in training. “1-Stage” indicates whether the training can be done in one stage. “Acc.” denotes the top-1 accuracy in percentage

Full size table

5.2 Comparison with State-of-the-Art

Quantitative experimental results are shown in Table 2b–e.

We first analyze the results on the CUB-200-2011 dataset in Table 2b. It is observed that with ResNet-101, our method achieves the best overall performance (tied with MACNN) against state-of-the-art. Even with ResNet-50, our method exceeds the second best method using extra annotation (PN-CNN) by 0.8%, and exceeds the second best method without extra annotation (RAM) by 0.2%. For the weakly supervised methods without extra annotation, PDFR and MG-CNN conduct feature combination from multiple scales, and RACNN is trained with multiple alternative stages, while our method is trained with only one stage to obtain all the required features. Yet our method outperforms all of the the three methods by 2.0%, 4.8% and 1.2%, respectively. The methods B-CNN and RAN share similar multi-branch ideas with the OSME in our method, where B-CNN connects two CNN features with outer product, and RAN combines the trunk CNN feature with an additional attention mask. Our method, on the other hand, applies the OSME for multi-attention feature extraction in one step, which surpasses B-CNN and RAN by 2.4% and 3.7%, respectively.

Our method exhibits similar performances on Stanford Dogs and Stanford Cars, as shown in Table 2c and d. On Stanford Dogs, our method exceeds all of the comparison methods except RACNN, which requires multiple stages for feature extraction and is hard to be trained end-to-end. On Stanford Cars, our method obtains 93.0% accuracy, outperforming all of the comparison methods.

Finally, on the Dogs-in-the-Wild dataset, our method still achieves the best result with remarkable margins. Since this dataset is newly proposed, the results in Table 2e can be used as baselines for future explorations. Moreover, by comparing the overall performances in Table 2c and e, we find that the accuracies on Dogs-in-the-wild are significantly lower than those on Stanford Dogs, which witness the relatively higher classification difficulty of this dataset.

By adopting our network with ResNet-101, we visualize the $\mathbf {S}^p$ in Eq. 4 of each OSME branch (which corresponds to an attention region) as its channel-wise average heatmap, as shown in the third and fourth columns of Fig. 6, . In comparison, we also show the outputs of the last conv layer of the baseline network (ResNet-101) as heatmaps in the second column. It is seen that the highlighted regions of OSME outputs reveal more meaningful parts than those of the baseline, that we humans also rely on to recognize the fine-grained label, e.g., the head and wing for birds, the head and tail for dogs, and the headlight/grill and frame for cars.

6 Conclusion

In this paper, we propose a novel CNN with the multi-attention multi-class constraint (MAMC) for fine-grained image recognition. Our network extracts attention-aware features through the one-squeeze multi-excitation (OSME) module, supervised by the MAMC loss that pulls positive features closer to the anchor, while pushing negative features away. Our method does not require bounding box or part annotation, and can be trained end-to-end in one stage. Extensive experiments against state-of-the-art methods exhibit the superior performances of our method on various fine-grained recognition tasks on birds, dogs and cars. In addition, we have collected and released the Dogs-in-the-Wild, a comprehensive dog species dataset with the largest data volume, full category coverage, and accurate annotation compared with existing similar datasets.

Notes

1.
N stands for the number of sample pairs as well as the number of classes in a mini-batch. Limited by GPU memory, N is usually much smaller than K, the total number of classes in the entire training set.
2.
It is worth to point out that the implementation of MAMC is independent to the use of N-pair loss, as MAMC is a general framework that can be combined with other triplet-based metric learning loss as well. The N-pair loss is taken as a reference because of its robustness and good convergence in practice.
3.
http://ai.baidu.com/broad/subordinate?dataset=canine.

References

Bossard, L., Guillaumin, M., Gool, L.V.: Food-101 - mining discriminative components with random forests. In: ECCV (2014)
Google Scholar
Branson, S., Van Horn, G., Belongie, S., Perona, P.: Bird species categorization using pose normalized deep convolutional nets. In: BMVC (2014)
Google Scholar
Branson, S., Van Horn, G., Wah, C., Perona, P., Belongie, S.: The ignorant led by the blind: a hybrid human-machine vision system for fine-grained categorization. Int. J. Comput. Vis. 108(1–2), 3–29 (2014)
MathSciNet MATH Google Scholar
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “Siamese" time delay neural network. In: NIPS (1994)
Google Scholar
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In: BigLearn, NIPS workshop (2011)
Google Scholar
Cui, Y., Zhou, F., Lin, Y., Belongie, S.: Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1153–1162 (2016)
Google Scholar
Cui, Y., Zhou, F., Wang, J., Liu, X., Lin, Y., Belongie, S.: Kernel pooling for convolutional neural networks. In: CVPR (2017)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Farrell, R., Oza, O., Zhang, N., Morariu, V.I., Darrell, T., Davis, L.S.: Birdlets: subordinate categorization using volumetric primitives and pose-normalized appearance. In: ICCV (2011)
Google Scholar
Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: CVPR (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507 (2017)
Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: NIPS (2015)
Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM MM (2014)
Google Scholar
Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: stanford dogs. In: CVPR Workshops on Fine-Grained Visual Categorization (2011)
Google Scholar
Krause, J., Gebru, T., Deng, J., Li, L.J., Fei-Fei, L.: Learning features and parts for fine-grained recognition. In: ICPR (2014)
Google Scholar
Krause, J., Jin, H., Yang, J., Fei-Fei, L.: Fine-grained recognition without part annotations. In: CVPR (2015)
Google Scholar
Krause, J., et al.: The unreasonable effectiveness of noisy data for fine-grained recognition. In: ECCV (2016)
Chapter Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: ICCV Workshops on 3D Representation and Recognition (2013)
Google Scholar
Li, Z., Yang, Y., Liu, X., Wen, S., Xu, W.: Dynamic computational time for visual attention. arXiv preprint arXiv:1703.10332 (2017)
Lin, D., Shen, X., Lu, C., Jia, J.: Deep LAC: deep localization, alignment and classification for fine-grained recognition. In: CVPR (2015)
Google Scholar
Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: ICCV (2015)
Google Scholar
Lin, Y., Morariu, V.I., Hsu, W.H., Davis, L.S.: Jointly optimizing 3D model fitting and fine-grained classification. In: ECCV (2014)
Google Scholar
Liu, J., Kanazawa, A., Jacobs, D.W., Belhumeur, P.N.: Dog breed classification using part localization. In: ECCV (2012)
Google Scholar
Liu, X., Xia, T., Wang, J., Yang, Y., Zhou, F., Lin, Y.: Fully convolutional attention networks for fine-grained recognition. arXiv preprint arXiv:1603.06765 (2017)
Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS (2014)
Google Scholar
Nilsback, M., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008)
Google Scholar
Parkhi, O.M., Vedaldi, A., Jawahar, C., Zisserman, A.: The truth about cats and dogs. In: ICCV (2011)
Google Scholar
Perronnin, F., Larlus, D.: Fisher vectors meet neural networks: a hybrid classification architecture. In: CVPR (2015)
Google Scholar
Rosenfeld, A., Ullman, S.: Visual concept recognition and localization via iterative introspection. In: ACCV (2016)
Google Scholar
Salakhutdinov, R., Hinton, G.E.: Learning a nonlinear embedding by preserving class neighbourhood structure. In: AISTATS (2007)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for face recognition and clustering. In: CVPR (2015)
Google Scholar
Simon, M., Gao, Y., Darrell, T., Denzler, J., Rodner, E.: Generalized orderless pooling performs implicit salient matching. arXiv preprint arXiv:1705.00487 (2017)
Simon, M., Rodner, E.: Neural activation constellations: Unsupervised part model discovery with convolutional networks. In: ICCV (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: NIPS (2016)
Google Scholar
Van Horn, G., et al.: The iNaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642 (2017)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001, California Institute of Technology (2011)
Google Scholar
Wang, D., Shen, Z., Shao, J., Zhang, W., Xue, X., Zhang, Z.: Multiple granularity descriptors for fine-grained categorization. In: ICCV (2015)
Google Scholar
Wang, F., et al.: Residual attention network for image classification. In: CVPR (2017)
Google Scholar
Wang, J., Zhou, F., Wen, S., Liu, X., Lin, Y.: Deep metric learning with angular loss. In: ICCV (2017)
Google Scholar
Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: CVPR (2014)
Google Scholar
Wang, Y., Choi, J., Morariu, V., Davis, L.S.: Mining discriminative triplets of patches for fine-grained classification. In: CVPR (2016)
Google Scholar
Wilber, M., Kwak, I.S., Kriegman, D., Belongie, S.: Learning concept embeddings with combined human-machine expertise. In: ICCV (2015)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV (2014)
Google Scholar
Zhang, H., et al.: SPDA-CNN: unifying semantic part detection and abstraction for fine-grained recognition. In: CVPR (2016)
Google Scholar
Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: ECCV (2014)
Google Scholar
Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L.: Panda: Pose aligned networks for deep attribute modeling. In: CVPR (2014)
Google Scholar
Zhang, X., Zhou, F., Lin, Y., Zhang, S.: Embedding label structures for fine-grained feature representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1114–1123 (2016)
Google Scholar
Zhang, X., Xiong, H., Zhou, W., Lin, W., Tian, Q.: Picking deep filter responses for fine-grained image recognition. In: CVPR (2016)
Google Scholar
Zhao, B., Wu, X., Feng, J., Peng, Q., Yan, S.: Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimed. 19(6), 1245–1256 (2017)
Article Google Scholar
Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: ICCV (2017)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Object detectors emerge in deep scene CNNs. In: ICLR (2014)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NIPS (2014)
Google Scholar
Zhou, F., Lin, Y.: Fine-grained image classification by exploring bipartite-graph labels. In: CVPR (2016)
Google Scholar
Zhu, Y., Zhou, Y., Ye, Q., Qiu, Q., Jiao, J.: Soft proposal networks for weakly supervised object localization. In: ICCV (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Vision Technology (VIS), Baidu Inc., Beijing, China
Ming Sun, Yuchen Yuan & Errui Ding
Baidu Research, Beijing, China
Feng Zhou

Authors

Ming Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Feng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Errui Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Zhou .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, M., Yuan, Y., Zhou, F., Ding, E. (2018). Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11220. Springer, Cham. https://doi.org/10.1007/978-3-030-01270-0_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-01270-0_49
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01269-4
Online ISBN: 978-3-030-01270-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics