Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The availability of large-scale labeled training images is one of the key factors that contribute to recent successes in visual object recognition and classification. It is well-known, however, that object frequencies in natural images follow long-tailed distributions [13]. For example, some animal or plant species are simply rare by nature — it is uncommon to find alpacas wandering around the streets. Furthermore, brand new categories could just emerge with zero or little labeled images; newly defined visual concepts or products are introduced everyday. In this real-world setting, it would be desirable for computer vision systems to be able to recognize instances of those rare classes, while demanding minimum human efforts and labeled examples.

Zero-shot learning (ZSL) has long been believed to hold the key to the above problem of recognition in the wild. ZSL differentiates two types of classes: seen and unseen, where labeled examples are available for seen classes only. Without labeled data, models for unseen classes are learned by relating them to seen ones. This is often achieved by embedding both seen and unseen classes into a common semantic space, such as visual attributes [46] or word2vec representations of the class names [79]. This common semantic space enables transferring models for the seen classes to those for the unseen ones [10].

The setup for ZSL is that once models for unseen classes are learned, they are judged based on their ability to discriminate among unseen classes, assuming the absence of seen objects during the test phase. Originally proposed in the seminal work of Lampert et al. [4], this setting has almost always been adopted for evaluating ZSL methods [8, 1028].

But, does this problem setting truly reflect what recognition in the wild entails? While the ability to learn novel concepts is by all means a trait that any zero-shot learning systems should possess, it is merely one side of the coin. The other important — yet so far under-studied — trait is the ability to remember past experiences, i.e., the seen classes.

Why is this trait desirable? Consider how data are distributed in the real world. The seen classes are often more common than the unseen ones; it is therefore unrealistic to assume that we will never encounter them during the test stage. For models generated by ZSL to be truly useful, they should not only accurately discriminate among either seen or unseen classes themselves but also accurately discriminate between the seen and unseen ones.

Thus, to understand better how existing ZSL approaches will perform in the real world, we advocate evaluating them in the setting of generalized zero-shot learning (GZSL), where test data are from both seen and unseen classes and we need to classify them into the joint labeling space of both types of classes. Previous work in this direction is scarce. See related work for more details.

Our contributions include an extensive empirical study of several existing ZSL approaches in this new setting. We show that a straightforward application of classifiers constructed by those approaches performs poorly. In particular, test data from unseen classes are almost always classified as a class from the seen ones. We propose a surprisingly simple yet very effective method called calibrated stacking to address this problem. This method is mindful of the two conflicting forces: recognizing data from seen classes and recognizing data from unseen ones. We introduce a new performance metric called Area Under Seen-Unseen accuracy Curve (AUSUC) that can evaluate ZSL approaches on how well they can trade off between the two. We demonstrate the utility of this metric by evaluating several representative ZSL approaches under this metric on three benchmark datasets, including the full ImageNet Fall 2011 release dataset [29] that contains approximately 21,000 unseen categories.

We complement our comparative studies in learning methods by further establishing an upper bound on the performance limit of ZSL. In particular, our idea is to use class-representative visual features as the idealized semantic embeddings to construct ZSL classifiers. We show that there is a large gap between existing approaches and this ideal performance limit, suggesting that improving class semantic embeddings is vital to achieve GZSL.

The rest of the paper is organized as follows. Section 2 reviews relevant literature. We define GZSL formally and shed lights on its difficulty in Sect. 3. In Sect. 4, we propose a method to remedy the observed issues in the previous section and compare it to related approaches. Experimental results, detailed analysis, and discussions are provided in Sects. 56, and 7, respectively.

2 Related Work

There has been very little work on generalized zero-shot learning. [8, 17, 30, 31] allow the label space of their classifiers to include seen classes but they only test on the data from the unseen classes. [9] proposes a two-stage approach that first determines whether a test data point is from a seen or unseen class, and then apply the corresponding classifiers. However, their experiments are limited to only 2 or 6 unseen classes. We describe and compare to their methods in Sects. 4.3, 5, and the Supplementary Material. In the domain of action recognition, [32] investigates the generalized setting with only up to 3 seen classes. [33, 34] focus on training a zero-shot binary classifier for each unseen class (against seen ones) — it is not clear how to distinguish multiple unseen classes from the seen ones. Finally, open set recognition [3537] considers testing on both types of classes, but treating the unseen ones as a single outlier class.

3 Generalized Zero-Shot Learning

In this section, we describe formally the setting of generalized zero-shot learning. We then present empirical evidence to illustrate the difficulty of this problem.

3.1 Conventional and Generalized Zero-Shot Learning

Suppose we are given the training data \(\mathcal {D}= \{({\varvec{x}}_n\in \mathbb {R}^{\mathsf {D}},y_n)\}_{n=1}^\mathsf {N}\) with the labels \(y_n\) from the label space of seen classes \(\mathcal {S} = \{1,2,\cdots ,\mathsf {S}\}\). Denote by \(\mathcal {U} = \{\mathsf {S}+1,\cdots ,\mathsf {S}+\mathsf {U}\}\) the label space of unseen classes. We use \(\mathcal {T} = \mathcal {S} \cup \mathcal {U}\) to represent the union of the two sets of classes.

In the (conventional) zero-shot learning (ZSL) setting, the main goal is to classify test data into the unseen classes, assuming the absence of the seen classes in the test phase. In other words, each test data point is assumed to come from and will be assigned to one of the labels in \(\mathcal {U}\).

Existing research on ZSL has been almost entirely focusing on this setting [4, 8, 1028]. However, in real applications, the assumption of encountering data only from the unseen classes is hardly realistic. The seen classes are often the most common objects we see in the real world. Thus, the objective in the conventional ZSL does not truly reflect how the classifiers will perform recognition in the wild.

Motivated by this shortcoming of the conventional ZSL, we advocate studying the more general setting of generalized zero-shot learning (GZSL), where we no longer limit the possible class memberships of test data — each of them belongs to one of the classes in \(\mathcal {T}\).

3.2 Classifiers

Without the loss of generality, we assume that for each class \(c \in \mathcal {T}\), we have a discriminant scoring function \(f_c({\varvec{x}})\), from which we would be able to derive the label for \({\varvec{x}}\). For instance, for an unseen class u, the method of synthesized classifiers [28] defines \(f_u({\varvec{x}}) = \varvec{w}_u^{\text {T}}{\varvec{x}}\), where \(\varvec{w}_u\) is the model parameter vector for the class u, constructed from its semantic embedding \(\varvec{a}_u\) (such as its attribute vector or the word vector associated with the name of the class). In ConSE [17], \(f_u({\varvec{x}}) = \cos (s({\varvec{x}}), \varvec{a}_u)\), where \(s({\varvec{x}})\) is the predicted embedding of the data sample \({\varvec{x}}\). In DAP/IAP [38], \(f_u({\varvec{x}})\) is a probabilistic model of attribute vectors. We assume that similar discriminant functions for seen classes can be constructed in the same manner given their corresponding semantic embeddings.

How to assess an algorithm for GZSL? We define and differentiate the following performance metrics: \(A_{\mathcal {U} \rightarrow \mathcal {U}}\) the accuracy of classifying test data from \(\mathcal {U}\) into \(\mathcal {U}\), \(A_{\mathcal {S} \rightarrow \mathcal {S}}\) the accuracy of classifying test data from \(\mathcal {S}\) into \(\mathcal {S}\), and finally \(A_{\mathcal {S} \rightarrow \mathcal {T}}\) and \(A_{\mathcal {U} \rightarrow \mathcal {T}}\) the accuracies of classifying test data from either seen or unseen classes into the joint labeling space. Note that \(A_{\mathcal {U} \rightarrow \mathcal {U}}\) is the standard performance metric used for conventional ZSL and \(A_{\mathcal {S} \rightarrow \mathcal {S}}\) is the standard metric for multi-class classification. Furthermore, note that we do not report \(A_{\mathcal {T} \rightarrow \mathcal {T}}\) as simply averaging \(A_{\mathcal {S} \rightarrow \mathcal {T}}\) and \(A_{\mathcal {U} \rightarrow \mathcal {S}}\) to compute \(A_{\mathcal {T} \rightarrow \mathcal {T}}\) might be misleading when the two metrics are not balanced, as shown below.

3.3 Generalized ZSL is Hard

To demonstrate the difficulty of GZSL, we report the empirical results of using a simple but intuitive algorithm for GZSL. Given the discriminant functions, we adopt the following classification rule

$$\begin{aligned} \hat{y} = \arg \max _{c\in \mathcal {T}} \quad f_c({\varvec{x}}) \end{aligned}$$
(1)

which we refer to as direct stacking.

We use the rule on “stacking” classifiers from the following zero-shot learning approaches: DAP and IAP [38], ConSE [17], and Synthesized Classifiers (SynC) [28]. We tune the hyper-parameters for each approach based on class-wise cross validation [26, 28, 33]. We test GZSL on two datasets AwA [38] and CUB [39] — details about those datasets can be found in Sect. 5.

Table 1. Classification accuracies (%) on conventional ZSL (\(A_{\mathcal {U} \rightarrow \mathcal {U}}\)), multi-class classification for seen classes (\(A_{\mathcal {S} \rightarrow \mathcal {S}}\)), and GZSL (\(A_{\mathcal {S} \rightarrow \mathcal {T}}\) and \(A_{\mathcal {U} \rightarrow \mathcal {T}}\)), on AwA and CUB. Significant drops are observed from \(A_{\mathcal {U} \rightarrow \mathcal {U}}\) to \(A_{\mathcal {U} \rightarrow \mathcal {T}}\).

Table 1 reports experimental results based on the 4 performance metrics we have described previously. Our goal here is not to compare between methods. Instead, we examine the impact of relaxing the assumption of the prior knowledge of whether data are from seen or unseen classes.

We observe that, in this setting of GZSL, the classification performance for unseen classes (\(A_{\mathcal {U} \rightarrow \mathcal {T}}\)) drops significantly from the performance in conventional ZSL (\(A_{\mathcal {U} \rightarrow \mathcal {U}}\)), while that of seen ones (\(A_{\mathcal {S} \rightarrow \mathcal {T}}\)) remains roughly the same as in the multi-class task (\(A_{\mathcal {S} \rightarrow \mathcal {S}}\)). That is, nearly all test data from unseen classes are misclassified into the seen classes. This unusual degradation in performance highlights the challenges of GZSL; as we only see labeled data from seen classes during training, the scoring functions of seen classes tend to dominate those of unseen classes, leading to biased predictions in GZSL and aggressively classifying a new data point into the label space of \(\mathcal {S}\) because classifiers for the seen classes do not get trained on “negative” examples from the unseen classes.

4 Approach for GZSL

The previous example shows that the classifiers for unseen classes constructed by conventional ZSL methods should not be naively combined with models for seen classes to expand the labeling space required by GZSL.

In what follows, we propose a simple variant to the naive approach of direct stacking to curb such a problem. We also develop a metric that measures the performance of GZSL, by acknowledging that there is an inherent trade-off between recognizing seen classes and recognizing unseen classes. This metric, referred to as the Area Under Seen-Unseen accuracy Curve (AUSUC), balances the two conflicting forces. We conclude this section by describing two related approaches: despite their sophistication, they do not perform well empirically.

4.1 Calibrated Stacking

Our approach stems from the observation that the scores of the discriminant functions for the seen classes are often greater than the scores for the unseen classes. Thus, intuitively, we would like to reduce the scores for the seen classes. This leads to the following classification rule:

$$\begin{aligned} \hat{y} = \arg \max _{c\;\in \;\mathcal {T}}\quad f_c({\varvec{x}}) - \gamma \mathbb {I}[c\in \mathcal {S}], \end{aligned}$$
(2)

where the indicator \(\mathbb {I}[\cdot ]\in \{0,1\}\) indicates whether or not c is a seen class and \(\gamma \) is a calibration factor. We term this adjustable rule as calibrated stacking.

Another way to interpret \(\gamma \) is to regard it as the prior likelihood of a data point coming from unseen classes. When \(\gamma =0\), the calibrated stacking rule reverts back to the direct stacking rule, described previously.

It is also instructive to consider the two extreme cases of \(\gamma \). When \(\gamma \rightarrow +\infty \), the classification rule will ignore all seen classes and classify all data points into one of the unseen classes. When there is no new data point coming from seen classes, this classification rule essentially implements what one would do in the setting of conventional ZSL. On the other hand, when \(\gamma \rightarrow -\infty \), the classification rule only considers the label space of seen classes as in standard multi-way classification. The calibrated stacking rule thus represents a middle ground between aggressively classifying every data point into seen classes and conservatively classifying every data point into unseen classes. Adjusting this hyperparameter thus gives a trade-off, which we exploit to define a new performance metric.

4.2 Area Under Seen-Unseen Accuracy Curve (AUSUC)

Varying the calibration factor \(\gamma \), we can compute a series of classification accuracies (\(A_{\mathcal {U} \rightarrow \mathcal {T}}\), \(A_{\mathcal {S} \rightarrow \mathcal {T}}\)). Figure 1 plots those points for the dataset AwA using the classifiers generated by the method in [28] based on class-wise cross validation. We call such a plot the Seen-Unseen accuracy Curve (SUC).

Fig. 1.
figure 1

The Seen-Unseen accuracy Curve (SUC) obtained by varying \(\gamma \) in the calibrated stacking classification rule Eq. (2). The AUSUC summarizes the curve by computing the area under it. We use the method SynC\(^\text {o-vs-o}\) on the AwA dataset, and tune hyper-parameters as in Table 1. The red cross denotes the accuracies by direct stacking CoW.

On the curve, \(\gamma =0\) corresponds to direct stacking, denoted by a cross. The curve is similar to many familiar curves for representing conflicting goals, such as the Precision-Recall (PR) curve and the Receiving Operator Characteristic (ROC) curve, with two ends for the extreme cases (\(\gamma \rightarrow -\infty \) and \(\gamma \rightarrow +\infty \)).

A convenient way to summarize the plot with one number is to use the Area Under SUC (AUSUC)Footnote 1. The higher the area is, the better an algorithm is able to balance \(A_{\mathcal {U} \rightarrow \mathcal {T}}\) and \(A_{\mathcal {S} \rightarrow \mathcal {T}}\). In Sects. 5, 6, and the Supplementary Material, we evaluate the performance of existing zero-shot learning methods under this metric, as well as provide further insights and analyses.

An immediate and important use of the metric AUSUC is for model selection. Many ZSL learning methods require tuning hyperparameters — previous work tune them based on the accuracy \(A_{\mathcal {U} \rightarrow \mathcal {U}}\). The selected model, however, does not necessarily balance optimally between \(A_{\mathcal {U} \rightarrow \mathcal {T}}\) and \(A_{\mathcal {S} \rightarrow \mathcal {T}}\). Instead, we advocate using AUSUC for model selection and hyperparamter tuning. Models with higher values of AUSUC are likely to perform in balance for the task of GZSL. For detailed discussions, see the Supplementary Material.

4.3 Alternative Approaches

Socher et al. [9] propose a two-stage zero-shot learning approach that first predicts whether an image is of seen or unseen classes and then accordingly applies the corresponding classifiers. The first stage is based on the idea of novelty detection and assigns a high novelty score if it is unlikely for the data point to come from seen classes. They experiment with two novelty detection strategies: Gaussian and LoOP models [40]. We briefly describe and contrast them to our approach below. The details are in the Supplementary Material.

Novelty Detection. The main idea is to assign a novelty score \(N({\varvec{x}})\) to each sample \({\varvec{x}}\). With this novelty score, the final prediction rule becomes

$$\begin{aligned} \hat{y} = \left\{ \begin{array}{l@{\quad }r} \mathop {\arg \max }\nolimits _{c\;\in \;\mathcal {S}}\quad f_c({\varvec{x}}), \quad &{} \text {if } N({\varvec{x}}) \le -\gamma .\\ \mathop {\arg \max }\nolimits _{c\;\in \;\mathcal {U}}\quad f_c({\varvec{x}}), \quad &{} \text {if } N({\varvec{x}}) > -\gamma . \end{array} \right. \end{aligned}$$
(3)

where \(-\gamma \) is the novelty threshold. The scores above this threshold indicate belonging to unseen classes. To estimate \(N({\varvec{x}})\), for the Gaussian model, data points in seen classes are first modeled with a Gaussian mixture model. The novelty score of a data point is then its negative log probability value under this mixture model. Alternatively, the novelty score can be estimated using the Local Outlier Probabilities (LoOP) model [40]. The idea there is to compute the distances of \({\varvec{x}}\) to its nearest seen classes. Such distances are then converted to an outlier probability, interpreted as the likelihood of \({\varvec{x}}\) being from unseen classes.

Relation to Calibrated Stacking. If we define a new form of novelty score \(N({\varvec{x}}) = \max _{u\;\in \;\mathcal {U}} f_u({\varvec{x}}) - \max _{s\;\in \;\mathcal {S}} f_s({\varvec{x}})\) in Eq. (3), we recover the prediction rule in Eq. (2). However, this relation holds only if we are interested in predicting one label \(\hat{y}\). When we are interested in predicting a set of labels (for example, hoping that the correct labels are in the top K predicted labels, (i.e., the Flat hit@K metric, cf. Sect. 5), the two prediction rules will give different results.

5 Experimental Results

5.1 Setup

Datasets. We mainly use three benchmark datasets: the Animals with Attributes (AwA) [38], CUB-200-2011 Birds (CUB) [39], and ImageNet (with full 21,841 classes) [41]. Table 2 summarizes their key characteristics.

Table 2. Key characteristics of the studied datasets.

Semantic Spaces. For the classes in AwA and CUB, we use 85-dimensional and 312-dimensional binary or continuous-valued attributes, respectively [38, 39]. For ImageNet, we use 500-dimensional word vectors (word2vec) trained by the skip-gram model [7, 42] provided by Changpinyo et al. [28]. We ignore classes without word vectors, resulting in 20,345 (out of 20,842) unseen classes. We follow [28] to normalize all but binary embeddings to have unit \(\ell _2\) norms.

Visual Features. We use the GoogLeNet deep features [43] pre-trained on ILSVRC 2012 1K [41] for all datasets (all extracted with the Caffe package [44]). Extracted features come from the 1,024-dimensional activations of the pooling units, as in [20, 28].

Zero-Shot Learning Methods. We examine several representative conventional zero-shot learning approaches, described briefly below. Direct Attribute Prediction (DAP) and Indirect Attribute Prediction (IAP) [38] are probabilistic models that perform attribute predictions as an intermediate step and then use them to compute MAP predictions of unseen class labels. ConSE [17] makes use of pre-trained classifiers for seen classes and their probabilitic outputs to infer the semantic embeddings of each test example, and then classifies it into the unseen class with the most similar semantic embedding. SynC [28] is a recently proposed multi-task learning approach that synthesizes a novel classifier based on semantic embeddings and base classifiers that are learned with labeled data from the seen classes. Two versions of this approach — SynC\(^\text {o-v-o}\) and SynC\(^\text {struct}\) — use one-versus-other and Crammer-Singer style [45] loss functions to train classifiers. We use binary attributes for DAP and IAP, and continuous attributes and word2vec for ConSE and SynC, following [17, 28, 38].

Generalized Zero-Shot Learning Tasks. There are no previously established benchmark tasks for GZSL. We thus define a set of tasks that reflects more closely how data are distributed in real-world applications.

We construct the GZSL tasks by composing test data as a combination of images from both seen and unseen classes. We follow existing splits of the datasets for the conventional ZSL to separate seen and unseen classes. Moreover, for the datasets AwA and CUB, we hold out 20 % of the data points from the seen classes (previously, all of them are used for training in the conventional zero-shot setting) and merge them with the data from the unseen classes to form the test set; for ImageNet, we combine its validation set (having the same classes as its training set) and the 21 K classes that are not in the ILSVRC 2012 1 K dataset.

Evaluation Metrics. While we will primarily report the performance of ZSL approaches under the metric Area Under Seen-Unseen accuracy Curve (AUSUC) developed in Sect. 4.1, we explain how its two accuracy components \(A_{\mathcal {S} \rightarrow \mathcal {T}}\) and \(A_{\mathcal {U} \rightarrow \mathcal {T}}\) are computed below.

For AwA and CUB, seen and unseen accuracies correspond to (normalized-by-class-size) multi-way classification accuracy, where the seen accuracy is computed on the 20 % images from the seen classes and the unseen accuracy is computed on images from unseen classes.

For ImageNet, seen and unseen accuracies correspond to Flat hit@K (F@K), defined as the percentage of test images for which the model returns the true label in its top K predictions. Note that, F@1 is the unnormalized multi-way classification accuracy. Moreover, following the procedure in [8, 17, 28], we evaluate on three scenarios of increasing difficulty: (1) 2-hop contains 1,509 unseen classes that are within two tree hops of the 1 K seen classes according to the ImageNet label hierarchyFootnote 2. (2) 3-hop contains 7,678 unseen classes that are within three tree hops of the seen classes. (3) All contains all 20,345 unseen classes.

5.2 Which Method to Use to Perform GZSL?

Table 3 provides an experimental comparison between several methods utilizing seen and unseen classifiers for generalized ZSL, with hyperparameters cross-validated to maximize AUSUC. Empirical results on additional datasets and ZSL methods are in the Supplementary Material.

The results show that, irrespective of which ZSL methods are used to generate models for seen and unseen classes, our method of calibrated stacking for generalized ZSL outperforms other methods. In particular, despite their probabilistic justification, the two novelty detection methods do not perform well. We believe that this is because most existing zero-shot learning methods are discriminative and optimized to take full advantage of class labels and semantic information. In contrast, either Gaussian or LoOP approach models all the seen classes as a whole, possibly at the cost of modeling inter-class differences.

Table 3. Performances measured in AUSUC of several methods for Generalized Zero-Shot Learning on AwA and CUB. The higher the better (the upper bound is 1).
Fig. 2.
figure 2

Comparison between several ZSL approaches on the task of GZSL for AwA and CUB.

Fig. 3.
figure 3

Comparison between ConSE and SynC of their performances on the task of GZSL for ImageNet where the unseen classes are within 2 tree-hops from seen classes.

Table 4. Performances measured in AUSUC by different zero-shot learning approaches on GZSL on ImageNet, using our method of calibrated stacking.

5.3 Which Zero-Shot Learning Approach is More Robust to GZSL?

Figure 2 contrasts in detail several ZSL approaches when tested on the task of GZSL, using the method of calibrated stacking. Clearly, the SynC method dominates all other methods in the whole ranges. The crosses on the plots mark the results of direct stacking (Sect. 3).

Figure 3 contrasts in detail ConSE to SynC, the two known methods for large-scale ZSL. When the accuracies measured in Flat hit@1 (i.e., multi-class classification accuracy), neither method dominates the other, suggesting the different trade-offs by the two methods. However, when we measure hit rates in the top \(K>1\), SynC dominates ConSE. Table 4 gives summarized comparison in AUSUC between the two methods on the ImageNet dataset. We observe that SynC in general outperforms ConSE except when Flat hit@1 is used, in which case the two methods’ performances are nearly indistinguishable. Additional plots can be found in the Supplementary Material.

6 Analysis on (Generalized) Zero-Shot Learning

Zero-shot learning, either in conventional setting or generalized setting, is a challenging problem as there is no labeled data for the unseen classes. The performance of ZSL methods depends on at least two factors: (1) how seen and unseen classes are related; (2) how effectively the relation can be exploited by learning algorithms to generate models for the unseen classes. For generalized zero-shot learning, the performance further depends on how classifiers for seen and unseen classes are combined to classify new data into the joint label space.

Despite extensive study in ZSL, several questions remain understudied. For example, given a dataset and a split of seen and unseen classes, what is the best possible performance of any ZSL method? How far are we from there? What is the most crucial component we can improve in order to reduce the gap between the state-of-the-art and the ideal performances?

In this section, we empirically analyze ZSL methods in detail and shed light on some of those questions.

Setup. As ZSL methods do not use labeled data from unseen classes for training classifiers, one reasonable estimate of their best possible performance is to measure the performance on a multi-class classification task where annotated data on the unseen classes are provided.

Concretely, to construct the multi-class classification task, on AwA and CUB, we randomly select 80 % of the data along with their labels from all classes (seen and unseen) to train classifiers. The remaining 20 % will be used to assess both the multi-class classifiers and the classifiers from ZSL. Note that, for ZSL, only the seen classes from the 80 % are used for training — the portion belonging to the unseen classes are not used.

On ImageNet, to reduce the computational cost (of constructing multi-class classifiers which would involve 20,345-way classification), we subsample another 1,000 unseen classes from its original 20,345 unseen classes. We call this new dataset ImageNet-2K (including the 1 K seen classes from ImageNet). The subsampling procedure is described in the Supplementary Material and the main goal is to keep the proportions of difficult unseen classes unchanged. Out of those 1,000 unseen classes, we randomly select 50 samples per class and reserve them for testing and use the remaining examples (along with their labels) to train 2000-way classifiers.

For ZSL methods, we use either attribute vectors or word vectors (word2vec) as semantic embeddings. Since SynC\(^\text {o-vs-o}\) [28] performs well on a range of datasets and settings, we focus on this method. For multi-class classification, we train one-versus-others SVMs. Once we obtain the classifiers for both seen and unseen classes, we use the calibrated stacking decision rule to combine (as in generalized ZSL) and vary the calibration factor \(\gamma \) to obtain the Seen-Unseen accuracy Curve, exemplified in Fig. 1.

Fig. 4.
figure 4

We contrast the performances of GZSL to multi-class classifiers trained with labeled data from both seen and unseen classes on the dataset ImageNet-2K. GZSL uses word2vector (in red color) and the idealized visual features (G-attr) as semantic embeddings (in black color). (Color figure online)

How Far Are We From the Ideal Performance? Figure 4 displays the Seen-Unseen accuracy Curves for ImageNet-2K — additional plots on ImageNet-2K and similar ones on AwA and CUB are in the Supplementary Material. Clearly, there is a large gap between the performances of GZSL using the default word2vec semantic embeddings and the ideal performance indicated by the multi-class classifiers. Note that the cross marks indicate the results of direct stacking. The multi-class classifiers not only dominate GZSL in the whole ranges (thus, with very high AUSUCs) but also are capable of learning classifiers that are well-balanced (such that direct stacking works well).

How Much Can Idealized Semantic Embeddings Help? We hypothesize that a large portion of the gap between GZSL and multi-class classification can be attributed to the weak semantic embeddings used by the GZSL approach.

Table 5. Comparison of performances measured in AUSUC between GZSL (using word2vec and G-attr) and multi-class classification on ImageNet-2K. Few-shot results are averaged over 100 rounds. GZSL with G-attr improves upon GZSL with word2vec significantly and quickly approaches multi-class classification performance.
Table 6. Comparison of performances measured in AUSUC between GZSL with word2vec and GZSL with G-attr on the full ImageNet with 21,000 unseen classes. Few-shot results are averaged over 20 rounds.

We investigate this by using a form of idealized semantic embeddings. As the success of zero-shot learning relies heavily on how accurate semantic embeddings represent visual similarity among classes, we examine the idea of visual features as semantic embeddings. Concretely, for each class, semantic embeddings can be obtained by averaging visual features of images belonging to that class. We call them G-attr as we derive the visual features from GoogLeNet. Note that, for unseen classes, we only use the reserved training examples to derive the semantic embeddings; we do not use their labels to train classifiers.

Figure 4 shows the performance of GZSL using G-attr — the gaps to the multi-class classification performances are significantly reduced from those made by GZSL using word2vec. In some cases (see the Supplementary Material for more comprehensive experiments), GZSL can almost match the performance of multi-class classifiers without using any labels from the unseen classes!

How Much Labeled Data Do We Need to Improve GZSL’s Performance? Imagine we are given a budget to label data from unseen classes, how much those labels can improve GZSL’s performance?

Table 5 contrasts the AUSUCs obtained by GZSL to those from mutli-class classification on ImageNet-2K, where GZSL is allowed to use visual features as embeddings — those features can be computed from a few labeled images from the unseen classes, a scenario we can refer to as “few-shot” learning. Using about (randomly sampled) 100 labeled images per class, GZSL can quickly approach the performance of multi-class classifiers, which use about 1,000 labeled images per class. Moreover, those G-attr visual features as semantic embeddings improve upon word2vec more significantly under Flat hit@\(\text {K}=1\) than when K > 1.

We further examine on the whole ImageNet with 20,345 unseen classes in Table 6, where we keep 80 % of the unseen classes’ examples to derive G-attr and test on the rest, and observe similar trends. Specifically on Flat hit@1, the performance of G-attr from merely 1 image is boosted threefold of that by word2vec, while G-attr from 100 images achieves over tenfold. See the Supplementary Material for details, including results on AwA and CUB.

7 Discussion

Zero-shot learning (ZSL) methods have been studied in the unrealistic setting where the test data are assumed to come from unseen classes only. In contrast, we advocate studying the problem of Generalized ZSL where test data’s class memberships are unconstrained. Naively using the classifiers constructed by ZSL approaches, however, does not perform well in this generalized setting. Instead, we propose a simple but effective method that can be used to balance two conflicting forces: recognizing data from seen classes versus unseen ones. We develop a performance metric to characterize the tradeoff and examine the utility of this metric in evaluating various ZSL approaches. Our analysis also leads us to investigate the best possible performance of any ZSL methods. We show that there is a large gap between existing approaches and the best possible. Moreover, we show that this gap can be reduced significantly if idealized semantic embeddings are used. Thus, an important direction for future research is to improve the quality of semantic embeddings of seen and unseen classes.