Keywords

1 Introduction

Person re-identification is the problem of matching a person acquired by disjoint cameras at different time instants. The problem has recently gained increasing attention (see [1] for a recent survey) due to its open challenges like changes in viewing angle, background clutter, and occlusions. To address these issues, existing approaches seek either the best feature representations (e.g., [24]) or propose to learn optimal matching metrics (e.g., [57]). While they have obtained reasonable performance on commonly used datasets (e.g., [810]), we believe that these approaches have not yet considered a fundamental related problem: how to learn from the data being continuously collected in an installed system and adapt existing models to this new data. This is an important problem to address if re-identification methods have to work on long time-scales.

To illustrate such a problem, let us consider a simplified scenario in which at every time instant a conspicuous amount of visual data is being generated from two cameras. From each camera we obtain a large set of probe and gallery persons that have to be matched. Since this is a task that evolves over time, it is unlikely that the a-priori selected features or the learned model return the correct gallery match for every probe at any instant. In addition, after each of such matches is computed, the information provided by the considered images is discarded. This results in a loss of valuable information which could have been used to update the model, thus ideally yielding better performance over time.

The above problem could be overcome if the data could be exploited in a continuous learning process in which the model can be updated with every single probe-gallery match. Since we do not know whether a match is correct or not, the model might be updated with the wrong information. To tackle this issue, manual labeling of each match can be performed, but, doing so with a large corpus of data is clearly impossible. However, if the human labor is kept to a minimum, the model can ideally be adapted over time without compromising performance. Thus, the main idea of the paper is a person re-identification solution based on an incremental adaptation of the learned model with human in the loop.

Fig. 1.
figure 1

Illustration of the re-identification pipeline highlighting our contribution. Dashed lines indicate the training stage, solid lines the deployment stage. Existing methods do not consider the information provided by a matched probe-gallery pair to update the model. We propose to use such information to improve the model performance by adapting it to the dynamic environmental variations.

Contributions: As shown in Fig. 1, this work brings in two main contributions: (i) an incremental learning algorithm that allows the model to be adapted over time, and (ii) a method to reduce the human labeling effort required to properly update the model. These objectives are achieved as follows.

  1. (i)

    We propose a low-rank sparse similarity-dissimilarity metric learning method (Sect. 3.2) which

    1. (a)

      learns two low-rank projections onto discriminant manifolds providing optimal embeddings for a similarity and a dissimilarity measure;

    2. (b)

      introduces sparsity inducing regularizers that allow identification and exploitation of the most discriminative dimensions for matching; and

    3. (c)

      is trained in an incremental fashion through a stochastic derivation of the Alternating Directions Methods of Multipliers (ADMM) [11].

  2. (ii)

    We introduce an unsupervised graph-based approach which, for every probe, identifies only the most relevant gallery persons among a large set of available ones (Sect. 3.3). Such a set, obtained by exploiting dominant sets clustering [12], contains the most informative gallery persons which are first provided to the human labeler, then exploited to update the model.

To substantiate our contributions we have conducted the experiments on three benchmark datasets for person re-identification. Results demonstrate that (i) the proposed approach for identifying the most informative gallery persons yields better re-identification performance than using completely labeled data; (ii) the proposed low-rank sparse similarity-dissimilarity approach trained in an incremental fashion with such informative gallery persons, hence with significantly less manual labor, performs on par or even better than state-of-the-art methods trained on \(100\,\%\) labeled data. In fact, with only \(15\,\%\) labeled data we improve the previous best rank 1 results by more than \(8\,\%\) on the PRID450S dataset. These experiments show how re-identification models can be continuously adapted over time with limited human effort and without sacrifice in performance.

2 Relation to Existing Work

The person re-identification problem has been studied from different perspectives, ranging from partially seen persons [13] to low resolution images [14] – also considered in camera networks [15], which can eventually be synthesized in the open-world re-identification idea [16]. In the following, we focus on metric and active learning methods relevant to our work.

Metric Learning approaches focus on learning discriminant metrics which aim to yield an optimal matching score/distance between a gallery and a probe image.

Since the early work of [17], many different solutions have been introduced [18]. In the re-identification field, metric learning approaches have been proposed by relaxing [19] or enforcing [20] positive semi-definite (PSD) conditions as well as by considering equivalence constraints [2123]. While most of the existing methods capture the global structure of the dissimilarity space, local solutions [2427] have been proposed too. Following the success of both approaches, methods combining them in ensembles [5, 7, 28] have been introduced.

Different solutions yielding similarity measures have also been investigated by proposing to learn listwise [29] and pairwise [30] similarities as well as mixture of polynomial kernel-based models [9]. Related to these similarity learning models are the deep architectures which have been exploited to tackle the task [3133].

With respect to all such methods, the closest ones to our approach are [6, 20]. Specifically, in [6], authors jointly exploit the metric in [21] and learn a low-rank projection onto a subspace with discriminative Euclidean distance. The solution is obtained through generalized eigenvalue decomposition. In [20], a soft-margin PSD constrained metric with low-rank projections is learned via a proximal gradient method. Both works exploit a batch optimization approach.

Though sharing the idea of finding discriminative low-rank projections, there are significant differences with our method. Specifically, we introduce (i) an incremental learning procedure along with a stochastic ADMM solver which can handle noisy observations of the true data; (ii) a low-rank similarity-dissimilarity metric learning which brings significant performance gain with respect to each of its components; (iii) additional sparsity regularizers on the low-rank projections that allow self-discovery of the relevant components of the underlying manifold.

Active Learning: In an effort to bypass tedious labeling of training data there has been recent interest in “active learning” [34] to intelligently select unlabeled examples for the experts to label in an interactive manner.

This can be achieved by choosing one sample at a time by maximizing the value of information [35], reducing the expected error [36], or minimizing the resultant entropy of the system [37]. More recently, works selecting batches of unlabeled data by exploiting classifier feedback to maximize informativeness and sample diversity [38, 39] were proposed. Specific application areas in computer vision include, but are not limited to, tracking [40], scene classification [35, 41], semantic segmentation [42], video annotation [43] and activity recognition [44].

Active learning has been a relatively unexplored area in person re-identification. Including the human in the loop has been investigated in [8, 45, 46]. These methods focused on post-ranking solutions and exploit human labor to refine the initial results by relying on full [8] or partial [45] image selection. In [46], authors introduce an active learning strategy that exploits mid level attributes to train a set of attribute predictors aiding active selection of images.

Different from such approaches, in our proposed method human labor is not required to improve the post-rank visual search, but to reliably update the learned model over time. We do not rely on additional attribute predictors which require a proper training that calls for a large number of annotated attributes. Thus bypassing the need for attribute annotation, we reduce both the computational complexity as well as the additional manual effort. We introduce a graph-based solution that exploits the information provided by a single probe-gallery match as well as the information shared between all the persons in the entire gallery. With this, a small set of highly informative probe-gallery pairs is delivered to the human, whose effort is thus limited.

3 Temporal Model Adaptation for Re-identification

An overview of the proposed solution is illustrated in Fig. 2. Specifically, to achieve model adaptation over time, we first introduce a similarity-dissimilarity metric learning approach which can be trained in an incremental fashion (Sect. 3.2). Then, to limit the human labeling effort required to properly update the model, we propose an unsupervised graph-based approach that identifies only the most informative probe-gallery samples (Sect. 3.3).

Fig. 2.
figure 2

Proposed temporal model adaptation scheme. An off-line procedure exploits labeled image pairs to train the initial similarity-dissimilarity model. As new unlabeled pairs are obtained, a score for each of those is obtained using the learned model. These are later used to identify a relevant set of gallery persons for each probe. Such a set, containing the most informative samples, is exploited to construct the relevant pairs which are first provided to the human annotator, then considered to update the model.

3.1 Preliminaries

Let \(\mathcal {P}= \{\mathbf {I}_p\}_{p=1}^{|\mathcal {P}|}\) and \(\mathcal {G}= \{\mathbf {I}_g\}_{g=1}^{|\mathcal {G}|}\) be the set of probe and gallery images acquired by two disjoint cameras. Let \(\mathbf {x}_p\in \mathbb {R}^{d}\) and \(\mathbf {x}_g\in \mathbb {R}^{d}\) be the feature representations of \(\mathbf {I}_p\) and \(\mathbf {I}_g\) of two persons p and g. Let \(\mathcal {X}= \{(\mathbf {x}_p, \mathbf {x}_g; y_{p,g})^{(i)}\}_{i=1}^{n}\) denote the training set of \(n = |\mathcal {P}|\times |\mathcal {G}|\) probe-gallery pairs where \(y_{p,g}\in \{-1, +1\}\) indicates whether p and g are the same person (\(+1\)) or not (\(-1\)). Finally, let an iteration be a parameter update computed by visiting a single sample and let an epoch denote a complete cycle on the training set.

3.2 Low-Rank Sparse Similarity-Dissimilarity Learning

Objective: The image feature representations \(\mathbf {x}\) might be very high-dimensional and contain non-discriminative components. Hence, learning a metric in such a feature space might yield to non-optimal generalization performance. To overcome such a problem we propose to learn a low-rank metric which self-determines the discriminative dimensions of the underlying manifold.

Towards such an objective, inspired by the success of similarity learning on image retrieval tasks [4749], we propose to learn a similarity function

$$\begin{aligned} \sigma _{\mathbf {K}}(\mathbf {x}_p, \mathbf {x}_g) = \mathbf {x}_p^T \mathbf {K}^T \mathbf {K}\mathbf {x}_g\end{aligned}$$
(1)

parameterized by the low-rank projection matrix \(\mathbf {K}\in \mathbb {R}^{r\times d}\), with \(r \ll d\). This provides an embedding in which the dot product between the projected feature vectors is “large” if p and g are the same person, “small” otherwise. The similarity function is then coupled with the output of a metric learning solution that aims to find a matrix \(\mathbf {P}\in \mathbb {R}^{r\times d}\) that projects the high-dimensional vectors to a low-dimensional manifold with a discriminative Euclidean dissimilarity

$$\begin{aligned} \delta _{\mathbf {P}}(\mathbf {x}_p, \mathbf {x}_g) = \Vert \mathbf {P}\mathbf {x}_p- \mathbf {P}\mathbf {x}_g\Vert _{2}^{2} = (\mathbf {x}_p-\mathbf {x}_g)^T \mathbf {P}^T \mathbf {P}(\mathbf {x}_p-\mathbf {x}_g) \end{aligned}$$
(2)

which is “small” if p and g are the same person, “larger” otherwise. This results in the score function

$$\begin{aligned} S_{\mathbf {K},\mathbf {P}}(p,g)= y_{p,g}(\underbrace{\sigma _{\mathbf {K}}(\mathbf {x}_p, \mathbf {x}_g)}_{ \uparrow \text {for } p=g,\ \downarrow \text {for } p\ne g} - \underbrace{(1/2) \delta _{\mathbf {P}}(\mathbf {x}_p, \mathbf {x}_g)}_{\downarrow \text {for } p=g,\ \uparrow \text {for } p\ne g}) \end{aligned}$$
(3)

which included in a margin hinge loss yields

$$\begin{aligned} \ell _{\mathbf {K},\mathbf {P}}(p,g)= \max \left( 0, 1 - S_{\mathbf {K},\mathbf {P}}(p,g)\right) . \end{aligned}$$
(4)

Notice that zero loss is achieved if \(S_{\mathbf {K},\mathbf {P}}(p,g)\ge 1\), i.e., when the difference between \(\sigma _{\mathbf {K}}\) and \(\frac{1}{2} \delta _{\mathbf {P}}\) is either greater than or equal to 1 for positive pairs or less than or equal to \(-1\) for negative ones. In other cases a linear penalty is paid.

Obtaining the low-rank projections through Eq. (4) with fixed r implies that such a value should be carefully selected before the learning process begins. To overcome such a problem, we impose additional constraints on the low-rank projection matrices. In particular, the \(\ell _{2,1}\) norm has shown to perform robust feature selection through the induced group sparsity [5053]. Motivated by such findings, we can set \(r=d\), then leverage on an \(\ell _{2,1}\) norm regularizer to drive the rows of \(\mathbf {P}\) and \(\mathbf {K}\) to decay to zero. This corresponds to rejecting non discriminative dimensions of the underlying manifold.

Let \(\varOmega _{\mathbf {K},\mathbf {P}}= \alpha \Vert \mathbf {K}\Vert _{2,1} + \beta \Vert \mathbf {P}\Vert _{2,1}\) be the cost associated with the low-rank projection matrix regularizers where \(\alpha \) and \(\beta \) are the corresponding trade-off parameters controlling the regularization strength. Then, considering that we want to optimize the empirical risk over \(\mathcal {X}\), we can write our objective as

$$\begin{aligned} \mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {K},\mathbf {P}} \mathcal {J}_{\mathbf {K},\mathbf {P}}+ \varOmega _{\mathbf {K},\mathbf {P}}\qquad \text{ where } \qquad \mathcal {J}_{\mathbf {K},\mathbf {P}}= \frac{1}{n} \sum _{i=1}^{n} \ell _{\mathbf {K},\mathbf {P}}\left( p^{(i)},g^{(i)}\right) \end{aligned}$$
(5)

and \(p^{(i)}\) and \(g^{(i)}\) denote the identities of persons p and g in the i-th pair of \(\mathcal {X}\).

Incremental Learning: The objective function in Eq. (5) is a sum of two functions which are both convex but non-smooth. A solution to such kind of a problem that allows us to perform incremental updates can be obtained using the ADMM optimization algorithm [11].

ADMM solves optimization problems defined by means of the corresponding augmented Lagrangian. By introducing two additional constraints \(\mathbf {K}-\mathbf {U}=\mathbf {0}\) and \(\mathbf {P}-\mathbf {V}=\mathbf {0}\) we can define the augmented Lagrangian for Eq. (5) as

$$\begin{aligned} L_{\mathbf {K}, \mathbf {P}, \mathbf {U}, \mathbf {V}, \mathbf {\Lambda }, \mathbf {\Psi }}&= \mathcal {J}_{\mathbf {K},\mathbf {P}}+ \varOmega _{\mathbf {U},\mathbf {V}}+ \langle \mathbf {\Lambda }, \mathbf {K}- \mathbf {U}\rangle + \langle \mathbf {\Psi }, \mathbf {P}- \mathbf {V}\rangle \nonumber \\&\quad + \frac{\rho }{2} \left( \vert \vert {\mathbf {K}-\mathbf {U}}\vert \vert ^{2}_{F} + \vert \vert {\mathbf {P}-\mathbf {V}}\vert \vert ^{2}_{F}\right) \end{aligned}$$
(6)

where \(\mathbf {\Lambda }\in \mathbb {R}^{r\times d}\) and \(\mathbf {\Psi }\in \mathbb {R}^{r\times d}\) are two Lagrangian multipliers, \(\langle \cdot ,\cdot \rangle \) denote the inner product, \(\vert \vert {\cdot }\vert \vert _{F}\) is the Frobenius norm and, \(\rho > 0\) is a penalty parameter.

To solve the optimization problem, at each epoch s, ADMM alternatively minimizes \(L\) with respect to a single parameter, \(\mathbf {K}\), \(\mathbf {P}\), \(\mathbf {U}\), \(\mathbf {V}\), \(\mathbf {\Lambda }\) or \(\mathbf {\Psi }\), keeping others fixed. The result of each minimization gives the updated parameter.

Standard deterministic ADMM implicitly assumes true data values are available, hence overlooking the existence of noise [54]. Noticing that only \(\mathbf {K}\) and \(\mathbf {P}\) depend on the data samples, we define the corresponding update rules using the scalable stochastic ADMM approach [55, 56] which can handle such an issue.

Update K and P: Let and denote the subgradients components of Eq. (4) computed for all samples with respect \(\mathbf {K}\) and \(\mathbf {P}\), respectively. Then, at each iteration t, i.e., for the t-th random sample, we compute

(7)
(8)

where \(\eta \) is the step size and \(\mathbf {\tilde{K}}^{(t)}\) and \(\mathbf {\tilde{P}}^{(t)}\) denote the parameters for a specific iteration t, while \(\mathbf {K}^{(s)}\) and \(\mathbf {P}^{(s)}\) represent the parameters obtained for epoch s. Once T iterations are completed, the two low-rank matrices are updated as

$$\begin{aligned} \mathbf {K}^{(s+1)}= \frac{1}{T}\sum _{t=1}^{T}\mathbf {\tilde{K}}^{(t)}\qquad \quad \mathbf {P}^{(s+1)}= \frac{1}{T}\sum _{t=1}^{T}\mathbf {\tilde{P}}^{(t)} \end{aligned}$$
(9)

Update U and V: To derive the updates for the two regularizers, we first compute the partial derivatives of Eq. (6) with respect to \(\mathbf {U}\) and \(\mathbf {V}\) while keeping other parameters fixed. Then, solving for a stationary point yields

$$\begin{aligned} \mathbf {U}^{(s+1)}&= \left( \mathbf {K}^{(s+1)}_{i,:}+ \mathbf {\Lambda }^{(s)}_{i,:}/\rho \right) \max \Bigl (0, 1- \alpha / \left( \rho \vert \vert { \mathbf {K}^{(s+1)}_{i,:}+\mathbf {\Lambda }^{(s)}_{i,:}/\rho }\vert \vert _2 \right) \Bigr ) \end{aligned}$$
(10)
$$\begin{aligned} \mathbf {V}^{(s+1)}&= \left( \mathbf {P}^{(s+1)}_{i,:}+ \mathbf {\Psi }^{(s)}_{i,:}/\rho \right) \max \Bigl (0, 1- \beta / \left( \rho \vert \vert {\mathbf {P}^{(s+1)}_{i,:}+\mathbf {\Psi }^{(s)}_{i,:}/\rho }\vert \vert _2 \right) \Bigr ) \end{aligned}$$
(11)

whose closed form solutions have been obtained using the group soft-thresholding technique [51] and \(i=1,\cdots ,r\) denotes the i-th row of a parameter matrix.

: Results from Eq. (9) and Eqs. (1011) can be finally used to update the duals for the Lagrangian multipliers as

$$\begin{aligned} \mathbf {\Lambda }^{(s+1)}&= \mathbf {\Lambda }^{(s)}+ \rho (\mathbf {K}^{(s+1)}- \mathbf {U}^{(s+1)}) \end{aligned}$$
(12)
$$\begin{aligned} \mathbf {\Psi }^{(s+1)}&= \mathbf {\Psi }^{(s)}+ \rho (\mathbf {P}^{(s+1)}- \mathbf {V}^{(s+1)}) \end{aligned}$$
(13)

To conclude, after S epochs have been performed, the optimal estimates for the two low-rank projection matrices are given by \(\mathbf {K}^{(S)}\) and \(\mathbf {P}^{(S)}\).

3.3 Model Adaptation with Reduced Human Effort

In the previous section we have presented a similarity-dissimilarity learning model which can be trained in an incremental fashion. To achieve model adaptation over time, we propose to perform incremental steps to minimize Eq. (6) with new image pairs that are progressively acquired as time passes. This requires human labeling of such pairs. To limit such a manual effort and improve model generalization, we aim to select only a small set of informative gallery persons to update the model. These are persons for which the positive/negative association with the probe is very uncertain. Given a probe, such gallery persons form its probe relevant set.

Probe Relevant Set Selection: Let \(\mathcal {H}= \{ \mathbf {x}_p, \mathbf {x}_g\ | \ g=1,\ldots ,|\mathcal {G}| \}\) denote the probe-gallery set for probe p. We represent such a set as an undirected graph with no loops. More precisely, let \(G= (V, E, \mathbf {W})\) denote a graph where \(V= \{p, g | g=1, \ldots , |\mathcal {G}|\}\) is the set of vertices, \(E\subseteq V\times V\) is the set of edges and \(\mathbf {W}\in \mathbb {R}^{|V|\times |V|}_{+}\) denotes the adjacency symmetric matrix of positive edge weights such that, for any two vertices i and j, \(\mathbf {W}_{i,j} = f(S_{\mathbf {K},\mathbf {P}}(i,j))\) if \(i\ne j\), \(\mathbf {W}_{i,j} = 0\), otherwise. \(f(\cdot )\) is the Platt function [57] used to ensure a positive edge weight.

To obtain the probe relevant set, we aim to cluster G in such a way that (i) a cluster contains the probe and gallery persons which are similar to each other, and (ii) all persons outside a cluster should be dissimilar to the ones inside. To achieve such an objective, we exploit the dominant sets clustering technique [12].

Dominant set clustering partitions a graph into dominant sets on the basis of the coherency between vertices as measured by the edge weights. A dominant set is a subset of the graph nodes having high internal and low external coherency.

To obtain such partitions, the dominant sets approach is based on the participation vector \(\mathbf {h}\). It expresses the probability of participation of the corresponding person in the cluster. More precisely, the objective is

$$\begin{aligned} \hat{\mathbf {h}}= \mathop {{{\mathrm{arg\,max}}}}\limits _{\mathbf {h}} \mathbf {h}^T \mathbf {W}\mathbf {h}\qquad \qquad \text{ s.t. } \quad \mathbf {h}\in \mathcal {S} \end{aligned}$$
(14)

where \(\mathcal {S}\) is the standard simplex of \(\mathbb {R}^{|V|}\).

Let the participation vector be initialized to a uniform distribution, i.e., \(h_i = 1/|V|\), for \(i=1,\ldots ,|V|\) Footnote 1. Then, as shown in [12], a solution to the optimization problem can be obtained by an iterative procedure that, at each iteration k, updates the participation vector as

$$\begin{aligned} h_i^{(k+1)} = h_i^{(k)} \frac{(\mathbf {W}\mathbf {h}^{(k)})_i}{(\mathbf {h}^{(k)})^T \mathbf {W}\mathbf {h}^{(k)}} \qquad \qquad \text{ for } \ i = 1, \ldots , |V| \end{aligned}$$
(15)

The iterative updates are applied until the objective function difference between two consecutive iterations is higher than a predefined threshold \(\epsilon \). When such a condition is not satisfied a local optima is obtained and the non-zero entries in the participation vector \(\hat{\mathbf {h}}\) specify the relevant nodes included in the dominant set. Notice that the dominant sets clustering can be easily extended to cluster a graph in multiple dominant sets. This is obtained by removing the person identities included in the current dominant set from \(\mathcal {H}\), creating the new graph structure and then repeating the process. In our approach such a procedure is applied until the dominant set containing the probe person p is found. This is the probe relevant set for person p and is denoted as \(\mathcal {D}_{p} = \{ i \ | \ i \ne p \wedge h_i > 0 \}\).

Incremental Model Update: Armed with the probe relevant set, we can now achieve temporal model adaptation by performing the incremental learning steps described in Sect. 3.2. Towards this objective, we first ask the human annotator to label only the probe relevant pairs in \(\{(\mathbf {I}_p, \mathbf {I}_g) \ | \ g \in \mathcal {D}_p \}\). Then, using the current parameters \(\mathbf {K}\) and \(\mathbf {P}\) as a “warm-restart”, we exploit the newly labeled samples to run \(\hat{S}\) epochs, each providing \(\hat{T}\) incremental iterations. When such a process is completed the updated model parameters \(\mathbf {K}\) and \(\mathbf {P}\) are obtained.

3.4 Discussion

Through the preceding sections we have introduced two main contributions that allow us to obtain model adaptation over time. Specifically, the goal has been achieved (i) by proposing a stochastic similarity-dissimilarity metric learning procedure that can be incrementally updated and (ii) by introducing a graph-based approach that allows to identify the most informative pairs that should be labeled by the human. All the steps are summarized in Algorithm 1.

figure a

MLAPG [20] and XQDA [6], which learn a discriminant subspace as well as a distance function in the learned subspace, are close to the proposed approach. However, both of them do not update the model over time. In addition, our solution differs in the stochastic ADMM optimization, the combination of both a similarity and a dissimilarity measure, as well as the sparsity regularization.

4 Experimental Results

Datasets: We evaluated our approach on three publicly available benchmark datasetsFootnote 2, namely VIPeR [58], PRID450S [59], and Market1501 [60] (see Fig. 3 for few sample images). Following the literature, we run 10 trials on the VIPeR and PRID450S dataset, while we use the available partitions for Market 1501. We report on the average performance using the Cumulative Matching Characteristic (CMC). We refer to our method as Temporal Model Adaptation (\(\mathrm {TMA}\)).

VIPeR [58] is considered one of the most challenging datasets. It contains 1,264 images of 632 persons viewed by two cameras. Most image pairs have viewpoint changes larger than \(90^\circ \). Following the general protocol, we split the dataset into a training and a test set each including 316 persons.

Fig. 3.
figure 3

15 image pairs from the (a) VIPeR, (b) PRID450S and (c) Market1501 datasets. Columns correspond to different persons, rows to different cameras.

PRID450S [59] is a more recent dataset containing 450 persons viewed by two disjoint cameras with viewpoint changes, background interference and partial occlusion. As performed in literature [61, 62], we partitioned the dataset into a training and a test set each containing 225 individuals.

Market1501 [60] is the largest currently available person re-identification dataset. It contains 32,668 images of 1,501 persons taken from 6 disjoint cameras. Multiple images of a same person have been obtained by means of a state-of-the-art detector, thus providing a realistic setup. To run the experiments, we used the available codeFootnote 3 to get the same BoW feature representation as well as the same train/test partitions containing 750 and 751 person identities each.

Implementation: To model person appearance we adopted the Local Maximal Occurrence (LOMO) representation [6]. We selected \(\alpha = 0.001\), \(\beta = 0.001\), \(\eta = 1\), and \(\rho =1\) by performing 5-fold cross validation on \(\{1, 0.5, 0.1, 0.05, 0.01, 0.001\}\). The temporal model adaptation followed the common batch framework used in active learning [34]. It partitioned each training set into 4 disjoint batches. Due to the adopted randomization procedure, each batch contains approximately \(z=(|\mathcal {P}|/4)\times (|\mathcal {G}|/4)\) pairsFootnote 4. We have used the first batch to train the initial model with \(T=2z\), \(S=200\), and no further stopping criteria. The remaining ones have been used for the batch-incremental updates with \(\hat{T}=2z\) and \(\hat{S}=150\) (in the following, the subscript of \(\mathrm {TMA}\) indicates the number of model updates that are achieved for every probe in each batch). Finally, to select the relevant gallery images in each batch we have set \(\epsilon =0.1\) (see Table 5).

Table 1. Comparison with state-of-the-art methods on the VIPeR dataset. Best results for each rank are in boldface font.

4.1 State-of-the-art Comparisons

In the following we compare the results of our approach with existing methods. In addition to the incremental performance, we also provide our results when no model adaptation is exploited and all the training data is included in one single batch (\(\mathrm {TMA}_0\)).

VIPeR: Results in Table 1 show that our approach has better performance than recent solutions even in the case only about \(5\,\%\) of the data is used. This result indicates that, partially due to the feature representation (see results of KISSME in Table 4), our approach produces a robust solution to viewpoint variations. Incremental updates bring \(\mathrm {TMA}_{4}\) to be the second best. In such a case, only LMF+LADF performs better. However, such an approach is a combination of two methods, which, as shown in [5], generally improves the performance. Indeed, a rank 1 recognition rate of \(48.19\,\%\) is achieved by summing \(\mathrm {TMA}_0\) and LADF scores. If the same batches as \(\mathrm {TMA}_{1-4}\) are considered to train LADF, the fused rank 1 performances are of \(35.6\,\%\), \(37.9\,\%\), \(40.8\,\%\) and \(43.4\,\%\), respectively – which represent an average improvement of \(11\,\%\) over standalone LADF.

Finally, results obtained with \(\mathrm {TMA}_0\) show that the best rank 1 is achieved, but performance on higher ranks is slightly worse than the one obtained using incremental updates (\(\mathrm {TMA}_4\)). Hence, using all the available data requires additional manual labor and might also drive to decreasing performance. This strengthens our contribution showing that, by identifying the most informative samples to train with, better results can be achieved with reduced human effort.

PRID450S: In Table 2 we report on the performance comparisons between existing method and our approach on the PRID450S dataset. Results show that our solution outperforms the methods used for comparisons regardless of the amount of data used for training. In particular, using only \(14.25\,\%\) of the data an \(8\,\%\) improvement with respect to the best existing approach is obtained at rank 1. By training only with the initially available data (i.e., \(\mathrm {TMA}_1\)), our solution outperforms SCNCDFinal [61], which, on the VIPeR dataset, had better performance (until the \(3^{rd}\) batch update). This may suggest that our approach is robust to background clutter and occlusions which PRID450S suffer from.

Table 2. Comparison with state-of-the-art methods on the PRID 450S dataset. Best results for each rank are in boldface font.
Table 3. Rank 1 and mAP performance comparison with existing methods on the Market 1501 dataset. Best result is in boldface font.

Market1501: Comparisons of our approach with existing methods on the Market 1501 dataset are shown in Table 3. The obtained performance are consistent with the ones achieved on the VIPeR and PRID450S datasets. Our approach has significantly better performance than methods used for comparisons even by using \(5.23\,\%\) of labeled data. Incremental updates bring in relevant improvements and with \(\mathrm {TMA}_4\) we achieve the best rank 1 recognition rate, i.e., \(44.74\,\%\). Using the LOMO feature representation instead of the BoW one provided by [60], about a \(3\,\%\) rank 1 performance gain is obtained. Results on such dataset demonstrate that our approach can scale to a real scenario and achieve competitive performance with significantly less manual labor. The reason for the improved performance with much less training data is because our method identifies the most discriminating examples to train with, and does not waste labeling effort on those that will add little or no value to the re-identification accuracy.

4.2 Influence of the Temporal Model Adaptation Components

To better understand the achieved performance, we have run additional experiments by separately considering the similarity-dissimilarity metric learning approach and the probe relevant set selection method.

Similarity-Dissimilarity Metric: In the following, we first analyze the contribution of the similarity and the dissimilarity components. Then, we compare our performance with existing methods using the same LOMO representation.

Fig. 4.
figure 4

Comparison of the similarity-dissimilarity learning components. (a)–(d) show the results on the VIPeR dataset computed using incremental batch updates. For each curve, the percentage of manually labeled samples is indicated in parenthesis. The inside picture show the results for rank range 1–10.

Contribution of the components: In Fig. 4, we report on the results obtained using either the learned similarity, the learned dissimilarity or both. Results show that most of the performance contribution is provided by the dissimilarity. The similarity has significantly lower performance and calls for more labeled pairs. This is due to the fact that the majority of the edges of the corresponding graph have weak weights, thus causing the maximization procedure to select more samples before the stop condition. Enforcing agreement on a specific pair by jointly optimizing the similarity and the dissimilarity measure results in the best performances. With respect to the dissimilarity approach, this yields negligible increase of manual labor and improved results (\(7\,\%\) at rank 1).

Table 4. Comparison with metric learning approaches on the VIPeR dataset. Results obtained using truncated projections (100 dimensions) are given for three representative ranks. Last row shows the percentage of manually labeled samples. Best results for each rank are in bold. Most of the results are from [20].

Comparison with existing methods: In Table 4, we report on the comparison of our similarity-dissimilarity approach with general state-of-the-art metric learning approaches, namely ITML [65], LMNN [66], LDML [67], and re-identification tied ones namely, PRDC [30], KISSME [21], LADF [24], XQDA [6], and MLAPG [20]. To provide a fair comparison, we used the same settings in [20]. Precisely, the 100 principal components found by PCA have been exploited to train LMNN, ITML, KISSME, and LADF. Since other methods, i.e., XQDA, PRDC, LDML, MLAPG and \(\mathrm {TMA}\), are able to discover the discriminative features, we used all the principal components. For a fair comparison, projection learned by XQDA, MLAPG and \(\mathrm {TMA}\) were truncated to 100 dimensions.

Results in Table 4 show that our approach, trained with only \(4.91\,\%\) of the available data, has the \(4^{th}\) best rank 1 result. As shown in Fig. 4, such a successful result is due to the competition between the similarity and the dissimilarity approaches. Performing incremental updates yields significant improvements and, after the \(4^{th}\) update is completed, the best rank 1 recognition rate is achieved. At higher ranks, \(\mathrm {TMA}\) performs on par with other methods but with substantially less labeled pairs (i.e., \(15.77\,\%\) of all possible annotations).

Discussion: Results have demonstrated that, while the dissimilarity metric has more impact on the performance, by enforcing competition with the similarity measure better results can be obtained. Additional evaluations showed that by removing the \(\ell _{2,1}\) norms the degradation is of 3 %. Comparisons with existing approaches have shown that, under the same conditions, our approach achieves good results using only 1 / 6 of the data. Incremental updates produce considerable improvements with a significantly reduced human effort. This substantiates the benefits of the proposed similarity-dissimilarity learning approach and demonstrate the feasibility of temporal model adaptation for the task.

Probe Relevant Set Selection: In the following, we provide an analysis of the graph-based solution to identify the most informative gallery persons. We report on the effects of the \(\epsilon \) parameter, then we compare with three approaches.

Table 5. Analysis of the \(\epsilon \) parameter used to obtain the probe relevant set. Each entry in the table shows the rank 1 performance as well as the percentage of labeled data (in brackets). Best results for each rank are in bold.

Influence of \(\epsilon \): To verify the influence of the \(\epsilon \) parameter, we have computed the results in Table 5. These show that, large values of \(\epsilon \) produce coarse under-segmented sets, hence identify a large number of relevant pairs to label. Small values of \(\epsilon \), e.g., 0.01, produce over segmented-graphs, hence small dominant sets. Indeed, after the \(4^{th}\) update, less than \(10\,\%\) of all the available pairs has been used for training. This results in achieving similar performance improvements, but with a different manual effort. The reason behind this is that, in the former case, the probe relevant sets contain additional persons which are not “similar” to the probe and any other gallery person. This causes the model to be updated with uninformative pairs which weaken its discriminative power. In the latter, too few informative pairs are found and the model overfits such samples.

Fig. 5.
figure 5

Re-Identification performance on the VIPeR dataset computed using four different probe relevant set selection criteria. (a)–(c) show the performances achieved using the \(2^{nd}\)-\(4^{th}\) batch incremental updates. The percentage of manually labeled samples is given within parenthesis. The inside picture show the results on a log-scale reduced rank range, i.e., 1–50.

Selection Criteria Comparison: In Fig. 5, we compare our probe relevant set selection approach with three different criteria. Before exploiting such criteria, we applied Platt scaling [57] to the obtained scores to get the probability of each probe-gallery pair being positive.

  1. (i)

    Unsupervised: Each pair having probability less than 0.5 has been assigned the negative label, remaining ones have been assigned the positive label.

  2. (ii)

    Semi-Supervised: Top and bottom 20 ranked pairs have been labeled as positive or the negative, respectively. Remaining pairs have been human labeled.

  3. (iii)

    Supervised: Every pair has been human labeled.

Results show that using the unsupervised or the semi-supervised criteria, the performance obtained with incremental updates tends to decrease. This behavior is due to the fact that, right after the first update, the produced scores induce very small or very large probabilities. This yields zero manual labor, but, as a consequence, the model is updated with a large portion of mislabeled samples. Using our solution, performance reaches the ones obtained using a fully-supervised approach. In particular, with the \(4^{th}\) batch update our approach yields the highest rank 1 recognition rate (\(41.46\,\%\) vs \(39.87\,\%\)) with \(5\,\%\) less manual labor. Additional experiments considering the human mislabeling error \(C\in \{5,\ldots ,95\}\%\) show that the model update is effective when \(C\le 15\,\%\).

Discussion: In this section, we have shown that our approach is moderately sensible to the selection of \(\epsilon \), which to some extent, controls the human effort. In addition, it performs better than a fully supervised approach in which all the samples are manually labeled. This demonstrates that the proposed approach identifies the most informative pairs that should be used to update the model.

4.3 Computational Complexity

In Table 6, we compare the computational performance of deterministic ADMM and our stochastic solution. While achieving similar rank 1 performance, deterministic ADMM brings in more complexity, hence the training time is considerably higher. In particular, while d might be arbitrarily large, n and K are usually small (those depend on the number of samples which are manually labeled), thus our solution is more desirable in a continuous learning scenario.

Finally, notice that, while the initial training is more expensive than existing approaches, e.g., KISSME [21], the proposed incremental learning solution is more effective in the long term since it does not require re-training like others.

Table 6. Comparison between deterministic ADMM and our stochastic solution. VIPeR result computed by running MATLAB code on an Intel Xeon 2.6 GHz. Complexity is computed for the parameters updates which differs from the two solutions

5 Conclusion

In this paper we have proposed a person re-identification approach based on a temporal adaptation of the learned model with human in the loop. First, to allow temporal adaptation, we have proposed a similarity-dissimilarity metric learning approach which can be trained in an incremental fashion by means of a stochastic version of the ADMM optimization method. Then, to update the model with the proper information, we have included the human in the loop and proposed a graph-based approach to select the most informative pairs that should be manually labeled. Informative pairs selection has been obtained through the dominant sets graph partition technique. Results conducted on three datasets have shown that similar or better performances than existing methods can be achieved with significantly less manual labor.