Performance Measures and a Data Set for Multi-target, Multi-camera Tracking

Ristani, Ergys; Solera, Francesco; Zou, Roger; Cucchiara, Rita; Tomasi, Carlo

doi:10.1007/978-3-319-48881-3_2

Ergys Ristani¹⁵,
Francesco Solera¹⁶,
Roger Zou¹⁵,
Rita Cucchiara¹⁶ &
…
Carlo Tomasi¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9914))

Included in the following conference series:

European Conference on Computer Vision

18k Accesses
864 Citations

Abstract

To help accelerate progress in multi-target, multi-camera tracking systems, we present (i) a new pair of precision-recall measures of performance that treats errors of all types uniformly and emphasizes correct identification over sources of error; (ii) the largest fully-annotated and calibrated data set to date with more than 2 million frames of 1080 p, 60 fps video taken by 8 cameras observing more than 2,700 identities over 85 min; and (iii) a reference software system as a comparison baseline. We show that (i) our measures properly account for bottom-line identity match performance in the multi-camera setting; (ii) our data set poses realistic challenges to current trackers; and (iii) the performance of our system is comparable to the state of the art.

This material is based upon work supported by the National Science Foundation under grants CCF-1513816 and IIS-1543720 and by the Army Research Office under grant W911NF-16-1-0392.

You have full access to this open access chapter, Download conference paper PDF

A survey of datasets for visual tracking

Article 25 September 2015

TAO: A Large-Scale Benchmark for Tracking Any Object

Long-Term Tracking in the Wild: A Benchmark

Keywords

1 Introduction

Multi-Target, Multi-Camera (MTMC) tracking systems automatically track multiple people through a network of cameras. As MTMC methods solve larger and larger problems, it becomes increasingly important (i) to agree on straightforward performance measures that consistently report bottom-line tracker performance, both within and across cameras; (ii) to develop realistically large benchmark data sets for performance evaluation; and (iii) to compare system performance end-to-end. This paper contributes to these aspects.

Performance Measures. Multi-Target Tracking has been traditionally defined as continuously following multiple objects of interest. Because of this, existing performance measures such as CLEAR MOT report how often a tracker makes what types of incorrect decisions. We argue that some system users may instead be more interested in how well they can determine who is where at all times.

To see this distinction, consider the scenario abstractly depicted in Fig. 1(a) and (c). Airport security is following suspect A spotted in the airport lobby. They need to choose between two trackers, Fig. 1(a) and (c). Both tag the suspect as identity 1 and track him up to the security checkpoint. System Fig. 1(a) makes a single mistake at the checkpoint and henceforth tags the suspect as identity 2, so it loses the suspect at the checkpoint. After the checkpoint, system Fig. 1(c) repeatedly flips the tags for suspect A between 1 and 2, thereby giving police the correct location of the suspect several times also between the checkpoint and the gate, and for a greater overall fraction of the time. Even though system Fig. 1(a) incurs only one ID switch, airport security is likely to prefer system Fig. 1(c), which reports the suspect’s position longer—multiple ID switches notwithstanding—and ultimately leads to his arrest at the gate.

We do not claim that one measure is better than the other, but rather that different measures serve different purposes. Event-based measures like CLEAR MOT help pinpoint the source of some errors, and are thereby informative for the designer of certain system components. In the interest of users in applications such as sports, security, or surveillance, where preserving identity is crucial, we propose two identity-based measures (ID precision and ID recall) that evaluate how well computed identities conform to true identities, while disregarding where or why mistakes occur. Our measures apply both within and across cameras.

Data Set. We make available a new data set that has more than 2 million frames and more than 2,700 identities. It consists of $8\times 85$ min of 1080 p video recorded at 60 fps from 8 static cameras deployed on the Duke University campus during periods between lectures, when pedestrian traffic is heavy. Calibration data determines homographies between images and the world ground plane. All trajectories were manually annotated by five people over a year, using an interface we developed to mark trajectory key points and associate identities across cameras. The resulting nearly 100,000 key points were automatically interpolated to single frames, so that every identity comes with single-frame bounding boxes and ground-plane world coordinates across all cameras in which it appears. To our knowledge this is the first dataset of its kind.

Reference System. We provide code for an MTMC tracker that extends a single-camera system that has shown good performance [1] to the multi-camera setting. We hope that the conceptual simplicity of our system will encourage plug-and-play experimentation when new individual components are proposed.

We show that our system does well on a recently published data set [2] when previously used measures are employed to compare our system to the state of the art. This comparison is only circumstantial because most existing results on MTMC tracking report performance using ground-truth person detections and ground-truth single-camera trajectories as inputs, rather than using the results from actual detectors and single-camera trackers. The literature typically justifies this limitation with the desire to measure only what a multi-camera tracker adds to a single-camera system. This justification is starting to wane as MTMC tracking systems approach realistically useful performance levels. Accordingly, we evaluate our system end-to-end, and also provide our own measures as a baseline for future research.

2 Related Work

We survey prior work on MTMC performance measures, data sets, and trackers.

Measures. We rephrase existing MTMC performance measures as follows.

A fragmentation occurs in frame t if the tracker switches the identity of a trajectory in that frame, but the corresponding ground-truth identity does not change. The number of fragmentations at frame t is $\phi _t$, and $\varPhi = \sum _t \phi _t$.
A merge is the reverse of a fragmentation: The tracker merges two different ground truth identities into one between frames $t'$ and t. The number of merges at frame t is $\gamma _t$, and $\varGamma = \sum _t \gamma _t$.
A mismatch is either a fragmentation or a merge. We define $\mu _t = \phi _t + \gamma _t$ and $M = \sum _t \mu _t$.

When relevant, each of these error counts is given a superscript w (for “within-camera”) when the frames $t'$ and t in question come from the same camera, and a superscript h (for “handover”) otherwise.

The number of false positives $fp_t$ is the number of times the tracker detects a target in frame t where there is none in the ground truth, the number of false negatives $fn_t$ is the number of true targets missed by the tracker in frame t, and $tp_t$ is the number of true positive detections at time t. The capitalized versions TP, FP, FN are the sums of $tp_t$, $fp_t$, and $fn_t$ over all frames (and cameras, if more than one), and the superscripts w and h apply here as well if needed.

Precision and recall are the usual derived measures, $P = TP/(TP + FP)$ and $R = TP/(TP + FN)$.

Single-camera, multi-object tracking performance is typically measured by the Multiple Object Tracking Accuracy (MOTA):

$$\begin{aligned} \text {MOTA} = 1-\frac{FN + FP + \varPhi }{T} \end{aligned}$$

(1)

and related scores (MOTP, MT, ML, FRG) [3–5]. MOTA penalizes detection errors ($FN+FP$) and fragmentations ($\varPhi $) normalized by the total number T of true detections. If extended to the multi-camera case, MOTA and its companions under-report across-camera errors, because a trajectory that covers $n_f$ frames from $n_c$ cameras has only about $n_c$ across-camera detection links between consecutive frames and about $n_f - n_c$ within camera ones, and $n_c \ll n_f$. To address this limitation handover errors [6] and multi-camera object tracking accuracy (MCTA) [2, 7] measures were introduced, which we describe next.

Handover errors focus only on errors across cameras, and distinguish between fragmentations $\varPhi ^h$ and merges $\varGamma ^h$. Fragmentations and merges are divided further into crossing ($\varPhi ^h_X$ and $\varGamma ^h_X$) and returning ($\varPhi ^h_R$ and $\varGamma ^h_R$) errors. These more detailed handover error scores help understand different types of tracker failures, and within-camera errors are quantified separately by standard measures.

MCTA condenses all aspects of system performance into one measure:

$$\begin{aligned} \text {MCTA} = \underbrace{\frac{2PR}{P+R}}_{F_1}\underbrace{\left( 1-\frac{M^w}{T^w}\right) }_{\text {within camera}}\underbrace{\left( 1-\frac{M^h}{T^h}\right) }_{\text {handover}} \;. \end{aligned}$$

(2)

This measure multiplies the $F_1$ detection score (harmonic mean of precision and recall) by a term that penalizes within-camera identity mismatches ($M^w$) normalized by true within-camera detections ($T^w$) and a term that penalizes wrong identity handover mismatches ($M^h$) normalized by the total number of handovers. Consistent with our notation, $T^h$ is the number of true detections (true positives $TP^h$ plus false negatives $FN^h$) that occur when consecutive frames come from different cameras.

Comparing to MOTA, MCTA multiplies within-camera and handover mismatches rather than adding them. In addition, false positives and false negatives, accounted for in precision and recall, are also factored into MCTA through a product. This separation brings the measure into the range [0, 1] rather than $[-\infty , 1]$ as for MOTA. However, the reasons for using a product rather than some other form of combination are unclear. In particular, each error in any of the three terms is penalized inconsistently, in that its cost is multiplied by the (variable) product of the other two terms.

Data Sets. Existing multi-camera data sets allow only for limited evaluation of MTMC systems. Some have fully overlapping views and are restricted to short time intervals and controlled conditions [8–10]. Some sports scenarios provide quality video with many cameras [11, 12], but their environments are severely constrained and there are no blind spots between cameras. Data sets with disjoint views come either with low resolution video [2, 6, 13], a small number of cameras placed along a straight path [2, 6], or scripted scenarios [2, 8–10, 13, 14]. Most importantly, all existing data sets only have a small number of identities. Table 1 summarizes the parameters of existing data sets. Ours is shown in the last row. It contains more identities than all previous data sets combined, and was recorded over the longest time period at the highest temporal resolution (60 fps).

Table 1. Summary of existing data sets for MTMC tracking. Ours is in the last row.

Full size table

Systems. MTMC trackers rely on pedestrian detection [15] and tracking [16] or assume single-camera trajectories to be given [6, 13, 17–26]. Spatial relations between cameras are either explicitly mapped in 3D [13, 19], learned by tracking known identities [25, 27, 28], or obtained by comparing entry/exit rates across pairs of cameras [6, 18, 26]. Pre-processing methods may fuse data from partially overlapping views [29], while some systems rely on completely overlapping and unobstructed views [9, 17, 30–32]. People entry and exit points may be explicitly modeled on the ground [6, 18, 19, 26] or image plane [24, 27]. Travel time is also modeled, either parametrically [13, 27] or not [6, 19, 24–26].

Appearance is captured by color [6, 13, 18–21, 23–25, 27, 29] and texture descriptors [6, 13, 18, 20, 22, 29]. Lighting variations are addressed through color normalization [18], exemplar based approaches [20], or brightness transfer functions learned with [23, 25] or without supervision [13, 19, 24, 29]. Discriminative power is improved by saliency information [33, 34] or learning features specific to body parts [6, 18, 20–23, 27], either in the image [35–37] or back-projected onto an articulated [38, 39] or monolithic [40] 3D body model.

All MTMC trackers employ optimization to maximize the coherence of observations for predicted identities. They first summarize spatial, temporal, and appearance information into a graph of weights $w_{ij}$ that express the affinity of node observations i and j, and then partition the nodes into identities either greedily through bipartite matching or, more generally, by finding either paths or cliques with maximal internal weights. Some contributions are as follows (Table 2):

Table 2. Optimization techniques employed by MTMC systems.

Full size table

In this paper, we extend a previous clique method [1] to formulate within- and across-camera tracking in a unified framework, similarly to previous MTMC flow methods [2, 29]. In contrast with [23], we handle identities reappearing in the same camera and differently from [8, 9] we handle co-occuring observations in overlapping views naturally, with no need for separate data fusion methods.

3 Performance Measures

Current event-based MTMC tracking performance measures count mismatches between ground truth and system output through changes of identity over time. The next two Sections show that this can be problematic both within and across cameras. The Section thereafter introduces our proposed measures.

3.1 Within-Camera Issues

With event-based measures, a truly-unique trajectory that switches between two computed identities over n frames can incur penalties that are anywhere between 1, when there is exactly one switch, and $n-1$, in the extreme case of one identity switch per frame. This can yield inconsistencies if correct identities are crucial. For example, in all cases in Fig. 1, the tracker covers a true identity A with computed identities 1 and 2. Current measures would make cases (b) and (c) equally bad, and (a) much better than the other two.

And yet the key mistake made by the tracker is to see two identities where there is one. To quantify the extent of the mistake, we need to decide which of the two computed identities we should match with A for the purpose of performance evaluation. Once that choice is made, every frame in which A is assigned to the wrong computed identity is a frame in which the tracker is in error.

Since the evaluator—and not the tracker—makes this choice, we suggest that it should favor the tracker to the extent possible. If this is done for each tracker under evaluation, the choice is fair. In all cases in Fig. 1, the most favorable choice is to tie A to 1, because this choice explains the largest fraction of A.

Once this choice is made, we measure the number of frames over which the tracker is wrong—in the example, the number of frames of A that are not matched to 1. In Fig. 1, this measure makes (a) and (b) equally good, and (c) better than either. This penalty is consistent because it reflects precisely what the choice made above maximizes, namely, the number of frames over which the tracker is correct about who is where. In (a) and (b), the tracker matches ground truth 67 % of the time, and in (c) it matches it 83 % of the time.

Figure 1 is about fragmentation errors. It can be reinterpreted in terms of merge errors by exchanging the role of thick and thin lines. In this new interpretation, choosing the longest ground-truth trajectory as the correct match for a given computed trajectory explains as much of the tracker’s output as possible, rather than as much of the ground truth. In both directions, our truth-to-result matching criterion is to let ground truth and tracker output explain as much of each other’s data as possible, in a way that will be made quantitative later on.

3.2 Handover Issues

Event-based measures often evaluate handover errors separately from within-camera errors: Whether a mismatch is within-camera or handover depends on the identities associated to the very last frame in which a trajectory is seen in one camera, and on the very first frame in which it is seen in the next—a rather brittle proposition. In contrast, our measure counts the number of incorrectly matched frames, regardless of other considerations: If only one frame is wrong, the penalty is small. For instance, in the cases shown in Fig. 2, current measures either charge a handover penalty when the handover is essentially correct (a) or fail to charge a handover penalty when the handover is essentially incorrect (b). Our measure charges a one-frame penalty in case (a) and a penalty nearly equal to the trajectory length in camera II in case (b), as appropriate. These cases are not just theoretical. In Sect. 6, we show that 74 % of the 5,549 handovers computed by our tracker in our data set show similar phenomena.

These issues are exacerbated in measures, such as MCTA, that combine measures of within-camera mismatches and handover mismatches into a single value by a product (Eq. 2). If one of the anomalies discussed above changes a within-camera error into a handover error or vice versa, the corresponding contribution to the performance measure can change drastically, because the penalty moves from one term of the product to another: If the product has the form wh (“within” and “handover”), then a unit contribution to w has value h in the product, and changing that contribution from w to h changes its value to w.

3.3 The Truth-To-Result Match

To address these issues, we propose to measure performance not by how often mismatches occur, but by how long the tracker correctly identifies targets. To this end, ground-truth identities are first matched to computed ones. More specifically, a bipartite match associates one ground-truth trajectory to exactly one computed trajectory by minimizing the number of mismatched frames over all the available data—true and computed. Standard measures such as precision, recall, and $F_1$-score are built on top of this truth-to-result match. These scores then measure the number of mismatched or unmatched detection-frames, regardless of where the discrepancies start or end or which cameras are involved.

To compute the optimal truth-to-result match, we construct a bipartite graph $G = (V_T, V_C, E)$ as follows. Vertex set $V_T$ has one “regular” node $\tau $ for each true trajectory and one “false positive” node $f^{+}_{\gamma }$ for each computed trajectory $\gamma $. Vertex set $V_C$ has one “regular” node $\gamma $ for each computed trajectory and one “false negative” node $f^{-}_{\tau }$, for each true trajectory $\tau $. Two regular nodes are connected with an edge $e\in E$ if their trajectories overlap in time. Every regular true node $\tau $ is also connected to its corresponding $f^{-}_{\tau }$, and every regular computed node $\gamma $ is also connected to its corresponding $f^{+}_{\gamma }$.

The cost on an edge $(\tau , \gamma )\in E$ tallies the number of false negative and false positive frames that would be incurred if that match were chosen. Specifically, let $\tau (t)$ be the sequence of detections for true trajectory $\tau $, one detection for each frame t in the set $\mathcal {T}_{\tau }$ over which $\tau $ extends, and define $\gamma (t)$ for $t\in \mathcal {T}_{\gamma }$ similarly for computed trajectories. The two simultaneous detections $\tau (t)$ and $\gamma (t)$ are a miss if they do not overlap in space, and we write

$$\begin{aligned} m(\tau , \gamma , t, \varDelta ) = 1\;. \end{aligned}$$

(3)

More specifically, when both $\tau $ and $\gamma $ are regular nodes, spatial overlap between two detections can be measured either in the image plane or on the reference ground plane in the world. In the first case, we declare a miss when the area of the intersection of the two detection boxes is less than $\varDelta $ (with $0< \varDelta < 1$) times the area of the union of the two boxes. On the ground plane, we declare a miss when the positions of the two detections are more than $\varDelta =1$ meter apart. If there is no miss, we write $m(\tau , \gamma , t, \varDelta ) = 0$. When either $\tau $ or $\gamma $ is an irregular node ($f^{-}_{\tau }$ or $f^{+}_{\gamma }$), any detections in the other trajectory are misses. When both $\tau $ and $\gamma $ are irregular, m is undefined. We define costs in terms of binary misses, rather than, say, Euclidean distances, so that a miss between regular positions has the same cost as a miss between a regular position and an irregular one. Matching two irregular trajectories incurs zero cost because they are empty.

With this definition, the cost on edge $(\tau , \gamma )\in E$ is defined as follows:

$$\begin{aligned} c(\tau , \gamma , \varDelta ) = \underbrace{ \sum \limits _{t \in \mathcal {T}_{\tau }} m(\tau , \gamma , t, \varDelta )}_{\text {False Negatives}} + \underbrace{ \sum \limits _{t \in \mathcal {T}_{\gamma }} m(\tau , \gamma , t, \varDelta )}_{\text {False Positives}} \;. \end{aligned}$$

(4)

A minimum-cost solution to this bipartite matching problem determines a one-to-one matching that minimizes the cumulative false positive and false negative errors, and the overall cost is the number of mis-assigned detections for all types of errors. Every $(\tau , \gamma )$ match is a True Positive ID (IDTP). Every $(f^{+}_{\gamma }, \gamma )$ match is a False Positive ID (IDFP). Every $(\tau , f^{-}_{\tau })$ match is a False Negative ID (IDFN). Every $(f^{+}_{\gamma }, f^{-}_{\tau })$ match is a True Negative ID (IDTN).

The matches $(\tau , \gamma )$ in IDTP imply a truth-to-result match, in that they reveal which computed identity matches which ground-truth identity. In general not every trajectory is matched. The sets

$$\begin{aligned} MT = \{\tau \;|\; (\tau , \gamma ) \in IDTP \} \;\;\;\text { and }\;\;\; MC = \{\gamma \;|\; (\tau , \gamma ) \in IDTP \} \end{aligned}$$

(5)

contain the matched ground-truth trajectories and matched computed trajectories, respectively. The pairs in IDTP can be viewed as a bijection between MT and MC. In other words, the bipartite match implies functions $\gamma = \gamma _{m}(\tau )$ from MT to MC and $\tau = \tau _{m}(\gamma )$ from MC to MT.

3.4 Identification Precision, Identification Recall, and $F_1$ Score

We use the IDFN, IDFP, IDTP counts to compute identification precision (IDP), identification recall (IDR), and the corresponding $F_1$ score $IDF_1$. More specifically,

$$\begin{aligned} IDFN= & {} \sum _{\tau \in AT} \sum \limits _{t \in \mathcal {T}_{\tau }} m(\tau , \gamma _{m}(\tau ), t, \varDelta ) \end{aligned}$$

(6)

$$\begin{aligned} IDFP= & {} \sum _{\gamma \in AC} \sum \limits _{t \in \mathcal {T}_{\gamma }} m(\tau _{m}(\gamma ), \gamma , t, \varDelta ) \end{aligned}$$

(7)

$$\begin{aligned} IDTP= & {} \sum _{\tau \in AT} \text {len}(\tau ) - IDFN \;=\; \sum _{\gamma \in AC} \text {len}(\gamma ) - IDFP \end{aligned}$$

(8)

where AT and AC are all true and computed identities in MT and MC.

$$\begin{aligned} IDP= & {} \frac{IDTP}{IDTP + IDFP} \end{aligned}$$

(9)

$$\begin{aligned} IDR= & {} \frac{IDTP}{IDTP + IDFN} \end{aligned}$$

(10)

$$\begin{aligned} IDF_1= & {} \frac{2\,IDTP}{2\,IDTP + IDFP + IDFN} \end{aligned}$$

(11)

Identification precision (recall) is the fraction of computed (ground truth) detections that are correctly identified. $IDF_1$ is the ratio of correctly identified detections over the average number of ground-truth and computed detections. ID precision and ID recall shed light on tracking trade-offs, while the $IDF_1$ score allows ranking all trackers on a single scale that balances identification precision and recall through their harmonic mean.

Our performance evaluation approach based on the truth-to-result match addresses all the weaknesses mentioned earlier in a simple and uniform way, and enjoys the following desirable properties: (1) Bijectivity: A correct match (with no fragmentation or merge) between true identities and computed identities is one-to-one. (2) Optimality: The truth-to-result matching is the most favorable to the tracker. (3) Consistency: Errors of any type are penalized in the same currency, namely, the number of misassigned or unassigned frames. Our approach also handles overlapping and disjoint fields of view in exactly the same way—a feature absent in all previous measures.

3.5 Additional Comparative Remarks

Measures of Handover Difficulty. Handover errors in current measures are meant to account for the additional difficulty of tracking individuals across cameras, compared to tracking them within a single camera’s field of view. If a system designer were interested in this aspect of performance, a similar measure could be based on the difference between the total number of errors for the multi-camera solution and the sum of the numbers of single-camera errors:

$$\begin{aligned} E_M - E_S \;\;\text { where }\;\; E_M = IDFP_M + IDFN_M \;\;\text { and }\;\; E_S = IDFP_S + IDFN_S \;. \end{aligned}$$

(12)

The two errors can be computed by computing the truth-to-result mapping twice: Once for all the data and once for each camera separately (and then adding the single-camera errors together). The difference above is nonnegative, because the multi-camera solution must account for the additional constraint of consistency across cameras. Similarly, simple manipulation shows that ID precision, ID recall, and $IDF_1$ score are sorted the other way:

$$\begin{aligned} IDP_S - IDP_M \ge 0 \;\;\;\text {,}\;\;\; IDR_S - IDR_M \ge 0 \;\;\;\text {,}\;\;\; F_{1S} - F_{1M} \ge 0 \end{aligned}$$

and these differences measure how well the overall system can associate across cameras, given within-camera associations.

Comparison with CLEAR MOT. The first step in performance evaluation matches true and computed identities. In CLEAR MOT the event-based matching defines the best mapping sequentially at each frame. It minimizes Euclidean distances (within a threshold $\varDelta $) between unmatched detections (true and computed) while matched detections from frame $t-1$ that are still within $\varDelta $ in t are preserved. Although the per-frame identity mapping is 1-to-1, the mapping for the entire sequence is generally many-to-many.

In our identity-based measures, we define the best mapping as the one which minimizes the total number of mismatched frames between true and computed IDs for the entire sequence. Similar to CLEAR MOT, a match at each frame is enforced by a threshold $\varDelta $. In contrast, our reasoning is not frame-by-frame and results in an ID-to-ID mapping that is 1-to-1 for the entire sequence.

The second step evaluates the goodness of the match through a scoring function. This is usually done by aggregating mistakes. MOTA aggregates FP, FN and $\varPhi $ while we aggregate IDFP and IDFN counts. The notion of fragmentation is not present in our evaluation because the mapping is strictly 1-to-1. In other words our evaluation only checks whether every detection of an identity is explained or not, consistently with our definition of tracking. Also, our aggregated mistakes are binary mismatch counts instead of, say, Euclidean distances. This is because we want all errors to be penalized in the same currency. If we were to combine the binary IDFP and IDFN counts with Euclidean distances instead of IDTP, the unit of error would be ambiguous: We won’t be able to tell whether the tracker under evaluation is good at explaining identities longer or following their trajectories closer.

Comparison with Identity-Aware Tracking. Performance scores similar to ours were recently introduced for this specific task [56]. The problem is defined as computing trajectories for a known set of true identities from a database. This implies that the truth-to-result match is determined during tracking and not evaluation. Instead, our evaluation applies to the more general MTMC setting where the tracker is agnostic to the true identities.

4 Data Set

Another contribution of this work is a new, manually annotated, calibrated, multi-camera data set recorded outdoors on the Duke University campus with 8 synchronized cameras (Fig. 3)^{Footnote 1}. We recorded 6,791 trajectories for 2,834 different identities (distinct persons) over 1 h and 25 min for each camera, for a total of more than 10 video hours and more than 2 million frames. There are on average 2.5 single-camera trajectories per identity, and up to 7 in some cases.

The cumulative trajectory time is more than 30 h. Individual camera density varies from 0 to 54 people per frame, depending on the camera. There are 4,159 hand-overs and up to 50 people traverse blind spots at the same time. More than 1,800 self-occlusion events happen (with 50 % or more overlap), lasting 60 frames on average. Our videos are recorded at 1080 p resolution and 60 fps to capture spatial and temporal detail. Two camera pairs (2–8 and 3–5) have small overlapping areas, through which about 100 people transit, while the other cameras are disjoint. Full annotations are provided in the form of trajectories of each person’s foot contact point with the ground. Image bounding boxes are also available and have been semi-automatically generated. The first 5 min of video from all the cameras are set aside for validation or training, and the remaining 80 min per camera are for testing.

Unlike many multi-camera data sets, ours is not scripted and cameras have a wider field of view. Unlike single-camera benchmarks where a tracker is tested on very short videos of different challenging scenarios, our data set is recorded in a fixed environment, and the main challenge is persistent tracking under occlusions and blind spots.

People often carry bags, backpacks, umbrellas, or bicycles. Some people stop for long periods of time in blind spots and the environment rarely constrains their paths. So transition times through blind spots are often but not always informative. 891 people walk in front of only one camera—a challenge for trackers that are prone to false-positive matches across cameras.

Working with this data set requires efficient trackers because of the amount of data to process. To illustrate, it took 6 days on a single computer to generate all the foreground masks with a standard algorithm [57] and 7 days to generate all detections on a cluster of 192 cores using the DPM detector [58]. Computing appearance features for all cameras on a single machine took half a day; computing all tracklets, trajectories, and identities together also took half a day with the proposed system (Sect. 5). People detections and foreground masks are released along with the videos.

Limitations. Our data set covers a single outdoor scene from fixed cameras. Soft lighting from overcast weather could make tracking easier. Views are mostly disjoint, which disadvantages methods that exploit data from overlapping views.

5 Reference System

We provide a reference MTMC tracker that extends to multiple cameras a system that was previously proposed for single camera multi-target tracking [1]. Our system takes target detections from any detection system, aggregates them into tracklets that are short enough to rely on a simple motion model, then aggregates tracklets into single camera trajectories, and finally connects these into multi-camera trajectories which we call identities.

In each of these layers, a graph $\mathcal {G} = (V, E)$ has observations (detections, tracklets, or trajectories) for nodes in V, and edges in E connect any pairs of nodes i, j for which correlations $w_{ij}$ are provided. These are real values in $[-1, 1]$ that measure evidence for or against i and j having the same identity. Values of $\pm \infty $ are also allowed to represent hard evidence. A Binary Integer Program (BIP) solves the correlation clustering problem [59] on $\mathcal {G}$: Partition V so as to maximize the sum of the correlations $w_{ij}$ assigned to edges that connect co-identical observations and the penalties $-w_{ij}$ assigned to edges that straddle identities. Sets of the resulting partition are taken to be the desired aggregates.

Solving this BIP is NP-hard and the problem is also hard to approximate [60], hence the need for our multi-layered solution to keep the problems small. To account for unbounded observation times, solutions are found at all levels over a sliding temporal window, with solutions from previous overlapping windows incorporated into the proper BIP as “extended observations”. For additional efficiency, observations in all layers are grouped heuristically into a number of subgroups with roughly consistent appearance and space-time locations.

Our implementation includes default algorithms for the computation of appearance descriptors and correlations in all layers. For appearance, we use the methods from the previous paper [1] in the first layers and simple striped color histograms [61] for the last layer. Correlations are computed from both appearance features and simple temporal reasoning.

6 Experiments

This Section shows that (i) traditional event based measures are not good proxies for a tracker’s ID precision or ID recall, defined in Sect. 3; (ii) handover errors, as customarily defined, cause frequent problems in practice; and (iii) the performance of our reference system, when evaluated with existing measures, is comparable to that of other recent MTMC trackers. We also give detailed performance numbers for our system on our data under a variety of performance measures, including ours, to establish a baseline for future comparisons.

ID Recall, ID Precision and Mismatches. Figure 4 shows that fragmentations and merges correlate poorly with ID recall and ID precision, confirming that event- and identity-based measures quantify different aspects of performance.

Truth-to-Result Mapping. Section 3 and Fig. 2 describe situations in which traditional, event-based performance measures handle handover errors differently from ours. Figure 5 shows that these discrepancies are frequent in our results.

Table 3. Top Table: MCTA score comparison on the existing NLPR data sets, starting from ground truth single camera trajectories. The last column contains the average dataset ranks. Bottom Table: Single-camera (white background) and multi-camera (grey background) results on our DukeMTMC data set. For each separate camera we report both standard multi-target tracking measures as well as our new measures.

Full size table

Traditional System Performance Analysis. Table 3 (top) compares our reference method to existing ones on the NLPR MCT data sets [2] and evaluates performance using the existing MCTA measure. The results are obtained under the commonly used experimental setup where all systems start with the same input of ground-truth single-camera trajectories. On average, our baseline system ranks second out of six by using our simple default appearance features. The highest ranked method [18] uses features based on discriminative learning.

System Performance Details. Table 3 (bottom) shows both traditional and new measures of performance, both single-camera and multi-camera, for our reference system when run on our data set. This table is meant as a baseline against which new methods may be compared.

From the table we see that our $IDF_1$ score and MOTA do not agree on how they rank the sequence difficulty of cameras 2 and 3. This is primarily because they measure different aspects of the tracker. Also, they are different in the relative value differences. For example, camera 6 appears much more difficult than 7 based on MOTA, but the difference is not as dramatic when results are inspected visually or when $IDF_1$ differences are considered.

7 Conclusion

We define new measures of MTMC tracking performance that emphasize correct identities over sources of error. We introduce the largest annotated and calibrated data set to date for the comparison of MTMC trackers. We provide a reference tracker that performs comparably to the state of the art by standard measures, and we establish a baseline of performance measures, both traditional and new, for future comparisons. We hope in this way to contribute to accelerating advances in this important and exciting field.

Notes

1.
http://vision.cs.duke.edu/DukeMTMC.

References

Ristani, E., Tomasi, C.: Tracking multiple people online and in real time. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9007, pp. 444–459. Springer, Heidelberg (2015)
Google Scholar
Cao, L., Chen, W., Chen, X., Zheng, S., Huang, K.: An equalised global graphical model-based approach for multi-camera object tracking [cs]. arXiv:11502.03532, February 2015
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J. Image Video Process. 246309, 1–10 (2008)
Article Google Scholar
Wu, B., Nevatia, R.: Tracking of multiple, partially occluded humans based on static body part detection. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 951–958. IEEE (2006)
Google Scholar
Milan, A., Schindler, K., Roth, S.: Challenges of ground truth evaluation of multi-target tracking. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 735–742. IEEE (2013)
Google Scholar
Kuo, C.-H., Huang, C., Nevatia, R.: Inter-camera association of multi-target tracks by on-line learned appearance affinity models. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 383–396. Springer, Heidelberg (2010)
Chapter Google Scholar
Multi-camera Object Tracking Challenge: ECCV Workshop on Visual Surveillance and Re-Identification (2014). http://mct.idealtest.org
Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multi-camera people tracking with a probabilistic occupancy map. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 267–282 (2008)
Article Google Scholar
Berclaz, J., Fleuret, F., Türetken, E., Fua, P.: Multiple object tracking using k-shortest paths optimization. IEEE Trans. Pattern Anal. Mach. Intell. 33(9), 1806–1819 (2011)
Article Google Scholar
Ferryman, J., Shahrokni, A.: An overview of the PETS 2009 challenge (2009)
Google Scholar
D’Orazio, T., Leo, M., Mosca, N., Spagnolo, P., Mazzeo, P.L.: A semi-automatic system for ground truth generation of soccer video sequences. In: Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2009, pp. 559–564. IEEE (2009)
Google Scholar
De Vleeschouwer, C., Chen, F., Delannay, D., Parisot, C., Chaudy, C., Martrou, E., Cavallaro, A., et al.: Distributed video acquisition and annotation for sport-event summarization. In: NEM summit 2008: Towards Future Media Internet (2008)
Google Scholar
Zhang, S., Staudt, E., Faltemier, T., Roy-Chowdhury, A.: A camera network tracking (CamNeT) dataset and performance baseline. In: 2015 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 365–372, January 2015
Google Scholar
Per, J., Kenk, V.S., Mandeljc, R., Kristan, M., Kovačič, S.: Dana36: a multi-camera image dataset for object identification in surveillance scenarios. In: 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 64–69. IEEE (2012)
Google Scholar
Benenson, R., Omran, M., Hosang, J., Schiele, B.: Ten years of pedestrian detection, what have we learned? In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014 Workshops. LNCS, vol. 8926, pp. 613–627. Springer, Heidelberg (2015)
Google Scholar
Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv: 1504.01942, April 2015
Bredereck, M., Jiang, X., Korner, M., Denzler, J.: Data association for multi-object Tracking-by-Detection in multi-camera networks. In: 2012 Sixth International Conference on Distributed Smart Cameras (ICDSC), pp. 1–6, October 2012
Google Scholar
Cai, Y., Medioni, G.: Exploring context information for inter-camera multiple target tracking. In: 2014 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 761–768, March 2014
Google Scholar
Chen, K.W., Lai, C.C., Lee, P.J., Chen, C.S., Hung, Y.P.: Adaptive learning for target tracking and true linking discovering across multiple non-overlapping cameras. IEEE Trans. Multimedia 13(4), 625–638 (2011)
Article Google Scholar
Chen, X., An, L., Bhanu, B.: Multitarget tracking in nonoverlapping cameras using a reference set. IEEE Sens. J. 15(5), 2692–2704 (2015)
Article Google Scholar
Chen, X., Huang, K., Tan, T.: Direction-based stochastic matching for pedestrian recognition in non-overlapping cameras. In: 2011 18th IEEE International Conference on Image Processing (ICIP), pp. 2065–2068, September 2011
Google Scholar
Daliyot, S., Netanyahu, N.S.: A framework for inter-camera association of multi-target trajectories by invariant target models. In: Park, J.-I., Kim, J. (eds.) ACCV Workshops 2012, Part II. LNCS, vol. 7729, pp. 372–386. Springer, Heidelberg (2013)
Chapter Google Scholar
Das, A., Chakraborty, A., Roy-Chowdhury, A.K.: Consistent re-identification in a camera network. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 330–345. Springer, Heidelberg (2014)
Google Scholar
Gilbert, A., Bowden, R.: Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 125–136. Springer, Heidelberg (2006)
Chapter Google Scholar
Javed, O., Shafique, K., Rasheed, Z., Shah, M.: Modeling inter-camera space time and appearance relationships for tracking across non-overlapping views. Comput. Vis. Image Underst. 109(2), 146–162 (2008)
Article Google Scholar
Makris, D., Ellis, T., Black, J.: Bridging the gaps between cameras. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, vol. 2, June 2004
Google Scholar
Jiuqing, W., Li, L.: Distributed optimization for global data association in non-overlapping camera networks. In: 2013 Seventh International Conference on Distributed Smart Cameras (ICDSC), pp. 1–7, October 2013
Google Scholar
Calderara, S., Cucchiara, R., Prati, A.: Bayesian-competitive consistent labeling for people surveillance. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 354–360 (2008)
Article Google Scholar
Zhang, S., Zhu, Y., Roy-Chowdhury, A.: Tracking multiple interacting targets in a camera network. Comput. Vis. Image Underst. 134, 64–73 (2015)
Article Google Scholar
Ayazoglu, M., Li, B., Dicle, C., Sznaier, M., Camps, O.: Dynamic subspace-based coordinated multicamera tracking. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2462–2469, November 2011
Google Scholar
Kamal, A., Farrell, J., Roy-Chowdhury, A.: Information consensus for distributed multi-target tracking. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2403–2410, June 2013
Google Scholar
Hamid, R., Kumar, R., Grundmann, M., Kim, K., Essa, I., Hodgins, J.: Player localization using multiple static cameras for sports visualization. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 731–738, June 2010
Google Scholar
Martinel, N., Micheloni, C., Foresti, G.L.: Saliency weighted features for person re-identification. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 191–208. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16199-0_14
Google Scholar
Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)
Google Scholar
Bedagkar-Gala, A., Shah, S.: Multiple person re-identification using part based spatio-temporal color appearance model. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1721–1728, November 2011
Google Scholar
Bedagkar-Gala, A., Shah, S.K.: Part-based spatio-temporal model for multi-person re-identification. Pattern Recogn. Lett. 33(14), 1908–1915 (2012). Novel Pattern Recognition-based Methods for Re-identification in Biometric Context
Article Google Scholar
Cheng, D., Cristani, M., Stoppa, M., Bazzani, L., Murino, V.: Custom pictorial structures for re-identification. In: Proceedings of the British Machine Vision Conference, pp. 68.1–68.11. BMVA Press (2011). doi:10.5244/C.25.68
Baltieri, D., Vezzani, R., Cucchiara, R.: Learning articulated body models for people re-identification. In: Proceedings of the 21st ACM International Conference on Multimedia, MM 2013, pp. 557–560. ACM, New York (2013)
Google Scholar
Cheng, D., Cristani, M.: Person re-identification by articulated appearance matching. In: Gong, S., Cristani, M., Yan, S., Loy, C.C. (eds.) Person Re-Identification. Advances in Computer Vision and Pattern Recognition, pp. 139–160. Springer, London (2014)
Chapter Google Scholar
Baltieri, D., Vezzani, R., Cucchiara, R.: Mapping appearance descriptors on 3d body models for people re-identification. Int. J. Comput. Vis. 111(3), 345–364 (2015)
Article MathSciNet Google Scholar
Brendel, W., Amer, M., Todorovic, S.: Multiobject tracking as maximum weight independent set. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1273–1280. IEEE (2011)
Google Scholar
Shu, G., Dehghan, A., Oreifej, O., Hand, E., Shah, M.: Part-based multiple-person tracking with partial occlusion handling. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1815–1821. IEEE (2012)
Google Scholar
Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. Int. J. Comput. Vis. 75(2), 247–266 (2007)
Article Google Scholar
Izadinia, H., Saleemi, I., Li, W., Shah, M.: (MP)$^{2}$T: multiple people multiple parts tracker. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 100–114. Springer, Heidelberg (2012)
Chapter Google Scholar
Pirsiavash, H., Ramanan, D., Fowlkes, C.C.: Globally-optimal greedy algorithms for tracking a variable number of objects. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1201–1208. IEEE (2011)
Google Scholar
Zhang, L., Li, Y., Nevatia, R.: Global data association for multi-object tracking using network flows. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)
Google Scholar
Butt, A.A., Collins, R.T.: Multiple target tracking using frame triplets. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part III. LNCS, vol. 7726, pp. 163–176. Springer, Heidelberg (2013)
Chapter Google Scholar
Chari, V., Lacoste-Julien, S., Laptev, I., Sivic, J.: On pairwise costs for network flow multi-object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5537–5545 (2015)
Google Scholar
Collins, R.T.: Multitarget data association with higher-order motion models. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1744–1751. IEEE (2012)
Google Scholar
Dehghan, A., Assari, S.M., Shah, M.: Gmmcp tracker: globally optimal generalized maximum multi clique problem for multiple object tracking. In: CVPR, vol. 1, p. 2 (2015)
Google Scholar
Kumar, R., Charpiat, G., Thonnat, M.: Multiple object tracking by efficient graph partitioning. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 445–460. Springer, Heidelberg (2015)
Google Scholar
Shafique, K., Shah, M.: A noniterative greedy algorithm for multiframe point correspondence. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 51–65 (2005)
Article Google Scholar
Tang, S., Andres, B., Andriluka, M., Schiele, B.: Subgraph decomposition for multi-target tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5033–5041 (2015)
Google Scholar
Wen, L., Li, W., Yan, J., Lei, Z., Yi, D., Li, S.Z.: Multiple target tracking based on undirected hierarchical relation hypergraph. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1282–1289. IEEE (2014)
Google Scholar
Roshan Zamir, A., Dehghan, A., Shah, M.: GMCP-tracker: global multi-object tracking using generalized minimum clique graphs. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 343–356. Springer, Heidelberg (2012)
Chapter Google Scholar
Yu, S.I., Meng, D., Zuo, W., Hauptmann, A.: The solution path algorithm for identity-aware multi-object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3871–3879 (2016)
Google Scholar
Yao, J., Odobez, J.M.: Multi-layer background subtraction based on color and texture. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8. IEEE (2007)
Google Scholar
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Article Google Scholar
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: Foundations of Computer Science (2002)
Google Scholar
Tan, J.: A note on the inapproximability of correlation clustering (2008)
Google Scholar
Liu, C., Gong, S., Loy, C.C., Lin, X.: Person re-identification: what features are important? In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7583, pp. 391–401. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33863-2_39
Chapter Google Scholar
Chen, W., Cao, L., Chen, X., Huang, K.: A novel solution for multi-camera object tracking. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 2329–2333. IEEE (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Duke University, Durham, USA
Ergys Ristani, Roger Zou & Carlo Tomasi
Department of Engineering, University of Modena and Reggio Emilia, Modena, Italy
Francesco Solera & Rita Cucchiara

Authors

Ergys Ristani
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Solera
View author publications
You can also search for this author in PubMed Google Scholar
Roger Zou
View author publications
You can also search for this author in PubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Tomasi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ergys Ristani .

Editor information

Editors and Affiliations

Microsoft Research Asia, Beijing, China
Gang Hua
Facebook AI Research (FAIR), Menlo Park, California, USA
Hervé Jégou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C. (2016). Performance Measures and a Data Set for Multi-target, Multi-camera Tracking. In: Hua, G., Jégou, H. (eds) Computer Vision – ECCV 2016 Workshops. ECCV 2016. Lecture Notes in Computer Science(), vol 9914. Springer, Cham. https://doi.org/10.1007/978-3-319-48881-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-48881-3_2
Published: 03 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48880-6
Online ISBN: 978-3-319-48881-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Performance Measures and a Data Set for Multi-target, Multi-camera Tracking

Abstract