TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

Müller, Matthias; Bibi, Adel; Giancola, Silvio; Alsubaihi, Salman; Ghanem, Bernard

doi:10.1007/978-3-030-01246-5_19

Matthias Müller¹⁷,
Adel Bibi¹⁷,
Silvio Giancola¹⁷,
Salman Alsubaihi¹⁷ &
…
Bernard Ghanem¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11205))

Included in the following conference series:

European Conference on Computer Vision

5664 Accesses
421 Citations

Abstract

Despite the numerous developments in object tracking, further improvement of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.

This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR).

M. Müller, A. Bibi and S. Giancola—Equally contributed.

You have full access to this open access chapter, Download conference paper PDF

TAO: A Large-Scale Benchmark for Tracking Any Object

MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking

Article Open access 23 December 2020

The Visual Object Tracking VOT2016 Challenge Results

Keywords

1 Introduction

Object tracking is a common task in computer vision, with a long history spanning decades [29, 42, 48]. Despite considerable progress in the field, object tracking remains a challenging task. Current trackers perform well on established datasets such as OTB [46, 47] and VOT [21,22,23,24,25,26] benchmarks. However, most of these datasets are fairly small and do not fully represent the challenges faced when tracking objects in the wild (Fig. 1).

Following the rise of deep learning in computer vision, the tracking community is currently embracing data-driven learning methods. Most trackers submitted to the annual challenge VOT17 [22] use deep features, while they were nonexistent in earlier versions VOT13 [25] and VOT14 [26]. In addition, nine out of the ten top-performing trackers in VOT17 [22] rely on deep features, outperforming the previous state-of-the-art trackers. However, the tracking community still lacks a dedicated large-scale dataset to train deep trackers. As a consequence, deep trackers are often restricted to using pretrained models from object classification [6] or use object detection datasets such as ImageNet Videos [40]. As an example of this, SiameseFC [2] and CFNet [43] show outstanding results by training specific Convolutional Neural Networks (CNNs) for tracking.

Since classical trackers rely on handcrafted features and because existing tracking datasets are small, there is currently no clear split between data used for training and testing. Recent benchmarks [22, 33] now consider putting aside a sequestered test set to provide a fair comparison. Hence, it is common to see trackers developed and trained on the OTB [47] dataset before competing on VOT [24]. Note that VOT15 [23] is sampled from existing datasets like OTB100 [47] and ALOV300 [41], resulting in overlapping sequences (e.g. basketball, car, singer, etc...). Even though the redundancy is contained, one needs to be careful while selecting training video sequences, since training deep trackers on testing videos is not fair. As a result, there is usually not enough data to train deep networks for tracking and data from different fields are used to pre-train models, which is a limiting factor for certain architectures.

In this paper, we present TrackingNet, a large-scale object tracking dataset designed to train deep trackers. Our dataset has several advantages. First, the large training set enables the development of deep design specific for tracking. Second, the specificity of the dataset for object tracking enables novel architectures to focus on the temporal context between consecutive frames. Current large scale object detection datasets do not provide data densely annotated in time. Third, TrackingNet represents real-world scenarios by sampling over YouTube videos. As such, TrackingNet videos contain a rich distribution of object classes, which we enforce to be shared between training and testing. Last, we evaluate tracker performance on a segregated testing set with a similar distribution over object classes and motion. Trackers do not have access to the annotations of these videos but can obtain results and insights through an evaluation server.

Contributions. (i) We present TrackingNet, the first large-scale dataset for object tracking. We analyze the characteristics, attributes and uniqueness of TrackingNet when compared with other datasets (Sect. 3). (ii) We provide insights into different techniques to generate dense annotations from coarse ones. We show that most trackers can produce accurate and reliable dense annotations over 1 second-long intervals (Sect. 4). (iii) We provide an extended baseline for state-of-the-art trackers benchmarked on TrackingNet. We show that pretraining deep models on TrackingNet can improve their performance on other datasets by increasing their metrics by up to 1.7% (Sect. 5).

2 Related Work

In the following, we provide an overview of the various research on object tracking. The tasks in the field can be clustered between multi-object tracking [24, 47] and single-object tracking [27, 33]. The former focuses on multiple instance tracking of class-specific objects, relying on strong and fast object detection algorithms and association estimation between consecutive frames. The latter is the target of this work. It approaches the problem by tracking-by-detection, which consists of two main components: model representation, either generative [19, 39] or discriminative [14, 49], and object search, a trade-off between computational cost and dense sampling of the region of interest.

Correlation Filter Trackers. In recent years, correlation filter (CF) trackers [1, 4, 16, 17] have emerged as the most common, fastest and most accurate category of trackers. CF trackers learn a filter at the first frame, which represents the object of interest. This filter localizes the target in successive frames before being updated. The main reason behind the impressive performance of CF trackers lies in the approximate dense sampling achieved by circulantly shifting the target patch samples [17]. Also, the remarkable runtime performance is achieved by efficiently solving the underlying ridge regression problem in the Fourier domain [4]. Since the inception of CF trackers with single-channel features [4, 17], they have been extended with kernels [16], multi-channel features [9] and scale adaptation [30]. In addition, many works enhance the original formulation by adapting the regression target [3], adding context [12, 35], spatially regularizing the learned filters and learning continuous filters [10].

Deep Trackers. Beside the CF trackers that use deep features from object detection networks, few works explore more complete deep learning approaches. A first approach consists of learning generic features on a large-scale object detection dataset and successively fine-tuning domain-specific layers to be target-specific in an online fashion. MDNET [36] shows the success of such a method by winning the VOT15 [23] challenge. A second approach consists of training a fully convolutional network and using a feature map selection method to choose between shallow and deep layers during tracking [45]. The goal is to find a good trade-off between general semantic and more specific discriminative features, as well as, to remove noisy and irrelevant feature maps.

While both of these approaches achieve state-of-the-art results, their computation cost prohibits these algorithms from being deployed in real applications. A third approach consists of using Siamese networks that predict motion between consecutive frames. Such trackers are usually trained offline on a large-scale dataset using either deep regression [15] or a CNN matching function [2, 13, 43]. Due to their simple architecture and lack of online fine-tuning, only a forward pass has to be executed at test time. This results in very fast run-times (up to 100 fps on a GPU) while achieving competitive accuracy. However, since the model is not updated at test time, the accuracy highly depends on how well the training dataset captures appearance nuisances that occur while tracking various objects. Such approaches would benefit from a large-scale dataset like the one we propose in this paper.

Object Tracking Datasets. Numerous datasets are available for object tracking, the most common ones being OTB [47], VOT [24], ALOV300 [41] and TC128 [31] for single-object tracking and MOT [27, 33] for multi-object tracking. VIVID [5] is an early attempt to build a tracking dataset for surveillance purposes. OTB50 [46] and OTB100 [47] provide 51 and 98 video sequences annotated with 11 different attributes and upright bounding boxes for each frame. TC128 [31] comprises 129 videos, based on similar attributes and upright bounding boxes. ALOV300 [41] comprises 314 videos sequences labelled with 14 attributes. VOT [24] proposes several challenges with up to 60 video sequences. It introduced rotated bounding boxes as well as extensive studies on object tracking annotations. VOT-TIR is a specific dataset from VOT focusing on Thermal InfraRed videos. NUS PRO [28] gathers an application-specific collection of 365 videos for people and rigid object tracking. UAV123 and UAV20L [34] gather another application-specific collection of 123 videos and 20 long videos captured from a UAV or generated from a flight simulator. NfS [11] provides a set of 100 videos with high framerate, in an attempt to focus on fast motion. Table 1 provides a detailed overview of the most popular tracking datasets.

Despite the availability of several datasets for object tracking, large scale datasets are necessary to train deep trackers. Therefore, current deep trackers rely on object detection datasets such as ImageNet Video [40] or Youtube-BoundingBoxes [38]. Those datasets provide object detection bounding boxes on videos, relatively sparse in time or at a low frame rate. Thus, they lack motion information about the object dynamics in consecutive frames. Still, they are widely used to pre-train deep trackers. They provide deep feature representation with object knowledge that can be transferred from detection to tracking.

Table 1. Comparison of current datasets for object tracking.

Full size table

3 TrackingNet

In this section, we introduce TrackingNet, a large-scale dataset for object tracking. TrackingNet assembles a total of 30,643 video segments with an average duration of 16.6s. All the 14,431,266 frames extracted from the 140 hours of visual content are annotated with a single upright bounding box. We provide a comparison with other tracking datasets in Table 1 and Fig. 2.

Our work attempts to bridge the gap between data-hungry deep trackers and scarcely-available large scale datasets. Our proposed tracking dataset is larger than the previous largest one by 2 orders of magnitude. We build TrackingNet to address object tracking in the wild. Therefore, the dataset copes with a large variety of frame rates, resolutions, context and object classes. In contrast with previous tracking datasets, TrackingNet is split between training and testing. We carefully select 30,132 training videos from Youtube-BoundingBoxes [38] and build a novel set of 511 testing videos with a distribution similar to the training set.

3.1 From YT-BB to TrackingNet Training Set

Youtube-BoundingBoxes (YT-BB) [38] is a large scale dataset for object detection. This dataset consists of approximately 380,000 video segments, annotated every second with upright bounding boxes. Those videos are gathered directly from YouTube, with a wide diversity in resolution, frame rate and duration.

Since YT-BB focuses on object detection, the object class is provided along with the bounding boxes. The dataset proposes a list of 23 object classes representative of the videos available on the YouTube platform. For the sake of tracking, we remove the object classes that lack motion by definition, in particular potted plant and toilet. Since the person class represents 25% of the annotations, we split it into 7 different classes based on their context. Overall, the distribution of the object classes in TrackingNet is shown in Fig. 3.

To ensure decent quality in the videos for tracking purposes, we filtered out 90% of the videos based on attribute criteria. First, we avoid small segments by removing videos shorter than 15 seconds. Second, we only considered bounding boxes that covered less than 50% of the frame. Last, we preserve segments that contain at least a reasonable amount of motion between bounding boxes. During such filtering, we preserved the original distribution of the 21 object classes provided by YT-BB, to prevent bias in the dataset. We end up with a training set of 30,132 videos, which we split into 12 training subsets, each of which contains 2,511 videos and preserves the original YT-BB object classes distribution.

Coarse annotations are provided by YT-BB at 1 fps. In order to increase the annotation density, we rely on a mixture of state-of-the-art trackers to fill in missing annotations. We claim that any tracker is reliable on a small time lapse of 1 second. We present in Sect. 4 the performance of state-of-the-art trackers on 1 second-long video segments from OTB100. As a result, we densely annotated the 30,132 videos using a weighted average between a forward and a backward pass using the DCF tracker [16]. By doing so, we provide a densely annotated training dataset for object tracking, along with code for automatically downloading videos from YouTube and extracting the annotated frames.

3.2 From YT-CC to TrackingNet Testing Set

Alongside the training dataset, we compile a novel dataset for testing, which comprises 511 videos from YouTube with Creative Commons licence, namely YT-CC. We carefully select those videos to reflect the object class distribution from the training set. We ensure that those videos do not contain any copyrights, so they can be shared. We then used Amazon Mechanical Turk workers (Turkers) for annotating those videos. We annotate the first bounding boxes and define specific rules for the Turkers to carefully annotate the successive frames. We define the objects as in YT-BB for object detection, i.e. with the smallest bounding box fitting any visible part of the object to track.

Annotations should be defined in a deterministic way, using rules that are agreed upon and abided by during the annotation process. By defining the smallest upright bounding box around an object, we avoid any ambiguity. However, the bounding box may contain a large amount of background. For instance, the arm and the legs are always included for the person class, regardless of the person’s pose. We argue that a tracker should be able to cope with deformable objects and to understand what it is tracking. In a similar fashion, the tails of animal are always included. In addition, the bounding box of an object is adjusted as a function of its visibility in the frame. Estimating the position of an occluded part of the object is not deterministic hence should be avoided. For instance, the handle of the object class knife could be hidden by the hand. In such cases, only the blade is annotated.

We use the VATIC tool [44] to annotate the frames. It incorporates an optical flow algorithm to guess the position of the next bounding boxes in successive frames. Turkers may annotate a non-tight bounding box around the object or rely on the optical flow to determine the bounding box location and size. To avoid such behavior, we visually inspect every single frame after each annotation round, rewarding good Turkers and rejecting bad annotations. We either restart the video annotation from scratch or ask Turkers to fine-tune previous results. With our supervision in the loop, we ensure the quality of our annotations after a few iterations, discourage bad annotators and incentivize the good ones.

3.3 Attributes

Successively, each video is annotated with a list of attributes defined in Table 2. 15 attributes are provided for our testing set, the first 5 are extracted automatically by analyzing the variation of the bounding boxes in time while the last 10 are manually checked by visually analyzing the 511 videos of our dataset. An overview of the attribute distribution is given in Fig. 4 and compared to OTB100 [47] and VOT17 [22].

Table 2. List and description of the 15 attributes that characterize videos in TrackingNet. Top: automatically estimated. Bottom: visually inspected.

Full size table

First, we claim to have better control over the number of frames per video in our dataset, with a more contained variation with respect to other datasets. We argue that such contained length diversity is more suitable for training with a constant batch size. Second, the distribution of the bounding box resolution is more diverse in TrackingNet, providing more diversity in the scale of the objects to track. Third, we show that challenges in OTB100 [47] and VOT17 [22] focus on objects with slightly larger motion, while TrackingNet shows a more natural motion distribution over the fastest moving instances in YT-BB. Similar conclusions can be drawn from the distribution of the aspect ratio change attribute. Fourth, more than 30% of the OTB100 instances have a constant aspect ratio, while VOT17 shows a flatter distribution. Once again, we argue that TrackingNet contains a more natural distribution of objects present in the wild. Last, we show statistics over the 15 attributes, which will be used to generate attribute specific tracking results in Sect. 5. Overall, we see that our sequestered testing set has an attribute distribution similar to that of our training set.

3.4 Evaluation

Annotation for the testing set should not be revealed to ensure a fair comparison between trackers. We thus evaluate the trackers through an online server. In a similar OTB100 fashion, we perform a One Pass Evaluation (OPE) and measure the success and precision of the trackers over the 511 videos. The success S is measured as the Intersection over Union (IoU) of the pixels between the ground truth bounding boxes ($BB^{gt}$) and the ones generated by the trackers ($BB^{tr}$). The trackers are ranked using the Area Under the Curve (AUC) measurement [47]. The precision P is usually measured as the distance in pixels between the centers $C^{gt}$ and $C^{tr}$ of the ground truth and the tracker bounding box, respectively. The trackers are ranked using this metric with a conventional threshold of 20 pixels.

Since the precision metric is sensitive to the resolution of the images and the size of the bounding boxes, we propose a third metric $P_{norm}$. We normalize the precision over the size of the ground truth bounding box, following Eq. 1. The trackers are then ranked using the AUC for normalized precision between 0 and 0.5. By substituting the original precision with the normalized one, we ensure the consistency of the metrics across different scales of objects to track. However, for bounding boxes with similar scale, success and normalized precision are very similar and show how far an annotation is from another. Nevertheless, we argue that they will differ in the case of different scales. For the sake of consistency, we provide results using precision, normalized precision and success.

$$\begin{aligned} \begin{aligned} S&= \frac{| BB^{tr} \cap BB^{gt} |}{| BB^{tr} \cup BB^{gt} |}&P&= \Vert C^{tr} - C^{gt} \Vert _2 \\ P_{norm}&= \Vert W \big ( C^{tr} - C^{gt} \big ) \Vert _2&W&= \text {diag}(BB^{gt}_x, BB^{gt}_y) \end{aligned} \end{aligned}$$

(1)

4 Dataset Experiments

Since TrackingNet Training Set ($\sim $30K videos) is compiled from the YT-BB dataset, it is originally annotated with bounding boxes every second. While such sparse annotations might be satisfactory for some vision tasks, e.g. object classification and detection, deep network based trackers rely on learning the temporal evolution of bounding boxes over time. For instance, Siamese-like architectures [43, 45] need to observe a large number of similar and dissimilar patches of the same object. Unfortunately, manually extending YT-BB is not feasible for such large number of frames. Thus, we have entertained the possibility of tracker-aided annotation to generate the missing dense bounding box annotations arising between the sparsely occurring original YT-BB ones. State-of-the-art trackers not only achieve impressive performance on standard tracking benchmarks, but they also perform well at high frame rates.

To assess such capability, we conducted four different experiments to decide which tracker would perform best in densely annotating OTB100 [47]. We chose among the following trackers: ECO [6], CSRDCF [32], BACF [12], SiameseFC [2], STAPLE$_{\text {CA}}$ [35], STAPLE [1], SRDCF [7], SAMF [30], CSK [17], KCF [18], DCF [18] and MOSSE [4]. To mimic the 1-second annotation in TrackingNet Training Set, we assume that all videos of OTB100 are captured at 30 fps and the OTB100 dataset is split into 1916 smaller sequences of 30 frames. We evaluate the previously highlighted trackers on the 1916 sequences of OTB100 by running them forward and backward through each sequence.

$$\begin{aligned} \begin{aligned} \mathbf {x}^t_{\text {WG}} = w_{t} \mathbf {x}^{t}_{\text {FW}} + \left( 1-w_{t} \right) \mathbf {x}^t_{\text {BK}} \end{aligned} \end{aligned}$$

(2)

The results of both the forward and backward passes are then combined by directly averaging the two results and by generating the convex combination (weighted average) according to Eq. 2, where $\mathbf {x}^t_{\text {FW}}$, $\mathbf {x}^t_{\text {BK}}$ and $\mathbf {x}^t_{\text {WG}}$ are the tracking results at frame t for the forward pass, backward pass, and the weighted average respectively. We tested the linear, quadratic, cubic and exponential decay combinations for the weight $w_{t}$. Note that the maximum sequence length is 30, thus $t \in [1,30]$. The weighted average gives more weight to the results of the forward pass for frames closer to the first frame and vice versa. Figure 5 along with Table 3 show that most trackers perform almost equally well with the best performance upon using the weighted average strategy. Thereafter, since STAPLE$_{\text {CA}}$ [35] generates a reasonable accuracy with a frame rate of 30fps, we find it suitable for annotating the large training set in TrackingNet. We run STAPLE$_{\text {CA}}$ in both a forward and a backward pass where the results of both are later combined in a weighted average using a linear decay fashion as described in Eq. 2 using $w_{t} = \left( 1 - t / 30\right) $.

Table 3. Tracking results on the 1sec-long OTB100 dataset using different averaging.

Full size table

5 Tracking Benchmark

In our benchmark, we compare a large variety of tracking algorithms that cover all common tracking principles. The majority of current state-of-the-art algorithms are based on discriminative correlation filters with handcrafted or deep features. We select trackers to cover a large set of combinations of features and kernels. MOSSE [4], CSK [17], DCF [16], KCF [16] use simple features and do not adapt to scale variations. DSST [9], SAMF [30], and STAPLE [1] use more sophisticated features like Colornames and try to compensate for scale variations. We also include trackers that propose some kind of general framework to improve upon correlation filter tracking. These include SRDCF [8], SAMF$_{\text {AT}}$ [30], STAPLE$_{\text {CA}}$ [35], BACF [12] and ECO-HC [6]. We include CFNet [43] and SiameseFC [2] to represent CNN matching trackers and MEEM [49] and DLSSVM [37] for structured SVM-based trackers. Last, we include some baseline trackers such as TLD [20], Struck [14], ASLA [19] and IVT [39] for reference. Table 4 summarizes the selected trackers along with their representation scheme, search method, runtime and a generic description.

Table 4. Evaluated Trackers. Representation: PI - Pixel Intensity, HOG - Histogram of Oriented Gradients, CN - Color Names, CH - Color Histogram, GK - Gaussian Kernel, K - Keypoints, BP - Binary Pattern, SSVM - Structured Support Vector Machine. Search: PF - Particle Filter, RS - Random Sampling, DS - Dense Sampling.

Full size table

5.1 State-of-the-art Benchmark on TrackingNet

Figure 6 shows the results on the complete dataset. Note that the highest score for any tracker is about 60% success rate compared to around 90% on OTB. The top performing tracker is MDNET [36] that trains in an online fashion and is, as a result, able to adapt best. However, this comes at the cost of a very slow runtime. Next are CFNet [43] and SiameseFC [2] that benefit from being trained on a large-scale dataset (ImageNet Videos). However, as we show later, their performance can be further improved by using our training dataset.

5.2 Real-Time Tracking

For many real applications, tracking is not very useful if it cannot be done at real-time. Therefore, we conduct an experiment to evaluate how well trackers would perform in more realistic settings where frames are skipped if a tracker is too slow. We do this by subsampling the sequence based on each tracker’s speed. Figure 7 shows the results of this experiment across the complete dataset. As expected, most trackers that run below real-time degrade. In the worst case, this degradation can be as much as 50%, as is the case for Struck [14]. More recent trackers, in particular deep learning ones, are much less affected. CFNet [43] for example, does not degrade at all even though it only sees every third frame. This is probably due to the fact that it relies on a generic object matching function that was trained on a large-scale dataset.

5.3 Retraining on TrainingNet

We fine-tune SiameseFC [2] on a fraction of TrackingNet to show how our data can improve the tracking performance of deep-learning based trackers. The results are shown in Table 5. By training on only one of the twelve chunks (2511 videos) of our training dataset, we observe an increase in all the metrics on TrackingNet Test and OTB100. Fine-tuning using more chunks is expected to improve the performance even further.

Table 5. Fine-tuning results for SiameseFC on OTB100 and TrackingNet Test.

Full size table

5.4 Attribute Specific Results

Each video in TrackingNet Test is annotated with 15 attributes described in Sect. 3. We evaluate all trackers per attribute to get insights about challenges facing state-of-the-art tracking algorithms. We show the most interesting results in Fig. 8 and refer the reader to the supplementary material for the remaining attributes. We find that videos with in-plane rotation, low resolution targets, and full occlusion are consistently the most difficult. Trackers are least affected by illumination variation, partial occlusion, and object deformation.

6 Conclusion

In this work, we present TrackingNet, which is, to the best of our knowledge, the largest dataset for object tracking. We show how large-scale existing datasets for object detection can be leveraged for object tracking by a novel interpolation method. We also benchmark more than 20 tracking algorithms on this novel dataset and shed light on what attributes are especially difficult for current trackers. Lastly, we verify the usefulness of our large dataset in improving the performance of some deep learning based trackers.

In the future, we aim to extend the test set from 500 to 1000 videos. We plan to sample the extra 500 videos from different classes within the same category (e.g. tortoise/animal). This will allow for further evaluation in regards to generalization. After publication, we plan to release the training set with our interpolated annotations. We will also release the test sequences with initial bounding box annotations and the corresponding integration for the OTB toolkit. At the same time, we will publish our online evaluation server to allow researches to rank their tracking algorithms instantly.

References

Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H.: Staple: complementary learners for real-time tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1401–1409 (2016)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Bibi, A., Mueller, M., Ghanem, B.: Target response adaptation for correlation filter tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 419–433. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_25
Chapter Google Scholar
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2544–2550, June 2010. https://doi.org/10.1109/CVPR.2010.5539960
Collins, R., Zhou, X., Teh, S.K.: An open source tracking testbed and evaluation web site. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS 2005), January 2005
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 21–26 (2017)
Google Scholar
Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4310–4318 (2015)
Google Scholar
Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: The IEEE International Conference on Computer Vision (ICCV), December 2015
Google Scholar
Danelljan, M., Hger, G., Shahbaz Khan, F., Felsberg, M.: Accurate scale estimation for robust visual tracking. In: Proceedings of the British Machine Vision Conference. BMVA Press (2014). https://doi.org/10.5244/C.28.65
Danelljan, M., Robinson, A., Shahbaz Khan, F., Felsberg, M.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: ECCV (2016)
Google Scholar
Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: a benchmark for higher frame rate object tracking. arXiv preprint arXiv:1703.05884 (2017)
Galoogahi, H.K., Fagg, A., Lucey, S.: Learning background-aware correlation filters for visual tracking. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 21–26 (2017)
Google Scholar
Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., Wang, S.: Learning dynamic Siamese network for visual object tracking. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Hare, S., Saffari, A., Torr, P.H.S.: Struck: structured output tracking with kernels. In: 2011 International Conference on Computer Vision, pp. 263–270. IEEE, November 2011. https://doi.org/10.1109/ICCV.2011.6126251
Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regression networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 749–765. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_45
Chapter Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2015). https://doi.org/10.1109/TPAMI.2014.2345390
Article Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: Exploiting the circulant structure of tracking-by-detection with kernels. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 702–715. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_50
Chapter Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)
Article Google Scholar
Jia, X., Lu, H., Yang, M.H.: Visual tracking via adaptive structural local sparse appearance model. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1822–1829, June 2012. https://doi.org/10.1109/CVPR.2012.6247880
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2011). https://doi.org/10.1109/TPAMI.2011.239
Article Google Scholar
Kristan, M.: The visual object tracking VOT2016 challenge results. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 777–823. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_54. http://www.springer.com/gp/book/9783319488806
Chapter Google Scholar
Kristan, M., et al.: The visual object tracking vot2017 challenge results (2017). http://openaccess.thecvf.com/content_ICCV_2017_workshops/papers/w28/Kristan_The_Visual_Object_ICCV_2017_paper.pdf
Kristan, M., et al.: The visual object tracking vot2015 challenge results. In: Visual Object Tracking Workshop 2015 at ICCV2015, December 2015
Google Scholar
Kristan, M., et al.: A novel performance evaluation methodology for single-target trackers. IEEE Trans. Pattern Anal. Mach. Intell. 38(11), 2137–2155 (2016). https://doi.org/10.1109/TPAMI.2016.2516982
Article Google Scholar
Kristan, M.: The visual object tracking VOT2014 challenge results. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 191–217. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16181-5_14
Chapter Google Scholar
Kristan, M.: The visual object tracking VOT2014 challenge results. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 191–217. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16181-5_14. http://www.votchallenge.net/vot2014/program.html
Chapter Google Scholar
Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942 (2015)
Li, A., Lin, M., Wu, Y., Yang, M.H., Yan, S.: NUS-PRO: a new visual tracking challenge. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 335–349 (2016). https://doi.org/10.1109/TPAMI.2015.2417577
Article Google Scholar
Li, X., Hu, W., Shen, C., Zhang, Z., Dick, A., Hengel, A.V.D.: A survey of appearance models in visual object tracking. ACM Trans. Intell. Syst. Technol. (TIST) 4(4), 58 (2013)
Google Scholar
Li, Y., Zhu, J.: A scale adaptive kernel correlation filter tracker with feature integration. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 254–265. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16181-5_18
Chapter Google Scholar
Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: algorithms and benchmark. In: Image Processing, pp. 1–14. IEEE (2015). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7277070
Article MathSciNet Google Scholar
Lukezic, A., Vojír, T., Zajc, L.C., Matas, J., Kristan, M.: Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2 (2017)
Google Scholar
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking, March 2016. arXiv:1603.00831 [cs], http://arxiv.org/abs/1603.00831, arXiv: 1603.00831
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_27
Chapter Google Scholar
Mueller, M., Smith, N., Ghanem, B.: Context-aware correlation filter tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1396–1404 (2017)
Google Scholar
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Ning, J., Yang, J., Jiang, S., Zhang, L., Yang, M.H.: Object tracking via dual linear structured SVM and explicit feature map. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4266–4274 (2016)
Google Scholar
Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: Youtube-boundingboxes: a large high-precision human-annotated data set for object detection in video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7464–7473. IEEE (2017)
Google Scholar
Ross, D., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1–3), 125–141 (2008). https://doi.org/10.1007/s11263-007-0075-7
Article Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1442–1468 (2014). https://doi.org/10.1109/TPAMI.2013.230
Article Google Scholar
Smeulders, A.W., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1442–1468 (2014)
Article Google Scholar
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.: End-to-end representation learning for correlation filter based tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5000–5008. IEEE (2017)
Google Scholar
Vondrick, C., Patterson, D., Ramanan, D.: Efficiently scaling up crowdsourced video annotation. Int. J. Comput. Vis. 101(1), 184–204 (2013)
Article Google Scholar
Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3119–3127, December 2015. https://doi.org/10.1109/ICCV.2015.357
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: 2013 IEEE Conference on Computer vision and pattern recognition (CVPR), pp. 2411–2418. IEEE (2013)
Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)
Article Google Scholar
Yilmaz, A., Javed, O., Shah, M.: Object tracking: a survey. ACM Comput. Surv. (CSUR) 38(4), 13 (2006)
Article Google Scholar
Zhang, J., Ma, S., Sclaroff, S.: MEEM: robust tracking via multiple experts using entropy minimization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 188–203. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_13
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia
Matthias Müller, Adel Bibi, Silvio Giancola, Salman Alsubaihi & Bernard Ghanem

Authors

Matthias Müller
View author publications
You can also search for this author in PubMed Google Scholar
Adel Bibi
View author publications
You can also search for this author in PubMed Google Scholar
Silvio Giancola
View author publications
You can also search for this author in PubMed Google Scholar
Salman Alsubaihi
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Ghanem
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Silvio Giancola .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 73930 KB)

Supplementary material 2 (pdf 14843 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B. (2018). TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11205. Springer, Cham. https://doi.org/10.1007/978-3-030-01246-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-01246-5_19
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01245-8
Online ISBN: 978-3-030-01246-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

Abstract

Similar content being viewed by others

TAO: A Large-Scale Benchmark for Tracking Any Object

MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking

The Visual Object Tracking VOT2016 Challenge Results

Keywords

1 Introduction

2 Related Work

3 TrackingNet

3.1 From YT-BB to TrackingNet Training Set