The Eighth Visual Object Tracking VOT2020 Challenge Results

Kristan, Matej; Leonardis, Aleš; Matas, Jiří; Felsberg, Michael; Pflugfelder, Roman; Kämäräinen, Joni-Kristian; Danelljan, Martin; Zajc, Luka Čehovin; Lukežič, Alan; Drbohlav, Ondrej; He, Linbo; Zhang, Yushan; Yan, Song; Yang, Jinyu; Fernández, Gustavo; Hauptmann, Alexander; Memarmoghadam, Alireza; García-Martín, Álvaro; Robinson, Andreas; Varfolomieiev, Anton; Gebrehiwot, Awet Haileslassie; Uzun, Bedirhan; Yan, Bin; Li, Bing; Qian, Chen; Tsai, Chi-Yi; Micheloni, Christian; Wang, Dong; Wang, Fei; Xie, Fei; Lawin, Felix Jaremo; Gustafsson, Fredrik; Foresti, Gian Luca; Bhat, Goutam; Chen, Guangqi; Ling, Haibin; Zhang, Haitao; Cevikalp, Hakan; Zhao, Haojie; Bai, Haoran; Kuchibhotla, Hari Chandana; Saribas, Hasan; Fan, Heng; Ghanei-Yakhdan, Hossein; Li, Houqiang; Peng, Houwen; Lu, Huchuan; Li, Hui; Khaghani, Javad; Bescos, Jesus; Li, Jianhua; Fu, Jianlong; Yu, Jiaqian; Xu, Jingtao; Kittler, Josef; Yin, Jun; Lee, Junhyun; Yu, Kaicheng; Liu, Kaiwen; Yang, Kang; Dai, Kenan; Cheng, Li; Zhang, Li; Wang, Lijun; Wang, Linyuan; Van Gool, Luc; Bertinetto, Luca; Dunnhofer, Matteo; Cheng, Miao; Dasari, Mohana Murali; Wang, Ning; Wang, Ning; Zhang, Pengyu; Torr, Philip H. S.; Wang, Qiang; Timofte, Radu; Gorthi, Rama Krishna Sai; Choi, Seokeon; Marvasti-Zadeh, Seyed Mojtaba; Zhao, Shaochuan; Kasaei, Shohreh; Qiu, Shoumeng; Chen, Shuhao; Schön, Thomas B.; Xu, Tianyang; Lu, Wei; Hu, Weiming; Zhou, Wengang; Qiu, Xi; Ke, Xiao; Wu, Xiao-Jun; Zhang, Xiaolin; Yang, Xiaoyun; Zhu, Xuefeng; Jiang, Yingjie; Wang, Yingming; Chen, Yiwei; Ye, Yu; Li, Yuezhou; Yao, Yuncon; Lee, Yunsung; Gu, Yuzhang; Wang, Zezhou; Tang, Zhangyong; Feng, Zhen-Hua; Mai, Zhijun; Zhang, Zhipeng; Wu, Zhirong; Ma, Ziang

doi:10.1007/978-3-030-68238-5_39

Matej Kristan¹⁰,
Aleš Leonardis¹¹,
Jiří Matas¹²,
Michael Felsberg¹³,
Roman Pflugfelder^14,15,
Joni-Kristian Kämäräinen¹⁶,
Martin Danelljan¹⁷,
Luka Čehovin Zajc¹⁰,
Alan Lukežič¹⁰,
Ondrej Drbohlav¹²,
Linbo He¹³,
Yushan Zhang^13,18,
Song Yan¹⁶,
Jinyu Yang¹¹,
Gustavo Fernández¹⁴,
Alexander Hauptmann¹⁹,
Alireza Memarmoghadam⁴⁸,
Álvaro García-Martín⁴⁵,
Andreas Robinson¹³,
Anton Varfolomieiev³⁴,
Awet Haileslassie Gebrehiwot⁴⁵,
Bedirhan Uzun²¹,
Bin Yan²⁰,
Bing Li²⁷,
Chen Qian³⁸,
Chi-Yi Tsai⁴⁴,
Christian Micheloni⁵²,
Dong Wang²⁰,
Fei Wang³⁸,
Fei Xie⁴²,
Felix Jaremo Lawin¹³,
Fredrik Gustafsson⁵³,
Gian Luca Foresti⁵²,
Goutam Bhat¹⁷,
Guangqi Chen³⁸,
Haibin Ling⁴³,
Haitao Zhang⁵⁵,
Hakan Cevikalp²¹,
Haojie Zhao²⁰,
Haoran Bai⁴¹,
Hari Chandana Kuchibhotla²⁶,
Hasan Saribas²²,
Heng Fan⁴³,
Hossein Ghanei-Yakhdan⁵⁴,
Houqiang Li⁵⁰,
Houwen Peng³²,
Huchuan Lu²⁰,
Hui Li²⁸,
Javad Khaghani⁴⁶,
Jesus Bescos⁴⁵,
Jianhua Li²⁰,
Jianlong Fu³²,
Jiaqian Yu³⁷,
Jingtao Xu³⁷,
Josef Kittler⁵¹,
Jun Yin⁵⁵,
Junhyun Lee³⁰,
Kaicheng Yu²⁵,
Kaiwen Liu²⁷,
Kang Yang³³,
Kenan Dai²⁰,
Li Cheng⁴⁶,
Li Zhang⁴⁹,
Lijun Wang²⁰,
Linyuan Wang⁵⁵,
Luc Van Gool¹⁷,
Luca Bertinetto²³,
Matteo Dunnhofer⁵²,
Miao Cheng⁵⁵,
Mohana Murali Dasari²⁶,
Ning Wang³³,
Ning Wang⁵⁰,
Pengyu Zhang²⁰,
Philip H. S. Torr⁴⁹,
Qiang Wang³⁵,
Radu Timofte¹⁷,
Rama Krishna Sai Gorthi²⁶,
Seokeon Choi²⁹,
Seyed Mojtaba Marvasti-Zadeh⁴⁶,
Shaochuan Zhao²⁸,
Shohreh Kasaei⁴⁰,
Shoumeng Qiu³⁹,
Shuhao Chen²⁰,
Thomas B. Schön⁵³,
Tianyang Xu⁵¹,
Wei Lu⁵⁵,
Weiming Hu^27,35,
Wengang Zhou⁵⁰,
Xi Qiu³¹,
Xiao Ke²⁴,
Xiao-Jun Wu²⁸,
Xiaolin Zhang³⁹,
Xiaoyun Yang³⁶,
Xuefeng Zhu²⁸,
Yingjie Jiang²⁸,
Yingming Wang²⁰,
Yiwei Chen³⁷,
Yu Ye²⁴,
Yuezhou Li²⁴,
Yuncon Yao⁴²,
Yunsung Lee³⁰,
Yuzhang Gu³⁹,
Zezhou Wang²⁰,
Zhangyong Tang²⁸,
Zhen-Hua Feng⁵¹,
Zhijun Mai⁴⁷,
Zhipeng Zhang²⁷,
Zhirong Wu³² &
…
Ziang Ma⁵⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12539))

Included in the following conference series:

European Conference on Computer Vision

4233 Accesses
85 Citations

Abstract

The Visual Object Tracking challenge VOT2020 is the eighth annual tracker benchmarking activity organized by the VOT initiative. Results of 58 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The VOT2020 challenge was composed of five sub-challenges focusing on different tracking domains: (i) VOT-ST2020 challenge focused on short-term tracking in RGB, (ii) VOT-RT2020 challenge focused on “real-time” short-term tracking in RGB, (iii) VOT-LT2020 focused on long-term tracking namely coping with target disappearance and reappearance, (iv) VOT-RGBT2020 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2020 challenge focused on long-term tracking in RGB and depth imagery. Only the VOT-ST2020 datasets were refreshed. A significant novelty is introduction of a new VOT short-term tracking evaluation methodology, and introduction of segmentation ground truth in the VOT-ST2020 challenge – bounding boxes will no longer be used in the VOT-ST challenges. A new VOT Python toolkit that implements all these novelites was introduced. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website (http://votchallenge.net).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1.
2.
http://www.homeoffice.gov.uk/science-research/hosdb/i-lids.
3.
http://www-sop.inria.fr/orion/ETISEO.
4.
http://vision.fe.uni-lj.si/cvbase06/.
5.
http://www.micc.unifi.it/LTDT2014/.
6.
http://videonet.team.
7.
http://votchallenge.net.
8.
http://www.votchallenge.net/vot2020/participation.html.
9.
http://www.votchallenge.net/vot2019/res/list0_prohibited_1000.txt.
10.
The target was sought in a window centered at its estimated position in the previous frame. This is the simplest dynamic model that assumes all positions within a search region contain the target have equal prior probability.
11.
This includes standard FFT-based as well as more recent deep learning based DCFs (e.g., [5, 13]).
12.
https://github.com/NVIDIA/TensorRT.
13.
https://github.com/Daikenan/LTMU.

References

Babenko, B., Yang, M.H., Belongie, S.: Robust object tracking with online multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1619–1632 (2011)
Article Google Scholar
Berg, A., Ahlberg, J., Felsberg, M.: A thermal object tracking benchmark. In: 12th IEEE International Conference on Advanced Video- and Signal-based Surveillance, Karlsruhe, Germany, 25–28 August 2015. IEEE (2015)
Google Scholar
Berg, A., Johnander, J., de Gevigney, F.D., Ahlberg, J., Felsberg, M.: Semi-automatic annotation of objects in visual-thermal video. In: IEEE International Conference on Computer Vision, ICCV Workshops (2019)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: IEEE International Conference on Computer Vision, ICCV (2019)
Google Scholar
Bhat, G., Johnander, J., Danelljan, M., Khan, F.S., Felsberg, M.: Unveiling the power of deep tracking. In: ECCV, pp. 483–498 (2018)
Google Scholar
Bhat, G., et al.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 777–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_46
Chapter Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018)
Google Scholar
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Google Scholar
Dai, K., Zhang, Y., Wang, D., Li, J., Lu, H., Yang, X.: High-performance long-term tracking with meta-updater. In: CVPR (2020)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: CVPR, pp. 6638–6646 (2017)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: CVPR, pp. 4660–4669 (2019)
Google Scholar
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: CVPR (2020)
Google Scholar
Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Discriminative scale space tracking. IEEE Trans. Pattern Anal. Mach. Intell. 39(8), 1561–1575 (2016)
Article Google Scholar
Dunnhofer, M., Martinel, N., Luca Foresti, G., Micheloni, C.: Visual tracking by means of deep reinforcement learning and an expert demonstrator. In: The IEEE International Conference on Computer Vision (ICCV) Workshops, October 2019
Google Scholar
Dunnhofer, M., Martinel, N., Micheloni, C.: A distilled model for tracking and tracker fusion (2020)
Google Scholar
Fan, H., et al.: Lasot: a high-quality benchmark for large-scale single object tracking. In: Computer Vision Pattern Recognition (2019)
Google Scholar
Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: a benchmark for higher frame rate object tracking. CoRR abs/1703.05884 (2017). http://arxiv.org/abs/1703.05884
Goyette, N., Jodoin, P.M., Porikli, F., Konrad, J., Ishwar, P.: Changedetection.net: a new change detection benchmark dataset. In: CVPR Workshops, pp. 1–8. IEEE (2012)
Google Scholar
Guo, C., Zhang, L.: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans. Image Process. 19(1), 185–198 (2009)
MathSciNet MATH Google Scholar
Gustafsson, F.K., Danelljan, M., Bhat, G., Schön, T.B.: Energy-based models for deep probabilistic regression. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 325–343. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_20
Chapter Google Scholar
Gustafsson, F.K., Danelljan, M., Timofte, R., Schön, T.B.: How to train your energy-based model for regression. CoRR abs/2005.01698 (2020). https://arxiv.org/abs/2005.01698
Henriques, J., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. PAMI 37(3), 583–596 (2015)
Article Google Scholar
Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. arXiv:1810.11981 (2018)
Huang, L., Zhao, X., Huang, K.: GlobalTrack: a simple and strong baseline for long-term tracking. In: AAAI (2020)
Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: alexnet-level accuracy with 50x fewer parameters and \(<\)0.5mb model size. arXiv:1602.07360 (2016)
Jack, V., et al.: Long-term tracking in the wild: A benchmark. arXiv:1803.09502 (2018)
Jung, I., Son, J., Baek, M., Han, B.: Real-time MDNet. In: ECCV, pp. 83–98 (2018)
Google Scholar
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 34(7), 1409–1422 (2012). https://doi.org/10.1109/TPAMI.2011.239
Article Google Scholar
Kristan, M., et al.: The seventh visual object tracking vot2019 challenge results. In: ICCV2019 Workshops, Workshop on Visual Object Tracking Challenge (2019)
Google Scholar
Kristan, M., et al.: The visual object tracking vot2018 challenge results. In: ECCV2018 Workshops, Workshop on Visual Object Tracking Challenge (2018)
Google Scholar
Kristan, M., et al.: The visual object tracking vot2017 challenge results. In: ICCV2017 Workshops, Workshop on Visual Object Tracking Challenge (2017)
Google Scholar
Kristan, M., et al.: The visual object tracking vot2016 challenge results. In: ECCV2016 Workshops, Workshop on Visual Object Tracking Challenge (2016)
Google Scholar
Kristan, M., et al.: The visual object tracking vot2015 challenge results. In: ICCV2015 Workshops, Workshop on Visual Object Tracking Challenge (2015)
Google Scholar
Kristan, M., et al.: The visual object tracking vot2013 challenge results. In: ICCV2013 Workshops, Workshop on Visual Object Tracking Challenge, pp. 98–111 (2013)
Google Scholar
Kristan, M., et al.: The visual object tracking vot2014 challenge results. In: ECCV2014 Workshops, Workshop on Visual Object Tracking Challenge (2014)
Google Scholar
Kristan, M., et al.: A novel performance evaluation methodology for single-target trackers. IEEE Trans. Pattern Anal. Mach. Intell. 38(11), 2137–2155 (2016)
Article Google Scholar
Leal-Taixé, L., Milan, A., Reid, I.D., Roth, S., Schindler, K.: Motchallenge 2015: towards a benchmark for multi-target tracking. CoRR abs/1504.01942 (2015). http://arxiv.org/abs/1504.01942
Li, A., Li, M., Wu, Y., Yang, M.H., Yan, S.: Nus-pro: a new visual tracking challenge. IEEE-PAMI (2015)
Google Scholar
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of Siamese visual tracking with very deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2019)
Google Scholar
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8971–8980, June 2018
Google Scholar
Li, C., Liang, X., Lu, Y., Zhao, N., Tang, J.: RGB-T object tracking: benchmark and baseline. Pattern Recogn. (2019, submitted)
Google Scholar
Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: algorithms and benchmark. IEEE Trans. Image Process. 24(12), 5630–5644 (2015)
Article MathSciNet Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lukežič, A., Kart, U., Kämäräinen, J., Matas, J., Kristan, M.: CDTB: a color and depth visual object tracking dataset and benchmark. In: ICCV (2019)
Google Scholar
Lukežič, A., Vojír̃ T., Čehovin Zajc, L., Matas, J., Kristan, M.: Discriminative correlation filter with channel and spatial reliability. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6309–6318, July 2017
Google Scholar
Lukežič, A., Čehovin Zajc, L., Vojír̃ T., Matas, J., Kristan, M.: Now you see me: evaluating performance in long-term visual tracking. CoRR abs/1804.07056 (2018). http://arxiv.org/abs/1804.07056
Lukezic, A., Cehovin Zajc, L., Vojir, T., Matas, J., Kristan, M.: Performance evaluation methodology for long-term single object tracking. IEEE Trans. Cybern. (2020)
Google Scholar
Lukezic, A., Matas, J., Kristan, M.: D3S - a discriminative single shot segmentation tracker. In: CVPR (2020)
Google Scholar
Memarmoghadam, A., Moallem, P.: Size-aware visual object tracking via dynamic fusion of correlation filter-based part regressors. Signal Process. 164, 84–98 (2019). https://doi.org/10.1016/j.sigpro.2019.05.021. http://www.sciencedirect.com/science/article/pii/S0165168419301872
Moudgil, A., Gandhi, V.: Long-term visual object tracking benchmark. arXiv preprint arXiv:1712.01358 (2017)
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_27
Chapter Google Scholar
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: ECCV, pp. 300–317 (2018)
Google Scholar
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp. 4293–4302 (2016)
Google Scholar
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)
Google Scholar
Pernici, F., del Bimbo, A.: Object tracking by oversampling local features. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2538–2551 (2013). https://doi.org/10.1109/TPAMI.2013.250
Article Google Scholar
Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 22(10), 1090–1104 (2000)
Article Google Scholar
Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video. In: Computer Vision and Pattern Recognition, pp. 7464–7473 (2017)
Google Scholar
Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., Felsberg, M.: Learning fast and robust target models for video object segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Computer Vision Foundation, June 2020
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1–3), 125–141 (2008)
Article Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Seoung, W.O., Lee, J.Y., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: Computer Vision Pattern Recognition, pp. 7376–7385 (2018)
Google Scholar
Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. TPAMI (2013). https://doi.org/10.1109/TPAMI.2013.230
Article Google Scholar
Solera, F., Calderara, S., Cucchiara, R.: Towards the evaluation of reproducible robustness in tracking-by-detection. In: Advanced Video and Signal Based Surveillance, pp. 1–6 (2015)
Google Scholar
Song, S., Xiao, J.: Tracking revisited using RGBD camera: unified benchmark and baselines. In: ICCV (2013)
Google Scholar
Tao, R., Gavves, E., Smeulders, A.W.M.: Tracking for half an hour. CoRR abs/1711.10217 (2017). http://arxiv.org/abs/1711.10217
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355 (2019)
Čehovin, L., Kristan, M., Leonardis, A.: Is my new tracker really better than yours? Technical report 10, ViCoS Lab, University of Ljubljana, October 2013. http://prints.vicos.si/publications/302
Čehovin, L.: TraX: The visual Tracking eXchange Protocol and Library. Neurocomputing (2017). https://doi.org/10.1016/j.neucom.2017.02.036
Čehovin, L., Leonardis, A., Kristan, M.: Visual object tracking performance measures revisited. IEEE Trans. Image Process. 25(3), 1261–1274 (2016)
MathSciNet MATH Google Scholar
Vojír̃, T., Noskova, J., Matas, J.: Robust scale-adaptive mean-shift for tracking. Pattern Recogn. Lett. 49, 250–258 (2014)
Google Scholar
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: CVPR, pp. 1328–1338 (2019)
Google Scholar
Wang, X., Kong, T., Shen, C., Jiang, Y., Li, L.: SOLO: segmenting objects by locations. arXiv preprint arXiv:1912.04488 (2019)
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: Computer Vision Pattern Recognition (2013)
Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. PAMI 37(9), 1834–1848 (2015)
Article Google Scholar
Xiao, J., Stolkin, R., Gao, Y., Leonardis, A.: Robust fusion of color and depth data for RGB-D target tracking using adaptive range-invariant depth models and spatio-temporal consistency constraints. IEEE Trans. Cybern. 48, 2485–2499 (2018)
Article Google Scholar
Xu, N., Price, B., Yang, J., Huang, T.: Deep grabcut for object selection. In: Proceedings of British Machine Vision Conference (2017)
Google Scholar
Xu, T., Feng, Z.H., Wu, X.J., Kittler, J.: AFAT: adaptive failure-aware tracker for robust visual object tracking. arXiv preprint arXiv:2005.13708 (2020)
Xu, Y., Wang, Z., Li, Z., Ye, Y., Yu, G.: SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. arXiv preprint arXiv:1911.06188 (2019)
Yan, B., Wang, D., Lu, H., Yang, X.: Alpha-refine: boosting tracking performance by precise bounding box estimation. arXiv preprint arXiv:2007.02024 (2020)
Yan, B., Zhao, H., Wang, D., Lu, H., Yang, X.: Skimming-Perusal Tracking: a framework for real-time and robust long-term tracking. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S.: RepPoints: point set representation for object detection. In: The IEEE International Conference on Computer Vision (ICCV), pp. 9657–9666, October 2019
Google Scholar
Yiming, L., Shen, J., Pantic, M.: Mobile face tracking: a survey and benchmark. arXiv:1805.09749v1 (2018)
Young, D.P., Ferryman, J.M.: PETS Metrics: on-line performance evaluation service. In: Proceedings of the 14th International Conference on Computer Communications and Networks, ICCCN 2005, pp. 317–324 (2005)
Google Scholar
Zhang, L., Danelljan, M., Gonzalez-Garcia, A., van de Weijer, J., Khan, F.S.: Multi-modal fusion for end-to-end RGB-T tracking. In: IEEE International Conference on Computer Vision, ICCV Workshops (2019)
Google Scholar
Zhang, P., Zhao, J., Wang, D., Lu, H., Yang, X.: Jointly modeling motion and appearance cues for robust RGB-T tracking. CoRR abs/2007.02041 (2020)
Google Scholar
Zhang, Y., Wu, Z., Peng, H., Lin, S.: A transductive approach for video object segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4000–4009, June 2020
Google Scholar
Zhang, Y., Wang, D., Wang, L., Qi, J., Lu, H.: Learning regression and verification networks for long-term visual tracking. CoRR abs/1809.04320 (2018)
Google Scholar
Zhang, Z., Peng, H.: Deeper and wider Siamese networks for real-time visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4591–4600, June 2019
Google Scholar
Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: object-aware anchor-free tracking. arXiv preprint arXiv:2006.10721 (2020)
Zhu, P., Wen, L., Bian, X., Haibin, L., Hu, Q.: Vision meets drones: a challenge. arXiv preprint arXiv:1804.07437 (2018)

Download references

Acknowledgements

This work was supported in part by the following research programs and projects: Slovenian research agency research programs P2-0214, Z2-1866, P2-0094, Slovenian research agency project J2-8175. Jiři Matas and Ondrej Drbohlav were supported by the Czech Science Foundation Project GACR P103/12/G084. Aleš Leonardis was supported by MURI project financed by MoD/Dstl and EPSRC through EP/N019415/1 grant. Michael Felsberg and Linbo He were supported by WASP, VR (ELLIIT and NCNN), and SSF (SymbiCloud). Roman Pflugfelder and Gustavo Fernández were supported by the AIT Strategic Research Programme 2020 Visual Surveillance and Insight. The challenge was sponsored by the Faculty of Computer Science, University of Ljubljana, Slovenia.

Author information

Authors and Affiliations

University of Ljubljana, Ljubljana, Slovenia
Matej Kristan, Luka Čehovin Zajc & Alan Lukežič
University of Birmingham, Birmingham, UK
Aleš Leonardis & Jinyu Yang
Czech Technical University, Prague, Czech Republic
Jiří Matas & Ondrej Drbohlav
Linköping University, Linköping, Sweden
Michael Felsberg, Linbo He, Yushan Zhang, Andreas Robinson & Felix Jaremo Lawin
Austrian Institute of Technology, Seibersdorf, Austria
Roman Pflugfelder & Gustavo Fernández
TU Wien, Vienna, Austria
Roman Pflugfelder
Tampere University, Tampere, Finland
Joni-Kristian Kämäräinen & Song Yan
ETH Zürich, Zürich, Switzerland
Martin Danelljan, Goutam Bhat, Luc Van Gool & Radu Timofte
Beijing Institute of Technology, Beijing, China
Yushan Zhang
Carnegie Mellon University, Pittsburgh, USA
Alexander Hauptmann
Dalian University of Technology, Dalian, China
Bin Yan, Dong Wang, Haojie Zhao, Huchuan Lu, Jianhua Li, Kenan Dai, Lijun Wang, Pengyu Zhang, Shuhao Chen, Yingming Wang & Zezhou Wang
Eskisehir Osmangazi University, Eskişehir, Turkey
Bedirhan Uzun & Hakan Cevikalp
Eskisehir Technical University, Eskişehir, Turkey
Hasan Saribas
Five AI, London, UK
Luca Bertinetto
Fuzhou University, Fuzhou, China
Xiao Ke, Yu Ye & Yuezhou Li
High School Affiliated to Renmin University of China, Beijing, China
Kaicheng Yu
Indian Institute of Technology, Tirupati, Tirupati, India
Hari Chandana Kuchibhotla, Mohana Murali Dasari & Rama Krishna Sai Gorthi
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Bing Li, Kaiwen Liu, Weiming Hu & Zhipeng Zhang
Jiangnan University, Wuxi, China
Hui Li, Shaochuan Zhao, Xiao-Jun Wu, Xuefeng Zhu, Yingjie Jiang & Zhangyong Tang
KAIST, Daejeon, Korea
Seokeon Choi
Korea University, Seoul, Korea
Junhyun Lee & Yunsung Lee
Megvii, Beijing, China
Xi Qiu
Microsoft Research, Redmond, USA
Houwen Peng, Jianlong Fu & Zhirong Wu
Nanjing University of Information Science and Technology, Nanjing, China
Kang Yang & Ning Wang
National Technical University of Ukraine, Kiev, Ukraine
Anton Varfolomieiev
NLP, Beijing, China
Qiang Wang & Weiming Hu
Remark Holdings, London, UK
Xiaoyun Yang
Samsung Research China-Beijing (SRC-B), Beijing, China
Jiaqian Yu, Jingtao Xu & Yiwei Chen
Sensetime, Taiwan, Hong Kong
Chen Qian, Fei Wang & Guangqi Chen
Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai, China
Shoumeng Qiu, Xiaolin Zhang & Yuzhang Gu
Sharif University of Technology, Tehran, Iran
Shohreh Kasaei
Sichuan University, Chengdu, China
Haoran Bai
Southeast University, Nanjing, China
Fei Xie & Yuncon Yao
Stony Brook University, Stony Brook, USA
Haibin Ling & Heng Fan
Tamkang University, New Taipei City, Taiwan
Chi-Yi Tsai
Universidad Autónoma de Madrid, Madrid, Spain
Álvaro García-Martín, Awet Haileslassie Gebrehiwot & Jesus Bescos
University of Alberta, Edmonton, Canada
Javad Khaghani, Li Cheng & Seyed Mojtaba Marvasti-Zadeh
University of Electronic Science and Technology of China, Chengdu, China
Zhijun Mai
University of Isfahan, Isfahan, Iran
Alireza Memarmoghadam
University of Oxford, Oxford, UK
Li Zhang & Philip H. S. Torr
University of Science and Technology of China, Hefei, China
Houqiang Li, Ning Wang & Wengang Zhou
University of Surrey, Guildford, UK
Josef Kittler, Tianyang Xu & Zhen-Hua Feng
University of Udine, Udine, Italy
Christian Micheloni, Gian Luca Foresti & Matteo Dunnhofer
Uppsala University, Uppsala, Sweden
Fredrik Gustafsson & Thomas B. Schön
Yazd University, Yazd, Iran
Hossein Ghanei-Yakhdan
Zhejiang Dahua Technology, Binjiang, China
Haitao Zhang, Jun Yin, Linyuan Wang, Miao Cheng, Wei Lu & Ziang Ma

Authors

Matej Kristan
View author publications
You can also search for this author in PubMed Google Scholar
Aleš Leonardis
View author publications
You can also search for this author in PubMed Google Scholar
Jiří Matas
View author publications
You can also search for this author in PubMed Google Scholar
Michael Felsberg
View author publications
You can also search for this author in PubMed Google Scholar
Roman Pflugfelder
View author publications
You can also search for this author in PubMed Google Scholar
Joni-Kristian Kämäräinen
View author publications
You can also search for this author in PubMed Google Scholar
Martin Danelljan
View author publications
You can also search for this author in PubMed Google Scholar
Luka Čehovin Zajc
View author publications
You can also search for this author in PubMed Google Scholar
Alan Lukežič
View author publications
You can also search for this author in PubMed Google Scholar
Ondrej Drbohlav
View author publications
You can also search for this author in PubMed Google Scholar
Linbo He
View author publications
You can also search for this author in PubMed Google Scholar
Yushan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Song Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jinyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Hauptmann
View author publications
You can also search for this author in PubMed Google Scholar
Alireza Memarmoghadam
View author publications
You can also search for this author in PubMed Google Scholar
Álvaro García-Martín
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Robinson
View author publications
You can also search for this author in PubMed Google Scholar
Anton Varfolomieiev
View author publications
You can also search for this author in PubMed Google Scholar
Awet Haileslassie Gebrehiwot
View author publications
You can also search for this author in PubMed Google Scholar
Bedirhan Uzun
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yan
View author publications
You can also search for this author in PubMed Google Scholar
Bing Li
View author publications
You can also search for this author in PubMed Google Scholar
Chen Qian
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Yi Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Christian Micheloni
View author publications
You can also search for this author in PubMed Google Scholar
Dong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Felix Jaremo Lawin
View author publications
You can also search for this author in PubMed Google Scholar
Fredrik Gustafsson
View author publications
You can also search for this author in PubMed Google Scholar
Gian Luca Foresti
View author publications
You can also search for this author in PubMed Google Scholar
Goutam Bhat
View author publications
You can also search for this author in PubMed Google Scholar
Guangqi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Haibin Ling
View author publications
You can also search for this author in PubMed Google Scholar
Haitao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hakan Cevikalp
View author publications
You can also search for this author in PubMed Google Scholar
Haojie Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Bai
View author publications
You can also search for this author in PubMed Google Scholar
Hari Chandana Kuchibhotla
View author publications
You can also search for this author in PubMed Google Scholar
Hasan Saribas
View author publications
You can also search for this author in PubMed Google Scholar
Heng Fan
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Ghanei-Yakhdan
View author publications
You can also search for this author in PubMed Google Scholar
Houqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Houwen Peng
View author publications
You can also search for this author in PubMed Google Scholar
Huchuan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hui Li
View author publications
You can also search for this author in PubMed Google Scholar
Javad Khaghani
View author publications
You can also search for this author in PubMed Google Scholar
Jesus Bescos
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianlong Fu
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqian Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jingtao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Josef Kittler
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yin
View author publications
You can also search for this author in PubMed Google Scholar
Junhyun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Kaicheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kaiwen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Kang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Kenan Dai
View author publications
You can also search for this author in PubMed Google Scholar
Li Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lijun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Linyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author in PubMed Google Scholar
Luca Bertinetto
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Dunnhofer
View author publications
You can also search for this author in PubMed Google Scholar
Miao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Mohana Murali Dasari
View author publications
You can also search for this author in PubMed Google Scholar
Ning Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ning Wang
View author publications
You can also search for this author in PubMed Google Scholar
Pengyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Philip H. S. Torr
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Radu Timofte
View author publications
You can also search for this author in PubMed Google Scholar
Rama Krishna Sai Gorthi
View author publications
You can also search for this author in PubMed Google Scholar
Seokeon Choi
View author publications
You can also search for this author in PubMed Google Scholar
Seyed Mojtaba Marvasti-Zadeh
View author publications
You can also search for this author in PubMed Google Scholar
Shaochuan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shohreh Kasaei
View author publications
You can also search for this author in PubMed Google Scholar
Shoumeng Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Shuhao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Thomas B. Schön
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Weiming Hu
View author publications
You can also search for this author in PubMed Google Scholar
Wengang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xi Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Ke
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Jun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xuefeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yingjie Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yingming Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yiwei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yu Ye
View author publications
You can also search for this author in PubMed Google Scholar
Yuezhou Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuncon Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yunsung Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhang Gu
View author publications
You can also search for this author in PubMed Google Scholar
Zezhou Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhangyong Tang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen-Hua Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhijun Mai
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhirong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ziang Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matej Kristan .

Editor information

Editors and Affiliations

University of Clermont Auvergne, Clermont Ferrand, France
Adrien Bartoli
Università degli Studi di Udine, Udine, Italy
Andrea Fusiello

Appendices

A VOT-ST2020 and VOT-RT2020 Submissions

This appendix provides a short summary of trackers considered in the VOT-ST2020 and VOT-RT2020 challenges.

1.1 A.1 Discriminative Sing-Shot Segmentation Tracker (D3S)

A. Lukezic

alan.lukezic@fri.uni-lj.si

Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker named D3S [50], which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve high robustness and online target segmentation.

1.2 A.2 Visual Tracking by Means of Deep Reinforcement Learning and an Expert Demonstrator (A3CTDmask)

M. Dunnhofer, G. Foresti, C. Micheloni

{matteo.dunnhofer, gianluca.foresti, christian.micheloni}@uniud.it

A3CTDmask is the combination of the A3CTD tracker [16] with a one-shot segmentation method for target object mask generation. A3CTD is a real-time tracker built on a deep recurrent regression network architecture trained offline using a reinforcement learning based framework. After training, the proposed tracker is capable of producing bounding box estimates through the learned policy or by exploiting the demonstrator. A3CTDmask exploits SiamMask [74] by reinterpreting it as a one-shot segmentation module. The target object mask is generated inside a frame patch obtained through the bounding box estimates given by A3CTD.

1.3 A.3 Deep Convolutional Descriptor Aggregation for Visual Tracking (DCDA)

Y. Li, X. Ke

liyuezhou.cm@gmail.com, kex@fzu.edu.cn

This work aims to mine the target representation capability of pre-trained VGG16 model for visual tracking. Based on spatial and semantic priors, a central attention mask is designed for robust-aware feature aggregation, and an edge attention mask is used for accuracy aware feature aggregation. To make full use of the scene context, a regression loss is developed to learn a discriminative feature for complex scenes. DCDA tracker is implemented based on the Siamese network, with a feature fusion and template enhancement strategies.

1.4 A.4 IOU Guided Siamese Networks for Visual Object Tracking (IGS)

M. Dasari, R. Gorthi

{ee18d001, rkg}@iittp.ac.in

In the proposed IOU-SiamTrack framework, a new block called ‘IOU module’ is introduced. This module accepts the above feature domain response maps, convert them into image domain with the help of anchor boxes, as is done in the inference stage in [41, 42]. Using the classification response map, top-K ‘probable’ bounding boxes, having top-K responses are selected. IOU module then calculates the IOU of probable bounding boxes w.r.t. estimated bounding box and produce the one with maximum IOU score as predicted output bounding box. Through training progress, predicted box is more aligned with ground truth, as network is guided to minimise the IOU loss.

1.5 A.5 SiamMask_SOLO (SiamMask_S)

Y. Jiang, Z. Feng, T. Xu, X. Song

yj.jiang@stu.jiangnan.edu.cn, {z.feng, tianyang.xu}@surrey.ac.uk,

x.song@jiangnan.edu.cn

The SiamMask_SOLO tracker is based on the SiamMask algorithm. It utilizes a multi-layer aggregation module to make full use of different levels of deep CNN features. Besides, to balance all the three branches, the mask branch is replaced by a SOLO [75] head that uses CoordConv and FCN, which improves the performance of the proposed SiamMask_SOLO tracker in terms of both accuracy and robustness. The original refined module is kept for a further performance boost.

1.6 A.6 Diverse Ensemble Tracker (DET50)

N. Wang, W. Zhou, H. Li

wn6149@mail.ustc.edu.cn, {zhwg, lihq}@ustc.edu.cn

In this work, we leverage an ensemble of diverse models to learn manifold representations for robust object tracking. Based on the DiMP method, a shared backbone network (ResNet-50) is applied for feature extraction and multiple head networks for independent predictions. To shrink the representational overlaps among multiple models, both model diversity and response diversity regularization terms are used during training. This ensemble framework is end-to-end trained in a data-driven manner. After box-level prediction, we use SiamMask for mask generation.

1.7 A.7 VPU_SiamM: Robust Template Update Strategy for Efficient Object Tracking (VPU_SiamM)

A. Gebrehiwot, J. Bescos, Á. García-Martín

awet.gebrehiwot@estudiante.uam.es, {j.bescos, alvaro.garcia}@uam.es

The VPU_SiamM tracker is an improved version of the SiamMask [74]. The SiamMask tracks without any target update strategy. In order to enable more discriminant features and to enhance robustness, the VPU_SiamM applies a target template update strategy, which leverages both the initial ground truth template and a supplementary updatable template. The initial template provides highly reliable information and increase robustness against model drift and the updatable template integrates the new target information from the predicted target location given by the current frame. During online tracking, VPU_SiamM applies both forward and backward tracking strategies by updating the updatable target template with the predicted target. The tracking decision on the next frame is determined where both templates yield a high response map (score) in the search region. Data augmentation strategy has been implemented during the training process of the refinement branch to become robust in handling motion-blurred and low-resolution datasets during inference.

1.8 A.8 RPT: Learning Point Set Representation for Siamese Visual Tracking (RPT)

H. Zhang, L. Wang, Z. Ma, W. Lu, J. Yin, M. Cheng

1067166127@qq.com, {wanglinyuan, kobebean, lwhfh01}@zju.edu.cn, {yin_jun, cheng_miao}@dahuatech.com

RPT tracker is formulated with a two-stage structure. The first stage is composed with two parallel subnets, one for target estimation with RepPoints [84] in an offline-trained embedding space, the other trained online to provide high robustness against distractors [13]. The online classification subnet is set to a lightweight 2-layer convolutional neural network. The target estimation head is constructed with Siamese-based feature extraction and matching. For the second stage, the set of RepPoints with highest confidence (i.e. online classification score) is fed into a modified D3S [50] to obtain the segmentation mask. A segmentation map is obtained by combining enhanced target location channel with target and background similarity channels. The backbone is ResNet50 pre-trained on ImageNet, while the target estimation head is trained using pairs of frames from YouTube-Bounding Box [59], COCO [45] and ImageNet VID [63] datasets.

1.9 A.9 Tracking Student and Teacher (TRASTmask)

M. Dunnhofer, G. Foresti, C. Micheloni

{matteo.dunnhofer, gianluca.foresti, christian.micheloni}@uniud.it

TRASTmask is the combination of the TRAST tracker [17] with a one-shot segmentation method for target object mask generation. TRAST tracker consists of two components: (i) a fast processing CNN-based tracker, i.e. the Student; and (ii) an off-the-shelf tracker, i.e. the Teacher. The Student is trained offline based on knowledge distillation and reinforcement learning, where multiple tracking teachers are exploited. Tracker TRASTmask uses DiMP [5] as the Teacher. The target object mask is generated inside a frame patch obtained through the bounding box estimates given by TRAST tracker.

1.10 A.10 Ocean: Object-aware Anchor-free Tracking (Ocean)

Z. Zhang, H. Peng

zhangzhipeng2017@ia.ac.cn, houwen.peng@microsoft.com

We extend our object-aware anchor-free tracking framework [92] with novel transduction and segmentation networks, enabling it to predict accurate target mask. The transduction network is introduced to infuse the knowledge of the given mask in the first frame. Inspired by recent work TVOS [89], it compares the pixel-wise feature similarities between the template and search features, and then transfers the mask of the template to an attention map based on the similarities. We add the attention map to backbone features to learn target-background aware representations. Finally, a U-net shape segmentation pathway is designed to progressively refine the enhanced backbone features to target mask. The code will be completely released at https://github.com/researchmm/TracKit.

1.11 A.11 Tracking by Student FUSing Teachers (TRASFUSTm)

M. Dunnhofer, G. Foresti, C. Micheloni

{matteo.dunnhofer, gianluca.foresti, christian.micheloni}@uniud.it

The tracker TRASFUSTm is the combination of the TRASFUST tracker [17] with a one-shot segmentation method for target object mask generation. TRASFUSTm tracker consists of two components: (i) a fast processing CNN-based tracker, i.e. the Student; (ii) a pool of off-the-shelf trackers, i.e. Teachers. The Student is trained offline based on knowledge distillation and reinforcement learning, where multiple tracking teachers are exploited. After learning, through the learned evaluation method, the Student is capable to select the prediction of the best Teacher of the pool, thus performing robust fusion. Both trackers DiMP [5] and ECO [12] were chosen as Teachers. The target object mask is generated inside a frame patch obtained through the bounding box estimates given by TRASFUSTm tracker.

1.12 A.12 Alpha-Refine (AlphaRef)

B. Yan, D. Wang, H. Lu, X. Yang

yan_bin@mail.dlut.edu.cn, {wdice, lhchuan}@dlut.edu.cn,

xyang@remarkholdings.com

We propose a simple yet powerful two-stage tracker, which consists of a robust base tracker (super-dimp) and an accurate refinement module named Alpha-Refine [82]. In the first stage, super-dimp robustly locates the target, generating an initial bounding box for the target. Then in the second stage, based on this result, Alpha-Refine crops a small search region to predict a high-quality mask for the tracked target. Alpha-Refine exploits pixel-wise correlation for fine feature aggregation, and uses non-local layer to capture global context information. Besides, Alpha-Refine also deploys a delicate mask prediction head [60] to generate high-quality masks. The complete code and trained models of Alpha-Refine will be released at github.com/MasterBin-IIAU/AlphaRefine.

1.13 A.13 Hierarchical Representations with Discriminative Meta-Filters in Dual Path Network for Tracking (DPMT)

F. Xie, N. Wang, K. Yang, Y. Yao

220191672@seu.edu.cn, 20181222016@nuist.edu.cn,

yangkang779@163.con, 220191672@seu.edu.cn

We propose a novel dual path network with discriminative meta-filters and hierarchical representations to solve these issues. DPMT tracker consists of two pathways: (i) Geographical Sensitivity Pathway (GASP) and (ii) Geometrically Sensitivity Pathway (GESP). The modules in Geographical Sensitivity Pathway (GASP) are more sensitive to the spatial location of targets and distractors. Subnetworks in Geometrically Sensitivity Pathway (GESP) are designed to refine the bounding box to fit the target. According to this dual path network design, Geographical Sensitivity Pathway (GASP) should be trained to own more discriminative power between foreground and background while Geographical Sensitivity Pathway (GASP) should focus more on the appearance model of the object.

1.14 A.14 SiamMask (siammask)

Q. Wang, L. Zhang, L. Bertinetto, P.H.S. Torr, W. Hu

qiang.wang@nlpr.ia.ac.cn, {lz, luca}@robots.ox.ac.uk, philip.torr@eng.ox.ac.uk, wmhu@nlpr.ia.ac.cn

Our method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. In this way, our tracker gains a better instance-level understanding towards the object to track by exploiting the rich object mask representations offline. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes. Code is publicly available at https://github.com/foolwood/SiamMask.

1.15 A.15 OceanPlus: Online Object-Aware Anchor-Free Tracking (OceanPlus)

Z. Zhang, H. Peng, Z. Wu, K. Liu, J. Fu, B. Li, W. Hu

zhangzhipeng2017@ia.ac.cn, houwen.peng@microsoft.com,

Wu.Zhirong@microsoft.com, liukaiwen2019@ia.ac.cn, jianf@microsoft.com,

bli@nlpr.ia.ac.cn, wmhu@nlpr.ia.ac.cn

This model is the extension of the Ocean tracker A.10. Inspired by recent online models, we introduce an online branch to accommodate to the changes of object scale and position. Specifically, the online branch inherits the structure and parameters from the first three stages of the Siamese backbone network. The fourth stage keeps the same structure as the original ResNet50, but its initial parameters are obtained through the pre-training strategy proposed in [5]. The segmentation refinement pathway is the same as Ocean. We refer the readers to Ocean tracker A.10 and https://github.com/researchmm/TracKit for more details.

1.16 A.16 fastOcean: Fast Object-Aware Anchor-Free Tracking (fastOcean)

Z. Zhang, H. Peng

zhangzhipeng2017@ia.ac.cn, houwen.peng@microsoft.com

To speed up the inference of our submitted tracker OceanPlus, we use TensorRT^{Footnote 12} to re-implement the model. All structure and model parameters are the same as OceanPlus. Please refer to OceanPlus A.15 and Ocean A.10 for more details.

1.17 A.17 Siamese Tracker with Discriminative Feature Embedding and Mask Prediction (SiamMargin)

G. Chen, F. Wang, C. Qian

{chenguangqi, wangfei, qianchen}@sensetime.com

SiamMargin is based on the SiamRPN++ algorithm [41]. In the training stage, a discrimination loss is added to the embedding layer. In the training phase the discriminative embedding is offline learned. In the inference stage the template feature of the object in current frame is obtained by ROIAlign from features of the current search region and it is updated via a moving average strategy. The discriminative embedding features are leveraged to accommodate the appearance change with properly online updating. Lastly, the SiamMask [74] model is appended to obtain the pixel-level mask prediction.

1.18 A.18 Siamese Tracker with Enhanced Template and Generalized Mask Generator (SiamEM)

Y. Li, Y. Ye, X. Ke

liyuezhou.cm@gmail.com, yyfzu@foxmail.com, kex@fzu.edu.cn

SiamEM is a Siamese tracker with enhanced template and generalized mask generator. SiamEM improves SiamFC++ [81] by obtaining feature results of the template and flip template in the network header while making decisions based on quality scores to predict bounding boxes. The segmentation network presented in [10] is used as a mask generation network.

1.19 A.19 TRacker by Using ATtention (TRAT)

H. Saribas, H. Cevikalp, B. Uzun

{hasansaribas48, hakan.cevikalp, eee.bedirhan}@gmail.com

The tracker ‘TRacker by using ATtention’ uses a two-stream network which consists of a 2D-CNN and a 3D-CNN, to use both spatial and temporal information in video streams. To obtain temporal (motion) information, 3D-CNN is fed by stacking the previous 4 frames with one stride. To extract spatial information, the 2D-CNN is used. Then, we fuse the two-stream network outputs by using an attention module. We use ATOM [13] tracker and ResNet backbone as a baseline. Code is available at https://github.com/Hasan4825/TRAT.

1.20 A.20 InfoGAN Based Tracker: InfoVITAL (InfoVital)

H. Kuchibhotla, M. Dasari, R. Gorthi

{ee18m009, ee18d001, rkg}@iittp.ac.in

Architecture of InfoGAN (Generator, Discriminator and a Q-Network) is incorporated in the Tracking-By-Detection Framework using the Mutual Information concept to bind two distributions (latent code) to the target and the background samples. Additional Q Network helps in proper estimation of the assigned distributions and the network is trained offline in an adversarial fashion. During online testing, the additional information from the Q-Network is used to obtain the target location in the subsequent frames. This greatly helps to assess the drift from the exact target location from frame-to-frame and also during occlusion.

1.21 A.21 Learning Discriminative Model Prediction for Tracking (DiMP)

G. Bhat, M. Danelljan, L. Van Gool, R. Timofte

{goutam.bhat, martin.danelljan, vangool, timofter}@vision.ee.ethz.ch

DiMP is an end-to-end tracking architecture, capable of fully exploiting both target and background appearance information for target model prediction. The target model here constitutes the weights of a convolution layer which performs the target-background classification. The weights of this convolution layer are predicted by the target model prediction network, which is derived from a discriminative learning loss by applying an iterative optimization procedure. The model prediction network employs a steepest descent based methodology that computes an optimal step length in each iteration to provide fast convergence. The online learned target model is applied in each frame to perform target-background classification. The final bounding box is then estimated using the overlap maximization approach as in [13]. See [5] for more details about the tracker.

1.22 A.22 SuperDiMP (SuperDiMP)

G. Bhat, M. Danelljan, F. Gustafsson, T. B. Schön, L. Van Gool, R. Timofte

{goutam.bhat, martin.danelljan}@vision.ee.ethz.ch, {fredrik.gustafsson,

thomas.schon}@it.uu.se, {vangool, timofter}@vision.ee.ethz.ch

SuperDiMP [23] combines the standard DiMP classifier from [5] with the EBM-based bounding-box regressor from [14, 22]. Instead of training the bounding box regression network to predict the IoU with an \(L_2\) loss [5], it is trained using the NCE+ approach [23] to minimize the negative-log likelihood. Further, the tracker uses better training and inference settings.

1.23 A.23 Learning What to Learn for Video Object Segmentation (LWTL)

G. Bhat, F. Jaremo Lawin, M. Danelljan, A. Robinson, M. Felsberg, L. Van Gool, R. Timofte

goutam.bhat@vision.ee.ethz.ch, felix.jaremo-lawin@liu.se,

martin.danelljan@vision.ee.ethz.ch, {andreas.robinson, michael.felsberg}@liu.se, {vangool, timofter}@vision.ee.ethz.ch

LWTL is an end-to-end trainable video object segmentation VOS architecture which captures the current target object information in a compact parametric model. It integrates a differentiable few-shot learner module, which predicts the target model parameters using the first frame annotation. The learner is designed to explicitly optimize an error between target model prediction and a ground truth label, which ensures a powerful model of the target object. Given a new frame, the target model predicts an intermediate representation of the target mask, which is input to the offline trained segmentation decoder to generate the final segmentation mask. LWTL learns the ground-truth labels used by the few-shot learner to train the target model. Furthermore, a network module is trained to predict spatial importance weights for different elements in the few-shot learning loss. All modules in the architecture are trained end-to-end by maximizing segmentation accuracy on annotated VOS videos. See [7] for more details.

1.24 A.24 Adaptive Failure-Aware Tracker (AFAT)

T. Xu, S. Zhao, Z. Feng, X. Wu, J. Kittler

tianyang.xu@surrey.ac.uk, zsc960813@163.com, z.feng@surrey.ac.uk,

wu_xiaojun@jiangnan.edu.cn, j.kittler@surrey.ac.uk

Adaptive Failure-Aware Tracker [80] is based on Siamese structure. First, multi-RPN module is employed to predict the central location with Resnet-50. Second, a 2-cell LSTM is established to perform quality prediction with an additional motion model. Third, fused mask branch is exploited for segmentation.

1.25 A.25 Ensemble Correlation Filter Tracking Based on Temporal Confidence Learning (TCLCF)

C. Tsai

chiyi_tsai@gms.tku.edu.tw

TCLCF is a real-time ensemble correlation filter tracker based on the temporal confidence learning method. In the current implementation, we use four different correlation filters to collaboratively track the same target. The TCLCF tracker is a fast and robust generic object tracker without GPU acceleration. Therefore, it can be implemented on the embedded platform with limited computing resources.

1.26 A.26 AFOD: Adaptive Focused Discriminative Segmentation Tracker (AFOD)

Y. Chen, J. Xu, J. Yu

{yiwei.chen, jingtao.xu, jiaqian.yu}@samsung.com

The proposed tracker is based on D3S and DiMP [5], employing ResNet-50 as backbone. AFOD calculates the feature similarity to foreground and background of the template as proposed in D3S. For discriminative features, AFOD updates the target model online. AFOD adaptively utilizes different strategies during tracking to update the scale of search region and to adjust the prediction. Moreover, the Lovasz hinge loss metric is used to learn the IoU score in offline training. The segmentation module is trained using both databases YoutubeVOS2019 and DAVIS2016. The offline training process includes two stages: (i) BCE loss is used for optimization and (ii) the Lovasz hinge is applied for further fine tuning. For inference, two ResNet-50 models are used; one for the segmentation and another for the target.

1.27 A.27 Fast Saliency-Guided Continuous Correlation Filter-Based Tracker (FSC2F)

A. Memarmoghadam

a.memarmoghadam@yahoo.com

The tracker FSC2F is based on the ECOhc approach [12]. A fast spatio temporal saliency map is added using the PQFT approach [21]. The PQFT model utilizes intensity, colour, and motion features for quaternion representation of the search image context around the previously pose of the tracked object. Therefore, attentional regions in the coarse saliency map can constrain target confidence peaks. Moreover, a faster scale estimation algorithm is utilised by enhancing the fast fDSST method [15] via jointly learning of the sparsely-sampled scale spaces.

1.28 A.28 Adaptive Visual Tracking and Instance Segmentation (DESTINE)

S.M. Marvasti-Zadeh, J. Khaghani, L. Cheng, H. Ghanei-Yakhdan, S. Kasaei

mojtaba.marvasti@ualberta.ca, khaghani@ualberta.ca, lcheng5@ualberta.ca,

hghaneiy@yazd.ac.ir, kasaei@sharif.edu

DESTINE is a two-stage method consisting of an axis-aligned bounding box estimation and mask prediction, respectively. First, DiMP50 [5] is used as the baseline tracker switching to ATOM [13] when IoU and normalized L1-distance between the results meet predefined thresholds. Then, to segment the estimated bounding box, the segmentation network of FRTM-VOS [60] uses the predicted mask by SiamMask [74] as its scores. Finally, DESTINE selects the best target mask according to the ratio of foreground pixels for two predictions. The codes are publicly released at https://github.com/MMarvasti/DESTINE.

1.29 A.29 Scale Adaptive Mean-Shift Tracker (ASMS)

Submitted by VOT Committee

The mean-shift tracker optimizes the Hellinger distance between template histogram and target candidate in the image. This optimization is done by a gradient descend. ASMS [73] addresses the problem of scale adaptation and presents a novel theoretically justified scale estimation mechanism which relies solely on the mean-shift procedure for the Hellinger distance. ASMS also introduces two improvements of the mean-shift tracker that make the scale estimation more robust in the presence of background clutter – a novel histogram colour weighting and a forward-backward consistency check. Code available at https://github.com/vojirt/asms.

1.30 A.30 ATOM: Accurate Tracking by Overlap Maximization (ATOM)

Submitted by VOT Committee

ATOM separates the tracking problem into two sub-tasks: i) target classification, where the aim is to robustly distinguish the target from the background; and ii) target estimation, where an accurate bounding box for the target is determined. Target classification is performed by training a discriminative classifier online. Target estimation is performed by an overlap maximization approach where a network module is trained offline to predict the overlap between the target object and a bounding box estimate, conditioned on the target appearance in first frame. See [13] for more details.

1.31 A.31 Discriminative Correlation Filter with Channel and Spatial Reliability - C++ (CSRpp)

Submitted by VOT Committee

The CSRpp tracker is the C++ implementation of the Discriminative Correlation Filter with Channel and Spatial Reliability (CSR-DCF) tracker [47].

1.32 A.32 Incremental Learning for Robust Visual Tracking (IVT)

Submitted by VOT Committee

The idea of the IVT tracker [62] is to incrementally learn a low-dimensional sub-space representation, adapting on-line to changes in the appearance of the target. The model update, based on incremental algorithms for principal component analysis, includes two features: a method for correctly updating the sample mean, and a forgetting factor to ensure less modelling power is expended fitting older observations.

1.33 A.33 Kernelized Correlation Filter (KCF)

Submitted by VOT Committee

This tracker is a C++ implementation of Kernelized Correlation Filter [24] operating on simple HOG features and Colour Names. The KCF tracker is equivalent to a Kernel Ridge Regression trained with thousands of sample patches around the object at different translations. It implements multi-thread multi-scale support, sub-cell peak estimation and replacing the model update by linear interpolation with a more robust update scheme. Code available at https://github.com/vojirt/kcf.

1.34 A.34 Multiple Instance Learning tracker (MIL)

Submitted by VOT Committee

MIL tracker [1] uses a tracking-by-detection approach, more specifically Multiple Instance Learning instead of traditional supervised learning methods and shows improved robustness to inaccuracies of the tracker and to incorrectly labelled training samples.

1.35 A.35 Robust Siamese Fully Convolutional Tracker (RSiamFC)

Submitted by VOT Committee

RSiamFC tracker is an extended SiamFC tracker [4] with a robust training method which puts a transformation on training sample to generate a pair of samples for feature extraction.

1.36 A.36 VOS SOTA Method (STM)

Submitted by VOT Committee

Please see the original paper for details [56].

1.37 A.37 (UPDT)

Submitted by VOT Committee

Please see the original paper for details [6].

B VOT-LT2020 Submissions

This appendix provides a short summary of trackers considered in the VOT-LT2020 challenge.

1.1 B.1 Long-Term Visual Tracking with Assistant Global Instance Search (Megtrack)

Z. Mai, H. Bai, K. Yu, X. QIu

marchihjun@gmail.com, 522184271@qq.com, valjean1832@outlook.com,

qiuxi@megvii.com

Megtrack tracker applies a 2-stage method that consists of local tracking and multi-level search. The local tracker is based on ATOM [13] algorithm improved by initializing online correlation filters with backbone feature maps and by inserting a bounding box calibration branch in the target estimation module. SiamMask [74] is cascaded to further refining the bounding box after locating the centre of the target. The multi-level search uses RPN-based regression network to generate candidate proposals before applying GlobalTrack [26]. Appearance scores are calculated using both the online-learned RTMDNet [29] and the offline-learned one-shot matching module and linearly combine them to leverage the former’s high robustness and the latter’s discriminative power. Using a pre-defined threshold, the highest-scored proposal is considered as the current tracker state and used to re-initialize the local tracker for consecutive tracking.

1.2 B.2 Skimming-Perusal Long-Term Tracker (SPLT)

B. Yan, H. Zhao, D. Wang, H. Lu, X. Yang

{yan_bin, haojie_zhao}@mail.dlut.edu.cn, {wdice, lhchuan}@dlut.edu.cn,

xyang@remarkholdings.com

This is the original SPLT tracker [83] without modification. SPLT consists of a perusal module and a skimming module. The perusal module aims at obtaining precise bounding boxes and determining the target’s state in a local search region. The skimming module is designed to quickly filter out most unreliable search windows, speeding up the whole pipeline.

1.3 B.3 A Baseline Long-Term Tracker with Meta-Updater (LTMU_B)

K. Dai, D. Wang, J. Li, H. Lu, X. Yang

dkn2014@mail.dlut.edu.cn, {wdice, jianhual}@dlut.edu.cn, lhchuan@dlut.edu.cn,

xyang@remarkholdings.com

The tracker LTMU_B is a simplified version of LTMU [11] and LTDSE with comparable performance adding a RPN-based regression network, a sliding-window based re-detection module and a complex mechanism for updating models and target re-localization. The short-term tracker LTMU_B contains two components. One is for target localization and based on DiMP algorithm [5] using ResNet50 as the backbone network. The update of DiMP is controlled by meta-updater which is proposed by LTMU^{Footnote 13}. The second component is the SiamMask network [74] used for refining the bounding box after locating the centre of the target. It also takes the local search region as the input and outputs the tight bounding boxes of candidate proposals. For the verifier, we adopts MDNet network [5] which uses VGGM as the backbone and is pre-trained on ILSVRC VID dataset. The classification score is finally obtained by sending the tracking result’s feature to three fully connected layers. GlobalTrack [26] is utilised as the global detector.

1.4 B.4 Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLTDiMP)

S. Choi, J. Lee, Y. Lee, A. Hauptmann

seokeon@kaist.ac.kr, {ljhyun33, swack9751}@korea.ac.kr, alex@cs.cmu.edu

We propose an improved Discriminative Model Prediction method for robust long-term tracking based on a pre-trained short-term tracker. The baseline tracker is SuperDiMP which combines the bounding-box regressor of PrDiMP [14] with the standard DiMP [5] classifier. To make our model more discriminative and robust, we introduce uncertainty reduction using random erasing, background augmentation for more discriminative feature learning, and random search with spatio-temporal constraints. Code available at https://github.com/bismex/RLT-DIMP.

1.5 B.5 Long-Term MDNet (ltMDNet)

H. Fan, H. Ling

{hefan, hling}@cs.stonybrook.edu

We designate a long-term tracker by adapting MDNet [55]. In specific, we utilize an instance-aware detector [26] to generate target proposals. Then, these proposals are forwarded to MDNet for classification. Since the detector performs detection on the full image, the final tracker can locate the target in the whole image which can robustly deal with full occlusion and out-of-view. The instance-aware detector is implemented by on Faster R-CNN using ResNet-50. The MDNet is implemented as in the original paper.

1.6 B.6 (CLGS)

Submitted by VOT Committee

In this work, we develop a complementary local-global search (CLGS) framework to conduct robust long-term tracking, which is a local robust tracker based on SiamMask [74], a global detection based on cascade R-CNN [8], and an online verifier based on Real-time MDNet [29]. During online tracking, the SiamMask model locates the target in local region and estimates the size of the target according to the predicted mask. The online verifier is used to judge whether the target is found or lost. Once the target is lost, a global R-CNN detector (without class prediction) is used to generate region proposals on the whole image. Then, the online verifier will find the target from region proposals again. Besides, we design an effective online update strategy to improve the discrimination of the verifier.

1.7 B.7 (LT_DSE)

Submitted by VOT Committee

This algorithm divides each long-term sequence into several short episodes and tracks the target in each episode using short-term tracking techniques. Whether the target is visible or not is judged by the outputs from the short-term local tracker and the classification-based verifier updated online. If the target disappears, the image-wide re-detection will be conducted and output the possible location and size of the target. Based on these, the tracker crops the local search region that may include the target and sends it to the RPN based regression network. Then, the candidate proposals from the regression network will be scored by the online learned verifier. If the candidate with the maximum score is above the pre-defined threshold, the tracker will regard it as the target and re-initialize the short-term components. Finally, the tracker conducts short-term tracking until the target disappears again.

1.8 B.8 (SiamDW_LT)

Submitted by VOT Committee

SiamDW_LT is a long-term tracker that utilizes deeper and wider backbone networks with fast online model updates. The basic tracking module is a short-term Siamese tracker, which returns confidence scores to indicate the tracking reliability. When the Siamese tracker is uncertain on its tracking accuracy, an online correction module is triggered to refine the results. When the Siamese tracker is failed, a global re-detection module is activated to search the target in the images. Moreover, object disappearance and occlusion are also detected by the tracking confidence. In addition, we introduce model ensemble to further improve the tracking accuracy and robustness.

C VOT-RGBT2020 Submissions

This appendix provides a short summary of trackers considered in the VOT-RGBT2020 challenge.

1.1 C.1 Multi-model Continuous Correlation Filter for RGBT Visual Object Tracking (M2C2Frgbt)

A. Memarmoghadam

a.memarmoghadam@yahoo.com

Inspired by ECO tracker [12], we propose a robust yet efficient tracker namely as M2C2Frgbt that utilizes multiple models of the tracked object and estimates its position every frame by weighted cumulative fusion of their respective regressors in a ridge regression optimization problem [51]. Moreover, to accelerate tracking performance, we propose a faster scale estimation method in which the target scale filter is jointly learned via sparsely sampled scale spaces constructed by just the thermal infrared data. Our scale estimation approach enhances the running speed of fDSST [15] as the baseline algorithm better than 20% while maintaining the tracking performance as well. To suppress unwanted samples mostly belong to the occlusion or other non-object data, we conservatively update every model on-the-fly in a non-uniform sparse manner.

1.2 C.2 Jointly Modelling Motion and Appearance Cues for Robust RGB-T Tracking (JMMAC)

P. Zhang, S. Chen, D. Wang, H. Lu, X. Yang

pyzhang@mail.dlut.edu.cn, shuhaochn@mail.dlut.edu.cn, wdice@dlut.edu.cn,

lhchuan@dlut.edu.cn, xyang@remarkholdings.com

Our tracker is based on [88], consisting of two components, i.e. multimodal fusion for appearance trackers and camera motion estimation. In multimodal fusion, we develop a late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities. The fusion weights are determined by using offline-trained global and local Multimodal Fusion Networks (MFNet), and then adopted to linearly combine the response maps of RGB and T modalities obtained from ECOs. In MFNet, the truncated VGG-M networks is used as backbone to extract deep feature. In camera motion estimation, when the drastic camera motion is detected, we compensate movement to correct the search region by key-point-based image registration technique. Finally, we employ YOLOv2 to refine the bounding box. The scale estimation and model updating methods are borrowed from ECO in default.

1.3 C.3 Accurate Multimodal Fusion for RGB-T Object Tracking (AMF)

P. Zhang, S. Chen, B. Yan, D. Wang, H. Lu, X. Yang

{pyzhang, shuhaochn, yan_bin}@mail.dlut.edu.cn, {wdice, lhchuan}@dlut.edu.cn, xyang@remarkholdings.com

We achieve multimodal fusion for RGB-T tracking by linear combining the response maps obtained from two monomodality base trackers, i.e., DiMP. The fusion weight is obtained by the Multimodal Fusion Network proposed in [88]. To achieve high accuracy, the bounding box obtained from fused DiMP is then refined by a refinement module in visible modality. The refinement module, namely Alpha-Refine, aggregates features via a pixel-level correlation layer and a non-local layer and adaptively selects the most adequate results from three branches, namely bounding box, corner and mask heads, which can predict more accurate bounding boxes. Note that the target scale estimated by IoUNet in DiMP is also applied in visible modality which is followed by Alpha-Refine and the model updating method is borrowed from DiMP in default.

1.4 C.4 SqueezeNet Based Discriminative Correlation Filter Tracker (SNDCFT)

A. Varfolomieiev

a.varfolomieiev@kpi.ua

The tracker uses FHOG and convolutional features extracted from both video and infrared modalities. As the convolutional features, the output of the ‘fire2/concat’ layer of the original SqueezeNet network [27] is used (no additional pre-training for the network is performed). The core of the tracker is the spatially regularized discriminative correlation filter, which is calculated using the ADMM optimizer. The calculation of the DCF filter is performed independently over different feature modalities. The filter is updated in each frame using simple exponential forgetting.

1.5 C.5 Decision Fusion Adaptive Tracker (DFAT)

H. Li, Z. Tang, T. Xu, X. Zhu, X. Wu, J. Kittler

hui_li_jnu@163.com, 1030415519@vip.jiangnan.edu.cn, tianyang.xu@surrey.ac.uk, xuefeng_zhu95@163.com, wu_xiaojun@jiangnan.edu.cn, j.kittler@surrey.ac.uk

Decision Fusion Adaptive Tracker is based on Siamese structure. Firstly, the multi-layer deep features are extracted by Resnet-50. Then, multi-RPN module is employed to predict the central location with multi-layer deep features. Finally, an adaptive weight strategy for decision level fusion is utilized to generate the final result. In addition, the template features are updated by a linear template update strategy.

1.6 C.6 Multi-modal Fusion for End-to-End RGB-T Tracking (mfDiMP)

Submitted by VOT Committee

The mfDiMP tracker contains an end-to-end tracking framework for fusing the RGB and TIR modalities in RGB-T tracking [87]. The mfDiMP tracker fuses modalities at the feature level on both the IoU predictor and the model predictor of DiMP [87] and won the VOT-RGBT2019 challenge.

1.7 C.7 Online Deeper and Wider Siamese Networks for RGBT Visual Tracking (SiamDW-T)

Submitted by VOT Committee

SiamDW-T is based on previous work by Zhang and Peng [91], and extends it with two fusion strategies for RGBT tracking. A simple fully connected layer is appended to classify each fused feature to background or foreground. SiamDW-T achieved the second rank in VOT-RGBT2019 and its code is available at https://github.com/researchmm/VOT2019.

D VOT-RGBD2020 Submissions

This appendix provides a short summary of trackers considered in the VOT-RGBD2020 challenge.

1.1 D.1 Accurate Tracking by Category-Agnostic Instance Segmentation for RGBD Image (ATCAIS)

Y. Wang, L. Wang, D. Wang, H. Lu, X. Yang

{wym097,wlj,wdice,lhchuan}@dlut.edu.cn, xyang@remarkholdings.com

The proposed tracker combines both instance segmentation and the depth information for accurate tracking. ATCAIS is based on the ATOM tracker and the HTC instance segmentation method which is re-trained in a category-agnostic manner. The instance segmentation results are used to detect background distractors and to re-fine the target bounding boxes to prevent drifting. The depth value is used to detect the target occlusion or disappearance and re-finding the target.

1.2 D.2 Depth Enhanced DiMP for RGBD Tracking (DDiMP)

S. Qiu, Y. Gu, X. Zhang

{shoumeng, gyz, xlzhang}@mail.sim.ac.cn

DDiMP is based on SuperDiMP which combines the standard DiMP classifier from [5] with the bounding box regressor from [5]. The update strategy of the model during the tracking process is enhanced by using the model’s confidence for the current tracking results. Output of IoU-Net is used to determine whether to fine-tune the shape, size, and position of the target. To handle scale variations, the target is searched over five scales \(1.025^{\{-2, -1, 0, 1, 2\}}\), and depth information is utilized to prevent scale from changing too quickly. Finally, two trackers with different model update confidence thresholds run in parallel, and the output with higher confidence is selected as the tracking result of the current frame.

1.3 D.3 Complementary Local-Global Search for RGBD Visual Tracking (CLGS-D)

H. Zhao, Z. Wang, B. Yan. D. Wang, H. Lu, X. Yang

{haojie_zhao,zzwang,yan_bin,wdice,lhchuan@dlut.edu.cn}@mail.dlut.edu.cn,

xyang@remarkholdings.com

CLGS-D tracker is based on SiamMask, FlowNetv2, CenterNet, Real-time MDNet and a novel box refine module. The SiamMask model is used as the base tracker. MDNet is used to judge whether the target is found or lost. Once the target is lost, CenterNet is used to generate region proposals on the whole image. FlowNetv2 is used to estimate the motion of the target by generating a flow map. Then, the region proposals are filtered with aid of the flow and depth maps. Finally, an online “verifier” will find the target from the remaining region proposals again. A novel module is also used in this work to refine the bounding box.

1.4 D.4 Siamese Network for Long-term RGB-D Tracking (Siam_LTD)

X.-F. Zhu, H. Li, S. Zhao, T. Xu, X.-J. Wu

{xuefeng_zhu95,hui_li_jnu,zsc960813,wu_xiaojun}@163.com,

tianyang.xu@surrey.ac.uk

Siam_LTD employes ResNet-50 to extract backbone features and RPN branch to locate the centre. In addition, a re-detection mechanism is introduced.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kristan, M. et al. (2020). The Eighth Visual Object Tracking VOT2020 Challenge Results. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12539. Springer, Cham. https://doi.org/10.1007/978-3-030-68238-5_39

Download citation

DOI: https://doi.org/10.1007/978-3-030-68238-5_39
Published: 31 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68237-8
Online ISBN: 978-3-030-68238-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Eighth Visual Object Tracking VOT2020 Challenge Results

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A VOT-ST2020 and VOT-RT2020 Submissions

1.1 A.1 Discriminative Sing-Shot Segmentation Tracker (D3S)

1.2 A.2 Visual Tracking by Means of Deep Reinforcement Learning and an Expert Demonstrator (A3CTDmask)

1.3 A.3 Deep Convolutional Descriptor Aggregation for Visual Tracking (DCDA)

1.4 A.4 IOU Guided Siamese Networks for Visual Object Tracking (IGS)

1.5 A.5 SiamMask_SOLO (SiamMask_S)

1.6 A.6 Diverse Ensemble Tracker (DET50)

1.7 A.7 VPU_SiamM: Robust Template Update Strategy for Efficient Object Tracking (VPU_SiamM)

1.8 A.8 RPT: Learning Point Set Representation for Siamese Visual Tracking (RPT)

1.9 A.9 Tracking Student and Teacher (TRASTmask)

1.10 A.10 Ocean: Object-aware Anchor-free Tracking (Ocean)

1.11 A.11 Tracking by Student FUSing Teachers (TRASFUSTm)

1.12 A.12 Alpha-Refine (AlphaRef)

1.13 A.13 Hierarchical Representations with Discriminative Meta-Filters in Dual Path Network for Tracking (DPMT)

1.14 A.14 SiamMask (siammask)

1.15 A.15 OceanPlus: Online Object-Aware Anchor-Free Tracking (OceanPlus)

1.16 A.16 fastOcean: Fast Object-Aware Anchor-Free Tracking (fastOcean)

1.17 A.17 Siamese Tracker with Discriminative Feature Embedding and Mask Prediction (SiamMargin)

1.18 A.18 Siamese Tracker with Enhanced Template and Generalized Mask Generator (SiamEM)

1.19 A.19 TRacker by Using ATtention (TRAT)

1.20 A.20 InfoGAN Based Tracker: InfoVITAL (InfoVital)

1.21 A.21 Learning Discriminative Model Prediction for Tracking (DiMP)

1.22 A.22 SuperDiMP (SuperDiMP)

1.23 A.23 Learning What to Learn for Video Object Segmentation (LWTL)

1.24 A.24 Adaptive Failure-Aware Tracker (AFAT)

1.25 A.25 Ensemble Correlation Filter Tracking Based on Temporal Confidence Learning (TCLCF)

1.26 A.26 AFOD: Adaptive Focused Discriminative Segmentation Tracker (AFOD)

1.27 A.27 Fast Saliency-Guided Continuous Correlation Filter-Based Tracker (FSC2F)

1.28 A.28 Adaptive Visual Tracking and Instance Segmentation (DESTINE)

1.29 A.29 Scale Adaptive Mean-Shift Tracker (ASMS)

1.30 A.30 ATOM: Accurate Tracking by Overlap Maximization (ATOM)

1.31 A.31 Discriminative Correlation Filter with Channel and Spatial Reliability - C++ (CSRpp)

1.32 A.32 Incremental Learning for Robust Visual Tracking (IVT)

1.33 A.33 Kernelized Correlation Filter (KCF)

1.34 A.34 Multiple Instance Learning tracker (MIL)

1.35 A.35 Robust Siamese Fully Convolutional Tracker (RSiamFC)

1.36 A.36 VOS SOTA Method (STM)

1.37 A.37 (UPDT)

B VOT-LT2020 Submissions

1.1 B.1 Long-Term Visual Tracking with Assistant Global Instance Search (Megtrack)

1.2 B.2 Skimming-Perusal Long-Term Tracker (SPLT)

1.3 B.3 A Baseline Long-Term Tracker with Meta-Updater (LTMU_B)

1.4 B.4 Robust Long-Term Object Tracking via Improved Discriminative Model Prediction (RLTDiMP)

1.5 B.5 Long-Term MDNet (ltMDNet)

1.6 B.6 (CLGS)

1.7 B.7 (LT_DSE)

1.8 B.8 (SiamDW_LT)

C VOT-RGBT2020 Submissions

1.1 C.1 Multi-model Continuous Correlation Filter for RGBT Visual Object Tracking (M2C2Frgbt)

1.2 C.2 Jointly Modelling Motion and Appearance Cues for Robust RGB-T Tracking (JMMAC)

1.3 C.3 Accurate Multimodal Fusion for RGB-T Object Tracking (AMF)

1.4 C.4 SqueezeNet Based Discriminative Correlation Filter Tracker (SNDCFT)

1.5 C.5 Decision Fusion Adaptive Tracker (DFAT)

1.6 C.6 Multi-modal Fusion for End-to-End RGB-T Tracking (mfDiMP)

1.7 C.7 Online Deeper and Wider Siamese Networks for RGBT Visual Tracking (SiamDW-T)

D VOT-RGBD2020 Submissions

1.1 D.1 Accurate Tracking by Category-Agnostic Instance Segmentation for RGBD Image (ATCAIS)

1.2 D.2 Depth Enhanced DiMP for RGBD Tracking (DDiMP)

1.3 D.3 Complementary Local-Global Search for RGBD Visual Tracking (CLGS-D)

1.4 D.4 Siamese Network for Long-term RGB-D Tracking (Siam_LTD)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search