Abstract

Discriminative correlation filter- (DCF-) based trackers are computationally efficient and achieve excellent tracking in challenging applications. However, most of them suffer low accuracy and robustness due to the lack of diversity information extracted from a single type of spectral image (visible spectrum). Fusion of visible and infrared imaging sensors, one of the typical multisensor cooperation, provides complementarily useful features and consistently helps recognize the target from the background efficiently in visual tracking. Therefore, this paper proposes a discriminative fusion correlation learning model to improve DCF-based tracking performance by efficiently combining multiple features from visible and infrared images. Fusion learning filters are extracted via late fusion with early estimation, in which the performances of the filters are weighted to improve the flexibility of fusion. Moreover, the proposed discriminative filter selection model considers the surrounding background information in order to increase the discriminability of the template filters so as to improve model learning. Extensive experiments showed that the proposed method achieves superior performances in challenging visible and infrared tracking tasks.

1. Introduction

Visual tracking has received widespread attention for its extensive applications in video surveillance, autonomous driving and human-machine interaction, military attack, robot vision, etc. [1, 2]. Depending on the appearance model, existing tracking algorithms can be categorized into two categories: generative and discriminative tracking. Generative tracking algorithms build a target model and search for the candidate image patch with maximal similarity. For example, Wang et al. [3] proposed a novel regression-based object tracking framework which successfully incorporates Lucas and Kanade algorithm into an end-to-end deep learning paradigm. Chi et al. [4] trained a dual network with random patches measuring the similarities between the network activation and target appearance to leverage the robustness of visual tracking. On the contrary, the goal of discriminative algorithms is to learn a classifier to discriminate between its appearance and that of the environment given an initial image patch containing the target. Yang et al. [5] proposed a temporal restricted reverse-low-rank learning algorithm for visual tracking to jointly represent target and background templates via candidates, which exploits the low-rank structure among consecutive target observations and enforces the temporal consistency of target in a global level. A new peak strength metric [6] is proposed to measure the discriminative capability of the learned correlation filter that can effectively strengthen the peak of the correlation response, leading to more discriminative performance than previous methods.

Besides these efforts, other researchers have worked on tracking methods that are both generative and discriminative. For instance, Zhang et al. [7] obtained an object likelihood map to adaptively regularize the correlation filter learning by suppressing the clutter background noises while making full use of the long-term stable target appearance information. Qi et al. [8] proposed a structure-aware local sparse coding algorithm, which encodes a target candidate using templates with both global and local sparsity constraints, and also obtains a more precise and discriminative sparse representation to account for appearance changes. In [9], an adaptive set of filtering templates is learned to alleviate drifting problem of tracking by carefully selecting object candidates in different situations to jointly capture the target appearance variations. Moreover, a variety of simple yet effective features are effectively integrated into the learning process of filters to further improve the discriminative power of the filters. In the salient-sparse-collaborative tracker [10], an object salient feature map is built to create a salient-sparse discriminative model and a salient-sparse generative model to both handle the appearance variation and reduce tracking drifts effectively. A multilayer convolutional network-based visual tracking algorithm based on important region selection [11] is proposed to build high entropy selection and background discrimination models and to obtain the feature maps by weighting the template filters with cluster weights, which enables the training samples to be informative in order to provide enough stable information and also be discriminative so as to resist distractors. Generally speaking, discriminative and generative methods have complementary advantages in appearance modeling, and the success of a visual tracking method depends not only on its representation ability against appearance variations but also on the discriminability between target and background, thus leading to the requirement of a more robust training model [12].

Recently, discriminative correlation filter- (DCF-) based visual tracking methods [1318] have shown excellent performances on real-time visual tracking for its advantage of robustness and computational efficiency. The DCF-based methods work by learning an optimal correlation filter used to locate the target in the next frame. The significant gain in speed is obtained by exploiting the fast Fourier transform (FFT) at both learning and detection stages [14]. Bolme et al. [13] presented an adaptive correlation filter, named Minimum Output Sum of Squared Error (MOSSE) filter, which produces stable correlation filters by optimizing the output sum of squared error. Based on MOSSE, Danelljan et al. [14, 15] proposed a novel scale adaptive tracking approach by learning separate discriminative correlation filters for translation and scale estimation, which achieves accurate and robust scale estimation in a tracking-by-detection framework. Galoogahi et al. [16] proposed a computationally efficient Background-aware correlation filter-based on hand-crafted features that can efficiently model how both the foreground and background of the object varies over time. The work in [17] reformulates DCFs as a one-layer convolutional neural network composed of integrates feature extraction, response map generation, and model update with residual learning. Johnander et al. [18] proposed a unified formulation for learning a deformable convolution filter in which the deformable filter is represented as a linear combination of subfilters, and both the subfilter coefficients and their relative locations are inferred jointly in our formulation. However, the above trackers fail when the target undergoes severe appearance changes due to limited data supplied by single features.

Multiple feature fusion contains more useful information than single feature, thus providing higher precision, certainty, and reliability for visual tracking. Wu et al. [19] proposed a data fusion approach via sparse representation with applications to robust visual tracking. Uzkent et al. [20] proposed an adaptive fusion tracking method that combines likelihood maps from multiple bands of hyperspectral imagery into one single more distinctive representation, which increases the margin between mean value of foreground and background pixels in the fused map. Chan et al. [21] proposed a robust adaptive fusion tracking method, which incorporates a novel complex cell into the group of object representation to enhance the global distinctiveness. Feature fusion also achieves superior performances on correlation filter-based tracking. For example, Rapuru et al. [22] proposed a robust tracking algorithm by efficiently fusing tracking, learning, and detection with the systematic model update strategy of kernelized correlation filter tracker.

Although much efforts have been made, single-sensor feature fusion-based tracking suffer low accuracy and robustness due to the lack of diversity information. Fusion of visible and infrared sensors, one of the typical multisensor cooperations, provides complementarily useful features, which is able to achieve a more robust and accurate tracking result [23]. Li et al. [24] designed a fusion scheme containing joint sparse representation and colearning update model to fuse color visual spectrum and thermal spectrum images for object tracking. Li et al. [25] proposed an adaptive fusion scheme based on collaborative sparse representation in Bayesian filtering framework for online tracking. Mangale and Khambete [26] developed reliable camouflaged target detection and tracking system using fusion of visible and infrared imaging. Yun et al. [23] proposed a compressive time-space Kalman fusion tracking with time-space adaptability for visible and infrared images and introduced extended Kalman filter to update fusion coefficients optimally. A visible and infrared fusion tracking algorithm based on multiview multikernel fusion model is presented in [27]. Zhang et al. [28] transferred visible tracking data to infrared data to obtain better tracking performances. Lan et al. [29] proposed joint feature learning and discriminative classifier framework for multimodality tracking, which jointly eliminate outlier samples caused by large variations and learn discriminability-consistent features from heterogeneous modalities. Li et al. [30] proposed a convolutional neural network architecture including a two-stream ConvNet and a FusionNet, which proves that tracking with visible and infrared fusion outperforms that with single sensor in terms of accuracy and robustness.

DCF-based trackers have significant low computational load and are especially suitable for a variety of real-time challenging applications. However, most of the DCF-based trackers suffer low accuracy and robustness due to the lack of diversity information extracted from a single type of spectral image (visible spectrum). Therefore, this paper proposes a discriminative fusion correlation learning model to improve DCF-based tracking performance by combining multiple features from visible and infrared imaging sensors. The main contributions of our work are summarized as follows:(i)A discriminative fusion correlation learning model is presented to fuse visible and infrared features such that valuable information from all sensors is preserved.(ii)The proposed fusion learning filters are obtained via late fusion with early estimation, in which the performances of the filters are weighted to improve the flexibility of fusion.(iii)The proposed discriminative filter selection model considers the surrounding background information in order to increase the discriminability of the template filters so as to improve model learning.

The remainder of this paper is organized as follows. In Section 2, the multichannel discriminative correlation filter is introduced. In Section 3, we describe our work in detail. The experimental results are presented in Section 4. Section 5 concludes with a general discussion.

2. Multichannel Discriminative Correlation Filter

Multichannel DCF provides superior robustness and efficiency in dealing with challenging tracking tasks [14]. In the multichannel DCF-based tracking algorithm, channel Histogram of Oriented Gradient (HOG) features [14] from the target sample are extracted to maintain diverse information. During training process, the goal is to learn correlation filter , which is achieved by minimizing the error of the correlation response compared to the desired correlation output aswhere denotes circular correlation and is the weight parameter [14]. and are the -th channel feature and the corresponding correlation filter, respectively. The correlation output is supposed to be a Gaussian function with a parametrized standard deviation [14].

The minimization of (1) can be solved by minimizing (2) in the Fourier domain aswhere , , and are the discrete Fourier transform (DFT) of , , and , respectively. The symbol bar denotes complex conjugation. The multiplications and divisions in (2) are performed pointwise. The numerator and denominator of the filter in (2) are updated aswhere is a learning rate parameter.

During tracking process, the DFT of the correlation score of the test sample is computed in the Fourier domain aswhere and are the DFTs of and , respectively. and are the numerator and denominator of the filter updated in the previous frame, respectively. Then the correlation scores is obtained by taking the inverse DFT . The estimate of the current target state is obtained by finding the maximum correlation score among the test samples.

3. Proposed Discriminative Fusion Correlation Learning

In this section, we introduce the tracking framework of the proposed algorithm, the general scheme of which is described in Figure 1. Firstly, the multichannel features are extracted, respectively, from the visible and infrared images according to [14]. Secondly, the proposed discriminative filter selection and the fusion filter learning are applied to get the fusion response map. Finally, the discriminative filters and fusion filters are updated via the tracking result obtained by the response map. We will discuss them specifically below.

3.1. Discriminative Filter Selection

According to DCF-based trackers, we obtain the correlation output bywhere and are the target sample and target correlation filter corresponding to the -th channel feature among channels, respectively. In this paper, is selected as a 2D Gaussian function where [13].

Before tracking, we need to choose the optimal target correlation filters in the training step via minimizing (6) aswhere , , and are the DFTs of , , and , respectively.

Different from a single training sample of the target appearance, multiple background samples at different locations around target need to be considered to maintain a stable model. However, extracting multichannel features from each background sample increase computational complex significantly. Moreover, in practice, single channel features from multiple background samples are enough to present satisfied performances. Therefore, in this paper, we extract background samples randomly in the range of an annulus around the target location [11] and obtain the correlation output aswhere denotes the -th background sample.

Similarly, the optimal background correlation filters in the training step are selected via minimizing (8) aswhere and are the DFTs of and , respectively.

While tracking, DFT of the estimated discriminative correlation score of the test sample is defined aswhere and are the DFTs of the target and background correlation scores and , respectively, and is the DFTs of . and denote the numerator and denominator of the filter in (6). and denote the numerator and denominator of the filter in (8). Then the discriminative correlation scores are obtained by taking the inverse DFT . The estimate of the current target state is obtained by finding the maximum correlation score among the test samples as .

3.2. Fusion Learning Filter

As proved by Wagner et al. [31], late fusion with early estimation provides better performance than early fusion with late estimation. Based on this conclusion, we use the discriminative correlation filters to obtain the estimate of target location in visible and infrared images, respectively, and then do fusion with the fusion correlation filters. Let denote the estimate of target location of visible or infrared images. Then we define the region denoting the minimum bounding rectangle that contains the regions of samples and in image . Thus, we extract the fusion test samples through and define the DFT of the fusion correlation score of fusion sample that is defined aswhere and are the DFTs of the target and background fusion correlation scores and , respectively. in which is the -th channel feature of image . and are the DFTs of samples and , respectively. denotes the image weights that are denoted as where is the correlation score of the -th image computed by (9).

After obtaining , the fusion correlation score is obtained by taking the inverse DFT . The fusion location of the current target state is obtained by finding the maximum correlation score among the test samples.

The whole tracking process of DFCL is summarized in Algorithm 1.

Input: The -th visible and infrared images
For   = 1 to number of frames do
1. Crop the samples and extract the -th ( = 1, · · ·, ) channel features for visible and
infrared images, respectively.
2. Compute the discriminative correlation scores using Eq. (9).
3. Compute the fusion correlation scores using Eq. (10).
4. Obtain the tracking result by maximizing .
5. Extract the -th ( = 1, · · ·, ) channel feature of the target samples and the -th
( = 1, · · ·, ) sample .
6. Update the discriminative correlation filters and using Eq. (6) and Eq. (8),
respectively.
end for
Output: Target result and the discriminative correlation filters and

4. Experiments

The proposed DFCL algorithm was tested on several challenging real-world sequences, and some qualitative and quantitative analyses were performed on the tracking results in this section.

4.1. Experimental Environment and Evaluation Criteria

DFCL was implemented with C++ programming language and.Net Framework 4.0 in Visual Studio 2010 on an Intel Dual-Core 1.70GHz CPU with 4 GB RAM. Two metrics, i.e., location error (pixel) and overlapping rate, are used to evaluate the tracking results quantitatively. The location error is computed as , where and are the ground truth (either downloaded from a standard database or located manually) and tracking bounding box centers, respectively. The tracking overlapping rate is defined as , where and denote the ground truth and tracking bounding box, respectively, and is the rectangular area function. A smaller location error and a larger overlapping rate indicate higher accuracy and robustness.

4.2. Experimental Results

The performance of DFCL was compared with state-of-the-art trackers Struck [32], ODFS [33], STC [34], KCF [35], ROT [36], DCF-based trackers MOSSE [13], DSST [14], fDSST [15], and visible-infrared fusion trackers TSKF [23], MVMKF [27], L1-PF [19], JSR [24], and CSR [25]. Figures 26 present the experimental results of the test trackers in challenging visible sequences named Biker [37], Campus [37], Car [38], Crossroad [39], Hotkettle [39], Inglassandmobile [39], Labman [40], Pedestrian [41], and Runner [38], as well as their corresponding infrared sequences Biker-ir, Campus-ir, Car-ir, Crossroad-ir, Labman-ir, Hotkettle-ir, Inglassandmobile-ir, Pedestrian-ir, and Runner-ir. Single-sensor trackers were separately tested on visible and the corresponding infrared sequences, while visible-infrared fusion trackers obtain the results with information from both visible and infrared sequences. For the convenience of presentation, some tracking curves are not shown entirely in the Figures. Next, the performance of the trackers in each sequence is described in detail.

(a) Sequences Biker and Biker-ir: Biker presents the example of complex background clutters. The target human in the visible sequence encounters similar background disturbance (i.e., bikes), which causes the ODFS, MOSSE, fDSST, TSKF, and MVMKF trackers to drift away from the target. The corresponding infrared sequence Biker-ir provides temperature information that eliminates the background clutter in Biker. But when the target is approaching another person at around Frame #20, Struck, ODFS, STC, MOSSE, TSKF, and MVMKF do not perform well because they are not able to distinguish target from persons with similar temperature in infrared sequences. Only KCF, ROT, DSST, and our DFCL have achieved precise and robust performances in these sequences.

(b) Sequences Campus and Campus-ir: the target in Campus and Campus-ir undergoes background clutters, occlusion, and scale variation. At the beginning of Campus, ODFS, STC, KCF, and ROT lose the target due to background disturbance. Only TSKF and DFCL perform well, while Struck, fDSST, and MVMKF do not achieve accurate results. Because of the infrared information provided by Campus-ir, fewer test trackers lose tracking when background clutters happen as shown in Figure 2. But Struck, KCF, and ROT mistake another person for the target. As shown in Figure 2, most of the trackers result in tracking failures, whereas DFCL outperforms the others in most metrics (location accuracy and success rate).

(c) Sequences Car and Car-ir: Car and Car-ir demonstrate the efficiency of DFCL on coping with heavy occlusions. The target driving car is occluded by lampposts and trees many times, which cause tracking failures of most trackers. Only TSKF, MVMKF, and DFCL are able to handle the occlusion throughout the tracking process in this sequence. As shown in Figure 2, most trackers perform better in Car-ir than in Car because the infrared features can overcome the difficulties of target detection among surrounding similar background. STC, TSKF, MVMKF, and DFCL are able to handle this problem, whereas the result of DFCL is the most accurate, as shown in Figure 2.

(d) Sequences Crossroad and Crossroad-ir: the target in Crossroad and Crossroad-ir undergoes heavy background clutters when she crosses the road. While the target is passing by the road lamp, both ODFS and JSR lose the target. Then, when a car passes by the target, Struck, TSKF, and MVMKF drift away from the target. When the target goes toward the sidewalk, most of the trackers are not able to handle the problem of heavy background clutters, but our tracker performs satisfying tracking results as shown in Figures 24.

(e) Sequences Hotkettle and Hotkettle-ir: in these sequences, tracking is hard because of the changes of the complex background clutters. Most trackers perform better in Hotkettle-ir than in Hotkettle for the reason that the temperature diverge makes the hot target more distinct in the cold background. Struck, KCF, DSST, fDSST, and DFCL can achieve robust and accurate tracking performances as shown in Figures 24.

(f) Sequences Inglassandmobile and Inglassandmobile-ir: Sequences Inglassandmobile and Inglassandmobile-ir demonstrate the performances of the 14 trackers under the circumstances of background clutters, illumination changes, and occlusion. As shown in Figure 2, when the illumination changes at around Frames #300, ODFS and fDSST lose the target, and KCF, TSKF, and L1-PF drift a little away from the target. When the target is approaching a tree, the background clutters makes most of trackers cause tracking failures that can be seen from Figure 2. Our DFCL can overcome these challenges and perform well in these sequences.

(g) Sequences Labman and Labman-ir: the experiments in Sequences Labman and Labman-ir aim to evaluate the performances on tracking under appearance variation, rotation, scale variation, and background clutter. In Labman, when the target man is walking into the laboratory, ODFS, STC, and MOSSE lose the target. When the man keeps shaking and turning around his head at around Frame #400, KCF, ROT, and DSST cause tracking failures. Also, most trackers achieve better tracking performances in Labman-ir as shown in Figure 2.

(h) Sequences Pedestrian and Pedestrian-ir: the target in Pedestrian and Pedestrian-ir undergoes heavy background clutters and occlusion. As shown in Figure 2, other trackers result in tracking failures in Pedestrian, whereas our tracker shows satisfying performances in terms of both accuracy and robustness. The efficient infrared features extracted from Pedestrian-ir ensure the tracking successes of Struck, STC, and DFCL, as can be seen from Figures 24.

(i) Sequences Runner and Runner-ir: Runner and Runner-ir contain examples of heavy occlusion, abrupt movement, and scale variation. The target running man is occluded by lampposts, trees, stone tablet, and bushes many times, resulting in tracking failures of most trackers. Also, the abrupt movement and scale variation cause many trackers to drift away the target in both Runner and Runner-ir as shown in Figure 2. Once again, our DFCL is able to overcome the above problems and achieve good performances.

Figures 5 and 6 are included here to demonstrate quantitatively the performances on average location error (pixel) and success rate. The success rate is defined as the number of times success is achieved in the whole tracking process by considering one frame as a success if the overlapping rate exceeds 0.5 [33]. A smaller average location error and a larger success rate indicate increased accuracy and robustness. Figures 5 and 6 show that DFCL performs satisfying most of the tracking sequences.

To validate the effectiveness of the discriminative filter selection model of DFCL, we compare the tracker DCL (the proposed DFCL without the fusion learning model) with DFCL and the original DCF tracker MOSSE on visible sequences. The performances shown in Figure 7 demonstrate the efficiency of the discriminative filter selection model especially in the sequences with background clutters, i.e., Sequences Biker, Hotkettle, Inglassandmobile, and Pedestrian.

5. Conclusion

Discriminative correlation filter- (DCF-) based trackers have the advantage of being computationally efficient and more robust than most of the other state-of-the-art trackers in challenging tracking tasks, thereby making them especially suitable for a variety of real-time challenging applications. However, most of the DCF-based trackers suffer low accuracy due to the lack of diversity information extracted from a single type of spectral image (visible spectrum). Fusion of visible and infrared sensors, one of the typical multisensor cooperation, provides complementarily useful features and consistently helps recognize the target from the background efficiently in visual tracking. For the above reasons, this paper proposes a discriminative fusion correlation learning model to improve DCF-based tracking performance by combining multiple features from visible and infrared imaging sensors. The proposed fusion learning filters are obtained via late fusion with early estimation, in which the performances of the filters are weighted to improve the flexibility of fusion. Moreover, the proposed discriminative filter selection model considers the surrounding background information in order to increase the discriminability of the template filters so as to improve model learning. Numerous real-world video sequences were used to test DFCL and other state-of-the-art algorithms, and here we only selected representative videos for presentation. Experimental results demonstrated that DFCL is highly accurate and robust.

Data Availability

The data used to support the findings of this study were supplied by China University of Mining and Technology under license and so cannot be made freely available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the Natural Science Foundation of Jiangsu Province (BK20180640, BK20150204), Research Development Programme of Jiangsu Province (BE2015040), the State Key Research Development Program (2016YFC0801403), and the National Natural Science Foundation of China (51504214, 51504255, 51734009, 61771417, and 61873246).