Volume 13 Number 6
December 2016
Article Contents
Yi Song, Shu-Xiao Li, Cheng-Fei Zhu and Hong-Xing Chang. Object Tracking with Dual Field-of-view Switching in Aerial Videos. International Journal of Automation and Computing, vol. 13, no. 6, pp. 565-573, 2016. doi: 10.1007/s11633-016-0949-7
Cite as: Yi Song, Shu-Xiao Li, Cheng-Fei Zhu and Hong-Xing Chang. Object Tracking with Dual Field-of-view Switching in Aerial Videos. International Journal of Automation and Computing, vol. 13, no. 6, pp. 565-573, 2016. doi: 10.1007/s11633-016-0949-7

Object Tracking with Dual Field-of-view Switching in Aerial Videos

Author Biography:
  • Yi Song graduated from Hunan University, China in 2010. He is now a Ph.D. candidate in Institute of Automation, Chinese Academy of Sciences, China.
    His research interests include computer vision and image analysis.
    E-mail: yi.song@ia.ac.cn
    ORCID iD: 0000-0003-0932-8806

    Shu-Xiao Li graduated from Xi'an Jiaotong University, China in 2003. He received the Ph. D. degree from Institute of Automation, Chinese Academy of Sciences (CASIA), China in 2008. He is currently an associate professor in CASIA.
    His research interests include computer vision, image processing, and its applications.
    E-mail: shuxiao.li@ia.ac.cn

    Hong-Xing Chang graduated from Beihang University in 1986. He received the M. Sc. degree from Beihang University, China in 1991. He is currently a professor in Institute of Automation, Chinese Academy of Sciences, China.
    His research interests include computer vision, integrated information processing, and its applications.
    E-mail: hongxing.chang@ia.ac.cn

  • Corresponding author: Cheng-Fei Zhu graduated from University of Science and Technology of China in 2004. He received the Ph.D. degree from Institute of Automation, Chinese Academy of Sciences (CASIA), China in 2010. He is currently an assistant professor in CASIA.
    His research interests include computer vision and image processing.
    E-mail: chengfei.zhu@ia.ac.cn (Corresponding author)
    ORCID iD: 0000-0002-6484-7089
  • Received: 2014-09-10
    Published Online: 2016-02-01
Fund Project:

This work was supported by National Natural Science Foundation of China Nos. 61175032, 61302154 and 61304096

通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures (5)  / Tables (2)

Metrics

Abstract Views (399) PDF downloads (7) Citations (0)

Object Tracking with Dual Field-of-view Switching in Aerial Videos

  • Corresponding author: Cheng-Fei Zhu graduated from University of Science and Technology of China in 2004. He received the Ph.D. degree from Institute of Automation, Chinese Academy of Sciences (CASIA), China in 2010. He is currently an assistant professor in CASIA.
    His research interests include computer vision and image processing.
    E-mail: chengfei.zhu@ia.ac.cn (Corresponding author)
    ORCID iD: 0000-0002-6484-7089
Fund Project:

This work was supported by National Natural Science Foundation of China Nos. 61175032, 61302154 and 61304096

Abstract: Visual object tracking plays an important role in intelligent aerial surveillance by unmanned aerial vehicles (UAV).In ordinary applications, aerial videos are captured by cameras with a fixed-focus lens or a zoom lens, for which the field-of-view (FOV) of the camera is fixed or smoothly changed.In this paper, a special application of the visual tracking in aerial videos captured by the dual FOV camera is introduced, which is different from ordinary applications since the camera quickly switches its FOV during the capturing.Firstly, the tracking process with the dual FOV camera is analyzed, and a conclusion is made that the critical part for the whole process depends on the accurate tracking of the target at the moment of FOV switching.Then, a cascade mean shift tracker is proposed to deal with the target tracking under FOV switching.The tracker utilizes kernels with multiple bandwidths to execute mean shift locating, which is able to deal with the abrupt motion of the target caused by FOV switching.The target is represented by the background weighted histogram to make it well distinguished from the background, and a modification is made to the weight value in the mean shift process to accelerate the convergence of the tracker.Experimental results show that our tracker presents a good performance on both accuracy and efficiency for the tracking.To the best of our knowledge, this paper is the first attempt to apply a visual object tracking method to the situation where the FOV of the camera switches in aerial videos.

Yi Song, Shu-Xiao Li, Cheng-Fei Zhu and Hong-Xing Chang. Object Tracking with Dual Field-of-view Switching in Aerial Videos. International Journal of Automation and Computing, vol. 13, no. 6, pp. 565-573, 2016. doi: 10.1007/s11633-016-0949-7
Citation: Yi Song, Shu-Xiao Li, Cheng-Fei Zhu and Hong-Xing Chang. Object Tracking with Dual Field-of-view Switching in Aerial Videos. International Journal of Automation and Computing, vol. 13, no. 6, pp. 565-573, 2016. doi: 10.1007/s11633-016-0949-7
  • Aerial surveillance by video cameras has been a significant task for unmanned aerial vehicles (UAV)[1-3]. As a low level processing assignment in aerial videos, a robust visual tracking of the target gives an efficient guide for UAV to handle more complicated tasks such as object recognition[4, 5] and event detection[6]. Object tracking in aerial videos faces many challenges such as low spatial resolution, camera shaking, illumination change and appearance change of the background, and researchers have developed various approaches to handle these problems[7-10].

    In the aerial surveillance for UAV, most raw videos or image sequences are captured by normal pan-tilt-zoom (PTZ) video cameras. Large amount of researchers concern about processing the videos captured by the cameras with zoom lenses[11, 12]. With this kind of camera, multiple moving target detection[10] or multiple target tracking[6, 13] in aerial surveillance imagery of UAV have been recently studied to assist the system to accomplish higher level tasks. Differently, in our study, the field-of-view (FOV)-switching camera is focused. This kind of camera uses a lens system containing two or three different lenses. By switching the lens, the camera could to achieve abrupt change of focal lengths, thus the FOV of the camera quickly switches. The camera has an advantage that, the time for switching its lens is usually less than one second, which is about three to four times faster than using the zoom lens camera to acquire the same resolution of the target by zooming its focal lengths. As different resolutions of the target can be rapidly shown to the observer by switching its FOV, the camera is usually used in the cases where a more detailed version of the object is required to appear quickly in the screen during the observation. It can also provide extra information of the target for the task of aerial surveillance. In the wide FOV, the observer can seek and assign the target of interest, but the target is not identifiable due to the low resolution. As the camera switches to the narrow FOV, the target is magnified and a higher resolution of the target is acquired, so that it can be further confirmed and recognized. This kind of camera can be used both in civilian and military applications. For example, in the task of the low altitude surveillance of UAV, the camera is applied for the fast confirmation and recognition of the target on the ground. A few works have been conducted to study the design of the FOV-switching camera[14, 15]. However, the processing of the videos captured by this camera is rarely studied. In this work, we are interested in online visual tracking under this kind of camera, as the target tracking is significant for the automatic aerial surveillance system. To the best of our knowledge, our work is the first attempt to deal with the visual tracking in aerial surveillance under the FOV-switching camera.

    As mentioned, researchers have achieved progress on mul-tiple target detection or tracking in aerial surveillance imagery of UAV. Shen et al.[10] combined the spatial and temporal saliency to detect moving objects in surveillance videos. Prokaj et al.[6] studied the wide area aerial imagery and propose to learn motion patterns to track multiple targets. In their later work[13], they combine a background subtraction tracker and a state regressor to achieve multiple target tracking in aerial imagery. These studies provide valuable thoughts and methods on object tracking in aerial videos, although our focus is on single target tracking.

    For general single object tracking, numerous tracking methods have appeared to solve tracking problems in videos[11, 12]. Most of the methods may not have the ability to deal with the abrupt location change or scale change of the target when the FOV of the camera quickly switches. Furthermore, for object tracking in aerial videos, the tracker should be considered to handle some other problems such as the small size of the object, the camera shaking, and its ability of real-time processing. Methods as the particle filter using importance sampling[16] or Markov Chain Monte Carlo sampling[17] estimate the posterior probability of the target distribution through particle samples to locate the target. They may tackle abrupt motions of the target by proposing a large standard deviation of the distribution, but they yield a large computational burden to obtain an accurate state. The mean shift tracker[18] has achieved great success in real-time object tracking with its notable efficiency. The tracker locates the target by iteratively shifting its kernel to find the most similar candidate. Its simplicity and robustness determine its good performance in applications where few resources are available such as onboard implementation. However, the original mean shift tracker has a drawback that it finds the best state locally.

    Our work focuses on the online real-time object tracking under the switching of the dual FOV camera. The FOV of the camera switches from wide to narrow during the image capturing. In this situation, the two important terms to be concerned for the tracker are the abilities of the accurate global locating and the real-time processing. The main contribution of this paper lies in the following three points.

    1) The tracking process on the image sequence captured by dual FOV camera is analyzed, and a conclusion is made that the critical part for the process depends on the accurate tracking of the target at the moment when the FOV switches. The analysis helps to understand the capturing process of dual FOV camera to develop the tracking algorithm.

    2) A cascade mean shift tracker is developed to locate the object for the case of the FOV switching. The tracker yields good results on globally locating the target, which could deal with the abrupt motion of the target caused by the FOV switching.

    3) A simulated dataset is generated from a public aerial surveillance dataset, with which the evaluation for the tracker is carried out under the FOV switching, by comparing with other popular trackers.

    The remaining of the paper is organized as follows. Section 2 analyzes the target tracking over the image sequence captured by the dual FOV camera. Section 3 gives the details of the cascade mean shift tracking method to deal with the FOV switching. In Section 4, the experimental results are presented. Conclusions are given in Section 5.

  • A simple illustration of the image sequence captured by dual FOV camera is shown in Fig. 1. Generally, a capturing cycle can be divided into three stages. At the first stage, the camera stays in wide FOV (WFOV) mode for capturing, illustrated as from frame 1 to frame $t-1$. In this stage, the observer searches for the target to be tracked and get it located by some detection mechanism. Once the target is located in the center of WFOV and is locked to be further observed, the camera switches into narrow FOV (NFOV) mode in a moment, as the second stage. The switching of the FOV magnifies the object for better observing and identification. Note that a signal will be sent at time t to the system to inform that the camera is to do the switching, which means that the beginning moment of the second stage is known. It takes about 0.5-1 s for the optical system in the camera to do the switching. During the switching, the sensor of the camera is still working, but the captured images become distorted or unpredictable. The time length of the second stage is set to be 1s to make sure that the FOV switching is finished during this stage, so that frame t and frame $t+l$ are two normal images. The third stage is the remaining stage, i.e., from frame $t+l+1$ to frame M. At this stage, the camera stays in NFOV mode until the end of the capturing cycle. If the target is exactly what the observer is looking for, it will be continuously tracked to the end of the capturing process. On the contrary, if the object does not interest the observer, or the target gets out of the view, the camera switches back into WFOV mode to have other objects observed or to relocate the same target. In this case, another capturing cycle of the three stages begins.

    Figure 1.  Image sequence captured by dual FOV camera

    In fact, the accurate tracking of the target at the second stage determines the successful tracking of the whole process. As the FOV of the camera stays the same during the first and the third stage, the image sequence in these two stages could be treated normally using an ordinary tracking method. As for the second stage, two challenges remain to be addressed: 1) Suppose the last complete image captured in the WFOV and the first complete image captured in the NFOV are denoted as FFOS (former frame of the switching) and LFOS (latter frame of the switching), respectively. How to exclude transition images and confirm the FFOS and the LFOS should be studied.

    2) During the switching from WFOV to NFOV, the scale of the target is enlarged abruptly from the FFOS to the LFOS. Moreover, the location of the target changes abruptly because of the time delay during the FOV switching. Thus, tracking the object from the FFOS to the LFOS imposes different restrictions from ordinary tracking methods.

    This paper mainly addresses the second challenge stated above, whereas the first is left for future research. Technically, one may use online scene cut detection methods[19] to confirm the FFOS and the LFOS. Since these two frames are basic requirements for the study, we just set the first image and the last image of the sequence in the second stage as the FFOS and the LFOS, respectively. This setting includes a larger time delay, thus introduce more difficulties. As for the scale change, the focal lengths of the dual FOV camera can be previously obtained before the observing task, which means that the scaling of the object can be estimated between the FFOS and the LFOS.

    Our method is devoted to tracking the target in the second stage. The target in the FFOS is assigned with a bounding box, and the objective of our tracker is to locate the target in the LFOS. The main problem imposed on the tracking is the abrupt change of the object location, and we propose to use a cascade mean shift method to solve it.

  • From the analysis in Section 2, the accurate locating of the object at the moment of the FOV switching is critical for the tracking. In this section, the proposed cascade mean shift tracker is described in details, which intends to tackle the abrupt motion of the target between the FFOS and the LFOS.

    The method is based on the annealed mean shift framework[20], but it is improved for the situation of FOV switching. Firstly, the LFOS is preprocessed to relax the computing burden in the mean shift procedure. Secondly, a background weighted histogram is utilized to model the target to obtain robust tracking. Thirdly, a modification to the weight value in the shifting procedure is proposed to accelerate the convergence. The procedure of our method is shown in Fig. 2.

    Figure 2.  The procedure of our method. (a) The target and its bounding rectangle in the FFOS. (b) The LFOS. (c) The cascade mean shift tracking in the preprocessed LFOS (PLFOS). The initial positions and the windows of each mean shift tracker are shown in different colors. The red window on the right is the initial position of the cascade tracker, which is the same position of the target in the FFOS, and the red window on the left is the convergence position of the tracker. (d) The final location of the target obtained by the tracker in the LFOS. In the figure, the scaling factor $S=3.0$ for FOV switching, and the number of the trackers $N=3$ in the cascade tracker

  • With the given information of the focal lengths in WFOV mode and NFOV mode of the camera, the scaling of the object between the FFOS and the LFOS can be estimated by $S=\frac{f_n}{f_w}$, where fw and fn denote the values of the focal lengths used in WFOV mode and NFOV mode, respectively.

    By scaling, the image resolution of the LFOS can be adjusted to the similar one of the FFOS. It is realized by down-sampling the LFOS using the factor S, while the 2D coordinate of the image center remains unchanged. With the preprocessing, the image region for further processing is largely decreased. The cascade mean shift procedure applied to the preprocessed LFOS (PLFOS) gets faster than that applied to the LFOS, since it uses smaller windows for shifting.

  • The standard mean shift tracker[18] uses the kernel weighted color histogram to model the target. However, it may reduce the tracking performance in the case that the color of the target is similar to that of the background. We propose to use the background weighted histogram[21] to integrate the background information into the target model, which could better separate the target from the background.

    Typically, the kernel weighted color histogram $\hat{{ q}}=\{ \hat{q}_u \}_{u=1, \cdots, m}$ is a histogram with m bins weighed by a spatial kernel:

    \begin{equation} {\hat{q}_u} = C\sum\limits_{i = 1}^n {k\left( {{{\left\| {{ x}_i^ * } \right\|}^2}} \right)\delta [b({ x}_i^*) - u]} \label{equ:ColorHistogram} \end{equation}

    (1)

    where $k(\cdot)$ denotes the kernel function, $\{{ x}_i^*\}_{i=1, \cdots, n}$ denotes the normalized locations of the pixels inside the spatial window, $b(\cdot)$ gives the corresponding bin of the pixel in position ${ x}_i^*$, $\delta(\cdot)$ is the Kronecker delta function, and C is a normalization constant. The background weighted color histogram of the object, $\hat{{ q}}_b=\{ \hat{q}_u^b \}_{u=1, \cdots, m}$, can be calculated by

    \begin{equation} {\hat{q}_u^b} = C_b\sum\limits_{i = 1}^n {{\beta}_uk\left( {{{\left\| {{ x}_i^ * } \right\|}^2}} \right)\delta [b({ x}_i^*) - u]} = \hat{\beta}_u\hat{q}_u \end{equation}

    (2)

    where Cb is a normalization constant and $\hat{\beta}_u = \frac{\beta_u C_b}{C}$. The weight value ${\beta}_u$ is calculated using the ratio of the background histogram and the object histogram for each bin:

    \begin{equation} {\beta}_u = \frac{1}{1+L(u)} = \frac{1}{1+\frac{\max(h_{bk}(u), \epsilon)}{\max(h_{ob}(u), \epsilon)}} \end{equation}

    (3)

    where the object histogram $\hat{{ h}}_{ob}$ is calculated within the object bounding box in the FFOS, and the background histogram $\hat{{ h}}_{bk}$ is calculated within the window around the object. None of them are weighted by a spatial kernel. In the equation, the ratio value $L(u)$ indicates the likelihood with which it belongs to the background for the color of bin u.

    Using the background weighted histogram makes the model more accurate to describe the object. If the color feature in bin u has a low probability belonging to the background (which indicates $h_{bk}(u)$ is small), then the weight value $\beta_u$ is high. It means that the model increases the effects of the object color distribution, while decreasing the effects of the background color distribution. Compared with the kernel weighted color histogram, the background weighted histogram enhances the contrast between the object and the background, which discriminates the object from the background well. The analysis for the background weighted histogram to model the target used in mean shift tracking could be referred to in [22].

  • The mean shift algorithm requires that the initial window overlaps with the target, and it shifts the window to the local maximum of the similarity function iteratively to obtain the target location. Unfortunately, for the time delay during the FOV switching, the displacement of the target is usually very large between the FFOS and the PLFOS, which means that the initial shift window in the FFOS may get far away from the target in the PLFOS. In this case, the standard mean shift algorithm cannot be guaranteed to converge to the target location.

    The proposed cascade mean shift tracker has the ability to tackle the problem of the abrupt motion of the object, since it uses kernels with multiple bandwidths for locating. The idea comes from that the location of the object could be found by a coarse-to-fine mechanism.

    For constructing the cascade mean shift tracker, we generate a set of kernels $k_{\tiny{{ h}_ r}}(\cdot)|r = 0, 1, 2, \cdots, N-1$ with the bandwidth ${ h}_r$. The sequence of the bandwidth ${ h}_r(r=0, 1, 2, \cdots, N-1)$ is increasing with r so that ${ h}_0<{ h}_1<\cdots<{ h}_{N-1}$, where ${ h}_0=(h_x^0, h_y^0)$ denotes the bandwidth of the kernel used in the FFOS. The value of ${ h}_0$ equals to the half size of the target window in the FFOS. In our method, Gaussian kernels are adopted for the tracker, as proposed in [20]. Then, each kernel is integrated into a mean shift tracker, generating N trackers.

    The cascade tracking process starts from employing the mean shift tracker with the kernel $k_{\tiny{{ h}_{N-1}}}$. During the process, when the tracker with the kernel $k_{\tiny{{ h}_r}}$ converges to the position ${ y}^{(r)}$, the next tracker with $k_{\tiny{{ h}_{r-1}}}$ starts to work using ${ y}^{(r)}$ as its initial position, converging to the position ${ y}^{(r-1)}$. This process ends when the tracker with the kernel $k_{\tiny{{ h}_0}}$ converges to the position ${ y}^{(0)}$, and it is used as the final output of the cascade tracker, which indicates the target location in the PLFOS.

    The tracking begins at the location of the target in the FFOS. To obtain the global convergence, the starting bandwidth ${ h}_{N-1}$ should be large enough for the kernel window to overlap with the real target. In the tracking process, the candidate histogram is calculated using (\ref{equ:ColorHistogram}) within the window centered at ${ y}$, denoted as $\hat{{ p}}({ y})$. The similarity between the target histogram and the candidate histogram is measured by the Bhattacharyya coefficient, $ \rho\left[\hat{{ p}}({ y}), \hat{{ q}}_b \right]= \sum\nolimits_{u = 1}^m {\sqrt {{\hat{p}_u}({{ y}}){\hat{q}_u^b}} } $. By maximizing the coefficient, the mean shift algorithm obtains the convergence of the location by iteratively shifting from the old location of the candidate $\hat{{ y}}_0$ to the new location $\hat{{ y}}$

    \begin{equation} \hat{{ y}} = \frac{{\sum\limits_{i = 1}^n {{{{ x}}_i}{w_i} g_{\tiny{{ h}_r}}\left(\left\| \frac{{{ {{{\hat{{ y}}}_0} - {{{ x}}_i}}}}} {{ h}_r} \right\|^2 \right)} }}{{\sum\limits_{i = 1}^n {{w_i} g_{\tiny{{ h}_r}}\left(\left\| \frac{{{ {{{\hat{{ y}}}_0} - {{{ x}}_i}}}}} {{ h}_r}\right\|^2 \right)} }} \label{equ:MeanShift} \end{equation}

    (4)

    where $g_{\tiny{{ h}_r}}(x)=-k'_{\tiny{{ h}_r}}(x)$, and the weight wi is computed by

    \begin{equation} {w_i} = \sum\limits_{u = 1}^m {\sqrt {\frac{{{\hat{q}_u^b}}}{{{\hat{p}_u}({{\hat{{ y}}}_0})}}} } \delta [b({{{ x}}_i}) - u]. \label{equ:Weight} \end{equation}

    (5)

    The ratio $\frac{\hat{q}_u^b}{ \hat{p}_u({{\hat{{ y}}}_0})}$ determines the weight value for each position within the window, thus it indicates the direction of the shifting vector. From ((4)-(5), it is easy to know that the vector shifts towards the position where the ratio value is large. Since the kernels used for tracking are no smaller than the original kernel, large amounts of background pixels are involved to calculate the candidate histograms. Therefore, the ratio values calculated at target pixels becomes much higher than those calculated at background pixels, which guarantees the convergence of the tracker to the target position.

    In our method, we expand the contrast between the target and the background, so that a modified form to calculate the weight value wi is proposed

    \begin{align} {w_i} = &\sum\limits_{u = 1}^m { \left(\frac{{{\hat{q}_u^b}}}{{{\hat{p}_u}({{\hat{{ y}}}_0})}}\right)^2 } \delta [b({{{ x}}_i}) - u] =\notag\& \sum\limits_{u = 1}^m { \frac{\hat{\beta}_u^2\hat{q}_u^2}{{{\hat{p}_u^2}({{\hat{{ y}}}_0})}} } \delta [b({{{ x}}_i}) - u]. \label{equ:WeightModified} \end{align}

    (6)

    As the weight value at each position is amplified, the convergence gets faster than using (5). This modified version is also reported to yield a better performance for the tracker to tackle large displacement of the object[23].

  • It is important to determine if the target still exists in the FOV after the camera switches into NFOV mode. As for the tracking, a determination should be made if the target is lost in the LFOS. Let ${ y}_p^{(0)}$ denote the final position obtained by the cascade mean shift tracker in the PLFOS. The kernel weighted color histogram of the candidate is calculated at ${ y}_p^{(0)}$ with the bandwidth ${ h}_0$, and its Bhattacharyya coefficient with the target histogram is computed. If the value is beneath a given threshold $\rho_t$, then the candidate is abandoned and the algorithm declares target lost. Otherwise, the position ${ y}_p^{(0)}$ is mapped to the corresponding position ${ y}_l^{(0)}$ in the LFOS and accepted as the location of the target.

  • Good efficiency is critical for a visual tracking algorithm. The main computational cost in our proposed cascade mean shift tracker is the calculation of the kernel weighted candidate histogram for obtaining the weight value wi in (5), since it needs to be computed within the kernel windows of different sizes in each iteration. Assuming the number of the cascade kernels is denoted by N, and the iteration number for each mean shift tracker is denoted by $n_r(r=0, 1, 2, \cdots, N-1)$, then the computational time for calculating candidate histograms in each frame is about

    \begin{equation} T_N=\sum\limits_{r = 0}^{N-1} n_r\times2{h_x^r}\times2{h_y^r}\times T_c \end{equation}

    (7)

    where $h_x^r$ and $h_y^r$ are the bandwidth of the kernel $k_{\tiny{{ h}_ r}}$, and Tc is the time cost for a pixel to be counted into the histogram. Since the kernel weights are precomputed, the time cost Tc is a constant.

    As the LFOS is preprocessed, the time cost for processing each frame is $S\times S$ times smaller than directly processing original LFOS, since the kernel bandwidth to be needed is $S\times S$ times smaller, where S is the scaling factor between the FFOS and the LFOS. The iteration number nr for each mean shift tracker to be converged is between 2 and 3 on average. In the implementation, N is typically set to be 3, and ${ h}_ r = (h_x^r, h_y^r)(r=0, 1, 2)$ is properly set to get global convergence for the tracker, see Section \ref{subsec:Results}. The preprocessing procedure, the small iteration number for convergence, and the default settings of the algorithm guarantee the ability of real-time processing of the cascade mean shift tracker.

  • The tracking method is evaluated using a simulated dataset. Five image sequences from the VIVID dataset[24] (i.e., EgTest01-Egtest05) are processed to generate five sets of image pairs indicating the FOV switching for a dual FOV camera. The process for an original sequence $\{I_n\}_{n=1, 2, \cdots, N}$ is taken as follows. Firstly, the size of each image in the sequence is magnified by the scaling factor S to generate a larger image, using bicubic interpolation for each pixel value. Then, the center region of each magnified image with the same size of the original image is clipped out, and they generate a new image sequence $\{I'_n\}_{n=1, 2, \cdots, N}$. Finally, images captured by WFOV mode are presented by a subsequence $\{I_n\}_{n=1, \cdots, t}$ of $\{I_n\}$, and images captured by NFOW mode are presented by a subsequence $\{I'_n\}_{n=t+l, \cdots, N}$ of $\{I'_n\}$, with the FOV switching at It. In this way, the FFOS is presented by It, and the LFOS is presented by $I'_{t+l}$, so that the set of image pairs $\{(I_t, I'_{t+l})|t=1, 2, \cdots, N-l\}$ indicates all the possible FOV switching stages on the time line. The five sets of image pairs generated by the five sequences constitute a simulated dataset, which could be used to evaluate the tracking performance with dual FOV switching. To obtain the simulated dataset for our experiments, the scaling factor S is given to be 3.0, and the frame delay l is set to be 20.

    Examples of the target in the FFOS and the corresponding LFOS are shown in Fig. 3. The total number of the image pairs used for evaluation is 8 493. There exist some difficulties for the tracking in the five sets of image pairs. In the image sequence EgTest01, the target goes through illumination change, and passes by similar objects sometimes. In EgTest02, the camera shakes a lot during the image capturing, causing a large displacement of the object. In EgTest03, similar object distraction and partial occlusion of the target exist. In EgTest04, the camera defocuses at times, and it also dropped some frames which also causes abrupt motion. In EgTest05, the target is fully occluded sometimes, and it goes through severe illumination change as well. Moreover, as there is a time delay during the FOV switching, the target assigned in the FFOS may disappear in the LFOS. These difficulties may cause tracking failure in the LFOS.

    Figure 3.  Examples of image pairs generated by (a) EgTest01 (b) EgTest02 (c) EgTest03 (d) EgTest04 (e) EgTest05. The left image of each pair indicates the FFOS, and the right one indicates the corresponding LFOS. The target in the FFOS and the groundtruth in the LFOS are bounded in a red rectangle

  • For measuring the tracking accuracy of the trackers, we use the F-measure proposed in [17]. The area of the rectangle as the tracking result in the image is denoted as $A(\Omega_X)$, and the area of the groundtruth rectangle is denoted as $A(\Omega_G)$. The precision P and the recall R of the tracking are calculated as

    \begin{equation} P = \frac{A(\Omega_X\cap\Omega_G)}{A(\Omega_X)}, \quad R = \frac{A(\Omega_X\cap\Omega_G)}{A(\Omega_G)} \end{equation}

    (8)

    where $A(\Omega_X\cap\Omega_G)$ denotes the area of the overlap rectangle between the tracking result and the groundtruth. The F-measure is computed as $F=\frac{2\times P\times R}{P+R}$, which is the harmonic mean of P and R. If both values of the precision and the recall are high, then a high quality performance of the tracking is achieved, indicated by the high value of F.

    The groundtruth of the target is manually labeled for each image. If the target in the image is fully occluded or does not exist, then the groundtruth area is zero for this image. In this case, if the tracking algorithm declares target lost, then the F-measure of the tracking for this image is 1.0, otherwise the F-measure is 0.0.

  • In order to evaluate the performance of our tracking method under FOV switching, a comparison is taken between the proposed cascade mean shift tracker (CMS) and six other trackers in the five sets of image pairs. The six trackers for the comparison are the standard mean shift tracker (MS)[18], the annealed mean shift tracker (AMS)[20], two particle filter trackers (PF40, PF250)[16], the Wang-Landau Monte Carlo (WLMC) tracker[17] and the N-Fold Wang-Landau (NFWL)-based tracker[17]. The PF40 is the particle filter tracker with the standard deviation 40 for the sample distribution, and the PF250 is the one with the standard deviation 250. The WLMC tracker and the NFWL tracker have the ability to handle the abrupt motion and scale change of the target between frames[17].

    In the implementation of our tracker, the histograms are calculated in the RGB color space with $16\times16\times16$ bins. For all the experiments, three mean shift trackers constitute the cascade tracker, which means $N=3$. The bandwidths of the kernels are set as ${ h}_2 = 3{ h}_0$, ${ h}_1 = \frac{{ h}_0+{ h}_2}{2}$ for all the five sets except for the set generated by EgTest02, where ${ h}_2 = 6{ h}_0$ for handling the large displacement of the target caused by camera shaking. For other six trackers, the appearance model of the target is represented by the kernel weighted color histogram. For the MS tracker, we enlarge the window size of the target in the FFOS by the scaling factor S as the kernel size for shifting in the LFOS. The number of particles applied in PF40 and PF250 is 200, and the state value for the scale is set to be the fixed value S. For the WLMC tracker, the number of the state is 200 for constructing the Markov chain, and the state value in scale space is also fixed as S. For the NFWL tracker, the state number is set to be 400 considering the sampling in scale space. The threshold $\rho_t$ is set to be 0.5 for our tracker to declare target lost. For other trackers, if the Bhattacharyya coefficient between the output candidate and the target model is below 0.5, the target is also declared lost.

    The performance of the trackers in the five image pair sets is shown in Table 1, and some intuitive tracking results for comparison are shown in Fig. 4. As displayed in the table, CMS yields the highest average F-measure for all the five sets. The MS tracker and the PF40 tracker easily lose the target when the displacement between the FFOS and the LFOS is large, as shown in column 2 in Fig. 4. The MS tracker fails since there is no overlap between the target and the initial shift window in the LFOS. The standard deviation value of the sample distribution is not large enough in PF40 to get a sample for the target, so that it loses the target. As can be seen in columns 1 and 3 in Fig. 4, the PF250, WLMC and NFWL trackers are easily distracted by similar objects, even the objects are not close to the real target. These three trackers have the ability to get target samples even when the target in the LFOS is far from the initial position in the FFOS. However, as there is a time delay in the FOV switching stage, the appearance of the target in the LFOS might change compared with the target in the FFOS. The trackers may locate another object that has a more similar appearance than the real target in the LFOS, which causes the tracking failure. The AMS tracker performs well when the appearance of the target is discriminative from the background, as shown in columns 1 and 2 in Fig. 4. If the target is merged into the background, for example, partial occlusion of the target, the tracker may obtain an inaccurate position for the target, as shown in columns 3 and 4 in Fig. 4. The reason is that, in terms of the discrimination of the target from the background, the kernel weighted color histogram model does not perform so well as the background weighted color histogram model. In contrast, CMS could deal with the abrupt motion of the target well, without being distracted by similar objects which are far from the target. Our proposed tracker also yields a better performance when the target goes through illumination change and partial occlusion compared with other trackers, as shown in columns 3 and 4 in Fig. 4.

    Table 1.  Accuracy of different trackers

    Figure 4.  The comparison of the tracking results for the seven trackers. Images in the first row show the target (bounded by a red rectangle) in the FFOSs, and the tracking results of CMS (row 2), MS (row 3), AMS (row 4), PF40 (row 5), PF250 (row 6), WLMC (row 7), NFWL (row 8) in the corresponding LFOSs are shown. In images from row 2 to row 8, the red rectangles indicate the groundtruths, and the green rectangles indicate the tracking results of each tracker. Note that the MS and the PF40 tracker declares target lost in column 2, but the output candidates of the two trackers are also shown for better illustration. The figure shows the good performance of our proposed tracker

    For the evaluation of computing efficiency of the trackers, the image pair set generated by EgTest03 is employed in the following experiments. This set consists of 2 551 image pairs. The size of the image is $640\times480$, while the average size of the target in the FFOS is about $36\times32$. Firstly, the number of convergence iterations per image is compared between two CMS trackers. One uses the standard shifting weight value (calculated by (5)), and the other uses modified weight value (calculated by (6). The results for the middle 300 image pairs are shown in Fig. 5. As can be seen in the results, the CMS tracker using the modified weight converges faster than the one using the standard weight for most images. Therefore, we adopt the modified weight in our proposed tracker. Then, an experiment is set up to evaluate the computational speed for all the trackers. The test is conducted in C++ implementation on a desktop with an Intel Core 2 Duo 2.66GHz CPU, 2GB RAM. As shown in Table 2, CMS is the second efficient tracker among the six trackers, while the MS tracker takes the lead. The AMS tracker takes the longest time to process a frame, more than five times the cost of the CMS tracker, since it directly processes in the LFOS instead of in the down-sampled LFOS. The good efficiency of our tracker makes it possible for its application to real-time object tracking tasks.

    Figure 5.  Comparison of the number of the convergence iterations in each image. The red points with "*" refer to the tracker using the standard weight, and the blue points with "o" refer to the one using the modified weight

    Table 2.  Average time cost of different trackers

  • The online real-time tracking with dual FOV camera is a special case in the field of visual object tracking. An accurate tracking of the target from WFOV to NFOV switching determines the successful tracking in the whole process. A cascade mean shift tracker is presented in the paper, which could handle the abrupt motion of the target caused by dual FOV switching of the camera. With the given information of the camera focal lengths of the dual FOV mode, the tracker could locate the target accurately and efficiently under FOV switching. Our future work includes the accurate determination of the target scale in the LFOS, and the integration of the tracker to the whole tracking process under the FOV-switching camera. As the infrared (IR) sensor is commonly used in the FOV-switching camera deployed for aerial videos, the research on the visual tracking in this condition is also of great importance in the future.

Reference (24)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return