Head detection using motion features and multi level pyramid architecture

https://doi.org/10.1016/j.cviu.2015.04.007Get rights and content

Highlights

  • A two-stage head detection system using motion features is proposed.

  • A multi-level histograms architecture targeting low-resolution images is developed.

  • State of the art motion features including HOOF and MBH have been employed.

  • HOOF has been extended to Relative Motion Distance for better head representation.

  • The results are validated using PETS 2009 dataset and compared to other existing schemes with excellent results.

Abstract

Monitoring large crowds using video cameras is a challenging task. Detecting humans in video is becoming essential for monitoring crowd behavior. However, occlusion and low resolution in the region of interest hinders accurate crowd segmentation. In such scenarios, it is likely that only the head is visible, and often very small. Most existing people-detection systems rely on low-level visual appearance features such as the Histogram of Oriented Gradients (HOG), and these are unsuitable for detecting human heads at low resolutions. In this paper, a novel head detector is presented using motion histogram features. The shape and the motion information, including crowd direction and magnitude, is learned and used to detect humans in occluded crowds. We introduce novel features based on a multi level pyramid architecture for Motion Boundary Histogram (MBH) and Histogram of Oriented Optical Flow (HOOF), derived from the TV-L1 optical flow. In addition, a new feature, called Relative Motion Distance (RMD) is proposed to efficiently capture correlation statistics. For classification distinguishing human head from similar features, a two-stage Support Vector Machine (SVM) is used, and an explicit kernel mapping on our motion histogram features is performed using Bhattacharyya-distance kernels. A second stage of classification is required to reduce the number of false positives. The proposed features and system were tested on videos from the PETS 2009 dataset and compared with state-of-the-art features, against which our system reported excellent results.

Introduction

Security and safety is of paramount importance in large arenas where people gather, such as train stations, stadium, and other public places. There is an urgent need to create new video surveillance technologies for understanding crowd behavior. In addition to live surveillance, understanding crowd behavior will aid off-line simulations that contribute to better architectural designs and emergency managment strategies. Crowds are characteristically dense and objects of interest are frequently occluded (that is, partially hidden). Based on visual observation, accurate detection and tracking of the head of people in crowds is the only viable option in such scenarios. Recently, some work has been carried out in head detection in crowds and there seems to be enough scope for improvement. Working with the popular Caltech Pedestrian Dataset, researchers have found that the head is the only visible part in 97% of occlusions [1]. Therefore, extracting features from the head and shoulder is more appropriate and feasible than trying to model the entire human body in crowds.

Recently, various methods for modeling human movement have been developed by the video surveillance community. Researchers have proposed numerous object-related features with extrinsic and intrinsic properties [2]. The extrinsic properties include color, intensity, and motion information, and the intrinsic properties include curvature information. A multi-level representation framework, which uses a sliding-window technique, has become a fundamental tool in building low-level feature descriptors. These descriptors are applied in human detection, tracking, and behavior modeling. Selecting appropriate features is the key in successful people detection [3], and researchers frequently utilise local features that are invariant to illumination and small deformations.

Since Dalal et al. [4] proposed the Histogram of Oriented Gradients (HOG), human detection using HOG features has gained a lot of attention. However, there are several limitations when HOG is applied for detecting human heads [5]. The HOG does not offer enough variations for accurate discrimination, and it often causes the learning algorithm to mistakenly report other head-like objects as heads. However, based on the detailed argument presented in Dollar et al. [1], HOG and its extension to part models [6] is the preferred feature due to its robustness to model the local shape, relation between parts, and appearance of the target object in static environments. Generally, classifiers such as the Support Vector Machine (SVM) and cascade boosting [4], [7] are employed to map these low-level features into an interpretable score (representing the likelihood of a human is present). However, there are weakness in determining the humans when only appearance features are extracted [3], where they also promote to use more features instead of using only one channel of feature. Extending earlier work [5], the Motion Boundary Histogram computed from optical flows, accompanied with the Histogram of Optical Flow (HOOF) are included in the multiple motion feature sets. A new feature, called Relative Motion Distance, by computing the statistical correlation of HOOF, is proposed and added to enrich the motion features set.

To discover different behavior in a variety of spatial resolutions, the Spatial Pyramid Matching (SPM), a multi-level histogram feature, was introduced by Lazebnik [8] and is applied widely in tasks relating to human action recognition and human detection [9], [10]. The resolution of the input image where features are computed is kept fixed, whereas the spatial resolution in which they aggregate into block features is varied. Eventually, it produces a higher dimensional feature representation, containing discriminative information over and above the raw video. In this work, we report a multi-level classification scheme based on motion histogram features.

Kernel selection is among the most important factors in SVM learning [11]. Thus, the key to successfully train an SVM model is to construct an appropriate kernel to map the data into a linearly separable feature space. General-purpose kernels are available, but the use of an appropriate kernel which corresponds to the target features can substantially improve the generalization ability of the learning system [12]. Ablavsky et al. [13] and Vemulapalli et al. [14] note that the space where histograms lie is in a Riemannian manifold rather than Euclidean space. Therefore, the need for a corresponding kernel arises for such data in the form of finite probability distributions, normalised histograms, and other computer vision descriptors [15]. One way to achieve this is by mapping the manifold to a Reproducing Kernel Hilbert Space (RKHS) with the help of a kernel and an appropriate distance metric. Another way is to directly perform classification on tangent space of the manifold [2]. Two popular kernels– the Bhattacharyya kernel and the Histogram Intersection kernel, employed in [13] are applied and compared in this work. Interestingly, by simple arithmetic operations these two kernels are capable of faster performance and result in better classification accuracy.

The main contributions in this paper are: 1) the use of high-level motion features and kernel machines for head detection in occluded scenarios; 2) new features are proposed based on the statistical relations between motion histogram features (i.e. the RMD) and a combination of various motion features, including the multi level MBH and HOOF; 3) the motion features based on a combination of HOOF and MBH lie in the simplex manifold making it appropriate to implement the Bhattacharyya-distance kernels to gain better results than the linear and RBF kernels traditionally employed; and 4) the proposed method is evaluated with successful results using the publicly available PETS 2009 dataset for people-counting in surveillance, and the proposed method is compared with other state-of-the-art features.

Section snippets

Background

In this section, background for this work and other state-of-the-art histogram based methods and appropriate kernel selection is presented. Proposed methods thus far have come a long way—from grappling with complex methods for detecting the entire body, to more sophisticated approaches for detecting the head in occluded scenarios. Some of the earliest work in tracking groups of people was carried out by McKenna et al. [16], making use of color as a dominant feature. A detailed survey of crowd

Methodology

The overall methodology is described in Fig. 2. As can be seen, the raw video is first subjected to low level motion feature extraction using optical flow. A high-level feature is then extracted based on the histogram of motion features. A discriminative feature in the form of relative motion distance is then derived, and used as the final feature for classification. At the training stage, the extracted features are shown to lie in the simplex manifold space and the appropriate kernel is

Performance evaluation and discussion

In this section, we discuss the evaluation of the proposed motion features for detecting heads in a public-crowd video dataset. Different combinations of motion features were tested: the results of votes from all the features in Stage 1 and Stage 2, from combinations of shape and relative motion (MBHU, MBHV, RMD), and from each respective feature individually. The results from explicit mapping kernels selection—includeing the linear, intersection, linear-Bhattacharyya, and the

Conclusion

In this paper, a head detector that uses motion features exclusively was presented with excellent results. The specific targets for this technology include crowd monitoring and the detection of heads in the presence of occluded environment. A new motion feature, the relative motion distance (RMD), is proposed to combines two other motion histogram features (HOOF and, MBH) using a two-stage SVM head detector. Further, we demonstrated the increasing discriminability of linear and intersection

Acknowledgment

This work is partially supported by the ARC linkage project LP100200430, partnering the University of Melbourne, Melbourne Cricket Club and ARUP. Authors would like to thank representatives and staff of ARUP and MCG. We also thank Mr. Aravinda S. Rao for his valuable feedback.

References (46)

  • S.J. McKenna et al.

    Tracking groups of people

    Comput. Vis. Image Und.

    (2000)
  • B.K. Horn et al.

    Determining optical flow

    Artif. Intell.

    (1981)
  • P. Dollar et al.

    Pedestrian detection: an evaluation of the state of the art

    Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2012)
  • O. Tuzel et al.

    Pedestrian detection via classification on Riemannian manifolds

    Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2008)
  • C. Wojek et al.

    Multi-cue onboard pedestrian detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009

    (2009)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005

    (2005)
  • F.-C. Hsu et al.

    Human head detection using histograms of oriented optical flow in low quality videos with occlusion

    Proceedings of the 7th International Conference on Signal Processing and Communication Systems, ICSPCS 2013

    (2013)
  • P.F. Felzenszwalb et al.

    Object detection with discriminatively trained part-based models

  • Q. Zhu et al.

    Fast human detection using a cascade of histograms of oriented gradients

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    (2006)
  • S. Lazebnik et al.

    Beyond bags of features: spatial pyramid matching for recognizing natural scene categories

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    (2006)
  • S. Maji et al.

    Classification using intersection kernel support vector machines is efficient

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008

    (2008)
  • A. Maki et al.

    Co-occurrence flow for pedestrian detection

    Proceedings of the 18th IEEE International Conference on Image Processing (ICIP, 2011)

    (2011)
  • K.P. Murphy

    Machine Learning: A Probabilistic Perspective

    (2012)
  • G. Camps-Valls et al.

    Kernel Methods in Bioengineering, Signal and Image Processing

    (2007)
  • V. Ablavsky et al.

    Learning parameterized histogram kernels on the simplex manifold for image and action classification

    Proceedings of the IEEE International Conference on Computer Vision (ICCV, 2011)

    (2011)
  • R. Vemulapalli et al.

    Kernel learning for extrinsic classification of manifold features

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2013)

    (2013)
  • A. Vedaldi et al.

    Efficient additive kernels via explicit feature maps

    Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2012)
  • B. Zhan et al.

    Crowd analysis: a survey

    Mach. Vis. Appl.

    (2008)
  • P. Dollar et al.

    Integral channel features

    Proceedings of the Conference on British Machine Vision

    (2009)
  • B. Wu et al.

    Optimizing discrimination-efficiency tradeoff in integrating heterogeneous local features for object detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008

    (2008)
  • P. Sabzmeydani et al.

    Detecting pedestrians by learning shapelet features

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007

    (2007)
  • S. Walk et al.

    New features and insights for pedestrian detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010)

    (2010)
  • D. Park et al.

    Exploring weak stabilization for motion feature extraction

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013)

    (2013)
  • Cited by (0)

    This paper has been recommended for acceptance by Xiaofei He.

    View full text